2023-10-27 18:25:21

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes

Non-KVM people, please take a gander at two small-ish patches buried in the
middle of this series:

fs: Export anon_inode_getfile_secure() for use by KVM
mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

Our plan/hope is to take this through the KVM tree for 6.8, reviews (and acks!)
would be much appreciated. Note, adding AS_UNMOVABLE isn't strictly required as
it's "just" an optimization, but we'd prefer to have it in place straightaway.

Reviews on all the KVM changes, especially the guest_memfd.c implementation, are
also most definitely welcome.

The "what and why" at the very bottom is hopefully old news for most readers. My
plan is to copy the blurb into a tag when this is merged (today's word of the day
is: optimism), e.g. so that the big picture and why we're doing this is captured
in the git history.

Note, the v13 changelog below captures only changes that were not posted and
applied to the v12+ development branch. Those changes can be found in commits
46c10adeda81..74a4d3b6a284 at

https://github.com/kvm-x86/linux.git tags/kvm-x86-guest_memfd-v12

This series can be found at

https://github.com/kvm-x86/linux.git guest_memfd

kvm-x86/guest_memfd is also now being fed into kvm-x86/next, i.e. will be getting
coverage in linux-next as of the next build.

Similar to the v12 "development cycle", any changes needed will be applied on
top of v13, and squashed prior to sending v14 (if needed) or merging (optimism!).

KVM folks, ***LOOK HERE***. v13 has several breaking userspace changes relative
to v12. Some were "necessary" (removal of a pointless ioctl), others were
opportunistic and opinionated (renaming kvm_userspace_memory_region2 fields to
use guest_memfd instead of gmem). I didn't post changes as I found the "issues"
very late (when writing documentation) and didn't want to delay v13.

Here's a diff of the linux/include/uapi/linux/kvm.h changes that will break
userspace developed for v12.

@@ -102,8 +102,8 @@ struct kvm_userspace_memory_region2 {
__u64 guest_phys_addr;
__u64 memory_size;
__u64 userspace_addr;
- __u64 gmem_offset;
- __u32 gmem_fd;
+ __u64 guest_memfd_offset;
+ __u32 guest_memfd;
__u32 pad1;
__u64 pad2[14];
};

@@ -1231,9 +1215,10 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
#define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
#define KVM_CAP_USER_MEMORY2 230
-#define KVM_CAP_MEMORY_ATTRIBUTES 231
-#define KVM_CAP_GUEST_MEMFD 232
-#define KVM_CAP_VM_TYPES 233
+#define KVM_CAP_MEMORY_FAULT_INFO 231
+#define KVM_CAP_MEMORY_ATTRIBUTES 232
+#define KVM_CAP_GUEST_MEMFD 233
+#define KVM_CAP_VM_TYPES 234

#ifdef KVM_CAP_IRQ_ROUTING

@@ -2301,8 +2286,7 @@ struct kvm_s390_zpci_op {
#define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0)

/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
-#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES _IOR(KVMIO, 0xd2, __u64)
-#define KVM_SET_MEMORY_ATTRIBUTES _IOW(KVMIO, 0xd3, struct kvm_memory_attributes)
+#define KVM_SET_MEMORY_ATTRIBUTES _IOW(KVMIO, 0xd2, struct kvm_memory_attributes)

v13:
- Drop KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES, have KVM_CAP_MEMORY_ATTRIBUTES
return the supported attributes.
- Add KVM_CAP_MEMORY_FAULT_INFO to report support for KVM_EXIT_MEMORY_FAULT,
and shift capability numbers accordingly.
- Do s/gmem/guest_memfd (roughly) in userspace-facing APIs, i.e. use guest_memfd
as the formal name. Going off of various internal conversations, "gmem" isn't
at all intuitive, whereas "guest_memfd" gives readers/listeners a rough idea
of what's going on. If you don't like the rename, then next time volunteer
to write the documentation. :-)
- Rename a leftover "out_restricted" label to "out_unbind".
- Write and clean up changelogs.
- Write and clean up documentation.
- Move "memory_fault" to the standard exit reasons union (requires userspace to
rebuild, but shouldn't require code changes).
- Fix intermediate build issues (hidden behind unselectable Kconfigs)
- KVM_CAP_GUEST_MEMFD and KVM_CREATE_GUEST_MEMFD under the same #ifdefs.
- Fix a bug in kvm_mmu_invalidate_range_add() where adding multiple ranges in a
single invalidation would captured only the last range. [Xu Yilun]

v12: https://lore.kernel.org/all/[email protected]
v11: https://lore.kernel.org/all/[email protected]
v10: https://lore.kernel.org/all/[email protected]

Fodder for a merge tag:
---
Introduce several new KVM uAPIs to ultimately create a guest-first memory
subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM to
provide features, enhancements, and optimizations that are kludgly or outright
impossible to implement in a generic memory subsystem.

The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which similar
to the generic memfd_create(), creates an anonymous file and returns a file
descriptor that refers to it. Again like "regular" memfd files, guest_memfd
files live in RAM, have volatile storage, and are automatically released when
the last reference is dropped. The key differences between memfd files (and
every other memory subystem) is that guest_memfd files are bound to their owning
virtual machine, cannot be mapped, read, or written by userspace, and cannot be
resized (guest_memfd files do however support PUNCH_HOLE).

A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to specify
attributes for a given page of guest memory, e.g. in the long term, it will
likely be extended to allow userspace to specify per-gfn RWX protections.

The immediate and driving use case for guest_memfd are Confidential (CoCo) VMs,
specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM. For KVM CoCo use
cases, being able to map memory into KVM guests without requireming said memory
to be mapped into the host is a hard requirement. While SEV+ and TDX prevent
untrusted software from reading guest private data by encrypting guest memory,
pKVM provides confidentiality and integrity *without* relying on memory
encryption. And with SEV-SNP and TDX, accessing guest private memory can be
fatal to the host, i.e. KVM must be prevent host userspace from accessing guest
memory irrespective of hardware behavior.

Long term, guest_memfd provides KVM line-of-sight to use cases beyond CoCo VMs,
e.g. KVM currently doesn't support mapping memory as writable in the guest
without it also being writable in host userspace, as KVM's ABI uses userspace
VMA protections to define the allow guest protection (with an exception granted
to mapping guest memory executable).

Similarly, KVM currently requires the guest mapping size to be a strict subset
of the host userspace mapping size, e.g. KVM doesn’t support creating a 1GiB
guest mapping unless userspace also has a 1GiB guest mapping. Decoupling the
mappings sizes would allow userspace to precisely map only what is needed
without impacting guest performance, e.g. to again harden against unintentional
accesses to guest memory.

A guest-first memory subsystem also provides clearer line of sight to things
like a dedicated memory pool (for slice-of-hardware VMs) and elimination of
"struct page" (for offload setups where userspace _never_ needs to mmap() guest
memory).

guest_memfd is the result of 3+ years of development and exploration; taking on
memory management responsibilities in KVM was not the first, second, or even
third choice for supporting CoCo VMs. But after many failed attempts to avoid
KVM-specific backing memory, and looking at where things ended up, it is quite
clear that of all approaches tried, guest_memfd is the simplest, most robust,
and most extensible, and the right thing to do for KVM and the kernel at-large.
---

Ackerley Tng (1):
KVM: selftests: Test KVM exit behavior for private memory/access

Chao Peng (8):
KVM: Use gfn instead of hva for mmu_notifier_retry
KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
KVM: Introduce per-page memory attributes
KVM: x86: Disallow hugepages when memory attributes are mixed
KVM: x86/mmu: Handle page fault for private memory
KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper
KVM: selftests: Expand set_memory_region_test to validate
guest_memfd()
KVM: selftests: Add basic selftest for guest_memfd()

Sean Christopherson (23):
KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn
ranges
KVM: Assert that mmu_invalidate_in_progress *never* goes negative
KVM: WARN if there are dangling MMU invalidations at VM destruction
KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER
KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU
KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to
CONFIG_KVM_GENERIC_MMU_NOTIFIER
KVM: Introduce KVM_SET_USER_MEMORY_REGION2
KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory
KVM: Drop .on_unlock() mmu_notifier hook
KVM: Prepare for handling only shared mappings in mmu_notifier events
mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
fs: Export anon_inode_getfile_secure() for use by KVM
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing
memory
KVM: Add transparent hugepage support for dedicated guest memory
KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro
KVM: Allow arch code to track number of memslot address spaces per VM
KVM: x86: Add support for "protected VMs" that can utilize private
memory
KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper
KVM: selftests: Convert lib's mem regions to
KVM_SET_USER_MEMORY_REGION2
KVM: selftests: Add support for creating private memslots
KVM: selftests: Introduce VM "shape" to allow tests to specify the VM
type
KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data

Vishal Annapurve (3):
KVM: selftests: Add helpers to convert guest memory b/w private and
shared
KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls
(x86)
KVM: selftests: Add x86-only selftest for private memory conversions

Documentation/virt/kvm/api.rst | 208 ++++++
arch/arm64/include/asm/kvm_host.h | 2 -
arch/arm64/kvm/Kconfig | 2 +-
arch/mips/include/asm/kvm_host.h | 2 -
arch/mips/kvm/Kconfig | 2 +-
arch/powerpc/include/asm/kvm_host.h | 2 -
arch/powerpc/kvm/Kconfig | 8 +-
arch/powerpc/kvm/book3s_hv.c | 2 +-
arch/powerpc/kvm/powerpc.c | 7 +-
arch/riscv/include/asm/kvm_host.h | 2 -
arch/riscv/kvm/Kconfig | 2 +-
arch/x86/include/asm/kvm_host.h | 17 +-
arch/x86/include/uapi/asm/kvm.h | 3 +
arch/x86/kvm/Kconfig | 14 +-
arch/x86/kvm/debugfs.c | 2 +-
arch/x86/kvm/mmu/mmu.c | 271 +++++++-
arch/x86/kvm/mmu/mmu_internal.h | 2 +
arch/x86/kvm/vmx/vmx.c | 11 +-
arch/x86/kvm/x86.c | 26 +-
fs/anon_inodes.c | 1 +
include/linux/kvm_host.h | 143 ++++-
include/linux/kvm_types.h | 1 +
include/linux/pagemap.h | 19 +-
include/uapi/linux/kvm.h | 51 ++
mm/compaction.c | 43 +-
mm/migrate.c | 2 +
tools/testing/selftests/kvm/Makefile | 3 +
tools/testing/selftests/kvm/dirty_log_test.c | 2 +-
.../testing/selftests/kvm/guest_memfd_test.c | 221 +++++++
.../selftests/kvm/include/kvm_util_base.h | 148 ++++-
.../testing/selftests/kvm/include/test_util.h | 5 +
.../selftests/kvm/include/ucall_common.h | 11 +
.../selftests/kvm/include/x86_64/processor.h | 15 +
.../selftests/kvm/kvm_page_table_test.c | 2 +-
tools/testing/selftests/kvm/lib/kvm_util.c | 233 ++++---
tools/testing/selftests/kvm/lib/memstress.c | 3 +-
.../selftests/kvm/set_memory_region_test.c | 100 +++
.../kvm/x86_64/private_mem_conversions_test.c | 487 ++++++++++++++
.../kvm/x86_64/private_mem_kvm_exits_test.c | 120 ++++
.../kvm/x86_64/ucna_injection_test.c | 2 +-
virt/kvm/Kconfig | 17 +
virt/kvm/Makefile.kvm | 1 +
virt/kvm/dirty_ring.c | 2 +-
virt/kvm/guest_memfd.c | 607 ++++++++++++++++++
virt/kvm/kvm_main.c | 505 ++++++++++++---
virt/kvm/kvm_mm.h | 26 +
46 files changed, 3083 insertions(+), 272 deletions(-)
create mode 100644 tools/testing/selftests/kvm/guest_memfd_test.c
create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
create mode 100644 virt/kvm/guest_memfd.c


base-commit: 2b3f2325e71f09098723727d665e2e8003d455dc
--
2.42.0.820.g83a721a137-goog


2023-10-27 18:25:33

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 26/35] KVM: selftests: Add support for creating private memslots

Add support for creating "private" memslots via KVM_CREATE_GUEST_MEMFD and
KVM_SET_USER_MEMORY_REGION2. Make vm_userspace_mem_region_add() a wrapper
to its effective replacement, vm_mem_add(), so that private memslots are
fully opt-in, i.e. don't require update all tests that add memory regions.

Pivot on the KVM_MEM_PRIVATE flag instead of the validity of the "gmem"
file descriptor so that simple tests can let vm_mem_add() do the heavy
lifting of creating the guest memfd, but also allow the caller to pass in
an explicit fd+offset so that fancier tests can do things like back
multiple memslots with a single file. If the caller passes in a fd, dup()
the fd so that (a) __vm_mem_region_delete() can close the fd associated
with the memory region without needing yet another flag, and (b) so that
the caller can safely close its copy of the fd without having to first
destroy memslots.

Co-developed-by: Ackerley Tng <[email protected]>
Signed-off-by: Ackerley Tng <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
.../selftests/kvm/include/kvm_util_base.h | 23 +++++
.../testing/selftests/kvm/include/test_util.h | 5 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 85 ++++++++++++-------
3 files changed, 82 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 9f144841c2ee..9f861182c02a 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -431,6 +431,26 @@ static inline uint64_t vm_get_stat(struct kvm_vm *vm, const char *stat_name)

void vm_create_irqchip(struct kvm_vm *vm);

+static inline int __vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+ uint64_t flags)
+{
+ struct kvm_create_guest_memfd guest_memfd = {
+ .size = size,
+ .flags = flags,
+ };
+
+ return __vm_ioctl(vm, KVM_CREATE_GUEST_MEMFD, &guest_memfd);
+}
+
+static inline int vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+ uint64_t flags)
+{
+ int fd = __vm_create_guest_memfd(vm, size, flags);
+
+ TEST_ASSERT(fd >= 0, KVM_IOCTL_ERROR(KVM_CREATE_GUEST_MEMFD, fd));
+ return fd;
+}
+
void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
uint64_t gpa, uint64_t size, void *hva);
int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
@@ -439,6 +459,9 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
uint32_t flags);
+void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
+ uint64_t guest_paddr, uint32_t slot, uint64_t npages,
+ uint32_t flags, int guest_memfd_fd, uint64_t guest_memfd_offset);

void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags);
void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, uint64_t new_gpa);
diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 7e614adc6cf4..7257f2243ab9 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -142,6 +142,11 @@ static inline bool backing_src_is_shared(enum vm_mem_backing_src_type t)
return vm_mem_backing_src_alias(t)->flag & MAP_SHARED;
}

+static inline bool backing_src_can_be_huge(enum vm_mem_backing_src_type t)
+{
+ return t != VM_MEM_SRC_ANONYMOUS && t != VM_MEM_SRC_SHMEM;
+}
+
/* Aligns x up to the next multiple of size. Size must be a power of 2. */
static inline uint64_t align_up(uint64_t x, uint64_t size)
{
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 3676b37bea38..45050f54701a 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -669,6 +669,8 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("munmap()", ret));
close(region->fd);
}
+ if (region->region.guest_memfd >= 0)
+ close(region->region.guest_memfd);

free(region);
}
@@ -870,36 +872,15 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
errno, strerror(errno));
}

-/*
- * VM Userspace Memory Region Add
- *
- * Input Args:
- * vm - Virtual Machine
- * src_type - Storage source for this region.
- * NULL to use anonymous memory.
- * guest_paddr - Starting guest physical address
- * slot - KVM region slot
- * npages - Number of physical pages
- * flags - KVM memory region flags (e.g. KVM_MEM_LOG_DIRTY_PAGES)
- *
- * Output Args: None
- *
- * Return: None
- *
- * Allocates a memory area of the number of pages specified by npages
- * and maps it to the VM specified by vm, at a starting physical address
- * given by guest_paddr. The region is created with a KVM region slot
- * given by slot, which must be unique and < KVM_MEM_SLOTS_NUM. The
- * region is created with the flags given by flags.
- */
-void vm_userspace_mem_region_add(struct kvm_vm *vm,
- enum vm_mem_backing_src_type src_type,
- uint64_t guest_paddr, uint32_t slot, uint64_t npages,
- uint32_t flags)
+/* FIXME: This thing needs to be ripped apart and rewritten. */
+void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
+ uint64_t guest_paddr, uint32_t slot, uint64_t npages,
+ uint32_t flags, int guest_memfd, uint64_t guest_memfd_offset)
{
int ret;
struct userspace_mem_region *region;
size_t backing_src_pagesz = get_backing_src_pagesz(src_type);
+ size_t mem_size = npages * vm->page_size;
size_t alignment;

TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
@@ -952,7 +933,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
/* Allocate and initialize new mem region structure. */
region = calloc(1, sizeof(*region));
TEST_ASSERT(region != NULL, "Insufficient Memory");
- region->mmap_size = npages * vm->page_size;
+ region->mmap_size = mem_size;

#ifdef __s390x__
/* On s390x, the host address must be aligned to 1M (due to PGSTEs) */
@@ -999,14 +980,47 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
/* As needed perform madvise */
if ((src_type == VM_MEM_SRC_ANONYMOUS ||
src_type == VM_MEM_SRC_ANONYMOUS_THP) && thp_configured()) {
- ret = madvise(region->host_mem, npages * vm->page_size,
+ ret = madvise(region->host_mem, mem_size,
src_type == VM_MEM_SRC_ANONYMOUS ? MADV_NOHUGEPAGE : MADV_HUGEPAGE);
TEST_ASSERT(ret == 0, "madvise failed, addr: %p length: 0x%lx src_type: %s",
- region->host_mem, npages * vm->page_size,
+ region->host_mem, mem_size,
vm_mem_backing_src_alias(src_type)->name);
}

region->backing_src_type = src_type;
+
+ if (flags & KVM_MEM_PRIVATE) {
+ if (guest_memfd < 0) {
+ uint32_t guest_memfd_flags = 0;
+
+ /*
+ * Allow hugepages for the guest memfd backing if the
+ * "normal" backing is allowed/required to be huge.
+ */
+ if (src_type != VM_MEM_SRC_ANONYMOUS &&
+ src_type != VM_MEM_SRC_SHMEM)
+ guest_memfd_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+
+ TEST_ASSERT(!guest_memfd_offset,
+ "Offset must be zero when creating new guest_memfd");
+ guest_memfd = vm_create_guest_memfd(vm, mem_size, guest_memfd_flags);
+ } else {
+ /*
+ * Install a unique fd for each memslot so that the fd
+ * can be closed when the region is deleted without
+ * needing to track if the fd is owned by the framework
+ * or by the caller.
+ */
+ guest_memfd = dup(guest_memfd);
+ TEST_ASSERT(guest_memfd >= 0, __KVM_SYSCALL_ERROR("dup()", guest_memfd));
+ }
+
+ region->region.guest_memfd = guest_memfd;
+ region->region.guest_memfd_offset = guest_memfd_offset;
+ } else {
+ region->region.guest_memfd = -1;
+ }
+
region->unused_phy_pages = sparsebit_alloc();
sparsebit_set_num(region->unused_phy_pages,
guest_paddr >> vm->page_shift, npages);
@@ -1019,9 +1033,10 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
" rc: %i errno: %i\n"
" slot: %u flags: 0x%x\n"
- " guest_phys_addr: 0x%lx size: 0x%lx",
+ " guest_phys_addr: 0x%lx size: 0x%lx guest_memfd: %d\n",
ret, errno, slot, flags,
- guest_paddr, (uint64_t) region->region.memory_size);
+ guest_paddr, (uint64_t) region->region.memory_size,
+ region->region.guest_memfd);

/* Add to quick lookup data structures */
vm_userspace_mem_region_gpa_insert(&vm->regions.gpa_tree, region);
@@ -1042,6 +1057,14 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
}
}

+void vm_userspace_mem_region_add(struct kvm_vm *vm,
+ enum vm_mem_backing_src_type src_type,
+ uint64_t guest_paddr, uint32_t slot,
+ uint64_t npages, uint32_t flags)
+{
+ vm_mem_add(vm, src_type, guest_paddr, slot, npages, flags, -1, 0);
+}
+
/*
* Memslot to region
*
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:26:01

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 27/35] KVM: selftests: Add helpers to convert guest memory b/w private and shared

From: Vishal Annapurve <[email protected]>

Add helpers to convert memory between private and shared via KVM's
memory attributes, as well as helpers to free/allocate guest_memfd memory
via fallocate(). Userspace, i.e. tests, is NOT required to do fallocate()
when converting memory, as the attributes are the single source of true.
Provide allocate() helpers so that tests can mimic a userspace that frees
private memory on conversion, e.g. to prioritize memory usage over
performance.

Signed-off-by: Vishal Annapurve <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
.../selftests/kvm/include/kvm_util_base.h | 48 +++++++++++++++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 28 +++++++++++
2 files changed, 76 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 9f861182c02a..1441fca6c273 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -333,6 +333,54 @@ static inline void vm_enable_cap(struct kvm_vm *vm, uint32_t cap, uint64_t arg0)
vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
}

+static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
+ uint64_t size, uint64_t attributes)
+{
+ struct kvm_memory_attributes attr = {
+ .attributes = attributes,
+ .address = gpa,
+ .size = size,
+ .flags = 0,
+ };
+
+ /*
+ * KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes. These flows
+ * need significant enhancements to support multiple attributes.
+ */
+ TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
+ "Update me to support multiple attributes!");
+
+ vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
+}
+
+
+static inline void vm_mem_set_private(struct kvm_vm *vm, uint64_t gpa,
+ uint64_t size)
+{
+ vm_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa,
+ uint64_t size)
+{
+ vm_set_memory_attributes(vm, gpa, size, 0);
+}
+
+void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
+ bool punch_hole);
+
+static inline void vm_guest_mem_punch_hole(struct kvm_vm *vm, uint64_t gpa,
+ uint64_t size)
+{
+ vm_guest_mem_fallocate(vm, gpa, size, true);
+}
+
+static inline void vm_guest_mem_allocate(struct kvm_vm *vm, uint64_t gpa,
+ uint64_t size)
+{
+ vm_guest_mem_fallocate(vm, gpa, size, false);
+}
+
void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
const char *vm_guest_mode_string(uint32_t i);

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 45050f54701a..a140aee8d0f5 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1176,6 +1176,34 @@ void vm_mem_region_delete(struct kvm_vm *vm, uint32_t slot)
__vm_mem_region_delete(vm, memslot2region(vm, slot), true);
}

+void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t base, uint64_t size,
+ bool punch_hole)
+{
+ const int mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? FALLOC_FL_PUNCH_HOLE : 0);
+ struct userspace_mem_region *region;
+ uint64_t end = base + size;
+ uint64_t gpa, len;
+ off_t fd_offset;
+ int ret;
+
+ for (gpa = base; gpa < end; gpa += len) {
+ uint64_t offset;
+
+ region = userspace_mem_region_find(vm, gpa, gpa);
+ TEST_ASSERT(region && region->region.flags & KVM_MEM_PRIVATE,
+ "Private memory region not found for GPA 0x%lx", gpa);
+
+ offset = (gpa - region->region.guest_phys_addr);
+ fd_offset = region->region.guest_memfd_offset + offset;
+ len = min_t(uint64_t, end - gpa, region->region.memory_size - offset);
+
+ ret = fallocate(region->region.guest_memfd, mode, fd_offset, len);
+ TEST_ASSERT(!ret, "fallocate() failed to %s at %lx (len = %lu), fd = %d, mode = %x, offset = %lx\n",
+ punch_hole ? "punch hole" : "allocate", gpa, len,
+ region->region.guest_memfd, mode, fd_offset);
+ }
+}
+
/* Returns the size of a vCPU's kvm_run structure. */
static int vcpu_mmap_sz(void)
{
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:26:31

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 32/35] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper

From: Chao Peng <[email protected]>

Add helpers to invoke KVM_SET_USER_MEMORY_REGION2 directly so that tests
can validate of features that are unique to "version 2" of "set user
memory region", e.g. do negative testing on gmem_fd and gmem_offset.

Provide a raw version as well as an assert-success version to reduce
the amount of boilerplate code need for basic usage.

Signed-off-by: Chao Peng <[email protected]>
Signed-off-by: Ackerley Tng <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
.../selftests/kvm/include/kvm_util_base.h | 7 +++++
tools/testing/selftests/kvm/lib/kvm_util.c | 29 +++++++++++++++++++
2 files changed, 36 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 157508c071f3..8ec122f5fcc8 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -522,6 +522,13 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
uint64_t gpa, uint64_t size, void *hva);
int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
uint64_t gpa, uint64_t size, void *hva);
+void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+ uint64_t gpa, uint64_t size, void *hva,
+ uint32_t guest_memfd, uint64_t guest_memfd_offset);
+int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+ uint64_t gpa, uint64_t size, void *hva,
+ uint32_t guest_memfd, uint64_t guest_memfd_offset);
+
void vm_userspace_mem_region_add(struct kvm_vm *vm,
enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 52b131e3aca5..1620452c1cf7 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -873,6 +873,35 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
errno, strerror(errno));
}

+int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+ uint64_t gpa, uint64_t size, void *hva,
+ uint32_t guest_memfd, uint64_t guest_memfd_offset)
+{
+ struct kvm_userspace_memory_region2 region = {
+ .slot = slot,
+ .flags = flags,
+ .guest_phys_addr = gpa,
+ .memory_size = size,
+ .userspace_addr = (uintptr_t)hva,
+ .guest_memfd = guest_memfd,
+ .guest_memfd_offset = guest_memfd_offset,
+ };
+
+ return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION2, &region);
+}
+
+void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+ uint64_t gpa, uint64_t size, void *hva,
+ uint32_t guest_memfd, uint64_t guest_memfd_offset)
+{
+ int ret = __vm_set_user_memory_region2(vm, slot, flags, gpa, size, hva,
+ guest_memfd, guest_memfd_offset);
+
+ TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed, errno = %d (%s)",
+ errno, strerror(errno));
+}
+
+
/* FIXME: This thing needs to be ripped apart and rewritten. */
void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:26:48

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 35/35] KVM: selftests: Test KVM exit behavior for private memory/access

From: Ackerley Tng <[email protected]>

"Testing private access when memslot gets deleted" tests the behavior
of KVM when a private memslot gets deleted while the VM is using the
private memslot. When KVM looks up the deleted (slot = NULL) memslot,
KVM should exit to userspace with KVM_EXIT_MEMORY_FAULT.

In the second test, upon a private access to non-private memslot, KVM
should also exit to userspace with KVM_EXIT_MEMORY_FAULT.

Intentionally don't take a requirement on KVM_CAP_GUEST_MEMFD,
KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_ATTRIBUTE_PRIVATE, etc., as it's a
KVM bug to advertise KVM_X86_SW_PROTECTED_VM without its prerequisites.

Signed-off-by: Ackerley Tng <[email protected]>
[sean: call out the similarities with set_memory_region_test]
Signed-off-by: Sean Christopherson <[email protected]>
---
tools/testing/selftests/kvm/Makefile | 1 +
.../kvm/x86_64/private_mem_kvm_exits_test.c | 120 ++++++++++++++++++
2 files changed, 121 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 2b1ef809d73a..f7fdd8244547 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -82,6 +82,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
+TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
new file mode 100644
index 000000000000..7f7ca4475745
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+#include <linux/kvm.h>
+#include <pthread.h>
+#include <stdint.h>
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "test_util.h"
+
+/* Arbitrarily selected to avoid overlaps with anything else */
+#define EXITS_TEST_GVA 0xc0000000
+#define EXITS_TEST_GPA EXITS_TEST_GVA
+#define EXITS_TEST_NPAGES 1
+#define EXITS_TEST_SIZE (EXITS_TEST_NPAGES * PAGE_SIZE)
+#define EXITS_TEST_SLOT 10
+
+static uint64_t guest_repeatedly_read(void)
+{
+ volatile uint64_t value;
+
+ while (true)
+ value = *((uint64_t *) EXITS_TEST_GVA);
+
+ return value;
+}
+
+static uint32_t run_vcpu_get_exit_reason(struct kvm_vcpu *vcpu)
+{
+ int r;
+
+ r = _vcpu_run(vcpu);
+ if (r) {
+ TEST_ASSERT(errno == EFAULT, KVM_IOCTL_ERROR(KVM_RUN, r));
+ TEST_ASSERT_EQ(vcpu->run->exit_reason, KVM_EXIT_MEMORY_FAULT);
+ }
+ return vcpu->run->exit_reason;
+}
+
+const struct vm_shape protected_vm_shape = {
+ .mode = VM_MODE_DEFAULT,
+ .type = KVM_X86_SW_PROTECTED_VM,
+};
+
+static void test_private_access_memslot_deleted(void)
+{
+ struct kvm_vm *vm;
+ struct kvm_vcpu *vcpu;
+ pthread_t vm_thread;
+ void *thread_return;
+ uint32_t exit_reason;
+
+ vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
+ guest_repeatedly_read);
+
+ vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+ EXITS_TEST_GPA, EXITS_TEST_SLOT,
+ EXITS_TEST_NPAGES,
+ KVM_MEM_PRIVATE);
+
+ virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
+
+ /* Request to access page privately */
+ vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+
+ pthread_create(&vm_thread, NULL,
+ (void *(*)(void *))run_vcpu_get_exit_reason,
+ (void *)vcpu);
+
+ vm_mem_region_delete(vm, EXITS_TEST_SLOT);
+
+ pthread_join(vm_thread, &thread_return);
+ exit_reason = (uint32_t)(uint64_t)thread_return;
+
+ TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+ TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
+ TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
+ TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+
+ kvm_vm_free(vm);
+}
+
+static void test_private_access_memslot_not_private(void)
+{
+ struct kvm_vm *vm;
+ struct kvm_vcpu *vcpu;
+ uint32_t exit_reason;
+
+ vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
+ guest_repeatedly_read);
+
+ /* Add a non-private memslot (flags = 0) */
+ vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+ EXITS_TEST_GPA, EXITS_TEST_SLOT,
+ EXITS_TEST_NPAGES, 0);
+
+ virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
+
+ /* Request to access page privately */
+ vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+
+ exit_reason = run_vcpu_get_exit_reason(vcpu);
+
+ TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+ TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
+ TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
+ TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+
+ kvm_vm_free(vm);
+}
+
+int main(int argc, char *argv[])
+{
+ TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
+
+ test_private_access_memslot_deleted();
+ test_private_access_memslot_not_private();
+}
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:27:26

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2

Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
support for guest private memory can be added without needing an entirely
separate set of helpers.

Note, this obviously makes selftests backwards-incompatible with older KVM
versions from this point forward.

Signed-off-by: Sean Christopherson <[email protected]>
---
.../selftests/kvm/include/kvm_util_base.h | 2 +-
tools/testing/selftests/kvm/lib/kvm_util.c | 19 ++++++++++---------
2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 967eaaeacd75..9f144841c2ee 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -44,7 +44,7 @@ typedef uint64_t vm_paddr_t; /* Virtual Machine (Guest) physical address */
typedef uint64_t vm_vaddr_t; /* Virtual Machine (Guest) virtual address */

struct userspace_mem_region {
- struct kvm_userspace_memory_region region;
+ struct kvm_userspace_memory_region2 region;
struct sparsebit *unused_phy_pages;
int fd;
off_t offset;
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index f09295d56c23..3676b37bea38 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -453,8 +453,9 @@ void kvm_vm_restart(struct kvm_vm *vmp)
vm_create_irqchip(vmp);

hash_for_each(vmp->regions.slot_hash, ctr, region, slot_node) {
- int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION, &region->region);
- TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+ int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION2, &region->region);
+
+ TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
" rc: %i errno: %i\n"
" slot: %u flags: 0x%x\n"
" guest_phys_addr: 0x%llx size: 0x%llx",
@@ -657,7 +658,7 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
}

region->region.memory_size = 0;
- vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region);
+ vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);

sparsebit_free(&region->unused_phy_pages);
ret = munmap(region->mmap_start, region->mmap_size);
@@ -1014,8 +1015,8 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
region->region.guest_phys_addr = guest_paddr;
region->region.memory_size = npages * vm->page_size;
region->region.userspace_addr = (uintptr_t) region->host_mem;
- ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region);
- TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+ ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
+ TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
" rc: %i errno: %i\n"
" slot: %u flags: 0x%x\n"
" guest_phys_addr: 0x%lx size: 0x%lx",
@@ -1097,9 +1098,9 @@ void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags)

region->region.flags = flags;

- ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region);
+ ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);

- TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+ TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
" rc: %i errno: %i slot: %u flags: 0x%x",
ret, errno, slot, flags);
}
@@ -1127,9 +1128,9 @@ void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, uint64_t new_gpa)

region->region.guest_phys_addr = new_gpa;

- ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region);
+ ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);

- TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION failed\n"
+ TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed\n"
"ret: %i errno: %i slot: %u new_gpa: 0x%lx",
ret, errno, slot, new_gpa);
}
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:27:49

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 06/35] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU

Advertise that KVM's MMU is synchronized with the primary MMU for all
flavors of PPC KVM support, i.e. advertise that the MMU is synchronized
when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y but the VM is not using hypervisor
mode (a.k.a. PR VMs). PR VMs, via kvm_unmap_gfn_range_pr(), do the right
thing for mmu_notifier invalidation events, and more tellingly, KVM
returns '1' for KVM_CAP_SYNC_MMU when CONFIG_KVM_BOOK3S_HV_POSSIBLE=n
and CONFIG_KVM_BOOK3S_PR_POSSIBLE=y, i.e. KVM already advertises a
synchronized MMU for PR VMs, just not when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/powerpc/kvm/powerpc.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b0a512ede764..8d3ec483bc2b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -635,11 +635,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
BUILD_BUG();
#endif
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
- r = hv_enabled;
-#else
r = 1;
-#endif
break;
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
case KVM_CAP_PPC_HTAB_FD:
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:27:58

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 24/35] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper

Drop kvm_userspace_memory_region_find(), it's unused and a terrible API
(probably why it's unused). If anything outside of kvm_util.c needs to
get at the memslot, userspace_mem_region_find() can be exposed to give
others full access to all memory region/slot information.

Signed-off-by: Sean Christopherson <[email protected]>
---
.../selftests/kvm/include/kvm_util_base.h | 4 ---
tools/testing/selftests/kvm/lib/kvm_util.c | 29 -------------------
2 files changed, 33 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index a18db6a7b3cf..967eaaeacd75 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -776,10 +776,6 @@ vm_adjust_num_guest_pages(enum vm_guest_mode mode, unsigned int num_guest_pages)
return n;
}

-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
- uint64_t end);
-
#define sync_global_to_guest(vm, g) ({ \
typeof(g) *_p = addr_gva2hva(vm, (vm_vaddr_t)&(g)); \
memcpy(_p, &(g), sizeof(g)); \
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 7a8af1821f5d..f09295d56c23 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -590,35 +590,6 @@ userspace_mem_region_find(struct kvm_vm *vm, uint64_t start, uint64_t end)
return NULL;
}

-/*
- * KVM Userspace Memory Region Find
- *
- * Input Args:
- * vm - Virtual Machine
- * start - Starting VM physical address
- * end - Ending VM physical address, inclusive.
- *
- * Output Args: None
- *
- * Return:
- * Pointer to overlapping region, NULL if no such region.
- *
- * Public interface to userspace_mem_region_find. Allows tests to look up
- * the memslot datastructure for a given range of guest physical memory.
- */
-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
- uint64_t end)
-{
- struct userspace_mem_region *region;
-
- region = userspace_mem_region_find(vm, start, end);
- if (!region)
- return NULL;
-
- return &region->region;
-}
-
__weak void vcpu_arch_free(struct kvm_vcpu *vcpu)
{

--
2.42.0.820.g83a721a137-goog

2023-10-27 18:28:03

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

Extended guest_memfd to allow backing guest memory with transparent
hugepages. Require userspace to opt-in via a flag even though there's no
known/anticipated use case for forcing small pages as THP is optional,
i.e. to avoid ending up in a situation where userspace is unaware that
KVM can't provide hugepages.

For simplicity, require the guest_memfd size to be a multiple of the
hugepage size, e.g. so that KVM doesn't need to do bounds checking when
deciding whether or not to allocate a huge folio.

When reporting the max order when KVM gets a pfn from guest_memfd, force
order-0 pages if the hugepage is not fully contained by the memslot
binding, e.g. if userspace requested hugepages but punches a hole in the
memslot bindings in order to emulate x86's VGA hole.

Signed-off-by: Sean Christopherson <[email protected]>
---
Documentation/virt/kvm/api.rst | 7 ++++
include/uapi/linux/kvm.h | 2 +
virt/kvm/guest_memfd.c | 73 ++++++++++++++++++++++++++++++----
3 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e82c69d5e755..7f00c310c24a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6176,6 +6176,8 @@ and cannot be resized (guest_memfd files do however support PUNCH_HOLE).
__u64 reserved[6];
};

+ #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0)
+
Conceptually, the inode backing a guest_memfd file represents physical memory,
i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The
file itself, which is bound to a "struct kvm", is that instance's view of the
@@ -6192,6 +6194,11 @@ most one mapping per page, i.e. binding multiple memory regions to a single
guest_memfd range is not allowed (any number of memory regions can be bound to
a single guest_memfd file, but the bound ranges must not overlap).

+If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set in flags, KVM will attempt to allocate
+and map hugepages for the guest_memfd file. This is currently best effort. If
+KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set, the size must be aligned to the maximum
+transparent hugepage size supported by the kernel
+
See KVM_SET_USER_MEMORY_REGION2 for additional details.

5. The kvm_run structure
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 25caee8d1a80..33d542de0a61 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2303,4 +2303,6 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
};

+#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0)
+
#endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 98a12da80214..94bc478c26f3 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -13,14 +13,47 @@ struct kvm_gmem {
struct list_head entry;
};

+static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ unsigned long huge_index = round_down(index, HPAGE_PMD_NR);
+ unsigned long flags = (unsigned long)inode->i_private;
+ struct address_space *mapping = inode->i_mapping;
+ gfp_t gfp = mapping_gfp_mask(mapping);
+ struct folio *folio;
+
+ if (!(flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE))
+ return NULL;
+
+ if (filemap_range_has_page(mapping, huge_index << PAGE_SHIFT,
+ (huge_index + HPAGE_PMD_NR - 1) << PAGE_SHIFT))
+ return NULL;
+
+ folio = filemap_alloc_folio(gfp, HPAGE_PMD_ORDER);
+ if (!folio)
+ return NULL;
+
+ if (filemap_add_folio(mapping, folio, huge_index, gfp)) {
+ folio_put(folio);
+ return NULL;
+ }
+
+ return folio;
+#else
+ return NULL;
+#endif
+}
+
static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
{
struct folio *folio;

- /* TODO: Support huge pages. */
- folio = filemap_grab_folio(inode->i_mapping, index);
- if (IS_ERR_OR_NULL(folio))
- return NULL;
+ folio = kvm_gmem_get_huge_folio(inode, index);
+ if (!folio) {
+ folio = filemap_grab_folio(inode->i_mapping, index);
+ if (IS_ERR_OR_NULL(folio))
+ return NULL;
+ }

/*
* Use the up-to-date flag to track whether or not the memory has been
@@ -373,6 +406,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
inode->i_mode |= S_IFREG;
inode->i_size = size;
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+ mapping_set_large_folios(inode->i_mapping);
mapping_set_unmovable(inode->i_mapping);
/* Unmovable mappings are supposed to be marked unevictable as well. */
WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
@@ -398,12 +432,21 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
u64 flags = args->flags;
u64 valid_flags = 0;

+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+
if (flags & ~valid_flags)
return -EINVAL;

if (size < 0 || !PAGE_ALIGNED(size))
return -EINVAL;

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
+ !IS_ALIGNED(size, HPAGE_PMD_SIZE))
+ return -EINVAL;
+#endif
+
return __kvm_gmem_create(kvm, size, flags);
}

@@ -501,7 +544,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
{
- pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
+ pgoff_t index, huge_index;
struct kvm_gmem *gmem;
struct folio *folio;
struct page *page;
@@ -514,6 +557,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,

gmem = file->private_data;

+ index = gfn - slot->base_gfn + slot->gmem.pgoff;
if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) {
r = -EIO;
goto out_fput;
@@ -533,9 +577,24 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
page = folio_file_page(folio, index);

*pfn = page_to_pfn(page);
- if (max_order)
+ if (!max_order)
+ goto success;
+
+ *max_order = compound_order(compound_head(page));
+ if (!*max_order)
+ goto success;
+
+ /*
+ * The folio can be mapped with a hugepage if and only if the folio is
+ * fully contained by the range the memslot is bound to. Note, the
+ * caller is responsible for handling gfn alignment, this only deals
+ * with the file binding.
+ */
+ huge_index = ALIGN(index, 1ull << *max_order);
+ if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) ||
+ huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages)
*max_order = 0;
-
+success:
r = 0;

out_unlock:
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:28:45

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 30/35] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data

Add GUEST_SYNC[1-6]() so that tests can pass the maximum amount of
information supported via ucall(), without needing to resort to shared
memory.

Signed-off-by: Sean Christopherson <[email protected]>
---
tools/testing/selftests/kvm/include/ucall_common.h | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/ucall_common.h b/tools/testing/selftests/kvm/include/ucall_common.h
index ce33d306c2cb..0fb472a5a058 100644
--- a/tools/testing/selftests/kvm/include/ucall_common.h
+++ b/tools/testing/selftests/kvm/include/ucall_common.h
@@ -52,6 +52,17 @@ int ucall_nr_pages_required(uint64_t page_size);
#define GUEST_SYNC_ARGS(stage, arg1, arg2, arg3, arg4) \
ucall(UCALL_SYNC, 6, "hello", stage, arg1, arg2, arg3, arg4)
#define GUEST_SYNC(stage) ucall(UCALL_SYNC, 2, "hello", stage)
+#define GUEST_SYNC1(arg0) ucall(UCALL_SYNC, 1, arg0)
+#define GUEST_SYNC2(arg0, arg1) ucall(UCALL_SYNC, 2, arg0, arg1)
+#define GUEST_SYNC3(arg0, arg1, arg2) \
+ ucall(UCALL_SYNC, 3, arg0, arg1, arg2)
+#define GUEST_SYNC4(arg0, arg1, arg2, arg3) \
+ ucall(UCALL_SYNC, 4, arg0, arg1, arg2, arg3)
+#define GUEST_SYNC5(arg0, arg1, arg2, arg3, arg4) \
+ ucall(UCALL_SYNC, 5, arg0, arg1, arg2, arg3, arg4)
+#define GUEST_SYNC6(arg0, arg1, arg2, arg3, arg4, arg5) \
+ ucall(UCALL_SYNC, 6, arg0, arg1, arg2, arg3, arg4, arg5)
+
#define GUEST_PRINTF(_fmt, _args...) ucall_fmt(UCALL_PRINTF, _fmt, ##_args)
#define GUEST_DONE() ucall(UCALL_DONE, 0)

--
2.42.0.820.g83a721a137-goog

2023-10-27 18:29:31

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 31/35] KVM: selftests: Add x86-only selftest for private memory conversions

From: Vishal Annapurve <[email protected]>

Add a selftest to exercise implicit/explicit conversion functionality
within KVM and verify:

- Shared memory is visible to host userspace
- Private memory is not visible to host userspace
- Host userspace and guest can communicate over shared memory
- Data in shared backing is preserved across conversions (test's
host userspace doesn't free the data)
- Private memory is bound to the lifetime of the VM

Ideally, KVM's selftests infrastructure would be reworked to allow backing
a single region of guest memory with multiple memslots for _all_ backing
types and shapes, i.e. ideally the code for using a single backing fd
across multiple memslots would work for "regular" memory as well. But
sadly, support for KVM_CREATE_GUEST_MEMFD has languished for far too long,
and overhauling selftests' memslots infrastructure would likely open a can
of worms, i.e. delay things even further.

In addition to the more obvious tests, verify that PUNCH_HOLE actually
frees memory. Directly verifying that KVM frees memory is impractical, if
it's even possible, so instead indirectly verify memory is freed by
asserting that the guest reads zeroes after a PUNCH_HOLE. E.g. if KVM
zaps SPTEs but doesn't actually punch a hole in the inode, the subsequent
read will still see the previous value. And obviously punching a hole
shouldn't cause explosions.

Let the user specify the number of memslots in the private mem conversion
test, i.e. don't require the number of memslots to be '1' or "nr_vcpus".
Creating more memslots than vCPUs is particularly interesting, e.g. it can
result in a single KVM_SET_MEMORY_ATTRIBUTES spanning multiple memslots.
To keep the math reasonable, align each vCPU's chunk to at least 2MiB (the
size is 2MiB+4KiB), and require the total size to be cleanly divisible by
the number of memslots. The goal is to be able to validate that KVM plays
nice with multiple memslots, being able to create a truly arbitrary number
of memslots doesn't add meaningful value, i.e. isn't worth the cost.

Intentionally don't take a requirement on KVM_CAP_GUEST_MEMFD,
KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_ATTRIBUTE_PRIVATE, etc., as it's a
KVM bug to advertise KVM_X86_SW_PROTECTED_VM without its prerequisites.

Signed-off-by: Vishal Annapurve <[email protected]>
Co-developed-by: Ackerley Tng <[email protected]>
Signed-off-by: Ackerley Tng <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
tools/testing/selftests/kvm/Makefile | 1 +
.../kvm/x86_64/private_mem_conversions_test.c | 487 ++++++++++++++++++
2 files changed, 488 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index a3bb36fb3cfc..b709a52d5cdb 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -81,6 +81,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test
TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
+TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
new file mode 100644
index 000000000000..be311944e90a
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -0,0 +1,487 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include <fcntl.h>
+#include <limits.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+#include <linux/kvm_para.h>
+#include <linux/memfd.h>
+#include <linux/sizes.h>
+
+#include <test_util.h>
+#include <kvm_util.h>
+#include <processor.h>
+
+#define BASE_DATA_SLOT 10
+#define BASE_DATA_GPA ((uint64_t)(1ull << 32))
+#define PER_CPU_DATA_SIZE ((uint64_t)(SZ_2M + PAGE_SIZE))
+
+/* Horrific macro so that the line info is captured accurately :-( */
+#define memcmp_g(gpa, pattern, size) \
+do { \
+ uint8_t *mem = (uint8_t *)gpa; \
+ size_t i; \
+ \
+ for (i = 0; i < size; i++) \
+ __GUEST_ASSERT(mem[i] == pattern, \
+ "Guest expected 0x%x at offset %lu (gpa 0x%llx), got 0x%x", \
+ pattern, i, gpa + i, mem[i]); \
+} while (0)
+
+static void memcmp_h(uint8_t *mem, uint64_t gpa, uint8_t pattern, size_t size)
+{
+ size_t i;
+
+ for (i = 0; i < size; i++)
+ TEST_ASSERT(mem[i] == pattern,
+ "Host expected 0x%x at gpa 0x%lx, got 0x%x",
+ pattern, gpa + i, mem[i]);
+}
+
+/*
+ * Run memory conversion tests with explicit conversion:
+ * Execute KVM hypercall to map/unmap gpa range which will cause userspace exit
+ * to back/unback private memory. Subsequent accesses by guest to the gpa range
+ * will not cause exit to userspace.
+ *
+ * Test memory conversion scenarios with following steps:
+ * 1) Access private memory using private access and verify that memory contents
+ * are not visible to userspace.
+ * 2) Convert memory to shared using explicit conversions and ensure that
+ * userspace is able to access the shared regions.
+ * 3) Convert memory back to private using explicit conversions and ensure that
+ * userspace is again not able to access converted private regions.
+ */
+
+#define GUEST_STAGE(o, s) { .offset = o, .size = s }
+
+enum ucall_syncs {
+ SYNC_SHARED,
+ SYNC_PRIVATE,
+};
+
+static void guest_sync_shared(uint64_t gpa, uint64_t size,
+ uint8_t current_pattern, uint8_t new_pattern)
+{
+ GUEST_SYNC5(SYNC_SHARED, gpa, size, current_pattern, new_pattern);
+}
+
+static void guest_sync_private(uint64_t gpa, uint64_t size, uint8_t pattern)
+{
+ GUEST_SYNC4(SYNC_PRIVATE, gpa, size, pattern);
+}
+
+/* Arbitrary values, KVM doesn't care about the attribute flags. */
+#define MAP_GPA_SET_ATTRIBUTES BIT(0)
+#define MAP_GPA_SHARED BIT(1)
+#define MAP_GPA_DO_FALLOCATE BIT(2)
+
+static void guest_map_mem(uint64_t gpa, uint64_t size, bool map_shared,
+ bool do_fallocate)
+{
+ uint64_t flags = MAP_GPA_SET_ATTRIBUTES;
+
+ if (map_shared)
+ flags |= MAP_GPA_SHARED;
+ if (do_fallocate)
+ flags |= MAP_GPA_DO_FALLOCATE;
+ kvm_hypercall_map_gpa_range(gpa, size, flags);
+}
+
+static void guest_map_shared(uint64_t gpa, uint64_t size, bool do_fallocate)
+{
+ guest_map_mem(gpa, size, true, do_fallocate);
+}
+
+static void guest_map_private(uint64_t gpa, uint64_t size, bool do_fallocate)
+{
+ guest_map_mem(gpa, size, false, do_fallocate);
+}
+
+struct {
+ uint64_t offset;
+ uint64_t size;
+} static const test_ranges[] = {
+ GUEST_STAGE(0, PAGE_SIZE),
+ GUEST_STAGE(0, SZ_2M),
+ GUEST_STAGE(PAGE_SIZE, PAGE_SIZE),
+ GUEST_STAGE(PAGE_SIZE, SZ_2M),
+ GUEST_STAGE(SZ_2M, PAGE_SIZE),
+};
+
+static void guest_test_explicit_conversion(uint64_t base_gpa, bool do_fallocate)
+{
+ const uint8_t def_p = 0xaa;
+ const uint8_t init_p = 0xcc;
+ uint64_t j;
+ int i;
+
+ /* Memory should be shared by default. */
+ memset((void *)base_gpa, def_p, PER_CPU_DATA_SIZE);
+ memcmp_g(base_gpa, def_p, PER_CPU_DATA_SIZE);
+ guest_sync_shared(base_gpa, PER_CPU_DATA_SIZE, def_p, init_p);
+
+ memcmp_g(base_gpa, init_p, PER_CPU_DATA_SIZE);
+
+ for (i = 0; i < ARRAY_SIZE(test_ranges); i++) {
+ uint64_t gpa = base_gpa + test_ranges[i].offset;
+ uint64_t size = test_ranges[i].size;
+ uint8_t p1 = 0x11;
+ uint8_t p2 = 0x22;
+ uint8_t p3 = 0x33;
+ uint8_t p4 = 0x44;
+
+ /*
+ * Set the test region to pattern one to differentiate it from
+ * the data range as a whole (contains the initial pattern).
+ */
+ memset((void *)gpa, p1, size);
+
+ /*
+ * Convert to private, set and verify the private data, and
+ * then verify that the rest of the data (map shared) still
+ * holds the initial pattern, and that the host always sees the
+ * shared memory (initial pattern). Unlike shared memory,
+ * punching a hole in private memory is destructive, i.e.
+ * previous values aren't guaranteed to be preserved.
+ */
+ guest_map_private(gpa, size, do_fallocate);
+
+ if (size > PAGE_SIZE) {
+ memset((void *)gpa, p2, PAGE_SIZE);
+ goto skip;
+ }
+
+ memset((void *)gpa, p2, size);
+ guest_sync_private(gpa, size, p1);
+
+ /*
+ * Verify that the private memory was set to pattern two, and
+ * that shared memory still holds the initial pattern.
+ */
+ memcmp_g(gpa, p2, size);
+ if (gpa > base_gpa)
+ memcmp_g(base_gpa, init_p, gpa - base_gpa);
+ if (gpa + size < base_gpa + PER_CPU_DATA_SIZE)
+ memcmp_g(gpa + size, init_p,
+ (base_gpa + PER_CPU_DATA_SIZE) - (gpa + size));
+
+ /*
+ * Convert odd-number page frames back to shared to verify KVM
+ * also correctly handles holes in private ranges.
+ */
+ for (j = 0; j < size; j += PAGE_SIZE) {
+ if ((j >> PAGE_SHIFT) & 1) {
+ guest_map_shared(gpa + j, PAGE_SIZE, do_fallocate);
+ guest_sync_shared(gpa + j, PAGE_SIZE, p1, p3);
+
+ memcmp_g(gpa + j, p3, PAGE_SIZE);
+ } else {
+ guest_sync_private(gpa + j, PAGE_SIZE, p1);
+ }
+ }
+
+skip:
+ /*
+ * Convert the entire region back to shared, explicitly write
+ * pattern three to fill in the even-number frames before
+ * asking the host to verify (and write pattern four).
+ */
+ guest_map_shared(gpa, size, do_fallocate);
+ memset((void *)gpa, p3, size);
+ guest_sync_shared(gpa, size, p3, p4);
+ memcmp_g(gpa, p4, size);
+
+ /* Reset the shared memory back to the initial pattern. */
+ memset((void *)gpa, init_p, size);
+
+ /*
+ * Free (via PUNCH_HOLE) *all* private memory so that the next
+ * iteration starts from a clean slate, e.g. with respect to
+ * whether or not there are pages/folios in guest_mem.
+ */
+ guest_map_shared(base_gpa, PER_CPU_DATA_SIZE, true);
+ }
+}
+
+static void guest_punch_hole(uint64_t gpa, uint64_t size)
+{
+ /* "Mapping" memory shared via fallocate() is done via PUNCH_HOLE. */
+ uint64_t flags = MAP_GPA_SHARED | MAP_GPA_DO_FALLOCATE;
+
+ kvm_hypercall_map_gpa_range(gpa, size, flags);
+}
+
+/*
+ * Test that PUNCH_HOLE actually frees memory by punching holes without doing a
+ * proper conversion. Freeing (PUNCH_HOLE) should zap SPTEs, and reallocating
+ * (subsequent fault) should zero memory.
+ */
+static void guest_test_punch_hole(uint64_t base_gpa, bool precise)
+{
+ const uint8_t init_p = 0xcc;
+ int i;
+
+ /*
+ * Convert the entire range to private, this testcase is all about
+ * punching holes in guest_memfd, i.e. shared mappings aren't needed.
+ */
+ guest_map_private(base_gpa, PER_CPU_DATA_SIZE, false);
+
+ for (i = 0; i < ARRAY_SIZE(test_ranges); i++) {
+ uint64_t gpa = base_gpa + test_ranges[i].offset;
+ uint64_t size = test_ranges[i].size;
+
+ /*
+ * Free all memory before each iteration, even for the !precise
+ * case where the memory will be faulted back in. Freeing and
+ * reallocating should obviously work, and freeing all memory
+ * minimizes the probability of cross-testcase influence.
+ */
+ guest_punch_hole(base_gpa, PER_CPU_DATA_SIZE);
+
+ /* Fault-in and initialize memory, and verify the pattern. */
+ if (precise) {
+ memset((void *)gpa, init_p, size);
+ memcmp_g(gpa, init_p, size);
+ } else {
+ memset((void *)base_gpa, init_p, PER_CPU_DATA_SIZE);
+ memcmp_g(base_gpa, init_p, PER_CPU_DATA_SIZE);
+ }
+
+ /*
+ * Punch a hole at the target range and verify that reads from
+ * the guest succeed and return zeroes.
+ */
+ guest_punch_hole(gpa, size);
+ memcmp_g(gpa, 0, size);
+ }
+}
+
+static void guest_code(uint64_t base_gpa)
+{
+ /*
+ * Run the conversion test twice, with and without doing fallocate() on
+ * the guest_memfd backing when converting between shared and private.
+ */
+ guest_test_explicit_conversion(base_gpa, false);
+ guest_test_explicit_conversion(base_gpa, true);
+
+ /*
+ * Run the PUNCH_HOLE test twice too, once with the entire guest_memfd
+ * faulted in, once with only the target range faulted in.
+ */
+ guest_test_punch_hole(base_gpa, false);
+ guest_test_punch_hole(base_gpa, true);
+ GUEST_DONE();
+}
+
+static void handle_exit_hypercall(struct kvm_vcpu *vcpu)
+{
+ struct kvm_run *run = vcpu->run;
+ uint64_t gpa = run->hypercall.args[0];
+ uint64_t size = run->hypercall.args[1] * PAGE_SIZE;
+ bool set_attributes = run->hypercall.args[2] & MAP_GPA_SET_ATTRIBUTES;
+ bool map_shared = run->hypercall.args[2] & MAP_GPA_SHARED;
+ bool do_fallocate = run->hypercall.args[2] & MAP_GPA_DO_FALLOCATE;
+ struct kvm_vm *vm = vcpu->vm;
+
+ TEST_ASSERT(run->hypercall.nr == KVM_HC_MAP_GPA_RANGE,
+ "Wanted MAP_GPA_RANGE (%u), got '%llu'",
+ KVM_HC_MAP_GPA_RANGE, run->hypercall.nr);
+
+ if (do_fallocate)
+ vm_guest_mem_fallocate(vm, gpa, size, map_shared);
+
+ if (set_attributes)
+ vm_set_memory_attributes(vm, gpa, size,
+ map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE);
+ run->hypercall.ret = 0;
+}
+
+static bool run_vcpus;
+
+static void *__test_mem_conversions(void *__vcpu)
+{
+ struct kvm_vcpu *vcpu = __vcpu;
+ struct kvm_run *run = vcpu->run;
+ struct kvm_vm *vm = vcpu->vm;
+ struct ucall uc;
+
+ while (!READ_ONCE(run_vcpus))
+ ;
+
+ for ( ;; ) {
+ vcpu_run(vcpu);
+
+ if (run->exit_reason == KVM_EXIT_HYPERCALL) {
+ handle_exit_hypercall(vcpu);
+ continue;
+ }
+
+ TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+ "Wanted KVM_EXIT_IO, got exit reason: %u (%s)",
+ run->exit_reason, exit_reason_str(run->exit_reason));
+
+ switch (get_ucall(vcpu, &uc)) {
+ case UCALL_ABORT:
+ REPORT_GUEST_ASSERT(uc);
+ case UCALL_SYNC: {
+ uint64_t gpa = uc.args[1];
+ size_t size = uc.args[2];
+ size_t i;
+
+ TEST_ASSERT(uc.args[0] == SYNC_SHARED ||
+ uc.args[0] == SYNC_PRIVATE,
+ "Unknown sync command '%ld'", uc.args[0]);
+
+ for (i = 0; i < size; i += vm->page_size) {
+ size_t nr_bytes = min_t(size_t, vm->page_size, size - i);
+ uint8_t *hva = addr_gpa2hva(vm, gpa + i);
+
+ /* In all cases, the host should observe the shared data. */
+ memcmp_h(hva, gpa + i, uc.args[3], nr_bytes);
+
+ /* For shared, write the new pattern to guest memory. */
+ if (uc.args[0] == SYNC_SHARED)
+ memset(hva, uc.args[4], nr_bytes);
+ }
+ break;
+ }
+ case UCALL_DONE:
+ return NULL;
+ default:
+ TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
+ }
+ }
+}
+
+static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t nr_vcpus,
+ uint32_t nr_memslots)
+{
+ /*
+ * Allocate enough memory so that each vCPU's chunk of memory can be
+ * naturally aligned with respect to the size of the backing store.
+ */
+ const size_t alignment = max_t(size_t, SZ_2M, get_backing_src_pagesz(src_type));
+ const size_t per_cpu_size = align_up(PER_CPU_DATA_SIZE, alignment);
+ const size_t memfd_size = per_cpu_size * nr_vcpus;
+ const size_t slot_size = memfd_size / nr_memslots;
+ struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
+ pthread_t threads[KVM_MAX_VCPUS];
+ uint64_t memfd_flags;
+ struct kvm_vm *vm;
+ int memfd, i, r;
+
+ const struct vm_shape shape = {
+ .mode = VM_MODE_DEFAULT,
+ .type = KVM_X86_SW_PROTECTED_VM,
+ };
+
+ TEST_ASSERT(slot_size * nr_memslots == memfd_size,
+ "The memfd size (0x%lx) needs to be cleanly divisible by the number of memslots (%u)",
+ memfd_size, nr_memslots);
+ vm = __vm_create_with_vcpus(shape, nr_vcpus, 0, guest_code, vcpus);
+
+ vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
+
+ if (backing_src_can_be_huge(src_type))
+ memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+ else
+ memfd_flags = 0;
+ memfd = vm_create_guest_memfd(vm, memfd_size, memfd_flags);
+
+ for (i = 0; i < nr_memslots; i++)
+ vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
+ BASE_DATA_SLOT + i, slot_size / vm->page_size,
+ KVM_MEM_PRIVATE, memfd, slot_size * i);
+
+ for (i = 0; i < nr_vcpus; i++) {
+ uint64_t gpa = BASE_DATA_GPA + i * per_cpu_size;
+
+ vcpu_args_set(vcpus[i], 1, gpa);
+
+ /*
+ * Map only what is needed so that an out-of-bounds access
+ * results #PF => SHUTDOWN instead of data corruption.
+ */
+ virt_map(vm, gpa, gpa, PER_CPU_DATA_SIZE / vm->page_size);
+
+ pthread_create(&threads[i], NULL, __test_mem_conversions, vcpus[i]);
+ }
+
+ WRITE_ONCE(run_vcpus, true);
+
+ for (i = 0; i < nr_vcpus; i++)
+ pthread_join(threads[i], NULL);
+
+ kvm_vm_free(vm);
+
+ /*
+ * Allocate and free memory from the guest_memfd after closing the VM
+ * fd. The guest_memfd is gifted a reference to its owning VM, i.e.
+ * should prevent the VM from being fully destroyed until the last
+ * reference to the guest_memfd is also put.
+ */
+ r = fallocate(memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0, memfd_size);
+ TEST_ASSERT(!r, __KVM_SYSCALL_ERROR("fallocate()", r));
+
+ r = fallocate(memfd, FALLOC_FL_KEEP_SIZE, 0, memfd_size);
+ TEST_ASSERT(!r, __KVM_SYSCALL_ERROR("fallocate()", r));
+}
+
+static void usage(const char *cmd)
+{
+ puts("");
+ printf("usage: %s [-h] [-m nr_memslots] [-s mem_type] [-n nr_vcpus]\n", cmd);
+ puts("");
+ backing_src_help("-s");
+ puts("");
+ puts(" -n: specify the number of vcpus (default: 1)");
+ puts("");
+ puts(" -m: specify the number of memslots (default: 1)");
+ puts("");
+}
+
+int main(int argc, char *argv[])
+{
+ enum vm_mem_backing_src_type src_type = DEFAULT_VM_MEM_SRC;
+ uint32_t nr_memslots = 1;
+ uint32_t nr_vcpus = 1;
+ int opt;
+
+ TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
+
+ while ((opt = getopt(argc, argv, "hm:s:n:")) != -1) {
+ switch (opt) {
+ case 's':
+ src_type = parse_backing_src_type(optarg);
+ break;
+ case 'n':
+ nr_vcpus = atoi_positive("nr_vcpus", optarg);
+ break;
+ case 'm':
+ nr_memslots = atoi_positive("nr_memslots", optarg);
+ break;
+ case 'h':
+ default:
+ usage(argv[0]);
+ exit(0);
+ }
+ }
+
+ test_mem_conversions(src_type, nr_vcpus, nr_memslots);
+
+ return 0;
+}
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:29:46

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM

Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs. Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.

Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/powerpc/kvm/book3s_hv.c | 2 +-
arch/x86/include/asm/kvm_host.h | 8 +++++++-
arch/x86/kvm/debugfs.c | 2 +-
arch/x86/kvm/mmu/mmu.c | 6 +++---
arch/x86/kvm/x86.c | 2 +-
include/linux/kvm_host.h | 17 +++++++++++------
virt/kvm/dirty_ring.c | 2 +-
virt/kvm/kvm_main.c | 26 ++++++++++++++------------
8 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 130bafdb1430..9b0eaa17275a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -6084,7 +6084,7 @@ static int kvmhv_svm_off(struct kvm *kvm)
}

srcu_idx = srcu_read_lock(&kvm->srcu);
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
struct kvm_memory_slot *memslot;
struct kvm_memslots *slots = __kvm_memslots(kvm, i);
int bkt;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6702f795c862..f9e8d5642069 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2124,9 +2124,15 @@ enum {
#define HF_SMM_MASK (1 << 1)
#define HF_SMM_INSIDE_NMI_MASK (1 << 2)

-# define KVM_ADDRESS_SPACE_NUM 2
+# define KVM_MAX_NR_ADDRESS_SPACES 2
# define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
# define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
+
+static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
+{
+ return KVM_MAX_NR_ADDRESS_SPACES;
+}
+
#else
# define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
#endif
diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index ee8c4c3496ed..42026b3f3ff3 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -111,7 +111,7 @@ static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void *v)
mutex_lock(&kvm->slots_lock);
write_lock(&kvm->mmu_lock);

- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
int bkt;

slots = __kvm_memslots(kvm, i);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c4e758f0aebb..baeba8fc1c38 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3755,7 +3755,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
kvm_page_track_write_tracking_enabled(kvm))
goto out_success;

- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);
kvm_for_each_memslot(slot, bkt, slots) {
/*
@@ -6294,7 +6294,7 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_e
if (!kvm_memslots_have_rmaps(kvm))
return flush;

- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);

kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, gfn_end) {
@@ -6791,7 +6791,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
* modifier prior to checking for a wrap of the MMIO generation so
* that a wrap in any address space is detected.
*/
- gen &= ~((u64)KVM_ADDRESS_SPACE_NUM - 1);
+ gen &= ~((u64)kvm_arch_nr_memslot_as_ids(kvm) - 1);

/*
* The very rare case: if the MMIO generation number has wrapped,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 824b58b44382..c4d17727b199 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12456,7 +12456,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
hva = slot->userspace_addr;
}

- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
struct kvm_userspace_memory_region2 m;

m.slot = id | (i << 16);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c3cfe08b1300..687589ce9f63 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -80,8 +80,8 @@
/* Two fragments for cross MMIO pages. */
#define KVM_MAX_MMIO_FRAGMENTS 2

-#ifndef KVM_ADDRESS_SPACE_NUM
-#define KVM_ADDRESS_SPACE_NUM 1
+#ifndef KVM_MAX_NR_ADDRESS_SPACES
+#define KVM_MAX_NR_ADDRESS_SPACES 1
#endif

/*
@@ -692,7 +692,12 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
#define KVM_MEM_SLOTS_NUM SHRT_MAX
#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)

-#if KVM_ADDRESS_SPACE_NUM == 1
+#if KVM_MAX_NR_ADDRESS_SPACES == 1
+static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
+{
+ return KVM_MAX_NR_ADDRESS_SPACES;
+}
+
static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
{
return 0;
@@ -747,9 +752,9 @@ struct kvm {
struct mm_struct *mm; /* userspace tied to this vm */
unsigned long nr_memslot_pages;
/* The two memslot sets - active and inactive (per address space) */
- struct kvm_memslots __memslots[KVM_ADDRESS_SPACE_NUM][2];
+ struct kvm_memslots __memslots[KVM_MAX_NR_ADDRESS_SPACES][2];
/* The current active memslot set for each address space */
- struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
+ struct kvm_memslots __rcu *memslots[KVM_MAX_NR_ADDRESS_SPACES];
struct xarray vcpu_array;
/*
* Protected by slots_lock, but can be read outside if an
@@ -1018,7 +1023,7 @@ void kvm_put_kvm_no_destroy(struct kvm *kvm);

static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
{
- as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
+ as_id = array_index_nospec(as_id, KVM_MAX_NR_ADDRESS_SPACES);
return srcu_dereference_check(kvm->memslots[as_id], &kvm->srcu,
lockdep_is_held(&kvm->slots_lock) ||
!refcount_read(&kvm->users_count));
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index c1cd7dfe4a90..86d267db87bb 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -58,7 +58,7 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
as_id = slot >> 16;
id = (u16)slot;

- if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+ if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
return;

memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5d1a2f1b4e94..23633984142f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -615,7 +615,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,

idx = srcu_read_lock(&kvm->srcu);

- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
struct interval_tree_node *node;

slots = __kvm_memslots(kvm, i);
@@ -1248,7 +1248,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
goto out_err_no_irq_srcu;

refcount_set(&kvm->users_count, 1);
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
for (j = 0; j < 2; j++) {
slots = &kvm->__memslots[i][j];

@@ -1398,7 +1398,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
#endif
kvm_arch_destroy_vm(kvm);
kvm_destroy_devices(kvm);
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
}
@@ -1681,7 +1681,7 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
* space 0 will use generations 0, 2, 4, ... while address space 1 will
* use generations 1, 3, 5, ...
*/
- gen += KVM_ADDRESS_SPACE_NUM;
+ gen += kvm_arch_nr_memslot_as_ids(kvm);

kvm_arch_memslots_updated(kvm, gen);

@@ -2051,7 +2051,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
(mem->guest_memfd_offset & (PAGE_SIZE - 1) ||
mem->guest_memfd_offset + mem->memory_size < mem->guest_memfd_offset))
return -EINVAL;
- if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
+ if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM)
return -EINVAL;
if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
return -EINVAL;
@@ -2187,7 +2187,7 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,

as_id = log->slot >> 16;
id = (u16)log->slot;
- if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+ if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
return -EINVAL;

slots = __kvm_memslots(kvm, as_id);
@@ -2249,7 +2249,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)

as_id = log->slot >> 16;
id = (u16)log->slot;
- if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+ if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
return -EINVAL;

slots = __kvm_memslots(kvm, as_id);
@@ -2361,7 +2361,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,

as_id = log->slot >> 16;
id = (u16)log->slot;
- if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+ if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
return -EINVAL;

if (log->first_page & 63)
@@ -2502,7 +2502,7 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
gfn_range.only_private = false;
gfn_range.only_shared = false;

- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);

kvm_for_each_memslot_in_gfn_range(&iter, slots, range->start, range->end) {
@@ -4857,9 +4857,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
case KVM_CAP_IRQ_ROUTING:
return KVM_MAX_IRQ_ROUTES;
#endif
-#if KVM_ADDRESS_SPACE_NUM > 1
+#if KVM_MAX_NR_ADDRESS_SPACES > 1
case KVM_CAP_MULTI_ADDRESS_SPACE:
- return KVM_ADDRESS_SPACE_NUM;
+ if (kvm)
+ return kvm_arch_nr_memslot_as_ids(kvm);
+ return KVM_MAX_NR_ADDRESS_SPACES;
#endif
case KVM_CAP_NR_MEMSLOTS:
return KVM_USER_MEM_SLOTS;
@@ -4967,7 +4969,7 @@ bool kvm_are_all_memslots_empty(struct kvm *kvm)

lockdep_assert_held(&kvm->slots_lock);

- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
if (!kvm_memslots_empty(__kvm_memslots(kvm, i)))
return false;
}
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:32:28

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 34/35] KVM: selftests: Add basic selftest for guest_memfd()

From: Chao Peng <[email protected]>

Add a selftest to verify the basic functionality of guest_memfd():

+ file descriptor created with the guest_memfd() ioctl does not allow
read/write/mmap operations
+ file size and block size as returned from fstat are as expected
+ fallocate on the fd checks that offset/length on
fallocate(FALLOC_FL_PUNCH_HOLE) should be page aligned
+ invalid inputs (misaligned size, invalid flags) are rejected
+ file size and inode are unique (the innocuous-sounding
anon_inode_getfile() backs all files with a single inode...)

Signed-off-by: Chao Peng <[email protected]>
Co-developed-by: Ackerley Tng <[email protected]>
Signed-off-by: Ackerley Tng <[email protected]>
Co-developed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
tools/testing/selftests/kvm/Makefile | 1 +
.../testing/selftests/kvm/guest_memfd_test.c | 221 ++++++++++++++++++
2 files changed, 222 insertions(+)
create mode 100644 tools/testing/selftests/kvm/guest_memfd_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index b709a52d5cdb..2b1ef809d73a 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -124,6 +124,7 @@ TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
TEST_GEN_PROGS_x86_64 += demand_paging_test
TEST_GEN_PROGS_x86_64 += dirty_log_test
TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
+TEST_GEN_PROGS_x86_64 += guest_memfd_test
TEST_GEN_PROGS_x86_64 += guest_print_test
TEST_GEN_PROGS_x86_64 += hardware_disable_test
TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
new file mode 100644
index 000000000000..c15de9852316
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -0,0 +1,221 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright Intel Corporation, 2023
+ *
+ * Author: Chao Peng <[email protected]>
+ */
+
+#define _GNU_SOURCE
+#include "test_util.h"
+#include "kvm_util_base.h"
+#include <linux/bitmap.h>
+#include <linux/falloc.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+#include <fcntl.h>
+
+static void test_file_read_write(int fd)
+{
+ char buf[64];
+
+ TEST_ASSERT(read(fd, buf, sizeof(buf)) < 0,
+ "read on a guest_mem fd should fail");
+ TEST_ASSERT(write(fd, buf, sizeof(buf)) < 0,
+ "write on a guest_mem fd should fail");
+ TEST_ASSERT(pread(fd, buf, sizeof(buf), 0) < 0,
+ "pread on a guest_mem fd should fail");
+ TEST_ASSERT(pwrite(fd, buf, sizeof(buf), 0) < 0,
+ "pwrite on a guest_mem fd should fail");
+}
+
+static void test_mmap(int fd, size_t page_size)
+{
+ char *mem;
+
+ mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ TEST_ASSERT_EQ(mem, MAP_FAILED);
+}
+
+static void test_file_size(int fd, size_t page_size, size_t total_size)
+{
+ struct stat sb;
+ int ret;
+
+ ret = fstat(fd, &sb);
+ TEST_ASSERT(!ret, "fstat should succeed");
+ TEST_ASSERT_EQ(sb.st_size, total_size);
+ TEST_ASSERT_EQ(sb.st_blksize, page_size);
+}
+
+static void test_fallocate(int fd, size_t page_size, size_t total_size)
+{
+ int ret;
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, total_size);
+ TEST_ASSERT(!ret, "fallocate with aligned offset and size should succeed");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ page_size - 1, page_size);
+ TEST_ASSERT(ret, "fallocate with unaligned offset should fail");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size, page_size);
+ TEST_ASSERT(ret, "fallocate beginning at total_size should fail");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size + page_size, page_size);
+ TEST_ASSERT(ret, "fallocate beginning after total_size should fail");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ total_size, page_size);
+ TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) at total_size should succeed");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ total_size + page_size, page_size);
+ TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) after total_size should succeed");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ page_size, page_size - 1);
+ TEST_ASSERT(ret, "fallocate with unaligned size should fail");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ page_size, page_size);
+ TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) with aligned offset and size should succeed");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, page_size, page_size);
+ TEST_ASSERT(!ret, "fallocate to restore punched hole should succeed");
+}
+
+static void test_invalid_punch_hole(int fd, size_t page_size, size_t total_size)
+{
+ struct {
+ off_t offset;
+ off_t len;
+ } testcases[] = {
+ {0, 1},
+ {0, page_size - 1},
+ {0, page_size + 1},
+
+ {1, 1},
+ {1, page_size - 1},
+ {1, page_size},
+ {1, page_size + 1},
+
+ {page_size, 1},
+ {page_size, page_size - 1},
+ {page_size, page_size + 1},
+ };
+ int ret, i;
+
+ for (i = 0; i < ARRAY_SIZE(testcases); i++) {
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ testcases[i].offset, testcases[i].len);
+ TEST_ASSERT(ret == -1 && errno == EINVAL,
+ "PUNCH_HOLE with !PAGE_SIZE offset (%lx) and/or length (%lx) should fail",
+ testcases[i].offset, testcases[i].len);
+ }
+}
+
+static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
+{
+ uint64_t valid_flags = 0;
+ size_t page_size = getpagesize();
+ uint64_t flag;
+ size_t size;
+ int fd;
+
+ for (size = 1; size < page_size; size++) {
+ fd = __vm_create_guest_memfd(vm, size, 0);
+ TEST_ASSERT(fd == -1 && errno == EINVAL,
+ "guest_memfd() with non-page-aligned page size '0x%lx' should fail with EINVAL",
+ size);
+ }
+
+ if (thp_configured()) {
+ for (size = page_size * 2; size < get_trans_hugepagesz(); size += page_size) {
+ fd = __vm_create_guest_memfd(vm, size, KVM_GUEST_MEMFD_ALLOW_HUGEPAGE);
+ TEST_ASSERT(fd == -1 && errno == EINVAL,
+ "guest_memfd() with non-hugepage-aligned page size '0x%lx' should fail with EINVAL",
+ size);
+ }
+
+ valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+ }
+
+ for (flag = 1; flag; flag <<= 1) {
+ uint64_t bit;
+
+ if (flag & valid_flags)
+ continue;
+
+ fd = __vm_create_guest_memfd(vm, page_size, flag);
+ TEST_ASSERT(fd == -1 && errno == EINVAL,
+ "guest_memfd() with flag '0x%lx' should fail with EINVAL",
+ flag);
+
+ for_each_set_bit(bit, &valid_flags, 64) {
+ fd = __vm_create_guest_memfd(vm, page_size, flag | BIT_ULL(bit));
+ TEST_ASSERT(fd == -1 && errno == EINVAL,
+ "guest_memfd() with flags '0x%llx' should fail with EINVAL",
+ flag | BIT_ULL(bit));
+ }
+ }
+}
+
+static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
+{
+ int fd1, fd2, ret;
+ struct stat st1, st2;
+
+ fd1 = __vm_create_guest_memfd(vm, 4096, 0);
+ TEST_ASSERT(fd1 != -1, "memfd creation should succeed");
+
+ ret = fstat(fd1, &st1);
+ TEST_ASSERT(ret != -1, "memfd fstat should succeed");
+ TEST_ASSERT(st1.st_size == 4096, "memfd st_size should match requested size");
+
+ fd2 = __vm_create_guest_memfd(vm, 8192, 0);
+ TEST_ASSERT(fd2 != -1, "memfd creation should succeed");
+
+ ret = fstat(fd2, &st2);
+ TEST_ASSERT(ret != -1, "memfd fstat should succeed");
+ TEST_ASSERT(st2.st_size == 8192, "second memfd st_size should match requested size");
+
+ ret = fstat(fd1, &st1);
+ TEST_ASSERT(ret != -1, "memfd fstat should succeed");
+ TEST_ASSERT(st1.st_size == 4096, "first memfd st_size should still match requested size");
+ TEST_ASSERT(st1.st_ino != st2.st_ino, "different memfd should have different inode numbers");
+}
+
+int main(int argc, char *argv[])
+{
+ size_t page_size;
+ size_t total_size;
+ int fd;
+ struct kvm_vm *vm;
+
+ TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
+
+ page_size = getpagesize();
+ total_size = page_size * 4;
+
+ vm = vm_create_barebones();
+
+ test_create_guest_memfd_invalid(vm);
+ test_create_guest_memfd_multiple(vm);
+
+ fd = vm_create_guest_memfd(vm, total_size, 0);
+
+ test_file_read_write(fd);
+ test_mmap(fd, page_size);
+ test_file_size(fd, page_size, total_size);
+ test_fallocate(fd, page_size, total_size);
+ test_invalid_punch_hole(fd, page_size, total_size);
+
+ close(fd);
+}
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:32:30

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 21/35] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro

Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
KVM_ADDRESS_SPACE_NUM.

No functional change intended.

Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 -
include/linux/kvm_host.h | 2 +-
2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8d60e4745e8b..6702f795c862 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2124,7 +2124,6 @@ enum {
#define HF_SMM_MASK (1 << 1)
#define HF_SMM_INSIDE_NMI_MASK (1 << 2)

-# define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
# define KVM_ADDRESS_SPACE_NUM 2
# define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
# define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e3223cafd7db..c3cfe08b1300 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -692,7 +692,7 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
#define KVM_MEM_SLOTS_NUM SHRT_MAX
#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)

-#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
+#if KVM_ADDRESS_SPACE_NUM == 1
static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
{
return 0;
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:32:51

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 33/35] KVM: selftests: Expand set_memory_region_test to validate guest_memfd()

From: Chao Peng <[email protected]>

Expand set_memory_region_test to exercise various positive and negative
testcases for private memory.

- Non-guest_memfd() file descriptor for private memory
- guest_memfd() from different VM
- Overlapping bindings
- Unaligned bindings

Signed-off-by: Chao Peng <[email protected]>
Co-developed-by: Ackerley Tng <[email protected]>
Signed-off-by: Ackerley Tng <[email protected]>
[sean: trim the testcases to remove duplicate coverage]
Signed-off-by: Sean Christopherson <[email protected]>
---
.../selftests/kvm/include/kvm_util_base.h | 10 ++
.../selftests/kvm/set_memory_region_test.c | 100 ++++++++++++++++++
2 files changed, 110 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 8ec122f5fcc8..e4d2cd9218b2 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -819,6 +819,16 @@ static inline struct kvm_vm *vm_create_barebones(void)
return ____vm_create(VM_SHAPE_DEFAULT);
}

+static inline struct kvm_vm *vm_create_barebones_protected_vm(void)
+{
+ const struct vm_shape shape = {
+ .mode = VM_MODE_DEFAULT,
+ .type = KVM_X86_SW_PROTECTED_VM,
+ };
+
+ return ____vm_create(shape);
+}
+
static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)
{
return __vm_create(VM_SHAPE_DEFAULT, nr_runnable_vcpus, 0);
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c
index b32960189f5f..ca83e3307a98 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -385,6 +385,98 @@ static void test_add_max_memory_regions(void)
kvm_vm_free(vm);
}

+
+static void test_invalid_guest_memfd(struct kvm_vm *vm, int memfd,
+ size_t offset, const char *msg)
+{
+ int r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+ MEM_REGION_GPA, MEM_REGION_SIZE,
+ 0, memfd, offset);
+ TEST_ASSERT(r == -1 && errno == EINVAL, "%s", msg);
+}
+
+static void test_add_private_memory_region(void)
+{
+ struct kvm_vm *vm, *vm2;
+ int memfd, i;
+
+ pr_info("Testing ADD of KVM_MEM_PRIVATE memory regions\n");
+
+ vm = vm_create_barebones_protected_vm();
+
+ test_invalid_guest_memfd(vm, vm->kvm_fd, 0, "KVM fd should fail");
+ test_invalid_guest_memfd(vm, vm->fd, 0, "VM's fd should fail");
+
+ memfd = kvm_memfd_alloc(MEM_REGION_SIZE, false);
+ test_invalid_guest_memfd(vm, memfd, 0, "Regular memfd() should fail");
+ close(memfd);
+
+ vm2 = vm_create_barebones_protected_vm();
+ memfd = vm_create_guest_memfd(vm2, MEM_REGION_SIZE, 0);
+ test_invalid_guest_memfd(vm, memfd, 0, "Other VM's guest_memfd() should fail");
+
+ vm_set_user_memory_region2(vm2, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+ MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0);
+ close(memfd);
+ kvm_vm_free(vm2);
+
+ memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE, 0);
+ for (i = 1; i < PAGE_SIZE; i++)
+ test_invalid_guest_memfd(vm, memfd, i, "Unaligned offset should fail");
+
+ vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+ MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0);
+ close(memfd);
+
+ kvm_vm_free(vm);
+}
+
+static void test_add_overlapping_private_memory_regions(void)
+{
+ struct kvm_vm *vm;
+ int memfd;
+ int r;
+
+ pr_info("Testing ADD of overlapping KVM_MEM_PRIVATE memory regions\n");
+
+ vm = vm_create_barebones_protected_vm();
+
+ memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE * 4, 0);
+
+ vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+ MEM_REGION_GPA, MEM_REGION_SIZE * 2, 0, memfd, 0);
+
+ vm_set_user_memory_region2(vm, MEM_REGION_SLOT + 1, KVM_MEM_PRIVATE,
+ MEM_REGION_GPA * 2, MEM_REGION_SIZE * 2,
+ 0, memfd, MEM_REGION_SIZE * 2);
+
+ /*
+ * Delete the first memslot, and then attempt to recreate it except
+ * with a "bad" offset that results in overlap in the guest_memfd().
+ */
+ vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+ MEM_REGION_GPA, 0, NULL, -1, 0);
+
+ /* Overlap the front half of the other slot. */
+ r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+ MEM_REGION_GPA * 2 - MEM_REGION_SIZE,
+ MEM_REGION_SIZE * 2,
+ 0, memfd, 0);
+ TEST_ASSERT(r == -1 && errno == EEXIST, "%s",
+ "Overlapping guest_memfd() bindings should fail with EEXIST");
+
+ /* And now the back half of the other slot. */
+ r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+ MEM_REGION_GPA * 2 + MEM_REGION_SIZE,
+ MEM_REGION_SIZE * 2,
+ 0, memfd, 0);
+ TEST_ASSERT(r == -1 && errno == EEXIST, "%s",
+ "Overlapping guest_memfd() bindings should fail with EEXIST");
+
+ close(memfd);
+ kvm_vm_free(vm);
+}
+
int main(int argc, char *argv[])
{
#ifdef __x86_64__
@@ -401,6 +493,14 @@ int main(int argc, char *argv[])

test_add_max_memory_regions();

+ if (kvm_has_cap(KVM_CAP_GUEST_MEMFD) &&
+ (kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM))) {
+ test_add_private_memory_region();
+ test_add_overlapping_private_memory_regions();
+ } else {
+ pr_info("Skipping tests for KVM_MEM_PRIVATE memory regions\n");
+ }
+
#ifdef __x86_64__
if (argc > 1)
loops = atoi_positive("Number of iterations", argv[1]);
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:33:17

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 28/35] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86)

From: Vishal Annapurve <[email protected]>

Add helpers for x86 guests to invoke the KVM_HC_MAP_GPA_RANGE hypercall,
which KVM will forward to userspace and thus can be used by tests to
coordinate private<=>shared conversions between host userspace code and
guest code.

Signed-off-by: Vishal Annapurve <[email protected]>
[sean: drop shared/private helpers (let tests specify flags)]
Signed-off-by: Sean Christopherson <[email protected]>
---
.../selftests/kvm/include/x86_64/processor.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 25bc61dac5fb..a84863503fcb 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -15,6 +15,7 @@
#include <asm/msr-index.h>
#include <asm/prctl.h>

+#include <linux/kvm_para.h>
#include <linux/stringify.h>

#include "../kvm_util.h"
@@ -1194,6 +1195,20 @@ uint64_t kvm_hypercall(uint64_t nr, uint64_t a0, uint64_t a1, uint64_t a2,
uint64_t __xen_hypercall(uint64_t nr, uint64_t a0, void *a1);
void xen_hypercall(uint64_t nr, uint64_t a0, void *a1);

+static inline uint64_t __kvm_hypercall_map_gpa_range(uint64_t gpa,
+ uint64_t size, uint64_t flags)
+{
+ return kvm_hypercall(KVM_HC_MAP_GPA_RANGE, gpa, size >> PAGE_SHIFT, flags, 0);
+}
+
+static inline void kvm_hypercall_map_gpa_range(uint64_t gpa, uint64_t size,
+ uint64_t flags)
+{
+ uint64_t ret = __kvm_hypercall_map_gpa_range(gpa, size, flags);
+
+ GUEST_ASSERT(!ret);
+}
+
void __vm_xsave_require_permission(uint64_t xfeature, const char *name);

#define vm_xsave_require_permission(xfeature) \
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:47:11

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory

Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.

The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.

At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.

Signed-off-by: Sean Christopherson <[email protected]>
---
Documentation/virt/kvm/api.rst | 32 ++++++++++++++++++++++++++++++++
arch/x86/include/asm/kvm_host.h | 15 +++++++++------
arch/x86/include/uapi/asm/kvm.h | 3 +++
arch/x86/kvm/Kconfig | 12 ++++++++++++
arch/x86/kvm/mmu/mmu_internal.h | 1 +
arch/x86/kvm/x86.c | 16 +++++++++++++++-
include/uapi/linux/kvm.h | 1 +
virt/kvm/Kconfig | 5 +++++
8 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 38dc1fda4f45..00029436ac5b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -147,10 +147,29 @@ described as 'basic' will be available.
The new VM has no virtual cpus and no memory.
You probably want to use 0 as machine type.

+X86:
+^^^^
+
+Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
+
+S390:
+^^^^^
+
In order to create user controlled virtual machines on S390, check
KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
privileged user (CAP_SYS_ADMIN).

+MIPS:
+^^^^^
+
+To use hardware assisted virtualization on MIPS (VZ ASE) rather than
+the default trap & emulate implementation (which changes the virtual
+memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
+flag KVM_VM_MIPS_VZ.
+
+ARM64:
+^^^^^^
+
On arm64, the physical address size for a VM (IPA Size limit) is limited
to 40bits by default. The limit can be configured if the host supports the
extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
@@ -8650,6 +8669,19 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
64-bit bitmap (each bit describing a block size). The default value is
0, to disable the eager page splitting.

+8.41 KVM_CAP_VM_TYPES
+---------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: system ioctl
+
+This capability returns a bitmap of support VM types. The 1-setting of bit @n
+means the VM type with value @n is supported. Possible values of @n are::
+
+ #define KVM_X86_DEFAULT_VM 0
+ #define KVM_X86_SW_PROTECTED_VM 1
+
9. Known KVM API problems
=========================

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f9e8d5642069..dff10051e9b6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1244,6 +1244,7 @@ enum kvm_apicv_inhibit {
};

struct kvm_arch {
+ unsigned long vm_type;
unsigned long n_used_mmu_pages;
unsigned long n_requested_mmu_pages;
unsigned long n_max_mmu_pages;
@@ -2077,6 +2078,12 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd);
void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
int tdp_max_root_level, int tdp_huge_page_level);

+#ifdef CONFIG_KVM_PRIVATE_MEM
+#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.vm_type != KVM_X86_DEFAULT_VM)
+#else
+#define kvm_arch_has_private_mem(kvm) false
+#endif
+
static inline u16 kvm_read_ldt(void)
{
u16 ldt;
@@ -2125,14 +2132,10 @@ enum {
#define HF_SMM_INSIDE_NMI_MASK (1 << 2)

# define KVM_MAX_NR_ADDRESS_SPACES 2
+/* SMM is currently unsupported for guests with private memory. */
+# define kvm_arch_nr_memslot_as_ids(kvm) (kvm_arch_has_private_mem(kvm) ? 1 : 2)
# define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
# define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
-
-static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
-{
- return KVM_MAX_NR_ADDRESS_SPACES;
-}
-
#else
# define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
#endif
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 1a6a1f987949..a448d0964fc0 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -562,4 +562,7 @@ struct kvm_pmu_event_filter {
/* x86-specific KVM_EXIT_HYPERCALL flags. */
#define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0)

+#define KVM_X86_DEFAULT_VM 0
+#define KVM_X86_SW_PROTECTED_VM 1
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 091b74599c22..8452ed0228cb 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -77,6 +77,18 @@ config KVM_WERROR

If in doubt, say "N".

+config KVM_SW_PROTECTED_VM
+ bool "Enable support for KVM software-protected VMs"
+ depends on EXPERT
+ depends on X86_64
+ select KVM_GENERIC_PRIVATE_MEM
+ help
+ Enable support for KVM software-protected VMs. Currently "protected"
+ means the VM can be backed with memory provided by
+ KVM_CREATE_GUEST_MEMFD.
+
+ If unsure, say "N".
+
config KVM_INTEL
tristate "KVM for Intel (and compatible) processors support"
depends on KVM && IA32_FEAT_CTL
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 86c7cb692786..b66a7d47e0e4 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -297,6 +297,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.max_level = KVM_MAX_HUGEPAGE_LEVEL,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
+ .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
};
int r;

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c4d17727b199..e3eb608b6692 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4441,6 +4441,13 @@ static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu,
return 0;
}

+static bool kvm_is_vm_type_supported(unsigned long type)
+{
+ return type == KVM_X86_DEFAULT_VM ||
+ (type == KVM_X86_SW_PROTECTED_VM &&
+ IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled);
+}
+
int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
{
int r = 0;
@@ -4632,6 +4639,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_X86_NOTIFY_VMEXIT:
r = kvm_caps.has_notify_vmexit;
break;
+ case KVM_CAP_VM_TYPES:
+ r = BIT(KVM_X86_DEFAULT_VM);
+ if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM))
+ r |= BIT(KVM_X86_SW_PROTECTED_VM);
+ break;
default:
break;
}
@@ -12314,9 +12326,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
int ret;
unsigned long flags;

- if (type)
+ if (!kvm_is_vm_type_supported(type))
return -EINVAL;

+ kvm->arch.vm_type = type;
+
ret = kvm_page_track_init(kvm);
if (ret)
goto out;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 29e9eb51dec9..5b5820d19e71 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1218,6 +1218,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_MEMORY_FAULT_INFO 231
#define KVM_CAP_MEMORY_ATTRIBUTES 232
#define KVM_CAP_GUEST_MEMFD 233
+#define KVM_CAP_VM_TYPES 234

#ifdef KVM_CAP_IRQ_ROUTING

diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 08afef022db9..2c964586aa14 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -104,3 +104,8 @@ config KVM_GENERIC_MEMORY_ATTRIBUTES
config KVM_PRIVATE_MEM
select XARRAY_MULTI
bool
+
+config KVM_GENERIC_PRIVATE_MEM
+ select KVM_GENERIC_MEMORY_ATTRIBUTES
+ select KVM_PRIVATE_MEM
+ bool
--
2.42.0.820.g83a721a137-goog

2023-10-27 18:47:11

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v13 29/35] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type

Add a "vm_shape" structure to encapsulate the selftests-defined "mode",
along with the KVM-defined "type" for use when creating a new VM. "mode"
tracks physical and virtual address properties, as well as the preferred
backing memory type, while "type" corresponds to the VM type.

Taking the VM type will allow adding tests for KVM_CREATE_GUEST_MEMFD,
a.k.a. guest private memory, without needing an entirely separate set of
helpers. Guest private memory is effectively usable only by confidential
VM types, and it's expected that x86 will double down and require unique
VM types for TDX and SNP guests.

Signed-off-by: Sean Christopherson <[email protected]>
---
tools/testing/selftests/kvm/dirty_log_test.c | 2 +-
.../selftests/kvm/include/kvm_util_base.h | 54 +++++++++++++++----
.../selftests/kvm/kvm_page_table_test.c | 2 +-
tools/testing/selftests/kvm/lib/kvm_util.c | 43 +++++++--------
tools/testing/selftests/kvm/lib/memstress.c | 3 +-
.../kvm/x86_64/ucna_injection_test.c | 2 +-
6 files changed, 72 insertions(+), 34 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 936f3a8d1b83..6cbecf499767 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -699,7 +699,7 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, struct kvm_vcpu **vcpu,

pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode));

- vm = __vm_create(mode, 1, extra_mem_pages);
+ vm = __vm_create(VM_SHAPE(mode), 1, extra_mem_pages);

log_mode_create_vm_done(vm);
*vcpu = vm_vcpu_add(vm, 0, guest_code);
diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 1441fca6c273..157508c071f3 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -188,6 +188,23 @@ enum vm_guest_mode {
NUM_VM_MODES,
};

+struct vm_shape {
+ enum vm_guest_mode mode;
+ unsigned int type;
+};
+
+#define VM_TYPE_DEFAULT 0
+
+#define VM_SHAPE(__mode) \
+({ \
+ struct vm_shape shape = { \
+ .mode = (__mode), \
+ .type = VM_TYPE_DEFAULT \
+ }; \
+ \
+ shape; \
+})
+
#if defined(__aarch64__)

extern enum vm_guest_mode vm_mode_default;
@@ -220,6 +237,8 @@ extern enum vm_guest_mode vm_mode_default;

#endif

+#define VM_SHAPE_DEFAULT VM_SHAPE(VM_MODE_DEFAULT)
+
#define MIN_PAGE_SIZE (1U << MIN_PAGE_SHIFT)
#define PTES_PER_MIN_PAGE ptes_per_page(MIN_PAGE_SIZE)

@@ -784,21 +803,21 @@ vm_paddr_t vm_alloc_page_table(struct kvm_vm *vm);
* __vm_create() does NOT create vCPUs, @nr_runnable_vcpus is used purely to
* calculate the amount of memory needed for per-vCPU data, e.g. stacks.
*/
-struct kvm_vm *____vm_create(enum vm_guest_mode mode);
-struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus,
+struct kvm_vm *____vm_create(struct vm_shape shape);
+struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
uint64_t nr_extra_pages);

static inline struct kvm_vm *vm_create_barebones(void)
{
- return ____vm_create(VM_MODE_DEFAULT);
+ return ____vm_create(VM_SHAPE_DEFAULT);
}

static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)
{
- return __vm_create(VM_MODE_DEFAULT, nr_runnable_vcpus, 0);
+ return __vm_create(VM_SHAPE_DEFAULT, nr_runnable_vcpus, 0);
}

-struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus,
+struct kvm_vm *__vm_create_with_vcpus(struct vm_shape shape, uint32_t nr_vcpus,
uint64_t extra_mem_pages,
void *guest_code, struct kvm_vcpu *vcpus[]);

@@ -806,17 +825,27 @@ static inline struct kvm_vm *vm_create_with_vcpus(uint32_t nr_vcpus,
void *guest_code,
struct kvm_vcpu *vcpus[])
{
- return __vm_create_with_vcpus(VM_MODE_DEFAULT, nr_vcpus, 0,
+ return __vm_create_with_vcpus(VM_SHAPE_DEFAULT, nr_vcpus, 0,
guest_code, vcpus);
}

+
+struct kvm_vm *__vm_create_shape_with_one_vcpu(struct vm_shape shape,
+ struct kvm_vcpu **vcpu,
+ uint64_t extra_mem_pages,
+ void *guest_code);
+
/*
* Create a VM with a single vCPU with reasonable defaults and @extra_mem_pages
* additional pages of guest memory. Returns the VM and vCPU (via out param).
*/
-struct kvm_vm *__vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
- uint64_t extra_mem_pages,
- void *guest_code);
+static inline struct kvm_vm *__vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
+ uint64_t extra_mem_pages,
+ void *guest_code)
+{
+ return __vm_create_shape_with_one_vcpu(VM_SHAPE_DEFAULT, vcpu,
+ extra_mem_pages, guest_code);
+}

static inline struct kvm_vm *vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
void *guest_code)
@@ -824,6 +853,13 @@ static inline struct kvm_vm *vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
return __vm_create_with_one_vcpu(vcpu, 0, guest_code);
}

+static inline struct kvm_vm *vm_create_shape_with_one_vcpu(struct vm_shape shape,
+ struct kvm_vcpu **vcpu,
+ void *guest_code)
+{
+ return __vm_create_shape_with_one_vcpu(shape, vcpu, 0, guest_code);
+}
+
struct kvm_vcpu *vm_recreate_with_one_vcpu(struct kvm_vm *vm);

void kvm_pin_this_task_to_pcpu(uint32_t pcpu);
diff --git a/tools/testing/selftests/kvm/kvm_page_table_test.c b/tools/testing/selftests/kvm/kvm_page_table_test.c
index 69f26d80c821..e37dc9c21888 100644
--- a/tools/testing/selftests/kvm/kvm_page_table_test.c
+++ b/tools/testing/selftests/kvm/kvm_page_table_test.c
@@ -254,7 +254,7 @@ static struct kvm_vm *pre_init_before_test(enum vm_guest_mode mode, void *arg)

/* Create a VM with enough guest pages */
guest_num_pages = test_mem_size / guest_page_size;
- vm = __vm_create_with_vcpus(mode, nr_vcpus, guest_num_pages,
+ vm = __vm_create_with_vcpus(VM_SHAPE(mode), nr_vcpus, guest_num_pages,
guest_code, test_args.vcpus);

/* Align down GPA of the testing memslot */
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index a140aee8d0f5..52b131e3aca5 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -209,7 +209,7 @@ __weak void vm_vaddr_populate_bitmap(struct kvm_vm *vm)
(1ULL << (vm->va_bits - 1)) >> vm->page_shift);
}

-struct kvm_vm *____vm_create(enum vm_guest_mode mode)
+struct kvm_vm *____vm_create(struct vm_shape shape)
{
struct kvm_vm *vm;

@@ -221,13 +221,13 @@ struct kvm_vm *____vm_create(enum vm_guest_mode mode)
vm->regions.hva_tree = RB_ROOT;
hash_init(vm->regions.slot_hash);

- vm->mode = mode;
- vm->type = 0;
+ vm->mode = shape.mode;
+ vm->type = shape.type;

- vm->pa_bits = vm_guest_mode_params[mode].pa_bits;
- vm->va_bits = vm_guest_mode_params[mode].va_bits;
- vm->page_size = vm_guest_mode_params[mode].page_size;
- vm->page_shift = vm_guest_mode_params[mode].page_shift;
+ vm->pa_bits = vm_guest_mode_params[vm->mode].pa_bits;
+ vm->va_bits = vm_guest_mode_params[vm->mode].va_bits;
+ vm->page_size = vm_guest_mode_params[vm->mode].page_size;
+ vm->page_shift = vm_guest_mode_params[vm->mode].page_shift;

/* Setup mode specific traits. */
switch (vm->mode) {
@@ -265,7 +265,7 @@ struct kvm_vm *____vm_create(enum vm_guest_mode mode)
/*
* Ignore KVM support for 5-level paging (vm->va_bits == 57),
* it doesn't take effect unless a CR4.LA57 is set, which it
- * isn't for this VM_MODE.
+ * isn't for this mode (48-bit virtual address space).
*/
TEST_ASSERT(vm->va_bits == 48 || vm->va_bits == 57,
"Linear address width (%d bits) not supported",
@@ -285,10 +285,11 @@ struct kvm_vm *____vm_create(enum vm_guest_mode mode)
vm->pgtable_levels = 5;
break;
default:
- TEST_FAIL("Unknown guest mode, mode: 0x%x", mode);
+ TEST_FAIL("Unknown guest mode: 0x%x", vm->mode);
}

#ifdef __aarch64__
+ TEST_ASSERT(!vm->type, "ARM doesn't support test-provided types");
if (vm->pa_bits != 40)
vm->type = KVM_VM_TYPE_ARM_IPA_SIZE(vm->pa_bits);
#endif
@@ -347,19 +348,19 @@ static uint64_t vm_nr_pages_required(enum vm_guest_mode mode,
return vm_adjust_num_guest_pages(mode, nr_pages);
}

-struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus,
+struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
uint64_t nr_extra_pages)
{
- uint64_t nr_pages = vm_nr_pages_required(mode, nr_runnable_vcpus,
+ uint64_t nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
nr_extra_pages);
struct userspace_mem_region *slot0;
struct kvm_vm *vm;
int i;

- pr_debug("%s: mode='%s' pages='%ld'\n", __func__,
- vm_guest_mode_string(mode), nr_pages);
+ pr_debug("%s: mode='%s' type='%d', pages='%ld'\n", __func__,
+ vm_guest_mode_string(shape.mode), shape.type, nr_pages);

- vm = ____vm_create(mode);
+ vm = ____vm_create(shape);

vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 0);
for (i = 0; i < NR_MEM_REGIONS; i++)
@@ -400,7 +401,7 @@ struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus,
* extra_mem_pages is only used to calculate the maximum page table size,
* no real memory allocation for non-slot0 memory in this function.
*/
-struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus,
+struct kvm_vm *__vm_create_with_vcpus(struct vm_shape shape, uint32_t nr_vcpus,
uint64_t extra_mem_pages,
void *guest_code, struct kvm_vcpu *vcpus[])
{
@@ -409,7 +410,7 @@ struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus

TEST_ASSERT(!nr_vcpus || vcpus, "Must provide vCPU array");

- vm = __vm_create(mode, nr_vcpus, extra_mem_pages);
+ vm = __vm_create(shape, nr_vcpus, extra_mem_pages);

for (i = 0; i < nr_vcpus; ++i)
vcpus[i] = vm_vcpu_add(vm, i, guest_code);
@@ -417,15 +418,15 @@ struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus
return vm;
}

-struct kvm_vm *__vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
- uint64_t extra_mem_pages,
- void *guest_code)
+struct kvm_vm *__vm_create_shape_with_one_vcpu(struct vm_shape shape,
+ struct kvm_vcpu **vcpu,
+ uint64_t extra_mem_pages,
+ void *guest_code)
{
struct kvm_vcpu *vcpus[1];
struct kvm_vm *vm;

- vm = __vm_create_with_vcpus(VM_MODE_DEFAULT, 1, extra_mem_pages,
- guest_code, vcpus);
+ vm = __vm_create_with_vcpus(shape, 1, extra_mem_pages, guest_code, vcpus);

*vcpu = vcpus[0];
return vm;
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index df457452d146..d05487e5a371 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -168,7 +168,8 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
* The memory is also added to memslot 0, but that's a benign side
* effect as KVM allows aliasing HVAs in meslots.
*/
- vm = __vm_create_with_vcpus(mode, nr_vcpus, slot0_pages + guest_num_pages,
+ vm = __vm_create_with_vcpus(VM_SHAPE(mode), nr_vcpus,
+ slot0_pages + guest_num_pages,
memstress_guest_code, vcpus);

args->vm = vm;
diff --git a/tools/testing/selftests/kvm/x86_64/ucna_injection_test.c b/tools/testing/selftests/kvm/x86_64/ucna_injection_test.c
index 85f34ca7e49e..0ed32ec903d0 100644
--- a/tools/testing/selftests/kvm/x86_64/ucna_injection_test.c
+++ b/tools/testing/selftests/kvm/x86_64/ucna_injection_test.c
@@ -271,7 +271,7 @@ int main(int argc, char *argv[])

kvm_check_cap(KVM_CAP_MCE);

- vm = __vm_create(VM_MODE_DEFAULT, 3, 0);
+ vm = __vm_create(VM_SHAPE_DEFAULT, 3, 0);

kvm_ioctl(vm->kvm_fd, KVM_X86_GET_MCE_CAP_SUPPORTED,
&supported_mcg_caps);
--
2.42.0.820.g83a721a137-goog

2023-10-30 17:35:57

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM

On 10/27/23 20:22, Sean Christopherson wrote:
> Let x86 track the number of address spaces on a per-VM basis so that KVM
> can disallow SMM memslots for confidential VMs. Confidentials VMs are
> fundamentally incompatible with emulating SMM, which as the name suggests
> requires being able to read and write guest memory and register state.
>
> Disallowing SMM will simplify support for guest private memory, as KVM
> will not need to worry about tracking memory attributes for multiple
> address spaces (SMM is the only "non-default" address space across all
> architectures).

Reviewed-by: Paolo Bonzini <[email protected]>

2023-10-30 17:37:17

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory

On 10/27/23 20:22, Sean Christopherson wrote:
> Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
> and testing vehicle for Confidential (CoCo) VMs, and potentially to even
> become a "real" product in the distant future, e.g. a la pKVM.
>
> The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
> Intel's TDX, but those technologies are extremely complex (understatement),
> difficult to debug, don't support running as nested guests, and require
> hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX
> for maintaining guest private memory isn't a realistic option.
>
> At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
> selftests for guest_memfd and private memory support without requiring
> unique hardware.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Paolo Bonzini <[email protected]>

with one nit:

> +---------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: system ioctl
> +
> +This capability returns a bitmap of support VM types. The 1-setting of bit @n

s/support/supported/

Paolo

2023-10-30 17:40:49

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes

On 10/27/23 20:21, Sean Christopherson wrote:
> Non-KVM people, please take a gander at two small-ish patches buried in the
> middle of this series:
>
> fs: Export anon_inode_getfile_secure() for use by KVM
> mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
>
> Our plan/hope is to take this through the KVM tree for 6.8, reviews (and acks!)
> would be much appreciated. Note, adding AS_UNMOVABLE isn't strictly required as
> it's "just" an optimization, but we'd prefer to have it in place straightaway.

Reporting what I wrote in the other thread, for wider distribution:

I'm going to wait a couple days more for reviews to come in, post a v14
myself, and apply the series to kvm/next as soon as Linus merges the 6.7
changes. The series will be based on the 6.7 tags/for-linus, and when
6.7-rc1 comes up, I'll do this to straighten the history:

git checkout kvm/next
git tag -s -f kvm-gmem HEAD
git reset --hard v6.7-rc1
git merge tags/kvm-gmem
# fix conflict with Christian Brauner's VFS series
git commit
git push kvm

6.8 is not going to be out for four months, and I'm pretty sure that
anything that would be discovered within "a few weeks" can also be
applied on top, and the heaviness of a 35-patch series will outweigh any
imperfections by a long margin.

(Full disclosure: this is _also_ because I want to apply this series to
the RHEL kernel, and Red Hat has a high level of disdain for
non-upstream patches. But it's mostly because I want all dependencies
to be able to move on and be developed on top of stock kvm/next).

Paolo

2023-10-31 08:37:19

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> Extended guest_memfd to allow backing guest memory with transparent
> hugepages. Require userspace to opt-in via a flag even though there's no
> known/anticipated use case for forcing small pages as THP is optional,
> i.e. to avoid ending up in a situation where userspace is unaware that
> KVM can't provide hugepages.

Personally, it seems not so "transparent" if requiring userspace to opt-in.

People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE
support, or check is the sysfs of transparent hugepage exists; 2)get the
maximum support hugepage size 3) ensure the size satisfies the
alignment; before opt-in it.

Even simpler, userspace can blindly try to create guest memfd with
transparent hugapage flag. If getting error, fallback to create without
the transparent hugepage flag.

However, it doesn't look transparent to me.

2023-10-31 14:16:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > Extended guest_memfd to allow backing guest memory with transparent
> > hugepages. Require userspace to opt-in via a flag even though there's no
> > known/anticipated use case for forcing small pages as THP is optional,
> > i.e. to avoid ending up in a situation where userspace is unaware that
> > KVM can't provide hugepages.
>
> Personally, it seems not so "transparent" if requiring userspace to opt-in.
>
> People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE
> support, or check is the sysfs of transparent hugepage exists; 2)get the
> maximum support hugepage size 3) ensure the size satisfies the alignment;
> before opt-in it.
>
> Even simpler, userspace can blindly try to create guest memfd with
> transparent hugapage flag. If getting error, fallback to create without the
> transparent hugepage flag.
>
> However, it doesn't look transparent to me.

The "transparent" part is referring to the underlying kernel mechanism, it's not
saying anything about the API. The "transparent" part of THP is that the kernel
doesn't guarantee hugepages, i.e. whether or not hugepages are actually used is
(mostly) transparent to userspace.

Paolo also isn't the biggest fan[*], but there are also downsides to always
allowing hugepages, e.g. silent failure due to lack of THP or unaligned size,
and there's precedent in the form of MADV_HUGEPAGE.

[*] https://lore.kernel.org/all/[email protected]

2023-11-01 07:26:28

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

On 10/31/2023 10:16 PM, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Xiaoyao Li wrote:
>> On 10/28/2023 2:21 AM, Sean Christopherson wrote:
>>> Extended guest_memfd to allow backing guest memory with transparent
>>> hugepages. Require userspace to opt-in via a flag even though there's no
>>> known/anticipated use case for forcing small pages as THP is optional,
>>> i.e. to avoid ending up in a situation where userspace is unaware that
>>> KVM can't provide hugepages.
>>
>> Personally, it seems not so "transparent" if requiring userspace to opt-in.
>>
>> People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE
>> support, or check is the sysfs of transparent hugepage exists; 2)get the
>> maximum support hugepage size 3) ensure the size satisfies the alignment;
>> before opt-in it.
>>
>> Even simpler, userspace can blindly try to create guest memfd with
>> transparent hugapage flag. If getting error, fallback to create without the
>> transparent hugepage flag.
>>
>> However, it doesn't look transparent to me.
>
> The "transparent" part is referring to the underlying kernel mechanism, it's not
> saying anything about the API. The "transparent" part of THP is that the kernel
> doesn't guarantee hugepages, i.e. whether or not hugepages are actually used is
> (mostly) transparent to userspace.
>
> Paolo also isn't the biggest fan[*], but there are also downsides to always
> allowing hugepages, e.g. silent failure due to lack of THP or unaligned size,
> and there's precedent in the form of MADV_HUGEPAGE.
>
> [*] https://lore.kernel.org/all/[email protected]

But it's different than MADV_HUGEPAGE, in a way. Per my understanding,
the failure of MADV_HUGEPAGE is not fatal, user space can ignore it and
continue.

However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which
leads to failure of guest memfd creation.

For current implementation, I think maybe
KVM_GUEST_MEMFD_DESIRE_HUGEPAGE fits better than
KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*?

2023-11-01 13:42:03

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

On Wed, Nov 01, 2023, Xiaoyao Li wrote:
> On 10/31/2023 10:16 PM, Sean Christopherson wrote:
> > On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> > > On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > > > Extended guest_memfd to allow backing guest memory with transparent
> > > > hugepages. Require userspace to opt-in via a flag even though there's no
> > > > known/anticipated use case for forcing small pages as THP is optional,
> > > > i.e. to avoid ending up in a situation where userspace is unaware that
> > > > KVM can't provide hugepages.
> > >
> > > Personally, it seems not so "transparent" if requiring userspace to opt-in.
> > >
> > > People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE
> > > support, or check is the sysfs of transparent hugepage exists; 2)get the
> > > maximum support hugepage size 3) ensure the size satisfies the alignment;
> > > before opt-in it.
> > >
> > > Even simpler, userspace can blindly try to create guest memfd with
> > > transparent hugapage flag. If getting error, fallback to create without the
> > > transparent hugepage flag.
> > >
> > > However, it doesn't look transparent to me.
> >
> > The "transparent" part is referring to the underlying kernel mechanism, it's not
> > saying anything about the API. The "transparent" part of THP is that the kernel
> > doesn't guarantee hugepages, i.e. whether or not hugepages are actually used is
> > (mostly) transparent to userspace.
> >
> > Paolo also isn't the biggest fan[*], but there are also downsides to always
> > allowing hugepages, e.g. silent failure due to lack of THP or unaligned size,
> > and there's precedent in the form of MADV_HUGEPAGE.
> >
> > [*] https://lore.kernel.org/all/[email protected]
>
> But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the
> failure of MADV_HUGEPAGE is not fatal, user space can ignore it and
> continue.
>
> However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which leads
> to failure of guest memfd creation.

Failing KVM_CREATE_GUEST_MEMFD isn't truly fatal, it just requires different
action from userspace, i.e. instead of ignoring the error, userspace could redo
KVM_CREATE_GUEST_MEMFD with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE=0.

We could make the behavior more like MADV_HUGEPAGE, e.g. theoretically we could
extend fadvise() with FADV_HUGEPAGE, or add a guest_memfd knob/ioctl() to let
userspace provide advice/hints after creating a guest_memfd. But I suspect that
guest_memfd would be the only user of FADV_HUGEPAGE, and IMO a post-creation hint
is actually less desirable.

KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will fail only if userspace didn't provide a
compatible size or the kernel doesn't support THP. An incompatible size is likely
a userspace bug, and for most setups that want to utilize guest_memfd, lack of THP
support is likely a configuration bug. I.e. many/most uses *want* failures due to
KVM_GUEST_MEMFD_ALLOW_HUGEPAGE to be fatal.

> For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE
> fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*?

Why? Verbs like "prefer" and "desire" aren't a good fit IMO because they suggest
the flag is a hint, and hints are usually best effort only, i.e. are ignored if
there is a fundamental incompatibility.

"Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES
or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't
(yet) guarantee hugepages. I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
a hint, but weaker than a requirement. And if/when KVM supports a dedicated memory
pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.

2023-11-01 13:50:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

On Wed, Nov 1, 2023 at 2:41 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, Nov 01, 2023, Xiaoyao Li wrote:
> > On 10/31/2023 10:16 PM, Sean Christopherson wrote:
> > > On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> > > > On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > > > > Extended guest_memfd to allow backing guest memory with transparent
> > > > > hugepages. Require userspace to opt-in via a flag even though there's no
> > > > > known/anticipated use case for forcing small pages as THP is optional,
> > > > > i.e. to avoid ending up in a situation where userspace is unaware that
> > > > > KVM can't provide hugepages.
> > > >
> > > > Personally, it seems not so "transparent" if requiring userspace to opt-in.
> > > >
> > > > People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE
> > > > support, or check is the sysfs of transparent hugepage exists; 2)get the
> > > > maximum support hugepage size 3) ensure the size satisfies the alignment;
> > > > before opt-in it.
> > > >
> > > > Even simpler, userspace can blindly try to create guest memfd with
> > > > transparent hugapage flag. If getting error, fallback to create without the
> > > > transparent hugepage flag.
> > > >
> > > > However, it doesn't look transparent to me.
> > >
> > > The "transparent" part is referring to the underlying kernel mechanism, it's not
> > > saying anything about the API. The "transparent" part of THP is that the kernel
> > > doesn't guarantee hugepages, i.e. whether or not hugepages are actually used is
> > > (mostly) transparent to userspace.
> > >
> > > Paolo also isn't the biggest fan[*], but there are also downsides to always
> > > allowing hugepages, e.g. silent failure due to lack of THP or unaligned size,
> > > and there's precedent in the form of MADV_HUGEPAGE.
> > >
> > > [*] https://lore.kernel.org/all/[email protected]
> >
> > But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the
> > failure of MADV_HUGEPAGE is not fatal, user space can ignore it and
> > continue.
> >
> > However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which leads
> > to failure of guest memfd creation.
>
> Failing KVM_CREATE_GUEST_MEMFD isn't truly fatal, it just requires different
> action from userspace, i.e. instead of ignoring the error, userspace could redo
> KVM_CREATE_GUEST_MEMFD with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE=0.
>
> We could make the behavior more like MADV_HUGEPAGE, e.g. theoretically we could
> extend fadvise() with FADV_HUGEPAGE, or add a guest_memfd knob/ioctl() to let
> userspace provide advice/hints after creating a guest_memfd. But I suspect that
> guest_memfd would be the only user of FADV_HUGEPAGE, and IMO a post-creation hint
> is actually less desirable.
>
> KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will fail only if userspace didn't provide a
> compatible size or the kernel doesn't support THP. An incompatible size is likely
> a userspace bug, and for most setups that want to utilize guest_memfd, lack of THP
> support is likely a configuration bug. I.e. many/most uses *want* failures due to
> KVM_GUEST_MEMFD_ALLOW_HUGEPAGE to be fatal.
>
> > For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE
> > fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*?
>
> Why? Verbs like "prefer" and "desire" aren't a good fit IMO because they suggest
> the flag is a hint, and hints are usually best effort only, i.e. are ignored if
> there is a fundamental incompatibility.
>
> "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES
> or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't
> (yet) guarantee hugepages. I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
> a hint, but weaker than a requirement. And if/when KVM supports a dedicated memory
> pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.

I think that the current patch is fine, but I will adjust it to always
allow the flag,
and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
If hugepages are not guaranteed, and (theoretically) you could have no
hugepage at all in the result, it's okay to get this result even if THP is not
available in the kernel.

Paolo

2023-11-01 16:36:11

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> On Wed, Nov 1, 2023 at 2:41 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Wed, Nov 01, 2023, Xiaoyao Li wrote:
> > > On 10/31/2023 10:16 PM, Sean Christopherson wrote:
> > > > On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> > > > > On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > > But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the
> > > failure of MADV_HUGEPAGE is not fatal, user space can ignore it and
> > > continue.
> > >
> > > However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which leads
> > > to failure of guest memfd creation.
> >
> > Failing KVM_CREATE_GUEST_MEMFD isn't truly fatal, it just requires different
> > action from userspace, i.e. instead of ignoring the error, userspace could redo
> > KVM_CREATE_GUEST_MEMFD with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE=0.
> >
> > We could make the behavior more like MADV_HUGEPAGE, e.g. theoretically we could
> > extend fadvise() with FADV_HUGEPAGE, or add a guest_memfd knob/ioctl() to let
> > userspace provide advice/hints after creating a guest_memfd. But I suspect that
> > guest_memfd would be the only user of FADV_HUGEPAGE, and IMO a post-creation hint
> > is actually less desirable.
> >
> > KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will fail only if userspace didn't provide a
> > compatible size or the kernel doesn't support THP. An incompatible size is likely
> > a userspace bug, and for most setups that want to utilize guest_memfd, lack of THP
> > support is likely a configuration bug. I.e. many/most uses *want* failures due to
> > KVM_GUEST_MEMFD_ALLOW_HUGEPAGE to be fatal.
> >
> > > For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE
> > > fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*?
> >
> > Why? Verbs like "prefer" and "desire" aren't a good fit IMO because they suggest
> > the flag is a hint, and hints are usually best effort only, i.e. are ignored if
> > there is a fundamental incompatibility.
> >
> > "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES
> > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't
> > (yet) guarantee hugepages. I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
> > a hint, but weaker than a requirement. And if/when KVM supports a dedicated memory
> > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.
>
> I think that the current patch is fine, but I will adjust it to always
> allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
> If hugepages are not guaranteed, and (theoretically) you could have no
> hugepage at all in the result, it's okay to get this result even if THP is not
> available in the kernel.

Can you post a fixup patch? It's not clear to me exactly what behavior you intend
to end up with.

2023-11-01 22:29:43

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

On 11/1/23 17:36, Sean Christopherson wrote:
>>> "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES
>>> or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't
>>> (yet) guarantee hugepages. I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
>>> a hint, but weaker than a requirement. And if/when KVM supports a dedicated memory
>>> pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.
>> I think that the current patch is fine, but I will adjust it to always
>> allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
>> If hugepages are not guaranteed, and (theoretically) you could have no
>> hugepage at all in the result, it's okay to get this result even if THP is not
>> available in the kernel.
> Can you post a fixup patch? It's not clear to me exactly what behavior you intend
> to end up with.

Sure, just this:

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 7d1a33c2ad42..34fd070e03d9 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
{
loff_t size = args->size;
u64 flags = args->flags;
- u64 valid_flags = 0;
-
- if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
- valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+ u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;

if (flags & ~valid_flags)
return -EINVAL;
@@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
if (size < 0 || !PAGE_ALIGNED(size))
return -EINVAL;

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
!IS_ALIGNED(size, HPAGE_PMD_SIZE))
return -EINVAL;
-#endif

return __kvm_gmem_create(kvm, size, flags);
}

Paolo

2023-11-01 22:35:52

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> On 11/1/23 17:36, Sean Christopherson wrote:
> > > > "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES
> > > > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't
> > > > (yet) guarantee hugepages. I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
> > > > a hint, but weaker than a requirement. And if/when KVM supports a dedicated memory
> > > > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.
> > > I think that the current patch is fine, but I will adjust it to always
> > > allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
> > > If hugepages are not guaranteed, and (theoretically) you could have no
> > > hugepage at all in the result, it's okay to get this result even if THP is not
> > > available in the kernel.
> > Can you post a fixup patch? It's not clear to me exactly what behavior you intend
> > to end up with.
>
> Sure, just this:
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 7d1a33c2ad42..34fd070e03d9 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> {
> loff_t size = args->size;
> u64 flags = args->flags;
> - u64 valid_flags = 0;
> -
> - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> - valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> if (flags & ~valid_flags)
> return -EINVAL;
> @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> if (size < 0 || !PAGE_ALIGNED(size))
> return -EINVAL;
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> !IS_ALIGNED(size, HPAGE_PMD_SIZE))
> return -EINVAL;
> -#endif

That won't work, HPAGE_PMD_SIZE is valid only for CONFIG_TRANSPARENT_HUGEPAGE=y.

#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })

...

> return __kvm_gmem_create(kvm, size, flags);
> }
>
> Paolo
>

2023-11-01 23:18:48

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

On Wed, Nov 1, 2023 at 11:35 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> > On 11/1/23 17:36, Sean Christopherson wrote:
> > > > > "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES
> > > > > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't
> > > > > (yet) guarantee hugepages. I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
> > > > > a hint, but weaker than a requirement. And if/when KVM supports a dedicated memory
> > > > > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.
> > > > I think that the current patch is fine, but I will adjust it to always
> > > > allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
> > > > If hugepages are not guaranteed, and (theoretically) you could have no
> > > > hugepage at all in the result, it's okay to get this result even if THP is not
> > > > available in the kernel.
> > > Can you post a fixup patch? It's not clear to me exactly what behavior you intend
> > > to end up with.
> >
> > Sure, just this:
> >
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 7d1a33c2ad42..34fd070e03d9 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> > {
> > loff_t size = args->size;
> > u64 flags = args->flags;
> > - u64 valid_flags = 0;
> > -
> > - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > - valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > if (flags & ~valid_flags)
> > return -EINVAL;
> > @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> > if (size < 0 || !PAGE_ALIGNED(size))
> > return -EINVAL;
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> > !IS_ALIGNED(size, HPAGE_PMD_SIZE))
> > return -EINVAL;
> > -#endif
>
> That won't work, HPAGE_PMD_SIZE is valid only for CONFIG_TRANSPARENT_HUGEPAGE=y.
>
> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })

Would have caught it when actually testing it, I guess. :) It has to
be PMD_SIZE, possibly with

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
BUILD_BUG_ON(HPAGE_PMD_SIZE != PMD_SIZE);
#endif

for extra safety.

Paolo

2023-11-02 14:36:29

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v13 21/35] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro

On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <[email protected]> wrote:
>
> Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
> KVM_ADDRESS_SPACE_NUM.
>
> No functional change intended.
>
> Reviewed-by: Paolo Bonzini <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---

Reviewed-by: Fuad Tabba <[email protected]>
Tested-by: Fuad Tabba <[email protected]>

Cheers,
/fuad

> arch/x86/include/asm/kvm_host.h | 1 -
> include/linux/kvm_host.h | 2 +-
> 2 files changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8d60e4745e8b..6702f795c862 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2124,7 +2124,6 @@ enum {
> #define HF_SMM_MASK (1 << 1)
> #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
>
> -# define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
> # define KVM_ADDRESS_SPACE_NUM 2
> # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
> # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index e3223cafd7db..c3cfe08b1300 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -692,7 +692,7 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
> #define KVM_MEM_SLOTS_NUM SHRT_MAX
> #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
>
> -#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
> +#if KVM_ADDRESS_SPACE_NUM == 1
> static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
> {
> return 0;
> --
> 2.42.0.820.g83a721a137-goog
>

2023-11-02 14:53:38

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM

On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <[email protected]> wrote:
>
> Let x86 track the number of address spaces on a per-VM basis so that KVM
> can disallow SMM memslots for confidential VMs. Confidentials VMs are
> fundamentally incompatible with emulating SMM, which as the name suggests
> requires being able to read and write guest memory and register state.
>
> Disallowing SMM will simplify support for guest private memory, as KVM
> will not need to worry about tracking memory attributes for multiple
> address spaces (SMM is the only "non-default" address space across all
> architectures).
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---

Reviewed-by: Fuad Tabba <[email protected]>
Tested-by: Fuad Tabba <[email protected]>

Cheers,
/fuad

> arch/powerpc/kvm/book3s_hv.c | 2 +-
> arch/x86/include/asm/kvm_host.h | 8 +++++++-
> arch/x86/kvm/debugfs.c | 2 +-
> arch/x86/kvm/mmu/mmu.c | 6 +++---
> arch/x86/kvm/x86.c | 2 +-
> include/linux/kvm_host.h | 17 +++++++++++------
> virt/kvm/dirty_ring.c | 2 +-
> virt/kvm/kvm_main.c | 26 ++++++++++++++------------
> 8 files changed, 39 insertions(+), 26 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 130bafdb1430..9b0eaa17275a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -6084,7 +6084,7 @@ static int kvmhv_svm_off(struct kvm *kvm)
> }
>
> srcu_idx = srcu_read_lock(&kvm->srcu);
> - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> struct kvm_memory_slot *memslot;
> struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> int bkt;
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6702f795c862..f9e8d5642069 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2124,9 +2124,15 @@ enum {
> #define HF_SMM_MASK (1 << 1)
> #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
>
> -# define KVM_ADDRESS_SPACE_NUM 2
> +# define KVM_MAX_NR_ADDRESS_SPACES 2
> # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
> # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
> +
> +static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
> +{
> + return KVM_MAX_NR_ADDRESS_SPACES;
> +}
> +
> #else
> # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
> #endif
> diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
> index ee8c4c3496ed..42026b3f3ff3 100644
> --- a/arch/x86/kvm/debugfs.c
> +++ b/arch/x86/kvm/debugfs.c
> @@ -111,7 +111,7 @@ static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void *v)
> mutex_lock(&kvm->slots_lock);
> write_lock(&kvm->mmu_lock);
>
> - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> int bkt;
>
> slots = __kvm_memslots(kvm, i);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c4e758f0aebb..baeba8fc1c38 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3755,7 +3755,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
> kvm_page_track_write_tracking_enabled(kvm))
> goto out_success;
>
> - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> slots = __kvm_memslots(kvm, i);
> kvm_for_each_memslot(slot, bkt, slots) {
> /*
> @@ -6294,7 +6294,7 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_e
> if (!kvm_memslots_have_rmaps(kvm))
> return flush;
>
> - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> slots = __kvm_memslots(kvm, i);
>
> kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, gfn_end) {
> @@ -6791,7 +6791,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> * modifier prior to checking for a wrap of the MMIO generation so
> * that a wrap in any address space is detected.
> */
> - gen &= ~((u64)KVM_ADDRESS_SPACE_NUM - 1);
> + gen &= ~((u64)kvm_arch_nr_memslot_as_ids(kvm) - 1);
>
> /*
> * The very rare case: if the MMIO generation number has wrapped,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 824b58b44382..c4d17727b199 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12456,7 +12456,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> hva = slot->userspace_addr;
> }
>
> - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> struct kvm_userspace_memory_region2 m;
>
> m.slot = id | (i << 16);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c3cfe08b1300..687589ce9f63 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -80,8 +80,8 @@
> /* Two fragments for cross MMIO pages. */
> #define KVM_MAX_MMIO_FRAGMENTS 2
>
> -#ifndef KVM_ADDRESS_SPACE_NUM
> -#define KVM_ADDRESS_SPACE_NUM 1
> +#ifndef KVM_MAX_NR_ADDRESS_SPACES
> +#define KVM_MAX_NR_ADDRESS_SPACES 1
> #endif
>
> /*
> @@ -692,7 +692,12 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
> #define KVM_MEM_SLOTS_NUM SHRT_MAX
> #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
>
> -#if KVM_ADDRESS_SPACE_NUM == 1
> +#if KVM_MAX_NR_ADDRESS_SPACES == 1
> +static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
> +{
> + return KVM_MAX_NR_ADDRESS_SPACES;
> +}
> +
> static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
> {
> return 0;
> @@ -747,9 +752,9 @@ struct kvm {
> struct mm_struct *mm; /* userspace tied to this vm */
> unsigned long nr_memslot_pages;
> /* The two memslot sets - active and inactive (per address space) */
> - struct kvm_memslots __memslots[KVM_ADDRESS_SPACE_NUM][2];
> + struct kvm_memslots __memslots[KVM_MAX_NR_ADDRESS_SPACES][2];
> /* The current active memslot set for each address space */
> - struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
> + struct kvm_memslots __rcu *memslots[KVM_MAX_NR_ADDRESS_SPACES];
> struct xarray vcpu_array;
> /*
> * Protected by slots_lock, but can be read outside if an
> @@ -1018,7 +1023,7 @@ void kvm_put_kvm_no_destroy(struct kvm *kvm);
>
> static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
> {
> - as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
> + as_id = array_index_nospec(as_id, KVM_MAX_NR_ADDRESS_SPACES);
> return srcu_dereference_check(kvm->memslots[as_id], &kvm->srcu,
> lockdep_is_held(&kvm->slots_lock) ||
> !refcount_read(&kvm->users_count));
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index c1cd7dfe4a90..86d267db87bb 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -58,7 +58,7 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> as_id = slot >> 16;
> id = (u16)slot;
>
> - if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> + if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
> return;
>
> memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5d1a2f1b4e94..23633984142f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -615,7 +615,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>
> idx = srcu_read_lock(&kvm->srcu);
>
> - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> struct interval_tree_node *node;
>
> slots = __kvm_memslots(kvm, i);
> @@ -1248,7 +1248,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> goto out_err_no_irq_srcu;
>
> refcount_set(&kvm->users_count, 1);
> - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> for (j = 0; j < 2; j++) {
> slots = &kvm->__memslots[i][j];
>
> @@ -1398,7 +1398,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
> #endif
> kvm_arch_destroy_vm(kvm);
> kvm_destroy_devices(kvm);
> - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> }
> @@ -1681,7 +1681,7 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
> * space 0 will use generations 0, 2, 4, ... while address space 1 will
> * use generations 1, 3, 5, ...
> */
> - gen += KVM_ADDRESS_SPACE_NUM;
> + gen += kvm_arch_nr_memslot_as_ids(kvm);
>
> kvm_arch_memslots_updated(kvm, gen);
>
> @@ -2051,7 +2051,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> (mem->guest_memfd_offset & (PAGE_SIZE - 1) ||
> mem->guest_memfd_offset + mem->memory_size < mem->guest_memfd_offset))
> return -EINVAL;
> - if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> + if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM)
> return -EINVAL;
> if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> return -EINVAL;
> @@ -2187,7 +2187,7 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
>
> as_id = log->slot >> 16;
> id = (u16)log->slot;
> - if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> + if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
> return -EINVAL;
>
> slots = __kvm_memslots(kvm, as_id);
> @@ -2249,7 +2249,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
>
> as_id = log->slot >> 16;
> id = (u16)log->slot;
> - if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> + if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
> return -EINVAL;
>
> slots = __kvm_memslots(kvm, as_id);
> @@ -2361,7 +2361,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>
> as_id = log->slot >> 16;
> id = (u16)log->slot;
> - if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> + if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
> return -EINVAL;
>
> if (log->first_page & 63)
> @@ -2502,7 +2502,7 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
> gfn_range.only_private = false;
> gfn_range.only_shared = false;
>
> - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> slots = __kvm_memslots(kvm, i);
>
> kvm_for_each_memslot_in_gfn_range(&iter, slots, range->start, range->end) {
> @@ -4857,9 +4857,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> case KVM_CAP_IRQ_ROUTING:
> return KVM_MAX_IRQ_ROUTES;
> #endif
> -#if KVM_ADDRESS_SPACE_NUM > 1
> +#if KVM_MAX_NR_ADDRESS_SPACES > 1
> case KVM_CAP_MULTI_ADDRESS_SPACE:
> - return KVM_ADDRESS_SPACE_NUM;
> + if (kvm)
> + return kvm_arch_nr_memslot_as_ids(kvm);
> + return KVM_MAX_NR_ADDRESS_SPACES;
> #endif
> case KVM_CAP_NR_MEMSLOTS:
> return KVM_USER_MEM_SLOTS;
> @@ -4967,7 +4969,7 @@ bool kvm_are_all_memslots_empty(struct kvm *kvm)
>
> lockdep_assert_held(&kvm->slots_lock);
>
> - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> if (!kvm_memslots_empty(__kvm_memslots(kvm, i)))
> return false;
> }
> --
> 2.42.0.820.g83a721a137-goog
>

2023-11-02 15:38:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> On Wed, Nov 1, 2023 at 11:35 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> > > On 11/1/23 17:36, Sean Christopherson wrote:
> > > > Can you post a fixup patch? It's not clear to me exactly what behavior you intend
> > > > to end up with.
> > >
> > > Sure, just this:
> > >
> > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > index 7d1a33c2ad42..34fd070e03d9 100644
> > > --- a/virt/kvm/guest_memfd.c
> > > +++ b/virt/kvm/guest_memfd.c
> > > @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> > > {
> > > loff_t size = args->size;
> > > u64 flags = args->flags;
> > > - u64 valid_flags = 0;
> > > -
> > > - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > > - valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > > + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > > if (flags & ~valid_flags)
> > > return -EINVAL;
> > > @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> > > if (size < 0 || !PAGE_ALIGNED(size))
> > > return -EINVAL;
> > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> > > !IS_ALIGNED(size, HPAGE_PMD_SIZE))
> > > return -EINVAL;
> > > -#endif
> >
> > That won't work, HPAGE_PMD_SIZE is valid only for CONFIG_TRANSPARENT_HUGEPAGE=y.
> >
> > #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> > #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> > #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> > #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
>
> Would have caught it when actually testing it, I guess. :) It has to
> be PMD_SIZE, possibly with
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> BUILD_BUG_ON(HPAGE_PMD_SIZE != PMD_SIZE);
> #endif

Yeah, that works for me.

Actually, looking that this again, there's not actually a hard dependency on THP.
A THP-enabled kernel _probably_ gives a higher probability of using hugepages,
but mostly because THP selects COMPACTION, and I suppose because using THP for
other allocations reduces overall fragmentation.

So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think
we should do the below (I verified KVM can create hugepages with THP=n). We'll
need another capability, but (a) we probably should have that anyways and (b) it
provides a cleaner path to adding PUD-sized hugepage support in the future.

And then adjust the tests like so:

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index c15de9852316..c9f449718fce 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -201,6 +201,10 @@ int main(int argc, char *argv[])

TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));

+ if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE) && thp_configured())
+ TEST_ASSERT_EQ(get_trans_hugepagesz(),
+ kvm_check_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE));
+
page_size = getpagesize();
total_size = page_size * 4;

diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
index be311944e90a..245901587ed2 100644
--- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -396,7 +396,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t

vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));

- if (backing_src_can_be_huge(src_type))
+ if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE))
memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
else
memfd_flags = 0;

--
From: Sean Christopherson <[email protected]>
Date: Wed, 25 Oct 2023 16:26:41 -0700
Subject: [PATCH] KVM: Add best-effort hugepage support for dedicated guest
memory

Extend guest_memfd to allow backing guest memory with hugepages. For now,
make hugepage utilization best-effort, i.e. fall back to non-huge mappings
if a hugepage can't be allocated. Guaranteeing hugepages would require a
dedicated memory pool and significantly more complexity and churn..

Require userspace to opt-in via a flag even though it's unlikely real use
cases will ever want to use order-0 pages, e.g. to give userspace a safety
valve in case hugepage support is buggy, and to allow for easier testing
of both paths.

Do not take a dependency on CONFIG_TRANSPARENT_HUGEPAGE, as THP enabling
primarily deals with userspace page tables, which are explicitly not in
play for guest_memfd. Selecting THP does make obtaining hugepages more
likely, but only because THP selects CONFIG_COMPACTION. Don't select
CONFIG_COMPACTION either, because again it's not a hard dependency.

For simplicity, require the guest_memfd size to be a multiple of the
hugepage size, e.g. so that KVM doesn't need to do bounds checking when
deciding whether or not to allocate a huge folio.

When reporting the max order when KVM gets a pfn from guest_memfd, force
order-0 pages if the hugepage is not fully contained by the memslot
binding, e.g. if userspace requested hugepages but punches a hole in the
memslot bindings in order to emulate x86's VGA hole.

Signed-off-by: Sean Christopherson <[email protected]>
---
Documentation/virt/kvm/api.rst | 17 +++++++++
include/uapi/linux/kvm.h | 3 ++
virt/kvm/guest_memfd.c | 69 +++++++++++++++++++++++++++++-----
virt/kvm/kvm_main.c | 4 ++
4 files changed, 84 insertions(+), 9 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e82c69d5e755..ccdd5413920d 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6176,6 +6176,8 @@ and cannot be resized (guest_memfd files do however support PUNCH_HOLE).
__u64 reserved[6];
};

+ #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0)
+
Conceptually, the inode backing a guest_memfd file represents physical memory,
i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The
file itself, which is bound to a "struct kvm", is that instance's view of the
@@ -6192,6 +6194,12 @@ most one mapping per page, i.e. binding multiple memory regions to a single
guest_memfd range is not allowed (any number of memory regions can be bound to
a single guest_memfd file, but the bound ranges must not overlap).

+If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set in flags, KVM will attempt to allocate
+and map PMD-size hugepages for the guest_memfd file. This is currently best
+effort. If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set, size must be aligned to at
+least the size reported by KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE (which also
+enumerates support for KVM_GUEST_MEMFD_ALLOW_HUGEPAGE).
+
See KVM_SET_USER_MEMORY_REGION2 for additional details.

5. The kvm_run structure
@@ -8639,6 +8647,15 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
64-bit bitmap (each bit describing a block size). The default value is
0, to disable the eager page splitting.

+
+8.41 KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE
+------------------------------------------
+
+This is an information-only capability that returns guest_memfd's hugepage size
+for PMD hugepages. Returns '0' if guest_memfd is not supported, or if KVM
+doesn't support creating hugepages for guest_memfd. Note, guest_memfd doesn't
+currently support PUD-sized hugepages.
+
9. Known KVM API problems
=========================

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 25caee8d1a80..b78d0e3bf22a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1217,6 +1217,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_MEMORY_FAULT_INFO 231
#define KVM_CAP_MEMORY_ATTRIBUTES 232
#define KVM_CAP_GUEST_MEMFD 233
+#define KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE 234

#ifdef KVM_CAP_IRQ_ROUTING

@@ -2303,4 +2304,6 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
};

+#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0)
+
#endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 98a12da80214..31b5e94d461a 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -13,14 +13,44 @@ struct kvm_gmem {
struct list_head entry;
};

+#define NR_PAGES_PER_PMD (1 << PMD_ORDER)
+
+static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index)
+{
+ unsigned long huge_index = round_down(index, NR_PAGES_PER_PMD);
+ unsigned long flags = (unsigned long)inode->i_private;
+ struct address_space *mapping = inode->i_mapping;
+ gfp_t gfp = mapping_gfp_mask(mapping);
+ struct folio *folio;
+
+ if (!(flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE))
+ return NULL;
+
+ if (filemap_range_has_page(mapping, huge_index << PAGE_SHIFT,
+ (huge_index + NR_PAGES_PER_PMD - 1) << PAGE_SHIFT))
+ return NULL;
+
+ folio = filemap_alloc_folio(gfp, PMD_ORDER);
+ if (!folio)
+ return NULL;
+
+ if (filemap_add_folio(mapping, folio, huge_index, gfp)) {
+ folio_put(folio);
+ return NULL;
+ }
+ return folio;
+}
+
static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
{
struct folio *folio;

- /* TODO: Support huge pages. */
- folio = filemap_grab_folio(inode->i_mapping, index);
- if (IS_ERR_OR_NULL(folio))
- return NULL;
+ folio = kvm_gmem_get_huge_folio(inode, index);
+ if (!folio) {
+ folio = filemap_grab_folio(inode->i_mapping, index);
+ if (IS_ERR_OR_NULL(folio))
+ return NULL;
+ }

/*
* Use the up-to-date flag to track whether or not the memory has been
@@ -373,6 +403,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
inode->i_mode |= S_IFREG;
inode->i_size = size;
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+ mapping_set_large_folios(inode->i_mapping);
mapping_set_unmovable(inode->i_mapping);
/* Unmovable mappings are supposed to be marked unevictable as well. */
WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
@@ -394,14 +425,18 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)

int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
{
+ u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
loff_t size = args->size;
u64 flags = args->flags;
- u64 valid_flags = 0;

if (flags & ~valid_flags)
return -EINVAL;

- if (size < 0 || !PAGE_ALIGNED(size))
+ if (size <= 0 || !PAGE_ALIGNED(size))
+ return -EINVAL;
+
+ if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
+ !IS_ALIGNED(size, PMD_SIZE))
return -EINVAL;

return __kvm_gmem_create(kvm, size, flags);
@@ -501,7 +536,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
{
- pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
+ pgoff_t index, huge_index;
struct kvm_gmem *gmem;
struct folio *folio;
struct page *page;
@@ -514,6 +549,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,

gmem = file->private_data;

+ index = gfn - slot->base_gfn + slot->gmem.pgoff;
if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) {
r = -EIO;
goto out_fput;
@@ -533,9 +569,24 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
page = folio_file_page(folio, index);

*pfn = page_to_pfn(page);
- if (max_order)
+ if (!max_order)
+ goto success;
+
+ *max_order = compound_order(compound_head(page));
+ if (!*max_order)
+ goto success;
+
+ /*
+ * The folio can be mapped with a hugepage if and only if the folio is
+ * fully contained by the range the memslot is bound to. Note, the
+ * caller is responsible for handling gfn alignment, this only deals
+ * with the file binding.
+ */
+ huge_index = ALIGN(index, 1ull << *max_order);
+ if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) ||
+ huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages)
*max_order = 0;
-
+success:
r = 0;

out_unlock:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5d1a2f1b4e94..0711f2c75667 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4888,6 +4888,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
#ifdef CONFIG_KVM_PRIVATE_MEM
case KVM_CAP_GUEST_MEMFD:
return !kvm || kvm_arch_has_private_mem(kvm);
+ case KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE:
+ if (kvm && !kvm_arch_has_private_mem(kvm))
+ return 0;
+ return PMD_SIZE;
#endif
default:
break;

base-commit: fcbef1e5e5d2a60dacac0d16c06ac00bedaefc0f
--

2023-11-02 15:49:15

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson <[email protected]> wrote:
> Actually, looking that this again, there's not actually a hard dependency on THP.
> A THP-enabled kernel _probably_ gives a higher probability of using hugepages,
> but mostly because THP selects COMPACTION, and I suppose because using THP for
> other allocations reduces overall fragmentation.

Yes, that's why I didn't even bother enabling it unless THP is
enabled, but it makes even more sense to just try.

> So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think
> we should do the below (I verified KVM can create hugepages with THP=n). We'll
> need another capability, but (a) we probably should have that anyways and (b) it
> provides a cleaner path to adding PUD-sized hugepage support in the future.

I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This
should be a generic kernel API and in fact the sizes are available in
a not-so-friendly format in /sys/kernel/mm/hugepages.

We should just add /sys/kernel/mm/hugepages/sizes that contains
"2097152 1073741824" on x86 (only the former if 1G pages are not
supported).

Plus: is this the best API if we need something else for 1G pages?

Let's drop *this* patch and proceed incrementally. (Again, this is
what I want to do with this final review: identify places that are
stil sticky, and don't let them block the rest).

Coincidentially we have an open spot next week at plumbers. Let's
extend Fuad's section to cover more guestmem work.

Paolo

> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
> index c15de9852316..c9f449718fce 100644
> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
> @@ -201,6 +201,10 @@ int main(int argc, char *argv[])
>
> TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
>
> + if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE) && thp_configured())
> + TEST_ASSERT_EQ(get_trans_hugepagesz(),
> + kvm_check_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE));
> +
> page_size = getpagesize();
> total_size = page_size * 4;
>
> diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> index be311944e90a..245901587ed2 100644
> --- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> @@ -396,7 +396,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
>
> vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
>
> - if (backing_src_can_be_huge(src_type))
> + if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE))
> memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> else
> memfd_flags = 0;
>
> --
> From: Sean Christopherson <[email protected]>
> Date: Wed, 25 Oct 2023 16:26:41 -0700
> Subject: [PATCH] KVM: Add best-effort hugepage support for dedicated guest
> memory
>
> Extend guest_memfd to allow backing guest memory with hugepages. For now,
> make hugepage utilization best-effort, i.e. fall back to non-huge mappings
> if a hugepage can't be allocated. Guaranteeing hugepages would require a
> dedicated memory pool and significantly more complexity and churn..
>
> Require userspace to opt-in via a flag even though it's unlikely real use
> cases will ever want to use order-0 pages, e.g. to give userspace a safety
> valve in case hugepage support is buggy, and to allow for easier testing
> of both paths.
>
> Do not take a dependency on CONFIG_TRANSPARENT_HUGEPAGE, as THP enabling
> primarily deals with userspace page tables, which are explicitly not in
> play for guest_memfd. Selecting THP does make obtaining hugepages more
> likely, but only because THP selects CONFIG_COMPACTION. Don't select
> CONFIG_COMPACTION either, because again it's not a hard dependency.
>
> For simplicity, require the guest_memfd size to be a multiple of the
> hugepage size, e.g. so that KVM doesn't need to do bounds checking when
> deciding whether or not to allocate a huge folio.
>
> When reporting the max order when KVM gets a pfn from guest_memfd, force
> order-0 pages if the hugepage is not fully contained by the memslot
> binding, e.g. if userspace requested hugepages but punches a hole in the
> memslot bindings in order to emulate x86's VGA hole.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 17 +++++++++
> include/uapi/linux/kvm.h | 3 ++
> virt/kvm/guest_memfd.c | 69 +++++++++++++++++++++++++++++-----
> virt/kvm/kvm_main.c | 4 ++
> 4 files changed, 84 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index e82c69d5e755..ccdd5413920d 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6176,6 +6176,8 @@ and cannot be resized (guest_memfd files do however support PUNCH_HOLE).
> __u64 reserved[6];
> };
>
> + #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0)
> +
> Conceptually, the inode backing a guest_memfd file represents physical memory,
> i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The
> file itself, which is bound to a "struct kvm", is that instance's view of the
> @@ -6192,6 +6194,12 @@ most one mapping per page, i.e. binding multiple memory regions to a single
> guest_memfd range is not allowed (any number of memory regions can be bound to
> a single guest_memfd file, but the bound ranges must not overlap).
>
> +If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set in flags, KVM will attempt to allocate
> +and map PMD-size hugepages for the guest_memfd file. This is currently best
> +effort. If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set, size must be aligned to at
> +least the size reported by KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE (which also
> +enumerates support for KVM_GUEST_MEMFD_ALLOW_HUGEPAGE).
> +
> See KVM_SET_USER_MEMORY_REGION2 for additional details.
>
> 5. The kvm_run structure
> @@ -8639,6 +8647,15 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
> 64-bit bitmap (each bit describing a block size). The default value is
> 0, to disable the eager page splitting.
>
> +
> +8.41 KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE
> +------------------------------------------
> +
> +This is an information-only capability that returns guest_memfd's hugepage size
> +for PMD hugepages. Returns '0' if guest_memfd is not supported, or if KVM
> +doesn't support creating hugepages for guest_memfd. Note, guest_memfd doesn't
> +currently support PUD-sized hugepages.
> +
> 9. Known KVM API problems
> =========================
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 25caee8d1a80..b78d0e3bf22a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1217,6 +1217,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_MEMORY_FAULT_INFO 231
> #define KVM_CAP_MEMORY_ATTRIBUTES 232
> #define KVM_CAP_GUEST_MEMFD 233
> +#define KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE 234
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -2303,4 +2304,6 @@ struct kvm_create_guest_memfd {
> __u64 reserved[6];
> };
>
> +#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0)
> +
> #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 98a12da80214..31b5e94d461a 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -13,14 +13,44 @@ struct kvm_gmem {
> struct list_head entry;
> };
>
> +#define NR_PAGES_PER_PMD (1 << PMD_ORDER)
> +
> +static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index)
> +{
> + unsigned long huge_index = round_down(index, NR_PAGES_PER_PMD);
> + unsigned long flags = (unsigned long)inode->i_private;
> + struct address_space *mapping = inode->i_mapping;
> + gfp_t gfp = mapping_gfp_mask(mapping);
> + struct folio *folio;
> +
> + if (!(flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE))
> + return NULL;
> +
> + if (filemap_range_has_page(mapping, huge_index << PAGE_SHIFT,
> + (huge_index + NR_PAGES_PER_PMD - 1) << PAGE_SHIFT))
> + return NULL;
> +
> + folio = filemap_alloc_folio(gfp, PMD_ORDER);
> + if (!folio)
> + return NULL;
> +
> + if (filemap_add_folio(mapping, folio, huge_index, gfp)) {
> + folio_put(folio);
> + return NULL;
> + }
> + return folio;
> +}
> +
> static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> {
> struct folio *folio;
>
> - /* TODO: Support huge pages. */
> - folio = filemap_grab_folio(inode->i_mapping, index);
> - if (IS_ERR_OR_NULL(folio))
> - return NULL;
> + folio = kvm_gmem_get_huge_folio(inode, index);
> + if (!folio) {
> + folio = filemap_grab_folio(inode->i_mapping, index);
> + if (IS_ERR_OR_NULL(folio))
> + return NULL;
> + }
>
> /*
> * Use the up-to-date flag to track whether or not the memory has been
> @@ -373,6 +403,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> inode->i_mode |= S_IFREG;
> inode->i_size = size;
> mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> + mapping_set_large_folios(inode->i_mapping);
> mapping_set_unmovable(inode->i_mapping);
> /* Unmovable mappings are supposed to be marked unevictable as well. */
> WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> @@ -394,14 +425,18 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>
> int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> {
> + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> loff_t size = args->size;
> u64 flags = args->flags;
> - u64 valid_flags = 0;
>
> if (flags & ~valid_flags)
> return -EINVAL;
>
> - if (size < 0 || !PAGE_ALIGNED(size))
> + if (size <= 0 || !PAGE_ALIGNED(size))
> + return -EINVAL;
> +
> + if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> + !IS_ALIGNED(size, PMD_SIZE))
> return -EINVAL;
>
> return __kvm_gmem_create(kvm, size, flags);
> @@ -501,7 +536,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
> int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
> {
> - pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
> + pgoff_t index, huge_index;
> struct kvm_gmem *gmem;
> struct folio *folio;
> struct page *page;
> @@ -514,6 +549,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>
> gmem = file->private_data;
>
> + index = gfn - slot->base_gfn + slot->gmem.pgoff;
> if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) {
> r = -EIO;
> goto out_fput;
> @@ -533,9 +569,24 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> page = folio_file_page(folio, index);
>
> *pfn = page_to_pfn(page);
> - if (max_order)
> + if (!max_order)
> + goto success;
> +
> + *max_order = compound_order(compound_head(page));
> + if (!*max_order)
> + goto success;
> +
> + /*
> + * The folio can be mapped with a hugepage if and only if the folio is
> + * fully contained by the range the memslot is bound to. Note, the
> + * caller is responsible for handling gfn alignment, this only deals
> + * with the file binding.
> + */
> + huge_index = ALIGN(index, 1ull << *max_order);
> + if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) ||
> + huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages)
> *max_order = 0;
> -
> +success:
> r = 0;
>
> out_unlock:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5d1a2f1b4e94..0711f2c75667 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4888,6 +4888,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> #ifdef CONFIG_KVM_PRIVATE_MEM
> case KVM_CAP_GUEST_MEMFD:
> return !kvm || kvm_arch_has_private_mem(kvm);
> + case KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE:
> + if (kvm && !kvm_arch_has_private_mem(kvm))
> + return 0;
> + return PMD_SIZE;
> #endif
> default:
> break;
>
> base-commit: fcbef1e5e5d2a60dacac0d16c06ac00bedaefc0f
> --
>

2023-11-06 11:01:48

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory

Hi,


On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <[email protected]> wrote:
>
> Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
> and testing vehicle for Confidential (CoCo) VMs, and potentially to even
> become a "real" product in the distant future, e.g. a la pKVM.
>
> The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
> Intel's TDX, but those technologies are extremely complex (understatement),
> difficult to debug, don't support running as nested guests, and require
> hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX

nit: "that isn't"

Reviewed-by: Fuad Tabba <[email protected]>
Tested-by: Fuad Tabba <[email protected]>

Cheers,
/fuad

> for maintaining guest private memory isn't a realistic option.
>
> At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
> selftests for guest_memfd and private memory support without requiring
> unique hardware.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 32 ++++++++++++++++++++++++++++++++
> arch/x86/include/asm/kvm_host.h | 15 +++++++++------
> arch/x86/include/uapi/asm/kvm.h | 3 +++
> arch/x86/kvm/Kconfig | 12 ++++++++++++
> arch/x86/kvm/mmu/mmu_internal.h | 1 +
> arch/x86/kvm/x86.c | 16 +++++++++++++++-
> include/uapi/linux/kvm.h | 1 +
> virt/kvm/Kconfig | 5 +++++
> 8 files changed, 78 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 38dc1fda4f45..00029436ac5b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -147,10 +147,29 @@ described as 'basic' will be available.
> The new VM has no virtual cpus and no memory.
> You probably want to use 0 as machine type.
>
> +X86:
> +^^^^
> +
> +Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
> +
> +S390:
> +^^^^^
> +
> In order to create user controlled virtual machines on S390, check
> KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
> privileged user (CAP_SYS_ADMIN).
>
> +MIPS:
> +^^^^^
> +
> +To use hardware assisted virtualization on MIPS (VZ ASE) rather than
> +the default trap & emulate implementation (which changes the virtual
> +memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
> +flag KVM_VM_MIPS_VZ.
> +
> +ARM64:
> +^^^^^^
> +
> On arm64, the physical address size for a VM (IPA Size limit) is limited
> to 40bits by default. The limit can be configured if the host supports the
> extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
> @@ -8650,6 +8669,19 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
> 64-bit bitmap (each bit describing a block size). The default value is
> 0, to disable the eager page splitting.
>
> +8.41 KVM_CAP_VM_TYPES
> +---------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: system ioctl
> +
> +This capability returns a bitmap of support VM types. The 1-setting of bit @n
> +means the VM type with value @n is supported. Possible values of @n are::
> +
> + #define KVM_X86_DEFAULT_VM 0
> + #define KVM_X86_SW_PROTECTED_VM 1
> +
> 9. Known KVM API problems
> =========================
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f9e8d5642069..dff10051e9b6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1244,6 +1244,7 @@ enum kvm_apicv_inhibit {
> };
>
> struct kvm_arch {
> + unsigned long vm_type;
> unsigned long n_used_mmu_pages;
> unsigned long n_requested_mmu_pages;
> unsigned long n_max_mmu_pages;
> @@ -2077,6 +2078,12 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd);
> void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> int tdp_max_root_level, int tdp_huge_page_level);
>
> +#ifdef CONFIG_KVM_PRIVATE_MEM
> +#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.vm_type != KVM_X86_DEFAULT_VM)
> +#else
> +#define kvm_arch_has_private_mem(kvm) false
> +#endif
> +
> static inline u16 kvm_read_ldt(void)
> {
> u16 ldt;
> @@ -2125,14 +2132,10 @@ enum {
> #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
>
> # define KVM_MAX_NR_ADDRESS_SPACES 2
> +/* SMM is currently unsupported for guests with private memory. */
> +# define kvm_arch_nr_memslot_as_ids(kvm) (kvm_arch_has_private_mem(kvm) ? 1 : 2)
> # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
> # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
> -
> -static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
> -{
> - return KVM_MAX_NR_ADDRESS_SPACES;
> -}
> -
> #else
> # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
> #endif
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 1a6a1f987949..a448d0964fc0 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -562,4 +562,7 @@ struct kvm_pmu_event_filter {
> /* x86-specific KVM_EXIT_HYPERCALL flags. */
> #define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0)
>
> +#define KVM_X86_DEFAULT_VM 0
> +#define KVM_X86_SW_PROTECTED_VM 1
> +
> #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 091b74599c22..8452ed0228cb 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -77,6 +77,18 @@ config KVM_WERROR
>
> If in doubt, say "N".
>
> +config KVM_SW_PROTECTED_VM
> + bool "Enable support for KVM software-protected VMs"
> + depends on EXPERT
> + depends on X86_64
> + select KVM_GENERIC_PRIVATE_MEM
> + help
> + Enable support for KVM software-protected VMs. Currently "protected"
> + means the VM can be backed with memory provided by
> + KVM_CREATE_GUEST_MEMFD.
> +
> + If unsure, say "N".
> +
> config KVM_INTEL
> tristate "KVM for Intel (and compatible) processors support"
> depends on KVM && IA32_FEAT_CTL
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 86c7cb692786..b66a7d47e0e4 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -297,6 +297,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> .max_level = KVM_MAX_HUGEPAGE_LEVEL,
> .req_level = PG_LEVEL_4K,
> .goal_level = PG_LEVEL_4K,
> + .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
> };
> int r;
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c4d17727b199..e3eb608b6692 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4441,6 +4441,13 @@ static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu,
> return 0;
> }
>
> +static bool kvm_is_vm_type_supported(unsigned long type)
> +{
> + return type == KVM_X86_DEFAULT_VM ||
> + (type == KVM_X86_SW_PROTECTED_VM &&
> + IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled);
> +}
> +
> int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> {
> int r = 0;
> @@ -4632,6 +4639,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_X86_NOTIFY_VMEXIT:
> r = kvm_caps.has_notify_vmexit;
> break;
> + case KVM_CAP_VM_TYPES:
> + r = BIT(KVM_X86_DEFAULT_VM);
> + if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM))
> + r |= BIT(KVM_X86_SW_PROTECTED_VM);
> + break;
> default:
> break;
> }
> @@ -12314,9 +12326,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> int ret;
> unsigned long flags;
>
> - if (type)
> + if (!kvm_is_vm_type_supported(type))
> return -EINVAL;
>
> + kvm->arch.vm_type = type;
> +
> ret = kvm_page_track_init(kvm);
> if (ret)
> goto out;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 29e9eb51dec9..5b5820d19e71 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1218,6 +1218,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_MEMORY_FAULT_INFO 231
> #define KVM_CAP_MEMORY_ATTRIBUTES 232
> #define KVM_CAP_GUEST_MEMFD 233
> +#define KVM_CAP_VM_TYPES 234
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 08afef022db9..2c964586aa14 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -104,3 +104,8 @@ config KVM_GENERIC_MEMORY_ATTRIBUTES
> config KVM_PRIVATE_MEM
> select XARRAY_MULTI
> bool
> +
> +config KVM_GENERIC_PRIVATE_MEM
> + select KVM_GENERIC_MEMORY_ATTRIBUTES
> + select KVM_PRIVATE_MEM
> + bool
> --
> 2.42.0.820.g83a721a137-goog
>

2023-11-06 11:04:43

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory

On 11/6/23 12:00, Fuad Tabba wrote:
> Hi,
>
>
> On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <[email protected]> wrote:
>>
>> Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
>> and testing vehicle for Confidential (CoCo) VMs, and potentially to even
>> become a "real" product in the distant future, e.g. a la pKVM.
>>
>> The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
>> Intel's TDX, but those technologies are extremely complex (understatement),
>> difficult to debug, don't support running as nested guests, and require
>> hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX
>
> nit: "that isn't"
>
> Reviewed-by: Fuad Tabba <[email protected]>
> Tested-by: Fuad Tabba <[email protected]>

Hi Fuad,

thanks for your reviews and tests of the gmem patches! Can you please
continue replying to v14?

Thanks,

Paolo

> Cheers,
> /fuad
>
>> for maintaining guest private memory isn't a realistic option.
>>
>> At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
>> selftests for guest_memfd and private memory support without requiring
>> unique hardware.
>>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> ---
>> Documentation/virt/kvm/api.rst | 32 ++++++++++++++++++++++++++++++++
>> arch/x86/include/asm/kvm_host.h | 15 +++++++++------
>> arch/x86/include/uapi/asm/kvm.h | 3 +++
>> arch/x86/kvm/Kconfig | 12 ++++++++++++
>> arch/x86/kvm/mmu/mmu_internal.h | 1 +
>> arch/x86/kvm/x86.c | 16 +++++++++++++++-
>> include/uapi/linux/kvm.h | 1 +
>> virt/kvm/Kconfig | 5 +++++
>> 8 files changed, 78 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 38dc1fda4f45..00029436ac5b 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -147,10 +147,29 @@ described as 'basic' will be available.
>> The new VM has no virtual cpus and no memory.
>> You probably want to use 0 as machine type.
>>
>> +X86:
>> +^^^^
>> +
>> +Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
>> +
>> +S390:
>> +^^^^^
>> +
>> In order to create user controlled virtual machines on S390, check
>> KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
>> privileged user (CAP_SYS_ADMIN).
>>
>> +MIPS:
>> +^^^^^
>> +
>> +To use hardware assisted virtualization on MIPS (VZ ASE) rather than
>> +the default trap & emulate implementation (which changes the virtual
>> +memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
>> +flag KVM_VM_MIPS_VZ.
>> +
>> +ARM64:
>> +^^^^^^
>> +
>> On arm64, the physical address size for a VM (IPA Size limit) is limited
>> to 40bits by default. The limit can be configured if the host supports the
>> extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
>> @@ -8650,6 +8669,19 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
>> 64-bit bitmap (each bit describing a block size). The default value is
>> 0, to disable the eager page splitting.
>>
>> +8.41 KVM_CAP_VM_TYPES
>> +---------------------
>> +
>> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
>> +:Architectures: x86
>> +:Type: system ioctl
>> +
>> +This capability returns a bitmap of support VM types. The 1-setting of bit @n
>> +means the VM type with value @n is supported. Possible values of @n are::
>> +
>> + #define KVM_X86_DEFAULT_VM 0
>> + #define KVM_X86_SW_PROTECTED_VM 1
>> +
>> 9. Known KVM API problems
>> =========================
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index f9e8d5642069..dff10051e9b6 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1244,6 +1244,7 @@ enum kvm_apicv_inhibit {
>> };
>>
>> struct kvm_arch {
>> + unsigned long vm_type;
>> unsigned long n_used_mmu_pages;
>> unsigned long n_requested_mmu_pages;
>> unsigned long n_max_mmu_pages;
>> @@ -2077,6 +2078,12 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd);
>> void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>> int tdp_max_root_level, int tdp_huge_page_level);
>>
>> +#ifdef CONFIG_KVM_PRIVATE_MEM
>> +#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.vm_type != KVM_X86_DEFAULT_VM)
>> +#else
>> +#define kvm_arch_has_private_mem(kvm) false
>> +#endif
>> +
>> static inline u16 kvm_read_ldt(void)
>> {
>> u16 ldt;
>> @@ -2125,14 +2132,10 @@ enum {
>> #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
>>
>> # define KVM_MAX_NR_ADDRESS_SPACES 2
>> +/* SMM is currently unsupported for guests with private memory. */
>> +# define kvm_arch_nr_memslot_as_ids(kvm) (kvm_arch_has_private_mem(kvm) ? 1 : 2)
>> # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
>> # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
>> -
>> -static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
>> -{
>> - return KVM_MAX_NR_ADDRESS_SPACES;
>> -}
>> -
>> #else
>> # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
>> #endif
>> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
>> index 1a6a1f987949..a448d0964fc0 100644
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -562,4 +562,7 @@ struct kvm_pmu_event_filter {
>> /* x86-specific KVM_EXIT_HYPERCALL flags. */
>> #define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0)
>>
>> +#define KVM_X86_DEFAULT_VM 0
>> +#define KVM_X86_SW_PROTECTED_VM 1
>> +
>> #endif /* _ASM_X86_KVM_H */
>> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
>> index 091b74599c22..8452ed0228cb 100644
>> --- a/arch/x86/kvm/Kconfig
>> +++ b/arch/x86/kvm/Kconfig
>> @@ -77,6 +77,18 @@ config KVM_WERROR
>>
>> If in doubt, say "N".
>>
>> +config KVM_SW_PROTECTED_VM
>> + bool "Enable support for KVM software-protected VMs"
>> + depends on EXPERT
>> + depends on X86_64
>> + select KVM_GENERIC_PRIVATE_MEM
>> + help
>> + Enable support for KVM software-protected VMs. Currently "protected"
>> + means the VM can be backed with memory provided by
>> + KVM_CREATE_GUEST_MEMFD.
>> +
>> + If unsure, say "N".
>> +
>> config KVM_INTEL
>> tristate "KVM for Intel (and compatible) processors support"
>> depends on KVM && IA32_FEAT_CTL
>> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
>> index 86c7cb692786..b66a7d47e0e4 100644
>> --- a/arch/x86/kvm/mmu/mmu_internal.h
>> +++ b/arch/x86/kvm/mmu/mmu_internal.h
>> @@ -297,6 +297,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>> .max_level = KVM_MAX_HUGEPAGE_LEVEL,
>> .req_level = PG_LEVEL_4K,
>> .goal_level = PG_LEVEL_4K,
>> + .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
>> };
>> int r;
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index c4d17727b199..e3eb608b6692 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4441,6 +4441,13 @@ static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu,
>> return 0;
>> }
>>
>> +static bool kvm_is_vm_type_supported(unsigned long type)
>> +{
>> + return type == KVM_X86_DEFAULT_VM ||
>> + (type == KVM_X86_SW_PROTECTED_VM &&
>> + IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled);
>> +}
>> +
>> int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>> {
>> int r = 0;
>> @@ -4632,6 +4639,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>> case KVM_CAP_X86_NOTIFY_VMEXIT:
>> r = kvm_caps.has_notify_vmexit;
>> break;
>> + case KVM_CAP_VM_TYPES:
>> + r = BIT(KVM_X86_DEFAULT_VM);
>> + if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM))
>> + r |= BIT(KVM_X86_SW_PROTECTED_VM);
>> + break;
>> default:
>> break;
>> }
>> @@ -12314,9 +12326,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>> int ret;
>> unsigned long flags;
>>
>> - if (type)
>> + if (!kvm_is_vm_type_supported(type))
>> return -EINVAL;
>>
>> + kvm->arch.vm_type = type;
>> +
>> ret = kvm_page_track_init(kvm);
>> if (ret)
>> goto out;
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 29e9eb51dec9..5b5820d19e71 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1218,6 +1218,7 @@ struct kvm_ppc_resize_hpt {
>> #define KVM_CAP_MEMORY_FAULT_INFO 231
>> #define KVM_CAP_MEMORY_ATTRIBUTES 232
>> #define KVM_CAP_GUEST_MEMFD 233
>> +#define KVM_CAP_VM_TYPES 234
>>
>> #ifdef KVM_CAP_IRQ_ROUTING
>>
>> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
>> index 08afef022db9..2c964586aa14 100644
>> --- a/virt/kvm/Kconfig
>> +++ b/virt/kvm/Kconfig
>> @@ -104,3 +104,8 @@ config KVM_GENERIC_MEMORY_ATTRIBUTES
>> config KVM_PRIVATE_MEM
>> select XARRAY_MULTI
>> bool
>> +
>> +config KVM_GENERIC_PRIVATE_MEM
>> + select KVM_GENERIC_MEMORY_ATTRIBUTES
>> + select KVM_PRIVATE_MEM
>> + bool
>> --
>> 2.42.0.820.g83a721a137-goog
>>
>

2023-11-06 11:27:11

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v13 27/35] KVM: selftests: Add helpers to convert guest memory b/w private and shared

On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <[email protected]> wrote:
>
> From: Vishal Annapurve <[email protected]>
>
> Add helpers to convert memory between private and shared via KVM's
> memory attributes, as well as helpers to free/allocate guest_memfd memory
> via fallocate(). Userspace, i.e. tests, is NOT required to do fallocate()
> when converting memory, as the attributes are the single source of true.

true->truth

> Provide allocate() helpers so that tests can mimic a userspace that frees
> private memory on conversion, e.g. to prioritize memory usage over
> performance.
>
> Signed-off-by: Vishal Annapurve <[email protected]>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> .../selftests/kvm/include/kvm_util_base.h | 48 +++++++++++++++++++
> tools/testing/selftests/kvm/lib/kvm_util.c | 28 +++++++++++
> 2 files changed, 76 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
> index 9f861182c02a..1441fca6c273 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util_base.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
> @@ -333,6 +333,54 @@ static inline void vm_enable_cap(struct kvm_vm *vm, uint32_t cap, uint64_t arg0)
> vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
> }
>
> +static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
> + uint64_t size, uint64_t attributes)
> +{
> + struct kvm_memory_attributes attr = {
> + .attributes = attributes,
> + .address = gpa,
> + .size = size,
> + .flags = 0,
> + };
> +
> + /*
> + * KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes. These flows
> + * need significant enhancements to support multiple attributes.
> + */
> + TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
> + "Update me to support multiple attributes!");
> +
> + vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
> +}
> +
> +
> +static inline void vm_mem_set_private(struct kvm_vm *vm, uint64_t gpa,
> + uint64_t size)
> +{
> + vm_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +}
> +
> +static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa,
> + uint64_t size)
> +{
> + vm_set_memory_attributes(vm, gpa, size, 0);
> +}
> +
> +void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
> + bool punch_hole);
> +
> +static inline void vm_guest_mem_punch_hole(struct kvm_vm *vm, uint64_t gpa,
> + uint64_t size)
> +{
> + vm_guest_mem_fallocate(vm, gpa, size, true);
> +}
> +
> +static inline void vm_guest_mem_allocate(struct kvm_vm *vm, uint64_t gpa,
> + uint64_t size)
> +{
> + vm_guest_mem_fallocate(vm, gpa, size, false);
> +}
> +
> void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
> const char *vm_guest_mode_string(uint32_t i);
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 45050f54701a..a140aee8d0f5 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -1176,6 +1176,34 @@ void vm_mem_region_delete(struct kvm_vm *vm, uint32_t slot)
> __vm_mem_region_delete(vm, memslot2region(vm, slot), true);
> }
>
> +void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t base, uint64_t size,
> + bool punch_hole)
> +{
> + const int mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? FALLOC_FL_PUNCH_HOLE : 0);
> + struct userspace_mem_region *region;
> + uint64_t end = base + size;
> + uint64_t gpa, len;
> + off_t fd_offset;
> + int ret;
> +
> + for (gpa = base; gpa < end; gpa += len) {
> + uint64_t offset;
> +
> + region = userspace_mem_region_find(vm, gpa, gpa);
> + TEST_ASSERT(region && region->region.flags & KVM_MEM_PRIVATE,
> + "Private memory region not found for GPA 0x%lx", gpa);
> +
> + offset = (gpa - region->region.guest_phys_addr);

nit: why the parentheses?

> + fd_offset = region->region.guest_memfd_offset + offset;
> + len = min_t(uint64_t, end - gpa, region->region.memory_size - offset);
> +
> + ret = fallocate(region->region.guest_memfd, mode, fd_offset, len);
> + TEST_ASSERT(!ret, "fallocate() failed to %s at %lx (len = %lu), fd = %d, mode = %x, offset = %lx\n",
> + punch_hole ? "punch hole" : "allocate", gpa, len,
> + region->region.guest_memfd, mode, fd_offset);
> + }
> +}
> +

Nits aside:

Reviewed-by: Fuad Tabba <[email protected]>
Tested-by: Fuad Tabba <[email protected]>

Cheers,
/fuad


> /* Returns the size of a vCPU's kvm_run structure. */
> static int vcpu_mmap_sz(void)
> {
> --
> 2.42.0.820.g83a721a137-goog
>

2023-11-27 11:14:29

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

On 11/2/23 16:46, Paolo Bonzini wrote:
> On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson <[email protected]> wrote:
>> Actually, looking that this again, there's not actually a hard dependency on THP.
>> A THP-enabled kernel _probably_ gives a higher probability of using hugepages,
>> but mostly because THP selects COMPACTION, and I suppose because using THP for
>> other allocations reduces overall fragmentation.
>
> Yes, that's why I didn't even bother enabling it unless THP is
> enabled, but it makes even more sense to just try.
>
>> So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think
>> we should do the below (I verified KVM can create hugepages with THP=n). We'll
>> need another capability, but (a) we probably should have that anyways and (b) it
>> provides a cleaner path to adding PUD-sized hugepage support in the future.
>
> I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This
> should be a generic kernel API and in fact the sizes are available in
> a not-so-friendly format in /sys/kernel/mm/hugepages.
>
> We should just add /sys/kernel/mm/hugepages/sizes that contains
> "2097152 1073741824" on x86 (only the former if 1G pages are not
> supported).
>
> Plus: is this the best API if we need something else for 1G pages?
>
> Let's drop *this* patch and proceed incrementally. (Again, this is
> what I want to do with this final review: identify places that are
> stil sticky, and don't let them block the rest).
>
> Coincidentially we have an open spot next week at plumbers. Let's
> extend Fuad's section to cover more guestmem work.

Hi,

was there any outcome wrt this one? Based on my experience with THP's it
would be best if userspace didn't have to opt-in, nor care about the
supported size. If the given size is unaligned, provide a mix of large pages
up to an aligned size, and for the rest fallback to base pages, which should
be better than -EINVAL on creation (is it possible with the current
implementation? I'd hope so so?). A way to opt-out from huge pages could be
useful although there's always the risk of some initial troubles resulting
in various online sources cargo-cult recommending to opt-out forever.

Vlastimil

2023-11-29 22:40:54

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

On Mon, Nov 27, 2023, Vlastimil Babka wrote:
> On 11/2/23 16:46, Paolo Bonzini wrote:
> > On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson <[email protected]> wrote:
> >> Actually, looking that this again, there's not actually a hard dependency on THP.
> >> A THP-enabled kernel _probably_ gives a higher probability of using hugepages,
> >> but mostly because THP selects COMPACTION, and I suppose because using THP for
> >> other allocations reduces overall fragmentation.
> >
> > Yes, that's why I didn't even bother enabling it unless THP is
> > enabled, but it makes even more sense to just try.
> >
> >> So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think
> >> we should do the below (I verified KVM can create hugepages with THP=n). We'll
> >> need another capability, but (a) we probably should have that anyways and (b) it
> >> provides a cleaner path to adding PUD-sized hugepage support in the future.
> >
> > I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This
> > should be a generic kernel API and in fact the sizes are available in
> > a not-so-friendly format in /sys/kernel/mm/hugepages.
> >
> > We should just add /sys/kernel/mm/hugepages/sizes that contains
> > "2097152 1073741824" on x86 (only the former if 1G pages are not
> > supported).
> >
> > Plus: is this the best API if we need something else for 1G pages?
> >
> > Let's drop *this* patch and proceed incrementally. (Again, this is
> > what I want to do with this final review: identify places that are
> > stil sticky, and don't let them block the rest).
> >
> > Coincidentially we have an open spot next week at plumbers. Let's
> > extend Fuad's section to cover more guestmem work.
>
> Hi,
>
> was there any outcome wrt this one?

No, we punted on hugepage support for the initial guest_memfd merge. We definitely
plan on adding hugeapge support sooner than later, but we haven't yet agreed on
exactly what that will look like.

> Based on my experience with THP's it would be best if userspace didn't have
> to opt-in, nor care about the supported size. If the given size is unaligned,
> provide a mix of large pages up to an aligned size, and for the rest fallback
> to base pages, which should be better than -EINVAL on creation (is it
> possible with the current implementation? I'd hope so so?).

guest_memfd serves a different use case than THP. For modern VMs, and especially
for slice-of-hardware VMs that are one of the main targets for guest_memfd, if not
_the_ main target, guest memory should _always_ be backed by hugepages in the
physical domain. The actual guest mappings might not be huge, e.g. x86 needs to
do partial mappings to skip over (legacy) memory holes, but KVM already gracefully
handles that.

In other words, for most guest_memfd use cases, if userspace wants hugepages but
KVM can't provide hugepages, then it is much more desirable to return an error
than to silently fall back to small pages.

I 100% agree that having to opt-in is suboptimal, but IMO providing "error on an
incompatible configuration" semantics without requiring userspace to opt-in is an
even worse experience for userspace.

> A way to opt-out from huge pages could be useful although there's always the
> risk of some initial troubles resulting in various online sources cargo-cult
> recommending to opt-out forever.

2024-04-25 14:12:58

by Dan Carpenter

[permalink] [raw]
Subject: Re: [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2

On Fri, Oct 27, 2023 at 11:22:07AM -0700, Sean Christopherson wrote:
> Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
> support for guest private memory can be added without needing an entirely
> separate set of helpers.
>
> Note, this obviously makes selftests backwards-incompatible with older KVM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> versions from this point forward.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Is there a way we could disable the tests on older kernels instead of
making them fail? Check uname or something? There is probably a
standard way to do this... It's these tests which fail.

kvm_aarch32_id_regs
kvm_access_tracking_perf_test
kvm_arch_timer
kvm_debug-exceptions
kvm_demand_paging_test
kvm_dirty_log_perf_test
kvm_dirty_log_test
kvm_guest_print_test
kvm_hypercalls
kvm_kvm_page_table_test
kvm_memslot_modification_stress_test
kvm_memslot_perf_test
kvm_page_fault_test
kvm_psci_test
kvm_rseq_test
kvm_smccc_filter
kvm_steal_time
kvm_vgic_init
kvm_vgic_irq
kvm_vpmu_counter_access

regards,
dan carpenter


2024-04-25 14:45:52

by Shuah Khan

[permalink] [raw]
Subject: Re: [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2

On 4/25/24 08:12, Dan Carpenter wrote:
> On Fri, Oct 27, 2023 at 11:22:07AM -0700, Sean Christopherson wrote:
>> Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
>> support for guest private memory can be added without needing an entirely
>> separate set of helpers.
>>
>> Note, this obviously makes selftests backwards-incompatible with older KVM
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> versions from this point forward.
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Is there a way we could disable the tests on older kernels instead of
> making them fail? Check uname or something? There is probably a
> standard way to do this... It's these tests which fail.

They shouldn't fail - the tests should be skipped on older kernels.
If it is absolutely necessary to dd uname to check kernel version,
refer to zram/zram_lib.sh for an example.

thanks,
-- Shuah

2024-04-25 15:10:08

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2

On Thu, Apr 25, 2024, Shuah Khan wrote:
> On 4/25/24 08:12, Dan Carpenter wrote:
> > On Fri, Oct 27, 2023 at 11:22:07AM -0700, Sean Christopherson wrote:
> > > Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
> > > support for guest private memory can be added without needing an entirely
> > > separate set of helpers.
> > >
> > > Note, this obviously makes selftests backwards-incompatible with older KVM
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > versions from this point forward.
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > Is there a way we could disable the tests on older kernels instead of
> > making them fail? Check uname or something? There is probably a
> > standard way to do this... It's these tests which fail.
>
> They shouldn't fail - the tests should be skipped on older kernels.

Ah, that makes sense. Except for a few outliers that aren't all that interesting,
all KVM selftests create memslots, so I'm tempted to just make it a hard requirement
to spare us headache, e.g.

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index b2262b5fad9e..4b2038b1f11f 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -2306,6 +2306,9 @@ void __attribute((constructor)) kvm_selftest_init(void)
/* Tell stdout not to buffer its content. */
setbuf(stdout, NULL);

+ __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
+ "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");
+
kvm_selftest_arch_init();
}

--

but it's also easy enough to be more precise and skip only those that actually
create memslots.

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index b2262b5fad9e..b21152adf448 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -944,6 +944,9 @@ int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flag
.guest_memfd_offset = guest_memfd_offset,
};

+ __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
+ "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");
+
return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION2, &region);
}

@@ -970,6 +973,9 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
size_t mem_size = npages * vm->page_size;
size_t alignment;

+ __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
+ "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");
+
TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
"Number of guest pages is not compatible with the host. "
"Try npages=%d", vm_adjust_num_guest_pages(vm->mode, npages));
--

2024-04-25 16:23:03

by Shuah Khan

[permalink] [raw]
Subject: Re: [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2

On 4/25/24 09:09, Sean Christopherson wrote:
> On Thu, Apr 25, 2024, Shuah Khan wrote:
>> On 4/25/24 08:12, Dan Carpenter wrote:
>>> On Fri, Oct 27, 2023 at 11:22:07AM -0700, Sean Christopherson wrote:
>>>> Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
>>>> support for guest private memory can be added without needing an entirely
>>>> separate set of helpers.
>>>>
>>>> Note, this obviously makes selftests backwards-incompatible with older KVM
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> versions from this point forward.
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>
>>> Is there a way we could disable the tests on older kernels instead of
>>> making them fail? Check uname or something? There is probably a
>>> standard way to do this... It's these tests which fail.
>>
>> They shouldn't fail - the tests should be skipped on older kernels.
>
> Ah, that makes sense. Except for a few outliers that aren't all that interesting,
> all KVM selftests create memslots, so I'm tempted to just make it a hard requirement
> to spare us headache, e.g.
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index b2262b5fad9e..4b2038b1f11f 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -2306,6 +2306,9 @@ void __attribute((constructor)) kvm_selftest_init(void)
> /* Tell stdout not to buffer its content. */
> setbuf(stdout, NULL);
>
> + __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
> + "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");
> +
> kvm_selftest_arch_init();
> }
>
> --
>
> but it's also easy enough to be more precise and skip only those that actually
> create memslots.

This is approach is what is recommended in kselfest document. Rubn as many tests
as possible and skip the ones that can't be run due to unmet dependencies.

>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index b2262b5fad9e..b21152adf448 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -944,6 +944,9 @@ int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flag
> .guest_memfd_offset = guest_memfd_offset,
> };
>
> + __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
> + "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");
> +
> return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION2, &region);
> }
>
> @@ -970,6 +973,9 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
> size_t mem_size = npages * vm->page_size;
> size_t alignment;
>
> + __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
> + "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");
> +
> TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
> "Number of guest pages is not compatible with the host. "
> "Try npages=%d", vm_adjust_num_guest_pages(vm->mode, npages));
> --

thanks,
-- Shuah

2024-04-26 07:34:12

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2

On Thu Apr 25, 2024 at 6:09 PM EEST, Sean Christopherson wrote:
> + __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
> + "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");

This would work also for casual (but not seasoned) visitor in KVM code
as additionl documentation.

BR, Jarkko