LinuxLists.cc - [RFC PATCH v11 00/29] KVM: guest

2023-07-19 00:07:30

Subject: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

This is the next iteration of implementing fd-based (instead of vma-based)
memory for KVM guests. If you want the full background of why we are doing
this, please go read the v10 cover letter[1].

The biggest change from v10 is to implement the backing storage in KVM
itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
See link[2] for details on why we pivoted to a KVM-specific approach.

Key word is "biggest". Relative to v10, there are many big changes.
Highlights below (I can't remember everything that got changed at
this point).

Tagged RFC as there are a lot of empty changelogs, and a lot of missing
documentation. And ideally, we'll have even more tests before merging.
There are also several gaps/opens (to be discussed in tomorrow's PUCK).

v11:
- Test private<=>shared conversions *without* doing fallocate()
- PUNCH_HOLE all memory between iterations of the conversion test so that
KVM doesn't retain pages in the guest_memfd
- Rename hugepage control to be a very generic ALLOW_HUGEPAGE, instead of
giving it a THP or PMD specific name.
- Fold in fixes from a lot of people (thank you!)
- Zap SPTEs *before* updating attributes to ensure no weirdness, e.g. if
KVM handles a page fault and looks at inconsistent attributes
- Refactor MMU interaction with attributes updates to reuse much of KVM's
framework for mmu_notifiers.

[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/all/[email protected]

Ackerley Tng (1):
KVM: selftests: Test KVM exit behavior for private memory/access

Chao Peng (7):
KVM: Use gfn instead of hva for mmu_notifier_retry
KVM: Add KVM_EXIT_MEMORY_FAULT exit
KVM: Introduce per-page memory attributes
KVM: x86: Disallow hugepages when memory attributes are mixed
KVM: x86/mmu: Handle page fault for private memory
KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper
KVM: selftests: Expand set_memory_region_test to validate
guest_memfd()

Sean Christopherson (18):
KVM: Wrap kvm_gfn_range.pte in a per-action union
KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn
ranges
KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER
KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to
CONFIG_KVM_GENERIC_MMU_NOTIFIER
KVM: Introduce KVM_SET_USER_MEMORY_REGION2
mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
security: Export security_inode_init_security_anon() for use by KVM
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing
memory
KVM: Add transparent hugepage support for dedicated guest memory
KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro
KVM: Allow arch code to track number of memslot address spaces per VM
KVM: x86: Add support for "protected VMs" that can utilize private
memory
KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper
KVM: selftests: Convert lib's mem regions to
KVM_SET_USER_MEMORY_REGION2
KVM: selftests: Add support for creating private memslots
KVM: selftests: Introduce VM "shape" to allow tests to specify the VM
type
KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data
KVM: selftests: Add basic selftest for guest_memfd()

Vishal Annapurve (3):
KVM: selftests: Add helpers to convert guest memory b/w private and
shared
KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls
(x86)
KVM: selftests: Add x86-only selftest for private memory conversions

Documentation/virt/kvm/api.rst | 114 ++++
arch/arm64/include/asm/kvm_host.h | 2 -
arch/arm64/kvm/Kconfig | 2 +-
arch/arm64/kvm/mmu.c | 2 +-
arch/mips/include/asm/kvm_host.h | 2 -
arch/mips/kvm/Kconfig | 2 +-
arch/mips/kvm/mmu.c | 2 +-
arch/powerpc/include/asm/kvm_host.h | 2 -
arch/powerpc/kvm/Kconfig | 8 +-
arch/powerpc/kvm/book3s_hv.c | 2 +-
arch/powerpc/kvm/powerpc.c | 5 +-
arch/riscv/include/asm/kvm_host.h | 2 -
arch/riscv/kvm/Kconfig | 2 +-
arch/riscv/kvm/mmu.c | 2 +-
arch/x86/include/asm/kvm_host.h | 17 +-
arch/x86/include/uapi/asm/kvm.h | 3 +
arch/x86/kvm/Kconfig | 14 +-
arch/x86/kvm/debugfs.c | 2 +-
arch/x86/kvm/mmu/mmu.c | 287 +++++++-
arch/x86/kvm/mmu/mmu_internal.h | 4 +
arch/x86/kvm/mmu/mmutrace.h | 1 +
arch/x86/kvm/mmu/tdp_mmu.c | 8 +-
arch/x86/kvm/vmx/vmx.c | 11 +-
arch/x86/kvm/x86.c | 24 +-
include/linux/kvm_host.h | 129 +++-
include/linux/pagemap.h | 11 +
include/uapi/linux/kvm.h | 50 ++
include/uapi/linux/magic.h | 1 +
mm/compaction.c | 4 +
mm/migrate.c | 2 +
security/security.c | 1 +
tools/testing/selftests/kvm/Makefile | 3 +
tools/testing/selftests/kvm/dirty_log_test.c | 2 +-
.../testing/selftests/kvm/guest_memfd_test.c | 114 ++++
.../selftests/kvm/include/kvm_util_base.h | 141 +++-
.../testing/selftests/kvm/include/test_util.h | 5 +
.../selftests/kvm/include/ucall_common.h | 12 +
.../selftests/kvm/include/x86_64/processor.h | 15 +
.../selftests/kvm/kvm_page_table_test.c | 2 +-
tools/testing/selftests/kvm/lib/kvm_util.c | 230 ++++---
tools/testing/selftests/kvm/lib/memstress.c | 3 +-
.../selftests/kvm/set_memory_region_test.c | 99 +++
.../kvm/x86_64/private_mem_conversions_test.c | 408 +++++++++++
.../kvm/x86_64/private_mem_kvm_exits_test.c | 115 ++++
.../kvm/x86_64/ucna_injection_test.c | 2 +-
virt/kvm/Kconfig | 17 +
virt/kvm/Makefile.kvm | 1 +
virt/kvm/dirty_ring.c | 2 +-
virt/kvm/guest_mem.c | 635 ++++++++++++++++++
virt/kvm/kvm_main.c | 384 +++++++++--
virt/kvm/kvm_mm.h | 38 ++
51 files changed, 2700 insertions(+), 246 deletions(-)
create mode 100644 tools/testing/selftests/kvm/guest_memfd_test.c
create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
create mode 100644 virt/kvm/guest_mem.c

base-commit: fdf0eaf11452d72945af31804e2a1048ee1b574c
--
2.41.0.255.g8b1d071c50-goog

2023-07-19 00:07:36

by Sean Christopherson

[permalink] [raw]

Subject: [RFC PATCH v11 28/29] KVM: selftests: Add basic selftest for guest_memfd()

Add a selftest to verify the basic functionality of guest_memfd():

+ file descriptor created with the guest_memfd() ioctl does not allow
read/write/mmap operations
+ file size and block size as returned from fstat are as expected
+ fallocate on the fd checks that offset/length on
fallocate(FALLOC_FL_PUNCH_HOLE) should be page aligned

Signed-off-by: Chao Peng <[email protected]>
Co-developed-by: Ackerley Tng <[email protected]>
Signed-off-by: Ackerley Tng <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
tools/testing/selftests/kvm/Makefile | 1 +
.../testing/selftests/kvm/guest_memfd_test.c | 114 ++++++++++++++++++
2 files changed, 115 insertions(+)
create mode 100644 tools/testing/selftests/kvm/guest_memfd_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index fdc7dff8d6ae..18c43336ede3 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -123,6 +123,7 @@ TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
TEST_GEN_PROGS_x86_64 += demand_paging_test
TEST_GEN_PROGS_x86_64 += dirty_log_test
TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
+TEST_GEN_PROGS_x86_64 += guest_memfd_test
TEST_GEN_PROGS_x86_64 += hardware_disable_test
TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
TEST_GEN_PROGS_x86_64 += kvm_page_table_test
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
new file mode 100644
index 000000000000..d698f9fde987
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright Intel Corporation, 2023
+ *
+ * Author: Chao Peng <[email protected]>
+ */
+
+#define _GNU_SOURCE
+#include "test_util.h"
+#include "kvm_util_base.h"
+#include <linux/falloc.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+#include <fcntl.h>
+
+static void test_file_read_write(int fd)
+{
+ char buf[64];
+
+ TEST_ASSERT(read(fd, buf, sizeof(buf)) < 0,
+ "read on a guest_mem fd should fail");
+ TEST_ASSERT(write(fd, buf, sizeof(buf)) < 0,
+ "write on a guest_mem fd should fail");
+ TEST_ASSERT(pread(fd, buf, sizeof(buf), 0) < 0,
+ "pread on a guest_mem fd should fail");
+ TEST_ASSERT(pwrite(fd, buf, sizeof(buf), 0) < 0,
+ "pwrite on a guest_mem fd should fail");
+}
+
+static void test_mmap(int fd, size_t page_size)
+{
+ char *mem;
+
+ mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ ASSERT_EQ(mem, MAP_FAILED);
+}
+
+static void test_file_size(int fd, size_t page_size, size_t total_size)
+{
+ struct stat sb;
+ int ret;
+
+ ret = fstat(fd, &sb);
+ TEST_ASSERT(!ret, "fstat should succeed");
+ ASSERT_EQ(sb.st_size, total_size);
+ ASSERT_EQ(sb.st_blksize, page_size);
+}
+
+static void test_fallocate(int fd, size_t page_size, size_t total_size)
+{
+ int ret;
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, total_size);
+ TEST_ASSERT(!ret, "fallocate with aligned offset and size should succeed");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ page_size - 1, page_size);
+ TEST_ASSERT(ret, "fallocate with unaligned offset should fail");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size, page_size);
+ TEST_ASSERT(ret, "fallocate beginning at total_size should fail");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size + page_size, page_size);
+ TEST_ASSERT(ret, "fallocate beginning at total_size should fail");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ total_size, page_size);
+ TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) at total_size should succeed");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ total_size + page_size, page_size);
+ TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) after total_size should succeed");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ page_size, page_size - 1);
+ TEST_ASSERT(ret, "fallocate with unaligned size should fail");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ page_size, page_size);
+ TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) with aligned offset and size should succeed");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, page_size, page_size);
+ TEST_ASSERT(!ret, "fallocate to restore punched hole should succeed");
+}
+
+
+int main(int argc, char *argv[])
+{
+ size_t page_size;
+ size_t total_size;
+ int fd;
+ struct kvm_vm *vm;
+
+ page_size = getpagesize();
+ total_size = page_size * 4;
+
+ vm = vm_create_barebones();
+
+ fd = vm_create_guest_memfd(vm, total_size, 0);
+
+ test_file_read_write(fd);
+ test_mmap(fd, page_size);
+ test_file_size(fd, page_size, total_size);
+ test_fallocate(fd, page_size, total_size);
+
+ close(fd);
+}
--
2.41.0.255.g8b1d071c50-goog

2023-07-19 00:10:44

by Sean Christopherson

[permalink] [raw]

Subject: [RFC PATCH v11 22/29] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86)

From: Vishal Annapurve <[email protected]>

Signed-off-by: Vishal Annapurve <[email protected]>
[sean: drop shared/private helpers (let tests specify flags)]
Signed-off-by: Sean Christopherson <[email protected]>
---
.../selftests/kvm/include/x86_64/processor.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index aa434c8f19c5..8857143d400a 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -15,6 +15,7 @@
#include <asm/msr-index.h>
#include <asm/prctl.h>

+#include <linux/kvm_para.h>
#include <linux/stringify.h>

#include "../kvm_util.h"
@@ -1166,6 +1167,20 @@ uint64_t kvm_hypercall(uint64_t nr, uint64_t a0, uint64_t a1, uint64_t a2,
uint64_t __xen_hypercall(uint64_t nr, uint64_t a0, void *a1);
void xen_hypercall(uint64_t nr, uint64_t a0, void *a1);

+static inline uint64_t __kvm_hypercall_map_gpa_range(uint64_t gpa,
+ uint64_t size, uint64_t flags)
+{
+ return kvm_hypercall(KVM_HC_MAP_GPA_RANGE, gpa, size >> PAGE_SHIFT, flags, 0);
+}
+
+static inline void kvm_hypercall_map_gpa_range(uint64_t gpa, uint64_t size,
+ uint64_t flags)
+{
+ uint64_t ret = __kvm_hypercall_map_gpa_range(gpa, size, flags);
+
+ GUEST_ASSERT_1(!ret, ret);
+}
+
void __vm_xsave_require_permission(uint64_t xfeature, const char *name);

#define vm_xsave_require_permission(xfeature) \
--
2.41.0.255.g8b1d071c50-goog

2023-07-19 00:12:35

by Sean Christopherson

[permalink] [raw]

Subject: [RFC PATCH v11 14/29] KVM: x86/mmu: Handle page fault for private memory

From: Chao Peng <[email protected]>

A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
hva-based shared memory. Architecture code (like TDX code) can tell
whether the on-going fault is private or not. This patch adds a
'is_private' field to kvm_page_fault to indicate this and architecture
code is expected to set it.

To handle page fault for such memslot, the handling logic is different
depending on whether the fault is private or shared. KVM checks if
'is_private' matches the host's view of the page (maintained in
mem_attr_array).
- For a successful match, private pfn is obtained with
restrictedmem_get_page() and shared pfn is obtained with existing
get_user_pages().
- For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
userspace. Userspace then can convert memory between private/shared
in host's view and retry the fault.

Co-developed-by: Yu Zhang <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
Reviewed-by: Fuad Tabba <[email protected]>
Tested-by: Fuad Tabba <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 82 +++++++++++++++++++++++++++++++--
arch/x86/kvm/mmu/mmu_internal.h | 3 ++
arch/x86/kvm/mmu/mmutrace.h | 1 +
3 files changed, 81 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index aefe67185637..4cf73a579ee1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3179,9 +3179,9 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
return level;
}

-int kvm_mmu_max_mapping_level(struct kvm *kvm,
- const struct kvm_memory_slot *slot, gfn_t gfn,
- int max_level)
+static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ gfn_t gfn, int max_level, bool is_private)
{
struct kvm_lpage_info *linfo;
int host_level;
@@ -3193,6 +3193,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
break;
}

+ if (is_private)
+ return max_level;
+
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;

@@ -3200,6 +3203,16 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
return min(host_level, max_level);
}

+int kvm_mmu_max_mapping_level(struct kvm *kvm,
+ const struct kvm_memory_slot *slot, gfn_t gfn,
+ int max_level)
+{
+ bool is_private = kvm_slot_can_be_private(slot) &&
+ kvm_mem_is_private(kvm, gfn);
+
+ return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_private);
+}
+
void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_memory_slot *slot = fault->slot;
@@ -3220,8 +3233,9 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* Enforce the iTLB multihit workaround after capturing the requested
* level, which will be used to do precise, accurate accounting.
*/
- fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
- fault->gfn, fault->max_level);
+ fault->req_level = __kvm_mmu_max_mapping_level(vcpu->kvm, slot,
+ fault->gfn, fault->max_level,
+ fault->is_private);
if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
return;

@@ -4304,6 +4318,55 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
}

+static inline u8 kvm_max_level_for_order(int order)
+{
+ BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+ MMU_WARN_ON(order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G) &&
+ order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M) &&
+ order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K));
+
+ if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+ return PG_LEVEL_1G;
+
+ if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+ return PG_LEVEL_2M;
+
+ return PG_LEVEL_4K;
+}
+
+static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
+{
+ vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+ if (fault->is_private)
+ vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
+ else
+ vcpu->run->memory.flags = 0;
+ vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+ vcpu->run->memory.size = PAGE_SIZE;
+ return RET_PF_USER;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
+{
+ int max_order, r;
+
+ if (!kvm_slot_can_be_private(fault->slot))
+ return kvm_do_memory_fault_exit(vcpu, fault);
+
+ r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
+ &max_order);
+ if (r)
+ return r;
+
+ fault->max_level = min(kvm_max_level_for_order(max_order),
+ fault->max_level);
+ fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
+ return RET_PF_CONTINUE;
+}
+
static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_memory_slot *slot = fault->slot;
@@ -4336,6 +4399,12 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
return RET_PF_EMULATE;
}

+ if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
+ return kvm_do_memory_fault_exit(vcpu, fault);
+
+ if (fault->is_private)
+ return kvm_faultin_pfn_private(vcpu, fault);
+
async = false;
fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
fault->write, &fault->map_writable,
@@ -5771,6 +5840,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
return -EIO;
}

+ if (r == RET_PF_USER)
+ return 0;
+
if (r < 0)
return r;
if (r != RET_PF_EMULATE)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d39af5639ce9..268b517e88cb 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -203,6 +203,7 @@ struct kvm_page_fault {

/* Derived from mmu and global state. */
const bool is_tdp;
+ const bool is_private;
const bool nx_huge_page_workaround_enabled;

/*
@@ -259,6 +260,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
* RET_PF_RETRY: let CPU fault again on the address.
* RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
* RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
+ * RET_PF_USER: need to exit to userspace to handle this fault.
* RET_PF_FIXED: The faulting entry has been fixed.
* RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
*
@@ -275,6 +277,7 @@ enum {
RET_PF_RETRY,
RET_PF_EMULATE,
RET_PF_INVALID,
+ RET_PF_USER,
RET_PF_FIXED,
RET_PF_SPURIOUS,
};
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..2d7555381955 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
TRACE_DEFINE_ENUM(RET_PF_RETRY);
TRACE_DEFINE_ENUM(RET_PF_EMULATE);
TRACE_DEFINE_ENUM(RET_PF_INVALID);
+TRACE_DEFINE_ENUM(RET_PF_USER);
TRACE_DEFINE_ENUM(RET_PF_FIXED);
TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);

--
2.41.0.255.g8b1d071c50-goog

2023-07-19 00:15:12

by Sean Christopherson

[permalink] [raw]

Subject: [RFC PATCH v11 05/29] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 2 --
arch/arm64/kvm/Kconfig | 2 +-
arch/mips/include/asm/kvm_host.h | 2 --
arch/mips/kvm/Kconfig | 2 +-
arch/powerpc/include/asm/kvm_host.h | 2 --
arch/powerpc/kvm/Kconfig | 8 ++++----
arch/powerpc/kvm/powerpc.c | 4 +---
arch/riscv/include/asm/kvm_host.h | 2 --
arch/riscv/kvm/Kconfig | 2 +-
arch/x86/include/asm/kvm_host.h | 2 --
arch/x86/kvm/Kconfig | 2 +-
include/linux/kvm_host.h | 8 +++++---
virt/kvm/Kconfig | 4 ++++
virt/kvm/kvm_main.c | 10 +++++-----
14 files changed, 23 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 8b6096753740..50d89d400bf1 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -912,8 +912,6 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
struct kvm_vcpu_events *events);

-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
void kvm_arm_halt_guest(struct kvm *kvm);
void kvm_arm_resume_guest(struct kvm *kvm);

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index f531da6b362e..a650b46f4f2f 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -22,7 +22,7 @@ menuconfig KVM
bool "Kernel-based Virtual Machine (KVM) support"
depends on HAVE_KVM
select KVM_GENERIC_HARDWARE_ENABLING
- select MMU_NOTIFIER
+ select KVM_GENERIC_MMU_NOTIFIER
select PREEMPT_NOTIFIERS
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_ARCH_TLB_FLUSH_ALL
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 04cedf9f8811..22a41d941bf3 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -810,8 +810,6 @@ int kvm_mips_mkclean_gpa_pt(struct kvm *kvm, gfn_t start_gfn, gfn_t end_gfn);
pgd_t *kvm_pgd_alloc(void);
void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);

-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
/* Emulation */
enum emulation_result update_pc(struct kvm_vcpu *vcpu, u32 cause);
int kvm_get_badinstr(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index a8cdba75f98d..c04987d2ed2e 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -25,7 +25,7 @@ config KVM
select HAVE_KVM_EVENTFD
select HAVE_KVM_VCPU_ASYNC_IOCTL
select KVM_MMIO
- select MMU_NOTIFIER
+ select KVM_GENERIC_MMU_NOTIFIER
select INTERVAL_TREE
select KVM_GENERIC_HARDWARE_ENABLING
help
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..4b5c3f2acf78 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -62,8 +62,6 @@

#include <linux/mmu_notifier.h>

-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
#define HPTEG_CACHE_NUM (1 << 15)
#define HPTEG_HASH_BITS_PTE 13
#define HPTEG_HASH_BITS_PTE_LONG 12
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 902611954200..b33358ee6424 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -42,7 +42,7 @@ config KVM_BOOK3S_64_HANDLER
config KVM_BOOK3S_PR_POSSIBLE
bool
select KVM_MMIO
- select MMU_NOTIFIER
+ select KVM_GENERIC_MMU_NOTIFIER

config KVM_BOOK3S_HV_POSSIBLE
bool
@@ -85,7 +85,7 @@ config KVM_BOOK3S_64_HV
tristate "KVM for POWER7 and later using hypervisor mode in host"
depends on KVM_BOOK3S_64 && PPC_POWERNV
select KVM_BOOK3S_HV_POSSIBLE
- select MMU_NOTIFIER
+ select KVM_GENERIC_MMU_NOTIFIER
select CMA
help
Support running unmodified book3s_64 guest kernels in
@@ -194,7 +194,7 @@ config KVM_E500V2
depends on !CONTEXT_TRACKING_USER
select KVM
select KVM_MMIO
- select MMU_NOTIFIER
+ select KVM_GENERIC_MMU_NOTIFIER
help
Support running unmodified E500 guest kernels in virtual machines on
E500v2 host processors.
@@ -211,7 +211,7 @@ config KVM_E500MC
select KVM
select KVM_MMIO
select KVM_BOOKE_HV
- select MMU_NOTIFIER
+ select KVM_GENERIC_MMU_NOTIFIER
help
Support running unmodified E500MC/E5500/E6500 guest kernels in
virtual machines on E500MC/E5500/E6500 host processors.
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 5cf9e5e3112a..f97fbac7eac9 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -635,9 +635,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
r = hv_enabled;
#else
-#ifndef KVM_ARCH_WANT_MMU_NOTIFIER
- BUILD_BUG();
-#endif
+ BUILD_BUG_ON(!IS_ENABLED(CONFIG_KVM_GENERIC_MMU_NOTIFIER));
r = 1;
#endif
break;
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 2d8ee53b66c7..6ddaf0b9278c 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -249,8 +249,6 @@ struct kvm_vcpu_arch {
static inline void kvm_arch_sync_events(struct kvm *kvm) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}

-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
#define KVM_RISCV_GSTAGE_TLB_MIN_ORDER 12

void kvm_riscv_local_hfence_gvma_vmid_gpa(unsigned long vmid,
diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
index dfc237d7875b..ae2e05f050ec 100644
--- a/arch/riscv/kvm/Kconfig
+++ b/arch/riscv/kvm/Kconfig
@@ -30,7 +30,7 @@ config KVM
select KVM_GENERIC_HARDWARE_ENABLING
select KVM_MMIO
select KVM_XFER_TO_GUEST_WORK
- select MMU_NOTIFIER
+ select KVM_GENERIC_MMU_NOTIFIER
select PREEMPT_NOTIFIERS
help
Support hosting virtualized guest machines.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 28bd38303d70..f9a927296d85 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2110,8 +2110,6 @@ enum {
# define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
#endif

-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
int kvm_cpu_has_extint(struct kvm_vcpu *v);
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 89ca7f4c1464..a7eb2bdbfb18 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -24,7 +24,7 @@ config KVM
depends on HIGH_RES_TIMERS
depends on X86_LOCAL_APIC
select PREEMPT_NOTIFIERS
- select MMU_NOTIFIER
+ select KVM_GENERIC_MMU_NOTIFIER
select HAVE_KVM_IRQCHIP
select HAVE_KVM_PFNCACHE
select HAVE_KVM_IRQFD
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 90a0be261a5c..d2d3e083ec7f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -255,7 +255,9 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
#endif

-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+struct kvm_gfn_range;
+
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
@@ -784,7 +786,7 @@ struct kvm {
struct hlist_head irq_ack_notifier_list;
#endif

-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
struct mmu_notifier mmu_notifier;
unsigned long mmu_invalidate_seq;
long mmu_invalidate_in_progress;
@@ -1916,7 +1918,7 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[];
extern const struct kvm_stats_header kvm_vcpu_stats_header;
extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];

-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
{
if (unlikely(kvm->mmu_invalidate_in_progress))
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index b74916de5183..2fa11bd26cfc 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -95,3 +95,7 @@ config HAVE_KVM_PM_NOTIFIER

config KVM_GENERIC_HARDWARE_ENABLING
bool
+
+config KVM_GENERIC_MMU_NOTIFIER
+ select MMU_NOTIFIER
+ bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8101b11a13ba..53346bc2902a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -510,7 +510,7 @@ void kvm_destroy_vcpus(struct kvm *kvm)
}
EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);

-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
{
return container_of(mn, struct kvm, mmu_notifier);
@@ -938,14 +938,14 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
}

-#else /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
+#else /* !CONFIG_KVM_GENERIC_MMU_NOTIFIER */

static int kvm_init_mmu_notifier(struct kvm *kvm)
{
return 0;
}

-#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
+#endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */

#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
static int kvm_pm_notifier_call(struct notifier_block *bl,
@@ -1265,7 +1265,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
out_err_no_debugfs:
kvm_coalesced_mmio_free(kvm);
out_no_coalesced_mmio:
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
if (kvm->mmu_notifier.ops)
mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
#endif
@@ -1325,7 +1325,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
kvm->buses[i] = NULL;
}
kvm_coalesced_mmio_free(kvm);
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
/*
* At this point, pending calls to invalidate_range_start()
--
2.41.0.255.g8b1d071c50-goog

2023-07-19 00:18:22

by Sean Christopherson

[permalink] [raw]

Subject: [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

Signed-off-by: Sean Christopherson <[email protected]>
---
include/linux/pagemap.h | 11 +++++++++++
mm/compaction.c | 4 ++++
mm/migrate.c | 2 ++
3 files changed, 17 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716953ee1ebd..931d2f1da7d5 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -203,6 +203,7 @@ enum mapping_flags {
/* writeback related tags are not used */
AS_NO_WRITEBACK_TAGS = 5,
AS_LARGE_FOLIO_SUPPORT = 6,
+ AS_UNMOVABLE = 7, /* The mapping cannot be moved, ever */
};

/**
@@ -273,6 +274,16 @@ static inline int mapping_use_writeback_tags(struct address_space *mapping)
return !test_bit(AS_NO_WRITEBACK_TAGS, &mapping->flags);
}

+static inline void mapping_set_unmovable(struct address_space *mapping)
+{
+ set_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
+static inline bool mapping_unmovable(struct address_space *mapping)
+{
+ return test_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
{
return mapping->gfp_mask;
diff --git a/mm/compaction.c b/mm/compaction.c
index dbc9f86b1934..a3d2b132df52 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1047,6 +1047,10 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (!mapping && (folio_ref_count(folio) - 1) > folio_mapcount(folio))
goto isolate_fail_put;

+ /* The mapping truly isn't movable. */
+ if (mapping && mapping_unmovable(mapping))
+ goto isolate_fail_put;
+
/*
* Only allow to migrate anonymous pages in GFP_NOFS context
* because those do not depend on fs locks.
diff --git a/mm/migrate.c b/mm/migrate.c
index 24baad2571e3..c00a4ca86698 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -954,6 +954,8 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,

if (!mapping)
rc = migrate_folio(mapping, dst, src, mode);
+ else if (mapping_unmovable(mapping))
+ rc = -EOPNOTSUPP;
else if (mapping->a_ops->migrate_folio)
/*
* Most folios have a mapping and most filesystems
--
2.41.0.255.g8b1d071c50-goog

2023-07-19 00:19:24

by Sean Christopherson

[permalink] [raw]

Subject: [RFC PATCH v11 18/29] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper

Drop kvm_userspace_memory_region_find(), it's unused and a terrible API
(probably why it's unused). If anything outside of kvm_util.c needs to
get at the memslot, userspace_mem_region_find() can be exposed to give
others full access to all memory region/slot information.

Signed-off-by: Sean Christopherson <[email protected]>
---
.../selftests/kvm/include/kvm_util_base.h | 4 ---
tools/testing/selftests/kvm/lib/kvm_util.c | 29 -------------------
2 files changed, 33 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 07732a157ccd..6aeb008dd668 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -753,10 +753,6 @@ vm_adjust_num_guest_pages(enum vm_guest_mode mode, unsigned int num_guest_pages)
return n;
}

-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
- uint64_t end);
-
#define sync_global_to_guest(vm, g) ({ \
typeof(g) *_p = addr_gva2hva(vm, (vm_vaddr_t)&(g)); \
memcpy(_p, &(g), sizeof(g)); \
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 9741a7ff6380..45d21e052db0 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -586,35 +586,6 @@ userspace_mem_region_find(struct kvm_vm *vm, uint64_t start, uint64_t end)
return NULL;
}

-/*
- * KVM Userspace Memory Region Find
- *
- * Input Args:
- * vm - Virtual Machine
- * start - Starting VM physical address
- * end - Ending VM physical address, inclusive.
- *
- * Output Args: None
- *
- * Return:
- * Pointer to overlapping region, NULL if no such region.
- *
- * Public interface to userspace_mem_region_find. Allows tests to look up
- * the memslot datastructure for a given range of guest physical memory.
- */
-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
- uint64_t end)
-{
- struct userspace_mem_region *region;
-
- region = userspace_mem_region_find(vm, start, end);
- if (!region)
- return NULL;
-
- return &region->region;
-}
-
__weak void vcpu_arch_free(struct kvm_vcpu *vcpu)
{

--
2.41.0.255.g8b1d071c50-goog

2023-07-19 00:19:28

by Sean Christopherson

[permalink] [raw]

Subject: [RFC PATCH v11 06/29] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

Cc: Jarkko Sakkinen <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 2 +-
include/linux/kvm_host.h | 4 ++--
include/uapi/linux/kvm.h | 13 +++++++++++++
virt/kvm/kvm_main.c | 38 ++++++++++++++++++++++++++++++--------
4 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a6b9bea62fb8..92e77afd3ffd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12420,7 +12420,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
}

for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
- struct kvm_userspace_memory_region m;
+ struct kvm_userspace_memory_region2 m;

m.slot = id | (i << 16);
m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d2d3e083ec7f..e9ca49d451f3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1130,9 +1130,9 @@ enum kvm_mr_change {
};

int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_userspace_memory_region2 *mem);
int __kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_userspace_memory_region2 *mem);
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f089ab290978..4d4b3de8ac55 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
__u64 userspace_addr; /* start of the userspace allocated memory */
};

+/* for KVM_SET_USER_MEMORY_REGION2 */
+struct kvm_userspace_memory_region2 {
+ __u32 slot;
+ __u32 flags;
+ __u64 guest_phys_addr;
+ __u64 memory_size;
+ __u64 userspace_addr;
+ __u64 pad[16];
+};
+
/*
* The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
* userspace, other bits are reserved for kvm internal use which are defined
@@ -1192,6 +1202,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_COUNTER_OFFSET 227
#define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
#define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
+#define KVM_CAP_USER_MEMORY2 230

#ifdef KVM_CAP_IRQ_ROUTING

@@ -1466,6 +1477,8 @@ struct kvm_vfio_spapr_tce {
struct kvm_userspace_memory_region)
#define KVM_SET_TSS_ADDR _IO(KVMIO, 0x47)
#define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO, 0x48, __u64)
+#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
+ struct kvm_userspace_memory_region2)

/* enable ucontrol for s390 */
struct kvm_s390_ucas_mapping {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 53346bc2902a..c14adf93daec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1549,7 +1549,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
}

-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(const struct kvm_userspace_memory_region2 *mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;

@@ -1951,7 +1951,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
* Must be called holding kvm->slots_lock for write.
*/
int __kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_userspace_memory_region2 *mem)
{
struct kvm_memory_slot *old, *new;
struct kvm_memslots *slots;
@@ -2055,7 +2055,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
EXPORT_SYMBOL_GPL(__kvm_set_memory_region);

int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_userspace_memory_region2 *mem)
{
int r;

@@ -2067,7 +2067,7 @@ int kvm_set_memory_region(struct kvm *kvm,
EXPORT_SYMBOL_GPL(kvm_set_memory_region);

static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
- struct kvm_userspace_memory_region *mem)
+ struct kvm_userspace_memory_region2 *mem)
{
if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
return -EINVAL;
@@ -4514,6 +4514,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
{
switch (arg) {
case KVM_CAP_USER_MEMORY:
+ case KVM_CAP_USER_MEMORY2:
case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
case KVM_CAP_INTERNAL_ERROR_DATA:
@@ -4757,6 +4758,14 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
return fd;
}

+#define SANITY_CHECK_MEM_REGION_FIELD(field) \
+do { \
+ BUILD_BUG_ON(offsetof(struct kvm_userspace_memory_region, field) != \
+ offsetof(struct kvm_userspace_memory_region2, field)); \
+ BUILD_BUG_ON(sizeof_field(struct kvm_userspace_memory_region, field) != \
+ sizeof_field(struct kvm_userspace_memory_region2, field)); \
+} while (0)
+
static long kvm_vm_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -4779,15 +4788,28 @@ static long kvm_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
break;
}
+ case KVM_SET_USER_MEMORY_REGION2:
case KVM_SET_USER_MEMORY_REGION: {
- struct kvm_userspace_memory_region kvm_userspace_mem;
+ struct kvm_userspace_memory_region2 mem;
+ unsigned long size;
+
+ if (ioctl == KVM_SET_USER_MEMORY_REGION)
+ size = sizeof(struct kvm_userspace_memory_region);
+ else
+ size = sizeof(struct kvm_userspace_memory_region2);
+
+ /* Ensure the common parts of the two structs are identical. */
+ SANITY_CHECK_MEM_REGION_FIELD(slot);
+ SANITY_CHECK_MEM_REGION_FIELD(flags);
+ SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+ SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+ SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);

r = -EFAULT;
- if (copy_from_user(&kvm_userspace_mem, argp,
- sizeof(kvm_userspace_mem)))
+ if (copy_from_user(&mem, argp, size))
goto out;

- r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+ r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
break;
}
case KVM_GET_DIRTY_LOG: {
--
2.41.0.255.g8b1d071c50-goog

2023-07-19 00:19:53

by Sean Christopherson

[permalink] [raw]

Subject: [RFC PATCH v11 29/29] KVM: selftests: Test KVM exit behavior for private memory/access

From: Ackerley Tng <[email protected]>

"Testing private access when memslot gets deleted" tests the behavior
of KVM when a private memslot gets deleted while the VM is using the
private memslot. When KVM looks up the deleted (slot = NULL) memslot,
KVM should exit to userspace with KVM_EXIT_MEMORY_FAULT.

In the second test, upon a private access to non-private memslot, KVM
should also exit to userspace with KVM_EXIT_MEMORY_FAULT.

sean: These testcases belong in set_memory_region_test.c, they're private
variants on existing testscases and aren't as robust, e.g. don't ensure
the vCPU is actually running and accessing memory when converting and
deleting.

Signed-off-by: Ackerley Tng <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
tools/testing/selftests/kvm/Makefile | 1 +
.../kvm/x86_64/private_mem_kvm_exits_test.c | 115 ++++++++++++++++++
2 files changed, 116 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 18c43336ede3..cb9450022302 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -81,6 +81,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
+TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
new file mode 100644
index 000000000000..8daaa08c0d90
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+#include <linux/kvm.h>
+#include <pthread.h>
+#include <stdint.h>
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "test_util.h"
+
+/* Arbitrarily selected to avoid overlaps with anything else */
+#define EXITS_TEST_GVA 0xc0000000
+#define EXITS_TEST_GPA EXITS_TEST_GVA
+#define EXITS_TEST_NPAGES 1
+#define EXITS_TEST_SIZE (EXITS_TEST_NPAGES * PAGE_SIZE)
+#define EXITS_TEST_SLOT 10
+
+static uint64_t guest_repeatedly_read(void)
+{
+ volatile uint64_t value;
+
+ while (true)
+ value = *((uint64_t *) EXITS_TEST_GVA);
+
+ return value;
+}
+
+static uint32_t run_vcpu_get_exit_reason(struct kvm_vcpu *vcpu)
+{
+ vcpu_run(vcpu);
+
+ return vcpu->run->exit_reason;
+}
+
+const struct vm_shape protected_vm_shape = {
+ .mode = VM_MODE_DEFAULT,
+ .type = KVM_X86_SW_PROTECTED_VM,
+};
+
+static void test_private_access_memslot_deleted(void)
+{
+ struct kvm_vm *vm;
+ struct kvm_vcpu *vcpu;
+ pthread_t vm_thread;
+ void *thread_return;
+ uint32_t exit_reason;
+
+ vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
+ guest_repeatedly_read);
+
+ vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+ EXITS_TEST_GPA, EXITS_TEST_SLOT,
+ EXITS_TEST_NPAGES,
+ KVM_MEM_PRIVATE);
+
+ virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
+
+ /* Request to access page privately */
+ vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+
+ pthread_create(&vm_thread, NULL,
+ (void *(*)(void *))run_vcpu_get_exit_reason,
+ (void *)vcpu);
+
+ vm_mem_region_delete(vm, EXITS_TEST_SLOT);
+
+ pthread_join(vm_thread, &thread_return);
+ exit_reason = (uint32_t)(uint64_t)thread_return;
+
+ ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+ ASSERT_EQ(vcpu->run->memory.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
+ ASSERT_EQ(vcpu->run->memory.gpa, EXITS_TEST_GPA);
+ ASSERT_EQ(vcpu->run->memory.size, EXITS_TEST_SIZE);
+
+ kvm_vm_free(vm);
+}
+
+static void test_private_access_memslot_not_private(void)
+{
+ struct kvm_vm *vm;
+ struct kvm_vcpu *vcpu;
+ uint32_t exit_reason;
+
+ vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
+ guest_repeatedly_read);
+
+ /* Add a non-private memslot (flags = 0) */
+ vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+ EXITS_TEST_GPA, EXITS_TEST_SLOT,
+ EXITS_TEST_NPAGES, 0);
+
+ virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
+
+ /* Request to access page privately */
+ vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+
+ exit_reason = run_vcpu_get_exit_reason(vcpu);
+
+ ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+ ASSERT_EQ(vcpu->run->memory.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
+ ASSERT_EQ(vcpu->run->memory.gpa, EXITS_TEST_GPA);
+ ASSERT_EQ(vcpu->run->memory.size, EXITS_TEST_SIZE);
+
+ kvm_vm_free(vm);
+}
+
+int main(int argc, char *argv[])
+{
+ TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
+
+ test_private_access_memslot_deleted();
+ test_private_access_memslot_not_private();
+}
--
2.41.0.255.g8b1d071c50-goog

2023-07-19 07:43:49

by Yuan Yao

[permalink] [raw]

Subject: Re: [RFC PATCH v11 05/29] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER

On Tue, Jul 18, 2023 at 04:44:48PM -0700, Sean Christopherson wrote:
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/arm64/include/asm/kvm_host.h | 2 --
> arch/arm64/kvm/Kconfig | 2 +-
> arch/mips/include/asm/kvm_host.h | 2 --
> arch/mips/kvm/Kconfig | 2 +-
> arch/powerpc/include/asm/kvm_host.h | 2 --
> arch/powerpc/kvm/Kconfig | 8 ++++----
> arch/powerpc/kvm/powerpc.c | 4 +---
> arch/riscv/include/asm/kvm_host.h | 2 --
> arch/riscv/kvm/Kconfig | 2 +-
> arch/x86/include/asm/kvm_host.h | 2 --
> arch/x86/kvm/Kconfig | 2 +-
> include/linux/kvm_host.h | 8 +++++---
> virt/kvm/Kconfig | 4 ++++
> virt/kvm/kvm_main.c | 10 +++++-----
> 14 files changed, 23 insertions(+), 29 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 8b6096753740..50d89d400bf1 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -912,8 +912,6 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
> int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
> struct kvm_vcpu_events *events);
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
> void kvm_arm_halt_guest(struct kvm *kvm);
> void kvm_arm_resume_guest(struct kvm *kvm);
>
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index f531da6b362e..a650b46f4f2f 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -22,7 +22,7 @@ menuconfig KVM
> bool "Kernel-based Virtual Machine (KVM) support"
> depends on HAVE_KVM
> select KVM_GENERIC_HARDWARE_ENABLING
> - select MMU_NOTIFIER
> + select KVM_GENERIC_MMU_NOTIFIER
> select PREEMPT_NOTIFIERS
> select HAVE_KVM_CPU_RELAX_INTERCEPT
> select HAVE_KVM_ARCH_TLB_FLUSH_ALL
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 04cedf9f8811..22a41d941bf3 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -810,8 +810,6 @@ int kvm_mips_mkclean_gpa_pt(struct kvm *kvm, gfn_t start_gfn, gfn_t end_gfn);
> pgd_t *kvm_pgd_alloc(void);
> void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
> /* Emulation */
> enum emulation_result update_pc(struct kvm_vcpu *vcpu, u32 cause);
> int kvm_get_badinstr(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
> diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
> index a8cdba75f98d..c04987d2ed2e 100644
> --- a/arch/mips/kvm/Kconfig
> +++ b/arch/mips/kvm/Kconfig
> @@ -25,7 +25,7 @@ config KVM
> select HAVE_KVM_EVENTFD
> select HAVE_KVM_VCPU_ASYNC_IOCTL
> select KVM_MMIO
> - select MMU_NOTIFIER
> + select KVM_GENERIC_MMU_NOTIFIER
> select INTERVAL_TREE
> select KVM_GENERIC_HARDWARE_ENABLING
> help
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 14ee0dece853..4b5c3f2acf78 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -62,8 +62,6 @@
>
> #include <linux/mmu_notifier.h>
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
> #define HPTEG_CACHE_NUM (1 << 15)
> #define HPTEG_HASH_BITS_PTE 13
> #define HPTEG_HASH_BITS_PTE_LONG 12
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 902611954200..b33358ee6424 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -42,7 +42,7 @@ config KVM_BOOK3S_64_HANDLER
> config KVM_BOOK3S_PR_POSSIBLE
> bool
> select KVM_MMIO
> - select MMU_NOTIFIER
> + select KVM_GENERIC_MMU_NOTIFIER
>
> config KVM_BOOK3S_HV_POSSIBLE
> bool
> @@ -85,7 +85,7 @@ config KVM_BOOK3S_64_HV
> tristate "KVM for POWER7 and later using hypervisor mode in host"
> depends on KVM_BOOK3S_64 && PPC_POWERNV
> select KVM_BOOK3S_HV_POSSIBLE
> - select MMU_NOTIFIER
> + select KVM_GENERIC_MMU_NOTIFIER
> select CMA
> help
> Support running unmodified book3s_64 guest kernels in
> @@ -194,7 +194,7 @@ config KVM_E500V2
> depends on !CONTEXT_TRACKING_USER
> select KVM
> select KVM_MMIO
> - select MMU_NOTIFIER
> + select KVM_GENERIC_MMU_NOTIFIER
> help
> Support running unmodified E500 guest kernels in virtual machines on
> E500v2 host processors.
> @@ -211,7 +211,7 @@ config KVM_E500MC
> select KVM
> select KVM_MMIO
> select KVM_BOOKE_HV
> - select MMU_NOTIFIER
> + select KVM_GENERIC_MMU_NOTIFIER
> help
> Support running unmodified E500MC/E5500/E6500 guest kernels in
> virtual machines on E500MC/E5500/E6500 host processors.
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 5cf9e5e3112a..f97fbac7eac9 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -635,9 +635,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> r = hv_enabled;
> #else
> -#ifndef KVM_ARCH_WANT_MMU_NOTIFIER
> - BUILD_BUG();
> -#endif
> + BUILD_BUG_ON(!IS_ENABLED(CONFIG_KVM_GENERIC_MMU_NOTIFIER));
> r = 1;
> #endif
> break;
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index 2d8ee53b66c7..6ddaf0b9278c 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -249,8 +249,6 @@ struct kvm_vcpu_arch {
> static inline void kvm_arch_sync_events(struct kvm *kvm) {}
> static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
> #define KVM_RISCV_GSTAGE_TLB_MIN_ORDER 12
>
> void kvm_riscv_local_hfence_gvma_vmid_gpa(unsigned long vmid,
> diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
> index dfc237d7875b..ae2e05f050ec 100644
> --- a/arch/riscv/kvm/Kconfig
> +++ b/arch/riscv/kvm/Kconfig
> @@ -30,7 +30,7 @@ config KVM
> select KVM_GENERIC_HARDWARE_ENABLING
> select KVM_MMIO
> select KVM_XFER_TO_GUEST_WORK
> - select MMU_NOTIFIER
> + select KVM_GENERIC_MMU_NOTIFIER
> select PREEMPT_NOTIFIERS
> help
> Support hosting virtualized guest machines.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 28bd38303d70..f9a927296d85 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2110,8 +2110,6 @@ enum {
> # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
> #endif
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
> int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
> int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
> int kvm_cpu_has_extint(struct kvm_vcpu *v);
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 89ca7f4c1464..a7eb2bdbfb18 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -24,7 +24,7 @@ config KVM
> depends on HIGH_RES_TIMERS
> depends on X86_LOCAL_APIC
> select PREEMPT_NOTIFIERS
> - select MMU_NOTIFIER
> + select KVM_GENERIC_MMU_NOTIFIER
> select HAVE_KVM_IRQCHIP
> select HAVE_KVM_PFNCACHE
> select HAVE_KVM_IRQFD
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 90a0be261a5c..d2d3e083ec7f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -255,7 +255,9 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> +struct kvm_gfn_range;

Not sure why a declaration here, it's
defined for ARCHs which defined KVM_ARCH_WANT_MMU_NOTIFIER
before.

> +
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> struct kvm_gfn_range {
> struct kvm_memory_slot *slot;
> gfn_t start;
> @@ -784,7 +786,7 @@ struct kvm {
> struct hlist_head irq_ack_notifier_list;
> #endif
>
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> struct mmu_notifier mmu_notifier;
> unsigned long mmu_invalidate_seq;
> long mmu_invalidate_in_progress;
> @@ -1916,7 +1918,7 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[];
> extern const struct kvm_stats_header kvm_vcpu_stats_header;
> extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
>
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
> {
> if (unlikely(kvm->mmu_invalidate_in_progress))
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index b74916de5183..2fa11bd26cfc 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -95,3 +95,7 @@ config HAVE_KVM_PM_NOTIFIER
>
> config KVM_GENERIC_HARDWARE_ENABLING
> bool
> +
> +config KVM_GENERIC_MMU_NOTIFIER
> + select MMU_NOTIFIER
> + bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8101b11a13ba..53346bc2902a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -510,7 +510,7 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> }
> EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> {
> return container_of(mn, struct kvm, mmu_notifier);
> @@ -938,14 +938,14 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
> }
>
> -#else /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
> +#else /* !CONFIG_KVM_GENERIC_MMU_NOTIFIER */
>
> static int kvm_init_mmu_notifier(struct kvm *kvm)
> {
> return 0;
> }
>
> -#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> +#endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */
>
> #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> static int kvm_pm_notifier_call(struct notifier_block *bl,
> @@ -1265,7 +1265,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> out_err_no_debugfs:
> kvm_coalesced_mmio_free(kvm);
> out_no_coalesced_mmio:
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> if (kvm->mmu_notifier.ops)
> mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
> #endif
> @@ -1325,7 +1325,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
> kvm->buses[i] = NULL;
> }
> kvm_coalesced_mmio_free(kvm);
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
> /*
> * At this point, pending calls to invalidate_range_start()
> --
> 2.41.0.255.g8b1d071c50-goog
>

2023-07-19 14:41:45

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC PATCH v11 05/29] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER

On Wed, Jul 19, 2023, Yuan Yao wrote:
> On Tue, Jul 18, 2023 at 04:44:48PM -0700, Sean Christopherson wrote:
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 90a0be261a5c..d2d3e083ec7f 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -255,7 +255,9 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > #endif
> >
> > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > +struct kvm_gfn_range;
>
> Not sure why a declaration here, it's defined for ARCHs which defined
> KVM_ARCH_WANT_MMU_NOTIFIER before.

The forward declaration exists to handle cases where CONFIG_KVM=n, specifically
arch/powerpc/include/asm/kvm_ppc.h's declaration of hooks to forward calls to
uarch modules:

bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);

Prior to using a Kconfig, a forward declaration wasn't necessary because
arch/powerpc/include/asm/kvm_host.h would #define KVM_ARCH_WANT_MMU_NOTIFIER even
if CONFIG_KVM=n.

Alternatively, kvm_ppc.h could declare the struct. I went this route mainly to
avoid the possibility of someone encountering the same problem on a different
architecture.

2023-07-20 01:31:33

by Yuan Yao

[permalink] [raw]

Subject: Re: [RFC PATCH v11 05/29] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER

On Wed, Jul 19, 2023 at 07:15:09AM -0700, Sean Christopherson wrote:
> On Wed, Jul 19, 2023, Yuan Yao wrote:
> > On Tue, Jul 18, 2023 at 04:44:48PM -0700, Sean Christopherson wrote:
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 90a0be261a5c..d2d3e083ec7f 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -255,7 +255,9 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > > #endif
> > >
> > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > +struct kvm_gfn_range;
> >
> > Not sure why a declaration here, it's defined for ARCHs which defined
> > KVM_ARCH_WANT_MMU_NOTIFIER before.
>
> The forward declaration exists to handle cases where CONFIG_KVM=n, specifically
> arch/powerpc/include/asm/kvm_ppc.h's declaration of hooks to forward calls to
> uarch modules:
>
> bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
> bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>
> Prior to using a Kconfig, a forward declaration wasn't necessary because
> arch/powerpc/include/asm/kvm_host.h would #define KVM_ARCH_WANT_MMU_NOTIFIER even
> if CONFIG_KVM=n.
>
> Alternatively, kvm_ppc.h could declare the struct. I went this route mainly to
> avoid the possibility of someone encountering the same problem on a different
> architecture.

Ah I see, thanks for your explanation!

2023-07-21 10:10:07

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [RFC PATCH v11 06/29] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

On 7/19/23 01:44, Sean Christopherson wrote:
> Cc: Jarkko Sakkinen <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/x86.c | 2 +-
> include/linux/kvm_host.h | 4 ++--
> include/uapi/linux/kvm.h | 13 +++++++++++++
> virt/kvm/kvm_main.c | 38 ++++++++++++++++++++++++++++++--------
> 4 files changed, 46 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a6b9bea62fb8..92e77afd3ffd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12420,7 +12420,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> }
>
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> - struct kvm_userspace_memory_region m;
> + struct kvm_userspace_memory_region2 m;
>
> m.slot = id | (i << 16);
> m.flags = 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d2d3e083ec7f..e9ca49d451f3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1130,9 +1130,9 @@ enum kvm_mr_change {
> };
>
> int kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem);
> + const struct kvm_userspace_memory_region2 *mem);
> int __kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem);
> + const struct kvm_userspace_memory_region2 *mem);
> void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index f089ab290978..4d4b3de8ac55 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
> __u64 userspace_addr; /* start of the userspace allocated memory */
> };
>
> +/* for KVM_SET_USER_MEMORY_REGION2 */
> +struct kvm_userspace_memory_region2 {
> + __u32 slot;
> + __u32 flags;
> + __u64 guest_phys_addr;
> + __u64 memory_size;
> + __u64 userspace_addr;
> + __u64 pad[16];
> +};
> +
> /*
> * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
> * userspace, other bits are reserved for kvm internal use which are defined
> @@ -1192,6 +1202,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_COUNTER_OFFSET 227
> #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
> #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
> +#define KVM_CAP_USER_MEMORY2 230
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -1466,6 +1477,8 @@ struct kvm_vfio_spapr_tce {
> struct kvm_userspace_memory_region)
> #define KVM_SET_TSS_ADDR _IO(KVMIO, 0x47)
> #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO, 0x48, __u64)
> +#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> + struct kvm_userspace_memory_region2)
>
> /* enable ucontrol for s390 */
> struct kvm_s390_ucas_mapping {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 53346bc2902a..c14adf93daec 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1549,7 +1549,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> }
> }
>
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_userspace_memory_region2 *mem)
> {
> u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> @@ -1951,7 +1951,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> * Must be called holding kvm->slots_lock for write.
> */
> int __kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem)
> + const struct kvm_userspace_memory_region2 *mem)
> {
> struct kvm_memory_slot *old, *new;
> struct kvm_memslots *slots;
> @@ -2055,7 +2055,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>
> int kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem)
> + const struct kvm_userspace_memory_region2 *mem)
> {
> int r;
>
> @@ -2067,7 +2067,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>
> static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> - struct kvm_userspace_memory_region *mem)
> + struct kvm_userspace_memory_region2 *mem)
> {
> if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> return -EINVAL;
> @@ -4514,6 +4514,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> {
> switch (arg) {
> case KVM_CAP_USER_MEMORY:
> + case KVM_CAP_USER_MEMORY2:
> case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
> case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
> case KVM_CAP_INTERNAL_ERROR_DATA:
> @@ -4757,6 +4758,14 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> return fd;
> }
>
> +#define SANITY_CHECK_MEM_REGION_FIELD(field) \
> +do { \
> + BUILD_BUG_ON(offsetof(struct kvm_userspace_memory_region, field) != \
> + offsetof(struct kvm_userspace_memory_region2, field)); \
> + BUILD_BUG_ON(sizeof_field(struct kvm_userspace_memory_region, field) != \
> + sizeof_field(struct kvm_userspace_memory_region2, field)); \
> +} while (0)
> +
> static long kvm_vm_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> @@ -4779,15 +4788,28 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
> break;
> }
> + case KVM_SET_USER_MEMORY_REGION2:
> case KVM_SET_USER_MEMORY_REGION: {
> - struct kvm_userspace_memory_region kvm_userspace_mem;
> + struct kvm_userspace_memory_region2 mem;
> + unsigned long size;
> +
> + if (ioctl == KVM_SET_USER_MEMORY_REGION)
> + size = sizeof(struct kvm_userspace_memory_region);
> + else
> + size = sizeof(struct kvm_userspace_memory_region2);
> +
> + /* Ensure the common parts of the two structs are identical. */
> + SANITY_CHECK_MEM_REGION_FIELD(slot);
> + SANITY_CHECK_MEM_REGION_FIELD(flags);
> + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> + SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
>
> r = -EFAULT;
> - if (copy_from_user(&kvm_userspace_mem, argp,
> - sizeof(kvm_userspace_mem)))
> + if (copy_from_user(&mem, argp, size))
> goto out;
>
> - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> + r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> break;
> }
> case KVM_GET_DIRTY_LOG: {

Reviewed-by: Paolo Bonzini <[email protected]>

2023-07-21 15:19:59

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [RFC PATCH v11 14/29] KVM: x86/mmu: Handle page fault for private memory

On 7/19/23 01:44, Sean Christopherson wrote:
> From: Chao Peng <[email protected]>
>
> A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> hva-based shared memory. Architecture code (like TDX code) can tell
> whether the on-going fault is private or not. This patch adds a
> 'is_private' field to kvm_page_fault to indicate this and architecture
> code is expected to set it.
>
> To handle page fault for such memslot, the handling logic is different
> depending on whether the fault is private or shared. KVM checks if
> 'is_private' matches the host's view of the page (maintained in
> mem_attr_array).
> - For a successful match, private pfn is obtained with
> restrictedmem_get_page() and shared pfn is obtained with existing
> get_user_pages().
> - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> userspace. Userspace then can convert memory between private/shared
> in host's view and retry the fault.
>
> Co-developed-by: Yu Zhang <[email protected]>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> Reviewed-by: Fuad Tabba <[email protected]>
> Tested-by: Fuad Tabba <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 82 +++++++++++++++++++++++++++++++--
> arch/x86/kvm/mmu/mmu_internal.h | 3 ++
> arch/x86/kvm/mmu/mmutrace.h | 1 +
> 3 files changed, 81 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index aefe67185637..4cf73a579ee1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3179,9 +3179,9 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> return level;
> }
>
> -int kvm_mmu_max_mapping_level(struct kvm *kvm,
> - const struct kvm_memory_slot *slot, gfn_t gfn,
> - int max_level)
> +static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
> + const struct kvm_memory_slot *slot,
> + gfn_t gfn, int max_level, bool is_private)
> {
> struct kvm_lpage_info *linfo;
> int host_level;
> @@ -3193,6 +3193,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> break;
> }
>
> + if (is_private)
> + return max_level;
> +
> if (max_level == PG_LEVEL_4K)
> return PG_LEVEL_4K;
>
> @@ -3200,6 +3203,16 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> return min(host_level, max_level);
> }
>
> +int kvm_mmu_max_mapping_level(struct kvm *kvm,
> + const struct kvm_memory_slot *slot, gfn_t gfn,
> + int max_level)
> +{
> + bool is_private = kvm_slot_can_be_private(slot) &&
> + kvm_mem_is_private(kvm, gfn);
> +
> + return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_private);
> +}
> +
> void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> struct kvm_memory_slot *slot = fault->slot;
> @@ -3220,8 +3233,9 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> * Enforce the iTLB multihit workaround after capturing the requested
> * level, which will be used to do precise, accurate accounting.
> */
> - fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> - fault->gfn, fault->max_level);
> + fault->req_level = __kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> + fault->gfn, fault->max_level,
> + fault->is_private);
> if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> return;
>
> @@ -4304,6 +4318,55 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
> }
>
> +static inline u8 kvm_max_level_for_order(int order)
> +{
> + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> +
> + MMU_WARN_ON(order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G) &&
> + order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M) &&
> + order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K));
> +
> + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> + return PG_LEVEL_1G;
> +
> + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> + return PG_LEVEL_2M;
> +
> + return PG_LEVEL_4K;
> +}
> +
> +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> + if (fault->is_private)
> + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> + else
> + vcpu->run->memory.flags = 0;
> + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> + vcpu->run->memory.size = PAGE_SIZE;
> + return RET_PF_USER;
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + int max_order, r;
> +
> + if (!kvm_slot_can_be_private(fault->slot))
> + return kvm_do_memory_fault_exit(vcpu, fault);
> +
> + r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
> + &max_order);
> + if (r)
> + return r;
> +
> + fault->max_level = min(kvm_max_level_for_order(max_order),
> + fault->max_level);
> + fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
> + return RET_PF_CONTINUE;
> +}
> +
> static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> struct kvm_memory_slot *slot = fault->slot;
> @@ -4336,6 +4399,12 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> return RET_PF_EMULATE;
> }
>
> + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> + return kvm_do_memory_fault_exit(vcpu, fault);
> +
> + if (fault->is_private)
> + return kvm_faultin_pfn_private(vcpu, fault);
> +
> async = false;
> fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> fault->write, &fault->map_writable,
> @@ -5771,6 +5840,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> return -EIO;
> }
>
> + if (r == RET_PF_USER)
> + return 0;
> +
> if (r < 0)
> return r;
> if (r != RET_PF_EMULATE)
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index d39af5639ce9..268b517e88cb 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -203,6 +203,7 @@ struct kvm_page_fault {
>
> /* Derived from mmu and global state. */
> const bool is_tdp;
> + const bool is_private;
> const bool nx_huge_page_workaround_enabled;
>
> /*
> @@ -259,6 +260,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> * RET_PF_RETRY: let CPU fault again on the address.
> * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> + * RET_PF_USER: need to exit to userspace to handle this fault.
> * RET_PF_FIXED: The faulting entry has been fixed.
> * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> *
> @@ -275,6 +277,7 @@ enum {
> RET_PF_RETRY,
> RET_PF_EMULATE,
> RET_PF_INVALID,
> + RET_PF_USER,
> RET_PF_FIXED,
> RET_PF_SPURIOUS,
> };
> diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> index ae86820cef69..2d7555381955 100644
> --- a/arch/x86/kvm/mmu/mmutrace.h
> +++ b/arch/x86/kvm/mmu/mmutrace.h
> @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> TRACE_DEFINE_ENUM(RET_PF_RETRY);
> TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> TRACE_DEFINE_ENUM(RET_PF_INVALID);
> +TRACE_DEFINE_ENUM(RET_PF_USER);
> TRACE_DEFINE_ENUM(RET_PF_FIXED);
> TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
>

Reviewed-by: Paolo Bonzini <[email protected]>

2023-07-21 16:43:42

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [RFC PATCH v11 18/29] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper

On 7/19/23 01:45, Sean Christopherson wrote:
> Drop kvm_userspace_memory_region_find(), it's unused and a terrible API
> (probably why it's unused). If anything outside of kvm_util.c needs to
> get at the memslot, userspace_mem_region_find() can be exposed to give
> others full access to all memory region/slot information.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> .../selftests/kvm/include/kvm_util_base.h | 4 ---
> tools/testing/selftests/kvm/lib/kvm_util.c | 29 -------------------
> 2 files changed, 33 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
> index 07732a157ccd..6aeb008dd668 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util_base.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
> @@ -753,10 +753,6 @@ vm_adjust_num_guest_pages(enum vm_guest_mode mode, unsigned int num_guest_pages)
> return n;
> }
>
> -struct kvm_userspace_memory_region *
> -kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
> - uint64_t end);
> -
> #define sync_global_to_guest(vm, g) ({ \
> typeof(g) *_p = addr_gva2hva(vm, (vm_vaddr_t)&(g)); \
> memcpy(_p, &(g), sizeof(g)); \
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 9741a7ff6380..45d21e052db0 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -586,35 +586,6 @@ userspace_mem_region_find(struct kvm_vm *vm, uint64_t start, uint64_t end)
> return NULL;
> }
>
> -/*
> - * KVM Userspace Memory Region Find
> - *
> - * Input Args:
> - * vm - Virtual Machine
> - * start - Starting VM physical address
> - * end - Ending VM physical address, inclusive.
> - *
> - * Output Args: None
> - *
> - * Return:
> - * Pointer to overlapping region, NULL if no such region.
> - *
> - * Public interface to userspace_mem_region_find. Allows tests to look up
> - * the memslot datastructure for a given range of guest physical memory.
> - */
> -struct kvm_userspace_memory_region *
> -kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
> - uint64_t end)
> -{
> - struct userspace_mem_region *region;
> -
> - region = userspace_mem_region_find(vm, start, end);
> - if (!region)
> - return NULL;
> -
> - return &region->region;
> -}
> -
> __weak void vcpu_arch_free(struct kvm_vcpu *vcpu)
> {
>

Will queue this.

Paolo

2023-07-24 07:00:31

by Nikunj A. Dadhania

[permalink] [raw]

Subject: Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

On 7/19/2023 5:14 AM, Sean Christopherson wrote:
> This is the next iteration of implementing fd-based (instead of vma-based)
> memory for KVM guests. If you want the full background of why we are doing
> this, please go read the v10 cover letter[1].
>
> The biggest change from v10 is to implement the backing storage in KVM
> itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
> See link[2] for details on why we pivoted to a KVM-specific approach.
>
> Key word is "biggest". Relative to v10, there are many big changes.
> Highlights below (I can't remember everything that got changed at
> this point).
>
> Tagged RFC as there are a lot of empty changelogs, and a lot of missing
> documentation. And ideally, we'll have even more tests before merging.
> There are also several gaps/opens (to be discussed in tomorrow's PUCK).

As per our discussion on the PUCK call, here are the memory/NUMA accounting
related observations that I had while working on SNP guest secure page migration:

* gmem allocations are currently treated as file page allocations
accounted to the kernel and not to the QEMU process.

Starting an SNP guest with 40G memory with memory interleave between
Node2 and Node3

$ numactl -i 2,3 ./bootg_snp.sh

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
242179 root 20 0 40.4g 99580 51676 S 78.0 0.0 0:56.58 qemu-system-x86

-> Incorrect process resident memory and shared memory is reported

Accounting of the memory happens in the host page fault handler path,
but for private guest pages we will never hit that.

* NUMA allocation does use the process mempolicy for appropriate node
allocation (Node2 and Node3), but they again do not get attributed to
the QEMU process

Every 1.0s: sudo numastat -m -p qemu-system-x86 | egrep -i "qemu|PID|Node|Filepage" gomati: Mon Jul 24 11:51:34 2023

Per-node process memory usage (in MBs)
PID Node 0 Node 1 Node 2 Node 3 Total
242179 (qemu-system-x86) 21.14 1.61 39.44 39.38 101.57
Per-node system memory usage (in MBs):
Node 0 Node 1 Node 2 Node 3 Total
FilePages 2475.63 2395.83 23999.46 23373.22 52244.14

* Most of the memory accounting relies on the VMAs and as private-fd of
gmem doesn't have a VMA(and that was the design goal), user-space fails
to attribute the memory appropriately to the process.

/proc/<qemu pid>/numa_maps
7f528be00000 interleave:2-3 file=/memfd:memory-backend-memfd-shared\040(deleted) anon=1070 dirty=1070 mapped=1987 mapmax=256 active=1956 N2=582 N3=1405 kernelpagesize_kB=4
7f5c90200000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted)
7f5c90400000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=32 active=0 N2=32 kernelpagesize_kB=4
7f5c90800000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=892 active=0 N2=512 N3=380 kernelpagesize_kB=4

/proc/<qemu pid>/smaps
7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629 /memfd:memory-backend-memfd-shared (deleted)
7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033 /memfd:rom-backend-memfd-shared (deleted)
7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032 /memfd:rom-backend-memfd-shared (deleted)
7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025 /memfd:rom-backend-memfd-shared (deleted)

* QEMU based NUMA bindings will not work. Memory backend uses mbind()
to set the policy for a particular virtual memory range but gmem
private-FD does not have a virtual memory range visible in the host.

Regards,
Nikunj

2023-07-24 17:25:33

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

On Mon, Jul 24, 2023, Nikunj A. Dadhania wrote:
> On 7/19/2023 5:14 AM, Sean Christopherson wrote:
> > This is the next iteration of implementing fd-based (instead of vma-based)
> > memory for KVM guests. If you want the full background of why we are doing
> > this, please go read the v10 cover letter[1].
> >
> > The biggest change from v10 is to implement the backing storage in KVM
> > itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
> > See link[2] for details on why we pivoted to a KVM-specific approach.
> >
> > Key word is "biggest". Relative to v10, there are many big changes.
> > Highlights below (I can't remember everything that got changed at
> > this point).
> >
> > Tagged RFC as there are a lot of empty changelogs, and a lot of missing
> > documentation. And ideally, we'll have even more tests before merging.
> > There are also several gaps/opens (to be discussed in tomorrow's PUCK).
>
> As per our discussion on the PUCK call, here are the memory/NUMA accounting
> related observations that I had while working on SNP guest secure page migration:
>
> * gmem allocations are currently treated as file page allocations
> accounted to the kernel and not to the QEMU process.

We need to level set on terminology: these are all *stats*, not accounting. That
distinction matters because we have wiggle room on stats, e.g. we can probably get
away with just about any definition of how guest_memfd memory impacts stats, so
long as the information that is surfaced to userspace is useful and expected.

But we absolutely need to get accounting correct, specifically the allocations
need to be correctly accounted in memcg. And unless I'm missing something,
nothing in here shows anything related to memcg.

> Starting an SNP guest with 40G memory with memory interleave between
> Node2 and Node3
>
> $ numactl -i 2,3 ./bootg_snp.sh
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 242179 root 20 0 40.4g 99580 51676 S 78.0 0.0 0:56.58 qemu-system-x86
>
> -> Incorrect process resident memory and shared memory is reported

I don't know that I would call these "incorrect". Shared memory definitely is
correct, because by definition guest_memfd isn't shared. RSS is less clear cut;
gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up with
scenarios where RSS > VIRT, which will be quite confusing for unaware users (I'm
assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem
memslots).

> Accounting of the memory happens in the host page fault handler path,
> but for private guest pages we will never hit that.
>
> * NUMA allocation does use the process mempolicy for appropriate node
> allocation (Node2 and Node3), but they again do not get attributed to
> the QEMU process
>
> Every 1.0s: sudo numastat -m -p qemu-system-x86 | egrep -i "qemu|PID|Node|Filepage" gomati: Mon Jul 24 11:51:34 2023
>
> Per-node process memory usage (in MBs)
> PID Node 0 Node 1 Node 2 Node 3 Total
> 242179 (qemu-system-x86) 21.14 1.61 39.44 39.38 101.57
> Per-node system memory usage (in MBs):
> Node 0 Node 1 Node 2 Node 3 Total
> FilePages 2475.63 2395.83 23999.46 23373.22 52244.14
>
>
> * Most of the memory accounting relies on the VMAs and as private-fd of
> gmem doesn't have a VMA(and that was the design goal), user-space fails
> to attribute the memory appropriately to the process.
>
> /proc/<qemu pid>/numa_maps
> 7f528be00000 interleave:2-3 file=/memfd:memory-backend-memfd-shared\040(deleted) anon=1070 dirty=1070 mapped=1987 mapmax=256 active=1956 N2=582 N3=1405 kernelpagesize_kB=4
> 7f5c90200000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted)
> 7f5c90400000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=32 active=0 N2=32 kernelpagesize_kB=4
> 7f5c90800000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=892 active=0 N2=512 N3=380 kernelpagesize_kB=4
>
> /proc/<qemu pid>/smaps
> 7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629 /memfd:memory-backend-memfd-shared (deleted)
> 7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033 /memfd:rom-backend-memfd-shared (deleted)
> 7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032 /memfd:rom-backend-memfd-shared (deleted)
> 7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025 /memfd:rom-backend-memfd-shared (deleted)

This is all expected, and IMO correct. There are no userspace mappings, and so
not accounting anything is working as intended.

> * QEMU based NUMA bindings will not work. Memory backend uses mbind()
> to set the policy for a particular virtual memory range but gmem
> private-FD does not have a virtual memory range visible in the host.

Yes, adding a generic fbind() is the way to solve silve.

2023-07-24 20:33:07

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

Dropped non-KVM folks from Cc: so as not to bother them too much.

On Tue, Jul 18, 2023, Sean Christopherson wrote:
> This is the next iteration of implementing fd-based (instead of vma-based)
> memory for KVM guests. If you want the full background of why we are doing
> this, please go read the v10 cover letter[1].
>
> The biggest change from v10 is to implement the backing storage in KVM
> itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
> See link[2] for details on why we pivoted to a KVM-specific approach.
>
> Key word is "biggest". Relative to v10, there are many big changes.
> Highlights below (I can't remember everything that got changed at
> this point).
>
> Tagged RFC as there are a lot of empty changelogs, and a lot of missing
> documentation. And ideally, we'll have even more tests before merging.
> There are also several gaps/opens (to be discussed in tomorrow's PUCK).

I've pushed this to

https://github.com/kvm-x86/linux/tree/guest_memfd

along with Isaku's fix for the lock ordering bug on top.

As discussed at PUCK, I'll apply fixes/tweaks/changes on top until development
stabilizes, and will only squash/fixup when we're ready to post v12 for broad
review.

Please "formally" post patches just like you normally would do, i.e. don't *just*
repond to the buggy mail (though that is also helpful). Standalone patches make
it easier for me to manage things via lore/b4.

If you can, put gmem or guest_memfd inside the square braces, e.g.

[PATCH gmem] KVM: <shortlog>

so that it's obvious the patch is intended for the guest_memfd branch. For fixes,
please also be sure to use Fixes: tags and split patches to fix exactly one base
commit, again to make my life easier.

I'll likely add my own annotations when applying, e.g. [FIXUP] and whatnot, but
that's purely notes for myself for the future squash/rebase.

Thanks!

2023-07-25 11:54:23

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

On Tue, Jul 18, 2023 at 04:44:53PM -0700, Sean Christopherson wrote:
> diff --git a/mm/compaction.c b/mm/compaction.c
> index dbc9f86b1934..a3d2b132df52 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1047,6 +1047,10 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> if (!mapping && (folio_ref_count(folio) - 1) > folio_mapcount(folio))
> goto isolate_fail_put;
>
> + /* The mapping truly isn't movable. */
> + if (mapping && mapping_unmovable(mapping))
> + goto isolate_fail_put;
> +

I doubt that it is safe to dereference mapping here. I believe the folio
can be truncated from under us and the mapping freed with the inode.

The folio has to be locked to dereference mapping safely (given that the
mapping is still tied to the folio).

Vlastimil, any comments?

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-07-25 13:39:21

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

On Tue, Jul 25, 2023 at 01:24:03PM +0300, Kirill A . Shutemov wrote:
> On Tue, Jul 18, 2023 at 04:44:53PM -0700, Sean Christopherson wrote:
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index dbc9f86b1934..a3d2b132df52 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1047,6 +1047,10 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > if (!mapping && (folio_ref_count(folio) - 1) > folio_mapcount(folio))
> > goto isolate_fail_put;
> >
> > + /* The mapping truly isn't movable. */
> > + if (mapping && mapping_unmovable(mapping))
> > + goto isolate_fail_put;
> > +
>
> I doubt that it is safe to dereference mapping here. I believe the folio
> can be truncated from under us and the mapping freed with the inode.
>
> The folio has to be locked to dereference mapping safely (given that the
> mapping is still tied to the folio).

There's even a comment to that effect later on in the function:

/*
* Only pages without mappings or that have a
* ->migrate_folio callback are possible to migrate
* without blocking. However, we can be racing with
* truncation so it's necessary to lock the page
* to stabilise the mapping as truncation holds
* the page lock until after the page is removed
* from the page cache.
*/

(that could be reworded to make it clear how dangerous dereferencing
->mapping is without the lock ... and it does need to be changed to say
"folio lock" instead of "page lock", so ...)

How does this look?

/*
* Only folios without mappings or that have
* a ->migrate_folio callback are possible to
* migrate without blocking. However, we can
* be racing with truncation, which can free
* the mapping. Truncation holds the folio lock
* until after the folio is removed from the page
* cache so holding it ourselves is sufficient.
*/

2023-07-26 12:19:15

by Nikunj A. Dadhania

[permalink] [raw]

Subject: Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

Hi Sean,

On 7/24/2023 10:30 PM, Sean Christopherson wrote:
> On Mon, Jul 24, 2023, Nikunj A. Dadhania wrote:
>> On 7/19/2023 5:14 AM, Sean Christopherson wrote:
>>> This is the next iteration of implementing fd-based (instead of vma-based)
>>> memory for KVM guests. If you want the full background of why we are doing
>>> this, please go read the v10 cover letter[1].
>>>
>>> The biggest change from v10 is to implement the backing storage in KVM
>>> itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
>>> See link[2] for details on why we pivoted to a KVM-specific approach.
>>>
>>> Key word is "biggest". Relative to v10, there are many big changes.
>>> Highlights below (I can't remember everything that got changed at
>>> this point).
>>>
>>> Tagged RFC as there are a lot of empty changelogs, and a lot of missing
>>> documentation. And ideally, we'll have even more tests before merging.
>>> There are also several gaps/opens (to be discussed in tomorrow's PUCK).
>>
>> As per our discussion on the PUCK call, here are the memory/NUMA accounting
>> related observations that I had while working on SNP guest secure page migration:
>>
>> * gmem allocations are currently treated as file page allocations
>> accounted to the kernel and not to the QEMU process.
>
> We need to level set on terminology: these are all *stats*, not accounting. That
> distinction matters because we have wiggle room on stats, e.g. we can probably get
> away with just about any definition of how guest_memfd memory impacts stats, so
> long as the information that is surfaced to userspace is useful and expected.
>
> But we absolutely need to get accounting correct, specifically the allocations
> need to be correctly accounted in memcg. And unless I'm missing something,
> nothing in here shows anything related to memcg.

I tried out memcg after creating a separate cgroup for the qemu process. Guest
memory is accounted in memcg.

$ egrep -w "file|file_thp|unevictable" memory.stat
file 42978775040
file_thp 42949672960
unevictable 42953588736

NUMA allocations are coming from right nodes as set by the numactl.

$ egrep -w "file|file_thp|unevictable" memory.numa_stat
file N0=0 N1=20480 N2=21489377280 N3=21489377280
file_thp N0=0 N1=0 N2=21472739328 N3=21476933632
unevictable N0=0 N1=0 N2=21474697216 N3=21478891520

>
>> Starting an SNP guest with 40G memory with memory interleave between
>> Node2 and Node3
>>
>> $ numactl -i 2,3 ./bootg_snp.sh
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 242179 root 20 0 40.4g 99580 51676 S 78.0 0.0 0:56.58 qemu-system-x86
>>
>> -> Incorrect process resident memory and shared memory is reported
>
> I don't know that I would call these "incorrect". Shared memory definitely is
> correct, because by definition guest_memfd isn't shared. RSS is less clear cut;
> gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up with
> scenarios where RSS > VIRT, which will be quite confusing for unaware users (I'm
> assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem
> memslots).

I am not sure why will RSS exceed the VIRT, it should be at max 40G (assuming all the
memory is private)

As per my experiments with a hack below. MM_FILEPAGES does get accounted to RSS/SHR in top

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4339 root 20 0 40.4g 40.1g 40.1g S 76.7 16.0 0:13.83 qemu-system-x86

diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..5b1f48a2e714 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -166,6 +166,7 @@ void mm_trace_rss_stat(struct mm_struct *mm, int member)
{
trace_rss_stat(mm, member);
}
+EXPORT_SYMBOL(mm_trace_rss_stat);

/*
* Note: this doesn't free the actual pages themselves. That
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index a7e926af4255..e4f268bf9ce2 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -91,6 +91,10 @@ static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
clear_highpage(folio_page(folio, i));
}

+ /* Account only once for the first time */
+ if (!folio_test_dirty(folio))
+ add_mm_counter(current->mm, MM_FILEPAGES, folio_nr_pages(folio));
+
folio_mark_accessed(folio);
folio_mark_dirty(folio);
folio_mark_uptodate(folio);

We can update the rss_stat appropriately to get correct reporting in userspace.

>> Accounting of the memory happens in the host page fault handler path,
>> but for private guest pages we will never hit that.
>>
>> * NUMA allocation does use the process mempolicy for appropriate node
>> allocation (Node2 and Node3), but they again do not get attributed to
>> the QEMU process
>>
>> Every 1.0s: sudo numastat -m -p qemu-system-x86 | egrep -i "qemu|PID|Node|Filepage" gomati: Mon Jul 24 11:51:34 2023
>>
>> Per-node process memory usage (in MBs)
>> PID Node 0 Node 1 Node 2 Node 3 Total
>> 242179 (qemu-system-x86) 21.14 1.61 39.44 39.38 101.57
>>
>> Per-node system memory usage (in MBs):
>> Node 0 Node 1 Node 2 Node 3 Total
>> FilePages 2475.63 2395.83 23999.46 23373.22 52244.14
>>
>>
>> * Most of the memory accounting relies on the VMAs and as private-fd of
>> gmem doesn't have a VMA(and that was the design goal), user-space fails
>> to attribute the memory appropriately to the process.
>>
>> /proc/<qemu pid>/numa_maps
>> 7f528be00000 interleave:2-3 file=/memfd:memory-backend-memfd-shared\040(deleted) anon=1070 dirty=1070 mapped=1987 mapmax=256 active=1956 N2=582 N3=1405 kernelpagesize_kB=4
>> 7f5c90200000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted)
>> 7f5c90400000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=32 active=0 N2=32 kernelpagesize_kB=4
>> 7f5c90800000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=892 active=0 N2=512 N3=380 kernelpagesize_kB=4
>>
>> /proc/<qemu pid>/smaps
>> 7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629 /memfd:memory-backend-memfd-shared (deleted)
>> 7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033 /memfd:rom-backend-memfd-shared (deleted)
>> 7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032 /memfd:rom-backend-memfd-shared (deleted)
>> 7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025 /memfd:rom-backend-memfd-shared (deleted)
>
> This is all expected, and IMO correct. There are no userspace mappings, and so
> not accounting anything is working as intended.
Doesn't sound that correct, if 10 SNP guests are running each using 10GB, how would we know who is using 100GB of memory?

>
>> * QEMU based NUMA bindings will not work. Memory backend uses mbind()
>> to set the policy for a particular virtual memory range but gmem
>> private-FD does not have a virtual memory range visible in the host.
>
> Yes, adding a generic fbind() is the way to solve silve.

Regards,
Nikunj

2023-07-26 12:57:59

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

On Tue, Jul 25, 2023 at 01:51:55PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 25, 2023 at 01:24:03PM +0300, Kirill A . Shutemov wrote:
> > On Tue, Jul 18, 2023 at 04:44:53PM -0700, Sean Christopherson wrote:
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index dbc9f86b1934..a3d2b132df52 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -1047,6 +1047,10 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > > if (!mapping && (folio_ref_count(folio) - 1) > folio_mapcount(folio))
> > > goto isolate_fail_put;
> > >
> > > + /* The mapping truly isn't movable. */
> > > + if (mapping && mapping_unmovable(mapping))
> > > + goto isolate_fail_put;
> > > +
> >
> > I doubt that it is safe to dereference mapping here. I believe the folio
> > can be truncated from under us and the mapping freed with the inode.
> >
> > The folio has to be locked to dereference mapping safely (given that the
> > mapping is still tied to the folio).
>
> There's even a comment to that effect later on in the function:
>
> /*
> * Only pages without mappings or that have a
> * ->migrate_folio callback are possible to migrate
> * without blocking. However, we can be racing with
> * truncation so it's necessary to lock the page
> * to stabilise the mapping as truncation holds
> * the page lock until after the page is removed
> * from the page cache.
> */
>
> (that could be reworded to make it clear how dangerous dereferencing
> ->mapping is without the lock ... and it does need to be changed to say
> "folio lock" instead of "page lock", so ...)
>
> How does this look?
>
> /*
> * Only folios without mappings or that have
> * a ->migrate_folio callback are possible to
> * migrate without blocking. However, we can
> * be racing with truncation, which can free
> * the mapping. Truncation holds the folio lock
> * until after the folio is removed from the page
> * cache so holding it ourselves is sufficient.
> */
>

Looks good to me.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-07-26 15:23:33

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

On Wed, Jul 26, 2023, Nikunj A. Dadhania wrote:
> Hi Sean,
>
> On 7/24/2023 10:30 PM, Sean Christopherson wrote:
> >> Starting an SNP guest with 40G memory with memory interleave between
> >> Node2 and Node3
> >>
> >> $ numactl -i 2,3 ./bootg_snp.sh
> >>
> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> >> 242179 root 20 0 40.4g 99580 51676 S 78.0 0.0 0:56.58 qemu-system-x86
> >>
> >> -> Incorrect process resident memory and shared memory is reported
> >
> > I don't know that I would call these "incorrect". Shared memory definitely is
> > correct, because by definition guest_memfd isn't shared. RSS is less clear cut;
> > gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up with
> > scenarios where RSS > VIRT, which will be quite confusing for unaware users (I'm
> > assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem
> > memslots).
>
> I am not sure why will RSS exceed the VIRT, it should be at max 40G (assuming all the
> memory is private)

And also assuming that (a) userspace mmap()'d the shared side of things 1:1 with
private memory and (b) that the shared mappings have not been populated. Those
assumptions will mostly probably hold true for QEMU, but kernel correctness
shouldn't depend on assumptions about one specific userspace application.

> >> /proc/<qemu pid>/smaps
> >> 7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629 /memfd:memory-backend-memfd-shared (deleted)
> >> 7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033 /memfd:rom-backend-memfd-shared (deleted)
> >> 7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032 /memfd:rom-backend-memfd-shared (deleted)
> >> 7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025 /memfd:rom-backend-memfd-shared (deleted)
> >
> > This is all expected, and IMO correct. There are no userspace mappings, and so
> > not accounting anything is working as intended.
> Doesn't sound that correct, if 10 SNP guests are running each using 10GB, how
> would we know who is using 100GB of memory?

It's correct with respect to what the interfaces show, which is how much memory
is *mapped* into userspace.

As I said (or at least tried to say) in my first reply, I am not against exposing
memory usage to userspace via stats, only that it's not obvious to me that the
existing VMA-based stats are the most appropriate way to surface this information.

2023-07-27 07:36:30

by Nikunj A. Dadhania

[permalink] [raw]

Subject: Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

On 7/26/2023 7:54 PM, Sean Christopherson wrote:
> On Wed, Jul 26, 2023, Nikunj A. Dadhania wrote:
>> On 7/24/2023 10:30 PM, Sean Christopherson wrote:

>>>> /proc/<qemu pid>/smaps
>>>> 7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629 /memfd:memory-backend-memfd-shared (deleted)
>>>> 7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033 /memfd:rom-backend-memfd-shared (deleted)
>>>> 7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032 /memfd:rom-backend-memfd-shared (deleted)
>>>> 7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025 /memfd:rom-backend-memfd-shared (deleted)
>>>
>>> This is all expected, and IMO correct. There are no userspace mappings, and so
>>> not accounting anything is working as intended.
>> Doesn't sound that correct, if 10 SNP guests are running each using 10GB, how
>> would we know who is using 100GB of memory?
>
> It's correct with respect to what the interfaces show, which is how much memory
> is *mapped* into userspace.
>
> As I said (or at least tried to say) in my first reply, I am not against exposing
> memory usage to userspace via stats, only that it's not obvious to me that the
> existing VMA-based stats are the most appropriate way to surface this information.

Right, then should we think in the line of creating a VM IOCTL for querying current memory
usage for guest-memfd ?

We could use memcg for statistics, but then memory cgroup can be disabled and so memcg
isn't really a dependable option.

Do you have some ideas on how to expose the memory usage to the user space other than
VMA-based stats ?

Regards,
Nikunj

2023-07-28 09:52:22

by Quentin Perret

[permalink] [raw]

Subject: Re: [RFC PATCH v11 06/29] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

On Tuesday 18 Jul 2023 at 16:44:49 (-0700), Sean Christopherson wrote:
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
> __u64 userspace_addr; /* start of the userspace allocated memory */
> };
>
> +/* for KVM_SET_USER_MEMORY_REGION2 */
> +struct kvm_userspace_memory_region2 {
> + __u32 slot;
> + __u32 flags;
> + __u64 guest_phys_addr;
> + __u64 memory_size;
> + __u64 userspace_addr;
> + __u64 pad[16];

Should we replace that pad[16] with:

__u64 size;

where 'size' is the size of the structure as seen by userspace? This is
used in other UAPIs (see struct sched_attr for example) and is a bit
more robust for future extensions (e.g. an 'old' kernel can correctly
reject a newer version of the struct with additional fields it doesn't
know about if that makes sense, etc).

2023-07-28 16:37:30

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

On 7/25/23 14:51, Matthew Wilcox wrote:
> On Tue, Jul 25, 2023 at 01:24:03PM +0300, Kirill A . Shutemov wrote:
>> On Tue, Jul 18, 2023 at 04:44:53PM -0700, Sean Christopherson wrote:
>> > diff --git a/mm/compaction.c b/mm/compaction.c
>> > index dbc9f86b1934..a3d2b132df52 100644
>> > --- a/mm/compaction.c
>> > +++ b/mm/compaction.c
>> > @@ -1047,6 +1047,10 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>> > if (!mapping && (folio_ref_count(folio) - 1) > folio_mapcount(folio))
>> > goto isolate_fail_put;
>> >
>> > + /* The mapping truly isn't movable. */
>> > + if (mapping && mapping_unmovable(mapping))
>> > + goto isolate_fail_put;
>> > +
>>
>> I doubt that it is safe to dereference mapping here. I believe the folio
>> can be truncated from under us and the mapping freed with the inode.
>>
>> The folio has to be locked to dereference mapping safely (given that the
>> mapping is still tied to the folio).
>
> There's even a comment to that effect later on in the function:

Hmm, well spotted. But it wouldn't be so great if we now had to lock every
inspected page (and not just dirty pages), just to check the AS_ bit.

But I wonder if this is leftover from previous versions. Are the guest pages
even PageLRU currently? (and should they be, given how they can't be swapped
out or anything?) If not, isolate_migratepages_block will skip them anyway.

>
> /*
> * Only pages without mappings or that have a
> * ->migrate_folio callback are possible to migrate
> * without blocking. However, we can be racing with
> * truncation so it's necessary to lock the page
> * to stabilise the mapping as truncation holds
> * the page lock until after the page is removed
> * from the page cache.
> */
>
> (that could be reworded to make it clear how dangerous dereferencing
> ->mapping is without the lock ... and it does need to be changed to say
> "folio lock" instead of "page lock", so ...)

> How does this look?
>
> /*
> * Only folios without mappings or that have
> * a ->migrate_folio callback are possible to
> * migrate without blocking. However, we can
> * be racing with truncation, which can free
> * the mapping. Truncation holds the folio lock
> * until after the folio is removed from the page
> * cache so holding it ourselves is sufficient.
> */
>

2023-07-28 17:42:48

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

On 7/28/23 18:02, Vlastimil Babka wrote:
>> There's even a comment to that effect later on in the function:
> Hmm, well spotted. But it wouldn't be so great if we now had to lock every
> inspected page (and not just dirty pages), just to check the AS_ bit.
>
> But I wonder if this is leftover from previous versions. Are the guest pages
> even PageLRU currently? (and should they be, given how they can't be swapped
> out or anything?) If not, isolate_migratepages_block will skip them anyway.

No, they're not (migration or even swap-out is not excluded for the
future, but for now it's left for future work.

Paolo

2023-07-29 00:30:12

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC PATCH v11 06/29] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

On Fri, Jul 28, 2023, Quentin Perret wrote:
> On Tuesday 18 Jul 2023 at 16:44:49 (-0700), Sean Christopherson wrote:
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
> > __u64 userspace_addr; /* start of the userspace allocated memory */
> > };
> >
> > +/* for KVM_SET_USER_MEMORY_REGION2 */
> > +struct kvm_userspace_memory_region2 {
> > + __u32 slot;
> > + __u32 flags;
> > + __u64 guest_phys_addr;
> > + __u64 memory_size;
> > + __u64 userspace_addr;
> > + __u64 pad[16];
>
> Should we replace that pad[16] with:
>
> __u64 size;
>
> where 'size' is the size of the structure as seen by userspace? This is
> used in other UAPIs (see struct sched_attr for example) and is a bit
> more robust for future extensions (e.g. an 'old' kernel can correctly
> reject a newer version of the struct with additional fields it doesn't
> know about if that makes sense, etc).

"flags" serves that purpose, i.e. allows userspace to opt-in to having KVM actually
consume what is currently just padding.

The padding is there mainly to simplify kernel/KVM code, e.g. the number of bytes
that KVM needs to copy in is static.

But now that I think more on this, I don't know why we didn't just unconditionally
bump the size of kvm_userspace_memory_region. We tried to play games with unions
and overlays, but that was a mess[*].

KVM would need to do multiple uaccess reads, but that's not a big deal. Am I
missing something, or did past us just get too clever and miss the obvious solution?

[*] https://lkml.kernel.org/r/Y7xrtf9FCuYRYm1q%40google.com

2023-07-31 10:50:08

by Quentin Perret

[permalink] [raw]

Subject: Re: [RFC PATCH v11 06/29] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

On Friday 28 Jul 2023 at 17:03:33 (-0700), Sean Christopherson wrote:
> On Fri, Jul 28, 2023, Quentin Perret wrote:
> > On Tuesday 18 Jul 2023 at 16:44:49 (-0700), Sean Christopherson wrote:
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
> > > __u64 userspace_addr; /* start of the userspace allocated memory */
> > > };
> > >
> > > +/* for KVM_SET_USER_MEMORY_REGION2 */
> > > +struct kvm_userspace_memory_region2 {
> > > + __u32 slot;
> > > + __u32 flags;
> > > + __u64 guest_phys_addr;
> > > + __u64 memory_size;
> > > + __u64 userspace_addr;
> > > + __u64 pad[16];
> >
> > Should we replace that pad[16] with:
> >
> > __u64 size;
> >
> > where 'size' is the size of the structure as seen by userspace? This is
> > used in other UAPIs (see struct sched_attr for example) and is a bit
> > more robust for future extensions (e.g. an 'old' kernel can correctly
> > reject a newer version of the struct with additional fields it doesn't
> > know about if that makes sense, etc).
>
> "flags" serves that purpose, i.e. allows userspace to opt-in to having KVM actually
> consume what is currently just padding.

Sure, I've just grown to dislike static padding of that type -- it ends
up being either a waste a space, or is too small, while the 'superior'
alternative (having a 'size' member) doesn't cost much and avoids those
problems.

But no strong opinion really, this struct really shouldn't grow much,
so I'm sure that'll be fine in practice.

> The padding is there mainly to simplify kernel/KVM code, e.g. the number of bytes
> that KVM needs to copy in is static.
>
> But now that I think more on this, I don't know why we didn't just unconditionally
> bump the size of kvm_userspace_memory_region. We tried to play games with unions
> and overlays, but that was a mess[*].
>
> KVM would need to do multiple uaccess reads, but that's not a big deal. Am I
> missing something, or did past us just get too clever and miss the obvious solution?
>
> [*] https://lkml.kernel.org/r/Y7xrtf9FCuYRYm1q%40google.com

Right, so the first uaccess would get_user() the flags, based on that
we'd figure out the size of the struct, copy_from_user() what we need,
and then sanity check the flags are the same from both reads, or
something along those lines?

That doesn't sound too complicated to me, and as long as every extension
to the struct does come with a new flag I can't immediately see what
would go wrong.

2023-07-31 17:01:52

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [RFC PATCH v11 06/29] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

On 7/29/23 02:03, Sean Christopherson wrote:
> KVM would need to do multiple uaccess reads, but that's not a big
> deal. Am I missing something, or did past us just get too clever and
> miss the obvious solution?

You would have to introduce struct kvm_userspace_memory_region2 anyway,
though not a new ioctl, for two reasons:

1) the current size of the struct is part of the userspace API via the
KVM_SET_USER_MEMORY_REGION #define, so introducing a new struct is the
easiest way to preserve this

2) the struct can (at least theoretically) enter the ABI of a shared
library, and such mismatches are really hard to detect and resolve. So
it's better to add the padding to a new struct, and keep struct
kvm_userspace_memory_region backwards-compatible.

As to whether we should introduce a new ioctl: doing so makes
KVM_SET_USER_MEMORY_REGION's detection of bad flags a bit more robust;
it's not like we cannot introduce new flags at all, of course, but
having out-of-bounds reads as a side effect of new flags is a bit nasty.
Protecting programs from their own bugs gets into diminishing returns
very quickly, but introducing a new ioctl can make exploits a bit harder
when struct kvm_userspace_memory_region is on the stack and adjacent to
an attacker-controlled location.

Paolo

2023-08-03 11:39:55

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

On 7/26/23 13:20, Nikunj A. Dadhania wrote:
> Hi Sean,
>
> On 7/24/2023 10:30 PM, Sean Christopherson wrote:
>> On Mon, Jul 24, 2023, Nikunj A. Dadhania wrote:
>>> On 7/19/2023 5:14 AM, Sean Christopherson wrote:
>>>> This is the next iteration of implementing fd-based (instead of vma-based)
>>>> memory for KVM guests. If you want the full background of why we are doing
>>>> this, please go read the v10 cover letter[1].
>>>>
>>>> The biggest change from v10 is to implement the backing storage in KVM
>>>> itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
>>>> See link[2] for details on why we pivoted to a KVM-specific approach.
>>>>
>>>> Key word is "biggest". Relative to v10, there are many big changes.
>>>> Highlights below (I can't remember everything that got changed at
>>>> this point).
>>>>
>>>> Tagged RFC as there are a lot of empty changelogs, and a lot of missing
>>>> documentation. And ideally, we'll have even more tests before merging.
>>>> There are also several gaps/opens (to be discussed in tomorrow's PUCK).
>>>
>>> As per our discussion on the PUCK call, here are the memory/NUMA accounting
>>> related observations that I had while working on SNP guest secure page migration:
>>>
>>> * gmem allocations are currently treated as file page allocations
>>> accounted to the kernel and not to the QEMU process.
>>
>> We need to level set on terminology: these are all *stats*, not accounting. That
>> distinction matters because we have wiggle room on stats, e.g. we can probably get
>> away with just about any definition of how guest_memfd memory impacts stats, so
>> long as the information that is surfaced to userspace is useful and expected.
>>
>> But we absolutely need to get accounting correct, specifically the allocations
>> need to be correctly accounted in memcg. And unless I'm missing something,
>> nothing in here shows anything related to memcg.
>
> I tried out memcg after creating a separate cgroup for the qemu process. Guest
> memory is accounted in memcg.
>
> $ egrep -w "file|file_thp|unevictable" memory.stat
> file 42978775040
> file_thp 42949672960
> unevictable 42953588736
>
> NUMA allocations are coming from right nodes as set by the numactl.
>
> $ egrep -w "file|file_thp|unevictable" memory.numa_stat
> file N0=0 N1=20480 N2=21489377280 N3=21489377280
> file_thp N0=0 N1=0 N2=21472739328 N3=21476933632
> unevictable N0=0 N1=0 N2=21474697216 N3=21478891520
>
>>
>>> Starting an SNP guest with 40G memory with memory interleave between
>>> Node2 and Node3
>>>
>>> $ numactl -i 2,3 ./bootg_snp.sh
>>>
>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>> 242179 root 20 0 40.4g 99580 51676 S 78.0 0.0 0:56.58 qemu-system-x86
>>>
>>> -> Incorrect process resident memory and shared memory is reported
>>
>> I don't know that I would call these "incorrect". Shared memory definitely is
>> correct, because by definition guest_memfd isn't shared. RSS is less clear cut;
>> gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up with
>> scenarios where RSS > VIRT, which will be quite confusing for unaware users (I'm
>> assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem
>> memslots).
>
> I am not sure why will RSS exceed the VIRT, it should be at max 40G (assuming all the
> memory is private)
>
> As per my experiments with a hack below. MM_FILEPAGES does get accounted to RSS/SHR in top
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4339 root 20 0 40.4g 40.1g 40.1g S 76.7 16.0 0:13.83 qemu-system-x86
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f456f3b5049c..5b1f48a2e714 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -166,6 +166,7 @@ void mm_trace_rss_stat(struct mm_struct *mm, int member)
> {
> trace_rss_stat(mm, member);
> }
> +EXPORT_SYMBOL(mm_trace_rss_stat);
>
> /*
> * Note: this doesn't free the actual pages themselves. That
> diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> index a7e926af4255..e4f268bf9ce2 100644
> --- a/virt/kvm/guest_mem.c
> +++ b/virt/kvm/guest_mem.c
> @@ -91,6 +91,10 @@ static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
> clear_highpage(folio_page(folio, i));
> }
>
> + /* Account only once for the first time */
> + if (!folio_test_dirty(folio))
> + add_mm_counter(current->mm, MM_FILEPAGES, folio_nr_pages(folio));

I think this alone would cause "Bad rss-counter" messages when the process
exits, because there's no corresponding decrement when page tables are torn
down. We would probably have to instantiate the page tables (i.e. with
PROT_NONE so userspace can't really do accesses through them) for this to
work properly.

So then it wouldn't technically be "unmapped private memory" anymore, but
effectively still would be. Maybe there would be more benefits, like the
mbind() working. But where would the PROT_NONE page tables be instantiated
if there's no page fault? During the ioctl? And is perhaps too much (CPU)
work for little benefit? Maybe, but we could say it makes things simpler and
can be optimized later?

Anyway IMHO it would be really great if the memory usage was attributable
the usual way without new IOCTLs or something. Each time some memory appears
"unaccounted" somewhere, it causes confusion.

> +
> folio_mark_accessed(folio);
> folio_mark_dirty(folio);
> folio_mark_uptodate(folio);
>
> We can update the rss_stat appropriately to get correct reporting in userspace.
>
>>> Accounting of the memory happens in the host page fault handler path,
>>> but for private guest pages we will never hit that.
>>>
>>> * NUMA allocation does use the process mempolicy for appropriate node
>>> allocation (Node2 and Node3), but they again do not get attributed to
>>> the QEMU process
>>>
>>> Every 1.0s: sudo numastat -m -p qemu-system-x86 | egrep -i "qemu|PID|Node|Filepage" gomati: Mon Jul 24 11:51:34 2023
>>>
>>> Per-node process memory usage (in MBs)
>>> PID Node 0 Node 1 Node 2 Node 3 Total
>>> 242179 (qemu-system-x86) 21.14 1.61 39.44 39.38 101.57
>>>
>>> Per-node system memory usage (in MBs):
>>> Node 0 Node 1 Node 2 Node 3 Total
>>> FilePages 2475.63 2395.83 23999.46 23373.22 52244.14
>>>
>>>
>>> * Most of the memory accounting relies on the VMAs and as private-fd of
>>> gmem doesn't have a VMA(and that was the design goal), user-space fails
>>> to attribute the memory appropriately to the process.
>>>
>>> /proc/<qemu pid>/numa_maps
>>> 7f528be00000 interleave:2-3 file=/memfd:memory-backend-memfd-shared\040(deleted) anon=1070 dirty=1070 mapped=1987 mapmax=256 active=1956 N2=582 N3=1405 kernelpagesize_kB=4
>>> 7f5c90200000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted)
>>> 7f5c90400000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=32 active=0 N2=32 kernelpagesize_kB=4
>>> 7f5c90800000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=892 active=0 N2=512 N3=380 kernelpagesize_kB=4
>>>
>>> /proc/<qemu pid>/smaps
>>> 7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629 /memfd:memory-backend-memfd-shared (deleted)
>>> 7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033 /memfd:rom-backend-memfd-shared (deleted)
>>> 7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032 /memfd:rom-backend-memfd-shared (deleted)
>>> 7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025 /memfd:rom-backend-memfd-shared (deleted)
>>
>> This is all expected, and IMO correct. There are no userspace mappings, and so
>> not accounting anything is working as intended.
> Doesn't sound that correct, if 10 SNP guests are running each using 10GB, how would we know who is using 100GB of memory?
>
>>
>>> * QEMU based NUMA bindings will not work. Memory backend uses mbind()
>>> to set the policy for a particular virtual memory range but gmem
>>> private-FD does not have a virtual memory range visible in the host.
>>
>> Yes, adding a generic fbind() is the way to solve silve.
>
> Regards,
> Nikunj
>

2023-08-07 23:42:38

by Ackerley Tng

[permalink] [raw]

Subject: Re: [RFC PATCH v11 28/29] KVM: selftests: Add basic selftest for guest_memfd()

Sean Christopherson <[email protected]> writes:

> Add a selftest to verify the basic functionality of guest_memfd():
>
> + file descriptor created with the guest_memfd() ioctl does not allow
> read/write/mmap operations
> + file size and block size as returned from fstat are as expected
> + fallocate on the fd checks that offset/length on
> fallocate(FALLOC_FL_PUNCH_HOLE) should be page aligned
>

> <snip>

> +
> +static void test_fallocate(int fd, size_t page_size, size_t total_size)
> +{
> + int ret;
> +
> + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, total_size);
> + TEST_ASSERT(!ret, "fallocate with aligned offset and size should succeed");
> +
> + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
> + page_size - 1, page_size);
> + TEST_ASSERT(ret, "fallocate with unaligned offset should fail");
> +
> + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size, page_size);
> + TEST_ASSERT(ret, "fallocate beginning at total_size should fail");
> +
> + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size + page_size, page_size);
> + TEST_ASSERT(ret, "fallocate beginning at total_size should fail");

This should be

TEST_ASSERT(ret, "fallocate beginning after total_size should fail");

> +
> + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
> + total_size, page_size);
> + TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) at total_size should succeed");
> +
> + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
> + total_size + page_size, page_size);
> + TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) after total_size should succeed");
> +
> + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
> + page_size, page_size - 1);
> + TEST_ASSERT(ret, "fallocate with unaligned size should fail");
> +
> + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
> + page_size, page_size);
> + TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) with aligned offset and size should succeed");
> +
> + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, page_size, page_size);
> + TEST_ASSERT(!ret, "fallocate to restore punched hole should succeed");
> +}

> <snip>

2023-08-08 02:07:27

by Ackerley Tng

[permalink] [raw]

Subject: Re: [RFC PATCH v11 28/29] KVM: selftests: Add basic selftest for guest_memfd()

Sean Christopherson <[email protected]> writes:

> Add a selftest to verify the basic functionality of guest_memfd():
>
> <snip>

Here's one more test:

From 72dc6836f01bdd613d64d4c6a4f2af8f2b777ba2 Mon Sep 17 00:00:00 2001
From: Ackerley Tng <[email protected]>
Date: Tue, 1 Aug 2023 18:02:50 +0000
Subject: [PATCH] KVM: selftests: Add tests - invalid inputs for
KVM_CREATE_GUEST_MEMFD

Test that invalid inputs for KVM_CREATE_GUEST_MEMFD, such as
non-page-aligned page size and invalid flags, are rejected by the
KVM_CREATE_GUEST_MEMFD with EINVAL

Signed-off-by: Ackerley Tng <[email protected]>
---
tools/testing/selftests/kvm/guest_memfd_test.c | 17 +++++++++++++++++
.../selftests/kvm/include/kvm_util_base.h | 11 +++++++++--
2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index eb93c608a7e0..ad20f11b2d2c 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -90,6 +90,21 @@ static void test_fallocate(int fd, size_t page_size, size_t total_size)
TEST_ASSERT(!ret, "fallocate to restore punched hole should succeed");
}

+static void test_create_guest_memfd_invalid(struct kvm_vm *vm, size_t page_size)
+{
+ int fd;
+
+ /* Non-page-aligned page_size */
+ fd = __vm_create_guest_memfd(vm, 1, 0);
+ ASSERT_EQ(fd, -1);
+ ASSERT_EQ(errno, EINVAL);
+
+ /* Invalid flags */
+ fd = __vm_create_guest_memfd(vm, page_size, 99);
+ ASSERT_EQ(fd, -1);
+ ASSERT_EQ(errno, EINVAL);
+}
+

int main(int argc, char *argv[])
{
@@ -103,6 +118,8 @@ int main(int argc, char *argv[])

vm = vm_create_barebones();

+ test_create_guest_memfd_invalid(vm, page_size);
+
fd = vm_create_guest_memfd(vm, total_size, 0);

test_file_read_write(fd);
diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 39b38c75b99c..8bdfadd72349 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -474,7 +474,8 @@ static inline uint64_t vm_get_stat(struct kvm_vm *vm, const char *stat_name)
}

void vm_create_irqchip(struct kvm_vm *vm);
-static inline int vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+
+static inline int __vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
uint64_t flags)
{
struct kvm_create_guest_memfd gmem = {
@@ -482,7 +483,13 @@ static inline int vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
.flags = flags,
};

- int fd = __vm_ioctl(vm, KVM_CREATE_GUEST_MEMFD, &gmem);
+ return __vm_ioctl(vm, KVM_CREATE_GUEST_MEMFD, &gmem);
+}
+
+static inline int vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+ uint64_t flags)
+{
+ int fd = __vm_create_guest_memfd(vm, size, flags);

TEST_ASSERT(fd >= 0, KVM_IOCTL_ERROR(KVM_CREATE_GUEST_MEMFD, fd));
return fd;
--
2.41.0.640.ga95def55d0-goog

2023-08-22 03:04:42

by Ackerley Tng

[permalink] [raw]

Subject: Re: [RFC PATCH v11 28/29] KVM: selftests: Add basic selftest for guest_memfd()

Sean Christopherson <[email protected]> writes:

> On Mon, Aug 07, 2023, Ackerley Tng wrote:
>> Sean Christopherson <[email protected]> writes:
>>
>> > Add a selftest to verify the basic functionality of guest_memfd():
>> >
>> > <snip>
>>
>> Here's one more test:
>
> First off, thank you! I greatly appreciate all the selftests work you (and
> others!) have been doing.
>
> For v2, can you please post a standalone patch? My workflow barfs on unrelated,
> inlined patches. I'm guessing I can get b4 to play nice, but it's easier to just
> yell at people :-)
>

Here's a standalone patch :)
https://lore.kernel.org/lkml/[email protected]/

> <snip>

2023-09-03 13:47:15

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

On 7/25/23 14:51, Matthew Wilcox wrote:
> On Tue, Jul 25, 2023 at 01:24:03PM +0300, Kirill A . Shutemov wrote:
>> On Tue, Jul 18, 2023 at 04:44:53PM -0700, Sean Christopherson wrote:
>> > diff --git a/mm/compaction.c b/mm/compaction.c
>> > index dbc9f86b1934..a3d2b132df52 100644
>> > --- a/mm/compaction.c
>> > +++ b/mm/compaction.c
>> > @@ -1047,6 +1047,10 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>> > if (!mapping && (folio_ref_count(folio) - 1) > folio_mapcount(folio))
>> > goto isolate_fail_put;
>> >
>> > + /* The mapping truly isn't movable. */
>> > + if (mapping && mapping_unmovable(mapping))
>> > + goto isolate_fail_put;
>> > +
>>
>> I doubt that it is safe to dereference mapping here. I believe the folio
>> can be truncated from under us and the mapping freed with the inode.
>>
>> The folio has to be locked to dereference mapping safely (given that the
>> mapping is still tied to the folio).
>
> There's even a comment to that effect later on in the function:
>
> /*
> * Only pages without mappings or that have a
> * ->migrate_folio callback are possible to migrate
> * without blocking. However, we can be racing with
> * truncation so it's necessary to lock the page
> * to stabilise the mapping as truncation holds
> * the page lock until after the page is removed
> * from the page cache.
> */
>
> (that could be reworded to make it clear how dangerous dereferencing
> ->mapping is without the lock ... and it does need to be changed to say
> "folio lock" instead of "page lock", so ...)
>
> How does this look?
>
> /*
> * Only folios without mappings or that have
> * a ->migrate_folio callback are possible to
> * migrate without blocking. However, we can
> * be racing with truncation, which can free
> * the mapping. Truncation holds the folio lock
> * until after the folio is removed from the page
> * cache so holding it ourselves is sufficient.
> */

Incorporated to my attempt at a fix (posted separately per the requested
process):

https://lore.kernel.org/all/[email protected]/