2022-10-25 16:21:25

by Chao Peng

[permalink] [raw]
Subject: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

This patch series implements KVM guest private memory for confidential
computing scenarios like Intel TDX[1]. If a TDX host accesses
TDX-protected guest memory, machine check can happen which can further
crash the running host system, this is terrible for multi-tenant
configurations. The host accesses include those from KVM userspace like
QEMU. This series addresses KVM userspace induced crash by introducing
new mm and KVM interfaces so KVM userspace can still manage guest memory
via a fd-based approach, but it can never access the guest memory
content.

The patch series touches both core mm and KVM code. I appreciate
Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
reviews are always welcome.
- 01: mm change, target for mm tree
- 02-08: KVM change, target for KVM tree

Given KVM is the only current user for the mm part, I have chatted with
Paolo and he is OK to merge the mm change through KVM tree, but
reviewed-by/acked-by is still expected from the mm people.

The patches have been verified in Intel TDX environment, but Vishal has
done an excellent work on the selftests[4] which are dedicated for this
series, making it possible to test this series without innovative
hardware and fancy steps of building a VM environment. See Test section
below for more info.


Introduction
============
KVM userspace being able to crash the host is horrible. Under current
KVM architecture, all guest memory is inherently accessible from KVM
userspace and is exposed to the mentioned crash issue. The goal of this
series is to provide a solution to align mm and KVM, on a userspace
inaccessible approach of exposing guest memory.

Normally, KVM populates secondary page table (e.g. EPT) by using a host
virtual address (hva) from core mm page table (e.g. x86 userspace page
table). This requires guest memory being mmaped into KVM userspace, but
this is also the source where the mentioned crash issue can happen. In
theory, apart from those 'shared' memory for device emulation etc, guest
memory doesn't have to be mmaped into KVM userspace.

This series introduces fd-based guest memory which will not be mmaped
into KVM userspace. KVM populates secondary page table by using a
fd/offset pair backed by a memory file system. The fd can be created
from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
directly interact with them with newly introduced in-kernel interface,
therefore remove the KVM userspace from the path of accessing/mmaping
the guest memory.

Kirill had a patch [2] to address the same issue in a different way. It
tracks guest encrypted memory at the 'struct page' level and relies on
HWPOISON to reject the userspace access. The patch has been discussed in
several online and offline threads and resulted in a design document [3]
which is also the original proposal for this series. Later this patch
series evolved as more comments received in community but the major
concepts in [3] still hold true so recommend reading.

The patch series may also be useful for other usages, for example, pure
software approach may use it to harden itself against unintentional
access to guest memory. This series is designed with these usages in
mind but doesn't have code directly support them and extension might be
needed.


mm change
=========
Introduces a new memfd_restricted system call which can create memory
file that is restricted from userspace access via normal MMU operations
like read(), write() or mmap() etc and the only way to use it is
passing it to a third kernel module like KVM and relying on it to
access the fd through the newly added restrictedmem kernel interface.
The restrictedmem interface bridges the memory file subsystems
(tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
bi-directional communication between them.


KVM change
==========
Extends the KVM memslot to provide guest private (encrypted) memory from
a fd. With this extension, a single memslot can maintain both private
memory through private fd (restricted_fd/restricted_offset) and shared
(unencrypted) memory through userspace mmaped host virtual address
(userspace_addr). For a particular guest page, the corresponding page in
KVM memslot can be only either private or shared and only one of the
shared/private parts of the memslot is visible to guest. For how this
new extension is used in QEMU, please refer to kvm_set_phys_mem() in
below TDX-enabled QEMU repo.

Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
chance on decision-making for shared <-> private memory conversion. The
exit can be an implicit conversion in KVM page fault handler or an
explicit conversion from guest OS.

Extends existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to
convert a guest page between private <-> shared. The data maintained in
these ioctls tells the truth whether a guest page is private or shared
and this information will be used in KVM page fault handler to decide
whether the private or the shared part of the memslot is visible to
guest.


Test
====
Ran two kinds of tests:
- Selftests [4] from Vishal and VM boot tests in non-TDX environment
Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v9

- Functional tests in TDX capable environment
Tested the new functionalities in TDX environment. Code repos:
Linux: https://github.com/chao-p/linux/tree/privmem-v9-tdx
QEMU: https://github.com/chao-p/qemu/tree/privmem-v9

An example QEMU command line for TDX test:
-object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
-machine confidential-guest-support=tdx \
-object memory-backend-memfd-private,id=ram1,size=${mem} \
-machine memory-backend=ram1


TODO
====
- Page accounting and limiting for encrypted memory
- hugetlbfs support


Changelog
=========
v9:
- mm: move inaccessible memfd into separated syscall.
- mm: return page instead of pfn_t for inaccessible_get_pfn and remove
inaccessible_put_pfn.
- KVM: rename inaccessible/private to restricted and CONFIG change to
make the code friendly to pKVM.
- KVM: add invalidate_begin/end pair to fix race contention and revise
the lock protection for invalidation path.
- KVM: optimize setting lpage_info for > 2M level by direct accessing
lower level's result.
- KVM: avoid load xarray in kvm_mmu_max_mapping_level() and instead let
the caller to pass in is_private.
- KVM: API doc improvement.
v8:
- mm: redesign mm part by introducing a shim layer(inaccessible_memfd)
in memfd to avoid touch the memory file systems directly.
- mm: exclude F_SEAL_AUTO_ALLOCATE as it is for shared memory and
cause confusion in this series, will send out separately.
- doc: exclude the man page change, it's not kernel patch and will
send out separately.
- KVM: adapt to use the new mm inaccessible_memfd interface.
- KVM: update lpage_info when setting mem_attr_array to support
large page.
- KVM: change from xa_store_range to xa_store for mem_attr_array due
to xa_store_range overrides all entries which is not intended
behavior for us.
- KVM: refine the mmu_invalidate_retry_gfn mechanism for private page.
- KVM: reorganize KVM_MEMORY_ENCRYPT_{UN,}REG_REGION and private page
handling code suggested by Sean.
v7:
- mm: introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
- KVM: use KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to record
private/shared info.
- KVM: use similar sync mechanism between zap/page fault paths as
mmu_notifier for memfile_notifier based invalidation.
v6:
- mm: introduce MEMFILE_F_* flags into memfile_node to allow checking
feature consistence among all memfile_notifier users and get rid of
internal flags like SHM_F_INACCESSIBLE.
- mm: make pfn_ops callbacks being members of memfile_backing_store
and then refer to it directly in memfile_notifier.
- mm: remove backing store unregister.
- mm: remove RLIMIT_MEMLOCK based memory accounting and limiting.
- KVM: reorganize patch sequence for page fault handling and private
memory enabling.
v5:
- Add man page for MFD_INACCESSIBLE flag and improve KVM API do for
the new memslot extensions.
- mm: introduce memfile_{un}register_backing_store to allow memory
backing store to register/unregister it from memfile_notifier.
- mm: remove F_SEAL_INACCESSIBLE, use in-kernel flag
(SHM_F_INACCESSIBLE for shmem) instead.
- mm: add memory accounting and limiting (RLIMIT_MEMLOCK based) for
MFD_INACCESSIBLE memory.
- KVM: remove the overlap check for mapping the same file+offset into
multiple gfns due to perf consideration, warned in document.
v4:
- mm: rename memfd_ops to memfile_notifier and separate it from
memfd.c to standalone memfile-notifier.c.
- KVM: move pfn_ops to per-memslot scope from per-vm scope and allow
registering multiple memslots to the same memory backing store.
- KVM: add a 'kvm' reference in memslot so that we can recover kvm in
memfile_notifier handlers.
- KVM: add 'private_' prefix for the new fields in memslot.
- KVM: reshape the 'type' to 'flag' for kvm_memory_exit
v3:
- Remove 'RFC' prefix.
- Fix race condition between memfile_notifier handlers and kvm destroy.
- mm: introduce MFD_INACCESSIBLE flag for memfd_create() to force
setting F_SEAL_INACCESSIBLE when the fd is created.
- KVM: add the shared part of the memslot back to make private/shared
pages live in one memslot.

Reference
=========
[1] Intel TDX:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Kirill's implementation:
https://lore.kernel.org/all/[email protected]/T/
[3] Original design proposal:
https://lore.kernel.org/all/[email protected]/
[4] Selftest:
https://lore.kernel.org/all/[email protected]/


Chao Peng (7):
KVM: Extend the memslot to support fd-based private memory
KVM: Add KVM_EXIT_MEMORY_FAULT exit
KVM: Use gfn instead of hva for mmu_notifier_retry
KVM: Register/unregister the guest private memory regions
KVM: Update lpage info when private/shared memory are mixed
KVM: Handle page fault for private memory
KVM: Enable and expose KVM_MEM_PRIVATE

Kirill A. Shutemov (1):
mm: Introduce memfd_restricted system call to create restricted user
memory

Documentation/virt/kvm/api.rst | 88 ++++-
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/kvm_host.h | 8 +
arch/x86/kvm/Kconfig | 3 +
arch/x86/kvm/mmu/mmu.c | 170 +++++++++-
arch/x86/kvm/mmu/mmu_internal.h | 14 +-
arch/x86/kvm/mmu/mmutrace.h | 1 +
arch/x86/kvm/mmu/spte.h | 6 +
arch/x86/kvm/mmu/tdp_mmu.c | 3 +-
arch/x86/kvm/x86.c | 4 +-
include/linux/kvm_host.h | 89 ++++-
include/linux/restrictedmem.h | 62 ++++
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/kvm.h | 38 +++
include/uapi/linux/magic.h | 1 +
kernel/sys_ni.c | 3 +
mm/Kconfig | 4 +
mm/Makefile | 1 +
mm/restrictedmem.c | 250 ++++++++++++++
virt/kvm/Kconfig | 7 +
virt/kvm/kvm_main.c | 453 +++++++++++++++++++++----
23 files changed, 1121 insertions(+), 92 deletions(-)
create mode 100644 include/linux/restrictedmem.h
create mode 100644 mm/restrictedmem.c


base-commit: e18d6152ff0f41b7f01f9817372022df04e0d354
--
2.25.1



2022-10-25 16:24:35

by Chao Peng

[permalink] [raw]
Subject: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

Introduce generic private memory register/unregister by reusing existing
SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case
by treating address in the region as gpa instead of hva. Which cases
should these ioctls go is determined by the kvm_arch_has_private_mem().
Architecture which supports KVM_PRIVATE_MEM should override this function.

KVM internally defaults all guest memory as private memory and maintain
the shared memory in 'mem_attr_array'. The above ioctls operate on this
field and unmap existing mappings if any.

Signed-off-by: Chao Peng <[email protected]>
---
Documentation/virt/kvm/api.rst | 17 ++-
arch/x86/kvm/Kconfig | 1 +
include/linux/kvm_host.h | 10 +-
virt/kvm/Kconfig | 4 +
virt/kvm/kvm_main.c | 227 +++++++++++++++++++++++++--------
5 files changed, 198 insertions(+), 61 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 975688912b8c..08253cf498d1 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
This ioctl can be used to register a guest memory region which may
contain encrypted data (e.g. guest RAM, SMRAM etc).

-It is used in the SEV-enabled guest. When encryption is enabled, a guest
-memory region may contain encrypted data. The SEV memory encryption
-engine uses a tweak such that two identical plaintext pages, each at
-different locations will have differing ciphertexts. So swapping or
+Currently this ioctl supports registering memory regions for two usages:
+private memory and SEV-encrypted memory.
+
+When private memory is enabled, this ioctl is used to register guest private
+memory region and the addr/size of kvm_enc_region represents guest physical
+address (GPA). In this usage, this ioctl zaps the existing guest memory
+mappings in KVM that fallen into the region.
+
+When SEV-encrypted memory is enabled, this ioctl is used to register guest
+memory region which may contain encrypted data for a SEV-enabled guest. The
+addr/size of kvm_enc_region represents userspace address (HVA). The SEV
+memory encryption engine uses a tweak such that two identical plaintext pages,
+each at different locations will have differing ciphertexts. So swapping or
moving ciphertext of those pages will not result in plaintext being
swapped. So relocating (or migrating) physical backing pages for the SEV
guest will require some additional steps.
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 8d2bd455c0cd..73fdfa429b20 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -51,6 +51,7 @@ config KVM
select HAVE_KVM_PM_NOTIFIER if PM
select HAVE_KVM_RESTRICTED_MEM if X86_64
select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
+ select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM
help
Support hosting fully virtualized guest machines using hardware
virtualization extensions. You will need a fairly recent
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 79e5cbc35fcf..4ce98fa0153c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
#endif

-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+
+#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
@@ -254,6 +255,9 @@ struct kvm_gfn_range {
bool may_block;
};
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+#endif
+
+#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
@@ -794,6 +798,9 @@ struct kvm {
struct notifier_block pm_notifier;
#endif
char stats_id[KVM_STATS_NAME_SIZE];
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+ struct xarray mem_attr_array;
+#endif
};

#define kvm_err(fmt, ...) \
@@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
int kvm_arch_post_init_vm(struct kvm *kvm);
void kvm_arch_pre_destroy_vm(struct kvm *kvm);
int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_has_private_mem(struct kvm *kvm);

#ifndef __KVM_HAVE_ARCH_VM_ALLOC
/*
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 9ff164c7e0cc..69ca59e82149 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER

config HAVE_KVM_RESTRICTED_MEM
bool
+
+config KVM_GENERIC_PRIVATE_MEM
+ bool
+ depends on HAVE_KVM_RESTRICTED_MEM
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 09c9cdeb773c..fc3835826ace 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
}
EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);

+static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
+ gfn_t end)
+{
+ if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+ kvm->mmu_invalidate_range_start = start;
+ kvm->mmu_invalidate_range_end = end;
+ } else {
+ /*
+ * Fully tracking multiple concurrent ranges has diminishing
+ * returns. Keep things simple and just find the minimal range
+ * which includes the current and new ranges. As there won't be
+ * enough information to subtract a range after its invalidate
+ * completes, any ranges invalidated concurrently will
+ * accumulate and persist until all outstanding invalidates
+ * complete.
+ */
+ kvm->mmu_invalidate_range_start =
+ min(kvm->mmu_invalidate_range_start, start);
+ kvm->mmu_invalidate_range_end =
+ max(kvm->mmu_invalidate_range_end, end);
+ }
+}
+
+static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ /*
+ * The count increase must become visible at unlock time as no
+ * spte can be established without taking the mmu_lock and
+ * count is also read inside the mmu_lock critical section.
+ */
+ kvm->mmu_invalidate_in_progress++;
+}
+
+void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ mark_invalidate_in_progress(kvm, start, end);
+ update_invalidate_range(kvm, start, end);
+}
+
+void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ /*
+ * This sequence increase will notify the kvm page fault that
+ * the page that is going to be mapped in the spte could have
+ * been freed.
+ */
+ kvm->mmu_invalidate_seq++;
+ smp_wmb();
+ /*
+ * The above sequence increase must be visible before the
+ * below count decrease, which is ensured by the smp_wmb above
+ * in conjunction with the smp_rmb in mmu_invalidate_retry().
+ */
+ kvm->mmu_invalidate_in_progress--;
+}
+
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
{
@@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
}

-static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
- gfn_t end)
-{
- if (likely(kvm->mmu_invalidate_in_progress == 1)) {
- kvm->mmu_invalidate_range_start = start;
- kvm->mmu_invalidate_range_end = end;
- } else {
- /*
- * Fully tracking multiple concurrent ranges has diminishing
- * returns. Keep things simple and just find the minimal range
- * which includes the current and new ranges. As there won't be
- * enough information to subtract a range after its invalidate
- * completes, any ranges invalidated concurrently will
- * accumulate and persist until all outstanding invalidates
- * complete.
- */
- kvm->mmu_invalidate_range_start =
- min(kvm->mmu_invalidate_range_start, start);
- kvm->mmu_invalidate_range_end =
- max(kvm->mmu_invalidate_range_end, end);
- }
-}
-
-static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
-{
- /*
- * The count increase must become visible at unlock time as no
- * spte can be established without taking the mmu_lock and
- * count is also read inside the mmu_lock critical section.
- */
- kvm->mmu_invalidate_in_progress++;
-}
-
static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
update_invalidate_range(kvm, range->start, range->end);
return kvm_unmap_gfn_range(kvm, range);
}

-void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
-{
- mark_invalidate_in_progress(kvm, start, end);
- update_invalidate_range(kvm, start, end);
-}
-
static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
{
@@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
return 0;
}

-void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
-{
- /*
- * This sequence increase will notify the kvm page fault that
- * the page that is going to be mapped in the spte could have
- * been freed.
- */
- kvm->mmu_invalidate_seq++;
- smp_wmb();
- /*
- * The above sequence increase must be visible before the
- * below count decrease, which is ensured by the smp_wmb above
- * in conjunction with the smp_rmb in mmu_invalidate_retry().
- */
- kvm->mmu_invalidate_in_progress--;
-}
-
static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
{
@@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)

#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */

+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+
+static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ struct kvm_gfn_range gfn_range;
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ struct kvm_memslot_iter iter;
+ int i;
+ int r = 0;
+
+ gfn_range.pte = __pte(0);
+ gfn_range.may_block = true;
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ slots = __kvm_memslots(kvm, i);
+
+ kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+ slot = iter.slot;
+ gfn_range.start = max(start, slot->base_gfn);
+ gfn_range.end = min(end, slot->base_gfn + slot->npages);
+ if (gfn_range.start >= gfn_range.end)
+ continue;
+ gfn_range.slot = slot;
+
+ r |= kvm_unmap_gfn_range(kvm, &gfn_range);
+ }
+ }
+
+ if (r)
+ kvm_flush_remote_tlbs(kvm);
+}
+
+#define KVM_MEM_ATTR_SHARED 0x0001
+static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
+ bool is_private)
+{
+ gfn_t start, end;
+ unsigned long i;
+ void *entry;
+ int idx;
+ int r = 0;
+
+ if (size == 0 || gpa + size < gpa)
+ return -EINVAL;
+ if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
+ return -EINVAL;
+
+ start = gpa >> PAGE_SHIFT;
+ end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
+
+ /*
+ * Guest memory defaults to private, kvm->mem_attr_array only stores
+ * shared memory.
+ */
+ entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
+
+ idx = srcu_read_lock(&kvm->srcu);
+ KVM_MMU_LOCK(kvm);
+ kvm_mmu_invalidate_begin(kvm, start, end);
+
+ for (i = start; i < end; i++) {
+ r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
+ GFP_KERNEL_ACCOUNT));
+ if (r)
+ goto err;
+ }
+
+ kvm_unmap_mem_range(kvm, start, end);
+
+ goto ret;
+err:
+ for (; i > start; i--)
+ xa_erase(&kvm->mem_attr_array, i);
+ret:
+ kvm_mmu_invalidate_end(kvm, start, end);
+ KVM_MMU_UNLOCK(kvm);
+ srcu_read_unlock(&kvm->srcu, idx);
+
+ return r;
+}
+#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
+
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
static int kvm_pm_notifier_call(struct notifier_block *bl,
unsigned long state,
@@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
spin_lock_init(&kvm->mn_invalidate_lock);
rcuwait_init(&kvm->mn_memslots_update_rcuwait);
xa_init(&kvm->vcpu_array);
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+ xa_init(&kvm->mem_attr_array);
+#endif

INIT_LIST_HEAD(&kvm->gpc_list);
spin_lock_init(&kvm->gpc_lock);
@@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
}
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+ xa_destroy(&kvm->mem_attr_array);
+#endif
cleanup_srcu_struct(&kvm->irq_srcu);
cleanup_srcu_struct(&kvm->srcu);
kvm_arch_free_vm(kvm);
@@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
}

+bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
+{
+ return false;
+}
+
static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
break;
}
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+ case KVM_MEMORY_ENCRYPT_REG_REGION:
+ case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
+ struct kvm_enc_region region;
+ bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
+
+ if (!kvm_arch_has_private_mem(kvm))
+ goto arch_vm_ioctl;
+
+ r = -EFAULT;
+ if (copy_from_user(&region, argp, sizeof(region)))
+ goto out;
+
+ r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
+ region.size, set);
+ break;
+ }
+#endif
case KVM_GET_DIRTY_LOG: {
struct kvm_dirty_log log;

@@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_get_stats_fd(kvm);
break;
default:
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+arch_vm_ioctl:
+#endif
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
}
out:
--
2.25.1


2022-10-27 10:35:36

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

Hi,

On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <[email protected]> wrote:
>
> Introduce generic private memory register/unregister by reusing existing
> SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case
> by treating address in the region as gpa instead of hva. Which cases
> should these ioctls go is determined by the kvm_arch_has_private_mem().
> Architecture which supports KVM_PRIVATE_MEM should override this function.
>
> KVM internally defaults all guest memory as private memory and maintain
> the shared memory in 'mem_attr_array'. The above ioctls operate on this
> field and unmap existing mappings if any.
>
> Signed-off-by: Chao Peng <[email protected]>
> ---

Reviewed-by: Fuad Tabba <[email protected]>

Cheers,
/fuad


> Documentation/virt/kvm/api.rst | 17 ++-
> arch/x86/kvm/Kconfig | 1 +
> include/linux/kvm_host.h | 10 +-
> virt/kvm/Kconfig | 4 +
> virt/kvm/kvm_main.c | 227 +++++++++++++++++++++++++--------
> 5 files changed, 198 insertions(+), 61 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 975688912b8c..08253cf498d1 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
> This ioctl can be used to register a guest memory region which may
> contain encrypted data (e.g. guest RAM, SMRAM etc).
>
> -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> -memory region may contain encrypted data. The SEV memory encryption
> -engine uses a tweak such that two identical plaintext pages, each at
> -different locations will have differing ciphertexts. So swapping or
> +Currently this ioctl supports registering memory regions for two usages:
> +private memory and SEV-encrypted memory.
> +
> +When private memory is enabled, this ioctl is used to register guest private
> +memory region and the addr/size of kvm_enc_region represents guest physical
> +address (GPA). In this usage, this ioctl zaps the existing guest memory
> +mappings in KVM that fallen into the region.
> +
> +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> +memory region which may contain encrypted data for a SEV-enabled guest. The
> +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> +memory encryption engine uses a tweak such that two identical plaintext pages,
> +each at different locations will have differing ciphertexts. So swapping or
> moving ciphertext of those pages will not result in plaintext being
> swapped. So relocating (or migrating) physical backing pages for the SEV
> guest will require some additional steps.
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 8d2bd455c0cd..73fdfa429b20 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -51,6 +51,7 @@ config KVM
> select HAVE_KVM_PM_NOTIFIER if PM
> select HAVE_KVM_RESTRICTED_MEM if X86_64
> select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> + select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM
> help
> Support hosting fully virtualized guest machines using hardware
> virtualization extensions. You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 79e5cbc35fcf..4ce98fa0153c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> +
> +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
> struct kvm_gfn_range {
> struct kvm_memory_slot *slot;
> gfn_t start;
> @@ -254,6 +255,9 @@ struct kvm_gfn_range {
> bool may_block;
> };
> bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +#endif
> +
> +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> @@ -794,6 +798,9 @@ struct kvm {
> struct notifier_block pm_notifier;
> #endif
> char stats_id[KVM_STATS_NAME_SIZE];
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> + struct xarray mem_attr_array;
> +#endif
> };
>
> #define kvm_err(fmt, ...) \
> @@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> int kvm_arch_post_init_vm(struct kvm *kvm);
> void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
> #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> /*
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 9ff164c7e0cc..69ca59e82149 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER
>
> config HAVE_KVM_RESTRICTED_MEM
> bool
> +
> +config KVM_GENERIC_PRIVATE_MEM
> + bool
> + depends on HAVE_KVM_RESTRICTED_MEM
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 09c9cdeb773c..fc3835826ace 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> }
> EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> + gfn_t end)
> +{
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = start;
> + kvm->mmu_invalidate_range_end = end;
> + } else {
> + /*
> + * Fully tracking multiple concurrent ranges has diminishing
> + * returns. Keep things simple and just find the minimal range
> + * which includes the current and new ranges. As there won't be
> + * enough information to subtract a range after its invalidate
> + * completes, any ranges invalidated concurrently will
> + * accumulate and persist until all outstanding invalidates
> + * complete.
> + */
> + kvm->mmu_invalidate_range_start =
> + min(kvm->mmu_invalidate_range_start, start);
> + kvm->mmu_invalidate_range_end =
> + max(kvm->mmu_invalidate_range_end, end);
> + }
> +}
> +
> +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + /*
> + * The count increase must become visible at unlock time as no
> + * spte can be established without taking the mmu_lock and
> + * count is also read inside the mmu_lock critical section.
> + */
> + kvm->mmu_invalidate_in_progress++;
> +}
> +
> +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + mark_invalidate_in_progress(kvm, start, end);
> + update_invalidate_range(kvm, start, end);
> +}
> +
> +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + /*
> + * This sequence increase will notify the kvm page fault that
> + * the page that is going to be mapped in the spte could have
> + * been freed.
> + */
> + kvm->mmu_invalidate_seq++;
> + smp_wmb();
> + /*
> + * The above sequence increase must be visible before the
> + * below count decrease, which is ensured by the smp_wmb above
> + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> + */
> + kvm->mmu_invalidate_in_progress--;
> +}
> +
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> {
> @@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> }
>
> -static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> - gfn_t end)
> -{
> - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> - kvm->mmu_invalidate_range_start = start;
> - kvm->mmu_invalidate_range_end = end;
> - } else {
> - /*
> - * Fully tracking multiple concurrent ranges has diminishing
> - * returns. Keep things simple and just find the minimal range
> - * which includes the current and new ranges. As there won't be
> - * enough information to subtract a range after its invalidate
> - * completes, any ranges invalidated concurrently will
> - * accumulate and persist until all outstanding invalidates
> - * complete.
> - */
> - kvm->mmu_invalidate_range_start =
> - min(kvm->mmu_invalidate_range_start, start);
> - kvm->mmu_invalidate_range_end =
> - max(kvm->mmu_invalidate_range_end, end);
> - }
> -}
> -
> -static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> - /*
> - * The count increase must become visible at unlock time as no
> - * spte can be established without taking the mmu_lock and
> - * count is also read inside the mmu_lock critical section.
> - */
> - kvm->mmu_invalidate_in_progress++;
> -}
> -
> static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> update_invalidate_range(kvm, range->start, range->end);
> return kvm_unmap_gfn_range(kvm, range);
> }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> - mark_invalidate_in_progress(kvm, start, end);
> - update_invalidate_range(kvm, start, end);
> -}
> -
> static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> const struct mmu_notifier_range *range)
> {
> @@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> return 0;
> }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> - /*
> - * This sequence increase will notify the kvm page fault that
> - * the page that is going to be mapped in the spte could have
> - * been freed.
> - */
> - kvm->mmu_invalidate_seq++;
> - smp_wmb();
> - /*
> - * The above sequence increase must be visible before the
> - * below count decrease, which is ensured by the smp_wmb above
> - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> - */
> - kvm->mmu_invalidate_in_progress--;
> -}
> -
> static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> const struct mmu_notifier_range *range)
> {
> @@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
> #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + struct kvm_gfn_range gfn_range;
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + struct kvm_memslot_iter iter;
> + int i;
> + int r = 0;
> +
> + gfn_range.pte = __pte(0);
> + gfn_range.may_block = true;
> +
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + slots = __kvm_memslots(kvm, i);
> +
> + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> + slot = iter.slot;
> + gfn_range.start = max(start, slot->base_gfn);
> + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> + if (gfn_range.start >= gfn_range.end)
> + continue;
> + gfn_range.slot = slot;
> +
> + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> + }
> + }
> +
> + if (r)
> + kvm_flush_remote_tlbs(kvm);
> +}
> +
> +#define KVM_MEM_ATTR_SHARED 0x0001
> +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> + bool is_private)
> +{
> + gfn_t start, end;
> + unsigned long i;
> + void *entry;
> + int idx;
> + int r = 0;
> +
> + if (size == 0 || gpa + size < gpa)
> + return -EINVAL;
> + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> + return -EINVAL;
> +
> + start = gpa >> PAGE_SHIFT;
> + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + /*
> + * Guest memory defaults to private, kvm->mem_attr_array only stores
> + * shared memory.
> + */
> + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> +
> + idx = srcu_read_lock(&kvm->srcu);
> + KVM_MMU_LOCK(kvm);
> + kvm_mmu_invalidate_begin(kvm, start, end);
> +
> + for (i = start; i < end; i++) {
> + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> + GFP_KERNEL_ACCOUNT));
> + if (r)
> + goto err;
> + }
> +
> + kvm_unmap_mem_range(kvm, start, end);
> +
> + goto ret;
> +err:
> + for (; i > start; i--)
> + xa_erase(&kvm->mem_attr_array, i);
> +ret:
> + kvm_mmu_invalidate_end(kvm, start, end);
> + KVM_MMU_UNLOCK(kvm);
> + srcu_read_unlock(&kvm->srcu, idx);
> +
> + return r;
> +}
> +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
> +
> #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> static int kvm_pm_notifier_call(struct notifier_block *bl,
> unsigned long state,
> @@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> spin_lock_init(&kvm->mn_invalidate_lock);
> rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> + xa_init(&kvm->mem_attr_array);
> +#endif
>
> INIT_LIST_HEAD(&kvm->gpc_list);
> spin_lock_init(&kvm->gpc_lock);
> @@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> }
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> + xa_destroy(&kvm->mem_attr_array);
> +#endif
> cleanup_srcu_struct(&kvm->irq_srcu);
> cleanup_srcu_struct(&kvm->srcu);
> kvm_arch_free_vm(kvm);
> @@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
> }
> }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> + return false;
> +}
> +
> static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> {
> u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> break;
> }
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> + case KVM_MEMORY_ENCRYPT_REG_REGION:
> + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> + struct kvm_enc_region region;
> + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> +
> + if (!kvm_arch_has_private_mem(kvm))
> + goto arch_vm_ioctl;
> +
> + r = -EFAULT;
> + if (copy_from_user(&region, argp, sizeof(region)))
> + goto out;
> +
> + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> + region.size, set);
> + break;
> + }
> +#endif
> case KVM_GET_DIRTY_LOG: {
> struct kvm_dirty_log log;
>
> @@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_vm_ioctl_get_stats_fd(kvm);
> break;
> default:
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +arch_vm_ioctl:
> +#endif
> r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> }
> out:
> --
> 2.25.1
>

2022-11-03 12:38:48

by Vishal Annapurve

[permalink] [raw]
Subject: Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

On Tue, Oct 25, 2022 at 8:48 PM Chao Peng <[email protected]> wrote:
>
> This patch series implements KVM guest private memory for confidential
> computing scenarios like Intel TDX[1]. If a TDX host accesses
> TDX-protected guest memory, machine check can happen which can further
> crash the running host system, this is terrible for multi-tenant
> configurations. The host accesses include those from KVM userspace like
> QEMU. This series addresses KVM userspace induced crash by introducing
> new mm and KVM interfaces so KVM userspace can still manage guest memory
> via a fd-based approach, but it can never access the guest memory
> content.
>
> The patch series touches both core mm and KVM code. I appreciate
> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> reviews are always welcome.
> - 01: mm change, target for mm tree
> - 02-08: KVM change, target for KVM tree
>
> Given KVM is the only current user for the mm part, I have chatted with
> Paolo and he is OK to merge the mm change through KVM tree, but
> reviewed-by/acked-by is still expected from the mm people.
>
> The patches have been verified in Intel TDX environment, but Vishal has
> done an excellent work on the selftests[4] which are dedicated for this
> series, making it possible to test this series without innovative
> hardware and fancy steps of building a VM environment. See Test section
> below for more info.
>
>
> Introduction
> ============
> KVM userspace being able to crash the host is horrible. Under current
> KVM architecture, all guest memory is inherently accessible from KVM
> userspace and is exposed to the mentioned crash issue. The goal of this
> series is to provide a solution to align mm and KVM, on a userspace
> inaccessible approach of exposing guest memory.
>
> Normally, KVM populates secondary page table (e.g. EPT) by using a host
> virtual address (hva) from core mm page table (e.g. x86 userspace page
> table). This requires guest memory being mmaped into KVM userspace, but
> this is also the source where the mentioned crash issue can happen. In
> theory, apart from those 'shared' memory for device emulation etc, guest
> memory doesn't have to be mmaped into KVM userspace.
>
> This series introduces fd-based guest memory which will not be mmaped
> into KVM userspace. KVM populates secondary page table by using a

With no mappings in place for userspace VMM, IIUC, looks like the host
kernel will not be able to find the culprit userspace process in case
of Machine check error on guest private memory. As implemented in
hwpoison_user_mappings, host kernel tries to look at the processes
which have mapped the pfns with hardware error.

Is there a modification needed in mce handling logic of the host
kernel to immediately send a signal to the vcpu thread accessing
faulting pfn backing guest private memory?


> fd/offset pair backed by a memory file system. The fd can be created
> from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> directly interact with them with newly introduced in-kernel interface,
> therefore remove the KVM userspace from the path of accessing/mmaping
> the guest memory.
>
> Kirill had a patch [2] to address the same issue in a different way. It
> tracks guest encrypted memory at the 'struct page' level and relies on
> HWPOISON to reject the userspace access. The patch has been discussed in
> several online and offline threads and resulted in a design document [3]
> which is also the original proposal for this series. Later this patch
> series evolved as more comments received in community but the major
> concepts in [3] still hold true so recommend reading.
>
> The patch series may also be useful for other usages, for example, pure
> software approach may use it to harden itself against unintentional
> access to guest memory. This series is designed with these usages in
> mind but doesn't have code directly support them and extension might be
> needed.
>
>
> mm change
> =========
> Introduces a new memfd_restricted system call which can create memory
> file that is restricted from userspace access via normal MMU operations
> like read(), write() or mmap() etc and the only way to use it is
> passing it to a third kernel module like KVM and relying on it to
> access the fd through the newly added restrictedmem kernel interface.
> The restrictedmem interface bridges the memory file subsystems
> (tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
> bi-directional communication between them.
>
>
> KVM change
> ==========
> Extends the KVM memslot to provide guest private (encrypted) memory from
> a fd. With this extension, a single memslot can maintain both private
> memory through private fd (restricted_fd/restricted_offset) and shared
> (unencrypted) memory through userspace mmaped host virtual address
> (userspace_addr). For a particular guest page, the corresponding page in
> KVM memslot can be only either private or shared and only one of the
> shared/private parts of the memslot is visible to guest. For how this
> new extension is used in QEMU, please refer to kvm_set_phys_mem() in
> below TDX-enabled QEMU repo.
>
> Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
> chance on decision-making for shared <-> private memory conversion. The
> exit can be an implicit conversion in KVM page fault handler or an
> explicit conversion from guest OS.
>
> Extends existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to
> convert a guest page between private <-> shared. The data maintained in
> these ioctls tells the truth whether a guest page is private or shared
> and this information will be used in KVM page fault handler to decide
> whether the private or the shared part of the memslot is visible to
> guest.
>
>
> Test
> ====
> Ran two kinds of tests:
> - Selftests [4] from Vishal and VM boot tests in non-TDX environment
> Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v9
>
> - Functional tests in TDX capable environment
> Tested the new functionalities in TDX environment. Code repos:
> Linux: https://github.com/chao-p/linux/tree/privmem-v9-tdx
> QEMU: https://github.com/chao-p/qemu/tree/privmem-v9
>
> An example QEMU command line for TDX test:
> -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> -machine confidential-guest-support=tdx \
> -object memory-backend-memfd-private,id=ram1,size=${mem} \
> -machine memory-backend=ram1
>
>
> TODO
> ====
> - Page accounting and limiting for encrypted memory
> - hugetlbfs support
>
>
> Changelog
> =========
> v9:
> - mm: move inaccessible memfd into separated syscall.
> - mm: return page instead of pfn_t for inaccessible_get_pfn and remove
> inaccessible_put_pfn.
> - KVM: rename inaccessible/private to restricted and CONFIG change to
> make the code friendly to pKVM.
> - KVM: add invalidate_begin/end pair to fix race contention and revise
> the lock protection for invalidation path.
> - KVM: optimize setting lpage_info for > 2M level by direct accessing
> lower level's result.
> - KVM: avoid load xarray in kvm_mmu_max_mapping_level() and instead let
> the caller to pass in is_private.
> - KVM: API doc improvement.
> v8:
> - mm: redesign mm part by introducing a shim layer(inaccessible_memfd)
> in memfd to avoid touch the memory file systems directly.
> - mm: exclude F_SEAL_AUTO_ALLOCATE as it is for shared memory and
> cause confusion in this series, will send out separately.
> - doc: exclude the man page change, it's not kernel patch and will
> send out separately.
> - KVM: adapt to use the new mm inaccessible_memfd interface.
> - KVM: update lpage_info when setting mem_attr_array to support
> large page.
> - KVM: change from xa_store_range to xa_store for mem_attr_array due
> to xa_store_range overrides all entries which is not intended
> behavior for us.
> - KVM: refine the mmu_invalidate_retry_gfn mechanism for private page.
> - KVM: reorganize KVM_MEMORY_ENCRYPT_{UN,}REG_REGION and private page
> handling code suggested by Sean.
> v7:
> - mm: introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
> - KVM: use KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to record
> private/shared info.
> - KVM: use similar sync mechanism between zap/page fault paths as
> mmu_notifier for memfile_notifier based invalidation.
> v6:
> - mm: introduce MEMFILE_F_* flags into memfile_node to allow checking
> feature consistence among all memfile_notifier users and get rid of
> internal flags like SHM_F_INACCESSIBLE.
> - mm: make pfn_ops callbacks being members of memfile_backing_store
> and then refer to it directly in memfile_notifier.
> - mm: remove backing store unregister.
> - mm: remove RLIMIT_MEMLOCK based memory accounting and limiting.
> - KVM: reorganize patch sequence for page fault handling and private
> memory enabling.
> v5:
> - Add man page for MFD_INACCESSIBLE flag and improve KVM API do for
> the new memslot extensions.
> - mm: introduce memfile_{un}register_backing_store to allow memory
> backing store to register/unregister it from memfile_notifier.
> - mm: remove F_SEAL_INACCESSIBLE, use in-kernel flag
> (SHM_F_INACCESSIBLE for shmem) instead.
> - mm: add memory accounting and limiting (RLIMIT_MEMLOCK based) for
> MFD_INACCESSIBLE memory.
> - KVM: remove the overlap check for mapping the same file+offset into
> multiple gfns due to perf consideration, warned in document.
> v4:
> - mm: rename memfd_ops to memfile_notifier and separate it from
> memfd.c to standalone memfile-notifier.c.
> - KVM: move pfn_ops to per-memslot scope from per-vm scope and allow
> registering multiple memslots to the same memory backing store.
> - KVM: add a 'kvm' reference in memslot so that we can recover kvm in
> memfile_notifier handlers.
> - KVM: add 'private_' prefix for the new fields in memslot.
> - KVM: reshape the 'type' to 'flag' for kvm_memory_exit
> v3:
> - Remove 'RFC' prefix.
> - Fix race condition between memfile_notifier handlers and kvm destroy.
> - mm: introduce MFD_INACCESSIBLE flag for memfd_create() to force
> setting F_SEAL_INACCESSIBLE when the fd is created.
> - KVM: add the shared part of the memslot back to make private/shared
> pages live in one memslot.
>
> Reference
> =========
> [1] Intel TDX:
> https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> [2] Kirill's implementation:
> https://lore.kernel.org/all/[email protected]/T/
> [3] Original design proposal:
> https://lore.kernel.org/all/[email protected]/
> [4] Selftest:
> https://lore.kernel.org/all/[email protected]/
>
>
> Chao Peng (7):
> KVM: Extend the memslot to support fd-based private memory
> KVM: Add KVM_EXIT_MEMORY_FAULT exit
> KVM: Use gfn instead of hva for mmu_notifier_retry
> KVM: Register/unregister the guest private memory regions
> KVM: Update lpage info when private/shared memory are mixed
> KVM: Handle page fault for private memory
> KVM: Enable and expose KVM_MEM_PRIVATE
>
> Kirill A. Shutemov (1):
> mm: Introduce memfd_restricted system call to create restricted user
> memory
>
> Documentation/virt/kvm/api.rst | 88 ++++-
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/x86/include/asm/kvm_host.h | 8 +
> arch/x86/kvm/Kconfig | 3 +
> arch/x86/kvm/mmu/mmu.c | 170 +++++++++-
> arch/x86/kvm/mmu/mmu_internal.h | 14 +-
> arch/x86/kvm/mmu/mmutrace.h | 1 +
> arch/x86/kvm/mmu/spte.h | 6 +
> arch/x86/kvm/mmu/tdp_mmu.c | 3 +-
> arch/x86/kvm/x86.c | 4 +-
> include/linux/kvm_host.h | 89 ++++-
> include/linux/restrictedmem.h | 62 ++++
> include/linux/syscalls.h | 1 +
> include/uapi/asm-generic/unistd.h | 5 +-
> include/uapi/linux/kvm.h | 38 +++
> include/uapi/linux/magic.h | 1 +
> kernel/sys_ni.c | 3 +
> mm/Kconfig | 4 +
> mm/Makefile | 1 +
> mm/restrictedmem.c | 250 ++++++++++++++
> virt/kvm/Kconfig | 7 +
> virt/kvm/kvm_main.c | 453 +++++++++++++++++++++----
> 23 files changed, 1121 insertions(+), 92 deletions(-)
> create mode 100644 include/linux/restrictedmem.h
> create mode 100644 mm/restrictedmem.c
>
>
> base-commit: e18d6152ff0f41b7f01f9817372022df04e0d354
> --
> 2.25.1
>

2022-11-03 23:28:11

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

On Tue, Oct 25, 2022, Chao Peng wrote:
> @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> break;
> }
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> + case KVM_MEMORY_ENCRYPT_REG_REGION:
> + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {

I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION. Aside
from the fact that restricted/protected memory may not be encrypted, there are
other potential use cases for per-page memory attributes[*], e.g. to make memory
read-only (or no-exec, or exec-only, etc...) without having to modify memslots.

Any paravirt use case where the attributes of a page are effectively dictated by
the guest is going to run into the exact same performance problems with memslots,
which isn't suprising in hindsight since shared vs. private is really just an
attribute, albeit with extra special semantics.

And if we go with a brand new ioctl(), maybe someday in the very distant future
we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION.

Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too big
of a wrench into things.

Something like:

KVM_SET_MEMORY_ATTRIBUTES

struct kvm_memory_attributes {
__u64 address;
__u64 size;
__u64 flags;
}

[*] https://lore.kernel.org/all/Y1a1i9vbJ%[email protected]

> + struct kvm_enc_region region;
> + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> +
> + if (!kvm_arch_has_private_mem(kvm))
> + goto arch_vm_ioctl;
> +
> + r = -EFAULT;
> + if (copy_from_user(&region, argp, sizeof(region)))
> + goto out;
> +
> + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> + region.size, set);
> + break;
> + }
> +#endif

2022-11-04 09:29:40

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

On Thu, Nov 03, 2022 at 11:04:53PM +0000, Sean Christopherson wrote:
> On Tue, Oct 25, 2022, Chao Peng wrote:
> > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > break;
> > }
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > + case KVM_MEMORY_ENCRYPT_REG_REGION:
> > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
>
> I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION. Aside
> from the fact that restricted/protected memory may not be encrypted, there are
> other potential use cases for per-page memory attributes[*], e.g. to make memory
> read-only (or no-exec, or exec-only, etc...) without having to modify memslots.
>
> Any paravirt use case where the attributes of a page are effectively dictated by
> the guest is going to run into the exact same performance problems with memslots,
> which isn't suprising in hindsight since shared vs. private is really just an
> attribute, albeit with extra special semantics.
>
> And if we go with a brand new ioctl(), maybe someday in the very distant future
> we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION.
>
> Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too big
> of a wrench into things.
>
> Something like:
>
> KVM_SET_MEMORY_ATTRIBUTES
>
> struct kvm_memory_attributes {
> __u64 address;
> __u64 size;
> __u64 flags;
> }

I like the idea of adding a new ioctl(). But putting all attributes into
a flags in uAPI sounds not good to me, e.g. forcing userspace to set all
attributes in one call can cause pain for userspace, probably for KVM
implementation as well. For private<->shared memory conversion, we
actually only care the KVM_MEM_ATTR_SHARED or KVM_MEM_ATTR_PRIVATE bit,
but we force userspace to set other irrelevant bits as well if use this
API.

I looked at kvm_device_attr, sounds we can do similar:

KVM_SET_MEMORY_ATTR

struct kvm_memory_attr {
__u64 address;
__u64 size;
#define KVM_MEM_ATTR_SHARED BIT(0)
#define KVM_MEM_ATTR_READONLY BIT(1)
#define KVM_MEM_ATTR_NOEXEC BIT(2)
__u32 attr;
__u32 pad;
}

I'm not sure if we need KVM_GET_MEMORY_ATTR/KVM_HAS_MEMORY_ATTR as well,
but sounds like we need a KVM_UNSET_MEMORY_ATTR.

Since we are exposing the attribute directly to userspace I also think
we'd better treat shared memory as the default, so even when the private
memory is not used, the bit can still be meaningful. So define BIT(0) as
KVM_MEM_ATTR_PRIVATE instead of KVM_MEM_ATTR_SHARED.

Thanks,
Chao

>
> [*] https://lore.kernel.org/all/Y1a1i9vbJ%[email protected]
>
> > + struct kvm_enc_region region;
> > + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > +
> > + if (!kvm_arch_has_private_mem(kvm))
> > + goto arch_vm_ioctl;
> > +
> > + r = -EFAULT;
> > + if (copy_from_user(&region, argp, sizeof(region)))
> > + goto out;
> > +
> > + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> > + region.size, set);
> > + break;
> > + }
> > +#endif

2022-11-04 21:53:26

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

Paolo, any thoughts before I lead things further astray?

On Fri, Nov 04, 2022, Chao Peng wrote:
> On Thu, Nov 03, 2022 at 11:04:53PM +0000, Sean Christopherson wrote:
> > On Tue, Oct 25, 2022, Chao Peng wrote:
> > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > > r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > > break;
> > > }
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > + case KVM_MEMORY_ENCRYPT_REG_REGION:
> > > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> >
> > I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION. Aside
> > from the fact that restricted/protected memory may not be encrypted, there are
> > other potential use cases for per-page memory attributes[*], e.g. to make memory
> > read-only (or no-exec, or exec-only, etc...) without having to modify memslots.
> >
> > Any paravirt use case where the attributes of a page are effectively dictated by
> > the guest is going to run into the exact same performance problems with memslots,
> > which isn't suprising in hindsight since shared vs. private is really just an
> > attribute, albeit with extra special semantics.
> >
> > And if we go with a brand new ioctl(), maybe someday in the very distant future
> > we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION.
> >
> > Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too big
> > of a wrench into things.
> >
> > Something like:
> >
> > KVM_SET_MEMORY_ATTRIBUTES
> >
> > struct kvm_memory_attributes {
> > __u64 address;
> > __u64 size;
> > __u64 flags;

Oh, this is half-baked. I lost track of which flags were which. What I intended
was a separate, initially-unused flags, e.g.

struct kvm_memory_attributes {
__u64 address;
__u64 size;
__u64 attributes;
__u64 flags;
}

so that KVM can tweak behavior and/or extend the effective size of the struct.

> I like the idea of adding a new ioctl(). But putting all attributes into
> a flags in uAPI sounds not good to me, e.g. forcing userspace to set all
> attributes in one call can cause pain for userspace, probably for KVM
> implementation as well. For private<->shared memory conversion, we
> actually only care the KVM_MEM_ATTR_SHARED or KVM_MEM_ATTR_PRIVATE bit,

Not necessarily, e.g. I can see pKVM wanting to convert from RW+PRIVATE => RO+SHARED
or even RW+PRIVATE => NONE+SHARED so that the guest can't write/access the memory
while it's accessible from the host.

And if this does extend beyond shared/private, dropping from RWX=>R, i.e. dropping
WX permissions, would also be a common operation.

Hmm, typing that out makes me think that if we do end up supporting other "attributes",
i.e. protections, we should go straight to full RWX protections instead of doing
things piecemeal, i.e. add individual protections instead of combinations like
NO_EXEC and READ_ONLY. The protections would have to be inverted for backwards
compatibility, but that's easy enough to handle. The semantics could be like
protection keys, which also have inverted persmissions, where the final protections
are the combination of memslot+attributes, i.e. a read-only memslot couldn't be made
writable via attributes.

E.g. userspace could do "NO_READ | NO_WRITE | NO_EXEC" to temporarily block access
to memory without needing to delete the memslot. KVM would need to disallow
unsupported combinations, e.g. disallowed effective protections would be:

- W or WX [unless there's an arch that supports write-only memory]
- R or RW [until KVM plumbs through support for no-exec, or it's unsupported in hardware]
- X [until KVM plumbs through support for exec-only, or it's unsupported in hardware]

Anyways, that's all future work...

> but we force userspace to set other irrelevant bits as well if use this
> API.

They aren't irrelevant though, as the memory attributes are all describing the
allowed protections for a given page. If there's a use case where userspace "can't"
keep track of the attributes for whatever reason, then userspace could do a RMW
to set/clear attributes. Alternatively, the ioctl() could take an "operation" and
support WRITE/OR/AND to allow setting/clearing individual flags, e.g. tweak the
above to be:

struct kvm_memory_attributes {
__u64 address;
__u64 size;
__u64 attributes;
__u32 operation;
__u32 flags;
}

> I looked at kvm_device_attr, sounds we can do similar:

The device attributes deal with isolated, arbitrary values, whereas memory attributes
are flags, i.e. devices are 1:1 whereas memory is 1:MANY. There is no "unset" for
device attributes, because they aren't flags. Device attributes vs. memory attributes
really are two very different things that just happen to use a common name.

If it helped clarify things without creating naming problems, we could even use
PROTECTIONS instead of ATTRIBUTES.

> KVM_SET_MEMORY_ATTR
>
> struct kvm_memory_attr {
> __u64 address;
> __u64 size;
> #define KVM_MEM_ATTR_SHARED BIT(0)
> #define KVM_MEM_ATTR_READONLY BIT(1)
> #define KVM_MEM_ATTR_NOEXEC BIT(2)
> __u32 attr;

As above, letting userspace set only a single attribute would prevent setting
(or clearing) multiple attributes in a single ioctl().

> __u32 pad;
> }
>
> I'm not sure if we need KVM_GET_MEMORY_ATTR/KVM_HAS_MEMORY_ATTR as well,

Definitely would need to communicate to userspace that various attributes are
supported. That doesn't necessarily require a common ioctl(), but I don't see
any reason not to add a common helper, and adding a common helper would mean
KVM_CAP_PRIVATE_MEM can go away. But it should return a bitmask so that userspace
can do a single query to get all supported attributes, e.g. KVM_SUPPORTED_MEMORY_ATTRIBUTES.

As for KVM_GET_MEMORY_ATTRIBUTES, we wouldn't necessarily have to provide such an
API, e.g. we could hold off until someone came along with a RMW use case (as above).
That said, debug would likely be a nightmare without KVM_GET_MEMORY_ATTRIBUTES,
so it's probably best to add it straightway.

> but sounds like we need a KVM_UNSET_MEMORY_ATTR.

No need if the setter operates on all attributes.

> Since we are exposing the attribute directly to userspace I also think
> we'd better treat shared memory as the default, so even when the private
> memory is not used, the bit can still be meaningful. So define BIT(0) as
> KVM_MEM_ATTR_PRIVATE instead of KVM_MEM_ATTR_SHARED.

Ah, right.

2022-11-08 00:47:16

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

On Thu, Nov 03, 2022 at 05:43:52PM +0530,
Vishal Annapurve <[email protected]> wrote:

> On Tue, Oct 25, 2022 at 8:48 PM Chao Peng <[email protected]> wrote:
> >
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> >
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> > - 01: mm change, target for mm tree
> > - 02-08: KVM change, target for KVM tree
> >
> > Given KVM is the only current user for the mm part, I have chatted with
> > Paolo and he is OK to merge the mm change through KVM tree, but
> > reviewed-by/acked-by is still expected from the mm people.
> >
> > The patches have been verified in Intel TDX environment, but Vishal has
> > done an excellent work on the selftests[4] which are dedicated for this
> > series, making it possible to test this series without innovative
> > hardware and fancy steps of building a VM environment. See Test section
> > below for more info.
> >
> >
> > Introduction
> > ============
> > KVM userspace being able to crash the host is horrible. Under current
> > KVM architecture, all guest memory is inherently accessible from KVM
> > userspace and is exposed to the mentioned crash issue. The goal of this
> > series is to provide a solution to align mm and KVM, on a userspace
> > inaccessible approach of exposing guest memory.
> >
> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > table). This requires guest memory being mmaped into KVM userspace, but
> > this is also the source where the mentioned crash issue can happen. In
> > theory, apart from those 'shared' memory for device emulation etc, guest
> > memory doesn't have to be mmaped into KVM userspace.
> >
> > This series introduces fd-based guest memory which will not be mmaped
> > into KVM userspace. KVM populates secondary page table by using a
>
> With no mappings in place for userspace VMM, IIUC, looks like the host
> kernel will not be able to find the culprit userspace process in case
> of Machine check error on guest private memory. As implemented in
> hwpoison_user_mappings, host kernel tries to look at the processes
> which have mapped the pfns with hardware error.
>
> Is there a modification needed in mce handling logic of the host
> kernel to immediately send a signal to the vcpu thread accessing
> faulting pfn backing guest private memory?

mce_register_decode_chain() can be used. MCE physical address(p->mce_addr)
includes host key id in addition to real physical address. By searching used
hkid by KVM, we can determine if the page is assigned to guest TD or not. If
yes, send SIGBUS.

kvm_machine_check() can be enhanced for KVM specific use. This is before
memory_failure() is called, though.

any other ideas?
--
Isaku Yamahata <[email protected]>

2022-11-08 02:02:38

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

On Tue, Oct 25, 2022 at 11:13:41PM +0800, Chao Peng wrote:
> Introduce generic private memory register/unregister by reusing existing
> SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case
> by treating address in the region as gpa instead of hva. Which cases
> should these ioctls go is determined by the kvm_arch_has_private_mem().
> Architecture which supports KVM_PRIVATE_MEM should override this function.
>
> KVM internally defaults all guest memory as private memory and maintain
> the shared memory in 'mem_attr_array'. The above ioctls operate on this
> field and unmap existing mappings if any.
>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 17 ++-
> arch/x86/kvm/Kconfig | 1 +
> include/linux/kvm_host.h | 10 +-
> virt/kvm/Kconfig | 4 +
> virt/kvm/kvm_main.c | 227 +++++++++++++++++++++++++--------
> 5 files changed, 198 insertions(+), 61 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 975688912b8c..08253cf498d1 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
> This ioctl can be used to register a guest memory region which may
> contain encrypted data (e.g. guest RAM, SMRAM etc).
>
> -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> -memory region may contain encrypted data. The SEV memory encryption
> -engine uses a tweak such that two identical plaintext pages, each at
> -different locations will have differing ciphertexts. So swapping or
> +Currently this ioctl supports registering memory regions for two usages:
> +private memory and SEV-encrypted memory.
> +
> +When private memory is enabled, this ioctl is used to register guest private
> +memory region and the addr/size of kvm_enc_region represents guest physical
> +address (GPA). In this usage, this ioctl zaps the existing guest memory
> +mappings in KVM that fallen into the region.
> +
> +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> +memory region which may contain encrypted data for a SEV-enabled guest. The
> +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> +memory encryption engine uses a tweak such that two identical plaintext pages,
> +each at different locations will have differing ciphertexts. So swapping or
> moving ciphertext of those pages will not result in plaintext being
> swapped. So relocating (or migrating) physical backing pages for the SEV
> guest will require some additional steps.
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 8d2bd455c0cd..73fdfa429b20 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -51,6 +51,7 @@ config KVM
> select HAVE_KVM_PM_NOTIFIER if PM
> select HAVE_KVM_RESTRICTED_MEM if X86_64
> select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> + select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM
> help
> Support hosting fully virtualized guest machines using hardware
> virtualization extensions. You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 79e5cbc35fcf..4ce98fa0153c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> +
> +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
> struct kvm_gfn_range {
> struct kvm_memory_slot *slot;
> gfn_t start;
> @@ -254,6 +255,9 @@ struct kvm_gfn_range {
> bool may_block;
> };
> bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +#endif
> +
> +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> @@ -794,6 +798,9 @@ struct kvm {
> struct notifier_block pm_notifier;
> #endif
> char stats_id[KVM_STATS_NAME_SIZE];
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> + struct xarray mem_attr_array;
> +#endif
> };
>
> #define kvm_err(fmt, ...) \
> @@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> int kvm_arch_post_init_vm(struct kvm *kvm);
> void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
> #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> /*
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 9ff164c7e0cc..69ca59e82149 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER
>
> config HAVE_KVM_RESTRICTED_MEM
> bool
> +
> +config KVM_GENERIC_PRIVATE_MEM
> + bool
> + depends on HAVE_KVM_RESTRICTED_MEM
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 09c9cdeb773c..fc3835826ace 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> }
> EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> + gfn_t end)
> +{
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = start;
> + kvm->mmu_invalidate_range_end = end;
> + } else {
> + /*
> + * Fully tracking multiple concurrent ranges has diminishing
> + * returns. Keep things simple and just find the minimal range
> + * which includes the current and new ranges. As there won't be
> + * enough information to subtract a range after its invalidate
> + * completes, any ranges invalidated concurrently will
> + * accumulate and persist until all outstanding invalidates
> + * complete.
> + */
> + kvm->mmu_invalidate_range_start =
> + min(kvm->mmu_invalidate_range_start, start);
> + kvm->mmu_invalidate_range_end =
> + max(kvm->mmu_invalidate_range_end, end);
> + }
> +}
> +
> +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + /*
> + * The count increase must become visible at unlock time as no
> + * spte can be established without taking the mmu_lock and
> + * count is also read inside the mmu_lock critical section.
> + */
> + kvm->mmu_invalidate_in_progress++;
> +}
> +
> +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + mark_invalidate_in_progress(kvm, start, end);
> + update_invalidate_range(kvm, start, end);
> +}
> +
> +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + /*
> + * This sequence increase will notify the kvm page fault that
> + * the page that is going to be mapped in the spte could have
> + * been freed.
> + */
> + kvm->mmu_invalidate_seq++;
> + smp_wmb();
> + /*
> + * The above sequence increase must be visible before the
> + * below count decrease, which is ensured by the smp_wmb above
> + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> + */
> + kvm->mmu_invalidate_in_progress--;
> +}
> +
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> {
> @@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> }
>
> -static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> - gfn_t end)
> -{
> - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> - kvm->mmu_invalidate_range_start = start;
> - kvm->mmu_invalidate_range_end = end;
> - } else {
> - /*
> - * Fully tracking multiple concurrent ranges has diminishing
> - * returns. Keep things simple and just find the minimal range
> - * which includes the current and new ranges. As there won't be
> - * enough information to subtract a range after its invalidate
> - * completes, any ranges invalidated concurrently will
> - * accumulate and persist until all outstanding invalidates
> - * complete.
> - */
> - kvm->mmu_invalidate_range_start =
> - min(kvm->mmu_invalidate_range_start, start);
> - kvm->mmu_invalidate_range_end =
> - max(kvm->mmu_invalidate_range_end, end);
> - }
> -}
> -
> -static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> - /*
> - * The count increase must become visible at unlock time as no
> - * spte can be established without taking the mmu_lock and
> - * count is also read inside the mmu_lock critical section.
> - */
> - kvm->mmu_invalidate_in_progress++;
> -}
> -
> static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> update_invalidate_range(kvm, range->start, range->end);
> return kvm_unmap_gfn_range(kvm, range);
> }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> - mark_invalidate_in_progress(kvm, start, end);
> - update_invalidate_range(kvm, start, end);
> -}
> -
> static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> const struct mmu_notifier_range *range)
> {
> @@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> return 0;
> }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> - /*
> - * This sequence increase will notify the kvm page fault that
> - * the page that is going to be mapped in the spte could have
> - * been freed.
> - */
> - kvm->mmu_invalidate_seq++;
> - smp_wmb();
> - /*
> - * The above sequence increase must be visible before the
> - * below count decrease, which is ensured by the smp_wmb above
> - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> - */
> - kvm->mmu_invalidate_in_progress--;
> -}
> -
> static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> const struct mmu_notifier_range *range)
> {
> @@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
> #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + struct kvm_gfn_range gfn_range;
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + struct kvm_memslot_iter iter;
> + int i;
> + int r = 0;
> +
> + gfn_range.pte = __pte(0);
> + gfn_range.may_block = true;
> +
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + slots = __kvm_memslots(kvm, i);
> +
> + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> + slot = iter.slot;
> + gfn_range.start = max(start, slot->base_gfn);
> + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> + if (gfn_range.start >= gfn_range.end)
> + continue;
> + gfn_range.slot = slot;
> +
> + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> + }
> + }
> +
> + if (r)
> + kvm_flush_remote_tlbs(kvm);
> +}
> +
> +#define KVM_MEM_ATTR_SHARED 0x0001
> +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> + bool is_private)
> +{
> + gfn_t start, end;
> + unsigned long i;
> + void *entry;
> + int idx;
> + int r = 0;
> +
> + if (size == 0 || gpa + size < gpa)
> + return -EINVAL;
> + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> + return -EINVAL;
> +
> + start = gpa >> PAGE_SHIFT;
> + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + /*
> + * Guest memory defaults to private, kvm->mem_attr_array only stores
> + * shared memory.
> + */
> + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> +
> + idx = srcu_read_lock(&kvm->srcu);
> + KVM_MMU_LOCK(kvm);
> + kvm_mmu_invalidate_begin(kvm, start, end);
> +
> + for (i = start; i < end; i++) {
> + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> + GFP_KERNEL_ACCOUNT));
> + if (r)
> + goto err;
> + }
> +
> + kvm_unmap_mem_range(kvm, start, end);

lock is hold by KVM_MMU_LOCK() so how about do
kvm_mmu_invalidate_begin() after changing xarray:

kvm_mmu_invalidate_begin(kvm, start, end);
kvm_unmap_mem_range(kvm, start, end);
kvm_mmu_invalidate_end(kvm, start, end);

Also the error handling path doesn't need to care it yet.

> +
> + goto ret;
> +err:
> + for (; i > start; i--)
> + xa_erase(&kvm->mem_attr_array, i);

the start should be covered yet, consider the i is
unsigned long and case of start is 0, may need another
variable j for this.

> +ret:
> + kvm_mmu_invalidate_end(kvm, start, end);
> + KVM_MMU_UNLOCK(kvm);
> + srcu_read_unlock(&kvm->srcu, idx);
> +
> + return r;
> +}
> +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
> +
> #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> static int kvm_pm_notifier_call(struct notifier_block *bl,
> unsigned long state,
> @@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> spin_lock_init(&kvm->mn_invalidate_lock);
> rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> + xa_init(&kvm->mem_attr_array);
> +#endif
>
> INIT_LIST_HEAD(&kvm->gpc_list);
> spin_lock_init(&kvm->gpc_lock);
> @@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> }
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> + xa_destroy(&kvm->mem_attr_array);
> +#endif
> cleanup_srcu_struct(&kvm->irq_srcu);
> cleanup_srcu_struct(&kvm->srcu);
> kvm_arch_free_vm(kvm);
> @@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
> }
> }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> + return false;
> +}
> +
> static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> {
> u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> break;
> }
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> + case KVM_MEMORY_ENCRYPT_REG_REGION:
> + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> + struct kvm_enc_region region;
> + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> +
> + if (!kvm_arch_has_private_mem(kvm))
> + goto arch_vm_ioctl;
> +
> + r = -EFAULT;
> + if (copy_from_user(&region, argp, sizeof(region)))
> + goto out;
> +
> + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> + region.size, set);
> + break;
> + }
> +#endif
> case KVM_GET_DIRTY_LOG: {
> struct kvm_dirty_log log;
>
> @@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_vm_ioctl_get_stats_fd(kvm);
> break;
> default:
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +arch_vm_ioctl:
> +#endif
> r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> }
> out:
> --
> 2.25.1
>
>

2022-11-08 08:55:57

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

On Fri, Nov 04, 2022 at 09:19:31PM +0000, Sean Christopherson wrote:
> Paolo, any thoughts before I lead things further astray?
>
> On Fri, Nov 04, 2022, Chao Peng wrote:
> > On Thu, Nov 03, 2022 at 11:04:53PM +0000, Sean Christopherson wrote:
> > > On Tue, Oct 25, 2022, Chao Peng wrote:
> > > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > > > r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > > > break;
> > > > }
> > > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > > + case KVM_MEMORY_ENCRYPT_REG_REGION:
> > > > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > >
> > > I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION. Aside
> > > from the fact that restricted/protected memory may not be encrypted, there are
> > > other potential use cases for per-page memory attributes[*], e.g. to make memory
> > > read-only (or no-exec, or exec-only, etc...) without having to modify memslots.
> > >
> > > Any paravirt use case where the attributes of a page are effectively dictated by
> > > the guest is going to run into the exact same performance problems with memslots,
> > > which isn't suprising in hindsight since shared vs. private is really just an
> > > attribute, albeit with extra special semantics.
> > >
> > > And if we go with a brand new ioctl(), maybe someday in the very distant future
> > > we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION.
> > >
> > > Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too big
> > > of a wrench into things.
> > >
> > > Something like:
> > >
> > > KVM_SET_MEMORY_ATTRIBUTES
> > >
> > > struct kvm_memory_attributes {
> > > __u64 address;
> > > __u64 size;
> > > __u64 flags;
>
> Oh, this is half-baked. I lost track of which flags were which. What I intended
> was a separate, initially-unused flags, e.g.

That makes sense.

>
> struct kvm_memory_attributes {
> __u64 address;
> __u64 size;
> __u64 attributes;
> __u64 flags;
> }
>
> so that KVM can tweak behavior and/or extend the effective size of the struct.
>
> > I like the idea of adding a new ioctl(). But putting all attributes into
> > a flags in uAPI sounds not good to me, e.g. forcing userspace to set all
> > attributes in one call can cause pain for userspace, probably for KVM
> > implementation as well. For private<->shared memory conversion, we
> > actually only care the KVM_MEM_ATTR_SHARED or KVM_MEM_ATTR_PRIVATE bit,
>
> Not necessarily, e.g. I can see pKVM wanting to convert from RW+PRIVATE => RO+SHARED
> or even RW+PRIVATE => NONE+SHARED so that the guest can't write/access the memory
> while it's accessible from the host.
>
> And if this does extend beyond shared/private, dropping from RWX=>R, i.e. dropping
> WX permissions, would also be a common operation.
>
> Hmm, typing that out makes me think that if we do end up supporting other "attributes",
> i.e. protections, we should go straight to full RWX protections instead of doing
> things piecemeal, i.e. add individual protections instead of combinations like
> NO_EXEC and READ_ONLY. The protections would have to be inverted for backwards
> compatibility, but that's easy enough to handle. The semantics could be like
> protection keys, which also have inverted persmissions, where the final protections
> are the combination of memslot+attributes, i.e. a read-only memslot couldn't be made
> writable via attributes.
>
> E.g. userspace could do "NO_READ | NO_WRITE | NO_EXEC" to temporarily block access
> to memory without needing to delete the memslot. KVM would need to disallow
> unsupported combinations, e.g. disallowed effective protections would be:
>
> - W or WX [unless there's an arch that supports write-only memory]
> - R or RW [until KVM plumbs through support for no-exec, or it's unsupported in hardware]
> - X [until KVM plumbs through support for exec-only, or it's unsupported in hardware]
>
> Anyways, that's all future work...
>
> > but we force userspace to set other irrelevant bits as well if use this
> > API.
>
> They aren't irrelevant though, as the memory attributes are all describing the
> allowed protections for a given page.

The 'allowed' protections seems answer my concern. But after we enabled
"NO_READ | NO_WRITE | NO_EXEC", are we going to check "NO_READ |
NO_WRITE | NO_EXEC" are also set together with the PRIVATE bit? I just
can't imagine what the semantic would be if we have the PRIVATE bit set
but other bits indicate it's actually can READ/WRITE/EXEC from usrspace.

> If there's a use case where userspace "can't"
> keep track of the attributes for whatever reason, then userspace could do a RMW
> to set/clear attributes. Alternatively, the ioctl() could take an "operation" and
> support WRITE/OR/AND to allow setting/clearing individual flags, e.g. tweak the
> above to be:

A getter would be good, it might also be needed for live migration.

>
> struct kvm_memory_attributes {
> __u64 address;
> __u64 size;
> __u64 attributes;
> __u32 operation;
> __u32 flags;
> }
>
> > I looked at kvm_device_attr, sounds we can do similar:
>
> The device attributes deal with isolated, arbitrary values, whereas memory attributes
> are flags, i.e. devices are 1:1 whereas memory is 1:MANY. There is no "unset" for
> device attributes, because they aren't flags. Device attributes vs. memory attributes
> really are two very different things that just happen to use a common name.
>
> If it helped clarify things without creating naming problems, we could even use
> PROTECTIONS instead of ATTRIBUTES.
>
> > KVM_SET_MEMORY_ATTR
> >
> > struct kvm_memory_attr {
> > __u64 address;
> > __u64 size;
> > #define KVM_MEM_ATTR_SHARED BIT(0)
> > #define KVM_MEM_ATTR_READONLY BIT(1)
> > #define KVM_MEM_ATTR_NOEXEC BIT(2)
> > __u32 attr;
>
> As above, letting userspace set only a single attribute would prevent setting
> (or clearing) multiple attributes in a single ioctl().
>
> > __u32 pad;
> > }
> >
> > I'm not sure if we need KVM_GET_MEMORY_ATTR/KVM_HAS_MEMORY_ATTR as well,
>
> Definitely would need to communicate to userspace that various attributes are
> supported. That doesn't necessarily require a common ioctl(), but I don't see
> any reason not to add a common helper, and adding a common helper would mean
> KVM_CAP_PRIVATE_MEM can go away. But it should return a bitmask so that userspace
> can do a single query to get all supported attributes, e.g. KVM_SUPPORTED_MEMORY_ATTRIBUTES.

Do you have preference on using a new ioctl or just keep it as a cap?
E.g. KVM_CAP_MEMORY_ATTIBUTES can also returns a mask.

>
> As for KVM_GET_MEMORY_ATTRIBUTES, we wouldn't necessarily have to provide such an
> API, e.g. we could hold off until someone came along with a RMW use case (as above).
> That said, debug would likely be a nightmare without KVM_GET_MEMORY_ATTRIBUTES,
> so it's probably best to add it straightway.

Dive into the implementation a bit, for KVM_GET_MEMORY_ATTRIBUTES we can
have different attributes for different pages in the same user-provided
range, in that case we will have to either return a list or just a error
number. Or we only support per-page attributes for the getter?

Chao
>
> > but sounds like we need a KVM_UNSET_MEMORY_ATTR.
>
> No need if the setter operates on all attributes.
>
> > Since we are exposing the attribute directly to userspace I also think
> > we'd better treat shared memory as the default, so even when the private
> > memory is not used, the bit can still be meaningful. So define BIT(0) as
> > KVM_MEM_ATTR_PRIVATE instead of KVM_MEM_ATTR_SHARED.
>
> Ah, right.

2022-11-08 10:52:01

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

On Tue, Nov 08, 2022 at 09:35:06AM +0800, Yuan Yao wrote:
> On Tue, Oct 25, 2022 at 11:13:41PM +0800, Chao Peng wrote:
> > Introduce generic private memory register/unregister by reusing existing
> > SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case
> > by treating address in the region as gpa instead of hva. Which cases
> > should these ioctls go is determined by the kvm_arch_has_private_mem().
> > Architecture which supports KVM_PRIVATE_MEM should override this function.
> >
> > KVM internally defaults all guest memory as private memory and maintain
> > the shared memory in 'mem_attr_array'. The above ioctls operate on this
> > field and unmap existing mappings if any.
> >
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > Documentation/virt/kvm/api.rst | 17 ++-
> > arch/x86/kvm/Kconfig | 1 +
> > include/linux/kvm_host.h | 10 +-
> > virt/kvm/Kconfig | 4 +
> > virt/kvm/kvm_main.c | 227 +++++++++++++++++++++++++--------
> > 5 files changed, 198 insertions(+), 61 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 975688912b8c..08253cf498d1 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
> > This ioctl can be used to register a guest memory region which may
> > contain encrypted data (e.g. guest RAM, SMRAM etc).
> >
> > -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> > -memory region may contain encrypted data. The SEV memory encryption
> > -engine uses a tweak such that two identical plaintext pages, each at
> > -different locations will have differing ciphertexts. So swapping or
> > +Currently this ioctl supports registering memory regions for two usages:
> > +private memory and SEV-encrypted memory.
> > +
> > +When private memory is enabled, this ioctl is used to register guest private
> > +memory region and the addr/size of kvm_enc_region represents guest physical
> > +address (GPA). In this usage, this ioctl zaps the existing guest memory
> > +mappings in KVM that fallen into the region.
> > +
> > +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> > +memory region which may contain encrypted data for a SEV-enabled guest. The
> > +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> > +memory encryption engine uses a tweak such that two identical plaintext pages,
> > +each at different locations will have differing ciphertexts. So swapping or
> > moving ciphertext of those pages will not result in plaintext being
> > swapped. So relocating (or migrating) physical backing pages for the SEV
> > guest will require some additional steps.
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index 8d2bd455c0cd..73fdfa429b20 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -51,6 +51,7 @@ config KVM
> > select HAVE_KVM_PM_NOTIFIER if PM
> > select HAVE_KVM_RESTRICTED_MEM if X86_64
> > select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > + select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM
> > help
> > Support hosting fully virtualized guest machines using hardware
> > virtualization extensions. You will need a fairly recent
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 79e5cbc35fcf..4ce98fa0153c 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > #endif
> >
> > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > +
> > +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
> > struct kvm_gfn_range {
> > struct kvm_memory_slot *slot;
> > gfn_t start;
> > @@ -254,6 +255,9 @@ struct kvm_gfn_range {
> > bool may_block;
> > };
> > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > +#endif
> > +
> > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > @@ -794,6 +798,9 @@ struct kvm {
> > struct notifier_block pm_notifier;
> > #endif
> > char stats_id[KVM_STATS_NAME_SIZE];
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > + struct xarray mem_attr_array;
> > +#endif
> > };
> >
> > #define kvm_err(fmt, ...) \
> > @@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > int kvm_arch_post_init_vm(struct kvm *kvm);
> > void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> >
> > #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > /*
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index 9ff164c7e0cc..69ca59e82149 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER
> >
> > config HAVE_KVM_RESTRICTED_MEM
> > bool
> > +
> > +config KVM_GENERIC_PRIVATE_MEM
> > + bool
> > + depends on HAVE_KVM_RESTRICTED_MEM
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 09c9cdeb773c..fc3835826ace 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > }
> > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> >
> > +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> > + gfn_t end)
> > +{
> > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = start;
> > + kvm->mmu_invalidate_range_end = end;
> > + } else {
> > + /*
> > + * Fully tracking multiple concurrent ranges has diminishing
> > + * returns. Keep things simple and just find the minimal range
> > + * which includes the current and new ranges. As there won't be
> > + * enough information to subtract a range after its invalidate
> > + * completes, any ranges invalidated concurrently will
> > + * accumulate and persist until all outstanding invalidates
> > + * complete.
> > + */
> > + kvm->mmu_invalidate_range_start =
> > + min(kvm->mmu_invalidate_range_start, start);
> > + kvm->mmu_invalidate_range_end =
> > + max(kvm->mmu_invalidate_range_end, end);
> > + }
> > +}
> > +
> > +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + /*
> > + * The count increase must become visible at unlock time as no
> > + * spte can be established without taking the mmu_lock and
> > + * count is also read inside the mmu_lock critical section.
> > + */
> > + kvm->mmu_invalidate_in_progress++;
> > +}
> > +
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + mark_invalidate_in_progress(kvm, start, end);
> > + update_invalidate_range(kvm, start, end);
> > +}
> > +
> > +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + /*
> > + * This sequence increase will notify the kvm page fault that
> > + * the page that is going to be mapped in the spte could have
> > + * been freed.
> > + */
> > + kvm->mmu_invalidate_seq++;
> > + smp_wmb();
> > + /*
> > + * The above sequence increase must be visible before the
> > + * below count decrease, which is ensured by the smp_wmb above
> > + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > + */
> > + kvm->mmu_invalidate_in_progress--;
> > +}
> > +
> > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > {
> > @@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > }
> >
> > -static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> > - gfn_t end)
> > -{
> > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > - kvm->mmu_invalidate_range_start = start;
> > - kvm->mmu_invalidate_range_end = end;
> > - } else {
> > - /*
> > - * Fully tracking multiple concurrent ranges has diminishing
> > - * returns. Keep things simple and just find the minimal range
> > - * which includes the current and new ranges. As there won't be
> > - * enough information to subtract a range after its invalidate
> > - * completes, any ranges invalidated concurrently will
> > - * accumulate and persist until all outstanding invalidates
> > - * complete.
> > - */
> > - kvm->mmu_invalidate_range_start =
> > - min(kvm->mmu_invalidate_range_start, start);
> > - kvm->mmu_invalidate_range_end =
> > - max(kvm->mmu_invalidate_range_end, end);
> > - }
> > -}
> > -
> > -static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> > -{
> > - /*
> > - * The count increase must become visible at unlock time as no
> > - * spte can be established without taking the mmu_lock and
> > - * count is also read inside the mmu_lock critical section.
> > - */
> > - kvm->mmu_invalidate_in_progress++;
> > -}
> > -
> > static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > update_invalidate_range(kvm, range->start, range->end);
> > return kvm_unmap_gfn_range(kvm, range);
> > }
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> > -{
> > - mark_invalidate_in_progress(kvm, start, end);
> > - update_invalidate_range(kvm, start, end);
> > -}
> > -
> > static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > const struct mmu_notifier_range *range)
> > {
> > @@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > return 0;
> > }
> >
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> > -{
> > - /*
> > - * This sequence increase will notify the kvm page fault that
> > - * the page that is going to be mapped in the spte could have
> > - * been freed.
> > - */
> > - kvm->mmu_invalidate_seq++;
> > - smp_wmb();
> > - /*
> > - * The above sequence increase must be visible before the
> > - * below count decrease, which is ensured by the smp_wmb above
> > - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > - */
> > - kvm->mmu_invalidate_in_progress--;
> > -}
> > -
> > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > const struct mmu_notifier_range *range)
> > {
> > @@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >
> > #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> >
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > +
> > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + struct kvm_gfn_range gfn_range;
> > + struct kvm_memory_slot *slot;
> > + struct kvm_memslots *slots;
> > + struct kvm_memslot_iter iter;
> > + int i;
> > + int r = 0;
> > +
> > + gfn_range.pte = __pte(0);
> > + gfn_range.may_block = true;
> > +
> > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > + slots = __kvm_memslots(kvm, i);
> > +
> > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > + slot = iter.slot;
> > + gfn_range.start = max(start, slot->base_gfn);
> > + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > + if (gfn_range.start >= gfn_range.end)
> > + continue;
> > + gfn_range.slot = slot;
> > +
> > + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > + }
> > + }
> > +
> > + if (r)
> > + kvm_flush_remote_tlbs(kvm);
> > +}
> > +
> > +#define KVM_MEM_ATTR_SHARED 0x0001
> > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > + bool is_private)
> > +{
> > + gfn_t start, end;
> > + unsigned long i;
> > + void *entry;
> > + int idx;
> > + int r = 0;
> > +
> > + if (size == 0 || gpa + size < gpa)
> > + return -EINVAL;
> > + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> > + return -EINVAL;
> > +
> > + start = gpa >> PAGE_SHIFT;
> > + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > + /*
> > + * Guest memory defaults to private, kvm->mem_attr_array only stores
> > + * shared memory.
> > + */
> > + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> > +
> > + idx = srcu_read_lock(&kvm->srcu);
> > + KVM_MMU_LOCK(kvm);
> > + kvm_mmu_invalidate_begin(kvm, start, end);
> > +
> > + for (i = start; i < end; i++) {
> > + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > + GFP_KERNEL_ACCOUNT));
> > + if (r)
> > + goto err;
> > + }
> > +
> > + kvm_unmap_mem_range(kvm, start, end);
>
> lock is hold by KVM_MMU_LOCK() so how about do
> kvm_mmu_invalidate_begin() after changing xarray:
>
> kvm_mmu_invalidate_begin(kvm, start, end);
> kvm_unmap_mem_range(kvm, start, end);
> kvm_mmu_invalidate_end(kvm, start, end);
>
> Also the error handling path doesn't need to care it yet.

The mem_attr_array is consumed in the page fault handler(i.e.
kvm_mem_is_private() in patch 08) so it should also be protected by
kvm_mmu_invalidate_begin/end(). E.g. if we change the mem_attr_arry here
after the page fault handler has read the mem_attr_array, the
mmu_invalidate_retry_gfn() should return 1 to let the page fault handler
to retry the fault.

>
> > +
> > + goto ret;
> > +err:
> > + for (; i > start; i--)
> > + xa_erase(&kvm->mem_attr_array, i);
>
> the start should be covered yet, consider the i is
> unsigned long and case of start is 0, may need another
> variable j for this.

Ah, right!

Thanks,
Chao
>
> > +ret:
> > + kvm_mmu_invalidate_end(kvm, start, end);
> > + KVM_MMU_UNLOCK(kvm);
> > + srcu_read_unlock(&kvm->srcu, idx);
> > +
> > + return r;
> > +}
> > +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
> > +
> > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > static int kvm_pm_notifier_call(struct notifier_block *bl,
> > unsigned long state,
> > @@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > spin_lock_init(&kvm->mn_invalidate_lock);
> > rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> > xa_init(&kvm->vcpu_array);
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > + xa_init(&kvm->mem_attr_array);
> > +#endif
> >
> > INIT_LIST_HEAD(&kvm->gpc_list);
> > spin_lock_init(&kvm->gpc_lock);
> > @@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> > kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> > kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> > }
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > + xa_destroy(&kvm->mem_attr_array);
> > +#endif
> > cleanup_srcu_struct(&kvm->irq_srcu);
> > cleanup_srcu_struct(&kvm->srcu);
> > kvm_arch_free_vm(kvm);
> > @@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > }
> > }
> >
> > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > +{
> > + return false;
> > +}
> > +
> > static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > {
> > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > break;
> > }
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > + case KVM_MEMORY_ENCRYPT_REG_REGION:
> > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > + struct kvm_enc_region region;
> > + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > +
> > + if (!kvm_arch_has_private_mem(kvm))
> > + goto arch_vm_ioctl;
> > +
> > + r = -EFAULT;
> > + if (copy_from_user(&region, argp, sizeof(region)))
> > + goto out;
> > +
> > + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> > + region.size, set);
> > + break;
> > + }
> > +#endif
> > case KVM_GET_DIRTY_LOG: {
> > struct kvm_dirty_log log;
> >
> > @@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp,
> > r = kvm_vm_ioctl_get_stats_fd(kvm);
> > break;
> > default:
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > +arch_vm_ioctl:
> > +#endif
> > r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> > }
> > out:
> > --
> > 2.25.1
> >
> >

2022-11-09 06:11:55

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

On Tue, Nov 08, 2022 at 05:41:41PM +0800, Chao Peng wrote:
> On Tue, Nov 08, 2022 at 09:35:06AM +0800, Yuan Yao wrote:
> > On Tue, Oct 25, 2022 at 11:13:41PM +0800, Chao Peng wrote:
> > > Introduce generic private memory register/unregister by reusing existing
> > > SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case
> > > by treating address in the region as gpa instead of hva. Which cases
> > > should these ioctls go is determined by the kvm_arch_has_private_mem().
> > > Architecture which supports KVM_PRIVATE_MEM should override this function.
> > >
> > > KVM internally defaults all guest memory as private memory and maintain
> > > the shared memory in 'mem_attr_array'. The above ioctls operate on this
> > > field and unmap existing mappings if any.
> > >
> > > Signed-off-by: Chao Peng <[email protected]>
> > > ---
> > > Documentation/virt/kvm/api.rst | 17 ++-
> > > arch/x86/kvm/Kconfig | 1 +
> > > include/linux/kvm_host.h | 10 +-
> > > virt/kvm/Kconfig | 4 +
> > > virt/kvm/kvm_main.c | 227 +++++++++++++++++++++++++--------
> > > 5 files changed, 198 insertions(+), 61 deletions(-)
> > >
> > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > index 975688912b8c..08253cf498d1 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
> > > This ioctl can be used to register a guest memory region which may
> > > contain encrypted data (e.g. guest RAM, SMRAM etc).
> > >
> > > -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> > > -memory region may contain encrypted data. The SEV memory encryption
> > > -engine uses a tweak such that two identical plaintext pages, each at
> > > -different locations will have differing ciphertexts. So swapping or
> > > +Currently this ioctl supports registering memory regions for two usages:
> > > +private memory and SEV-encrypted memory.
> > > +
> > > +When private memory is enabled, this ioctl is used to register guest private
> > > +memory region and the addr/size of kvm_enc_region represents guest physical
> > > +address (GPA). In this usage, this ioctl zaps the existing guest memory
> > > +mappings in KVM that fallen into the region.
> > > +
> > > +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> > > +memory region which may contain encrypted data for a SEV-enabled guest. The
> > > +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> > > +memory encryption engine uses a tweak such that two identical plaintext pages,
> > > +each at different locations will have differing ciphertexts. So swapping or
> > > moving ciphertext of those pages will not result in plaintext being
> > > swapped. So relocating (or migrating) physical backing pages for the SEV
> > > guest will require some additional steps.
> > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > > index 8d2bd455c0cd..73fdfa429b20 100644
> > > --- a/arch/x86/kvm/Kconfig
> > > +++ b/arch/x86/kvm/Kconfig
> > > @@ -51,6 +51,7 @@ config KVM
> > > select HAVE_KVM_PM_NOTIFIER if PM
> > > select HAVE_KVM_RESTRICTED_MEM if X86_64
> > > select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > > + select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM
> > > help
> > > Support hosting fully virtualized guest machines using hardware
> > > virtualization extensions. You will need a fairly recent
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 79e5cbc35fcf..4ce98fa0153c 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > > #endif
> > >
> > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > +
> > > +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
> > > struct kvm_gfn_range {
> > > struct kvm_memory_slot *slot;
> > > gfn_t start;
> > > @@ -254,6 +255,9 @@ struct kvm_gfn_range {
> > > bool may_block;
> > > };
> > > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > > +#endif
> > > +
> > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > @@ -794,6 +798,9 @@ struct kvm {
> > > struct notifier_block pm_notifier;
> > > #endif
> > > char stats_id[KVM_STATS_NAME_SIZE];
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > + struct xarray mem_attr_array;
> > > +#endif
> > > };
> > >
> > > #define kvm_err(fmt, ...) \
> > > @@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > > int kvm_arch_post_init_vm(struct kvm *kvm);
> > > void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > > int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> > >
> > > #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > > /*
> > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > > index 9ff164c7e0cc..69ca59e82149 100644
> > > --- a/virt/kvm/Kconfig
> > > +++ b/virt/kvm/Kconfig
> > > @@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER
> > >
> > > config HAVE_KVM_RESTRICTED_MEM
> > > bool
> > > +
> > > +config KVM_GENERIC_PRIVATE_MEM
> > > + bool
> > > + depends on HAVE_KVM_RESTRICTED_MEM
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 09c9cdeb773c..fc3835826ace 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > > }
> > > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> > >
> > > +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> > > + gfn_t end)
> > > +{
> > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > + kvm->mmu_invalidate_range_start = start;
> > > + kvm->mmu_invalidate_range_end = end;
> > > + } else {
> > > + /*
> > > + * Fully tracking multiple concurrent ranges has diminishing
> > > + * returns. Keep things simple and just find the minimal range
> > > + * which includes the current and new ranges. As there won't be
> > > + * enough information to subtract a range after its invalidate
> > > + * completes, any ranges invalidated concurrently will
> > > + * accumulate and persist until all outstanding invalidates
> > > + * complete.
> > > + */
> > > + kvm->mmu_invalidate_range_start =
> > > + min(kvm->mmu_invalidate_range_start, start);
> > > + kvm->mmu_invalidate_range_end =
> > > + max(kvm->mmu_invalidate_range_end, end);
> > > + }
> > > +}
> > > +
> > > +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + /*
> > > + * The count increase must become visible at unlock time as no
> > > + * spte can be established without taking the mmu_lock and
> > > + * count is also read inside the mmu_lock critical section.
> > > + */
> > > + kvm->mmu_invalidate_in_progress++;
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + mark_invalidate_in_progress(kvm, start, end);
> > > + update_invalidate_range(kvm, start, end);
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + /*
> > > + * This sequence increase will notify the kvm page fault that
> > > + * the page that is going to be mapped in the spte could have
> > > + * been freed.
> > > + */
> > > + kvm->mmu_invalidate_seq++;
> > > + smp_wmb();
> > > + /*
> > > + * The above sequence increase must be visible before the
> > > + * below count decrease, which is ensured by the smp_wmb above
> > > + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > + */
> > > + kvm->mmu_invalidate_in_progress--;
> > > +}
> > > +
> > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > > {
> > > @@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > > }
> > >
> > > -static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> > > - gfn_t end)
> > > -{
> > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > - kvm->mmu_invalidate_range_start = start;
> > > - kvm->mmu_invalidate_range_end = end;
> > > - } else {
> > > - /*
> > > - * Fully tracking multiple concurrent ranges has diminishing
> > > - * returns. Keep things simple and just find the minimal range
> > > - * which includes the current and new ranges. As there won't be
> > > - * enough information to subtract a range after its invalidate
> > > - * completes, any ranges invalidated concurrently will
> > > - * accumulate and persist until all outstanding invalidates
> > > - * complete.
> > > - */
> > > - kvm->mmu_invalidate_range_start =
> > > - min(kvm->mmu_invalidate_range_start, start);
> > > - kvm->mmu_invalidate_range_end =
> > > - max(kvm->mmu_invalidate_range_end, end);
> > > - }
> > > -}
> > > -
> > > -static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> > > -{
> > > - /*
> > > - * The count increase must become visible at unlock time as no
> > > - * spte can be established without taking the mmu_lock and
> > > - * count is also read inside the mmu_lock critical section.
> > > - */
> > > - kvm->mmu_invalidate_in_progress++;
> > > -}
> > > -
> > > static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > > {
> > > update_invalidate_range(kvm, range->start, range->end);
> > > return kvm_unmap_gfn_range(kvm, range);
> > > }
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> > > -{
> > > - mark_invalidate_in_progress(kvm, start, end);
> > > - update_invalidate_range(kvm, start, end);
> > > -}
> > > -
> > > static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > const struct mmu_notifier_range *range)
> > > {
> > > @@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > return 0;
> > > }
> > >
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> > > -{
> > > - /*
> > > - * This sequence increase will notify the kvm page fault that
> > > - * the page that is going to be mapped in the spte could have
> > > - * been freed.
> > > - */
> > > - kvm->mmu_invalidate_seq++;
> > > - smp_wmb();
> > > - /*
> > > - * The above sequence increase must be visible before the
> > > - * below count decrease, which is ensured by the smp_wmb above
> > > - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > - */
> > > - kvm->mmu_invalidate_in_progress--;
> > > -}
> > > -
> > > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > > const struct mmu_notifier_range *range)
> > > {
> > > @@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> > >
> > > #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> > >
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > +
> > > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + struct kvm_gfn_range gfn_range;
> > > + struct kvm_memory_slot *slot;
> > > + struct kvm_memslots *slots;
> > > + struct kvm_memslot_iter iter;
> > > + int i;
> > > + int r = 0;
> > > +
> > > + gfn_range.pte = __pte(0);
> > > + gfn_range.may_block = true;
> > > +
> > > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > + slots = __kvm_memslots(kvm, i);
> > > +
> > > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > > + slot = iter.slot;
> > > + gfn_range.start = max(start, slot->base_gfn);
> > > + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > > + if (gfn_range.start >= gfn_range.end)
> > > + continue;
> > > + gfn_range.slot = slot;
> > > +
> > > + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > > + }
> > > + }
> > > +
> > > + if (r)
> > > + kvm_flush_remote_tlbs(kvm);
> > > +}
> > > +
> > > +#define KVM_MEM_ATTR_SHARED 0x0001
> > > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > > + bool is_private)
> > > +{
> > > + gfn_t start, end;
> > > + unsigned long i;
> > > + void *entry;
> > > + int idx;
> > > + int r = 0;
> > > +
> > > + if (size == 0 || gpa + size < gpa)
> > > + return -EINVAL;
> > > + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> > > + return -EINVAL;
> > > +
> > > + start = gpa >> PAGE_SHIFT;
> > > + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > > +
> > > + /*
> > > + * Guest memory defaults to private, kvm->mem_attr_array only stores
> > > + * shared memory.
> > > + */
> > > + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> > > +
> > > + idx = srcu_read_lock(&kvm->srcu);
> > > + KVM_MMU_LOCK(kvm);
> > > + kvm_mmu_invalidate_begin(kvm, start, end);
> > > +
> > > + for (i = start; i < end; i++) {
> > > + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > + GFP_KERNEL_ACCOUNT));
> > > + if (r)
> > > + goto err;
> > > + }
> > > +
> > > + kvm_unmap_mem_range(kvm, start, end);
> >
> > lock is hold by KVM_MMU_LOCK() so how about do
> > kvm_mmu_invalidate_begin() after changing xarray:
> >
> > kvm_mmu_invalidate_begin(kvm, start, end);
> > kvm_unmap_mem_range(kvm, start, end);
> > kvm_mmu_invalidate_end(kvm, start, end);
> >
> > Also the error handling path doesn't need to care it yet.
>
> The mem_attr_array is consumed in the page fault handler(i.e.
> kvm_mem_is_private() in patch 08) so it should also be protected by
> kvm_mmu_invalidate_begin/end(). E.g. if we change the mem_attr_arry here
> after the page fault handler has read the mem_attr_array, the
> mmu_invalidate_retry_gfn() should return 1 to let the page fault handler
> to retry the fault.

You're right!
Even the changes are undo by error handling path, we still need to
sure that user of mem_attr_arry retry the fault, due to the user may
get some "stale" data (they're "stale" becaus the xarray is recovered
in error case).

>
> >
> > > +
> > > + goto ret;
> > > +err:
> > > + for (; i > start; i--)
> > > + xa_erase(&kvm->mem_attr_array, i);
> >
> > the start should be covered yet, consider the i is
> > unsigned long and case of start is 0, may need another
> > variable j for this.
>
> Ah, right!
>
> Thanks,
> Chao
> >
> > > +ret:
> > > + kvm_mmu_invalidate_end(kvm, start, end);
> > > + KVM_MMU_UNLOCK(kvm);
> > > + srcu_read_unlock(&kvm->srcu, idx);
> > > +
> > > + return r;
> > > +}
> > > +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
> > > +
> > > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > > static int kvm_pm_notifier_call(struct notifier_block *bl,
> > > unsigned long state,
> > > @@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > > spin_lock_init(&kvm->mn_invalidate_lock);
> > > rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> > > xa_init(&kvm->vcpu_array);
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > + xa_init(&kvm->mem_attr_array);
> > > +#endif
> > >
> > > INIT_LIST_HEAD(&kvm->gpc_list);
> > > spin_lock_init(&kvm->gpc_lock);
> > > @@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> > > kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> > > kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> > > }
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > + xa_destroy(&kvm->mem_attr_array);
> > > +#endif
> > > cleanup_srcu_struct(&kvm->irq_srcu);
> > > cleanup_srcu_struct(&kvm->srcu);
> > > kvm_arch_free_vm(kvm);
> > > @@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > > }
> > > }
> > >
> > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > > +{
> > > + return false;
> > > +}
> > > +
> > > static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > > {
> > > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > > r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > > break;
> > > }
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > + case KVM_MEMORY_ENCRYPT_REG_REGION:
> > > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > > + struct kvm_enc_region region;
> > > + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > > +
> > > + if (!kvm_arch_has_private_mem(kvm))
> > > + goto arch_vm_ioctl;
> > > +
> > > + r = -EFAULT;
> > > + if (copy_from_user(&region, argp, sizeof(region)))
> > > + goto out;
> > > +
> > > + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> > > + region.size, set);
> > > + break;
> > > + }
> > > +#endif
> > > case KVM_GET_DIRTY_LOG: {
> > > struct kvm_dirty_log log;
> > >
> > > @@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp,
> > > r = kvm_vm_ioctl_get_stats_fd(kvm);
> > > break;
> > > default:
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > +arch_vm_ioctl:
> > > +#endif
> > > r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> > > }
> > > out:
> > > --
> > > 2.25.1
> > >
> > >

2022-11-09 16:39:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

On Mon, Nov 07, 2022 at 04:41:41PM -0800, Isaku Yamahata wrote:
> On Thu, Nov 03, 2022 at 05:43:52PM +0530,
> Vishal Annapurve <[email protected]> wrote:
>
> > On Tue, Oct 25, 2022 at 8:48 PM Chao Peng <[email protected]> wrote:
> > >
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > >
> > > The patch series touches both core mm and KVM code. I appreciate
> > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > reviews are always welcome.
> > > - 01: mm change, target for mm tree
> > > - 02-08: KVM change, target for KVM tree
> > >
> > > Given KVM is the only current user for the mm part, I have chatted with
> > > Paolo and he is OK to merge the mm change through KVM tree, but
> > > reviewed-by/acked-by is still expected from the mm people.
> > >
> > > The patches have been verified in Intel TDX environment, but Vishal has
> > > done an excellent work on the selftests[4] which are dedicated for this
> > > series, making it possible to test this series without innovative
> > > hardware and fancy steps of building a VM environment. See Test section
> > > below for more info.
> > >
> > >
> > > Introduction
> > > ============
> > > KVM userspace being able to crash the host is horrible. Under current
> > > KVM architecture, all guest memory is inherently accessible from KVM
> > > userspace and is exposed to the mentioned crash issue. The goal of this
> > > series is to provide a solution to align mm and KVM, on a userspace
> > > inaccessible approach of exposing guest memory.
> > >
> > > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > > table). This requires guest memory being mmaped into KVM userspace, but
> > > this is also the source where the mentioned crash issue can happen. In
> > > theory, apart from those 'shared' memory for device emulation etc, guest
> > > memory doesn't have to be mmaped into KVM userspace.
> > >
> > > This series introduces fd-based guest memory which will not be mmaped
> > > into KVM userspace. KVM populates secondary page table by using a
> >
> > With no mappings in place for userspace VMM, IIUC, looks like the host
> > kernel will not be able to find the culprit userspace process in case
> > of Machine check error on guest private memory. As implemented in
> > hwpoison_user_mappings, host kernel tries to look at the processes
> > which have mapped the pfns with hardware error.
> >
> > Is there a modification needed in mce handling logic of the host
> > kernel to immediately send a signal to the vcpu thread accessing
> > faulting pfn backing guest private memory?
>
> mce_register_decode_chain() can be used. MCE physical address(p->mce_addr)
> includes host key id in addition to real physical address. By searching used
> hkid by KVM, we can determine if the page is assigned to guest TD or not. If
> yes, send SIGBUS.
>
> kvm_machine_check() can be enhanced for KVM specific use. This is before
> memory_failure() is called, though.
>
> any other ideas?

That's too KVM-centric. It will not work for other possible user of
restricted memfd.

I tried to find a way to get it right: we need to get restricted memfd
code info about corrupted page so it can invalidate its users. On the next
request of the page the user will see an error. In case of KVM, the error
will likely escalate to SIGBUS.

The problem is that core-mm code that handles memory failure knows nothing
about restricted memfd. It only sees that the page belongs to a normal
memfd.

AFAICS, there's no way to get it intercepted from the shim level. shmem
code has to be patches. shmem_error_remove_page() has to call into
restricted memfd code.

Hugh, are you okay with this? Or maybe you have a better idea?

--
Kiryl Shutsemau / Kirill A. Shutemov

2022-11-14 13:31:53

by Alex Bennée

[permalink] [raw]
Subject: Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM


Chao Peng <[email protected]> writes:

<snip>
> Introduction
> ============
> KVM userspace being able to crash the host is horrible. Under current
> KVM architecture, all guest memory is inherently accessible from KVM
> userspace and is exposed to the mentioned crash issue. The goal of this
> series is to provide a solution to align mm and KVM, on a userspace
> inaccessible approach of exposing guest memory.
>
> Normally, KVM populates secondary page table (e.g. EPT) by using a host
> virtual address (hva) from core mm page table (e.g. x86 userspace page
> table). This requires guest memory being mmaped into KVM userspace, but
> this is also the source where the mentioned crash issue can happen. In
> theory, apart from those 'shared' memory for device emulation etc, guest
> memory doesn't have to be mmaped into KVM userspace.
>
> This series introduces fd-based guest memory which will not be mmaped
> into KVM userspace. KVM populates secondary page table by using a
> fd/offset pair backed by a memory file system. The fd can be created
> from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> directly interact with them with newly introduced in-kernel interface,
> therefore remove the KVM userspace from the path of accessing/mmaping
> the guest memory.
>
> Kirill had a patch [2] to address the same issue in a different way. It
> tracks guest encrypted memory at the 'struct page' level and relies on
> HWPOISON to reject the userspace access. The patch has been discussed in
> several online and offline threads and resulted in a design document [3]
> which is also the original proposal for this series. Later this patch
> series evolved as more comments received in community but the major
> concepts in [3] still hold true so recommend reading.
>
> The patch series may also be useful for other usages, for example, pure
> software approach may use it to harden itself against unintentional
> access to guest memory. This series is designed with these usages in
> mind but doesn't have code directly support them and extension might be
> needed.

There are a couple of additional use cases where having a consistent
memory interface with the kernel would be useful.

- Xen DomU guests providing other domains with VirtIO backends

Xen by default doesn't give other domains special access to a domains
memory. The guest can grant access to regions of its memory to other
domains for this purpose.

- pKVM on ARM

Similar to Xen, pKVM moves the management of the page tables into the
hypervisor and again doesn't allow those domains to share memory by
default.

- VirtIO loopback

This allows for VirtIO devices for the host kernel to be serviced by
backends running in userspace. Obviously the memory userspace is
allowed to access is strictly limited to the buffers and queues
because giving userspace unrestricted access to the host kernel would
have consequences.

All of these VirtIO backends work with vhost-user which uses memfds to
pass references to guest memory from the VMM to the backend
implementation.

> mm change
> =========
> Introduces a new memfd_restricted system call which can create memory
> file that is restricted from userspace access via normal MMU operations
> like read(), write() or mmap() etc and the only way to use it is
> passing it to a third kernel module like KVM and relying on it to
> access the fd through the newly added restrictedmem kernel interface.
> The restrictedmem interface bridges the memory file subsystems
> (tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
> bi-directional communication between them.
>
>
> KVM change
> ==========
> Extends the KVM memslot to provide guest private (encrypted) memory from
> a fd. With this extension, a single memslot can maintain both private
> memory through private fd (restricted_fd/restricted_offset) and shared
> (unencrypted) memory through userspace mmaped host virtual address
> (userspace_addr). For a particular guest page, the corresponding page in
> KVM memslot can be only either private or shared and only one of the
> shared/private parts of the memslot is visible to guest. For how this
> new extension is used in QEMU, please refer to kvm_set_phys_mem() in
> below TDX-enabled QEMU repo.
>
> Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
> chance on decision-making for shared <-> private memory conversion. The
> exit can be an implicit conversion in KVM page fault handler or an
> explicit conversion from guest OS.
>
> Extends existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to
> convert a guest page between private <-> shared. The data maintained in
> these ioctls tells the truth whether a guest page is private or shared
> and this information will be used in KVM page fault handler to decide
> whether the private or the shared part of the memslot is visible to
> guest.
>
<snip>

--
Alex Bennée

2022-11-15 15:22:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

On Wed, Nov 09, 2022 at 06:54:04PM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 07, 2022 at 04:41:41PM -0800, Isaku Yamahata wrote:
> > On Thu, Nov 03, 2022 at 05:43:52PM +0530,
> > Vishal Annapurve <[email protected]> wrote:
> >
> > > On Tue, Oct 25, 2022 at 8:48 PM Chao Peng <[email protected]> wrote:
> > > >
> > > > This patch series implements KVM guest private memory for confidential
> > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > TDX-protected guest memory, machine check can happen which can further
> > > > crash the running host system, this is terrible for multi-tenant
> > > > configurations. The host accesses include those from KVM userspace like
> > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > via a fd-based approach, but it can never access the guest memory
> > > > content.
> > > >
> > > > The patch series touches both core mm and KVM code. I appreciate
> > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > reviews are always welcome.
> > > > - 01: mm change, target for mm tree
> > > > - 02-08: KVM change, target for KVM tree
> > > >
> > > > Given KVM is the only current user for the mm part, I have chatted with
> > > > Paolo and he is OK to merge the mm change through KVM tree, but
> > > > reviewed-by/acked-by is still expected from the mm people.
> > > >
> > > > The patches have been verified in Intel TDX environment, but Vishal has
> > > > done an excellent work on the selftests[4] which are dedicated for this
> > > > series, making it possible to test this series without innovative
> > > > hardware and fancy steps of building a VM environment. See Test section
> > > > below for more info.
> > > >
> > > >
> > > > Introduction
> > > > ============
> > > > KVM userspace being able to crash the host is horrible. Under current
> > > > KVM architecture, all guest memory is inherently accessible from KVM
> > > > userspace and is exposed to the mentioned crash issue. The goal of this
> > > > series is to provide a solution to align mm and KVM, on a userspace
> > > > inaccessible approach of exposing guest memory.
> > > >
> > > > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > > > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > > > table). This requires guest memory being mmaped into KVM userspace, but
> > > > this is also the source where the mentioned crash issue can happen. In
> > > > theory, apart from those 'shared' memory for device emulation etc, guest
> > > > memory doesn't have to be mmaped into KVM userspace.
> > > >
> > > > This series introduces fd-based guest memory which will not be mmaped
> > > > into KVM userspace. KVM populates secondary page table by using a
> > >
> > > With no mappings in place for userspace VMM, IIUC, looks like the host
> > > kernel will not be able to find the culprit userspace process in case
> > > of Machine check error on guest private memory. As implemented in
> > > hwpoison_user_mappings, host kernel tries to look at the processes
> > > which have mapped the pfns with hardware error.
> > >
> > > Is there a modification needed in mce handling logic of the host
> > > kernel to immediately send a signal to the vcpu thread accessing
> > > faulting pfn backing guest private memory?
> >
> > mce_register_decode_chain() can be used. MCE physical address(p->mce_addr)
> > includes host key id in addition to real physical address. By searching used
> > hkid by KVM, we can determine if the page is assigned to guest TD or not. If
> > yes, send SIGBUS.
> >
> > kvm_machine_check() can be enhanced for KVM specific use. This is before
> > memory_failure() is called, though.
> >
> > any other ideas?
>
> That's too KVM-centric. It will not work for other possible user of
> restricted memfd.
>
> I tried to find a way to get it right: we need to get restricted memfd
> code info about corrupted page so it can invalidate its users. On the next
> request of the page the user will see an error. In case of KVM, the error
> will likely escalate to SIGBUS.
>
> The problem is that core-mm code that handles memory failure knows nothing
> about restricted memfd. It only sees that the page belongs to a normal
> memfd.
>
> AFAICS, there's no way to get it intercepted from the shim level. shmem
> code has to be patches. shmem_error_remove_page() has to call into
> restricted memfd code.
>
> Hugh, are you okay with this? Or maybe you have a better idea?

Okay, here is what I've come up with. It doesn't touch shmem code, but
hooks up directly into memory-failure.c. It is still ugly, but should be
tolerable.

restrictedmem_error_page() loops over all restrictedmem inodes. It is
slow, but memory failure is not hot path (I hope).

Only build-tested. Chao, could you hook up ->error for KVM and get it
tested?

diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
index 9c37c3ea3180..c2700c5daa43 100644
--- a/include/linux/restrictedmem.h
+++ b/include/linux/restrictedmem.h
@@ -12,6 +12,8 @@ struct restrictedmem_notifier_ops {
pgoff_t start, pgoff_t end);
void (*invalidate_end)(struct restrictedmem_notifier *notifier,
pgoff_t start, pgoff_t end);
+ void (*error)(struct restrictedmem_notifier *notifier,
+ pgoff_t start, pgoff_t end);
};

struct restrictedmem_notifier {
@@ -34,6 +36,8 @@ static inline bool file_is_restrictedmem(struct file *file)
return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
}

+void restrictedmem_error_page(struct page *page, struct address_space *mapping);
+
#else

static inline void restrictedmem_register_notifier(struct file *file,
@@ -57,6 +61,11 @@ static inline bool file_is_restrictedmem(struct file *file)
return false;
}

+static inline void restrictedmem_error_page(struct page *page,
+ struct address_space *mapping)
+{
+}
+
#endif /* CONFIG_RESTRICTEDMEM */

#endif /* _LINUX_RESTRICTEDMEM_H */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e7ac570dda75..ee85e46c6992 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -62,6 +62,7 @@
#include <linux/page-isolation.h>
#include <linux/pagewalk.h>
#include <linux/shmem_fs.h>
+#include <linux/restrictedmem.h>
#include "swap.h"
#include "internal.h"
#include "ras/ras_event.h"
@@ -939,6 +940,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
goto out;
}

+ restrictedmem_error_page(p, mapping);
+
/*
* The shmem page is kept in page cache instead of truncating
* so is expected to have an extra refcount after error-handling.
diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index e5bf8907e0f8..0dcdff0d8055 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -29,6 +29,18 @@ static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
mutex_unlock(&data->lock);
}

+static void restrictedmem_notifier_error(struct restrictedmem_data *data,
+ pgoff_t start, pgoff_t end)
+{
+ struct restrictedmem_notifier *notifier;
+
+ mutex_lock(&data->lock);
+ list_for_each_entry(notifier, &data->notifiers, list) {
+ notifier->ops->error(notifier, start, end);
+ }
+ mutex_unlock(&data->lock);
+}
+
static int restrictedmem_release(struct inode *inode, struct file *file)
{
struct restrictedmem_data *data = inode->i_mapping->private_data;
@@ -248,3 +260,30 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset,
return 0;
}
EXPORT_SYMBOL_GPL(restrictedmem_get_page);
+
+void restrictedmem_error_page(struct page *page, struct address_space *mapping)
+{
+ struct super_block *sb = restrictedmem_mnt->mnt_sb;
+ struct inode *inode, *next;
+
+ if (!shmem_mapping(mapping))
+ return;
+
+ spin_lock(&sb->s_inode_list_lock);
+ list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+ struct restrictedmem_data *data = inode->i_mapping->private_data;
+ struct file *memfd = data->memfd;
+
+ if (memfd->f_mapping == mapping) {
+ pgoff_t start, end;
+
+ spin_unlock(&sb->s_inode_list_lock);
+
+ start = page->index;
+ end = start + thp_nr_pages(page);
+ restrictedmem_notifier_error(data, start, end);
+ return;
+ }
+ }
+ spin_unlock(&sb->s_inode_list_lock);
+}
--
Kiryl Shutsemau / Kirill A. Shutemov

2022-11-16 05:08:19

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

On Mon, Nov 14, 2022 at 11:43:37AM +0000, Alex Benn?e wrote:
>
> Chao Peng <[email protected]> writes:
>
> <snip>
> > Introduction
> > ============
> > KVM userspace being able to crash the host is horrible. Under current
> > KVM architecture, all guest memory is inherently accessible from KVM
> > userspace and is exposed to the mentioned crash issue. The goal of this
> > series is to provide a solution to align mm and KVM, on a userspace
> > inaccessible approach of exposing guest memory.
> >
> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > table). This requires guest memory being mmaped into KVM userspace, but
> > this is also the source where the mentioned crash issue can happen. In
> > theory, apart from those 'shared' memory for device emulation etc, guest
> > memory doesn't have to be mmaped into KVM userspace.
> >
> > This series introduces fd-based guest memory which will not be mmaped
> > into KVM userspace. KVM populates secondary page table by using a
> > fd/offset pair backed by a memory file system. The fd can be created
> > from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> > directly interact with them with newly introduced in-kernel interface,
> > therefore remove the KVM userspace from the path of accessing/mmaping
> > the guest memory.
> >
> > Kirill had a patch [2] to address the same issue in a different way. It
> > tracks guest encrypted memory at the 'struct page' level and relies on
> > HWPOISON to reject the userspace access. The patch has been discussed in
> > several online and offline threads and resulted in a design document [3]
> > which is also the original proposal for this series. Later this patch
> > series evolved as more comments received in community but the major
> > concepts in [3] still hold true so recommend reading.
> >
> > The patch series may also be useful for other usages, for example, pure
> > software approach may use it to harden itself against unintentional
> > access to guest memory. This series is designed with these usages in
> > mind but doesn't have code directly support them and extension might be
> > needed.
>
> There are a couple of additional use cases where having a consistent
> memory interface with the kernel would be useful.

Thanks very much for the info. But I'm not so confident that the current
memfd_restricted() implementation can be useful for all these usages.

>
> - Xen DomU guests providing other domains with VirtIO backends
>
> Xen by default doesn't give other domains special access to a domains
> memory. The guest can grant access to regions of its memory to other
> domains for this purpose.

I'm trying to form my understanding on how this could work and what's
the benefit for a DomU guest to provide memory through memfd_restricted().
AFAICS, memfd_restricted() can help to hide the memory from DomU userspace,
but I assume VirtIO backends are still in DomU uerspace and need access
that memory, right?

>
> - pKVM on ARM
>
> Similar to Xen, pKVM moves the management of the page tables into the
> hypervisor and again doesn't allow those domains to share memory by
> default.

Right, we already had some discussions on this in the past versions.

>
> - VirtIO loopback
>
> This allows for VirtIO devices for the host kernel to be serviced by
> backends running in userspace. Obviously the memory userspace is
> allowed to access is strictly limited to the buffers and queues
> because giving userspace unrestricted access to the host kernel would
> have consequences.

Okay, but normal memfd_create() should work for it, right? And
memfd_restricted() instead may not work as it unmaps the memory from
userspace.

>
> All of these VirtIO backends work with vhost-user which uses memfds to
> pass references to guest memory from the VMM to the backend
> implementation.

Sounds to me these are the places where normal memfd_create() can act on.
VirtIO backends work on the mmap-ed memory which currently is not the
case for memfd_restricted(). memfd_restricted() has different design
purpose that unmaps the memory from userspace and employs some kernel
callbacks so other kernel modules can make use of the memory with these
callbacks instead of userspace virtual address.

Chao
>
> > mm change
> > =========
> > Introduces a new memfd_restricted system call which can create memory
> > file that is restricted from userspace access via normal MMU operations
> > like read(), write() or mmap() etc and the only way to use it is
> > passing it to a third kernel module like KVM and relying on it to
> > access the fd through the newly added restrictedmem kernel interface.
> > The restrictedmem interface bridges the memory file subsystems
> > (tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
> > bi-directional communication between them.
> >
> >
> > KVM change
> > ==========
> > Extends the KVM memslot to provide guest private (encrypted) memory from
> > a fd. With this extension, a single memslot can maintain both private
> > memory through private fd (restricted_fd/restricted_offset) and shared
> > (unencrypted) memory through userspace mmaped host virtual address
> > (userspace_addr). For a particular guest page, the corresponding page in
> > KVM memslot can be only either private or shared and only one of the
> > shared/private parts of the memslot is visible to guest. For how this
> > new extension is used in QEMU, please refer to kvm_set_phys_mem() in
> > below TDX-enabled QEMU repo.
> >
> > Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
> > chance on decision-making for shared <-> private memory conversion. The
> > exit can be an implicit conversion in KVM page fault handler or an
> > explicit conversion from guest OS.
> >
> > Extends existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to
> > convert a guest page between private <-> shared. The data maintained in
> > these ioctls tells the truth whether a guest page is private or shared
> > and this information will be used in KVM page fault handler to decide
> > whether the private or the shared part of the memslot is visible to
> > guest.
> >
> <snip>
>
> --
> Alex Benn?e

2022-11-16 10:13:40

by Alex Bennée

[permalink] [raw]
Subject: Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM


Chao Peng <[email protected]> writes:

> On Mon, Nov 14, 2022 at 11:43:37AM +0000, Alex Bennée wrote:
>>
>> Chao Peng <[email protected]> writes:
>>
>> <snip>
>> > Introduction
>> > ============
>> > KVM userspace being able to crash the host is horrible. Under current
>> > KVM architecture, all guest memory is inherently accessible from KVM
>> > userspace and is exposed to the mentioned crash issue. The goal of this
>> > series is to provide a solution to align mm and KVM, on a userspace
>> > inaccessible approach of exposing guest memory.
>> >
>> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
>> > virtual address (hva) from core mm page table (e.g. x86 userspace page
>> > table). This requires guest memory being mmaped into KVM userspace, but
>> > this is also the source where the mentioned crash issue can happen. In
>> > theory, apart from those 'shared' memory for device emulation etc, guest
>> > memory doesn't have to be mmaped into KVM userspace.
>> >
>> > This series introduces fd-based guest memory which will not be mmaped
>> > into KVM userspace. KVM populates secondary page table by using a
>> > fd/offset pair backed by a memory file system. The fd can be created
>> > from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
>> > directly interact with them with newly introduced in-kernel interface,
>> > therefore remove the KVM userspace from the path of accessing/mmaping
>> > the guest memory.
>> >
>> > Kirill had a patch [2] to address the same issue in a different way. It
>> > tracks guest encrypted memory at the 'struct page' level and relies on
>> > HWPOISON to reject the userspace access. The patch has been discussed in
>> > several online and offline threads and resulted in a design document [3]
>> > which is also the original proposal for this series. Later this patch
>> > series evolved as more comments received in community but the major
>> > concepts in [3] still hold true so recommend reading.
>> >
>> > The patch series may also be useful for other usages, for example, pure
>> > software approach may use it to harden itself against unintentional
>> > access to guest memory. This series is designed with these usages in
>> > mind but doesn't have code directly support them and extension might be
>> > needed.
>>
>> There are a couple of additional use cases where having a consistent
>> memory interface with the kernel would be useful.
>
> Thanks very much for the info. But I'm not so confident that the current
> memfd_restricted() implementation can be useful for all these usages.
>
>>
>> - Xen DomU guests providing other domains with VirtIO backends
>>
>> Xen by default doesn't give other domains special access to a domains
>> memory. The guest can grant access to regions of its memory to other
>> domains for this purpose.
>
> I'm trying to form my understanding on how this could work and what's
> the benefit for a DomU guest to provide memory through memfd_restricted().
> AFAICS, memfd_restricted() can help to hide the memory from DomU userspace,
> but I assume VirtIO backends are still in DomU uerspace and need access
> that memory, right?

They need access to parts of the memory. At the moment you run your
VirtIO domains in the Dom0 and give them access to the whole of a DomU's
address space - however the Xen model is by default the guests memory is
inaccessible to other domains on the system. The DomU guest uses the Xen
grant model to expose portions of its address space to other domains -
namely for the VirtIO queues themselves and any pages containing buffers
involved in the VirtIO transaction. My thought was that looks like a
guest memory interface which is mostly inaccessible (private) with some
holes in it where memory is being explicitly shared with other domains.

What I want to achieve is a common userspace API with defined semantics
for what happens when private and shared regions are accessed. Because
having each hypervisor/confidential computing architecture define its
own special API for accessing this memory is just a recipe for
fragmentation and makes sharing common VirtIO backends impossible.

>
>>
>> - pKVM on ARM
>>
>> Similar to Xen, pKVM moves the management of the page tables into the
>> hypervisor and again doesn't allow those domains to share memory by
>> default.
>
> Right, we already had some discussions on this in the past versions.
>
>>
>> - VirtIO loopback
>>
>> This allows for VirtIO devices for the host kernel to be serviced by
>> backends running in userspace. Obviously the memory userspace is
>> allowed to access is strictly limited to the buffers and queues
>> because giving userspace unrestricted access to the host kernel would
>> have consequences.
>
> Okay, but normal memfd_create() should work for it, right? And
> memfd_restricted() instead may not work as it unmaps the memory from
> userspace.
>
>>
>> All of these VirtIO backends work with vhost-user which uses memfds to
>> pass references to guest memory from the VMM to the backend
>> implementation.
>
> Sounds to me these are the places where normal memfd_create() can act on.
> VirtIO backends work on the mmap-ed memory which currently is not the
> case for memfd_restricted(). memfd_restricted() has different design
> purpose that unmaps the memory from userspace and employs some kernel
> callbacks so other kernel modules can make use of the memory with these
> callbacks instead of userspace virtual address.

Maybe my understanding is backwards then. Are you saying a guest starts
with all its memory exposed and then selectively unmaps the private
regions? Is this driven by the VMM or the guest itself?

--
Alex Bennée

2022-11-16 22:55:30

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

On Tue, Oct 25, 2022, Chao Peng wrote:
> +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> + bool is_private)
> +{
> + gfn_t start, end;
> + unsigned long i;
> + void *entry;
> + int idx;
> + int r = 0;
> +
> + if (size == 0 || gpa + size < gpa)
> + return -EINVAL;
> + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> + return -EINVAL;
> +
> + start = gpa >> PAGE_SHIFT;
> + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + /*
> + * Guest memory defaults to private, kvm->mem_attr_array only stores
> + * shared memory.
> + */
> + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> +
> + idx = srcu_read_lock(&kvm->srcu);
> + KVM_MMU_LOCK(kvm);
> + kvm_mmu_invalidate_begin(kvm, start, end);
> +
> + for (i = start; i < end; i++) {
> + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> + GFP_KERNEL_ACCOUNT));
> + if (r)
> + goto err;
> + }
> +
> + kvm_unmap_mem_range(kvm, start, end);
> +
> + goto ret;
> +err:
> + for (; i > start; i--)
> + xa_erase(&kvm->mem_attr_array, i);

I don't think deleting previous entries is correct. To unwind, the correct thing
to do is restore the original values. E.g. if userspace space is mapping a large
range as shared, and some of the previous entries were shared, deleting them would
incorrectly "convert" those entries to private.

Tracking the previous state likely isn't the best approach, e.g. it would require
speculatively allocating extra memory for a rare condition that is likely going to
lead to OOM anyways.

Instead of trying to unwind, what about updating the ioctl() params such that
retrying with the updated addr+size would Just Work? E.g.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 55b07aae67cc..f1de592a1a06 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1015,15 +1015,12 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,

kvm_unmap_mem_range(kvm, start, end, attr);

- goto ret;
-err:
- for (; i > start; i--)
- xa_erase(&kvm->mem_attr_array, i);
-ret:
kvm_mmu_invalidate_end(kvm, start, end);
KVM_MMU_UNLOCK(kvm);
srcu_read_unlock(&kvm->srcu, idx);

+ <update gpa and size>
+
return r;
}
#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
@@ -4989,6 +4986,8 @@ static long kvm_vm_ioctl(struct file *filp,

r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
region.size, set);
+ if (copy_to_user(argp, &region, sizeof(region)) && !r)
+ r = -EFAULT
break;
}
#endif

2022-11-17 13:48:17

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

On Wed, Nov 16, 2022 at 10:24:11PM +0000, Sean Christopherson wrote:
> On Tue, Oct 25, 2022, Chao Peng wrote:
> > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > + bool is_private)
> > +{
> > + gfn_t start, end;
> > + unsigned long i;
> > + void *entry;
> > + int idx;
> > + int r = 0;
> > +
> > + if (size == 0 || gpa + size < gpa)
> > + return -EINVAL;
> > + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> > + return -EINVAL;
> > +
> > + start = gpa >> PAGE_SHIFT;
> > + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > + /*
> > + * Guest memory defaults to private, kvm->mem_attr_array only stores
> > + * shared memory.
> > + */
> > + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> > +
> > + idx = srcu_read_lock(&kvm->srcu);
> > + KVM_MMU_LOCK(kvm);
> > + kvm_mmu_invalidate_begin(kvm, start, end);
> > +
> > + for (i = start; i < end; i++) {
> > + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > + GFP_KERNEL_ACCOUNT));
> > + if (r)
> > + goto err;
> > + }
> > +
> > + kvm_unmap_mem_range(kvm, start, end);
> > +
> > + goto ret;
> > +err:
> > + for (; i > start; i--)
> > + xa_erase(&kvm->mem_attr_array, i);
>
> I don't think deleting previous entries is correct. To unwind, the correct thing
> to do is restore the original values. E.g. if userspace space is mapping a large
> range as shared, and some of the previous entries were shared, deleting them would
> incorrectly "convert" those entries to private.

Ah, right!

>
> Tracking the previous state likely isn't the best approach, e.g. it would require
> speculatively allocating extra memory for a rare condition that is likely going to
> lead to OOM anyways.

Agree.

>
> Instead of trying to unwind, what about updating the ioctl() params such that
> retrying with the updated addr+size would Just Work? E.g.

Looks good to me. Thanks!

Chao
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 55b07aae67cc..f1de592a1a06 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1015,15 +1015,12 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
>
> kvm_unmap_mem_range(kvm, start, end, attr);
>
> - goto ret;
> -err:
> - for (; i > start; i--)
> - xa_erase(&kvm->mem_attr_array, i);
> -ret:
> kvm_mmu_invalidate_end(kvm, start, end);
> KVM_MMU_UNLOCK(kvm);
> srcu_read_unlock(&kvm->srcu, idx);
>
> + <update gpa and size>
> +
> return r;
> }
> #endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
> @@ -4989,6 +4986,8 @@ static long kvm_vm_ioctl(struct file *filp,
>
> r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> region.size, set);
> + if (copy_to_user(argp, &region, sizeof(region)) && !r)
> + r = -EFAULT
> break;
> }
> #endif

2022-11-17 14:42:02

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

On Wed, Nov 16, 2022 at 09:40:23AM +0000, Alex Benn?e wrote:
>
> Chao Peng <[email protected]> writes:
>
> > On Mon, Nov 14, 2022 at 11:43:37AM +0000, Alex Benn?e wrote:
> >>
> >> Chao Peng <[email protected]> writes:
> >>
> >> <snip>
> >> > Introduction
> >> > ============
> >> > KVM userspace being able to crash the host is horrible. Under current
> >> > KVM architecture, all guest memory is inherently accessible from KVM
> >> > userspace and is exposed to the mentioned crash issue. The goal of this
> >> > series is to provide a solution to align mm and KVM, on a userspace
> >> > inaccessible approach of exposing guest memory.
> >> >
> >> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> >> > virtual address (hva) from core mm page table (e.g. x86 userspace page
> >> > table). This requires guest memory being mmaped into KVM userspace, but
> >> > this is also the source where the mentioned crash issue can happen. In
> >> > theory, apart from those 'shared' memory for device emulation etc, guest
> >> > memory doesn't have to be mmaped into KVM userspace.
> >> >
> >> > This series introduces fd-based guest memory which will not be mmaped
> >> > into KVM userspace. KVM populates secondary page table by using a
> >> > fd/offset pair backed by a memory file system. The fd can be created
> >> > from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> >> > directly interact with them with newly introduced in-kernel interface,
> >> > therefore remove the KVM userspace from the path of accessing/mmaping
> >> > the guest memory.
> >> >
> >> > Kirill had a patch [2] to address the same issue in a different way. It
> >> > tracks guest encrypted memory at the 'struct page' level and relies on
> >> > HWPOISON to reject the userspace access. The patch has been discussed in
> >> > several online and offline threads and resulted in a design document [3]
> >> > which is also the original proposal for this series. Later this patch
> >> > series evolved as more comments received in community but the major
> >> > concepts in [3] still hold true so recommend reading.
> >> >
> >> > The patch series may also be useful for other usages, for example, pure
> >> > software approach may use it to harden itself against unintentional
> >> > access to guest memory. This series is designed with these usages in
> >> > mind but doesn't have code directly support them and extension might be
> >> > needed.
> >>
> >> There are a couple of additional use cases where having a consistent
> >> memory interface with the kernel would be useful.
> >
> > Thanks very much for the info. But I'm not so confident that the current
> > memfd_restricted() implementation can be useful for all these usages.
> >
> >>
> >> - Xen DomU guests providing other domains with VirtIO backends
> >>
> >> Xen by default doesn't give other domains special access to a domains
> >> memory. The guest can grant access to regions of its memory to other
> >> domains for this purpose.
> >
> > I'm trying to form my understanding on how this could work and what's
> > the benefit for a DomU guest to provide memory through memfd_restricted().
> > AFAICS, memfd_restricted() can help to hide the memory from DomU userspace,
> > but I assume VirtIO backends are still in DomU uerspace and need access
> > that memory, right?
>
> They need access to parts of the memory. At the moment you run your
> VirtIO domains in the Dom0 and give them access to the whole of a DomU's
> address space - however the Xen model is by default the guests memory is
> inaccessible to other domains on the system. The DomU guest uses the Xen
> grant model to expose portions of its address space to other domains -
> namely for the VirtIO queues themselves and any pages containing buffers
> involved in the VirtIO transaction. My thought was that looks like a
> guest memory interface which is mostly inaccessible (private) with some
> holes in it where memory is being explicitly shared with other domains.

Yes, similar in conception. For KVM, memfd_restricted() is used by host
OS, guest will issue conversion between private and shared for its
memory range. This is similar to Xen DomU guest grants its memory to
other domains. Similarly, I guess to make memfd_restricted() being really
useful for Xen, it should be run on the VirtIO backend domain (e.g.
equivalent to the host position for KVM).

>
> What I want to achieve is a common userspace API with defined semantics
> for what happens when private and shared regions are accessed. Because
> having each hypervisor/confidential computing architecture define its
> own special API for accessing this memory is just a recipe for
> fragmentation and makes sharing common VirtIO backends impossible.

Yes, I agree. That's interesting to explore.

>
> >
> >>
> >> - pKVM on ARM
> >>
> >> Similar to Xen, pKVM moves the management of the page tables into the
> >> hypervisor and again doesn't allow those domains to share memory by
> >> default.
> >
> > Right, we already had some discussions on this in the past versions.
> >
> >>
> >> - VirtIO loopback
> >>
> >> This allows for VirtIO devices for the host kernel to be serviced by
> >> backends running in userspace. Obviously the memory userspace is
> >> allowed to access is strictly limited to the buffers and queues
> >> because giving userspace unrestricted access to the host kernel would
> >> have consequences.
> >
> > Okay, but normal memfd_create() should work for it, right? And
> > memfd_restricted() instead may not work as it unmaps the memory from
> > userspace.
> >
> >>
> >> All of these VirtIO backends work with vhost-user which uses memfds to
> >> pass references to guest memory from the VMM to the backend
> >> implementation.
> >
> > Sounds to me these are the places where normal memfd_create() can act on.
> > VirtIO backends work on the mmap-ed memory which currently is not the
> > case for memfd_restricted(). memfd_restricted() has different design
> > purpose that unmaps the memory from userspace and employs some kernel
> > callbacks so other kernel modules can make use of the memory with these
> > callbacks instead of userspace virtual address.
>
> Maybe my understanding is backwards then. Are you saying a guest starts
> with all its memory exposed and then selectively unmaps the private
> regions? Is this driven by the VMM or the guest itself?

For confidential computing usages, normally guest starts with all guest
memory being private, e.g, cannot be accessed by host. The memory will
be lived in memfd_restricted() memory and not exposed to host userspace
VMM like QEMU. Guest then can selectively map its private sub regions
(e.g. VirtIO queue in the guest VirtIO frontend driver) as shared so
host backend driver in QEMU can see it. When this happens, new shared
mapping will be established in KVM and the new memory will be provided
from normal mmap-able memory, then QEMU can do whatever it can do for
the device emulation.

Thanks,
Chao
>
> --
> Alex Benn?e