2022-12-02 07:09:14

by Chao Peng

[permalink] [raw]
Subject: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

This patch series implements KVM guest private memory for confidential
computing scenarios like Intel TDX[1]. If a TDX host accesses
TDX-protected guest memory, machine check can happen which can further
crash the running host system, this is terrible for multi-tenant
configurations. The host accesses include those from KVM userspace like
QEMU. This series addresses KVM userspace induced crash by introducing
new mm and KVM interfaces so KVM userspace can still manage guest memory
via a fd-based approach, but it can never access the guest memory
content.

The patch series touches both core mm and KVM code. I appreciate
Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
reviews are always welcome.
- 01: mm change, target for mm tree
- 02-09: KVM change, target for KVM tree

Given KVM is the only current user for the mm part, I have chatted with
Paolo and he is OK to merge the mm change through KVM tree, but
reviewed-by/acked-by is still expected from the mm people.

The patches have been verified with Intel TDX environment, but Vishal
has done an excellent work on the selftests[4] which are dedicated for
this series, making it possible to test this series without innovative
hardware and fancy steps of building a VM environment. See Test section
below for more info.


Introduction
============
KVM userspace being able to crash the host is horrible. Under current
KVM architecture, all guest memory is inherently accessible from KVM
userspace and is exposed to the mentioned crash issue. The goal of this
series is to provide a solution to align mm and KVM, on a userspace
inaccessible approach of exposing guest memory.

Normally, KVM populates secondary page table (e.g. EPT) by using a host
virtual address (hva) from core mm page table (e.g. x86 userspace page
table). This requires guest memory being mmaped into KVM userspace, but
this is also the source where the mentioned crash issue can happen. In
theory, apart from those 'shared' memory for device emulation etc, guest
memory doesn't have to be mmaped into KVM userspace.

This series introduces fd-based guest memory which will not be mmaped
into KVM userspace. KVM populates secondary page table by using a
fd/offset pair backed by a memory file system. The fd can be created
from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
directly interact with them with newly introduced in-kernel interface,
therefore remove the KVM userspace from the path of accessing/mmaping
the guest memory.

Kirill had a patch [2] to address the same issue in a different way. It
tracks guest encrypted memory at the 'struct page' level and relies on
HWPOISON to reject the userspace access. The patch has been discussed in
several online and offline threads and resulted in a design document [3]
which is also the original proposal for this series. Later this patch
series evolved as more comments received in community but the major
concepts in [3] still hold true so recommend reading.

The patch series may also be useful for other usages, for example, pure
software approach may use it to harden itself against unintentional
access to guest memory. This series is designed with these usages in
mind but doesn't have code directly support them and extension might be
needed.


mm change
=========
Introduces a new memfd_restricted system call which can create memory
file that is restricted from userspace access via normal MMU operations
like read(), write() or mmap() etc and the only way to use it is
passing it to a third kernel module like KVM and relying on it to
access the fd through the newly added restrictedmem kernel interface.
The restrictedmem interface bridges the memory file subsystems
(tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
bi-directional communication between them.


KVM change
==========
Extends the KVM memslot to provide guest private (encrypted) memory from
a fd. With this extension, a single memslot can maintain both private
memory through private fd (restricted_fd/restricted_offset) and shared
(unencrypted) memory through userspace mmaped host virtual address
(userspace_addr). For a particular guest page, the corresponding page in
KVM memslot can be only either private or shared and only one of the
shared/private parts of the memslot is visible to guest. For how this
new extension is used in QEMU, please refer to kvm_set_phys_mem() in
below TDX-enabled QEMU repo.

Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
chance on decision-making for shared <-> private memory conversion. The
exit can be an implicit conversion in KVM page fault handler or an
explicit conversion from guest OS.

Introduces new KVM ioctl KVM_SET_MEMORY_ATTRIBUTES to maintain whether a
page is private or shared. This ioctl allows userspace to convert a page
between private <-> shared. The data maintained tells the truth whether
a guest page is private or shared and this information will be used in
KVM page fault handler to decide whether the private or the shared part
of the memslot is visible to guest.


Test
====
Ran two kinds of tests:
- Selftests [4] from Vishal and VM boot tests in non-TDX environment
Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v10

- Functional tests in TDX capable environment
Tested the new functionalities in TDX environment. Code repos:
Linux: https://github.com/chao-p/linux/tree/privmem-v10-tdx
QEMU: https://github.com/chao-p/qemu/tree/privmem-v10

An example QEMU command line for TDX test:
-object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
-machine confidential-guest-support=tdx \
-object memory-backend-memfd-private,id=ram1,size=${mem} \
-machine memory-backend=ram1


TODO
====
- Page accounting and limiting for encrypted memory
- hugetlbfs support


Changelog
=========
v10:
- mm: hook up restricted_memfd to memory failure and route it to
kernel users through .error() callback.
- mm: call invalidate() notifier only for FALLOC_FL_PUNCH_HOLE, i.e.
not for allocation.
- KVM: introduce new ioctl KVM_SET_MEMORY_ATTRIBUTES for memory
conversion instead of reusing KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
- KVM: refine gfn-based mmu_notifier_retry() mechanism.
- KVM: improve lpage_info updating code.
- KVM: fix the bug in private memory handling that a private fault may
fall into a non-private memslot.
- KVM: handle memory machine check error for fd-based memory.
v9:
- mm: move inaccessible memfd into separated syscall.
- mm: return page instead of pfn_t for inaccessible_get_pfn and remove
inaccessible_put_pfn.
- KVM: rename inaccessible/private to restricted and CONFIG change to
make the code friendly to pKVM.
- KVM: add invalidate_begin/end pair to fix race contention and revise
the lock protection for invalidation path.
- KVM: optimize setting lpage_info for > 2M level by direct accessing
lower level's result.
- KVM: avoid load xarray in kvm_mmu_max_mapping_level() and instead let
the caller to pass in is_private.
- KVM: API doc improvement.
v8:
- mm: redesign mm part by introducing a shim layer(inaccessible_memfd)
in memfd to avoid touch the memory file systems directly.
- mm: exclude F_SEAL_AUTO_ALLOCATE as it is for shared memory and
cause confusion in this series, will send out separately.
- doc: exclude the man page change, it's not kernel patch and will
send out separately.
- KVM: adapt to use the new mm inaccessible_memfd interface.
- KVM: update lpage_info when setting mem_attr_array to support
large page.
- KVM: change from xa_store_range to xa_store for mem_attr_array due
to xa_store_range overrides all entries which is not intended
behavior for us.
- KVM: refine the mmu_invalidate_retry_gfn mechanism for private page.
- KVM: reorganize KVM_MEMORY_ENCRYPT_{UN,}REG_REGION and private page
handling code suggested by Sean.
v7:
- mm: introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
- KVM: use KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to record
private/shared info.
- KVM: use similar sync mechanism between zap/page fault paths as
mmu_notifier for memfile_notifier based invalidation.
v6:
- mm: introduce MEMFILE_F_* flags into memfile_node to allow checking
feature consistence among all memfile_notifier users and get rid of
internal flags like SHM_F_INACCESSIBLE.
- mm: make pfn_ops callbacks being members of memfile_backing_store
and then refer to it directly in memfile_notifier.
- mm: remove backing store unregister.
- mm: remove RLIMIT_MEMLOCK based memory accounting and limiting.
- KVM: reorganize patch sequence for page fault handling and private
memory enabling.
v5:
- Add man page for MFD_INACCESSIBLE flag and improve KVM API do for
the new memslot extensions.
- mm: introduce memfile_{un}register_backing_store to allow memory
backing store to register/unregister it from memfile_notifier.
- mm: remove F_SEAL_INACCESSIBLE, use in-kernel flag
(SHM_F_INACCESSIBLE for shmem) instead.
- mm: add memory accounting and limiting (RLIMIT_MEMLOCK based) for
MFD_INACCESSIBLE memory.
- KVM: remove the overlap check for mapping the same file+offset into
multiple gfns due to perf consideration, warned in document.
v4:
- mm: rename memfd_ops to memfile_notifier and separate it from
memfd.c to standalone memfile-notifier.c.
- KVM: move pfn_ops to per-memslot scope from per-vm scope and allow
registering multiple memslots to the same memory backing store.
- KVM: add a 'kvm' reference in memslot so that we can recover kvm in
memfile_notifier handlers.
- KVM: add 'private_' prefix for the new fields in memslot.
- KVM: reshape the 'type' to 'flag' for kvm_memory_exit
v3:
- Remove 'RFC' prefix.
- Fix race condition between memfile_notifier handlers and kvm destroy.
- mm: introduce MFD_INACCESSIBLE flag for memfd_create() to force
setting F_SEAL_INACCESSIBLE when the fd is created.
- KVM: add the shared part of the memslot back to make private/shared
pages live in one memslot.

Reference
=========
[1] Intel TDX:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Kirill's implementation:
https://lore.kernel.org/all/[email protected]/T/
[3] Original design proposal:
https://lore.kernel.org/all/[email protected]/
[4] Selftest:
https://lore.kernel.org/all/[email protected]/


Chao Peng (8):
KVM: Introduce per-page memory attributes
KVM: Extend the memslot to support fd-based private memory
KVM: Add KVM_EXIT_MEMORY_FAULT exit
KVM: Use gfn instead of hva for mmu_notifier_retry
KVM: Unmap existing mappings when change the memory attributes
KVM: Update lpage info when private/shared memory are mixed
KVM: Handle page fault for private memory
KVM: Enable and expose KVM_MEM_PRIVATE

Kirill A. Shutemov (1):
mm: Introduce memfd_restricted system call to create restricted user
memory

Documentation/virt/kvm/api.rst | 125 ++++++-
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/kvm_host.h | 9 +
arch/x86/kvm/Kconfig | 3 +
arch/x86/kvm/mmu/mmu.c | 205 ++++++++++-
arch/x86/kvm/mmu/mmu_internal.h | 14 +-
arch/x86/kvm/mmu/mmutrace.h | 1 +
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
arch/x86/kvm/x86.c | 17 +-
include/linux/kvm_host.h | 103 +++++-
include/linux/restrictedmem.h | 71 ++++
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/kvm.h | 53 +++
include/uapi/linux/magic.h | 1 +
kernel/sys_ni.c | 3 +
mm/Kconfig | 4 +
mm/Makefile | 1 +
mm/memory-failure.c | 3 +
mm/restrictedmem.c | 318 +++++++++++++++++
virt/kvm/Kconfig | 6 +
virt/kvm/kvm_main.c | 469 +++++++++++++++++++++----
23 files changed, 1323 insertions(+), 93 deletions(-)
create mode 100644 include/linux/restrictedmem.h
create mode 100644 mm/restrictedmem.c


base-commit: df0bb47baa95aad133820b149851d5b94cbc6790
--
2.25.1


2022-12-02 07:09:39

by Chao Peng

[permalink] [raw]
Subject: [PATCH v10 8/9] KVM: Handle page fault for private memory

A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
hva-based shared memory. Architecture code (like TDX code) can tell
whether the on-going fault is private or not. This patch adds a
'is_private' field to kvm_page_fault to indicate this and architecture
code is expected to set it.

To handle page fault for such memslot, the handling logic is different
depending on whether the fault is private or shared. KVM checks if
'is_private' matches the host's view of the page (maintained in
mem_attr_array).
- For a successful match, private pfn is obtained with
restrictedmem_get_page() and shared pfn is obtained with existing
get_user_pages().
- For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
userspace. Userspace then can convert memory between private/shared
in host's view and retry the fault.

Co-developed-by: Yu Zhang <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++--
arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
arch/x86/kvm/mmu/mmutrace.h | 1 +
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
include/linux/kvm_host.h | 30 ++++++++++++++++
5 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2190fd8c95c0..b1953ebc012e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,

int kvm_mmu_max_mapping_level(struct kvm *kvm,
const struct kvm_memory_slot *slot, gfn_t gfn,
- int max_level)
+ int max_level, bool is_private)
{
struct kvm_lpage_info *linfo;
int host_level;
@@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
break;
}

+ if (is_private)
+ return max_level;
+
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;

@@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* level, which will be used to do precise, accurate accounting.
*/
fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
- fault->gfn, fault->max_level);
+ fault->gfn, fault->max_level,
+ fault->is_private);
if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
return;

@@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
}

+static inline u8 order_to_level(int order)
+{
+ BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+ if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+ return PG_LEVEL_1G;
+
+ if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+ return PG_LEVEL_2M;
+
+ return PG_LEVEL_4K;
+}
+
+static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
+{
+ vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+ if (fault->is_private)
+ vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
+ else
+ vcpu->run->memory.flags = 0;
+ vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+ vcpu->run->memory.size = PAGE_SIZE;
+ return RET_PF_USER;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
+{
+ int order;
+ struct kvm_memory_slot *slot = fault->slot;
+
+ if (!kvm_slot_can_be_private(slot))
+ return kvm_do_memory_fault_exit(vcpu, fault);
+
+ if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
+ return RET_PF_RETRY;
+
+ fault->max_level = min(order_to_level(order), fault->max_level);
+ fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
+ return RET_PF_CONTINUE;
+}
+
static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_memory_slot *slot = fault->slot;
@@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return RET_PF_EMULATE;
}

+ if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
+ return kvm_do_memory_fault_exit(vcpu, fault);
+
+ if (fault->is_private)
+ return kvm_faultin_pfn_private(vcpu, fault);
+
async = false;
fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
fault->write, &fault->map_writable,
@@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
return -EIO;
}

+ if (r == RET_PF_USER)
+ return 0;
+
if (r < 0)
return r;
if (r != RET_PF_EMULATE)
@@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
*/
if (sp->role.direct &&
sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
- PG_LEVEL_NUM)) {
+ PG_LEVEL_NUM,
+ false)) {
kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);

if (kvm_available_flush_tlb_with_range())
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index dbaf6755c5a7..5ccf08183b00 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -189,6 +189,7 @@ struct kvm_page_fault {

/* Derived from mmu and global state. */
const bool is_tdp;
+ const bool is_private;
const bool nx_huge_page_workaround_enabled;

/*
@@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
* RET_PF_RETRY: let CPU fault again on the address.
* RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
* RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
+ * RET_PF_USER: need to exit to userspace to handle this fault.
* RET_PF_FIXED: The faulting entry has been fixed.
* RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
*
@@ -253,6 +255,7 @@ enum {
RET_PF_RETRY,
RET_PF_EMULATE,
RET_PF_INVALID,
+ RET_PF_USER,
RET_PF_FIXED,
RET_PF_SPURIOUS,
};
@@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,

int kvm_mmu_max_mapping_level(struct kvm *kvm,
const struct kvm_memory_slot *slot, gfn_t gfn,
- int max_level);
+ int max_level, bool is_private);
void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);

@@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);

+#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
+static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+ WARN_ON_ONCE(1);
+ return -EOPNOTSUPP;
+}
+#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..2d7555381955 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
TRACE_DEFINE_ENUM(RET_PF_RETRY);
TRACE_DEFINE_ENUM(RET_PF_EMULATE);
TRACE_DEFINE_ENUM(RET_PF_INVALID);
+TRACE_DEFINE_ENUM(RET_PF_USER);
TRACE_DEFINE_ENUM(RET_PF_FIXED);
TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 771210ce5181..8ba1a4afc546 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
continue;

max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
- iter.gfn, PG_LEVEL_NUM);
+ iter.gfn, PG_LEVEL_NUM, false);
if (max_mapping_level < iter.level)
continue;

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 25099c94e770..153842bb33df 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
}
#endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */

+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+ return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
+ KVM_MEMORY_ATTRIBUTE_PRIVATE;
+}
+#else
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+ return false;
+}
+
+#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
+
+#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
+static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+ int ret;
+ struct page *page;
+ pgoff_t index = gfn - slot->base_gfn +
+ (slot->restricted_offset >> PAGE_SHIFT);
+
+ ret = restrictedmem_get_page(slot->restricted_file, index,
+ &page, order);
+ *pfn = page_to_pfn(page);
+ return ret;
+}
+#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
#endif
--
2.25.1

2022-12-02 07:10:58

by Chao Peng

[permalink] [raw]
Subject: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry

Currently in mmu_notifier invalidate path, hva range is recorded and
then checked against by mmu_notifier_retry_hva() in the page fault
handling path. However, for the to be introduced private memory, a page
fault may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 +++++---
include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
virt/kvm/kvm_main.c | 32 +++++++++++++++++++++++---------
3 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4736d7849c60..e2c70b5afa3e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
return true;

return fault->slot &&
- mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+ mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
}

static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)

write_lock(&kvm->mmu_lock);

- kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
+ kvm_mmu_invalidate_begin(kvm);
+
+ kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);

flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);

@@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
gfn_end - gfn_start);

- kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
+ kvm_mmu_invalidate_end(kvm);

write_unlock(&kvm->mmu_lock);
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 02347e386ea2..3d69484d2704 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -787,8 +787,8 @@ struct kvm {
struct mmu_notifier mmu_notifier;
unsigned long mmu_invalidate_seq;
long mmu_invalidate_in_progress;
- unsigned long mmu_invalidate_range_start;
- unsigned long mmu_invalidate_range_end;
+ gfn_t mmu_invalidate_range_start;
+ gfn_t mmu_invalidate_range_end;
#endif
struct list_head devices;
u64 manual_dirty_log_protect;
@@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
#endif

-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
- unsigned long end);
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
- unsigned long end);
+void kvm_mmu_invalidate_begin(struct kvm *kvm);
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_invalidate_end(struct kvm *kvm);

long kvm_arch_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg);
@@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
return 0;
}

-static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
+static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
unsigned long mmu_seq,
- unsigned long hva)
+ gfn_t gfn)
{
lockdep_assert_held(&kvm->mmu_lock);
/*
@@ -1974,10 +1973,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
* that might be being invalidated. Note that it may include some false
* positives, due to shortcuts when handing concurrent invalidations.
*/
- if (unlikely(kvm->mmu_invalidate_in_progress) &&
- hva >= kvm->mmu_invalidate_range_start &&
- hva < kvm->mmu_invalidate_range_end)
- return 1;
+ if (unlikely(kvm->mmu_invalidate_in_progress)) {
+ /*
+ * Dropping mmu_lock after bumping mmu_invalidate_in_progress
+ * but before updating the range is a KVM bug.
+ */
+ if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
+ kvm->mmu_invalidate_range_end == INVALID_GPA))
+ return 1;
+
+ if (gfn >= kvm->mmu_invalidate_range_start &&
+ gfn < kvm->mmu_invalidate_range_end)
+ return 1;
+ }
+
if (kvm->mmu_invalidate_seq != mmu_seq)
return 1;
return 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b882eb2c76a2..ad55dfbc75d7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,

typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);

-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
- unsigned long end);
-
+typedef void (*on_lock_fn_t)(struct kvm *kvm);
typedef void (*on_unlock_fn_t)(struct kvm *kvm);

struct kvm_hva_range {
@@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
locked = true;
KVM_MMU_LOCK(kvm);
if (!IS_KVM_NULL_FN(range->on_lock))
- range->on_lock(kvm, range->start, range->end);
+ range->on_lock(kvm);
+
if (IS_KVM_NULL_FN(range->handler))
break;
}
@@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
}

-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
- unsigned long end)
+void kvm_mmu_invalidate_begin(struct kvm *kvm)
{
/*
* The count increase must become visible at unlock time as no
@@ -724,6 +722,17 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
* count is also read inside the mmu_lock critical section.
*/
kvm->mmu_invalidate_in_progress++;
+
+ if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+ kvm->mmu_invalidate_range_start = INVALID_GPA;
+ kvm->mmu_invalidate_range_end = INVALID_GPA;
+ }
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
if (likely(kvm->mmu_invalidate_in_progress == 1)) {
kvm->mmu_invalidate_range_start = start;
kvm->mmu_invalidate_range_end = end;
@@ -744,6 +753,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
}
}

+static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+ return kvm_unmap_gfn_range(kvm, range);
+}
+
static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
{
@@ -752,7 +767,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
.start = range->start,
.end = range->end,
.pte = __pte(0),
- .handler = kvm_unmap_gfn_range,
+ .handler = kvm_mmu_unmap_gfn_range,
.on_lock = kvm_mmu_invalidate_begin,
.on_unlock = kvm_arch_guest_memory_reclaimed,
.flush_on_ret = true,
@@ -791,8 +806,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
return 0;
}

-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
- unsigned long end)
+void kvm_mmu_invalidate_end(struct kvm *kvm)
{
/*
* This sequence increase will notify the kvm page fault that
--
2.25.1

2022-12-02 07:38:25

by Chao Peng

[permalink] [raw]
Subject: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

In memory encryption usage, guest memory may be encrypted with special
key and can be accessed only by the guest itself. We call such memory
private memory. It's valueless and sometimes can cause problem to allow
userspace to access guest private memory. This new KVM memslot extension
allows guest private memory being provided through a restrictedmem
backed file descriptor(fd) and userspace is restricted to access the
bookmarked memory in the fd.

This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
additional KVM memslot fields restricted_fd/restricted_offset to allow
userspace to instruct KVM to provide guest memory through restricted_fd.
'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
and the size is 'memory_size'.

The extended memslot can still have the userspace_addr(hva). When use, a
single memslot can maintain both private memory through restricted_fd
and shared memory through userspace_addr. Whether the private or shared
part is visible to guest is maintained by other KVM code.

A restrictedmem_notifier field is also added to the memslot structure to
allow the restricted_fd's backing store to notify KVM the memory change,
KVM then can invalidate its page table entries or handle memory errors.

Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
and right now it is selected on X86_64 only.

To make future maintenance easy, internally use a binary compatible
alias struct kvm_user_mem_region to handle both the normal and the
'_ext' variants.

Co-developed-by: Yu Zhang <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
Reviewed-by: Fuad Tabba <[email protected]>
Tested-by: Fuad Tabba <[email protected]>
---
Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
arch/x86/kvm/Kconfig | 2 ++
arch/x86/kvm/x86.c | 2 +-
include/linux/kvm_host.h | 8 ++++--
include/uapi/linux/kvm.h | 28 +++++++++++++++++++
virt/kvm/Kconfig | 3 +++
virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++------
7 files changed, 114 insertions(+), 18 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index bb2f709c0900..99352170c130 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
:Capability: KVM_CAP_USER_MEMORY
:Architectures: all
:Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
:Returns: 0 on success, -1 on error

::
@@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
__u64 userspace_addr; /* start of the userspace allocated memory */
};

+ struct kvm_userspace_memory_region_ext {
+ struct kvm_userspace_memory_region region;
+ __u64 restricted_offset;
+ __u32 restricted_fd;
+ __u32 pad1;
+ __u64 pad2[14];
+ };
+
/* for kvm_memory_region::flags */
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
+ #define KVM_MEM_PRIVATE (1UL << 2)

This ioctl allows the user to create, modify or delete a guest physical
memory slot. Bits 0-15 of "slot" specify the slot id and this value
@@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
be identical. This allows large pages in the guest to be backed by large
pages in the host.

-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
-writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
-use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only. In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext struct includes all fields of
+kvm_userspace_memory_region struct, while also adds additional fields for some
+other features. See below description of flags field for more information.
+It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports following flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
+ within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
+
+- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
+ read-only. In this case, writes to this memory will be posted to userspace as
+ KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
+ KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
+ memory backed by a file descriptor(fd) and userspace access to the fd may be
+ restricted. Userspace should use restricted_fd/restricted_offset in the
+ kvm_userspace_memory_region_ext to instruct KVM to provide private memory
+ to guest. Userspace should guarantee not to map the same host physical address
+ indicated by restricted_fd/restricted_offset to different guest physical
+ addresses within multiple memslots. Failed to do this may result undefined
+ behavior.

When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
the memory region are automatically reflected into the guest. For example, an
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index a8e379a3afee..690cb21010e7 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -50,6 +50,8 @@ config KVM
select INTERVAL_TREE
select HAVE_KVM_PM_NOTIFIER if PM
select HAVE_KVM_MEMORY_ATTRIBUTES
+ select HAVE_KVM_RESTRICTED_MEM if X86_64
+ select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
help
Support hosting fully virtualized guest machines using hardware
virtualization extensions. You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7f850dfb4086..9a07380f8d3c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
}

for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
- struct kvm_userspace_memory_region m;
+ struct kvm_user_mem_region m;

m.slot = id | (i << 16);
m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a784e2b06625..02347e386ea2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -44,6 +44,7 @@

#include <asm/kvm_host.h>
#include <linux/kvm_dirty_ring.h>
+#include <linux/restrictedmem.h>

#ifndef KVM_MAX_VCPU_IDS
#define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
@@ -585,6 +586,9 @@ struct kvm_memory_slot {
u32 flags;
short id;
u16 as_id;
+ struct file *restricted_file;
+ loff_t restricted_offset;
+ struct restrictedmem_notifier notifier;
};

static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
@@ -1123,9 +1127,9 @@ enum kvm_mr_change {
};

int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_user_mem_region *mem);
int __kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_user_mem_region *mem);
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5d0941acb5bb..13bff963b8b0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
__u64 userspace_addr; /* start of the userspace allocated memory */
};

+struct kvm_userspace_memory_region_ext {
+ struct kvm_userspace_memory_region region;
+ __u64 restricted_offset;
+ __u32 restricted_fd;
+ __u32 pad1;
+ __u64 pad2[14];
+};
+
+#ifdef __KERNEL__
+/*
+ * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
+ * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
+ * all fields from the top-level "extended" region.
+ */
+struct kvm_user_mem_region {
+ __u32 slot;
+ __u32 flags;
+ __u64 guest_phys_addr;
+ __u64 memory_size;
+ __u64 userspace_addr;
+ __u64 restricted_offset;
+ __u32 restricted_fd;
+ __u32 pad1;
+ __u64 pad2[14];
+};
+#endif
+
/*
* The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
* other bits are reserved for kvm internal use which are defined in
@@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
*/
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
+#define KVM_MEM_PRIVATE (1UL << 2)

/* for KVM_IRQ_LINE */
struct kvm_irq_level {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index effdea5dd4f0..d605545d6dd1 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK

config HAVE_KVM_PM_NOTIFIER
bool
+
+config HAVE_KVM_RESTRICTED_MEM
+ bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7f0f5e9f2406..b882eb2c76a2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
}

-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;

@@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
* Must be called holding kvm->slots_lock for write.
*/
int __kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_user_mem_region *mem)
{
struct kvm_memory_slot *old, *new;
struct kvm_memslots *slots;
@@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
EXPORT_SYMBOL_GPL(__kvm_set_memory_region);

int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_user_mem_region *mem)
{
int r;

@@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
EXPORT_SYMBOL_GPL(kvm_set_memory_region);

static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
- struct kvm_userspace_memory_region *mem)
+ struct kvm_user_mem_region *mem)
{
if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
return -EINVAL;
@@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
return fd;
}

+#define SANITY_CHECK_MEM_REGION_FIELD(field) \
+do { \
+ BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
+ offsetof(struct kvm_userspace_memory_region, field)); \
+ BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
+ sizeof_field(struct kvm_userspace_memory_region, field)); \
+} while (0)
+
+#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \
+do { \
+ BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
+ offsetof(struct kvm_userspace_memory_region_ext, field)); \
+ BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
+ sizeof_field(struct kvm_userspace_memory_region_ext, field)); \
+} while (0)
+
+static void kvm_sanity_check_user_mem_region_alias(void)
+{
+ SANITY_CHECK_MEM_REGION_FIELD(slot);
+ SANITY_CHECK_MEM_REGION_FIELD(flags);
+ SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+ SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+ SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
+ SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
+ SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
+}
+
static long kvm_vm_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
break;
}
case KVM_SET_USER_MEMORY_REGION: {
- struct kvm_userspace_memory_region kvm_userspace_mem;
+ struct kvm_user_mem_region mem;
+ unsigned long size = sizeof(struct kvm_userspace_memory_region);
+
+ kvm_sanity_check_user_mem_region_alias();

r = -EFAULT;
- if (copy_from_user(&kvm_userspace_mem, argp,
- sizeof(kvm_userspace_mem)))
+ if (copy_from_user(&mem, argp, size))
+ goto out;
+
+ r = -EINVAL;
+ if (mem.flags & KVM_MEM_PRIVATE)
goto out;

- r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+ r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
break;
}
case KVM_GET_DIRTY_LOG: {
--
2.25.1

2022-12-02 07:38:46

by Chao Peng

[permalink] [raw]
Subject: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

Unmap the existing guest mappings when memory attribute is changed
between shared and private. This is needed because shared pages and
private pages are from different backends, unmapping existing ones
gives a chance for page fault handler to re-populate the mappings
according to the new attribute.

Only architecture has private memory support needs this and the
supported architecture is expected to rewrite the weak
kvm_arch_has_private_mem().

Also, during memory attribute changing and the unmapping time frame,
page fault handler may happen in the same memory range and can cause
incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
page fault handler retry during this time frame.

Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 7 +-
virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
2 files changed, 116 insertions(+), 59 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3d69484d2704..3331c0c92838 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
#endif

-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
@@ -264,6 +263,8 @@ struct kvm_gfn_range {
bool may_block;
};
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+
+#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
@@ -785,11 +786,12 @@ struct kvm {

#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
struct mmu_notifier mmu_notifier;
+#endif
unsigned long mmu_invalidate_seq;
long mmu_invalidate_in_progress;
gfn_t mmu_invalidate_range_start;
gfn_t mmu_invalidate_range_end;
-#endif
+
struct list_head devices;
u64 manual_dirty_log_protect;
struct dentry *debugfs_dentry;
@@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
int kvm_arch_post_init_vm(struct kvm *kvm);
void kvm_arch_pre_destroy_vm(struct kvm *kvm);
int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_has_private_mem(struct kvm *kvm);

#ifndef __KVM_HAVE_ARCH_VM_ALLOC
/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ad55dfbc75d7..4e1e1e113bf0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
}
EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);

+void kvm_mmu_invalidate_begin(struct kvm *kvm)
+{
+ /*
+ * The count increase must become visible at unlock time as no
+ * spte can be established without taking the mmu_lock and
+ * count is also read inside the mmu_lock critical section.
+ */
+ kvm->mmu_invalidate_in_progress++;
+
+ if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+ kvm->mmu_invalidate_range_start = INVALID_GPA;
+ kvm->mmu_invalidate_range_end = INVALID_GPA;
+ }
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
+ if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+ kvm->mmu_invalidate_range_start = start;
+ kvm->mmu_invalidate_range_end = end;
+ } else {
+ /*
+ * Fully tracking multiple concurrent ranges has diminishing
+ * returns. Keep things simple and just find the minimal range
+ * which includes the current and new ranges. As there won't be
+ * enough information to subtract a range after its invalidate
+ * completes, any ranges invalidated concurrently will
+ * accumulate and persist until all outstanding invalidates
+ * complete.
+ */
+ kvm->mmu_invalidate_range_start =
+ min(kvm->mmu_invalidate_range_start, start);
+ kvm->mmu_invalidate_range_end =
+ max(kvm->mmu_invalidate_range_end, end);
+ }
+}
+
+void kvm_mmu_invalidate_end(struct kvm *kvm)
+{
+ /*
+ * This sequence increase will notify the kvm page fault that
+ * the page that is going to be mapped in the spte could have
+ * been freed.
+ */
+ kvm->mmu_invalidate_seq++;
+ smp_wmb();
+ /*
+ * The above sequence increase must be visible before the
+ * below count decrease, which is ensured by the smp_wmb above
+ * in conjunction with the smp_rmb in mmu_invalidate_retry().
+ */
+ kvm->mmu_invalidate_in_progress--;
+}
+
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
{
@@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
}

-void kvm_mmu_invalidate_begin(struct kvm *kvm)
-{
- /*
- * The count increase must become visible at unlock time as no
- * spte can be established without taking the mmu_lock and
- * count is also read inside the mmu_lock critical section.
- */
- kvm->mmu_invalidate_in_progress++;
-
- if (likely(kvm->mmu_invalidate_in_progress == 1)) {
- kvm->mmu_invalidate_range_start = INVALID_GPA;
- kvm->mmu_invalidate_range_end = INVALID_GPA;
- }
-}
-
-void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
-{
- WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
-
- if (likely(kvm->mmu_invalidate_in_progress == 1)) {
- kvm->mmu_invalidate_range_start = start;
- kvm->mmu_invalidate_range_end = end;
- } else {
- /*
- * Fully tracking multiple concurrent ranges has diminishing
- * returns. Keep things simple and just find the minimal range
- * which includes the current and new ranges. As there won't be
- * enough information to subtract a range after its invalidate
- * completes, any ranges invalidated concurrently will
- * accumulate and persist until all outstanding invalidates
- * complete.
- */
- kvm->mmu_invalidate_range_start =
- min(kvm->mmu_invalidate_range_start, start);
- kvm->mmu_invalidate_range_end =
- max(kvm->mmu_invalidate_range_end, end);
- }
-}
-
static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
@@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
return 0;
}

-void kvm_mmu_invalidate_end(struct kvm *kvm)
-{
- /*
- * This sequence increase will notify the kvm page fault that
- * the page that is going to be mapped in the spte could have
- * been freed.
- */
- kvm->mmu_invalidate_seq++;
- smp_wmb();
- /*
- * The above sequence increase must be visible before the
- * below count decrease, which is ensured by the smp_wmb above
- * in conjunction with the smp_rmb in mmu_invalidate_retry().
- */
- kvm->mmu_invalidate_in_progress--;
-}
-
static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
{
@@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
return 0;
}

+bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
+{
+ return false;
+}
+
static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
{
struct kvm *kvm = kvm_arch_alloc_vm();
@@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
return 0;
}

+static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ struct kvm_gfn_range gfn_range;
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ struct kvm_memslot_iter iter;
+ int i;
+ int r = 0;
+
+ gfn_range.pte = __pte(0);
+ gfn_range.may_block = true;
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ slots = __kvm_memslots(kvm, i);
+
+ kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+ slot = iter.slot;
+ gfn_range.start = max(start, slot->base_gfn);
+ gfn_range.end = min(end, slot->base_gfn + slot->npages);
+ if (gfn_range.start >= gfn_range.end)
+ continue;
+ gfn_range.slot = slot;
+
+ r |= kvm_unmap_gfn_range(kvm, &gfn_range);
+ }
+ }
+
+ if (r)
+ kvm_flush_remote_tlbs(kvm);
+}
+
static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
struct kvm_memory_attributes *attrs)
{
gfn_t start, end;
unsigned long i;
void *entry;
+ int idx;
u64 supported_attrs = kvm_supported_mem_attributes(kvm);

- /* flags is currently not used. */
+ /* 'flags' is currently not used. */
if (attrs->flags)
return -EINVAL;
if (attrs->attributes & ~supported_attrs)
@@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,

entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;

+ if (kvm_arch_has_private_mem(kvm)) {
+ KVM_MMU_LOCK(kvm);
+ kvm_mmu_invalidate_begin(kvm);
+ kvm_mmu_invalidate_range_add(kvm, start, end);
+ KVM_MMU_UNLOCK(kvm);
+ }
+
mutex_lock(&kvm->lock);
for (i = start; i < end; i++)
if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
@@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
break;
mutex_unlock(&kvm->lock);

+ if (kvm_arch_has_private_mem(kvm)) {
+ idx = srcu_read_lock(&kvm->srcu);
+ KVM_MMU_LOCK(kvm);
+ if (i > start)
+ kvm_unmap_mem_range(kvm, start, i);
+ kvm_mmu_invalidate_end(kvm);
+ KVM_MMU_UNLOCK(kvm);
+ srcu_read_unlock(&kvm->srcu, idx);
+ }
+
attrs->address = i << PAGE_SHIFT;
attrs->size = (end - i) << PAGE_SHIFT;

--
2.25.1

2022-12-05 09:49:33

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

Hi Chao,

On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <[email protected]> wrote:
>
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided through a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.
>
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
>
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through restricted_fd
> and shared memory through userspace_addr. Whether the private or shared
> part is visible to guest is maintained by other KVM code.
>
> A restrictedmem_notifier field is also added to the memslot structure to
> allow the restricted_fd's backing store to notify KVM the memory change,
> KVM then can invalidate its page table entries or handle memory errors.
>
> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only.
>
> To make future maintenance easy, internally use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.
>
> Co-developed-by: Yu Zhang <[email protected]>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> Reviewed-by: Fuad Tabba <[email protected]>
> Tested-by: Fuad Tabba <[email protected]>

V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
patch series anymore. Any reason you removed it, or is it just an
omission?

[*] https://lore.kernel.org/linux-mm/[email protected]/

Thanks,
/fuad

> ---
> Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> arch/x86/kvm/Kconfig | 2 ++
> arch/x86/kvm/x86.c | 2 +-
> include/linux/kvm_host.h | 8 ++++--
> include/uapi/linux/kvm.h | 28 +++++++++++++++++++
> virt/kvm/Kconfig | 3 +++
> virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++------
> 7 files changed, 114 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index bb2f709c0900..99352170c130 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> :Capability: KVM_CAP_USER_MEMORY
> :Architectures: all
> :Type: vm ioctl
> -:Parameters: struct kvm_userspace_memory_region (in)
> +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> :Returns: 0 on success, -1 on error
>
> ::
> @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> __u64 userspace_addr; /* start of the userspace allocated memory */
> };
>
> + struct kvm_userspace_memory_region_ext {
> + struct kvm_userspace_memory_region region;
> + __u64 restricted_offset;
> + __u32 restricted_fd;
> + __u32 pad1;
> + __u64 pad2[14];
> + };
> +
> /* for kvm_memory_region::flags */
> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> #define KVM_MEM_READONLY (1UL << 1)
> + #define KVM_MEM_PRIVATE (1UL << 2)
>
> This ioctl allows the user to create, modify or delete a guest physical
> memory slot. Bits 0-15 of "slot" specify the slot id and this value
> @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> be identical. This allows large pages in the guest to be backed by large
> pages in the host.
>
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
> -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> -to make a new slot read-only. In this case, writes to this memory will be
> -posted to userspace as KVM_EXIT_MMIO exits.
> +kvm_userspace_memory_region_ext struct includes all fields of
> +kvm_userspace_memory_region struct, while also adds additional fields for some
> +other features. See below description of flags field for more information.
> +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> +
> +The flags field supports following flags:
> +
> +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> + within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> +
> +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> + read-only. In this case, writes to this memory will be posted to userspace as
> + KVM_EXIT_MMIO exits.
> +
> +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> + KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> + memory backed by a file descriptor(fd) and userspace access to the fd may be
> + restricted. Userspace should use restricted_fd/restricted_offset in the
> + kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> + to guest. Userspace should guarantee not to map the same host physical address
> + indicated by restricted_fd/restricted_offset to different guest physical
> + addresses within multiple memslots. Failed to do this may result undefined
> + behavior.
>
> When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> the memory region are automatically reflected into the guest. For example, an
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index a8e379a3afee..690cb21010e7 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -50,6 +50,8 @@ config KVM
> select INTERVAL_TREE
> select HAVE_KVM_PM_NOTIFIER if PM
> select HAVE_KVM_MEMORY_ATTRIBUTES
> + select HAVE_KVM_RESTRICTED_MEM if X86_64
> + select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> help
> Support hosting fully virtualized guest machines using hardware
> virtualization extensions. You will need a fairly recent
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7f850dfb4086..9a07380f8d3c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> }
>
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> - struct kvm_userspace_memory_region m;
> + struct kvm_user_mem_region m;
>
> m.slot = id | (i << 16);
> m.flags = 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a784e2b06625..02347e386ea2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -44,6 +44,7 @@
>
> #include <asm/kvm_host.h>
> #include <linux/kvm_dirty_ring.h>
> +#include <linux/restrictedmem.h>
>
> #ifndef KVM_MAX_VCPU_IDS
> #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> u32 flags;
> short id;
> u16 as_id;
> + struct file *restricted_file;
> + loff_t restricted_offset;
> + struct restrictedmem_notifier notifier;
> };
>
> static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> };
>
> int kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem);
> + const struct kvm_user_mem_region *mem);
> int __kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem);
> + const struct kvm_user_mem_region *mem);
> void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 5d0941acb5bb..13bff963b8b0 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> __u64 userspace_addr; /* start of the userspace allocated memory */
> };
>
> +struct kvm_userspace_memory_region_ext {
> + struct kvm_userspace_memory_region region;
> + __u64 restricted_offset;
> + __u32 restricted_fd;
> + __u32 pad1;
> + __u64 pad2[14];
> +};
> +
> +#ifdef __KERNEL__
> +/*
> + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> + * all fields from the top-level "extended" region.
> + */
> +struct kvm_user_mem_region {
> + __u32 slot;
> + __u32 flags;
> + __u64 guest_phys_addr;
> + __u64 memory_size;
> + __u64 userspace_addr;
> + __u64 restricted_offset;
> + __u32 restricted_fd;
> + __u32 pad1;
> + __u64 pad2[14];
> +};
> +#endif
> +
> /*
> * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> * other bits are reserved for kvm internal use which are defined in
> @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> */
> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> #define KVM_MEM_READONLY (1UL << 1)
> +#define KVM_MEM_PRIVATE (1UL << 2)
>
> /* for KVM_IRQ_LINE */
> struct kvm_irq_level {
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index effdea5dd4f0..d605545d6dd1 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
>
> config HAVE_KVM_PM_NOTIFIER
> bool
> +
> +config HAVE_KVM_RESTRICTED_MEM
> + bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7f0f5e9f2406..b882eb2c76a2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> }
> }
>
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> {
> u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> * Must be called holding kvm->slots_lock for write.
> */
> int __kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem)
> + const struct kvm_user_mem_region *mem)
> {
> struct kvm_memory_slot *old, *new;
> struct kvm_memslots *slots;
> @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>
> int kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem)
> + const struct kvm_user_mem_region *mem)
> {
> int r;
>
> @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>
> static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> - struct kvm_userspace_memory_region *mem)
> + struct kvm_user_mem_region *mem)
> {
> if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> return -EINVAL;
> @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> return fd;
> }
>
> +#define SANITY_CHECK_MEM_REGION_FIELD(field) \
> +do { \
> + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> + offsetof(struct kvm_userspace_memory_region, field)); \
> + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> + sizeof_field(struct kvm_userspace_memory_region, field)); \
> +} while (0)
> +
> +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \
> +do { \
> + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> + offsetof(struct kvm_userspace_memory_region_ext, field)); \
> + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \
> +} while (0)
> +
> +static void kvm_sanity_check_user_mem_region_alias(void)
> +{
> + SANITY_CHECK_MEM_REGION_FIELD(slot);
> + SANITY_CHECK_MEM_REGION_FIELD(flags);
> + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> + SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> +}
> +
> static long kvm_vm_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> break;
> }
> case KVM_SET_USER_MEMORY_REGION: {
> - struct kvm_userspace_memory_region kvm_userspace_mem;
> + struct kvm_user_mem_region mem;
> + unsigned long size = sizeof(struct kvm_userspace_memory_region);
> +
> + kvm_sanity_check_user_mem_region_alias();
>
> r = -EFAULT;
> - if (copy_from_user(&kvm_userspace_mem, argp,
> - sizeof(kvm_userspace_mem)))
> + if (copy_from_user(&mem, argp, size))
> + goto out;
> +
> + r = -EINVAL;
> + if (mem.flags & KVM_MEM_PRIVATE)
> goto out;
>
> - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> + r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> break;
> }
> case KVM_GET_DIRTY_LOG: {
> --
> 2.25.1
>

2022-12-05 09:53:21

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry

Hi Chao,

On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <[email protected]> wrote:
>
> Currently in mmu_notifier invalidate path, hva range is recorded and
> then checked against by mmu_notifier_retry_hva() in the page fault
> handling path. However, for the to be introduced private memory, a page
> fault may not have a hva associated, checking gfn(gpa) makes more sense.
>
> For existing hva based shared memory, gfn is expected to also work. The
> only downside is when aliasing multiple gfns to a single hva, the
> current algorithm of checking multiple ranges could result in a much
> larger range being rejected. Such aliasing should be uncommon, so the
> impact is expected small.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 8 +++++---
> include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
> virt/kvm/kvm_main.c | 32 +++++++++++++++++++++++---------
> 3 files changed, 49 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4736d7849c60..e2c70b5afa3e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> return true;
>
> return fault->slot &&
> - mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> + mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> }
>
> static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> @@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>
> write_lock(&kvm->mmu_lock);
>
> - kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> + kvm_mmu_invalidate_begin(kvm);
> +
> + kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
>
> flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>
> @@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
> gfn_end - gfn_start);
>
> - kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> + kvm_mmu_invalidate_end(kvm);
>
> write_unlock(&kvm->mmu_lock);
> }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 02347e386ea2..3d69484d2704 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -787,8 +787,8 @@ struct kvm {
> struct mmu_notifier mmu_notifier;
> unsigned long mmu_invalidate_seq;
> long mmu_invalidate_in_progress;
> - unsigned long mmu_invalidate_range_start;
> - unsigned long mmu_invalidate_range_end;
> + gfn_t mmu_invalidate_range_start;
> + gfn_t mmu_invalidate_range_end;
> #endif
> struct list_head devices;
> u64 manual_dirty_log_protect;
> @@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> #endif
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> - unsigned long end);
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> - unsigned long end);
> +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_mmu_invalidate_end(struct kvm *kvm);
>
> long kvm_arch_dev_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg);
> @@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
> return 0;
> }
>
> -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> unsigned long mmu_seq,
> - unsigned long hva)
> + gfn_t gfn)
> {
> lockdep_assert_held(&kvm->mmu_lock);
> /*
> @@ -1974,10 +1973,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> * that might be being invalidated. Note that it may include some false

nit: "might be" (or) "is being"

> * positives, due to shortcuts when handing concurrent invalidations.

nit: handling

> */
> - if (unlikely(kvm->mmu_invalidate_in_progress) &&
> - hva >= kvm->mmu_invalidate_range_start &&
> - hva < kvm->mmu_invalidate_range_end)
> - return 1;
> + if (unlikely(kvm->mmu_invalidate_in_progress)) {
> + /*
> + * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> + * but before updating the range is a KVM bug.
> + */
> + if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> + kvm->mmu_invalidate_range_end == INVALID_GPA))

INVALID_GPA is an x86-specific define in
arch/x86/include/asm/kvm_host.h, so this doesn't build on other
architectures. The obvious fix is to move it to
include/linux/kvm_host.h.

Cheers,
/fuad

> + return 1;
> +
> + if (gfn >= kvm->mmu_invalidate_range_start &&
> + gfn < kvm->mmu_invalidate_range_end)
> + return 1;
> + }
> +
> if (kvm->mmu_invalidate_seq != mmu_seq)
> return 1;
> return 0;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b882eb2c76a2..ad55dfbc75d7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
>
> typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>
> -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> - unsigned long end);
> -
> +typedef void (*on_lock_fn_t)(struct kvm *kvm);
> typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>
> struct kvm_hva_range {
> @@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> locked = true;
> KVM_MMU_LOCK(kvm);
> if (!IS_KVM_NULL_FN(range->on_lock))
> - range->on_lock(kvm, range->start, range->end);
> + range->on_lock(kvm);
> +
> if (IS_KVM_NULL_FN(range->handler))
> break;
> }
> @@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> - unsigned long end)
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> {
> /*
> * The count increase must become visible at unlock time as no
> @@ -724,6 +722,17 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> * count is also read inside the mmu_lock critical section.
> */
> kvm->mmu_invalidate_in_progress++;
> +
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = INVALID_GPA;
> + kvm->mmu_invalidate_range_end = INVALID_GPA;
> + }
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
> if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> kvm->mmu_invalidate_range_start = start;
> kvm->mmu_invalidate_range_end = end;
> @@ -744,6 +753,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> }
> }
>
> +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> + kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> + return kvm_unmap_gfn_range(kvm, range);
> +}
> +
> static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> const struct mmu_notifier_range *range)
> {
> @@ -752,7 +767,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> .start = range->start,
> .end = range->end,
> .pte = __pte(0),
> - .handler = kvm_unmap_gfn_range,
> + .handler = kvm_mmu_unmap_gfn_range,
> .on_lock = kvm_mmu_invalidate_begin,
> .on_unlock = kvm_arch_guest_memory_reclaimed,
> .flush_on_ret = true,
> @@ -791,8 +806,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> return 0;
> }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> - unsigned long end)
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
> {
> /*
> * This sequence increase will notify the kvm page fault that
> --
> 2.25.1
>

2022-12-06 12:42:24

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry

On Mon, Dec 05, 2022 at 09:23:49AM +0000, Fuad Tabba wrote:
> Hi Chao,
>
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <[email protected]> wrote:
> >
> > Currently in mmu_notifier invalidate path, hva range is recorded and
> > then checked against by mmu_notifier_retry_hva() in the page fault
> > handling path. However, for the to be introduced private memory, a page
> > fault may not have a hva associated, checking gfn(gpa) makes more sense.
> >
> > For existing hva based shared memory, gfn is expected to also work. The
> > only downside is when aliasing multiple gfns to a single hva, the
> > current algorithm of checking multiple ranges could result in a much
> > larger range being rejected. Such aliasing should be uncommon, so the
> > impact is expected small.
> >
> > Suggested-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 8 +++++---
> > include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
> > virt/kvm/kvm_main.c | 32 +++++++++++++++++++++++---------
> > 3 files changed, 49 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 4736d7849c60..e2c70b5afa3e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> > return true;
> >
> > return fault->slot &&
> > - mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> > + mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> > }
> >
> > static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > @@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> >
> > write_lock(&kvm->mmu_lock);
> >
> > - kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> > + kvm_mmu_invalidate_begin(kvm);
> > +
> > + kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
> >
> > flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> >
> > @@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
> > gfn_end - gfn_start);
> >
> > - kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> > + kvm_mmu_invalidate_end(kvm);
> >
> > write_unlock(&kvm->mmu_lock);
> > }
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 02347e386ea2..3d69484d2704 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -787,8 +787,8 @@ struct kvm {
> > struct mmu_notifier mmu_notifier;
> > unsigned long mmu_invalidate_seq;
> > long mmu_invalidate_in_progress;
> > - unsigned long mmu_invalidate_range_start;
> > - unsigned long mmu_invalidate_range_end;
> > + gfn_t mmu_invalidate_range_start;
> > + gfn_t mmu_invalidate_range_end;
> > #endif
> > struct list_head devices;
> > u64 manual_dirty_log_protect;
> > @@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> > void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > #endif
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > - unsigned long end);
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > - unsigned long end);
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> > +void kvm_mmu_invalidate_end(struct kvm *kvm);
> >
> > long kvm_arch_dev_ioctl(struct file *filp,
> > unsigned int ioctl, unsigned long arg);
> > @@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
> > return 0;
> > }
> >
> > -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> > unsigned long mmu_seq,
> > - unsigned long hva)
> > + gfn_t gfn)
> > {
> > lockdep_assert_held(&kvm->mmu_lock);
> > /*
> > @@ -1974,10 +1973,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > * that might be being invalidated. Note that it may include some false
>
> nit: "might be" (or) "is being"
>
> > * positives, due to shortcuts when handing concurrent invalidations.
>
> nit: handling

Both are existing code, but I can fix it either.

>
> > */
> > - if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > - hva >= kvm->mmu_invalidate_range_start &&
> > - hva < kvm->mmu_invalidate_range_end)
> > - return 1;
> > + if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > + /*
> > + * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > + * but before updating the range is a KVM bug.
> > + */
> > + if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > + kvm->mmu_invalidate_range_end == INVALID_GPA))
>
> INVALID_GPA is an x86-specific define in
> arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> architectures. The obvious fix is to move it to
> include/linux/kvm_host.h.

Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
correct choice for other architectures, but after search it has not been
used for other architectures, so should be safe to make it common.

Thanks,
Chao
>
> Cheers,
> /fuad
>
> > + return 1;
> > +
> > + if (gfn >= kvm->mmu_invalidate_range_start &&
> > + gfn < kvm->mmu_invalidate_range_end)
> > + return 1;
> > + }
> > +
> > if (kvm->mmu_invalidate_seq != mmu_seq)
> > return 1;
> > return 0;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index b882eb2c76a2..ad55dfbc75d7 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
> >
> > typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
> >
> > -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> > - unsigned long end);
> > -
> > +typedef void (*on_lock_fn_t)(struct kvm *kvm);
> > typedef void (*on_unlock_fn_t)(struct kvm *kvm);
> >
> > struct kvm_hva_range {
> > @@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> > locked = true;
> > KVM_MMU_LOCK(kvm);
> > if (!IS_KVM_NULL_FN(range->on_lock))
> > - range->on_lock(kvm, range->start, range->end);
> > + range->on_lock(kvm);
> > +
> > if (IS_KVM_NULL_FN(range->handler))
> > break;
> > }
> > @@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > }
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > - unsigned long end)
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > {
> > /*
> > * The count increase must become visible at unlock time as no
> > @@ -724,6 +722,17 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > * count is also read inside the mmu_lock critical section.
> > */
> > kvm->mmu_invalidate_in_progress++;
> > +
> > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > + }
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > +
> > if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > kvm->mmu_invalidate_range_start = start;
> > kvm->mmu_invalidate_range_end = end;
> > @@ -744,6 +753,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > }
> > }
> >
> > +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > +{
> > + kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > + return kvm_unmap_gfn_range(kvm, range);
> > +}
> > +
> > static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > const struct mmu_notifier_range *range)
> > {
> > @@ -752,7 +767,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > .start = range->start,
> > .end = range->end,
> > .pte = __pte(0),
> > - .handler = kvm_unmap_gfn_range,
> > + .handler = kvm_mmu_unmap_gfn_range,
> > .on_lock = kvm_mmu_invalidate_begin,
> > .on_unlock = kvm_arch_guest_memory_reclaimed,
> > .flush_on_ret = true,
> > @@ -791,8 +806,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > return 0;
> > }
> >
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > - unsigned long end)
> > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > {
> > /*
> > * This sequence increase will notify the kvm page fault that
> > --
> > 2.25.1
> >

2022-12-06 12:47:09

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

Hi Chao,

On Tue, Dec 6, 2022 at 11:58 AM Chao Peng <[email protected]> wrote:
>
> On Mon, Dec 05, 2022 at 09:03:11AM +0000, Fuad Tabba wrote:
> > Hi Chao,
> >
> > On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <[email protected]> wrote:
> > >
> > > In memory encryption usage, guest memory may be encrypted with special
> > > key and can be accessed only by the guest itself. We call such memory
> > > private memory. It's valueless and sometimes can cause problem to allow
> > > userspace to access guest private memory. This new KVM memslot extension
> > > allows guest private memory being provided through a restrictedmem
> > > backed file descriptor(fd) and userspace is restricted to access the
> > > bookmarked memory in the fd.
> > >
> > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > > userspace to instruct KVM to provide guest memory through restricted_fd.
> > > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > > and the size is 'memory_size'.
> > >
> > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > single memslot can maintain both private memory through restricted_fd
> > > and shared memory through userspace_addr. Whether the private or shared
> > > part is visible to guest is maintained by other KVM code.
> > >
> > > A restrictedmem_notifier field is also added to the memslot structure to
> > > allow the restricted_fd's backing store to notify KVM the memory change,
> > > KVM then can invalidate its page table entries or handle memory errors.
> > >
> > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > and right now it is selected on X86_64 only.
> > >
> > > To make future maintenance easy, internally use a binary compatible
> > > alias struct kvm_user_mem_region to handle both the normal and the
> > > '_ext' variants.
> > >
> > > Co-developed-by: Yu Zhang <[email protected]>
> > > Signed-off-by: Yu Zhang <[email protected]>
> > > Signed-off-by: Chao Peng <[email protected]>
> > > Reviewed-by: Fuad Tabba <[email protected]>
> > > Tested-by: Fuad Tabba <[email protected]>
> >
> > V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> > patch series anymore. Any reason you removed it, or is it just an
> > omission?
>
> We had some discussion in v9 [1] to add generic memory attributes ioctls
> and KVM_CAP_PRIVATE_MEM can be implemented as a new
> KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
> ioctl [2]. The api doc has been updated:
>
> +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> + KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) …
>
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2]
> https://lore.kernel.org/linux-mm/[email protected]/

I see. I just retested it with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES,
and my Reviewed/Tested-by still apply.

Cheers,
/fuad

>
> Thanks,
> Chao
> >
> > [*] https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Thanks,
> > /fuad
> >
> > > ---
> > > Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> > > arch/x86/kvm/Kconfig | 2 ++
> > > arch/x86/kvm/x86.c | 2 +-
> > > include/linux/kvm_host.h | 8 ++++--
> > > include/uapi/linux/kvm.h | 28 +++++++++++++++++++
> > > virt/kvm/Kconfig | 3 +++
> > > virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++------
> > > 7 files changed, 114 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > index bb2f709c0900..99352170c130 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> > > :Capability: KVM_CAP_USER_MEMORY
> > > :Architectures: all
> > > :Type: vm ioctl
> > > -:Parameters: struct kvm_userspace_memory_region (in)
> > > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> > > :Returns: 0 on success, -1 on error
> > >
> > > ::
> > > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > > __u64 userspace_addr; /* start of the userspace allocated memory */
> > > };
> > >
> > > + struct kvm_userspace_memory_region_ext {
> > > + struct kvm_userspace_memory_region region;
> > > + __u64 restricted_offset;
> > > + __u32 restricted_fd;
> > > + __u32 pad1;
> > > + __u64 pad2[14];
> > > + };
> > > +
> > > /* for kvm_memory_region::flags */
> > > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > > #define KVM_MEM_READONLY (1UL << 1)
> > > + #define KVM_MEM_PRIVATE (1UL << 2)
> > >
> > > This ioctl allows the user to create, modify or delete a guest physical
> > > memory slot. Bits 0-15 of "slot" specify the slot id and this value
> > > @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> > > be identical. This allows large pages in the guest to be backed by large
> > > pages in the host.
> > >
> > > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > > -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
> > > -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
> > > -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > > -to make a new slot read-only. In this case, writes to this memory will be
> > > -posted to userspace as KVM_EXIT_MMIO exits.
> > > +kvm_userspace_memory_region_ext struct includes all fields of
> > > +kvm_userspace_memory_region struct, while also adds additional fields for some
> > > +other features. See below description of flags field for more information.
> > > +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> > > +
> > > +The flags field supports following flags:
> > > +
> > > +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> > > + within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> > > +
> > > +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> > > + read-only. In this case, writes to this memory will be posted to userspace as
> > > + KVM_EXIT_MMIO exits.
> > > +
> > > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > > + KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> > > + memory backed by a file descriptor(fd) and userspace access to the fd may be
> > > + restricted. Userspace should use restricted_fd/restricted_offset in the
> > > + kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> > > + to guest. Userspace should guarantee not to map the same host physical address
> > > + indicated by restricted_fd/restricted_offset to different guest physical
> > > + addresses within multiple memslots. Failed to do this may result undefined
> > > + behavior.
> > >
> > > When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> > > the memory region are automatically reflected into the guest. For example, an
> > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > > index a8e379a3afee..690cb21010e7 100644
> > > --- a/arch/x86/kvm/Kconfig
> > > +++ b/arch/x86/kvm/Kconfig
> > > @@ -50,6 +50,8 @@ config KVM
> > > select INTERVAL_TREE
> > > select HAVE_KVM_PM_NOTIFIER if PM
> > > select HAVE_KVM_MEMORY_ATTRIBUTES
> > > + select HAVE_KVM_RESTRICTED_MEM if X86_64
> > > + select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > > help
> > > Support hosting fully virtualized guest machines using hardware
> > > virtualization extensions. You will need a fairly recent
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 7f850dfb4086..9a07380f8d3c 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> > > }
> > >
> > > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > - struct kvm_userspace_memory_region m;
> > > + struct kvm_user_mem_region m;
> > >
> > > m.slot = id | (i << 16);
> > > m.flags = 0;
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index a784e2b06625..02347e386ea2 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -44,6 +44,7 @@
> > >
> > > #include <asm/kvm_host.h>
> > > #include <linux/kvm_dirty_ring.h>
> > > +#include <linux/restrictedmem.h>
> > >
> > > #ifndef KVM_MAX_VCPU_IDS
> > > #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> > > @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> > > u32 flags;
> > > short id;
> > > u16 as_id;
> > > + struct file *restricted_file;
> > > + loff_t restricted_offset;
> > > + struct restrictedmem_notifier notifier;
> > > };
> > >
> > > static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > > @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> > > };
> > >
> > > int kvm_set_memory_region(struct kvm *kvm,
> > > - const struct kvm_userspace_memory_region *mem);
> > > + const struct kvm_user_mem_region *mem);
> > > int __kvm_set_memory_region(struct kvm *kvm,
> > > - const struct kvm_userspace_memory_region *mem);
> > > + const struct kvm_user_mem_region *mem);
> > > void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> > > void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> > > int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > index 5d0941acb5bb..13bff963b8b0 100644
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> > > __u64 userspace_addr; /* start of the userspace allocated memory */
> > > };
> > >
> > > +struct kvm_userspace_memory_region_ext {
> > > + struct kvm_userspace_memory_region region;
> > > + __u64 restricted_offset;
> > > + __u32 restricted_fd;
> > > + __u32 pad1;
> > > + __u64 pad2[14];
> > > +};
> > > +
> > > +#ifdef __KERNEL__
> > > +/*
> > > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> > > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> > > + * all fields from the top-level "extended" region.
> > > + */
> > > +struct kvm_user_mem_region {
> > > + __u32 slot;
> > > + __u32 flags;
> > > + __u64 guest_phys_addr;
> > > + __u64 memory_size;
> > > + __u64 userspace_addr;
> > > + __u64 restricted_offset;
> > > + __u32 restricted_fd;
> > > + __u32 pad1;
> > > + __u64 pad2[14];
> > > +};
> > > +#endif
> > > +
> > > /*
> > > * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> > > * other bits are reserved for kvm internal use which are defined in
> > > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> > > */
> > > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > > #define KVM_MEM_READONLY (1UL << 1)
> > > +#define KVM_MEM_PRIVATE (1UL << 2)
> > >
> > > /* for KVM_IRQ_LINE */
> > > struct kvm_irq_level {
> > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > > index effdea5dd4f0..d605545d6dd1 100644
> > > --- a/virt/kvm/Kconfig
> > > +++ b/virt/kvm/Kconfig
> > > @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
> > >
> > > config HAVE_KVM_PM_NOTIFIER
> > > bool
> > > +
> > > +config HAVE_KVM_RESTRICTED_MEM
> > > + bool
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 7f0f5e9f2406..b882eb2c76a2 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > > }
> > > }
> > >
> > > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > > {
> > > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > >
> > > @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > > * Must be called holding kvm->slots_lock for write.
> > > */
> > > int __kvm_set_memory_region(struct kvm *kvm,
> > > - const struct kvm_userspace_memory_region *mem)
> > > + const struct kvm_user_mem_region *mem)
> > > {
> > > struct kvm_memory_slot *old, *new;
> > > struct kvm_memslots *slots;
> > > @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > > EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> > >
> > > int kvm_set_memory_region(struct kvm *kvm,
> > > - const struct kvm_userspace_memory_region *mem)
> > > + const struct kvm_user_mem_region *mem)
> > > {
> > > int r;
> > >
> > > @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> > > EXPORT_SYMBOL_GPL(kvm_set_memory_region);
> > >
> > > static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> > > - struct kvm_userspace_memory_region *mem)
> > > + struct kvm_user_mem_region *mem)
> > > {
> > > if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> > > return -EINVAL;
> > > @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> > > return fd;
> > > }
> > >
> > > +#define SANITY_CHECK_MEM_REGION_FIELD(field) \
> > > +do { \
> > > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> > > + offsetof(struct kvm_userspace_memory_region, field)); \
> > > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> > > + sizeof_field(struct kvm_userspace_memory_region, field)); \
> > > +} while (0)
> > > +
> > > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \
> > > +do { \
> > > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> > > + offsetof(struct kvm_userspace_memory_region_ext, field)); \
> > > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> > > + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \
> > > +} while (0)
> > > +
> > > +static void kvm_sanity_check_user_mem_region_alias(void)
> > > +{
> > > + SANITY_CHECK_MEM_REGION_FIELD(slot);
> > > + SANITY_CHECK_MEM_REGION_FIELD(flags);
> > > + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> > > + SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> > > + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> > > + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> > > + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> > > +}
> > > +
> > > static long kvm_vm_ioctl(struct file *filp,
> > > unsigned int ioctl, unsigned long arg)
> > > {
> > > @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> > > break;
> > > }
> > > case KVM_SET_USER_MEMORY_REGION: {
> > > - struct kvm_userspace_memory_region kvm_userspace_mem;
> > > + struct kvm_user_mem_region mem;
> > > + unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > > +
> > > + kvm_sanity_check_user_mem_region_alias();
> > >
> > > r = -EFAULT;
> > > - if (copy_from_user(&kvm_userspace_mem, argp,
> > > - sizeof(kvm_userspace_mem)))
> > > + if (copy_from_user(&mem, argp, size))
> > > + goto out;
> > > +
> > > + r = -EINVAL;
> > > + if (mem.flags & KVM_MEM_PRIVATE)
> > > goto out;
> > >
> > > - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > > + r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > > break;
> > > }
> > > case KVM_GET_DIRTY_LOG: {
> > > --
> > > 2.25.1
> > >

2022-12-06 13:02:27

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Mon, Dec 05, 2022 at 09:03:11AM +0000, Fuad Tabba wrote:
> Hi Chao,
>
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <[email protected]> wrote:
> >
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> >
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> >
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through restricted_fd
> > and shared memory through userspace_addr. Whether the private or shared
> > part is visible to guest is maintained by other KVM code.
> >
> > A restrictedmem_notifier field is also added to the memslot structure to
> > allow the restricted_fd's backing store to notify KVM the memory change,
> > KVM then can invalidate its page table entries or handle memory errors.
> >
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> >
> > To make future maintenance easy, internally use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
> >
> > Co-developed-by: Yu Zhang <[email protected]>
> > Signed-off-by: Yu Zhang <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
> > Reviewed-by: Fuad Tabba <[email protected]>
> > Tested-by: Fuad Tabba <[email protected]>
>
> V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> patch series anymore. Any reason you removed it, or is it just an
> omission?

We had some discussion in v9 [1] to add generic memory attributes ioctls
and KVM_CAP_PRIVATE_MEM can be implemented as a new
KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
ioctl [2]. The api doc has been updated:

+- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
+ KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) ??


[1] https://lore.kernel.org/linux-mm/[email protected]/
[2]
https://lore.kernel.org/linux-mm/[email protected]/

Thanks,
Chao
>
> [*] https://lore.kernel.org/linux-mm/[email protected]/
>
> Thanks,
> /fuad
>
> > ---
> > Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> > arch/x86/kvm/Kconfig | 2 ++
> > arch/x86/kvm/x86.c | 2 +-
> > include/linux/kvm_host.h | 8 ++++--
> > include/uapi/linux/kvm.h | 28 +++++++++++++++++++
> > virt/kvm/Kconfig | 3 +++
> > virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++------
> > 7 files changed, 114 insertions(+), 18 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index bb2f709c0900..99352170c130 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> > :Capability: KVM_CAP_USER_MEMORY
> > :Architectures: all
> > :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> > :Returns: 0 on success, -1 on error
> >
> > ::
> > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > __u64 userspace_addr; /* start of the userspace allocated memory */
> > };
> >
> > + struct kvm_userspace_memory_region_ext {
> > + struct kvm_userspace_memory_region region;
> > + __u64 restricted_offset;
> > + __u32 restricted_fd;
> > + __u32 pad1;
> > + __u64 pad2[14];
> > + };
> > +
> > /* for kvm_memory_region::flags */
> > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > #define KVM_MEM_READONLY (1UL << 1)
> > + #define KVM_MEM_PRIVATE (1UL << 2)
> >
> > This ioctl allows the user to create, modify or delete a guest physical
> > memory slot. Bits 0-15 of "slot" specify the slot id and this value
> > @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> > be identical. This allows large pages in the guest to be backed by large
> > pages in the host.
> >
> > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
> > -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
> > -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > -to make a new slot read-only. In this case, writes to this memory will be
> > -posted to userspace as KVM_EXIT_MMIO exits.
> > +kvm_userspace_memory_region_ext struct includes all fields of
> > +kvm_userspace_memory_region struct, while also adds additional fields for some
> > +other features. See below description of flags field for more information.
> > +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> > +
> > +The flags field supports following flags:
> > +
> > +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> > + within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> > +
> > +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> > + read-only. In this case, writes to this memory will be posted to userspace as
> > + KVM_EXIT_MMIO exits.
> > +
> > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > + KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> > + memory backed by a file descriptor(fd) and userspace access to the fd may be
> > + restricted. Userspace should use restricted_fd/restricted_offset in the
> > + kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> > + to guest. Userspace should guarantee not to map the same host physical address
> > + indicated by restricted_fd/restricted_offset to different guest physical
> > + addresses within multiple memslots. Failed to do this may result undefined
> > + behavior.
> >
> > When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> > the memory region are automatically reflected into the guest. For example, an
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index a8e379a3afee..690cb21010e7 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -50,6 +50,8 @@ config KVM
> > select INTERVAL_TREE
> > select HAVE_KVM_PM_NOTIFIER if PM
> > select HAVE_KVM_MEMORY_ATTRIBUTES
> > + select HAVE_KVM_RESTRICTED_MEM if X86_64
> > + select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > help
> > Support hosting fully virtualized guest machines using hardware
> > virtualization extensions. You will need a fairly recent
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 7f850dfb4086..9a07380f8d3c 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> > }
> >
> > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > - struct kvm_userspace_memory_region m;
> > + struct kvm_user_mem_region m;
> >
> > m.slot = id | (i << 16);
> > m.flags = 0;
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index a784e2b06625..02347e386ea2 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -44,6 +44,7 @@
> >
> > #include <asm/kvm_host.h>
> > #include <linux/kvm_dirty_ring.h>
> > +#include <linux/restrictedmem.h>
> >
> > #ifndef KVM_MAX_VCPU_IDS
> > #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> > @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> > u32 flags;
> > short id;
> > u16 as_id;
> > + struct file *restricted_file;
> > + loff_t restricted_offset;
> > + struct restrictedmem_notifier notifier;
> > };
> >
> > static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> > };
> >
> > int kvm_set_memory_region(struct kvm *kvm,
> > - const struct kvm_userspace_memory_region *mem);
> > + const struct kvm_user_mem_region *mem);
> > int __kvm_set_memory_region(struct kvm *kvm,
> > - const struct kvm_userspace_memory_region *mem);
> > + const struct kvm_user_mem_region *mem);
> > void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> > void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> > int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 5d0941acb5bb..13bff963b8b0 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> > __u64 userspace_addr; /* start of the userspace allocated memory */
> > };
> >
> > +struct kvm_userspace_memory_region_ext {
> > + struct kvm_userspace_memory_region region;
> > + __u64 restricted_offset;
> > + __u32 restricted_fd;
> > + __u32 pad1;
> > + __u64 pad2[14];
> > +};
> > +
> > +#ifdef __KERNEL__
> > +/*
> > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> > + * all fields from the top-level "extended" region.
> > + */
> > +struct kvm_user_mem_region {
> > + __u32 slot;
> > + __u32 flags;
> > + __u64 guest_phys_addr;
> > + __u64 memory_size;
> > + __u64 userspace_addr;
> > + __u64 restricted_offset;
> > + __u32 restricted_fd;
> > + __u32 pad1;
> > + __u64 pad2[14];
> > +};
> > +#endif
> > +
> > /*
> > * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> > * other bits are reserved for kvm internal use which are defined in
> > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> > */
> > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > #define KVM_MEM_READONLY (1UL << 1)
> > +#define KVM_MEM_PRIVATE (1UL << 2)
> >
> > /* for KVM_IRQ_LINE */
> > struct kvm_irq_level {
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index effdea5dd4f0..d605545d6dd1 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
> >
> > config HAVE_KVM_PM_NOTIFIER
> > bool
> > +
> > +config HAVE_KVM_RESTRICTED_MEM
> > + bool
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 7f0f5e9f2406..b882eb2c76a2 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > }
> > }
> >
> > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > {
> > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> >
> > @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > * Must be called holding kvm->slots_lock for write.
> > */
> > int __kvm_set_memory_region(struct kvm *kvm,
> > - const struct kvm_userspace_memory_region *mem)
> > + const struct kvm_user_mem_region *mem)
> > {
> > struct kvm_memory_slot *old, *new;
> > struct kvm_memslots *slots;
> > @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> >
> > int kvm_set_memory_region(struct kvm *kvm,
> > - const struct kvm_userspace_memory_region *mem)
> > + const struct kvm_user_mem_region *mem)
> > {
> > int r;
> >
> > @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> > EXPORT_SYMBOL_GPL(kvm_set_memory_region);
> >
> > static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> > - struct kvm_userspace_memory_region *mem)
> > + struct kvm_user_mem_region *mem)
> > {
> > if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> > return -EINVAL;
> > @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> > return fd;
> > }
> >
> > +#define SANITY_CHECK_MEM_REGION_FIELD(field) \
> > +do { \
> > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> > + offsetof(struct kvm_userspace_memory_region, field)); \
> > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> > + sizeof_field(struct kvm_userspace_memory_region, field)); \
> > +} while (0)
> > +
> > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \
> > +do { \
> > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> > + offsetof(struct kvm_userspace_memory_region_ext, field)); \
> > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> > + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \
> > +} while (0)
> > +
> > +static void kvm_sanity_check_user_mem_region_alias(void)
> > +{
> > + SANITY_CHECK_MEM_REGION_FIELD(slot);
> > + SANITY_CHECK_MEM_REGION_FIELD(flags);
> > + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> > + SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> > + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> > + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> > + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> > +}
> > +
> > static long kvm_vm_ioctl(struct file *filp,
> > unsigned int ioctl, unsigned long arg)
> > {
> > @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> > break;
> > }
> > case KVM_SET_USER_MEMORY_REGION: {
> > - struct kvm_userspace_memory_region kvm_userspace_mem;
> > + struct kvm_user_mem_region mem;
> > + unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > +
> > + kvm_sanity_check_user_mem_region_alias();
> >
> > r = -EFAULT;
> > - if (copy_from_user(&kvm_userspace_mem, argp,
> > - sizeof(kvm_userspace_mem)))
> > + if (copy_from_user(&mem, argp, size))
> > + goto out;
> > +
> > + r = -EINVAL;
> > + if (mem.flags & KVM_MEM_PRIVATE)
> > goto out;
> >
> > - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > + r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > break;
> > }
> > case KVM_GET_DIRTY_LOG: {
> > --
> > 2.25.1
> >

2022-12-06 16:04:07

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry

Hi,

On Tue, Dec 6, 2022 at 12:01 PM Chao Peng <[email protected]> wrote:
>
> On Mon, Dec 05, 2022 at 09:23:49AM +0000, Fuad Tabba wrote:
> > Hi Chao,
> >
> > On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <[email protected]> wrote:
> > >
> > > Currently in mmu_notifier invalidate path, hva range is recorded and
> > > then checked against by mmu_notifier_retry_hva() in the page fault
> > > handling path. However, for the to be introduced private memory, a page
> > > fault may not have a hva associated, checking gfn(gpa) makes more sense.
> > >
> > > For existing hva based shared memory, gfn is expected to also work. The
> > > only downside is when aliasing multiple gfns to a single hva, the
> > > current algorithm of checking multiple ranges could result in a much
> > > larger range being rejected. Such aliasing should be uncommon, so the
> > > impact is expected small.
> > >
> > > Suggested-by: Sean Christopherson <[email protected]>
> > > Signed-off-by: Chao Peng <[email protected]>
> > > ---
> > > arch/x86/kvm/mmu/mmu.c | 8 +++++---
> > > include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
> > > virt/kvm/kvm_main.c | 32 +++++++++++++++++++++++---------
> > > 3 files changed, 49 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 4736d7849c60..e2c70b5afa3e 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> > > return true;
> > >
> > > return fault->slot &&
> > > - mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> > > + mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> > > }
> > >
> > > static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > @@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > >
> > > write_lock(&kvm->mmu_lock);
> > >
> > > - kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> > > + kvm_mmu_invalidate_begin(kvm);
> > > +
> > > + kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
> > >
> > > flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> > >
> > > @@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > > kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
> > > gfn_end - gfn_start);
> > >
> > > - kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> > > + kvm_mmu_invalidate_end(kvm);
> > >
> > > write_unlock(&kvm->mmu_lock);
> > > }
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 02347e386ea2..3d69484d2704 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -787,8 +787,8 @@ struct kvm {
> > > struct mmu_notifier mmu_notifier;
> > > unsigned long mmu_invalidate_seq;
> > > long mmu_invalidate_in_progress;
> > > - unsigned long mmu_invalidate_range_start;
> > > - unsigned long mmu_invalidate_range_end;
> > > + gfn_t mmu_invalidate_range_start;
> > > + gfn_t mmu_invalidate_range_end;
> > > #endif
> > > struct list_head devices;
> > > u64 manual_dirty_log_protect;
> > > @@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> > > void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > > #endif
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > > - unsigned long end);
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > > - unsigned long end);
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm);
> > >
> > > long kvm_arch_dev_ioctl(struct file *filp,
> > > unsigned int ioctl, unsigned long arg);
> > > @@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
> > > return 0;
> > > }
> > >
> > > -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > > +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> > > unsigned long mmu_seq,
> > > - unsigned long hva)
> > > + gfn_t gfn)
> > > {
> > > lockdep_assert_held(&kvm->mmu_lock);
> > > /*
> > > @@ -1974,10 +1973,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > > * that might be being invalidated. Note that it may include some false
> >
> > nit: "might be" (or) "is being"
> >
> > > * positives, due to shortcuts when handing concurrent invalidations.
> >
> > nit: handling
>
> Both are existing code, but I can fix it either.

That was just a nit, please feel free to ignore it, especially if it
might cause headaches in the future with merges.
>
> >
> > > */
> > > - if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > - hva >= kvm->mmu_invalidate_range_start &&
> > > - hva < kvm->mmu_invalidate_range_end)
> > > - return 1;
> > > + if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > + /*
> > > + * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > > + * but before updating the range is a KVM bug.
> > > + */
> > > + if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > > + kvm->mmu_invalidate_range_end == INVALID_GPA))
> >
> > INVALID_GPA is an x86-specific define in
> > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > architectures. The obvious fix is to move it to
> > include/linux/kvm_host.h.
>
> Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> correct choice for other architectures, but after search it has not been
> used for other architectures, so should be safe to make it common.

With this fixed,

Reviewed-by: Fuad Tabba <[email protected]>
And the necessary work to port to arm64 (on qemu/arm64):
Tested-by: Fuad Tabba <[email protected]>

Cheers,
/fuad


>
> Thanks,
> Chao
> >
> > Cheers,
> > /fuad
> >
> > > + return 1;
> > > +
> > > + if (gfn >= kvm->mmu_invalidate_range_start &&
> > > + gfn < kvm->mmu_invalidate_range_end)
> > > + return 1;
> > > + }
> > > +
> > > if (kvm->mmu_invalidate_seq != mmu_seq)
> > > return 1;
> > > return 0;
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index b882eb2c76a2..ad55dfbc75d7 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
> > >
> > > typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
> > >
> > > -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> > > - unsigned long end);
> > > -
> > > +typedef void (*on_lock_fn_t)(struct kvm *kvm);
> > > typedef void (*on_unlock_fn_t)(struct kvm *kvm);
> > >
> > > struct kvm_hva_range {
> > > @@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> > > locked = true;
> > > KVM_MMU_LOCK(kvm);
> > > if (!IS_KVM_NULL_FN(range->on_lock))
> > > - range->on_lock(kvm, range->start, range->end);
> > > + range->on_lock(kvm);
> > > +
> > > if (IS_KVM_NULL_FN(range->handler))
> > > break;
> > > }
> > > @@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > > }
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > > - unsigned long end)
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > {
> > > /*
> > > * The count increase must become visible at unlock time as no
> > > @@ -724,6 +722,17 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > > * count is also read inside the mmu_lock critical section.
> > > */
> > > kvm->mmu_invalidate_in_progress++;
> > > +
> > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > + }
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > +
> > > if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > kvm->mmu_invalidate_range_start = start;
> > > kvm->mmu_invalidate_range_end = end;
> > > @@ -744,6 +753,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > > }
> > > }
> > >
> > > +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > > +{
> > > + kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > > + return kvm_unmap_gfn_range(kvm, range);
> > > +}
> > > +
> > > static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > const struct mmu_notifier_range *range)
> > > {
> > > @@ -752,7 +767,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > .start = range->start,
> > > .end = range->end,
> > > .pte = __pte(0),
> > > - .handler = kvm_unmap_gfn_range,
> > > + .handler = kvm_mmu_unmap_gfn_range,
> > > .on_lock = kvm_mmu_invalidate_begin,
> > > .on_unlock = kvm_arch_guest_memory_reclaimed,
> > > .flush_on_ret = true,
> > > @@ -791,8 +806,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > return 0;
> > > }
> > >
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > > - unsigned long end)
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > {
> > > /*
> > > * This sequence increase will notify the kvm page fault that
> > > --
> > > 2.25.1
> > >

2022-12-07 07:35:50

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry

On Tue, Dec 06, 2022 at 07:56:23PM +0800,
Chao Peng <[email protected]> wrote:

> > > - if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > - hva >= kvm->mmu_invalidate_range_start &&
> > > - hva < kvm->mmu_invalidate_range_end)
> > > - return 1;
> > > + if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > + /*
> > > + * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > > + * but before updating the range is a KVM bug.
> > > + */
> > > + if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > > + kvm->mmu_invalidate_range_end == INVALID_GPA))
> >
> > INVALID_GPA is an x86-specific define in
> > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > architectures. The obvious fix is to move it to
> > include/linux/kvm_host.h.
>
> Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> correct choice for other architectures, but after search it has not been
> used for other architectures, so should be safe to make it common.

INVALID_GPA is defined as all bit 1. Please notice "~" (tilde).

#define INVALID_GPA (~(gpa_t)0)
--
Isaku Yamahata <[email protected]>

2022-12-07 08:56:37

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

On Fri, Dec 02, 2022 at 02:13:44PM +0800, Chao Peng wrote:
> Unmap the existing guest mappings when memory attribute is changed
> between shared and private. This is needed because shared pages and
> private pages are from different backends, unmapping existing ones
> gives a chance for page fault handler to re-populate the mappings
> according to the new attribute.
>
> Only architecture has private memory support needs this and the
> supported architecture is expected to rewrite the weak
> kvm_arch_has_private_mem().
>
> Also, during memory attribute changing and the unmapping time frame,
> page fault handler may happen in the same memory range and can cause
> incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> page fault handler retry during this time frame.
>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/linux/kvm_host.h | 7 +-
> virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> 2 files changed, 116 insertions(+), 59 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3d69484d2704..3331c0c92838 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> struct kvm_gfn_range {
> struct kvm_memory_slot *slot;
> gfn_t start;
> @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> bool may_block;
> };
> bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +
> +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> @@ -785,11 +786,12 @@ struct kvm {
>
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> struct mmu_notifier mmu_notifier;
> +#endif
> unsigned long mmu_invalidate_seq;
> long mmu_invalidate_in_progress;
> gfn_t mmu_invalidate_range_start;
> gfn_t mmu_invalidate_range_end;
> -#endif
> +
> struct list_head devices;
> u64 manual_dirty_log_protect;
> struct dentry *debugfs_dentry;
> @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> int kvm_arch_post_init_vm(struct kvm *kvm);
> void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
> #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ad55dfbc75d7..4e1e1e113bf0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> }
> EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> +{
> + /*
> + * The count increase must become visible at unlock time as no
> + * spte can be established without taking the mmu_lock and
> + * count is also read inside the mmu_lock critical section.
> + */
> + kvm->mmu_invalidate_in_progress++;
> +
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = INVALID_GPA;
> + kvm->mmu_invalidate_range_end = INVALID_GPA;
> + }
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = start;
> + kvm->mmu_invalidate_range_end = end;
> + } else {
> + /*
> + * Fully tracking multiple concurrent ranges has diminishing
> + * returns. Keep things simple and just find the minimal range
> + * which includes the current and new ranges. As there won't be
> + * enough information to subtract a range after its invalidate
> + * completes, any ranges invalidated concurrently will
> + * accumulate and persist until all outstanding invalidates
> + * complete.
> + */
> + kvm->mmu_invalidate_range_start =
> + min(kvm->mmu_invalidate_range_start, start);
> + kvm->mmu_invalidate_range_end =
> + max(kvm->mmu_invalidate_range_end, end);
> + }
> +}
> +
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
> +{
> + /*
> + * This sequence increase will notify the kvm page fault that
> + * the page that is going to be mapped in the spte could have
> + * been freed.
> + */
> + kvm->mmu_invalidate_seq++;
> + smp_wmb();
> + /*
> + * The above sequence increase must be visible before the
> + * below count decrease, which is ensured by the smp_wmb above
> + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> + */
> + kvm->mmu_invalidate_in_progress--;
> +}
> +
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> {
> @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> -{
> - /*
> - * The count increase must become visible at unlock time as no
> - * spte can be established without taking the mmu_lock and
> - * count is also read inside the mmu_lock critical section.
> - */
> - kvm->mmu_invalidate_in_progress++;
> -
> - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> - kvm->mmu_invalidate_range_start = INVALID_GPA;
> - kvm->mmu_invalidate_range_end = INVALID_GPA;
> - }
> -}
> -
> -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> -
> - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> - kvm->mmu_invalidate_range_start = start;
> - kvm->mmu_invalidate_range_end = end;
> - } else {
> - /*
> - * Fully tracking multiple concurrent ranges has diminishing
> - * returns. Keep things simple and just find the minimal range
> - * which includes the current and new ranges. As there won't be
> - * enough information to subtract a range after its invalidate
> - * completes, any ranges invalidated concurrently will
> - * accumulate and persist until all outstanding invalidates
> - * complete.
> - */
> - kvm->mmu_invalidate_range_start =
> - min(kvm->mmu_invalidate_range_start, start);
> - kvm->mmu_invalidate_range_end =
> - max(kvm->mmu_invalidate_range_end, end);
> - }
> -}
> -
> static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> return 0;
> }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm)
> -{
> - /*
> - * This sequence increase will notify the kvm page fault that
> - * the page that is going to be mapped in the spte could have
> - * been freed.
> - */
> - kvm->mmu_invalidate_seq++;
> - smp_wmb();
> - /*
> - * The above sequence increase must be visible before the
> - * below count decrease, which is ensured by the smp_wmb above
> - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> - */
> - kvm->mmu_invalidate_in_progress--;
> -}
> -
> static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> const struct mmu_notifier_range *range)
> {
> @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> return 0;
> }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> + return false;
> +}
> +
> static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> {
> struct kvm *kvm = kvm_arch_alloc_vm();
> @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> return 0;
> }
>
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + struct kvm_gfn_range gfn_range;
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + struct kvm_memslot_iter iter;
> + int i;
> + int r = 0;
> +
> + gfn_range.pte = __pte(0);
> + gfn_range.may_block = true;
> +
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + slots = __kvm_memslots(kvm, i);
> +
> + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> + slot = iter.slot;
> + gfn_range.start = max(start, slot->base_gfn);
> + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> + if (gfn_range.start >= gfn_range.end)
> + continue;
> + gfn_range.slot = slot;
> +
> + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> + }
> + }
> +
> + if (r)
> + kvm_flush_remote_tlbs(kvm);
> +}
> +
> static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> struct kvm_memory_attributes *attrs)
> {
> gfn_t start, end;
> unsigned long i;
> void *entry;
> + int idx;
> u64 supported_attrs = kvm_supported_mem_attributes(kvm);
>
> - /* flags is currently not used. */
> + /* 'flags' is currently not used. */
> if (attrs->flags)
> return -EINVAL;
> if (attrs->attributes & ~supported_attrs)
> @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>
> entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
>
> + if (kvm_arch_has_private_mem(kvm)) {
> + KVM_MMU_LOCK(kvm);
> + kvm_mmu_invalidate_begin(kvm);
> + kvm_mmu_invalidate_range_add(kvm, start, end);

Nit: this works for KVM_MEMORY_ATTRIBUTE_PRIVATE, but
the invalidation should be necessary yet for attribute change of:

KVM_MEMORY_ATTRIBUTE_READ
KVM_MEMORY_ATTRIBUTE_WRITE
KVM_MEMORY_ATTRIBUTE_EXECUTE

> + KVM_MMU_UNLOCK(kvm);
> + }
> +
> mutex_lock(&kvm->lock);
> for (i = start; i < end; i++)
> if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> break;
> mutex_unlock(&kvm->lock);
>
> + if (kvm_arch_has_private_mem(kvm)) {
> + idx = srcu_read_lock(&kvm->srcu);
> + KVM_MMU_LOCK(kvm);
> + if (i > start)
> + kvm_unmap_mem_range(kvm, start, i);
> + kvm_mmu_invalidate_end(kvm);

Ditto.

> + KVM_MMU_UNLOCK(kvm);
> + srcu_read_unlock(&kvm->srcu, idx);
> + }
> +
> attrs->address = i << PAGE_SHIFT;
> attrs->size = (end - i) << PAGE_SHIFT;
>
> --
> 2.25.1
>
>

2022-12-07 15:22:01

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Tue, Dec 06, 2022 at 12:39:18PM +0000, Fuad Tabba wrote:
> Hi Chao,
>
> On Tue, Dec 6, 2022 at 11:58 AM Chao Peng <[email protected]> wrote:
> >
> > On Mon, Dec 05, 2022 at 09:03:11AM +0000, Fuad Tabba wrote:
> > > Hi Chao,
> > >
> > > On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <[email protected]> wrote:
> > > >
> > > > In memory encryption usage, guest memory may be encrypted with special
> > > > key and can be accessed only by the guest itself. We call such memory
> > > > private memory. It's valueless and sometimes can cause problem to allow
> > > > userspace to access guest private memory. This new KVM memslot extension
> > > > allows guest private memory being provided through a restrictedmem
> > > > backed file descriptor(fd) and userspace is restricted to access the
> > > > bookmarked memory in the fd.
> > > >
> > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > > > userspace to instruct KVM to provide guest memory through restricted_fd.
> > > > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > > > and the size is 'memory_size'.
> > > >
> > > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > > single memslot can maintain both private memory through restricted_fd
> > > > and shared memory through userspace_addr. Whether the private or shared
> > > > part is visible to guest is maintained by other KVM code.
> > > >
> > > > A restrictedmem_notifier field is also added to the memslot structure to
> > > > allow the restricted_fd's backing store to notify KVM the memory change,
> > > > KVM then can invalidate its page table entries or handle memory errors.
> > > >
> > > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > > and right now it is selected on X86_64 only.
> > > >
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > > >
> > > > Co-developed-by: Yu Zhang <[email protected]>
> > > > Signed-off-by: Yu Zhang <[email protected]>
> > > > Signed-off-by: Chao Peng <[email protected]>
> > > > Reviewed-by: Fuad Tabba <[email protected]>
> > > > Tested-by: Fuad Tabba <[email protected]>
> > >
> > > V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> > > patch series anymore. Any reason you removed it, or is it just an
> > > omission?
> >
> > We had some discussion in v9 [1] to add generic memory attributes ioctls
> > and KVM_CAP_PRIVATE_MEM can be implemented as a new
> > KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
> > ioctl [2]. The api doc has been updated:
> >
> > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > + KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) ??
> >
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
> > [2]
> > https://lore.kernel.org/linux-mm/[email protected]/
>
> I see. I just retested it with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES,
> and my Reviewed/Tested-by still apply.

Thanks for the info.

Chao
>
> Cheers,
> /fuad
>
> >
> > Thanks,
> > Chao
> > >
> > > [*] https://lore.kernel.org/linux-mm/[email protected]/
> > >
> > > Thanks,
> > > /fuad
> > >
> > > > ---
> > > > Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> > > > arch/x86/kvm/Kconfig | 2 ++
> > > > arch/x86/kvm/x86.c | 2 +-
> > > > include/linux/kvm_host.h | 8 ++++--
> > > > include/uapi/linux/kvm.h | 28 +++++++++++++++++++
> > > > virt/kvm/Kconfig | 3 +++
> > > > virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++------
> > > > 7 files changed, 114 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > index bb2f709c0900..99352170c130 100644
> > > > --- a/Documentation/virt/kvm/api.rst
> > > > +++ b/Documentation/virt/kvm/api.rst
> > > > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> > > > :Capability: KVM_CAP_USER_MEMORY
> > > > :Architectures: all
> > > > :Type: vm ioctl
> > > > -:Parameters: struct kvm_userspace_memory_region (in)
> > > > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> > > > :Returns: 0 on success, -1 on error
> > > >
> > > > ::
> > > > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > > > __u64 userspace_addr; /* start of the userspace allocated memory */
> > > > };
> > > >
> > > > + struct kvm_userspace_memory_region_ext {
> > > > + struct kvm_userspace_memory_region region;
> > > > + __u64 restricted_offset;
> > > > + __u32 restricted_fd;
> > > > + __u32 pad1;
> > > > + __u64 pad2[14];
> > > > + };
> > > > +
> > > > /* for kvm_memory_region::flags */
> > > > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > > > #define KVM_MEM_READONLY (1UL << 1)
> > > > + #define KVM_MEM_PRIVATE (1UL << 2)
> > > >
> > > > This ioctl allows the user to create, modify or delete a guest physical
> > > > memory slot. Bits 0-15 of "slot" specify the slot id and this value
> > > > @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> > > > be identical. This allows large pages in the guest to be backed by large
> > > > pages in the host.
> > > >
> > > > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > > > -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
> > > > -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
> > > > -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > > > -to make a new slot read-only. In this case, writes to this memory will be
> > > > -posted to userspace as KVM_EXIT_MMIO exits.
> > > > +kvm_userspace_memory_region_ext struct includes all fields of
> > > > +kvm_userspace_memory_region struct, while also adds additional fields for some
> > > > +other features. See below description of flags field for more information.
> > > > +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> > > > +
> > > > +The flags field supports following flags:
> > > > +
> > > > +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> > > > + within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> > > > +
> > > > +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> > > > + read-only. In this case, writes to this memory will be posted to userspace as
> > > > + KVM_EXIT_MMIO exits.
> > > > +
> > > > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > > > + KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> > > > + memory backed by a file descriptor(fd) and userspace access to the fd may be
> > > > + restricted. Userspace should use restricted_fd/restricted_offset in the
> > > > + kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> > > > + to guest. Userspace should guarantee not to map the same host physical address
> > > > + indicated by restricted_fd/restricted_offset to different guest physical
> > > > + addresses within multiple memslots. Failed to do this may result undefined
> > > > + behavior.
> > > >
> > > > When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> > > > the memory region are automatically reflected into the guest. For example, an
> > > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > > > index a8e379a3afee..690cb21010e7 100644
> > > > --- a/arch/x86/kvm/Kconfig
> > > > +++ b/arch/x86/kvm/Kconfig
> > > > @@ -50,6 +50,8 @@ config KVM
> > > > select INTERVAL_TREE
> > > > select HAVE_KVM_PM_NOTIFIER if PM
> > > > select HAVE_KVM_MEMORY_ATTRIBUTES
> > > > + select HAVE_KVM_RESTRICTED_MEM if X86_64
> > > > + select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > > > help
> > > > Support hosting fully virtualized guest machines using hardware
> > > > virtualization extensions. You will need a fairly recent
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 7f850dfb4086..9a07380f8d3c 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> > > > }
> > > >
> > > > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > > - struct kvm_userspace_memory_region m;
> > > > + struct kvm_user_mem_region m;
> > > >
> > > > m.slot = id | (i << 16);
> > > > m.flags = 0;
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index a784e2b06625..02347e386ea2 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -44,6 +44,7 @@
> > > >
> > > > #include <asm/kvm_host.h>
> > > > #include <linux/kvm_dirty_ring.h>
> > > > +#include <linux/restrictedmem.h>
> > > >
> > > > #ifndef KVM_MAX_VCPU_IDS
> > > > #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> > > > @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> > > > u32 flags;
> > > > short id;
> > > > u16 as_id;
> > > > + struct file *restricted_file;
> > > > + loff_t restricted_offset;
> > > > + struct restrictedmem_notifier notifier;
> > > > };
> > > >
> > > > static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > > > @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> > > > };
> > > >
> > > > int kvm_set_memory_region(struct kvm *kvm,
> > > > - const struct kvm_userspace_memory_region *mem);
> > > > + const struct kvm_user_mem_region *mem);
> > > > int __kvm_set_memory_region(struct kvm *kvm,
> > > > - const struct kvm_userspace_memory_region *mem);
> > > > + const struct kvm_user_mem_region *mem);
> > > > void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> > > > void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> > > > int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > > index 5d0941acb5bb..13bff963b8b0 100644
> > > > --- a/include/uapi/linux/kvm.h
> > > > +++ b/include/uapi/linux/kvm.h
> > > > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> > > > __u64 userspace_addr; /* start of the userspace allocated memory */
> > > > };
> > > >
> > > > +struct kvm_userspace_memory_region_ext {
> > > > + struct kvm_userspace_memory_region region;
> > > > + __u64 restricted_offset;
> > > > + __u32 restricted_fd;
> > > > + __u32 pad1;
> > > > + __u64 pad2[14];
> > > > +};
> > > > +
> > > > +#ifdef __KERNEL__
> > > > +/*
> > > > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> > > > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> > > > + * all fields from the top-level "extended" region.
> > > > + */
> > > > +struct kvm_user_mem_region {
> > > > + __u32 slot;
> > > > + __u32 flags;
> > > > + __u64 guest_phys_addr;
> > > > + __u64 memory_size;
> > > > + __u64 userspace_addr;
> > > > + __u64 restricted_offset;
> > > > + __u32 restricted_fd;
> > > > + __u32 pad1;
> > > > + __u64 pad2[14];
> > > > +};
> > > > +#endif
> > > > +
> > > > /*
> > > > * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> > > > * other bits are reserved for kvm internal use which are defined in
> > > > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> > > > */
> > > > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > > > #define KVM_MEM_READONLY (1UL << 1)
> > > > +#define KVM_MEM_PRIVATE (1UL << 2)
> > > >
> > > > /* for KVM_IRQ_LINE */
> > > > struct kvm_irq_level {
> > > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > > > index effdea5dd4f0..d605545d6dd1 100644
> > > > --- a/virt/kvm/Kconfig
> > > > +++ b/virt/kvm/Kconfig
> > > > @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
> > > >
> > > > config HAVE_KVM_PM_NOTIFIER
> > > > bool
> > > > +
> > > > +config HAVE_KVM_RESTRICTED_MEM
> > > > + bool
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index 7f0f5e9f2406..b882eb2c76a2 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > > > }
> > > > }
> > > >
> > > > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > > > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > > > {
> > > > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > > >
> > > > @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > > > * Must be called holding kvm->slots_lock for write.
> > > > */
> > > > int __kvm_set_memory_region(struct kvm *kvm,
> > > > - const struct kvm_userspace_memory_region *mem)
> > > > + const struct kvm_user_mem_region *mem)
> > > > {
> > > > struct kvm_memory_slot *old, *new;
> > > > struct kvm_memslots *slots;
> > > > @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > > > EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> > > >
> > > > int kvm_set_memory_region(struct kvm *kvm,
> > > > - const struct kvm_userspace_memory_region *mem)
> > > > + const struct kvm_user_mem_region *mem)
> > > > {
> > > > int r;
> > > >
> > > > @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> > > > EXPORT_SYMBOL_GPL(kvm_set_memory_region);
> > > >
> > > > static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> > > > - struct kvm_userspace_memory_region *mem)
> > > > + struct kvm_user_mem_region *mem)
> > > > {
> > > > if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> > > > return -EINVAL;
> > > > @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> > > > return fd;
> > > > }
> > > >
> > > > +#define SANITY_CHECK_MEM_REGION_FIELD(field) \
> > > > +do { \
> > > > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> > > > + offsetof(struct kvm_userspace_memory_region, field)); \
> > > > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> > > > + sizeof_field(struct kvm_userspace_memory_region, field)); \
> > > > +} while (0)
> > > > +
> > > > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \
> > > > +do { \
> > > > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> > > > + offsetof(struct kvm_userspace_memory_region_ext, field)); \
> > > > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> > > > + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \
> > > > +} while (0)
> > > > +
> > > > +static void kvm_sanity_check_user_mem_region_alias(void)
> > > > +{
> > > > + SANITY_CHECK_MEM_REGION_FIELD(slot);
> > > > + SANITY_CHECK_MEM_REGION_FIELD(flags);
> > > > + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> > > > + SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> > > > + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> > > > + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> > > > + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> > > > +}
> > > > +
> > > > static long kvm_vm_ioctl(struct file *filp,
> > > > unsigned int ioctl, unsigned long arg)
> > > > {
> > > > @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> > > > break;
> > > > }
> > > > case KVM_SET_USER_MEMORY_REGION: {
> > > > - struct kvm_userspace_memory_region kvm_userspace_mem;
> > > > + struct kvm_user_mem_region mem;
> > > > + unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > > > +
> > > > + kvm_sanity_check_user_mem_region_alias();
> > > >
> > > > r = -EFAULT;
> > > > - if (copy_from_user(&kvm_userspace_mem, argp,
> > > > - sizeof(kvm_userspace_mem)))
> > > > + if (copy_from_user(&mem, argp, size))
> > > > + goto out;
> > > > +
> > > > + r = -EINVAL;
> > > > + if (mem.flags & KVM_MEM_PRIVATE)
> > > > goto out;
> > > >
> > > > - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > > > + r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > > > break;
> > > > }
> > > > case KVM_GET_DIRTY_LOG: {
> > > > --
> > > > 2.25.1
> > > >

2022-12-07 15:49:01

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry

On Tue, Dec 06, 2022 at 10:34:11PM -0800, Isaku Yamahata wrote:
> On Tue, Dec 06, 2022 at 07:56:23PM +0800,
> Chao Peng <[email protected]> wrote:
>
> > > > - if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > > - hva >= kvm->mmu_invalidate_range_start &&
> > > > - hva < kvm->mmu_invalidate_range_end)
> > > > - return 1;
> > > > + if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > > + /*
> > > > + * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > > > + * but before updating the range is a KVM bug.
> > > > + */
> > > > + if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > > > + kvm->mmu_invalidate_range_end == INVALID_GPA))
> > >
> > > INVALID_GPA is an x86-specific define in
> > > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > > architectures. The obvious fix is to move it to
> > > include/linux/kvm_host.h.
> >
> > Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> > correct choice for other architectures, but after search it has not been
> > used for other architectures, so should be safe to make it common.
>
> INVALID_GPA is defined as all bit 1. Please notice "~" (tilde).
>
> #define INVALID_GPA (~(gpa_t)0)

Thanks for mention. Still looks right moving it to include/linux/kvm_host.h.
Chao
> --
> Isaku Yamahata <[email protected]>

2022-12-07 17:32:12

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

Hi,

On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <[email protected]> wrote:
>
> Unmap the existing guest mappings when memory attribute is changed
> between shared and private. This is needed because shared pages and
> private pages are from different backends, unmapping existing ones
> gives a chance for page fault handler to re-populate the mappings
> according to the new attribute.
>
> Only architecture has private memory support needs this and the
> supported architecture is expected to rewrite the weak
> kvm_arch_has_private_mem().

This kind of ties into the discussion of being able to share memory in
place. For pKVM for example, shared and private memory would have the
same backend, and the unmapping wouldn't be needed.

So I guess that, instead of kvm_arch_has_private_mem(), can the check
be done differently, e.g., with a different function, say
kvm_arch_private_notify_attribute_change() (but maybe with a more
friendly name than what I suggested :) )?

Thanks,
/fuad

>
> Also, during memory attribute changing and the unmapping time frame,
> page fault handler may happen in the same memory range and can cause
> incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> page fault handler retry during this time frame.
>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/linux/kvm_host.h | 7 +-
> virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> 2 files changed, 116 insertions(+), 59 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3d69484d2704..3331c0c92838 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> struct kvm_gfn_range {
> struct kvm_memory_slot *slot;
> gfn_t start;
> @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> bool may_block;
> };
> bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +
> +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> @@ -785,11 +786,12 @@ struct kvm {
>
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> struct mmu_notifier mmu_notifier;
> +#endif
> unsigned long mmu_invalidate_seq;
> long mmu_invalidate_in_progress;
> gfn_t mmu_invalidate_range_start;
> gfn_t mmu_invalidate_range_end;
> -#endif
> +
> struct list_head devices;
> u64 manual_dirty_log_protect;
> struct dentry *debugfs_dentry;
> @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> int kvm_arch_post_init_vm(struct kvm *kvm);
> void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
> #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ad55dfbc75d7..4e1e1e113bf0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> }
> EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> +{
> + /*
> + * The count increase must become visible at unlock time as no
> + * spte can be established without taking the mmu_lock and
> + * count is also read inside the mmu_lock critical section.
> + */
> + kvm->mmu_invalidate_in_progress++;
> +
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = INVALID_GPA;
> + kvm->mmu_invalidate_range_end = INVALID_GPA;
> + }
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = start;
> + kvm->mmu_invalidate_range_end = end;
> + } else {
> + /*
> + * Fully tracking multiple concurrent ranges has diminishing
> + * returns. Keep things simple and just find the minimal range
> + * which includes the current and new ranges. As there won't be
> + * enough information to subtract a range after its invalidate
> + * completes, any ranges invalidated concurrently will
> + * accumulate and persist until all outstanding invalidates
> + * complete.
> + */
> + kvm->mmu_invalidate_range_start =
> + min(kvm->mmu_invalidate_range_start, start);
> + kvm->mmu_invalidate_range_end =
> + max(kvm->mmu_invalidate_range_end, end);
> + }
> +}
> +
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
> +{
> + /*
> + * This sequence increase will notify the kvm page fault that
> + * the page that is going to be mapped in the spte could have
> + * been freed.
> + */
> + kvm->mmu_invalidate_seq++;
> + smp_wmb();
> + /*
> + * The above sequence increase must be visible before the
> + * below count decrease, which is ensured by the smp_wmb above
> + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> + */
> + kvm->mmu_invalidate_in_progress--;
> +}
> +
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> {
> @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> -{
> - /*
> - * The count increase must become visible at unlock time as no
> - * spte can be established without taking the mmu_lock and
> - * count is also read inside the mmu_lock critical section.
> - */
> - kvm->mmu_invalidate_in_progress++;
> -
> - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> - kvm->mmu_invalidate_range_start = INVALID_GPA;
> - kvm->mmu_invalidate_range_end = INVALID_GPA;
> - }
> -}
> -
> -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> -
> - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> - kvm->mmu_invalidate_range_start = start;
> - kvm->mmu_invalidate_range_end = end;
> - } else {
> - /*
> - * Fully tracking multiple concurrent ranges has diminishing
> - * returns. Keep things simple and just find the minimal range
> - * which includes the current and new ranges. As there won't be
> - * enough information to subtract a range after its invalidate
> - * completes, any ranges invalidated concurrently will
> - * accumulate and persist until all outstanding invalidates
> - * complete.
> - */
> - kvm->mmu_invalidate_range_start =
> - min(kvm->mmu_invalidate_range_start, start);
> - kvm->mmu_invalidate_range_end =
> - max(kvm->mmu_invalidate_range_end, end);
> - }
> -}
> -
> static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> return 0;
> }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm)
> -{
> - /*
> - * This sequence increase will notify the kvm page fault that
> - * the page that is going to be mapped in the spte could have
> - * been freed.
> - */
> - kvm->mmu_invalidate_seq++;
> - smp_wmb();
> - /*
> - * The above sequence increase must be visible before the
> - * below count decrease, which is ensured by the smp_wmb above
> - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> - */
> - kvm->mmu_invalidate_in_progress--;
> -}
> -
> static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> const struct mmu_notifier_range *range)
> {
> @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> return 0;
> }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> + return false;
> +}
> +
> static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> {
> struct kvm *kvm = kvm_arch_alloc_vm();
> @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> return 0;
> }
>
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + struct kvm_gfn_range gfn_range;
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + struct kvm_memslot_iter iter;
> + int i;
> + int r = 0;
> +
> + gfn_range.pte = __pte(0);
> + gfn_range.may_block = true;
> +
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + slots = __kvm_memslots(kvm, i);
> +
> + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> + slot = iter.slot;
> + gfn_range.start = max(start, slot->base_gfn);
> + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> + if (gfn_range.start >= gfn_range.end)
> + continue;
> + gfn_range.slot = slot;
> +
> + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> + }
> + }
> +
> + if (r)
> + kvm_flush_remote_tlbs(kvm);
> +}
> +
> static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> struct kvm_memory_attributes *attrs)
> {
> gfn_t start, end;
> unsigned long i;
> void *entry;
> + int idx;
> u64 supported_attrs = kvm_supported_mem_attributes(kvm);
>
> - /* flags is currently not used. */
> + /* 'flags' is currently not used. */
> if (attrs->flags)
> return -EINVAL;
> if (attrs->attributes & ~supported_attrs)
> @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>
> entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
>
> + if (kvm_arch_has_private_mem(kvm)) {
> + KVM_MMU_LOCK(kvm);
> + kvm_mmu_invalidate_begin(kvm);
> + kvm_mmu_invalidate_range_add(kvm, start, end);
> + KVM_MMU_UNLOCK(kvm);
> + }
> +
> mutex_lock(&kvm->lock);
> for (i = start; i < end; i++)
> if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> break;
> mutex_unlock(&kvm->lock);
>
> + if (kvm_arch_has_private_mem(kvm)) {
> + idx = srcu_read_lock(&kvm->srcu);
> + KVM_MMU_LOCK(kvm);
> + if (i > start)
> + kvm_unmap_mem_range(kvm, start, i);
> + kvm_mmu_invalidate_end(kvm);
> + KVM_MMU_UNLOCK(kvm);
> + srcu_read_unlock(&kvm->srcu, idx);
> + }
> +
> attrs->address = i << PAGE_SHIFT;
> attrs->size = (end - i) << PAGE_SHIFT;
>
> --
> 2.25.1
>

2022-12-08 02:55:34

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v10 8/9] KVM: Handle page fault for private memory

On Fri, Dec 02, 2022 at 02:13:46PM +0800, Chao Peng wrote:
> A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> hva-based shared memory. Architecture code (like TDX code) can tell
> whether the on-going fault is private or not. This patch adds a
> 'is_private' field to kvm_page_fault to indicate this and architecture
> code is expected to set it.
>
> To handle page fault for such memslot, the handling logic is different
> depending on whether the fault is private or shared. KVM checks if
> 'is_private' matches the host's view of the page (maintained in
> mem_attr_array).
> - For a successful match, private pfn is obtained with
> restrictedmem_get_page() and shared pfn is obtained with existing
> get_user_pages().
> - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> userspace. Userspace then can convert memory between private/shared
> in host's view and retry the fault.
>
> Co-developed-by: Yu Zhang <[email protected]>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++--
> arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
> arch/x86/kvm/mmu/mmutrace.h | 1 +
> arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> include/linux/kvm_host.h | 30 ++++++++++++++++
> 5 files changed, 105 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2190fd8c95c0..b1953ebc012e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
>
> int kvm_mmu_max_mapping_level(struct kvm *kvm,
> const struct kvm_memory_slot *slot, gfn_t gfn,
> - int max_level)
> + int max_level, bool is_private)
> {
> struct kvm_lpage_info *linfo;
> int host_level;
> @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> break;
> }
>
> + if (is_private)
> + return max_level;

lpage mixed information already saved, so is that possible
to query info->disallow_lpage without care 'is_private' ?

> +
> if (max_level == PG_LEVEL_4K)
> return PG_LEVEL_4K;
>
> @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> * level, which will be used to do precise, accurate accounting.
> */
> fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> - fault->gfn, fault->max_level);
> + fault->gfn, fault->max_level,
> + fault->is_private);
> if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> return;
>
> @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> }
>
> +static inline u8 order_to_level(int order)
> +{
> + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> +
> + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> + return PG_LEVEL_1G;
> +
> + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> + return PG_LEVEL_2M;
> +
> + return PG_LEVEL_4K;
> +}
> +
> +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> + if (fault->is_private)
> + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> + else
> + vcpu->run->memory.flags = 0;
> + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> + vcpu->run->memory.size = PAGE_SIZE;
> + return RET_PF_USER;
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + int order;
> + struct kvm_memory_slot *slot = fault->slot;
> +
> + if (!kvm_slot_can_be_private(slot))
> + return kvm_do_memory_fault_exit(vcpu, fault);
> +
> + if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> + return RET_PF_RETRY;
> +
> + fault->max_level = min(order_to_level(order), fault->max_level);
> + fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> + return RET_PF_CONTINUE;
> +}
> +
> static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> struct kvm_memory_slot *slot = fault->slot;
> @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> return RET_PF_EMULATE;
> }
>
> + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> + return kvm_do_memory_fault_exit(vcpu, fault);
> +
> + if (fault->is_private)
> + return kvm_faultin_pfn_private(vcpu, fault);
> +
> async = false;
> fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> fault->write, &fault->map_writable,
> @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> return -EIO;
> }
>
> + if (r == RET_PF_USER)
> + return 0;
> +
> if (r < 0)
> return r;
> if (r != RET_PF_EMULATE)
> @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> */
> if (sp->role.direct &&
> sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> - PG_LEVEL_NUM)) {
> + PG_LEVEL_NUM,
> + false)) {
> kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
>
> if (kvm_available_flush_tlb_with_range())
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index dbaf6755c5a7..5ccf08183b00 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -189,6 +189,7 @@ struct kvm_page_fault {
>
> /* Derived from mmu and global state. */
> const bool is_tdp;
> + const bool is_private;
> const bool nx_huge_page_workaround_enabled;
>
> /*
> @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> * RET_PF_RETRY: let CPU fault again on the address.
> * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> + * RET_PF_USER: need to exit to userspace to handle this fault.
> * RET_PF_FIXED: The faulting entry has been fixed.
> * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> *
> @@ -253,6 +255,7 @@ enum {
> RET_PF_RETRY,
> RET_PF_EMULATE,
> RET_PF_INVALID,
> + RET_PF_USER,
> RET_PF_FIXED,
> RET_PF_SPURIOUS,
> };
> @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>
> int kvm_mmu_max_mapping_level(struct kvm *kvm,
> const struct kvm_memory_slot *slot, gfn_t gfn,
> - int max_level);
> + int max_level, bool is_private);
> void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
>
> @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> + WARN_ON_ONCE(1);
> + return -EOPNOTSUPP;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> index ae86820cef69..2d7555381955 100644
> --- a/arch/x86/kvm/mmu/mmutrace.h
> +++ b/arch/x86/kvm/mmu/mmutrace.h
> @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> TRACE_DEFINE_ENUM(RET_PF_RETRY);
> TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> TRACE_DEFINE_ENUM(RET_PF_INVALID);
> +TRACE_DEFINE_ENUM(RET_PF_USER);
> TRACE_DEFINE_ENUM(RET_PF_FIXED);
> TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 771210ce5181..8ba1a4afc546 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> continue;
>
> max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> - iter.gfn, PG_LEVEL_NUM);
> + iter.gfn, PG_LEVEL_NUM, false);
> if (max_mapping_level < iter.level)
> continue;
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 25099c94e770..153842bb33df 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> }
> #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> + KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +#else
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> + return false;
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> + int ret;
> + struct page *page;
> + pgoff_t index = gfn - slot->base_gfn +
> + (slot->restricted_offset >> PAGE_SHIFT);
> +
> + ret = restrictedmem_get_page(slot->restricted_file, index,
> + &page, order);
> + *pfn = page_to_pfn(page);
> + return ret;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> #endif
> --
> 2.25.1
>
>

2022-12-08 08:43:04

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On 12/2/2022 2:13 PM, Chao Peng wrote:

..

> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only.
>

From the patch implementation, I have no idea why
HAVE_KVM_RESTRICTED_MEM is needed.

2022-12-08 11:21:51

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

On Wed, Dec 07, 2022 at 05:16:34PM +0000, Fuad Tabba wrote:
> Hi,
>
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <[email protected]> wrote:
> >
> > Unmap the existing guest mappings when memory attribute is changed
> > between shared and private. This is needed because shared pages and
> > private pages are from different backends, unmapping existing ones
> > gives a chance for page fault handler to re-populate the mappings
> > according to the new attribute.
> >
> > Only architecture has private memory support needs this and the
> > supported architecture is expected to rewrite the weak
> > kvm_arch_has_private_mem().
>
> This kind of ties into the discussion of being able to share memory in
> place. For pKVM for example, shared and private memory would have the
> same backend, and the unmapping wouldn't be needed.
>
> So I guess that, instead of kvm_arch_has_private_mem(), can the check
> be done differently, e.g., with a different function, say
> kvm_arch_private_notify_attribute_change() (but maybe with a more
> friendly name than what I suggested :) )?

Besides controlling the unmapping here, kvm_arch_has_private_mem() is
also used to gate the memslot KVM_MEM_PRIVATE flag in patch09. I know
unmapping is confirmed unnecessary for pKVM, but how about
KVM_MEM_PRIVATE? Will pKVM add its own flag or reuse KVM_MEM_PRIVATE?
If the answer is the latter, then yes we should use a different check
which only works for confidential usages here.

Thanks,
Chao
>
> Thanks,
> /fuad
>
> >
> > Also, during memory attribute changing and the unmapping time frame,
> > page fault handler may happen in the same memory range and can cause
> > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > page fault handler retry during this time frame.
> >
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > include/linux/kvm_host.h | 7 +-
> > virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> > 2 files changed, 116 insertions(+), 59 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3d69484d2704..3331c0c92838 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > #endif
> >
> > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > struct kvm_gfn_range {
> > struct kvm_memory_slot *slot;
> > gfn_t start;
> > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > bool may_block;
> > };
> > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > +
> > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > @@ -785,11 +786,12 @@ struct kvm {
> >
> > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > struct mmu_notifier mmu_notifier;
> > +#endif
> > unsigned long mmu_invalidate_seq;
> > long mmu_invalidate_in_progress;
> > gfn_t mmu_invalidate_range_start;
> > gfn_t mmu_invalidate_range_end;
> > -#endif
> > +
> > struct list_head devices;
> > u64 manual_dirty_log_protect;
> > struct dentry *debugfs_dentry;
> > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > int kvm_arch_post_init_vm(struct kvm *kvm);
> > void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> >
> > #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > /*
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ad55dfbc75d7..4e1e1e113bf0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > }
> > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> >
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > +{
> > + /*
> > + * The count increase must become visible at unlock time as no
> > + * spte can be established without taking the mmu_lock and
> > + * count is also read inside the mmu_lock critical section.
> > + */
> > + kvm->mmu_invalidate_in_progress++;
> > +
> > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > + }
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > +
> > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = start;
> > + kvm->mmu_invalidate_range_end = end;
> > + } else {
> > + /*
> > + * Fully tracking multiple concurrent ranges has diminishing
> > + * returns. Keep things simple and just find the minimal range
> > + * which includes the current and new ranges. As there won't be
> > + * enough information to subtract a range after its invalidate
> > + * completes, any ranges invalidated concurrently will
> > + * accumulate and persist until all outstanding invalidates
> > + * complete.
> > + */
> > + kvm->mmu_invalidate_range_start =
> > + min(kvm->mmu_invalidate_range_start, start);
> > + kvm->mmu_invalidate_range_end =
> > + max(kvm->mmu_invalidate_range_end, end);
> > + }
> > +}
> > +
> > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > +{
> > + /*
> > + * This sequence increase will notify the kvm page fault that
> > + * the page that is going to be mapped in the spte could have
> > + * been freed.
> > + */
> > + kvm->mmu_invalidate_seq++;
> > + smp_wmb();
> > + /*
> > + * The above sequence increase must be visible before the
> > + * below count decrease, which is ensured by the smp_wmb above
> > + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > + */
> > + kvm->mmu_invalidate_in_progress--;
> > +}
> > +
> > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > {
> > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > }
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > -{
> > - /*
> > - * The count increase must become visible at unlock time as no
> > - * spte can be established without taking the mmu_lock and
> > - * count is also read inside the mmu_lock critical section.
> > - */
> > - kvm->mmu_invalidate_in_progress++;
> > -
> > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > - kvm->mmu_invalidate_range_start = INVALID_GPA;
> > - kvm->mmu_invalidate_range_end = INVALID_GPA;
> > - }
> > -}
> > -
> > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > -{
> > - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > -
> > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > - kvm->mmu_invalidate_range_start = start;
> > - kvm->mmu_invalidate_range_end = end;
> > - } else {
> > - /*
> > - * Fully tracking multiple concurrent ranges has diminishing
> > - * returns. Keep things simple and just find the minimal range
> > - * which includes the current and new ranges. As there won't be
> > - * enough information to subtract a range after its invalidate
> > - * completes, any ranges invalidated concurrently will
> > - * accumulate and persist until all outstanding invalidates
> > - * complete.
> > - */
> > - kvm->mmu_invalidate_range_start =
> > - min(kvm->mmu_invalidate_range_start, start);
> > - kvm->mmu_invalidate_range_end =
> > - max(kvm->mmu_invalidate_range_end, end);
> > - }
> > -}
> > -
> > static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > return 0;
> > }
> >
> > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > -{
> > - /*
> > - * This sequence increase will notify the kvm page fault that
> > - * the page that is going to be mapped in the spte could have
> > - * been freed.
> > - */
> > - kvm->mmu_invalidate_seq++;
> > - smp_wmb();
> > - /*
> > - * The above sequence increase must be visible before the
> > - * below count decrease, which is ensured by the smp_wmb above
> > - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > - */
> > - kvm->mmu_invalidate_in_progress--;
> > -}
> > -
> > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > const struct mmu_notifier_range *range)
> > {
> > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > return 0;
> > }
> >
> > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > +{
> > + return false;
> > +}
> > +
> > static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > {
> > struct kvm *kvm = kvm_arch_alloc_vm();
> > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > return 0;
> > }
> >
> > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + struct kvm_gfn_range gfn_range;
> > + struct kvm_memory_slot *slot;
> > + struct kvm_memslots *slots;
> > + struct kvm_memslot_iter iter;
> > + int i;
> > + int r = 0;
> > +
> > + gfn_range.pte = __pte(0);
> > + gfn_range.may_block = true;
> > +
> > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > + slots = __kvm_memslots(kvm, i);
> > +
> > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > + slot = iter.slot;
> > + gfn_range.start = max(start, slot->base_gfn);
> > + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > + if (gfn_range.start >= gfn_range.end)
> > + continue;
> > + gfn_range.slot = slot;
> > +
> > + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > + }
> > + }
> > +
> > + if (r)
> > + kvm_flush_remote_tlbs(kvm);
> > +}
> > +
> > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > struct kvm_memory_attributes *attrs)
> > {
> > gfn_t start, end;
> > unsigned long i;
> > void *entry;
> > + int idx;
> > u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> >
> > - /* flags is currently not used. */
> > + /* 'flags' is currently not used. */
> > if (attrs->flags)
> > return -EINVAL;
> > if (attrs->attributes & ~supported_attrs)
> > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >
> > entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> >
> > + if (kvm_arch_has_private_mem(kvm)) {
> > + KVM_MMU_LOCK(kvm);
> > + kvm_mmu_invalidate_begin(kvm);
> > + kvm_mmu_invalidate_range_add(kvm, start, end);
> > + KVM_MMU_UNLOCK(kvm);
> > + }
> > +
> > mutex_lock(&kvm->lock);
> > for (i = start; i < end; i++)
> > if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > break;
> > mutex_unlock(&kvm->lock);
> >
> > + if (kvm_arch_has_private_mem(kvm)) {
> > + idx = srcu_read_lock(&kvm->srcu);
> > + KVM_MMU_LOCK(kvm);
> > + if (i > start)
> > + kvm_unmap_mem_range(kvm, start, i);
> > + kvm_mmu_invalidate_end(kvm);
> > + KVM_MMU_UNLOCK(kvm);
> > + srcu_read_unlock(&kvm->srcu, idx);
> > + }
> > +
> > attrs->address = i << PAGE_SHIFT;
> > attrs->size = (end - i) << PAGE_SHIFT;
> >
> > --
> > 2.25.1
> >

2022-12-08 11:52:35

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 8/9] KVM: Handle page fault for private memory

On Thu, Dec 08, 2022 at 10:29:18AM +0800, Yuan Yao wrote:
> On Fri, Dec 02, 2022 at 02:13:46PM +0800, Chao Peng wrote:
> > A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> > hva-based shared memory. Architecture code (like TDX code) can tell
> > whether the on-going fault is private or not. This patch adds a
> > 'is_private' field to kvm_page_fault to indicate this and architecture
> > code is expected to set it.
> >
> > To handle page fault for such memslot, the handling logic is different
> > depending on whether the fault is private or shared. KVM checks if
> > 'is_private' matches the host's view of the page (maintained in
> > mem_attr_array).
> > - For a successful match, private pfn is obtained with
> > restrictedmem_get_page() and shared pfn is obtained with existing
> > get_user_pages().
> > - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > userspace. Userspace then can convert memory between private/shared
> > in host's view and retry the fault.
> >
> > Co-developed-by: Yu Zhang <[email protected]>
> > Signed-off-by: Yu Zhang <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++--
> > arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
> > arch/x86/kvm/mmu/mmutrace.h | 1 +
> > arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> > include/linux/kvm_host.h | 30 ++++++++++++++++
> > 5 files changed, 105 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 2190fd8c95c0..b1953ebc012e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> >
> > int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > const struct kvm_memory_slot *slot, gfn_t gfn,
> > - int max_level)
> > + int max_level, bool is_private)
> > {
> > struct kvm_lpage_info *linfo;
> > int host_level;
> > @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > break;
> > }
> >
> > + if (is_private)
> > + return max_level;
>
> lpage mixed information already saved, so is that possible
> to query info->disallow_lpage without care 'is_private' ?

Actually we already queried info->disallow_lpage just before this
sentence. The check is needed because later in the function we call
host_pfn_mapping_level() which is shared memory specific.

Thanks,
Chao
>
> > +
> > if (max_level == PG_LEVEL_4K)
> > return PG_LEVEL_4K;
> >
> > @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > * level, which will be used to do precise, accurate accounting.
> > */
> > fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > - fault->gfn, fault->max_level);
> > + fault->gfn, fault->max_level,
> > + fault->is_private);
> > if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> > return;
> >
> > @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> > kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> > }
> >
> > +static inline u8 order_to_level(int order)
> > +{
> > + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > +
> > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > + return PG_LEVEL_1G;
> > +
> > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > + return PG_LEVEL_2M;
> > +
> > + return PG_LEVEL_4K;
> > +}
> > +
> > +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> > + struct kvm_page_fault *fault)
> > +{
> > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > + if (fault->is_private)
> > + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > + else
> > + vcpu->run->memory.flags = 0;
> > + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> > + vcpu->run->memory.size = PAGE_SIZE;
> > + return RET_PF_USER;
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > + struct kvm_page_fault *fault)
> > +{
> > + int order;
> > + struct kvm_memory_slot *slot = fault->slot;
> > +
> > + if (!kvm_slot_can_be_private(slot))
> > + return kvm_do_memory_fault_exit(vcpu, fault);
> > +
> > + if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> > + return RET_PF_RETRY;
> > +
> > + fault->max_level = min(order_to_level(order), fault->max_level);
> > + fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > + return RET_PF_CONTINUE;
> > +}
> > +
> > static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > {
> > struct kvm_memory_slot *slot = fault->slot;
> > @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > return RET_PF_EMULATE;
> > }
> >
> > + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> > + return kvm_do_memory_fault_exit(vcpu, fault);
> > +
> > + if (fault->is_private)
> > + return kvm_faultin_pfn_private(vcpu, fault);
> > +
> > async = false;
> > fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > fault->write, &fault->map_writable,
> > @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> > return -EIO;
> > }
> >
> > + if (r == RET_PF_USER)
> > + return 0;
> > +
> > if (r < 0)
> > return r;
> > if (r != RET_PF_EMULATE)
> > @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> > */
> > if (sp->role.direct &&
> > sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> > - PG_LEVEL_NUM)) {
> > + PG_LEVEL_NUM,
> > + false)) {
> > kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> >
> > if (kvm_available_flush_tlb_with_range())
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index dbaf6755c5a7..5ccf08183b00 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -189,6 +189,7 @@ struct kvm_page_fault {
> >
> > /* Derived from mmu and global state. */
> > const bool is_tdp;
> > + const bool is_private;
> > const bool nx_huge_page_workaround_enabled;
> >
> > /*
> > @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > * RET_PF_RETRY: let CPU fault again on the address.
> > * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> > * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> > + * RET_PF_USER: need to exit to userspace to handle this fault.
> > * RET_PF_FIXED: The faulting entry has been fixed.
> > * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> > *
> > @@ -253,6 +255,7 @@ enum {
> > RET_PF_RETRY,
> > RET_PF_EMULATE,
> > RET_PF_INVALID,
> > + RET_PF_USER,
> > RET_PF_FIXED,
> > RET_PF_SPURIOUS,
> > };
> > @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >
> > int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > const struct kvm_memory_slot *slot, gfn_t gfn,
> > - int max_level);
> > + int max_level, bool is_private);
> > void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> >
> > @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > + WARN_ON_ONCE(1);
> > + return -EOPNOTSUPP;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > +
> > #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > index ae86820cef69..2d7555381955 100644
> > --- a/arch/x86/kvm/mmu/mmutrace.h
> > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> > TRACE_DEFINE_ENUM(RET_PF_RETRY);
> > TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> > TRACE_DEFINE_ENUM(RET_PF_INVALID);
> > +TRACE_DEFINE_ENUM(RET_PF_USER);
> > TRACE_DEFINE_ENUM(RET_PF_FIXED);
> > TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 771210ce5181..8ba1a4afc546 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> > continue;
> >
> > max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> > - iter.gfn, PG_LEVEL_NUM);
> > + iter.gfn, PG_LEVEL_NUM, false);
> > if (max_mapping_level < iter.level)
> > continue;
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 25099c94e770..153842bb33df 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > }
> > #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
> >
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> > + KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > +}
> > +#else
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > + return false;
> > +}
> > +
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> > +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > + int ret;
> > + struct page *page;
> > + pgoff_t index = gfn - slot->base_gfn +
> > + (slot->restricted_offset >> PAGE_SHIFT);
> > +
> > + ret = restrictedmem_get_page(slot->restricted_file, index,
> > + &page, order);
> > + *pfn = page_to_pfn(page);
> > + return ret;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > +
> > #endif
> > --
> > 2.25.1
> >
> >

2022-12-08 13:04:35

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

On Wed, Dec 07, 2022 at 04:13:14PM +0800, Yuan Yao wrote:
> On Fri, Dec 02, 2022 at 02:13:44PM +0800, Chao Peng wrote:
> > Unmap the existing guest mappings when memory attribute is changed
> > between shared and private. This is needed because shared pages and
> > private pages are from different backends, unmapping existing ones
> > gives a chance for page fault handler to re-populate the mappings
> > according to the new attribute.
> >
> > Only architecture has private memory support needs this and the
> > supported architecture is expected to rewrite the weak
> > kvm_arch_has_private_mem().
> >
> > Also, during memory attribute changing and the unmapping time frame,
> > page fault handler may happen in the same memory range and can cause
> > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > page fault handler retry during this time frame.
> >
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > include/linux/kvm_host.h | 7 +-
> > virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> > 2 files changed, 116 insertions(+), 59 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3d69484d2704..3331c0c92838 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > #endif
> >
> > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > struct kvm_gfn_range {
> > struct kvm_memory_slot *slot;
> > gfn_t start;
> > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > bool may_block;
> > };
> > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > +
> > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > @@ -785,11 +786,12 @@ struct kvm {
> >
> > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > struct mmu_notifier mmu_notifier;
> > +#endif
> > unsigned long mmu_invalidate_seq;
> > long mmu_invalidate_in_progress;
> > gfn_t mmu_invalidate_range_start;
> > gfn_t mmu_invalidate_range_end;
> > -#endif
> > +
> > struct list_head devices;
> > u64 manual_dirty_log_protect;
> > struct dentry *debugfs_dentry;
> > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > int kvm_arch_post_init_vm(struct kvm *kvm);
> > void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> >
> > #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > /*
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ad55dfbc75d7..4e1e1e113bf0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > }
> > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> >
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > +{
> > + /*
> > + * The count increase must become visible at unlock time as no
> > + * spte can be established without taking the mmu_lock and
> > + * count is also read inside the mmu_lock critical section.
> > + */
> > + kvm->mmu_invalidate_in_progress++;
> > +
> > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > + }
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > +
> > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = start;
> > + kvm->mmu_invalidate_range_end = end;
> > + } else {
> > + /*
> > + * Fully tracking multiple concurrent ranges has diminishing
> > + * returns. Keep things simple and just find the minimal range
> > + * which includes the current and new ranges. As there won't be
> > + * enough information to subtract a range after its invalidate
> > + * completes, any ranges invalidated concurrently will
> > + * accumulate and persist until all outstanding invalidates
> > + * complete.
> > + */
> > + kvm->mmu_invalidate_range_start =
> > + min(kvm->mmu_invalidate_range_start, start);
> > + kvm->mmu_invalidate_range_end =
> > + max(kvm->mmu_invalidate_range_end, end);
> > + }
> > +}
> > +
> > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > +{
> > + /*
> > + * This sequence increase will notify the kvm page fault that
> > + * the page that is going to be mapped in the spte could have
> > + * been freed.
> > + */
> > + kvm->mmu_invalidate_seq++;
> > + smp_wmb();
> > + /*
> > + * The above sequence increase must be visible before the
> > + * below count decrease, which is ensured by the smp_wmb above
> > + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > + */
> > + kvm->mmu_invalidate_in_progress--;
> > +}
> > +
> > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > {
> > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > }
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > -{
> > - /*
> > - * The count increase must become visible at unlock time as no
> > - * spte can be established without taking the mmu_lock and
> > - * count is also read inside the mmu_lock critical section.
> > - */
> > - kvm->mmu_invalidate_in_progress++;
> > -
> > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > - kvm->mmu_invalidate_range_start = INVALID_GPA;
> > - kvm->mmu_invalidate_range_end = INVALID_GPA;
> > - }
> > -}
> > -
> > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > -{
> > - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > -
> > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > - kvm->mmu_invalidate_range_start = start;
> > - kvm->mmu_invalidate_range_end = end;
> > - } else {
> > - /*
> > - * Fully tracking multiple concurrent ranges has diminishing
> > - * returns. Keep things simple and just find the minimal range
> > - * which includes the current and new ranges. As there won't be
> > - * enough information to subtract a range after its invalidate
> > - * completes, any ranges invalidated concurrently will
> > - * accumulate and persist until all outstanding invalidates
> > - * complete.
> > - */
> > - kvm->mmu_invalidate_range_start =
> > - min(kvm->mmu_invalidate_range_start, start);
> > - kvm->mmu_invalidate_range_end =
> > - max(kvm->mmu_invalidate_range_end, end);
> > - }
> > -}
> > -
> > static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > return 0;
> > }
> >
> > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > -{
> > - /*
> > - * This sequence increase will notify the kvm page fault that
> > - * the page that is going to be mapped in the spte could have
> > - * been freed.
> > - */
> > - kvm->mmu_invalidate_seq++;
> > - smp_wmb();
> > - /*
> > - * The above sequence increase must be visible before the
> > - * below count decrease, which is ensured by the smp_wmb above
> > - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > - */
> > - kvm->mmu_invalidate_in_progress--;
> > -}
> > -
> > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > const struct mmu_notifier_range *range)
> > {
> > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > return 0;
> > }
> >
> > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > +{
> > + return false;
> > +}
> > +
> > static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > {
> > struct kvm *kvm = kvm_arch_alloc_vm();
> > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > return 0;
> > }
> >
> > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + struct kvm_gfn_range gfn_range;
> > + struct kvm_memory_slot *slot;
> > + struct kvm_memslots *slots;
> > + struct kvm_memslot_iter iter;
> > + int i;
> > + int r = 0;
> > +
> > + gfn_range.pte = __pte(0);
> > + gfn_range.may_block = true;
> > +
> > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > + slots = __kvm_memslots(kvm, i);
> > +
> > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > + slot = iter.slot;
> > + gfn_range.start = max(start, slot->base_gfn);
> > + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > + if (gfn_range.start >= gfn_range.end)
> > + continue;
> > + gfn_range.slot = slot;
> > +
> > + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > + }
> > + }
> > +
> > + if (r)
> > + kvm_flush_remote_tlbs(kvm);
> > +}
> > +
> > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > struct kvm_memory_attributes *attrs)
> > {
> > gfn_t start, end;
> > unsigned long i;
> > void *entry;
> > + int idx;
> > u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> >
> > - /* flags is currently not used. */
> > + /* 'flags' is currently not used. */
> > if (attrs->flags)
> > return -EINVAL;
> > if (attrs->attributes & ~supported_attrs)
> > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >
> > entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> >
> > + if (kvm_arch_has_private_mem(kvm)) {
> > + KVM_MMU_LOCK(kvm);
> > + kvm_mmu_invalidate_begin(kvm);
> > + kvm_mmu_invalidate_range_add(kvm, start, end);
>
> Nit: this works for KVM_MEMORY_ATTRIBUTE_PRIVATE, but
> the invalidation should be necessary yet for attribute change of:
>
> KVM_MEMORY_ATTRIBUTE_READ
> KVM_MEMORY_ATTRIBUTE_WRITE
> KVM_MEMORY_ATTRIBUTE_EXECUTE

The unmapping is only needed for confidential usages which uses
KVM_MEMORY_ATTRIBUTE_PRIVATE only and the other flags are defined here
for other usages like pKVM. As Fuad commented in a different reply, pKVM
supports in-place remapping and unmapping is unnecessary.

Thanks,
Chao
>
> > + KVM_MMU_UNLOCK(kvm);
> > + }
> > +
> > mutex_lock(&kvm->lock);
> > for (i = start; i < end; i++)
> > if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > break;
> > mutex_unlock(&kvm->lock);
> >
> > + if (kvm_arch_has_private_mem(kvm)) {
> > + idx = srcu_read_lock(&kvm->srcu);
> > + KVM_MMU_LOCK(kvm);
> > + if (i > start)
> > + kvm_unmap_mem_range(kvm, start, i);
> > + kvm_mmu_invalidate_end(kvm);
>
> Ditto.
>
> > + KVM_MMU_UNLOCK(kvm);
> > + srcu_read_unlock(&kvm->srcu, idx);
> > + }
> > +
> > attrs->address = i << PAGE_SHIFT;
> > attrs->size = (end - i) << PAGE_SHIFT;
> >
> > --
> > 2.25.1
> >
> >

2022-12-08 13:07:41

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
> On 12/2/2022 2:13 PM, Chao Peng wrote:
>
> ..
>
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> >
>
> From the patch implementation, I have no idea why HAVE_KVM_RESTRICTED_MEM is
> needed.

The reason is we want KVM further controls the feature enabling. An
opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
unsupported architectures.

Here is the original discussion:
https://lore.kernel.org/all/[email protected]/

Thanks,
Chao

2022-12-09 05:54:16

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

On Thu, Dec 08, 2022 at 07:20:43PM +0800, Chao Peng wrote:
> On Wed, Dec 07, 2022 at 04:13:14PM +0800, Yuan Yao wrote:
> > On Fri, Dec 02, 2022 at 02:13:44PM +0800, Chao Peng wrote:
> > > Unmap the existing guest mappings when memory attribute is changed
> > > between shared and private. This is needed because shared pages and
> > > private pages are from different backends, unmapping existing ones
> > > gives a chance for page fault handler to re-populate the mappings
> > > according to the new attribute.
> > >
> > > Only architecture has private memory support needs this and the
> > > supported architecture is expected to rewrite the weak
> > > kvm_arch_has_private_mem().
> > >
> > > Also, during memory attribute changing and the unmapping time frame,
> > > page fault handler may happen in the same memory range and can cause
> > > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > > page fault handler retry during this time frame.
> > >
> > > Signed-off-by: Chao Peng <[email protected]>
> > > ---
> > > include/linux/kvm_host.h | 7 +-
> > > virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> > > 2 files changed, 116 insertions(+), 59 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 3d69484d2704..3331c0c92838 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > > #endif
> > >
> > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > struct kvm_gfn_range {
> > > struct kvm_memory_slot *slot;
> > > gfn_t start;
> > > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > > bool may_block;
> > > };
> > > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > > +
> > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > @@ -785,11 +786,12 @@ struct kvm {
> > >
> > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > struct mmu_notifier mmu_notifier;
> > > +#endif
> > > unsigned long mmu_invalidate_seq;
> > > long mmu_invalidate_in_progress;
> > > gfn_t mmu_invalidate_range_start;
> > > gfn_t mmu_invalidate_range_end;
> > > -#endif
> > > +
> > > struct list_head devices;
> > > u64 manual_dirty_log_protect;
> > > struct dentry *debugfs_dentry;
> > > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > > int kvm_arch_post_init_vm(struct kvm *kvm);
> > > void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > > int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> > >
> > > #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > > /*
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index ad55dfbc75d7..4e1e1e113bf0 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > > }
> > > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> > >
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > +{
> > > + /*
> > > + * The count increase must become visible at unlock time as no
> > > + * spte can be established without taking the mmu_lock and
> > > + * count is also read inside the mmu_lock critical section.
> > > + */
> > > + kvm->mmu_invalidate_in_progress++;
> > > +
> > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > + }
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > +
> > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > + kvm->mmu_invalidate_range_start = start;
> > > + kvm->mmu_invalidate_range_end = end;
> > > + } else {
> > > + /*
> > > + * Fully tracking multiple concurrent ranges has diminishing
> > > + * returns. Keep things simple and just find the minimal range
> > > + * which includes the current and new ranges. As there won't be
> > > + * enough information to subtract a range after its invalidate
> > > + * completes, any ranges invalidated concurrently will
> > > + * accumulate and persist until all outstanding invalidates
> > > + * complete.
> > > + */
> > > + kvm->mmu_invalidate_range_start =
> > > + min(kvm->mmu_invalidate_range_start, start);
> > > + kvm->mmu_invalidate_range_end =
> > > + max(kvm->mmu_invalidate_range_end, end);
> > > + }
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > +{
> > > + /*
> > > + * This sequence increase will notify the kvm page fault that
> > > + * the page that is going to be mapped in the spte could have
> > > + * been freed.
> > > + */
> > > + kvm->mmu_invalidate_seq++;
> > > + smp_wmb();
> > > + /*
> > > + * The above sequence increase must be visible before the
> > > + * below count decrease, which is ensured by the smp_wmb above
> > > + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > + */
> > > + kvm->mmu_invalidate_in_progress--;
> > > +}
> > > +
> > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > > {
> > > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > > }
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > -{
> > > - /*
> > > - * The count increase must become visible at unlock time as no
> > > - * spte can be established without taking the mmu_lock and
> > > - * count is also read inside the mmu_lock critical section.
> > > - */
> > > - kvm->mmu_invalidate_in_progress++;
> > > -
> > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > - kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > - kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > - }
> > > -}
> > > -
> > > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > -{
> > > - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > -
> > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > - kvm->mmu_invalidate_range_start = start;
> > > - kvm->mmu_invalidate_range_end = end;
> > > - } else {
> > > - /*
> > > - * Fully tracking multiple concurrent ranges has diminishing
> > > - * returns. Keep things simple and just find the minimal range
> > > - * which includes the current and new ranges. As there won't be
> > > - * enough information to subtract a range after its invalidate
> > > - * completes, any ranges invalidated concurrently will
> > > - * accumulate and persist until all outstanding invalidates
> > > - * complete.
> > > - */
> > > - kvm->mmu_invalidate_range_start =
> > > - min(kvm->mmu_invalidate_range_start, start);
> > > - kvm->mmu_invalidate_range_end =
> > > - max(kvm->mmu_invalidate_range_end, end);
> > > - }
> > > -}
> > > -
> > > static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > > {
> > > kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > return 0;
> > > }
> > >
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > -{
> > > - /*
> > > - * This sequence increase will notify the kvm page fault that
> > > - * the page that is going to be mapped in the spte could have
> > > - * been freed.
> > > - */
> > > - kvm->mmu_invalidate_seq++;
> > > - smp_wmb();
> > > - /*
> > > - * The above sequence increase must be visible before the
> > > - * below count decrease, which is ensured by the smp_wmb above
> > > - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > - */
> > > - kvm->mmu_invalidate_in_progress--;
> > > -}
> > > -
> > > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > > const struct mmu_notifier_range *range)
> > > {
> > > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > > return 0;
> > > }
> > >
> > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > > +{
> > > + return false;
> > > +}
> > > +
> > > static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > > {
> > > struct kvm *kvm = kvm_arch_alloc_vm();
> > > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > > return 0;
> > > }
> > >
> > > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + struct kvm_gfn_range gfn_range;
> > > + struct kvm_memory_slot *slot;
> > > + struct kvm_memslots *slots;
> > > + struct kvm_memslot_iter iter;
> > > + int i;
> > > + int r = 0;
> > > +
> > > + gfn_range.pte = __pte(0);
> > > + gfn_range.may_block = true;
> > > +
> > > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > + slots = __kvm_memslots(kvm, i);
> > > +
> > > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > > + slot = iter.slot;
> > > + gfn_range.start = max(start, slot->base_gfn);
> > > + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > > + if (gfn_range.start >= gfn_range.end)
> > > + continue;
> > > + gfn_range.slot = slot;
> > > +
> > > + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > > + }
> > > + }
> > > +
> > > + if (r)
> > > + kvm_flush_remote_tlbs(kvm);
> > > +}
> > > +
> > > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > struct kvm_memory_attributes *attrs)
> > > {
> > > gfn_t start, end;
> > > unsigned long i;
> > > void *entry;
> > > + int idx;
> > > u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > >
> > > - /* flags is currently not used. */
> > > + /* 'flags' is currently not used. */
> > > if (attrs->flags)
> > > return -EINVAL;
> > > if (attrs->attributes & ~supported_attrs)
> > > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > >
> > > entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > >
> > > + if (kvm_arch_has_private_mem(kvm)) {
> > > + KVM_MMU_LOCK(kvm);
> > > + kvm_mmu_invalidate_begin(kvm);
> > > + kvm_mmu_invalidate_range_add(kvm, start, end);
> >
> > Nit: this works for KVM_MEMORY_ATTRIBUTE_PRIVATE, but
> > the invalidation should be necessary yet for attribute change of:
> >
> > KVM_MEMORY_ATTRIBUTE_READ
> > KVM_MEMORY_ATTRIBUTE_WRITE
> > KVM_MEMORY_ATTRIBUTE_EXECUTE
>
> The unmapping is only needed for confidential usages which uses
> KVM_MEMORY_ATTRIBUTE_PRIVATE only and the other flags are defined here
> for other usages like pKVM. As Fuad commented in a different reply, pKVM
> supports in-place remapping and unmapping is unnecessary.

Ah, I see. It's fine to me, thanks.

>
> Thanks,
> Chao
> >
> > > + KVM_MMU_UNLOCK(kvm);
> > > + }
> > > +
> > > mutex_lock(&kvm->lock);
> > > for (i = start; i < end; i++)
> > > if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > break;
> > > mutex_unlock(&kvm->lock);
> > >
> > > + if (kvm_arch_has_private_mem(kvm)) {
> > > + idx = srcu_read_lock(&kvm->srcu);
> > > + KVM_MMU_LOCK(kvm);
> > > + if (i > start)
> > > + kvm_unmap_mem_range(kvm, start, i);
> > > + kvm_mmu_invalidate_end(kvm);
> >
> > Ditto.
> >
> > > + KVM_MMU_UNLOCK(kvm);
> > > + srcu_read_unlock(&kvm->srcu, idx);
> > > + }
> > > +
> > > attrs->address = i << PAGE_SHIFT;
> > > attrs->size = (end - i) << PAGE_SHIFT;
> > >
> > > --
> > > 2.25.1
> > >
> > >

2022-12-09 06:06:51

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v10 8/9] KVM: Handle page fault for private memory

On Thu, Dec 08, 2022 at 07:23:46PM +0800, Chao Peng wrote:
> On Thu, Dec 08, 2022 at 10:29:18AM +0800, Yuan Yao wrote:
> > On Fri, Dec 02, 2022 at 02:13:46PM +0800, Chao Peng wrote:
> > > A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> > > hva-based shared memory. Architecture code (like TDX code) can tell
> > > whether the on-going fault is private or not. This patch adds a
> > > 'is_private' field to kvm_page_fault to indicate this and architecture
> > > code is expected to set it.
> > >
> > > To handle page fault for such memslot, the handling logic is different
> > > depending on whether the fault is private or shared. KVM checks if
> > > 'is_private' matches the host's view of the page (maintained in
> > > mem_attr_array).
> > > - For a successful match, private pfn is obtained with
> > > restrictedmem_get_page() and shared pfn is obtained with existing
> > > get_user_pages().
> > > - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > > userspace. Userspace then can convert memory between private/shared
> > > in host's view and retry the fault.
> > >
> > > Co-developed-by: Yu Zhang <[email protected]>
> > > Signed-off-by: Yu Zhang <[email protected]>
> > > Signed-off-by: Chao Peng <[email protected]>
> > > ---
> > > arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++--
> > > arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
> > > arch/x86/kvm/mmu/mmutrace.h | 1 +
> > > arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> > > include/linux/kvm_host.h | 30 ++++++++++++++++
> > > 5 files changed, 105 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 2190fd8c95c0..b1953ebc012e 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> > >
> > > int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > const struct kvm_memory_slot *slot, gfn_t gfn,
> > > - int max_level)
> > > + int max_level, bool is_private)
> > > {
> > > struct kvm_lpage_info *linfo;
> > > int host_level;
> > > @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > break;
> > > }
> > >
> > > + if (is_private)
> > > + return max_level;
> >
> > lpage mixed information already saved, so is that possible
> > to query info->disallow_lpage without care 'is_private' ?
>
> Actually we already queried info->disallow_lpage just before this
> sentence. The check is needed because later in the function we call
> host_pfn_mapping_level() which is shared memory specific.

You're right. We can't get mapping level info for private page from
host_pfn_mapping_level().

>
> Thanks,
> Chao
> >
> > > +
> > > if (max_level == PG_LEVEL_4K)
> > > return PG_LEVEL_4K;
> > >
> > > @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > > * level, which will be used to do precise, accurate accounting.
> > > */
> > > fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > > - fault->gfn, fault->max_level);
> > > + fault->gfn, fault->max_level,
> > > + fault->is_private);
> > > if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> > > return;
> > >
> > > @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> > > kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> > > }
> > >
> > > +static inline u8 order_to_level(int order)
> > > +{
> > > + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > > +
> > > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > > + return PG_LEVEL_1G;
> > > +
> > > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > > + return PG_LEVEL_2M;
> > > +
> > > + return PG_LEVEL_4K;
> > > +}
> > > +
> > > +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> > > + struct kvm_page_fault *fault)
> > > +{
> > > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > > + if (fault->is_private)
> > > + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > > + else
> > > + vcpu->run->memory.flags = 0;
> > > + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> > > + vcpu->run->memory.size = PAGE_SIZE;
> > > + return RET_PF_USER;
> > > +}
> > > +
> > > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > > + struct kvm_page_fault *fault)
> > > +{
> > > + int order;
> > > + struct kvm_memory_slot *slot = fault->slot;
> > > +
> > > + if (!kvm_slot_can_be_private(slot))
> > > + return kvm_do_memory_fault_exit(vcpu, fault);
> > > +
> > > + if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> > > + return RET_PF_RETRY;
> > > +
> > > + fault->max_level = min(order_to_level(order), fault->max_level);
> > > + fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > > + return RET_PF_CONTINUE;
> > > +}
> > > +
> > > static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > {
> > > struct kvm_memory_slot *slot = fault->slot;
> > > @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > return RET_PF_EMULATE;
> > > }
> > >
> > > + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> > > + return kvm_do_memory_fault_exit(vcpu, fault);
> > > +
> > > + if (fault->is_private)
> > > + return kvm_faultin_pfn_private(vcpu, fault);
> > > +
> > > async = false;
> > > fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > > fault->write, &fault->map_writable,
> > > @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> > > return -EIO;
> > > }
> > >
> > > + if (r == RET_PF_USER)
> > > + return 0;
> > > +
> > > if (r < 0)
> > > return r;
> > > if (r != RET_PF_EMULATE)
> > > @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> > > */
> > > if (sp->role.direct &&
> > > sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> > > - PG_LEVEL_NUM)) {
> > > + PG_LEVEL_NUM,
> > > + false)) {
> > > kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> > >
> > > if (kvm_available_flush_tlb_with_range())
> > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > > index dbaf6755c5a7..5ccf08183b00 100644
> > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > @@ -189,6 +189,7 @@ struct kvm_page_fault {
> > >
> > > /* Derived from mmu and global state. */
> > > const bool is_tdp;
> > > + const bool is_private;
> > > const bool nx_huge_page_workaround_enabled;
> > >
> > > /*
> > > @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > > * RET_PF_RETRY: let CPU fault again on the address.
> > > * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> > > * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> > > + * RET_PF_USER: need to exit to userspace to handle this fault.
> > > * RET_PF_FIXED: The faulting entry has been fixed.
> > > * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> > > *
> > > @@ -253,6 +255,7 @@ enum {
> > > RET_PF_RETRY,
> > > RET_PF_EMULATE,
> > > RET_PF_INVALID,
> > > + RET_PF_USER,
> > > RET_PF_FIXED,
> > > RET_PF_SPURIOUS,
> > > };
> > > @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >
> > > int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > const struct kvm_memory_slot *slot, gfn_t gfn,
> > > - int max_level);
> > > + int max_level, bool is_private);
> > > void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > > void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> > >
> > > @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > > void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > > void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > >
> > > +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > > + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > > +{
> > > + WARN_ON_ONCE(1);
> > > + return -EOPNOTSUPP;
> > > +}
> > > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > > +
> > > #endif /* __KVM_X86_MMU_INTERNAL_H */
> > > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > > index ae86820cef69..2d7555381955 100644
> > > --- a/arch/x86/kvm/mmu/mmutrace.h
> > > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > > @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> > > TRACE_DEFINE_ENUM(RET_PF_RETRY);
> > > TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> > > TRACE_DEFINE_ENUM(RET_PF_INVALID);
> > > +TRACE_DEFINE_ENUM(RET_PF_USER);
> > > TRACE_DEFINE_ENUM(RET_PF_FIXED);
> > > TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 771210ce5181..8ba1a4afc546 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> > > continue;
> > >
> > > max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> > > - iter.gfn, PG_LEVEL_NUM);
> > > + iter.gfn, PG_LEVEL_NUM, false);
> > > if (max_mapping_level < iter.level)
> > > continue;
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 25099c94e770..153842bb33df 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > > }
> > > #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
> > >
> > > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > > +{
> > > + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> > > + KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > > +}
> > > +#else
> > > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > > +{
> > > + return false;
> > > +}
> > > +
> > > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > > +
> > > +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > > + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > > +{
> > > + int ret;
> > > + struct page *page;
> > > + pgoff_t index = gfn - slot->base_gfn +
> > > + (slot->restricted_offset >> PAGE_SHIFT);
> > > +
> > > + ret = restrictedmem_get_page(slot->restricted_file, index,
> > > + &page, order);
> > > + *pfn = page_to_pfn(page);
> > > + return ret;
> > > +}
> > > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > > +
> > > #endif
> > > --
> > > 2.25.1
> > >
> > >

2022-12-09 06:45:42

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry

On Tue, Dec 06, 2022 at 03:48:50PM +0000, Fuad Tabba wrote:
...
> >
> > > > */
> > > > - if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > > - hva >= kvm->mmu_invalidate_range_start &&
> > > > - hva < kvm->mmu_invalidate_range_end)
> > > > - return 1;
> > > > + if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > > + /*
> > > > + * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > > > + * but before updating the range is a KVM bug.
> > > > + */
> > > > + if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > > > + kvm->mmu_invalidate_range_end == INVALID_GPA))
> > >
> > > INVALID_GPA is an x86-specific define in
> > > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > > architectures. The obvious fix is to move it to
> > > include/linux/kvm_host.h.
> >
> > Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> > correct choice for other architectures, but after search it has not been
> > used for other architectures, so should be safe to make it common.

As Yu posted a patch:
https://lore.kernel.org/all/[email protected]/

There is a GPA_INVALID in include/linux/kvm_types.h and I see ARM has already
been using it so sounds that is exactly what I need.

Chao
>
> With this fixed,
>
> Reviewed-by: Fuad Tabba <[email protected]>
> And the necessary work to port to arm64 (on qemu/arm64):
> Tested-by: Fuad Tabba <[email protected]>
>
> Cheers,
> /fuad

2022-12-09 09:12:30

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

Hi,

On Thu, Dec 8, 2022 at 11:18 AM Chao Peng <[email protected]> wrote:
>
> On Wed, Dec 07, 2022 at 05:16:34PM +0000, Fuad Tabba wrote:
> > Hi,
> >
> > On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <[email protected]> wrote:
> > >
> > > Unmap the existing guest mappings when memory attribute is changed
> > > between shared and private. This is needed because shared pages and
> > > private pages are from different backends, unmapping existing ones
> > > gives a chance for page fault handler to re-populate the mappings
> > > according to the new attribute.
> > >
> > > Only architecture has private memory support needs this and the
> > > supported architecture is expected to rewrite the weak
> > > kvm_arch_has_private_mem().
> >
> > This kind of ties into the discussion of being able to share memory in
> > place. For pKVM for example, shared and private memory would have the
> > same backend, and the unmapping wouldn't be needed.
> >
> > So I guess that, instead of kvm_arch_has_private_mem(), can the check
> > be done differently, e.g., with a different function, say
> > kvm_arch_private_notify_attribute_change() (but maybe with a more
> > friendly name than what I suggested :) )?
>
> Besides controlling the unmapping here, kvm_arch_has_private_mem() is
> also used to gate the memslot KVM_MEM_PRIVATE flag in patch09. I know
> unmapping is confirmed unnecessary for pKVM, but how about
> KVM_MEM_PRIVATE? Will pKVM add its own flag or reuse KVM_MEM_PRIVATE?
> If the answer is the latter, then yes we should use a different check
> which only works for confidential usages here.

I think it makes sense for pKVM to use the same flag (KVM_MEM_PRIVATE)
and not to add another one.

Thank you,
/fuad



>
> Thanks,
> Chao
> >
> > Thanks,
> > /fuad
> >
> > >
> > > Also, during memory attribute changing and the unmapping time frame,
> > > page fault handler may happen in the same memory range and can cause
> > > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > > page fault handler retry during this time frame.
> > >
> > > Signed-off-by: Chao Peng <[email protected]>
> > > ---
> > > include/linux/kvm_host.h | 7 +-
> > > virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> > > 2 files changed, 116 insertions(+), 59 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 3d69484d2704..3331c0c92838 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > > #endif
> > >
> > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > struct kvm_gfn_range {
> > > struct kvm_memory_slot *slot;
> > > gfn_t start;
> > > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > > bool may_block;
> > > };
> > > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > > +
> > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > @@ -785,11 +786,12 @@ struct kvm {
> > >
> > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > struct mmu_notifier mmu_notifier;
> > > +#endif
> > > unsigned long mmu_invalidate_seq;
> > > long mmu_invalidate_in_progress;
> > > gfn_t mmu_invalidate_range_start;
> > > gfn_t mmu_invalidate_range_end;
> > > -#endif
> > > +
> > > struct list_head devices;
> > > u64 manual_dirty_log_protect;
> > > struct dentry *debugfs_dentry;
> > > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > > int kvm_arch_post_init_vm(struct kvm *kvm);
> > > void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > > int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> > >
> > > #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > > /*
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index ad55dfbc75d7..4e1e1e113bf0 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > > }
> > > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> > >
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > +{
> > > + /*
> > > + * The count increase must become visible at unlock time as no
> > > + * spte can be established without taking the mmu_lock and
> > > + * count is also read inside the mmu_lock critical section.
> > > + */
> > > + kvm->mmu_invalidate_in_progress++;
> > > +
> > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > + }
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > +
> > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > + kvm->mmu_invalidate_range_start = start;
> > > + kvm->mmu_invalidate_range_end = end;
> > > + } else {
> > > + /*
> > > + * Fully tracking multiple concurrent ranges has diminishing
> > > + * returns. Keep things simple and just find the minimal range
> > > + * which includes the current and new ranges. As there won't be
> > > + * enough information to subtract a range after its invalidate
> > > + * completes, any ranges invalidated concurrently will
> > > + * accumulate and persist until all outstanding invalidates
> > > + * complete.
> > > + */
> > > + kvm->mmu_invalidate_range_start =
> > > + min(kvm->mmu_invalidate_range_start, start);
> > > + kvm->mmu_invalidate_range_end =
> > > + max(kvm->mmu_invalidate_range_end, end);
> > > + }
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > +{
> > > + /*
> > > + * This sequence increase will notify the kvm page fault that
> > > + * the page that is going to be mapped in the spte could have
> > > + * been freed.
> > > + */
> > > + kvm->mmu_invalidate_seq++;
> > > + smp_wmb();
> > > + /*
> > > + * The above sequence increase must be visible before the
> > > + * below count decrease, which is ensured by the smp_wmb above
> > > + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > + */
> > > + kvm->mmu_invalidate_in_progress--;
> > > +}
> > > +
> > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > > {
> > > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > > }
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > -{
> > > - /*
> > > - * The count increase must become visible at unlock time as no
> > > - * spte can be established without taking the mmu_lock and
> > > - * count is also read inside the mmu_lock critical section.
> > > - */
> > > - kvm->mmu_invalidate_in_progress++;
> > > -
> > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > - kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > - kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > - }
> > > -}
> > > -
> > > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > -{
> > > - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > -
> > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > - kvm->mmu_invalidate_range_start = start;
> > > - kvm->mmu_invalidate_range_end = end;
> > > - } else {
> > > - /*
> > > - * Fully tracking multiple concurrent ranges has diminishing
> > > - * returns. Keep things simple and just find the minimal range
> > > - * which includes the current and new ranges. As there won't be
> > > - * enough information to subtract a range after its invalidate
> > > - * completes, any ranges invalidated concurrently will
> > > - * accumulate and persist until all outstanding invalidates
> > > - * complete.
> > > - */
> > > - kvm->mmu_invalidate_range_start =
> > > - min(kvm->mmu_invalidate_range_start, start);
> > > - kvm->mmu_invalidate_range_end =
> > > - max(kvm->mmu_invalidate_range_end, end);
> > > - }
> > > -}
> > > -
> > > static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > > {
> > > kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > return 0;
> > > }
> > >
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > -{
> > > - /*
> > > - * This sequence increase will notify the kvm page fault that
> > > - * the page that is going to be mapped in the spte could have
> > > - * been freed.
> > > - */
> > > - kvm->mmu_invalidate_seq++;
> > > - smp_wmb();
> > > - /*
> > > - * The above sequence increase must be visible before the
> > > - * below count decrease, which is ensured by the smp_wmb above
> > > - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > - */
> > > - kvm->mmu_invalidate_in_progress--;
> > > -}
> > > -
> > > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > > const struct mmu_notifier_range *range)
> > > {
> > > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > > return 0;
> > > }
> > >
> > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > > +{
> > > + return false;
> > > +}
> > > +
> > > static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > > {
> > > struct kvm *kvm = kvm_arch_alloc_vm();
> > > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > > return 0;
> > > }
> > >
> > > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + struct kvm_gfn_range gfn_range;
> > > + struct kvm_memory_slot *slot;
> > > + struct kvm_memslots *slots;
> > > + struct kvm_memslot_iter iter;
> > > + int i;
> > > + int r = 0;
> > > +
> > > + gfn_range.pte = __pte(0);
> > > + gfn_range.may_block = true;
> > > +
> > > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > + slots = __kvm_memslots(kvm, i);
> > > +
> > > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > > + slot = iter.slot;
> > > + gfn_range.start = max(start, slot->base_gfn);
> > > + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > > + if (gfn_range.start >= gfn_range.end)
> > > + continue;
> > > + gfn_range.slot = slot;
> > > +
> > > + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > > + }
> > > + }
> > > +
> > > + if (r)
> > > + kvm_flush_remote_tlbs(kvm);
> > > +}
> > > +
> > > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > struct kvm_memory_attributes *attrs)
> > > {
> > > gfn_t start, end;
> > > unsigned long i;
> > > void *entry;
> > > + int idx;
> > > u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > >
> > > - /* flags is currently not used. */
> > > + /* 'flags' is currently not used. */
> > > if (attrs->flags)
> > > return -EINVAL;
> > > if (attrs->attributes & ~supported_attrs)
> > > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > >
> > > entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > >
> > > + if (kvm_arch_has_private_mem(kvm)) {
> > > + KVM_MMU_LOCK(kvm);
> > > + kvm_mmu_invalidate_begin(kvm);
> > > + kvm_mmu_invalidate_range_add(kvm, start, end);
> > > + KVM_MMU_UNLOCK(kvm);
> > > + }
> > > +
> > > mutex_lock(&kvm->lock);
> > > for (i = start; i < end; i++)
> > > if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > break;
> > > mutex_unlock(&kvm->lock);
> > >
> > > + if (kvm_arch_has_private_mem(kvm)) {
> > > + idx = srcu_read_lock(&kvm->srcu);
> > > + KVM_MMU_LOCK(kvm);
> > > + if (i > start)
> > > + kvm_unmap_mem_range(kvm, start, i);
> > > + kvm_mmu_invalidate_end(kvm);
> > > + KVM_MMU_UNLOCK(kvm);
> > > + srcu_read_unlock(&kvm->srcu, idx);
> > > + }
> > > +
> > > attrs->address = i << PAGE_SHIFT;
> > > attrs->size = (end - i) << PAGE_SHIFT;
> > >
> > > --
> > > 2.25.1
> > >

2022-12-09 09:35:38

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v10 8/9] KVM: Handle page fault for private memory

Hi,

On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <[email protected]> wrote:
>
> A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> hva-based shared memory. Architecture code (like TDX code) can tell
> whether the on-going fault is private or not. This patch adds a
> 'is_private' field to kvm_page_fault to indicate this and architecture
> code is expected to set it.
>
> To handle page fault for such memslot, the handling logic is different
> depending on whether the fault is private or shared. KVM checks if
> 'is_private' matches the host's view of the page (maintained in
> mem_attr_array).
> - For a successful match, private pfn is obtained with
> restrictedmem_get_page() and shared pfn is obtained with existing
> get_user_pages().
> - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> userspace. Userspace then can convert memory between private/shared
> in host's view and retry the fault.
>
> Co-developed-by: Yu Zhang <[email protected]>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++--
> arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
> arch/x86/kvm/mmu/mmutrace.h | 1 +
> arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> include/linux/kvm_host.h | 30 ++++++++++++++++
> 5 files changed, 105 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2190fd8c95c0..b1953ebc012e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
>
> int kvm_mmu_max_mapping_level(struct kvm *kvm,
> const struct kvm_memory_slot *slot, gfn_t gfn,
> - int max_level)
> + int max_level, bool is_private)
> {
> struct kvm_lpage_info *linfo;
> int host_level;
> @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> break;
> }
>
> + if (is_private)
> + return max_level;
> +
> if (max_level == PG_LEVEL_4K)
> return PG_LEVEL_4K;
>
> @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> * level, which will be used to do precise, accurate accounting.
> */
> fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> - fault->gfn, fault->max_level);
> + fault->gfn, fault->max_level,
> + fault->is_private);
> if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> return;
>
> @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> }
>
> +static inline u8 order_to_level(int order)
> +{
> + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> +
> + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> + return PG_LEVEL_1G;
> +
> + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> + return PG_LEVEL_2M;
> +
> + return PG_LEVEL_4K;
> +}
> +
> +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> + if (fault->is_private)
> + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> + else
> + vcpu->run->memory.flags = 0;
> + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;

nit: As in previous patches, use helpers (for this and other similar
shifts in this patch)?

> + vcpu->run->memory.size = PAGE_SIZE;
> + return RET_PF_USER;
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + int order;
> + struct kvm_memory_slot *slot = fault->slot;
> +
> + if (!kvm_slot_can_be_private(slot))
> + return kvm_do_memory_fault_exit(vcpu, fault);
> +
> + if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> + return RET_PF_RETRY;
> +
> + fault->max_level = min(order_to_level(order), fault->max_level);
> + fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> + return RET_PF_CONTINUE;
> +}
> +
> static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> struct kvm_memory_slot *slot = fault->slot;
> @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> return RET_PF_EMULATE;
> }
>
> + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> + return kvm_do_memory_fault_exit(vcpu, fault);
> +
> + if (fault->is_private)
> + return kvm_faultin_pfn_private(vcpu, fault);
> +
> async = false;
> fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> fault->write, &fault->map_writable,
> @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> return -EIO;
> }
>
> + if (r == RET_PF_USER)
> + return 0;
> +
> if (r < 0)
> return r;
> if (r != RET_PF_EMULATE)
> @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> */
> if (sp->role.direct &&
> sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> - PG_LEVEL_NUM)) {
> + PG_LEVEL_NUM,
> + false)) {
> kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
>
> if (kvm_available_flush_tlb_with_range())
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index dbaf6755c5a7..5ccf08183b00 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -189,6 +189,7 @@ struct kvm_page_fault {
>
> /* Derived from mmu and global state. */
> const bool is_tdp;
> + const bool is_private;
> const bool nx_huge_page_workaround_enabled;
>
> /*
> @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> * RET_PF_RETRY: let CPU fault again on the address.
> * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> + * RET_PF_USER: need to exit to userspace to handle this fault.
> * RET_PF_FIXED: The faulting entry has been fixed.
> * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> *
> @@ -253,6 +255,7 @@ enum {
> RET_PF_RETRY,
> RET_PF_EMULATE,
> RET_PF_INVALID,
> + RET_PF_USER,
> RET_PF_FIXED,
> RET_PF_SPURIOUS,
> };
> @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>
> int kvm_mmu_max_mapping_level(struct kvm *kvm,
> const struct kvm_memory_slot *slot, gfn_t gfn,
> - int max_level);
> + int max_level, bool is_private);
> void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
>
> @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> + WARN_ON_ONCE(1);
> + return -EOPNOTSUPP;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> index ae86820cef69..2d7555381955 100644
> --- a/arch/x86/kvm/mmu/mmutrace.h
> +++ b/arch/x86/kvm/mmu/mmutrace.h
> @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> TRACE_DEFINE_ENUM(RET_PF_RETRY);
> TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> TRACE_DEFINE_ENUM(RET_PF_INVALID);
> +TRACE_DEFINE_ENUM(RET_PF_USER);
> TRACE_DEFINE_ENUM(RET_PF_FIXED);
> TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 771210ce5181..8ba1a4afc546 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> continue;
>
> max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> - iter.gfn, PG_LEVEL_NUM);
> + iter.gfn, PG_LEVEL_NUM, false);
> if (max_mapping_level < iter.level)
> continue;
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 25099c94e770..153842bb33df 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> }
> #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> + KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +#else
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> + return false;
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> + int ret;
> + struct page *page;
> + pgoff_t index = gfn - slot->base_gfn +
> + (slot->restricted_offset >> PAGE_SHIFT);
> +
> + ret = restrictedmem_get_page(slot->restricted_file, index,
> + &page, order);
> + *pfn = page_to_pfn(page);
> + return ret;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> #endif
> --
> 2.25.1
>

With my limited understanding of x86 code:
Reviewed-by: Fuad Tabba <[email protected]>

The common code in kvm_host.h was used in the port to arm64, and the
x86 fault handling code was used as a guide to how it should be done
in pKVM (with similar code added there). So with these caveats in
mind:
Tested-by: Fuad Tabba <[email protected]>

Cheers,
/fuad

2022-12-12 07:38:07

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

On Fri, Dec 09, 2022 at 08:57:31AM +0000, Fuad Tabba wrote:
> Hi,
>
> On Thu, Dec 8, 2022 at 11:18 AM Chao Peng <[email protected]> wrote:
> >
> > On Wed, Dec 07, 2022 at 05:16:34PM +0000, Fuad Tabba wrote:
> > > Hi,
> > >
> > > On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <[email protected]> wrote:
> > > >
> > > > Unmap the existing guest mappings when memory attribute is changed
> > > > between shared and private. This is needed because shared pages and
> > > > private pages are from different backends, unmapping existing ones
> > > > gives a chance for page fault handler to re-populate the mappings
> > > > according to the new attribute.
> > > >
> > > > Only architecture has private memory support needs this and the
> > > > supported architecture is expected to rewrite the weak
> > > > kvm_arch_has_private_mem().
> > >
> > > This kind of ties into the discussion of being able to share memory in
> > > place. For pKVM for example, shared and private memory would have the
> > > same backend, and the unmapping wouldn't be needed.
> > >
> > > So I guess that, instead of kvm_arch_has_private_mem(), can the check
> > > be done differently, e.g., with a different function, say
> > > kvm_arch_private_notify_attribute_change() (but maybe with a more
> > > friendly name than what I suggested :) )?
> >
> > Besides controlling the unmapping here, kvm_arch_has_private_mem() is
> > also used to gate the memslot KVM_MEM_PRIVATE flag in patch09. I know
> > unmapping is confirmed unnecessary for pKVM, but how about
> > KVM_MEM_PRIVATE? Will pKVM add its own flag or reuse KVM_MEM_PRIVATE?
> > If the answer is the latter, then yes we should use a different check
> > which only works for confidential usages here.
>
> I think it makes sense for pKVM to use the same flag (KVM_MEM_PRIVATE)
> and not to add another one.

Thanks for the reply.
Chao
>
> Thank you,
> /fuad
>
>
>
> >
> > Thanks,
> > Chao
> > >
> > > Thanks,
> > > /fuad
> > >
> > > >
> > > > Also, during memory attribute changing and the unmapping time frame,
> > > > page fault handler may happen in the same memory range and can cause
> > > > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > > > page fault handler retry during this time frame.
> > > >
> > > > Signed-off-by: Chao Peng <[email protected]>
> > > > ---
> > > > include/linux/kvm_host.h | 7 +-
> > > > virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> > > > 2 files changed, 116 insertions(+), 59 deletions(-)
> > > >
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index 3d69484d2704..3331c0c92838 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > > > #endif
> > > >
> > > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > > struct kvm_gfn_range {
> > > > struct kvm_memory_slot *slot;
> > > > gfn_t start;
> > > > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > > > bool may_block;
> > > > };
> > > > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > > > +
> > > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > > @@ -785,11 +786,12 @@ struct kvm {
> > > >
> > > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > > struct mmu_notifier mmu_notifier;
> > > > +#endif
> > > > unsigned long mmu_invalidate_seq;
> > > > long mmu_invalidate_in_progress;
> > > > gfn_t mmu_invalidate_range_start;
> > > > gfn_t mmu_invalidate_range_end;
> > > > -#endif
> > > > +
> > > > struct list_head devices;
> > > > u64 manual_dirty_log_protect;
> > > > struct dentry *debugfs_dentry;
> > > > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > > > int kvm_arch_post_init_vm(struct kvm *kvm);
> > > > void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > > > int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > > > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> > > >
> > > > #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > > > /*
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index ad55dfbc75d7..4e1e1e113bf0 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > > > }
> > > > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> > > >
> > > > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > > +{
> > > > + /*
> > > > + * The count increase must become visible at unlock time as no
> > > > + * spte can be established without taking the mmu_lock and
> > > > + * count is also read inside the mmu_lock critical section.
> > > > + */
> > > > + kvm->mmu_invalidate_in_progress++;
> > > > +
> > > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > > + }
> > > > +}
> > > > +
> > > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > > +{
> > > > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > > +
> > > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > > + kvm->mmu_invalidate_range_start = start;
> > > > + kvm->mmu_invalidate_range_end = end;
> > > > + } else {
> > > > + /*
> > > > + * Fully tracking multiple concurrent ranges has diminishing
> > > > + * returns. Keep things simple and just find the minimal range
> > > > + * which includes the current and new ranges. As there won't be
> > > > + * enough information to subtract a range after its invalidate
> > > > + * completes, any ranges invalidated concurrently will
> > > > + * accumulate and persist until all outstanding invalidates
> > > > + * complete.
> > > > + */
> > > > + kvm->mmu_invalidate_range_start =
> > > > + min(kvm->mmu_invalidate_range_start, start);
> > > > + kvm->mmu_invalidate_range_end =
> > > > + max(kvm->mmu_invalidate_range_end, end);
> > > > + }
> > > > +}
> > > > +
> > > > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > > +{
> > > > + /*
> > > > + * This sequence increase will notify the kvm page fault that
> > > > + * the page that is going to be mapped in the spte could have
> > > > + * been freed.
> > > > + */
> > > > + kvm->mmu_invalidate_seq++;
> > > > + smp_wmb();
> > > > + /*
> > > > + * The above sequence increase must be visible before the
> > > > + * below count decrease, which is ensured by the smp_wmb above
> > > > + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > > + */
> > > > + kvm->mmu_invalidate_in_progress--;
> > > > +}
> > > > +
> > > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > > > {
> > > > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > > > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > > > }
> > > >
> > > > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > > -{
> > > > - /*
> > > > - * The count increase must become visible at unlock time as no
> > > > - * spte can be established without taking the mmu_lock and
> > > > - * count is also read inside the mmu_lock critical section.
> > > > - */
> > > > - kvm->mmu_invalidate_in_progress++;
> > > > -
> > > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > > - kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > > - kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > > - }
> > > > -}
> > > > -
> > > > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > > -{
> > > > - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > > -
> > > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > > - kvm->mmu_invalidate_range_start = start;
> > > > - kvm->mmu_invalidate_range_end = end;
> > > > - } else {
> > > > - /*
> > > > - * Fully tracking multiple concurrent ranges has diminishing
> > > > - * returns. Keep things simple and just find the minimal range
> > > > - * which includes the current and new ranges. As there won't be
> > > > - * enough information to subtract a range after its invalidate
> > > > - * completes, any ranges invalidated concurrently will
> > > > - * accumulate and persist until all outstanding invalidates
> > > > - * complete.
> > > > - */
> > > > - kvm->mmu_invalidate_range_start =
> > > > - min(kvm->mmu_invalidate_range_start, start);
> > > > - kvm->mmu_invalidate_range_end =
> > > > - max(kvm->mmu_invalidate_range_end, end);
> > > > - }
> > > > -}
> > > > -
> > > > static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > > > {
> > > > kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > > > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > > return 0;
> > > > }
> > > >
> > > > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > > -{
> > > > - /*
> > > > - * This sequence increase will notify the kvm page fault that
> > > > - * the page that is going to be mapped in the spte could have
> > > > - * been freed.
> > > > - */
> > > > - kvm->mmu_invalidate_seq++;
> > > > - smp_wmb();
> > > > - /*
> > > > - * The above sequence increase must be visible before the
> > > > - * below count decrease, which is ensured by the smp_wmb above
> > > > - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > > - */
> > > > - kvm->mmu_invalidate_in_progress--;
> > > > -}
> > > > -
> > > > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > > > const struct mmu_notifier_range *range)
> > > > {
> > > > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > > > return 0;
> > > > }
> > > >
> > > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > > > +{
> > > > + return false;
> > > > +}
> > > > +
> > > > static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > > > {
> > > > struct kvm *kvm = kvm_arch_alloc_vm();
> > > > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > > > return 0;
> > > > }
> > > >
> > > > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > > > +{
> > > > + struct kvm_gfn_range gfn_range;
> > > > + struct kvm_memory_slot *slot;
> > > > + struct kvm_memslots *slots;
> > > > + struct kvm_memslot_iter iter;
> > > > + int i;
> > > > + int r = 0;
> > > > +
> > > > + gfn_range.pte = __pte(0);
> > > > + gfn_range.may_block = true;
> > > > +
> > > > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > > + slots = __kvm_memslots(kvm, i);
> > > > +
> > > > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > > > + slot = iter.slot;
> > > > + gfn_range.start = max(start, slot->base_gfn);
> > > > + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > > > + if (gfn_range.start >= gfn_range.end)
> > > > + continue;
> > > > + gfn_range.slot = slot;
> > > > +
> > > > + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > > > + }
> > > > + }
> > > > +
> > > > + if (r)
> > > > + kvm_flush_remote_tlbs(kvm);
> > > > +}
> > > > +
> > > > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > > struct kvm_memory_attributes *attrs)
> > > > {
> > > > gfn_t start, end;
> > > > unsigned long i;
> > > > void *entry;
> > > > + int idx;
> > > > u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > > >
> > > > - /* flags is currently not used. */
> > > > + /* 'flags' is currently not used. */
> > > > if (attrs->flags)
> > > > return -EINVAL;
> > > > if (attrs->attributes & ~supported_attrs)
> > > > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > >
> > > > entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > > >
> > > > + if (kvm_arch_has_private_mem(kvm)) {
> > > > + KVM_MMU_LOCK(kvm);
> > > > + kvm_mmu_invalidate_begin(kvm);
> > > > + kvm_mmu_invalidate_range_add(kvm, start, end);
> > > > + KVM_MMU_UNLOCK(kvm);
> > > > + }
> > > > +
> > > > mutex_lock(&kvm->lock);
> > > > for (i = start; i < end; i++)
> > > > if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > > break;
> > > > mutex_unlock(&kvm->lock);
> > > >
> > > > + if (kvm_arch_has_private_mem(kvm)) {
> > > > + idx = srcu_read_lock(&kvm->srcu);
> > > > + KVM_MMU_LOCK(kvm);
> > > > + if (i > start)
> > > > + kvm_unmap_mem_range(kvm, start, i);
> > > > + kvm_mmu_invalidate_end(kvm);
> > > > + KVM_MMU_UNLOCK(kvm);
> > > > + srcu_read_unlock(&kvm->srcu, idx);
> > > > + }
> > > > +
> > > > attrs->address = i << PAGE_SHIFT;
> > > > attrs->size = (end - i) << PAGE_SHIFT;
> > > >
> > > > --
> > > > 2.25.1
> > > >

2022-12-12 07:51:59

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 8/9] KVM: Handle page fault for private memory

On Fri, Dec 09, 2022 at 09:01:04AM +0000, Fuad Tabba wrote:
> Hi,
>
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <[email protected]> wrote:
> >
> > A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> > hva-based shared memory. Architecture code (like TDX code) can tell
> > whether the on-going fault is private or not. This patch adds a
> > 'is_private' field to kvm_page_fault to indicate this and architecture
> > code is expected to set it.
> >
> > To handle page fault for such memslot, the handling logic is different
> > depending on whether the fault is private or shared. KVM checks if
> > 'is_private' matches the host's view of the page (maintained in
> > mem_attr_array).
> > - For a successful match, private pfn is obtained with
> > restrictedmem_get_page() and shared pfn is obtained with existing
> > get_user_pages().
> > - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > userspace. Userspace then can convert memory between private/shared
> > in host's view and retry the fault.
> >
> > Co-developed-by: Yu Zhang <[email protected]>
> > Signed-off-by: Yu Zhang <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++--
> > arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
> > arch/x86/kvm/mmu/mmutrace.h | 1 +
> > arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> > include/linux/kvm_host.h | 30 ++++++++++++++++
> > 5 files changed, 105 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 2190fd8c95c0..b1953ebc012e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> >
> > int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > const struct kvm_memory_slot *slot, gfn_t gfn,
> > - int max_level)
> > + int max_level, bool is_private)
> > {
> > struct kvm_lpage_info *linfo;
> > int host_level;
> > @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > break;
> > }
> >
> > + if (is_private)
> > + return max_level;
> > +
> > if (max_level == PG_LEVEL_4K)
> > return PG_LEVEL_4K;
> >
> > @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > * level, which will be used to do precise, accurate accounting.
> > */
> > fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > - fault->gfn, fault->max_level);
> > + fault->gfn, fault->max_level,
> > + fault->is_private);
> > if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> > return;
> >
> > @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> > kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> > }
> >
> > +static inline u8 order_to_level(int order)
> > +{
> > + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > +
> > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > + return PG_LEVEL_1G;
> > +
> > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > + return PG_LEVEL_2M;
> > +
> > + return PG_LEVEL_4K;
> > +}
> > +
> > +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> > + struct kvm_page_fault *fault)
> > +{
> > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > + if (fault->is_private)
> > + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > + else
> > + vcpu->run->memory.flags = 0;
> > + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
>
> nit: As in previous patches, use helpers (for this and other similar
> shifts in this patch)?

Agreed.

>
> > + vcpu->run->memory.size = PAGE_SIZE;
> > + return RET_PF_USER;
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > + struct kvm_page_fault *fault)
> > +{
> > + int order;
> > + struct kvm_memory_slot *slot = fault->slot;
> > +
> > + if (!kvm_slot_can_be_private(slot))
> > + return kvm_do_memory_fault_exit(vcpu, fault);
> > +
> > + if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> > + return RET_PF_RETRY;
> > +
> > + fault->max_level = min(order_to_level(order), fault->max_level);
> > + fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > + return RET_PF_CONTINUE;
> > +}
> > +
> > static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > {
> > struct kvm_memory_slot *slot = fault->slot;
> > @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > return RET_PF_EMULATE;
> > }
> >
> > + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> > + return kvm_do_memory_fault_exit(vcpu, fault);
> > +
> > + if (fault->is_private)
> > + return kvm_faultin_pfn_private(vcpu, fault);
> > +
> > async = false;
> > fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > fault->write, &fault->map_writable,
> > @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> > return -EIO;
> > }
> >
> > + if (r == RET_PF_USER)
> > + return 0;
> > +
> > if (r < 0)
> > return r;
> > if (r != RET_PF_EMULATE)
> > @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> > */
> > if (sp->role.direct &&
> > sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> > - PG_LEVEL_NUM)) {
> > + PG_LEVEL_NUM,
> > + false)) {
> > kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> >
> > if (kvm_available_flush_tlb_with_range())
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index dbaf6755c5a7..5ccf08183b00 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -189,6 +189,7 @@ struct kvm_page_fault {
> >
> > /* Derived from mmu and global state. */
> > const bool is_tdp;
> > + const bool is_private;
> > const bool nx_huge_page_workaround_enabled;
> >
> > /*
> > @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > * RET_PF_RETRY: let CPU fault again on the address.
> > * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> > * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> > + * RET_PF_USER: need to exit to userspace to handle this fault.
> > * RET_PF_FIXED: The faulting entry has been fixed.
> > * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> > *
> > @@ -253,6 +255,7 @@ enum {
> > RET_PF_RETRY,
> > RET_PF_EMULATE,
> > RET_PF_INVALID,
> > + RET_PF_USER,
> > RET_PF_FIXED,
> > RET_PF_SPURIOUS,
> > };
> > @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >
> > int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > const struct kvm_memory_slot *slot, gfn_t gfn,
> > - int max_level);
> > + int max_level, bool is_private);
> > void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> >
> > @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > + WARN_ON_ONCE(1);
> > + return -EOPNOTSUPP;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > +
> > #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > index ae86820cef69..2d7555381955 100644
> > --- a/arch/x86/kvm/mmu/mmutrace.h
> > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> > TRACE_DEFINE_ENUM(RET_PF_RETRY);
> > TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> > TRACE_DEFINE_ENUM(RET_PF_INVALID);
> > +TRACE_DEFINE_ENUM(RET_PF_USER);
> > TRACE_DEFINE_ENUM(RET_PF_FIXED);
> > TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 771210ce5181..8ba1a4afc546 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> > continue;
> >
> > max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> > - iter.gfn, PG_LEVEL_NUM);
> > + iter.gfn, PG_LEVEL_NUM, false);
> > if (max_mapping_level < iter.level)
> > continue;
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 25099c94e770..153842bb33df 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > }
> > #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
> >
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> > + KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > +}
> > +#else
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > + return false;
> > +}
> > +
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> > +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > + int ret;
> > + struct page *page;
> > + pgoff_t index = gfn - slot->base_gfn +
> > + (slot->restricted_offset >> PAGE_SHIFT);
> > +
> > + ret = restrictedmem_get_page(slot->restricted_file, index,
> > + &page, order);
> > + *pfn = page_to_pfn(page);
> > + return ret;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > +
> > #endif
> > --
> > 2.25.1
> >
>
> With my limited understanding of x86 code:
> Reviewed-by: Fuad Tabba <[email protected]>
>
> The common code in kvm_host.h was used in the port to arm64, and the
> x86 fault handling code was used as a guide to how it should be done
> in pKVM (with similar code added there). So with these caveats in
> mind:
> Tested-by: Fuad Tabba <[email protected]>
>
> Cheers,
> /fuad

2022-12-13 12:34:37

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On 12/8/2022 7:30 PM, Chao Peng wrote:
> On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
>> On 12/2/2022 2:13 PM, Chao Peng wrote:
>>
>> ..
>>
>>> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
>>> and right now it is selected on X86_64 only.
>>>
>>
>> From the patch implementation, I have no idea why HAVE_KVM_RESTRICTED_MEM is
>> needed.
>
> The reason is we want KVM further controls the feature enabling. An
> opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
> unsupported architectures.

HAVE_KVM_RESTRICTED_MEM is not used in this patch. It's better to
introduce it in the patch that actually uses it.

> Here is the original discussion:
> https://lore.kernel.org/all/[email protected]/
>
> Thanks,
> Chao

2022-12-14 00:16:34

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

On Fri, 2022-12-02 at 14:13 +0800, Chao Peng wrote:
>  
> - /* flags is currently not used. */
> + /* 'flags' is currently not used. */
>   if (attrs->flags)
>   return -EINVAL;

Unintended code change.

2022-12-19 08:48:53

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

On Tue, Dec 13, 2022 at 11:51:25PM +0000, Huang, Kai wrote:
> On Fri, 2022-12-02 at 14:13 +0800, Chao Peng wrote:
> > ?
> > - /* flags is currently not used. */
> > + /* 'flags' is currently not used. */
> > ? if (attrs->flags)
> > ? return -EINVAL;
>
> Unintended code change.

Yeah!

Chao

2022-12-19 08:59:25

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Tue, Dec 13, 2022 at 08:04:14PM +0800, Xiaoyao Li wrote:
> On 12/8/2022 7:30 PM, Chao Peng wrote:
> > On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
> > > On 12/2/2022 2:13 PM, Chao Peng wrote:
> > >
> > > ..
> > >
> > > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > > and right now it is selected on X86_64 only.
> > > >
> > >
> > > From the patch implementation, I have no idea why HAVE_KVM_RESTRICTED_MEM is
> > > needed.
> >
> > The reason is we want KVM further controls the feature enabling. An
> > opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
> > unsupported architectures.
>
> HAVE_KVM_RESTRICTED_MEM is not used in this patch. It's better to introduce
> it in the patch that actually uses it.

It's being 'used' in this patch by reverse selecting RESTRICTEDMEM in
arch/x86/kvm/Kconfig, this gives people a sense where
restrictedmem_notifier comes from. Introducing the config with other
private/restricted memslot stuff together can also help future
supporting architectures better identify what they need do. But those
are trivial and moving to patch 08 sounds also good to me.

Thanks,
Chao
>
> > Here is the original discussion:
> > https://lore.kernel.org/all/[email protected]/
> >
> > Thanks,
> > Chao

2022-12-19 15:05:26

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow

valueless?

I can't parse that.

> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided through a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.

bookmarked?

> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
>
> The extended memslot can still have the userspace_addr(hva). When use, a

"When un use, ..."

...

> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index a8e379a3afee..690cb21010e7 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -50,6 +50,8 @@ config KVM
> select INTERVAL_TREE
> select HAVE_KVM_PM_NOTIFIER if PM
> select HAVE_KVM_MEMORY_ATTRIBUTES
> + select HAVE_KVM_RESTRICTED_MEM if X86_64
> + select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM

Those deps here look weird.

RESTRICTEDMEM should be selected by TDX_GUEST as it can't live without
it.

Then you don't have to select HAVE_KVM_RESTRICTED_MEM simply because of
X86_64 - you need that functionality when the respective guest support
is enabled in KVM.

Then, looking forward into your patchset, I'm not sure you even
need HAVE_KVM_RESTRICTED_MEM - you could make it all depend on
CONFIG_RESTRICTEDMEM. But that's KVM folks call - I'd always aim for
less Kconfig items because we have waay too many.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-20 08:25:42

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Mon, Dec 19, 2022 at 03:36:28PM +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
>
> valueless?
>
> I can't parse that.

It's unnecessary and ...

>
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
>
> bookmarked?

userspace is restricted to access the memory content in the fd.

>
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> >
> > The extended memslot can still have the userspace_addr(hva). When use, a
>
> "When un use, ..."

When both userspace_addr and restricted_fd/offset were used, ...

>
> ...
>
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index a8e379a3afee..690cb21010e7 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -50,6 +50,8 @@ config KVM
> > select INTERVAL_TREE
> > select HAVE_KVM_PM_NOTIFIER if PM
> > select HAVE_KVM_MEMORY_ATTRIBUTES
> > + select HAVE_KVM_RESTRICTED_MEM if X86_64
> > + select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
>
> Those deps here look weird.
>
> RESTRICTEDMEM should be selected by TDX_GUEST as it can't live without
> it.

RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.

>
> Then you don't have to select HAVE_KVM_RESTRICTED_MEM simply because of
> X86_64 - you need that functionality when the respective guest support
> is enabled in KVM.

Letting the actual feature(e.g. TDX or pKVM) select it or add dependency
sounds a viable and clearer solution. Sean, let me know your opinion.

>
> Then, looking forward into your patchset, I'm not sure you even
> need HAVE_KVM_RESTRICTED_MEM - you could make it all depend on
> CONFIG_RESTRICTEDMEM. But that's KVM folks call - I'd always aim for
> less Kconfig items because we have waay too many.

The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
that.

[*] https://lore.kernel.org/all/[email protected]/

Thanks,
Chao
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2022-12-20 10:17:16

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Tue, Dec 20, 2022 at 03:43:18PM +0800, Chao Peng wrote:
> RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.

Which basically means that RESTRICTEDMEM should simply depend on KVM.
Because you can't know upfront whether KVM will run a TDX guest or a SNP
guest and so on.

Which then means that RESTRICTEDMEM will practically end up always
enabled in KVM HV configs.

> The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
> works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
> that.

This is what I mean with "we have too many Kconfig items". :-\

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-21 13:54:53

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Tue, Dec 20, 2022 at 10:55:44AM +0100, Borislav Petkov wrote:
> On Tue, Dec 20, 2022 at 03:43:18PM +0800, Chao Peng wrote:
> > RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.
>
> Which basically means that RESTRICTEDMEM should simply depend on KVM.
> Because you can't know upfront whether KVM will run a TDX guest or a SNP
> guest and so on.
>
> Which then means that RESTRICTEDMEM will practically end up always
> enabled in KVM HV configs.

That's right, CONFIG_RESTRICTEDMEM is always selected for supported KVM
architectures (currently x86_64).

>
> > The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
> > works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
> > that.
>
> This is what I mean with "we have too many Kconfig items". :-\

Yes I agree. One way to remove this is probably additionally checking
CONFIG_64BIT instead.

Thanks,
Chao
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2023-01-05 11:58:41

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided through a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.
>
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
>
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through restricted_fd
> and shared memory through userspace_addr. Whether the private or shared
> part is visible to guest is maintained by other KVM code.
>
> A restrictedmem_notifier field is also added to the memslot structure to
> allow the restricted_fd's backing store to notify KVM the memory change,
> KVM then can invalidate its page table entries or handle memory errors.
>
> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only.
>
> To make future maintenance easy, internally use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.

Feels bit hacky IMHO, and more like a completely new feature than
an extension.

Why not just add a new ioctl? The commit message does not address
the most essential design here.

BR, Jarkko

2023-01-06 09:59:16

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> >
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> >
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through restricted_fd
> > and shared memory through userspace_addr. Whether the private or shared
> > part is visible to guest is maintained by other KVM code.
> >
> > A restrictedmem_notifier field is also added to the memslot structure to
> > allow the restricted_fd's backing store to notify KVM the memory change,
> > KVM then can invalidate its page table entries or handle memory errors.
> >
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> >
> > To make future maintenance easy, internally use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
>
> Feels bit hacky IMHO, and more like a completely new feature than
> an extension.
>
> Why not just add a new ioctl? The commit message does not address
> the most essential design here.

Yes, people can always choose to add a new ioctl for this kind of change
and the balance point here is we want to also avoid 'too many ioctls' if
the functionalities are similar. The '_ext' variant reuses all the
existing fields in the 'normal' variant and most importantly KVM
internally can reuse most of the code. I certainly can add some words in
the commit message to explain this design choice.

Thanks,
Chao
>
> BR, Jarkko

2023-01-09 20:17:35

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Fri, Jan 06, 2023, Chao Peng wrote:
> On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > To make future maintenance easy, internally use a binary compatible
> > > alias struct kvm_user_mem_region to handle both the normal and the
> > > '_ext' variants.
> >
> > Feels bit hacky IMHO, and more like a completely new feature than
> > an extension.
> >
> > Why not just add a new ioctl? The commit message does not address
> > the most essential design here.
>
> Yes, people can always choose to add a new ioctl for this kind of change
> and the balance point here is we want to also avoid 'too many ioctls' if
> the functionalities are similar. The '_ext' variant reuses all the
> existing fields in the 'normal' variant and most importantly KVM
> internally can reuse most of the code. I certainly can add some words in
> the commit message to explain this design choice.

After seeing the userspace side of this, I agree with Jarkko; overloading
KVM_SET_USER_MEMORY_REGION is a hack. E.g. the size validation ends up being
bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
itself.

It feels absolutely ridiculous, but I think the best option is to do:

#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
struct kvm_userspace_memory_region2)

/* for KVM_SET_USER_MEMORY_REGION2 */
struct kvm_user_mem_region2 {
__u32 slot;
__u32 flags;
__u64 guest_phys_addr;
__u64 memory_size;
__u64 userspace_addr;
__u64 restricted_offset;
__u32 restricted_fd;
__u32 pad1;
__u64 pad2[14];
}

And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.

Regarding the userspace side of things, please include Vishal's selftests in v11,
it's impossible to properly review the uAPI changes without seeing the userspace
side of things. I'm in the process of reviewing Vishal's v2[*], I'll try to
massage it into a set of patches that you can incorporate into your series.

[*] https://lore.kernel.org/all/[email protected]

2023-01-10 09:59:57

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> On Fri, Jan 06, 2023, Chao Peng wrote:
> > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > >
> > > Feels bit hacky IMHO, and more like a completely new feature than
> > > an extension.
> > >
> > > Why not just add a new ioctl? The commit message does not address
> > > the most essential design here.
> >
> > Yes, people can always choose to add a new ioctl for this kind of change
> > and the balance point here is we want to also avoid 'too many ioctls' if
> > the functionalities are similar. The '_ext' variant reuses all the
> > existing fields in the 'normal' variant and most importantly KVM
> > internally can reuse most of the code. I certainly can add some words in
> > the commit message to explain this design choice.
>
> After seeing the userspace side of this, I agree with Jarkko; overloading
> KVM_SET_USER_MEMORY_REGION is a hack. E.g. the size validation ends up being
> bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> itself.

How is the size validation being bogus? I don't quite follow. Then we
will use kvm_userspace_memory_region2 as the KVM internal alias, right?
I see similar examples use different functions to handle different
versions but it does look easier if we use alias for this function.

>
> It feels absolutely ridiculous, but I think the best option is to do:
>
> #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> struct kvm_userspace_memory_region2)

Just interesting, is 0x49 a safe number we can use?

>
> /* for KVM_SET_USER_MEMORY_REGION2 */
> struct kvm_user_mem_region2 {
> __u32 slot;
> __u32 flags;
> __u64 guest_phys_addr;
> __u64 memory_size;
> __u64 userspace_addr;
> __u64 restricted_offset;
> __u32 restricted_fd;
> __u32 pad1;
> __u64 pad2[14];
> }
>
> And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.

Okay, agree from KVM userspace API perspective this is more consistent
with similar existing examples. I see several of them.

I think we will also need a CAP_KVM_SET_USER_MEMORY_REGION2 for this new
ioctl.

>
> Regarding the userspace side of things, please include Vishal's selftests in v11,
> it's impossible to properly review the uAPI changes without seeing the userspace
> side of things. I'm in the process of reviewing Vishal's v2[*], I'll try to
> massage it into a set of patches that you can incorporate into your series.

Previously I included Vishal's selftests in the github repo, but not
include them in this patch series. It's OK for me to incorporate them
directly into this series and review together if Vishal is fine.

Chao
>
> [*] https://lore.kernel.org/all/[email protected]

2023-01-10 23:14:04

by Vishal Annapurve

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Tue, Jan 10, 2023 at 1:19 AM Chao Peng <[email protected]> wrote:
> >
> > Regarding the userspace side of things, please include Vishal's selftests in v11,
> > it's impossible to properly review the uAPI changes without seeing the userspace
> > side of things. I'm in the process of reviewing Vishal's v2[*], I'll try to
> > massage it into a set of patches that you can incorporate into your series.
>
> Previously I included Vishal's selftests in the github repo, but not
> include them in this patch series. It's OK for me to incorporate them
> directly into this series and review together if Vishal is fine.
>

Yeah, I am ok with incorporating selftest patches into this series and
reviewing them together.

Regards,
Vishal

> Chao
> >
> > [*] https://lore.kernel.org/all/[email protected]

2023-01-13 23:14:08

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Tue, Jan 10, 2023, Chao Peng wrote:
> On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> > On Fri, Jan 06, 2023, Chao Peng wrote:
> > > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > > To make future maintenance easy, internally use a binary compatible
> > > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > > '_ext' variants.
> > > >
> > > > Feels bit hacky IMHO, and more like a completely new feature than
> > > > an extension.
> > > >
> > > > Why not just add a new ioctl? The commit message does not address
> > > > the most essential design here.
> > >
> > > Yes, people can always choose to add a new ioctl for this kind of change
> > > and the balance point here is we want to also avoid 'too many ioctls' if
> > > the functionalities are similar. The '_ext' variant reuses all the
> > > existing fields in the 'normal' variant and most importantly KVM
> > > internally can reuse most of the code. I certainly can add some words in
> > > the commit message to explain this design choice.
> >
> > After seeing the userspace side of this, I agree with Jarkko; overloading
> > KVM_SET_USER_MEMORY_REGION is a hack. E.g. the size validation ends up being
> > bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> > itself.
>
> How is the size validation being bogus? I don't quite follow.

The ioctl() magic embeds the size of the payload (struct kvm_userspace_memory_region
in this case) in the ioctl() number, and that information is visible to userspace
via _IOCTL_SIZE(). Attempting to take a larger size can mess up sanity checks,
e.g. KVM selftests get tripped up on this assert if KVM_SET_USER_MEMORY_REGION is
passed an "extended" struct.

#define kvm_do_ioctl(fd, cmd, arg) \
({ \
kvm_static_assert(!_IOC_SIZE(cmd) || sizeof(*arg) == _IOC_SIZE(cmd)); \
ioctl(fd, cmd, arg); \
})

> Then we will use kvm_userspace_memory_region2 as the KVM internal alias,
> right?

Yep.

> I see similar examples use different functions to handle different versions
> but it does look easier if we use alias for this function.
>
> >
> > It feels absolutely ridiculous, but I think the best option is to do:
> >
> > #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> > struct kvm_userspace_memory_region2)
>
> Just interesting, is 0x49 a safe number we can use?

Yes? So long as its not used by KVM, it's safe. AFAICT, it's unused.

2023-01-13 23:35:34

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

On Fri, Dec 02, 2022, Chao Peng wrote:
> @@ -785,11 +786,12 @@ struct kvm {
>
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> struct mmu_notifier mmu_notifier;
> +#endif
> unsigned long mmu_invalidate_seq;
> long mmu_invalidate_in_progress;
> gfn_t mmu_invalidate_range_start;
> gfn_t mmu_invalidate_range_end;
> -#endif

Blech. The existing code is a bit ugly, and trying to extend for this use case
makes things even worse.

Rather than use the base MMU_NOTIFIER Kconfig and an arbitrary define, I think we
should first add a proper Kconfig, e.g. KVM_GENERIC_MMU_NOTIFIER, to replace the
combination. E.g

config KVM_GENERIC_MMU_NOTIFIER
select MMU_NOTIFIER
bool

and then all architectures that currently #define KVM_ARCH_WANT_MMU_NOTIFIER can
simply select the Kconfig, which is everything except s390. "GENERIC" again because
s390 does select MMU_NOTIFER and actually registers its own notifier for s390's
version of protected VMs (at least, I think that's what its "pv" stands for).

And then later down the line in this series, when the attributes and private mem
needs to tie into the notifiers, we can do:


config KVM_GENERIC_MEMORY_ATTRIBUTES
select KVM_GENERIC_MMU_NOTIFIER
bool

I.e. that way this patch doesn't need to partially expose KVM's notifier stuff
and can instead just keep the soon-to-be-existing KVM_GENERIC_MMU_NOTIFIER.

Taking a depending on KVM_GENERIC_MMU_NOTIFIER for KVM_GENERIC_MEMORY_ATTRIBUTES
makes sense, because AFAICT, changing any type of attribute, e.g. RWX bits, is
going to necessitate unmapping the affected gfn range.

> struct list_head devices;
> u64 manual_dirty_log_protect;
> struct dentry *debugfs_dentry;
> @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> int kvm_arch_post_init_vm(struct kvm *kvm);
> void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);

The reference to private memory belongs in a later patch. More below.

> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + struct kvm_gfn_range gfn_range;
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + struct kvm_memslot_iter iter;
> + int i;
> + int r = 0;

The return from kvm_unmap_gfn_range() is a bool, this should be:

bool flush = false;

> +
> + gfn_range.pte = __pte(0);
> + gfn_range.may_block = true;
> +
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + slots = __kvm_memslots(kvm, i);
> +
> + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> + slot = iter.slot;
> + gfn_range.start = max(start, slot->base_gfn);
> + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> + if (gfn_range.start >= gfn_range.end)
> + continue;
> + gfn_range.slot = slot;
> +
> + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> + }
> + }
> +
> + if (r)
> + kvm_flush_remote_tlbs(kvm);
> +}
> +
> static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> struct kvm_memory_attributes *attrs)
> {
> gfn_t start, end;
> unsigned long i;
> void *entry;
> + int idx;
> u64 supported_attrs = kvm_supported_mem_attributes(kvm);
>
> - /* flags is currently not used. */
> + /* 'flags' is currently not used. */

Kind of a spurious change.

> if (attrs->flags)
> return -EINVAL;
> if (attrs->attributes & ~supported_attrs)
> @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>
> entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
>
> + if (kvm_arch_has_private_mem(kvm)) {

I think we should assume that any future attributes will necessitate unmapping
and invalidation, i.e. drop the private mem check. That allows introducing
kvm_arch_has_private_mem() in a later patch that is more directly related to
private memory.

> + KVM_MMU_LOCK(kvm);
> + kvm_mmu_invalidate_begin(kvm);
> + kvm_mmu_invalidate_range_add(kvm, start, end);
> + KVM_MMU_UNLOCK(kvm);
> + }
> +
> mutex_lock(&kvm->lock);
> for (i = start; i < end; i++)
> if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> break;
> mutex_unlock(&kvm->lock);
>
> + if (kvm_arch_has_private_mem(kvm)) {
> + idx = srcu_read_lock(&kvm->srcu);

Mostly for reference, this goes away if slots_lock is used instead of kvm->lock.

> + KVM_MMU_LOCK(kvm);
> + if (i > start)
> + kvm_unmap_mem_range(kvm, start, i);
> + kvm_mmu_invalidate_end(kvm);
> + KVM_MMU_UNLOCK(kvm);
> + srcu_read_unlock(&kvm->srcu, idx);
> + }
> +
> attrs->address = i << PAGE_SHIFT;
> attrs->size = (end - i) << PAGE_SHIFT;
>
> --
> 2.25.1
>

2023-01-14 00:34:41

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 8/9] KVM: Handle page fault for private memory

On Fri, Dec 02, 2022, Chao Peng wrote:
> @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> return -EIO;
> }
>
> + if (r == RET_PF_USER)
> + return 0;
> +
> if (r < 0)
> return r;
> if (r != RET_PF_EMULATE)
> @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> */
> if (sp->role.direct &&
> sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> - PG_LEVEL_NUM)) {
> + PG_LEVEL_NUM,
> + false)) {

Passing %false is incorrect. It might not cause problems because KVM currently
doesn't allowing modifying private memslots (that likely needs to change to allow
dirty logging), but it's wrong since nothing guarantees KVM is operating on SPTEs
for shared memory.

One option would be to take the patches from the TDX series that add a "private"
flag to the shadow page role, but I'd rather not add the role until it's truly
necessary.

For now, I think we can do this without impacting performance of guests that don't
support private memory.

int kvm_mmu_max_mapping_level(struct kvm *kvm,
const struct kvm_memory_slot *slot, gfn_t gfn,
int max_level)
{
bool is_private = kvm_slot_can_be_private(slot) &&
kvm_mem_is_private(kvm, gfn);

return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_private);
}

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 25099c94e770..153842bb33df 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> }
> #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{

This code, i.e. the generic KVM changes, belongs in a separate patch. It'll be
small, but I want to separate x86's page fault changes from the restrictedmem
support adding to common KVM.

This should also short-circuit based on CONFIG_HAVE_KVM_RESTRICTED_MEM, though
I would name that CONFIG_KVM_PRIVATE_MEMORY since in KVM's world, it's all about
private vs. shared at this time.

> + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> + KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +#else
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> + return false;
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> + int ret;
> + struct page *page;
> + pgoff_t index = gfn - slot->base_gfn +
> + (slot->restricted_offset >> PAGE_SHIFT);
> +
> + ret = restrictedmem_get_page(slot->restricted_file, index,
> + &page, order);

This needs handle errors. If "ret" is non-zero, "page" is garbage.

> + *pfn = page_to_pfn(page);
> + return ret;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> #endif
> --
> 2.25.1
>

2023-01-14 01:39:14

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Fri, Dec 02, 2022, Chao Peng wrote:
> This patch series implements KVM guest private memory for confidential
> computing scenarios like Intel TDX[1]. If a TDX host accesses
> TDX-protected guest memory, machine check can happen which can further
> crash the running host system, this is terrible for multi-tenant
> configurations. The host accesses include those from KVM userspace like
> QEMU. This series addresses KVM userspace induced crash by introducing
> new mm and KVM interfaces so KVM userspace can still manage guest memory
> via a fd-based approach, but it can never access the guest memory
> content.
>
> The patch series touches both core mm and KVM code. I appreciate
> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> reviews are always welcome.
> - 01: mm change, target for mm tree
> - 02-09: KVM change, target for KVM tree

A version with all of my feedback, plus reworked versions of Vishal's selftest,
is available here:

[email protected]:sean-jc/linux.git x86/upm_base_support

It compiles and passes the selftest, but it's otherwise barely tested. There are
a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
a WIP.

As for next steps, can you (handwaving all of the TDX folks) take a look at what
I pushed and see if there's anything horrifically broken, and that it still works
for TDX?

Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
(and I mean that).

On my side, the two things on my mind are (a) tests and (b) downstream dependencies
(SEV and TDX). For tests, I want to build a lists of tests that are required for
merging so that the criteria for merging are clear, and so that if the list is large
(haven't thought much yet), the work of writing and running tests can be distributed.

Regarding downstream dependencies, before this lands, I want to pull in all the
TDX and SNP series and see how everything fits together. Specifically, I want to
make sure that we don't end up with a uAPI that necessitates ugly code, and that we
don't miss an opportunity to make things simpler. The patches in the SNP series to
add "legacy" SEV support for UPM in particular made me slightly rethink some minor
details. Nothing remotely major, but something that needs attention since it'll
be uAPI.

I'm off Monday, so it'll be at least Tuesday before I make any more progress on
my side.

Thanks!

2023-01-16 15:30:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> >
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> > - 01: mm change, target for mm tree
> > - 02-09: KVM change, target for KVM tree
>
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
>
> [email protected]:sean-jc/linux.git x86/upm_base_support
>
> It compiles and passes the selftest, but it's otherwise barely tested. There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
>
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?

Minor build fix:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6eb5336ccc65..4a9e9fa2552a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7211,8 +7211,8 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
int level;
bool mixed;

- lockdep_assert_held_write(kvm->mmu_lock);
- lockdep_assert_held(kvm->slots_lock);
+ lockdep_assert_held_write(&kvm->mmu_lock);
+ lockdep_assert_held(&kvm->slots_lock);

/*
* KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 467916943c73..4ef60ba7eb1d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2304,7 +2304,7 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
{
- lockdep_assert_held(kvm->mmu_lock);
+ lockdep_assert_held(&kvm->mmu_lock);

return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
}
--
Kiryl Shutsemau / Kirill A. Shutemov

2023-01-17 13:17:52

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Fri, Jan 13, 2023 at 10:37:39PM +0000, Sean Christopherson wrote:
> On Tue, Jan 10, 2023, Chao Peng wrote:
> > On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> > > On Fri, Jan 06, 2023, Chao Peng wrote:
> > > > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > > > To make future maintenance easy, internally use a binary compatible
> > > > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > > > '_ext' variants.
> > > > >
> > > > > Feels bit hacky IMHO, and more like a completely new feature than
> > > > > an extension.
> > > > >
> > > > > Why not just add a new ioctl? The commit message does not address
> > > > > the most essential design here.
> > > >
> > > > Yes, people can always choose to add a new ioctl for this kind of change
> > > > and the balance point here is we want to also avoid 'too many ioctls' if
> > > > the functionalities are similar. The '_ext' variant reuses all the
> > > > existing fields in the 'normal' variant and most importantly KVM
> > > > internally can reuse most of the code. I certainly can add some words in
> > > > the commit message to explain this design choice.
> > >
> > > After seeing the userspace side of this, I agree with Jarkko; overloading
> > > KVM_SET_USER_MEMORY_REGION is a hack. E.g. the size validation ends up being
> > > bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> > > itself.
> >
> > How is the size validation being bogus? I don't quite follow.
>
> The ioctl() magic embeds the size of the payload (struct kvm_userspace_memory_region
> in this case) in the ioctl() number, and that information is visible to userspace
> via _IOCTL_SIZE(). Attempting to take a larger size can mess up sanity checks,
> e.g. KVM selftests get tripped up on this assert if KVM_SET_USER_MEMORY_REGION is
> passed an "extended" struct.
>
> #define kvm_do_ioctl(fd, cmd, arg) \
> ({ \
> kvm_static_assert(!_IOC_SIZE(cmd) || sizeof(*arg) == _IOC_SIZE(cmd)); \
> ioctl(fd, cmd, arg); \
> })

Got it. Thanks for the explanation.

Chao

2023-01-17 13:50:58

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> >
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> > - 01: mm change, target for mm tree
> > - 02-09: KVM change, target for KVM tree
>
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
>
> [email protected]:sean-jc/linux.git x86/upm_base_support
>
> It compiles and passes the selftest, but it's otherwise barely tested. There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.

Thanks very much for doing this. Almost all of your comments are well
received, except for two cases that need more discussions which have
replied individually.

>
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?

I have integrated this into my local TDX repo, with some changes (as I
replied individually), the new code basically still works with TDX.

I have also asked other TDX folks to take a look.

>
> Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> (and I mean that).
>
> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX). For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
>
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together. Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler. The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details. Nothing remotely major, but something that needs attention since it'll
> be uAPI.
>
> I'm off Monday, so it'll be at least Tuesday before I make any more progress on
> my side.

Appreciate your effort. As for the next steps, if you see something we
can do parallel, feel free to let me know.

Thanks,
Chao

2023-01-17 15:35:51

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

Hi Sean,

On Sat, Jan 14, 2023 at 12:38 AM Sean Christopherson <[email protected]> wrote:
>
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> >
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> > - 01: mm change, target for mm tree
> > - 02-09: KVM change, target for KVM tree
>
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
>
> [email protected]:sean-jc/linux.git x86/upm_base_support
>
> It compiles and passes the selftest, but it's otherwise barely tested. There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
>
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?
>
> Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> (and I mean that).

Thanks for sharing this. I've had a look at the patches, and have
ported them to work with pKVM. At a high level, the new interface
seems fine and it works with the arm64/pKVM port. I have a couple of
comments regarding some of the details, but they can wait until v11 is
posted.

Cheers,
/fuad



> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX). For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
>
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together. Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler. The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details. Nothing remotely major, but something that needs attention since it'll
> be uAPI.
>
> I'm off Monday, so it'll be at least Tuesday before I make any more progress on
> my side.
>
> Thanks!

2023-01-19 11:56:21

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Sat, Jan 14, 2023 at 12:37:59AM +0000,
Sean Christopherson <[email protected]> wrote:

> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> >
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> > - 01: mm change, target for mm tree
> > - 02-09: KVM change, target for KVM tree
>
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
>
> [email protected]:sean-jc/linux.git x86/upm_base_support
>
> It compiles and passes the selftest, but it's otherwise barely tested. There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
>
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?
>
> Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> (and I mean that).
>
> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX). For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
>
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together. Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler. The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details. Nothing remotely major, but something that needs attention since it'll
> be uAPI.

Although I'm still debuging with TDX KVM, I needed the following.
kvm_faultin_pfn() is called without mmu_lock held. the race to change
private/shared is handled by mmu_seq. Maybe dedicated function only for
kvm_faultin_pfn().

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 02be5e1cba1e..38699ca75ab8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2322,7 +2322,7 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
{
- lockdep_assert_held(&kvm->mmu_lock);
+ // lockdep_assert_held(&kvm->mmu_lock);

return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
}


--
Isaku Yamahata <[email protected]>

2023-01-19 16:12:27

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> Sean Christopherson <[email protected]> wrote:
>
> > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > >
> > > The patch series touches both core mm and KVM code. I appreciate
> > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > reviews are always welcome.
> > > - 01: mm change, target for mm tree
> > > - 02-09: KVM change, target for KVM tree
> >
> > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > is available here:
> >
> > [email protected]:sean-jc/linux.git x86/upm_base_support
> >
> > It compiles and passes the selftest, but it's otherwise barely tested. There are
> > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > a WIP.
> >
> > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > I pushed and see if there's anything horrifically broken, and that it still works
> > for TDX?
> >
> > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> > (and I mean that).
> >
> > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > (SEV and TDX). For tests, I want to build a lists of tests that are required for
> > merging so that the criteria for merging are clear, and so that if the list is large
> > (haven't thought much yet), the work of writing and running tests can be distributed.
> >
> > Regarding downstream dependencies, before this lands, I want to pull in all the
> > TDX and SNP series and see how everything fits together. Specifically, I want to
> > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > don't miss an opportunity to make things simpler. The patches in the SNP series to
> > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > details. Nothing remotely major, but something that needs attention since it'll
> > be uAPI.
>
> Although I'm still debuging with TDX KVM, I needed the following.
> kvm_faultin_pfn() is called without mmu_lock held. the race to change
> private/shared is handled by mmu_seq. Maybe dedicated function only for
> kvm_faultin_pfn().

Gah, you're not on the other thread where this was discussed[*]. Simply deleting
the lockdep assertion is safe, for guest types that rely on the attributes to
define shared vs. private, KVM rechecks the attributes under the protection of
mmu_seq.

I'll get a fixed version pushed out today.

[*] https://lore.kernel.org/all/[email protected]

2023-01-19 23:15:48

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Thu, Jan 19, 2023 at 03:25:08PM +0000,
Sean Christopherson <[email protected]> wrote:

> On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > Sean Christopherson <[email protected]> wrote:
> >
> > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > This patch series implements KVM guest private memory for confidential
> > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > TDX-protected guest memory, machine check can happen which can further
> > > > crash the running host system, this is terrible for multi-tenant
> > > > configurations. The host accesses include those from KVM userspace like
> > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > via a fd-based approach, but it can never access the guest memory
> > > > content.
> > > >
> > > > The patch series touches both core mm and KVM code. I appreciate
> > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > reviews are always welcome.
> > > > - 01: mm change, target for mm tree
> > > > - 02-09: KVM change, target for KVM tree
> > >
> > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > is available here:
> > >
> > > [email protected]:sean-jc/linux.git x86/upm_base_support
> > >
> > > It compiles and passes the selftest, but it's otherwise barely tested. There are
> > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > a WIP.
> > >
> > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > I pushed and see if there's anything horrifically broken, and that it still works
> > > for TDX?
> > >
> > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> > > (and I mean that).
> > >
> > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > (SEV and TDX). For tests, I want to build a lists of tests that are required for
> > > merging so that the criteria for merging are clear, and so that if the list is large
> > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > >
> > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > TDX and SNP series and see how everything fits together. Specifically, I want to
> > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > don't miss an opportunity to make things simpler. The patches in the SNP series to
> > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > details. Nothing remotely major, but something that needs attention since it'll
> > > be uAPI.
> >
> > Although I'm still debuging with TDX KVM, I needed the following.
> > kvm_faultin_pfn() is called without mmu_lock held. the race to change
> > private/shared is handled by mmu_seq. Maybe dedicated function only for
> > kvm_faultin_pfn().
>
> Gah, you're not on the other thread where this was discussed[*]. Simply deleting
> the lockdep assertion is safe, for guest types that rely on the attributes to
> define shared vs. private, KVM rechecks the attributes under the protection of
> mmu_seq.
>
> I'll get a fixed version pushed out today.
>
> [*] https://lore.kernel.org/all/[email protected]

Now I have tdx kvm working. I've uploaded at the followings.
It's rebased to v6.2-rc3.
[email protected]:yamahata/linux.git tdx/upm
[email protected]:yamahata/qemu.git tdx/upm

kvm_mmu_do_page_fault() needs the following change.
kvm_mem_is_private() queries mem_attr_array. kvm_faultin_pfn() also uses
kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't
make sense. This change would belong to TDX KVM patches, though.

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 72b0da8e27e0..f45ac438bbf4 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -430,7 +430,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.max_level = vcpu->kvm->arch.tdp_max_page_level,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
- .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
+ .is_private = kvm_is_private_gpa(vcpu->kvm, cr2_or_gpa),
};
int r;


--
Isaku Yamahata <[email protected]>

2023-01-21 00:06:03

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> On Fri, Jan 06, 2023, Chao Peng wrote:
> > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > >
> > > Feels bit hacky IMHO, and more like a completely new feature than
> > > an extension.
> > >
> > > Why not just add a new ioctl? The commit message does not address
> > > the most essential design here.
> >
> > Yes, people can always choose to add a new ioctl for this kind of change
> > and the balance point here is we want to also avoid 'too many ioctls' if
> > the functionalities are similar. The '_ext' variant reuses all the
> > existing fields in the 'normal' variant and most importantly KVM
> > internally can reuse most of the code. I certainly can add some words in
> > the commit message to explain this design choice.
>
> After seeing the userspace side of this, I agree with Jarkko; overloading
> KVM_SET_USER_MEMORY_REGION is a hack. E.g. the size validation ends up being
> bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> itself.
>
> It feels absolutely ridiculous, but I think the best option is to do:
>
> #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> struct kvm_userspace_memory_region2)
>
> /* for KVM_SET_USER_MEMORY_REGION2 */
> struct kvm_user_mem_region2 {
> __u32 slot;
> __u32 flags;
> __u64 guest_phys_addr;
> __u64 memory_size;
> __u64 userspace_addr;
> __u64 restricted_offset;
> __u32 restricted_fd;
> __u32 pad1;
> __u64 pad2[14];
> }
>
> And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.
>
> Regarding the userspace side of things, please include Vishal's selftests in v11,
> it's impossible to properly review the uAPI changes without seeing the userspace
> side of things. I'm in the process of reviewing Vishal's v2[*], I'll try to
> massage it into a set of patches that you can incorporate into your series.
>
> [*] https://lore.kernel.org/all/[email protected]

+1

BR, Jarkko

2023-01-21 00:06:03

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

On Tue, Jan 10, 2023 at 05:14:32PM +0800, Chao Peng wrote:
> On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> > On Fri, Jan 06, 2023, Chao Peng wrote:
> > > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > > To make future maintenance easy, internally use a binary compatible
> > > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > > '_ext' variants.
> > > >
> > > > Feels bit hacky IMHO, and more like a completely new feature than
> > > > an extension.
> > > >
> > > > Why not just add a new ioctl? The commit message does not address
> > > > the most essential design here.
> > >
> > > Yes, people can always choose to add a new ioctl for this kind of change
> > > and the balance point here is we want to also avoid 'too many ioctls' if
> > > the functionalities are similar. The '_ext' variant reuses all the
> > > existing fields in the 'normal' variant and most importantly KVM
> > > internally can reuse most of the code. I certainly can add some words in
> > > the commit message to explain this design choice.
> >
> > After seeing the userspace side of this, I agree with Jarkko; overloading
> > KVM_SET_USER_MEMORY_REGION is a hack. E.g. the size validation ends up being
> > bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> > itself.
>
> How is the size validation being bogus? I don't quite follow. Then we
> will use kvm_userspace_memory_region2 as the KVM internal alias, right?
> I see similar examples use different functions to handle different
> versions but it does look easier if we use alias for this function.
>
> >
> > It feels absolutely ridiculous, but I think the best option is to do:
> >
> > #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> > struct kvm_userspace_memory_region2)
>
> Just interesting, is 0x49 a safe number we can use?
>
> >
> > /* for KVM_SET_USER_MEMORY_REGION2 */
> > struct kvm_user_mem_region2 {
> > __u32 slot;
> > __u32 flags;
> > __u64 guest_phys_addr;
> > __u64 memory_size;
> > __u64 userspace_addr;
> > __u64 restricted_offset;
> > __u32 restricted_fd;
> > __u32 pad1;
> > __u64 pad2[14];
> > }
> >
> > And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.
>
> Okay, agree from KVM userspace API perspective this is more consistent
> with similar existing examples. I see several of them.
>
> I think we will also need a CAP_KVM_SET_USER_MEMORY_REGION2 for this new
> ioctl.

The current API in the patch set is trivial for C user space but for
any other more "constrained" language such as Rust a new ioctl would be
easier to adapt.

> >
> > Regarding the userspace side of things, please include Vishal's selftests in v11,
> > it's impossible to properly review the uAPI changes without seeing the userspace
> > side of things. I'm in the process of reviewing Vishal's v2[*], I'll try to
> > massage it into a set of patches that you can incorporate into your series.
>
> Previously I included Vishal's selftests in the github repo, but not
> include them in this patch series. It's OK for me to incorporate them
> directly into this series and review together if Vishal is fine.
>
> Chao
> >
> > [*] https://lore.kernel.org/all/[email protected]

BR, Jarkko

2023-01-24 01:28:04

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> On Thu, Jan 19, 2023 at 03:25:08PM +0000,
> Sean Christopherson <[email protected]> wrote:
>
> > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > > Sean Christopherson <[email protected]> wrote:
> > >
> > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > This patch series implements KVM guest private memory for confidential
> > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > crash the running host system, this is terrible for multi-tenant
> > > > > configurations. The host accesses include those from KVM userspace like
> > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > via a fd-based approach, but it can never access the guest memory
> > > > > content.
> > > > >
> > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > reviews are always welcome.
> > > > > - 01: mm change, target for mm tree
> > > > > - 02-09: KVM change, target for KVM tree
> > > >
> > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > is available here:
> > > >
> > > > [email protected]:sean-jc/linux.git x86/upm_base_support
> > > >
> > > > It compiles and passes the selftest, but it's otherwise barely tested. There are
> > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > a WIP.
> > > >
> > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > > I pushed and see if there's anything horrifically broken, and that it still works
> > > > for TDX?
> > > >
> > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> > > > (and I mean that).
> > > >
> > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for
> > > > merging so that the criteria for merging are clear, and so that if the list is large
> > > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > >
> > > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > > TDX and SNP series and see how everything fits together. Specifically, I want to
> > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > > don't miss an opportunity to make things simpler. The patches in the SNP series to
> > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > > details. Nothing remotely major, but something that needs attention since it'll
> > > > be uAPI.
> > >
> > > Although I'm still debuging with TDX KVM, I needed the following.
> > > kvm_faultin_pfn() is called without mmu_lock held. the race to change
> > > private/shared is handled by mmu_seq. Maybe dedicated function only for
> > > kvm_faultin_pfn().
> >
> > Gah, you're not on the other thread where this was discussed[*]. Simply deleting
> > the lockdep assertion is safe, for guest types that rely on the attributes to
> > define shared vs. private, KVM rechecks the attributes under the protection of
> > mmu_seq.
> >
> > I'll get a fixed version pushed out today.
> >
> > [*] https://lore.kernel.org/all/[email protected]
>
> Now I have tdx kvm working. I've uploaded at the followings.
> It's rebased to v6.2-rc3.
> [email protected]:yamahata/linux.git tdx/upm
> [email protected]:yamahata/qemu.git tdx/upm

And I finally got a working, building version updated and pushed out (again to):

[email protected]:sean-jc/linux.git x86/upm_base_support

Took longer than expected to get the memslot restrictions sussed out. I'm done
working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks
to resolves any remaining todos (that no one else tackles) and to do the whole
"merge the world" excersise.

> kvm_mmu_do_page_fault() needs the following change.
> kvm_mem_is_private() queries mem_attr_array. kvm_faultin_pfn() also uses
> kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't
> make sense. This change would belong to TDX KVM patches, though.

Yeah, SNP needs similar treatment. Sorting that out is high up on the todo list.

2023-01-24 16:11:03

by Liam Merwick

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On 14/01/2023 00:37, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
>> This patch series implements KVM guest private memory for confidential
>> computing scenarios like Intel TDX[1]. If a TDX host accesses
>> TDX-protected guest memory, machine check can happen which can further
>> crash the running host system, this is terrible for multi-tenant
>> configurations. The host accesses include those from KVM userspace like
>> QEMU. This series addresses KVM userspace induced crash by introducing
>> new mm and KVM interfaces so KVM userspace can still manage guest memory
>> via a fd-based approach, but it can never access the guest memory
>> content.
>>
>> The patch series touches both core mm and KVM code. I appreciate
>> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
>> reviews are always welcome.
>> - 01: mm change, target for mm tree
>> - 02-09: KVM change, target for KVM tree
>
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
>
> [email protected]:sean-jc/linux.git x86/upm_base_support
>
> It compiles and passes the selftest, but it's otherwise barely tested. There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
>

When running LTP (https://github.com/linux-test-project/ltp) on the v10
bits (and also with Sean's branch above) I encounter the following NULL
pointer dereference with testcases/kernel/syscalls/madvise/madvise01
(100% reproducible).

It appears that in restrictedmem_error_page()
inode->i_mapping->private_data is NULL
in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
but I don't know why.


[ 5365.177168] BUG: kernel NULL pointer dereference, address:
0000000000000028
[ 5365.178881] #PF: supervisor read access in kernel mode
[ 5365.180006] #PF: error_code(0x0000) - not-present page
[ 5365.181322] PGD 8000000109dad067 P4D 8000000109dad067 PUD 107707067 PMD 0
[ 5365.183474] Oops: 0000 [#1] PREEMPT SMP PTI
[ 5365.184792] CPU: 0 PID: 22086 Comm: madvise01 Not tainted
6.1.0-1.el8.seanjcupm.x86_64 #1
[ 5365.186572] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.5.1 06/16/2021
[ 5365.188816] RIP: 0010:restrictedmem_error_page+0xc7/0x1b0
[ 5365.190081] Code: 99 00 48 8b 55 00 48 8b 02 48 8d 8a e8 fe ff ff 48
2d 18 01 00 00 48 39 d5 0f 84 8a 00 00 00 48 8b 51 30 48 8b 92 b8 00 00
00 <48> 8b 4a 28 4c 39 b1 d8 00 00 00 74 22 48 8b 88 18 01 00 00 48 8d
[ 5365.193984] RSP: 0018:ffff9b7343c07d80 EFLAGS: 00010206
[ 5365.195142] RAX: ffff8e5b410cfc70 RBX: 0000000000000001 RCX:
ffff8e5b4048e580
[ 5365.196888] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
0000000000000000
[ 5365.198399] RBP: ffff8e5b410cfd88 R08: 0000000000000000 R09:
0000000000000000
[ 5365.200200] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000000
[ 5365.201843] R13: ffff8e5b410cfd80 R14: ffff8e5b47cc7618 R15:
ffffd49d44c05080
[ 5365.203472] FS: 00007fc96de9b5c0(0000) GS:ffff8e5deda00000(0000)
knlGS:0000000000000000
[ 5365.205485] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5365.206791] CR2: 0000000000000028 CR3: 000000012047e002 CR4:
0000000000170ef0
[ 5365.208131] Call Trace:
[ 5365.208752] <TASK>
[ 5365.209229] me_pagecache_clean+0x58/0x100
[ 5365.210196] identify_page_state+0x84/0xd0
[ 5365.211180] memory_failure+0x231/0x8b0
[ 5365.212148] madvise_inject_error.cold+0x8d/0xa4
[ 5365.213317] do_madvise+0x363/0x3a0
[ 5365.214177] __x64_sys_madvise+0x2c/0x40
[ 5365.215159] do_syscall_64+0x3f/0xa0
[ 5365.216016] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 5365.217130] RIP: 0033:0x7fc96d8399ab
[ 5365.217953] Code: 73 01 c3 48 8b 0d dd 54 38 00 f7 d8 64 89 01 48 83
c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 1c 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad 54 38 00 f7 d8 64 89 01 48
[ 5365.222323] RSP: 002b:00007fff62a99b18 EFLAGS: 00000206 ORIG_RAX:
000000000000001c
[ 5365.224026] RAX: ffffffffffffffda RBX: 000000000041ce00 RCX:
00007fc96d8399ab
[ 5365.225375] RDX: 0000000000000064 RSI: 000000000000a000 RDI:
00007fc96de8e000
[ 5365.226999] RBP: 00007fc96de9b540 R08: 0000000000000001 R09:
0000000000415c80
[ 5365.228641] R10: 0000000000000001 R11: 0000000000000206 R12:
0000000000000008
[ 5365.230074] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000

Regards,
Liam

> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?
>
> Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> (and I mean that).
>
> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX). For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
>
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together. Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler. The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details. Nothing remotely major, but something that needs attention since it'll
> be uAPI.
>
> I'm off Monday, so it'll be at least Tuesday before I make any more progress on
> my side.
>
> Thanks!
>


2023-01-25 00:20:42

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Tue, Jan 24, 2023, Liam Merwick wrote:
> On 14/01/2023 00:37, Sean Christopherson wrote:
> > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > >
> > > The patch series touches both core mm and KVM code. I appreciate
> > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > reviews are always welcome.
> > > - 01: mm change, target for mm tree
> > > - 02-09: KVM change, target for KVM tree
> >
> > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > is available here:
> >
> > [email protected]:sean-jc/linux.git x86/upm_base_support
> >
> > It compiles and passes the selftest, but it's otherwise barely tested. There are
> > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > a WIP.
> >
>
> When running LTP (https://github.com/linux-test-project/ltp) on the v10
> bits (and also with Sean's branch above) I encounter the following NULL
> pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> (100% reproducible).
>
> It appears that in restrictedmem_error_page() inode->i_mapping->private_data
> is NULL
> in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
> but I don't know why.

Kirill, can you take a look? Or pass the buck to someone who can? :-)

2023-01-25 13:01:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote:
> On Tue, Jan 24, 2023, Liam Merwick wrote:
> > On 14/01/2023 00:37, Sean Christopherson wrote:
> > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > This patch series implements KVM guest private memory for confidential
> > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > TDX-protected guest memory, machine check can happen which can further
> > > > crash the running host system, this is terrible for multi-tenant
> > > > configurations. The host accesses include those from KVM userspace like
> > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > via a fd-based approach, but it can never access the guest memory
> > > > content.
> > > >
> > > > The patch series touches both core mm and KVM code. I appreciate
> > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > reviews are always welcome.
> > > > - 01: mm change, target for mm tree
> > > > - 02-09: KVM change, target for KVM tree
> > >
> > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > is available here:
> > >
> > > [email protected]:sean-jc/linux.git x86/upm_base_support
> > >
> > > It compiles and passes the selftest, but it's otherwise barely tested. There are
> > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > a WIP.
> > >
> >
> > When running LTP (https://github.com/linux-test-project/ltp) on the v10
> > bits (and also with Sean's branch above) I encounter the following NULL
> > pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> > (100% reproducible).
> >
> > It appears that in restrictedmem_error_page() inode->i_mapping->private_data
> > is NULL
> > in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
> > but I don't know why.
>
> Kirill, can you take a look? Or pass the buck to someone who can? :-)

The patch below should help.

diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index 15c52301eeb9..39ada985c7c0 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping)

spin_lock(&sb->s_inode_list_lock);
list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
- struct restrictedmem *rm = inode->i_mapping->private_data;
struct restrictedmem_notifier *notifier;
- struct file *memfd = rm->memfd;
+ struct restrictedmem *rm;
unsigned long index;
+ struct file *memfd;

- if (memfd->f_mapping != mapping)
+ if (atomic_read(&inode->i_count))
continue;

+ spin_lock(&inode->i_lock);
+ if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+ spin_unlock(&inode->i_lock);
+ continue;
+ }
+
+ rm = inode->i_mapping->private_data;
+ memfd = rm->memfd;
+
+ if (memfd->f_mapping != mapping) {
+ spin_unlock(&inode->i_lock);
+ continue;
+ }
+ spin_unlock(&inode->i_lock);
+
xa_for_each_range(&rm->bindings, index, notifier, start, end)
notifier->ops->error(notifier, start, end);
break;
--
Kiryl Shutsemau / Kirill A. Shutemov

2023-01-25 16:03:26

by Liam Merwick

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On 25/01/2023 12:53, Kirill A. Shutemov wrote:
> On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote:
>> On Tue, Jan 24, 2023, Liam Merwick wrote:
>>> On 14/01/2023 00:37, Sean Christopherson wrote:
>>>> On Fri, Dec 02, 2022, Chao Peng wrote:
...
>>>
>>> When running LTP (https://github.com/linux-test-project/ltp) on the v10
>>> bits (and also with Sean's branch above) I encounter the following NULL
>>> pointer dereference with testcases/kernel/syscalls/madvise/madvise01
>>> (100% reproducible).
>>>
>>> It appears that in restrictedmem_error_page() inode->i_mapping->private_data
>>> is NULL
>>> in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
>>> but I don't know why.
>>
>> Kirill, can you take a look? Or pass the buck to someone who can? :-)
>
> The patch below should help.

Thanks, this works for me.

Regards,
Liam

>
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index 15c52301eeb9..39ada985c7c0 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping)
>
> spin_lock(&sb->s_inode_list_lock);
> list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> - struct restrictedmem *rm = inode->i_mapping->private_data;
> struct restrictedmem_notifier *notifier;
> - struct file *memfd = rm->memfd;
> + struct restrictedmem *rm;
> unsigned long index;
> + struct file *memfd;
>
> - if (memfd->f_mapping != mapping)
> + if (atomic_read(&inode->i_count))
> continue;
>
> + spin_lock(&inode->i_lock);
> + if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
> + spin_unlock(&inode->i_lock);
> + continue;
> + }
> +
> + rm = inode->i_mapping->private_data;
> + memfd = rm->memfd;
> +
> + if (memfd->f_mapping != mapping) {
> + spin_unlock(&inode->i_lock);
> + continue;
> + }
> + spin_unlock(&inode->i_lock);
> +
> xa_for_each_range(&rm->bindings, index, notifier, start, end)
> notifier->ops->error(notifier, start, end);
> break;


2023-02-08 12:24:32

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Tue, Jan 24, 2023 at 01:27:50AM +0000,
Sean Christopherson <[email protected]> wrote:

> On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > On Thu, Jan 19, 2023 at 03:25:08PM +0000,
> > Sean Christopherson <[email protected]> wrote:
> >
> > > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > > > Sean Christopherson <[email protected]> wrote:
> > > >
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace like
> > > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > >
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > > reviews are always welcome.
> > > > > > - 01: mm change, target for mm tree
> > > > > > - 02-09: KVM change, target for KVM tree
> > > > >
> > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > > is available here:
> > > > >
> > > > > [email protected]:sean-jc/linux.git x86/upm_base_support
> > > > >
> > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > > a WIP.
> > > > >
> > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > > > I pushed and see if there's anything horrifically broken, and that it still works
> > > > > for TDX?
> > > > >
> > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> > > > > (and I mean that).
> > > > >
> > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for
> > > > > merging so that the criteria for merging are clear, and so that if the list is large
> > > > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > > >
> > > > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > > > TDX and SNP series and see how everything fits together. Specifically, I want to
> > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > > > don't miss an opportunity to make things simpler. The patches in the SNP series to
> > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > > > details. Nothing remotely major, but something that needs attention since it'll
> > > > > be uAPI.
> > > >
> > > > Although I'm still debuging with TDX KVM, I needed the following.
> > > > kvm_faultin_pfn() is called without mmu_lock held. the race to change
> > > > private/shared is handled by mmu_seq. Maybe dedicated function only for
> > > > kvm_faultin_pfn().
> > >
> > > Gah, you're not on the other thread where this was discussed[*]. Simply deleting
> > > the lockdep assertion is safe, for guest types that rely on the attributes to
> > > define shared vs. private, KVM rechecks the attributes under the protection of
> > > mmu_seq.
> > >
> > > I'll get a fixed version pushed out today.
> > >
> > > [*] https://lore.kernel.org/all/[email protected]
> >
> > Now I have tdx kvm working. I've uploaded at the followings.
> > It's rebased to v6.2-rc3.
> > [email protected]:yamahata/linux.git tdx/upm
> > [email protected]:yamahata/qemu.git tdx/upm
>
> And I finally got a working, building version updated and pushed out (again to):
>
> [email protected]:sean-jc/linux.git x86/upm_base_support
>

Ok, I rebased TDX part to the updated branch.
[email protected]:yamahata/linux.git tdx/upm
[email protected]:yamahata/qemu.git tdx/upm

Now it's v6.2-rc7 based.
qemu needs more patches to avoid registering memory slot for SMM.
--
Isaku Yamahata <[email protected]>

2023-02-13 13:11:31

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Tue, Jan 24, 2023 at 01:27:50AM +0000, Sean Christopherson wrote:
> On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > On Thu, Jan 19, 2023 at 03:25:08PM +0000,
> > Sean Christopherson <[email protected]> wrote:
> >
> > > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > > > Sean Christopherson <[email protected]> wrote:
> > > >
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace like
> > > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > >
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > > reviews are always welcome.
> > > > > > - 01: mm change, target for mm tree
> > > > > > - 02-09: KVM change, target for KVM tree
> > > > >
> > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > > is available here:
> > > > >
> > > > > [email protected]:sean-jc/linux.git x86/upm_base_support
> > > > >
> > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > > a WIP.
> > > > >
> > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > > > I pushed and see if there's anything horrifically broken, and that it still works
> > > > > for TDX?
> > > > >
> > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> > > > > (and I mean that).
> > > > >
> > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for
> > > > > merging so that the criteria for merging are clear, and so that if the list is large
> > > > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > > >
> > > > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > > > TDX and SNP series and see how everything fits together. Specifically, I want to
> > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > > > don't miss an opportunity to make things simpler. The patches in the SNP series to
> > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > > > details. Nothing remotely major, but something that needs attention since it'll
> > > > > be uAPI.
> > > >
> > > > Although I'm still debuging with TDX KVM, I needed the following.
> > > > kvm_faultin_pfn() is called without mmu_lock held. the race to change
> > > > private/shared is handled by mmu_seq. Maybe dedicated function only for
> > > > kvm_faultin_pfn().
> > >
> > > Gah, you're not on the other thread where this was discussed[*]. Simply deleting
> > > the lockdep assertion is safe, for guest types that rely on the attributes to
> > > define shared vs. private, KVM rechecks the attributes under the protection of
> > > mmu_seq.
> > >
> > > I'll get a fixed version pushed out today.
> > >
> > > [*] https://lore.kernel.org/all/[email protected]
> >
> > Now I have tdx kvm working. I've uploaded at the followings.
> > It's rebased to v6.2-rc3.
> > [email protected]:yamahata/linux.git tdx/upm
> > [email protected]:yamahata/qemu.git tdx/upm
>
> And I finally got a working, building version updated and pushed out (again to):
>
> [email protected]:sean-jc/linux.git x86/upm_base_support
>
> Took longer than expected to get the memslot restrictions sussed out. I'm done
> working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks
> to resolves any remaining todos (that no one else tackles) and to do the whole
> "merge the world" excersise.
>
> > kvm_mmu_do_page_fault() needs the following change.
> > kvm_mem_is_private() queries mem_attr_array. kvm_faultin_pfn() also uses
> > kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't
> > make sense. This change would belong to TDX KVM patches, though.
>
> Yeah, SNP needs similar treatment. Sorting that out is high up on the todo list.

Hi Sean,

We've rebased the SEV+SNP support onto your updated UPM base support
tree and things seem to be working okay, but we needed some fixups on
top of the base support get things working, along with 1 workaround
for an issue that hasn't been root-caused yet:

https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip

*stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation
*workaround (use_base_support): mm: restrictedmem: loosen exclusivity check
*fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
*fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations
*fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations
*fixup (upm_base_support): KVM: selftests: CoCo compilation fixes

We plan to post an updated RFC for v8 soon, but also wanted to share
the staging tree in case you end up looking at the UPM integration aspects
before then.

-Mike

2023-02-16 05:14:25

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

Hi,

On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote:
> This patch series implements KVM guest private memory for confidential
> computing scenarios like Intel TDX[1]. If a TDX host accesses
> TDX-protected guest memory, machine check can happen which can further
> crash the running host system, this is terrible for multi-tenant
> configurations. The host accesses include those from KVM userspace like
> QEMU. This series addresses KVM userspace induced crash by introducing
> new mm and KVM interfaces so KVM userspace can still manage guest memory
> via a fd-based approach, but it can never access the guest memory
> content.

Sorry for jumping late.

Unless I'm missing something, hibernation will also cause an machine check
when there is TDX-protected memory in the system. When the hibernation
creates memory snapshot it essentially walks all physical pages and saves
their contents, so for TDX memory this will trigger machine check, right?

> Documentation/virt/kvm/api.rst | 125 ++++++-
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/x86/include/asm/kvm_host.h | 9 +
> arch/x86/kvm/Kconfig | 3 +
> arch/x86/kvm/mmu/mmu.c | 205 ++++++++++-
> arch/x86/kvm/mmu/mmu_internal.h | 14 +-
> arch/x86/kvm/mmu/mmutrace.h | 1 +
> arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> arch/x86/kvm/x86.c | 17 +-
> include/linux/kvm_host.h | 103 +++++-
> include/linux/restrictedmem.h | 71 ++++
> include/linux/syscalls.h | 1 +
> include/uapi/asm-generic/unistd.h | 5 +-
> include/uapi/linux/kvm.h | 53 +++
> include/uapi/linux/magic.h | 1 +
> kernel/sys_ni.c | 3 +
> mm/Kconfig | 4 +
> mm/Makefile | 1 +
> mm/memory-failure.c | 3 +
> mm/restrictedmem.c | 318 +++++++++++++++++
> virt/kvm/Kconfig | 6 +
> virt/kvm/kvm_main.c | 469 +++++++++++++++++++++----
> 23 files changed, 1323 insertions(+), 93 deletions(-)
> create mode 100644 include/linux/restrictedmem.h
> create mode 100644 mm/restrictedmem.c

--
Sincerely yours,
Mike.

2023-02-16 09:43:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On 16.02.23 06:13, Mike Rapoport wrote:
> Hi,
>
> On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote:
>> This patch series implements KVM guest private memory for confidential
>> computing scenarios like Intel TDX[1]. If a TDX host accesses
>> TDX-protected guest memory, machine check can happen which can further
>> crash the running host system, this is terrible for multi-tenant
>> configurations. The host accesses include those from KVM userspace like
>> QEMU. This series addresses KVM userspace induced crash by introducing
>> new mm and KVM interfaces so KVM userspace can still manage guest memory
>> via a fd-based approach, but it can never access the guest memory
>> content.
>
> Sorry for jumping late.
>
> Unless I'm missing something, hibernation will also cause an machine check
> when there is TDX-protected memory in the system. When the hibernation
> creates memory snapshot it essentially walks all physical pages and saves
> their contents, so for TDX memory this will trigger machine check, right?

I recall bringing that up in the past (also memory access due to kdump,
/prov/kcore) and was told that the main focus for now is preventing
unprivileged users from crashing the system, that is, not mapping such
memory into user space (e.g., QEMU). In the long run, we'll want to
handle such pages also properly in the other events where the kernel
might access them.

--
Thanks,

David / dhildenb


2023-02-21 12:20:13

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

> Hi Sean,
>
> We've rebased the SEV+SNP support onto your updated UPM base support
> tree and things seem to be working okay, but we needed some fixups on
> top of the base support get things working, along with 1 workaround
> for an issue that hasn't been root-caused yet:
>
> https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip
>
> *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation
> *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check

What I'm seeing is Slot#3 gets added first and then deleted. When it's
gets added, Slot#0 already has the same range bound to restrictedmem so
trigger the exclusive check. This check is exactly the current code for.

> *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
> *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations

As many kernel APIs treat 'end' as exclusive, I would rather keep using
exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
and notifier callbacks) but fix it internally in the restrictedmem. E.g.
all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
See below for the change.

> *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations

Subtracting slot->restrictedmem.index for start/end in
restrictedmem_get_gfn_range() is the correct fix.

> *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes
>
> We plan to post an updated RFC for v8 soon, but also wanted to share
> the staging tree in case you end up looking at the UPM integration aspects
> before then.
>
> -Mike

This is the restrictedmem fix to solve 'end' being stored and checked in xarray:

--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
*/
down_read(&rm->lock);

- xa_for_each_range(&rm->bindings, index, notifier, start, end)
+ xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
notifier->ops->invalidate_start(notifier, start, end);

ret = memfd->f_op->fallocate(memfd, mode, offset, len);

- xa_for_each_range(&rm->bindings, index, notifier, start, end)
+ xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
notifier->ops->invalidate_end(notifier, start, end);

up_read(&rm->lock);
@@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping,
}
spin_unlock(&inode->i_lock);

- xa_for_each_range(&rm->bindings, index, notifier, start, end)
+ xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
notifier->ops->error(notifier, start, end);
break;
}
@@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
if (exclusive != rm->exclusive)
goto out_unlock;

- if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
+ if (exclusive &&
+ xa_find(&rm->bindings, &start, end - 1, XA_PRESENT))
goto out_unlock;
}

- xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
+ xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL);
rm->exclusive = exclusive;
ret = 0;
out_unlock:
@@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
struct restrictedmem *rm = file->f_mapping->private_data;

down_write(&rm->lock);
- xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
+ xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL);
synchronize_rcu();
up_write(&rm->lock);
}

2023-02-22 21:53:27

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Thu, Feb 16, 2023, David Hildenbrand wrote:
> On 16.02.23 06:13, Mike Rapoport wrote:
> > Hi,
> >
> > On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote:
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> >
> > Sorry for jumping late.
> >
> > Unless I'm missing something, hibernation will also cause an machine check
> > when there is TDX-protected memory in the system. When the hibernation
> > creates memory snapshot it essentially walks all physical pages and saves
> > their contents, so for TDX memory this will trigger machine check, right?

For hibernation specifically, I think that should be handled elsewhere as hibernation
is simply incompatible with TDX, SNP, pKVM, etc. without paravirtualizing the
guest, as none of those technologies support auto-export a la s390. I suspect
the right approach is to disallow hibernation if KVM is running any protected guests.

> I recall bringing that up in the past (also memory access due to kdump,
> /prov/kcore) and was told that the main focus for now is preventing
> unprivileged users from crashing the system, that is, not mapping such
> memory into user space (e.g., QEMU). In the long run, we'll want to handle
> such pages also properly in the other events where the kernel might access
> them.

Ya, unless someone strongly objects, the plan is to essentially treat "attacks"
from privileged users as out of to scope for initial support, and then iterate
as needed to fix/enable more features.

FWIW, read accesses, e.g. kdump, should be ok for TDX and SNP as they both play
nice with "bad" reads. pKVM is a different beast though as I believe any access
to guest private memory will fault. But my understanding is that this series
would be a big step forward for pKVM, which currently doesn't have any safeguards.

2023-03-23 01:30:08

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote:
> > Hi Sean,
> >
> > We've rebased the SEV+SNP support onto your updated UPM base support
> > tree and things seem to be working okay, but we needed some fixups on
> > top of the base support get things working, along with 1 workaround
> > for an issue that hasn't been root-caused yet:
> >
> > https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip
> >
> > *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation
> > *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check
>
> What I'm seeing is Slot#3 gets added first and then deleted. When it's
> gets added, Slot#0 already has the same range bound to restrictedmem so
> trigger the exclusive check. This check is exactly the current code for.

With the following change in QEMU, we no longer trigger this check:

diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 20da121374..849b5de469 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -588,9 +588,9 @@ static void mch_realize(PCIDevice *d, Error **errp)
memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), "smram-open-high",
mch->ram_memory, MCH_HOST_BRIDGE_SMRAM_C_BASE,
MCH_HOST_BRIDGE_SMRAM_C_SIZE);
+ memory_region_set_enabled(&mch->open_high_smram, false);
memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000,
&mch->open_high_smram, 1);
- memory_region_set_enabled(&mch->open_high_smram, false);

I'm not sure if QEMU is actually doing something wrong here though or if
this check is putting tighter restrictions on userspace than what was
expected before. Will look into it more.

>
> > *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
> > *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations
>
> As many kernel APIs treat 'end' as exclusive, I would rather keep using
> exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
> and notifier callbacks) but fix it internally in the restrictedmem. E.g.
> all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
> See below for the change.

Yes I did feel like I was fighting the kernel a bit on that; your
suggestion seems like it would be a better fit.

>
> > *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations
>
> Subtracting slot->restrictedmem.index for start/end in
> restrictedmem_get_gfn_range() is the correct fix.
>
> > *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes
> >
> > We plan to post an updated RFC for v8 soon, but also wanted to share
> > the staging tree in case you end up looking at the UPM integration aspects
> > before then.
> >
> > -Mike
>
> This is the restrictedmem fix to solve 'end' being stored and checked in xarray:

Looks good.

Thanks!

-Mike

>
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
> */
> down_read(&rm->lock);
>
> - xa_for_each_range(&rm->bindings, index, notifier, start, end)
> + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> notifier->ops->invalidate_start(notifier, start, end);
>
> ret = memfd->f_op->fallocate(memfd, mode, offset, len);
>
> - xa_for_each_range(&rm->bindings, index, notifier, start, end)
> + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> notifier->ops->invalidate_end(notifier, start, end);
>
> up_read(&rm->lock);
> @@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping,
> }
> spin_unlock(&inode->i_lock);
>
> - xa_for_each_range(&rm->bindings, index, notifier, start, end)
> + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> notifier->ops->error(notifier, start, end);
> break;
> }
> @@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
> if (exclusive != rm->exclusive)
> goto out_unlock;
>
> - if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
> + if (exclusive &&
> + xa_find(&rm->bindings, &start, end - 1, XA_PRESENT))
> goto out_unlock;
> }
>
> - xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
> + xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL);
> rm->exclusive = exclusive;
> ret = 0;
> out_unlock:
> @@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
> struct restrictedmem *rm = file->f_mapping->private_data;
>
> down_write(&rm->lock);
> - xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
> + xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL);
> synchronize_rcu();
> up_write(&rm->lock);
> }

2023-03-24 02:26:15

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Wed, Mar 22, 2023 at 08:27:37PM -0500, Michael Roth wrote:
> On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote:
> > > Hi Sean,
> > >
> > > We've rebased the SEV+SNP support onto your updated UPM base support
> > > tree and things seem to be working okay, but we needed some fixups on
> > > top of the base support get things working, along with 1 workaround
> > > for an issue that hasn't been root-caused yet:
> > >
> > > https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip
> > >
> > > *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation
> > > *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check
> >
> > What I'm seeing is Slot#3 gets added first and then deleted. When it's
> > gets added, Slot#0 already has the same range bound to restrictedmem so
> > trigger the exclusive check. This check is exactly the current code for.
>
> With the following change in QEMU, we no longer trigger this check:
>
> diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
> index 20da121374..849b5de469 100644
> --- a/hw/pci-host/q35.c
> +++ b/hw/pci-host/q35.c
> @@ -588,9 +588,9 @@ static void mch_realize(PCIDevice *d, Error **errp)
> memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), "smram-open-high",
> mch->ram_memory, MCH_HOST_BRIDGE_SMRAM_C_BASE,
> MCH_HOST_BRIDGE_SMRAM_C_SIZE);
> + memory_region_set_enabled(&mch->open_high_smram, false);
> memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000,
> &mch->open_high_smram, 1);
> - memory_region_set_enabled(&mch->open_high_smram, false);
>
> I'm not sure if QEMU is actually doing something wrong here though or if
> this check is putting tighter restrictions on userspace than what was
> expected before. Will look into it more.

I don't think above QEMU change is upstream acceptable. It may break
functionality for 'normal' VMs.

The UPM check does putting tighter restriction, the restriction is that
you can't bind the same fd range to more than one memslot. For SMRAM in
QEMU however, it violates this restriction. The right 'fix' is disabling
SMM in QEMU for UPM usages rather than trying to work around it. There
is more discussion in below link:

https://lore.kernel.org/all/[email protected]/

Chao

>
> >
> > > *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
> > > *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations
> >
> > As many kernel APIs treat 'end' as exclusive, I would rather keep using
> > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
> > and notifier callbacks) but fix it internally in the restrictedmem. E.g.
> > all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
> > See below for the change.
>
> Yes I did feel like I was fighting the kernel a bit on that; your
> suggestion seems like it would be a better fit.
>
> >
> > > *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations
> >
> > Subtracting slot->restrictedmem.index for start/end in
> > restrictedmem_get_gfn_range() is the correct fix.
> >
> > > *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes
> > >
> > > We plan to post an updated RFC for v8 soon, but also wanted to share
> > > the staging tree in case you end up looking at the UPM integration aspects
> > > before then.
> > >
> > > -Mike
> >
> > This is the restrictedmem fix to solve 'end' being stored and checked in xarray:
>
> Looks good.
>
> Thanks!
>
> -Mike
>
> >
> > --- a/mm/restrictedmem.c
> > +++ b/mm/restrictedmem.c
> > @@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
> > */
> > down_read(&rm->lock);
> >
> > - xa_for_each_range(&rm->bindings, index, notifier, start, end)
> > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> > notifier->ops->invalidate_start(notifier, start, end);
> >
> > ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> >
> > - xa_for_each_range(&rm->bindings, index, notifier, start, end)
> > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> > notifier->ops->invalidate_end(notifier, start, end);
> >
> > up_read(&rm->lock);
> > @@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping,
> > }
> > spin_unlock(&inode->i_lock);
> >
> > - xa_for_each_range(&rm->bindings, index, notifier, start, end)
> > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> > notifier->ops->error(notifier, start, end);
> > break;
> > }
> > @@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
> > if (exclusive != rm->exclusive)
> > goto out_unlock;
> >
> > - if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
> > + if (exclusive &&
> > + xa_find(&rm->bindings, &start, end - 1, XA_PRESENT))
> > goto out_unlock;
> > }
> >
> > - xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
> > + xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL);
> > rm->exclusive = exclusive;
> > ret = 0;
> > out_unlock:
> > @@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
> > struct restrictedmem *rm = file->f_mapping->private_data;
> >
> > down_write(&rm->lock);
> > - xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
> > + xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL);
> > synchronize_rcu();
> > up_write(&rm->lock);
> > }

2023-04-12 22:10:31

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Wed, Mar 22, 2023, Michael Roth wrote:
> On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote:
> > > *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
> > > *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations
> >
> > As many kernel APIs treat 'end' as exclusive, I would rather keep using
> > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
> > and notifier callbacks) but fix it internally in the restrictedmem. E.g.
> > all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
> > See below for the change.
>
> Yes I did feel like I was fighting the kernel a bit on that; your
> suggestion seems like it would be a better fit.

Comically belated +1, XArray is the odd one here.

2023-04-13 01:14:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Wed, Jan 25, 2023, Kirill A. Shutemov wrote:
> On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote:
> > On Tue, Jan 24, 2023, Liam Merwick wrote:
> > > On 14/01/2023 00:37, Sean Christopherson wrote:
> > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > This patch series implements KVM guest private memory for confidential
> > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > crash the running host system, this is terrible for multi-tenant
> > > > > configurations. The host accesses include those from KVM userspace like
> > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > via a fd-based approach, but it can never access the guest memory
> > > > > content.
> > > > >
> > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > reviews are always welcome.
> > > > > - 01: mm change, target for mm tree
> > > > > - 02-09: KVM change, target for KVM tree
> > > >
> > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > is available here:
> > > >
> > > > [email protected]:sean-jc/linux.git x86/upm_base_support
> > > >
> > > > It compiles and passes the selftest, but it's otherwise barely tested. There are
> > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > a WIP.
> > > >
> > >
> > > When running LTP (https://github.com/linux-test-project/ltp) on the v10
> > > bits (and also with Sean's branch above) I encounter the following NULL
> > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> > > (100% reproducible).
> > >
> > > It appears that in restrictedmem_error_page()
> > > inode->i_mapping->private_data is NULL in the
> > > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) but I
> > > don't know why.
> >
> > Kirill, can you take a look? Or pass the buck to someone who can? :-)
>
> The patch below should help.
>
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index 15c52301eeb9..39ada985c7c0 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping)
>
> spin_lock(&sb->s_inode_list_lock);
> list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> - struct restrictedmem *rm = inode->i_mapping->private_data;
> struct restrictedmem_notifier *notifier;
> - struct file *memfd = rm->memfd;
> + struct restrictedmem *rm;
> unsigned long index;
> + struct file *memfd;
>
> - if (memfd->f_mapping != mapping)
> + if (atomic_read(&inode->i_count))

Kirill, should this be

if (!atomic_read(&inode->i_count))
continue;

i.e. skip unreferenced inodes, not skip referenced inodes?

2023-04-13 16:12:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Wed, Apr 12, 2023 at 06:07:28PM -0700, Sean Christopherson wrote:
> On Wed, Jan 25, 2023, Kirill A. Shutemov wrote:
> > On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote:
> > > On Tue, Jan 24, 2023, Liam Merwick wrote:
> > > > On 14/01/2023 00:37, Sean Christopherson wrote:
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace like
> > > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > >
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > > reviews are always welcome.
> > > > > > - 01: mm change, target for mm tree
> > > > > > - 02-09: KVM change, target for KVM tree
> > > > >
> > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > > is available here:
> > > > >
> > > > > [email protected]:sean-jc/linux.git x86/upm_base_support
> > > > >
> > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > > a WIP.
> > > > >
> > > >
> > > > When running LTP (https://github.com/linux-test-project/ltp) on the v10
> > > > bits (and also with Sean's branch above) I encounter the following NULL
> > > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> > > > (100% reproducible).
> > > >
> > > > It appears that in restrictedmem_error_page()
> > > > inode->i_mapping->private_data is NULL in the
> > > > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) but I
> > > > don't know why.
> > >
> > > Kirill, can you take a look? Or pass the buck to someone who can? :-)
> >
> > The patch below should help.
> >
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > index 15c52301eeb9..39ada985c7c0 100644
> > --- a/mm/restrictedmem.c
> > +++ b/mm/restrictedmem.c
> > @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping)
> >
> > spin_lock(&sb->s_inode_list_lock);
> > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> > - struct restrictedmem *rm = inode->i_mapping->private_data;
> > struct restrictedmem_notifier *notifier;
> > - struct file *memfd = rm->memfd;
> > + struct restrictedmem *rm;
> > unsigned long index;
> > + struct file *memfd;
> >
> > - if (memfd->f_mapping != mapping)
> > + if (atomic_read(&inode->i_count))
>
> Kirill, should this be
>
> if (!atomic_read(&inode->i_count))
> continue;
>
> i.e. skip unreferenced inodes, not skip referenced inodes?

Ouch. Yes.

But looking at other instances of s_inodes usage, I think we can drop the
check altogether. inode cannot be completely free until it is removed from
s_inodes list.

While there, replace list_for_each_entry_safe() with
list_for_each_entry() as we don't remove anything from the list.

diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index 55e99e6c09a1..8e8a4420d3d1 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -194,22 +194,19 @@ static int restricted_error_remove_page(struct address_space *mapping,
struct page *page)
{
struct super_block *sb = restrictedmem_mnt->mnt_sb;
- struct inode *inode, *next;
+ struct inode *inode;
pgoff_t start, end;

start = page->index;
end = start + thp_nr_pages(page);

spin_lock(&sb->s_inode_list_lock);
- list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
struct restrictedmem_notifier *notifier;
struct restrictedmem *rm;
unsigned long index;
struct file *memfd;

- if (atomic_read(&inode->i_count))
- continue;
-
spin_lock(&inode->i_lock);
if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
spin_unlock(&inode->i_lock);
--
Kiryl Shutsemau / Kirill A. Shutemov

2023-04-17 14:55:55

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Tue, Jan 24, 2023 at 01:27:50AM +0000, Sean Christopherson wrote:
> On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > On Thu, Jan 19, 2023 at 03:25:08PM +0000,
> > Sean Christopherson <[email protected]> wrote:
> >
> > > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > > > Sean Christopherson <[email protected]> wrote:
> > > >
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace like
> > > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > >
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > > reviews are always welcome.
> > > > > > - 01: mm change, target for mm tree
> > > > > > - 02-09: KVM change, target for KVM tree
> > > > >
> > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > > is available here:
> > > > >
> > > > > [email protected]:sean-jc/linux.git x86/upm_base_support
> > > > >
> > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > > a WIP.
> > > > >
> > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > > > I pushed and see if there's anything horrifically broken, and that it still works
> > > > > for TDX?
> > > > >
> > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> > > > > (and I mean that).
> > > > >
> > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for
> > > > > merging so that the criteria for merging are clear, and so that if the list is large
> > > > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > > >
> > > > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > > > TDX and SNP series and see how everything fits together. Specifically, I want to
> > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > > > don't miss an opportunity to make things simpler. The patches in the SNP series to
> > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > > > details. Nothing remotely major, but something that needs attention since it'll
> > > > > be uAPI.
> > > >
> > > > Although I'm still debuging with TDX KVM, I needed the following.
> > > > kvm_faultin_pfn() is called without mmu_lock held. the race to change
> > > > private/shared is handled by mmu_seq. Maybe dedicated function only for
> > > > kvm_faultin_pfn().
> > >
> > > Gah, you're not on the other thread where this was discussed[*]. Simply deleting
> > > the lockdep assertion is safe, for guest types that rely on the attributes to
> > > define shared vs. private, KVM rechecks the attributes under the protection of
> > > mmu_seq.
> > >
> > > I'll get a fixed version pushed out today.
> > >
> > > [*] https://lore.kernel.org/all/[email protected]
> >
> > Now I have tdx kvm working. I've uploaded at the followings.
> > It's rebased to v6.2-rc3.
> > [email protected]:yamahata/linux.git tdx/upm
> > [email protected]:yamahata/qemu.git tdx/upm
>
> And I finally got a working, building version updated and pushed out (again to):
>
> [email protected]:sean-jc/linux.git x86/upm_base_support
>
> Took longer than expected to get the memslot restrictions sussed out. I'm done
> working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks
> to resolves any remaining todos (that no one else tackles) and to do the whole
> "merge the world" excersise.

Hi Sean,

In case you started working on the code again, I have a branch [1]
originally planned as v11 candidate which I believe I addressed all the
discussions we had for v10 except the very latest one [2] and integrated
all the newly added selftests from Ackerley and myself. The branch was
based on your original upm_base_support and then rebased to your
kvm-x86/mmu head. Feel free to take anything you think useful( most of
them are trivial things but also some fixes for bugs).

[1] https://github.com/chao-p/linux/commits/privmem-v11.6
[2] https://lore.kernel.org/all/[email protected]/

Chao
>
> > kvm_mmu_do_page_fault() needs the following change.
> > kvm_mem_is_private() queries mem_attr_array. kvm_faultin_pfn() also uses
> > kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't
> > make sense. This change would belong to TDX KVM patches, though.
>
> Yeah, SNP needs similar treatment. Sorting that out is high up on the todo list.

2023-04-17 15:02:43

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

On Mon, Apr 17, 2023, Chao Peng wrote:
> In case you started working on the code again, I have a branch [1]
> originally planned as v11 candidate which I believe I addressed all the
> discussions we had for v10 except the very latest one [2] and integrated
> all the newly added selftests from Ackerley and myself. The branch was
> based on your original upm_base_support and then rebased to your
> kvm-x86/mmu head. Feel free to take anything you think useful( most of
> them are trivial things but also some fixes for bugs).

Nice! I am going to work on splicing together the various series this week, I'll
make sure to grab your work.

Thanks much!

2023-04-17 15:50:30

by Sean Christopherson

[permalink] [raw]
Subject: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

What do y'all think about renaming "restrictedmem" to "guardedmem"?

I want to start referring to the code/patches by its syscall/implementation name
instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
and not just the non-KVM code, and (c) will likely be confusing for future reviewers
since there's nothing in the code that mentions "UPM" in any way.

But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
already used to refer to "reserved memory".

Renaming the syscall to "guardedmem"...

1. Allows for a shorthand and namespace, "gmem", that isn't already in use by
the kernel (see "reserved memory above").

2. Provides a stronger hint as to its purpose. "Restricted" conveys that the
allocated memory is limited in some way, but doesn't capture how the memory
is restricted, e.g. "restricted" could just as easily mean that the allocation
can be restricted to certain types of backing stores or something. "Guarded"
on the other hand captures that the memory has extra defenses of some form.

3. Is shorter to type and speak. Work smart, not hard :-)

4. Isn't totally wrong for the KVM use case if someone assumes the "g" means
"guest" when reading mail and whatnot.


P.S. I trimmed the Cc/To substantially for this particular discussion to avoid
spamming folks that don't (yet) care about this stuff with another potentially
lengthy thread. Feel free to add (back) any people/lists.

2023-04-17 15:51:50

by David Hildenbrand

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On 17.04.23 17:40, Sean Christopherson wrote:
> What do y'all think about renaming "restrictedmem" to "guardedmem"?

Yeay, let's add more confusion :D

If we're at renaming, I'd appreciate if we could find a terminology that
does look/sound less horrible.

>
> I want to start referring to the code/patches by its syscall/implementation name
> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> since there's nothing in the code that mentions "UPM" in any way.
>
> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> already used to refer to "reserved memory".
>
> Renaming the syscall to "guardedmem"...

restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...

--
Thanks,

David / dhildenb

2023-04-17 16:42:25

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Mon, Apr 17, 2023, David Hildenbrand wrote:
> On 17.04.23 17:40, Sean Christopherson wrote:
> > I want to start referring to the code/patches by its syscall/implementation name
> > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> > and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> > since there's nothing in the code that mentions "UPM" in any way.
> >
> > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> > already used to refer to "reserved memory".
> >
> > Renaming the syscall to "guardedmem"...
>
> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...

I'm definitely open to other suggestions, but I suspect it's going to be difficult
to be more precise than something like "guarded".

E.g. we discussed "unmappable" at one point, but the memory can still be mapped,
just not via mmap(). And it's not just about mappings, e.g. read() and its many
variants are all disallowed too, despite the kernel direct map still being live
(modulo SNP requirements).

2023-04-17 17:16:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On 17.04.23 18:40, Sean Christopherson wrote:
> On Mon, Apr 17, 2023, David Hildenbrand wrote:
>> On 17.04.23 17:40, Sean Christopherson wrote:
>>> I want to start referring to the code/patches by its syscall/implementation name
>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>> since there's nothing in the code that mentions "UPM" in any way.
>>>
>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>> already used to refer to "reserved memory".
>>>
>>> Renaming the syscall to "guardedmem"...
>>
>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
>
> I'm definitely open to other suggestions, but I suspect it's going to be difficult
> to be more precise than something like "guarded".

Guardedmem is just as bad as restrictedmem IMHO, sorry.


Restricted: what's restricted? how does the restriction manifest?
secretmem also has it's restrictions/limitations (pinning), why does
that one not fall under the same category?

Make a stranger guess what "restrictedmem" is and I can guarantee that
it has nothing to do with the concept we're introducing here.


Guarded: what's guarded? From whom? For which purpose? How does the
"guarding" manifest?

Again, make a stranger guess what "guardedmem" is and I can guarantee
that it has nothing to do with the concept we're introducing here.

If, at all, the guess might be "guarded storage" [1] on s390x, which, of
course, has nothing to do with the concept here. (storage on s390x is
just the dinosaur slang for memory)


Often, if we fail to find a good name, the concept is either unclear or
not well defined.

So what are the characteristics we want to generalize under that new
name? We want to have an fd, that

(a) cannot be mapped into user space (mmap)
(b) cannot be accessed using ordinary system calls (read/write)
(c) can still be managed like other fds (fallocate, future NUMA
policies?)
(d) can be consumed by some special entities that are allowed to
read/write/map.

So the fd content is inaccessible using the ordinary POSIX syscalls.
It's only accessible by special entities (e.g., KVM).

Most probably I am forgetting something. But maybe that will help to
find a more expressive name. Maybe :)

>
> E.g. we discussed "unmappable" at one point, but the memory can still be mapped,
> just not via mmap(). And it's not just about mappings, e.g. read() and its many
> variants are all disallowed too, despite the kernel direct map still being live
> (modulo SNP requirements).
>

[1] https://man.archlinux.org/man/s390_guarded_storage.2.en

--
Thanks,

David / dhildenb

2023-04-17 17:16:10

by Ackerley Tng

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

Sean Christopherson <[email protected]> writes:

> On Mon, Apr 17, 2023, David Hildenbrand wrote:
>> On 17.04.23 17:40, Sean Christopherson wrote:
>> > I want to start referring to the code/patches by its
>> syscall/implementation name
>> > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the
>> broader effort
>> > and not just the non-KVM code, and (c) will likely be confusing for
>> future reviewers
>> > since there's nothing in the code that mentions "UPM" in any way.
>> >
>> > But typing out restrictedmem is quite tedious, and git grep shows
>> that "rmem" is
>> > already used to refer to "reserved memory".
>> >
>> > Renaming the syscall to "guardedmem"...

>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask
>> me ...

> I'm definitely open to other suggestions, but I suspect it's going to be
> difficult
> to be more precise than something like "guarded".

> E.g. we discussed "unmappable" at one point, but the memory can still be
> mapped,
> just not via mmap(). And it's not just about mappings, e.g. read() and
> its many
> variants are all disallowed too, despite the kernel direct map still
> being live
> (modulo SNP requirements).

I'm for renaming the concept because restrictedmem is quite a
mouthful. :)

How about "concealedmem" or "obscuredmem" to highlight the idea of this
memory being hidden/unreadable/unmappable from userspace?

Guarded is better than restricted but doesn't really highlight how/in
what way it is being guarded.

2023-04-17 18:29:27

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Mon, Apr 17, 2023, Ackerley Tng wrote:
> Sean Christopherson <[email protected]> writes:
>
> > On Mon, Apr 17, 2023, David Hildenbrand wrote:
> > > On 17.04.23 17:40, Sean Christopherson wrote:
> > > > I want to start referring to the code/patches by its
> > > syscall/implementation name
> > > > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to
> > > the broader effort
> > > > and not just the non-KVM code, and (c) will likely be confusing
> > > for future reviewers
> > > > since there's nothing in the code that mentions "UPM" in any way.
> > > >
> > > > But typing out restrictedmem is quite tedious, and git grep shows
> > > that "rmem" is

Your mail client appears to be wrapping too aggressively and mangling quotes. I'm
guessing gmail is to blame?

> > > > already used to refer to "reserved memory".
> > > >
> > > > Renaming the syscall to "guardedmem"...
>
> > > restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask
> > > me ...
>
> > I'm definitely open to other suggestions, but I suspect it's going to be
> > difficult
> > to be more precise than something like "guarded".
>
> > E.g. we discussed "unmappable" at one point, but the memory can still be
> > mapped,
> > just not via mmap(). And it's not just about mappings, e.g. read() and
> > its many
> > variants are all disallowed too, despite the kernel direct map still
> > being live
> > (modulo SNP requirements).
>
> I'm for renaming the concept because restrictedmem is quite a
> mouthful. :)
>
> How about "concealedmem" or "obscuredmem" to highlight the idea of this
> memory being hidden/unreadable/unmappable from userspace?

I'm hesitant to use something like "concealed" becuase it's too close to secretmem,
e.g. might be miscontrued as concealing the memory from anything _but_ the process
that creates the file.

Obscured has similar problems, and obscure often suggests that something is unclear,
as opposed to outright unreachable.

The other aspect of hidden/concealed/etc is that the memory isn't necessarily
concealed from the user. Long term, I hope to get to the point where even "normal"
VMs use restricted/guarded/??? memory, e.g. to guard (heh) against _unintentional_
access from userspace. In that use case, the memory isn't truly concealed, espeically
if the user is both the "admin" and the consumer.

Though by that argument, "guarded" is also a poor choice. And now I'm remembering
how we ended up with "restricted"...

> Guarded is better than restricted but doesn't really highlight how/in
> what way it is being guarded.

Ya, though in practice I think it's infeasible for us to get a name that is both
precise and succinct.

2023-04-17 19:27:15

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Mon, Apr 17, 2023, David Hildenbrand wrote:
> On 17.04.23 18:40, Sean Christopherson wrote:
> > On Mon, Apr 17, 2023, David Hildenbrand wrote:
> > > On 17.04.23 17:40, Sean Christopherson wrote:
> > > > I want to start referring to the code/patches by its syscall/implementation name
> > > > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> > > > and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> > > > since there's nothing in the code that mentions "UPM" in any way.
> > > >
> > > > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> > > > already used to refer to "reserved memory".
> > > >
> > > > Renaming the syscall to "guardedmem"...
> > >
> > > restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
> >
> > I'm definitely open to other suggestions, but I suspect it's going to be difficult
> > to be more precise than something like "guarded".
>
> Guardedmem is just as bad as restrictedmem IMHO, sorry.
>
>
> Restricted: what's restricted? how does the restriction manifest? secretmem
> also has it's restrictions/limitations (pinning), why does that one not fall
> under the same category?
>
> Make a stranger guess what "restrictedmem" is and I can guarantee that it
> has nothing to do with the concept we're introducing here.
>
>
> Guarded: what's guarded? From whom? For which purpose? How does the
> "guarding" manifest?

I completely agree that "guarded" lacks precision, but as above, I've pretty much
given up hope of being super precise. I actually like "restricted", I just don't
like that I can't shorten the name.

Hmm, maybe that won't be a huge problem in practice. I can't say I've ever heard
any use "rmem" in verbale or written communication, it's primarily just "rmem" in
code that we can't use, and I don't mind having to use restrictedmem for the namespace.
So maybe we can use "rmem", just not in code?

Or, we could pretend we're pirates and call it arrrmem!, which is definitely going
to be how I refer to it in my internal dialogue if we keep "restricted" :-)

> Again, make a stranger guess what "guardedmem" is and I can guarantee that
> it has nothing to do with the concept we're introducing here.
>
> If, at all, the guess might be "guarded storage" [1] on s390x, which, of
> course, has nothing to do with the concept here.

Oof, and guarded storage is even documented in Documentation/virt/kvm/api.rst.

> (storage on s390x is just the dinosaur slang for memory)
>
>
> Often, if we fail to find a good name, the concept is either unclear or not
> well defined.
>
> So what are the characteristics we want to generalize under that new name?
> We want to have an fd, that
>
> (a) cannot be mapped into user space (mmap)
> (b) cannot be accessed using ordinary system calls (read/write)
> (c) can still be managed like other fds (fallocate, future NUMA
> policies?)
> (d) can be consumed by some special entities that are allowed to
> read/write/map.
>
> So the fd content is inaccessible using the ordinary POSIX syscalls. It's
> only accessible by special entities (e.g., KVM).
>
> Most probably I am forgetting something. But maybe that will help to find a
> more expressive name. Maybe :)

Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
and depending on the use case, the memory may not actually be concealed from the
user that controls the VMM.

Restricted - "rmem" collides with "reserved memory" in code.

Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.

Inaccessible - Many of the same problems as "hidden".

Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
the memory is still mappable, just not via mmap().

Secured - I'm not getting anywhere near this one :-)

2023-04-18 09:04:10

by Fuad Tabba

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Mon, Apr 17, 2023 at 8:16 PM Sean Christopherson <[email protected]> wrote:
....

> > So the fd content is inaccessible using the ordinary POSIX syscalls. It's
> > only accessible by special entities (e.g., KVM).
> >
> > Most probably I am forgetting something. But maybe that will help to find a
> > more expressive name. Maybe :)
>
> Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
> and depending on the use case, the memory may not actually be concealed from the
> user that controls the VMM.
>
> Restricted - "rmem" collides with "reserved memory" in code.
>
> Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.
>
> Inaccessible - Many of the same problems as "hidden".
>
> Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
> the memory is still mappable, just not via mmap().
>
> Secured - I'm not getting anywhere near this one :-)

How about "protected" ;)? _ducks_

To me the name doesn't matter much, but fwiw I have developed a liking
to "restricted", more than the previous "private", since of all of the
one-word suggestions I think it captures most of what it's trying to
do.

Cheers,
/fuad

2023-04-18 09:16:38

by David Hildenbrand

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On 17.04.23 21:16, Sean Christopherson wrote:
> On Mon, Apr 17, 2023, David Hildenbrand wrote:
>> On 17.04.23 18:40, Sean Christopherson wrote:
>>> On Mon, Apr 17, 2023, David Hildenbrand wrote:
>>>> On 17.04.23 17:40, Sean Christopherson wrote:
>>>>> I want to start referring to the code/patches by its syscall/implementation name
>>>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>>>> since there's nothing in the code that mentions "UPM" in any way.
>>>>>
>>>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>>>> already used to refer to "reserved memory".
>>>>>
>>>>> Renaming the syscall to "guardedmem"...
>>>>
>>>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
>>>
>>> I'm definitely open to other suggestions, but I suspect it's going to be difficult
>>> to be more precise than something like "guarded".
>>
>> Guardedmem is just as bad as restrictedmem IMHO, sorry.
>>
>>
>> Restricted: what's restricted? how does the restriction manifest? secretmem
>> also has it's restrictions/limitations (pinning), why does that one not fall
>> under the same category?
>>
>> Make a stranger guess what "restrictedmem" is and I can guarantee that it
>> has nothing to do with the concept we're introducing here.
>>
>>
>> Guarded: what's guarded? From whom? For which purpose? How does the
>> "guarding" manifest?
>
> I completely agree that "guarded" lacks precision, but as above, I've pretty much
> given up hope of being super precise. I actually like "restricted", I just don't
> like that I can't shorten the name.
>
> Hmm, maybe that won't be a huge problem in practice. I can't say I've ever heard
> any use "rmem" in verbale or written communication, it's primarily just "rmem" in
> code that we can't use, and I don't mind having to use restrictedmem for the namespace.
> So maybe we can use "rmem", just not in code?
>
> Or, we could pretend we're pirates and call it arrrmem!, which is definitely going
> to be how I refer to it in my internal dialogue if we keep "restricted" :-)

:)

>
>> Again, make a stranger guess what "guardedmem" is and I can guarantee that
>> it has nothing to do with the concept we're introducing here.
>>
>> If, at all, the guess might be "guarded storage" [1] on s390x, which, of
>> course, has nothing to do with the concept here.
>
> Oof, and guarded storage is even documented in Documentation/virt/kvm/api.rst.
>
>> (storage on s390x is just the dinosaur slang for memory)
>>
>>
>> Often, if we fail to find a good name, the concept is either unclear or not
>> well defined.
>>
>> So what are the characteristics we want to generalize under that new name?
>> We want to have an fd, that
>>
>> (a) cannot be mapped into user space (mmap)
>> (b) cannot be accessed using ordinary system calls (read/write)
>> (c) can still be managed like other fds (fallocate, future NUMA
>> policies?)
>> (d) can be consumed by some special entities that are allowed to
>> read/write/map.
>>
>> So the fd content is inaccessible using the ordinary POSIX syscalls. It's
>> only accessible by special entities (e.g., KVM).
>>
>> Most probably I am forgetting something. But maybe that will help to find a
>> more expressive name. Maybe :)
>
> Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
> and depending on the use case, the memory may not actually be concealed from the
> user that controls the VMM.
>
> Restricted - "rmem" collides with "reserved memory" in code.
>
> Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.
>
> Inaccessible - Many of the same problems as "hidden".
>
> Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
> the memory is still mappable, just not via mmap().
>
> Secured - I'm not getting anywhere near this one :-)

The think about "secretmem" that I kind-of like (a little) is that it
explains what it's used for: storing secrets. We don't call it
"unmapped" memory because we unmap it from the directmap or "unpinnable"
memory or "inaccessible" memory ... or even "restricted" because it has
restrictions ... how the secrets are protected is kind of an
implementation detail.

So instead of describing *why*/*how* restrictedmem is the weird kid
(restricted/guarded/hidden/restricted/inaccessible/ ...), maybe rather
describe what it's used for?

I know, I know, "there are other use cases where it will be used outside
of VM context". I really don't care. "memfd_vm" / "vm_mem" would be sooo
(feel free to add some more o's here) much easier to get. It's a special
fd to be used to back VM memory. Depending on the VM type
(encrypted/protected/whatever), restrictions might apply (not able to
mmap, not able to read/write ...). For example, there really is no need
to disallow mmap/read/write when using that memory to back a simple VM
where all we want to do is avoid user-space page tables.

--
Thanks,

David / dhildenb

2023-04-18 17:09:47

by Ackerley Tng

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

Sean Christopherson <[email protected]> writes:

> On Mon, Apr 17, 2023, David Hildenbrand wrote:
>> On 17.04.23 17:40, Sean Christopherson wrote:
>> > I want to start referring to the code/patches by its
>> syscall/implementation name
>> > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the
>> broader effort
>> > and not just the non-KVM code, and (c) will likely be confusing for
>> future reviewers
>> > since there's nothing in the code that mentions "UPM" in any way.
>> >
>> > But typing out restrictedmem is quite tedious, and git grep shows
>> that "rmem" is
>> > already used to refer to "reserved memory".
>> >
>> > Renaming the syscall to "guardedmem"...

>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask
>> me ...

> I'm definitely open to other suggestions, but I suspect it's going to be
> difficult
> to be more precise than something like "guarded".

> E.g. we discussed "unmappable" at one point, but the memory can still be
> mapped,
> just not via mmap(). And it's not just about mappings, e.g. read() and
> its many
> variants are all disallowed too, despite the kernel direct map still
> being live
> (modulo SNP requirements).

How about "opaque"?

I think opaque captures the idea of enforced information hiding from the
user(space), and that the contents can only be manipulated via internal
(kernel) functions.

2023-04-19 00:57:25

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Tue, Apr 18, 2023, David Hildenbrand wrote:
> On 17.04.23 21:16, Sean Christopherson wrote:
> > Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
> > and depending on the use case, the memory may not actually be concealed from the
> > user that controls the VMM.
> >
> > Restricted - "rmem" collides with "reserved memory" in code.
> >
> > Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.
> >
> > Inaccessible - Many of the same problems as "hidden".
> >
> > Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
> > the memory is still mappable, just not via mmap().
> >
> > Secured - I'm not getting anywhere near this one :-)
>
> The think about "secretmem" that I kind-of like (a little) is that it
> explains what it's used for: storing secrets. We don't call it "unmapped"
> memory because we unmap it from the directmap or "unpinnable" memory or
> "inaccessible" memory ... or even "restricted" because it has restrictions
> ... how the secrets are protected is kind of an implementation detail.
>
> So instead of describing *why*/*how* restrictedmem is the weird kid
> (restricted/guarded/hidden/restricted/inaccessible/ ...), maybe rather
> describe what it's used for?
>
> I know, I know, "there are other use cases where it will be used outside of
> VM context". I really don't care.

Heh, we originally proposed F_SEAL_GUEST, but that was also sub-optimal[1] ;-)

> "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here)
> much easier to get. It's a special fd to be used to back VM memory. Depending
> on the VM type (encrypted/protected/whatever), restrictions might apply (not
> able to mmap, not able to read/write ...). For example, there really is no
> need to disallow mmap/read/write when using that memory to back a simple VM
> where all we want to do is avoid user-space page tables.

In seriousness, I do agree with Jason's very explicit objection[2] against naming
a non-KVM uAPI "guest", or any variation thereof.

An alternative that we haven't considered since the very early days is making the
uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall. Looking at the
code for "pure shim" implementation[3], that's actually not that crazy of an idea.

I don't know that I love the idea of burying this in KVM, but there are benefits
to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed
that I created).

The big benefit is that the layer of indirection goes away. That simplifies things
like enhancing restrictedmem to allow host userspace access for debug purposes,
batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive
access, likely the whole "share with a device" story if/when we get there, etc.

The obvious downsides are that (a) maintenance falls under the KVM umbrella, but
that's likely to be true in practice regardless of where the code lands, and
(b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk
someone reinventing a similar solution.

If we can get Gunyah on board and they don't need substantial changes to the
restrictedmem implementation, then I'm all for continuing on the path we're on.
But if Gunyah wants to do their own thing, and the lightweight shim approach is
viable, then it's awfully tempting to put this all behind a KVM ioctl().

[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/all/[email protected]
[3] https://lore.kernel.org/all/ZDiCG%[email protected]
[4] https://lore.kernel.org/all/Y%[email protected]
[5] https://lore.kernel.org/all/[email protected]

2023-04-19 07:26:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On 19.04.23 02:47, Sean Christopherson wrote:
> On Tue, Apr 18, 2023, David Hildenbrand wrote:
>> On 17.04.23 21:16, Sean Christopherson wrote:
>>> Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
>>> and depending on the use case, the memory may not actually be concealed from the
>>> user that controls the VMM.
>>>
>>> Restricted - "rmem" collides with "reserved memory" in code.
>>>
>>> Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.
>>>
>>> Inaccessible - Many of the same problems as "hidden".
>>>
>>> Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
>>> the memory is still mappable, just not via mmap().
>>>
>>> Secured - I'm not getting anywhere near this one :-)
>>
>> The think about "secretmem" that I kind-of like (a little) is that it
>> explains what it's used for: storing secrets. We don't call it "unmapped"
>> memory because we unmap it from the directmap or "unpinnable" memory or
>> "inaccessible" memory ... or even "restricted" because it has restrictions
>> ... how the secrets are protected is kind of an implementation detail.
>>
>> So instead of describing *why*/*how* restrictedmem is the weird kid
>> (restricted/guarded/hidden/restricted/inaccessible/ ...), maybe rather
>> describe what it's used for?
>>
>> I know, I know, "there are other use cases where it will be used outside of
>> VM context". I really don't care.
>
> Heh, we originally proposed F_SEAL_GUEST, but that was also sub-optimal[1] ;-)
>
>> "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here)
>> much easier to get. It's a special fd to be used to back VM memory. Depending
>> on the VM type (encrypted/protected/whatever), restrictions might apply (not
>> able to mmap, not able to read/write ...). For example, there really is no
>> need to disallow mmap/read/write when using that memory to back a simple VM
>> where all we want to do is avoid user-space page tables.
>
> In seriousness, I do agree with Jason's very explicit objection[2] against naming
> a non-KVM uAPI "guest", or any variation thereof.

While I agree, it's all better than the naming we use right now ...


Let me throw "tee_mem" / "memfd_tee" into the picture. That could
eventually catch what we want to have.

Or "coco_mem" / "memfd_coco".

Of course, both expect that people know the terminology (just like what
"vm" stands for), but it's IMHO significantly better than
restricted/guarded/opaque/whatsoever.

Again, expresses what it's used for, not why it behaves in weird ways.


>
> An alternative that we haven't considered since the very early days is making the
> uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall. Looking at the
> code for "pure shim" implementation[3], that's actually not that crazy of an idea.

Yes.

>
> I don't know that I love the idea of burying this in KVM, but there are benefits
> to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed
> that I created).

Yes, it's all better than jumping through hoops to come up with a bad
name like "restrictedmem".

>
> The big benefit is that the layer of indirection goes away. That simplifies things
> like enhancing restrictedmem to allow host userspace access for debug purposes,
> batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive
> access, likely the whole "share with a device" story if/when we get there, etc.
>
> The obvious downsides are that (a) maintenance falls under the KVM umbrella, but
> that's likely to be true in practice regardless of where the code lands, and

Yes.

> (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk
> someone reinventing a similar solution.

I agree. But if it's as simple as providing an ioctl for that hypervisor
that simply wires up the existing implementation, it's not too bad.

>
> If we can get Gunyah on board and they don't need substantial changes to the
> restrictedmem implementation, then I'm all for continuing on the path we're on.
> But if Gunyah wants to do their own thing, and the lightweight shim approach is
> viable, then it's awfully tempting to put this all behind a KVM ioctl().

Right. Or we still succeed in finding a name that's not as bad as what
we had so far.

--
Thanks,

David / dhildenb

2023-04-19 15:31:47

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Wed, Apr 19, 2023, David Hildenbrand wrote:
> On 19.04.23 02:47, Sean Christopherson wrote:
> > On Tue, Apr 18, 2023, David Hildenbrand wrote:
> > > "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here)
> > > much easier to get. It's a special fd to be used to back VM memory. Depending
> > > on the VM type (encrypted/protected/whatever), restrictions might apply (not
> > > able to mmap, not able to read/write ...). For example, there really is no
> > > need to disallow mmap/read/write when using that memory to back a simple VM
> > > where all we want to do is avoid user-space page tables.
> >
> > In seriousness, I do agree with Jason's very explicit objection[2] against naming
> > a non-KVM uAPI "guest", or any variation thereof.
>
> While I agree, it's all better than the naming we use right now ...
>
>
> Let me throw "tee_mem" / "memfd_tee" into the picture. That could eventually
> catch what we want to have.
>
> Or "coco_mem" / "memfd_coco".
>
> Of course, both expect that people know the terminology (just like what "vm"
> stands for), but it's IMHO significantly better than
> restricted/guarded/opaque/whatsoever.
>
> Again, expresses what it's used for, not why it behaves in weird ways.

I don't want to explicitly tie this to trusted execution or confidential compute,
as there is value in backing "normal" guests with memory that cannot be accessed
by the host userspace without jumping through a few extra hoops, e.g. to add a
layer of protection against data corruption due to host userspace bugs.

> > (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk
> > someone reinventing a similar solution.
>
> I agree. But if it's as simple as providing an ioctl for that hypervisor
> that simply wires up the existing implementation, it's not too bad.

Yeah, my mind was wandering in this direction too. The absolute worst case
scenario seems to be that we do end up creating a generic syscall that is a
superset of KVM's functionality, in which case KVM would end up with an ioctl()
that is just a redirect/wrapper.

2023-04-19 15:32:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On 19.04.23 17:17, Sean Christopherson wrote:
> On Wed, Apr 19, 2023, David Hildenbrand wrote:
>> On 19.04.23 02:47, Sean Christopherson wrote:
>>> On Tue, Apr 18, 2023, David Hildenbrand wrote:
>>>> "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here)
>>>> much easier to get. It's a special fd to be used to back VM memory. Depending
>>>> on the VM type (encrypted/protected/whatever), restrictions might apply (not
>>>> able to mmap, not able to read/write ...). For example, there really is no
>>>> need to disallow mmap/read/write when using that memory to back a simple VM
>>>> where all we want to do is avoid user-space page tables.
>>>
>>> In seriousness, I do agree with Jason's very explicit objection[2] against naming
>>> a non-KVM uAPI "guest", or any variation thereof.
>>
>> While I agree, it's all better than the naming we use right now ...
>>
>>
>> Let me throw "tee_mem" / "memfd_tee" into the picture. That could eventually
>> catch what we want to have.
>>
>> Or "coco_mem" / "memfd_coco".
>>
>> Of course, both expect that people know the terminology (just like what "vm"
>> stands for), but it's IMHO significantly better than
>> restricted/guarded/opaque/whatsoever.
>>
>> Again, expresses what it's used for, not why it behaves in weird ways.
>
> I don't want to explicitly tie this to trusted execution or confidential compute,
> as there is value in backing "normal" guests with memory that cannot be accessed
> by the host userspace without jumping through a few extra hoops, e.g. to add a
> layer of protection against data corruption due to host userspace bugs.

Nothing speaks against using tee_mem for the same purpose I guess. I
like the sound of it after all. :)

--
Thanks,

David / dhildenb

2023-04-22 01:33:56

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

+Christian and Hugh

On Wed, Apr 19, 2023, David Hildenbrand wrote:
> On 19.04.23 02:47, Sean Christopherson wrote:
> > An alternative that we haven't considered since the very early days is making the
> > uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall. Looking at the
> > code for "pure shim" implementation[3], that's actually not that crazy of an idea.
>
> Yes.
>
> >
> > I don't know that I love the idea of burying this in KVM, but there are benefits
> > to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed
> > that I created).
>
> Yes, it's all better than jumping through hoops to come up with a bad name
> like "restrictedmem".
>
> >
> > The big benefit is that the layer of indirection goes away. That simplifies things
> > like enhancing restrictedmem to allow host userspace access for debug purposes,
> > batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive
> > access, likely the whole "share with a device" story if/when we get there, etc.
> >
> > The obvious downsides are that (a) maintenance falls under the KVM umbrella, but
> > that's likely to be true in practice regardless of where the code lands, and
>
> Yes.
>
> > (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk
> > someone reinventing a similar solution.
>
> I agree. But if it's as simple as providing an ioctl for that hypervisor
> that simply wires up the existing implementation, it's not too bad.
>
> >
> > If we can get Gunyah on board and they don't need substantial changes to the
> > restrictedmem implementation, then I'm all for continuing on the path we're on.
> > But if Gunyah wants to do their own thing, and the lightweight shim approach is
> > viable, then it's awfully tempting to put this all behind a KVM ioctl().
>
> Right. Or we still succeed in finding a name that's not as bad as what we
> had so far.

Okie dokie. So as not to bury the lede:

I think we should provide a KVM ioctl() and have KVM provide its own low-level
implementation, i.e. not wrap shmem. As much as I don't want to have KVM serving
up memory to userspace, by trying to keep KVM out of the memory management business,
we're actually making things *more* complex and harder to maintain (and merge).

Hugh said something to this effect quite early on[1], it just unfortunately took
me a long time to understand the benefits.

: QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
: the fallocate syscall interface itself) to allocate and free the memory,
: ioctl for initializing some of it too. KVM in control of whether that
: fd can be read or written or mmap'ed or whatever, no need to prevent it
: in shmem.c, no need for flags, seals, notifications to and fro because
: KVM is already in control and knows the history. If shmem actually has
: value, call into it underneath - somewhat like SysV SHM, and /dev/zero
: mmap, and i915/gem make use of it underneath. If shmem has nothing to
: add, just allocate and free kernel memory directly, recorded in your
: own xarray.

Christian also suggested that we stop trying to be lazy and create a proper API[2].

: I find this hard to navigate tbh and it feels like taking a shortcut to
: avoid building a proper api. If you only care about a specific set of
: operations specific to memfd restricte that needs to be available to
: in-kernel consumers, I wonder if you shouldn't just go one step further
: then your proposal below and build a dedicated minimal ops api. Idk,
: sketching like a madman on a drawning board here with no claim to
: feasibility from a mm perspective whatsoever

The key point from Hugh is that, except for a few minor things that are trivially
easy to replicate, the things that make shmem "shmem" don't provide any value for
the KVM use case:

- We have no plans to support swap, and migration support is dubious at best.
Swap in particular brings in a lot of complexity for no benefit (to us). That
complexity doesn't mean depending on shmem is inherently bad, but it does mean
rolling our own implementation is highly unlikely to result in reinventing
shmem's wheel.

- Sharing a backing file between arbitrary process is _unwanted_ for the initial
use cases. There may come a time when mutually trusted VMs can share "private"
data, but (a) that's a distant future problem and (b) it'll likely require even
more KVM control over the memory.

- All of the interfaces for read/write/mmap/etc. are dead weight from our
perspective. Even worse, we have to actively avoid those interfaces. That
can kinda sorta be done without polluting shmem code by using a shim, but
that has problems of its own (see below).

And Christian pointed out several flaws with wrapping shmem:

- Implementing a partial overlay filesystem leads to inconsistencies because
only some of the ops are changed, e.g. poking at the inode_operations or
super_operations will show shmem stuff, whereas address_space_operations and
file_operations will show restrictedmem stuff.


- Usurping just f_ops and a_ops without creating a full blown filesystem
avoids the partial overlay issues, but partially overriding ops isn't really
supported (because it's weird and brittle), e.g. blindly forwarding a
fallocate() after restrictedmem does it's thing "works", but only because
we very carefully and deliberately designed restrictedmem on top of shmem.

On top of the points raised by Hugh and Christian, wrapping shmem isn't really
any simpler, just different. E.g. there's a very subtle bug in the shim variant:
by passing SGP_WRITE, shmem skips zeroing the page because restrictemem is telling
shmem that it (restrictedmem) will immediately write the page. For TDX and SNP,
that's a "feature" because in those cases trusted firmware will zero (or init)
memory when it's assigned to the guest, but it's a nasty flaw for other use cases.

I'm not saying that we'll magically avoid such bugs by avoiding shmem, just pointing
out that using shmem requires understanding exactly how shmem works, i.e. using
shmem isn't necessarily any easier than building directly on filemap and/or folio
APIs. And I gotta imagine it will be a similar story if/when we want to add
hugetlbfs support. Building on filemap/folio will definitely have its own challenges,
but after prototyping an implementation I can confidently say that none of the
problems will be insurmountable. KVM has _way_ more complex code in its memslot
and MMU implementations.

And another benefit of building on filemap and/or folio APIs is that, because we
would be reinventing the wheel to some extent, when we do inevitablly run into
problems, it will be easier to get help solving those problems because (a) we won't
be doing weird things no one wants to deal with and (b) because the problems will
likely be things others have already dealt with.

The other angle I've been looking at is whether or not having KVM provide its
own implementation will lead to maintenance problems in the future, specifically
if we get to the point where we want to support "fancy" things like swap and
migration. For migration, I'm quite confident that a dedicated KVM ioctl() versus
wrapping shmem would be at worst a wash, and more than likely simpler if KVM owns
everything. E.g. migrating pages under TDX and SNP requires invoking magic
instructions, and so we'd be overriding ->migrate_folio() no matter what.

As for swap, I think we should put a stake in the ground and say that KVM will
never support swap for KVM's ioctl(). Sooo much of the infrastructure around
swap/reclaim is tied to userspace mappings, e.g. knowing which pages are LRU and/or
cold. I poked around a bit to see how we could avoid reinventing all of that
infrastructure for fd-only memory, and the best idea I could come up with is
basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
allow "mapping" fd-only memory, but ensure that memory is never actually present
from hardware's perspective.

But on top of the various problems with that approach, the only use cases I can
think of for using fd-only to back non-confidential VMs is to guard against spurious
writes/reads to guest memory and/or avoid memory overhead for mapping guest
memory into the user page tables. Avoiding memory overhead is completely defeated
if the page tables are populated PROT_NONE, which just leaves the "harden against
guest data corruption use case". And for that specific use case, _if_ swap is
desirable, it would be far, far easier to teach the kernel to allow KVM to follow
PROT_NONE (as Kirill's series did), as opposed to trying to teach the kernel and/or
KVM how to swap fd-only memory.

In other words, fd-only memory is purely for slice-of-hardware VMs. If someone
wants to overcommit VMs, then they use KVM's traditional API for mapping memory
into the guest.

Regarding putting this in KVM, as mentioned (way) above in the quote, the worst
case scenario of making this a KVM ioctl() seems to be that KVM will end up with
an ioctl() that redirects to common kernel code. On the flip side, implementing
this in KVM gives us a much clearer path to getting all of this merged. There will
be a few small non-KVM patches, but they will either be trivial, e.g. exporting
APIs (that aren't contentious) or not strictly required, e.g. adding a flag to
mark an address space as completely unmigratable (for improved performance). I.e.
the non-KVM patches will be small and be very actionable for their maintainers,
and other than that we control our own destiny.

And on the topic of control, making this a KVM ioctl() and implementation gives
KVM a _lot_ of control. E.g. we can make the file size immutable to simplify the
implementation, bind the fd to a single VM at creation time, add KVM-defined flags
for controlling hugepage behavior, etc.

Last but not least, I think forcing us KVM folks to get our hands at least a little
dirty with MM and FS stuff would be a *good* thing. KVM has had more than a few
bugs that we missed in no small part because most of the people that work on KVM
have almost zero knowledge of MM and FS, and especially at the boundaries between
those two.

Note, I implemented the POC on top of the filemap APIs because that was much faster
and less error prone than re-implementing xarray management myself. We may or may
not want to actually make the kernel at large aware of these allocations, i.e. it
may be better to follow Hugh's suggestion and use the folio APIs directly instead
of piggybacking filemap+folios. E.g. I _think_ migration becomes a complete non-issue
if the pages are "raw" allocations and never get marked as LRU-friendly. What I'm
not so sure about is if there is anything substantial that we'd lose out on by not
reusing the filemap stuff.

The POC also doesn't play nice with Ackerley's patches to allocate files on a
user-defined mount. AIUI, that's largely a non-issue as something like fbind()
would provide a superset of that functionality and more than one maintainer has
expressed (handwavy) support for a generic fbind().

Code is available here if folks want to take a look before any kind of formal
posting:

https://github.com/sean-jc/linux.git x86/kvm_gmem_solo

[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
[3] https://lore.kernel.org/linux-mm/[email protected]

2023-04-23 13:17:42

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Mon Apr 17, 2023 at 6:40 PM EEST, Sean Christopherson wrote:
> What do y'all think about renaming "restrictedmem" to "guardedmem"?
>
> I want to start referring to the code/patches by its syscall/implementation name
> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> since there's nothing in the code that mentions "UPM" in any way.
>
> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> already used to refer to "reserved memory".
>
> Renaming the syscall to "guardedmem"...
>
> 1. Allows for a shorthand and namespace, "gmem", that isn't already in use by
> the kernel (see "reserved memory above").
>
> 2. Provides a stronger hint as to its purpose. "Restricted" conveys that the
> allocated memory is limited in some way, but doesn't capture how the memory
> is restricted, e.g. "restricted" could just as easily mean that the allocation
> can be restricted to certain types of backing stores or something. "Guarded"
> on the other hand captures that the memory has extra defenses of some form.
>
> 3. Is shorter to type and speak. Work smart, not hard :-)
>
> 4. Isn't totally wrong for the KVM use case if someone assumes the "g" means
> "guest" when reading mail and whatnot.
>
>
> P.S. I trimmed the Cc/To substantially for this particular discussion to avoid
> spamming folks that don't (yet) care about this stuff with another potentially
> lengthy thread. Feel free to add (back) any people/lists.

I guess 'guarded' could be a good noun in the sense that it does not
get easily mixed up to anything pre-existing, and it does give the idea
of the purpose.

BR, Jarkko

2023-04-23 13:33:50

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote:
> On 17.04.23 17:40, Sean Christopherson wrote:
> > What do y'all think about renaming "restrictedmem" to "guardedmem"?
>
> Yeay, let's add more confusion :D
>
> If we're at renaming, I'd appreciate if we could find a terminology that
> does look/sound less horrible.
>
> >
> > I want to start referring to the code/patches by its syscall/implementation name
> > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> > and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> > since there's nothing in the code that mentions "UPM" in any way.
> >
> > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> > already used to refer to "reserved memory".
> >
> > Renaming the syscall to "guardedmem"...
>
> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...

In the world of TEE's and confidential computing it is fairly common to
call memory areas enclaves, even outside SGX context. So in that sense
enclave memory would be the most correct terminology.

BR, Jarkko

2023-05-05 19:47:54

by Ackerley Tng

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)


Hi Sean,

Thanks for implementing this POC!

I’ve started porting the selftests (both Chao’s and those I added [1]).

guest mem seems to cover the use cases that have been discussed and
proposed so far, but I still need to figure out how gmem can work with

+ hugetlbfs
+ specification of/storing memory policy (for NUMA node bindings)
+ memory accounting - we may need to account for memory used separately,
so that guest mem shows up separately on /proc/meminfo and similar
places.

One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is
not cleaned up, and hence it is possible to crash the host kernel in the
following way

1. Create a KVM VM
2. Create a guest mem fd on that VM
3. Create a memslot with the guest mem fd (hence binding the fd to the
VM)
4. Close/destroy the KVM VM
5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
when it tries to do invalidation.

I then tried to clean up the gmem->kvm pointer during unbinding when the
KVM VM is destroyed.

That works, but then I realized there’s a simpler way to use the pointer
after freeing:

1. Create a KVM VM
2. Create a guest mem fd on that VM
3. Close/destroy the KVM VM
4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
when it tries to do invalidation.

Perhaps binding should mean setting the gmem->kvm pointer in addition to
gmem->bindings. This makes binding and unbinding symmetric and avoids
the use-after-frees described above.

This also means that creating a guest mem fd is no longer dependent on
the VM. Perhaps we can make creating a gmem fd a system ioctl (like
KVM_GET_API_VERSION and KVM_CREATE_VM) instead of a vm ioctl?

[1]
https://lore.kernel.org/all/[email protected]/T/

Ackerley

2023-05-05 20:11:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On 23.04.23 15:28, Jarkko Sakkinen wrote:
> On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote:
>> On 17.04.23 17:40, Sean Christopherson wrote:
>>> What do y'all think about renaming "restrictedmem" to "guardedmem"?
>>
>> Yeay, let's add more confusion :D
>>
>> If we're at renaming, I'd appreciate if we could find a terminology that
>> does look/sound less horrible.
>>
>>>
>>> I want to start referring to the code/patches by its syscall/implementation name
>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>> since there's nothing in the code that mentions "UPM" in any way.
>>>
>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>> already used to refer to "reserved memory".
>>>
>>> Renaming the syscall to "guardedmem"...
>>
>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
>
> In the world of TEE's and confidential computing it is fairly common to
> call memory areas enclaves, even outside SGX context. So in that sense
> enclave memory would be the most correct terminology.

I was also thinking along the lines of isolated_mem or imem ...
essentially, isolated from (unprivileged) user space.

... if we still want to have a common syscall for it.

--
Thanks,

David / dhildenb

2023-05-06 00:59:04

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Fri, May 05, 2023, Ackerley Tng wrote:
>
> Hi Sean,
>
> Thanks for implementing this POC!
>
> I’ve started porting the selftests (both Chao’s and those I added [1]).
>
> guest mem seems to cover the use cases that have been discussed and
> proposed so far, but I still need to figure out how gmem can work with
>
> + hugetlbfs
> + specification of/storing memory policy (for NUMA node bindings)
> + memory accounting - we may need to account for memory used separately,
> so that guest mem shows up separately on /proc/meminfo and similar
> places.
>
> One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is
> not cleaned up, and hence it is possible to crash the host kernel in the
> following way
>
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Create a memslot with the guest mem fd (hence binding the fd to the
> VM)
> 4. Close/destroy the KVM VM
> 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> when it tries to do invalidation.
>
> I then tried to clean up the gmem->kvm pointer during unbinding when the
> KVM VM is destroyed.
>
> That works, but then I realized there’s a simpler way to use the pointer
> after freeing:
>
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Close/destroy the KVM VM
> 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> when it tries to do invalidation.
>
> Perhaps binding should mean setting the gmem->kvm pointer in addition to
> gmem->bindings. This makes binding and unbinding symmetric and avoids
> the use-after-frees described above.

Hrm, that would work, though it's a bit convoluted, e.g. would require detecting
when the last binding is being removed. A similar (also ugly) solution would be
to nullify gmem->kvm when KVM dies.

I don't love either approach idea because it means a file created in the context
of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
that it can't do anything with except close(). I doubt that matters in practice
though, e.g. when the VM dies, all memory can be freed so that the file ends up
being little more than a shell. And if we go that route, there's no need to grab
a reference to the file during bind, KVM can just grab a longterm reference when
the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).

Blech, another wart is that I believe gmem would need to do __module_get() during
file creation to prevent kvm.ko from being unloaded after the last VM dies. Ah,
but that'd also be true if we went with a system-scoped KVM ioctl(), so I suppose
it's not _that_ ugly.

Exchanging references (at binding or at creation) doesn't work, because that
creates a circular dependency, i.e. gmem and KVM would pin each other.

A "proper" refcounting approach, where the file pins KVM and not vice versa, gets
nasty because of how KVM's memslots work. The least awful approach I can think of
would be to delete the associated memslot(s) when the file is released, possibly
via deferred work to avoid deadlock issues. Not the prettiest thing ever and in
some ways that'd yield an even worse ABI.

Side topic, there's a second bug (and probably more lurking): kvm_swap_active_memslots()'s
call to synchronize_srcu_expedited() is done _before_ the call to kvm_gmem_unbind(),
i.e. doesn't wait for readers in kvm_gmem_invalidate_begin() to go away. The easy
solution for that one is to add another synchronize_srcu_expedited() after unbinding.

> This also means that creating a guest mem fd is no longer dependent on
> the VM. Perhaps we can make creating a gmem fd a system ioctl (like
> KVM_GET_API_VERSION and KVM_CREATE_VM) instead of a vm ioctl?

My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
and not a common syscall. If the file isn't tightly coupled to a single VM, then
punching a hole is further complicated by needing to deal with invalidating multiple
regions that are bound to different @kvm instances. It's not super complex, but
AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
having one VM own the memory will complicate even if/when we get to the point where
VMs can share "private" memory, and the gmem code would still need to deal with
grabbing a module reference.

2023-05-06 01:32:40

by Vishal Annapurve

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Fri, May 5, 2023 at 5:55 PM Sean Christopherson <[email protected]> wrote:
>
> ...
> My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
> and not a common syscall. If the file isn't tightly coupled to a single VM, then
> punching a hole is further complicated by needing to deal with invalidating multiple
> regions that are bound to different @kvm instances. It's not super complex, but
> AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
> having one VM own the memory will complicate even if/when we get to the point where
> VMs can share "private" memory, and the gmem code would still need to deal with
> grabbing a module reference.

Copyless migration would be a scenario where "private" memory may need
to be shared between source and target VMs depending on how migration
support is implemented.

Regards,
Vishal

2023-05-06 07:57:36

by Vlastimil Babka

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On 5/5/23 22:00, David Hildenbrand wrote:
> On 23.04.23 15:28, Jarkko Sakkinen wrote:
>> On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote:
>>> On 17.04.23 17:40, Sean Christopherson wrote:
>>>> What do y'all think about renaming "restrictedmem" to "guardedmem"?
>>>
>>> Yeay, let's add more confusion :D
>>>
>>> If we're at renaming, I'd appreciate if we could find a terminology that
>>> does look/sound less horrible.
>>>
>>>>
>>>> I want to start referring to the code/patches by its syscall/implementation name
>>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>>> since there's nothing in the code that mentions "UPM" in any way.
>>>>
>>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>>> already used to refer to "reserved memory".
>>>>
>>>> Renaming the syscall to "guardedmem"...
>>>
>>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
>>
>> In the world of TEE's and confidential computing it is fairly common to
>> call memory areas enclaves, even outside SGX context. So in that sense
>> enclave memory would be the most correct terminology.
>
> I was also thinking along the lines of isolated_mem or imem ...
> essentially, isolated from (unprivileged) user space.
>
> ... if we still want to have a common syscall for it.

I'm fan of the ioctl, if it has a chance of working out.

2023-05-06 09:49:25

by David Hildenbrand

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On 06.05.23 09:44, Vlastimil Babka wrote:
> On 5/5/23 22:00, David Hildenbrand wrote:
>> On 23.04.23 15:28, Jarkko Sakkinen wrote:
>>> On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote:
>>>> On 17.04.23 17:40, Sean Christopherson wrote:
>>>>> What do y'all think about renaming "restrictedmem" to "guardedmem"?
>>>>
>>>> Yeay, let's add more confusion :D
>>>>
>>>> If we're at renaming, I'd appreciate if we could find a terminology that
>>>> does look/sound less horrible.
>>>>
>>>>>
>>>>> I want to start referring to the code/patches by its syscall/implementation name
>>>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>>>> since there's nothing in the code that mentions "UPM" in any way.
>>>>>
>>>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>>>> already used to refer to "reserved memory".
>>>>>
>>>>> Renaming the syscall to "guardedmem"...
>>>>
>>>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
>>>
>>> In the world of TEE's and confidential computing it is fairly common to
>>> call memory areas enclaves, even outside SGX context. So in that sense
>>> enclave memory would be the most correct terminology.
>>
>> I was also thinking along the lines of isolated_mem or imem ...
>> essentially, isolated from (unprivileged) user space.
>>
>> ... if we still want to have a common syscall for it.
>
> I'm fan of the ioctl, if it has a chance of working out.
Yes, me too.

--
Thanks,

David / dhildenb

2023-05-09 13:01:51

by Chao Peng

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Fri, May 05, 2023 at 07:39:36PM +0000, Ackerley Tng wrote:
>
> Hi Sean,
>
> Thanks for implementing this POC!
>
> I??ve started porting the selftests (both Chao??s and those I added [1]).

Hi Sean/Ackerley,

Thanks for doing that. Overall making gmem a KVM ioctl() looks good to
me and it should also play nice with Intel TDX. Besides what Ackerley
mentioned below, I think we haven't discussed device assignment, which
will be supported in not too long distance. Current VFIO_IOMMU_MAP_DMA
consumes virtual address so that needs to be fixed for fd-based memory
anyway, and the fix looks not related to whether this being a syscall()
or a KVM ioctl(). There will be some initialization sequence dependency,
e.g. if gmem is finally a VM-scope ioctl() then we need VM created first
before can we map fd-based memory in VFIO, but that sounds not an issue
at all.

I also see Vlastimil/David expressed their preference on ioctl. So maybe
we can move forward on your current PoC. Do you already have a plan to
post a formal version?

Chao

>
> guest mem seems to cover the use cases that have been discussed and
> proposed so far, but I still need to figure out how gmem can work with
>
> + hugetlbfs
> + specification of/storing memory policy (for NUMA node bindings)
> + memory accounting - we may need to account for memory used separately,
> so that guest mem shows up separately on /proc/meminfo and similar
> places.
>
> One issue I??ve found so far is that the pointer to kvm (gmem->kvm) is
> not cleaned up, and hence it is possible to crash the host kernel in the
> following way
>
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Create a memslot with the guest mem fd (hence binding the fd to the
> VM)
> 4. Close/destroy the KVM VM
> 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> when it tries to do invalidation.
>
> I then tried to clean up the gmem->kvm pointer during unbinding when the
> KVM VM is destroyed.
>
> That works, but then I realized there??s a simpler way to use the pointer
> after freeing:
>
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Close/destroy the KVM VM
> 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> when it tries to do invalidation.
>
> Perhaps binding should mean setting the gmem->kvm pointer in addition to
> gmem->bindings. This makes binding and unbinding symmetric and avoids
> the use-after-frees described above.
>
> This also means that creating a guest mem fd is no longer dependent on
> the VM. Perhaps we can make creating a gmem fd a system ioctl (like
> KVM_GET_API_VERSION and KVM_CREATE_VM) instead of a vm ioctl?
>
> [1]
> https://lore.kernel.org/all/[email protected]/T/
>
> Ackerley

2023-05-10 17:35:19

by Vishal Annapurve

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <[email protected]> wrote:
>
> ...
> cold. I poked around a bit to see how we could avoid reinventing all of that
> infrastructure for fd-only memory, and the best idea I could come up with is
> basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> allow "mapping" fd-only memory, but ensure that memory is never actually present
> from hardware's perspective.
>

I am most likely missing a lot of context here and possibly venturing
into an infeasible/already shot down direction here. But I would still
like to get this discussed here before we move on.

I am wondering if it would make sense to implement
restricted_mem/guest_mem file to expose both private and shared memory
regions, inline with Kirill's original proposal now that the file
implementation is controlled by KVM.

Thinking from userspace perspective:
1) Userspace creates guest mem files and is able to mmap them but all
accesses to these files result into faults as no memory is allowed to
be mapped into userspace VMM pagetables.
2) Userspace registers mmaped HVA ranges with KVM with additional
KVM_MEM_PRIVATE flag
3) Userspace converts memory attributes and this memory conversion
allows userspace to access shared ranges of the file because those are
allowed to be faulted in from guest_mem. Shared to private conversion
unmaps the file ranges from userspace VMM pagetables.
4) Granularity of userspace pagetable mappings for shared ranges will
have to be dictated by KVM guest_mem file implementation.

Caveat here is that once private pages are mapped into userspace view.

Benefits here:
1) Userspace view remains consistent while still being able to use HVA ranges
2) It would be possible to use HVA based APIs from userspace to do
things like binding.
3) Double allocation wouldn't be a concern since hva ranges and gpa
ranges possibly map to the same HPA ranges.

>
> Code is available here if folks want to take a look before any kind of formal
> posting:
>
> https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
>
> [1] https://lore.kernel.org/all/[email protected]
> [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> [3] https://lore.kernel.org/linux-mm/[email protected]

2023-05-10 20:46:51

by Vishal Annapurve

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Wed, May 10, 2023 at 10:26 AM Vishal Annapurve <[email protected]> wrote:
>
> On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <[email protected]> wrote:
> >
> > ...
> > cold. I poked around a bit to see how we could avoid reinventing all of that
> > infrastructure for fd-only memory, and the best idea I could come up with is
> > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> > allow "mapping" fd-only memory, but ensure that memory is never actually present
> > from hardware's perspective.
> >
>
> I am most likely missing a lot of context here and possibly venturing
> into an infeasible/already shot down direction here. But I would still
> like to get this discussed here before we move on.
>
> I am wondering if it would make sense to implement
> restricted_mem/guest_mem file to expose both private and shared memory
> regions, inline with Kirill's original proposal now that the file
> implementation is controlled by KVM.
>
> Thinking from userspace perspective:
> 1) Userspace creates guest mem files and is able to mmap them but all
> accesses to these files result into faults as no memory is allowed to
> be mapped into userspace VMM pagetables.
> 2) Userspace registers mmaped HVA ranges with KVM with additional
> KVM_MEM_PRIVATE flag
> 3) Userspace converts memory attributes and this memory conversion
> allows userspace to access shared ranges of the file because those are
> allowed to be faulted in from guest_mem. Shared to private conversion
> unmaps the file ranges from userspace VMM pagetables.
> 4) Granularity of userspace pagetable mappings for shared ranges will
> have to be dictated by KVM guest_mem file implementation.
>
> Caveat here is that once private pages are mapped into userspace view.
>
> Benefits here:
> 1) Userspace view remains consistent while still being able to use HVA ranges
> 2) It would be possible to use HVA based APIs from userspace to do
> things like binding.
> 3) Double allocation wouldn't be a concern since hva ranges and gpa
> ranges possibly map to the same HPA ranges.
>

I realize now that VFIO IOMMU mappings are not associated with
userspace MMU in any way. So this approach does leave the gap of not
being able to handle modifications of IOMMU mappings when HVA mappings
are modified in userspace page tables. I guess this could be a good
enough reason to give up on this option.


> >
> > Code is available here if folks want to take a look before any kind of formal
> > posting:
> >
> > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> >
> > [1] https://lore.kernel.org/all/[email protected]
> > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > [3] https://lore.kernel.org/linux-mm/[email protected]

2023-05-10 21:57:19

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Wed, May 10, 2023, Vishal Annapurve wrote:
> On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <[email protected]> wrote:
> >
> > ...
> > cold. I poked around a bit to see how we could avoid reinventing all of that
> > infrastructure for fd-only memory, and the best idea I could come up with is
> > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> > allow "mapping" fd-only memory, but ensure that memory is never actually present
> > from hardware's perspective.
> >
>
> I am most likely missing a lot of context here and possibly venturing
> into an infeasible/already shot down direction here.

Both :-)

> But I would still like to get this discussed here before we move on.
>
> I am wondering if it would make sense to implement
> restricted_mem/guest_mem file to expose both private and shared memory
> regions, inline with Kirill's original proposal now that the file
> implementation is controlled by KVM.
>
> Thinking from userspace perspective:
> 1) Userspace creates guest mem files and is able to mmap them but all
> accesses to these files result into faults as no memory is allowed to
> be mapped into userspace VMM pagetables.

Never mapping anything into the userspace page table is infeasible. Technically
it's doable, but it'd effectively require all of the work of an fd-based approach
(and probably significantly more), _and_ it'd require touching core mm code.

VMAs don't provide hva=>pfn information, they're the kernel's way of implementing
the abstraction provided to userspace by mmap(), mprotect() etc. Among many other
things, a VMA describes properties of what is mapped, e.g. hugetblfs versus
anonymous, where memory is mapped (virtual address), how memory is mapped, e.g.
RWX protections, etc. But a VMA doesn't track the physical address, that info
is all managed through the userspace page tables.

To make it possible to allow userspace to mmap() but not access memory (without
redoing how the kernel fundamentally manages virtual=>physical mappings), the
simplest approach is to install PTEs into userspace page tables, but never mark
them Present in hardware, i.e. prevent actually accessing the backing memory.
This is is exactly what Kirill's series in link [3] below implemented.

Issues that led to us abandoning the "map with special !Present PTEs" approach:

- Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
is inefficient and inflexible compared to software defined structures, especially
for the expected use cases for CoCo guests.

- The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
let alone a 1:1 pfn:gfn mapping.

- Does not work for memory that isn't backed by 'struct page', e.g. if devices
gain support for exposing encrypted memory regions to guests.

- Poking into the VMAs to convert memory would be likely be less performant due
to using infrastructure that is much "heavier", e.g. would require taking
mmap_lock for write.

In short, shoehorning this into mmap() requires fighting how the kernel works at
pretty much every step, and in the end, adding e.g. fbind() is a lot easier.

> 2) Userspace registers mmaped HVA ranges with KVM with additional
> KVM_MEM_PRIVATE flag
> 3) Userspace converts memory attributes and this memory conversion
> allows userspace to access shared ranges of the file because those are
> allowed to be faulted in from guest_mem. Shared to private conversion
> unmaps the file ranges from userspace VMM pagetables.
> 4) Granularity of userspace pagetable mappings for shared ranges will
> have to be dictated by KVM guest_mem file implementation.
>
> Caveat here is that once private pages are mapped into userspace view.
>
> Benefits here:
> 1) Userspace view remains consistent while still being able to use HVA ranges
> 2) It would be possible to use HVA based APIs from userspace to do
> things like binding.
> 3) Double allocation wouldn't be a concern since hva ranges and gpa
> ranges possibly map to the same HPA ranges.

#3 isn't entirely correct. If a different process (call it "B") maps shared memory,
and then the guest converts that memory from shared to private, the backing pages
for the previously shared mapping will still be mapped by process B unless userspace
also ensures process B also unmaps on conversion.

#3 is also a limiter. E.g. if a guest is primarly backed by 1GiB pages, keeping
the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
and possibly even if the guest converts a few MiB of memory.

> > Code is available here if folks want to take a look before any kind of formal
> > posting:
> >
> > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> >
> > [1] https://lore.kernel.org/all/[email protected]
> > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > [3] https://lore.kernel.org/linux-mm/[email protected]

2023-05-10 23:35:51

by Vishal Annapurve

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Wed, May 10, 2023 at 2:39 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, May 10, 2023, Vishal Annapurve wrote:
> > On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > ...
> > > cold. I poked around a bit to see how we could avoid reinventing all of that
> > > infrastructure for fd-only memory, and the best idea I could come up with is
> > > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> > > allow "mapping" fd-only memory, but ensure that memory is never actually present
> > > from hardware's perspective.
> > >
> >
> > I am most likely missing a lot of context here and possibly venturing
> > into an infeasible/already shot down direction here.
>
> Both :-)
>
> > But I would still like to get this discussed here before we move on.
> >
> > I am wondering if it would make sense to implement
> > restricted_mem/guest_mem file to expose both private and shared memory
> > regions, inline with Kirill's original proposal now that the file
> > implementation is controlled by KVM.
> >
> > Thinking from userspace perspective:
> > 1) Userspace creates guest mem files and is able to mmap them but all
> > accesses to these files result into faults as no memory is allowed to
> > be mapped into userspace VMM pagetables.
>
> Never mapping anything into the userspace page table is infeasible. Technically
> it's doable, but it'd effectively require all of the work of an fd-based approach
> (and probably significantly more), _and_ it'd require touching core mm code.
>
> VMAs don't provide hva=>pfn information, they're the kernel's way of implementing
> the abstraction provided to userspace by mmap(), mprotect() etc. Among many other
> things, a VMA describes properties of what is mapped, e.g. hugetblfs versus
> anonymous, where memory is mapped (virtual address), how memory is mapped, e.g.
> RWX protections, etc. But a VMA doesn't track the physical address, that info
> is all managed through the userspace page tables.
>
> To make it possible to allow userspace to mmap() but not access memory (without
> redoing how the kernel fundamentally manages virtual=>physical mappings), the
> simplest approach is to install PTEs into userspace page tables, but never mark
> them Present in hardware, i.e. prevent actually accessing the backing memory.
> This is is exactly what Kirill's series in link [3] below implemented.
>

Maybe it's simpler to do when mmaped regions are backed with files.

I see that shmem has fault handlers for accesses to VMA regions
associated with the files, In theory a file implementation can always
choose to not allocate physical pages for such faults (similar to
F_SEAL_FAULT_AUTOALLOCATE that was discussed earlier).

> Issues that led to us abandoning the "map with special !Present PTEs" approach:
>
> - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
> is inefficient and inflexible compared to software defined structures, especially
> for the expected use cases for CoCo guests.
>
> - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
> let alone a 1:1 pfn:gfn mapping.

Maybe KVM can ensure that each page of the guest_mem file is
associated with a single memslot. HVAs when they are registered can be
associated with offsets into guest_mem files.

>
> - Does not work for memory that isn't backed by 'struct page', e.g. if devices
> gain support for exposing encrypted memory regions to guests.
>
> - Poking into the VMAs to convert memory would be likely be less performant due
> to using infrastructure that is much "heavier", e.g. would require taking
> mmap_lock for write.

Converting memory doesn't necessarily need to poke holes into VMA, but
rather just unmap pagetables just like what would happen when mmapped
files are punched to free the backing file offsets.

>
> In short, shoehorning this into mmap() requires fighting how the kernel works at
> pretty much every step, and in the end, adding e.g. fbind() is a lot easier.
>
> > 2) Userspace registers mmaped HVA ranges with KVM with additional
> > KVM_MEM_PRIVATE flag
> > 3) Userspace converts memory attributes and this memory conversion
> > allows userspace to access shared ranges of the file because those are
> > allowed to be faulted in from guest_mem. Shared to private conversion
> > unmaps the file ranges from userspace VMM pagetables.
> > 4) Granularity of userspace pagetable mappings for shared ranges will
> > have to be dictated by KVM guest_mem file implementation.
> >
> > Caveat here is that once private pages are mapped into userspace view.
> >
> > Benefits here:
> > 1) Userspace view remains consistent while still being able to use HVA ranges
> > 2) It would be possible to use HVA based APIs from userspace to do
> > things like binding.
> > 3) Double allocation wouldn't be a concern since hva ranges and gpa
> > ranges possibly map to the same HPA ranges.
>
> #3 isn't entirely correct. If a different process (call it "B") maps shared memory,
> and then the guest converts that memory from shared to private, the backing pages
> for the previously shared mapping will still be mapped by process B unless userspace
> also ensures process B also unmaps on conversion.
>

This should be ideally handled by something like: unmap_mapping_range()

> #3 is also a limiter. E.g. if a guest is primarly backed by 1GiB pages, keeping
> the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
> and possibly even if the guest converts a few MiB of memory.

This caveat maybe can be lived with as shared ranges most likely will
not be backed by 1G pages anyways, possibly causing IO performance to
get hit. This possibly needs more discussion about conversion
granularity used by guests.

>
> > > Code is available here if folks want to take a look before any kind of formal
> > > posting:
> > >
> > > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> > >
> > > [1] https://lore.kernel.org/all/[email protected]
> > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > > [3] https://lore.kernel.org/linux-mm/[email protected]

2023-05-11 20:41:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Wed, May 10, 2023, Vishal Annapurve wrote:
> On Wed, May 10, 2023 at 2:39 PM Sean Christopherson <[email protected]> wrote:
> > > But I would still like to get this discussed here before we move on.
> > >
> > > I am wondering if it would make sense to implement
> > > restricted_mem/guest_mem file to expose both private and shared memory
> > > regions, inline with Kirill's original proposal now that the file
> > > implementation is controlled by KVM.
> > >
> > > Thinking from userspace perspective:
> > > 1) Userspace creates guest mem files and is able to mmap them but all
> > > accesses to these files result into faults as no memory is allowed to
> > > be mapped into userspace VMM pagetables.
> >
> > Never mapping anything into the userspace page table is infeasible. Technically
> > it's doable, but it'd effectively require all of the work of an fd-based approach
> > (and probably significantly more), _and_ it'd require touching core mm code.
> >
> > VMAs don't provide hva=>pfn information, they're the kernel's way of implementing
> > the abstraction provided to userspace by mmap(), mprotect() etc. Among many other
> > things, a VMA describes properties of what is mapped, e.g. hugetblfs versus
> > anonymous, where memory is mapped (virtual address), how memory is mapped, e.g.
> > RWX protections, etc. But a VMA doesn't track the physical address, that info
> > is all managed through the userspace page tables.
> >
> > To make it possible to allow userspace to mmap() but not access memory (without
> > redoing how the kernel fundamentally manages virtual=>physical mappings), the
> > simplest approach is to install PTEs into userspace page tables, but never mark
> > them Present in hardware, i.e. prevent actually accessing the backing memory.
> > This is is exactly what Kirill's series in link [3] below implemented.
> >
>
> Maybe it's simpler to do when mmaped regions are backed with files.
>
> I see that shmem has fault handlers for accesses to VMA regions
> associated with the files, In theory a file implementation can always
> choose to not allocate physical pages for such faults (similar to
> F_SEAL_FAULT_AUTOALLOCATE that was discussed earlier).

Ah, you're effectively suggesting a hybrid model where the file is the single
source of truth for what's private versus shared, ad KVM gets pfns through
direct communication with the backing store via the file descriptor, but userspace
can still control things via mmap() and friends.

If you're not suggesting a backdoor, i.e. KVM still gets private pfns via hvas,
then we're back at Kirill's series, because otherwise there's no easy way for KVM
to retrieve the pfn.

A form of this was also discussed, though I don't know how much of the discussion
happened on-list.

KVM actually does something like this for s390's Ultravisor (UV), which is quite
a bit like TDX (UV is a trusted intermediary) except that it handles faults much,
much more gracefully. Specifically, when the untrusted host attempts to access a
secure page, a fault occurs and the kernel responds by telling UV to export the
page. The fault is gracefully handled even even for kernel accesses
(see do_secure_storage_access()). The kernel does BUG() if the export fails when
handling fault from kernel context, but my understanding is that export can fail
if and only if there's a fatal error elsewhere, i.e. the UV essentialy _ensures_
success, and goes straight to BUG()/panic() if something goes wrong.

On the guest side, accesses to exported (swapped) secure pages generate intercepts
and KVM faults in the page. To do so, KVM freezes the page/folio refcount, tells
the UV to import the page, and then unfreezes the page/folio. But very crucially,
when _anything_ in the untrusted host attempts to access the secure page, the
above fault handling for untrusted host accesses kicks in. In other words, the
guest can cause thrash, but can't bring down the host.

TDX on the other hand silently poisons memory, i.e. doesn't even generate a
synchronous fault. Thus the kernel needs to be 100% perfect on preventing _any_
accesses to private memory from the host, and doing that is non-trivial and
invasive.

SNP does synchronously fault, but the automatically converting in the #PF handler
got NAK'd[*] for good reasons, e.g. SNP doesn't guarantee conversion success as the
guest can trigger concurrent RMP modifications. So the end result ends up being
the same as TDX, host accesses need to be completely prevented.

Again, this is all doable, but costly. And IMO, provides very little value.

Allowing things like mbind() is nice-to-have at best, as implementing fbind()
isn't straightforward and arguably valuable to have irrespective of this
discussion, e.g. to allow userspace to say "use this policy regardless of what
process maps the file".

Using a common memory pool (same physical page is used for both shared and private)
is a similar story. There are plenty of existing controls to limit userspace/guest
memory usage and to deal with OOM scenarios, so barring egregious host accounting
and/or limiting bugs, which would affect _all_ VM types, the worst case scenario
is that a VM is terminated because host userspace is buggy. On the slip side, using
a common pool brings complexity into the kernel, as backing stores would need to
be taught to deny access to a subset of pages in their mappings, and in multiple
paths, e.g. faults, read()/write() and similar, page migration, swap, etc.

[*] https://lore.kernel.org/linux-mm/[email protected]

> > Issues that led to us abandoning the "map with special !Present PTEs" approach:
> >
> > - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
> > is inefficient and inflexible compared to software defined structures, especially
> > for the expected use cases for CoCo guests.
> >
> > - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
> > let alone a 1:1 pfn:gfn mapping.
>
> Maybe KVM can ensure that each page of the guest_mem file is
> associated with a single memslot.

This is a hard NAK. Guest physical address space is guaranteed to have holes
and/or be discontiguous, for the PCI hole at the top of lower memory. Allowing
only a single binding would prevent userspace from backing all (or large chunks)
of guest memory with a single file.

> HVAs when they are registered can be associated with offsets into guest_mem files.

Enforcing 1:1 assocations is doable if KVM inserts a shim/interposer, e.g. essentially
implements the exclusivity bits of restrictedmem. But that's adding even more
complexity.

> > - Does not work for memory that isn't backed by 'struct page', e.g. if devices
> > gain support for exposing encrypted memory regions to guests.
> >
> > - Poking into the VMAs to convert memory would be likely be less performant due
> > to using infrastructure that is much "heavier", e.g. would require taking
> > mmap_lock for write.
>
> Converting memory doesn't necessarily need to poke holes into VMA, but
> rather just unmap pagetables just like what would happen when mmapped
> files are punched to free the backing file offsets.

Sorry, bad choice of word on my part. I didn't intend to imply poking holes, in
this case I used "poking" to mean "modifying". munmap(), mprotected(), etc all
require modifying VMAs, which means taking mmap_lock for write.

> > In short, shoehorning this into mmap() requires fighting how the kernel works at
> > pretty much every step, and in the end, adding e.g. fbind() is a lot easier.
> >
> > > 2) Userspace registers mmaped HVA ranges with KVM with additional
> > > KVM_MEM_PRIVATE flag
> > > 3) Userspace converts memory attributes and this memory conversion
> > > allows userspace to access shared ranges of the file because those are
> > > allowed to be faulted in from guest_mem. Shared to private conversion
> > > unmaps the file ranges from userspace VMM pagetables.
> > > 4) Granularity of userspace pagetable mappings for shared ranges will
> > > have to be dictated by KVM guest_mem file implementation.
> > >
> > > Caveat here is that once private pages are mapped into userspace view.
> > >
> > > Benefits here:
> > > 1) Userspace view remains consistent while still being able to use HVA ranges
> > > 2) It would be possible to use HVA based APIs from userspace to do
> > > things like binding.
> > > 3) Double allocation wouldn't be a concern since hva ranges and gpa
> > > ranges possibly map to the same HPA ranges.
> >
> > #3 isn't entirely correct. If a different process (call it "B") maps shared memory,
> > and then the guest converts that memory from shared to private, the backing pages
> > for the previously shared mapping will still be mapped by process B unless userspace
> > also ensures process B also unmaps on conversion.
> >
>
> This should be ideally handled by something like: unmap_mapping_range()

That'd work for the hybrid model (fd backdoor with pseudo mmap() support), but
not for a generic VMA-based implementation. If the file isn't the single source
of truth, then forcing all mappings to go away simply can't work.

> > #3 is also a limiter. E.g. if a guest is primarly backed by 1GiB pages, keeping
> > the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
> > and possibly even if the guest converts a few MiB of memory.
>
> This caveat maybe can be lived with as shared ranges most likely will
> not be backed by 1G pages anyways, possibly causing IO performance to
> get hit. This possibly needs more discussion about conversion
> granularity used by guests.

Yes, it's not the end of the world. My point is that separating shared and private
memory provides more flexibility. Maybe that flexibility never ends up being
super important, but at the same time we shouldn't willingly paint ourselves into
a corner.

2023-05-12 00:33:12

by Michael Roth

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote:
>
> Code is available here if folks want to take a look before any kind of formal
> posting:
>
> https://github.com/sean-jc/linux.git x86/kvm_gmem_solo

Hi Sean,

I've been working on getting the SNP patches ported to this but I'm having
some trouble working out a reasonable scheme for how to work the
RMPUPDATE hooks into the proposed design.

One of the main things is kvm_gmem_punch_hole(): this is can free pages
back to the host whenever userspace feels like it. Pages that are still
marked private in the RMP table will blow up the host if they aren't returned
to the normal state before handing them back to the kernel. So I'm trying to
add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that,
e.g.:

static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset,
loff_t len)
{
struct kvm_gmem *gmem = file->private_data;
pgoff_t start = offset >> PAGE_SHIFT;
pgoff_t end = (offset + len) >> PAGE_SHIFT;
struct kvm *kvm = gmem->kvm;

/*
* Bindings must stable across invalidation to ensure the start+end
* are balanced.
*/
filemap_invalidate_lock(file->f_mapping);
kvm_gmem_invalidate_begin(kvm, gmem, start, end);

/* Handle arch-specific cleanups before releasing pages */
kvm_arch_gmem_invalidate(kvm, gmem, start, end);
truncate_inode_pages_range(file->f_mapping, offset, offset + len);

kvm_gmem_invalidate_end(kvm, gmem, start, end);
filemap_invalidate_unlock(file->f_mapping);

return 0;
}

But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put
the page in its intended state in the RMP table prior to mapping it into the
guest's NPT. Currently I'm calling that hook via
kvm_vm_ioctl_set_mem_attributes(), just after kvm->mem_attr_array is updated
based on the ioctl. The reasoning there is that KVM MMU can then rely on the
existing mmu_invalidate_seq logic to ensure both the state in the
mem_attr_array and the RMP table are in sync and up-to-date once MMU lock is
acquired and MMU is ready to map it, or retry #NPF otherwise.

But for kvm_gmem_punch_hole(), kvm_vm_ioctl_set_mem_attributes() can potentially
result in something like the following sequence if I implement things as above:

CPU0: kvm_gmem_punch_hole():
kvm_gmem_invalidate_begin()
kvm_arch_gmem_invalidate() // set pages to default/shared state in RMP table before free'ing
CPU1: kvm_vm_ioctl_set_mem_attributes():
kvm_arch_gmem_set_mem_attributes() // maliciously set pages to private in RMP table
CPU0: truncate_inode_pages_range() // HOST BLOWS UP TOUCHING PRIVATE PAGES
kvm_arch_gmem_invalidate_end()

One quick and lazy solution is to rely on the fact that
kvm_vm_ioctl_set_mem_attributes() holds the kvm->slots_lock throughout the
entire begin()/end() portion of the invalidation sequence, and to similarly
hold the kvm->slots_lock throughout the begin()/end() sequence in
kvm_gmem_punch_hole() to prevent any interleaving.

But I'd imagine overloading kvm->slots_lock is not the proper approach. But
would introducing a similar mutex to keep these operations grouped/atomic be
a reasonable approach to you, or should we be doing something else entirely
here?

Keep in mind that RMP updates can't be done while holding KVM->mmu_lock
spinlock, because we also need to unmap pages from the directmap, which can
lead to scheduling-while-atomic BUG()s[1], so that's another constraint we
need to work around.

Thanks!

-Mike

[1] https://lore.kernel.org/linux-coco/[email protected]/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a

>
> [1] https://lore.kernel.org/all/[email protected]
> [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> [3] https://lore.kernel.org/linux-mm/[email protected]

2023-05-12 18:17:45

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Thu, May 11, 2023, Michael Roth wrote:
> On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote:
> >
> > Code is available here if folks want to take a look before any kind of formal
> > posting:
> >
> > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
>
> Hi Sean,
>
> I've been working on getting the SNP patches ported to this but I'm having
> some trouble working out a reasonable scheme for how to work the
> RMPUPDATE hooks into the proposed design.
>
> One of the main things is kvm_gmem_punch_hole(): this is can free pages
> back to the host whenever userspace feels like it. Pages that are still
> marked private in the RMP table will blow up the host if they aren't returned
> to the normal state before handing them back to the kernel. So I'm trying to
> add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that,
> e.g.:
>
> static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset,
> loff_t len)
> {
> struct kvm_gmem *gmem = file->private_data;
> pgoff_t start = offset >> PAGE_SHIFT;
> pgoff_t end = (offset + len) >> PAGE_SHIFT;
> struct kvm *kvm = gmem->kvm;
>
> /*
> * Bindings must stable across invalidation to ensure the start+end
> * are balanced.
> */
> filemap_invalidate_lock(file->f_mapping);
> kvm_gmem_invalidate_begin(kvm, gmem, start, end);
>
> /* Handle arch-specific cleanups before releasing pages */
> kvm_arch_gmem_invalidate(kvm, gmem, start, end);
> truncate_inode_pages_range(file->f_mapping, offset, offset + len);
>
> kvm_gmem_invalidate_end(kvm, gmem, start, end);
> filemap_invalidate_unlock(file->f_mapping);
>
> return 0;
> }
>
> But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put
> the page in its intended state in the RMP table prior to mapping it into the
> guest's NPT.

IMO, this approach is wrong. kvm->mem_attr_array is the source of truth for whether
userspace wants _guest_ physical pages mapped private vs. shared, but the attributes
array has zero insight into the _host_ physical pages. I.e. SNP shouldn't hook
kvm_mem_attrs_changed(), because operating on the RMP from that code is fundamentally
wrong.

A good analogy is moving a memslot (ignoring that AFAIK no VMM actually moves
memslots, but it's a good analogy for KVM internals). KVM needs to zap all mappings
for the old memslot gfn, but KVM does not create mappings for the new memslot gfn.
Same for changing attributes; unmap, but never map.

As for the unmapping side of things, kvm_unmap_gfn_range() will unmap all relevant
NPT entries, and the elevated mmu_invalidate_in_progress will prevent KVM from
establishing a new NPT mapping. And mmu_invalidate_in_progress will reach '0' only
after both truncation _and_ kvm_vm_ioctl_set_mem_attributes() complete, i.e. KVM
can create new mappings only when both kvm->mem_attr_array and any relevant
guest_mem bindings have reached steady state.

That leaves the question of when/where to do RMP updates. Off the cuff, I think
RMP updates (and I _think_ also TDX page conversions) should _always_ be done in
the context of either (a) file truncation (make host owned due, a.k.a. TDX reclaim)
or (b) allocating a new page/folio in guest_mem, a.k.a. kvm_gmem_get_folio().
Under the hood, even though the gfn is the same, the backing pfn is different, i.e.
installing a shared mapping should _never_ need to touch the RMP because pages
common from the normal (non-guest_mem) pool must already be host owned.

> Currently I'm calling that hook via kvm_vm_ioctl_set_mem_attributes(), just
> after kvm->mem_attr_array is updated based on the ioctl. The reasoning there
> is that KVM MMU can then rely on the existing mmu_invalidate_seq logic to
> ensure both the state in the mem_attr_array and the RMP table are in sync and
> up-to-date once MMU lock is acquired and MMU is ready to map it, or retry
> #NPF otherwise.
>
> But for kvm_gmem_punch_hole(), kvm_vm_ioctl_set_mem_attributes() can potentially
> result in something like the following sequence if I implement things as above:
>
> CPU0: kvm_gmem_punch_hole():
> kvm_gmem_invalidate_begin()
> kvm_arch_gmem_invalidate() // set pages to default/shared state in RMP table before free'ing
> CPU1: kvm_vm_ioctl_set_mem_attributes():
> kvm_arch_gmem_set_mem_attributes() // maliciously set pages to private in RMP table
> CPU0: truncate_inode_pages_range() // HOST BLOWS UP TOUCHING PRIVATE PAGES
> kvm_arch_gmem_invalidate_end()
>
> One quick and lazy solution is to rely on the fact that
> kvm_vm_ioctl_set_mem_attributes() holds the kvm->slots_lock throughout the
> entire begin()/end() portion of the invalidation sequence, and to similarly
> hold the kvm->slots_lock throughout the begin()/end() sequence in
> kvm_gmem_punch_hole() to prevent any interleaving.
>
> But I'd imagine overloading kvm->slots_lock is not the proper approach. But
> would introducing a similar mutex to keep these operations grouped/atomic be
> a reasonable approach to you, or should we be doing something else entirely
> here?
>
> Keep in mind that RMP updates can't be done while holding KVM->mmu_lock
> spinlock, because we also need to unmap pages from the directmap, which can
> lead to scheduling-while-atomic BUG()s[1], so that's another constraint we
> need to work around.
>
> Thanks!
>
> -Mike
>
> [1] https://lore.kernel.org/linux-coco/[email protected]/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a
>
> >
> > [1] https://lore.kernel.org/all/[email protected]
> > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > [3] https://lore.kernel.org/linux-mm/[email protected]

2023-05-16 00:01:54

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Fri, May 05, 2023, Sean Christopherson wrote:
> On Fri, May 05, 2023, Ackerley Tng wrote:
> > One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is
> > not cleaned up, and hence it is possible to crash the host kernel in the
> > following way
> >
> > 1. Create a KVM VM
> > 2. Create a guest mem fd on that VM
> > 3. Create a memslot with the guest mem fd (hence binding the fd to the
> > VM)
> > 4. Close/destroy the KVM VM
> > 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> > when it tries to do invalidation.
> >
> > I then tried to clean up the gmem->kvm pointer during unbinding when the
> > KVM VM is destroyed.
> >
> > That works, but then I realized there’s a simpler way to use the pointer
> > after freeing:
> >
> > 1. Create a KVM VM
> > 2. Create a guest mem fd on that VM
> > 3. Close/destroy the KVM VM
> > 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> > when it tries to do invalidation.
> >
> > Perhaps binding should mean setting the gmem->kvm pointer in addition to
> > gmem->bindings. This makes binding and unbinding symmetric and avoids
> > the use-after-frees described above.
>
> Hrm, that would work, though it's a bit convoluted, e.g. would require detecting
> when the last binding is being removed. A similar (also ugly) solution would be
> to nullify gmem->kvm when KVM dies.
>
> I don't love either approach idea because it means a file created in the context
> of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
> that it can't do anything with except close(). I doubt that matters in practice
> though, e.g. when the VM dies, all memory can be freed so that the file ends up
> being little more than a shell. And if we go that route, there's no need to grab
> a reference to the file during bind, KVM can just grab a longterm reference when
> the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).
>
> Blech, another wart is that I believe gmem would need to do __module_get() during
> file creation to prevent kvm.ko from being unloaded after the last VM dies. Ah,
> but that'd also be true if we went with a system-scoped KVM ioctl(), so I suppose
> it's not _that_ ugly.
>
> Exchanging references (at binding or at creation) doesn't work, because that
> creates a circular dependency, i.e. gmem and KVM would pin each other.
>
> A "proper" refcounting approach, where the file pins KVM and not vice versa, gets
> nasty because of how KVM's memslots work. The least awful approach I can think of
> would be to delete the associated memslot(s) when the file is released, possibly
> via deferred work to avoid deadlock issues. Not the prettiest thing ever and in
> some ways that'd yield an even worse ABI.

Circling back to this. Pending testing, the "proper" refcounting approach seems
to be the way to go. KVM's existing memslots actually work this way, e.g. if
userspace does munmap() on an active memslot, KVM zaps any PTEs but the memslot
stays alive. A similar approach can be taken here, the only wrinkle is that the
association between gmem and the memslot is stronger than between VMAs and memslots,
specifically KVM binds the file and not simply the file descriptor. This is
necessary because not binding to an exact file would let userspace install a
different file at the file descriptor.

That's solvable without having to delete memslots though, e.g. by protecting the
file pointer in the memslot with RCU and directly bumping the refcount in the two
places where KVM needs to get at gmem (the file) via the memslot (unbind and
faulting in a page). E.g.

static struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
{
struct file *file;

rcu_read_lock();

file = rcu_dereference(slot->gmem.file);
if (file && !get_file_rcu(file))
file = NULL;
rcu_read_unlock();

return file;
}

The gotcha is that ->release could race with memslot deletion, as kvm_gmem_unbind()
won't be able to differentiate between "file was deleted" and "file is currently
being deleted". That's easy enough to deal with though, kvm_gmem_release() can
take slots_lock to prevent the memslot from going away when nullifying and
invalidating ranges for the memslot.

> Side topic, there's a second bug (and probably more lurking): kvm_swap_active_memslots()'s
> call to synchronize_srcu_expedited() is done _before_ the call to kvm_gmem_unbind(),
> i.e. doesn't wait for readers in kvm_gmem_invalidate_begin() to go away. The easy
> solution for that one is to add another synchronize_srcu_expedited() after unbinding.

There's a bug here, but not the one I pointed out. Acquiring kvm->srcu doesn't
provide any protection, the binding already has a pointer to the memslot, i.e.
isn't doing an SRCU-protected lookup in the memslots. The actual protection is
provided by the filemap invalidate lock, which prevents unbinding a memslot until
all invalidations complete, i.e. acquiring kvm->srcu in the punch hole path is
completely unnecessary.

2023-05-19 01:24:54

by Vishal Annapurve

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Thu, May 11, 2023 at 1:22 PM Sean Christopherson <[email protected]> wrote:
> ...
> Ah, you're effectively suggesting a hybrid model where the file is the single
> source of truth for what's private versus shared, ad KVM gets pfns through
> direct communication with the backing store via the file descriptor, but userspace
> can still control things via mmap() and friends.
>
> If you're not suggesting a backdoor, i.e. KVM still gets private pfns via hvas,
> then we're back at Kirill's series, because otherwise there's no easy way for KVM
> to retrieve the pfn.
>

Yeah, I was hinting towards using the backdoor, where KVM still gets
private pfns via HVAs.

> A form of this was also discussed, though I don't know how much of the discussion
> happened on-list.
>
> KVM actually does something like this for s390's Ultravisor (UV), which is quite
> a bit like TDX (UV is a trusted intermediary) except that it handles faults much,
> much more gracefully. Specifically, when the untrusted host attempts to access a
> secure page, a fault occurs and the kernel responds by telling UV to export the
> page. The fault is gracefully handled even even for kernel accesses
> (see do_secure_storage_access()). The kernel does BUG() if the export fails when
> handling fault from kernel context, but my understanding is that export can fail
> if and only if there's a fatal error elsewhere, i.e. the UV essentialy _ensures_
> success, and goes straight to BUG()/panic() if something goes wrong.
>
> On the guest side, accesses to exported (swapped) secure pages generate intercepts
> and KVM faults in the page. To do so, KVM freezes the page/folio refcount, tells
> the UV to import the page, and then unfreezes the page/folio. But very crucially,
> when _anything_ in the untrusted host attempts to access the secure page, the
> above fault handling for untrusted host accesses kicks in. In other words, the
> guest can cause thrash, but can't bring down the host.
>

Yeah, this is very similar to what I was trying to propose. Except in
this case, the backing store i.e. guest_mem will have to let the fault
be unhandled for untrusted host accesses to private ranges of
guest_mem file.

> TDX on the other hand silently poisons memory, i.e. doesn't even generate a
> synchronous fault. Thus the kernel needs to be 100% perfect on preventing _any_
> accesses to private memory from the host, and doing that is non-trivial and
> invasive.
>
> SNP does synchronously fault, but the automatically converting in the #PF handler
> got NAK'd[*] for good reasons, e.g. SNP doesn't guarantee conversion success as the
> guest can trigger concurrent RMP modifications. So the end result ends up being
> the same as TDX, host accesses need to be completely prevented.
>
> Again, this is all doable, but costly. And IMO, provides very little value.

With this hybrid approach with the backdoor access to pfns from KVM,
do we see a scenario where host can bypass the guest_mem restrictions
and still be able to access the private ranges using HVA ranges? One
possibility is that these pages are mapped in the IOMMU (when they are
shared) and then get converted to private without getting unmapped
from IOMMU. Maybe KVM can disallow converting the ranges which are
pinned for DMA (not sure if there is a way to do that).

Few additional benefits here:
1) Possibly handle the pkvm usecase in this series without the need
for additional modifications.
2) Handling UPM for normal VMs possibly could get simpler as this
hybrid approach can allow preserving the contents across conversions.

>
> Allowing things like mbind() is nice-to-have at best, as implementing fbind()
> isn't straightforward and arguably valuable to have irrespective of this
> discussion, e.g. to allow userspace to say "use this policy regardless of what
> process maps the file".
>

Agreed, having mbind supported is not a significant gain given the cost here.

> Using a common memory pool (same physical page is used for both shared and private)
> is a similar story. There are plenty of existing controls to limit userspace/guest
> memory usage and to deal with OOM scenarios, so barring egregious host accounting
> and/or limiting bugs, which would affect _all_ VM types, the worst case scenario
> is that a VM is terminated because host userspace is buggy. On the slip side, using
> a common pool brings complexity into the kernel, as backing stores would need to
> be taught to deny access to a subset of pages in their mappings, and in multiple
> paths, e.g. faults, read()/write() and similar, page migration, swap, etc.

In this case the backing store that needs to be modified would just be
guest_mem though.

>
> [*] https://lore.kernel.org/linux-mm/[email protected]
>
> > > Issues that led to us abandoning the "map with special !Present PTEs" approach:
> > >
> > > - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
> > > is inefficient and inflexible compared to software defined structures, especially
> > > for the expected use cases for CoCo guests.
> > >
> > > - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
> > > let alone a 1:1 pfn:gfn mapping.
> >
> > Maybe KVM can ensure that each page of the guest_mem file is
> > associated with a single memslot.
>
> This is a hard NAK. Guest physical address space is guaranteed to have holes
> and/or be discontiguous, for the PCI hole at the top of lower memory. Allowing
> only a single binding would prevent userspace from backing all (or large chunks)
> of guest memory with a single file.
>

Poor choice of words from my side. I meant to suggest that KVM can
ensure that ANY page of the guest_mem file is associated with at max
one memslot.

> ...
> That'd work for the hybrid model (fd backdoor with pseudo mmap() support), but
> not for a generic VMA-based implementation. If the file isn't the single source
> of truth, then forcing all mappings to go away simply can't work.
>
> > > #3 is also a limiter. E.g. if a guest is primarly backed by 1GiB pages, keeping
> > > the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
> > > and possibly even if the guest converts a few MiB of memory.
> >
> > This caveat maybe can be lived with as shared ranges most likely will
> > not be backed by 1G pages anyways, possibly causing IO performance to
> > get hit. This possibly needs more discussion about conversion
> > granularity used by guests.
>
> Yes, it's not the end of the world. My point is that separating shared and private
> memory provides more flexibility. Maybe that flexibility never ends up being
> super important, but at the same time we shouldn't willingly paint ourselves into
> a corner.

There are some performance implications here with the split approach.

This flexibility is actually coming with the cost of managing double
allocation effectively. As the granularity of mappings increases for
the shared memory, it gets difficult to cap the amount of double
allocation. So effectively it comes down to always using smaller
granularity for shared memory and also the private memory for
converted ranges. In general the performance requirements would always
try to push for higher mapping granularities depending on the scale of
usage.

In general private memory (and also the shared memory on respective
conversions) will always need to be hole punched to ensure that the
double allocation won't happen. And so, even if this is something for
the future, using hugetlbfs pages for backing private memory with the
split model effectively makes it impossible to cap the double
allocation. I am not sure if the 1G pages can be handled better with
the hybrid model but maybe it's worth checking.

Split shared/private mem approach also increases the uncertainties
around memory management in general where the same amount of memory
which was available earlier is first freed to the system and then
allocated back from the system. .e.g. Even if hugepages were around
when private memory was initially allocated, further allocations keep
increasing the possibilities of not being able to use a huge page to
back the memory even if the whole huge page is private/shared.

2023-05-22 13:54:03

by Michael Roth

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> On Thu, May 11, 2023, Michael Roth wrote:
> > On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote:
> > >
> > > Code is available here if folks want to take a look before any kind of formal
> > > posting:
> > >
> > > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> >
> > Hi Sean,
> >
> > I've been working on getting the SNP patches ported to this but I'm having
> > some trouble working out a reasonable scheme for how to work the
> > RMPUPDATE hooks into the proposed design.
> >
> > One of the main things is kvm_gmem_punch_hole(): this is can free pages
> > back to the host whenever userspace feels like it. Pages that are still
> > marked private in the RMP table will blow up the host if they aren't returned
> > to the normal state before handing them back to the kernel. So I'm trying to
> > add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that,
> > e.g.:
> >
> > static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset,
> > loff_t len)
> > {
> > struct kvm_gmem *gmem = file->private_data;
> > pgoff_t start = offset >> PAGE_SHIFT;
> > pgoff_t end = (offset + len) >> PAGE_SHIFT;
> > struct kvm *kvm = gmem->kvm;
> >
> > /*
> > * Bindings must stable across invalidation to ensure the start+end
> > * are balanced.
> > */
> > filemap_invalidate_lock(file->f_mapping);
> > kvm_gmem_invalidate_begin(kvm, gmem, start, end);
> >
> > /* Handle arch-specific cleanups before releasing pages */
> > kvm_arch_gmem_invalidate(kvm, gmem, start, end);
> > truncate_inode_pages_range(file->f_mapping, offset, offset + len);
> >
> > kvm_gmem_invalidate_end(kvm, gmem, start, end);
> > filemap_invalidate_unlock(file->f_mapping);
> >
> > return 0;
> > }
> >
> > But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put
> > the page in its intended state in the RMP table prior to mapping it into the
> > guest's NPT.
>
> IMO, this approach is wrong. kvm->mem_attr_array is the source of truth for whether
> userspace wants _guest_ physical pages mapped private vs. shared, but the attributes
> array has zero insight into the _host_ physical pages. I.e. SNP shouldn't hook
> kvm_mem_attrs_changed(), because operating on the RMP from that code is fundamentally
> wrong.
>
> A good analogy is moving a memslot (ignoring that AFAIK no VMM actually moves
> memslots, but it's a good analogy for KVM internals). KVM needs to zap all mappings
> for the old memslot gfn, but KVM does not create mappings for the new memslot gfn.
> Same for changing attributes; unmap, but never map.
>
> As for the unmapping side of things, kvm_unmap_gfn_range() will unmap all relevant
> NPT entries, and the elevated mmu_invalidate_in_progress will prevent KVM from
> establishing a new NPT mapping. And mmu_invalidate_in_progress will reach '0' only
> after both truncation _and_ kvm_vm_ioctl_set_mem_attributes() complete, i.e. KVM
> can create new mappings only when both kvm->mem_attr_array and any relevant
> guest_mem bindings have reached steady state.
>
> That leaves the question of when/where to do RMP updates. Off the cuff, I think
> RMP updates (and I _think_ also TDX page conversions) should _always_ be done in
> the context of either (a) file truncation (make host owned due, a.k.a. TDX reclaim)
> or (b) allocating a new page/folio in guest_mem, a.k.a. kvm_gmem_get_folio().
> Under the hood, even though the gfn is the same, the backing pfn is different, i.e.
> installing a shared mapping should _never_ need to touch the RMP because pages
> common from the normal (non-guest_mem) pool must already be host owned.

Hi Sean, thanks for the suggestions.

I reworked things based on this approach and things seems to work out
pretty nicely for SNP.

I needed to add the hook to kvm_gmem_get_pfn() instead of
kvm_gmem_get_folio() because SNP needs to know the GFN in order to mark
the page as private in the RMP table, but otherwise I think things are
the same as what you had in mind. One downside to this approach is since
the hook always gets called during kvm_gmem_get_pfn(), we need to do an
extra RMP lookup to determine whether or not that page has already been
set to private state, vs. being able to assume it's already been put in
the expected state, but it's only a memory access so not a huge
overhead. Not sure if that would be a concern of not on the TDX side
though.

I put together a tree with some fixups that are needed for against the
kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
preparing the initial private state as suggested above, and a
platform-configurable mask that the x86 MMU code can use for determining
whether a fault is for private vs. shared pages.

KVM: x86: Determine shared/private faults using a configurable mask
^ for TDX we could trivially add an inverted analogue of the mask/logic
KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
KVM: x86: Add platform hooks for private memory invalidations
KVM: x86: Add platform hook for initializing private memory
*fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations
*fixup (kvm_gmem_solo): KVM: selftests: update kvm_create_guest_memfd struct usage

https://github.com/mdroth/linux/commits/kvm_gmem_solo_x86

I'm hoping these are similarly usable for TDX, but could use some input
from TDX folks on that aspect.

> >
> > Keep in mind that RMP updates can't be done while holding KVM->mmu_lock
> > spinlock, because we also need to unmap pages from the directmap, which can
> > lead to scheduling-while-atomic BUG()s[1], so that's another constraint we
> > need to work around.

This concern also ends up going away since GFP_RECLAIM also has similar
issues when called under kvm->mmu_lock, so having the hook in
kvm_gmem_get_pfn() sort of guarantees we wouldn't hit issues with this.

-Mike

> >
> > Thanks!
> >
> > -Mike
> >
> > [1] https://lore.kernel.org/linux-coco/[email protected]/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a
> >
> > >
> > > [1] https://lore.kernel.org/all/[email protected]
> > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > > [3] https://lore.kernel.org/linux-mm/[email protected]

2023-05-22 17:11:24

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Mon, May 22, 2023, Michael Roth wrote:
> On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> > On Thu, May 11, 2023, Michael Roth wrote:
> I put together a tree with some fixups that are needed for against the
> kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
> preparing the initial private state as suggested above, and a
> platform-configurable mask that the x86 MMU code can use for determining
> whether a fault is for private vs. shared pages.
>
> KVM: x86: Determine shared/private faults using a configurable mask
> ^ for TDX we could trivially add an inverted analogue of the mask/logic
> KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
> KVM: x86: Add platform hooks for private memory invalidations

Hrm, I'd prefer to avoid adding another hook for this case, arch code already has
a "hook" in the form of kvm_unmap_gfn_range(). We'd probably just need a
kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory
being zapped is private.

That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked
if and only if there's an overlapping memslot. I'll chew on that a bit to see if
there's a way to cleanly handle that case without another hook. I think it's worth
mapping out exactly what we want unbind() to look like anyways, e.g. right now the
code subtly relies on private memslots being immutable.

> KVM: x86: Add platform hook for initializing private memory

This should also be unnecessary. The call to kvm_gmem_get_pfn() is from arch
code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock,
e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault().

The only reason to add another arch hook would be if we wanted to converted the
RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of
waiting until #NPF. But I think I would rather add a generic ioctl() to allow
userspace to effectively prefault guest memory, e.g. to setup the RMP before
running a vCPU. Such an ioctl() would potentially be useful in other scenarios,
e.g. on the dest during live migration to reduce jitter.

> *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations

There was another bug in this path. The math for handling a non-zero offsets into
the file was wrong. The code now looks like:

xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
struct kvm_gfn_range gfn_range = {
.start = slot->base_gfn + start - slot->gmem.index,
.end = slot->base_gfn + min(end - slot->gmem.index, slot->npages),
.slot = slot,
.pte = __pte(0),
.may_block = true,
};

if (WARN_ON_ONCE(start < slot->gmem.index ||
end > slot->gmem.index + slot->npages))
continue;

kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end);

flush |= kvm_unmap_gfn_range(kvm, &gfn_range);
}

2023-05-23 01:09:32

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Mon, May 22, 2023, Michael Roth wrote:
> On Mon, May 22, 2023 at 10:09:40AM -0700, Sean Christopherson wrote:
> > On Mon, May 22, 2023, Michael Roth wrote:
> > > On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> > > > On Thu, May 11, 2023, Michael Roth wrote:
> > > I put together a tree with some fixups that are needed for against the
> > > kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
> > > preparing the initial private state as suggested above, and a
> > > platform-configurable mask that the x86 MMU code can use for determining
> > > whether a fault is for private vs. shared pages.
> > >
> > > KVM: x86: Determine shared/private faults using a configurable mask
> > > ^ for TDX we could trivially add an inverted analogue of the mask/logic
> > > KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
> > > KVM: x86: Add platform hooks for private memory invalidations
> >
> > Hrm, I'd prefer to avoid adding another hook for this case, arch code already has
> > a "hook" in the form of kvm_unmap_gfn_range(). We'd probably just need a
> > kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory
> > being zapped is private.
>
> kvm_unmap_gfn_range() does however get called with kvm->mmu_lock held so
> it might be tricky to tie RMP updates into that path.

Gah, I caught the mmu_lock issue before the end of my email, but forgot to go back
and rethink the first half.

> > That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked
> > if and only if there's an overlapping memslot. I'll chew on that a bit to see if
> > there's a way to cleanly handle that case without another hook. I think it's worth
> > mapping out exactly what we want unbind() to look like anyways, e.g. right now the
> > code subtly relies on private memslots being immutable.
m
> I thought the direction you sort of driving at was to completely decouple
> RMP updates for physical pages from the KVM MMU map/unmap paths since the
> life-cycles of those backing pages and associated RMP state are somewhat
> separate from the state of the GFNs and kvm->mem_attr_array. It seems to
> make sense when dealing with things like this unbind() case.
>
> There's also cases like userspaces that opt to not discard memory after
> conversions because they highly favor performance over memory usage. In
> those cases it would make sense to defer marking the pages as shared in
> the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via
> KVM MMU invalidation path after a conversion.

Hmm, right. I got overzealous in my desire to avoid new hooks.

> > > KVM: x86: Add platform hook for initializing private memory
> >
> > This should also be unnecessary. The call to kvm_gmem_get_pfn() is from arch
> > code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock,
> > e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault().
>
> I think that approach would work fine. The way I was thinking of things
> is that KVM MMU would necessarily call kvm_gmem_get_pfn() to grab the
> page before mapping it into the guest, so moving it out into an explicit
> call should work just as well. That would also drop the need for the
> __kvm_gmem_get_pfn() stuff I needed to add for the initial case where we
> need to access the PFN prior to making it private.
>
> >
> > The only reason to add another arch hook would be if we wanted to converted the
> > RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of
> > waiting until #NPF. But I think I would rather add a generic ioctl() to allow
> > userspace to effectively prefault guest memory, e.g. to setup the RMP before
> > running a vCPU. Such an ioctl() would potentially be useful in other scenarios,
> > e.g. on the dest during live migration to reduce jitter.
>
> Agreed, deferring the RMPUPDATE until it's actually needed would give us
> more flexibility on optimizing for things like lazy-acceptance.
>
> For less-common scenarios like preallocation it makes sense to make that
> an opt-in sort of thing for userspace to configure explicitly.
>
> >
> > > *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations
> >
> > There was another bug in this path. The math for handling a non-zero offsets into
> > the file was wrong. The code now looks like:
> >
> > xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> > struct kvm_gfn_range gfn_range = {
> > .start = slot->base_gfn + start - slot->gmem.index,
>
> Sorry if I'm missing something here, but isn't there a risk that:
>
> start - slot->gmem.index
>
> would be less than zero? E.g. starting GFN was 0, but current slot is bound
> at some non-zero offset in the same gmem instance. I guess the warning below
> shouldn't caught that, but it seems like a real scenario.

Heh, only if there's a testcase for it. Assuming start >= the slot offset does
seem broken, e.g. if the range-to-invalidate overlaps multiple slots, later slots
will have index==slot->gmem.index > start.

> Since 'index' corresponds to the gmem offset of the current slot, is there any
> reason not to do something like this?:
>
> .start = slot->base_gfn + index - slot->gmem.index,
>
> But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting
> we case just simplify to this?:
>
> .start = slot->base_gfn,

No, e.g. if start is partway through a memslot, there's no need to invalidate
the entire memslot. I'll stare at this tomorrow when my brain is hopefully a
bit more functional, I suspect there is a min() and/or max() needed somewhere.

2023-05-23 01:11:43

by Michael Roth

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Mon, May 22, 2023 at 10:09:40AM -0700, Sean Christopherson wrote:
> On Mon, May 22, 2023, Michael Roth wrote:
> > On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> > > On Thu, May 11, 2023, Michael Roth wrote:
> > I put together a tree with some fixups that are needed for against the
> > kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
> > preparing the initial private state as suggested above, and a
> > platform-configurable mask that the x86 MMU code can use for determining
> > whether a fault is for private vs. shared pages.
> >
> > KVM: x86: Determine shared/private faults using a configurable mask
> > ^ for TDX we could trivially add an inverted analogue of the mask/logic
> > KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
> > KVM: x86: Add platform hooks for private memory invalidations
>
> Hrm, I'd prefer to avoid adding another hook for this case, arch code already has
> a "hook" in the form of kvm_unmap_gfn_range(). We'd probably just need a
> kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory
> being zapped is private.

kvm_unmap_gfn_range() does however get called with kvm->mmu_lock held so
it might be tricky to tie RMP updates into that path.

>
> That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked
> if and only if there's an overlapping memslot. I'll chew on that a bit to see if
> there's a way to cleanly handle that case without another hook. I think it's worth
> mapping out exactly what we want unbind() to look like anyways, e.g. right now the
> code subtly relies on private memslots being immutable.

I thought the direction you sort of driving at was to completely decouple
RMP updates for physical pages from the KVM MMU map/unmap paths since the
life-cycles of those backing pages and associated RMP state are somewhat
separate from the state of the GFNs and kvm->mem_attr_array. It seems to
make sense when dealing with things like this unbind() case.

There's also cases like userspaces that opt to not discard memory after
conversions because they highly favor performance over memory usage. In
those cases it would make sense to defer marking the pages as shared in
the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via
KVM MMU invalidation path after a conversion.

>
> > KVM: x86: Add platform hook for initializing private memory
>
> This should also be unnecessary. The call to kvm_gmem_get_pfn() is from arch
> code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock,
> e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault().

I think that approach would work fine. The way I was thinking of things
is that KVM MMU would necessarily call kvm_gmem_get_pfn() to grab the
page before mapping it into the guest, so moving it out into an explicit
call should work just as well. That would also drop the need for the
__kvm_gmem_get_pfn() stuff I needed to add for the initial case where we
need to access the PFN prior to making it private.

>
> The only reason to add another arch hook would be if we wanted to converted the
> RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of
> waiting until #NPF. But I think I would rather add a generic ioctl() to allow
> userspace to effectively prefault guest memory, e.g. to setup the RMP before
> running a vCPU. Such an ioctl() would potentially be useful in other scenarios,
> e.g. on the dest during live migration to reduce jitter.

Agreed, deferring the RMPUPDATE until it's actually needed would give us
more flexibility on optimizing for things like lazy-acceptance.

For less-common scenarios like preallocation it makes sense to make that
an opt-in sort of thing for userspace to configure explicitly.

>
> > *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations
>
> There was another bug in this path. The math for handling a non-zero offsets into
> the file was wrong. The code now looks like:
>
> xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> struct kvm_gfn_range gfn_range = {
> .start = slot->base_gfn + start - slot->gmem.index,

Sorry if I'm missing something here, but isn't there a risk that:

start - slot->gmem.index

would be less than zero? E.g. starting GFN was 0, but current slot is bound
at some non-zero offset in the same gmem instance. I guess the warning below
shouldn't caught that, but it seems like a real scenario.

Since 'index' corresponds to the gmem offset of the current slot, is there any
reason not to do something like this?:

.start = slot->base_gfn + index - slot->gmem.index,

But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting
we case just simplify to this?:

.start = slot->base_gfn,

-Mike

> .end = slot->base_gfn + min(end - slot->gmem.index, slot->npages),
> .slot = slot,
> .pte = __pte(0),
> .may_block = true,
> };
>
> if (WARN_ON_ONCE(start < slot->gmem.index ||
> end > slot->gmem.index + slot->npages))
> continue;
>
> kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end);
>
> flush |= kvm_unmap_gfn_range(kvm, &gfn_range);
> }

2023-06-06 19:20:45

by Ackerley Tng

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)


I've ported selftests from Chao and I [1] while working on hugetlb support
for
guest_mem [2].

In the process, I found some bugs and have some suggestions for guest_mem.
Please see separate commits at [3].

Here are some highlights/questions:

+ "KVM: guest_mem: Explain the use of the uptodate flag for gmem"
+ Generally, uptodate flags means that the contents of this page match
the
backing store. Since gmem is memory-backed, does "uptodate" for gmem
mean
"zeroed"?
+ "KVM: guest_mem: Don't re-mark accessed after getting a folio" and "KVM:
guest_mem: Don't set dirty flag for folio"
+ Do we need to folio_mark_accessed(), when it was created with
FGP_ACCESSED?
+ What is the significance of these LRU flags when gmem doesn't support
swapping/eviction?
+ "KVM: guest_mem: Align so that at least 1 page is allocated"
+ Bug in current implementation: without this alignment, fallocate() of
a
size less than the gmem page size will result in no allocation at all
+ Both shmem and hugetlbfs perform this alignment
+ "KVM: guest_mem: Add alignment checks"
+ Implemented the alignment checks for guest_mem because hugetlb on gmem
would hit a BUG_ON without this check
+ "KVM: guest_mem: Prevent overflows in kvm_gmem_invalidate_begin()"
+ Sean fixed a bug in the offset-to-gfn conversion in
kvm_gmem_invalidate_begin() earlier, adding a WARN_ON_ONCE()
+ Code will always hit WARN_ON_ONCE() when the entire file is closed
and all
offsets are invalidated, so WARN_ON_ONCE() should be removed
+ Vishal noticed that the conversion might result in an overflow, so I
fixed
that
+ And of course, hugetlb support! Please let me know what you think of the
approach proposed at [2].

[1]
https://lore.kernel.org/all/[email protected]/T/
[2]
https://lore.kernel.org/lkml/[email protected]/T/
[3] https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1

2023-06-07 00:01:04

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Tue, Jun 06, 2023, Ackerley Tng wrote:
>
> I've ported selftests from Chao and I [1] while working on hugetlb support
> for guest_mem [2].
>
> In the process, I found some bugs and have some suggestions for guest_mem.
> Please see separate commits at [3].
>
> Here are some highlights/questions:
>
> + "KVM: guest_mem: Explain the use of the uptodate flag for gmem"
> + Generally, uptodate flags means that the contents of this page match the
> backing store. Since gmem is memory-backed, does "uptodate" for gmem mean
> "zeroed"?

Don't read too much into the code, my POC was very much a "beat on it until it
works" scenario.

> + "KVM: guest_mem: Don't re-mark accessed after getting a folio" and "KVM:
> guest_mem: Don't set dirty flag for folio"
> + Do we need to folio_mark_accessed(), when it was created with
> FGP_ACCESSED?

Probably not. And as you note below, it's all pretty nonsensical anyways.

> + What is the significance of these LRU flags when gmem doesn't support
> swapping/eviction?

Likely none. I used the filemap APIs in my POC because it was easy, not because
it was necessarily the best approach, i.e. that the folios/pages show up in the
LRUs is an unwanted side effect, not a feature. If guest_memfd only needs a small
subset of the filemap support, going with a true from-scratch implemenation on
top of xarray might be cleaner overall, e.g. would avoid the need for a new flag
to say "this folio can't be migrated even though it's on the LRUs".

> + "KVM: guest_mem: Align so that at least 1 page is allocated"
> + Bug in current implementation: without this alignment, fallocate() of
> a size less than the gmem page size will result in no allocation at all

I'm not convinced this is a bug. I don't see any reason to allow allocating and
punching holes in sub-page granularity.

> + Both shmem and hugetlbfs perform this alignment
> + "KVM: guest_mem: Add alignment checks"
> + Implemented the alignment checks for guest_mem because hugetlb on gmem
> would hit a BUG_ON without this check
> + "KVM: guest_mem: Prevent overflows in kvm_gmem_invalidate_begin()"
> + Sean fixed a bug in the offset-to-gfn conversion in
> kvm_gmem_invalidate_begin() earlier, adding a WARN_ON_ONCE()

As Mike pointed out, there's likely still a bug here[*]. I was planning on
diving into that last week, but that never happened. If you or anyone else can
take a peek and/or write a testcase, that would be awesome.

: Heh, only if there's a testcase for it. Assuming start >= the slot offset does
: seem broken, e.g. if the range-to-invalidate overlaps multiple slots, later slots
: will have index==slot->gmem.index > start.
:
: > Since 'index' corresponds to the gmem offset of the current slot, is there any
: > reason not to do something like this?:
: >
: > .start = slot->base_gfn + index - slot->gmem.index,
: >
: > But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting
: > we case just simplify to this?:
: >
: > .start = slot->base_gfn,
:
: No, e.g. if start is partway through a memslot, there's no need to invalidate
: the entire memslot. I'll stare at this tomorrow when my brain is hopefully a
: bit more functional, I suspect there is a min() and/or max() needed somewhere.

[*] https://lore.kernel.org/all/[email protected]

> + Code will always hit WARN_ON_ONCE() when the entire file is closed and
> all offsets are invalidated, so WARN_ON_ONCE() should be removed
> + Vishal noticed that the conversion might result in an overflow, so I
> fixed that
> + And of course, hugetlb support! Please let me know what you think of the
> approach proposed at [2].
>
> [1] https://lore.kernel.org/all/[email protected]/T/
> [2] https://lore.kernel.org/lkml/[email protected]/T/
> [3] https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1

2023-06-08 17:20:05

by Ackerley Tng

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

Sean Christopherson <[email protected]> writes:

> ...

> Probably not. And as you note below, it's all pretty nonsensical anyways.

>> + What is the significance of these LRU flags when gmem doesn't
>> support
>> swapping/eviction?

> Likely none. I used the filemap APIs in my POC because it was easy, not
> because
> it was necessarily the best approach, i.e. that the folios/pages show up
> in the
> LRUs is an unwanted side effect, not a feature. If guest_memfd only
> needs a small
> subset of the filemap support, going with a true from-scratch
> implemenation on
> top of xarray might be cleaner overall, e.g. would avoid the need for a
> new flag
> to say "this folio can't be migrated even though it's on the LRUs".

For hugetlb support on gmem, using an xarray in place of a filemap should
work
fine. Page migration could come up in future - perhaps migration code works
better with filemap? Not sure.


>> + "KVM: guest_mem: Align so that at least 1 page is allocated"
>> + Bug in current implementation: without this alignment, fallocate()
>> of
>> a size less than the gmem page size will result in no allocation
>> at all

> I'm not convinced this is a bug. I don't see any reason to allow
> allocating and
> punching holes in sub-page granularity.


I looked at the code more closely, you're right. len is checked to be 4K
aligned. When userspace requests a gmem page size of larger than 4K (gmem
THP),
the allocation loop still does the right thing.

This issue only arises for hugetlb pages. I'll rebase the next revision of
the
hugetlb series accordingly.


>> + Both shmem and hugetlbfs perform this alignment
>> + "KVM: guest_mem: Add alignment checks"
>> + Implemented the alignment checks for guest_mem because hugetlb on
>> gmem
>> would hit a BUG_ON without this check
>> + "KVM: guest_mem: Prevent overflows in kvm_gmem_invalidate_begin()"
>> + Sean fixed a bug in the offset-to-gfn conversion in
>> kvm_gmem_invalidate_begin() earlier, adding a WARN_ON_ONCE()

> As Mike pointed out, there's likely still a bug here[*]. I was planning
> on
> diving into that last week, but that never happened. If you or anyone
> else can
> take a peek and/or write a testcase, that would be awesome.

> : Heh, only if there's a testcase for it. Assuming start >= the slot
> offset does
> : seem broken, e.g. if the range-to-invalidate overlaps multiple slots,
> later slots
> : will have index==slot->gmem.index > start.
> :
> : > Since 'index' corresponds to the gmem offset of the current slot, is
> there any
> : > reason not to do something like this?:
> : >
> : > .start = slot->base_gfn + index - slot->gmem.index,
> : >
> : > But then, if that's the case, wouldn't index == slot->gmem.index?
> Suggesting
> : > we case just simplify to this?:
> : >
> : > .start = slot->base_gfn,
> :
> : No, e.g. if start is partway through a memslot, there's no need to
> invalidate
> : the entire memslot. I'll stare at this tomorrow when my brain is
> hopefully a
> : bit more functional, I suspect there is a min() and/or max() needed
> somewhere.

> [*] https://lore.kernel.org/all/[email protected]


I think I have fixed this, please see "KVM: guest_mem: Prevent overflows in
kvm_gmem_invalidate_begin()" [1].

This patch does take into account that start could be greater than
slot->gmem.index, when userspace chooses to punch holes beginning in the
middle
of the memslot.

The process could be split into figuring out file indices, then GFNs:

1. Figure out the start and end in terms of index in the file
+ index_start: taking max(start, slot->gmem.index)
+ start will only be greater than slot->gmem.index but not greater
than
the end of the slot, due to the nature of xa_for_each_range()
+ index_end: taking min(end, slot->gmem.index + slot->npages)
2. Convert indices to GFNs

This also prevents overflows as described at [1].

>> ...

[1]
https://github.com/googleprodkernel/linux-cc/commit/bcc304e3657a998b8f61aa1b841754fbb90d8994

2023-07-13 23:03:10

by Ackerley Tng

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

Sean Christopherson <[email protected]> writes:

> On Fri, May 05, 2023, Ackerley Tng wrote:
>>
>> Hi Sean,
>>
>> Thanks for implementing this POC!
>>
>> ... snip ...
>>
>
> I don't love either approach idea because it means a file created in the context
> of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
> that it can't do anything with except close(). I doubt that matters in practice
> though, e.g. when the VM dies, all memory can be freed so that the file ends up
> being little more than a shell. And if we go that route, there's no need to grab
> a reference to the file during bind, KVM can just grab a longterm reference when
> the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).
>
> ... snip ...
>
> My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
> and not a common syscall. If the file isn't tightly coupled to a single VM, then
> punching a hole is further complicated by needing to deal with invalidating multiple
> regions that are bound to different @kvm instances. It's not super complex, but
> AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
> having one VM own the memory will complicate even if/when we get to the point where
> VMs can share "private" memory, and the gmem code would still need to deal with
> grabbing a module reference.

I’d like to follow up on this discussion about a guest_mem file
outliving the VM and whether to have a VM-scoped ioctl or a KVM ioctl.

Here's a POC of delayed binding of a guest_mem file to a memslot, where
the guest_mem file outlives the VM [1].

I also hope to raise some points before we do the first integration of
guest_mem patches!


A use case for guest_mem inodes outliving the VM is when the host VMM
needs to be upgraded. The guest_mem inode is passed between two VMs on
the same host machine and all memory associated with the inode needs to
be retained.

To support the above use case, binding of memslots is delayed until
first use, so that the following inode passing flow can be used:

1. Source (old version of host VMM) process passes guest_mem inodes to
destination (new version of host VMM) process via unix sockets.
2. Destination process initializes memslots identical to source process.
3. Destination process invokes ioctl to migrate guest_mem inode over to
destination process by unbinding all memslots from the source VM and
binding them to the destination VM. (The kvm pointer is updated in
this process)

Without delayed binding, step 2 will fail since initialization of
memslots would check and find that the kvm pointer in the guest_mem
inode points to the kvm in the source process.


These two patches contain the meat of the changes required to support
delayed binding:

https://github.com/googleprodkernel/linux-cc/commit/93b31a006ef2e4dbe1ef0ec5d2534ca30f3bf60c
https://github.com/googleprodkernel/linux-cc/commit/dd5ac5e53f14a1ef9915c9c1e4cc1006a40b49df

Some things to highlight for the approach set out in these two patches:

1. Previously, closing the guest_mem file in userspace is taken to mean
that all associated memory is to be removed and cleared. With these
two patches, each memslot also holds a reference to the file (and
therefore inode) and so even if the host VMM closes the fd, the VM
will be able to continue to function.

This is desirable to userspace since closing the file should not be
interpreted as a command to clear memory. This is aligned with the
way tmpfs files are used with KVM before guest_mem: when the file is
closed in userspace, the memory contents are still mapped and can
still be used by the VM. fallocate(PUNCH_HOLE) is how userspace
should command memory to be removed, just like munmap() would be used
to remove memory from use by KVM.

2. Creating a guest mem file no longer depends on a specific VM and
hence the guest_mem creation ioctl can be a system ioctl instead of a
VM specific ioctl. This will also address Chao's concern at [3].


I also separated cleaning up files vs inodes in
https://github.com/googleprodkernel/linux-cc/commit/0f5aa18910c515141e57e05c4cc791022047a242,
which I believe is more aligned with how files and inodes are cleaned up
in FSes. This alignment makes it easier to extend gmem to hugetlb, for
one. While working on this, I was also wondering if we should perhaps be
storing the inode pointer in slot->gmem instead of the file pointer? The
memory is associated with an inode->mapping rather than the file. Are we
binding to a userspace handle on the inode (store file pointer), or are
we really referencing the inode (store inode pointer)?

The branch [1] doesn't handle the bug Sean previously mentioned at [2]:
Need to take a reference on the KVM module, so that even if guest_mem
files are not bound to any VM, the KVM module cannot be unloaded. If the
KVM module can be unloaded while guest_mem files are open, then
userspace may be able to crash the kernel by invoking guest_mem
functions that had been unloaded together with KVM.


[1] https://github.com/googleprodkernel/linux-cc/tree/gmem-delayed-binding
[2] https://lore.kernel.org/lkml/ZFWli2%[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/

2023-07-14 20:00:04

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Thu, Jul 13, 2023, Ackerley Tng wrote:
> Sean Christopherson <[email protected]> writes:
>
> > On Fri, May 05, 2023, Ackerley Tng wrote:
> >>
> >> Hi Sean,
> >>
> >> Thanks for implementing this POC!
> >>
> >> ... snip ...
> >>
> >
> > I don't love either approach idea because it means a file created in the context
> > of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
> > that it can't do anything with except close(). I doubt that matters in practice
> > though, e.g. when the VM dies, all memory can be freed so that the file ends up
> > being little more than a shell. And if we go that route, there's no need to grab
> > a reference to the file during bind, KVM can just grab a longterm reference when
> > the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).
> >
> > ... snip ...
> >
> > My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
> > and not a common syscall. If the file isn't tightly coupled to a single VM, then
> > punching a hole is further complicated by needing to deal with invalidating multiple
> > regions that are bound to different @kvm instances. It's not super complex, but
> > AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
> > having one VM own the memory will complicate even if/when we get to the point where
> > VMs can share "private" memory, and the gmem code would still need to deal with
> > grabbing a module reference.
>
> I’d like to follow up on this discussion about a guest_mem file
> outliving the VM and whether to have a VM-scoped ioctl or a KVM ioctl.
>
> Here's a POC of delayed binding of a guest_mem file to a memslot, where
> the guest_mem file outlives the VM [1].
>
> I also hope to raise some points before we do the first integration of
> guest_mem patches!
>
> A use case for guest_mem inodes outliving the VM is when the host VMM
> needs to be upgraded.

Translation: to support intra-host migration, a.k.a. KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM

> The guest_mem inode is passed between two VMs on the same host machine and
> all memory associated with the inode needs to be retained.
>
> To support the above use case, binding of memslots is delayed until
> first use, so that the following inode passing flow can be used:
>
> 1. Source (old version of host VMM) process passes guest_mem inodes to
> destination (new version of host VMM) process via unix sockets.
> 2. Destination process initializes memslots identical to source process.
> 3. Destination process invokes ioctl to migrate guest_mem inode over to
> destination process by unbinding all memslots from the source VM and
> binding them to the destination VM. (The kvm pointer is updated in
> this process)
>
> Without delayed binding, step 2 will fail since initialization of
> memslots would check and find that the kvm pointer in the guest_mem
> inode points to the kvm in the source process.
>
>
> These two patches contain the meat of the changes required to support
> delayed binding:
>
> https://github.com/googleprodkernel/linux-cc/commit/93b31a006ef2e4dbe1ef0ec5d2534ca30f3bf60c
> https://github.com/googleprodkernel/linux-cc/commit/dd5ac5e53f14a1ef9915c9c1e4cc1006a40b49df
>
> Some things to highlight for the approach set out in these two patches:
>
> 1. Previously, closing the guest_mem file in userspace is taken to mean
> that all associated memory is to be removed and cleared. With these
> two patches, each memslot also holds a reference to the file (and
> therefore inode) and so even if the host VMM closes the fd, the VM
> will be able to continue to function.
>
> This is desirable to userspace since closing the file should not be
> interpreted as a command to clear memory.

100% agreed. However, after more thought since we first discussed this, I think
that a deferred binding is the wrong way to solve this particular problem. More
below.

> This is aligned with the
> way tmpfs files are used with KVM before guest_mem: when the file is
> closed in userspace, the memory contents are still mapped and can
> still be used by the VM. fallocate(PUNCH_HOLE) is how userspace
> should command memory to be removed, just like munmap() would be used
> to remove memory from use by KVM.
>
> 2. Creating a guest mem file no longer depends on a specific VM and
> hence the guest_mem creation ioctl can be a system ioctl instead of a
> VM specific ioctl. This will also address Chao's concern at [3].

That concern is a non-issue for QEMU as memory backends are created after
accelerators are initialized, and AFAICT Google's VMM behaves similarly.

And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making
the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace
to play nice with the required ordering. If userspace can get at /dev/kvm, then
it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e.
the only dependencies for KVM_CREATE_VM should be known/resolved long before the
VMM knows it wants to use gmem.

Using a non-KVM syscall would eliminate any such dependencies, but in my very
strong opinion, that is not a good reason to go with a syscall.

> I also separated cleaning up files vs inodes in
> https://github.com/googleprodkernel/linux-cc/commit/0f5aa18910c515141e57e05c4cc791022047a242,
> which I believe is more aligned with how files and inodes are cleaned up
> in FSes.

I didn't take these, though I am in the process of incorporating parts of the
underlying feedback (see below).

> This alignment makes it easier to extend gmem to hugetlb, for one.

I am not convinced that utilizing hugetlb is the best way to provide 1GiB support
in gmem. I'm not necessarily against it, but there's a fair bit of uncertainty
around the future of hugetlb, and there are fundamental aspects of hugetlb that
may be non-goals for the gmem use case, e.g. mapping the memory with 1GiB pages
in the userspace page tables likely isn't necessariy, and might even be
undesirable.

And practically speaking, the changes aren't _that_ invasive, i.e. punting any
necessary refactoring should not substantially increase the size/complexity of
hugetlb support (*if* we end up adding it).

That said, I do think we'll implement .evict_inode() very early on in order to
support SNP and TDX, because (in addition to PUNCH_HOLE) that's when the backing
memory will be freed and thus reclaimed, i.e. unassigned in the RMP (SNP) / zeroed
with the shared key ID (TDX).

> While working on this, I was also wondering if we should perhaps be
> storing the inode pointer in slot->gmem instead of the file pointer? The
> memory is associated with an inode->mapping rather than the file. Are we
> binding to a userspace handle on the inode (store file pointer), or are
> we really referencing the inode (store inode pointer)?

Conceptually, I think KVM should to bind to the file. The inode is effectively
the raw underlying physical storage, while the file is the VM's view of that
storage.

Practically, I think that gives us a clean, intuitive way to handle intra-host
migration. Rather than transfer ownership of the file, instantiate a new file
for the target VM, using the gmem inode from the source VM, i.e. create a hard
link. That'd probably require new uAPI, but I don't think that will be hugely
problematic. KVM would need to ensure the new VM's guest_memfd can't be mapped
until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the
memslots/bindings are identical), but that should be easy enough to enforce.

That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing
the memory and the *contents* of memory to outlive the VM, i.e. be effectively
transfered to the new target VM. And we'll maintain the invariant that each
guest_memfd is bound 1:1 with a single VM.

As above, that should also help us draw the line between mapping memory into a
VM (file), and freeing/reclaiming the memory (inode).

There will be extra complexity/overhead as we'll have to play nice with the
possibility of multiple files per inode, e.g. to zap mappings across all files
when punching a hole, but the extra complexity is quite small, e.g. we can use
address_space.private_list to keep track of the guest_memfd instances associated
with the inode.

Setting aside TDX and SNP for the moment, as it's not clear how they'll support
memory that is "private" but shared between multiple VMs, I think per-VM files
would work well for sharing gmem between two VMs. E.g. would allow a give page
to be bound to a different gfn for each VM, would allow having different permissions
for each file (e.g. to allow fallocate() only from the original owner).

2023-07-15 00:40:55

by Vishal Annapurve

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Fri, Jul 14, 2023 at 12:29 PM Sean Christopherson <[email protected]> wrote:
> ...
> And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making
> the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace
> to play nice with the required ordering. If userspace can get at /dev/kvm, then
> it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e.
> the only dependencies for KVM_CREATE_VM should be known/resolved long before the
> VMM knows it wants to use gmem.

I am not sure about the benefits of tying gmem creation to any given
kvm instance. I think the most important requirement here is that a
given gmem range is always tied to a single VM - This can be enforced
when memslots are bound to the gmem files.

I believe "Required ordering" is that gmem files are created first and
then supplied while creating the memslots whose gpa ranges can
generate private memory accesses.
Is there any other ordering we want to enforce here?

> ...
> Practically, I think that gives us a clean, intuitive way to handle intra-host
> migration. Rather than transfer ownership of the file, instantiate a new file
> for the target VM, using the gmem inode from the source VM, i.e. create a hard
> link. That'd probably require new uAPI, but I don't think that will be hugely
> problematic. KVM would need to ensure the new VM's guest_memfd can't be mapped
> until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the
> memslots/bindings are identical), but that should be easy enough to enforce.
>
> That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing
> the memory and the *contents* of memory to outlive the VM, i.e. be effectively
> transfered to the new target VM. And we'll maintain the invariant that each
> guest_memfd is bound 1:1 with a single VM.
>
> As above, that should also help us draw the line between mapping memory into a
> VM (file), and freeing/reclaiming the memory (inode).
>
> There will be extra complexity/overhead as we'll have to play nice with the
> possibility of multiple files per inode, e.g. to zap mappings across all files
> when punching a hole, but the extra complexity is quite small, e.g. we can use
> address_space.private_list to keep track of the guest_memfd instances associated
> with the inode.

Are we talking about a different usecase of sharing gmem fd across VMs
other than intra-host migration?
If not, ideally only one of the files should be catering to the guest
memory mappings at any given time. i.e. any inode should be ideally
bound to (through the file) a single kvm instance, as we we are
planning to ensure that guest_memfd can't be mapped until
KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM is invoked on the target side.

2023-07-15 00:49:08

by Sean Christopherson

[permalink] [raw]
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

On Fri, Jul 14, 2023, Vishal Annapurve wrote:
> On Fri, Jul 14, 2023 at 12:29 PM Sean Christopherson <[email protected]> wrote:
> > ...
> > And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making
> > the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace
> > to play nice with the required ordering. If userspace can get at /dev/kvm, then
> > it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e.
> > the only dependencies for KVM_CREATE_VM should be known/resolved long before the
> > VMM knows it wants to use gmem.
>
> I am not sure about the benefits of tying gmem creation to any given
> kvm instance.

IMO, making gmem->kvm immutable is very nice to have, e.g. gmem->kvm will always be
valid and the refcounting rules are fairly straightforward.

> I think the most important requirement here is that a given gmem range is always
> tied to a single VM

I'm not convinced that that requirement will always hold true (see below).

> This can be enforced when memslots are bound to the gmem files.

Yeah, but TBH, waiting until the guest faults in memory to detect an invalid memslot
is gross. And looking more closely, taking filemap_invalidate_lock(), i.e. taking
a semaphore for write, in the page fault path is a complete non-starter. The
"if (existing_slot == slot)" check is likely a non-starter, because KVM handles
FLAGS_ONLY memslot updates, e.g. toggling dirty logging, by duplicating and
replacing the memslot, not by updating the live memslot.

> I believe "Required ordering" is that gmem files are created first and
> then supplied while creating the memslots whose gpa ranges can
> generate private memory accesses.
> Is there any other ordering we want to enforce here?

I wasn't talking about enforcing arbitrary ordering, I was simply talking about
what userspace literally needs to be able to do KVM_CREATE_GUEST_MEMFD.

> > Practically, I think that gives us a clean, intuitive way to handle intra-host
> > migration. Rather than transfer ownership of the file, instantiate a new file
> > for the target VM, using the gmem inode from the source VM, i.e. create a hard
> > link. That'd probably require new uAPI, but I don't think that will be hugely
> > problematic. KVM would need to ensure the new VM's guest_memfd can't be mapped
> > until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the
> > memslots/bindings are identical), but that should be easy enough to enforce.
> >
> > That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing
> > the memory and the *contents* of memory to outlive the VM, i.e. be effectively
> > transfered to the new target VM. And we'll maintain the invariant that each
> > guest_memfd is bound 1:1 with a single VM.
> >
> > As above, that should also help us draw the line between mapping memory into a
> > VM (file), and freeing/reclaiming the memory (inode).
> >
> > There will be extra complexity/overhead as we'll have to play nice with the
> > possibility of multiple files per inode, e.g. to zap mappings across all files
> > when punching a hole, but the extra complexity is quite small, e.g. we can use
> > address_space.private_list to keep track of the guest_memfd instances associated
> > with the inode.
>
> Are we talking about a different usecase of sharing gmem fd across VMs
> other than intra-host migration?

Well, I am :-) I don't want to build all of this on an assumption that we'll
never ever want to share a guest_memfd across multiple VMs. E.g. SEV (and SEV-ES?)
already have the migration helper concept, and I've heard more than a few rumblings
of TDX utilizing helper TDs. IMO, it's not far fetched at all to think that there
will eventually be a need to let multiple VMs share a guest_memfd.

> If not, ideally only one of the files should be catering to the guest
> memory mappings at any given time. i.e. any inode should be ideally
> bound to (through the file) a single kvm instance,

Why? Honest question, what does it buy us?

For TDX and SNP intra-host migration, it should be easy enough to ensure the new
VM can't create mappings before migration, and that the old VM can't create mappings
or run after migration. I don't see that being any harder if the source and
dest use different files.

FWIW, it might be easier to hold off on this discussion until I post the RFC
(which is going to happen on Monday at this point), as then we'll have actual code
to discuss.

> as we we are planning to ensure that guest_memfd can't be mapped until
> KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM is invoked on the target side.