This patch series implements KVM guest private memory for confidential
computing scenarios like Intel TDX[1]. If a TDX host accesses
TDX-protected guest memory, machine check can happen which can further
crash the running host system, this is terrible for multi-tenant
configurations. The host accesses include those from KVM userspace like
QEMU. This series addresses KVM userspace induced crash by introducing
new mm and KVM interfaces so KVM userspace can still manage guest memory
via a fd-based approach, but it can never access the guest memory
content.
The patch series touches both core mm and KVM code. I appreciate
Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
reviews are always welcome.
- 01: mm change, target for mm tree
- 02-08: KVM change, target for KVM tree
Given KVM is the only current user for the mm part, I have chatted with
Paolo and he is OK to merge the mm change through KVM tree, but
reviewed-by/acked-by is still expected from the mm people.
The patches have been verified in Intel TDX environment, but Vishal has
done an excellent work on the selftests[4] which are dedicated for this
series, making it possible to test this series without innovative
hardware and fancy steps of building a VM environment. See Test section
below for more info.
Comparing to previous version, this version redesigned mm part code and
excluded F_SEAL_AUTO_ALLOCATE and man page changes from this series. See
Changelog section below for more info.
Introduction
============
KVM userspace being able to crash the host is horrible. Under current
KVM architecture, all guest memory is inherently accessible from KVM
userspace and is exposed to the mentioned crash issue. The goal of this
series is to provide a solution to align mm and KVM, on a userspace
inaccessible approach of exposing guest memory.
Normally, KVM populates secondary page table (e.g. EPT) by using a host
virtual address (hva) from core mm page table (e.g. x86 userspace page
table). This requires guest memory being mmaped into KVM userspace, but
this is also the source where the mentioned crash issue can happen. In
theory, apart from those 'shared' memory for device emulation etc, guest
memory doesn't have to be mmaped into KVM userspace.
This series introduces fd-based guest memory which will not be mmaped
into KVM userspace. KVM populates secondary page table by using a
fd/offset pair backed by a memory file system. The fd can be created
from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
directly interact with them with newly introduced in-kernel interface,
therefore remove the KVM userspace from the path of accessing/mmaping
the guest memory.
Kirill had a patch [2] to address the same issue in a different way. It
tracks guest encrypted memory at the 'struct page' level and relies on
HWPOISON to reject the userspace access. The patch has been discussed in
several online and offline threads and resulted in a design document [3]
which is also the original proposal for this series. Later this patch
series evolved as more comments received in community but the major
concepts in [3] still hold true so recommend reading.
The patch series may also be useful for other usages, for example, pure
software approach may use it to harden itself against unintentional
access to guest memory. This series is designed with these usages in
mind but doesn't have code directly support them and extension might be
needed.
mm change
=========
Introduces a new userspace MFD_INACCESSIBLE flag for memfd_create() so
that the memory fd created with this flag cannot read(), write() or
mmap() etc via normal MMU operations and the only way to use it is
passing it to a third kernel module like KVM and relying on it to
access the fd through the newly added inaccessible_memfd kernel
interface. The inaccessible_memfd interface bridges the memory file
subsystems (e.g.tmpfs/hugetlbfs) and their users (KVM in this case) and
provides bi-directional communication between them.
KVM change
==========
Extends the KVM memslot to provide guest private (encrypted) memory from
a fd. With this extension, a single memslot can maintain both private
memory through private fd (private_fd/private_offset) and shared
(unencrypted) memory through userspace mmaped host virtual address
(userspace_addr). For a particular guest page, the corresponding page in
KVM memslot can be only either private or shared and only one of the
shared/private parts of the memslot is visible to guest.
Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
chance on decision-making for shared <-> private memory conversion. The
exit can be an implicit conversion in KVM page fault handler or an
explicit conversion from guest OS.
Extends existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to
convert a guest page between private <-> shared. The data saved in these
ioctls tells the truth whether a guest page is private or shared and
this information will be used in KVM page fault handler to decide
whether the private or the shared part of the memslot is visible to
guest.
Test
====
Ran two kinds of tests:
- Selftests [4] from Vishal and VM boot tests in non-TDX environment
Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v8
- Functional tests in TDX capable environment
Tested the new functionalities in TDX environment. Code repos:
Linux: https://github.com/chao-p/linux/tree/privmem-v8-tdx
QEMU: https://github.com/chao-p/qemu/tree/privmem-v8
An example QEMU command line for TDX test:
-object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
-machine confidential-guest-support=tdx \
-object memory-backend-memfd-private,id=ram1,size=${mem} \
-machine memory-backend=ram1
TODO
====
- Page accounting and limiting for encrypted memory
- hugetlbfs support
Changelog
=========
v8:
- mm: redesign mm part by introducing a shim layer(inaccessible_memfd)
in memfd to avoid touch the memory file systems directly.
- mm: exclude F_SEAL_AUTO_ALLOCATE as it is for shared memory and
cause confusion in this series, will send out separately.
- doc: exclude the man page change, it's not kernel patch and will
send out separately.
- KVM: adapt to use the new mm inaccessible_memfd interface.
- KVM: update lpage_info when setting mem_attr_array to support
large page.
- KVM: change from xa_store_range to xa_store for mem_attr_array due
to xa_store_range overrides all entries which is not intended
behavior for us.
- KVM: refine the mmu_invalidate_retry_gfn mechanism for private page.
- KVM: reorganize KVM_MEMORY_ENCRYPT_{UN,}REG_REGION and private page
handling code suggested by Sean.
v7:
- mm: introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
- KVM: use KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to record
private/shared info.
- KVM: use similar sync mechanism between zap/page fault paths as
mmu_notifier for memfile_notifier based invalidation.
v6:
- mm: introduce MEMFILE_F_* flags into memfile_node to allow checking
feature consistence among all memfile_notifier users and get rid of
internal flags like SHM_F_INACCESSIBLE.
- mm: make pfn_ops callbacks being members of memfile_backing_store
and then refer to it directly in memfile_notifier.
- mm: remove backing store unregister.
- mm: remove RLIMIT_MEMLOCK based memory accounting and limiting.
- KVM: reorganize patch sequence for page fault handling and private
memory enabling.
v5:
- Add man page for MFD_INACCESSIBLE flag and improve KVM API do for
the new memslot extensions.
- mm: introduce memfile_{un}register_backing_store to allow memory
backing store to register/unregister it from memfile_notifier.
- mm: remove F_SEAL_INACCESSIBLE, use in-kernel flag
(SHM_F_INACCESSIBLE for shmem) instead.
- mm: add memory accounting and limiting (RLIMIT_MEMLOCK based) for
MFD_INACCESSIBLE memory.
- KVM: remove the overlap check for mapping the same file+offset into
multiple gfns due to perf consideration, warned in document.
v4:
- mm: rename memfd_ops to memfile_notifier and separate it from
memfd.c to standalone memfile-notifier.c.
- KVM: move pfn_ops to per-memslot scope from per-vm scope and allow
registering multiple memslots to the same memory backing store.
- KVM: add a 'kvm' reference in memslot so that we can recover kvm in
memfile_notifier handlers.
- KVM: add 'private_' prefix for the new fields in memslot.
- KVM: reshape the 'type' to 'flag' for kvm_memory_exit
v3:
- Remove 'RFC' prefix.
- Fix race condition between memfile_notifier handlers and kvm destroy.
- mm: introduce MFD_INACCESSIBLE flag for memfd_create() to force
setting F_SEAL_INACCESSIBLE when the fd is created.
- KVM: add the shared part of the memslot back to make private/shared
pages live in one memslot.
Reference
=========
[1] Intel TDX:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Kirill's implementation:
https://lore.kernel.org/all/[email protected]/T/
[3] Original design proposal:
https://lore.kernel.org/all/[email protected]/
[4] Selftest:
https://lore.kernel.org/all/[email protected]/
Chao Peng (7):
KVM: Extend the memslot to support fd-based private memory
KVM: Add KVM_EXIT_MEMORY_FAULT exit
KVM: Use gfn instead of hva for mmu_notifier_retry
KVM: Register/unregister the guest private memory regions
KVM: Update lpage info when private/shared memory are mixed
KVM: Handle page fault for private memory
KVM: Enable and expose KVM_MEM_PRIVATE
Kirill A. Shutemov (1):
mm/memfd: Introduce userspace inaccessible memfd
Documentation/virt/kvm/api.rst | 78 +++++++--
arch/x86/include/asm/kvm_host.h | 9 +
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/mmu.h | 2 -
arch/x86/kvm/mmu/mmu.c | 175 +++++++++++++++++++-
arch/x86/kvm/mmu/mmu_internal.h | 18 ++
arch/x86/kvm/mmu/mmutrace.h | 1 +
arch/x86/kvm/x86.c | 4 +-
include/linux/kvm_host.h | 86 ++++++++--
include/linux/memfd.h | 24 +++
include/uapi/linux/kvm.h | 37 +++++
include/uapi/linux/magic.h | 1 +
include/uapi/linux/memfd.h | 1 +
mm/Makefile | 2 +-
mm/memfd.c | 25 ++-
mm/memfd_inaccessible.c | 219 +++++++++++++++++++++++++
virt/kvm/Kconfig | 3 +
virt/kvm/kvm_main.c | 282 +++++++++++++++++++++++++++++---
18 files changed, 912 insertions(+), 56 deletions(-)
create mode 100644 mm/memfd_inaccessible.c
base-commit: 372d07084593dc7a399bf9bee815711b1fb1bcf2
--
2.25.1
Currently in mmu_notifier validate path, hva range is recorded and then
checked against in the mmu_notifier_retry_hva() of the page fault path.
However, for the to be introduced private memory, a page fault may not
have a hva associated, checking gfn(gpa) makes more sense.
For existing non private memory case, gfn is expected to continue to
work. The only downside is when aliasing multiple gfns to a single hva,
the current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.
The patch also fixes a bug in kvm_zap_gfn_range() which has already
been using gfn when calling kvm_mmu_invalidate_begin/end() while these
functions accept hva in current code.
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
include/linux/kvm_host.h | 18 +++++++---------
virt/kvm/kvm_main.c | 45 ++++++++++++++++++++++++++--------------
3 files changed, 39 insertions(+), 26 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e418ef3ecfcb..08abad4f3e6f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4203,7 +4203,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
return true;
return fault->slot &&
- mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+ mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
}
static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index eac1787b899b..2125b50f6345 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -776,8 +776,8 @@ struct kvm {
struct mmu_notifier mmu_notifier;
unsigned long mmu_invalidate_seq;
long mmu_invalidate_in_progress;
- unsigned long mmu_invalidate_range_start;
- unsigned long mmu_invalidate_range_end;
+ gfn_t mmu_invalidate_range_start;
+ gfn_t mmu_invalidate_range_end;
#endif
struct list_head devices;
u64 manual_dirty_log_protect;
@@ -1366,10 +1366,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
#endif
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
- unsigned long end);
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
- unsigned long end);
+void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end);
long kvm_arch_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg);
@@ -1938,9 +1936,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
return 0;
}
-static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
+static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
unsigned long mmu_seq,
- unsigned long hva)
+ gfn_t gfn)
{
lockdep_assert_held(&kvm->mmu_lock);
/*
@@ -1950,8 +1948,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
* positives, due to shortcuts when handing concurrent invalidations.
*/
if (unlikely(kvm->mmu_invalidate_in_progress) &&
- hva >= kvm->mmu_invalidate_range_start &&
- hva < kvm->mmu_invalidate_range_end)
+ gfn >= kvm->mmu_invalidate_range_start &&
+ gfn < kvm->mmu_invalidate_range_end)
return 1;
if (kvm->mmu_invalidate_seq != mmu_seq)
return 1;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 12dc0dc57b06..fa9dd2d2c001 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -540,8 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
- unsigned long end);
+typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end);
typedef void (*on_unlock_fn_t)(struct kvm *kvm);
@@ -628,7 +627,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
locked = true;
KVM_MMU_LOCK(kvm);
if (!IS_KVM_NULL_FN(range->on_lock))
- range->on_lock(kvm, range->start, range->end);
+ range->on_lock(kvm, gfn_range.start,
+ gfn_range.end);
if (IS_KVM_NULL_FN(range->handler))
break;
}
@@ -715,15 +715,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
}
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
- unsigned long end)
+static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
+ gfn_t end)
{
- /*
- * The count increase must become visible at unlock time as no
- * spte can be established without taking the mmu_lock and
- * count is also read inside the mmu_lock critical section.
- */
- kvm->mmu_invalidate_in_progress++;
if (likely(kvm->mmu_invalidate_in_progress == 1)) {
kvm->mmu_invalidate_range_start = start;
kvm->mmu_invalidate_range_end = end;
@@ -744,6 +738,28 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
}
}
+static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ /*
+ * The count increase must become visible at unlock time as no
+ * spte can be established without taking the mmu_lock and
+ * count is also read inside the mmu_lock critical section.
+ */
+ kvm->mmu_invalidate_in_progress++;
+}
+
+static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ update_invalidate_range(kvm, range->start, range->end);
+ return kvm_unmap_gfn_range(kvm, range);
+}
+
+void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ mark_invalidate_in_progress(kvm, start, end);
+ update_invalidate_range(kvm, start, end);
+}
+
static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
{
@@ -752,8 +768,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
.start = range->start,
.end = range->end,
.pte = __pte(0),
- .handler = kvm_unmap_gfn_range,
- .on_lock = kvm_mmu_invalidate_begin,
+ .handler = kvm_mmu_handle_gfn_range,
+ .on_lock = mark_invalidate_in_progress,
.on_unlock = kvm_arch_guest_memory_reclaimed,
.flush_on_ret = true,
.may_block = mmu_notifier_range_blockable(range),
@@ -791,8 +807,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
return 0;
}
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
- unsigned long end)
+void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
{
/*
* This sequence increase will notify the kvm page fault that
--
2.25.1
When private/shared memory are mixed in a large page, the lpage_info may
not be accurate and should be updated with this mixed info. A large page
has mixed pages can't be really mapped as large page since its
private/shared pages are from different physical memory.
This patch updates lpage_info when private/shared memory attribute is
changed. If both private and shared pages are within a large page
region, it can't be mapped as large page. It's a bit challenge to track
the mixed info in a 'count' like variable, this patch instead reserves a
bit in disallow_lpage to indicate a large page include mixed
private/share pages.
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 +++
arch/x86/kvm/mmu/mmu.c | 119 +++++++++++++++++++++++++++++++-
arch/x86/kvm/x86.c | 2 +
include/linux/kvm_host.h | 17 +++++
virt/kvm/kvm_main.c | 11 ++-
5 files changed, 154 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cfad6ba1a70a..85119ed9527a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -38,6 +38,7 @@
#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
#define __KVM_HAVE_ZAP_GFN_RANGE
+#define __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
#define KVM_MAX_VCPUS 1024
@@ -945,6 +946,13 @@ struct kvm_vcpu_arch {
#endif
};
+/*
+ * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
+ * level. The remaining bits will be used as a reference count for other users.
+ */
+#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
+#define KVM_LPAGE_COUNT_MAX ((1U << 31) - 1)
+
struct kvm_lpage_info {
int disallow_lpage;
};
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 08abad4f3e6f..a0f198cede3d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -762,11 +762,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
{
struct kvm_lpage_info *linfo;
int i;
+ int disallow_count;
for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
linfo = lpage_info_slot(gfn, slot, i);
+
+ disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
+ WARN_ON(disallow_count + count < 0 ||
+ disallow_count > KVM_LPAGE_COUNT_MAX - count);
+
linfo->disallow_lpage += count;
- WARN_ON(linfo->disallow_lpage < 0);
}
}
@@ -6894,3 +6899,115 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
if (kvm->arch.nx_lpage_recovery_thread)
kthread_stop(kvm->arch.nx_lpage_recovery_thread);
}
+
+static bool mem_attr_is_mixed(struct kvm *kvm, unsigned int attr,
+ gfn_t start, gfn_t end)
+{
+ XA_STATE(xas, &kvm->mem_attr_array, start);
+ gfn_t gfn = start;
+ void *entry;
+ bool shared, private;
+ bool mixed = false;
+
+ if (attr == KVM_MEM_ATTR_SHARED) {
+ shared = true;
+ private = false;
+ } else {
+ shared = false;
+ private = true;
+ }
+
+ rcu_read_lock();
+ entry = xas_load(&xas);
+ while (gfn < end) {
+ if (xas_retry(&xas, entry))
+ continue;
+
+ KVM_BUG_ON(gfn != xas.xa_index, kvm);
+
+ if (entry)
+ private = true;
+ else
+ shared = true;
+
+ if (private && shared) {
+ mixed = true;
+ goto out;
+ }
+
+ entry = xas_next(&xas);
+ gfn++;
+ }
+out:
+ rcu_read_unlock();
+ return mixed;
+}
+
+static inline void update_mixed(struct kvm_lpage_info *linfo, bool mixed)
+{
+ if (mixed)
+ linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
+ else
+ linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static void update_mem_lpage_info(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ unsigned int attr,
+ gfn_t start, gfn_t end)
+{
+ unsigned long lpage_start, lpage_end;
+ unsigned long gfn, pages, mask;
+ int level;
+
+ for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+ pages = KVM_PAGES_PER_HPAGE(level);
+ mask = ~(pages - 1);
+ lpage_start = start & mask;
+ lpage_end = (end - 1) & mask;
+
+ /*
+ * We only need to scan the head and tail page, for middle pages
+ * we know they are not mixed.
+ */
+ update_mixed(lpage_info_slot(lpage_start, slot, level),
+ mem_attr_is_mixed(kvm, attr, lpage_start,
+ lpage_start + pages));
+
+ if (lpage_start == lpage_end)
+ return;
+
+ for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
+ update_mixed(lpage_info_slot(gfn, slot, level), false);
+
+ update_mixed(lpage_info_slot(lpage_end, slot, level),
+ mem_attr_is_mixed(kvm, attr, lpage_end,
+ lpage_end + pages));
+ }
+}
+
+void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+ gfn_t start, gfn_t end)
+{
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ struct kvm_memslot_iter iter;
+ int i;
+
+ WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
+ "Unsupported mem attribute.\n");
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ slots = __kvm_memslots(kvm, i);
+
+ kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+ slot = iter.slot;
+ start = max(start, slot->base_gfn);
+ end = min(end, slot->base_gfn + slot->npages);
+ if (WARN_ON_ONCE(start >= end))
+ continue;
+
+ update_mem_lpage_info(kvm, slot, attr, start, end);
+ }
+ }
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 081f62ccc9a1..ef11cda6f13f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12321,6 +12321,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
linfo[lpages - 1].disallow_lpage = 1;
ugfn = slot->userspace_addr >> PAGE_SHIFT;
+ if (kvm_slot_can_be_private(slot))
+ ugfn |= slot->private_offset >> PAGE_SHIFT;
/*
* If the gfn and userspace address are not aligned wrt each
* other, disable large page support for this slot.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d65690cae80b..fd36ce6597ad 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2277,4 +2277,21 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
/* Max number of entries allowed for each kvm dirty ring */
#define KVM_DIRTY_RING_MAX_ENTRIES 65536
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+
+#define KVM_MEM_ATTR_SHARED 0x0001
+#define KVM_MEM_ATTR_PRIVATE 0x0002
+
+#ifdef __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
+void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+ gfn_t start, gfn_t end);
+#else
+static inline void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+ gfn_t start, gfn_t end)
+{
+}
+#endif
+
+#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
+
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index de5cce8c82c7..97d893f7482c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -938,13 +938,13 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
-#define KVM_MEM_ATTR_SHARED 0x0001
static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
bool is_private)
{
gfn_t start, end;
unsigned long index;
void *entry;
+ int attr;
int r;
if (size == 0 || gpa + size < gpa)
@@ -959,7 +959,13 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
* Guest memory defaults to private, kvm->mem_attr_array only stores
* shared memory.
*/
- entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
+ if (is_private) {
+ attr = KVM_MEM_ATTR_PRIVATE;
+ entry = NULL;
+ } else {
+ attr = KVM_MEM_ATTR_SHARED;
+ entry = xa_mk_value(KVM_MEM_ATTR_SHARED);
+ }
for (index = start; index < end; index++) {
r = xa_err(xa_store(&kvm->mem_attr_array, index, entry,
@@ -969,6 +975,7 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
}
kvm_zap_gfn_range(kvm, start, end);
+ kvm_arch_update_mem_attr(kvm, attr, start, end);
return r;
err:
--
2.25.1
A memslot with KVM_MEM_PRIVATE being set can include both fd-based
private memory and hva-based shared memory. Architecture code (like TDX
code) can tell whether the on-going fault is private or not. This patch
adds a 'is_private' field to kvm_page_fault to indicate this and
architecture code is expected to set it.
To handle page fault for such memslot, the handling logic is different
depending on whether the fault is private or shared. KVM checks if
'is_private' matches the host's view of the page (this is maintained in
mem_attr_array).
- For a successful match, private pfn is obtained with
inaccessible_get_pfn() from private fd and shared pfn is obtained
with existing get_user_pages().
- For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
userspace. Userspace then can convert memory between private/shared
in host's view and then retry the access.
Co-developed-by: Yu Zhang <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 54 ++++++++++++++++++++++++++++++++-
arch/x86/kvm/mmu/mmu_internal.h | 18 +++++++++++
arch/x86/kvm/mmu/mmutrace.h | 1 +
include/linux/kvm_host.h | 24 +++++++++++++++
4 files changed, 96 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a0f198cede3d..81ab20003824 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3028,6 +3028,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
break;
}
+ if (kvm_mem_is_private(kvm, gfn))
+ return max_level;
+
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;
@@ -4127,6 +4130,32 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
}
+static inline u8 order_to_level(int order)
+{
+ BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+ if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+ return PG_LEVEL_1G;
+
+ if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+ return PG_LEVEL_2M;
+
+ return PG_LEVEL_4K;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
+{
+ int order;
+ struct kvm_memory_slot *slot = fault->slot;
+
+ if (kvm_private_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
+ return RET_PF_RETRY;
+
+ fault->max_level = min(order_to_level(order), fault->max_level);
+ fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
+ return RET_PF_CONTINUE;
+}
+
static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_memory_slot *slot = fault->slot;
@@ -4159,6 +4188,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return RET_PF_EMULATE;
}
+ if (kvm_slot_can_be_private(slot) &&
+ fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+ vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+ if (fault->is_private)
+ vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
+ else
+ vcpu->run->memory.flags = 0;
+ vcpu->run->memory.padding = 0;
+ vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+ vcpu->run->memory.size = PAGE_SIZE;
+ return RET_PF_USER;
+ }
+
+ if (fault->is_private)
+ return kvm_faultin_pfn_private(fault);
+
async = false;
fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
fault->write, &fault->map_writable,
@@ -4267,7 +4312,11 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
read_unlock(&vcpu->kvm->mmu_lock);
else
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+
+ if (fault->is_private)
+ kvm_private_mem_put_pfn(fault->slot, fault->pfn);
+ else
+ kvm_release_pfn_clean(fault->pfn);
return r;
}
@@ -5543,6 +5592,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
return -EIO;
}
+ if (r == RET_PF_USER)
+ return 0;
+
if (r < 0)
return r;
if (r != RET_PF_EMULATE)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 582def531d4d..a55e352246a7 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -188,6 +188,7 @@ struct kvm_page_fault {
/* Derived from mmu and global state. */
const bool is_tdp;
+ const bool is_private;
const bool nx_huge_page_workaround_enabled;
/*
@@ -236,6 +237,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
* RET_PF_RETRY: let CPU fault again on the address.
* RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
* RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
+ * RET_PF_USER: need to exit to userspace to handle this fault.
* RET_PF_FIXED: The faulting entry has been fixed.
* RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
*
@@ -252,6 +254,7 @@ enum {
RET_PF_RETRY,
RET_PF_EMULATE,
RET_PF_INVALID,
+ RET_PF_USER,
RET_PF_FIXED,
RET_PF_SPURIOUS,
};
@@ -318,4 +321,19 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+#ifndef CONFIG_HAVE_KVM_PRIVATE_MEM
+static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+ WARN_ON_ONCE(1);
+ return -EOPNOTSUPP;
+}
+
+static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot,
+ kvm_pfn_t pfn)
+{
+ WARN_ON_ONCE(1);
+}
+#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
+
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..2d7555381955 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
TRACE_DEFINE_ENUM(RET_PF_RETRY);
TRACE_DEFINE_ENUM(RET_PF_EMULATE);
TRACE_DEFINE_ENUM(RET_PF_INVALID);
+TRACE_DEFINE_ENUM(RET_PF_USER);
TRACE_DEFINE_ENUM(RET_PF_FIXED);
TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fd36ce6597ad..b9906cdf468b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2292,6 +2292,30 @@ static inline void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
}
#endif
+static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+ int ret;
+ pfn_t pfnt;
+ pgoff_t index = gfn - slot->base_gfn +
+ (slot->private_offset >> PAGE_SHIFT);
+
+ ret = inaccessible_get_pfn(slot->private_file, index, &pfnt, order);
+ *pfn = pfn_t_to_pfn(pfnt);
+ return ret;
+}
+
+static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot,
+ kvm_pfn_t pfn)
+{
+ inaccessible_put_pfn(slot->private_file, pfn_to_pfn_t(pfn));
+}
+
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+ return !xa_load(&kvm->mem_attr_array, gfn);
+}
+
#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
#endif
--
2.25.1
Expose KVM_MEM_PRIVATE and memslot fields private_fd/offset to
userspace. KVM will register/unregister private memslot to fd-based
memory backing store and response to invalidation event from
inaccessible_notifier to zap the existing memory mappings in the
secondary page table.
Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
by architecture code which can turn on it by overriding the default
kvm_arch_has_private_mem().
A 'kvm' reference is added in memslot structure since in
inaccessible_notifier callback we can only obtain a memslot reference
but 'kvm' is needed to do the zapping.
Co-developed-by: Yu Zhang <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 116 +++++++++++++++++++++++++++++++++++++--
2 files changed, 111 insertions(+), 6 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b9906cdf468b..cb4eefac709c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -589,6 +589,7 @@ struct kvm_memory_slot {
struct file *private_file;
loff_t private_offset;
struct inaccessible_notifier notifier;
+ struct kvm *kvm;
};
static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 97d893f7482c..87e239d35b96 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -983,6 +983,57 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
xa_erase(&kvm->mem_attr_array, index);
return r;
}
+
+static void kvm_private_notifier_invalidate(struct inaccessible_notifier *notifier,
+ pgoff_t start, pgoff_t end)
+{
+ struct kvm_memory_slot *slot = container_of(notifier,
+ struct kvm_memory_slot,
+ notifier);
+ unsigned long base_pgoff = slot->private_offset >> PAGE_SHIFT;
+ gfn_t start_gfn = slot->base_gfn;
+ gfn_t end_gfn = slot->base_gfn + slot->npages;
+
+
+ if (start > base_pgoff)
+ start_gfn = slot->base_gfn + start - base_pgoff;
+
+ if (end < base_pgoff + slot->npages)
+ end_gfn = slot->base_gfn + end - base_pgoff;
+
+ if (start_gfn >= end_gfn)
+ return;
+
+ kvm_zap_gfn_range(slot->kvm, start_gfn, end_gfn);
+}
+
+static struct inaccessible_notifier_ops kvm_private_notifier_ops = {
+ .invalidate = kvm_private_notifier_invalidate,
+};
+
+static inline void kvm_private_mem_register(struct kvm_memory_slot *slot)
+{
+ slot->notifier.ops = &kvm_private_notifier_ops;
+ inaccessible_register_notifier(slot->private_file, &slot->notifier);
+}
+
+static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
+{
+ inaccessible_unregister_notifier(slot->private_file, &slot->notifier);
+}
+
+#else /* !CONFIG_HAVE_KVM_PRIVATE_MEM */
+
+static inline void kvm_private_mem_register(struct kvm_memory_slot *slot)
+{
+ WARN_ON_ONCE(1);
+}
+
+static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
+{
+ WARN_ON_ONCE(1);
+}
+
#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
@@ -1029,6 +1080,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
/* This does not remove the slot from struct kvm_memslots data structures */
static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
{
+ if (slot->flags & KVM_MEM_PRIVATE) {
+ kvm_private_mem_unregister(slot);
+ fput(slot->private_file);
+ }
+
kvm_destroy_dirty_bitmap(slot);
kvm_arch_free_memslot(kvm, slot);
@@ -1600,10 +1656,16 @@ bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
return false;
}
-static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
+static int check_memory_region_flags(struct kvm *kvm,
+ const struct kvm_user_mem_region *mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+ if (kvm_arch_has_private_mem(kvm))
+ valid_flags |= KVM_MEM_PRIVATE;
+#endif
+
#ifdef __KVM_HAVE_READONLY_MEM
valid_flags |= KVM_MEM_READONLY;
#endif
@@ -1679,6 +1741,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
{
int r;
+ if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
+ kvm_private_mem_register(new);
+
/*
* If dirty logging is disabled, nullify the bitmap; the old bitmap
* will be freed on "commit". If logging is enabled in both old and
@@ -1707,6 +1772,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
kvm_destroy_dirty_bitmap(new);
+ if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
+ kvm_private_mem_unregister(new);
+
return r;
}
@@ -2004,7 +2072,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
int as_id, id;
int r;
- r = check_memory_region_flags(mem);
+ r = check_memory_region_flags(kvm, mem);
if (r)
return r;
@@ -2023,6 +2091,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
!access_ok((void __user *)(unsigned long)mem->userspace_addr,
mem->memory_size))
return -EINVAL;
+ if (mem->flags & KVM_MEM_PRIVATE &&
+ (mem->private_offset & (PAGE_SIZE - 1) ||
+ mem->private_offset > U64_MAX - mem->memory_size))
+ return -EINVAL;
if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
return -EINVAL;
if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
@@ -2061,6 +2133,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
return -EINVAL;
} else { /* Modify an existing slot. */
+ /* Private memslots are immutable, they can only be deleted. */
+ if (mem->flags & KVM_MEM_PRIVATE)
+ return -EINVAL;
if ((mem->userspace_addr != old->userspace_addr) ||
(npages != old->npages) ||
((mem->flags ^ old->flags) & KVM_MEM_READONLY))
@@ -2089,10 +2164,27 @@ int __kvm_set_memory_region(struct kvm *kvm,
new->npages = npages;
new->flags = mem->flags;
new->userspace_addr = mem->userspace_addr;
+ if (mem->flags & KVM_MEM_PRIVATE) {
+ new->private_file = fget(mem->private_fd);
+ if (!new->private_file) {
+ r = -EINVAL;
+ goto out;
+ }
+ new->private_offset = mem->private_offset;
+ }
+
+ new->kvm = kvm;
r = kvm_set_memslot(kvm, old, new, change);
if (r)
- kfree(new);
+ goto out;
+
+ return 0;
+
+out:
+ if (new->private_file)
+ fput(new->private_file);
+ kfree(new);
return r;
}
EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
@@ -4747,16 +4839,28 @@ static long kvm_vm_ioctl(struct file *filp,
}
case KVM_SET_USER_MEMORY_REGION: {
struct kvm_user_mem_region mem;
- unsigned long size = sizeof(struct kvm_userspace_memory_region);
+ unsigned int flags_offset = offsetof(typeof(mem), flags);
+ unsigned long size;
+ u32 flags;
kvm_sanity_check_user_mem_region_alias();
+ memset(&mem, 0, sizeof(mem));
+
r = -EFAULT;
- if (copy_from_user(&mem, argp, size);
+ if (get_user(flags, (u32 __user *)(argp + flags_offset)))
+ goto out;
+
+ if (flags & KVM_MEM_PRIVATE)
+ size = sizeof(struct kvm_userspace_memory_region_ext);
+ else
+ size = sizeof(struct kvm_userspace_memory_region);
+
+ if (copy_from_user(&mem, argp, size))
goto out;
r = -EINVAL;
- if (mem.flags & KVM_MEM_PRIVATE)
+ if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
goto out;
r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
--
2.25.1
If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the
guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
ioctls. The patch reuses existing SEV ioctl number but differs that the
address in the region for KVM_PRIVATE_MEM case is gpa while for SEV case
it's hva. Which usages should the ioctls go is determined by the newly
added kvm_arch_has_private_mem(). Architecture which supports
KVM_PRIVATE_MEM should override this function.
The current implementation defaults all memory to private. The shared
memory regions are stored in a xarray variable for memory efficiency and
zapping existing memory mappings is also a side effect of these two
ioctls when defined.
Signed-off-by: Chao Peng <[email protected]>
---
Documentation/virt/kvm/api.rst | 17 ++++++--
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu.h | 2 -
include/linux/kvm_host.h | 13 ++++++
virt/kvm/kvm_main.c | 73 +++++++++++++++++++++++++++++++++
5 files changed, 100 insertions(+), 6 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 1a6c003b2a0b..c0f800d04ffc 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4715,10 +4715,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
This ioctl can be used to register a guest memory region which may
contain encrypted data (e.g. guest RAM, SMRAM etc).
-It is used in the SEV-enabled guest. When encryption is enabled, a guest
-memory region may contain encrypted data. The SEV memory encryption
-engine uses a tweak such that two identical plaintext pages, each at
-different locations will have differing ciphertexts. So swapping or
+Currently this ioctl supports registering memory regions for two usages:
+private memory and SEV-encrypted memory.
+
+When private memory is enabled, this ioctl is used to register guest private
+memory region and the addr/size of kvm_enc_region represents guest physical
+address (GPA). In this usage, this ioctl zaps the existing guest memory
+mappings in KVM that fallen into the region.
+
+When SEV-encrypted memory is enabled, this ioctl is used to register guest
+memory region which may contain encrypted data for a SEV-enabled guest. The
+addr/size of kvm_enc_region represents userspace address (HVA). The SEV
+memory encryption engine uses a tweak such that two identical plaintext pages,
+each at different locations will have differing ciphertexts. So swapping or
moving ciphertext of those pages will not result in plaintext being
swapped. So relocating (or migrating) physical backing pages for the SEV
guest will require some additional steps.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2c96c43c313a..cfad6ba1a70a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -37,6 +37,7 @@
#include <asm/hyperv-tlfs.h>
#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_HAVE_ZAP_GFN_RANGE
#define KVM_MAX_VCPUS 1024
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 6bdaacb6faa0..c94b620bf94b 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -211,8 +211,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
return -(u32)fault & errcode;
}
-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
-
int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
int kvm_mmu_post_init_vm(struct kvm *kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2125b50f6345..d65690cae80b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -260,6 +260,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
#endif
+#ifdef __KVM_HAVE_ZAP_GFN_RANGE
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
+#else
+static inline void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start
+ gfn_t gfn_end)
+{
+}
+#endif
+
enum {
OUTSIDE_GUEST_MODE,
IN_GUEST_MODE,
@@ -795,6 +804,9 @@ struct kvm {
struct notifier_block pm_notifier;
#endif
char stats_id[KVM_STATS_NAME_SIZE];
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+ struct xarray mem_attr_array;
+#endif
};
#define kvm_err(fmt, ...) \
@@ -1454,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
int kvm_arch_post_init_vm(struct kvm *kvm);
void kvm_arch_pre_destroy_vm(struct kvm *kvm);
int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_has_private_mem(struct kvm *kvm);
#ifndef __KVM_HAVE_ARCH_VM_ALLOC
/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fa9dd2d2c001..de5cce8c82c7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -937,6 +937,47 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+#define KVM_MEM_ATTR_SHARED 0x0001
+static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
+ bool is_private)
+{
+ gfn_t start, end;
+ unsigned long index;
+ void *entry;
+ int r;
+
+ if (size == 0 || gpa + size < gpa)
+ return -EINVAL;
+ if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
+ return -EINVAL;
+
+ start = gpa >> PAGE_SHIFT;
+ end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
+
+ /*
+ * Guest memory defaults to private, kvm->mem_attr_array only stores
+ * shared memory.
+ */
+ entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
+
+ for (index = start; index < end; index++) {
+ r = xa_err(xa_store(&kvm->mem_attr_array, index, entry,
+ GFP_KERNEL_ACCOUNT));
+ if (r)
+ goto err;
+ }
+
+ kvm_zap_gfn_range(kvm, start, end);
+
+ return r;
+err:
+ for (; index > start; index--)
+ xa_erase(&kvm->mem_attr_array, index);
+ return r;
+}
+#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
+
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
static int kvm_pm_notifier_call(struct notifier_block *bl,
unsigned long state,
@@ -1165,6 +1206,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
spin_lock_init(&kvm->mn_invalidate_lock);
rcuwait_init(&kvm->mn_memslots_update_rcuwait);
xa_init(&kvm->vcpu_array);
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+ xa_init(&kvm->mem_attr_array);
+#endif
INIT_LIST_HEAD(&kvm->gpc_list);
spin_lock_init(&kvm->gpc_lock);
@@ -1338,6 +1382,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
}
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+ xa_destroy(&kvm->mem_attr_array);
+#endif
cleanup_srcu_struct(&kvm->irq_srcu);
cleanup_srcu_struct(&kvm->srcu);
kvm_arch_free_vm(kvm);
@@ -1541,6 +1588,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
}
+bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
+{
+ return false;
+}
+
static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -4703,6 +4755,24 @@ static long kvm_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
break;
}
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+ case KVM_MEMORY_ENCRYPT_REG_REGION:
+ case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
+ struct kvm_enc_region region;
+ bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
+
+ if (!kvm_arch_has_private_mem(kvm))
+ goto arch_vm_ioctl;
+
+ r = -EFAULT;
+ if (copy_from_user(®ion, argp, sizeof(region)))
+ goto out;
+
+ r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
+ region.size, set);
+ break;
+ }
+#endif
case KVM_GET_DIRTY_LOG: {
struct kvm_dirty_log log;
@@ -4856,6 +4926,9 @@ static long kvm_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_get_stats_fd(kvm);
break;
default:
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+arch_vm_ioctl:
+#endif
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
}
out:
--
2.25.1
This new KVM exit allows userspace to handle memory-related errors. It
indicates an error happens in KVM at guest memory range [gpa, gpa+size).
The flags includes additional information for userspace to handle the
error. Currently bit 0 is defined as 'private memory' where '1'
indicates error happens due to private memory access and '0' indicates
error happens due to shared memory access.
When private memory is enabled, this new exit will be used for KVM to
exit to userspace for shared <-> private memory conversion in memory
encryption usage. In such usage, typically there are two kind of memory
conversions:
- explicit conversion: happens when guest explicitly calls into KVM
to map a range (as private or shared), KVM then exits to userspace
to do the map/unmap operations.
- implicit conversion: happens in KVM page fault handler where KVM
exits to userspace for an implicit conversion when the page is in a
different state than requested (private or shared).
Suggested-by: Sean Christopherson <[email protected]>
Co-developed-by: Yu Zhang <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
include/uapi/linux/kvm.h | 9 +++++++++
2 files changed, 32 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index c1fac1e9f820..1a6c003b2a0b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6638,6 +6638,29 @@ array field represents return values. The userspace should update the return
values of SBI call before resuming the VCPU. For more details on RISC-V SBI
spec refer, https://github.com/riscv/riscv-sbi-doc.
+::
+
+ /* KVM_EXIT_MEMORY_FAULT */
+ struct {
+ #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
+ __u32 flags;
+ __u32 padding;
+ __u64 gpa;
+ __u64 size;
+ } memory;
+
+If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
+encountered a memory error which is not handled by KVM kernel module and
+userspace may choose to handle it. The 'flags' field indicates the memory
+properties of the exit.
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
+ private memory access when the bit is set otherwise the memory error is
+ caused by shared memory access when the bit is clear.
+
+'gpa' and 'size' indicate the memory range the error occurs at. The userspace
+may handle the error and return to KVM to retry the previous memory access.
+
::
/* KVM_EXIT_NOTIFY */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 3ef462fb3b2a..0c8db7b7c138 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -300,6 +300,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_RISCV_SBI 35
#define KVM_EXIT_RISCV_CSR 36
#define KVM_EXIT_NOTIFY 37
+#define KVM_EXIT_MEMORY_FAULT 38
/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -538,6 +539,14 @@ struct kvm_run {
#define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
__u32 flags;
} notify;
+ /* KVM_EXIT_MEMORY_FAULT */
+ struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
+ __u32 flags;
+ __u32 padding;
+ __u64 gpa;
+ __u64 size;
+ } memory;
/* Fix the size of the union. */
char padding[256];
};
--
2.25.1
From: "Kirill A. Shutemov" <[email protected]>
KVM can use memfd-provided memory for guest memory. For normal userspace
accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
virtual address space and then tells KVM to use the virtual address to
setup the mapping in the secondary page table (e.g. EPT).
With confidential computing technologies like Intel TDX, the
memfd-provided memory may be encrypted with special key for special
software domain (e.g. KVM guest) and is not expected to be directly
accessed by userspace. Precisely, userspace access to such encrypted
memory may lead to host crash so it should be prevented.
This patch introduces userspace inaccessible memfd (created with
MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
ordinary MMU access (e.g. read/write/mmap) but can be accessed via
in-kernel interface so KVM can directly interact with core-mm without
the need to map the memory into KVM userspace.
It provides semantics required for KVM guest private(encrypted) memory
support that a file descriptor with this flag set is going to be used as
the source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV.
KVM userspace is still in charge of the lifecycle of the memfd. It
should pass the opened fd to KVM. KVM uses the kernel APIs newly added
in this patch to obtain the physical memory address and then populate
the secondary page table entries.
The userspace inaccessible memfd can be fallocate-ed and hole-punched
from userspace. When hole-punching happens, KVM can get notified through
inaccessible_notifier it then gets chance to remove any mapped entries
of the range in the secondary page tables.
The userspace inaccessible memfd itself is implemented as a shim layer
on top of real memory file systems like tmpfs/hugetlbfs but this patch
only implemented tmpfs. The allocated memory is currently marked as
unmovable and unevictable, this is required for current confidential
usage. But in future this might be changed.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/memfd.h | 24 ++++
include/uapi/linux/magic.h | 1 +
include/uapi/linux/memfd.h | 1 +
mm/Makefile | 2 +-
mm/memfd.c | 25 ++++-
mm/memfd_inaccessible.c | 219 +++++++++++++++++++++++++++++++++++++
6 files changed, 270 insertions(+), 2 deletions(-)
create mode 100644 mm/memfd_inaccessible.c
diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..334ddff08377 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -3,6 +3,7 @@
#define __LINUX_MEMFD_H
#include <linux/file.h>
+#include <linux/pfn_t.h>
#ifdef CONFIG_MEMFD_CREATE
extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
@@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
}
#endif
+struct inaccessible_notifier;
+
+struct inaccessible_notifier_ops {
+ void (*invalidate)(struct inaccessible_notifier *notifier,
+ pgoff_t start, pgoff_t end);
+};
+
+struct inaccessible_notifier {
+ struct list_head list;
+ const struct inaccessible_notifier_ops *ops;
+};
+
+void inaccessible_register_notifier(struct file *file,
+ struct inaccessible_notifier *notifier);
+void inaccessible_unregister_notifier(struct file *file,
+ struct inaccessible_notifier *notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+ int *order);
+void inaccessible_put_pfn(struct file *file, pfn_t pfn);
+
+struct file *memfd_mkinaccessible(struct file *memfd);
+
#endif /* __LINUX_MEMFD_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..9d066be3d7e8 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
#define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */
#define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
+#define INACCESSIBLE_MAGIC 0x494e4143 /* "INAC" */
#endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
#define MFD_CLOEXEC 0x0001U
#define MFD_ALLOW_SEALING 0x0002U
#define MFD_HUGETLB 0x0004U
+#define MFD_INACCESSIBLE 0x0008U
/*
* Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/Makefile b/mm/Makefile
index 9a564f836403..f82e5d4b4388 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -126,7 +126,7 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
obj-$(CONFIG_ZONE_DEVICE) += memremap.o
obj-$(CONFIG_HMM_MIRROR) += hmm.o
-obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o
obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
diff --git a/mm/memfd.c b/mm/memfd.c
index 08f5f8304746..1853a90f49ff 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+ MFD_INACCESSIBLE)
SYSCALL_DEFINE2(memfd_create,
const char __user *, uname,
@@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create,
return -EINVAL;
}
+ /* Disallow sealing when MFD_INACCESSIBLE is set. */
+ if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING))
+ return -EINVAL;
+
+ /* TODO: add hugetlb support */
+ if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB))
+ return -EINVAL;
+
/* length includes terminating zero */
len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
if (len <= 0)
@@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create,
*file_seals &= ~F_SEAL_SEAL;
}
+ if (flags & MFD_INACCESSIBLE) {
+ struct file *inaccessible_file;
+
+ inaccessible_file = memfd_mkinaccessible(file);
+ if (IS_ERR(inaccessible_file)) {
+ error = PTR_ERR(inaccessible_file);
+ goto err_file;
+ }
+
+ file = inaccessible_file;
+ }
+
fd_install(fd, file);
kfree(name);
return fd;
+err_file:
+ fput(file);
err_fd:
put_unused_fd(fd);
err_name:
diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
new file mode 100644
index 000000000000..2d33cbdd9282
--- /dev/null
+++ b/mm/memfd_inaccessible.c
@@ -0,0 +1,219 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "linux/sbitmap.h"
+#include <linux/memfd.h>
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/shmem_fs.h>
+#include <uapi/linux/falloc.h>
+#include <uapi/linux/magic.h>
+
+struct inaccessible_data {
+ struct mutex lock;
+ struct file *memfd;
+ struct list_head notifiers;
+};
+
+static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
+ pgoff_t start, pgoff_t end)
+{
+ struct inaccessible_notifier *notifier;
+
+ mutex_lock(&data->lock);
+ list_for_each_entry(notifier, &data->notifiers, list) {
+ notifier->ops->invalidate(notifier, start, end);
+ }
+ mutex_unlock(&data->lock);
+}
+
+static int inaccessible_release(struct inode *inode, struct file *file)
+{
+ struct inaccessible_data *data = inode->i_mapping->private_data;
+
+ fput(data->memfd);
+ kfree(data);
+ return 0;
+}
+
+static long inaccessible_fallocate(struct file *file, int mode,
+ loff_t offset, loff_t len)
+{
+ struct inaccessible_data *data = file->f_mapping->private_data;
+ struct file *memfd = data->memfd;
+ int ret;
+
+ if (mode & FALLOC_FL_PUNCH_HOLE) {
+ if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+ return -EINVAL;
+ }
+
+ ret = memfd->f_op->fallocate(memfd, mode, offset, len);
+ inaccessible_notifier_invalidate(data, offset, offset + len);
+ return ret;
+}
+
+static const struct file_operations inaccessible_fops = {
+ .release = inaccessible_release,
+ .fallocate = inaccessible_fallocate,
+};
+
+static int inaccessible_getattr(struct user_namespace *mnt_userns,
+ const struct path *path, struct kstat *stat,
+ u32 request_mask, unsigned int query_flags)
+{
+ struct inode *inode = d_inode(path->dentry);
+ struct inaccessible_data *data = inode->i_mapping->private_data;
+ struct file *memfd = data->memfd;
+
+ return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
+ request_mask, query_flags);
+}
+
+static int inaccessible_setattr(struct user_namespace *mnt_userns,
+ struct dentry *dentry, struct iattr *attr)
+{
+ struct inode *inode = d_inode(dentry);
+ struct inaccessible_data *data = inode->i_mapping->private_data;
+ struct file *memfd = data->memfd;
+ int ret;
+
+ if (attr->ia_valid & ATTR_SIZE) {
+ if (memfd->f_inode->i_size)
+ return -EPERM;
+
+ if (!PAGE_ALIGNED(attr->ia_size))
+ return -EINVAL;
+ }
+
+ ret = memfd->f_inode->i_op->setattr(mnt_userns,
+ file_dentry(memfd), attr);
+ return ret;
+}
+
+static const struct inode_operations inaccessible_iops = {
+ .getattr = inaccessible_getattr,
+ .setattr = inaccessible_setattr,
+};
+
+static int inaccessible_init_fs_context(struct fs_context *fc)
+{
+ if (!init_pseudo(fc, INACCESSIBLE_MAGIC))
+ return -ENOMEM;
+
+ fc->s_iflags |= SB_I_NOEXEC;
+ return 0;
+}
+
+static struct file_system_type inaccessible_fs = {
+ .owner = THIS_MODULE,
+ .name = "[inaccessible]",
+ .init_fs_context = inaccessible_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+
+static struct vfsmount *inaccessible_mnt;
+
+static __init int inaccessible_init(void)
+{
+ inaccessible_mnt = kern_mount(&inaccessible_fs);
+ if (IS_ERR(inaccessible_mnt))
+ return PTR_ERR(inaccessible_mnt);
+ return 0;
+}
+fs_initcall(inaccessible_init);
+
+struct file *memfd_mkinaccessible(struct file *memfd)
+{
+ struct inaccessible_data *data;
+ struct address_space *mapping;
+ struct inode *inode;
+ struct file *file;
+
+ data = kzalloc(sizeof(*data), GFP_KERNEL);
+ if (!data)
+ return ERR_PTR(-ENOMEM);
+
+ data->memfd = memfd;
+ mutex_init(&data->lock);
+ INIT_LIST_HEAD(&data->notifiers);
+
+ inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
+ if (IS_ERR(inode)) {
+ kfree(data);
+ return ERR_CAST(inode);
+ }
+
+ inode->i_mode |= S_IFREG;
+ inode->i_op = &inaccessible_iops;
+ inode->i_mapping->private_data = data;
+
+ file = alloc_file_pseudo(inode, inaccessible_mnt,
+ "[memfd:inaccessible]", O_RDWR,
+ &inaccessible_fops);
+ if (IS_ERR(file)) {
+ iput(inode);
+ kfree(data);
+ }
+
+ file->f_flags |= O_LARGEFILE;
+
+ mapping = memfd->f_mapping;
+ mapping_set_unevictable(mapping);
+ mapping_set_gfp_mask(mapping,
+ mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
+
+ return file;
+}
+
+void inaccessible_register_notifier(struct file *file,
+ struct inaccessible_notifier *notifier)
+{
+ struct inaccessible_data *data = file->f_mapping->private_data;
+
+ mutex_lock(&data->lock);
+ list_add(¬ifier->list, &data->notifiers);
+ mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
+
+void inaccessible_unregister_notifier(struct file *file,
+ struct inaccessible_notifier *notifier)
+{
+ struct inaccessible_data *data = file->f_mapping->private_data;
+
+ mutex_lock(&data->lock);
+ list_del(¬ifier->list);
+ mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+ int *order)
+{
+ struct inaccessible_data *data = file->f_mapping->private_data;
+ struct file *memfd = data->memfd;
+ struct page *page;
+ int ret;
+
+ ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
+ if (ret)
+ return ret;
+
+ *pfn = page_to_pfn_t(page);
+ *order = thp_order(compound_head(page));
+ SetPageUptodate(page);
+ unlock_page(page);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
+
+void inaccessible_put_pfn(struct file *file, pfn_t pfn)
+{
+ struct page *page = pfn_t_to_page(pfn);
+
+ if (WARN_ON_ONCE(!page))
+ return;
+
+ put_page(page);
+}
+EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
--
2.25.1
On Thu, Sep 15, 2022 at 10:29:08PM +0800, Chao Peng wrote:
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> + private memory access when the bit is set otherwise the memory error is
> + caused by shared memory access when the bit is clear.
s/set otherwise/set. Otherwise,
Thanks.
--
An old man doll... just what I always wanted! - Clara
On Fri, Sep 16, 2022 at 04:17:48PM +0700, Bagas Sanjaya wrote:
> On Thu, Sep 15, 2022 at 10:29:08PM +0800, Chao Peng wrote:
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > + private memory access when the bit is set otherwise the memory error is
> > + caused by shared memory access when the bit is clear.
>
> s/set otherwise/set. Otherwise,
Thanks.
>
> Thanks.
>
> --
> An old man doll... just what I always wanted! - Clara
On 15.09.22 16:29, Chao Peng wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> KVM can use memfd-provided memory for guest memory. For normal userspace
> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> virtual address space and then tells KVM to use the virtual address to
> setup the mapping in the secondary page table (e.g. EPT).
>
> With confidential computing technologies like Intel TDX, the
> memfd-provided memory may be encrypted with special key for special
> software domain (e.g. KVM guest) and is not expected to be directly
> accessed by userspace. Precisely, userspace access to such encrypted
> memory may lead to host crash so it should be prevented.
Initially my thaught was that this whole inaccessible thing is TDX
specific and there is no need to force that on other mechanisms. That's
why I suggested to not expose this to user space but handle the notifier
requirements internally.
IIUC now, protected KVM has similar demands. Either access (read/write)
of guest RAM would result in a fault and possibly crash the hypervisor
(at least not the whole machine IIUC).
>
> This patch introduces userspace inaccessible memfd (created with
> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> in-kernel interface so KVM can directly interact with core-mm without
> the need to map the memory into KVM userspace.
With secretmem we decided to not add such "concept switch" flags and
instead use a dedicated syscall.
What about memfd_inaccessible()? Especially, sealing and hugetlb are not
even supported and it might take a while to support either.
>
> It provides semantics required for KVM guest private(encrypted) memory
> support that a file descriptor with this flag set is going to be used as
> the source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV.
>
> KVM userspace is still in charge of the lifecycle of the memfd. It
> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> in this patch to obtain the physical memory address and then populate
> the secondary page table entries.
>
> The userspace inaccessible memfd can be fallocate-ed and hole-punched
> from userspace. When hole-punching happens, KVM can get notified through
> inaccessible_notifier it then gets chance to remove any mapped entries
> of the range in the secondary page tables.
>
> The userspace inaccessible memfd itself is implemented as a shim layer
> on top of real memory file systems like tmpfs/hugetlbfs but this patch
> only implemented tmpfs. The allocated memory is currently marked as
> unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/linux/memfd.h | 24 ++++
> include/uapi/linux/magic.h | 1 +
> include/uapi/linux/memfd.h | 1 +
> mm/Makefile | 2 +-
> mm/memfd.c | 25 ++++-
> mm/memfd_inaccessible.c | 219 +++++++++++++++++++++++++++++++++++++
> 6 files changed, 270 insertions(+), 2 deletions(-)
> create mode 100644 mm/memfd_inaccessible.c
>
> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> index 4f1600413f91..334ddff08377 100644
> --- a/include/linux/memfd.h
> +++ b/include/linux/memfd.h
> @@ -3,6 +3,7 @@
> #define __LINUX_MEMFD_H
>
> #include <linux/file.h>
> +#include <linux/pfn_t.h>
>
> #ifdef CONFIG_MEMFD_CREATE
> extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
> @@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
> }
> #endif
>
> +struct inaccessible_notifier;
> +
> +struct inaccessible_notifier_ops {
> + void (*invalidate)(struct inaccessible_notifier *notifier,
> + pgoff_t start, pgoff_t end);
> +};
> +
> +struct inaccessible_notifier {
> + struct list_head list;
> + const struct inaccessible_notifier_ops *ops;
> +};
> +
> +void inaccessible_register_notifier(struct file *file,
> + struct inaccessible_notifier *notifier);
> +void inaccessible_unregister_notifier(struct file *file,
> + struct inaccessible_notifier *notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> + int *order);
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn);
> +
> +struct file *memfd_mkinaccessible(struct file *memfd);
> +
> #endif /* __LINUX_MEMFD_H */
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..9d066be3d7e8 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
> #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */
> #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
> #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
> +#define INACCESSIBLE_MAGIC 0x494e4143 /* "INAC" */
[...]
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> + int *order)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> + struct file *memfd = data->memfd;
> + struct page *page;
> + int ret;
> +
> + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> + if (ret)
> + return ret;
> +
> + *pfn = page_to_pfn_t(page);
> + *order = thp_order(compound_head(page));
> + SetPageUptodate(page);
> + unlock_page(page);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> + struct page *page = pfn_t_to_page(pfn);
> +
> + if (WARN_ON_ONCE(!page))
> + return;
> +
> + put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
Sorry, I missed your reply regarding get/put interface.
https://lore.kernel.org/linux-mm/[email protected]/
"We have a design assumption that somedays this can even support
non-page based backing stores."
As long as there is no such user in sight (especially how to get the
memfd from even allocating such memory which will require bigger
changes), I prefer to keep it simple here and work on pages/folios. No
need to over-complicate it for now.
--
Thanks,
David / dhildenb
+Will, Marc and Fuad (apologies if I missed other pKVM folks)
On Mon, Sep 19, 2022, David Hildenbrand wrote:
> On 15.09.22 16:29, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > KVM can use memfd-provided memory for guest memory. For normal userspace
> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > virtual address space and then tells KVM to use the virtual address to
> > setup the mapping in the secondary page table (e.g. EPT).
> >
> > With confidential computing technologies like Intel TDX, the
> > memfd-provided memory may be encrypted with special key for special
> > software domain (e.g. KVM guest) and is not expected to be directly
> > accessed by userspace. Precisely, userspace access to such encrypted
> > memory may lead to host crash so it should be prevented.
>
> Initially my thaught was that this whole inaccessible thing is TDX specific
> and there is no need to force that on other mechanisms. That's why I
> suggested to not expose this to user space but handle the notifier
> requirements internally.
>
> IIUC now, protected KVM has similar demands. Either access (read/write) of
> guest RAM would result in a fault and possibly crash the hypervisor (at
> least not the whole machine IIUC).
Yep. The missing piece for pKVM is the ability to convert from shared to private
while preserving the contents, e.g. to hand off a large buffer (hundreds of MiB)
for processing in the protected VM. Thoughts on this at the bottom.
> > This patch introduces userspace inaccessible memfd (created with
> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > in-kernel interface so KVM can directly interact with core-mm without
> > the need to map the memory into KVM userspace.
>
> With secretmem we decided to not add such "concept switch" flags and instead
> use a dedicated syscall.
>
I have no personal preference whatsoever between a flag and a dedicated syscall,
but a dedicated syscall does seem like it would give the kernel a bit more
flexibility.
> What about memfd_inaccessible()? Especially, sealing and hugetlb are not
> even supported and it might take a while to support either.
Don't know about sealing, but hugetlb support for "inaccessible" memory needs to
come sooner than later. "inaccessible" in quotes because we might want to choose
a less binary name, e.g. "restricted"?.
Regarding pKVM's use case, with the shim approach I believe this can be done by
allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
piled on top.
My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
tightly control usage without taking on too much complexity in the kernel, but
working through things, routing the behavior through the shim itself might not be
all that horrific.
IIRC, we discarded the idea of allowing userspace to map the "private" fd because
things got too complex, but with the shim it doesn't seem _that_ bad.
E.g. on the memfd side:
1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
mapping is all or nothing.
2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
the restricted memfd.
3. Add notifier hooks to allow downstream users to further restrict things.
4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
one shot.
5. Require that there are no outstanding references at munmap(). Or if this
can't be guaranteed by userspace, maybe add some way for userspace to wait
until it's ok to convert to private? E.g. so that get_pfn() doesn't need
to do an expensive check every time.
static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
{
if (vma->vm_pgoff)
return -EINVAL;
if ((vma->vm_end - vma->vm_start) != <file size>)
return -EINVAL;
mutex_lock(&data->lock);
if (data->has_mapping) {
r = -EINVAL;
goto err;
}
list_for_each_entry(notifier, &data->notifiers, list) {
r = notifier->ops->mmap_start(notifier, ...);
if (r)
goto abort;
}
notifier->ops->mmap_end(notifier, ...);
mutex_unlock(&data->lock);
return 0;
abort:
list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
notifier->ops->mmap_abort(notifier, ...);
err:
mutex_unlock(&data->lock);
return r;
}
static void memfd_restricted_close(struct vm_area_struct *vma)
{
mutex_lock(...);
/*
* Destroy the memfd and disable all future accesses if there are
* outstanding refcounts (or other unsatisfied restrictions?).
*/
if (<outstanding references> || ???)
memfd_restricted_destroy(...);
else
data->has_mapping = false;
mutex_unlock(...);
}
static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
{
return -EINVAL;
}
static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
{
return -EINVAL;
}
Then on the KVM side, its mmap_start() + mmap_end() sequence would:
1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
memory into the guest (after pre-boot phase).
2. Be mutually exclusive with shared<=>private conversions, and is allowed if
and only if the entire gfn range of the associated memslot is shared.
(please excuse any formatting disasters. my internet went out as I was composing this, and i did my best to rescue it.)
On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote:
> +Will, Marc and Fuad (apologies if I missed other pKVM folks)
>
> On Mon, Sep 19, 2022, David Hildenbrand wrote:
>> On 15.09.22 16:29, Chao Peng wrote:
>> > From: "Kirill A. Shutemov" <[email protected]>
>> >
>> > KVM can use memfd-provided memory for guest memory. For normal userspace
>> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
>> > virtual address space and then tells KVM to use the virtual address to
>> > setup the mapping in the secondary page table (e.g. EPT).
>> >
>> > With confidential computing technologies like Intel TDX, the
>> > memfd-provided memory may be encrypted with special key for special
>> > software domain (e.g. KVM guest) and is not expected to be directly
>> > accessed by userspace. Precisely, userspace access to such encrypted
>> > memory may lead to host crash so it should be prevented.
>>
>> Initially my thaught was that this whole inaccessible thing is TDX specific
>> and there is no need to force that on other mechanisms. That's why I
>> suggested to not expose this to user space but handle the notifier
>> requirements internally.
>>
>> IIUC now, protected KVM has similar demands. Either access (read/write) of
>> guest RAM would result in a fault and possibly crash the hypervisor (at
>> least not the whole machine IIUC).
>
> Yep. The missing piece for pKVM is the ability to convert from shared
> to private
> while preserving the contents, e.g. to hand off a large buffer
> (hundreds of MiB)
> for processing in the protected VM. Thoughts on this at the bottom.
>
>> > This patch introduces userspace inaccessible memfd (created with
>> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
>> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
>> > in-kernel interface so KVM can directly interact with core-mm without
>> > the need to map the memory into KVM userspace.
>>
>> With secretmem we decided to not add such "concept switch" flags and instead
>> use a dedicated syscall.
>>
>
> I have no personal preference whatsoever between a flag and a dedicated syscall,
> but a dedicated syscall does seem like it would give the kernel a bit more
> flexibility.
The third option is a device node, e.g. /dev/kvm_secretmem or /dev/kvm_tdxmem or similar. But if we need flags or other details in the future, maybe this isn't ideal.
>
>> What about memfd_inaccessible()? Especially, sealing and hugetlb are not
>> even supported and it might take a while to support either.
>
> Don't know about sealing, but hugetlb support for "inaccessible" memory
> needs to
> come sooner than later. "inaccessible" in quotes because we might want
> to choose
> a less binary name, e.g. "restricted"?.
>
> Regarding pKVM's use case, with the shim approach I believe this can be done by
> allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> piled on top.
>
> My first thought was to make the uAPI a set of KVM ioctls so that KVM
> could tightly
> tightly control usage without taking on too much complexity in the
> kernel, but
> working through things, routing the behavior through the shim itself
> might not be
> all that horrific.
>
> IIRC, we discarded the idea of allowing userspace to map the "private"
> fd because
> things got too complex, but with the shim it doesn't seem _that_ bad.
What's the exact use case? Is it just to pre-populate the memory?
>
> E.g. on the memfd side:
>
> 1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
> mapping is all or nothing.
>
> 2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
> the restricted memfd.
>
> 3. Add notifier hooks to allow downstream users to further restrict things.
>
> 4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
> one shot.
>
> 5. Require that there are no outstanding references at munmap(). Or if this
> can't be guaranteed by userspace, maybe add some way for userspace to wait
> until it's ok to convert to private? E.g. so that get_pfn() doesn't need
> to do an expensive check every time.
Hmm. I haven't looked at the code to see if this would really work, but I think this could be done more in line with how the rest of the kernel works by using the rmap infrastructure. When the pKVM memfd is in not-yet-private mode, just let it be mmapped as usual (but don't allow any form of GUP or pinning). Then have an ioctl to switch to to shared mode that takes locks or sets flags so that no new faults can be serviced and does unmap_mapping_range.
As long as the shim arranges to have its own vm_ops, I don't immediately see any reason this can't work.
On Thursday, September 22, 2022 5:11 AM, Andy Lutomirski wrote:
> To: Christopherson,, Sean <[email protected]>; David Hildenbrand
> <[email protected]>
> Cc: Chao Peng <[email protected]>; kvm list
> <[email protected]>; Linux Kernel Mailing List
> <[email protected]>; [email protected];
> [email protected]; Linux API <[email protected]>;
> [email protected]; [email protected]; Paolo Bonzini
> <[email protected]>; Jonathan Corbet <[email protected]>; Vitaly
> Kuznetsov <[email protected]>; Wanpeng Li <[email protected]>;
> Jim Mattson <[email protected]>; Joerg Roedel <[email protected]>;
> Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>;
> Borislav Petkov <[email protected]>; the arch/x86 maintainers <[email protected]>;
> H. Peter Anvin <[email protected]>; Hugh Dickins <[email protected]>; Jeff
> Layton <[email protected]>; J . Bruce Fields <[email protected]>; Andrew
> Morton <[email protected]>; Shuah Khan <[email protected]>;
> Mike Rapoport <[email protected]>; Steven Price <[email protected]>;
> Maciej S . Szmigiero <[email protected]>; Vlastimil Babka
> <[email protected]>; Vishal Annapurve <[email protected]>; Yu Zhang
> <[email protected]>; Kirill A. Shutemov
> <[email protected]>; Nakajima, Jun <[email protected]>;
> Hansen, Dave <[email protected]>; Andi Kleen <[email protected]>;
> [email protected]; [email protected]; [email protected]; Quentin
> Perret <[email protected]>; Michael Roth <[email protected]>;
> Hocko, Michal <[email protected]>; Muchun Song
> <[email protected]>; Wang, Wei W <[email protected]>;
> Will Deacon <[email protected]>; Marc Zyngier <[email protected]>; Fuad Tabba
> <[email protected]>
> Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible
> memfd
>
> (please excuse any formatting disasters. my internet went out as I was
> composing this, and i did my best to rescue it.)
>
> On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote:
> > +Will, Marc and Fuad (apologies if I missed other pKVM folks)
> >
> > On Mon, Sep 19, 2022, David Hildenbrand wrote:
> >> On 15.09.22 16:29, Chao Peng wrote:
> >> > From: "Kirill A. Shutemov" <[email protected]>
> >> >
> >> > KVM can use memfd-provided memory for guest memory. For normal
> >> > userspace accessible memory, KVM userspace (e.g. QEMU) mmaps the
> >> > memfd into its virtual address space and then tells KVM to use the
> >> > virtual address to setup the mapping in the secondary page table (e.g.
> EPT).
> >> >
> >> > With confidential computing technologies like Intel TDX, the
> >> > memfd-provided memory may be encrypted with special key for special
> >> > software domain (e.g. KVM guest) and is not expected to be directly
> >> > accessed by userspace. Precisely, userspace access to such
> >> > encrypted memory may lead to host crash so it should be prevented.
> >>
> >> Initially my thaught was that this whole inaccessible thing is TDX
> >> specific and there is no need to force that on other mechanisms.
> >> That's why I suggested to not expose this to user space but handle
> >> the notifier requirements internally.
> >>
> >> IIUC now, protected KVM has similar demands. Either access
> >> (read/write) of guest RAM would result in a fault and possibly crash
> >> the hypervisor (at least not the whole machine IIUC).
> >
> > Yep. The missing piece for pKVM is the ability to convert from shared
> > to private while preserving the contents, e.g. to hand off a large
> > buffer (hundreds of MiB) for processing in the protected VM. Thoughts
> > on this at the bottom.
> >
> >> > This patch introduces userspace inaccessible memfd (created with
> >> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace
> >> > through ordinary MMU access (e.g. read/write/mmap) but can be
> >> > accessed via in-kernel interface so KVM can directly interact with
> >> > core-mm without the need to map the memory into KVM userspace.
> >>
> >> With secretmem we decided to not add such "concept switch" flags and
> >> instead use a dedicated syscall.
> >>
> >
> > I have no personal preference whatsoever between a flag and a
> > dedicated syscall, but a dedicated syscall does seem like it would
> > give the kernel a bit more flexibility.
>
> The third option is a device node, e.g. /dev/kvm_secretmem or
> /dev/kvm_tdxmem or similar. But if we need flags or other details in the
> future, maybe this isn't ideal.
>
> >
> >> What about memfd_inaccessible()? Especially, sealing and hugetlb are
> >> not even supported and it might take a while to support either.
> >
> > Don't know about sealing, but hugetlb support for "inaccessible"
> > memory needs to come sooner than later. "inaccessible" in quotes
> > because we might want to choose a less binary name, e.g.
> > "restricted"?.
> >
> > Regarding pKVM's use case, with the shim approach I believe this can
> > be done by allowing userspace mmap() the "hidden" memfd, but with a
> > ton of restrictions piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM
> > could tightly tightly control usage without taking on too much
> > complexity in the kernel, but working through things, routing the
> > behavior through the shim itself might not be all that horrific.
> >
> > IIRC, we discarded the idea of allowing userspace to map the "private"
> > fd because
> > things got too complex, but with the shim it doesn't seem _that_ bad.
>
> What's the exact use case? Is it just to pre-populate the memory?
Add one more use case here. For TDX live migration support, on the destination side,
we map the private fd during migration to store the encrypted private memory data sent
from source, and at the end of migration, we unmap it and make it inaccessible before
resuming the TD to run.
On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> + int *order)
Better to remove "order" from this interface?
Some callers only need to get pfn, and no need to bother with
defining and inputting something unused. For callers who need the "order",
can easily get it via thp_order(pfn_to_page(pfn)) on their own.
On Thu, Sep 22, 2022, Wang, Wei W wrote:
> On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > + int *order)
>
> Better to remove "order" from this interface?
Hard 'no'.
> Some callers only need to get pfn, and no need to bother with
> defining and inputting something unused. For callers who need the "order",
> can easily get it via thp_order(pfn_to_page(pfn)) on their own.
That requires (a) assuming the pfn is backed by struct page, and (b) assuming the
struct page is a transparent huge page. That might be true for the current
implementation, but it most certainly will not always be true.
KVM originally did things like this, where there was dedicated code for THP vs.
HugeTLB, and it was a mess. The goal here is very much to avoid repeating those
mistakes. Have the backing store _tell_ KVM how big the mapping is, don't force
KVM to rediscover the info on its own.
On Thu, Sep 22, 2022 at 07:49:18PM +0000, Sean Christopherson wrote:
> On Thu, Sep 22, 2022, Wang, Wei W wrote:
> > On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > + int *order)
> >
> > Better to remove "order" from this interface?
>
> Hard 'no'.
>
> > Some callers only need to get pfn, and no need to bother with
> > defining and inputting something unused. For callers who need the "order",
> > can easily get it via thp_order(pfn_to_page(pfn)) on their own.
>
> That requires (a) assuming the pfn is backed by struct page, and (b) assuming the
> struct page is a transparent huge page. That might be true for the current
> implementation, but it most certainly will not always be true.
>
> KVM originally did things like this, where there was dedicated code for THP vs.
> HugeTLB, and it was a mess. The goal here is very much to avoid repeating those
> mistakes. Have the backing store _tell_ KVM how big the mapping is, don't force
> KVM to rediscover the info on its own.
I guess we can allow order pointer to be NULL to cover caller that don't
need to know the order. Is it useful?
--
Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index 6325d1d0e90f..9d066be3d7e8 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -101,5 +101,6 @@
> > #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */
> > #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
> > #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
> > +#define INACCESSIBLE_MAGIC 0x494e4143 /* "INAC" */
>
>
> [...]
>
> > +
> > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > + int *order)
> > +{
> > + struct inaccessible_data *data = file->f_mapping->private_data;
> > + struct file *memfd = data->memfd;
> > + struct page *page;
> > + int ret;
> > +
> > + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > + if (ret)
> > + return ret;
> > +
> > + *pfn = page_to_pfn_t(page);
> > + *order = thp_order(compound_head(page));
> > + SetPageUptodate(page);
> > + unlock_page(page);
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> > +
> > +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> > +{
> > + struct page *page = pfn_t_to_page(pfn);
> > +
> > + if (WARN_ON_ONCE(!page))
> > + return;
> > +
> > + put_page(page);
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
>
> Sorry, I missed your reply regarding get/put interface.
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> "We have a design assumption that somedays this can even support non-page
> based backing stores."
>
> As long as there is no such user in sight (especially how to get the memfd
> from even allocating such memory which will require bigger changes), I
> prefer to keep it simple here and work on pages/folios. No need to
> over-complicate it for now.
Sean, Paolo , what is your take on this? Do you have conrete use case of
pageless backend for the mechanism in sight? Maybe DAX?
--
Kiryl Shutsemau / Kirill A. Shutemov
Hi,
On Fri, Sep 23, 2022 at 1:53 AM Kirill A . Shutemov
<[email protected]> wrote:
>
> On Thu, Sep 22, 2022 at 07:49:18PM +0000, Sean Christopherson wrote:
> > On Thu, Sep 22, 2022, Wang, Wei W wrote:
> > > On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > > + int *order)
> > >
> > > Better to remove "order" from this interface?
> >
> > Hard 'no'.
> >
> > > Some callers only need to get pfn, and no need to bother with
> > > defining and inputting something unused. For callers who need the "order",
> > > can easily get it via thp_order(pfn_to_page(pfn)) on their own.
> >
> > That requires (a) assuming the pfn is backed by struct page, and (b) assuming the
> > struct page is a transparent huge page. That might be true for the current
> > implementation, but it most certainly will not always be true.
> >
> > KVM originally did things like this, where there was dedicated code for THP vs.
> > HugeTLB, and it was a mess. The goal here is very much to avoid repeating those
> > mistakes. Have the backing store _tell_ KVM how big the mapping is, don't force
> > KVM to rediscover the info on its own.
>
> I guess we can allow order pointer to be NULL to cover caller that don't
> need to know the order. Is it useful?
I think that would be useful. In pKVM we don't need to know the order,
and I had to use a dummy variable when porting V7.
Cheers,
/fuad
> --
> Kiryl Shutsemau / Kirill A. Shutemov
Hi,
On Mon, Sep 19, 2022 at 8:10 PM Sean Christopherson <[email protected]> wrote:
>
> +Will, Marc and Fuad (apologies if I missed other pKVM folks)
>
> On Mon, Sep 19, 2022, David Hildenbrand wrote:
> > On 15.09.22 16:29, Chao Peng wrote:
> > > From: "Kirill A. Shutemov" <[email protected]>
> > >
> > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > virtual address space and then tells KVM to use the virtual address to
> > > setup the mapping in the secondary page table (e.g. EPT).
> > >
> > > With confidential computing technologies like Intel TDX, the
> > > memfd-provided memory may be encrypted with special key for special
> > > software domain (e.g. KVM guest) and is not expected to be directly
> > > accessed by userspace. Precisely, userspace access to such encrypted
> > > memory may lead to host crash so it should be prevented.
> >
> > Initially my thaught was that this whole inaccessible thing is TDX specific
> > and there is no need to force that on other mechanisms. That's why I
> > suggested to not expose this to user space but handle the notifier
> > requirements internally.
> >
> > IIUC now, protected KVM has similar demands. Either access (read/write) of
> > guest RAM would result in a fault and possibly crash the hypervisor (at
> > least not the whole machine IIUC).
>
> Yep. The missing piece for pKVM is the ability to convert from shared to private
> while preserving the contents, e.g. to hand off a large buffer (hundreds of MiB)
> for processing in the protected VM. Thoughts on this at the bottom.
Just wanted to mention that for pKVM (arm64), this wouldn't crash the
hypervisor. A userspace access would crash the userspace process since
the hypervisor would inject a fault back. Because of that making it
inaccessible from userspace is good to have, but not really vital for
pKVM. What is important for pKVM is that the guest private memory is
not GUP'able by the host. This is because if it were, it might be
possible for a malicious userspace process (e.g., a malicious vmm) to
trick the host kernel into accessing guest private memory in a context
where it isn’t prepared to handle the fault injected by the
hypervisor. This of course might crash the host.
> > > This patch introduces userspace inaccessible memfd (created with
> > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > in-kernel interface so KVM can directly interact with core-mm without
> > > the need to map the memory into KVM userspace.
> >
> > With secretmem we decided to not add such "concept switch" flags and instead
> > use a dedicated syscall.
> >
>
> I have no personal preference whatsoever between a flag and a dedicated syscall,
> but a dedicated syscall does seem like it would give the kernel a bit more
> flexibility.
>
> > What about memfd_inaccessible()? Especially, sealing and hugetlb are not
> > even supported and it might take a while to support either.
>
> Don't know about sealing, but hugetlb support for "inaccessible" memory needs to
> come sooner than later. "inaccessible" in quotes because we might want to choose
> a less binary name, e.g. "restricted"?.
>
> Regarding pKVM's use case, with the shim approach I believe this can be done by
> allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> piled on top.
>
> My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
> tightly control usage without taking on too much complexity in the kernel, but
> working through things, routing the behavior through the shim itself might not be
> all that horrific.
>
> IIRC, we discarded the idea of allowing userspace to map the "private" fd because
> things got too complex, but with the shim it doesn't seem _that_ bad.
>
> E.g. on the memfd side:
>
> 1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
> mapping is all or nothing.
>
> 2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
> the restricted memfd.
>
> 3. Add notifier hooks to allow downstream users to further restrict things.
>
> 4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
> one shot.
>
> 5. Require that there are no outstanding references at munmap(). Or if this
> can't be guaranteed by userspace, maybe add some way for userspace to wait
> until it's ok to convert to private? E.g. so that get_pfn() doesn't need
> to do an expensive check every time.
>
> static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
> {
> if (vma->vm_pgoff)
> return -EINVAL;
>
> if ((vma->vm_end - vma->vm_start) != <file size>)
> return -EINVAL;
>
> mutex_lock(&data->lock);
>
> if (data->has_mapping) {
> r = -EINVAL;
> goto err;
> }
> list_for_each_entry(notifier, &data->notifiers, list) {
> r = notifier->ops->mmap_start(notifier, ...);
> if (r)
> goto abort;
> }
>
> notifier->ops->mmap_end(notifier, ...);
> mutex_unlock(&data->lock);
> return 0;
>
> abort:
> list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
> notifier->ops->mmap_abort(notifier, ...);
> err:
> mutex_unlock(&data->lock);
> return r;
> }
>
> static void memfd_restricted_close(struct vm_area_struct *vma)
> {
> mutex_lock(...);
>
> /*
> * Destroy the memfd and disable all future accesses if there are
> * outstanding refcounts (or other unsatisfied restrictions?).
> */
> if (<outstanding references> || ???)
> memfd_restricted_destroy(...);
> else
> data->has_mapping = false;
>
> mutex_unlock(...);
> }
>
> static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
> {
> return -EINVAL;
> }
>
> static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
> {
> return -EINVAL;
> }
>
> Then on the KVM side, its mmap_start() + mmap_end() sequence would:
>
> 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> memory into the guest (after pre-boot phase).
>
> 2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> and only if the entire gfn range of the associated memslot is shared.
In general I think that this would work with pKVM. However, limiting
private<->shared conversions to the granularity of a whole memslot
might be difficult to handle in pKVM, since the guest doesn't have the
concept of memslots. For example, in pKVM right now, when a guest
shares back its restricted DMA pool with the host it does so at the
page-level. pKVM would also need a way to make an fd accessible again
when shared back, which I think isn't possible with this patch.
You were initially considering a KVM ioctl for mapping, which might be
better suited for this since KVM knows which pages are shared and
which ones are private. So routing things through KVM might simplify
things and allow it to enforce all the necessary restrictions (e.g.,
private memory cannot be mapped). What do you think?
Thanks,
/fuad
Hi,
<...>
> > Regarding pKVM's use case, with the shim approach I believe this can be done by
> > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> > piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM
> > could tightly
> > tightly control usage without taking on too much complexity in the
> > kernel, but
> > working through things, routing the behavior through the shim itself
> > might not be
> > all that horrific.
> >
> > IIRC, we discarded the idea of allowing userspace to map the "private"
> > fd because
> > things got too complex, but with the shim it doesn't seem _that_ bad.
>
> What's the exact use case? Is it just to pre-populate the memory?
Prepopulate memory and access memory that could go back and forth from
being shared to being private.
Cheers,
/fuad
> >
> > E.g. on the memfd side:
> >
> > 1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
> > mapping is all or nothing.
> >
> > 2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
> > the restricted memfd.
> >
> > 3. Add notifier hooks to allow downstream users to further restrict things.
> >
> > 4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
> > one shot.
> >
> > 5. Require that there are no outstanding references at munmap(). Or if this
> > can't be guaranteed by userspace, maybe add some way for userspace to wait
> > until it's ok to convert to private? E.g. so that get_pfn() doesn't need
> > to do an expensive check every time.
>
> Hmm. I haven't looked at the code to see if this would really work, but I think this could be done more in line with how the rest of the kernel works by using the rmap infrastructure. When the pKVM memfd is in not-yet-private mode, just let it be mmapped as usual (but don't allow any form of GUP or pinning). Then have an ioctl to switch to to shared mode that takes locks or sets flags so that no new faults can be serviced and does unmap_mapping_range.
>
> As long as the shim arranges to have its own vm_ops, I don't immediately see any reason this can't work.
Hi Chao,
On Thu, Sep 15, 2022 at 3:38 PM Chao Peng <[email protected]> wrote:
>
> If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the
> guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
> ioctls. The patch reuses existing SEV ioctl number but differs that the
> address in the region for KVM_PRIVATE_MEM case is gpa while for SEV case
> it's hva. Which usages should the ioctls go is determined by the newly
> added kvm_arch_has_private_mem(). Architecture which supports
> KVM_PRIVATE_MEM should override this function.
>
> The current implementation defaults all memory to private. The shared
> memory regions are stored in a xarray variable for memory efficiency and
> zapping existing memory mappings is also a side effect of these two
> ioctls when defined.
>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 17 ++++++--
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu.h | 2 -
> include/linux/kvm_host.h | 13 ++++++
> virt/kvm/kvm_main.c | 73 +++++++++++++++++++++++++++++++++
> 5 files changed, 100 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 1a6c003b2a0b..c0f800d04ffc 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4715,10 +4715,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
> This ioctl can be used to register a guest memory region which may
> contain encrypted data (e.g. guest RAM, SMRAM etc).
>
> -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> -memory region may contain encrypted data. The SEV memory encryption
> -engine uses a tweak such that two identical plaintext pages, each at
> -different locations will have differing ciphertexts. So swapping or
> +Currently this ioctl supports registering memory regions for two usages:
> +private memory and SEV-encrypted memory.
> +
> +When private memory is enabled, this ioctl is used to register guest private
> +memory region and the addr/size of kvm_enc_region represents guest physical
> +address (GPA). In this usage, this ioctl zaps the existing guest memory
> +mappings in KVM that fallen into the region.
> +
> +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> +memory region which may contain encrypted data for a SEV-enabled guest. The
> +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> +memory encryption engine uses a tweak such that two identical plaintext pages,
> +each at different locations will have differing ciphertexts. So swapping or
> moving ciphertext of those pages will not result in plaintext being
> swapped. So relocating (or migrating) physical backing pages for the SEV
> guest will require some additional steps.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2c96c43c313a..cfad6ba1a70a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -37,6 +37,7 @@
> #include <asm/hyperv-tlfs.h>
>
> #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> +#define __KVM_HAVE_ZAP_GFN_RANGE
>
> #define KVM_MAX_VCPUS 1024
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 6bdaacb6faa0..c94b620bf94b 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -211,8 +211,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> return -(u32)fault & errcode;
> }
>
> -void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> -
> int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>
> int kvm_mmu_post_init_vm(struct kvm *kvm);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2125b50f6345..d65690cae80b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -260,6 +260,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> #endif
>
> +#ifdef __KVM_HAVE_ZAP_GFN_RANGE
> +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> +#else
> +static inline void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start
> + gfn_t gfn_end)
Missing a comma after gfn_start.
Cheers,
/fuad
> +{
> +}
> +#endif
> +
> enum {
> OUTSIDE_GUEST_MODE,
> IN_GUEST_MODE,
> @@ -795,6 +804,9 @@ struct kvm {
> struct notifier_block pm_notifier;
> #endif
> char stats_id[KVM_STATS_NAME_SIZE];
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> + struct xarray mem_attr_array;
> +#endif
> };
>
> #define kvm_err(fmt, ...) \
> @@ -1454,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> int kvm_arch_post_init_vm(struct kvm *kvm);
> void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
> #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fa9dd2d2c001..de5cce8c82c7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -937,6 +937,47 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
> #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +#define KVM_MEM_ATTR_SHARED 0x0001
> +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> + bool is_private)
> +{
> + gfn_t start, end;
> + unsigned long index;
> + void *entry;
> + int r;
> +
> + if (size == 0 || gpa + size < gpa)
> + return -EINVAL;
> + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> + return -EINVAL;
> +
> + start = gpa >> PAGE_SHIFT;
> + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + /*
> + * Guest memory defaults to private, kvm->mem_attr_array only stores
> + * shared memory.
> + */
> + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> +
> + for (index = start; index < end; index++) {
> + r = xa_err(xa_store(&kvm->mem_attr_array, index, entry,
> + GFP_KERNEL_ACCOUNT));
> + if (r)
> + goto err;
> + }
> +
> + kvm_zap_gfn_range(kvm, start, end);
> +
> + return r;
> +err:
> + for (; index > start; index--)
> + xa_erase(&kvm->mem_attr_array, index);
> + return r;
> +}
> +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> +
> #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> static int kvm_pm_notifier_call(struct notifier_block *bl,
> unsigned long state,
> @@ -1165,6 +1206,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> spin_lock_init(&kvm->mn_invalidate_lock);
> rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> + xa_init(&kvm->mem_attr_array);
> +#endif
>
> INIT_LIST_HEAD(&kvm->gpc_list);
> spin_lock_init(&kvm->gpc_lock);
> @@ -1338,6 +1382,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> }
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> + xa_destroy(&kvm->mem_attr_array);
> +#endif
> cleanup_srcu_struct(&kvm->irq_srcu);
> cleanup_srcu_struct(&kvm->srcu);
> kvm_arch_free_vm(kvm);
> @@ -1541,6 +1588,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
> }
> }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> + return false;
> +}
> +
> static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> {
> u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> @@ -4703,6 +4755,24 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> break;
> }
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> + case KVM_MEMORY_ENCRYPT_REG_REGION:
> + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> + struct kvm_enc_region region;
> + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> +
> + if (!kvm_arch_has_private_mem(kvm))
> + goto arch_vm_ioctl;
> +
> + r = -EFAULT;
> + if (copy_from_user(®ion, argp, sizeof(region)))
> + goto out;
> +
> + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> + region.size, set);
> + break;
> + }
> +#endif
> case KVM_GET_DIRTY_LOG: {
> struct kvm_dirty_log log;
>
> @@ -4856,6 +4926,9 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_vm_ioctl_get_stats_fd(kvm);
> break;
> default:
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +arch_vm_ioctl:
> +#endif
> r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> }
> out:
> --
> 2.25.1
>
On 23.09.22 02:58, Kirill A . Shutemov wrote:
> On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
>>> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
>>> index 6325d1d0e90f..9d066be3d7e8 100644
>>> --- a/include/uapi/linux/magic.h
>>> +++ b/include/uapi/linux/magic.h
>>> @@ -101,5 +101,6 @@
>>> #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */
>>> #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
>>> #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
>>> +#define INACCESSIBLE_MAGIC 0x494e4143 /* "INAC" */
>>
>>
>> [...]
>>
>>> +
>>> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
>>> + int *order)
>>> +{
>>> + struct inaccessible_data *data = file->f_mapping->private_data;
>>> + struct file *memfd = data->memfd;
>>> + struct page *page;
>>> + int ret;
>>> +
>>> + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + *pfn = page_to_pfn_t(page);
>>> + *order = thp_order(compound_head(page));
>>> + SetPageUptodate(page);
>>> + unlock_page(page);
>>> +
>>> + return 0;
>>> +}
>>> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
>>> +
>>> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
>>> +{
>>> + struct page *page = pfn_t_to_page(pfn);
>>> +
>>> + if (WARN_ON_ONCE(!page))
>>> + return;
>>> +
>>> + put_page(page);
>>> +}
>>> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
>>
>> Sorry, I missed your reply regarding get/put interface.
>>
>> https://lore.kernel.org/linux-mm/[email protected]/
>>
>> "We have a design assumption that somedays this can even support non-page
>> based backing stores."
>>
>> As long as there is no such user in sight (especially how to get the memfd
>> from even allocating such memory which will require bigger changes), I
>> prefer to keep it simple here and work on pages/folios. No need to
>> over-complicate it for now.
>
> Sean, Paolo , what is your take on this? Do you have conrete use case of
> pageless backend for the mechanism in sight? Maybe DAX?
The problem I'm having with this is how to actually get such memory into
the memory backend (that triggers notifiers) and what the semantics are
at all with memory that is not managed by the buddy.
memfd with fixed PFNs doesn't make too much sense.
When using DAX, what happens with the shared <->private conversion?
Which "type" is supposed to use dax, which not?
In other word, I'm missing too many details on the bigger picture of how
this would work at all to see why it makes sense right now to prepare
for that.
--
Thanks,
David / dhildenb
On Mon, Sep 26, 2022 at 11:36:34AM +0100, Fuad Tabba wrote:
...
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 2125b50f6345..d65690cae80b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -260,6 +260,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > #endif
> >
> > +#ifdef __KVM_HAVE_ZAP_GFN_RANGE
> > +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> > +#else
> > +static inline void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start
> > + gfn_t gfn_end)
>
> Missing a comma after gfn_start.
Good catch, thanks!
Chao
>
> Cheers,
> /fuad
On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > Regarding pKVM's use case, with the shim approach I believe this can be done by
> > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> > piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
> > tightly control usage without taking on too much complexity in the kernel, but
> > working through things, routing the behavior through the shim itself might not be
> > all that horrific.
> >
> > IIRC, we discarded the idea of allowing userspace to map the "private" fd because
> > things got too complex, but with the shim it doesn't seem _that_ bad.
> >
> > E.g. on the memfd side:
> >
> > 1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
> > mapping is all or nothing.
> >
> > 2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
> > the restricted memfd.
> >
> > 3. Add notifier hooks to allow downstream users to further restrict things.
> >
> > 4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
> > one shot.
> >
> > 5. Require that there are no outstanding references at munmap(). Or if this
> > can't be guaranteed by userspace, maybe add some way for userspace to wait
> > until it's ok to convert to private? E.g. so that get_pfn() doesn't need
> > to do an expensive check every time.
> >
> > static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
> > {
> > if (vma->vm_pgoff)
> > return -EINVAL;
> >
> > if ((vma->vm_end - vma->vm_start) != <file size>)
> > return -EINVAL;
> >
> > mutex_lock(&data->lock);
> >
> > if (data->has_mapping) {
> > r = -EINVAL;
> > goto err;
> > }
> > list_for_each_entry(notifier, &data->notifiers, list) {
> > r = notifier->ops->mmap_start(notifier, ...);
> > if (r)
> > goto abort;
> > }
> >
> > notifier->ops->mmap_end(notifier, ...);
> > mutex_unlock(&data->lock);
> > return 0;
> >
> > abort:
> > list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
> > notifier->ops->mmap_abort(notifier, ...);
> > err:
> > mutex_unlock(&data->lock);
> > return r;
> > }
> >
> > static void memfd_restricted_close(struct vm_area_struct *vma)
> > {
> > mutex_lock(...);
> >
> > /*
> > * Destroy the memfd and disable all future accesses if there are
> > * outstanding refcounts (or other unsatisfied restrictions?).
> > */
> > if (<outstanding references> || ???)
> > memfd_restricted_destroy(...);
> > else
> > data->has_mapping = false;
> >
> > mutex_unlock(...);
> > }
> >
> > static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
> > {
> > return -EINVAL;
> > }
> >
> > static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
> > {
> > return -EINVAL;
> > }
> >
> > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> >
> > 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > memory into the guest (after pre-boot phase).
> >
> > 2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > and only if the entire gfn range of the associated memslot is shared.
>
> In general I think that this would work with pKVM. However, limiting
> private<->shared conversions to the granularity of a whole memslot
> might be difficult to handle in pKVM, since the guest doesn't have the
> concept of memslots. For example, in pKVM right now, when a guest
> shares back its restricted DMA pool with the host it does so at the
> page-level. pKVM would also need a way to make an fd accessible again
> when shared back, which I think isn't possible with this patch.
But does pKVM really want to mmap/munmap a new region at the page-level,
that can cause VMA fragmentation if the conversion is frequent as I see.
Even with a KVM ioctl for mapping as mentioned below, I think there will
be the same issue.
>
> You were initially considering a KVM ioctl for mapping, which might be
> better suited for this since KVM knows which pages are shared and
> which ones are private. So routing things through KVM might simplify
> things and allow it to enforce all the necessary restrictions (e.g.,
> private memory cannot be mapped). What do you think?
>
> Thanks,
> /fuad
On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> On 23.09.22 02:58, Kirill A . Shutemov wrote:
> > On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
> > > > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > > > index 6325d1d0e90f..9d066be3d7e8 100644
> > > > --- a/include/uapi/linux/magic.h
> > > > +++ b/include/uapi/linux/magic.h
> > > > @@ -101,5 +101,6 @@
> > > > #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */
> > > > #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
> > > > #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
> > > > +#define INACCESSIBLE_MAGIC 0x494e4143 /* "INAC" */
> > >
> > >
> > > [...]
> > >
> > > > +
> > > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > > + int *order)
> > > > +{
> > > > + struct inaccessible_data *data = file->f_mapping->private_data;
> > > > + struct file *memfd = data->memfd;
> > > > + struct page *page;
> > > > + int ret;
> > > > +
> > > > + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + *pfn = page_to_pfn_t(page);
> > > > + *order = thp_order(compound_head(page));
> > > > + SetPageUptodate(page);
> > > > + unlock_page(page);
> > > > +
> > > > + return 0;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> > > > +
> > > > +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> > > > +{
> > > > + struct page *page = pfn_t_to_page(pfn);
> > > > +
> > > > + if (WARN_ON_ONCE(!page))
> > > > + return;
> > > > +
> > > > + put_page(page);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> > >
> > > Sorry, I missed your reply regarding get/put interface.
> > >
> > > https://lore.kernel.org/linux-mm/[email protected]/
> > >
> > > "We have a design assumption that somedays this can even support non-page
> > > based backing stores."
> > >
> > > As long as there is no such user in sight (especially how to get the memfd
> > > from even allocating such memory which will require bigger changes), I
> > > prefer to keep it simple here and work on pages/folios. No need to
> > > over-complicate it for now.
> >
> > Sean, Paolo , what is your take on this? Do you have conrete use case of
> > pageless backend for the mechanism in sight? Maybe DAX?
>
> The problem I'm having with this is how to actually get such memory into the
> memory backend (that triggers notifiers) and what the semantics are at all
> with memory that is not managed by the buddy.
>
> memfd with fixed PFNs doesn't make too much sense.
What do you mean by "fixed PFN". It is as fixed as struct page/folio, no?
PFN covers more possible backends.
> When using DAX, what happens with the shared <->private conversion? Which
> "type" is supposed to use dax, which not?
>
> In other word, I'm missing too many details on the bigger picture of how
> this would work at all to see why it makes sense right now to prepare for
> that.
IIUC, KVM doesn't really care about pages or folios. They need PFN to
populate SEPT. Returning page/folio would make KVM do additional steps to
extract PFN and one more place to have a bug.
--
Kiryl Shutsemau / Kirill A. Shutemov
On 26.09.22 16:48, Kirill A. Shutemov wrote:
> On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
>> On 23.09.22 02:58, Kirill A . Shutemov wrote:
>>> On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
>>>>> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
>>>>> index 6325d1d0e90f..9d066be3d7e8 100644
>>>>> --- a/include/uapi/linux/magic.h
>>>>> +++ b/include/uapi/linux/magic.h
>>>>> @@ -101,5 +101,6 @@
>>>>> #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */
>>>>> #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
>>>>> #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
>>>>> +#define INACCESSIBLE_MAGIC 0x494e4143 /* "INAC" */
>>>>
>>>>
>>>> [...]
>>>>
>>>>> +
>>>>> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
>>>>> + int *order)
>>>>> +{
>>>>> + struct inaccessible_data *data = file->f_mapping->private_data;
>>>>> + struct file *memfd = data->memfd;
>>>>> + struct page *page;
>>>>> + int ret;
>>>>> +
>>>>> + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
>>>>> + if (ret)
>>>>> + return ret;
>>>>> +
>>>>> + *pfn = page_to_pfn_t(page);
>>>>> + *order = thp_order(compound_head(page));
>>>>> + SetPageUptodate(page);
>>>>> + unlock_page(page);
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
>>>>> +
>>>>> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
>>>>> +{
>>>>> + struct page *page = pfn_t_to_page(pfn);
>>>>> +
>>>>> + if (WARN_ON_ONCE(!page))
>>>>> + return;
>>>>> +
>>>>> + put_page(page);
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
>>>>
>>>> Sorry, I missed your reply regarding get/put interface.
>>>>
>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>
>>>> "We have a design assumption that somedays this can even support non-page
>>>> based backing stores."
>>>>
>>>> As long as there is no such user in sight (especially how to get the memfd
>>>> from even allocating such memory which will require bigger changes), I
>>>> prefer to keep it simple here and work on pages/folios. No need to
>>>> over-complicate it for now.
>>>
>>> Sean, Paolo , what is your take on this? Do you have conrete use case of
>>> pageless backend for the mechanism in sight? Maybe DAX?
>>
>> The problem I'm having with this is how to actually get such memory into the
>> memory backend (that triggers notifiers) and what the semantics are at all
>> with memory that is not managed by the buddy.
>>
>> memfd with fixed PFNs doesn't make too much sense.
>
> What do you mean by "fixed PFN". It is as fixed as struct page/folio, no?
> PFN covers more possible backends.
For DAX, you usually bypass the buddy and map /dev/mem or a devdax. In
contrast to ordinary memfd that allocates memory via the buddy. That's
the difference I see -- and I wonder how it could work.
>
>> When using DAX, what happens with the shared <->private conversion? Which
>> "type" is supposed to use dax, which not?
>>
>> In other word, I'm missing too many details on the bigger picture of how
>> this would work at all to see why it makes sense right now to prepare for
>> that.
>
> IIUC, KVM doesn't really care about pages or folios. They need PFN to
> populate SEPT. Returning page/folio would make KVM do additional steps to
> extract PFN and one more place to have a bug.
Fair enough. Smells KVM specific, though.
--
Thanks,
David / dhildenb
Hi,
On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <[email protected]> wrote:
>
> On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > Regarding pKVM's use case, with the shim approach I believe this can be done by
> > > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> > > piled on top.
> > >
> > > My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
> > > tightly control usage without taking on too much complexity in the kernel, but
> > > working through things, routing the behavior through the shim itself might not be
> > > all that horrific.
> > >
> > > IIRC, we discarded the idea of allowing userspace to map the "private" fd because
> > > things got too complex, but with the shim it doesn't seem _that_ bad.
> > >
> > > E.g. on the memfd side:
> > >
> > > 1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
> > > mapping is all or nothing.
> > >
> > > 2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
> > > the restricted memfd.
> > >
> > > 3. Add notifier hooks to allow downstream users to further restrict things.
> > >
> > > 4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
> > > one shot.
> > >
> > > 5. Require that there are no outstanding references at munmap(). Or if this
> > > can't be guaranteed by userspace, maybe add some way for userspace to wait
> > > until it's ok to convert to private? E.g. so that get_pfn() doesn't need
> > > to do an expensive check every time.
> > >
> > > static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
> > > {
> > > if (vma->vm_pgoff)
> > > return -EINVAL;
> > >
> > > if ((vma->vm_end - vma->vm_start) != <file size>)
> > > return -EINVAL;
> > >
> > > mutex_lock(&data->lock);
> > >
> > > if (data->has_mapping) {
> > > r = -EINVAL;
> > > goto err;
> > > }
> > > list_for_each_entry(notifier, &data->notifiers, list) {
> > > r = notifier->ops->mmap_start(notifier, ...);
> > > if (r)
> > > goto abort;
> > > }
> > >
> > > notifier->ops->mmap_end(notifier, ...);
> > > mutex_unlock(&data->lock);
> > > return 0;
> > >
> > > abort:
> > > list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
> > > notifier->ops->mmap_abort(notifier, ...);
> > > err:
> > > mutex_unlock(&data->lock);
> > > return r;
> > > }
> > >
> > > static void memfd_restricted_close(struct vm_area_struct *vma)
> > > {
> > > mutex_lock(...);
> > >
> > > /*
> > > * Destroy the memfd and disable all future accesses if there are
> > > * outstanding refcounts (or other unsatisfied restrictions?).
> > > */
> > > if (<outstanding references> || ???)
> > > memfd_restricted_destroy(...);
> > > else
> > > data->has_mapping = false;
> > >
> > > mutex_unlock(...);
> > > }
> > >
> > > static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
> > > {
> > > return -EINVAL;
> > > }
> > >
> > > static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
> > > {
> > > return -EINVAL;
> > > }
> > >
> > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > >
> > > 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > > memory into the guest (after pre-boot phase).
> > >
> > > 2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > > and only if the entire gfn range of the associated memslot is shared.
> >
> > In general I think that this would work with pKVM. However, limiting
> > private<->shared conversions to the granularity of a whole memslot
> > might be difficult to handle in pKVM, since the guest doesn't have the
> > concept of memslots. For example, in pKVM right now, when a guest
> > shares back its restricted DMA pool with the host it does so at the
> > page-level. pKVM would also need a way to make an fd accessible again
> > when shared back, which I think isn't possible with this patch.
>
> But does pKVM really want to mmap/munmap a new region at the page-level,
> that can cause VMA fragmentation if the conversion is frequent as I see.
> Even with a KVM ioctl for mapping as mentioned below, I think there will
> be the same issue.
pKVM doesn't really need to unmap the memory. What is really important
is that the memory is not GUP'able. Having private memory mapped and
then accessed by a misbehaving/malicious process will reinject a fault
into the misbehaving process.
Cheers,
/fuad
> >
> > You were initially considering a KVM ioctl for mapping, which might be
> > better suited for this since KVM knows which pages are shared and
> > which ones are private. So routing things through KVM might simplify
> > things and allow it to enforce all the necessary restrictions (e.g.,
> > private memory cannot be mapped). What do you think?
> >
> > Thanks,
> > /fuad
On Mon, Sep 26, 2022, Fuad Tabba wrote:
> Hi,
>
> On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <[email protected]> wrote:
> >
> > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > >
> > > > 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > > > memory into the guest (after pre-boot phase).
> > > >
> > > > 2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > > > and only if the entire gfn range of the associated memslot is shared.
> > >
> > > In general I think that this would work with pKVM. However, limiting
> > > private<->shared conversions to the granularity of a whole memslot
> > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > concept of memslots. For example, in pKVM right now, when a guest
> > > shares back its restricted DMA pool with the host it does so at the
> > > page-level.
Y'all are killing me :-)
Isn't the guest enlightened? E.g. can't you tell the guest "thou shalt share at
granularity X"? With KVM's newfangled scalable memslots and per-vCPU MRU slot,
X doesn't even have to be that high to get reasonable performance, e.g. assuming
the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to
work just fine in KVM.
> > > pKVM would also need a way to make an fd accessible again
> > > when shared back, which I think isn't possible with this patch.
> >
> > But does pKVM really want to mmap/munmap a new region at the page-level,
> > that can cause VMA fragmentation if the conversion is frequent as I see.
> > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > be the same issue.
>
> pKVM doesn't really need to unmap the memory. What is really important
> is that the memory is not GUP'able.
Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
otherwise KVM wouldn't be able to get the PFN to map into guest memory.
The problem is that gup() and "mapped" are tied together. So yes, pKVM doesn't
strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
the end result is the same.
Emphasis above because pKVM still needs unmap the memory _somehwere_. IIUC, the
current approach is to do that only in the stage-2 page tables, i.e. only in the
context of the hypervisor. Which is also the source of the gup() problems; the
untrusted kernel is blissfully unaware that the memory is inaccessible.
Any approach that moves some of that information into the untrusted kernel so that
the kernel can protect itself will incur fragmentation in the VMAs. Well, unless
all of guest memory becomes unguppable, but that's likely not a viable option.
On Mon, Sep 26, 2022, David Hildenbrand wrote:
> On 26.09.22 16:48, Kirill A. Shutemov wrote:
> > On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> > > When using DAX, what happens with the shared <->private conversion? Which
> > > "type" is supposed to use dax, which not?
> > >
> > > In other word, I'm missing too many details on the bigger picture of how
> > > this would work at all to see why it makes sense right now to prepare for
> > > that.
> >
> > IIUC, KVM doesn't really care about pages or folios. They need PFN to
> > populate SEPT. Returning page/folio would make KVM do additional steps to
> > extract PFN and one more place to have a bug.
>
> Fair enough. Smells KVM specific, though.
TL;DR: I'm good with either approach, though providing a "struct page" might avoid
refactoring the API in the nearish future.
Playing devil's advocate for a second, the counter argument is that KVM is the
only user for the foreseeable future.
That said, it might make sense to return a "struct page" from the core API and
force KVM to do page_to_pfn(). KVM already does that for HVA-based memory, so
it's not exactly new code.
More importantly, KVM may actually need/want the "struct page" in the not-too-distant
future to support mapping non-refcounted "struct page" memory into the guest. The
ChromeOS folks have a use case involving virtio-gpu blobs where KVM can get handed a
"struct page" that _isn't_ refcounted[*]. Once the lack of mmu_notifier integration
is fixed, the remaining issue is that KVM doesn't currently have a way to determine
whether or not it holds a reference to the page. Instead, KVM assumes that if the
page is "normal", it's refcounted, e.g. see kvm_release_pfn_clean().
KVM's current workaround for this is to refuse to map these pages into the guest,
i.e. KVM simply forces its assumption that normal pages are refcounted to be true.
To remove that workaround, the likely solution will be to pass around a tuple of
page+pfn, where "page" is non-NULL if the pfn is a refcounted "struct page".
At that point, getting handed a "struct page" from the core API would be a good
thing as KVM wouldn't need to probe the PFN to determine whether or not it's a
refcounted page.
Note, I still want the order to be provided by the API so that KVM doesn't need
to run through a bunch of helpers to try and figure out the allowed mapping size.
[*] https://lore.kernel.org/all/CAD=HUj736L5oxkzeL2JoPV8g1S6Rugy_TquW=PRt73YmFzP6Jw@mail.gmail.com
On Tue, Sep 27, 2022 at 11:23:24PM +0000, Sean Christopherson wrote:
> On Mon, Sep 26, 2022, David Hildenbrand wrote:
> > On 26.09.22 16:48, Kirill A. Shutemov wrote:
> > > On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> > > > When using DAX, what happens with the shared <->private conversion? Which
> > > > "type" is supposed to use dax, which not?
> > > >
> > > > In other word, I'm missing too many details on the bigger picture of how
> > > > this would work at all to see why it makes sense right now to prepare for
> > > > that.
> > >
> > > IIUC, KVM doesn't really care about pages or folios. They need PFN to
> > > populate SEPT. Returning page/folio would make KVM do additional steps to
> > > extract PFN and one more place to have a bug.
> >
> > Fair enough. Smells KVM specific, though.
>
> TL;DR: I'm good with either approach, though providing a "struct page" might avoid
> refactoring the API in the nearish future.
>
> Playing devil's advocate for a second, the counter argument is that KVM is the
> only user for the foreseeable future.
>
> That said, it might make sense to return a "struct page" from the core API and
> force KVM to do page_to_pfn(). KVM already does that for HVA-based memory, so
> it's not exactly new code.
Core MM tries to move away from struct page in favour of struct folio. We
can make interface return folio.
But it would require more work on KVM side.
folio_pfn(folio) + offset % folio_nr_pages(folio) would give you PFN for
base-pagesize PFN for given offset. I guess it is not too hard.
It also gives KVM capability to populate multiple EPT entries for non-zero
order folio and save few cycles.
Does it work for you?
> More importantly, KVM may actually need/want the "struct page" in the not-too-distant
> future to support mapping non-refcounted "struct page" memory into the guest. The
> ChromeOS folks have a use case involving virtio-gpu blobs where KVM can get handed a
> "struct page" that _isn't_ refcounted[*]. Once the lack of mmu_notifier integration
> is fixed, the remaining issue is that KVM doesn't currently have a way to determine
> whether or not it holds a reference to the page. Instead, KVM assumes that if the
> page is "normal", it's refcounted, e.g. see kvm_release_pfn_clean().
>
> KVM's current workaround for this is to refuse to map these pages into the guest,
> i.e. KVM simply forces its assumption that normal pages are refcounted to be true.
> To remove that workaround, the likely solution will be to pass around a tuple of
> page+pfn, where "page" is non-NULL if the pfn is a refcounted "struct page".
>
> At that point, getting handed a "struct page" from the core API would be a good
> thing as KVM wouldn't need to probe the PFN to determine whether or not it's a
> refcounted page.
>
> Note, I still want the order to be provided by the API so that KVM doesn't need
> to run through a bunch of helpers to try and figure out the allowed mapping size.
>
> [*] https://lore.kernel.org/all/CAD=HUj736L5oxkzeL2JoPV8g1S6Rugy_TquW=PRt73YmFzP6Jw@mail.gmail.com
These non-refcounted "struct page" confuses me.
IIUC (probably not), the idea is to share a buffer between host and guest
and avoid double buffering in page cache on the guest ("guest shadow
buffer" means page cache, right?). Don't we already have DAX interfaces to
bypass guest page cache?
And do you think it would need to be handled on inaccessible API lavel or
is it KVM-only thing that uses inaccessible API for some use-cases?
--
Kiryl Shutsemau / Kirill A. Shutemov
On Thu, Sep 15, 2022 at 10:29:11PM +0800,
Chao Peng <[email protected]> wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 08abad4f3e6f..a0f198cede3d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
...
> @@ -6894,3 +6899,115 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> if (kvm->arch.nx_lpage_recovery_thread)
> kthread_stop(kvm->arch.nx_lpage_recovery_thread);
> }
> +
> +static bool mem_attr_is_mixed(struct kvm *kvm, unsigned int attr,
> + gfn_t start, gfn_t end)
> +{
> + XA_STATE(xas, &kvm->mem_attr_array, start);
> + gfn_t gfn = start;
> + void *entry;
> + bool shared, private;
> + bool mixed = false;
> +
> + if (attr == KVM_MEM_ATTR_SHARED) {
> + shared = true;
> + private = false;
> + } else {
> + shared = false;
> + private = true;
> + }
We don't have to care the target is shared or private. We need to check
only same or not.
> +
> + rcu_read_lock();
> + entry = xas_load(&xas);
> + while (gfn < end) {
> + if (xas_retry(&xas, entry))
> + continue;
> +
> + KVM_BUG_ON(gfn != xas.xa_index, kvm);
> +
> + if (entry)
> + private = true;
> + else
> + shared = true;
> +
> + if (private && shared) {
> + mixed = true;
> + goto out;
> + }
> +
> + entry = xas_next(&xas);
> + gfn++;
> + }
> +out:
> + rcu_read_unlock();
> + return mixed;
> +}
> +
> +static inline void update_mixed(struct kvm_lpage_info *linfo, bool mixed)
> +{
> + if (mixed)
> + linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> + else
> + linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static void update_mem_lpage_info(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + unsigned int attr,
> + gfn_t start, gfn_t end)
> +{
> + unsigned long lpage_start, lpage_end;
> + unsigned long gfn, pages, mask;
> + int level;
> +
> + for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> + pages = KVM_PAGES_PER_HPAGE(level);
> + mask = ~(pages - 1);
> + lpage_start = start & mask;
> + lpage_end = (end - 1) & mask;
> +
> + /*
> + * We only need to scan the head and tail page, for middle pages
> + * we know they are not mixed.
> + */
> + update_mixed(lpage_info_slot(lpage_start, slot, level),
> + mem_attr_is_mixed(kvm, attr, lpage_start,
> + lpage_start + pages));
> +
> + if (lpage_start == lpage_end)
> + return;
> +
> + for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
> + update_mixed(lpage_info_slot(gfn, slot, level), false);
For >2M case, we don't have to check all entry. just check lower level case.
> +
> + update_mixed(lpage_info_slot(lpage_end, slot, level),
> + mem_attr_is_mixed(kvm, attr, lpage_end,
> + lpage_end + pages));
> + }
> +}
> +
> +void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
> + gfn_t start, gfn_t end)
> +{
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + struct kvm_memslot_iter iter;
> + int i;
> +
> + WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
> + "Unsupported mem attribute.\n");
> +
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + slots = __kvm_memslots(kvm, i);
> +
> + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> + slot = iter.slot;
> + start = max(start, slot->base_gfn);
> + end = min(end, slot->base_gfn + slot->npages);
> + if (WARN_ON_ONCE(start >= end))
> + continue;
> +
> + update_mem_lpage_info(kvm, slot, attr, start, end);
> + }
> + }
> +}
Here is my updated version.
bool kvm_mem_attr_is_mixed(struct kvm_memory_slot *slot, gfn_t gfn, int level)
{
gfn_t pages = KVM_PAGES_PER_HPAGE(level);
gfn_t mask = ~(pages - 1);
struct kvm_lpage_info *linfo = lpage_info_slot(gfn & mask, slot, level);
WARN_ON_ONCE(level == PG_LEVEL_4K);
return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
}
#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM_ATTR
static void update_mixed(struct kvm_lpage_info *linfo, bool mixed)
{
if (mixed)
linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
else
linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
}
static bool __mem_attr_is_mixed(struct kvm *kvm, gfn_t start, gfn_t end)
{
XA_STATE(xas, &kvm->mem_attr_array, start);
bool mixed = false;
gfn_t gfn = start;
void *s_entry;
void *entry;
rcu_read_lock();
s_entry = xas_load(&xas);
entry = s_entry;
while (gfn < end) {
if (xas_retry(&xas, entry))
continue;
KVM_BUG_ON(gfn != xas.xa_index, kvm);
entry = xas_next(&xas);
if (entry != s_entry) {
mixed = true;
break;
}
gfn++;
}
rcu_read_unlock();
return mixed;
}
static bool mem_attr_is_mixed(struct kvm *kvm,
struct kvm_memory_slot *slot, int level,
gfn_t start, gfn_t end)
{
struct kvm_lpage_info *child_linfo;
unsigned long child_pages;
bool mixed = false;
unsigned long gfn;
void *entry;
if (WARN_ON_ONCE(level == PG_LEVEL_4K))
return false;
if (level == PG_LEVEL_2M)
return __mem_attr_is_mixed(kvm, start, end);
/* This assumes that level - 1 is already updated. */
rcu_read_lock();
child_pages = KVM_PAGES_PER_HPAGE(level - 1);
entry = xa_load(&kvm->mem_attr_array, start);
for (gfn = start; gfn < end; gfn += child_pages) {
child_linfo = lpage_info_slot(gfn, slot, level - 1);
if (child_linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED) {
mixed = true;
break;
}
if (xa_load(&kvm->mem_attr_array, gfn) != entry) {
mixed = true;
break;
}
}
rcu_read_unlock();
return mixed;
}
static void update_mem_lpage_info(struct kvm *kvm,
struct kvm_memory_slot *slot,
unsigned int attr,
gfn_t start, gfn_t end)
{
unsigned long lpage_start, lpage_end;
unsigned long gfn, pages, mask;
int level;
for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
pages = KVM_PAGES_PER_HPAGE(level);
mask = ~(pages - 1);
lpage_start = start & mask;
lpage_end = (end - 1) & mask;
/*
* We only need to scan the head and tail page, for middle pages
* we know they are not mixed.
*/
update_mixed(lpage_info_slot(lpage_start, slot, level),
mem_attr_is_mixed(kvm, slot, level,
lpage_start, lpage_start + pages));
if (lpage_start == lpage_end)
return;
for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
update_mixed(lpage_info_slot(gfn, slot, level), false);
update_mixed(lpage_info_slot(lpage_end, slot, level),
mem_attr_is_mixed(kvm, slot, level,
lpage_end, lpage_end + pages));
}
}
void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
gfn_t start, gfn_t end)
{
struct kvm_memory_slot *slot;
struct kvm_memslots *slots;
struct kvm_memslot_iter iter;
int idx;
int i;
WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
"Unsupported mem attribute.\n");
idx = srcu_read_lock(&kvm->srcu);
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
slots = __kvm_memslots(kvm, i);
kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
slot = iter.slot;
start = max(start, slot->base_gfn);
end = min(end, slot->base_gfn + slot->npages);
if (WARN_ON_ONCE(start >= end))
continue;
update_mem_lpage_info(kvm, slot, attr, start, end);
}
}
srcu_read_unlock(&kvm->srcu, idx);
}
#endif
--
Isaku Yamahata <[email protected]>
On Thu, Sep 29, 2022 at 09:52:06AM -0700, Isaku Yamahata wrote:
> On Thu, Sep 15, 2022 at 10:29:11PM +0800,
> Chao Peng <[email protected]> wrote:
>
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 08abad4f3e6f..a0f198cede3d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> ...
> > @@ -6894,3 +6899,115 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> > if (kvm->arch.nx_lpage_recovery_thread)
> > kthread_stop(kvm->arch.nx_lpage_recovery_thread);
> > }
> > +
> > +static bool mem_attr_is_mixed(struct kvm *kvm, unsigned int attr,
> > + gfn_t start, gfn_t end)
> > +{
> > + XA_STATE(xas, &kvm->mem_attr_array, start);
> > + gfn_t gfn = start;
> > + void *entry;
> > + bool shared, private;
> > + bool mixed = false;
> > +
> > + if (attr == KVM_MEM_ATTR_SHARED) {
> > + shared = true;
> > + private = false;
> > + } else {
> > + shared = false;
> > + private = true;
> > + }
>
> We don't have to care the target is shared or private. We need to check
> only same or not.
There is optimization chance if we know what we are going to set. we can
return 'mixed = true' earlier when we find the first reverse attr, e.g.
it's unnecessarily to check all the child page attr in one largepage to
give a conclusion.
After a further look, the code can be refined as below:
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7255,17 +7255,9 @@ static bool mem_attr_is_mixed(struct kvm *kvm, unsigned int attr,
XA_STATE(xas, &kvm->mem_attr_array, start);
gfn_t gfn = start;
void *entry;
- bool shared, private;
+ bool shared = attr == KVM_MEM_ATTR_SHARED;
bool mixed = false;
- if (attr == KVM_MEM_ATTR_SHARED) {
- shared = true;
- private = false;
- } else {
- shared = false;
- private = true;
- }
-
rcu_read_lock();
entry = xas_load(&xas);
while (gfn < end) {
@@ -7274,12 +7266,7 @@ static bool mem_attr_is_mixed(struct kvm *kvm, unsigned int attr,
KVM_BUG_ON(gfn != xas.xa_index, kvm);
- if (entry)
- private = true;
- else
- shared = true;
-
- if (private && shared) {
+ if ((entry && !shared) || (!entry && shared)) {
mixed = true;
goto out;
}
@@ -7320,8 +7307,7 @@ static void update_mem_lpage_info(struct kvm *kvm,
* we know they are not mixed.
*/
update_mixed(lpage_info_slot(lpage_start, slot, level),
- mem_attr_is_mixed(kvm, attr, lpage_start,
- lpage_start + pages));
+ mem_attr_is_mixed(kvm, attr, lpage_start, start));
if (lpage_start == lpage_end)
return;
@@ -7330,7 +7316,7 @@ static void update_mem_lpage_info(struct kvm *kvm,
update_mixed(lpage_info_slot(gfn, slot, level), false);
update_mixed(lpage_info_slot(lpage_end, slot, level),
- mem_attr_is_mixed(kvm, attr, lpage_end,
+ mem_attr_is_mixed(kvm, attr, end,
lpage_end + pages));
}
}
>
> > +
> > + rcu_read_lock();
> > + entry = xas_load(&xas);
> > + while (gfn < end) {
> > + if (xas_retry(&xas, entry))
> > + continue;
> > +
> > + KVM_BUG_ON(gfn != xas.xa_index, kvm);
> > +
> > + if (entry)
> > + private = true;
> > + else
> > + shared = true;
> > +
> > + if (private && shared) {
> > + mixed = true;
> > + goto out;
> > + }
> > +
> > + entry = xas_next(&xas);
> > + gfn++;
> > + }
> > +out:
> > + rcu_read_unlock();
> > + return mixed;
> > +}
> > +
> > +static inline void update_mixed(struct kvm_lpage_info *linfo, bool mixed)
> > +{
> > + if (mixed)
> > + linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > + else
> > + linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static void update_mem_lpage_info(struct kvm *kvm,
> > + struct kvm_memory_slot *slot,
> > + unsigned int attr,
> > + gfn_t start, gfn_t end)
> > +{
> > + unsigned long lpage_start, lpage_end;
> > + unsigned long gfn, pages, mask;
> > + int level;
> > +
> > + for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > + pages = KVM_PAGES_PER_HPAGE(level);
> > + mask = ~(pages - 1);
> > + lpage_start = start & mask;
> > + lpage_end = (end - 1) & mask;
> > +
> > + /*
> > + * We only need to scan the head and tail page, for middle pages
> > + * we know they are not mixed.
> > + */
> > + update_mixed(lpage_info_slot(lpage_start, slot, level),
> > + mem_attr_is_mixed(kvm, attr, lpage_start,
> > + lpage_start + pages));
> > +
> > + if (lpage_start == lpage_end)
> > + return;
> > +
> > + for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
> > + update_mixed(lpage_info_slot(gfn, slot, level), false);
>
>
> For >2M case, we don't have to check all entry. just check lower level case.
Sounds good, we can reduce some scanning.
Thanks,
Chao
>
> > +
> > + update_mixed(lpage_info_slot(lpage_end, slot, level),
> > + mem_attr_is_mixed(kvm, attr, lpage_end,
> > + lpage_end + pages));
> > + }
> > +}
> > +
> > +void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
> > + gfn_t start, gfn_t end)
> > +{
> > + struct kvm_memory_slot *slot;
> > + struct kvm_memslots *slots;
> > + struct kvm_memslot_iter iter;
> > + int i;
> > +
> > + WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
> > + "Unsupported mem attribute.\n");
> > +
> > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > + slots = __kvm_memslots(kvm, i);
> > +
> > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > + slot = iter.slot;
> > + start = max(start, slot->base_gfn);
> > + end = min(end, slot->base_gfn + slot->npages);
> > + if (WARN_ON_ONCE(start >= end))
> > + continue;
> > +
> > + update_mem_lpage_info(kvm, slot, attr, start, end);
> > + }
> > + }
> > +}
>
>
> Here is my updated version.
>
> bool kvm_mem_attr_is_mixed(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> {
> gfn_t pages = KVM_PAGES_PER_HPAGE(level);
> gfn_t mask = ~(pages - 1);
> struct kvm_lpage_info *linfo = lpage_info_slot(gfn & mask, slot, level);
>
> WARN_ON_ONCE(level == PG_LEVEL_4K);
> return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> }
>
> #ifdef CONFIG_HAVE_KVM_PRIVATE_MEM_ATTR
> static void update_mixed(struct kvm_lpage_info *linfo, bool mixed)
> {
> if (mixed)
> linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> else
> linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> }
>
> static bool __mem_attr_is_mixed(struct kvm *kvm, gfn_t start, gfn_t end)
> {
> XA_STATE(xas, &kvm->mem_attr_array, start);
> bool mixed = false;
> gfn_t gfn = start;
> void *s_entry;
> void *entry;
>
> rcu_read_lock();
> s_entry = xas_load(&xas);
> entry = s_entry;
> while (gfn < end) {
> if (xas_retry(&xas, entry))
> continue;
>
> KVM_BUG_ON(gfn != xas.xa_index, kvm);
>
> entry = xas_next(&xas);
> if (entry != s_entry) {
> mixed = true;
> break;
> }
> gfn++;
> }
> rcu_read_unlock();
> return mixed;
> }
>
> static bool mem_attr_is_mixed(struct kvm *kvm,
> struct kvm_memory_slot *slot, int level,
> gfn_t start, gfn_t end)
> {
> struct kvm_lpage_info *child_linfo;
> unsigned long child_pages;
> bool mixed = false;
> unsigned long gfn;
> void *entry;
>
> if (WARN_ON_ONCE(level == PG_LEVEL_4K))
> return false;
>
> if (level == PG_LEVEL_2M)
> return __mem_attr_is_mixed(kvm, start, end);
>
> /* This assumes that level - 1 is already updated. */
> rcu_read_lock();
> child_pages = KVM_PAGES_PER_HPAGE(level - 1);
> entry = xa_load(&kvm->mem_attr_array, start);
> for (gfn = start; gfn < end; gfn += child_pages) {
> child_linfo = lpage_info_slot(gfn, slot, level - 1);
> if (child_linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED) {
> mixed = true;
> break;
> }
> if (xa_load(&kvm->mem_attr_array, gfn) != entry) {
> mixed = true;
> break;
> }
> }
> rcu_read_unlock();
> return mixed;
> }
>
> static void update_mem_lpage_info(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> unsigned int attr,
> gfn_t start, gfn_t end)
> {
> unsigned long lpage_start, lpage_end;
> unsigned long gfn, pages, mask;
> int level;
>
> for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> pages = KVM_PAGES_PER_HPAGE(level);
> mask = ~(pages - 1);
> lpage_start = start & mask;
> lpage_end = (end - 1) & mask;
>
> /*
> * We only need to scan the head and tail page, for middle pages
> * we know they are not mixed.
> */
> update_mixed(lpage_info_slot(lpage_start, slot, level),
> mem_attr_is_mixed(kvm, slot, level,
> lpage_start, lpage_start + pages));
>
> if (lpage_start == lpage_end)
> return;
>
> for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
> update_mixed(lpage_info_slot(gfn, slot, level), false);
>
> update_mixed(lpage_info_slot(lpage_end, slot, level),
> mem_attr_is_mixed(kvm, slot, level,
> lpage_end, lpage_end + pages));
> }
> }
>
> void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
> gfn_t start, gfn_t end)
> {
> struct kvm_memory_slot *slot;
> struct kvm_memslots *slots;
> struct kvm_memslot_iter iter;
> int idx;
> int i;
>
> WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
> "Unsupported mem attribute.\n");
>
> idx = srcu_read_lock(&kvm->srcu);
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> slots = __kvm_memslots(kvm, i);
>
> kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> slot = iter.slot;
> start = max(start, slot->base_gfn);
> end = min(end, slot->base_gfn + slot->npages);
> if (WARN_ON_ONCE(start >= end))
> continue;
>
> update_mem_lpage_info(kvm, slot, attr, start, end);
> }
> }
> srcu_read_unlock(&kvm->srcu, idx);
> }
> #endif
>
>
> --
> Isaku Yamahata <[email protected]>
Hi,
On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson <[email protected]> wrote:
>
> On Mon, Sep 26, 2022, Fuad Tabba wrote:
> > Hi,
> >
> > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <[email protected]> wrote:
> > >
> > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > > >
> > > > > 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > > > > memory into the guest (after pre-boot phase).
> > > > >
> > > > > 2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > > > > and only if the entire gfn range of the associated memslot is shared.
> > > >
> > > > In general I think that this would work with pKVM. However, limiting
> > > > private<->shared conversions to the granularity of a whole memslot
> > > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > > concept of memslots. For example, in pKVM right now, when a guest
> > > > shares back its restricted DMA pool with the host it does so at the
> > > > page-level.
>
> Y'all are killing me :-)
:D
> Isn't the guest enlightened? E.g. can't you tell the guest "thou shalt share at
> granularity X"? With KVM's newfangled scalable memslots and per-vCPU MRU slot,
> X doesn't even have to be that high to get reasonable performance, e.g. assuming
> the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to
> work just fine in KVM.
The guest is potentially enlightened, but the host doesn't necessarily
know which memslot the guest might want to share back, since it
doesn't know where the guest might want to place the DMA pool. If I
understand this correctly, for this to work, all memslots would need
to be the same size and sharing would always need to happen at that
granularity.
Moreover, for something like a small DMA pool this might scale, but
I'm not sure about potential future workloads (e.g., multimedia
in-place sharing).
>
> > > > pKVM would also need a way to make an fd accessible again
> > > > when shared back, which I think isn't possible with this patch.
> > >
> > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > be the same issue.
> >
> > pKVM doesn't really need to unmap the memory. What is really important
> > is that the memory is not GUP'able.
>
> Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> otherwise KVM wouldn't be able to get the PFN to map into guest memory.
>
> The problem is that gup() and "mapped" are tied together. So yes, pKVM doesn't
> strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> the end result is the same.
>
> Emphasis above because pKVM still needs unmap the memory _somehwere_. IIUC, the
> current approach is to do that only in the stage-2 page tables, i.e. only in the
> context of the hypervisor. Which is also the source of the gup() problems; the
> untrusted kernel is blissfully unaware that the memory is inaccessible.
>
> Any approach that moves some of that information into the untrusted kernel so that
> the kernel can protect itself will incur fragmentation in the VMAs. Well, unless
> all of guest memory becomes unguppable, but that's likely not a viable option.
Actually, for pKVM, there is no need for the guest memory to be
GUP'able at all if we use the new inaccessible_get_pfn(). This of
course goes back to what I'd mentioned before in v7; it seems that
representing the memslot memory as a file descriptor should be
orthogonal to whether the memory is shared or private, rather than a
private_fd for private memory and the userspace_addr for shared
memory. The host can then map or unmap the shared/private memory using
the fd, which allows it more freedom in even choosing to unmap shared
memory when not needed, for example.
Cheers,
/fuad
On Fri, Sep 30, 2022 at 05:14:00PM +0100, Fuad Tabba wrote:
> Hi,
>
> <...>
>
> > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > new file mode 100644
> > index 000000000000..2d33cbdd9282
> > --- /dev/null
> > +++ b/mm/memfd_inaccessible.c
> > @@ -0,0 +1,219 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include <linux/memfd.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/shmem_fs.h>
> > +#include <uapi/linux/falloc.h>
> > +#include <uapi/linux/magic.h>
> > +
> > +struct inaccessible_data {
> > + struct mutex lock;
> > + struct file *memfd;
> > + struct list_head notifiers;
> > +};
> > +
> > +static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
> > + pgoff_t start, pgoff_t end)
> > +{
> > + struct inaccessible_notifier *notifier;
> > +
> > + mutex_lock(&data->lock);
> > + list_for_each_entry(notifier, &data->notifiers, list) {
> > + notifier->ops->invalidate(notifier, start, end);
> > + }
> > + mutex_unlock(&data->lock);
> > +}
> > +
> > +static int inaccessible_release(struct inode *inode, struct file *file)
> > +{
> > + struct inaccessible_data *data = inode->i_mapping->private_data;
> > +
> > + fput(data->memfd);
> > + kfree(data);
> > + return 0;
> > +}
> > +
> > +static long inaccessible_fallocate(struct file *file, int mode,
> > + loff_t offset, loff_t len)
> > +{
> > + struct inaccessible_data *data = file->f_mapping->private_data;
> > + struct file *memfd = data->memfd;
> > + int ret;
> > +
> > + if (mode & FALLOC_FL_PUNCH_HOLE) {
> > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > + return -EINVAL;
> > + }
> > +
> > + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
>
> I think that shmem_file_operations.fallocate is only set if
> CONFIG_TMPFS is enabled (shmem.c). Should there be a check at
> initialization that fallocate is set, or maybe a config dependency, or
> can we count on it always being enabled?
It is already there:
config MEMFD_CREATE
def_bool TMPFS || HUGETLBFS
And we reject inaccessible memfd_create() for HUGETLBFS.
But if we go with a separate syscall, yes, we need the dependency.
> > + inaccessible_notifier_invalidate(data, offset, offset + len);
> > + return ret;
> > +}
> > +
>
> <...>
>
> > +void inaccessible_register_notifier(struct file *file,
> > + struct inaccessible_notifier *notifier)
> > +{
> > + struct inaccessible_data *data = file->f_mapping->private_data;
> > +
> > + mutex_lock(&data->lock);
> > + list_add(¬ifier->list, &data->notifiers);
> > + mutex_unlock(&data->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
>
> If the memfd wasn't marked as inaccessible, or more generally
> speaking, if the file isn't a memfd_inaccessible file, this ends up
> accessing an uninitialized pointer for the notifier list. Should there
> be a check for that here, and have this function return an error if
> that's not the case?
I think it is "don't do that" category. inaccessible_register_notifier()
caller has to know what file it operates on, no?
--
Kiryl Shutsemau / Kirill A. Shutemov
Hi,
<...>
> diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> new file mode 100644
> index 000000000000..2d33cbdd9282
> --- /dev/null
> +++ b/mm/memfd_inaccessible.c
> @@ -0,0 +1,219 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/memfd.h>
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +
> +struct inaccessible_data {
> + struct mutex lock;
> + struct file *memfd;
> + struct list_head notifiers;
> +};
> +
> +static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
> + pgoff_t start, pgoff_t end)
> +{
> + struct inaccessible_notifier *notifier;
> +
> + mutex_lock(&data->lock);
> + list_for_each_entry(notifier, &data->notifiers, list) {
> + notifier->ops->invalidate(notifier, start, end);
> + }
> + mutex_unlock(&data->lock);
> +}
> +
> +static int inaccessible_release(struct inode *inode, struct file *file)
> +{
> + struct inaccessible_data *data = inode->i_mapping->private_data;
> +
> + fput(data->memfd);
> + kfree(data);
> + return 0;
> +}
> +
> +static long inaccessible_fallocate(struct file *file, int mode,
> + loff_t offset, loff_t len)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> + struct file *memfd = data->memfd;
> + int ret;
> +
> + if (mode & FALLOC_FL_PUNCH_HOLE) {
> + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> + return -EINVAL;
> + }
> +
> + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
I think that shmem_file_operations.fallocate is only set if
CONFIG_TMPFS is enabled (shmem.c). Should there be a check at
initialization that fallocate is set, or maybe a config dependency, or
can we count on it always being enabled?
> + inaccessible_notifier_invalidate(data, offset, offset + len);
> + return ret;
> +}
> +
<...>
> +void inaccessible_register_notifier(struct file *file,
> + struct inaccessible_notifier *notifier)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> +
> + mutex_lock(&data->lock);
> + list_add(¬ifier->list, &data->notifiers);
> + mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
If the memfd wasn't marked as inaccessible, or more generally
speaking, if the file isn't a memfd_inaccessible file, this ends up
accessing an uninitialized pointer for the notifier list. Should there
be a check for that here, and have this function return an error if
that's not the case?
Thanks,
/fuad
> +
> +void inaccessible_unregister_notifier(struct file *file,
> + struct inaccessible_notifier *notifier)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> +
> + mutex_lock(&data->lock);
> + list_del(¬ifier->list);
> + mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> + int *order)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> + struct file *memfd = data->memfd;
> + struct page *page;
> + int ret;
> +
> + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> + if (ret)
> + return ret;
> +
> + *pfn = page_to_pfn_t(page);
> + *order = thp_order(compound_head(page));
> + SetPageUptodate(page);
> + unlock_page(page);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> + struct page *page = pfn_t_to_page(pfn);
> +
> + if (WARN_ON_ONCE(!page))
> + return;
> +
> + put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> --
> 2.25.1
>
Hi
On Fri, Sep 30, 2022 at 5:23 PM Kirill A . Shutemov
<[email protected]> wrote:
>
> On Fri, Sep 30, 2022 at 05:14:00PM +0100, Fuad Tabba wrote:
> > Hi,
> >
> > <...>
> >
> > > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > > new file mode 100644
> > > index 000000000000..2d33cbdd9282
> > > --- /dev/null
> > > +++ b/mm/memfd_inaccessible.c
> > > @@ -0,0 +1,219 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include "linux/sbitmap.h"
> > > +#include <linux/memfd.h>
> > > +#include <linux/pagemap.h>
> > > +#include <linux/pseudo_fs.h>
> > > +#include <linux/shmem_fs.h>
> > > +#include <uapi/linux/falloc.h>
> > > +#include <uapi/linux/magic.h>
> > > +
> > > +struct inaccessible_data {
> > > + struct mutex lock;
> > > + struct file *memfd;
> > > + struct list_head notifiers;
> > > +};
> > > +
> > > +static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
> > > + pgoff_t start, pgoff_t end)
> > > +{
> > > + struct inaccessible_notifier *notifier;
> > > +
> > > + mutex_lock(&data->lock);
> > > + list_for_each_entry(notifier, &data->notifiers, list) {
> > > + notifier->ops->invalidate(notifier, start, end);
> > > + }
> > > + mutex_unlock(&data->lock);
> > > +}
> > > +
> > > +static int inaccessible_release(struct inode *inode, struct file *file)
> > > +{
> > > + struct inaccessible_data *data = inode->i_mapping->private_data;
> > > +
> > > + fput(data->memfd);
> > > + kfree(data);
> > > + return 0;
> > > +}
> > > +
> > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > + loff_t offset, loff_t len)
> > > +{
> > > + struct inaccessible_data *data = file->f_mapping->private_data;
> > > + struct file *memfd = data->memfd;
> > > + int ret;
> > > +
> > > + if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > + return -EINVAL;
> > > + }
> > > +
> > > + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> >
> > I think that shmem_file_operations.fallocate is only set if
> > CONFIG_TMPFS is enabled (shmem.c). Should there be a check at
> > initialization that fallocate is set, or maybe a config dependency, or
> > can we count on it always being enabled?
>
> It is already there:
>
> config MEMFD_CREATE
> def_bool TMPFS || HUGETLBFS
>
> And we reject inaccessible memfd_create() for HUGETLBFS.
>
> But if we go with a separate syscall, yes, we need the dependency.
I missed that, thanks.
>
> > > + inaccessible_notifier_invalidate(data, offset, offset + len);
> > > + return ret;
> > > +}
> > > +
> >
> > <...>
> >
> > > +void inaccessible_register_notifier(struct file *file,
> > > + struct inaccessible_notifier *notifier)
> > > +{
> > > + struct inaccessible_data *data = file->f_mapping->private_data;
> > > +
> > > + mutex_lock(&data->lock);
> > > + list_add(¬ifier->list, &data->notifiers);
> > > + mutex_unlock(&data->lock);
> > > +}
> > > +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> >
> > If the memfd wasn't marked as inaccessible, or more generally
> > speaking, if the file isn't a memfd_inaccessible file, this ends up
> > accessing an uninitialized pointer for the notifier list. Should there
> > be a check for that here, and have this function return an error if
> > that's not the case?
>
> I think it is "don't do that" category. inaccessible_register_notifier()
> caller has to know what file it operates on, no?
The thing is, you could oops the kernel from userspace. For that, all
you have to do is a memfd_create without the MFD_INACCESSIBLE,
followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
I ran into this using my port of this patch series to arm64.
Cheers,
/fuad
> --
> Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Oct 03, 2022 at 08:33:13AM +0100, Fuad Tabba wrote:
> > I think it is "don't do that" category. inaccessible_register_notifier()
> > caller has to know what file it operates on, no?
>
> The thing is, you could oops the kernel from userspace. For that, all
> you have to do is a memfd_create without the MFD_INACCESSIBLE,
> followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
> I ran into this using my port of this patch series to arm64.
My point is that it has to be handled on a different level. KVM has to
reject private_fd if it is now inaccessible. It should be trivial by
checking file->f_inode->i_sb->s_magic.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Thu, Sep 15, 2022 at 10:29:13PM +0800, Chao Peng wrote:
> Expose KVM_MEM_PRIVATE and memslot fields private_fd/offset to
> userspace. KVM will register/unregister private memslot to fd-based
> memory backing store and response to invalidation event from
> inaccessible_notifier to zap the existing memory mappings in the
> secondary page table.
>
> Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
> by architecture code which can turn on it by overriding the default
> kvm_arch_has_private_mem().
>
> A 'kvm' reference is added in memslot structure since in
> inaccessible_notifier callback we can only obtain a memslot reference
> but 'kvm' is needed to do the zapping.
>
> Co-developed-by: Yu Zhang <[email protected]>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
ld: arch/x86/../../virt/kvm/kvm_main.o: in function `kvm_free_memslot':
kvm_main.c:(.text+0x1385): undefined reference to `inaccessible_unregister_notifier'
ld: arch/x86/../../virt/kvm/kvm_main.o: in function `kvm_set_memslot':
kvm_main.c:(.text+0x1b86): undefined reference to `inaccessible_register_notifier'
ld: kvm_main.c:(.text+0x1c85): undefined reference to `inaccessible_unregister_notifier'
ld: arch/x86/kvm/mmu/mmu.o: in function `kvm_faultin_pfn':
mmu.c:(.text+0x1e38): undefined reference to `inaccessible_get_pfn'
ld: arch/x86/kvm/mmu/mmu.o: in function `direct_page_fault':
mmu.c:(.text+0x67ca): undefined reference to `inaccessible_put_pfn'
make: *** [Makefile:1169: vmlinux] Error 1
I attached kernel config for reproduction.
The problem is that CONFIG_MEMFD_CREATE does not get enabled:
mm/Makefile:obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o
BR, Jarkko
Hi,
On Mon, Oct 3, 2022 at 12:01 PM Kirill A. Shutemov <[email protected]> wrote:
>
> On Mon, Oct 03, 2022 at 08:33:13AM +0100, Fuad Tabba wrote:
> > > I think it is "don't do that" category. inaccessible_register_notifier()
> > > caller has to know what file it operates on, no?
> >
> > The thing is, you could oops the kernel from userspace. For that, all
> > you have to do is a memfd_create without the MFD_INACCESSIBLE,
> > followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
> > I ran into this using my port of this patch series to arm64.
>
> My point is that it has to be handled on a different level. KVM has to
> reject private_fd if it is now inaccessible. It should be trivial by
> checking file->f_inode->i_sb->s_magic.
Yes, that makes sense.
Thanks,
/fuad
> --
> Kiryl Shutsemau / Kirill A. Shutemov
Hi,
<...>
> diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> new file mode 100644
> index 000000000000..2d33cbdd9282
> --- /dev/null
> +++ b/mm/memfd_inaccessible.c
<...>
> +struct file *memfd_mkinaccessible(struct file *memfd)
> +{
> + struct inaccessible_data *data;
> + struct address_space *mapping;
> + struct inode *inode;
> + struct file *file;
> +
> + data = kzalloc(sizeof(*data), GFP_KERNEL);
> + if (!data)
> + return ERR_PTR(-ENOMEM);
> +
> + data->memfd = memfd;
> + mutex_init(&data->lock);
> + INIT_LIST_HEAD(&data->notifiers);
> +
> + inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
> + if (IS_ERR(inode)) {
> + kfree(data);
> + return ERR_CAST(inode);
> + }
> +
> + inode->i_mode |= S_IFREG;
> + inode->i_op = &inaccessible_iops;
> + inode->i_mapping->private_data = data;
> +
> + file = alloc_file_pseudo(inode, inaccessible_mnt,
> + "[memfd:inaccessible]", O_RDWR,
> + &inaccessible_fops);
> + if (IS_ERR(file)) {
> + iput(inode);
> + kfree(data);
I think this might be missing a return at this point.
> + }
> +
> + file->f_flags |= O_LARGEFILE;
> +
> + mapping = memfd->f_mapping;
> + mapping_set_unevictable(mapping);
> + mapping_set_gfp_mask(mapping,
> + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> + return file;
> +}
Thanks,
/fuad
> +
> +void inaccessible_register_notifier(struct file *file,
> + struct inaccessible_notifier *notifier)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> +
> + mutex_lock(&data->lock);
> + list_add(¬ifier->list, &data->notifiers);
> + mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> +
> +void inaccessible_unregister_notifier(struct file *file,
> + struct inaccessible_notifier *notifier)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> +
> + mutex_lock(&data->lock);
> + list_del(¬ifier->list);
> + mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> + int *order)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> + struct file *memfd = data->memfd;
> + struct page *page;
> + int ret;
> +
> + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> + if (ret)
> + return ret;
> +
> + *pfn = page_to_pfn_t(page);
> + *order = thp_order(compound_head(page));
> + SetPageUptodate(page);
> + unlock_page(page);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> + struct page *page = pfn_t_to_page(pfn);
> +
> + if (WARN_ON_ONCE(!page))
> + return;
> +
> + put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> --
> 2.25.1
>
Hi,
On Thu, Sep 15, 2022 at 3:37 PM Chao Peng <[email protected]> wrote:
>
> Expose KVM_MEM_PRIVATE and memslot fields private_fd/offset to
> userspace. KVM will register/unregister private memslot to fd-based
> memory backing store and response to invalidation event from
> inaccessible_notifier to zap the existing memory mappings in the
> secondary page table.
>
> Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
> by architecture code which can turn on it by overriding the default
> kvm_arch_has_private_mem().
>
> A 'kvm' reference is added in memslot structure since in
> inaccessible_notifier callback we can only obtain a memslot reference
> but 'kvm' is needed to do the zapping.
>
> Co-developed-by: Yu Zhang <[email protected]>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 116 +++++++++++++++++++++++++++++++++++++--
> 2 files changed, 111 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index b9906cdf468b..cb4eefac709c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -589,6 +589,7 @@ struct kvm_memory_slot {
> struct file *private_file;
> loff_t private_offset;
> struct inaccessible_notifier notifier;
> + struct kvm *kvm;
> };
>
> static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 97d893f7482c..87e239d35b96 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -983,6 +983,57 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> xa_erase(&kvm->mem_attr_array, index);
> return r;
> }
> +
> +static void kvm_private_notifier_invalidate(struct inaccessible_notifier *notifier,
> + pgoff_t start, pgoff_t end)
> +{
> + struct kvm_memory_slot *slot = container_of(notifier,
> + struct kvm_memory_slot,
> + notifier);
> + unsigned long base_pgoff = slot->private_offset >> PAGE_SHIFT;
> + gfn_t start_gfn = slot->base_gfn;
> + gfn_t end_gfn = slot->base_gfn + slot->npages;
> +
> +
> + if (start > base_pgoff)
> + start_gfn = slot->base_gfn + start - base_pgoff;
> +
> + if (end < base_pgoff + slot->npages)
> + end_gfn = slot->base_gfn + end - base_pgoff;
> +
> + if (start_gfn >= end_gfn)
> + return;
> +
> + kvm_zap_gfn_range(slot->kvm, start_gfn, end_gfn);
> +}
> +
> +static struct inaccessible_notifier_ops kvm_private_notifier_ops = {
> + .invalidate = kvm_private_notifier_invalidate,
> +};
> +
> +static inline void kvm_private_mem_register(struct kvm_memory_slot *slot)
> +{
> + slot->notifier.ops = &kvm_private_notifier_ops;
> + inaccessible_register_notifier(slot->private_file, &slot->notifier);
> +}
> +
> +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
> +{
> + inaccessible_unregister_notifier(slot->private_file, &slot->notifier);
> +}
> +
> +#else /* !CONFIG_HAVE_KVM_PRIVATE_MEM */
> +
> +static inline void kvm_private_mem_register(struct kvm_memory_slot *slot)
> +{
> + WARN_ON_ONCE(1);
> +}
> +
> +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
> +{
> + WARN_ON_ONCE(1);
> +}
> +
> #endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
>
> #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> @@ -1029,6 +1080,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
> /* This does not remove the slot from struct kvm_memslots data structures */
> static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> {
> + if (slot->flags & KVM_MEM_PRIVATE) {
> + kvm_private_mem_unregister(slot);
> + fput(slot->private_file);
> + }
> +
> kvm_destroy_dirty_bitmap(slot);
>
> kvm_arch_free_memslot(kvm, slot);
> @@ -1600,10 +1656,16 @@ bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> return false;
> }
>
> -static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> +static int check_memory_region_flags(struct kvm *kvm,
> + const struct kvm_user_mem_region *mem)
> {
> u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> + if (kvm_arch_has_private_mem(kvm))
> + valid_flags |= KVM_MEM_PRIVATE;
> +#endif
> +
> #ifdef __KVM_HAVE_READONLY_MEM
> valid_flags |= KVM_MEM_READONLY;
> #endif
> @@ -1679,6 +1741,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
> {
> int r;
>
> + if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> + kvm_private_mem_register(new);
> +
From the discussion I had with Kirill in the first patch *, should
this check that the private_fd is inaccessible?
[*] https://lore.kernel.org/all/[email protected]/
Cheers,
/fuad
> /*
> * If dirty logging is disabled, nullify the bitmap; the old bitmap
> * will be freed on "commit". If logging is enabled in both old and
> @@ -1707,6 +1772,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
> if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
> kvm_destroy_dirty_bitmap(new);
>
> + if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> + kvm_private_mem_unregister(new);
> +
> return r;
> }
>
> @@ -2004,7 +2072,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> int as_id, id;
> int r;
>
> - r = check_memory_region_flags(mem);
> + r = check_memory_region_flags(kvm, mem);
> if (r)
> return r;
>
> @@ -2023,6 +2091,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
> !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> mem->memory_size))
> return -EINVAL;
> + if (mem->flags & KVM_MEM_PRIVATE &&
> + (mem->private_offset & (PAGE_SIZE - 1) ||
> + mem->private_offset > U64_MAX - mem->memory_size))
> + return -EINVAL;
> if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> return -EINVAL;
> if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> @@ -2061,6 +2133,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> return -EINVAL;
> } else { /* Modify an existing slot. */
> + /* Private memslots are immutable, they can only be deleted. */
> + if (mem->flags & KVM_MEM_PRIVATE)
> + return -EINVAL;
> if ((mem->userspace_addr != old->userspace_addr) ||
> (npages != old->npages) ||
> ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> @@ -2089,10 +2164,27 @@ int __kvm_set_memory_region(struct kvm *kvm,
> new->npages = npages;
> new->flags = mem->flags;
> new->userspace_addr = mem->userspace_addr;
> + if (mem->flags & KVM_MEM_PRIVATE) {
> + new->private_file = fget(mem->private_fd);
> + if (!new->private_file) {
> + r = -EINVAL;
> + goto out;
> + }
> + new->private_offset = mem->private_offset;
> + }
> +
> + new->kvm = kvm;
>
> r = kvm_set_memslot(kvm, old, new, change);
> if (r)
> - kfree(new);
> + goto out;
> +
> + return 0;
> +
> +out:
> + if (new->private_file)
> + fput(new->private_file);
> + kfree(new);
> return r;
> }
> EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> @@ -4747,16 +4839,28 @@ static long kvm_vm_ioctl(struct file *filp,
> }
> case KVM_SET_USER_MEMORY_REGION: {
> struct kvm_user_mem_region mem;
> - unsigned long size = sizeof(struct kvm_userspace_memory_region);
> + unsigned int flags_offset = offsetof(typeof(mem), flags);
> + unsigned long size;
> + u32 flags;
>
> kvm_sanity_check_user_mem_region_alias();
>
> + memset(&mem, 0, sizeof(mem));
> +
> r = -EFAULT;
> - if (copy_from_user(&mem, argp, size);
> + if (get_user(flags, (u32 __user *)(argp + flags_offset)))
> + goto out;
> +
> + if (flags & KVM_MEM_PRIVATE)
> + size = sizeof(struct kvm_userspace_memory_region_ext);
> + else
> + size = sizeof(struct kvm_userspace_memory_region);
> +
> + if (copy_from_user(&mem, argp, size))
> goto out;
>
> r = -EINVAL;
> - if (mem.flags & KVM_MEM_PRIVATE)
> + if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
> goto out;
>
> r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> --
> 2.25.1
>
On Thu, Oct 06, 2022 at 09:50:28AM +0100, Fuad Tabba wrote:
> Hi,
>
> <...>
>
>
> > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > new file mode 100644
> > index 000000000000..2d33cbdd9282
> > --- /dev/null
> > +++ b/mm/memfd_inaccessible.c
>
> <...>
>
> > +struct file *memfd_mkinaccessible(struct file *memfd)
> > +{
> > + struct inaccessible_data *data;
> > + struct address_space *mapping;
> > + struct inode *inode;
> > + struct file *file;
> > +
> > + data = kzalloc(sizeof(*data), GFP_KERNEL);
> > + if (!data)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + data->memfd = memfd;
> > + mutex_init(&data->lock);
> > + INIT_LIST_HEAD(&data->notifiers);
> > +
> > + inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
> > + if (IS_ERR(inode)) {
> > + kfree(data);
> > + return ERR_CAST(inode);
> > + }
> > +
> > + inode->i_mode |= S_IFREG;
> > + inode->i_op = &inaccessible_iops;
> > + inode->i_mapping->private_data = data;
> > +
> > + file = alloc_file_pseudo(inode, inaccessible_mnt,
> > + "[memfd:inaccessible]", O_RDWR,
> > + &inaccessible_fops);
> > + if (IS_ERR(file)) {
> > + iput(inode);
> > + kfree(data);
>
> I think this might be missing a return at this point.
Good catch! Thanks!
--
Kiryl Shutsemau / Kirill A. Shutemov
On Tue, Oct 04, 2022 at 05:55:28PM +0300, Jarkko Sakkinen wrote:
> On Thu, Sep 15, 2022 at 10:29:13PM +0800, Chao Peng wrote:
> > Expose KVM_MEM_PRIVATE and memslot fields private_fd/offset to
> > userspace. KVM will register/unregister private memslot to fd-based
> > memory backing store and response to invalidation event from
> > inaccessible_notifier to zap the existing memory mappings in the
> > secondary page table.
> >
> > Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
> > by architecture code which can turn on it by overriding the default
> > kvm_arch_has_private_mem().
> >
> > A 'kvm' reference is added in memslot structure since in
> > inaccessible_notifier callback we can only obtain a memslot reference
> > but 'kvm' is needed to do the zapping.
> >
> > Co-developed-by: Yu Zhang <[email protected]>
> > Signed-off-by: Yu Zhang <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
>
> ld: arch/x86/../../virt/kvm/kvm_main.o: in function `kvm_free_memslot':
> kvm_main.c:(.text+0x1385): undefined reference to `inaccessible_unregister_notifier'
> ld: arch/x86/../../virt/kvm/kvm_main.o: in function `kvm_set_memslot':
> kvm_main.c:(.text+0x1b86): undefined reference to `inaccessible_register_notifier'
> ld: kvm_main.c:(.text+0x1c85): undefined reference to `inaccessible_unregister_notifier'
> ld: arch/x86/kvm/mmu/mmu.o: in function `kvm_faultin_pfn':
> mmu.c:(.text+0x1e38): undefined reference to `inaccessible_get_pfn'
> ld: arch/x86/kvm/mmu/mmu.o: in function `direct_page_fault':
> mmu.c:(.text+0x67ca): undefined reference to `inaccessible_put_pfn'
> make: *** [Makefile:1169: vmlinux] Error 1
>
> I attached kernel config for reproduction.
>
> The problem is that CONFIG_MEMFD_CREATE does not get enabled:
>
> mm/Makefile:obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o
Thanks for reporting. Yes there is a dependency issue needs to fix.
Chao
On Thu, Oct 06, 2022 at 09:55:31AM +0100, Fuad Tabba wrote:
> Hi,
>
> On Thu, Sep 15, 2022 at 3:37 PM Chao Peng <[email protected]> wrote:
> >
> > Expose KVM_MEM_PRIVATE and memslot fields private_fd/offset to
> > userspace. KVM will register/unregister private memslot to fd-based
> > memory backing store and response to invalidation event from
> > inaccessible_notifier to zap the existing memory mappings in the
> > secondary page table.
> >
> > Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
> > by architecture code which can turn on it by overriding the default
> > kvm_arch_has_private_mem().
> >
> > A 'kvm' reference is added in memslot structure since in
> > inaccessible_notifier callback we can only obtain a memslot reference
> > but 'kvm' is needed to do the zapping.
> >
> > Co-developed-by: Yu Zhang <[email protected]>
> > Signed-off-by: Yu Zhang <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > include/linux/kvm_host.h | 1 +
> > virt/kvm/kvm_main.c | 116 +++++++++++++++++++++++++++++++++++++--
> > 2 files changed, 111 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index b9906cdf468b..cb4eefac709c 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -589,6 +589,7 @@ struct kvm_memory_slot {
> > struct file *private_file;
> > loff_t private_offset;
> > struct inaccessible_notifier notifier;
> > + struct kvm *kvm;
> > };
> >
> > static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 97d893f7482c..87e239d35b96 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -983,6 +983,57 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > xa_erase(&kvm->mem_attr_array, index);
> > return r;
> > }
> > +
> > +static void kvm_private_notifier_invalidate(struct inaccessible_notifier *notifier,
> > + pgoff_t start, pgoff_t end)
> > +{
> > + struct kvm_memory_slot *slot = container_of(notifier,
> > + struct kvm_memory_slot,
> > + notifier);
> > + unsigned long base_pgoff = slot->private_offset >> PAGE_SHIFT;
> > + gfn_t start_gfn = slot->base_gfn;
> > + gfn_t end_gfn = slot->base_gfn + slot->npages;
> > +
> > +
> > + if (start > base_pgoff)
> > + start_gfn = slot->base_gfn + start - base_pgoff;
> > +
> > + if (end < base_pgoff + slot->npages)
> > + end_gfn = slot->base_gfn + end - base_pgoff;
> > +
> > + if (start_gfn >= end_gfn)
> > + return;
> > +
> > + kvm_zap_gfn_range(slot->kvm, start_gfn, end_gfn);
> > +}
> > +
> > +static struct inaccessible_notifier_ops kvm_private_notifier_ops = {
> > + .invalidate = kvm_private_notifier_invalidate,
> > +};
> > +
> > +static inline void kvm_private_mem_register(struct kvm_memory_slot *slot)
> > +{
> > + slot->notifier.ops = &kvm_private_notifier_ops;
> > + inaccessible_register_notifier(slot->private_file, &slot->notifier);
> > +}
> > +
> > +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
> > +{
> > + inaccessible_unregister_notifier(slot->private_file, &slot->notifier);
> > +}
> > +
> > +#else /* !CONFIG_HAVE_KVM_PRIVATE_MEM */
> > +
> > +static inline void kvm_private_mem_register(struct kvm_memory_slot *slot)
> > +{
> > + WARN_ON_ONCE(1);
> > +}
> > +
> > +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
> > +{
> > + WARN_ON_ONCE(1);
> > +}
> > +
> > #endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> >
> > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > @@ -1029,6 +1080,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
> > /* This does not remove the slot from struct kvm_memslots data structures */
> > static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> > {
> > + if (slot->flags & KVM_MEM_PRIVATE) {
> > + kvm_private_mem_unregister(slot);
> > + fput(slot->private_file);
> > + }
> > +
> > kvm_destroy_dirty_bitmap(slot);
> >
> > kvm_arch_free_memslot(kvm, slot);
> > @@ -1600,10 +1656,16 @@ bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > return false;
> > }
> >
> > -static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > +static int check_memory_region_flags(struct kvm *kvm,
> > + const struct kvm_user_mem_region *mem)
> > {
> > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> >
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > + if (kvm_arch_has_private_mem(kvm))
> > + valid_flags |= KVM_MEM_PRIVATE;
> > +#endif
> > +
> > #ifdef __KVM_HAVE_READONLY_MEM
> > valid_flags |= KVM_MEM_READONLY;
> > #endif
> > @@ -1679,6 +1741,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
> > {
> > int r;
> >
> > + if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> > + kvm_private_mem_register(new);
> > +
>
> >From the discussion I had with Kirill in the first patch *, should
> this check that the private_fd is inaccessible?
Yes I can add a check in KVM code, see below for where I want to add it.
>
> [*] https://lore.kernel.org/all/[email protected]/
>
> Cheers,
> /fuad
>
> > /*
> > * If dirty logging is disabled, nullify the bitmap; the old bitmap
> > * will be freed on "commit". If logging is enabled in both old and
> > @@ -1707,6 +1772,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
> > if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
> > kvm_destroy_dirty_bitmap(new);
> >
> > + if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> > + kvm_private_mem_unregister(new);
> > +
> > return r;
> > }
> >
> > @@ -2004,7 +2072,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > int as_id, id;
> > int r;
> >
> > - r = check_memory_region_flags(mem);
> > + r = check_memory_region_flags(kvm, mem);
> > if (r)
> > return r;
> >
> > @@ -2023,6 +2091,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> > mem->memory_size))
> > return -EINVAL;
> > + if (mem->flags & KVM_MEM_PRIVATE &&
> > + (mem->private_offset & (PAGE_SIZE - 1) ||
> > + mem->private_offset > U64_MAX - mem->memory_size))
> > + return -EINVAL;
> > if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> > return -EINVAL;
> > if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> > @@ -2061,6 +2133,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> > return -EINVAL;
> > } else { /* Modify an existing slot. */
> > + /* Private memslots are immutable, they can only be deleted. */
> > + if (mem->flags & KVM_MEM_PRIVATE)
> > + return -EINVAL;
> > if ((mem->userspace_addr != old->userspace_addr) ||
> > (npages != old->npages) ||
> > ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> > @@ -2089,10 +2164,27 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > new->npages = npages;
> > new->flags = mem->flags;
> > new->userspace_addr = mem->userspace_addr;
> > + if (mem->flags & KVM_MEM_PRIVATE) {
> > + new->private_file = fget(mem->private_fd);
> > + if (!new->private_file) {
> > + r = -EINVAL;
The check will go here.
> > + goto out;
> > + }
> > + new->private_offset = mem->private_offset;
> > + }
> > +
> > + new->kvm = kvm;
> >
> > r = kvm_set_memslot(kvm, old, new, change);
> > if (r)
> > - kfree(new);
> > + goto out;
> > +
> > + return 0;
> > +
> > +out:
> > + if (new->private_file)
> > + fput(new->private_file);
> > + kfree(new);
> > return r;
> > }
> > EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> > @@ -4747,16 +4839,28 @@ static long kvm_vm_ioctl(struct file *filp,
> > }
> > case KVM_SET_USER_MEMORY_REGION: {
> > struct kvm_user_mem_region mem;
> > - unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > + unsigned int flags_offset = offsetof(typeof(mem), flags);
> > + unsigned long size;
> > + u32 flags;
> >
> > kvm_sanity_check_user_mem_region_alias();
> >
> > + memset(&mem, 0, sizeof(mem));
> > +
> > r = -EFAULT;
> > - if (copy_from_user(&mem, argp, size);
> > + if (get_user(flags, (u32 __user *)(argp + flags_offset)))
> > + goto out;
> > +
> > + if (flags & KVM_MEM_PRIVATE)
> > + size = sizeof(struct kvm_userspace_memory_region_ext);
> > + else
> > + size = sizeof(struct kvm_userspace_memory_region);
> > +
> > + if (copy_from_user(&mem, argp, size))
> > goto out;
> >
> > r = -EINVAL;
> > - if (mem.flags & KVM_MEM_PRIVATE)
> > + if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
> > goto out;
> >
> > r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > --
> > 2.25.1
> >
Hi,
On Thu, Sep 15, 2022 at 3:38 PM Chao Peng <[email protected]> wrote:
>
> If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the
> guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
> ioctls. The patch reuses existing SEV ioctl number but differs that the
> address in the region for KVM_PRIVATE_MEM case is gpa while for SEV case
> it's hva. Which usages should the ioctls go is determined by the newly
> added kvm_arch_has_private_mem(). Architecture which supports
> KVM_PRIVATE_MEM should override this function.
>
> The current implementation defaults all memory to private. The shared
> memory regions are stored in a xarray variable for memory efficiency and
> zapping existing memory mappings is also a side effect of these two
> ioctls when defined.
>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 17 ++++++--
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu.h | 2 -
> include/linux/kvm_host.h | 13 ++++++
> virt/kvm/kvm_main.c | 73 +++++++++++++++++++++++++++++++++
> 5 files changed, 100 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 1a6c003b2a0b..c0f800d04ffc 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4715,10 +4715,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
> This ioctl can be used to register a guest memory region which may
> contain encrypted data (e.g. guest RAM, SMRAM etc).
>
> -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> -memory region may contain encrypted data. The SEV memory encryption
> -engine uses a tweak such that two identical plaintext pages, each at
> -different locations will have differing ciphertexts. So swapping or
> +Currently this ioctl supports registering memory regions for two usages:
> +private memory and SEV-encrypted memory.
> +
> +When private memory is enabled, this ioctl is used to register guest private
> +memory region and the addr/size of kvm_enc_region represents guest physical
> +address (GPA). In this usage, this ioctl zaps the existing guest memory
> +mappings in KVM that fallen into the region.
> +
> +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> +memory region which may contain encrypted data for a SEV-enabled guest. The
> +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> +memory encryption engine uses a tweak such that two identical plaintext pages,
> +each at different locations will have differing ciphertexts. So swapping or
> moving ciphertext of those pages will not result in plaintext being
> swapped. So relocating (or migrating) physical backing pages for the SEV
> guest will require some additional steps.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2c96c43c313a..cfad6ba1a70a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -37,6 +37,7 @@
> #include <asm/hyperv-tlfs.h>
>
> #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> +#define __KVM_HAVE_ZAP_GFN_RANGE
>
> #define KVM_MAX_VCPUS 1024
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 6bdaacb6faa0..c94b620bf94b 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -211,8 +211,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> return -(u32)fault & errcode;
> }
>
> -void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> -
> int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>
> int kvm_mmu_post_init_vm(struct kvm *kvm);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2125b50f6345..d65690cae80b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -260,6 +260,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> #endif
>
> +#ifdef __KVM_HAVE_ZAP_GFN_RANGE
> +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> +#else
> +static inline void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start
> + gfn_t gfn_end)
> +{
> +}
> +#endif
> +
> enum {
> OUTSIDE_GUEST_MODE,
> IN_GUEST_MODE,
> @@ -795,6 +804,9 @@ struct kvm {
> struct notifier_block pm_notifier;
> #endif
> char stats_id[KVM_STATS_NAME_SIZE];
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> + struct xarray mem_attr_array;
> +#endif
> };
>
> #define kvm_err(fmt, ...) \
> @@ -1454,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> int kvm_arch_post_init_vm(struct kvm *kvm);
> void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
> #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fa9dd2d2c001..de5cce8c82c7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -937,6 +937,47 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
> #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +#define KVM_MEM_ATTR_SHARED 0x0001
> +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> + bool is_private)
> +{
I wonder if this ioctl should be implemented as an arch-specific
ioctl. In this patch it performs some actions that pKVM might not need
or might want to do differently.
pKVM tracks the sharing status in the stage-2 page table's software
bits, so it can avoid the overhead of using mem_attr_array.
Also, this ioctl calls kvm_zap_gfn_range(), as does the invalidation
notifier (introduced in patch 8). For pKVM, the kind of zapping (or
the information conveyed to the hypervisor) might need to be different
depending on the cause; whether it's invalidation or change of sharing
status.
Thanks,
/fuad
> + gfn_t start, end;
> + unsigned long index;
> + void *entry;
> + int r;
> +
> + if (size == 0 || gpa + size < gpa)
> + return -EINVAL;
> + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> + return -EINVAL;
> +
> + start = gpa >> PAGE_SHIFT;
> + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + /*
> + * Guest memory defaults to private, kvm->mem_attr_array only stores
> + * shared memory.
> + */
> + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> +
> + for (index = start; index < end; index++) {
> + r = xa_err(xa_store(&kvm->mem_attr_array, index, entry,
> + GFP_KERNEL_ACCOUNT));
> + if (r)
> + goto err;
> + }
> +
> + kvm_zap_gfn_range(kvm, start, end);
> +
> + return r;
> +err:
> + for (; index > start; index--)
> + xa_erase(&kvm->mem_attr_array, index);
> + return r;
> +}
> +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> +
> #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> static int kvm_pm_notifier_call(struct notifier_block *bl,
> unsigned long state,
> @@ -1165,6 +1206,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> spin_lock_init(&kvm->mn_invalidate_lock);
> rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> + xa_init(&kvm->mem_attr_array);
> +#endif
>
> INIT_LIST_HEAD(&kvm->gpc_list);
> spin_lock_init(&kvm->gpc_lock);
> @@ -1338,6 +1382,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> }
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> + xa_destroy(&kvm->mem_attr_array);
> +#endif
> cleanup_srcu_struct(&kvm->irq_srcu);
> cleanup_srcu_struct(&kvm->srcu);
> kvm_arch_free_vm(kvm);
> @@ -1541,6 +1588,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
> }
> }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> + return false;
> +}
> +
> static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> {
> u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> @@ -4703,6 +4755,24 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> break;
> }
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> + case KVM_MEMORY_ENCRYPT_REG_REGION:
> + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> + struct kvm_enc_region region;
> + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> +
> + if (!kvm_arch_has_private_mem(kvm))
> + goto arch_vm_ioctl;
> +
> + r = -EFAULT;
> + if (copy_from_user(®ion, argp, sizeof(region)))
> + goto out;
> +
> + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> + region.size, set);
> + break;
> + }
> +#endif
> case KVM_GET_DIRTY_LOG: {
> struct kvm_dirty_log log;
>
> @@ -4856,6 +4926,9 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_vm_ioctl_get_stats_fd(kvm);
> break;
> default:
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +arch_vm_ioctl:
> +#endif
> r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> }
> out:
> --
> 2.25.1
>
On Tue, Oct 11, 2022 at 10:48:58AM +0100, Fuad Tabba wrote:
> Hi,
>
> On Thu, Sep 15, 2022 at 3:38 PM Chao Peng <[email protected]> wrote:
> >
> > If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the
> > guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
> > ioctls. The patch reuses existing SEV ioctl number but differs that the
> > address in the region for KVM_PRIVATE_MEM case is gpa while for SEV case
> > it's hva. Which usages should the ioctls go is determined by the newly
> > added kvm_arch_has_private_mem(). Architecture which supports
> > KVM_PRIVATE_MEM should override this function.
> >
> > The current implementation defaults all memory to private. The shared
> > memory regions are stored in a xarray variable for memory efficiency and
> > zapping existing memory mappings is also a side effect of these two
> > ioctls when defined.
> >
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > Documentation/virt/kvm/api.rst | 17 ++++++--
> > arch/x86/include/asm/kvm_host.h | 1 +
> > arch/x86/kvm/mmu.h | 2 -
> > include/linux/kvm_host.h | 13 ++++++
> > virt/kvm/kvm_main.c | 73 +++++++++++++++++++++++++++++++++
> > 5 files changed, 100 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 1a6c003b2a0b..c0f800d04ffc 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -4715,10 +4715,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
> > This ioctl can be used to register a guest memory region which may
> > contain encrypted data (e.g. guest RAM, SMRAM etc).
> >
> > -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> > -memory region may contain encrypted data. The SEV memory encryption
> > -engine uses a tweak such that two identical plaintext pages, each at
> > -different locations will have differing ciphertexts. So swapping or
> > +Currently this ioctl supports registering memory regions for two usages:
> > +private memory and SEV-encrypted memory.
> > +
> > +When private memory is enabled, this ioctl is used to register guest private
> > +memory region and the addr/size of kvm_enc_region represents guest physical
> > +address (GPA). In this usage, this ioctl zaps the existing guest memory
> > +mappings in KVM that fallen into the region.
> > +
> > +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> > +memory region which may contain encrypted data for a SEV-enabled guest. The
> > +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> > +memory encryption engine uses a tweak such that two identical plaintext pages,
> > +each at different locations will have differing ciphertexts. So swapping or
> > moving ciphertext of those pages will not result in plaintext being
> > swapped. So relocating (or migrating) physical backing pages for the SEV
> > guest will require some additional steps.
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 2c96c43c313a..cfad6ba1a70a 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -37,6 +37,7 @@
> > #include <asm/hyperv-tlfs.h>
> >
> > #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> > +#define __KVM_HAVE_ZAP_GFN_RANGE
> >
> > #define KVM_MAX_VCPUS 1024
> >
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 6bdaacb6faa0..c94b620bf94b 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -211,8 +211,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > return -(u32)fault & errcode;
> > }
> >
> > -void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> > -
> > int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
> >
> > int kvm_mmu_post_init_vm(struct kvm *kvm);
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 2125b50f6345..d65690cae80b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -260,6 +260,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > #endif
> >
> > +#ifdef __KVM_HAVE_ZAP_GFN_RANGE
> > +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> > +#else
> > +static inline void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start
> > + gfn_t gfn_end)
> > +{
> > +}
> > +#endif
> > +
> > enum {
> > OUTSIDE_GUEST_MODE,
> > IN_GUEST_MODE,
> > @@ -795,6 +804,9 @@ struct kvm {
> > struct notifier_block pm_notifier;
> > #endif
> > char stats_id[KVM_STATS_NAME_SIZE];
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > + struct xarray mem_attr_array;
> > +#endif
> > };
> >
> > #define kvm_err(fmt, ...) \
> > @@ -1454,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > int kvm_arch_post_init_vm(struct kvm *kvm);
> > void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> >
> > #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > /*
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index fa9dd2d2c001..de5cce8c82c7 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -937,6 +937,47 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >
> > #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> >
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +#define KVM_MEM_ATTR_SHARED 0x0001
> > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > + bool is_private)
> > +{
>
> I wonder if this ioctl should be implemented as an arch-specific
> ioctl. In this patch it performs some actions that pKVM might not need
> or might want to do differently.
I think it's doable. We can provide the mem_attr_array kind thing in
common code and let arch code decide to use it or not. Currently
mem_attr_array is defined in the struct kvm, if those bytes are
unnecessary for pKVM it can even be moved to arch definition, but that
also loses the potential code sharing for confidential usages in other
non-architectures, e.g. if ARM also supports such usage. Or it can be
provided through a different CONFIG_ instead of
CONFIG_HAVE_KVM_PRIVATE_MEM.
Thanks,
Chao
>
> pKVM tracks the sharing status in the stage-2 page table's software
> bits, so it can avoid the overhead of using mem_attr_array.
>
> Also, this ioctl calls kvm_zap_gfn_range(), as does the invalidation
> notifier (introduced in patch 8). For pKVM, the kind of zapping (or
> the information conveyed to the hypervisor) might need to be different
> depending on the cause; whether it's invalidation or change of sharing
> status.
>
> Thanks,
> /fuad
>
>
> > + gfn_t start, end;
> > + unsigned long index;
> > + void *entry;
> > + int r;
> > +
> > + if (size == 0 || gpa + size < gpa)
> > + return -EINVAL;
> > + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> > + return -EINVAL;
> > +
> > + start = gpa >> PAGE_SHIFT;
> > + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > + /*
> > + * Guest memory defaults to private, kvm->mem_attr_array only stores
> > + * shared memory.
> > + */
> > + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> > +
> > + for (index = start; index < end; index++) {
> > + r = xa_err(xa_store(&kvm->mem_attr_array, index, entry,
> > + GFP_KERNEL_ACCOUNT));
> > + if (r)
> > + goto err;
> > + }
> > +
> > + kvm_zap_gfn_range(kvm, start, end);
> > +
> > + return r;
> > +err:
> > + for (; index > start; index--)
> > + xa_erase(&kvm->mem_attr_array, index);
> > + return r;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> > +
> > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > static int kvm_pm_notifier_call(struct notifier_block *bl,
> > unsigned long state,
> > @@ -1165,6 +1206,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > spin_lock_init(&kvm->mn_invalidate_lock);
> > rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> > xa_init(&kvm->vcpu_array);
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > + xa_init(&kvm->mem_attr_array);
> > +#endif
> >
> > INIT_LIST_HEAD(&kvm->gpc_list);
> > spin_lock_init(&kvm->gpc_lock);
> > @@ -1338,6 +1382,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> > kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> > kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> > }
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > + xa_destroy(&kvm->mem_attr_array);
> > +#endif
> > cleanup_srcu_struct(&kvm->irq_srcu);
> > cleanup_srcu_struct(&kvm->srcu);
> > kvm_arch_free_vm(kvm);
> > @@ -1541,6 +1588,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > }
> > }
> >
> > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > +{
> > + return false;
> > +}
> > +
> > static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > {
> > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > @@ -4703,6 +4755,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > break;
> > }
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > + case KVM_MEMORY_ENCRYPT_REG_REGION:
> > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > + struct kvm_enc_region region;
> > + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > +
> > + if (!kvm_arch_has_private_mem(kvm))
> > + goto arch_vm_ioctl;
> > +
> > + r = -EFAULT;
> > + if (copy_from_user(®ion, argp, sizeof(region)))
> > + goto out;
> > +
> > + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> > + region.size, set);
> > + break;
> > + }
> > +#endif
> > case KVM_GET_DIRTY_LOG: {
> > struct kvm_dirty_log log;
> >
> > @@ -4856,6 +4926,9 @@ static long kvm_vm_ioctl(struct file *filp,
> > r = kvm_vm_ioctl_get_stats_fd(kvm);
> > break;
> > default:
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +arch_vm_ioctl:
> > +#endif
> > r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> > }
> > out:
> > --
> > 2.25.1
> >
On Fri, Sep 30, 2022 at 05:19:00PM +0100, Fuad Tabba wrote:
> Hi,
>
> On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Mon, Sep 26, 2022, Fuad Tabba wrote:
> > > Hi,
> > >
> > > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <[email protected]> wrote:
> > > >
> > > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > > > >
> > > > > > 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > > > > > memory into the guest (after pre-boot phase).
> > > > > >
> > > > > > 2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > > > > > and only if the entire gfn range of the associated memslot is shared.
> > > > >
> > > > > In general I think that this would work with pKVM. However, limiting
> > > > > private<->shared conversions to the granularity of a whole memslot
> > > > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > > > concept of memslots. For example, in pKVM right now, when a guest
> > > > > shares back its restricted DMA pool with the host it does so at the
> > > > > page-level.
> >
> > Y'all are killing me :-)
>
> :D
>
> > Isn't the guest enlightened? E.g. can't you tell the guest "thou shalt share at
> > granularity X"? With KVM's newfangled scalable memslots and per-vCPU MRU slot,
> > X doesn't even have to be that high to get reasonable performance, e.g. assuming
> > the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to
> > work just fine in KVM.
>
> The guest is potentially enlightened, but the host doesn't necessarily
> know which memslot the guest might want to share back, since it
> doesn't know where the guest might want to place the DMA pool. If I
> understand this correctly, for this to work, all memslots would need
> to be the same size and sharing would always need to happen at that
> granularity.
>
> Moreover, for something like a small DMA pool this might scale, but
> I'm not sure about potential future workloads (e.g., multimedia
> in-place sharing).
>
> >
> > > > > pKVM would also need a way to make an fd accessible again
> > > > > when shared back, which I think isn't possible with this patch.
> > > >
> > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > be the same issue.
> > >
> > > pKVM doesn't really need to unmap the memory. What is really important
> > > is that the memory is not GUP'able.
> >
> > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> >
> > The problem is that gup() and "mapped" are tied together. So yes, pKVM doesn't
> > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> > the end result is the same.
> >
> > Emphasis above because pKVM still needs unmap the memory _somehwere_. IIUC, the
> > current approach is to do that only in the stage-2 page tables, i.e. only in the
> > context of the hypervisor. Which is also the source of the gup() problems; the
> > untrusted kernel is blissfully unaware that the memory is inaccessible.
> >
> > Any approach that moves some of that information into the untrusted kernel so that
> > the kernel can protect itself will incur fragmentation in the VMAs. Well, unless
> > all of guest memory becomes unguppable, but that's likely not a viable option.
>
> Actually, for pKVM, there is no need for the guest memory to be
> GUP'able at all if we use the new inaccessible_get_pfn().
If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I
think that is the major concern?), do you see any other gap from
existing API?
> This of
> course goes back to what I'd mentioned before in v7; it seems that
> representing the memslot memory as a file descriptor should be
> orthogonal to whether the memory is shared or private, rather than a
> private_fd for private memory and the userspace_addr for shared
> memory. The host can then map or unmap the shared/private memory using
> the fd, which allows it more freedom in even choosing to unmap shared
> memory when not needed, for example.
Using both private_fd and userspace_addr is only needed in TDX and other
confidential computing scenarios, pKVM may only use private_fd if the fd
can also be mmaped as a whole to userspace as Sean suggested.
Thanks,
Chao
>
> Cheers,
> /fuad
On Thu, Sep 15, 2022, Chao Peng wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a0f198cede3d..81ab20003824 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3028,6 +3028,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> break;
> }
>
> + if (kvm_mem_is_private(kvm, gfn))
Rather than reload the Xarray info, which is unnecessary overhead, pass in
@is_private. The caller must hold mmu_lock, i.e. invalidations from
private<->shared conversions will be stalled and will zap the new SPTE if the
state is changed.
E.g.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d68944f07b4b..44eea47697d8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3072,8 +3072,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* Enforce the iTLB multihit workaround after capturing the requested
* level, which will be used to do precise, accurate accounting.
*/
- fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
- fault->gfn, fault->max_level);
+ fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot, fault->gfn,
+ fault->max_level, fault->is_private);
if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
return;
@@ -6460,7 +6460,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
*/
if (sp->role.direct &&
sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
- PG_LEVEL_NUM)) {
+ PG_LEVEL_NUM, false)) {
kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
if (kvm_available_flush_tlb_with_range())
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 7670c13ce251..9acdf72537ce 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -315,6 +315,12 @@ static inline bool is_dirty_spte(u64 spte)
return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
}
+static inline bool is_private_spte(u64 spte)
+{
+ /* FIXME: Query C-bit/S-bit for SEV/TDX. */
+ return false;
+}
+
static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
int level)
{
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 672f0432d777..69ba00157e90 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1767,8 +1767,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
if (iter.gfn < start || iter.gfn >= end)
continue;
- max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
- iter.gfn, PG_LEVEL_NUM);
+ max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
+ PG_LEVEL_NUM,
+ is_private_spte(iter.old_spte));
if (max_mapping_level < iter.level)
continue;
Hi,
> > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > +#define KVM_MEM_ATTR_SHARED 0x0001
> > > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > > + bool is_private)
> > > +{
> >
> > I wonder if this ioctl should be implemented as an arch-specific
> > ioctl. In this patch it performs some actions that pKVM might not need
> > or might want to do differently.
>
> I think it's doable. We can provide the mem_attr_array kind thing in
> common code and let arch code decide to use it or not. Currently
> mem_attr_array is defined in the struct kvm, if those bytes are
> unnecessary for pKVM it can even be moved to arch definition, but that
> also loses the potential code sharing for confidential usages in other
> non-architectures, e.g. if ARM also supports such usage. Or it can be
> provided through a different CONFIG_ instead of
> CONFIG_HAVE_KVM_PRIVATE_MEM.
This sounds good. Thank you.
/fuad
> Thanks,
> Chao
> >
> > pKVM tracks the sharing status in the stage-2 page table's software
> > bits, so it can avoid the overhead of using mem_attr_array.
> >
> > Also, this ioctl calls kvm_zap_gfn_range(), as does the invalidation
> > notifier (introduced in patch 8). For pKVM, the kind of zapping (or
> > the information conveyed to the hypervisor) might need to be different
> > depending on the cause; whether it's invalidation or change of sharing
> > status.
>
> >
> > Thanks,
> > /fuad
> >
> >
> > > + gfn_t start, end;
> > > + unsigned long index;
> > > + void *entry;
> > > + int r;
> > > +
> > > + if (size == 0 || gpa + size < gpa)
> > > + return -EINVAL;
> > > + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> > > + return -EINVAL;
> > > +
> > > + start = gpa >> PAGE_SHIFT;
> > > + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > > +
> > > + /*
> > > + * Guest memory defaults to private, kvm->mem_attr_array only stores
> > > + * shared memory.
> > > + */
> > > + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> > > +
> > > + for (index = start; index < end; index++) {
> > > + r = xa_err(xa_store(&kvm->mem_attr_array, index, entry,
> > > + GFP_KERNEL_ACCOUNT));
> > > + if (r)
> > > + goto err;
> > > + }
> > > +
> > > + kvm_zap_gfn_range(kvm, start, end);
> > > +
> > > + return r;
> > > +err:
> > > + for (; index > start; index--)
> > > + xa_erase(&kvm->mem_attr_array, index);
> > > + return r;
> > > +}
> > > +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> > > +
> > > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > > static int kvm_pm_notifier_call(struct notifier_block *bl,
> > > unsigned long state,
> > > @@ -1165,6 +1206,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > > spin_lock_init(&kvm->mn_invalidate_lock);
> > > rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> > > xa_init(&kvm->vcpu_array);
> > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > + xa_init(&kvm->mem_attr_array);
> > > +#endif
> > >
> > > INIT_LIST_HEAD(&kvm->gpc_list);
> > > spin_lock_init(&kvm->gpc_lock);
> > > @@ -1338,6 +1382,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> > > kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> > > kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> > > }
> > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > + xa_destroy(&kvm->mem_attr_array);
> > > +#endif
> > > cleanup_srcu_struct(&kvm->irq_srcu);
> > > cleanup_srcu_struct(&kvm->srcu);
> > > kvm_arch_free_vm(kvm);
> > > @@ -1541,6 +1588,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > > }
> > > }
> > >
> > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > > +{
> > > + return false;
> > > +}
> > > +
> > > static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > > {
> > > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > > @@ -4703,6 +4755,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > > r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > > break;
> > > }
> > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > + case KVM_MEMORY_ENCRYPT_REG_REGION:
> > > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > > + struct kvm_enc_region region;
> > > + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > > +
> > > + if (!kvm_arch_has_private_mem(kvm))
> > > + goto arch_vm_ioctl;
> > > +
> > > + r = -EFAULT;
> > > + if (copy_from_user(®ion, argp, sizeof(region)))
> > > + goto out;
> > > +
> > > + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> > > + region.size, set);
> > > + break;
> > > + }
> > > +#endif
> > > case KVM_GET_DIRTY_LOG: {
> > > struct kvm_dirty_log log;
> > >
> > > @@ -4856,6 +4926,9 @@ static long kvm_vm_ioctl(struct file *filp,
> > > r = kvm_vm_ioctl_get_stats_fd(kvm);
> > > break;
> > > default:
> > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > +arch_vm_ioctl:
> > > +#endif
> > > r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> > > }
> > > out:
> > > --
> > > 2.25.1
> > >
Hi,
> >
> > Actually, for pKVM, there is no need for the guest memory to be
> > GUP'able at all if we use the new inaccessible_get_pfn().
>
> If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I
> think that is the major concern?), do you see any other gap from
> existing API?
Actually for this part no, there aren't any gaps and
inaccessible_get_pfn() is sufficient.
> > This of
> > course goes back to what I'd mentioned before in v7; it seems that
> > representing the memslot memory as a file descriptor should be
> > orthogonal to whether the memory is shared or private, rather than a
> > private_fd for private memory and the userspace_addr for shared
> > memory. The host can then map or unmap the shared/private memory using
> > the fd, which allows it more freedom in even choosing to unmap shared
> > memory when not needed, for example.
>
> Using both private_fd and userspace_addr is only needed in TDX and other
> confidential computing scenarios, pKVM may only use private_fd if the fd
> can also be mmaped as a whole to userspace as Sean suggested.
That does work in practice, for now at least, and is what I do in my
current port. However, the naming and how the API is defined as
implied by the name and the documentation. By calling the field
private_fd, it does imply that it should not be mapped, which is also
what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
would be mis/ab-using this interface, and that future changes could
cause unforeseen issues for pKVM.
Maybe renaming this to something like "guest_fp", and specifying in
the documentation that it can be restricted, e.g., instead of "the
content of the private memory is invisible to userspace" something
along the lines of "the content of the guest memory may be restricted
to userspace".
What do you think?
Cheers,
/fuad
>
> Thanks,
> Chao
> >
> > Cheers,
> > /fuad
On 9/15/22 16:29, Chao Peng wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> KVM can use memfd-provided memory for guest memory. For normal userspace
> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> virtual address space and then tells KVM to use the virtual address to
> setup the mapping in the secondary page table (e.g. EPT).
>
> With confidential computing technologies like Intel TDX, the
> memfd-provided memory may be encrypted with special key for special
> software domain (e.g. KVM guest) and is not expected to be directly
> accessed by userspace. Precisely, userspace access to such encrypted
> memory may lead to host crash so it should be prevented.
>
> This patch introduces userspace inaccessible memfd (created with
> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> in-kernel interface so KVM can directly interact with core-mm without
> the need to map the memory into KVM userspace.
>
> It provides semantics required for KVM guest private(encrypted) memory
> support that a file descriptor with this flag set is going to be used as
> the source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV.
>
> KVM userspace is still in charge of the lifecycle of the memfd. It
> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> in this patch to obtain the physical memory address and then populate
> the secondary page table entries.
>
> The userspace inaccessible memfd can be fallocate-ed and hole-punched
> from userspace. When hole-punching happens, KVM can get notified through
> inaccessible_notifier it then gets chance to remove any mapped entries
> of the range in the secondary page tables.
>
> The userspace inaccessible memfd itself is implemented as a shim layer
> on top of real memory file systems like tmpfs/hugetlbfs but this patch
> only implemented tmpfs. The allocated memory is currently marked as
> unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
...
> +static long inaccessible_fallocate(struct file *file, int mode,
> + loff_t offset, loff_t len)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> + struct file *memfd = data->memfd;
> + int ret;
> +
> + if (mode & FALLOC_FL_PUNCH_HOLE) {
> + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> + return -EINVAL;
> + }
> +
> + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> + inaccessible_notifier_invalidate(data, offset, offset + len);
Wonder if invalidate should precede the actual hole punch, otherwise we open
a window where the page tables point to memory no longer valid?
> + return ret;
> +}
> +
...
> +
> +static struct file_system_type inaccessible_fs = {
> + .owner = THIS_MODULE,
> + .name = "[inaccessible]",
Dunno where exactly is this name visible, but shouldn't it better be
"[memfd:inaccessible]"?
> + .init_fs_context = inaccessible_init_fs_context,
> + .kill_sb = kill_anon_super,
> +};
> +
On Mon, Oct 17, 2022 at 11:31:19AM +0100, Fuad Tabba wrote:
> Hi,
>
> > >
> > > Actually, for pKVM, there is no need for the guest memory to be
> > > GUP'able at all if we use the new inaccessible_get_pfn().
> >
> > If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I
> > think that is the major concern?), do you see any other gap from
> > existing API?
>
> Actually for this part no, there aren't any gaps and
> inaccessible_get_pfn() is sufficient.
Thanks for the confirmation.
>
> > > This of
> > > course goes back to what I'd mentioned before in v7; it seems that
> > > representing the memslot memory as a file descriptor should be
> > > orthogonal to whether the memory is shared or private, rather than a
> > > private_fd for private memory and the userspace_addr for shared
> > > memory. The host can then map or unmap the shared/private memory using
> > > the fd, which allows it more freedom in even choosing to unmap shared
> > > memory when not needed, for example.
> >
> > Using both private_fd and userspace_addr is only needed in TDX and other
> > confidential computing scenarios, pKVM may only use private_fd if the fd
> > can also be mmaped as a whole to userspace as Sean suggested.
>
> That does work in practice, for now at least, and is what I do in my
> current port. However, the naming and how the API is defined as
> implied by the name and the documentation. By calling the field
> private_fd, it does imply that it should not be mapped, which is also
> what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
> would be mis/ab-using this interface, and that future changes could
> cause unforeseen issues for pKVM.
That is fairly enough. We can change the naming and the documents.
>
> Maybe renaming this to something like "guest_fp", and specifying in
> the documentation that it can be restricted, e.g., instead of "the
> content of the private memory is invisible to userspace" something
> along the lines of "the content of the guest memory may be restricted
> to userspace".
Some other candidates in my mind:
- restricted_fd: to pair with the mm side restricted_memfd
- protected_fd: as Sean suggested before
- fd: how it's explained relies on the memslot.flag.
Thanks,
Chao
>
> What do you think?
>
> Cheers,
> /fuad
>
> >
> > Thanks,
> > Chao
> > >
> > > Cheers,
> > > /fuad
On Fri, Oct 14, 2022 at 06:57:20PM +0000, Sean Christopherson wrote:
> On Thu, Sep 15, 2022, Chao Peng wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index a0f198cede3d..81ab20003824 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3028,6 +3028,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > break;
> > }
> >
> > + if (kvm_mem_is_private(kvm, gfn))
>
> Rather than reload the Xarray info, which is unnecessary overhead, pass in
> @is_private. The caller must hold mmu_lock, i.e. invalidations from
> private<->shared conversions will be stalled and will zap the new SPTE if the
> state is changed.
Make sense. TDX/SEV should be easy to query that.
Chao
>
> E.g.
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d68944f07b4b..44eea47697d8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3072,8 +3072,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> * Enforce the iTLB multihit workaround after capturing the requested
> * level, which will be used to do precise, accurate accounting.
> */
> - fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> - fault->gfn, fault->max_level);
> + fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot, fault->gfn,
> + fault->max_level, fault->is_private);
> if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> return;
>
> @@ -6460,7 +6460,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> */
> if (sp->role.direct &&
> sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> - PG_LEVEL_NUM)) {
> + PG_LEVEL_NUM, false)) {
> kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
>
> if (kvm_available_flush_tlb_with_range())
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 7670c13ce251..9acdf72537ce 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -315,6 +315,12 @@ static inline bool is_dirty_spte(u64 spte)
> return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
> }
>
> +static inline bool is_private_spte(u64 spte)
> +{
> + /* FIXME: Query C-bit/S-bit for SEV/TDX. */
> + return false;
> +}
> +
> static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
> int level)
> {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 672f0432d777..69ba00157e90 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1767,8 +1767,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> if (iter.gfn < start || iter.gfn >= end)
> continue;
>
> - max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> - iter.gfn, PG_LEVEL_NUM);
> + max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
> + PG_LEVEL_NUM,
> + is_private_spte(iter.old_spte));
> if (max_mapping_level < iter.level)
> continue;
>
On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> On 9/15/22 16:29, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > KVM can use memfd-provided memory for guest memory. For normal userspace
> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > virtual address space and then tells KVM to use the virtual address to
> > setup the mapping in the secondary page table (e.g. EPT).
> >
> > With confidential computing technologies like Intel TDX, the
> > memfd-provided memory may be encrypted with special key for special
> > software domain (e.g. KVM guest) and is not expected to be directly
> > accessed by userspace. Precisely, userspace access to such encrypted
> > memory may lead to host crash so it should be prevented.
> >
> > This patch introduces userspace inaccessible memfd (created with
> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > in-kernel interface so KVM can directly interact with core-mm without
> > the need to map the memory into KVM userspace.
> >
> > It provides semantics required for KVM guest private(encrypted) memory
> > support that a file descriptor with this flag set is going to be used as
> > the source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV.
> >
> > KVM userspace is still in charge of the lifecycle of the memfd. It
> > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > in this patch to obtain the physical memory address and then populate
> > the secondary page table entries.
> >
> > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > from userspace. When hole-punching happens, KVM can get notified through
> > inaccessible_notifier it then gets chance to remove any mapped entries
> > of the range in the secondary page tables.
> >
> > The userspace inaccessible memfd itself is implemented as a shim layer
> > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > only implemented tmpfs. The allocated memory is currently marked as
> > unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
>
> ...
>
> > +static long inaccessible_fallocate(struct file *file, int mode,
> > + loff_t offset, loff_t len)
> > +{
> > + struct inaccessible_data *data = file->f_mapping->private_data;
> > + struct file *memfd = data->memfd;
> > + int ret;
> > +
> > + if (mode & FALLOC_FL_PUNCH_HOLE) {
> > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > + return -EINVAL;
> > + }
> > +
> > + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > + inaccessible_notifier_invalidate(data, offset, offset + len);
>
> Wonder if invalidate should precede the actual hole punch, otherwise we open
> a window where the page tables point to memory no longer valid?
Yes, you are right. Thanks for catching this.
> > + return ret;
> > +}
> > +
>
> ...
>
> > +
> > +static struct file_system_type inaccessible_fs = {
> > + .owner = THIS_MODULE,
> > + .name = "[inaccessible]",
>
> Dunno where exactly is this name visible, but shouldn't it better be
> "[memfd:inaccessible]"?
Maybe. And skip brackets.
>
> > + .init_fs_context = inaccessible_init_fs_context,
> > + .kill_sb = kill_anon_super,
> > +};
> > +
>
--
Kiryl Shutsemau / Kirill A. Shutemov
On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
>> On 9/15/22 16:29, Chao Peng wrote:
>>> From: "Kirill A. Shutemov" <[email protected]>
>>>
>>> KVM can use memfd-provided memory for guest memory. For normal userspace
>>> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
>>> virtual address space and then tells KVM to use the virtual address to
>>> setup the mapping in the secondary page table (e.g. EPT).
>>>
>>> With confidential computing technologies like Intel TDX, the
>>> memfd-provided memory may be encrypted with special key for special
>>> software domain (e.g. KVM guest) and is not expected to be directly
>>> accessed by userspace. Precisely, userspace access to such encrypted
>>> memory may lead to host crash so it should be prevented.
>>>
>>> This patch introduces userspace inaccessible memfd (created with
>>> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
>>> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
>>> in-kernel interface so KVM can directly interact with core-mm without
>>> the need to map the memory into KVM userspace.
>>>
>>> It provides semantics required for KVM guest private(encrypted) memory
>>> support that a file descriptor with this flag set is going to be used as
>>> the source of guest memory in confidential computing environments such
>>> as Intel TDX/AMD SEV.
>>>
>>> KVM userspace is still in charge of the lifecycle of the memfd. It
>>> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
>>> in this patch to obtain the physical memory address and then populate
>>> the secondary page table entries.
>>>
>>> The userspace inaccessible memfd can be fallocate-ed and hole-punched
>>> from userspace. When hole-punching happens, KVM can get notified through
>>> inaccessible_notifier it then gets chance to remove any mapped entries
>>> of the range in the secondary page tables.
>>>
>>> The userspace inaccessible memfd itself is implemented as a shim layer
>>> on top of real memory file systems like tmpfs/hugetlbfs but this patch
>>> only implemented tmpfs. The allocated memory is currently marked as
>>> unmovable and unevictable, this is required for current confidential
>>> usage. But in future this might be changed.
>>>
>>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>>> Signed-off-by: Chao Peng <[email protected]>
>>> ---
>>
>> ...
>>
>>> +static long inaccessible_fallocate(struct file *file, int mode,
>>> + loff_t offset, loff_t len)
>>> +{
>>> + struct inaccessible_data *data = file->f_mapping->private_data;
>>> + struct file *memfd = data->memfd;
>>> + int ret;
>>> +
>>> + if (mode & FALLOC_FL_PUNCH_HOLE) {
>>> + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
>>> + return -EINVAL;
>>> + }
>>> +
>>> + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
>>> + inaccessible_notifier_invalidate(data, offset, offset + len);
>>
>> Wonder if invalidate should precede the actual hole punch, otherwise we open
>> a window where the page tables point to memory no longer valid?
>
> Yes, you are right. Thanks for catching this.
I also noticed this. But then thought the memory would be anyways zeroed
(hole punched) before this call?
>
>>> + return ret;
>>> +}
>>> +
>>
>> ...
>>
>>> +
>>> +static struct file_system_type inaccessible_fs = {
>>> + .owner = THIS_MODULE,
>>> + .name = "[inaccessible]",
>>
>> Dunno where exactly is this name visible, but shouldn't it better be
>> "[memfd:inaccessible]"?
>
> Maybe. And skip brackets.
>
>>
>>> + .init_fs_context = inaccessible_init_fs_context,
>>> + .kill_sb = kill_anon_super,
>>> +};
>>> +
>>
>
Hi,
> > > Using both private_fd and userspace_addr is only needed in TDX and other
> > > confidential computing scenarios, pKVM may only use private_fd if the fd
> > > can also be mmaped as a whole to userspace as Sean suggested.
> >
> > That does work in practice, for now at least, and is what I do in my
> > current port. However, the naming and how the API is defined as
> > implied by the name and the documentation. By calling the field
> > private_fd, it does imply that it should not be mapped, which is also
> > what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
> > would be mis/ab-using this interface, and that future changes could
> > cause unforeseen issues for pKVM.
>
> That is fairly enough. We can change the naming and the documents.
>
> >
> > Maybe renaming this to something like "guest_fp", and specifying in
> > the documentation that it can be restricted, e.g., instead of "the
> > content of the private memory is invisible to userspace" something
> > along the lines of "the content of the guest memory may be restricted
> > to userspace".
>
> Some other candidates in my mind:
> - restricted_fd: to pair with the mm side restricted_memfd
> - protected_fd: as Sean suggested before
> - fd: how it's explained relies on the memslot.flag.
All these sound good, since they all capture the potential use cases.
Restricted might be the most logical choice if that's going to also
become the name for the mem_fd.
Thanks,
/fuad
> Thanks,
> Chao
> >
> > What do you think?
> >
> > Cheers,
> > /fuad
> >
> > >
> > > Thanks,
> > > Chao
> > > >
> > > > Cheers,
> > > > /fuad
On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > On 9/15/22 16:29, Chao Peng wrote:
> > > > From: "Kirill A. Shutemov" <[email protected]>
> > > >
> > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > virtual address space and then tells KVM to use the virtual address to
> > > > setup the mapping in the secondary page table (e.g. EPT).
> > > >
> > > > With confidential computing technologies like Intel TDX, the
> > > > memfd-provided memory may be encrypted with special key for special
> > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > memory may lead to host crash so it should be prevented.
> > > >
> > > > This patch introduces userspace inaccessible memfd (created with
> > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > the need to map the memory into KVM userspace.
> > > >
> > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > support that a file descriptor with this flag set is going to be used as
> > > > the source of guest memory in confidential computing environments such
> > > > as Intel TDX/AMD SEV.
> > > >
> > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > in this patch to obtain the physical memory address and then populate
> > > > the secondary page table entries.
> > > >
> > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > of the range in the secondary page tables.
> > > >
> > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > unmovable and unevictable, this is required for current confidential
> > > > usage. But in future this might be changed.
> > > >
> > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > Signed-off-by: Chao Peng <[email protected]>
> > > > ---
> > >
> > > ...
> > >
> > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > + loff_t offset, loff_t len)
> > > > +{
> > > > + struct inaccessible_data *data = file->f_mapping->private_data;
> > > > + struct file *memfd = data->memfd;
> > > > + int ret;
> > > > +
> > > > + if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > + inaccessible_notifier_invalidate(data, offset, offset + len);
> > >
> > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > a window where the page tables point to memory no longer valid?
> >
> > Yes, you are right. Thanks for catching this.
>
> I also noticed this. But then thought the memory would be anyways zeroed
> (hole punched) before this call?
Hole punching can free pages, given that offset/len covers full page.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Oct 17, 2022, Fuad Tabba wrote:
> Hi,
>
> > > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > > +#define KVM_MEM_ATTR_SHARED 0x0001
> > > > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > > > + bool is_private)
> > > > +{
> > >
> > > I wonder if this ioctl should be implemented as an arch-specific
> > > ioctl. In this patch it performs some actions that pKVM might not need
> > > or might want to do differently.
> >
> > I think it's doable. We can provide the mem_attr_array kind thing in
> > common code and let arch code decide to use it or not. Currently
> > mem_attr_array is defined in the struct kvm, if those bytes are
> > unnecessary for pKVM it can even be moved to arch definition, but that
> > also loses the potential code sharing for confidential usages in other
> > non-architectures, e.g. if ARM also supports such usage. Or it can be
> > provided through a different CONFIG_ instead of
> > CONFIG_HAVE_KVM_PRIVATE_MEM.
>
> This sounds good. Thank you.
I like the idea of a separate Kconfig, e.g. CONFIG_KVM_GENERIC_PRIVATE_MEM or
something. I highly doubt there will be any non-x86 users for multiple years,
if ever, but it would allow testing the private memory stuff on ARM (and any other
non-x86 arch) without needing full pKVM support and with only minor KVM
modifications, e.g. the x86 support[*] to test UPM without TDX is shaping up to be
trivial.
[*] https://lore.kernel.org/all/[email protected]
On Fri, Sep 30, 2022, Fuad Tabba wrote:
> > > > > pKVM would also need a way to make an fd accessible again
> > > > > when shared back, which I think isn't possible with this patch.
> > > >
> > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > be the same issue.
> > >
> > > pKVM doesn't really need to unmap the memory. What is really important
> > > is that the memory is not GUP'able.
> >
> > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> >
> > The problem is that gup() and "mapped" are tied together. So yes, pKVM doesn't
> > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> > the end result is the same.
> >
> > Emphasis above because pKVM still needs unmap the memory _somehwere_. IIUC, the
> > current approach is to do that only in the stage-2 page tables, i.e. only in the
> > context of the hypervisor. Which is also the source of the gup() problems; the
> > untrusted kernel is blissfully unaware that the memory is inaccessible.
> >
> > Any approach that moves some of that information into the untrusted kernel so that
> > the kernel can protect itself will incur fragmentation in the VMAs. Well, unless
> > all of guest memory becomes unguppable, but that's likely not a viable option.
>
> Actually, for pKVM, there is no need for the guest memory to be GUP'able at
> all if we use the new inaccessible_get_pfn().
Ya, I was referring to pKVM without UPM / inaccessible memory.
Jumping back to blocking gup(), what about using the same tricks as secretmem to
block gup()? E.g. compare vm_ops to block regular gup() and a_ops to block fast
gup() on struct page? With a Kconfig that's selected by pKVM (which would also
need its own Kconfig), e.g. CONFIG_INACCESSIBLE_MAPPABLE_MEM, there would be zero
performance overhead for non-pKVM kernels, i.e. hooking gup() shouldn't be
controversial.
I suspect the fast gup() path could even be optimized to avoid the page_mapping()
lookup by adding a PG_inaccessible flag that's defined iff the TBD Kconfig is
selected. I'm guessing pKVM isn't expected to be deployed on massivve NUMA systems
anytime soon, so there should be plenty of page flags to go around.
Blocking gup() instead of trying to play refcount games when converting back to
private would eliminate the need to put heavy restrictions on mapping, as the goal
of those were purely to simplify the KVM implementation, e.g. the "one mapping per
memslot" thing would go away entirely.
> This of course goes back to what I'd mentioned before in v7; it seems that
> representing the memslot memory as a file descriptor should be orthogonal to
> whether the memory is shared or private, rather than a private_fd for private
> memory and the userspace_addr for shared memory.
I also explored the idea of backing any guest memory with an fd, but came to
the conclusion that private memory needs a separate handle[1], at least on x86.
For SNP and TDX, even though the GPA is the same (ignoring the fact that SNP and
TDX steal GPA bits to differentiate private vs. shared), the two types need to be
treated as separate mappings[2]. Post-boot, converting is lossy in both directions,
so even conceptually they are two disctint pages that just happen to share (some)
GPA bits.
To allow conversions, i.e. changing which mapping to use, without memslot updates,
KVM needs to let userspace provide both mappings in a single memslot. So while
fd-based memory is an orthogonal concept, e.g. we could add fd-based shared memory,
KVM would still need a dedicated private handle.
For pKVM, the fd doesn't strictly need to be mutually exclusive with the existing
userspace_addr, but since the private_fd is going to be added for x86, I think it
makes sense to use that instead of adding generic fd-based memory for pKVM's use
case (which is arguably still "private" memory but with special semantics).
[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/all/[email protected]
> The host can then map or unmap the shared/private memory using the fd, which
> allows it more freedom in even choosing to unmap shared memory when not
> needed, for example.
On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
<[email protected]> wrote:
>
> On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > From: "Kirill A. Shutemov" <[email protected]>
> > > > >
> > > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > >
> > > > > With confidential computing technologies like Intel TDX, the
> > > > > memfd-provided memory may be encrypted with special key for special
> > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > memory may lead to host crash so it should be prevented.
> > > > >
> > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > the need to map the memory into KVM userspace.
> > > > >
> > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > support that a file descriptor with this flag set is going to be used as
> > > > > the source of guest memory in confidential computing environments such
> > > > > as Intel TDX/AMD SEV.
> > > > >
> > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > in this patch to obtain the physical memory address and then populate
> > > > > the secondary page table entries.
> > > > >
> > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > of the range in the secondary page tables.
> > > > >
> > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > unmovable and unevictable, this is required for current confidential
> > > > > usage. But in future this might be changed.
> > > > >
> > > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > > Signed-off-by: Chao Peng <[email protected]>
> > > > > ---
> > > >
> > > > ...
> > > >
> > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > + loff_t offset, loff_t len)
> > > > > +{
> > > > > + struct inaccessible_data *data = file->f_mapping->private_data;
> > > > > + struct file *memfd = data->memfd;
> > > > > + int ret;
> > > > > +
> > > > > + if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > + return -EINVAL;
> > > > > + }
> > > > > +
> > > > > + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > + inaccessible_notifier_invalidate(data, offset, offset + len);
> > > >
> > > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > > a window where the page tables point to memory no longer valid?
> > >
> > > Yes, you are right. Thanks for catching this.
> >
> > I also noticed this. But then thought the memory would be anyways zeroed
> > (hole punched) before this call?
>
> Hole punching can free pages, given that offset/len covers full page.
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov
I think moving this notifier_invalidate before fallocate may not solve
the problem completely. Is it possible that between invalidate and
fallocate, KVM tries to handle the page fault for the guest VM from
another vcpu and uses the pages to be freed to back gpa ranges? Should
hole punching here also update mem_attr first to say that KVM should
consider the corresponding gpa ranges to be no more backed by
inaccessible memfd?
On Thu, Sep 15, 2022 at 8:04 PM Chao Peng <[email protected]> wrote:
>
> From: "Kirill A. Shutemov" <[email protected]>
>
> KVM can use memfd-provided memory for guest memory. For normal userspace
> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> virtual address space and then tells KVM to use the virtual address to
> setup the mapping in the secondary page table (e.g. EPT).
>
> With confidential computing technologies like Intel TDX, the
> memfd-provided memory may be encrypted with special key for special
> software domain (e.g. KVM guest) and is not expected to be directly
> accessed by userspace. Precisely, userspace access to such encrypted
> memory may lead to host crash so it should be prevented.
>
> This patch introduces userspace inaccessible memfd (created with
> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> in-kernel interface so KVM can directly interact with core-mm without
> the need to map the memory into KVM userspace.
>
> It provides semantics required for KVM guest private(encrypted) memory
> support that a file descriptor with this flag set is going to be used as
> the source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV.
>
> KVM userspace is still in charge of the lifecycle of the memfd. It
> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> in this patch to obtain the physical memory address and then populate
> the secondary page table entries.
>
> The userspace inaccessible memfd can be fallocate-ed and hole-punched
> from userspace. When hole-punching happens, KVM can get notified through
> inaccessible_notifier it then gets chance to remove any mapped entries
> of the range in the secondary page tables.
>
> The userspace inaccessible memfd itself is implemented as a shim layer
> on top of real memory file systems like tmpfs/hugetlbfs but this patch
> only implemented tmpfs. The allocated memory is currently marked as
> unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/linux/memfd.h | 24 ++++
> include/uapi/linux/magic.h | 1 +
> include/uapi/linux/memfd.h | 1 +
> mm/Makefile | 2 +-
> mm/memfd.c | 25 ++++-
> mm/memfd_inaccessible.c | 219 +++++++++++++++++++++++++++++++++++++
> 6 files changed, 270 insertions(+), 2 deletions(-)
> create mode 100644 mm/memfd_inaccessible.c
>
> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> index 4f1600413f91..334ddff08377 100644
> --- a/include/linux/memfd.h
> +++ b/include/linux/memfd.h
> @@ -3,6 +3,7 @@
> #define __LINUX_MEMFD_H
>
> #include <linux/file.h>
> +#include <linux/pfn_t.h>
>
> #ifdef CONFIG_MEMFD_CREATE
> extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
> @@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
> }
> #endif
>
> +struct inaccessible_notifier;
> +
> +struct inaccessible_notifier_ops {
> + void (*invalidate)(struct inaccessible_notifier *notifier,
> + pgoff_t start, pgoff_t end);
> +};
> +
> +struct inaccessible_notifier {
> + struct list_head list;
> + const struct inaccessible_notifier_ops *ops;
> +};
> +
> +void inaccessible_register_notifier(struct file *file,
> + struct inaccessible_notifier *notifier);
> +void inaccessible_unregister_notifier(struct file *file,
> + struct inaccessible_notifier *notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> + int *order);
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn);
> +
> +struct file *memfd_mkinaccessible(struct file *memfd);
> +
> #endif /* __LINUX_MEMFD_H */
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..9d066be3d7e8 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
> #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */
> #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
> #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
> +#define INACCESSIBLE_MAGIC 0x494e4143 /* "INAC" */
>
> #endif /* __LINUX_MAGIC_H__ */
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 7a8a26751c23..48750474b904 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -8,6 +8,7 @@
> #define MFD_CLOEXEC 0x0001U
> #define MFD_ALLOW_SEALING 0x0002U
> #define MFD_HUGETLB 0x0004U
> +#define MFD_INACCESSIBLE 0x0008U
>
> /*
> * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f836403..f82e5d4b4388 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -126,7 +126,7 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
> obj-$(CONFIG_ZONE_DEVICE) += memremap.o
> obj-$(CONFIG_HMM_MIRROR) += hmm.o
> -obj-$(CONFIG_MEMFD_CREATE) += memfd.o
> +obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o
> obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
> obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
> obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 08f5f8304746..1853a90f49ff 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> + MFD_INACCESSIBLE)
>
> SYSCALL_DEFINE2(memfd_create,
> const char __user *, uname,
> @@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create,
> return -EINVAL;
> }
>
> + /* Disallow sealing when MFD_INACCESSIBLE is set. */
> + if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING))
> + return -EINVAL;
> +
> + /* TODO: add hugetlb support */
> + if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB))
> + return -EINVAL;
> +
> /* length includes terminating zero */
> len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
> if (len <= 0)
> @@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create,
> *file_seals &= ~F_SEAL_SEAL;
> }
>
> + if (flags & MFD_INACCESSIBLE) {
> + struct file *inaccessible_file;
> +
> + inaccessible_file = memfd_mkinaccessible(file);
> + if (IS_ERR(inaccessible_file)) {
> + error = PTR_ERR(inaccessible_file);
> + goto err_file;
> + }
> +
> + file = inaccessible_file;
> + }
> +
> fd_install(fd, file);
> kfree(name);
> return fd;
>
> +err_file:
> + fput(file);
> err_fd:
> put_unused_fd(fd);
> err_name:
> diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> new file mode 100644
> index 000000000000..2d33cbdd9282
> --- /dev/null
> +++ b/mm/memfd_inaccessible.c
> @@ -0,0 +1,219 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/memfd.h>
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +
> +struct inaccessible_data {
> + struct mutex lock;
> + struct file *memfd;
> + struct list_head notifiers;
> +};
> +
> +static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
> + pgoff_t start, pgoff_t end)
> +{
> + struct inaccessible_notifier *notifier;
> +
> + mutex_lock(&data->lock);
> + list_for_each_entry(notifier, &data->notifiers, list) {
> + notifier->ops->invalidate(notifier, start, end);
> + }
> + mutex_unlock(&data->lock);
> +}
> +
> +static int inaccessible_release(struct inode *inode, struct file *file)
> +{
> + struct inaccessible_data *data = inode->i_mapping->private_data;
> +
> + fput(data->memfd);
> + kfree(data);
> + return 0;
> +}
> +
> +static long inaccessible_fallocate(struct file *file, int mode,
> + loff_t offset, loff_t len)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> + struct file *memfd = data->memfd;
> + int ret;
> +
> + if (mode & FALLOC_FL_PUNCH_HOLE) {
> + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> + return -EINVAL;
> + }
> +
> + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> + inaccessible_notifier_invalidate(data, offset, offset + len);
> + return ret;
> +}
> +
> +static const struct file_operations inaccessible_fops = {
> + .release = inaccessible_release,
> + .fallocate = inaccessible_fallocate,
> +};
> +
> +static int inaccessible_getattr(struct user_namespace *mnt_userns,
> + const struct path *path, struct kstat *stat,
> + u32 request_mask, unsigned int query_flags)
> +{
> + struct inode *inode = d_inode(path->dentry);
> + struct inaccessible_data *data = inode->i_mapping->private_data;
> + struct file *memfd = data->memfd;
> +
> + return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> + request_mask, query_flags);
> +}
> +
> +static int inaccessible_setattr(struct user_namespace *mnt_userns,
> + struct dentry *dentry, struct iattr *attr)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct inaccessible_data *data = inode->i_mapping->private_data;
> + struct file *memfd = data->memfd;
> + int ret;
> +
> + if (attr->ia_valid & ATTR_SIZE) {
> + if (memfd->f_inode->i_size)
> + return -EPERM;
> +
> + if (!PAGE_ALIGNED(attr->ia_size))
> + return -EINVAL;
> + }
> +
> + ret = memfd->f_inode->i_op->setattr(mnt_userns,
> + file_dentry(memfd), attr);
> + return ret;
> +}
> +
> +static const struct inode_operations inaccessible_iops = {
> + .getattr = inaccessible_getattr,
> + .setattr = inaccessible_setattr,
> +};
> +
> +static int inaccessible_init_fs_context(struct fs_context *fc)
> +{
> + if (!init_pseudo(fc, INACCESSIBLE_MAGIC))
> + return -ENOMEM;
> +
> + fc->s_iflags |= SB_I_NOEXEC;
> + return 0;
> +}
> +
> +static struct file_system_type inaccessible_fs = {
> + .owner = THIS_MODULE,
> + .name = "[inaccessible]",
> + .init_fs_context = inaccessible_init_fs_context,
> + .kill_sb = kill_anon_super,
> +};
> +
> +static struct vfsmount *inaccessible_mnt;
> +
> +static __init int inaccessible_init(void)
> +{
> + inaccessible_mnt = kern_mount(&inaccessible_fs);
> + if (IS_ERR(inaccessible_mnt))
> + return PTR_ERR(inaccessible_mnt);
> + return 0;
> +}
> +fs_initcall(inaccessible_init);
> +
> +struct file *memfd_mkinaccessible(struct file *memfd)
> +{
> + struct inaccessible_data *data;
> + struct address_space *mapping;
> + struct inode *inode;
> + struct file *file;
> +
> + data = kzalloc(sizeof(*data), GFP_KERNEL);
> + if (!data)
> + return ERR_PTR(-ENOMEM);
> +
> + data->memfd = memfd;
> + mutex_init(&data->lock);
> + INIT_LIST_HEAD(&data->notifiers);
> +
> + inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
> + if (IS_ERR(inode)) {
> + kfree(data);
> + return ERR_CAST(inode);
> + }
> +
> + inode->i_mode |= S_IFREG;
> + inode->i_op = &inaccessible_iops;
> + inode->i_mapping->private_data = data;
> +
> + file = alloc_file_pseudo(inode, inaccessible_mnt,
> + "[memfd:inaccessible]", O_RDWR,
> + &inaccessible_fops);
> + if (IS_ERR(file)) {
> + iput(inode);
> + kfree(data);
> + }
> +
> + file->f_flags |= O_LARGEFILE;
> +
> + mapping = memfd->f_mapping;
> + mapping_set_unevictable(mapping);
> + mapping_set_gfp_mask(mapping,
> + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> + return file;
> +}
> +
> +void inaccessible_register_notifier(struct file *file,
> + struct inaccessible_notifier *notifier)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> +
> + mutex_lock(&data->lock);
> + list_add(¬ifier->list, &data->notifiers);
> + mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> +
> +void inaccessible_unregister_notifier(struct file *file,
> + struct inaccessible_notifier *notifier)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> +
> + mutex_lock(&data->lock);
> + list_del(¬ifier->list);
> + mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> + int *order)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> + struct file *memfd = data->memfd;
> + struct page *page;
> + int ret;
> +
> + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> + if (ret)
> + return ret;
> +
> + *pfn = page_to_pfn_t(page);
> + *order = thp_order(compound_head(page));
> + SetPageUptodate(page);
> + unlock_page(page);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> + struct page *page = pfn_t_to_page(pfn);
> +
> + if (WARN_ON_ONCE(!page))
> + return;
> +
> + put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> --
> 2.25.1
>
In the context of userspace inaccessible memfd, what would be a
suggested way to enforce NUMA memory policy for physical memory
allocation? mbind[1] won't work here in absence of virtual address
range.
[1] https://github.com/chao-p/qemu/blob/privmem-v8/backends/hostmem.c#L382
On Mon, Oct 17, 2022 at 10:17:45PM +0000, Sean Christopherson wrote:
> On Mon, Oct 17, 2022, Fuad Tabba wrote:
> > Hi,
> >
> > > > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > > > +#define KVM_MEM_ATTR_SHARED 0x0001
> > > > > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > > > > + bool is_private)
> > > > > +{
> > > >
> > > > I wonder if this ioctl should be implemented as an arch-specific
> > > > ioctl. In this patch it performs some actions that pKVM might not need
> > > > or might want to do differently.
> > >
> > > I think it's doable. We can provide the mem_attr_array kind thing in
> > > common code and let arch code decide to use it or not. Currently
> > > mem_attr_array is defined in the struct kvm, if those bytes are
> > > unnecessary for pKVM it can even be moved to arch definition, but that
> > > also loses the potential code sharing for confidential usages in other
> > > non-architectures, e.g. if ARM also supports such usage. Or it can be
> > > provided through a different CONFIG_ instead of
> > > CONFIG_HAVE_KVM_PRIVATE_MEM.
> >
> > This sounds good. Thank you.
>
> I like the idea of a separate Kconfig, e.g. CONFIG_KVM_GENERIC_PRIVATE_MEM or
> something. I highly doubt there will be any non-x86 users for multiple years,
> if ever, but it would allow testing the private memory stuff on ARM (and any other
> non-x86 arch) without needing full pKVM support and with only minor KVM
> modifications, e.g. the x86 support[*] to test UPM without TDX is shaping up to be
> trivial.
CONFIG_KVM_GENERIC_PRIVATE_MEM looks good to me.
Thanks,
Chao
>
> [*] https://lore.kernel.org/all/[email protected]
On Mon, Oct 17, 2022 at 08:05:10PM +0100, Fuad Tabba wrote:
> Hi,
>
> > > > Using both private_fd and userspace_addr is only needed in TDX and other
> > > > confidential computing scenarios, pKVM may only use private_fd if the fd
> > > > can also be mmaped as a whole to userspace as Sean suggested.
> > >
> > > That does work in practice, for now at least, and is what I do in my
> > > current port. However, the naming and how the API is defined as
> > > implied by the name and the documentation. By calling the field
> > > private_fd, it does imply that it should not be mapped, which is also
> > > what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
> > > would be mis/ab-using this interface, and that future changes could
> > > cause unforeseen issues for pKVM.
> >
> > That is fairly enough. We can change the naming and the documents.
> >
> > >
> > > Maybe renaming this to something like "guest_fp", and specifying in
> > > the documentation that it can be restricted, e.g., instead of "the
> > > content of the private memory is invisible to userspace" something
> > > along the lines of "the content of the guest memory may be restricted
> > > to userspace".
> >
> > Some other candidates in my mind:
> > - restricted_fd: to pair with the mm side restricted_memfd
> > - protected_fd: as Sean suggested before
> > - fd: how it's explained relies on the memslot.flag.
>
> All these sound good, since they all capture the potential use cases.
> Restricted might be the most logical choice if that's going to also
> become the name for the mem_fd.
Thanks, I will use 'restricted' for them. e.g.:
- memfd_restricted() syscall
- restricted_fd
- restricted_offset
The memslot flags will still be KVM_MEM_PRIVATE, since I think pKVM will
create its own one?
Chao
>
> Thanks,
> /fuad
>
> > Thanks,
> > Chao
> > >
> > > What do you think?
> > >
> > > Cheers,
> > > /fuad
> > >
> > > >
> > > > Thanks,
> > > > Chao
> > > > >
> > > > > Cheers,
> > > > > /fuad
Hi,
On Tue, Oct 18, 2022 at 1:34 AM Sean Christopherson <[email protected]> wrote:
>
> On Fri, Sep 30, 2022, Fuad Tabba wrote:
> > > > > > pKVM would also need a way to make an fd accessible again
> > > > > > when shared back, which I think isn't possible with this patch.
> > > > >
> > > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > > be the same issue.
> > > >
> > > > pKVM doesn't really need to unmap the memory. What is really important
> > > > is that the memory is not GUP'able.
> > >
> > > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> > >
> > > The problem is that gup() and "mapped" are tied together. So yes, pKVM doesn't
> > > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> > > the end result is the same.
> > >
> > > Emphasis above because pKVM still needs unmap the memory _somehwere_. IIUC, the
> > > current approach is to do that only in the stage-2 page tables, i.e. only in the
> > > context of the hypervisor. Which is also the source of the gup() problems; the
> > > untrusted kernel is blissfully unaware that the memory is inaccessible.
> > >
> > > Any approach that moves some of that information into the untrusted kernel so that
> > > the kernel can protect itself will incur fragmentation in the VMAs. Well, unless
> > > all of guest memory becomes unguppable, but that's likely not a viable option.
> >
> > Actually, for pKVM, there is no need for the guest memory to be GUP'able at
> > all if we use the new inaccessible_get_pfn().
>
> Ya, I was referring to pKVM without UPM / inaccessible memory.
>
> Jumping back to blocking gup(), what about using the same tricks as secretmem to
> block gup()? E.g. compare vm_ops to block regular gup() and a_ops to block fast
> gup() on struct page? With a Kconfig that's selected by pKVM (which would also
> need its own Kconfig), e.g. CONFIG_INACCESSIBLE_MAPPABLE_MEM, there would be zero
> performance overhead for non-pKVM kernels, i.e. hooking gup() shouldn't be
> controversial.
>
> I suspect the fast gup() path could even be optimized to avoid the page_mapping()
> lookup by adding a PG_inaccessible flag that's defined iff the TBD Kconfig is
> selected. I'm guessing pKVM isn't expected to be deployed on massivve NUMA systems
> anytime soon, so there should be plenty of page flags to go around.
>
> Blocking gup() instead of trying to play refcount games when converting back to
> private would eliminate the need to put heavy restrictions on mapping, as the goal
> of those were purely to simplify the KVM implementation, e.g. the "one mapping per
> memslot" thing would go away entirely.
My implementation of mmap for inaccessible_fops was setting VM_PFNMAP.
That said, I realized that that might be adding an unnecessary
restriction, and now have changed it to do it the secretmem way.
That's straightforward and works well.
> > This of course goes back to what I'd mentioned before in v7; it seems that
> > representing the memslot memory as a file descriptor should be orthogonal to
> > whether the memory is shared or private, rather than a private_fd for private
> > memory and the userspace_addr for shared memory.
>
> I also explored the idea of backing any guest memory with an fd, but came to
> the conclusion that private memory needs a separate handle[1], at least on x86.
>
> For SNP and TDX, even though the GPA is the same (ignoring the fact that SNP and
> TDX steal GPA bits to differentiate private vs. shared), the two types need to be
> treated as separate mappings[2]. Post-boot, converting is lossy in both directions,
> so even conceptually they are two disctint pages that just happen to share (some)
> GPA bits.
>
> To allow conversions, i.e. changing which mapping to use, without memslot updates,
> KVM needs to let userspace provide both mappings in a single memslot. So while
> fd-based memory is an orthogonal concept, e.g. we could add fd-based shared memory,
> KVM would still need a dedicated private handle.
>
> For pKVM, the fd doesn't strictly need to be mutually exclusive with the existing
> userspace_addr, but since the private_fd is going to be added for x86, I think it
> makes sense to use that instead of adding generic fd-based memory for pKVM's use
> case (which is arguably still "private" memory but with special semantics).
>
> [1] https://lore.kernel.org/all/[email protected]
> [2] https://lore.kernel.org/all/[email protected]
As long as the API does not impose this limit, which would imply pKVM
is misusing it, then I agree. I think that's why renaming it to
something like "restricted" might be clearer.
Thanks,
/fuad
> > > This sounds good. Thank you.
> >
> > I like the idea of a separate Kconfig, e.g. CONFIG_KVM_GENERIC_PRIVATE_MEM or
> > something. I highly doubt there will be any non-x86 users for multiple years,
> > if ever, but it would allow testing the private memory stuff on ARM (and any other
> > non-x86 arch) without needing full pKVM support and with only minor KVM
> > modifications, e.g. the x86 support[*] to test UPM without TDX is shaping up to be
> > trivial.
>
> CONFIG_KVM_GENERIC_PRIVATE_MEM looks good to me.
That sounds good to me, and just keeping the xarray isn't really an
issue for pKVM. We could end up using it instead of some of the other
structures we use for tracking.
Cheers,
/fuad
> Thanks,
> Chao
> >
> > [*] https://lore.kernel.org/all/[email protected]
On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
> <[email protected]> wrote:
> >
> > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > From: "Kirill A. Shutemov" <[email protected]>
> > > > > >
> > > > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > >
> > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > memfd-provided memory may be encrypted with special key for special
> > > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > > memory may lead to host crash so it should be prevented.
> > > > > >
> > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > > the need to map the memory into KVM userspace.
> > > > > >
> > > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > > support that a file descriptor with this flag set is going to be used as
> > > > > > the source of guest memory in confidential computing environments such
> > > > > > as Intel TDX/AMD SEV.
> > > > > >
> > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > > in this patch to obtain the physical memory address and then populate
> > > > > > the secondary page table entries.
> > > > > >
> > > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > > of the range in the secondary page tables.
> > > > > >
> > > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > > unmovable and unevictable, this is required for current confidential
> > > > > > usage. But in future this might be changed.
> > > > > >
> > > > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > > > Signed-off-by: Chao Peng <[email protected]>
> > > > > > ---
> > > > >
> > > > > ...
> > > > >
> > > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > > + loff_t offset, loff_t len)
> > > > > > +{
> > > > > > + struct inaccessible_data *data = file->f_mapping->private_data;
> > > > > > + struct file *memfd = data->memfd;
> > > > > > + int ret;
> > > > > > +
> > > > > > + if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > + return -EINVAL;
> > > > > > + }
> > > > > > +
> > > > > > + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > + inaccessible_notifier_invalidate(data, offset, offset + len);
> > > > >
> > > > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > > > a window where the page tables point to memory no longer valid?
> > > >
> > > > Yes, you are right. Thanks for catching this.
> > >
> > > I also noticed this. But then thought the memory would be anyways zeroed
> > > (hole punched) before this call?
> >
> > Hole punching can free pages, given that offset/len covers full page.
> >
> > --
> > Kiryl Shutsemau / Kirill A. Shutemov
>
> I think moving this notifier_invalidate before fallocate may not solve
> the problem completely. Is it possible that between invalidate and
> fallocate, KVM tries to handle the page fault for the guest VM from
> another vcpu and uses the pages to be freed to back gpa ranges? Should
> hole punching here also update mem_attr first to say that KVM should
> consider the corresponding gpa ranges to be no more backed by
> inaccessible memfd?
We rely on external synchronization to prevent this. See code around
mmu_invalidate_retry_hva().
--
Kiryl Shutsemau / Kirill A. Shutemov
On Wed, Oct 19, 2022, Fuad Tabba wrote:
> > > > This sounds good. Thank you.
> > >
> > > I like the idea of a separate Kconfig, e.g. CONFIG_KVM_GENERIC_PRIVATE_MEM or
> > > something. I highly doubt there will be any non-x86 users for multiple years,
> > > if ever, but it would allow testing the private memory stuff on ARM (and any other
> > > non-x86 arch) without needing full pKVM support and with only minor KVM
> > > modifications, e.g. the x86 support[*] to test UPM without TDX is shaping up to be
> > > trivial.
> >
> > CONFIG_KVM_GENERIC_PRIVATE_MEM looks good to me.
>
> That sounds good to me, and just keeping the xarray isn't really an
> issue for pKVM.
The xarray won't exist for pKVM if the #ifdefs in this patch are changed from
CONFIG_HAVE_KVM_PRIVATE_MEM => CONFIG_KVM_GENERIC_PRIVATE_MEM.
> We could end up using it instead of some of the other
> structures we use for tracking.
I don't think pKVM should hijack the xarray for other purposes. At best, it will
be confusing, at worst we'll end up with a mess if ARM ever supports the "generic"
implementation.
On Wed, Oct 19, 2022 at 5:09 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, Oct 19, 2022, Fuad Tabba wrote:
> > > > > This sounds good. Thank you.
> > > >
> > > > I like the idea of a separate Kconfig, e.g. CONFIG_KVM_GENERIC_PRIVATE_MEM or
> > > > something. I highly doubt there will be any non-x86 users for multiple years,
> > > > if ever, but it would allow testing the private memory stuff on ARM (and any other
> > > > non-x86 arch) without needing full pKVM support and with only minor KVM
> > > > modifications, e.g. the x86 support[*] to test UPM without TDX is shaping up to be
> > > > trivial.
> > >
> > > CONFIG_KVM_GENERIC_PRIVATE_MEM looks good to me.
> >
> > That sounds good to me, and just keeping the xarray isn't really an
> > issue for pKVM.
>
> The xarray won't exist for pKVM if the #ifdefs in this patch are changed from
> CONFIG_HAVE_KVM_PRIVATE_MEM => CONFIG_KVM_GENERIC_PRIVATE_MEM.
>
> > We could end up using it instead of some of the other
> > structures we use for tracking.
>
> I don't think pKVM should hijack the xarray for other purposes. At best, it will
> be confusing, at worst we'll end up with a mess if ARM ever supports the "generic"
> implementation.
Definitely wasn't referring to hijacking it for other purposes, which
is the main reason I wanted to clarify the documentation and the
naming of private_fd. Anyway, I'm glad to see that we're in agreement.
Once I've tightened the screws, I'll share the pKVM port as an RFC as
well.
Cheers,
/fuad
On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov
<[email protected]> wrote:
>
> On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
> > <[email protected]> wrote:
> > >
> > > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > > From: "Kirill A. Shutemov" <[email protected]>
> > > > > > >
> > > > > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > > >
> > > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > > memfd-provided memory may be encrypted with special key for special
> > > > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > > > memory may lead to host crash so it should be prevented.
> > > > > > >
> > > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > > > the need to map the memory into KVM userspace.
> > > > > > >
> > > > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > > > support that a file descriptor with this flag set is going to be used as
> > > > > > > the source of guest memory in confidential computing environments such
> > > > > > > as Intel TDX/AMD SEV.
> > > > > > >
> > > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > > > in this patch to obtain the physical memory address and then populate
> > > > > > > the secondary page table entries.
> > > > > > >
> > > > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > > > of the range in the secondary page tables.
> > > > > > >
> > > > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > > > unmovable and unevictable, this is required for current confidential
> > > > > > > usage. But in future this might be changed.
> > > > > > >
> > > > > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > > > > Signed-off-by: Chao Peng <[email protected]>
> > > > > > > ---
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > > > + loff_t offset, loff_t len)
> > > > > > > +{
> > > > > > > + struct inaccessible_data *data = file->f_mapping->private_data;
> > > > > > > + struct file *memfd = data->memfd;
> > > > > > > + int ret;
> > > > > > > +
> > > > > > > + if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > > + return -EINVAL;
> > > > > > > + }
> > > > > > > +
> > > > > > > + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > > + inaccessible_notifier_invalidate(data, offset, offset + len);
> > > > > >
> > > > > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > > > > a window where the page tables point to memory no longer valid?
> > > > >
> > > > > Yes, you are right. Thanks for catching this.
> > > >
> > > > I also noticed this. But then thought the memory would be anyways zeroed
> > > > (hole punched) before this call?
> > >
> > > Hole punching can free pages, given that offset/len covers full page.
> > >
> > > --
> > > Kiryl Shutsemau / Kirill A. Shutemov
> >
> > I think moving this notifier_invalidate before fallocate may not solve
> > the problem completely. Is it possible that between invalidate and
> > fallocate, KVM tries to handle the page fault for the guest VM from
> > another vcpu and uses the pages to be freed to back gpa ranges? Should
> > hole punching here also update mem_attr first to say that KVM should
> > consider the corresponding gpa ranges to be no more backed by
> > inaccessible memfd?
>
> We rely on external synchronization to prevent this. See code around
> mmu_invalidate_retry_hva().
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov
IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn
ranges that are being invalidated are retried till invalidation is
complete. In this case, is it possible that KVM tries to serve the
page fault after inaccessible_notifier_invalidate is complete but
before fallocate could punch hole into the files?
e.g.
inaccessible_notifier_invalidate(...)
... (system event preempting this control flow, giving a window for
the guest to retry accessing the gfn range which was invalidated)
fallocate(.., PUNCH_HOLE..)
>
> In the context of userspace inaccessible memfd, what would be a
> suggested way to enforce NUMA memory policy for physical memory
> allocation? mbind[1] won't work here in absence of virtual address
> range.
How about set_mempolicy():
https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
Chao
>
> [1] https://github.com/chao-p/qemu/blob/privmem-v8/backends/hostmem.c#L382
On Thu, Oct 20, 2022 at 04:20:58PM +0530, Vishal Annapurve wrote:
> On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov
> <[email protected]> wrote:
> >
> > On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > > On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
> > > <[email protected]> wrote:
> > > >
> > > > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > > > From: "Kirill A. Shutemov" <[email protected]>
> > > > > > > >
> > > > > > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > > > >
> > > > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > > > memfd-provided memory may be encrypted with special key for special
> > > > > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > > > > memory may lead to host crash so it should be prevented.
> > > > > > > >
> > > > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > > > > the need to map the memory into KVM userspace.
> > > > > > > >
> > > > > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > > > > support that a file descriptor with this flag set is going to be used as
> > > > > > > > the source of guest memory in confidential computing environments such
> > > > > > > > as Intel TDX/AMD SEV.
> > > > > > > >
> > > > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > > > > in this patch to obtain the physical memory address and then populate
> > > > > > > > the secondary page table entries.
> > > > > > > >
> > > > > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > > > > of the range in the secondary page tables.
> > > > > > > >
> > > > > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > > > > unmovable and unevictable, this is required for current confidential
> > > > > > > > usage. But in future this might be changed.
> > > > > > > >
> > > > > > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > > > > > Signed-off-by: Chao Peng <[email protected]>
> > > > > > > > ---
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > > > > + loff_t offset, loff_t len)
> > > > > > > > +{
> > > > > > > > + struct inaccessible_data *data = file->f_mapping->private_data;
> > > > > > > > + struct file *memfd = data->memfd;
> > > > > > > > + int ret;
> > > > > > > > +
> > > > > > > > + if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > > > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > > > + return -EINVAL;
> > > > > > > > + }
> > > > > > > > +
> > > > > > > > + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > > > + inaccessible_notifier_invalidate(data, offset, offset + len);
> > > > > > >
> > > > > > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > > > > > a window where the page tables point to memory no longer valid?
> > > > > >
> > > > > > Yes, you are right. Thanks for catching this.
> > > > >
> > > > > I also noticed this. But then thought the memory would be anyways zeroed
> > > > > (hole punched) before this call?
> > > >
> > > > Hole punching can free pages, given that offset/len covers full page.
> > > >
> > > > --
> > > > Kiryl Shutsemau / Kirill A. Shutemov
> > >
> > > I think moving this notifier_invalidate before fallocate may not solve
> > > the problem completely. Is it possible that between invalidate and
> > > fallocate, KVM tries to handle the page fault for the guest VM from
> > > another vcpu and uses the pages to be freed to back gpa ranges? Should
> > > hole punching here also update mem_attr first to say that KVM should
> > > consider the corresponding gpa ranges to be no more backed by
> > > inaccessible memfd?
> >
> > We rely on external synchronization to prevent this. See code around
> > mmu_invalidate_retry_hva().
> >
> > --
> > Kiryl Shutsemau / Kirill A. Shutemov
>
> IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn
> ranges that are being invalidated are retried till invalidation is
> complete. In this case, is it possible that KVM tries to serve the
> page fault after inaccessible_notifier_invalidate is complete but
> before fallocate could punch hole into the files?
> e.g.
> inaccessible_notifier_invalidate(...)
> ... (system event preempting this control flow, giving a window for
> the guest to retry accessing the gfn range which was invalidated)
> fallocate(.., PUNCH_HOLE..)
Looks this is something can happen. And sounds to me the solution needs
just follow the mmu_notifier's way of using a invalidate_start/end pair.
invalidate_start() --> kvm->mmu_invalidate_in_progress++;
zap KVM page table entries;
fallocate()
invalidate_end() --> kvm->mmu_invalidate_in_progress--;
Then during invalidate_start/end time window mmu_invalidate_retry_gfn
checks 'mmu_invalidate_in_progress' and prevent repopulating the same
page in KVM page table.
if(kvm->mmu_invalidate_in_progress)
return 1; /* retry */
Thanks,
Chao
On Fri, Oct 21, 2022, Chao Peng wrote:
> >
> > In the context of userspace inaccessible memfd, what would be a
> > suggested way to enforce NUMA memory policy for physical memory
> > allocation? mbind[1] won't work here in absence of virtual address
> > range.
>
> How about set_mempolicy():
> https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
Andy Lutomirski brought this up in an off-list discussion way back when the whole
private-fd thing was first being proposed.
: The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses. If
: we want to support them for TDX private memory, we either need TDX private
: memory to have an HVA or we need file-based equivalents. Arguably we should add
: fmove_pages and fbind syscalls anyway, since the current API is quite awkward
: even for tools like numactl.
On Fri, Oct 21, 2022, Chao Peng wrote:
> On Thu, Oct 20, 2022 at 04:20:58PM +0530, Vishal Annapurve wrote:
> > On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov <[email protected]> wrote:
> > >
> > > On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > > > I think moving this notifier_invalidate before fallocate may not solve
> > > > the problem completely. Is it possible that between invalidate and
> > > > fallocate, KVM tries to handle the page fault for the guest VM from
> > > > another vcpu and uses the pages to be freed to back gpa ranges? Should
> > > > hole punching here also update mem_attr first to say that KVM should
> > > > consider the corresponding gpa ranges to be no more backed by
> > > > inaccessible memfd?
> > >
> > > We rely on external synchronization to prevent this. See code around
> > > mmu_invalidate_retry_hva().
> > >
> > > --
> > > Kiryl Shutsemau / Kirill A. Shutemov
> >
> > IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn
> > ranges that are being invalidated are retried till invalidation is
> > complete. In this case, is it possible that KVM tries to serve the
> > page fault after inaccessible_notifier_invalidate is complete but
> > before fallocate could punch hole into the files?
It's not just the page fault edge case. In the more straightforward scenario
where the memory is already mapped into the guest, freeing pages back to the kernel
before they are removed from the guest will lead to use-after-free.
> > e.g.
> > inaccessible_notifier_invalidate(...)
> > ... (system event preempting this control flow, giving a window for
> > the guest to retry accessing the gfn range which was invalidated)
> > fallocate(.., PUNCH_HOLE..)
>
> Looks this is something can happen.
> And sounds to me the solution needs
> just follow the mmu_notifier's way of using a invalidate_start/end pair.
>
> invalidate_start() --> kvm->mmu_invalidate_in_progress++;
> zap KVM page table entries;
> fallocate()
> invalidate_end() --> kvm->mmu_invalidate_in_progress--;
>
> Then during invalidate_start/end time window mmu_invalidate_retry_gfn
> checks 'mmu_invalidate_in_progress' and prevent repopulating the same
> page in KVM page table.
Yes, if it's not safe to invalidate after making the change (fallocate()), then
the change needs to be bookended by a start+end pair. The mmu_notifier's unpaired
invalidate() hook works by zapping the primary MMU's PTEs before invalidate(), but
frees the underlying physical page _after_ invalidate().
And the only reason the unpaired invalidate() exists is because there are secondary
MMUs that reuse the primary MMU's page tables, e.g. shared virtual addressing, in
which case bookending doesn't work because the secondary MMU can't remove PTEs, it
can only flush its TLBs.
For this case, the whole point is to not create PTEs in the primary MMU, so there
should never be a use case that _needs_ an unpaired invalidate().
TL;DR: a start+end pair is likely the simplest solution.
On 24.10.22 16:59, Kirill A . Shutemov wrote:
> On Fri, Oct 21, 2022 at 04:18:14PM +0000, Sean Christopherson wrote:
>> On Fri, Oct 21, 2022, Chao Peng wrote:
>>>>
>>>> In the context of userspace inaccessible memfd, what would be a
>>>> suggested way to enforce NUMA memory policy for physical memory
>>>> allocation? mbind[1] won't work here in absence of virtual address
>>>> range.
>>>
>>> How about set_mempolicy():
>>> https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
>>
>> Andy Lutomirski brought this up in an off-list discussion way back when the whole
>> private-fd thing was first being proposed.
>>
>> : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses. If
>> : we want to support them for TDX private memory, we either need TDX private
>> : memory to have an HVA or we need file-based equivalents. Arguably we should add
>> : fmove_pages and fbind syscalls anyway, since the current API is quite awkward
>> : even for tools like numactl.
>
> Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
> addressed in the initial submission.
>
> BTW, it is not regression comparing to old KVM slots, if the memory is
> backed by memfd or other file:
>
> MBIND(2)
> The specified policy will be ignored for any MAP_SHARED mappings in the
> specified memory range. Rather the pages will be allocated according to
> the memory policy of the thread that caused the page to be allocated.
> Again, this may not be the thread that called mbind().
IIRC, that documentation is imprecise/incorrect especially when it comes
to memfd. Page faults in shared mappings will similarly obey the set
mbind() policy when allocating new pages.
QEMU relies on that.
The "fun" begins when we have multiple mappings, and only some have a
policy set ... or if we already, previously allocated the pages.
--
Thanks,
David / dhildenb
On Fri, Oct 21, 2022 at 04:18:14PM +0000, Sean Christopherson wrote:
> On Fri, Oct 21, 2022, Chao Peng wrote:
> > >
> > > In the context of userspace inaccessible memfd, what would be a
> > > suggested way to enforce NUMA memory policy for physical memory
> > > allocation? mbind[1] won't work here in absence of virtual address
> > > range.
> >
> > How about set_mempolicy():
> > https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
>
> Andy Lutomirski brought this up in an off-list discussion way back when the whole
> private-fd thing was first being proposed.
>
> : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses. If
> : we want to support them for TDX private memory, we either need TDX private
> : memory to have an HVA or we need file-based equivalents. Arguably we should add
> : fmove_pages and fbind syscalls anyway, since the current API is quite awkward
> : even for tools like numactl.
Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
addressed in the initial submission.
BTW, it is not regression comparing to old KVM slots, if the memory is
backed by memfd or other file:
MBIND(2)
The specified policy will be ignored for any MAP_SHARED mappings in the
specified memory range. Rather the pages will be allocated according to
the memory policy of the thread that caused the page to be allocated.
Again, this may not be the thread that called mbind().
It is not clear how to define fbind(2) semantics, considering that multiple
processes may compete for the same region of page cache.
Should it be per-inode or per-fd? Or maybe per-range in inode/fd?
fmove_pages(2) should be relatively straight forward, since it is
best-effort and does not guarantee that the page will note be moved
somewhare else just after return from the syscall.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Oct 24, 2022 at 8:30 PM Kirill A . Shutemov
<[email protected]> wrote:
>
> On Fri, Oct 21, 2022 at 04:18:14PM +0000, Sean Christopherson wrote:
> > On Fri, Oct 21, 2022, Chao Peng wrote:
> > > >
> > > > In the context of userspace inaccessible memfd, what would be a
> > > > suggested way to enforce NUMA memory policy for physical memory
> > > > allocation? mbind[1] won't work here in absence of virtual address
> > > > range.
> > >
> > > How about set_mempolicy():
> > > https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
> >
> > Andy Lutomirski brought this up in an off-list discussion way back when the whole
> > private-fd thing was first being proposed.
> >
> > : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses. If
> > : we want to support them for TDX private memory, we either need TDX private
> > : memory to have an HVA or we need file-based equivalents. Arguably we should add
> > : fmove_pages and fbind syscalls anyway, since the current API is quite awkward
> > : even for tools like numactl.
>
> Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
> addressed in the initial submission.
>
> BTW, it is not regression comparing to old KVM slots, if the memory is
> backed by memfd or other file:
>
> MBIND(2)
> The specified policy will be ignored for any MAP_SHARED mappings in the
> specified memory range. Rather the pages will be allocated according to
> the memory policy of the thread that caused the page to be allocated.
> Again, this may not be the thread that called mbind().
>
> It is not clear how to define fbind(2) semantics, considering that multiple
> processes may compete for the same region of page cache.
>
> Should it be per-inode or per-fd? Or maybe per-range in inode/fd?
>
David's analysis on mempolicy with shmem seems to be right. set_policy
on virtual address range does seem to change the shared policy for the
inode irrespective of the mapping type.
Maybe having a way to set numa policy per-range in the inode would be
at par with what we can do today via mbind on virtual address ranges.
> fmove_pages(2) should be relatively straight forward, since it is
> best-effort and does not guarantee that the page will note be moved
> somewhare else just after return from the syscall.
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov