SEV guest requires the guest's pages to be pinned in host physical
memory as migration of encrypted pages is not supported. The memory
encryption scheme uses the physical address of the memory being
encrypted. If guest pages are moved by the host, content decrypted in
the guest would be incorrect thereby corrupting guest's memory.
For SEV/SEV-ES guests, the hypervisor doesn't know which pages are
encrypted and when the guest is done using those pages. Hypervisor
should treat all the guest pages as encrypted until the guest is
destroyed.
Actual pinning management is handled by vendor code via new
kvm_x86_ops hooks. MMU calls in to vendor code to pin the page on
demand. Metadata of the pinning is stored in architecture specific
memslot area. During the memslot freeing path guest pages are
unpinned.
Initially started with [1], where the idea was to store the pinning
information using the software bit in the SPTE to track the pinned
page. That is not feasible for the following reason:
The pinned SPTE information gets stored in the shadow pages(SP). The
way current MMU is designed, the full MMU context gets dropped
multiple number of times even when CR0.WP bit gets flipped. Due to
dropping of the MMU context (aka roots), there is a huge amount of SP
alloc/remove churn. Pinned information stored in the SP gets lost
during the dropping of the root and subsequent SP at the child levels.
Without this information making decisions about re-pinnning page or
unpinning during the guest shutdown will not be possible
[1] https://patchwork.kernel.org/project/kvm/cover/[email protected]/
Nikunj A Dadhania (4):
KVM: x86/mmu: Add hook to pin PFNs on demand in MMU
KVM: SVM: Add pinning metadata in the arch memslot
KVM: SVM: Implement demand page pinning
KVM: SEV: Carve out routine for allocation of pages
Sean Christopherson (2):
KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV/TDX
KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data()
arch/x86/include/asm/kvm-x86-ops.h | 3 +
arch/x86/include/asm/kvm_host.h | 9 +
arch/x86/kvm/mmu.h | 3 +
arch/x86/kvm/mmu/mmu.c | 41 +++
arch/x86/kvm/mmu/tdp_mmu.c | 7 +
arch/x86/kvm/svm/sev.c | 423 +++++++++++++++++++----------
arch/x86/kvm/svm/svm.c | 4 +
arch/x86/kvm/svm/svm.h | 9 +-
arch/x86/kvm/x86.c | 11 +-
9 files changed, 359 insertions(+), 151 deletions(-)
--
2.32.0
Use vendor code via kvm_x86_ops hooks for pinning.
Signed-off-by: Nikunj A Dadhania <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 3 +++
arch/x86/kvm/mmu/tdp_mmu.c | 7 +++++++
4 files changed, 13 insertions(+)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index f658bb4dbb74..a96c52a99a04 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -87,6 +87,7 @@ KVM_X86_OP(set_tss_addr)
KVM_X86_OP(set_identity_map_addr)
KVM_X86_OP(get_mt_mask)
KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP(pin_spte)
KVM_X86_OP_NULL(has_wbinvd_exit)
KVM_X86_OP(get_l2_tsc_offset)
KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0677b9ea01c9..1263a16dd588 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1417,6 +1417,8 @@ struct kvm_x86_ops {
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);
+ void (*pin_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ kvm_pfn_t pfn);
bool (*has_wbinvd_exit)(void);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1d275e9d76b5..62dda588eb99 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2977,6 +2977,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return ret;
direct_pte_prefetch(vcpu, it.sptep);
+ if (!is_error_noslot_pfn(fault->pfn) && !kvm_is_reserved_pfn(fault->pfn))
+ static_call_cond(kvm_x86_pin_spte)(vcpu->kvm, base_gfn,
+ it.level, fault->pfn);
++vcpu->stat.pf_fixed;
return ret;
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7b1bc816b7c3..b7578fa02e9f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -467,6 +467,13 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
(!is_present || !is_dirty_spte(new_spte) || pfn_changed))
kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+ /*
+ * Call the vendor code to handle the pinning
+ */
+ if (is_present && is_leaf)
+ static_call_cond(kvm_x86_pin_spte)(kvm, gfn, level,
+ spte_to_pfn(new_spte));
+
/*
* Recursively handle child PTs if the change removed a subtree from
* the paging structure.
--
2.32.0
From: Sean Christopherson <[email protected]>
Introduce a helper to directly (pun intended) fault-in a TDP page
without having to go through the full page fault path. This allows
SEV/TDX to pin pages before booting the guest, provides the resulting
pfn to vendor code if should be needed in the future, and allows the
RET_PF_* enums to stay in mmu.c where they belong.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Nikunj A Dadhania <[email protected]>
---
arch/x86/kvm/mmu.h | 3 +++
arch/x86/kvm/mmu/mmu.c | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e9fbb2c8bbe2..0595891dd834 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -202,6 +202,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
return vcpu->arch.mmu->page_fault(vcpu, &fault);
}
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+ u32 error_code, int max_level);
+
/*
* Currently, we have two sorts of write-protection, a) the first one
* write-protects guest page to sync the guest modification, b) another one is
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 62dda588eb99..de5d390e0dcc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4111,6 +4111,44 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return direct_page_fault(vcpu, fault);
}
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+ u32 error_code, int max_level)
+{
+ int r;
+ struct kvm_page_fault fault = (struct kvm_page_fault) {
+ .addr = gpa,
+ .error_code = error_code,
+ .exec = error_code & PFERR_FETCH_MASK,
+ .write = error_code & PFERR_WRITE_MASK,
+ .present = error_code & PFERR_PRESENT_MASK,
+ .rsvd = error_code & PFERR_RSVD_MASK,
+ .user = error_code & PFERR_USER_MASK,
+ .prefetch = false,
+ .is_tdp = true,
+ .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),
+ };
+
+ if (mmu_topup_memory_caches(vcpu, false))
+ return KVM_PFN_ERR_FAULT;
+
+ /*
+ * Loop on the page fault path to handle the case where an mmu_notifier
+ * invalidation triggers RET_PF_RETRY. In the normal page fault path,
+ * KVM needs to resume the guest in case the invalidation changed any
+ * of the page fault properties, i.e. the gpa or error code. For this
+ * path, the gpa and error code are fixed by the caller, and the caller
+ * expects failure if and only if the page fault can't be fixed.
+ */
+ do {
+ fault.max_level = max_level;
+ fault.req_level = PG_LEVEL_4K;
+ fault.goal_level = PG_LEVEL_4K;
+ r = direct_page_fault(vcpu, &fault);
+ } while (r == RET_PF_RETRY && !is_error_noslot_pfn(fault.pfn));
+ return fault.pfn;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
+
static void nonpaging_init_context(struct kvm_mmu *context)
{
context->page_fault = nonpaging_page_fault;
--
2.32.0
From: Sean Christopherson <[email protected]>
Pin the memory for the data being passed to launch_update_data()
because it gets encrypted before the guest is first run and must
not be moved which would corrupt it.
Signed-off-by: Sean Christopherson <[email protected]>
[ * Changed hva_to_gva() to take an extra argument and return gpa_t.
* Updated sev_pin_memory_in_mmu() error handling.
* As pinning/unpining pages is handled within MMU, removed
{get,put}_user(). ]
Signed-off-by: Nikunj A Dadhania <[email protected]>
---
arch/x86/kvm/svm/sev.c | 122 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 119 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 14aeccfc500b..1ae714e83a3c 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -22,6 +22,7 @@
#include <asm/trapnr.h>
#include <asm/fpu/xcr.h>
+#include "mmu.h"
#include "x86.h"
#include "svm.h"
#include "svm_ops.h"
@@ -490,6 +491,110 @@ static unsigned long get_num_contig_pages(unsigned long idx,
return pages;
}
+#define SEV_PFERR_RO (PFERR_USER_MASK)
+#define SEV_PFERR_RW (PFERR_WRITE_MASK | PFERR_USER_MASK)
+
+static struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm,
+ unsigned long hva)
+{
+ struct kvm_memslots *slots = kvm_memslots(kvm);
+ struct kvm_memory_slot *memslot;
+ int bkt;
+
+ kvm_for_each_memslot(memslot, bkt, slots) {
+ if (hva >= memslot->userspace_addr &&
+ hva < memslot->userspace_addr +
+ (memslot->npages << PAGE_SHIFT))
+ return memslot;
+ }
+
+ return NULL;
+}
+
+static gpa_t hva_to_gpa(struct kvm *kvm, unsigned long hva, bool *ro)
+{
+ struct kvm_memory_slot *memslot;
+ gpa_t gpa_offset;
+
+ memslot = hva_to_memslot(kvm, hva);
+ if (!memslot)
+ return UNMAPPED_GVA;
+
+ *ro = !!(memslot->flags & KVM_MEM_READONLY);
+ gpa_offset = hva - memslot->userspace_addr;
+ return ((memslot->base_gfn << PAGE_SHIFT) + gpa_offset);
+}
+
+static struct page **sev_pin_memory_in_mmu(struct kvm *kvm, unsigned long addr,
+ unsigned long size,
+ unsigned long *npages)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct kvm_vcpu *vcpu;
+ struct page **pages;
+ unsigned long i;
+ u32 error_code;
+ kvm_pfn_t pfn;
+ int idx, ret = 0;
+ gpa_t gpa;
+ bool ro;
+
+ pages = sev_alloc_pages(sev, addr, size, npages);
+ if (IS_ERR(pages))
+ return pages;
+
+ vcpu = kvm_get_vcpu(kvm, 0);
+ if (mutex_lock_killable(&vcpu->mutex)) {
+ kvfree(pages);
+ return ERR_PTR(-EINTR);
+ }
+
+ vcpu_load(vcpu);
+ idx = srcu_read_lock(&kvm->srcu);
+
+ kvm_mmu_load(vcpu);
+
+ for (i = 0; i < *npages; i++, addr += PAGE_SIZE) {
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+
+ if (need_resched())
+ cond_resched();
+
+ gpa = hva_to_gpa(kvm, addr, &ro);
+ if (gpa == UNMAPPED_GVA) {
+ ret = -EFAULT;
+ break;
+ }
+
+ error_code = ro ? SEV_PFERR_RO : SEV_PFERR_RW;
+
+ /*
+ * Fault in the page and sev_pin_page() will handle the
+ * pinning
+ */
+ pfn = kvm_mmu_map_tdp_page(vcpu, gpa, error_code, PG_LEVEL_4K);
+ if (is_error_noslot_pfn(pfn)) {
+ ret = -EFAULT;
+ break;
+ }
+ pages[i] = pfn_to_page(pfn);
+ }
+
+ kvm_mmu_unload(vcpu);
+ srcu_read_unlock(&kvm->srcu, idx);
+ vcpu_put(vcpu);
+ mutex_unlock(&vcpu->mutex);
+
+ if (!ret)
+ return pages;
+
+ kvfree(pages);
+ return ERR_PTR(ret);
+}
+
static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
{
unsigned long vaddr, vaddr_end, next_vaddr, npages, pages, size, i;
@@ -510,15 +615,21 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
vaddr_end = vaddr + size;
/* Lock the user memory. */
- inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
+ if (atomic_read(&kvm->online_vcpus))
+ inpages = sev_pin_memory_in_mmu(kvm, vaddr, size, &npages);
+ else
+ inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
if (IS_ERR(inpages))
return PTR_ERR(inpages);
/*
* Flush (on non-coherent CPUs) before LAUNCH_UPDATE encrypts pages in
* place; the cache may contain the data that was written unencrypted.
+ * Flushing is automatically handled if the pages can be pinned in the
+ * MMU.
*/
- sev_clflush_pages(inpages, npages);
+ if (!atomic_read(&kvm->online_vcpus))
+ sev_clflush_pages(inpages, npages);
data.reserved = 0;
data.handle = sev->handle;
@@ -553,8 +664,13 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
set_page_dirty_lock(inpages[i]);
mark_page_accessed(inpages[i]);
}
+
/* unlock the user pages */
- sev_unpin_memory(kvm, inpages, npages);
+ if (atomic_read(&kvm->online_vcpus))
+ kvfree(inpages);
+ else
+ sev_unpin_memory(kvm, inpages, npages);
+
return ret;
}
--
2.32.0
Use the memslot metadata to store the pinned data along with the pfns.
This improves the SEV guest startup time from O(n) to a constant by
deferring guest page pinning until the pages are used to satisfy nested
page faults. The page reference will be dropped in the memslot free
path.
Remove the enc_region structure definition and the code which did
upfront pinning, as they are no longer needed in view of the demand
pinning support.
Leave svm_register_enc_region() and svm_unregister_enc_region() as stubs
since qemu is dependent on this API.
Signed-off-by: Nikunj A Dadhania <[email protected]>
---
arch/x86/kvm/svm/sev.c | 208 ++++++++++++++++-------------------------
arch/x86/kvm/svm/svm.c | 1 +
arch/x86/kvm/svm/svm.h | 3 +-
3 files changed, 81 insertions(+), 131 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index d972ab4956d4..a962bed97a0b 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -66,14 +66,6 @@ static unsigned int nr_asids;
static unsigned long *sev_asid_bitmap;
static unsigned long *sev_reclaim_asid_bitmap;
-struct enc_region {
- struct list_head list;
- unsigned long npages;
- struct page **pages;
- unsigned long uaddr;
- unsigned long size;
-};
-
/* Called with the sev_bitmap_lock held, or on shutdown */
static int sev_flush_asids(int min_asid, int max_asid)
{
@@ -257,8 +249,6 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
if (ret)
goto e_free;
- INIT_LIST_HEAD(&sev->regions_list);
-
return 0;
e_free:
@@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
src->handle = 0;
src->pages_locked = 0;
src->enc_context_owner = NULL;
-
- list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
}
static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
@@ -1861,115 +1849,13 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
int svm_register_enc_region(struct kvm *kvm,
struct kvm_enc_region *range)
{
- struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
- struct enc_region *region;
- int ret = 0;
-
- if (!sev_guest(kvm))
- return -ENOTTY;
-
- /* If kvm is mirroring encryption context it isn't responsible for it */
- if (is_mirroring_enc_context(kvm))
- return -EINVAL;
-
- if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
- return -EINVAL;
-
- region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
- if (!region)
- return -ENOMEM;
-
- mutex_lock(&kvm->lock);
- region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1);
- if (IS_ERR(region->pages)) {
- ret = PTR_ERR(region->pages);
- mutex_unlock(&kvm->lock);
- goto e_free;
- }
-
- region->uaddr = range->addr;
- region->size = range->size;
-
- list_add_tail(®ion->list, &sev->regions_list);
- mutex_unlock(&kvm->lock);
-
- /*
- * The guest may change the memory encryption attribute from C=0 -> C=1
- * or vice versa for this memory range. Lets make sure caches are
- * flushed to ensure that guest data gets written into memory with
- * correct C-bit.
- */
- sev_clflush_pages(region->pages, region->npages);
-
- return ret;
-
-e_free:
- kfree(region);
- return ret;
-}
-
-static struct enc_region *
-find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
-{
- struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
- struct list_head *head = &sev->regions_list;
- struct enc_region *i;
-
- list_for_each_entry(i, head, list) {
- if (i->uaddr == range->addr &&
- i->size == range->size)
- return i;
- }
-
- return NULL;
-}
-
-static void __unregister_enc_region_locked(struct kvm *kvm,
- struct enc_region *region)
-{
- sev_unpin_memory(kvm, region->pages, region->npages);
- list_del(®ion->list);
- kfree(region);
+ return 0;
}
int svm_unregister_enc_region(struct kvm *kvm,
struct kvm_enc_region *range)
{
- struct enc_region *region;
- int ret;
-
- /* If kvm is mirroring encryption context it isn't responsible for it */
- if (is_mirroring_enc_context(kvm))
- return -EINVAL;
-
- mutex_lock(&kvm->lock);
-
- if (!sev_guest(kvm)) {
- ret = -ENOTTY;
- goto failed;
- }
-
- region = find_enc_region(kvm, range);
- if (!region) {
- ret = -EINVAL;
- goto failed;
- }
-
- /*
- * Ensure that all guest tagged cache entries are flushed before
- * releasing the pages back to the system for use. CLFLUSH will
- * not do this, so issue a WBINVD.
- */
- wbinvd_on_all_cpus();
-
- __unregister_enc_region_locked(kvm, region);
-
- mutex_unlock(&kvm->lock);
return 0;
-
-failed:
- mutex_unlock(&kvm->lock);
- return ret;
}
int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
@@ -2018,7 +1904,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
mirror_sev->fd = source_sev->fd;
mirror_sev->es_active = source_sev->es_active;
mirror_sev->handle = source_sev->handle;
- INIT_LIST_HEAD(&mirror_sev->regions_list);
ret = 0;
/*
@@ -2038,8 +1923,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
void sev_vm_destroy(struct kvm *kvm)
{
struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
- struct list_head *head = &sev->regions_list;
- struct list_head *pos, *q;
WARN_ON(sev->num_mirrored_vms);
@@ -2066,18 +1949,6 @@ void sev_vm_destroy(struct kvm *kvm)
*/
wbinvd_on_all_cpus();
- /*
- * if userspace was terminated before unregistering the memory regions
- * then lets unpin all the registered memory.
- */
- if (!list_empty(head)) {
- list_for_each_safe(pos, q, head) {
- __unregister_enc_region_locked(kvm,
- list_entry(pos, struct enc_region, list));
- cond_resched();
- }
- }
-
sev_unbind_asid(kvm, sev->handle);
sev_asid_free(sev);
}
@@ -2946,13 +2817,90 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
}
+void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ kvm_pfn_t pfn)
+{
+ struct kvm_arch_memory_slot *aslot;
+ struct kvm_memory_slot *slot;
+ gfn_t rel_gfn, pin_pfn;
+ unsigned long npages;
+ kvm_pfn_t old_pfn;
+ int i;
+
+ if (!sev_guest(kvm))
+ return;
+
+ if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
+ return;
+
+ /* Tested till 1GB pages */
+ if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
+ return;
+
+ slot = gfn_to_memslot(kvm, gfn);
+ if (!slot || !slot->arch.pfns)
+ return;
+
+ /*
+ * Use relative gfn index within the memslot for the bitmap as well as
+ * the pfns array
+ */
+ rel_gfn = gfn - slot->base_gfn;
+ aslot = &slot->arch;
+ pin_pfn = pfn;
+ npages = KVM_PAGES_PER_HPAGE(level);
+
+ /* Pin the page, KVM doesn't yet support page migration. */
+ for (i = 0; i < npages; i++, rel_gfn++, pin_pfn++) {
+ if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
+ old_pfn = aslot->pfns[rel_gfn];
+ if (old_pfn == pin_pfn)
+ continue;
+
+ put_page(pfn_to_page(old_pfn));
+ }
+
+ set_bit(rel_gfn, aslot->pinned_bitmap);
+ aslot->pfns[rel_gfn] = pin_pfn;
+ get_page(pfn_to_page(pin_pfn));
+ }
+
+ /*
+ * Flush any cached lines of the page being added since "ownership" of
+ * it will be transferred from the host to an encrypted guest.
+ */
+ clflush_cache_range(__va(pfn << PAGE_SHIFT), page_level_size(level));
+}
+
void sev_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
{
struct kvm_arch_memory_slot *aslot = &slot->arch;
+ kvm_pfn_t *pfns;
+ gfn_t gfn;
+ int i;
if (!sev_guest(kvm))
return;
+ if (!aslot->pinned_bitmap || !slot->arch.pfns)
+ goto out;
+
+ pfns = aslot->pfns;
+
+ /*
+ * Iterate the memslot to find the pinned pfn using the bitmap and drop
+ * the pfn stored.
+ */
+ for (i = 0, gfn = slot->base_gfn; i < slot->npages; i++, gfn++) {
+ if (test_and_clear_bit(i, aslot->pinned_bitmap)) {
+ if (WARN_ON(!pfns[i]))
+ continue;
+
+ put_page(pfn_to_page(pfns[i]));
+ }
+ }
+
+out:
if (aslot->pinned_bitmap) {
kvfree(aslot->pinned_bitmap);
aslot->pinned_bitmap = NULL;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 3fb19974f719..22535c680b3f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4743,6 +4743,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.alloc_memslot_metadata = sev_alloc_memslot_metadata,
.free_memslot = sev_free_memslot,
+ .pin_spte = sev_pin_spte,
};
static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index b2f8b3b52680..c731bc91ea8f 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -77,7 +77,6 @@ struct kvm_sev_info {
unsigned int handle; /* SEV firmware handle */
int fd; /* SEV device fd */
unsigned long pages_locked; /* Number of pages locked */
- struct list_head regions_list; /* List of registered regions */
u64 ap_jump_table; /* SEV-ES AP Jump Table address */
struct kvm *enc_context_owner; /* Owner of copied encryption context */
unsigned long num_mirrored_vms; /* Number of VMs sharing this ASID */
@@ -648,5 +647,7 @@ int sev_alloc_memslot_metadata(struct kvm *kvm,
struct kvm_memory_slot *new);
void sev_free_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot);
+void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ kvm_pfn_t pfn);
#endif
--
2.32.0
AMD SEV guest requires the guest's pages to be pinned in host physical
memory. The memory encryption scheme uses the physical address of the
memory being encrypted. If guest pages are moved, content decrypted would
be incorrect, corrupting guest's memory.
For SEV/SEV-ES guests, the hypervisor doesn't know which pages are
encrypted and when the guest is done using those pages. Hypervisor should
treat all the guest pages as encrypted until the guest is destroyed.
The KVM MMU needs to track the pages that are pinned and the corresponding
pfns for unpinning them during the guest destroy path.
Signed-off-by: Nikunj A Dadhania <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 2 ++
arch/x86/include/asm/kvm_host.h | 7 +++++
arch/x86/kvm/svm/sev.c | 49 ++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 3 ++
arch/x86/kvm/svm/svm.h | 6 ++++
arch/x86/kvm/x86.c | 11 ++++++-
6 files changed, 77 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index a96c52a99a04..da03250f503c 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -87,6 +87,8 @@ KVM_X86_OP(set_tss_addr)
KVM_X86_OP(set_identity_map_addr)
KVM_X86_OP(get_mt_mask)
KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP(alloc_memslot_metadata)
+KVM_X86_OP(free_memslot)
KVM_X86_OP(pin_spte)
KVM_X86_OP_NULL(has_wbinvd_exit)
KVM_X86_OP(get_l2_tsc_offset)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1263a16dd588..c235597f8442 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -927,6 +927,8 @@ struct kvm_arch_memory_slot {
struct kvm_rmap_head *rmap[KVM_NR_PAGE_SIZES];
struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
unsigned short *gfn_track[KVM_PAGE_TRACK_MAX];
+ unsigned long *pinned_bitmap;
+ kvm_pfn_t *pfns;
};
/*
@@ -1417,6 +1419,11 @@ struct kvm_x86_ops {
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);
+ int (*alloc_memslot_metadata)(struct kvm *kvm,
+ const struct kvm_memory_slot *old,
+ struct kvm_memory_slot *new);
+ void (*free_memslot)(struct kvm *kvm,
+ struct kvm_memory_slot *slot);
void (*pin_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
kvm_pfn_t pfn);
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 6a22798eaaee..d972ab4956d4 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2945,3 +2945,52 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
}
+
+void sev_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ struct kvm_arch_memory_slot *aslot = &slot->arch;
+
+ if (!sev_guest(kvm))
+ return;
+
+ if (aslot->pinned_bitmap) {
+ kvfree(aslot->pinned_bitmap);
+ aslot->pinned_bitmap = NULL;
+ }
+
+ if (aslot->pfns) {
+ kvfree(aslot->pfns);
+ aslot->pfns = NULL;
+ }
+}
+
+int sev_alloc_memslot_metadata(struct kvm *kvm,
+ const struct kvm_memory_slot *old,
+ struct kvm_memory_slot *new)
+{
+ struct kvm_arch_memory_slot *aslot = &new->arch;
+ unsigned long pinned_bytes = new->npages * sizeof(kvm_pfn_t);
+
+ if (!sev_guest(kvm))
+ return 0;
+
+ if (old && old->arch.pinned_bitmap && old->arch.pfns) {
+ WARN_ON(old->npages != new->npages);
+ aslot->pinned_bitmap = old->arch.pinned_bitmap;
+ aslot->pfns = old->arch.pfns;
+ return 0;
+ }
+
+ aslot->pfns = kvcalloc(new->npages, sizeof(*aslot->pfns),
+ GFP_KERNEL_ACCOUNT);
+ if (!aslot->pfns)
+ return -ENOMEM;
+
+ aslot->pinned_bitmap = kvzalloc(pinned_bytes, GFP_KERNEL_ACCOUNT);
+ if (!aslot->pinned_bitmap) {
+ kvfree(aslot->pfns);
+ aslot->pfns = NULL;
+ return -ENOMEM;
+ }
+ return 0;
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 46bcc706f257..3fb19974f719 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4740,6 +4740,9 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.complete_emulated_msr = svm_complete_emulated_msr,
.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
+
+ .alloc_memslot_metadata = sev_alloc_memslot_metadata,
+ .free_memslot = sev_free_memslot,
};
static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9f153c59f2c8..b2f8b3b52680 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -643,4 +643,10 @@ void sev_es_unmap_ghcb(struct vcpu_svm *svm);
void __svm_sev_es_vcpu_run(unsigned long vmcb_pa);
void __svm_vcpu_run(unsigned long vmcb_pa, unsigned long *regs);
+int sev_alloc_memslot_metadata(struct kvm *kvm,
+ const struct kvm_memory_slot *old,
+ struct kvm_memory_slot *new);
+void sev_free_memslot(struct kvm *kvm,
+ struct kvm_memory_slot *slot);
+
#endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 76b4803dd3bd..9e07e2ef8885 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11694,6 +11694,7 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
}
kvm_page_track_free_memslot(slot);
+ static_call_cond(kvm_x86_free_memslot)(kvm, slot);
}
int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages)
@@ -11719,6 +11720,7 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages)
}
static int kvm_alloc_memslot_metadata(struct kvm *kvm,
+ const struct kvm_memory_slot *old,
struct kvm_memory_slot *slot)
{
unsigned long npages = slot->npages;
@@ -11771,8 +11773,15 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
if (kvm_page_track_create_memslot(kvm, slot, npages))
goto out_free;
+ if (kvm_x86_ops.alloc_memslot_metadata &&
+ static_call(kvm_x86_alloc_memslot_metadata)(kvm, old, slot))
+ goto out_free_page_track;
+
return 0;
+out_free_page_track:
+ kvm_page_track_free_memslot(slot);
+
out_free:
memslot_rmap_free(slot);
@@ -11805,7 +11814,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
enum kvm_mr_change change)
{
if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
- return kvm_alloc_memslot_metadata(kvm, new);
+ return kvm_alloc_memslot_metadata(kvm, old, new);
if (change == KVM_MR_FLAGS_ONLY)
memcpy(&new->arch, &old->arch, sizeof(old->arch));
--
2.32.0
Create a separate routine sev_alloc_pages() for allocating sev pages.
This will be used in the following MMU based pinning.
While at it, validate the number of pages before the RLIMIT check and
use kzalloc instead of kmalloc.
Signed-off-by: Nikunj A Dadhania <[email protected]>
---
arch/x86/kvm/svm/sev.c | 44 +++++++++++++++++++++++++++---------------
1 file changed, 28 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index a962bed97a0b..14aeccfc500b 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -368,19 +368,13 @@ static int sev_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
return ret;
}
-static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
- unsigned long ulen, unsigned long *n,
- int write)
+static void *sev_alloc_pages(struct kvm_sev_info *sev, unsigned long uaddr,
+ unsigned long ulen, unsigned long *n)
{
- struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
- unsigned long npages, size;
- int npinned;
unsigned long locked, lock_limit;
- struct page **pages;
+ unsigned long npages, size;
unsigned long first, last;
- int ret;
-
- lockdep_assert_held(&kvm->lock);
+ struct page **pages;
if (ulen == 0 || uaddr + ulen < uaddr)
return ERR_PTR(-EINVAL);
@@ -390,6 +384,9 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT;
npages = (last - first + 1);
+ if (WARN_ON_ONCE(npages > INT_MAX))
+ return ERR_PTR(-EINVAL);
+
locked = sev->pages_locked + npages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
@@ -397,19 +394,34 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
return ERR_PTR(-ENOMEM);
}
- if (WARN_ON_ONCE(npages > INT_MAX))
- return ERR_PTR(-EINVAL);
-
/* Avoid using vmalloc for smaller buffers. */
size = npages * sizeof(struct page *);
if (size > PAGE_SIZE)
pages = __vmalloc(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
else
- pages = kmalloc(size, GFP_KERNEL_ACCOUNT);
+ pages = kzalloc(size, GFP_KERNEL_ACCOUNT);
- if (!pages)
- return ERR_PTR(-ENOMEM);
+ *n = pages ? npages : 0;
+ return pages;
+}
+static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
+ unsigned long ulen, unsigned long *n,
+ int write)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ unsigned long npages, locked;
+ struct page **pages;
+ int npinned;
+ int ret;
+
+ lockdep_assert_held(&kvm->lock);
+
+ pages = sev_alloc_pages(sev, uaddr, ulen, &npages);
+ if (IS_ERR(pages))
+ return pages;
+
+ locked = sev->pages_locked + npages;
/* Pin the user virtual address. */
npinned = pin_user_pages_fast(uaddr, npages, write ? FOLL_WRITE : 0, pages);
if (npinned != npages) {
--
2.32.0
Hi Nikunj,
On 18.01.2022 12:06, Nikunj A Dadhania wrote:
> From: Sean Christopherson <[email protected]>
>
> Pin the memory for the data being passed to launch_update_data()
> because it gets encrypted before the guest is first run and must
> not be moved which would corrupt it.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> [ * Changed hva_to_gva() to take an extra argument and return gpa_t.
> * Updated sev_pin_memory_in_mmu() error handling.
> * As pinning/unpining pages is handled within MMU, removed
> {get,put}_user(). ]
> Signed-off-by: Nikunj A Dadhania <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 122 ++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 119 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 14aeccfc500b..1ae714e83a3c 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -22,6 +22,7 @@
> #include <asm/trapnr.h>
> #include <asm/fpu/xcr.h>
>
> +#include "mmu.h"
> #include "x86.h"
> #include "svm.h"
> #include "svm_ops.h"
> @@ -490,6 +491,110 @@ static unsigned long get_num_contig_pages(unsigned long idx,
> return pages;
> }
>
> +#define SEV_PFERR_RO (PFERR_USER_MASK)
> +#define SEV_PFERR_RW (PFERR_WRITE_MASK | PFERR_USER_MASK)
> +
> +static struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm,
> + unsigned long hva)
> +{
> + struct kvm_memslots *slots = kvm_memslots(kvm);
> + struct kvm_memory_slot *memslot;
> + int bkt;
> +
> + kvm_for_each_memslot(memslot, bkt, slots) {
> + if (hva >= memslot->userspace_addr &&
> + hva < memslot->userspace_addr +
> + (memslot->npages << PAGE_SHIFT))
> + return memslot;
> + }
> +
> + return NULL;
> +}
We have kvm_for_each_memslot_in_hva_range() now, please don't do a linear
search through memslots.
You might need to move the aforementioned macro from kvm_main.c to some
header file, though.
> +static gpa_t hva_to_gpa(struct kvm *kvm, unsigned long hva, bool *ro)
> +{
> + struct kvm_memory_slot *memslot;
> + gpa_t gpa_offset;
> +
> + memslot = hva_to_memslot(kvm, hva);
> + if (!memslot)
> + return UNMAPPED_GVA;
> +
> + *ro = !!(memslot->flags & KVM_MEM_READONLY);
> + gpa_offset = hva - memslot->userspace_addr;
> + return ((memslot->base_gfn << PAGE_SHIFT) + gpa_offset);
> +}
> +
> +static struct page **sev_pin_memory_in_mmu(struct kvm *kvm, unsigned long addr,
> + unsigned long size,
> + unsigned long *npages)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct kvm_vcpu *vcpu;
> + struct page **pages;
> + unsigned long i;
> + u32 error_code;
> + kvm_pfn_t pfn;
> + int idx, ret = 0;
> + gpa_t gpa;
> + bool ro;
> +
> + pages = sev_alloc_pages(sev, addr, size, npages);
> + if (IS_ERR(pages))
> + return pages;
> +
> + vcpu = kvm_get_vcpu(kvm, 0);
> + if (mutex_lock_killable(&vcpu->mutex)) {
> + kvfree(pages);
> + return ERR_PTR(-EINTR);
> + }
> +
> + vcpu_load(vcpu);
> + idx = srcu_read_lock(&kvm->srcu);
> +
> + kvm_mmu_load(vcpu);
> +
> + for (i = 0; i < *npages; i++, addr += PAGE_SIZE) {
> + if (signal_pending(current)) {
> + ret = -ERESTARTSYS;
> + break;
> + }
> +
> + if (need_resched())
> + cond_resched();
> +
> + gpa = hva_to_gpa(kvm, addr, &ro);
> + if (gpa == UNMAPPED_GVA) {
> + ret = -EFAULT;
> + break;
> + }
This function is going to have worst case O(n²) complexity if called with
the whole VM memory (or O(n * log(n)) when hva_to_memslot() is modified
to use kvm_for_each_memslot_in_hva_range()).
That's really bad for something that can be done in O(n) time - look how
kvm_for_each_memslot_in_gfn_range() does it over gfns.
Thanks,
Maciej
On 18.01.2022 16:00, Maciej S. Szmigiero wrote:
> Hi Nikunj,
>
> On 18.01.2022 12:06, Nikunj A Dadhania wrote:
>> From: Sean Christopherson <[email protected]>
>>
>> Pin the memory for the data being passed to launch_update_data()
>> because it gets encrypted before the guest is first run and must
>> not be moved which would corrupt it.
>>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> [ * Changed hva_to_gva() to take an extra argument and return gpa_t.
>> * Updated sev_pin_memory_in_mmu() error handling.
>> * As pinning/unpining pages is handled within MMU, removed
>> {get,put}_user(). ]
>> Signed-off-by: Nikunj A Dadhania <[email protected]>
>> ---
>> arch/x86/kvm/svm/sev.c | 122 ++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 119 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 14aeccfc500b..1ae714e83a3c 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -22,6 +22,7 @@
>> #include <asm/trapnr.h>
>> #include <asm/fpu/xcr.h>
>> +#include "mmu.h"
>> #include "x86.h"
>> #include "svm.h"
>> #include "svm_ops.h"
>> @@ -490,6 +491,110 @@ static unsigned long get_num_contig_pages(unsigned long idx,
>> return pages;
>> }
>> +#define SEV_PFERR_RO (PFERR_USER_MASK)
>> +#define SEV_PFERR_RW (PFERR_WRITE_MASK | PFERR_USER_MASK)
>> +
>> +static struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm,
>> + unsigned long hva)
>> +{
>> + struct kvm_memslots *slots = kvm_memslots(kvm);
>> + struct kvm_memory_slot *memslot;
>> + int bkt;
>> +
>> + kvm_for_each_memslot(memslot, bkt, slots) {
>> + if (hva >= memslot->userspace_addr &&
>> + hva < memslot->userspace_addr +
>> + (memslot->npages << PAGE_SHIFT))
>> + return memslot;
>> + }
>> +
>> + return NULL;
>> +}
>
> We have kvm_for_each_memslot_in_hva_range() now, please don't do a linear
> search through memslots.
> You might need to move the aforementioned macro from kvm_main.c to some
> header file, though.
Besides performance considerations I can't see the code here taking into
account the fact that a hva can map to multiple memslots (they an overlap
in the host address space).
Thanks,
Maciej
Hi Maciej,
On 1/18/2022 8:30 PM, Maciej S. Szmigiero wrote:
> Hi Nikunj,
>
> On 18.01.2022 12:06, Nikunj A Dadhania wrote:
>> From: Sean Christopherson <[email protected]>
>>
>> Pin the memory for the data being passed to launch_update_data()
>> because it gets encrypted before the guest is first run and must
>> not be moved which would corrupt it.
>>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> [ * Changed hva_to_gva() to take an extra argument and return gpa_t.
>> * Updated sev_pin_memory_in_mmu() error handling.
>> * As pinning/unpining pages is handled within MMU, removed
>> {get,put}_user(). ]
>> Signed-off-by: Nikunj A Dadhania <[email protected]>
>> ---
>> arch/x86/kvm/svm/sev.c | 122 ++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 119 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 14aeccfc500b..1ae714e83a3c 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -22,6 +22,7 @@
>> #include <asm/trapnr.h>
>> #include <asm/fpu/xcr.h>
>> +#include "mmu.h"
>> #include "x86.h"
>> #include "svm.h"
>> #include "svm_ops.h"
>> @@ -490,6 +491,110 @@ static unsigned long get_num_contig_pages(unsigned long idx,
>> return pages;
>> }
>> +#define SEV_PFERR_RO (PFERR_USER_MASK)
>> +#define SEV_PFERR_RW (PFERR_WRITE_MASK | PFERR_USER_MASK)
>> +
>> +static struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm,
>> + unsigned long hva)
>> +{
>> + struct kvm_memslots *slots = kvm_memslots(kvm);
>> + struct kvm_memory_slot *memslot;
>> + int bkt;
>> +
>> + kvm_for_each_memslot(memslot, bkt, slots) {
>> + if (hva >= memslot->userspace_addr &&
>> + hva < memslot->userspace_addr +
>> + (memslot->npages << PAGE_SHIFT))
>> + return memslot;
>> + }
>> +
>> + return NULL;
>> +}
>
> We have kvm_for_each_memslot_in_hva_range() now, please don't do a linear
> search through memslots.
> You might need to move the aforementioned macro from kvm_main.c to some
> header file, though.
Sure, let me try optimizing with this newly added macro.
>
>> +static gpa_t hva_to_gpa(struct kvm *kvm, unsigned long hva, bool *ro)
>> +{
>> + struct kvm_memory_slot *memslot;
>> + gpa_t gpa_offset;
>> +
>> + memslot = hva_to_memslot(kvm, hva);
>> + if (!memslot)
>> + return UNMAPPED_GVA;
>> +
>> + *ro = !!(memslot->flags & KVM_MEM_READONLY);
>> + gpa_offset = hva - memslot->userspace_addr;
>> + return ((memslot->base_gfn << PAGE_SHIFT) + gpa_offset);
>> +}
>> +
>> +static struct page **sev_pin_memory_in_mmu(struct kvm *kvm, unsigned long addr,
>> + unsigned long size,
>> + unsigned long *npages)
>> +{
>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> + struct kvm_vcpu *vcpu;
>> + struct page **pages;
>> + unsigned long i;
>> + u32 error_code;
>> + kvm_pfn_t pfn;
>> + int idx, ret = 0;
>> + gpa_t gpa;
>> + bool ro;
>> +
>> + pages = sev_alloc_pages(sev, addr, size, npages);
>> + if (IS_ERR(pages))
>> + return pages;
>> +
>> + vcpu = kvm_get_vcpu(kvm, 0);
>> + if (mutex_lock_killable(&vcpu->mutex)) {
>> + kvfree(pages);
>> + return ERR_PTR(-EINTR);
>> + }
>> +
>> + vcpu_load(vcpu);
>> + idx = srcu_read_lock(&kvm->srcu);
>> +
>> + kvm_mmu_load(vcpu);
>> +
>> + for (i = 0; i < *npages; i++, addr += PAGE_SIZE) {
>> + if (signal_pending(current)) {
>> + ret = -ERESTARTSYS;
>> + break;
>> + }
>> +
>> + if (need_resched())
>> + cond_resched();
>> +
>> + gpa = hva_to_gpa(kvm, addr, &ro);
>> + if (gpa == UNMAPPED_GVA) {
>> + ret = -EFAULT;
>> + break;
>> + }
>
> This function is going to have worst case O(n²) complexity if called with
> the whole VM memory (or O(n * log(n)) when hva_to_memslot() is modified
> to use kvm_for_each_memslot_in_hva_range()).
I understand your concern and will address it. BTW, this is called for a small
fragment of VM memory( <10MB), that needs to be pinned before the guest execution
starts.
> That's really bad for something that can be done in O(n) time - look how
> kvm_for_each_memslot_in_gfn_range() does it over gfns.
>
I saw one use of kvm_for_each_memslot_in_gfn_range() in __kvm_zap_rmaps(), and
that too calls slot_handle_level_range() which has a for_each_slot_rmap_range().
How would that be O(n) ?
kvm_for_each_memslot_in_gfn_range() {
...
slot_handle_level_range()
...
}
slot_handle_level_range() {
...
for_each_slot_rmap_range() {
...
}
...
}
Regards,
Nikunj
On 1/18/2022 10:59 PM, Maciej S. Szmigiero wrote:
> On 18.01.2022 16:00, Maciej S. Szmigiero wrote:
>> Hi Nikunj,
>>
>> On 18.01.2022 12:06, Nikunj A Dadhania wrote:
>>> From: Sean Christopherson <[email protected]>
>>>
>>> Pin the memory for the data being passed to launch_update_data()
>>> because it gets encrypted before the guest is first run and must
>>> not be moved which would corrupt it.
>>>
>>> Signed-off-by: Sean Christopherson <[email protected]>
>>> [ * Changed hva_to_gva() to take an extra argument and return gpa_t.
>>> * Updated sev_pin_memory_in_mmu() error handling.
>>> * As pinning/unpining pages is handled within MMU, removed
>>> {get,put}_user(). ]
>>> Signed-off-by: Nikunj A Dadhania <[email protected]>
>>> ---
>>> arch/x86/kvm/svm/sev.c | 122 ++++++++++++++++++++++++++++++++++++++++-
>>> 1 file changed, 119 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>> index 14aeccfc500b..1ae714e83a3c 100644
>>> --- a/arch/x86/kvm/svm/sev.c
>>> +++ b/arch/x86/kvm/svm/sev.c
>>> @@ -22,6 +22,7 @@
>>> #include <asm/trapnr.h>
>>> #include <asm/fpu/xcr.h>
>>> +#include "mmu.h"
>>> #include "x86.h"
>>> #include "svm.h"
>>> #include "svm_ops.h"
>>> @@ -490,6 +491,110 @@ static unsigned long get_num_contig_pages(unsigned long idx,
>>> return pages;
>>> }
>>> +#define SEV_PFERR_RO (PFERR_USER_MASK)
>>> +#define SEV_PFERR_RW (PFERR_WRITE_MASK | PFERR_USER_MASK)
>>> +
>>> +static struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm,
>>> + unsigned long hva)
>>> +{
>>> + struct kvm_memslots *slots = kvm_memslots(kvm);
>>> + struct kvm_memory_slot *memslot;
>>> + int bkt;
>>> +
>>> + kvm_for_each_memslot(memslot, bkt, slots) {
>>> + if (hva >= memslot->userspace_addr &&
>>> + hva < memslot->userspace_addr +
>>> + (memslot->npages << PAGE_SHIFT))
>>> + return memslot;
>>> + }
>>> +
>>> + return NULL;
>>> +}
>>
>> We have kvm_for_each_memslot_in_hva_range() now, please don't do a linear
>> search through memslots.
>> You might need to move the aforementioned macro from kvm_main.c to some
>> header file, though.
>
> Besides performance considerations I can't see the code here taking into
> account the fact that a hva can map to multiple memslots (they an overlap
> in the host address space).
You are right I was returning at the first match, looks like if I switch to using
kvm_for_each_memslot_in_hva_range() it should take care of overlapping hva,
is this understanding correct ?
Regards
Nikunj
On 19.01.2022 07:33, Nikunj A. Dadhania wrote:
> Hi Maciej,
>
> On 1/18/2022 8:30 PM, Maciej S. Szmigiero wrote:
>> Hi Nikunj,
>>
>> On 18.01.2022 12:06, Nikunj A Dadhania wrote:
>>> From: Sean Christopherson <[email protected]>
>>>
>>> Pin the memory for the data being passed to launch_update_data()
>>> because it gets encrypted before the guest is first run and must
>>> not be moved which would corrupt it.
>>>
>>> Signed-off-by: Sean Christopherson <[email protected]>
>>> [ * Changed hva_to_gva() to take an extra argument and return gpa_t.
>>> * Updated sev_pin_memory_in_mmu() error handling.
>>> * As pinning/unpining pages is handled within MMU, removed
>>> {get,put}_user(). ]
>>> Signed-off-by: Nikunj A Dadhania <[email protected]>
>>> ---
>>> arch/x86/kvm/svm/sev.c | 122 ++++++++++++++++++++++++++++++++++++++++-
>>> 1 file changed, 119 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>> index 14aeccfc500b..1ae714e83a3c 100644
>>> --- a/arch/x86/kvm/svm/sev.c
>>> +++ b/arch/x86/kvm/svm/sev.c
>>> @@ -22,6 +22,7 @@
>>> #include <asm/trapnr.h>
>>> #include <asm/fpu/xcr.h>
>>> +#include "mmu.h"
>>> #include "x86.h"
>>> #include "svm.h"
>>> #include "svm_ops.h"
>>> @@ -490,6 +491,110 @@ static unsigned long get_num_contig_pages(unsigned long idx,
>>> return pages;
>>> }
>>> +#define SEV_PFERR_RO (PFERR_USER_MASK)
>>> +#define SEV_PFERR_RW (PFERR_WRITE_MASK | PFERR_USER_MASK)
>>> +
>>> +static struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm,
>>> + unsigned long hva)
>>> +{
>>> + struct kvm_memslots *slots = kvm_memslots(kvm);
>>> + struct kvm_memory_slot *memslot;
>>> + int bkt;
>>> +
>>> + kvm_for_each_memslot(memslot, bkt, slots) {
>>> + if (hva >= memslot->userspace_addr &&
>>> + hva < memslot->userspace_addr +
>>> + (memslot->npages << PAGE_SHIFT))
>>> + return memslot;
>>> + }
>>> +
>>> + return NULL;
>>> +}
>>
>> We have kvm_for_each_memslot_in_hva_range() now, please don't do a linear
>> search through memslots.
>> You might need to move the aforementioned macro from kvm_main.c to some
>> header file, though.
>
> Sure, let me try optimizing with this newly added macro.
????
>>
>>> +static gpa_t hva_to_gpa(struct kvm *kvm, unsigned long hva, bool *ro)
>>> +{
>>> + struct kvm_memory_slot *memslot;
>>> + gpa_t gpa_offset;
>>> +
>>> + memslot = hva_to_memslot(kvm, hva);
>>> + if (!memslot)
>>> + return UNMAPPED_GVA;
>>> +
>>> + *ro = !!(memslot->flags & KVM_MEM_READONLY);
>>> + gpa_offset = hva - memslot->userspace_addr;
>>> + return ((memslot->base_gfn << PAGE_SHIFT) + gpa_offset);
>>> +}
>>> +
>>> +static struct page **sev_pin_memory_in_mmu(struct kvm *kvm, unsigned long addr,
>>> + unsigned long size,
>>> + unsigned long *npages)
>>> +{
>>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>> + struct kvm_vcpu *vcpu;
>>> + struct page **pages;
>>> + unsigned long i;
>>> + u32 error_code;
>>> + kvm_pfn_t pfn;
>>> + int idx, ret = 0;
>>> + gpa_t gpa;
>>> + bool ro;
>>> +
>>> + pages = sev_alloc_pages(sev, addr, size, npages);
>>> + if (IS_ERR(pages))
>>> + return pages;
>>> +
>>> + vcpu = kvm_get_vcpu(kvm, 0);
>>> + if (mutex_lock_killable(&vcpu->mutex)) {
>>> + kvfree(pages);
>>> + return ERR_PTR(-EINTR);
>>> + }
>>> +
>>> + vcpu_load(vcpu);
>>> + idx = srcu_read_lock(&kvm->srcu);
>>> +
>>> + kvm_mmu_load(vcpu);
>>> +
>>> + for (i = 0; i < *npages; i++, addr += PAGE_SIZE) {
>>> + if (signal_pending(current)) {
>>> + ret = -ERESTARTSYS;
>>> + break;
>>> + }
>>> +
>>> + if (need_resched())
>>> + cond_resched();
>>> +
>>> + gpa = hva_to_gpa(kvm, addr, &ro);
>>> + if (gpa == UNMAPPED_GVA) {
>>> + ret = -EFAULT;
>>> + break;
>>> + }
>>
>> This function is going to have worst case O(n²) complexity if called with
>> the whole VM memory (or O(n * log(n)) when hva_to_memslot() is modified
>> to use kvm_for_each_memslot_in_hva_range()).
>
> I understand your concern and will address it. BTW, this is called for a small
> fragment of VM memory( <10MB), that needs to be pinned before the guest execution
> starts.
I understand it is a relatively small memory area now, but a rewrite of
this patch that makes use of kvm_for_each_memslot_in_hva_range() while
taking care of other considerations (like overlapping hva) will also
solve the performance issue.
>> That's really bad for something that can be done in O(n) time - look how
>> kvm_for_each_memslot_in_gfn_range() does it over gfns.
>>
>
> I saw one use of kvm_for_each_memslot_in_gfn_range() in __kvm_zap_rmaps(), and
> that too calls slot_handle_level_range() which has a for_each_slot_rmap_range().
> How would that be O(n) ?
>
> kvm_for_each_memslot_in_gfn_range() {
> ...
> slot_handle_level_range()
> ...
> }
>
> slot_handle_level_range() {
> ...
> for_each_slot_rmap_range() {
> ...
> }
> ...
> }
kvm_for_each_memslot_in_gfn_range() iterates over gfns, which are unique,
so at most one memslot is returned per gfn (and if a memslot covers
multiple gfns in the requested range it will be returned just once).
for_each_slot_rmap_range() then iterates over rmaps covering that
*single* memslot: look at slot_rmap_walk_next() - the memslot under
iteration is not advanced.
So each memslot returned by kvm_for_each_memslot_in_gfn_range() is
iterated over just once by the aforementioned macro.
>> Besides performance considerations I can't see the code here taking into
>> account the fact that a hva can map to multiple memslots (they an overlap
>> in the host address space).
>
> You are right I was returning at the first match, looks like if I switch to using
> kvm_for_each_memslot_in_hva_range() it should take care of overlapping hva,
> is this understanding correct ?
Let's say that the requested range of hva for sev_pin_memory_in_mmu() to
handle is 0x1000 - 0x2000.
If there are three memslots:
1: hva 0x1000 - 0x2000 -> gpa 0x1000 - 0x2000
2: hva 0x1000 - 0x2000 -> gpa 0x2000 - 0x3000
3: hva 0x2000 - 0x3000 -> gpa 0x3000 - 0x4000
then kvm_for_each_memslot_in_hva_range() will return the first two,
essentially covering the hva range of 0x1000 - 0x2000 twice.
If such hva aliases are permitted the code has to be ready for this case
and handle it sensibly:
If you need to return just a single struct page per a hva AND / OR pin
operations aren't idempotent then it has to keep track which hva were
already processed.
Another, and probably the easiest option would be to simply disallow
such overlapping memslots in the requested range and make
KVM_SEV_LAUNCH_UPDATE_DATA ioctl return something like EINVAL in this
case - if that would be acceptable semantics for this ioctl.
In any case, the main loop in sev_pin_memory_in_mmu() will probably
need to be build around a kvm_for_each_memslot_in_hva_range() call,
which will then solve the performance issue, too.
> Regards,
> Nikunj
Thanks,
Maciej
On 1/20/2022 12:22 AM, Maciej S. Szmigiero wrote:
> On 19.01.2022 07:33, Nikunj A. Dadhania wrote:
>> On 1/18/2022 8:30 PM, Maciej S. Szmigiero wrote:
>>> On 18.01.2022 12:06, Nikunj A Dadhania wrote:
>>>> +static struct page **sev_pin_memory_in_mmu(struct kvm *kvm, unsigned long addr,
>>>> + unsigned long size,
>>>> + unsigned long *npages)
>>>> +{
>>>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>> + struct kvm_vcpu *vcpu;
>>>> + struct page **pages;
>>>> + unsigned long i;
>>>> + u32 error_code;
>>>> + kvm_pfn_t pfn;
>>>> + int idx, ret = 0;
>>>> + gpa_t gpa;
>>>> + bool ro;
>>>> +
>>>> + pages = sev_alloc_pages(sev, addr, size, npages);
>>>> + if (IS_ERR(pages))
>>>> + return pages;
>>>> +
>>>> + vcpu = kvm_get_vcpu(kvm, 0);
>>>> + if (mutex_lock_killable(&vcpu->mutex)) {
>>>> + kvfree(pages);
>>>> + return ERR_PTR(-EINTR);
>>>> + }
>>>> +
>>>> + vcpu_load(vcpu);
>>>> + idx = srcu_read_lock(&kvm->srcu);
>>>> +
>>>> + kvm_mmu_load(vcpu);
>>>> +
>>>> + for (i = 0; i < *npages; i++, addr += PAGE_SIZE) {
>>>> + if (signal_pending(current)) {
>>>> + ret = -ERESTARTSYS;
>>>> + break;
>>>> + }
>>>> +
>>>> + if (need_resched())
>>>> + cond_resched();
>>>> +
>>>> + gpa = hva_to_gpa(kvm, addr, &ro);
>>>> + if (gpa == UNMAPPED_GVA) {
>>>> + ret = -EFAULT;
>>>> + break;
>>>> + }
>>>
>>> This function is going to have worst case O(n²) complexity if called with
>>> the whole VM memory (or O(n * log(n)) when hva_to_memslot() is modified
>>> to use kvm_for_each_memslot_in_hva_range()).
>>
>> I understand your concern and will address it. BTW, this is called for a small
>> fragment of VM memory( <10MB), that needs to be pinned before the guest execution
>> starts.
>
> I understand it is a relatively small memory area now, but a rewrite of
> this patch that makes use of kvm_for_each_memslot_in_hva_range() while
> taking care of other considerations (like overlapping hva) will also
> solve the performance issue.>
>>> That's really bad for something that can be done in O(n) time - look how
>>> kvm_for_each_memslot_in_gfn_range() does it over gfns.
>>>
>>
>> I saw one use of kvm_for_each_memslot_in_gfn_range() in __kvm_zap_rmaps(), and
>> that too calls slot_handle_level_range() which has a for_each_slot_rmap_range().
>> How would that be O(n) ?
>>
>> kvm_for_each_memslot_in_gfn_range() {
>> ...
>> slot_handle_level_range()
>> ...
>> }
>>
>> slot_handle_level_range() {
>> ...
>> for_each_slot_rmap_range() {
>> ...
>> }
>> ...
>> }
>
> kvm_for_each_memslot_in_gfn_range() iterates over gfns, which are unique,
> so at most one memslot is returned per gfn (and if a memslot covers
> multiple gfns in the requested range it will be returned just once).
>
> for_each_slot_rmap_range() then iterates over rmaps covering that
> *single* memslot: look at slot_rmap_walk_next() - the memslot under
> iteration is not advanced.
>
> So each memslot returned by kvm_for_each_memslot_in_gfn_range() is
> iterated over just once by the aforementioned macro.
>
>>> Besides performance considerations I can't see the code here taking into
>>> account the fact that a hva can map to multiple memslots (they an overlap
>>> in the host address space).
>>
>> You are right I was returning at the first match, looks like if I switch to using kvm_for_each_memslot_in_hva_range() it should take care of overlapping hva, is this understanding correct ?
>
> Let's say that the requested range of hva for sev_pin_memory_in_mmu() to
> handle is 0x1000 - 0x2000.
>
> If there are three memslots:
> 1: hva 0x1000 - 0x2000 -> gpa 0x1000 - 0x2000
> 2: hva 0x1000 - 0x2000 -> gpa 0x2000 - 0x3000
> 3: hva 0x2000 - 0x3000 -> gpa 0x3000 - 0x4000
>
> then kvm_for_each_memslot_in_hva_range() will return the first two,
> essentially covering the hva range of 0x1000 - 0x2000 twice.
>
> If such hva aliases are permitted the code has to be ready for this case
> and handle it sensibly:
> If you need to return just a single struct page per a hva AND / OR pin
> operations aren't idempotent then it has to keep track which hva were
> already processed.
>
> Another, and probably the easiest option would be to simply disallow
> such overlapping memslots in the requested range and make
> KVM_SEV_LAUNCH_UPDATE_DATA ioctl return something like EINVAL in this
> case - if that would be acceptable semantics for this ioctl.
>
> In any case, the main loop in sev_pin_memory_in_mmu() will probably
> need to be build around a kvm_for_each_memslot_in_hva_range() call,
> which will then solve the performance issue, too.
Sure, I already tried out and have the walk implemented using
kvm_for_each_memslot_in_hva_range() call.
Regards
Nikunj
On Tue, Jan 18, 2022 at 4:07 AM Nikunj A Dadhania <[email protected]> wrote:
>
> From: Sean Christopherson <[email protected]>
>
> Pin the memory for the data being passed to launch_update_data()
> because it gets encrypted before the guest is first run and must
> not be moved which would corrupt it.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> [ * Changed hva_to_gva() to take an extra argument and return gpa_t.
> * Updated sev_pin_memory_in_mmu() error handling.
> * As pinning/unpining pages is handled within MMU, removed
> {get,put}_user(). ]
> Signed-off-by: Nikunj A Dadhania <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 122 ++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 119 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 14aeccfc500b..1ae714e83a3c 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -22,6 +22,7 @@
> #include <asm/trapnr.h>
> #include <asm/fpu/xcr.h>
>
> +#include "mmu.h"
> #include "x86.h"
> #include "svm.h"
> #include "svm_ops.h"
> @@ -490,6 +491,110 @@ static unsigned long get_num_contig_pages(unsigned long idx,
> return pages;
> }
>
> +#define SEV_PFERR_RO (PFERR_USER_MASK)
> +#define SEV_PFERR_RW (PFERR_WRITE_MASK | PFERR_USER_MASK)
> +
> +static struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm,
> + unsigned long hva)
> +{
> + struct kvm_memslots *slots = kvm_memslots(kvm);
> + struct kvm_memory_slot *memslot;
> + int bkt;
> +
> + kvm_for_each_memslot(memslot, bkt, slots) {
> + if (hva >= memslot->userspace_addr &&
> + hva < memslot->userspace_addr +
> + (memslot->npages << PAGE_SHIFT))
> + return memslot;
> + }
> +
> + return NULL;
> +}
> +
> +static gpa_t hva_to_gpa(struct kvm *kvm, unsigned long hva, bool *ro)
> +{
> + struct kvm_memory_slot *memslot;
> + gpa_t gpa_offset;
> +
> + memslot = hva_to_memslot(kvm, hva);
> + if (!memslot)
> + return UNMAPPED_GVA;
> +
> + *ro = !!(memslot->flags & KVM_MEM_READONLY);
> + gpa_offset = hva - memslot->userspace_addr;
> + return ((memslot->base_gfn << PAGE_SHIFT) + gpa_offset);
> +}
> +
> +static struct page **sev_pin_memory_in_mmu(struct kvm *kvm, unsigned long addr,
> + unsigned long size,
> + unsigned long *npages)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct kvm_vcpu *vcpu;
> + struct page **pages;
> + unsigned long i;
> + u32 error_code;
> + kvm_pfn_t pfn;
> + int idx, ret = 0;
> + gpa_t gpa;
> + bool ro;
> +
> + pages = sev_alloc_pages(sev, addr, size, npages);
> + if (IS_ERR(pages))
> + return pages;
> +
> + vcpu = kvm_get_vcpu(kvm, 0);
> + if (mutex_lock_killable(&vcpu->mutex)) {
> + kvfree(pages);
> + return ERR_PTR(-EINTR);
> + }
> +
> + vcpu_load(vcpu);
> + idx = srcu_read_lock(&kvm->srcu);
> +
> + kvm_mmu_load(vcpu);
> +
> + for (i = 0; i < *npages; i++, addr += PAGE_SIZE) {
> + if (signal_pending(current)) {
> + ret = -ERESTARTSYS;
> + break;
> + }
> +
> + if (need_resched())
> + cond_resched();
> +
> + gpa = hva_to_gpa(kvm, addr, &ro);
> + if (gpa == UNMAPPED_GVA) {
> + ret = -EFAULT;
> + break;
> + }
> +
> + error_code = ro ? SEV_PFERR_RO : SEV_PFERR_RW;
> +
> + /*
> + * Fault in the page and sev_pin_page() will handle the
> + * pinning
> + */
> + pfn = kvm_mmu_map_tdp_page(vcpu, gpa, error_code, PG_LEVEL_4K);
> + if (is_error_noslot_pfn(pfn)) {
> + ret = -EFAULT;
> + break;
> + }
> + pages[i] = pfn_to_page(pfn);
> + }
> +
> + kvm_mmu_unload(vcpu);
> + srcu_read_unlock(&kvm->srcu, idx);
> + vcpu_put(vcpu);
> + mutex_unlock(&vcpu->mutex);
> +
> + if (!ret)
> + return pages;
> +
> + kvfree(pages);
> + return ERR_PTR(ret);
> +}
> +
> static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
> {
> unsigned long vaddr, vaddr_end, next_vaddr, npages, pages, size, i;
> @@ -510,15 +615,21 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
> vaddr_end = vaddr + size;
>
> /* Lock the user memory. */
> - inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
> + if (atomic_read(&kvm->online_vcpus))
> + inpages = sev_pin_memory_in_mmu(kvm, vaddr, size, &npages);
IIUC we can only use the sev_pin_memory_in_mmu() when there is an
online vCPU because that means the MMU has been setup enough to use?
Can we add a variable and a comment to help explain that?
bool mmu_usable = atomic_read(&kvm->online_vcpus) > 0;
> + else
> + inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
So I am confused about this case. Since svm_register_enc_region() is
now a NOOP how can a user ensure that memory remains pinned from
sev_launch_update_data() to when the memory would be demand pinned?
Before users could svm_register_enc_region() which pins the region,
then sev_launch_update_data(), then the VM could run an the data from
sev_launch_update_data() would have never moved. I don't think that
same guarantee is held here?
> if (IS_ERR(inpages))
> return PTR_ERR(inpages);
>
> /*
> * Flush (on non-coherent CPUs) before LAUNCH_UPDATE encrypts pages in
> * place; the cache may contain the data that was written unencrypted.
> + * Flushing is automatically handled if the pages can be pinned in the
> + * MMU.
> */
> - sev_clflush_pages(inpages, npages);
> + if (!atomic_read(&kvm->online_vcpus))
> + sev_clflush_pages(inpages, npages);
>
> data.reserved = 0;
> data.handle = sev->handle;
> @@ -553,8 +664,13 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
> set_page_dirty_lock(inpages[i]);
> mark_page_accessed(inpages[i]);
> }
> +
> /* unlock the user pages */
> - sev_unpin_memory(kvm, inpages, npages);
> + if (atomic_read(&kvm->online_vcpus))
> + kvfree(inpages);
> + else
> + sev_unpin_memory(kvm, inpages, npages);
> +
> return ret;
> }
>
> --
> 2.32.0
>
On 1/20/2022 9:47 PM, Peter Gonda wrote:
> On Tue, Jan 18, 2022 at 4:07 AM Nikunj A Dadhania <[email protected]> wrote:
>>
>> From: Sean Christopherson <[email protected]>
>>
>> Pin the memory for the data being passed to launch_update_data()
>> because it gets encrypted before the guest is first run and must
>> not be moved which would corrupt it.
>>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> [ * Changed hva_to_gva() to take an extra argument and return gpa_t.
>> * Updated sev_pin_memory_in_mmu() error handling.
>> * As pinning/unpining pages is handled within MMU, removed
>> {get,put}_user(). ]
>> Signed-off-by: Nikunj A Dadhania <[email protected]>
>> ---
>> arch/x86/kvm/svm/sev.c | 122 ++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 119 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 14aeccfc500b..1ae714e83a3c 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -22,6 +22,7 @@
>> #include <asm/trapnr.h>
>> #include <asm/fpu/xcr.h>
>>
>> +#include "mmu.h"
>> #include "x86.h"
>> #include "svm.h"
>> #include "svm_ops.h"
>> @@ -490,6 +491,110 @@ static unsigned long get_num_contig_pages(unsigned long idx,
>> return pages;
>> }
>>
>> +#define SEV_PFERR_RO (PFERR_USER_MASK)
>> +#define SEV_PFERR_RW (PFERR_WRITE_MASK | PFERR_USER_MASK)
>> +
>> +static struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm,
>> + unsigned long hva)
>> +{
>> + struct kvm_memslots *slots = kvm_memslots(kvm);
>> + struct kvm_memory_slot *memslot;
>> + int bkt;
>> +
>> + kvm_for_each_memslot(memslot, bkt, slots) {
>> + if (hva >= memslot->userspace_addr &&
>> + hva < memslot->userspace_addr +
>> + (memslot->npages << PAGE_SHIFT))
>> + return memslot;
>> + }
>> +
>> + return NULL;
>> +}
>> +
>> +static gpa_t hva_to_gpa(struct kvm *kvm, unsigned long hva, bool *ro)
>> +{
>> + struct kvm_memory_slot *memslot;
>> + gpa_t gpa_offset;
>> +
>> + memslot = hva_to_memslot(kvm, hva);
>> + if (!memslot)
>> + return UNMAPPED_GVA;
>> +
>> + *ro = !!(memslot->flags & KVM_MEM_READONLY);
>> + gpa_offset = hva - memslot->userspace_addr;
>> + return ((memslot->base_gfn << PAGE_SHIFT) + gpa_offset);
>> +}
>> +
>> +static struct page **sev_pin_memory_in_mmu(struct kvm *kvm, unsigned long addr,
>> + unsigned long size,
>> + unsigned long *npages)
>> +{
>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> + struct kvm_vcpu *vcpu;
>> + struct page **pages;
>> + unsigned long i;
>> + u32 error_code;
>> + kvm_pfn_t pfn;
>> + int idx, ret = 0;
>> + gpa_t gpa;
>> + bool ro;
>> +
>> + pages = sev_alloc_pages(sev, addr, size, npages);
>> + if (IS_ERR(pages))
>> + return pages;
>> +
>> + vcpu = kvm_get_vcpu(kvm, 0);
>> + if (mutex_lock_killable(&vcpu->mutex)) {
>> + kvfree(pages);
>> + return ERR_PTR(-EINTR);
>> + }
>> +
>> + vcpu_load(vcpu);
>> + idx = srcu_read_lock(&kvm->srcu);
>> +
>> + kvm_mmu_load(vcpu);
>> +
>> + for (i = 0; i < *npages; i++, addr += PAGE_SIZE) {
>> + if (signal_pending(current)) {
>> + ret = -ERESTARTSYS;
>> + break;
>> + }
>> +
>> + if (need_resched())
>> + cond_resched();
>> +
>> + gpa = hva_to_gpa(kvm, addr, &ro);
>> + if (gpa == UNMAPPED_GVA) {
>> + ret = -EFAULT;
>> + break;
>> + }
>> +
>> + error_code = ro ? SEV_PFERR_RO : SEV_PFERR_RW;
>> +
>> + /*
>> + * Fault in the page and sev_pin_page() will handle the
>> + * pinning
>> + */
>> + pfn = kvm_mmu_map_tdp_page(vcpu, gpa, error_code, PG_LEVEL_4K);
>> + if (is_error_noslot_pfn(pfn)) {
>> + ret = -EFAULT;
>> + break;
>> + }
>> + pages[i] = pfn_to_page(pfn);
>> + }
>> +
>> + kvm_mmu_unload(vcpu);
>> + srcu_read_unlock(&kvm->srcu, idx);
>> + vcpu_put(vcpu);
>> + mutex_unlock(&vcpu->mutex);
>> +
>> + if (!ret)
>> + return pages;
>> +
>> + kvfree(pages);
>> + return ERR_PTR(ret);
>> +}
>> +
>> static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>> {
>> unsigned long vaddr, vaddr_end, next_vaddr, npages, pages, size, i;
>> @@ -510,15 +615,21 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>> vaddr_end = vaddr + size;
>>
>> /* Lock the user memory. */
>> - inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
>> + if (atomic_read(&kvm->online_vcpus))
>> + inpages = sev_pin_memory_in_mmu(kvm, vaddr, size, &npages);
>
> IIUC we can only use the sev_pin_memory_in_mmu() when there is an
> online vCPU because that means the MMU has been setup enough to use?
> Can we add a variable and a comment to help explain that?
>
> bool mmu_usable = atomic_read(&kvm->online_vcpus) > 0;
Sure, will add comment and the variable.
>
>> + else
>> + inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
>
> So I am confused about this case. Since svm_register_enc_region() is
> now a NOOP how can a user ensure that memory remains pinned from
> sev_launch_update_data() to when the memory would be demand pinned?
>
> Before users could svm_register_enc_region() which pins the region,
> then sev_launch_update_data(), then the VM could run an the data from
> sev_launch_update_data() would have never moved. I don't think that
> same guarantee is held here?
Yes, you are right. One way is to error out of this call if MMU is not setup.
Other one would require us to maintain all list of pinned memory via sev_pin_memory()
and unpin them in the destroy path.
>> if (IS_ERR(inpages))
>> return PTR_ERR(inpages);
>>
>> /*
>> * Flush (on non-coherent CPUs) before LAUNCH_UPDATE encrypts pages in
>> * place; the cache may contain the data that was written unencrypted.
>> + * Flushing is automatically handled if the pages can be pinned in the
>> + * MMU.
>> */
>> - sev_clflush_pages(inpages, npages);
>> + if (!atomic_read(&kvm->online_vcpus))
>> + sev_clflush_pages(inpages, npages);
>>
>> data.reserved = 0;
>> data.handle = sev->handle;
>> @@ -553,8 +664,13 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>> set_page_dirty_lock(inpages[i]);
>> mark_page_accessed(inpages[i]);
>> }
>> +
>> /* unlock the user pages */
>> - sev_unpin_memory(kvm, inpages, npages);
>> + if (atomic_read(&kvm->online_vcpus))
>> + kvfree(inpages);
>> + else
>> + sev_unpin_memory(kvm, inpages, npages);
And not unpin here in this case.
Regards
Nikunj
On Thu, Jan 20, 2022 at 9:08 PM Nikunj A. Dadhania <[email protected]> wrote:
>
> On 1/20/2022 9:47 PM, Peter Gonda wrote:
> > On Tue, Jan 18, 2022 at 4:07 AM Nikunj A Dadhania <[email protected]> wrote:
> >>
> >> From: Sean Christopherson <[email protected]>
> >>
> >> Pin the memory for the data being passed to launch_update_data()
> >> because it gets encrypted before the guest is first run and must
> >> not be moved which would corrupt it.
> >>
> >> Signed-off-by: Sean Christopherson <[email protected]>
> >> [ * Changed hva_to_gva() to take an extra argument and return gpa_t.
> >> * Updated sev_pin_memory_in_mmu() error handling.
> >> * As pinning/unpining pages is handled within MMU, removed
> >> {get,put}_user(). ]
> >> Signed-off-by: Nikunj A Dadhania <[email protected]>
> >> ---
> >> arch/x86/kvm/svm/sev.c | 122 ++++++++++++++++++++++++++++++++++++++++-
> >> 1 file changed, 119 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> >> index 14aeccfc500b..1ae714e83a3c 100644
> >> --- a/arch/x86/kvm/svm/sev.c
> >> +++ b/arch/x86/kvm/svm/sev.c
> >> @@ -22,6 +22,7 @@
> >> #include <asm/trapnr.h>
> >> #include <asm/fpu/xcr.h>
> >>
> >> +#include "mmu.h"
> >> #include "x86.h"
> >> #include "svm.h"
> >> #include "svm_ops.h"
> >> @@ -490,6 +491,110 @@ static unsigned long get_num_contig_pages(unsigned long idx,
> >> return pages;
> >> }
> >>
> >> +#define SEV_PFERR_RO (PFERR_USER_MASK)
> >> +#define SEV_PFERR_RW (PFERR_WRITE_MASK | PFERR_USER_MASK)
> >> +
> >> +static struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm,
> >> + unsigned long hva)
> >> +{
> >> + struct kvm_memslots *slots = kvm_memslots(kvm);
> >> + struct kvm_memory_slot *memslot;
> >> + int bkt;
> >> +
> >> + kvm_for_each_memslot(memslot, bkt, slots) {
> >> + if (hva >= memslot->userspace_addr &&
> >> + hva < memslot->userspace_addr +
> >> + (memslot->npages << PAGE_SHIFT))
> >> + return memslot;
> >> + }
> >> +
> >> + return NULL;
> >> +}
> >> +
> >> +static gpa_t hva_to_gpa(struct kvm *kvm, unsigned long hva, bool *ro)
> >> +{
> >> + struct kvm_memory_slot *memslot;
> >> + gpa_t gpa_offset;
> >> +
> >> + memslot = hva_to_memslot(kvm, hva);
> >> + if (!memslot)
> >> + return UNMAPPED_GVA;
> >> +
> >> + *ro = !!(memslot->flags & KVM_MEM_READONLY);
> >> + gpa_offset = hva - memslot->userspace_addr;
> >> + return ((memslot->base_gfn << PAGE_SHIFT) + gpa_offset);
> >> +}
> >> +
> >> +static struct page **sev_pin_memory_in_mmu(struct kvm *kvm, unsigned long addr,
> >> + unsigned long size,
> >> + unsigned long *npages)
> >> +{
> >> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> >> + struct kvm_vcpu *vcpu;
> >> + struct page **pages;
> >> + unsigned long i;
> >> + u32 error_code;
> >> + kvm_pfn_t pfn;
> >> + int idx, ret = 0;
> >> + gpa_t gpa;
> >> + bool ro;
> >> +
> >> + pages = sev_alloc_pages(sev, addr, size, npages);
> >> + if (IS_ERR(pages))
> >> + return pages;
> >> +
> >> + vcpu = kvm_get_vcpu(kvm, 0);
> >> + if (mutex_lock_killable(&vcpu->mutex)) {
> >> + kvfree(pages);
> >> + return ERR_PTR(-EINTR);
> >> + }
> >> +
> >> + vcpu_load(vcpu);
> >> + idx = srcu_read_lock(&kvm->srcu);
> >> +
> >> + kvm_mmu_load(vcpu);
> >> +
> >> + for (i = 0; i < *npages; i++, addr += PAGE_SIZE) {
> >> + if (signal_pending(current)) {
> >> + ret = -ERESTARTSYS;
> >> + break;
> >> + }
> >> +
> >> + if (need_resched())
> >> + cond_resched();
> >> +
> >> + gpa = hva_to_gpa(kvm, addr, &ro);
> >> + if (gpa == UNMAPPED_GVA) {
> >> + ret = -EFAULT;
> >> + break;
> >> + }
> >> +
> >> + error_code = ro ? SEV_PFERR_RO : SEV_PFERR_RW;
> >> +
> >> + /*
> >> + * Fault in the page and sev_pin_page() will handle the
> >> + * pinning
> >> + */
> >> + pfn = kvm_mmu_map_tdp_page(vcpu, gpa, error_code, PG_LEVEL_4K);
> >> + if (is_error_noslot_pfn(pfn)) {
> >> + ret = -EFAULT;
> >> + break;
> >> + }
> >> + pages[i] = pfn_to_page(pfn);
> >> + }
> >> +
> >> + kvm_mmu_unload(vcpu);
> >> + srcu_read_unlock(&kvm->srcu, idx);
> >> + vcpu_put(vcpu);
> >> + mutex_unlock(&vcpu->mutex);
> >> +
> >> + if (!ret)
> >> + return pages;
> >> +
> >> + kvfree(pages);
> >> + return ERR_PTR(ret);
> >> +}
> >> +
> >> static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >> {
> >> unsigned long vaddr, vaddr_end, next_vaddr, npages, pages, size, i;
> >> @@ -510,15 +615,21 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >> vaddr_end = vaddr + size;
> >>
> >> /* Lock the user memory. */
> >> - inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
> >> + if (atomic_read(&kvm->online_vcpus))
> >> + inpages = sev_pin_memory_in_mmu(kvm, vaddr, size, &npages);
> >
> > IIUC we can only use the sev_pin_memory_in_mmu() when there is an
> > online vCPU because that means the MMU has been setup enough to use?
> > Can we add a variable and a comment to help explain that?
> >
> > bool mmu_usable = atomic_read(&kvm->online_vcpus) > 0;
>
> Sure, will add comment and the variable.
>
> >
> >> + else
> >> + inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
> >
> > So I am confused about this case. Since svm_register_enc_region() is
> > now a NOOP how can a user ensure that memory remains pinned from
> > sev_launch_update_data() to when the memory would be demand pinned?
> >
> > Before users could svm_register_enc_region() which pins the region,
> > then sev_launch_update_data(), then the VM could run an the data from
> > sev_launch_update_data() would have never moved. I don't think that
> > same guarantee is held here?
>
> Yes, you are right. One way is to error out of this call if MMU is not setup.
> Other one would require us to maintain all list of pinned memory via sev_pin_memory()
> and unpin them in the destroy path.
Got it. So we'll probably still need regions_list to track those
pinned regions and free them on destruction.
Also similar changes are probably needed in sev_receive_update_data()?
>
> >> if (IS_ERR(inpages))
> >> return PTR_ERR(inpages);
> >>
> >> /*
> >> * Flush (on non-coherent CPUs) before LAUNCH_UPDATE encrypts pages in
> >> * place; the cache may contain the data that was written unencrypted.
> >> + * Flushing is automatically handled if the pages can be pinned in the
> >> + * MMU.
> >> */
> >> - sev_clflush_pages(inpages, npages);
> >> + if (!atomic_read(&kvm->online_vcpus))
> >> + sev_clflush_pages(inpages, npages);
> >>
> >> data.reserved = 0;
> >> data.handle = sev->handle;
> >> @@ -553,8 +664,13 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >> set_page_dirty_lock(inpages[i]);
> >> mark_page_accessed(inpages[i]);
> >> }
> >> +
> >> /* unlock the user pages */
> >> - sev_unpin_memory(kvm, inpages, npages);
> >> + if (atomic_read(&kvm->online_vcpus))
> >> + kvfree(inpages);
>
> >> + else
> >> + sev_unpin_memory(kvm, inpages, npages);
>
> And not unpin here in this case.
>
> Regards
> Nikunj
On 1/21/2022 9:30 PM, Peter Gonda wrote:
> On Thu, Jan 20, 2022 at 9:08 PM Nikunj A. Dadhania <[email protected]> wrote:
>>
>> On 1/20/2022 9:47 PM, Peter Gonda wrote:
>>> On Tue, Jan 18, 2022 at 4:07 AM Nikunj A Dadhania <[email protected]> wrote:
>>>> static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>>> {
>>>> unsigned long vaddr, vaddr_end, next_vaddr, npages, pages, size, i;
>>>> @@ -510,15 +615,21 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>>> vaddr_end = vaddr + size;
>>>>
>>>> /* Lock the user memory. */
>>>> - inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
>>>> + if (atomic_read(&kvm->online_vcpus))
>>>> + inpages = sev_pin_memory_in_mmu(kvm, vaddr, size, &npages);
>>>
>>> IIUC we can only use the sev_pin_memory_in_mmu() when there is an
>>> online vCPU because that means the MMU has been setup enough to use?
>>> Can we add a variable and a comment to help explain that?
>>>
>>> bool mmu_usable = atomic_read(&kvm->online_vcpus) > 0;
>>
>> Sure, will add comment and the variable.
>>
>>>
>>>> + else
>>>> + inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
>>>
>>> So I am confused about this case. Since svm_register_enc_region() is
>>> now a NOOP how can a user ensure that memory remains pinned from
>>> sev_launch_update_data() to when the memory would be demand pinned?
>>>
>>> Before users could svm_register_enc_region() which pins the region,
>>> then sev_launch_update_data(), then the VM could run an the data from
>>> sev_launch_update_data() would have never moved. I don't think that
>>> same guarantee is held here?
>>
>> Yes, you are right. One way is to error out of this call if MMU is not setup.
>> Other one would require us to maintain all list of pinned memory via sev_pin_memory()
>> and unpin them in the destroy path.
>
> Got it. So we'll probably still need regions_list to track those
> pinned regions and free them on destruction.
>
Yes, I will have to bring that structure back.
> Also similar changes are probably needed in sev_receive_update_data()?
Right, there are multiple locations where sev_pin_memory() is used, I will go through each
case and make changes. Alternatively, add to the region_list in sev_pin_memory() and free in
destruction.
Regards
Nikunj
On Tue, Jan 18, 2022 at 4:07 AM Nikunj A Dadhania <[email protected]> wrote:
>
> Use the memslot metadata to store the pinned data along with the pfns.
> This improves the SEV guest startup time from O(n) to a constant by
> deferring guest page pinning until the pages are used to satisfy nested
> page faults. The page reference will be dropped in the memslot free
> path.
>
> Remove the enc_region structure definition and the code which did
> upfront pinning, as they are no longer needed in view of the demand
> pinning support.
>
> Leave svm_register_enc_region() and svm_unregister_enc_region() as stubs
> since qemu is dependent on this API.
>
> Signed-off-by: Nikunj A Dadhania <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 208 ++++++++++++++++-------------------------
> arch/x86/kvm/svm/svm.c | 1 +
> arch/x86/kvm/svm/svm.h | 3 +-
> 3 files changed, 81 insertions(+), 131 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index d972ab4956d4..a962bed97a0b 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -66,14 +66,6 @@ static unsigned int nr_asids;
> static unsigned long *sev_asid_bitmap;
> static unsigned long *sev_reclaim_asid_bitmap;
>
> -struct enc_region {
> - struct list_head list;
> - unsigned long npages;
> - struct page **pages;
> - unsigned long uaddr;
> - unsigned long size;
> -};
> -
> /* Called with the sev_bitmap_lock held, or on shutdown */
> static int sev_flush_asids(int min_asid, int max_asid)
> {
> @@ -257,8 +249,6 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
> if (ret)
> goto e_free;
>
> - INIT_LIST_HEAD(&sev->regions_list);
> -
> return 0;
>
> e_free:
> @@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
> src->handle = 0;
> src->pages_locked = 0;
> src->enc_context_owner = NULL;
> -
> - list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
I think we need to move the pinned SPTE entries into the target, and
repin the pages in the target here. Otherwise the pages will be
unpinned when the source is cleaned up. Have you thought about how
this could be done?
> }
>
> static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
> @@ -1861,115 +1849,13 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
> int svm_register_enc_region(struct kvm *kvm,
> struct kvm_enc_region *range)
> {
> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> - struct enc_region *region;
> - int ret = 0;
> -
> - if (!sev_guest(kvm))
> - return -ENOTTY;
> -
> - /* If kvm is mirroring encryption context it isn't responsible for it */
> - if (is_mirroring_enc_context(kvm))
> - return -EINVAL;
> -
> - if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
> - return -EINVAL;
> -
> - region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
> - if (!region)
> - return -ENOMEM;
> -
> - mutex_lock(&kvm->lock);
> - region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1);
> - if (IS_ERR(region->pages)) {
> - ret = PTR_ERR(region->pages);
> - mutex_unlock(&kvm->lock);
> - goto e_free;
> - }
> -
> - region->uaddr = range->addr;
> - region->size = range->size;
> -
> - list_add_tail(®ion->list, &sev->regions_list);
> - mutex_unlock(&kvm->lock);
> -
> - /*
> - * The guest may change the memory encryption attribute from C=0 -> C=1
> - * or vice versa for this memory range. Lets make sure caches are
> - * flushed to ensure that guest data gets written into memory with
> - * correct C-bit.
> - */
> - sev_clflush_pages(region->pages, region->npages);
> -
> - return ret;
> -
> -e_free:
> - kfree(region);
> - return ret;
> -}
> -
> -static struct enc_region *
> -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
> -{
> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> - struct list_head *head = &sev->regions_list;
> - struct enc_region *i;
> -
> - list_for_each_entry(i, head, list) {
> - if (i->uaddr == range->addr &&
> - i->size == range->size)
> - return i;
> - }
> -
> - return NULL;
> -}
> -
> -static void __unregister_enc_region_locked(struct kvm *kvm,
> - struct enc_region *region)
> -{
> - sev_unpin_memory(kvm, region->pages, region->npages);
> - list_del(®ion->list);
> - kfree(region);
> + return 0;
> }
>
> int svm_unregister_enc_region(struct kvm *kvm,
> struct kvm_enc_region *range)
> {
> - struct enc_region *region;
> - int ret;
> -
> - /* If kvm is mirroring encryption context it isn't responsible for it */
> - if (is_mirroring_enc_context(kvm))
> - return -EINVAL;
> -
> - mutex_lock(&kvm->lock);
> -
> - if (!sev_guest(kvm)) {
> - ret = -ENOTTY;
> - goto failed;
> - }
> -
> - region = find_enc_region(kvm, range);
> - if (!region) {
> - ret = -EINVAL;
> - goto failed;
> - }
> -
> - /*
> - * Ensure that all guest tagged cache entries are flushed before
> - * releasing the pages back to the system for use. CLFLUSH will
> - * not do this, so issue a WBINVD.
> - */
> - wbinvd_on_all_cpus();
> -
> - __unregister_enc_region_locked(kvm, region);
> -
> - mutex_unlock(&kvm->lock);
> return 0;
> -
> -failed:
> - mutex_unlock(&kvm->lock);
> - return ret;
> }
>
> int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
> @@ -2018,7 +1904,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
> mirror_sev->fd = source_sev->fd;
> mirror_sev->es_active = source_sev->es_active;
> mirror_sev->handle = source_sev->handle;
> - INIT_LIST_HEAD(&mirror_sev->regions_list);
> ret = 0;
>
> /*
> @@ -2038,8 +1923,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
> void sev_vm_destroy(struct kvm *kvm)
> {
> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> - struct list_head *head = &sev->regions_list;
> - struct list_head *pos, *q;
>
> WARN_ON(sev->num_mirrored_vms);
>
> @@ -2066,18 +1949,6 @@ void sev_vm_destroy(struct kvm *kvm)
> */
> wbinvd_on_all_cpus();
>
> - /*
> - * if userspace was terminated before unregistering the memory regions
> - * then lets unpin all the registered memory.
> - */
> - if (!list_empty(head)) {
> - list_for_each_safe(pos, q, head) {
> - __unregister_enc_region_locked(kvm,
> - list_entry(pos, struct enc_region, list));
> - cond_resched();
> - }
> - }
> -
> sev_unbind_asid(kvm, sev->handle);
> sev_asid_free(sev);
> }
> @@ -2946,13 +2817,90 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
> }
>
> +void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + kvm_pfn_t pfn)
> +{
> + struct kvm_arch_memory_slot *aslot;
> + struct kvm_memory_slot *slot;
> + gfn_t rel_gfn, pin_pfn;
> + unsigned long npages;
> + kvm_pfn_t old_pfn;
> + int i;
> +
> + if (!sev_guest(kvm))
> + return;
> +
> + if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
> + return;
> +
> + /* Tested till 1GB pages */
> + if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
> + return;
> +
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!slot || !slot->arch.pfns)
> + return;
> +
> + /*
> + * Use relative gfn index within the memslot for the bitmap as well as
> + * the pfns array
> + */
> + rel_gfn = gfn - slot->base_gfn;
> + aslot = &slot->arch;
> + pin_pfn = pfn;
> + npages = KVM_PAGES_PER_HPAGE(level);
> +
> + /* Pin the page, KVM doesn't yet support page migration. */
> + for (i = 0; i < npages; i++, rel_gfn++, pin_pfn++) {
> + if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
> + old_pfn = aslot->pfns[rel_gfn];
> + if (old_pfn == pin_pfn)
> + continue;
> +
> + put_page(pfn_to_page(old_pfn));
> + }
> +
> + set_bit(rel_gfn, aslot->pinned_bitmap);
> + aslot->pfns[rel_gfn] = pin_pfn;
> + get_page(pfn_to_page(pin_pfn));
> + }
> +
> + /*
> + * Flush any cached lines of the page being added since "ownership" of
> + * it will be transferred from the host to an encrypted guest.
> + */
> + clflush_cache_range(__va(pfn << PAGE_SHIFT), page_level_size(level));
> +}
> +
> void sev_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> {
> struct kvm_arch_memory_slot *aslot = &slot->arch;
> + kvm_pfn_t *pfns;
> + gfn_t gfn;
> + int i;
>
> if (!sev_guest(kvm))
> return;
>
> + if (!aslot->pinned_bitmap || !slot->arch.pfns)
> + goto out;
> +
> + pfns = aslot->pfns;
> +
> + /*
> + * Iterate the memslot to find the pinned pfn using the bitmap and drop
> + * the pfn stored.
> + */
> + for (i = 0, gfn = slot->base_gfn; i < slot->npages; i++, gfn++) {
> + if (test_and_clear_bit(i, aslot->pinned_bitmap)) {
> + if (WARN_ON(!pfns[i]))
> + continue;
> +
> + put_page(pfn_to_page(pfns[i]));
> + }
> + }
> +
> +out:
> if (aslot->pinned_bitmap) {
> kvfree(aslot->pinned_bitmap);
> aslot->pinned_bitmap = NULL;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 3fb19974f719..22535c680b3f 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4743,6 +4743,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>
> .alloc_memslot_metadata = sev_alloc_memslot_metadata,
> .free_memslot = sev_free_memslot,
> + .pin_spte = sev_pin_spte,
> };
>
> static struct kvm_x86_init_ops svm_init_ops __initdata = {
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index b2f8b3b52680..c731bc91ea8f 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -77,7 +77,6 @@ struct kvm_sev_info {
> unsigned int handle; /* SEV firmware handle */
> int fd; /* SEV device fd */
> unsigned long pages_locked; /* Number of pages locked */
> - struct list_head regions_list; /* List of registered regions */
> u64 ap_jump_table; /* SEV-ES AP Jump Table address */
> struct kvm *enc_context_owner; /* Owner of copied encryption context */
> unsigned long num_mirrored_vms; /* Number of VMs sharing this ASID */
> @@ -648,5 +647,7 @@ int sev_alloc_memslot_metadata(struct kvm *kvm,
> struct kvm_memory_slot *new);
> void sev_free_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot);
> +void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + kvm_pfn_t pfn);
>
> #endif
> --
> 2.32.0
>
Hi Peter
On 1/25/2022 10:17 PM, Peter Gonda wrote:
>> @@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
>> src->handle = 0;
>> src->pages_locked = 0;
>> src->enc_context_owner = NULL;
>> -
>> - list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
> I think we need to move the pinned SPTE entries into the target, and
> repin the pages in the target here. Otherwise the pages will be
> unpinned when the source is cleaned up. Have you thought about how
> this could be done?
>
I am testing migration with pinned_list, I see that all the guest pages are
transferred/pinned on the other side during migration. I think that there is
assumption that all private pages needs to be moved.
QEMU: target/i386/sev.c:bool sev_is_gfn_in_unshared_region(unsigned long gfn)
Will dig more on this.
Regards
Nikunj
On Tue, Jan 25, 2022 at 10:49 AM Nikunj A. Dadhania <[email protected]> wrote:
>
> Hi Peter
>
> On 1/25/2022 10:17 PM, Peter Gonda wrote:
> >> @@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
> >> src->handle = 0;
> >> src->pages_locked = 0;
> >> src->enc_context_owner = NULL;
> >> -
> >> - list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
> > I think we need to move the pinned SPTE entries into the target, and
> > repin the pages in the target here. Otherwise the pages will be
> > unpinned when the source is cleaned up. Have you thought about how
> > this could be done?
> >
> I am testing migration with pinned_list, I see that all the guest pages are
> transferred/pinned on the other side during migration. I think that there is
> assumption that all private pages needs to be moved.
>
> QEMU: target/i386/sev.c:bool sev_is_gfn_in_unshared_region(unsigned long gfn)
>
> Will dig more on this.
The code you linked appears to be for a remote migration. This
function is for an "intra-host" migration meaning we are just moving
the VMs memory and state to a new userspace VMM on the same not an
entirely new host.
>
> Regards
> Nikunj
On 18.01.22 12:06, Nikunj A Dadhania wrote:
> Use the memslot metadata to store the pinned data along with the pfns.
> This improves the SEV guest startup time from O(n) to a constant by
> deferring guest page pinning until the pages are used to satisfy nested
> page faults. The page reference will be dropped in the memslot free
> path.
>
> Remove the enc_region structure definition and the code which did
> upfront pinning, as they are no longer needed in view of the demand
> pinning support.
>
> Leave svm_register_enc_region() and svm_unregister_enc_region() as stubs
> since qemu is dependent on this API.
>
> Signed-off-by: Nikunj A Dadhania <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 208 ++++++++++++++++-------------------------
> arch/x86/kvm/svm/svm.c | 1 +
> arch/x86/kvm/svm/svm.h | 3 +-
> 3 files changed, 81 insertions(+), 131 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index d972ab4956d4..a962bed97a0b 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -66,14 +66,6 @@ static unsigned int nr_asids;
> static unsigned long *sev_asid_bitmap;
> static unsigned long *sev_reclaim_asid_bitmap;
>
> -struct enc_region {
> - struct list_head list;
> - unsigned long npages;
> - struct page **pages;
> - unsigned long uaddr;
> - unsigned long size;
> -};
> -
> /* Called with the sev_bitmap_lock held, or on shutdown */
> static int sev_flush_asids(int min_asid, int max_asid)
> {
> @@ -257,8 +249,6 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
> if (ret)
> goto e_free;
>
> - INIT_LIST_HEAD(&sev->regions_list);
> -
> return 0;
>
> e_free:
> @@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
> src->handle = 0;
> src->pages_locked = 0;
> src->enc_context_owner = NULL;
> -
> - list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
> }
>
> static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
> @@ -1861,115 +1849,13 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
> int svm_register_enc_region(struct kvm *kvm,
> struct kvm_enc_region *range)
> {
> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> - struct enc_region *region;
> - int ret = 0;
> -
> - if (!sev_guest(kvm))
> - return -ENOTTY;
> -
> - /* If kvm is mirroring encryption context it isn't responsible for it */
> - if (is_mirroring_enc_context(kvm))
> - return -EINVAL;
> -
> - if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
> - return -EINVAL;
> -
> - region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
> - if (!region)
> - return -ENOMEM;
> -
> - mutex_lock(&kvm->lock);
> - region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1);
> - if (IS_ERR(region->pages)) {
> - ret = PTR_ERR(region->pages);
> - mutex_unlock(&kvm->lock);
> - goto e_free;
> - }
> -
> - region->uaddr = range->addr;
> - region->size = range->size;
> -
> - list_add_tail(®ion->list, &sev->regions_list);
> - mutex_unlock(&kvm->lock);
> -
> - /*
> - * The guest may change the memory encryption attribute from C=0 -> C=1
> - * or vice versa for this memory range. Lets make sure caches are
> - * flushed to ensure that guest data gets written into memory with
> - * correct C-bit.
> - */
> - sev_clflush_pages(region->pages, region->npages);
> -
> - return ret;
> -
> -e_free:
> - kfree(region);
> - return ret;
> -}
> -
> -static struct enc_region *
> -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
> -{
> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> - struct list_head *head = &sev->regions_list;
> - struct enc_region *i;
> -
> - list_for_each_entry(i, head, list) {
> - if (i->uaddr == range->addr &&
> - i->size == range->size)
> - return i;
> - }
> -
> - return NULL;
> -}
> -
> -static void __unregister_enc_region_locked(struct kvm *kvm,
> - struct enc_region *region)
> -{
> - sev_unpin_memory(kvm, region->pages, region->npages);
> - list_del(®ion->list);
> - kfree(region);
> + return 0;
> }
>
> int svm_unregister_enc_region(struct kvm *kvm,
> struct kvm_enc_region *range)
> {
> - struct enc_region *region;
> - int ret;
> -
> - /* If kvm is mirroring encryption context it isn't responsible for it */
> - if (is_mirroring_enc_context(kvm))
> - return -EINVAL;
> -
> - mutex_lock(&kvm->lock);
> -
> - if (!sev_guest(kvm)) {
> - ret = -ENOTTY;
> - goto failed;
> - }
> -
> - region = find_enc_region(kvm, range);
> - if (!region) {
> - ret = -EINVAL;
> - goto failed;
> - }
> -
> - /*
> - * Ensure that all guest tagged cache entries are flushed before
> - * releasing the pages back to the system for use. CLFLUSH will
> - * not do this, so issue a WBINVD.
> - */
> - wbinvd_on_all_cpus();
> -
> - __unregister_enc_region_locked(kvm, region);
> -
> - mutex_unlock(&kvm->lock);
> return 0;
> -
> -failed:
> - mutex_unlock(&kvm->lock);
> - return ret;
> }
>
> int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
> @@ -2018,7 +1904,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
> mirror_sev->fd = source_sev->fd;
> mirror_sev->es_active = source_sev->es_active;
> mirror_sev->handle = source_sev->handle;
> - INIT_LIST_HEAD(&mirror_sev->regions_list);
> ret = 0;
>
> /*
> @@ -2038,8 +1923,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
> void sev_vm_destroy(struct kvm *kvm)
> {
> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> - struct list_head *head = &sev->regions_list;
> - struct list_head *pos, *q;
>
> WARN_ON(sev->num_mirrored_vms);
>
> @@ -2066,18 +1949,6 @@ void sev_vm_destroy(struct kvm *kvm)
> */
> wbinvd_on_all_cpus();
>
> - /*
> - * if userspace was terminated before unregistering the memory regions
> - * then lets unpin all the registered memory.
> - */
> - if (!list_empty(head)) {
> - list_for_each_safe(pos, q, head) {
> - __unregister_enc_region_locked(kvm,
> - list_entry(pos, struct enc_region, list));
> - cond_resched();
> - }
> - }
> -
> sev_unbind_asid(kvm, sev->handle);
> sev_asid_free(sev);
> }
> @@ -2946,13 +2817,90 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
> }
>
> +void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + kvm_pfn_t pfn)
> +{
> + struct kvm_arch_memory_slot *aslot;
> + struct kvm_memory_slot *slot;
> + gfn_t rel_gfn, pin_pfn;
> + unsigned long npages;
> + kvm_pfn_t old_pfn;
> + int i;
> +
> + if (!sev_guest(kvm))
> + return;
> +
> + if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
> + return;
> +
> + /* Tested till 1GB pages */
> + if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
> + return;
> +
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!slot || !slot->arch.pfns)
> + return;
> +
> + /*
> + * Use relative gfn index within the memslot for the bitmap as well as
> + * the pfns array
> + */
> + rel_gfn = gfn - slot->base_gfn;
> + aslot = &slot->arch;
> + pin_pfn = pfn;
> + npages = KVM_PAGES_PER_HPAGE(level);
> +
> + /* Pin the page, KVM doesn't yet support page migration. */
> + for (i = 0; i < npages; i++, rel_gfn++, pin_pfn++) {
> + if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
> + old_pfn = aslot->pfns[rel_gfn];
> + if (old_pfn == pin_pfn)
> + continue;
> +
> + put_page(pfn_to_page(old_pfn));
> + }
> +
> + set_bit(rel_gfn, aslot->pinned_bitmap);
> + aslot->pfns[rel_gfn] = pin_pfn;
> + get_page(pfn_to_page(pin_pfn));
I assume this is to replace KVM_MEMORY_ENCRYPT_REG_REGION, which ends up
calling svm_register_enc_region()->sev_pin_memory(), correct?
sev_pin_memory() correctly checks the RLIMIT_MEMLOCK and uses
pin_user_pages_fast().
I have to strongly assume that sev_pin_memory() is *wrong* as is because
it's supposed to supply FOLL_LONGTERM -- after all we're pinning these
pages possibly forever.
I might be wrong but
1. You are missing the RLIMIT_MEMLOCK check
2. get_page() is the wong way of long-term pinning a page. You would
have to mimic what pin_user_pages_fast(FOLL_LONGTERM) does to eventually
get it right (e.g., migrate the page off of MIGRATE_CMA or ZONE_MOVABLE).
--
Thanks,
David / dhildenb
On 1/25/2022 11:29 PM, Peter Gonda wrote:
> On Tue, Jan 25, 2022 at 10:49 AM Nikunj A. Dadhania <[email protected]> wrote:
>>
>> Hi Peter
>>
>> On 1/25/2022 10:17 PM, Peter Gonda wrote:
>>>> @@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
>>>> src->handle = 0;
>>>> src->pages_locked = 0;
>>>> src->enc_context_owner = NULL;
>>>> -
>>>> - list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
>>> I think we need to move the pinned SPTE entries into the target, and
>>> repin the pages in the target here. Otherwise the pages will be
>>> unpinned when the source is cleaned up. Have you thought about how
>>> this could be done?
Right, copying just the list doesn't look to be sufficient.
In destination kvm context, will have to go over the source region list of
pinned pages and pin them. Roughly something like the below:
struct list_head *head = &src->pinned_regions_list;
struct pinned_region *new, old;
if (!list_empty(head)) {
list_for_each_safe(pos, q, head) {
old = list_entry(pos, struct pinned_region, list);
/* alloc new region and initialize with old */
new = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
new->uaddr = old->uaddr;
new->len = old->len;
new->npages = old->npages;
/* pin memory */
new->pages = sev_pin_memory(kvm, new->uaddr, new->npages);
list_add_tail(&new->list, &dst->pinned_regions_list);
...
}
}
>>>
>> I am testing migration with pinned_list, I see that all the guest pages are
>> transferred/pinned on the other side during migration. I think that there is
>> assumption that all private pages needs to be moved.
>>
>> QEMU: target/i386/sev.c:bool sev_is_gfn_in_unshared_region(unsigned long gfn)
>>
>> Will dig more on this.
>
> The code you linked appears to be for a remote migration.
Yes, that is correct.
> This
> function is for an "intra-host" migration meaning we are just moving
> the VMs memory and state to a new userspace VMM on the same not an
> entirely new host.
Regards
Nikunj
On 1/26/2022 4:16 PM, David Hildenbrand wrote:
> On 18.01.22 12:06, Nikunj A Dadhania wrote:
>> Use the memslot metadata to store the pinned data along with the pfns.
>> This improves the SEV guest startup time from O(n) to a constant by
>> deferring guest page pinning until the pages are used to satisfy nested
>> page faults. The page reference will be dropped in the memslot free
>> path.
>>
>> Remove the enc_region structure definition and the code which did
>> upfront pinning, as they are no longer needed in view of the demand
>> pinning support.
>>
>> Leave svm_register_enc_region() and svm_unregister_enc_region() as stubs
>> since qemu is dependent on this API.
>>
>> Signed-off-by: Nikunj A Dadhania <[email protected]>
>> ---
>> arch/x86/kvm/svm/sev.c | 208 ++++++++++++++++-------------------------
>> arch/x86/kvm/svm/svm.c | 1 +
>> arch/x86/kvm/svm/svm.h | 3 +-
>> 3 files changed, 81 insertions(+), 131 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index d972ab4956d4..a962bed97a0b 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -66,14 +66,6 @@ static unsigned int nr_asids;
>> static unsigned long *sev_asid_bitmap;
>> static unsigned long *sev_reclaim_asid_bitmap;
>>
>> -struct enc_region {
>> - struct list_head list;
>> - unsigned long npages;
>> - struct page **pages;
>> - unsigned long uaddr;
>> - unsigned long size;
>> -};
>> -
>> /* Called with the sev_bitmap_lock held, or on shutdown */
>> static int sev_flush_asids(int min_asid, int max_asid)
>> {
>> @@ -257,8 +249,6 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
>> if (ret)
>> goto e_free;
>>
>> - INIT_LIST_HEAD(&sev->regions_list);
>> -
>> return 0;
>>
>> e_free:
>> @@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
>> src->handle = 0;
>> src->pages_locked = 0;
>> src->enc_context_owner = NULL;
>> -
>> - list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
>> }
>>
>> static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
>> @@ -1861,115 +1849,13 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
>> int svm_register_enc_region(struct kvm *kvm,
>> struct kvm_enc_region *range)
>> {
>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> - struct enc_region *region;
>> - int ret = 0;
>> -
>> - if (!sev_guest(kvm))
>> - return -ENOTTY;
>> -
>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>> - if (is_mirroring_enc_context(kvm))
>> - return -EINVAL;
>> -
>> - if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
>> - return -EINVAL;
>> -
>> - region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
>> - if (!region)
>> - return -ENOMEM;
>> -
>> - mutex_lock(&kvm->lock);
>> - region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1);
>> - if (IS_ERR(region->pages)) {
>> - ret = PTR_ERR(region->pages);
>> - mutex_unlock(&kvm->lock);
>> - goto e_free;
>> - }
>> -
>> - region->uaddr = range->addr;
>> - region->size = range->size;
>> -
>> - list_add_tail(®ion->list, &sev->regions_list);
>> - mutex_unlock(&kvm->lock);
>> -
>> - /*
>> - * The guest may change the memory encryption attribute from C=0 -> C=1
>> - * or vice versa for this memory range. Lets make sure caches are
>> - * flushed to ensure that guest data gets written into memory with
>> - * correct C-bit.
>> - */
>> - sev_clflush_pages(region->pages, region->npages);
>> -
>> - return ret;
>> -
>> -e_free:
>> - kfree(region);
>> - return ret;
>> -}
>> -
>> -static struct enc_region *
>> -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
>> -{
>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> - struct list_head *head = &sev->regions_list;
>> - struct enc_region *i;
>> -
>> - list_for_each_entry(i, head, list) {
>> - if (i->uaddr == range->addr &&
>> - i->size == range->size)
>> - return i;
>> - }
>> -
>> - return NULL;
>> -}
>> -
>> -static void __unregister_enc_region_locked(struct kvm *kvm,
>> - struct enc_region *region)
>> -{
>> - sev_unpin_memory(kvm, region->pages, region->npages);
>> - list_del(®ion->list);
>> - kfree(region);
>> + return 0;
>> }
>>
>> int svm_unregister_enc_region(struct kvm *kvm,
>> struct kvm_enc_region *range)
>> {
>> - struct enc_region *region;
>> - int ret;
>> -
>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>> - if (is_mirroring_enc_context(kvm))
>> - return -EINVAL;
>> -
>> - mutex_lock(&kvm->lock);
>> -
>> - if (!sev_guest(kvm)) {
>> - ret = -ENOTTY;
>> - goto failed;
>> - }
>> -
>> - region = find_enc_region(kvm, range);
>> - if (!region) {
>> - ret = -EINVAL;
>> - goto failed;
>> - }
>> -
>> - /*
>> - * Ensure that all guest tagged cache entries are flushed before
>> - * releasing the pages back to the system for use. CLFLUSH will
>> - * not do this, so issue a WBINVD.
>> - */
>> - wbinvd_on_all_cpus();
>> -
>> - __unregister_enc_region_locked(kvm, region);
>> -
>> - mutex_unlock(&kvm->lock);
>> return 0;
>> -
>> -failed:
>> - mutex_unlock(&kvm->lock);
>> - return ret;
>> }
>>
>> int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>> @@ -2018,7 +1904,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>> mirror_sev->fd = source_sev->fd;
>> mirror_sev->es_active = source_sev->es_active;
>> mirror_sev->handle = source_sev->handle;
>> - INIT_LIST_HEAD(&mirror_sev->regions_list);
>> ret = 0;
>>
>> /*
>> @@ -2038,8 +1923,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>> void sev_vm_destroy(struct kvm *kvm)
>> {
>> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> - struct list_head *head = &sev->regions_list;
>> - struct list_head *pos, *q;
>>
>> WARN_ON(sev->num_mirrored_vms);
>>
>> @@ -2066,18 +1949,6 @@ void sev_vm_destroy(struct kvm *kvm)
>> */
>> wbinvd_on_all_cpus();
>>
>> - /*
>> - * if userspace was terminated before unregistering the memory regions
>> - * then lets unpin all the registered memory.
>> - */
>> - if (!list_empty(head)) {
>> - list_for_each_safe(pos, q, head) {
>> - __unregister_enc_region_locked(kvm,
>> - list_entry(pos, struct enc_region, list));
>> - cond_resched();
>> - }
>> - }
>> -
>> sev_unbind_asid(kvm, sev->handle);
>> sev_asid_free(sev);
>> }
>> @@ -2946,13 +2817,90 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
>> }
>>
>> +void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
>> + kvm_pfn_t pfn)
>> +{
>> + struct kvm_arch_memory_slot *aslot;
>> + struct kvm_memory_slot *slot;
>> + gfn_t rel_gfn, pin_pfn;
>> + unsigned long npages;
>> + kvm_pfn_t old_pfn;
>> + int i;
>> +
>> + if (!sev_guest(kvm))
>> + return;
>> +
>> + if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
>> + return;
>> +
>> + /* Tested till 1GB pages */
>> + if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
>> + return;
>> +
>> + slot = gfn_to_memslot(kvm, gfn);
>> + if (!slot || !slot->arch.pfns)
>> + return;
>> +
>> + /*
>> + * Use relative gfn index within the memslot for the bitmap as well as
>> + * the pfns array
>> + */
>> + rel_gfn = gfn - slot->base_gfn;
>> + aslot = &slot->arch;
>> + pin_pfn = pfn;
>> + npages = KVM_PAGES_PER_HPAGE(level);
>> +
>> + /* Pin the page, KVM doesn't yet support page migration. */
>> + for (i = 0; i < npages; i++, rel_gfn++, pin_pfn++) {
>> + if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
>> + old_pfn = aslot->pfns[rel_gfn];
>> + if (old_pfn == pin_pfn)
>> + continue;
>> +
>> + put_page(pfn_to_page(old_pfn));
>> + }
>> +
>> + set_bit(rel_gfn, aslot->pinned_bitmap);
>> + aslot->pfns[rel_gfn] = pin_pfn;
>> + get_page(pfn_to_page(pin_pfn));
>
>
> I assume this is to replace KVM_MEMORY_ENCRYPT_REG_REGION, which ends up
> calling svm_register_enc_region()->sev_pin_memory(), correct?
Yes, that is correct.
>
> sev_pin_memory() correctly checks the RLIMIT_MEMLOCK and uses
> pin_user_pages_fast().
>
> I have to strongly assume that sev_pin_memory() is *wrong* as is because
> it's supposed to supply FOLL_LONGTERM -- after all we're pinning these
> pages possibly forever.
>
>
> I might be wrong but
>
> 1. You are missing the RLIMIT_MEMLOCK check
Yes, I will add this check during the enc_region registration.
> 2. get_page() is the wong way of long-term pinning a page. You would
> have to mimic what pin_user_pages_fast(FOLL_LONGTERM) does to eventually
> get it right (e.g., migrate the page off of MIGRATE_CMA or ZONE_MOVABLE).
Let me go through this and I will come back. Thanks for pointing this out.
Regards
Nikunj
On 28.01.22 07:57, Nikunj A. Dadhania wrote:
> On 1/26/2022 4:16 PM, David Hildenbrand wrote:
>> On 18.01.22 12:06, Nikunj A Dadhania wrote:
>>> Use the memslot metadata to store the pinned data along with the pfns.
>>> This improves the SEV guest startup time from O(n) to a constant by
>>> deferring guest page pinning until the pages are used to satisfy nested
>>> page faults. The page reference will be dropped in the memslot free
>>> path.
>>>
>>> Remove the enc_region structure definition and the code which did
>>> upfront pinning, as they are no longer needed in view of the demand
>>> pinning support.
>>>
>>> Leave svm_register_enc_region() and svm_unregister_enc_region() as stubs
>>> since qemu is dependent on this API.
>>>
>>> Signed-off-by: Nikunj A Dadhania <[email protected]>
>>> ---
>>> arch/x86/kvm/svm/sev.c | 208 ++++++++++++++++-------------------------
>>> arch/x86/kvm/svm/svm.c | 1 +
>>> arch/x86/kvm/svm/svm.h | 3 +-
>>> 3 files changed, 81 insertions(+), 131 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>> index d972ab4956d4..a962bed97a0b 100644
>>> --- a/arch/x86/kvm/svm/sev.c
>>> +++ b/arch/x86/kvm/svm/sev.c
>>> @@ -66,14 +66,6 @@ static unsigned int nr_asids;
>>> static unsigned long *sev_asid_bitmap;
>>> static unsigned long *sev_reclaim_asid_bitmap;
>>>
>>> -struct enc_region {
>>> - struct list_head list;
>>> - unsigned long npages;
>>> - struct page **pages;
>>> - unsigned long uaddr;
>>> - unsigned long size;
>>> -};
>>> -
>>> /* Called with the sev_bitmap_lock held, or on shutdown */
>>> static int sev_flush_asids(int min_asid, int max_asid)
>>> {
>>> @@ -257,8 +249,6 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>> if (ret)
>>> goto e_free;
>>>
>>> - INIT_LIST_HEAD(&sev->regions_list);
>>> -
>>> return 0;
>>>
>>> e_free:
>>> @@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
>>> src->handle = 0;
>>> src->pages_locked = 0;
>>> src->enc_context_owner = NULL;
>>> -
>>> - list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
>>> }
>>>
>>> static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
>>> @@ -1861,115 +1849,13 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
>>> int svm_register_enc_region(struct kvm *kvm,
>>> struct kvm_enc_region *range)
>>> {
>>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>> - struct enc_region *region;
>>> - int ret = 0;
>>> -
>>> - if (!sev_guest(kvm))
>>> - return -ENOTTY;
>>> -
>>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>>> - if (is_mirroring_enc_context(kvm))
>>> - return -EINVAL;
>>> -
>>> - if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
>>> - return -EINVAL;
>>> -
>>> - region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
>>> - if (!region)
>>> - return -ENOMEM;
>>> -
>>> - mutex_lock(&kvm->lock);
>>> - region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1);
>>> - if (IS_ERR(region->pages)) {
>>> - ret = PTR_ERR(region->pages);
>>> - mutex_unlock(&kvm->lock);
>>> - goto e_free;
>>> - }
>>> -
>>> - region->uaddr = range->addr;
>>> - region->size = range->size;
>>> -
>>> - list_add_tail(®ion->list, &sev->regions_list);
>>> - mutex_unlock(&kvm->lock);
>>> -
>>> - /*
>>> - * The guest may change the memory encryption attribute from C=0 -> C=1
>>> - * or vice versa for this memory range. Lets make sure caches are
>>> - * flushed to ensure that guest data gets written into memory with
>>> - * correct C-bit.
>>> - */
>>> - sev_clflush_pages(region->pages, region->npages);
>>> -
>>> - return ret;
>>> -
>>> -e_free:
>>> - kfree(region);
>>> - return ret;
>>> -}
>>> -
>>> -static struct enc_region *
>>> -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
>>> -{
>>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>> - struct list_head *head = &sev->regions_list;
>>> - struct enc_region *i;
>>> -
>>> - list_for_each_entry(i, head, list) {
>>> - if (i->uaddr == range->addr &&
>>> - i->size == range->size)
>>> - return i;
>>> - }
>>> -
>>> - return NULL;
>>> -}
>>> -
>>> -static void __unregister_enc_region_locked(struct kvm *kvm,
>>> - struct enc_region *region)
>>> -{
>>> - sev_unpin_memory(kvm, region->pages, region->npages);
>>> - list_del(®ion->list);
>>> - kfree(region);
>>> + return 0;
>>> }
>>>
>>> int svm_unregister_enc_region(struct kvm *kvm,
>>> struct kvm_enc_region *range)
>>> {
>>> - struct enc_region *region;
>>> - int ret;
>>> -
>>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>>> - if (is_mirroring_enc_context(kvm))
>>> - return -EINVAL;
>>> -
>>> - mutex_lock(&kvm->lock);
>>> -
>>> - if (!sev_guest(kvm)) {
>>> - ret = -ENOTTY;
>>> - goto failed;
>>> - }
>>> -
>>> - region = find_enc_region(kvm, range);
>>> - if (!region) {
>>> - ret = -EINVAL;
>>> - goto failed;
>>> - }
>>> -
>>> - /*
>>> - * Ensure that all guest tagged cache entries are flushed before
>>> - * releasing the pages back to the system for use. CLFLUSH will
>>> - * not do this, so issue a WBINVD.
>>> - */
>>> - wbinvd_on_all_cpus();
>>> -
>>> - __unregister_enc_region_locked(kvm, region);
>>> -
>>> - mutex_unlock(&kvm->lock);
>>> return 0;
>>> -
>>> -failed:
>>> - mutex_unlock(&kvm->lock);
>>> - return ret;
>>> }
>>>
>>> int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>> @@ -2018,7 +1904,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>> mirror_sev->fd = source_sev->fd;
>>> mirror_sev->es_active = source_sev->es_active;
>>> mirror_sev->handle = source_sev->handle;
>>> - INIT_LIST_HEAD(&mirror_sev->regions_list);
>>> ret = 0;
>>>
>>> /*
>>> @@ -2038,8 +1923,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>> void sev_vm_destroy(struct kvm *kvm)
>>> {
>>> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>> - struct list_head *head = &sev->regions_list;
>>> - struct list_head *pos, *q;
>>>
>>> WARN_ON(sev->num_mirrored_vms);
>>>
>>> @@ -2066,18 +1949,6 @@ void sev_vm_destroy(struct kvm *kvm)
>>> */
>>> wbinvd_on_all_cpus();
>>>
>>> - /*
>>> - * if userspace was terminated before unregistering the memory regions
>>> - * then lets unpin all the registered memory.
>>> - */
>>> - if (!list_empty(head)) {
>>> - list_for_each_safe(pos, q, head) {
>>> - __unregister_enc_region_locked(kvm,
>>> - list_entry(pos, struct enc_region, list));
>>> - cond_resched();
>>> - }
>>> - }
>>> -
>>> sev_unbind_asid(kvm, sev->handle);
>>> sev_asid_free(sev);
>>> }
>>> @@ -2946,13 +2817,90 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>>> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
>>> }
>>>
>>> +void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
>>> + kvm_pfn_t pfn)
>>> +{
>>> + struct kvm_arch_memory_slot *aslot;
>>> + struct kvm_memory_slot *slot;
>>> + gfn_t rel_gfn, pin_pfn;
>>> + unsigned long npages;
>>> + kvm_pfn_t old_pfn;
>>> + int i;
>>> +
>>> + if (!sev_guest(kvm))
>>> + return;
>>> +
>>> + if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
>>> + return;
>>> +
>>> + /* Tested till 1GB pages */
>>> + if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
>>> + return;
>>> +
>>> + slot = gfn_to_memslot(kvm, gfn);
>>> + if (!slot || !slot->arch.pfns)
>>> + return;
>>> +
>>> + /*
>>> + * Use relative gfn index within the memslot for the bitmap as well as
>>> + * the pfns array
>>> + */
>>> + rel_gfn = gfn - slot->base_gfn;
>>> + aslot = &slot->arch;
>>> + pin_pfn = pfn;
>>> + npages = KVM_PAGES_PER_HPAGE(level);
>>> +
>>> + /* Pin the page, KVM doesn't yet support page migration. */
>>> + for (i = 0; i < npages; i++, rel_gfn++, pin_pfn++) {
>>> + if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
>>> + old_pfn = aslot->pfns[rel_gfn];
>>> + if (old_pfn == pin_pfn)
>>> + continue;
>>> +
>>> + put_page(pfn_to_page(old_pfn));
>>> + }
>>> +
>>> + set_bit(rel_gfn, aslot->pinned_bitmap);
>>> + aslot->pfns[rel_gfn] = pin_pfn;
>>> + get_page(pfn_to_page(pin_pfn));
>>
>>
>> I assume this is to replace KVM_MEMORY_ENCRYPT_REG_REGION, which ends up
>> calling svm_register_enc_region()->sev_pin_memory(), correct?
>
> Yes, that is correct.
>>
>> sev_pin_memory() correctly checks the RLIMIT_MEMLOCK and uses
>> pin_user_pages_fast().
>>
>> I have to strongly assume that sev_pin_memory() is *wrong* as is because
>> it's supposed to supply FOLL_LONGTERM -- after all we're pinning these
>> pages possibly forever.
>>
>>
>> I might be wrong but
>>
>> 1. You are missing the RLIMIT_MEMLOCK check
>
> Yes, I will add this check during the enc_region registration.
>
>> 2. get_page() is the wong way of long-term pinning a page. You would
>> have to mimic what pin_user_pages_fast(FOLL_LONGTERM) does to eventually
>> get it right (e.g., migrate the page off of MIGRATE_CMA or ZONE_MOVABLE).
>
> Let me go through this and I will come back. Thanks for pointing this out.
I asusme the "issue" is that KVM uses mmu notifier and does a simple
get_user_pages() to obtain the references, to drop the reference when
the entry is invalidated via a mmu notifier call. So once you intent to
long-term pin, it's already to late.
If you could teach KVM to do a long-term pin when stumbling over these
special encrypted memory regions (requires a proper matching
unpin_user_pages() call from KVM), then you could "take over" that pin
by get_page(), and let KVM do the ordinary put_page(), while you would
do the unpin_user_pages().
--
Thanks,
David / dhildenb
On 1/28/2022 1:57 PM, David Hildenbrand wrote:
> On 28.01.22 07:57, Nikunj A. Dadhania wrote:
>> On 1/26/2022 4:16 PM, David Hildenbrand wrote:
>>> On 18.01.22 12:06, Nikunj A Dadhania wrote:
>>>> Use the memslot metadata to store the pinned data along with the pfns.
>>>> This improves the SEV guest startup time from O(n) to a constant by
>>>> deferring guest page pinning until the pages are used to satisfy nested
>>>> page faults. The page reference will be dropped in the memslot free
>>>> path.
>>>>
>>>> Remove the enc_region structure definition and the code which did
>>>> upfront pinning, as they are no longer needed in view of the demand
>>>> pinning support.
>>>>
>>>> Leave svm_register_enc_region() and svm_unregister_enc_region() as stubs
>>>> since qemu is dependent on this API.
>>>>
>>>> Signed-off-by: Nikunj A Dadhania <[email protected]>
>>>> ---
>>>> arch/x86/kvm/svm/sev.c | 208 ++++++++++++++++-------------------------
>>>> arch/x86/kvm/svm/svm.c | 1 +
>>>> arch/x86/kvm/svm/svm.h | 3 +-
>>>> 3 files changed, 81 insertions(+), 131 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>>> index d972ab4956d4..a962bed97a0b 100644
>>>> --- a/arch/x86/kvm/svm/sev.c
>>>> +++ b/arch/x86/kvm/svm/sev.c
>>>> @@ -66,14 +66,6 @@ static unsigned int nr_asids;
>>>> static unsigned long *sev_asid_bitmap;
>>>> static unsigned long *sev_reclaim_asid_bitmap;
>>>>
>>>> -struct enc_region {
>>>> - struct list_head list;
>>>> - unsigned long npages;
>>>> - struct page **pages;
>>>> - unsigned long uaddr;
>>>> - unsigned long size;
>>>> -};
>>>> -
>>>> /* Called with the sev_bitmap_lock held, or on shutdown */
>>>> static int sev_flush_asids(int min_asid, int max_asid)
>>>> {
>>>> @@ -257,8 +249,6 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>>> if (ret)
>>>> goto e_free;
>>>>
>>>> - INIT_LIST_HEAD(&sev->regions_list);
>>>> -
>>>> return 0;
>>>>
>>>> e_free:
>>>> @@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
>>>> src->handle = 0;
>>>> src->pages_locked = 0;
>>>> src->enc_context_owner = NULL;
>>>> -
>>>> - list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
>>>> }
>>>>
>>>> static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
>>>> @@ -1861,115 +1849,13 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
>>>> int svm_register_enc_region(struct kvm *kvm,
>>>> struct kvm_enc_region *range)
>>>> {
>>>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>> - struct enc_region *region;
>>>> - int ret = 0;
>>>> -
>>>> - if (!sev_guest(kvm))
>>>> - return -ENOTTY;
>>>> -
>>>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>>>> - if (is_mirroring_enc_context(kvm))
>>>> - return -EINVAL;
>>>> -
>>>> - if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
>>>> - return -EINVAL;
>>>> -
>>>> - region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
>>>> - if (!region)
>>>> - return -ENOMEM;
>>>> -
>>>> - mutex_lock(&kvm->lock);
>>>> - region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1);
>>>> - if (IS_ERR(region->pages)) {
>>>> - ret = PTR_ERR(region->pages);
>>>> - mutex_unlock(&kvm->lock);
>>>> - goto e_free;
>>>> - }
>>>> -
>>>> - region->uaddr = range->addr;
>>>> - region->size = range->size;
>>>> -
>>>> - list_add_tail(®ion->list, &sev->regions_list);
>>>> - mutex_unlock(&kvm->lock);
>>>> -
>>>> - /*
>>>> - * The guest may change the memory encryption attribute from C=0 -> C=1
>>>> - * or vice versa for this memory range. Lets make sure caches are
>>>> - * flushed to ensure that guest data gets written into memory with
>>>> - * correct C-bit.
>>>> - */
>>>> - sev_clflush_pages(region->pages, region->npages);
>>>> -
>>>> - return ret;
>>>> -
>>>> -e_free:
>>>> - kfree(region);
>>>> - return ret;
>>>> -}
>>>> -
>>>> -static struct enc_region *
>>>> -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
>>>> -{
>>>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>> - struct list_head *head = &sev->regions_list;
>>>> - struct enc_region *i;
>>>> -
>>>> - list_for_each_entry(i, head, list) {
>>>> - if (i->uaddr == range->addr &&
>>>> - i->size == range->size)
>>>> - return i;
>>>> - }
>>>> -
>>>> - return NULL;
>>>> -}
>>>> -
>>>> -static void __unregister_enc_region_locked(struct kvm *kvm,
>>>> - struct enc_region *region)
>>>> -{
>>>> - sev_unpin_memory(kvm, region->pages, region->npages);
>>>> - list_del(®ion->list);
>>>> - kfree(region);
>>>> + return 0;
>>>> }
>>>>
>>>> int svm_unregister_enc_region(struct kvm *kvm,
>>>> struct kvm_enc_region *range)
>>>> {
>>>> - struct enc_region *region;
>>>> - int ret;
>>>> -
>>>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>>>> - if (is_mirroring_enc_context(kvm))
>>>> - return -EINVAL;
>>>> -
>>>> - mutex_lock(&kvm->lock);
>>>> -
>>>> - if (!sev_guest(kvm)) {
>>>> - ret = -ENOTTY;
>>>> - goto failed;
>>>> - }
>>>> -
>>>> - region = find_enc_region(kvm, range);
>>>> - if (!region) {
>>>> - ret = -EINVAL;
>>>> - goto failed;
>>>> - }
>>>> -
>>>> - /*
>>>> - * Ensure that all guest tagged cache entries are flushed before
>>>> - * releasing the pages back to the system for use. CLFLUSH will
>>>> - * not do this, so issue a WBINVD.
>>>> - */
>>>> - wbinvd_on_all_cpus();
>>>> -
>>>> - __unregister_enc_region_locked(kvm, region);
>>>> -
>>>> - mutex_unlock(&kvm->lock);
>>>> return 0;
>>>> -
>>>> -failed:
>>>> - mutex_unlock(&kvm->lock);
>>>> - return ret;
>>>> }
>>>>
>>>> int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>> @@ -2018,7 +1904,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>> mirror_sev->fd = source_sev->fd;
>>>> mirror_sev->es_active = source_sev->es_active;
>>>> mirror_sev->handle = source_sev->handle;
>>>> - INIT_LIST_HEAD(&mirror_sev->regions_list);
>>>> ret = 0;
>>>>
>>>> /*
>>>> @@ -2038,8 +1923,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>> void sev_vm_destroy(struct kvm *kvm)
>>>> {
>>>> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>> - struct list_head *head = &sev->regions_list;
>>>> - struct list_head *pos, *q;
>>>>
>>>> WARN_ON(sev->num_mirrored_vms);
>>>>
>>>> @@ -2066,18 +1949,6 @@ void sev_vm_destroy(struct kvm *kvm)
>>>> */
>>>> wbinvd_on_all_cpus();
>>>>
>>>> - /*
>>>> - * if userspace was terminated before unregistering the memory regions
>>>> - * then lets unpin all the registered memory.
>>>> - */
>>>> - if (!list_empty(head)) {
>>>> - list_for_each_safe(pos, q, head) {
>>>> - __unregister_enc_region_locked(kvm,
>>>> - list_entry(pos, struct enc_region, list));
>>>> - cond_resched();
>>>> - }
>>>> - }
>>>> -
>>>> sev_unbind_asid(kvm, sev->handle);
>>>> sev_asid_free(sev);
>>>> }
>>>> @@ -2946,13 +2817,90 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>>>> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
>>>> }
>>>>
>>>> +void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
>>>> + kvm_pfn_t pfn)
>>>> +{
>>>> + struct kvm_arch_memory_slot *aslot;
>>>> + struct kvm_memory_slot *slot;
>>>> + gfn_t rel_gfn, pin_pfn;
>>>> + unsigned long npages;
>>>> + kvm_pfn_t old_pfn;
>>>> + int i;
>>>> +
>>>> + if (!sev_guest(kvm))
>>>> + return;
>>>> +
>>>> + if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
>>>> + return;
>>>> +
>>>> + /* Tested till 1GB pages */
>>>> + if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
>>>> + return;
>>>> +
>>>> + slot = gfn_to_memslot(kvm, gfn);
>>>> + if (!slot || !slot->arch.pfns)
>>>> + return;
>>>> +
>>>> + /*
>>>> + * Use relative gfn index within the memslot for the bitmap as well as
>>>> + * the pfns array
>>>> + */
>>>> + rel_gfn = gfn - slot->base_gfn;
>>>> + aslot = &slot->arch;
>>>> + pin_pfn = pfn;
>>>> + npages = KVM_PAGES_PER_HPAGE(level);
>>>> +
>>>> + /* Pin the page, KVM doesn't yet support page migration. */
>>>> + for (i = 0; i < npages; i++, rel_gfn++, pin_pfn++) {
>>>> + if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
>>>> + old_pfn = aslot->pfns[rel_gfn];
>>>> + if (old_pfn == pin_pfn)
>>>> + continue;
>>>> +
>>>> + put_page(pfn_to_page(old_pfn));
>>>> + }
>>>> +
>>>> + set_bit(rel_gfn, aslot->pinned_bitmap);
>>>> + aslot->pfns[rel_gfn] = pin_pfn;
>>>> + get_page(pfn_to_page(pin_pfn));
>>>
>>>
>>> I assume this is to replace KVM_MEMORY_ENCRYPT_REG_REGION, which ends up
>>> calling svm_register_enc_region()->sev_pin_memory(), correct?
>>
>> Yes, that is correct.
>>>
>>> sev_pin_memory() correctly checks the RLIMIT_MEMLOCK and uses
>>> pin_user_pages_fast().
>>>
>>> I have to strongly assume that sev_pin_memory() is *wrong* as is because
>>> it's supposed to supply FOLL_LONGTERM -- after all we're pinning these
>>> pages possibly forever.
>>>
>>>
>>> I might be wrong but
>>>
>>> 1. You are missing the RLIMIT_MEMLOCK check
>>
>> Yes, I will add this check during the enc_region registration.
>>
>>> 2. get_page() is the wong way of long-term pinning a page. You would
>>> have to mimic what pin_user_pages_fast(FOLL_LONGTERM) does to eventually
>>> get it right (e.g., migrate the page off of MIGRATE_CMA or ZONE_MOVABLE).
>>
>> Let me go through this and I will come back. Thanks for pointing this out.
>
> I asusme the "issue" is that KVM uses mmu notifier and does a simple
> get_user_pages() to obtain the references, to drop the reference when
> the entry is invalidated via a mmu notifier call. So once you intent to
> long-term pin, it's already to late.
>
> If you could teach KVM to do a long-term pin when stumbling over these
> special encrypted memory regions (requires a proper matching
> unpin_user_pages() call from KVM), then you could "take over" that pin
> by get_page(), and let KVM do the ordinary put_page(), while you would
> do the unpin_user_pages().
>
The fault path looks like this in KVM x86 mmu code:
direct_page_fault()
-> kvm_faultin_pfn()
-> __gfn_to_pfn_memslot()
-> hva_to_pfn()
-> hva_to_pfn_{slow,fast}()
-> get_user_pages_*() <<<<==== This is where the
reference is taken
Next step is to create the mappings which is done in below functions:
-> kvm_tdp_mmu_map() / __direct_map()
-> Within this function (patch 1/6), I call sev_pin_spte to take an extra
reference to pin it using get_page.
Is it possible to use pin_user_pages(FOLL_LONGTERM) here? Wouldn't that
be equivalent to "take over" solution that you are suggesting?
Reference is released when direct_page_fault() completes using put_page()
Later when the SEV VM is shutting down, I can do unpin_user_pages() for the
pinned pages.
Regards
Nikunj
On 28.01.22 12:04, Nikunj A. Dadhania wrote:
> On 1/28/2022 1:57 PM, David Hildenbrand wrote:
>> On 28.01.22 07:57, Nikunj A. Dadhania wrote:
>>> On 1/26/2022 4:16 PM, David Hildenbrand wrote:
>>>> On 18.01.22 12:06, Nikunj A Dadhania wrote:
>>>>> Use the memslot metadata to store the pinned data along with the pfns.
>>>>> This improves the SEV guest startup time from O(n) to a constant by
>>>>> deferring guest page pinning until the pages are used to satisfy nested
>>>>> page faults. The page reference will be dropped in the memslot free
>>>>> path.
>>>>>
>>>>> Remove the enc_region structure definition and the code which did
>>>>> upfront pinning, as they are no longer needed in view of the demand
>>>>> pinning support.
>>>>>
>>>>> Leave svm_register_enc_region() and svm_unregister_enc_region() as stubs
>>>>> since qemu is dependent on this API.
>>>>>
>>>>> Signed-off-by: Nikunj A Dadhania <[email protected]>
>>>>> ---
>>>>> arch/x86/kvm/svm/sev.c | 208 ++++++++++++++++-------------------------
>>>>> arch/x86/kvm/svm/svm.c | 1 +
>>>>> arch/x86/kvm/svm/svm.h | 3 +-
>>>>> 3 files changed, 81 insertions(+), 131 deletions(-)
>>>>>
>>>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>>>> index d972ab4956d4..a962bed97a0b 100644
>>>>> --- a/arch/x86/kvm/svm/sev.c
>>>>> +++ b/arch/x86/kvm/svm/sev.c
>>>>> @@ -66,14 +66,6 @@ static unsigned int nr_asids;
>>>>> static unsigned long *sev_asid_bitmap;
>>>>> static unsigned long *sev_reclaim_asid_bitmap;
>>>>>
>>>>> -struct enc_region {
>>>>> - struct list_head list;
>>>>> - unsigned long npages;
>>>>> - struct page **pages;
>>>>> - unsigned long uaddr;
>>>>> - unsigned long size;
>>>>> -};
>>>>> -
>>>>> /* Called with the sev_bitmap_lock held, or on shutdown */
>>>>> static int sev_flush_asids(int min_asid, int max_asid)
>>>>> {
>>>>> @@ -257,8 +249,6 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>>>> if (ret)
>>>>> goto e_free;
>>>>>
>>>>> - INIT_LIST_HEAD(&sev->regions_list);
>>>>> -
>>>>> return 0;
>>>>>
>>>>> e_free:
>>>>> @@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
>>>>> src->handle = 0;
>>>>> src->pages_locked = 0;
>>>>> src->enc_context_owner = NULL;
>>>>> -
>>>>> - list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
>>>>> }
>>>>>
>>>>> static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
>>>>> @@ -1861,115 +1849,13 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
>>>>> int svm_register_enc_region(struct kvm *kvm,
>>>>> struct kvm_enc_region *range)
>>>>> {
>>>>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>>> - struct enc_region *region;
>>>>> - int ret = 0;
>>>>> -
>>>>> - if (!sev_guest(kvm))
>>>>> - return -ENOTTY;
>>>>> -
>>>>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>>>>> - if (is_mirroring_enc_context(kvm))
>>>>> - return -EINVAL;
>>>>> -
>>>>> - if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
>>>>> - return -EINVAL;
>>>>> -
>>>>> - region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
>>>>> - if (!region)
>>>>> - return -ENOMEM;
>>>>> -
>>>>> - mutex_lock(&kvm->lock);
>>>>> - region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1);
>>>>> - if (IS_ERR(region->pages)) {
>>>>> - ret = PTR_ERR(region->pages);
>>>>> - mutex_unlock(&kvm->lock);
>>>>> - goto e_free;
>>>>> - }
>>>>> -
>>>>> - region->uaddr = range->addr;
>>>>> - region->size = range->size;
>>>>> -
>>>>> - list_add_tail(®ion->list, &sev->regions_list);
>>>>> - mutex_unlock(&kvm->lock);
>>>>> -
>>>>> - /*
>>>>> - * The guest may change the memory encryption attribute from C=0 -> C=1
>>>>> - * or vice versa for this memory range. Lets make sure caches are
>>>>> - * flushed to ensure that guest data gets written into memory with
>>>>> - * correct C-bit.
>>>>> - */
>>>>> - sev_clflush_pages(region->pages, region->npages);
>>>>> -
>>>>> - return ret;
>>>>> -
>>>>> -e_free:
>>>>> - kfree(region);
>>>>> - return ret;
>>>>> -}
>>>>> -
>>>>> -static struct enc_region *
>>>>> -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
>>>>> -{
>>>>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>>> - struct list_head *head = &sev->regions_list;
>>>>> - struct enc_region *i;
>>>>> -
>>>>> - list_for_each_entry(i, head, list) {
>>>>> - if (i->uaddr == range->addr &&
>>>>> - i->size == range->size)
>>>>> - return i;
>>>>> - }
>>>>> -
>>>>> - return NULL;
>>>>> -}
>>>>> -
>>>>> -static void __unregister_enc_region_locked(struct kvm *kvm,
>>>>> - struct enc_region *region)
>>>>> -{
>>>>> - sev_unpin_memory(kvm, region->pages, region->npages);
>>>>> - list_del(®ion->list);
>>>>> - kfree(region);
>>>>> + return 0;
>>>>> }
>>>>>
>>>>> int svm_unregister_enc_region(struct kvm *kvm,
>>>>> struct kvm_enc_region *range)
>>>>> {
>>>>> - struct enc_region *region;
>>>>> - int ret;
>>>>> -
>>>>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>>>>> - if (is_mirroring_enc_context(kvm))
>>>>> - return -EINVAL;
>>>>> -
>>>>> - mutex_lock(&kvm->lock);
>>>>> -
>>>>> - if (!sev_guest(kvm)) {
>>>>> - ret = -ENOTTY;
>>>>> - goto failed;
>>>>> - }
>>>>> -
>>>>> - region = find_enc_region(kvm, range);
>>>>> - if (!region) {
>>>>> - ret = -EINVAL;
>>>>> - goto failed;
>>>>> - }
>>>>> -
>>>>> - /*
>>>>> - * Ensure that all guest tagged cache entries are flushed before
>>>>> - * releasing the pages back to the system for use. CLFLUSH will
>>>>> - * not do this, so issue a WBINVD.
>>>>> - */
>>>>> - wbinvd_on_all_cpus();
>>>>> -
>>>>> - __unregister_enc_region_locked(kvm, region);
>>>>> -
>>>>> - mutex_unlock(&kvm->lock);
>>>>> return 0;
>>>>> -
>>>>> -failed:
>>>>> - mutex_unlock(&kvm->lock);
>>>>> - return ret;
>>>>> }
>>>>>
>>>>> int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>>> @@ -2018,7 +1904,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>>> mirror_sev->fd = source_sev->fd;
>>>>> mirror_sev->es_active = source_sev->es_active;
>>>>> mirror_sev->handle = source_sev->handle;
>>>>> - INIT_LIST_HEAD(&mirror_sev->regions_list);
>>>>> ret = 0;
>>>>>
>>>>> /*
>>>>> @@ -2038,8 +1923,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>>> void sev_vm_destroy(struct kvm *kvm)
>>>>> {
>>>>> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>>> - struct list_head *head = &sev->regions_list;
>>>>> - struct list_head *pos, *q;
>>>>>
>>>>> WARN_ON(sev->num_mirrored_vms);
>>>>>
>>>>> @@ -2066,18 +1949,6 @@ void sev_vm_destroy(struct kvm *kvm)
>>>>> */
>>>>> wbinvd_on_all_cpus();
>>>>>
>>>>> - /*
>>>>> - * if userspace was terminated before unregistering the memory regions
>>>>> - * then lets unpin all the registered memory.
>>>>> - */
>>>>> - if (!list_empty(head)) {
>>>>> - list_for_each_safe(pos, q, head) {
>>>>> - __unregister_enc_region_locked(kvm,
>>>>> - list_entry(pos, struct enc_region, list));
>>>>> - cond_resched();
>>>>> - }
>>>>> - }
>>>>> -
>>>>> sev_unbind_asid(kvm, sev->handle);
>>>>> sev_asid_free(sev);
>>>>> }
>>>>> @@ -2946,13 +2817,90 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>>>>> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
>>>>> }
>>>>>
>>>>> +void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
>>>>> + kvm_pfn_t pfn)
>>>>> +{
>>>>> + struct kvm_arch_memory_slot *aslot;
>>>>> + struct kvm_memory_slot *slot;
>>>>> + gfn_t rel_gfn, pin_pfn;
>>>>> + unsigned long npages;
>>>>> + kvm_pfn_t old_pfn;
>>>>> + int i;
>>>>> +
>>>>> + if (!sev_guest(kvm))
>>>>> + return;
>>>>> +
>>>>> + if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
>>>>> + return;
>>>>> +
>>>>> + /* Tested till 1GB pages */
>>>>> + if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
>>>>> + return;
>>>>> +
>>>>> + slot = gfn_to_memslot(kvm, gfn);
>>>>> + if (!slot || !slot->arch.pfns)
>>>>> + return;
>>>>> +
>>>>> + /*
>>>>> + * Use relative gfn index within the memslot for the bitmap as well as
>>>>> + * the pfns array
>>>>> + */
>>>>> + rel_gfn = gfn - slot->base_gfn;
>>>>> + aslot = &slot->arch;
>>>>> + pin_pfn = pfn;
>>>>> + npages = KVM_PAGES_PER_HPAGE(level);
>>>>> +
>>>>> + /* Pin the page, KVM doesn't yet support page migration. */
>>>>> + for (i = 0; i < npages; i++, rel_gfn++, pin_pfn++) {
>>>>> + if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
>>>>> + old_pfn = aslot->pfns[rel_gfn];
>>>>> + if (old_pfn == pin_pfn)
>>>>> + continue;
>>>>> +
>>>>> + put_page(pfn_to_page(old_pfn));
>>>>> + }
>>>>> +
>>>>> + set_bit(rel_gfn, aslot->pinned_bitmap);
>>>>> + aslot->pfns[rel_gfn] = pin_pfn;
>>>>> + get_page(pfn_to_page(pin_pfn));
>>>>
>>>>
>>>> I assume this is to replace KVM_MEMORY_ENCRYPT_REG_REGION, which ends up
>>>> calling svm_register_enc_region()->sev_pin_memory(), correct?
>>>
>>> Yes, that is correct.
>>>>
>>>> sev_pin_memory() correctly checks the RLIMIT_MEMLOCK and uses
>>>> pin_user_pages_fast().
>>>>
>>>> I have to strongly assume that sev_pin_memory() is *wrong* as is because
>>>> it's supposed to supply FOLL_LONGTERM -- after all we're pinning these
>>>> pages possibly forever.
>>>>
>>>>
>>>> I might be wrong but
>>>>
>>>> 1. You are missing the RLIMIT_MEMLOCK check
>>>
>>> Yes, I will add this check during the enc_region registration.
>>>
>>>> 2. get_page() is the wong way of long-term pinning a page. You would
>>>> have to mimic what pin_user_pages_fast(FOLL_LONGTERM) does to eventually
>>>> get it right (e.g., migrate the page off of MIGRATE_CMA or ZONE_MOVABLE).
>>>
>>> Let me go through this and I will come back. Thanks for pointing this out.
>>
>> I asusme the "issue" is that KVM uses mmu notifier and does a simple
>> get_user_pages() to obtain the references, to drop the reference when
>> the entry is invalidated via a mmu notifier call. So once you intent to
>> long-term pin, it's already to late.
>>
>> If you could teach KVM to do a long-term pin when stumbling over these
>> special encrypted memory regions (requires a proper matching
>> unpin_user_pages() call from KVM), then you could "take over" that pin
>> by get_page(), and let KVM do the ordinary put_page(), while you would
>> do the unpin_user_pages().
>>
>
> The fault path looks like this in KVM x86 mmu code:
>
> direct_page_fault()
> -> kvm_faultin_pfn()
> -> __gfn_to_pfn_memslot()
> -> hva_to_pfn()
> -> hva_to_pfn_{slow,fast}()
> -> get_user_pages_*() <<<<==== This is where the
> reference is taken
>
> Next step is to create the mappings which is done in below functions:
>
> -> kvm_tdp_mmu_map() / __direct_map()
>
> -> Within this function (patch 1/6), I call sev_pin_spte to take an extra
> reference to pin it using get_page.
>
> Is it possible to use pin_user_pages(FOLL_LONGTERM) here? Wouldn't that
> be equivalent to "take over" solution that you are suggesting?
>
The issue is that pin_user_pages(FOLL_LONGTERM) might have to migrate
the page, which will fail if there is already an additional reference
from get_user_pages_*().
--
Thanks,
David / dhildenb
On 28.01.22 12:08, David Hildenbrand wrote:
> On 28.01.22 12:04, Nikunj A. Dadhania wrote:
>> On 1/28/2022 1:57 PM, David Hildenbrand wrote:
>>> On 28.01.22 07:57, Nikunj A. Dadhania wrote:
>>>> On 1/26/2022 4:16 PM, David Hildenbrand wrote:
>>>>> On 18.01.22 12:06, Nikunj A Dadhania wrote:
>>>>>> Use the memslot metadata to store the pinned data along with the pfns.
>>>>>> This improves the SEV guest startup time from O(n) to a constant by
>>>>>> deferring guest page pinning until the pages are used to satisfy nested
>>>>>> page faults. The page reference will be dropped in the memslot free
>>>>>> path.
>>>>>>
>>>>>> Remove the enc_region structure definition and the code which did
>>>>>> upfront pinning, as they are no longer needed in view of the demand
>>>>>> pinning support.
>>>>>>
>>>>>> Leave svm_register_enc_region() and svm_unregister_enc_region() as stubs
>>>>>> since qemu is dependent on this API.
>>>>>>
>>>>>> Signed-off-by: Nikunj A Dadhania <[email protected]>
>>>>>> ---
>>>>>> arch/x86/kvm/svm/sev.c | 208 ++++++++++++++++-------------------------
>>>>>> arch/x86/kvm/svm/svm.c | 1 +
>>>>>> arch/x86/kvm/svm/svm.h | 3 +-
>>>>>> 3 files changed, 81 insertions(+), 131 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>>>>> index d972ab4956d4..a962bed97a0b 100644
>>>>>> --- a/arch/x86/kvm/svm/sev.c
>>>>>> +++ b/arch/x86/kvm/svm/sev.c
>>>>>> @@ -66,14 +66,6 @@ static unsigned int nr_asids;
>>>>>> static unsigned long *sev_asid_bitmap;
>>>>>> static unsigned long *sev_reclaim_asid_bitmap;
>>>>>>
>>>>>> -struct enc_region {
>>>>>> - struct list_head list;
>>>>>> - unsigned long npages;
>>>>>> - struct page **pages;
>>>>>> - unsigned long uaddr;
>>>>>> - unsigned long size;
>>>>>> -};
>>>>>> -
>>>>>> /* Called with the sev_bitmap_lock held, or on shutdown */
>>>>>> static int sev_flush_asids(int min_asid, int max_asid)
>>>>>> {
>>>>>> @@ -257,8 +249,6 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>>>>> if (ret)
>>>>>> goto e_free;
>>>>>>
>>>>>> - INIT_LIST_HEAD(&sev->regions_list);
>>>>>> -
>>>>>> return 0;
>>>>>>
>>>>>> e_free:
>>>>>> @@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
>>>>>> src->handle = 0;
>>>>>> src->pages_locked = 0;
>>>>>> src->enc_context_owner = NULL;
>>>>>> -
>>>>>> - list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
>>>>>> }
>>>>>>
>>>>>> static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
>>>>>> @@ -1861,115 +1849,13 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
>>>>>> int svm_register_enc_region(struct kvm *kvm,
>>>>>> struct kvm_enc_region *range)
>>>>>> {
>>>>>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>>>> - struct enc_region *region;
>>>>>> - int ret = 0;
>>>>>> -
>>>>>> - if (!sev_guest(kvm))
>>>>>> - return -ENOTTY;
>>>>>> -
>>>>>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>>>>>> - if (is_mirroring_enc_context(kvm))
>>>>>> - return -EINVAL;
>>>>>> -
>>>>>> - if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
>>>>>> - return -EINVAL;
>>>>>> -
>>>>>> - region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
>>>>>> - if (!region)
>>>>>> - return -ENOMEM;
>>>>>> -
>>>>>> - mutex_lock(&kvm->lock);
>>>>>> - region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1);
>>>>>> - if (IS_ERR(region->pages)) {
>>>>>> - ret = PTR_ERR(region->pages);
>>>>>> - mutex_unlock(&kvm->lock);
>>>>>> - goto e_free;
>>>>>> - }
>>>>>> -
>>>>>> - region->uaddr = range->addr;
>>>>>> - region->size = range->size;
>>>>>> -
>>>>>> - list_add_tail(®ion->list, &sev->regions_list);
>>>>>> - mutex_unlock(&kvm->lock);
>>>>>> -
>>>>>> - /*
>>>>>> - * The guest may change the memory encryption attribute from C=0 -> C=1
>>>>>> - * or vice versa for this memory range. Lets make sure caches are
>>>>>> - * flushed to ensure that guest data gets written into memory with
>>>>>> - * correct C-bit.
>>>>>> - */
>>>>>> - sev_clflush_pages(region->pages, region->npages);
>>>>>> -
>>>>>> - return ret;
>>>>>> -
>>>>>> -e_free:
>>>>>> - kfree(region);
>>>>>> - return ret;
>>>>>> -}
>>>>>> -
>>>>>> -static struct enc_region *
>>>>>> -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
>>>>>> -{
>>>>>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>>>> - struct list_head *head = &sev->regions_list;
>>>>>> - struct enc_region *i;
>>>>>> -
>>>>>> - list_for_each_entry(i, head, list) {
>>>>>> - if (i->uaddr == range->addr &&
>>>>>> - i->size == range->size)
>>>>>> - return i;
>>>>>> - }
>>>>>> -
>>>>>> - return NULL;
>>>>>> -}
>>>>>> -
>>>>>> -static void __unregister_enc_region_locked(struct kvm *kvm,
>>>>>> - struct enc_region *region)
>>>>>> -{
>>>>>> - sev_unpin_memory(kvm, region->pages, region->npages);
>>>>>> - list_del(®ion->list);
>>>>>> - kfree(region);
>>>>>> + return 0;
>>>>>> }
>>>>>>
>>>>>> int svm_unregister_enc_region(struct kvm *kvm,
>>>>>> struct kvm_enc_region *range)
>>>>>> {
>>>>>> - struct enc_region *region;
>>>>>> - int ret;
>>>>>> -
>>>>>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>>>>>> - if (is_mirroring_enc_context(kvm))
>>>>>> - return -EINVAL;
>>>>>> -
>>>>>> - mutex_lock(&kvm->lock);
>>>>>> -
>>>>>> - if (!sev_guest(kvm)) {
>>>>>> - ret = -ENOTTY;
>>>>>> - goto failed;
>>>>>> - }
>>>>>> -
>>>>>> - region = find_enc_region(kvm, range);
>>>>>> - if (!region) {
>>>>>> - ret = -EINVAL;
>>>>>> - goto failed;
>>>>>> - }
>>>>>> -
>>>>>> - /*
>>>>>> - * Ensure that all guest tagged cache entries are flushed before
>>>>>> - * releasing the pages back to the system for use. CLFLUSH will
>>>>>> - * not do this, so issue a WBINVD.
>>>>>> - */
>>>>>> - wbinvd_on_all_cpus();
>>>>>> -
>>>>>> - __unregister_enc_region_locked(kvm, region);
>>>>>> -
>>>>>> - mutex_unlock(&kvm->lock);
>>>>>> return 0;
>>>>>> -
>>>>>> -failed:
>>>>>> - mutex_unlock(&kvm->lock);
>>>>>> - return ret;
>>>>>> }
>>>>>>
>>>>>> int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>>>> @@ -2018,7 +1904,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>>>> mirror_sev->fd = source_sev->fd;
>>>>>> mirror_sev->es_active = source_sev->es_active;
>>>>>> mirror_sev->handle = source_sev->handle;
>>>>>> - INIT_LIST_HEAD(&mirror_sev->regions_list);
>>>>>> ret = 0;
>>>>>>
>>>>>> /*
>>>>>> @@ -2038,8 +1923,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>>>> void sev_vm_destroy(struct kvm *kvm)
>>>>>> {
>>>>>> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>>>> - struct list_head *head = &sev->regions_list;
>>>>>> - struct list_head *pos, *q;
>>>>>>
>>>>>> WARN_ON(sev->num_mirrored_vms);
>>>>>>
>>>>>> @@ -2066,18 +1949,6 @@ void sev_vm_destroy(struct kvm *kvm)
>>>>>> */
>>>>>> wbinvd_on_all_cpus();
>>>>>>
>>>>>> - /*
>>>>>> - * if userspace was terminated before unregistering the memory regions
>>>>>> - * then lets unpin all the registered memory.
>>>>>> - */
>>>>>> - if (!list_empty(head)) {
>>>>>> - list_for_each_safe(pos, q, head) {
>>>>>> - __unregister_enc_region_locked(kvm,
>>>>>> - list_entry(pos, struct enc_region, list));
>>>>>> - cond_resched();
>>>>>> - }
>>>>>> - }
>>>>>> -
>>>>>> sev_unbind_asid(kvm, sev->handle);
>>>>>> sev_asid_free(sev);
>>>>>> }
>>>>>> @@ -2946,13 +2817,90 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>>>>>> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
>>>>>> }
>>>>>>
>>>>>> +void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
>>>>>> + kvm_pfn_t pfn)
>>>>>> +{
>>>>>> + struct kvm_arch_memory_slot *aslot;
>>>>>> + struct kvm_memory_slot *slot;
>>>>>> + gfn_t rel_gfn, pin_pfn;
>>>>>> + unsigned long npages;
>>>>>> + kvm_pfn_t old_pfn;
>>>>>> + int i;
>>>>>> +
>>>>>> + if (!sev_guest(kvm))
>>>>>> + return;
>>>>>> +
>>>>>> + if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
>>>>>> + return;
>>>>>> +
>>>>>> + /* Tested till 1GB pages */
>>>>>> + if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
>>>>>> + return;
>>>>>> +
>>>>>> + slot = gfn_to_memslot(kvm, gfn);
>>>>>> + if (!slot || !slot->arch.pfns)
>>>>>> + return;
>>>>>> +
>>>>>> + /*
>>>>>> + * Use relative gfn index within the memslot for the bitmap as well as
>>>>>> + * the pfns array
>>>>>> + */
>>>>>> + rel_gfn = gfn - slot->base_gfn;
>>>>>> + aslot = &slot->arch;
>>>>>> + pin_pfn = pfn;
>>>>>> + npages = KVM_PAGES_PER_HPAGE(level);
>>>>>> +
>>>>>> + /* Pin the page, KVM doesn't yet support page migration. */
>>>>>> + for (i = 0; i < npages; i++, rel_gfn++, pin_pfn++) {
>>>>>> + if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
>>>>>> + old_pfn = aslot->pfns[rel_gfn];
>>>>>> + if (old_pfn == pin_pfn)
>>>>>> + continue;
>>>>>> +
>>>>>> + put_page(pfn_to_page(old_pfn));
>>>>>> + }
>>>>>> +
>>>>>> + set_bit(rel_gfn, aslot->pinned_bitmap);
>>>>>> + aslot->pfns[rel_gfn] = pin_pfn;
>>>>>> + get_page(pfn_to_page(pin_pfn));
>>>>>
>>>>>
>>>>> I assume this is to replace KVM_MEMORY_ENCRYPT_REG_REGION, which ends up
>>>>> calling svm_register_enc_region()->sev_pin_memory(), correct?
>>>>
>>>> Yes, that is correct.
>>>>>
>>>>> sev_pin_memory() correctly checks the RLIMIT_MEMLOCK and uses
>>>>> pin_user_pages_fast().
>>>>>
>>>>> I have to strongly assume that sev_pin_memory() is *wrong* as is because
>>>>> it's supposed to supply FOLL_LONGTERM -- after all we're pinning these
>>>>> pages possibly forever.
>>>>>
>>>>>
>>>>> I might be wrong but
>>>>>
>>>>> 1. You are missing the RLIMIT_MEMLOCK check
>>>>
>>>> Yes, I will add this check during the enc_region registration.
>>>>
>>>>> 2. get_page() is the wong way of long-term pinning a page. You would
>>>>> have to mimic what pin_user_pages_fast(FOLL_LONGTERM) does to eventually
>>>>> get it right (e.g., migrate the page off of MIGRATE_CMA or ZONE_MOVABLE).
>>>>
>>>> Let me go through this and I will come back. Thanks for pointing this out.
>>>
>>> I asusme the "issue" is that KVM uses mmu notifier and does a simple
>>> get_user_pages() to obtain the references, to drop the reference when
>>> the entry is invalidated via a mmu notifier call. So once you intent to
>>> long-term pin, it's already to late.
>>>
>>> If you could teach KVM to do a long-term pin when stumbling over these
>>> special encrypted memory regions (requires a proper matching
>>> unpin_user_pages() call from KVM), then you could "take over" that pin
>>> by get_page(), and let KVM do the ordinary put_page(), while you would
>>> do the unpin_user_pages().
>>>
>>
>> The fault path looks like this in KVM x86 mmu code:
>>
>> direct_page_fault()
>> -> kvm_faultin_pfn()
>> -> __gfn_to_pfn_memslot()
>> -> hva_to_pfn()
>> -> hva_to_pfn_{slow,fast}()
>> -> get_user_pages_*() <<<<==== This is where the
>> reference is taken
>>
>> Next step is to create the mappings which is done in below functions:
>>
>> -> kvm_tdp_mmu_map() / __direct_map()
>>
>> -> Within this function (patch 1/6), I call sev_pin_spte to take an extra
>> reference to pin it using get_page.
>>
>> Is it possible to use pin_user_pages(FOLL_LONGTERM) here? Wouldn't that
>> be equivalent to "take over" solution that you are suggesting?
>>
>
> The issue is that pin_user_pages(FOLL_LONGTERM) might have to migrate
> the page, which will fail if there is already an additional reference
> from get_user_pages_*().
>
Minor addition: hva_to_pfn_{slow,fast}() *don't* take a reference,
because we neither supply FOLL_GET nor FOLL_PIN. GUP users that rely on
memory notifiers don't require refernces.
I don't know what the implications would be if you FOLL_PIN |
FOLL_LONGTERM after already having a reference via
hva_to_pfn_{slow,fast}() in your hand in the callpath. Migration code
would effectively want to unmap the old page and call mmu notifiers to
properly invalidate the KVM MMU ...
In an ideal word, you'd really do a FOLL_PIN | FOLL_LONGTERM right away,
not doing the get_user_pages_*() first.
--
Thanks,
David / dhildenb
On 1/31/2022 5:26 PM, David Hildenbrand wrote:
> On 28.01.22 12:08, David Hildenbrand wrote:
>> On 28.01.22 12:04, Nikunj A. Dadhania wrote:
>>> On 1/28/2022 1:57 PM, David Hildenbrand wrote:
>>>> On 28.01.22 07:57, Nikunj A. Dadhania wrote:
>>>>> On 1/26/2022 4:16 PM, David Hildenbrand wrote:
>>>>>> On 18.01.22 12:06, Nikunj A Dadhania wrote:
>>>>>>> Use the memslot metadata to store the pinned data along with the pfns.
>>>>>>> This improves the SEV guest startup time from O(n) to a constant by
>>>>>>> deferring guest page pinning until the pages are used to satisfy nested
>>>>>>> page faults. The page reference will be dropped in the memslot free
>>>>>>> path.
>>>>>>>
>>>>>>> Remove the enc_region structure definition and the code which did
>>>>>>> upfront pinning, as they are no longer needed in view of the demand
>>>>>>> pinning support.
>>>>>>>
>>>>>>> Leave svm_register_enc_region() and svm_unregister_enc_region() as stubs
>>>>>>> since qemu is dependent on this API.
>>>>>>>
>>>>>>> Signed-off-by: Nikunj A Dadhania <[email protected]>
>>>>>>> ---
>>>>>>> arch/x86/kvm/svm/sev.c | 208 ++++++++++++++++-------------------------
>>>>>>> arch/x86/kvm/svm/svm.c | 1 +
>>>>>>> arch/x86/kvm/svm/svm.h | 3 +-
>>>>>>> 3 files changed, 81 insertions(+), 131 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>>>>>> index d972ab4956d4..a962bed97a0b 100644
>>>>>>> --- a/arch/x86/kvm/svm/sev.c
>>>>>>> +++ b/arch/x86/kvm/svm/sev.c
>>>>>>> @@ -66,14 +66,6 @@ static unsigned int nr_asids;
>>>>>>> static unsigned long *sev_asid_bitmap;
>>>>>>> static unsigned long *sev_reclaim_asid_bitmap;
>>>>>>>
>>>>>>> -struct enc_region {
>>>>>>> - struct list_head list;
>>>>>>> - unsigned long npages;
>>>>>>> - struct page **pages;
>>>>>>> - unsigned long uaddr;
>>>>>>> - unsigned long size;
>>>>>>> -};
>>>>>>> -
>>>>>>> /* Called with the sev_bitmap_lock held, or on shutdown */
>>>>>>> static int sev_flush_asids(int min_asid, int max_asid)
>>>>>>> {
>>>>>>> @@ -257,8 +249,6 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>>>>>> if (ret)
>>>>>>> goto e_free;
>>>>>>>
>>>>>>> - INIT_LIST_HEAD(&sev->regions_list);
>>>>>>> -
>>>>>>> return 0;
>>>>>>>
>>>>>>> e_free:
>>>>>>> @@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
>>>>>>> src->handle = 0;
>>>>>>> src->pages_locked = 0;
>>>>>>> src->enc_context_owner = NULL;
>>>>>>> -
>>>>>>> - list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
>>>>>>> }
>>>>>>>
>>>>>>> static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
>>>>>>> @@ -1861,115 +1849,13 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
>>>>>>> int svm_register_enc_region(struct kvm *kvm,
>>>>>>> struct kvm_enc_region *range)
>>>>>>> {
>>>>>>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>>>>> - struct enc_region *region;
>>>>>>> - int ret = 0;
>>>>>>> -
>>>>>>> - if (!sev_guest(kvm))
>>>>>>> - return -ENOTTY;
>>>>>>> -
>>>>>>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>>>>>>> - if (is_mirroring_enc_context(kvm))
>>>>>>> - return -EINVAL;
>>>>>>> -
>>>>>>> - if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
>>>>>>> - return -EINVAL;
>>>>>>> -
>>>>>>> - region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
>>>>>>> - if (!region)
>>>>>>> - return -ENOMEM;
>>>>>>> -
>>>>>>> - mutex_lock(&kvm->lock);
>>>>>>> - region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1);
>>>>>>> - if (IS_ERR(region->pages)) {
>>>>>>> - ret = PTR_ERR(region->pages);
>>>>>>> - mutex_unlock(&kvm->lock);
>>>>>>> - goto e_free;
>>>>>>> - }
>>>>>>> -
>>>>>>> - region->uaddr = range->addr;
>>>>>>> - region->size = range->size;
>>>>>>> -
>>>>>>> - list_add_tail(®ion->list, &sev->regions_list);
>>>>>>> - mutex_unlock(&kvm->lock);
>>>>>>> -
>>>>>>> - /*
>>>>>>> - * The guest may change the memory encryption attribute from C=0 -> C=1
>>>>>>> - * or vice versa for this memory range. Lets make sure caches are
>>>>>>> - * flushed to ensure that guest data gets written into memory with
>>>>>>> - * correct C-bit.
>>>>>>> - */
>>>>>>> - sev_clflush_pages(region->pages, region->npages);
>>>>>>> -
>>>>>>> - return ret;
>>>>>>> -
>>>>>>> -e_free:
>>>>>>> - kfree(region);
>>>>>>> - return ret;
>>>>>>> -}
>>>>>>> -
>>>>>>> -static struct enc_region *
>>>>>>> -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
>>>>>>> -{
>>>>>>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>>>>> - struct list_head *head = &sev->regions_list;
>>>>>>> - struct enc_region *i;
>>>>>>> -
>>>>>>> - list_for_each_entry(i, head, list) {
>>>>>>> - if (i->uaddr == range->addr &&
>>>>>>> - i->size == range->size)
>>>>>>> - return i;
>>>>>>> - }
>>>>>>> -
>>>>>>> - return NULL;
>>>>>>> -}
>>>>>>> -
>>>>>>> -static void __unregister_enc_region_locked(struct kvm *kvm,
>>>>>>> - struct enc_region *region)
>>>>>>> -{
>>>>>>> - sev_unpin_memory(kvm, region->pages, region->npages);
>>>>>>> - list_del(®ion->list);
>>>>>>> - kfree(region);
>>>>>>> + return 0;
>>>>>>> }
>>>>>>>
>>>>>>> int svm_unregister_enc_region(struct kvm *kvm,
>>>>>>> struct kvm_enc_region *range)
>>>>>>> {
>>>>>>> - struct enc_region *region;
>>>>>>> - int ret;
>>>>>>> -
>>>>>>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>>>>>>> - if (is_mirroring_enc_context(kvm))
>>>>>>> - return -EINVAL;
>>>>>>> -
>>>>>>> - mutex_lock(&kvm->lock);
>>>>>>> -
>>>>>>> - if (!sev_guest(kvm)) {
>>>>>>> - ret = -ENOTTY;
>>>>>>> - goto failed;
>>>>>>> - }
>>>>>>> -
>>>>>>> - region = find_enc_region(kvm, range);
>>>>>>> - if (!region) {
>>>>>>> - ret = -EINVAL;
>>>>>>> - goto failed;
>>>>>>> - }
>>>>>>> -
>>>>>>> - /*
>>>>>>> - * Ensure that all guest tagged cache entries are flushed before
>>>>>>> - * releasing the pages back to the system for use. CLFLUSH will
>>>>>>> - * not do this, so issue a WBINVD.
>>>>>>> - */
>>>>>>> - wbinvd_on_all_cpus();
>>>>>>> -
>>>>>>> - __unregister_enc_region_locked(kvm, region);
>>>>>>> -
>>>>>>> - mutex_unlock(&kvm->lock);
>>>>>>> return 0;
>>>>>>> -
>>>>>>> -failed:
>>>>>>> - mutex_unlock(&kvm->lock);
>>>>>>> - return ret;
>>>>>>> }
>>>>>>>
>>>>>>> int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>>>>> @@ -2018,7 +1904,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>>>>> mirror_sev->fd = source_sev->fd;
>>>>>>> mirror_sev->es_active = source_sev->es_active;
>>>>>>> mirror_sev->handle = source_sev->handle;
>>>>>>> - INIT_LIST_HEAD(&mirror_sev->regions_list);
>>>>>>> ret = 0;
>>>>>>>
>>>>>>> /*
>>>>>>> @@ -2038,8 +1923,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>>>>> void sev_vm_destroy(struct kvm *kvm)
>>>>>>> {
>>>>>>> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>>>>> - struct list_head *head = &sev->regions_list;
>>>>>>> - struct list_head *pos, *q;
>>>>>>>
>>>>>>> WARN_ON(sev->num_mirrored_vms);
>>>>>>>
>>>>>>> @@ -2066,18 +1949,6 @@ void sev_vm_destroy(struct kvm *kvm)
>>>>>>> */
>>>>>>> wbinvd_on_all_cpus();
>>>>>>>
>>>>>>> - /*
>>>>>>> - * if userspace was terminated before unregistering the memory regions
>>>>>>> - * then lets unpin all the registered memory.
>>>>>>> - */
>>>>>>> - if (!list_empty(head)) {
>>>>>>> - list_for_each_safe(pos, q, head) {
>>>>>>> - __unregister_enc_region_locked(kvm,
>>>>>>> - list_entry(pos, struct enc_region, list));
>>>>>>> - cond_resched();
>>>>>>> - }
>>>>>>> - }
>>>>>>> -
>>>>>>> sev_unbind_asid(kvm, sev->handle);
>>>>>>> sev_asid_free(sev);
>>>>>>> }
>>>>>>> @@ -2946,13 +2817,90 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>>>>>>> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
>>>>>>> }
>>>>>>>
>>>>>>> +void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
>>>>>>> + kvm_pfn_t pfn)
>>>>>>> +{
>>>>>>> + struct kvm_arch_memory_slot *aslot;
>>>>>>> + struct kvm_memory_slot *slot;
>>>>>>> + gfn_t rel_gfn, pin_pfn;
>>>>>>> + unsigned long npages;
>>>>>>> + kvm_pfn_t old_pfn;
>>>>>>> + int i;
>>>>>>> +
>>>>>>> + if (!sev_guest(kvm))
>>>>>>> + return;
>>>>>>> +
>>>>>>> + if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
>>>>>>> + return;
>>>>>>> +
>>>>>>> + /* Tested till 1GB pages */
>>>>>>> + if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
>>>>>>> + return;
>>>>>>> +
>>>>>>> + slot = gfn_to_memslot(kvm, gfn);
>>>>>>> + if (!slot || !slot->arch.pfns)
>>>>>>> + return;
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * Use relative gfn index within the memslot for the bitmap as well as
>>>>>>> + * the pfns array
>>>>>>> + */
>>>>>>> + rel_gfn = gfn - slot->base_gfn;
>>>>>>> + aslot = &slot->arch;
>>>>>>> + pin_pfn = pfn;
>>>>>>> + npages = KVM_PAGES_PER_HPAGE(level);
>>>>>>> +
>>>>>>> + /* Pin the page, KVM doesn't yet support page migration. */
>>>>>>> + for (i = 0; i < npages; i++, rel_gfn++, pin_pfn++) {
>>>>>>> + if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
>>>>>>> + old_pfn = aslot->pfns[rel_gfn];
>>>>>>> + if (old_pfn == pin_pfn)
>>>>>>> + continue;
>>>>>>> +
>>>>>>> + put_page(pfn_to_page(old_pfn));
>>>>>>> + }
>>>>>>> +
>>>>>>> + set_bit(rel_gfn, aslot->pinned_bitmap);
>>>>>>> + aslot->pfns[rel_gfn] = pin_pfn;
>>>>>>> + get_page(pfn_to_page(pin_pfn));
>>>>>>
>>>>>>
>>>>>> I assume this is to replace KVM_MEMORY_ENCRYPT_REG_REGION, which ends up
>>>>>> calling svm_register_enc_region()->sev_pin_memory(), correct?
>>>>>
>>>>> Yes, that is correct.
>>>>>>
>>>>>> sev_pin_memory() correctly checks the RLIMIT_MEMLOCK and uses
>>>>>> pin_user_pages_fast().
>>>>>>
>>>>>> I have to strongly assume that sev_pin_memory() is *wrong* as is because
>>>>>> it's supposed to supply FOLL_LONGTERM -- after all we're pinning these
>>>>>> pages possibly forever.
>>>>>>
>>>>>>
>>>>>> I might be wrong but
>>>>>>
>>>>>> 1. You are missing the RLIMIT_MEMLOCK check
>>>>>
>>>>> Yes, I will add this check during the enc_region registration.
>>>>>
>>>>>> 2. get_page() is the wong way of long-term pinning a page. You would
>>>>>> have to mimic what pin_user_pages_fast(FOLL_LONGTERM) does to eventually
>>>>>> get it right (e.g., migrate the page off of MIGRATE_CMA or ZONE_MOVABLE).
>>>>>
>>>>> Let me go through this and I will come back. Thanks for pointing this out.
>>>>
>>>> I asusme the "issue" is that KVM uses mmu notifier and does a simple
>>>> get_user_pages() to obtain the references, to drop the reference when
>>>> the entry is invalidated via a mmu notifier call. So once you intent to
>>>> long-term pin, it's already to late.
>>>>
>>>> If you could teach KVM to do a long-term pin when stumbling over these
>>>> special encrypted memory regions (requires a proper matching
>>>> unpin_user_pages() call from KVM), then you could "take over" that pin
>>>> by get_page(), and let KVM do the ordinary put_page(), while you would
>>>> do the unpin_user_pages().
>>>>
>>>
>>> The fault path looks like this in KVM x86 mmu code:
>>>
>>> direct_page_fault()
>>> -> kvm_faultin_pfn()
>>> -> __gfn_to_pfn_memslot()
>>> -> hva_to_pfn()
>>> -> hva_to_pfn_{slow,fast}()
>>> -> get_user_pages_*() <<<<==== This is where the
>>> reference is taken
>>>
>>> Next step is to create the mappings which is done in below functions:
>>>
>>> -> kvm_tdp_mmu_map() / __direct_map()
>>>
>>> -> Within this function (patch 1/6), I call sev_pin_spte to take an extra
>>> reference to pin it using get_page.
>>>
>>> Is it possible to use pin_user_pages(FOLL_LONGTERM) here? Wouldn't that
>>> be equivalent to "take over" solution that you are suggesting?
>>>
>>
>> The issue is that pin_user_pages(FOLL_LONGTERM) might have to migrate
>> the page, which will fail if there is already an additional reference
>> from get_user_pages_*().
>>
>
> Minor addition: hva_to_pfn_{slow,fast}() *don't* take a reference,
hva_to_pfn_fast() does take a reference, not able to find in _slow() though.
->get_user_page_fast_only()
-> get_user_pages_fast_only()
...
gup_flags |= FOLL_GET | FOLL_FAST_ONLY;
...
> because we neither supply FOLL_GET nor FOLL_PIN. GUP users that rely on
> memory notifiers don't require refernces.
>
> I don't know what the implications would be if you FOLL_PIN |
> FOLL_LONGTERM after already having a reference via
> hva_to_pfn_{slow,fast}() in your hand in the callpath. Migration code
> would effectively want to unmap the old page and call mmu notifiers to
> properly invalidate the KVM MMU ...
>
> In an ideal word, you'd really do a FOLL_PIN | FOLL_LONGTERM right away,
> not doing the get_user_pages_*() first.
>
I am thinking on the same line:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index eff3ef64722b..fd7c878ab03d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2379,9 +2379,10 @@ static inline int check_user_page_hwpoison(unsigned long addr)
* only part that runs if we can in atomic context.
*/
static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
- bool *writable, kvm_pfn_t *pfn)
+ bool *writable, kvm_pfn_t *pfn, bool pin_longterm)
{
struct page *page[1];
+ bool ret;
/*
* Fast pin a writable pfn only if it is a write fault request
@@ -2391,7 +2392,12 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
if (!(write_fault || writable))
return false;
- if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+ if (!pin_longterm)
+ ret = get_user_page_fast_only(addr, FOLL_WRITE, page);
+ else
+ ret = pin_user_pages_fast(addr, 1, FOLL_WRITE | FOLL_LONGTERM, page);
+
+ if (ret) {
*pfn = page_to_pfn(page[0]);
And the pin_longterm could be determined using a memslot flags:
#define KVM_MEMSLOT_LONGTERM (1UL << 17)
Regards
Nikunj
On 31.01.22 13:18, Nikunj A. Dadhania wrote:
> On 1/31/2022 5:26 PM, David Hildenbrand wrote:
>> On 28.01.22 12:08, David Hildenbrand wrote:
>>> On 28.01.22 12:04, Nikunj A. Dadhania wrote:
>>>> On 1/28/2022 1:57 PM, David Hildenbrand wrote:
>>>>> On 28.01.22 07:57, Nikunj A. Dadhania wrote:
>>>>>> On 1/26/2022 4:16 PM, David Hildenbrand wrote:
>>>>>>> On 18.01.22 12:06, Nikunj A Dadhania wrote:
>>>>>>>> Use the memslot metadata to store the pinned data along with the pfns.
>>>>>>>> This improves the SEV guest startup time from O(n) to a constant by
>>>>>>>> deferring guest page pinning until the pages are used to satisfy nested
>>>>>>>> page faults. The page reference will be dropped in the memslot free
>>>>>>>> path.
>>>>>>>>
>>>>>>>> Remove the enc_region structure definition and the code which did
>>>>>>>> upfront pinning, as they are no longer needed in view of the demand
>>>>>>>> pinning support.
>>>>>>>>
>>>>>>>> Leave svm_register_enc_region() and svm_unregister_enc_region() as stubs
>>>>>>>> since qemu is dependent on this API.
>>>>>>>>
>>>>>>>> Signed-off-by: Nikunj A Dadhania <[email protected]>
>>>>>>>> ---
>>>>>>>> arch/x86/kvm/svm/sev.c | 208 ++++++++++++++++-------------------------
>>>>>>>> arch/x86/kvm/svm/svm.c | 1 +
>>>>>>>> arch/x86/kvm/svm/svm.h | 3 +-
>>>>>>>> 3 files changed, 81 insertions(+), 131 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>>>>>>> index d972ab4956d4..a962bed97a0b 100644
>>>>>>>> --- a/arch/x86/kvm/svm/sev.c
>>>>>>>> +++ b/arch/x86/kvm/svm/sev.c
>>>>>>>> @@ -66,14 +66,6 @@ static unsigned int nr_asids;
>>>>>>>> static unsigned long *sev_asid_bitmap;
>>>>>>>> static unsigned long *sev_reclaim_asid_bitmap;
>>>>>>>>
>>>>>>>> -struct enc_region {
>>>>>>>> - struct list_head list;
>>>>>>>> - unsigned long npages;
>>>>>>>> - struct page **pages;
>>>>>>>> - unsigned long uaddr;
>>>>>>>> - unsigned long size;
>>>>>>>> -};
>>>>>>>> -
>>>>>>>> /* Called with the sev_bitmap_lock held, or on shutdown */
>>>>>>>> static int sev_flush_asids(int min_asid, int max_asid)
>>>>>>>> {
>>>>>>>> @@ -257,8 +249,6 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>>>>>>> if (ret)
>>>>>>>> goto e_free;
>>>>>>>>
>>>>>>>> - INIT_LIST_HEAD(&sev->regions_list);
>>>>>>>> -
>>>>>>>> return 0;
>>>>>>>>
>>>>>>>> e_free:
>>>>>>>> @@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
>>>>>>>> src->handle = 0;
>>>>>>>> src->pages_locked = 0;
>>>>>>>> src->enc_context_owner = NULL;
>>>>>>>> -
>>>>>>>> - list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
>>>>>>>> }
>>>>>>>>
>>>>>>>> static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
>>>>>>>> @@ -1861,115 +1849,13 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
>>>>>>>> int svm_register_enc_region(struct kvm *kvm,
>>>>>>>> struct kvm_enc_region *range)
>>>>>>>> {
>>>>>>>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>>>>>> - struct enc_region *region;
>>>>>>>> - int ret = 0;
>>>>>>>> -
>>>>>>>> - if (!sev_guest(kvm))
>>>>>>>> - return -ENOTTY;
>>>>>>>> -
>>>>>>>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>>>>>>>> - if (is_mirroring_enc_context(kvm))
>>>>>>>> - return -EINVAL;
>>>>>>>> -
>>>>>>>> - if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
>>>>>>>> - return -EINVAL;
>>>>>>>> -
>>>>>>>> - region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
>>>>>>>> - if (!region)
>>>>>>>> - return -ENOMEM;
>>>>>>>> -
>>>>>>>> - mutex_lock(&kvm->lock);
>>>>>>>> - region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1);
>>>>>>>> - if (IS_ERR(region->pages)) {
>>>>>>>> - ret = PTR_ERR(region->pages);
>>>>>>>> - mutex_unlock(&kvm->lock);
>>>>>>>> - goto e_free;
>>>>>>>> - }
>>>>>>>> -
>>>>>>>> - region->uaddr = range->addr;
>>>>>>>> - region->size = range->size;
>>>>>>>> -
>>>>>>>> - list_add_tail(®ion->list, &sev->regions_list);
>>>>>>>> - mutex_unlock(&kvm->lock);
>>>>>>>> -
>>>>>>>> - /*
>>>>>>>> - * The guest may change the memory encryption attribute from C=0 -> C=1
>>>>>>>> - * or vice versa for this memory range. Lets make sure caches are
>>>>>>>> - * flushed to ensure that guest data gets written into memory with
>>>>>>>> - * correct C-bit.
>>>>>>>> - */
>>>>>>>> - sev_clflush_pages(region->pages, region->npages);
>>>>>>>> -
>>>>>>>> - return ret;
>>>>>>>> -
>>>>>>>> -e_free:
>>>>>>>> - kfree(region);
>>>>>>>> - return ret;
>>>>>>>> -}
>>>>>>>> -
>>>>>>>> -static struct enc_region *
>>>>>>>> -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
>>>>>>>> -{
>>>>>>>> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>>>>>> - struct list_head *head = &sev->regions_list;
>>>>>>>> - struct enc_region *i;
>>>>>>>> -
>>>>>>>> - list_for_each_entry(i, head, list) {
>>>>>>>> - if (i->uaddr == range->addr &&
>>>>>>>> - i->size == range->size)
>>>>>>>> - return i;
>>>>>>>> - }
>>>>>>>> -
>>>>>>>> - return NULL;
>>>>>>>> -}
>>>>>>>> -
>>>>>>>> -static void __unregister_enc_region_locked(struct kvm *kvm,
>>>>>>>> - struct enc_region *region)
>>>>>>>> -{
>>>>>>>> - sev_unpin_memory(kvm, region->pages, region->npages);
>>>>>>>> - list_del(®ion->list);
>>>>>>>> - kfree(region);
>>>>>>>> + return 0;
>>>>>>>> }
>>>>>>>>
>>>>>>>> int svm_unregister_enc_region(struct kvm *kvm,
>>>>>>>> struct kvm_enc_region *range)
>>>>>>>> {
>>>>>>>> - struct enc_region *region;
>>>>>>>> - int ret;
>>>>>>>> -
>>>>>>>> - /* If kvm is mirroring encryption context it isn't responsible for it */
>>>>>>>> - if (is_mirroring_enc_context(kvm))
>>>>>>>> - return -EINVAL;
>>>>>>>> -
>>>>>>>> - mutex_lock(&kvm->lock);
>>>>>>>> -
>>>>>>>> - if (!sev_guest(kvm)) {
>>>>>>>> - ret = -ENOTTY;
>>>>>>>> - goto failed;
>>>>>>>> - }
>>>>>>>> -
>>>>>>>> - region = find_enc_region(kvm, range);
>>>>>>>> - if (!region) {
>>>>>>>> - ret = -EINVAL;
>>>>>>>> - goto failed;
>>>>>>>> - }
>>>>>>>> -
>>>>>>>> - /*
>>>>>>>> - * Ensure that all guest tagged cache entries are flushed before
>>>>>>>> - * releasing the pages back to the system for use. CLFLUSH will
>>>>>>>> - * not do this, so issue a WBINVD.
>>>>>>>> - */
>>>>>>>> - wbinvd_on_all_cpus();
>>>>>>>> -
>>>>>>>> - __unregister_enc_region_locked(kvm, region);
>>>>>>>> -
>>>>>>>> - mutex_unlock(&kvm->lock);
>>>>>>>> return 0;
>>>>>>>> -
>>>>>>>> -failed:
>>>>>>>> - mutex_unlock(&kvm->lock);
>>>>>>>> - return ret;
>>>>>>>> }
>>>>>>>>
>>>>>>>> int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>>>>>> @@ -2018,7 +1904,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>>>>>> mirror_sev->fd = source_sev->fd;
>>>>>>>> mirror_sev->es_active = source_sev->es_active;
>>>>>>>> mirror_sev->handle = source_sev->handle;
>>>>>>>> - INIT_LIST_HEAD(&mirror_sev->regions_list);
>>>>>>>> ret = 0;
>>>>>>>>
>>>>>>>> /*
>>>>>>>> @@ -2038,8 +1923,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>>>>>>> void sev_vm_destroy(struct kvm *kvm)
>>>>>>>> {
>>>>>>>> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>>>>>> - struct list_head *head = &sev->regions_list;
>>>>>>>> - struct list_head *pos, *q;
>>>>>>>>
>>>>>>>> WARN_ON(sev->num_mirrored_vms);
>>>>>>>>
>>>>>>>> @@ -2066,18 +1949,6 @@ void sev_vm_destroy(struct kvm *kvm)
>>>>>>>> */
>>>>>>>> wbinvd_on_all_cpus();
>>>>>>>>
>>>>>>>> - /*
>>>>>>>> - * if userspace was terminated before unregistering the memory regions
>>>>>>>> - * then lets unpin all the registered memory.
>>>>>>>> - */
>>>>>>>> - if (!list_empty(head)) {
>>>>>>>> - list_for_each_safe(pos, q, head) {
>>>>>>>> - __unregister_enc_region_locked(kvm,
>>>>>>>> - list_entry(pos, struct enc_region, list));
>>>>>>>> - cond_resched();
>>>>>>>> - }
>>>>>>>> - }
>>>>>>>> -
>>>>>>>> sev_unbind_asid(kvm, sev->handle);
>>>>>>>> sev_asid_free(sev);
>>>>>>>> }
>>>>>>>> @@ -2946,13 +2817,90 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>>>>>>>> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
>>>>>>>> }
>>>>>>>>
>>>>>>>> +void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
>>>>>>>> + kvm_pfn_t pfn)
>>>>>>>> +{
>>>>>>>> + struct kvm_arch_memory_slot *aslot;
>>>>>>>> + struct kvm_memory_slot *slot;
>>>>>>>> + gfn_t rel_gfn, pin_pfn;
>>>>>>>> + unsigned long npages;
>>>>>>>> + kvm_pfn_t old_pfn;
>>>>>>>> + int i;
>>>>>>>> +
>>>>>>>> + if (!sev_guest(kvm))
>>>>>>>> + return;
>>>>>>>> +
>>>>>>>> + if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
>>>>>>>> + return;
>>>>>>>> +
>>>>>>>> + /* Tested till 1GB pages */
>>>>>>>> + if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
>>>>>>>> + return;
>>>>>>>> +
>>>>>>>> + slot = gfn_to_memslot(kvm, gfn);
>>>>>>>> + if (!slot || !slot->arch.pfns)
>>>>>>>> + return;
>>>>>>>> +
>>>>>>>> + /*
>>>>>>>> + * Use relative gfn index within the memslot for the bitmap as well as
>>>>>>>> + * the pfns array
>>>>>>>> + */
>>>>>>>> + rel_gfn = gfn - slot->base_gfn;
>>>>>>>> + aslot = &slot->arch;
>>>>>>>> + pin_pfn = pfn;
>>>>>>>> + npages = KVM_PAGES_PER_HPAGE(level);
>>>>>>>> +
>>>>>>>> + /* Pin the page, KVM doesn't yet support page migration. */
>>>>>>>> + for (i = 0; i < npages; i++, rel_gfn++, pin_pfn++) {
>>>>>>>> + if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
>>>>>>>> + old_pfn = aslot->pfns[rel_gfn];
>>>>>>>> + if (old_pfn == pin_pfn)
>>>>>>>> + continue;
>>>>>>>> +
>>>>>>>> + put_page(pfn_to_page(old_pfn));
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + set_bit(rel_gfn, aslot->pinned_bitmap);
>>>>>>>> + aslot->pfns[rel_gfn] = pin_pfn;
>>>>>>>> + get_page(pfn_to_page(pin_pfn));
>>>>>>>
>>>>>>>
>>>>>>> I assume this is to replace KVM_MEMORY_ENCRYPT_REG_REGION, which ends up
>>>>>>> calling svm_register_enc_region()->sev_pin_memory(), correct?
>>>>>>
>>>>>> Yes, that is correct.
>>>>>>>
>>>>>>> sev_pin_memory() correctly checks the RLIMIT_MEMLOCK and uses
>>>>>>> pin_user_pages_fast().
>>>>>>>
>>>>>>> I have to strongly assume that sev_pin_memory() is *wrong* as is because
>>>>>>> it's supposed to supply FOLL_LONGTERM -- after all we're pinning these
>>>>>>> pages possibly forever.
>>>>>>>
>>>>>>>
>>>>>>> I might be wrong but
>>>>>>>
>>>>>>> 1. You are missing the RLIMIT_MEMLOCK check
>>>>>>
>>>>>> Yes, I will add this check during the enc_region registration.
>>>>>>
>>>>>>> 2. get_page() is the wong way of long-term pinning a page. You would
>>>>>>> have to mimic what pin_user_pages_fast(FOLL_LONGTERM) does to eventually
>>>>>>> get it right (e.g., migrate the page off of MIGRATE_CMA or ZONE_MOVABLE).
>>>>>>
>>>>>> Let me go through this and I will come back. Thanks for pointing this out.
>>>>>
>>>>> I asusme the "issue" is that KVM uses mmu notifier and does a simple
>>>>> get_user_pages() to obtain the references, to drop the reference when
>>>>> the entry is invalidated via a mmu notifier call. So once you intent to
>>>>> long-term pin, it's already to late.
>>>>>
>>>>> If you could teach KVM to do a long-term pin when stumbling over these
>>>>> special encrypted memory regions (requires a proper matching
>>>>> unpin_user_pages() call from KVM), then you could "take over" that pin
>>>>> by get_page(), and let KVM do the ordinary put_page(), while you would
>>>>> do the unpin_user_pages().
>>>>>
>>>>
>>>> The fault path looks like this in KVM x86 mmu code:
>>>>
>>>> direct_page_fault()
>>>> -> kvm_faultin_pfn()
>>>> -> __gfn_to_pfn_memslot()
>>>> -> hva_to_pfn()
>>>> -> hva_to_pfn_{slow,fast}()
>>>> -> get_user_pages_*() <<<<==== This is where the
>>>> reference is taken
>>>>
>>>> Next step is to create the mappings which is done in below functions:
>>>>
>>>> -> kvm_tdp_mmu_map() / __direct_map()
>>>>
>>>> -> Within this function (patch 1/6), I call sev_pin_spte to take an extra
>>>> reference to pin it using get_page.
>>>>
>>>> Is it possible to use pin_user_pages(FOLL_LONGTERM) here? Wouldn't that
>>>> be equivalent to "take over" solution that you are suggesting?
>>>>
>>>
>>> The issue is that pin_user_pages(FOLL_LONGTERM) might have to migrate
>>> the page, which will fail if there is already an additional reference
>>> from get_user_pages_*().
>>>
>>
>> Minor addition: hva_to_pfn_{slow,fast}() *don't* take a reference,
>
> hva_to_pfn_fast() does take a reference, not able to find in _slow() though.
Ah, my fault, you're correct and my memory is wrong.
>
> ->get_user_page_fast_only()
> -> get_user_pages_fast_only()
> ...
> gup_flags |= FOLL_GET | FOLL_FAST_ONLY;
> ...
__get_user_pages_locked() has
if (pages && !(flags & FOLL_PIN))
flags |= FOLL_GET;$
I could have sworn we'd have code to lookup a page without the need to
grab a reference for MMU notifier purposes in KVM's MMU.
But looking into the details, I think we simply get a reference, map the
page, and then release the reference.
--
Thanks,
David / dhildenb
On Tue, Jan 18, 2022, Nikunj A Dadhania wrote:
> Use the memslot metadata to store the pinned data along with the pfns.
> This improves the SEV guest startup time from O(n) to a constant by
> deferring guest page pinning until the pages are used to satisfy nested
> page faults. The page reference will be dropped in the memslot free
> path.
>
> Remove the enc_region structure definition and the code which did
> upfront pinning, as they are no longer needed in view of the demand
> pinning support.
>
> Leave svm_register_enc_region() and svm_unregister_enc_region() as stubs
> since qemu is dependent on this API.
>
> Signed-off-by: Nikunj A Dadhania <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 208 ++++++++++++++++-------------------------
> arch/x86/kvm/svm/svm.c | 1 +
> arch/x86/kvm/svm/svm.h | 3 +-
> 3 files changed, 81 insertions(+), 131 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index d972ab4956d4..a962bed97a0b 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -66,14 +66,6 @@ static unsigned int nr_asids;
> static unsigned long *sev_asid_bitmap;
> static unsigned long *sev_reclaim_asid_bitmap;
>
> -struct enc_region {
> - struct list_head list;
> - unsigned long npages;
> - struct page **pages;
> - unsigned long uaddr;
> - unsigned long size;
> -};
> -
> /* Called with the sev_bitmap_lock held, or on shutdown */
> static int sev_flush_asids(int min_asid, int max_asid)
> {
> @@ -257,8 +249,6 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
> if (ret)
> goto e_free;
>
> - INIT_LIST_HEAD(&sev->regions_list);
> -
> return 0;
>
> e_free:
> @@ -1637,8 +1627,6 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
> src->handle = 0;
> src->pages_locked = 0;
> src->enc_context_owner = NULL;
> -
> - list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
> }
>
> static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
> @@ -1861,115 +1849,13 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
> int svm_register_enc_region(struct kvm *kvm,
> struct kvm_enc_region *range)
> {
> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> - struct enc_region *region;
> - int ret = 0;
> -
> - if (!sev_guest(kvm))
> - return -ENOTTY;
> -
> - /* If kvm is mirroring encryption context it isn't responsible for it */
> - if (is_mirroring_enc_context(kvm))
> - return -EINVAL;
> -
> - if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
> - return -EINVAL;
> -
> - region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
> - if (!region)
> - return -ENOMEM;
> -
> - mutex_lock(&kvm->lock);
> - region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1);
> - if (IS_ERR(region->pages)) {
> - ret = PTR_ERR(region->pages);
> - mutex_unlock(&kvm->lock);
> - goto e_free;
> - }
> -
> - region->uaddr = range->addr;
> - region->size = range->size;
> -
> - list_add_tail(®ion->list, &sev->regions_list);
> - mutex_unlock(&kvm->lock);
> -
> - /*
> - * The guest may change the memory encryption attribute from C=0 -> C=1
> - * or vice versa for this memory range. Lets make sure caches are
> - * flushed to ensure that guest data gets written into memory with
> - * correct C-bit.
> - */
> - sev_clflush_pages(region->pages, region->npages);
> -
> - return ret;
> -
> -e_free:
> - kfree(region);
> - return ret;
> -}
> -
> -static struct enc_region *
> -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
> -{
> - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> - struct list_head *head = &sev->regions_list;
> - struct enc_region *i;
> -
> - list_for_each_entry(i, head, list) {
> - if (i->uaddr == range->addr &&
> - i->size == range->size)
> - return i;
> - }
> -
> - return NULL;
> -}
> -
> -static void __unregister_enc_region_locked(struct kvm *kvm,
> - struct enc_region *region)
> -{
> - sev_unpin_memory(kvm, region->pages, region->npages);
> - list_del(®ion->list);
> - kfree(region);
> + return 0;
> }
>
> int svm_unregister_enc_region(struct kvm *kvm,
> struct kvm_enc_region *range)
> {
> - struct enc_region *region;
> - int ret;
> -
> - /* If kvm is mirroring encryption context it isn't responsible for it */
> - if (is_mirroring_enc_context(kvm))
> - return -EINVAL;
> -
> - mutex_lock(&kvm->lock);
> -
> - if (!sev_guest(kvm)) {
> - ret = -ENOTTY;
> - goto failed;
> - }
> -
> - region = find_enc_region(kvm, range);
> - if (!region) {
> - ret = -EINVAL;
> - goto failed;
> - }
> -
> - /*
> - * Ensure that all guest tagged cache entries are flushed before
> - * releasing the pages back to the system for use. CLFLUSH will
> - * not do this, so issue a WBINVD.
> - */
> - wbinvd_on_all_cpus();
> -
> - __unregister_enc_region_locked(kvm, region);
> -
> - mutex_unlock(&kvm->lock);
> return 0;
> -
> -failed:
> - mutex_unlock(&kvm->lock);
> - return ret;
> }
>
> int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
> @@ -2018,7 +1904,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
> mirror_sev->fd = source_sev->fd;
> mirror_sev->es_active = source_sev->es_active;
> mirror_sev->handle = source_sev->handle;
> - INIT_LIST_HEAD(&mirror_sev->regions_list);
> ret = 0;
>
> /*
> @@ -2038,8 +1923,6 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
> void sev_vm_destroy(struct kvm *kvm)
> {
> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> - struct list_head *head = &sev->regions_list;
> - struct list_head *pos, *q;
>
> WARN_ON(sev->num_mirrored_vms);
>
> @@ -2066,18 +1949,6 @@ void sev_vm_destroy(struct kvm *kvm)
> */
> wbinvd_on_all_cpus();
>
> - /*
> - * if userspace was terminated before unregistering the memory regions
> - * then lets unpin all the registered memory.
> - */
> - if (!list_empty(head)) {
> - list_for_each_safe(pos, q, head) {
> - __unregister_enc_region_locked(kvm,
> - list_entry(pos, struct enc_region, list));
> - cond_resched();
> - }
> - }
> -
> sev_unbind_asid(kvm, sev->handle);
> sev_asid_free(sev);
> }
> @@ -2946,13 +2817,90 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
> }
>
> +void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + kvm_pfn_t pfn)
> +{
> + struct kvm_arch_memory_slot *aslot;
> + struct kvm_memory_slot *slot;
> + gfn_t rel_gfn, pin_pfn;
> + unsigned long npages;
> + kvm_pfn_t old_pfn;
> + int i;
> +
> + if (!sev_guest(kvm))
> + return;
> +
> + if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
> + return;
> +
> + /* Tested till 1GB pages */
> + if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
> + return;
> +
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!slot || !slot->arch.pfns)
> + return;
> +
> + /*
> + * Use relative gfn index within the memslot for the bitmap as well as
> + * the pfns array
> + */
> + rel_gfn = gfn - slot->base_gfn;
> + aslot = &slot->arch;
> + pin_pfn = pfn;
> + npages = KVM_PAGES_PER_HPAGE(level);
> +
> + /* Pin the page, KVM doesn't yet support page migration. */
> + for (i = 0; i < npages; i++, rel_gfn++, pin_pfn++) {
> + if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
> + old_pfn = aslot->pfns[rel_gfn];
> + if (old_pfn == pin_pfn)
> + continue;
> +
> + put_page(pfn_to_page(old_pfn));
You need to flush the old pfn using VMPAGE_FLUSH before doing put_page.
Normally, this should not happen. But if the user-level VMM is
malicious, then it could just munmap() the region (not the memslot);
mmap() it again; let the guest VM touches the page and you will see this
path get executed.
Clearly, this will slow down the faulting path if this happens. So,
alternatively, you can register a hook in mmu_notifier and shoot a flush
there according to the bitmap. Either way should work.
> + }
> +
> + set_bit(rel_gfn, aslot->pinned_bitmap);
> + aslot->pfns[rel_gfn] = pin_pfn;
> + get_page(pfn_to_page(pin_pfn));
> + }
> +
> + /*
> + * Flush any cached lines of the page being added since "ownership" of
> + * it will be transferred from the host to an encrypted guest.
> + */
> + clflush_cache_range(__va(pfn << PAGE_SHIFT), page_level_size(level));
> +}
> +
> void sev_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> {
> struct kvm_arch_memory_slot *aslot = &slot->arch;
> + kvm_pfn_t *pfns;
> + gfn_t gfn;
> + int i;
>
> if (!sev_guest(kvm))
> return;
>
> + if (!aslot->pinned_bitmap || !slot->arch.pfns)
> + goto out;
> +
> + pfns = aslot->pfns;
> +
> + /*
> + * Iterate the memslot to find the pinned pfn using the bitmap and drop
> + * the pfn stored.
> + */
> + for (i = 0, gfn = slot->base_gfn; i < slot->npages; i++, gfn++) {
> + if (test_and_clear_bit(i, aslot->pinned_bitmap)) {
> + if (WARN_ON(!pfns[i]))
> + continue;
> +
> + put_page(pfn_to_page(pfns[i]));
Here, you get lucky that you don't have to flush the cache. However,
this is because sev_free_memslots is called after the
kvm_arch_destroy_vm, which flushes the cache system wise.
> + }
> + }
> +
> +out:
> if (aslot->pinned_bitmap) {
> kvfree(aslot->pinned_bitmap);
> aslot->pinned_bitmap = NULL;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 3fb19974f719..22535c680b3f 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4743,6 +4743,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>
> .alloc_memslot_metadata = sev_alloc_memslot_metadata,
> .free_memslot = sev_free_memslot,
> + .pin_spte = sev_pin_spte,
> };
>
> static struct kvm_x86_init_ops svm_init_ops __initdata = {
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index b2f8b3b52680..c731bc91ea8f 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -77,7 +77,6 @@ struct kvm_sev_info {
> unsigned int handle; /* SEV firmware handle */
> int fd; /* SEV device fd */
> unsigned long pages_locked; /* Number of pages locked */
> - struct list_head regions_list; /* List of registered regions */
> u64 ap_jump_table; /* SEV-ES AP Jump Table address */
> struct kvm *enc_context_owner; /* Owner of copied encryption context */
> unsigned long num_mirrored_vms; /* Number of VMs sharing this ASID */
> @@ -648,5 +647,7 @@ int sev_alloc_memslot_metadata(struct kvm *kvm,
> struct kvm_memory_slot *new);
> void sev_free_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot);
> +void sev_pin_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + kvm_pfn_t pfn);
>
> #endif
> --
> 2.32.0
>
On Tue, Jan 18, 2022, Nikunj A Dadhania wrote:
> SEV guest requires the guest's pages to be pinned in host physical
> memory as migration of encrypted pages is not supported. The memory
> encryption scheme uses the physical address of the memory being
> encrypted. If guest pages are moved by the host, content decrypted in
> the guest would be incorrect thereby corrupting guest's memory.
>
> For SEV/SEV-ES guests, the hypervisor doesn't know which pages are
> encrypted and when the guest is done using those pages. Hypervisor
> should treat all the guest pages as encrypted until the guest is
> destroyed.
"Hypervisor should treat all the guest pages as encrypted until they are
deallocated or the guest is destroyed".
Note: in general, the guest VM could ask the user-level VMM to free the
page by either free the memslot or free the pages (munmap(2)).
>
> Actual pinning management is handled by vendor code via new
> kvm_x86_ops hooks. MMU calls in to vendor code to pin the page on
> demand. Metadata of the pinning is stored in architecture specific
> memslot area. During the memslot freeing path guest pages are
> unpinned.
"During the memslot freeing path and deallocation path"
>
> Initially started with [1], where the idea was to store the pinning
> information using the software bit in the SPTE to track the pinned
> page. That is not feasible for the following reason:
>
> The pinned SPTE information gets stored in the shadow pages(SP). The
> way current MMU is designed, the full MMU context gets dropped
> multiple number of times even when CR0.WP bit gets flipped. Due to
> dropping of the MMU context (aka roots), there is a huge amount of SP
> alloc/remove churn. Pinned information stored in the SP gets lost
> during the dropping of the root and subsequent SP at the child levels.
> Without this information making decisions about re-pinnning page or
> unpinning during the guest shutdown will not be possible
>
> [1] https://patchwork.kernel.org/project/kvm/cover/[email protected]/
>
A general feedback: I really like this patch set and I think doing
memory pinning at fault path in kernel and storing the metadata in
memslot is the right thing to do.
This basically solves all the problems triggered by the KVM based API
that trusts the user-level VMM to do the memory pinning.
Thanks.
> Nikunj A Dadhania (4):
> KVM: x86/mmu: Add hook to pin PFNs on demand in MMU
> KVM: SVM: Add pinning metadata in the arch memslot
> KVM: SVM: Implement demand page pinning
> KVM: SEV: Carve out routine for allocation of pages
>
> Sean Christopherson (2):
> KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV/TDX
> KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data()
>
> arch/x86/include/asm/kvm-x86-ops.h | 3 +
> arch/x86/include/asm/kvm_host.h | 9 +
> arch/x86/kvm/mmu.h | 3 +
> arch/x86/kvm/mmu/mmu.c | 41 +++
> arch/x86/kvm/mmu/tdp_mmu.c | 7 +
> arch/x86/kvm/svm/sev.c | 423 +++++++++++++++++++----------
> arch/x86/kvm/svm/svm.c | 4 +
> arch/x86/kvm/svm/svm.h | 9 +-
> arch/x86/kvm/x86.c | 11 +-
> 9 files changed, 359 insertions(+), 151 deletions(-)
>
> --
> 2.32.0
>
On 3/7/2022 1:18 AM, Mingwei Zhang wrote:
> On Tue, Jan 18, 2022, Nikunj A Dadhania wrote:
>> Use the memslot metadata to store the pinned data along with the pfns.
>> This improves the SEV guest startup time from O(n) to a constant by
>> deferring guest page pinning until the pages are used to satisfy nested
>> page faults. The page reference will be dropped in the memslot free
>> path.
>>
>> Remove the enc_region structure definition and the code which did
>> upfront pinning, as they are no longer needed in view of the demand
>> pinning support.
>>
>> Leave svm_register_enc_region() and svm_unregister_enc_region() as stubs
>> since qemu is dependent on this API.
>>
>> Signed-off-by: Nikunj A Dadhania <[email protected]>
>> ---
>> +
>> + /* Pin the page, KVM doesn't yet support page migration. */
>> + for (i = 0; i < npages; i++, rel_gfn++, pin_pfn++) {
>> + if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
>> + old_pfn = aslot->pfns[rel_gfn];
>> + if (old_pfn == pin_pfn)
>> + continue;
>> +
>> + put_page(pfn_to_page(old_pfn));
>
> You need to flush the old pfn using VMPAGE_FLUSH before doing put_page.
> Normally, this should not happen. But if the user-level VMM is
> malicious, then it could just munmap() the region (not the memslot);
> mmap() it again; let the guest VM touches the page and you will see this
> path get executed.
>
> Clearly, this will slow down the faulting path if this happens. So,
> alternatively, you can register a hook in mmu_notifier and shoot a flush
> there according to the bitmap. Either way should work.
>
We can call sev_flush_guest_memory() before the put_page().
>> + }
>> +
>> + set_bit(rel_gfn, aslot->pinned_bitmap);
>> + aslot->pfns[rel_gfn] = pin_pfn;
>> + get_page(pfn_to_page(pin_pfn));
>> + }
>> +
>> + /*
>> + * Flush any cached lines of the page being added since "ownership" of
>> + * it will be transferred from the host to an encrypted guest.
>> + */
>> + clflush_cache_range(__va(pfn << PAGE_SHIFT), page_level_size(level));
>> +}
>> +
>> void sev_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>> {
>> struct kvm_arch_memory_slot *aslot = &slot->arch;
>> + kvm_pfn_t *pfns;
>> + gfn_t gfn;
>> + int i;
>>
>> if (!sev_guest(kvm))
>> return;
>>
>> + if (!aslot->pinned_bitmap || !slot->arch.pfns)
>> + goto out;
>> +
>> + pfns = aslot->pfns;
>> +
>> + /*
>> + * Iterate the memslot to find the pinned pfn using the bitmap and drop
>> + * the pfn stored.
>> + */
>> + for (i = 0, gfn = slot->base_gfn; i < slot->npages; i++, gfn++) {
>> + if (test_and_clear_bit(i, aslot->pinned_bitmap)) {
>> + if (WARN_ON(!pfns[i]))
>> + continue;
>> +
>> + put_page(pfn_to_page(pfns[i]));
>
> Here, you get lucky that you don't have to flush the cache. However,
> this is because sev_free_memslots is called after the
> kvm_arch_destroy_vm, which flushes the cache system wise.
I have added wbinvd_on_all_cpus() just before the iteration in my new version.
Regards
Nikunj
On 3/7/2022 1:37 AM, Mingwei Zhang wrote:
> On Tue, Jan 18, 2022, Nikunj A Dadhania wrote:
>> SEV guest requires the guest's pages to be pinned in host physical
>> memory as migration of encrypted pages is not supported. The memory
>> encryption scheme uses the physical address of the memory being
>> encrypted. If guest pages are moved by the host, content decrypted in
>> the guest would be incorrect thereby corrupting guest's memory.
>>
>> For SEV/SEV-ES guests, the hypervisor doesn't know which pages are
>> encrypted and when the guest is done using those pages. Hypervisor
>> should treat all the guest pages as encrypted until the guest is
>> destroyed.
> "Hypervisor should treat all the guest pages as encrypted until they are
> deallocated or the guest is destroyed".
>
> Note: in general, the guest VM could ask the user-level VMM to free the
> page by either free the memslot or free the pages (munmap(2)).
>
Sure, will update
>>
>> Actual pinning management is handled by vendor code via new
>> kvm_x86_ops hooks. MMU calls in to vendor code to pin the page on
>> demand. Metadata of the pinning is stored in architecture specific
>> memslot area. During the memslot freeing path guest pages are
>> unpinned.
>
> "During the memslot freeing path and deallocation path"
Sure.
>
>>
>> Initially started with [1], where the idea was to store the pinning
>> information using the software bit in the SPTE to track the pinned
>> page. That is not feasible for the following reason:
>>
>> The pinned SPTE information gets stored in the shadow pages(SP). The
>> way current MMU is designed, the full MMU context gets dropped
>> multiple number of times even when CR0.WP bit gets flipped. Due to
>> dropping of the MMU context (aka roots), there is a huge amount of SP
>> alloc/remove churn. Pinned information stored in the SP gets lost
>> during the dropping of the root and subsequent SP at the child levels.
>> Without this information making decisions about re-pinnning page or
>> unpinning during the guest shutdown will not be possible
>>
>> [1] https://patchwork.kernel.org/project/kvm/cover/[email protected]/
>>
>
> A general feedback: I really like this patch set and I think doing
> memory pinning at fault path in kernel and storing the metadata in
> memslot is the right thing to do.
>
> This basically solves all the problems triggered by the KVM based API
> that trusts the user-level VMM to do the memory pinning.
>
Thanks for the feedback.
Regards
Nikunj