2021-06-24 04:00:12

by David Stevens

[permalink] [raw]
Subject: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
follow_pte in gfn_to_pfn. However, the resolved pfns may not have
assoicated struct pages, so they should not be passed to pfn_to_page.
This series removes such calls from the x86 and arm64 secondary MMU. To
do this, this series modifies gfn_to_pfn to return a struct page in
addition to a pfn, if the hva was resolved by gup. This allows the
caller to call put_page only when necessated by gup.

This series provides a helper function that unwraps the new return type
of gfn_to_pfn to provide behavior identical to the old behavior. As I
have no hardware to test powerpc/mips changes, the function is used
there for minimally invasive changes. Additionally, as gfn_to_page and
gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
easily changed over to only use pfns.

This addresses CVE-2021-22543 on x86 and arm64.

David Stevens (6):
KVM: x86/mmu: release audited pfns
KVM: mmu: also return page from gfn_to_pfn
KVM: x86/mmu: avoid struct page in MMU
KVM: arm64/mmu: avoid struct page in MMU
KVM: mmu: remove over-aggressive warnings
drm/i915/gvt: use gfn_to_pfn's page instead of pfn

arch/arm64/kvm/mmu.c | 42 +++++----
arch/mips/kvm/mmu.c | 3 +-
arch/powerpc/kvm/book3s.c | 3 +-
arch/powerpc/kvm/book3s_64_mmu_hv.c | 5 +-
arch/powerpc/kvm/book3s_64_mmu_radix.c | 5 +-
arch/powerpc/kvm/book3s_hv_uvmem.c | 4 +-
arch/powerpc/kvm/e500_mmu_host.c | 2 +-
arch/x86/kvm/mmu/mmu.c | 60 ++++++------
arch/x86/kvm/mmu/mmu_audit.c | 13 ++-
arch/x86/kvm/mmu/mmu_internal.h | 3 +-
arch/x86/kvm/mmu/paging_tmpl.h | 36 +++++---
arch/x86/kvm/mmu/tdp_mmu.c | 7 +-
arch/x86/kvm/mmu/tdp_mmu.h | 4 +-
arch/x86/kvm/x86.c | 9 +-
drivers/gpu/drm/i915/gvt/gtt.c | 12 ++-
drivers/gpu/drm/i915/gvt/hypercall.h | 3 +-
drivers/gpu/drm/i915/gvt/kvmgt.c | 12 +--
drivers/gpu/drm/i915/gvt/mpt.h | 8 +-
include/linux/kvm_host.h | 27 ++++--
include/linux/kvm_types.h | 5 +
virt/kvm/kvm_main.c | 123 +++++++++++++------------
21 files changed, 212 insertions(+), 174 deletions(-)

--
2.32.0.93.g670b81a890-goog


2021-06-24 04:00:24

by David Stevens

[permalink] [raw]
Subject: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

From: David Stevens <[email protected]>

Return a struct kvm_pfn_page containing both a pfn and an optional
struct page from the gfn_to_pfn family of functions. This differentiates
the gup and follow_fault_pfn cases, which allows callers that only need
a pfn to avoid touching the page struct in the latter case. For callers
that need a struct page, introduce a helper function that unwraps a
struct kvm_pfn_page into a struct page. This helper makes the call to
kvm_get_pfn which had previously been in hva_to_pfn_remapped.

For now, wrap all calls to gfn_to_pfn functions in the new helper
function. Callers which don't need the page struct will be updated in
follow-up patches.

Signed-off-by: David Stevens <[email protected]>
---
arch/arm64/kvm/mmu.c | 5 +-
arch/mips/kvm/mmu.c | 3 +-
arch/powerpc/kvm/book3s.c | 3 +-
arch/powerpc/kvm/book3s_64_mmu_hv.c | 5 +-
arch/powerpc/kvm/book3s_64_mmu_radix.c | 5 +-
arch/powerpc/kvm/book3s_hv_uvmem.c | 4 +-
arch/powerpc/kvm/e500_mmu_host.c | 2 +-
arch/x86/kvm/mmu/mmu.c | 11 ++-
arch/x86/kvm/mmu/mmu_audit.c | 2 +-
arch/x86/kvm/x86.c | 2 +-
drivers/gpu/drm/i915/gvt/kvmgt.c | 2 +-
include/linux/kvm_host.h | 27 ++++--
include/linux/kvm_types.h | 5 +
virt/kvm/kvm_main.c | 121 +++++++++++++------------
14 files changed, 109 insertions(+), 88 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c10207fed2f3..896b3644b36f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -933,8 +933,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
*/
smp_rmb();

- pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
- write_fault, &writable, NULL);
+ pfn = kvm_pfn_page_unwrap(__gfn_to_pfn_memslot(memslot, gfn, false,
+ NULL, write_fault,
+ &writable, NULL));
if (pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(hva, vma_shift);
return 0;
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 6d1f68cf4edf..f4e5e48bc6bf 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -630,7 +630,8 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa,
smp_rmb();

/* Slow path - ask KVM core whether we can access this GPA */
- pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writeable);
+ pfn = kvm_pfn_page_unwrap(gfn_to_pfn_prot(kvm, gfn,
+ write_fault, &writeable));
if (is_error_noslot_pfn(pfn)) {
err = -EFAULT;
goto out;
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 2b691f4d1f26..2dff01d0632a 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -417,7 +417,8 @@ kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
return pfn;
}

- return gfn_to_pfn_prot(vcpu->kvm, gfn, writing, writable);
+ return kvm_pfn_page_unwrap(gfn_to_pfn_prot(vcpu->kvm, gfn,
+ writing, writable));
}
EXPORT_SYMBOL_GPL(kvmppc_gpa_to_pfn);

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 2d9193cd73be..ba094b9f87a9 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -590,8 +590,9 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
write_ok = true;
} else {
/* Call KVM generic code to do the slow-path check */
- pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
- writing, &write_ok, NULL);
+ pfn = kvm_pfn_page_unwrap(
+ __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
+ writing, &write_ok, NULL));
if (is_error_noslot_pfn(pfn))
return -EFAULT;
page = NULL;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index d909c069363e..e7892f148222 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -821,8 +821,9 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
unsigned long pfn;

/* Call KVM generic code to do the slow-path check */
- pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
- writing, upgrade_p, NULL);
+ pfn = kvm_pfn_page_unwrap(
+ __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
+ writing, upgrade_p, NULL));
if (is_error_noslot_pfn(pfn))
return -EFAULT;
page = NULL;
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..0eee0d98b055 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -890,7 +890,7 @@ static unsigned long kvmppc_share_page(struct kvm *kvm, unsigned long gpa,

retry:
mutex_unlock(&kvm->arch.uvmem_lock);
- pfn = gfn_to_pfn(kvm, gfn);
+ pfn = kvm_pfn_page_unwrap(gfn_to_pfn(kvm, gfn));
if (is_error_noslot_pfn(pfn))
goto out;

@@ -1075,7 +1075,7 @@ int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn)
unsigned long pfn;
int ret = U_SUCCESS;

- pfn = gfn_to_pfn(kvm, gfn);
+ pfn = kvm_pfn_page_unwrap(gfn_to_pfn(kvm, gfn));
if (is_error_noslot_pfn(pfn))
return -EFAULT;

diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 7f16afc331ef..be7ca5788a05 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -446,7 +446,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,

if (likely(!pfnmap)) {
tsize_pages = 1UL << (tsize + 10 - PAGE_SHIFT);
- pfn = gfn_to_pfn_memslot(slot, gfn);
+ pfn = kvm_pfn_page_unwrap(gfn_to_pfn_memslot(slot, gfn));
if (is_error_noslot_pfn(pfn)) {
if (printk_ratelimit())
pr_err("%s: real page not found for gfn %lx\n",
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8d5876dfc6b7..84913677c404 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2619,7 +2619,7 @@ static kvm_pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
if (!slot)
return KVM_PFN_ERR_FAULT;

- return gfn_to_pfn_memslot_atomic(slot, gfn);
+ return kvm_pfn_page_unwrap(gfn_to_pfn_memslot_atomic(slot, gfn));
}

static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
@@ -3694,8 +3694,9 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
}

async = false;
- *pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async,
- write, writable, hva);
+ *pfn = kvm_pfn_page_unwrap(__gfn_to_pfn_memslot(slot, gfn, false,
+ &async, write,
+ writable, hva));
if (!async)
return false; /* *pfn has correct page already */

@@ -3709,8 +3710,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
return true;
}

- *pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL,
- write, writable, hva);
+ *pfn = kvm_pfn_page_unwrap(__gfn_to_pfn_memslot(slot, gfn, false, NULL,
+ write, writable, hva));
return false;
}

diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
index 97ff184084b4..3f983dc6e0f1 100644
--- a/arch/x86/kvm/mmu/mmu_audit.c
+++ b/arch/x86/kvm/mmu/mmu_audit.c
@@ -111,7 +111,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
return;

gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
- pfn = kvm_vcpu_gfn_to_pfn_atomic(vcpu, gfn);
+ pfn = kvm_pfn_page_unwrap(kvm_vcpu_gfn_to_pfn_atomic(vcpu, gfn));

if (is_error_pfn(pfn))
return;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6d425310054b..d31797e0cb6e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7341,7 +7341,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
* retry instruction -> write #PF -> emulation fail -> retry
* instruction -> ...
*/
- pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
+ pfn = kvm_pfn_page_unwrap(gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)));

/*
* If the instruction failed on the error pfn, it can not be fixed,
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index 65ff43cfc0f7..b829ff67e3d9 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1938,7 +1938,7 @@ static unsigned long kvmgt_gfn_to_pfn(unsigned long handle, unsigned long gfn)

info = (struct kvmgt_guest_info *)handle;

- pfn = gfn_to_pfn(info->kvm, gfn);
+ pfn = kvm_pfn_page_unwrap(gfn_to_pfn(info->kvm, gfn));
if (is_error_noslot_pfn(pfn))
return INTEL_GVT_INVALID_ADDR;

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8583ed3ff344..d3731790a967 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -138,6 +138,8 @@ static inline bool is_error_page(struct page *page)
return IS_ERR(page);
}

+#define KVM_PFN_PAGE_ERR(e) ((struct kvm_pfn_page) { .pfn = (e), .page = NULL })
+
#define KVM_REQUEST_MASK GENMASK(7,0)
#define KVM_REQUEST_NO_WAKEUP BIT(8)
#define KVM_REQUEST_WAIT BIT(9)
@@ -795,14 +797,19 @@ void kvm_release_page_clean(struct page *page);
void kvm_release_page_dirty(struct page *page);
void kvm_set_page_accessed(struct page *page);

-kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
-kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
- bool *writable);
-kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
-kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
-kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
- bool atomic, bool *async, bool write_fault,
- bool *writable, hva_t *hva);
+struct kvm_pfn_page gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
+struct kvm_pfn_page gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn,
+ bool write_fault, bool *writable);
+struct kvm_pfn_page gfn_to_pfn_memslot(struct kvm_memory_slot *slot,
+ gfn_t gfn);
+struct kvm_pfn_page gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot,
+ gfn_t gfn);
+struct kvm_pfn_page __gfn_to_pfn_memslot(struct kvm_memory_slot *slot,
+ gfn_t gfn, bool atomic, bool *async,
+ bool write_fault, bool *writable,
+ hva_t *hva);
+
+kvm_pfn_t kvm_pfn_page_unwrap(struct kvm_pfn_page pfnpg);

void kvm_release_pfn_clean(kvm_pfn_t pfn);
void kvm_release_pfn_dirty(kvm_pfn_t pfn);
@@ -883,8 +890,8 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn);

struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
-kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
-kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+struct kvm_pfn_page kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
+struct kvm_pfn_page kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map);
int kvm_map_gfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map,
struct gfn_to_pfn_cache *cache, bool atomic);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index a7580f69dda0..471f4b329f46 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -45,6 +45,11 @@ typedef u64 hfn_t;

typedef hfn_t kvm_pfn_t;

+struct kvm_pfn_page {
+ kvm_pfn_t pfn;
+ struct page *page;
+};
+
struct gfn_to_hva_cache {
u64 generation;
gpa_t gpa;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a6bc7af0e28..898e90be4d0e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1982,9 +1982,8 @@ static inline int check_user_page_hwpoison(unsigned long addr)
* only part that runs if we can in atomic context.
*/
static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
- bool *writable, kvm_pfn_t *pfn)
+ bool *writable, struct kvm_pfn_page *pfnpg)
{
- struct page *page[1];

/*
* Fast pin a writable pfn only if it is a write fault request
@@ -1994,8 +1993,8 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
if (!(write_fault || writable))
return false;

- if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
- *pfn = page_to_pfn(page[0]);
+ if (get_user_page_fast_only(addr, FOLL_WRITE, &pfnpg->page)) {
+ pfnpg->pfn = page_to_pfn(pfnpg->page);

if (writable)
*writable = true;
@@ -2010,10 +2009,9 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
* 1 indicates success, -errno is returned if error is detected.
*/
static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
- bool *writable, kvm_pfn_t *pfn)
+ bool *writable, struct kvm_pfn_page *pfnpg)
{
unsigned int flags = FOLL_HWPOISON;
- struct page *page;
int npages = 0;

might_sleep();
@@ -2026,7 +2024,7 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
if (async)
flags |= FOLL_NOWAIT;

- npages = get_user_pages_unlocked(addr, 1, &page, flags);
+ npages = get_user_pages_unlocked(addr, 1, &pfnpg->page, flags);
if (npages != 1)
return npages;

@@ -2036,11 +2034,11 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,

if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
*writable = true;
- put_page(page);
- page = wpage;
+ put_page(pfnpg->page);
+ pfnpg->page = wpage;
}
}
- *pfn = page_to_pfn(page);
+ pfnpg->pfn = page_to_pfn(pfnpg->page);
return npages;
}

@@ -2058,7 +2056,7 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
static int hva_to_pfn_remapped(struct vm_area_struct *vma,
unsigned long addr, bool *async,
bool write_fault, bool *writable,
- kvm_pfn_t *p_pfn)
+ struct kvm_pfn_page *p_pfn)
{
kvm_pfn_t pfn;
pte_t *ptep;
@@ -2094,22 +2092,12 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
*writable = pte_write(*ptep);
pfn = pte_pfn(*ptep);

- /*
- * Get a reference here because callers of *hva_to_pfn* and
- * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
- * returned pfn. This is only needed if the VMA has VM_MIXEDMAP
- * set, but the kvm_get_pfn/kvm_release_pfn_clean pair will
- * simply do nothing for reserved pfns.
- *
- * Whoever called remap_pfn_range is also going to call e.g.
- * unmap_mapping_range before the underlying pages are freed,
- * causing a call to our MMU notifier.
- */
- kvm_get_pfn(pfn);
-
out:
pte_unmap_unlock(ptep, ptl);
- *p_pfn = pfn;
+
+ p_pfn->pfn = pfn;
+ p_pfn->page = NULL;
+
return 0;
}

@@ -2127,30 +2115,30 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
* 2): @write_fault = false && @writable, @writable will tell the caller
* whether the mapping is writable.
*/
-static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
- bool write_fault, bool *writable)
+static struct kvm_pfn_page hva_to_pfn(unsigned long addr, bool atomic,
+ bool *async, bool write_fault, bool *writable)
{
struct vm_area_struct *vma;
- kvm_pfn_t pfn = 0;
+ struct kvm_pfn_page pfnpg = {};
int npages, r;

/* we can do it either atomically or asynchronously, not both */
BUG_ON(atomic && async);

- if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
- return pfn;
+ if (hva_to_pfn_fast(addr, write_fault, writable, &pfnpg))
+ return pfnpg;

if (atomic)
- return KVM_PFN_ERR_FAULT;
+ return KVM_PFN_PAGE_ERR(KVM_PFN_ERR_FAULT);

- npages = hva_to_pfn_slow(addr, async, write_fault, writable, &pfn);
+ npages = hva_to_pfn_slow(addr, async, write_fault, writable, &pfnpg);
if (npages == 1)
- return pfn;
+ return pfnpg;

mmap_read_lock(current->mm);
if (npages == -EHWPOISON ||
(!async && check_user_page_hwpoison(addr))) {
- pfn = KVM_PFN_ERR_HWPOISON;
+ pfnpg.pfn = KVM_PFN_ERR_HWPOISON;
goto exit;
}

@@ -2158,26 +2146,27 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
vma = find_vma_intersection(current->mm, addr, addr + 1);

if (vma == NULL)
- pfn = KVM_PFN_ERR_FAULT;
+ pfnpg.pfn = KVM_PFN_ERR_FAULT;
else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
- r = hva_to_pfn_remapped(vma, addr, async, write_fault, writable, &pfn);
+ r = hva_to_pfn_remapped(vma, addr, async, write_fault, writable, &pfnpg);
if (r == -EAGAIN)
goto retry;
if (r < 0)
- pfn = KVM_PFN_ERR_FAULT;
+ pfnpg.pfn = KVM_PFN_ERR_FAULT;
} else {
if (async && vma_is_valid(vma, write_fault))
*async = true;
- pfn = KVM_PFN_ERR_FAULT;
+ pfnpg.pfn = KVM_PFN_ERR_FAULT;
}
exit:
mmap_read_unlock(current->mm);
- return pfn;
+ return pfnpg;
}

-kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
- bool atomic, bool *async, bool write_fault,
- bool *writable, hva_t *hva)
+struct kvm_pfn_page __gfn_to_pfn_memslot(struct kvm_memory_slot *slot,
+ gfn_t gfn, bool atomic, bool *async,
+ bool write_fault, bool *writable,
+ hva_t *hva)
{
unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);

@@ -2187,13 +2176,13 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
if (addr == KVM_HVA_ERR_RO_BAD) {
if (writable)
*writable = false;
- return KVM_PFN_ERR_RO_FAULT;
+ return KVM_PFN_PAGE_ERR(KVM_PFN_ERR_RO_FAULT);
}

if (kvm_is_error_hva(addr)) {
if (writable)
*writable = false;
- return KVM_PFN_NOSLOT;
+ return KVM_PFN_PAGE_ERR(KVM_PFN_NOSLOT);
}

/* Do not map writable pfn in the readonly memslot. */
@@ -2207,44 +2196,55 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
}
EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);

-kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
- bool *writable)
+struct kvm_pfn_page gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn,
+ bool write_fault, bool *writable)
{
return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, NULL,
write_fault, writable, NULL);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);

-kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
+struct kvm_pfn_page gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
{
return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL, NULL);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);

-kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
+struct kvm_pfn_page gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot,
+ gfn_t gfn)
{
return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL, NULL);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);

-kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
+struct kvm_pfn_page kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
{
return gfn_to_pfn_memslot_atomic(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_atomic);

-kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
+struct kvm_pfn_page gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
{
return gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn);

-kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+struct kvm_pfn_page kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
{
return gfn_to_pfn_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn);

+kvm_pfn_t kvm_pfn_page_unwrap(struct kvm_pfn_page pfnpg)
+{
+ if (pfnpg.page)
+ return pfnpg.pfn;
+
+ kvm_get_pfn(pfnpg.pfn);
+ return pfnpg.pfn;
+}
+EXPORT_SYMBOL_GPL(kvm_pfn_page_unwrap);
+
int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
struct page **pages, int nr_pages)
{
@@ -2277,11 +2277,14 @@ static struct page *kvm_pfn_to_page(kvm_pfn_t pfn)

struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
{
- kvm_pfn_t pfn;
+ struct kvm_pfn_page pfnpg;
+
+ pfnpg = gfn_to_pfn(kvm, gfn);

- pfn = gfn_to_pfn(kvm, gfn);
+ if (pfnpg.page)
+ return pfnpg.page;

- return kvm_pfn_to_page(pfn);
+ return kvm_pfn_to_page(kvm_pfn_page_unwrap(pfnpg));
}
EXPORT_SYMBOL_GPL(gfn_to_page);

@@ -2304,7 +2307,7 @@ static void kvm_cache_gfn_to_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
{
kvm_release_pfn(cache->pfn, cache->dirty, cache);

- cache->pfn = gfn_to_pfn_memslot(slot, gfn);
+ cache->pfn = kvm_pfn_page_unwrap(gfn_to_pfn_memslot(slot, gfn));
cache->gfn = gfn;
cache->dirty = false;
cache->generation = gen;
@@ -2335,7 +2338,7 @@ static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
} else {
if (atomic)
return -EAGAIN;
- pfn = gfn_to_pfn_memslot(slot, gfn);
+ pfn = kvm_pfn_page_unwrap(gfn_to_pfn_memslot(slot, gfn));
}
if (is_error_noslot_pfn(pfn))
return -EINVAL;
@@ -2435,11 +2438,11 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);

struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn)
{
- kvm_pfn_t pfn;
+ struct kvm_pfn_page pfnpg;

- pfn = kvm_vcpu_gfn_to_pfn(vcpu, gfn);
+ pfnpg = kvm_vcpu_gfn_to_pfn(vcpu, gfn);

- return kvm_pfn_to_page(pfn);
+ return kvm_pfn_to_page(kvm_pfn_page_unwrap(pfnpg));
}
EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_page);

--
2.32.0.93.g670b81a890-goog

2021-06-24 04:00:25

by David Stevens

[permalink] [raw]
Subject: [PATCH 1/6] KVM: x86/mmu: release audited pfns

From: David Stevens <[email protected]>

Signed-off-by: David Stevens <[email protected]>
---
arch/x86/kvm/mmu/mmu_audit.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
index cedc17b2f60e..97ff184084b4 100644
--- a/arch/x86/kvm/mmu/mmu_audit.c
+++ b/arch/x86/kvm/mmu/mmu_audit.c
@@ -121,6 +121,8 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
"ent %llxn", vcpu->arch.mmu->root_level, pfn,
hpa, *sptep);
+
+ kvm_release_pfn_clean(pfn);
}

static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
--
2.32.0.93.g670b81a890-goog

2021-06-24 04:00:53

by David Stevens

[permalink] [raw]
Subject: [PATCH 4/6] KVM: arm64/mmu: avoid struct page in MMU

From: David Stevens <[email protected]>

Avoid converting pfns returned by follow_fault_pfn to struct pages to
transiently take a reference. The reference was originally taken to
match the reference taken by gup. However, pfns returned by
follow_fault_pfn may not have a struct page set up for reference
counting.

Signed-off-by: David Stevens <[email protected]>
---
arch/arm64/kvm/mmu.c | 43 +++++++++++++++++++++++--------------------
1 file changed, 23 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 896b3644b36f..a741972cb75f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -779,17 +779,17 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
*/
static unsigned long
transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
- unsigned long hva, kvm_pfn_t *pfnp,
+ unsigned long hva, struct kvm_pfn_page *pfnpgp,
phys_addr_t *ipap)
{
- kvm_pfn_t pfn = *pfnp;
+ kvm_pfn_t pfn = pfnpgp->pfn;

/*
* Make sure the adjustment is done only for THP pages. Also make
* sure that the HVA and IPA are sufficiently aligned and that the
* block map is contained within the memslot.
*/
- if (kvm_is_transparent_hugepage(pfn) &&
+ if (pfnpgp->page && kvm_is_transparent_hugepage(pfn) &&
fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
/*
* The address we faulted on is backed by a transparent huge
@@ -810,10 +810,11 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
* page accordingly.
*/
*ipap &= PMD_MASK;
- kvm_release_pfn_clean(pfn);
+ put_page(pfnpgp->page);
pfn &= ~(PTRS_PER_PMD - 1);
- kvm_get_pfn(pfn);
- *pfnp = pfn;
+ pfnpgp->pfn = pfn;
+ pfnpgp->page = pfn_to_page(pfnpgp->pfn);
+ get_page(pfnpgp->page);

return PMD_SIZE;
}
@@ -836,7 +837,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
struct vm_area_struct *vma;
short vma_shift;
gfn_t gfn;
- kvm_pfn_t pfn;
+ struct kvm_pfn_page pfnpg;
bool logging_active = memslot_is_logging(memslot);
unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
unsigned long vma_pagesize, fault_granule;
@@ -933,17 +934,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
*/
smp_rmb();

- pfn = kvm_pfn_page_unwrap(__gfn_to_pfn_memslot(memslot, gfn, false,
- NULL, write_fault,
- &writable, NULL));
- if (pfn == KVM_PFN_ERR_HWPOISON) {
+ pfnpg = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
+ write_fault, &writable, NULL);
+ if (pfnpg.pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(hva, vma_shift);
return 0;
}
- if (is_error_noslot_pfn(pfn))
+ if (is_error_noslot_pfn(pfnpg.pfn))
return -EFAULT;

- if (kvm_is_device_pfn(pfn)) {
+ if (kvm_is_device_pfn(pfnpg.pfn)) {
device = true;
force_pte = true;
} else if (logging_active && !write_fault) {
@@ -968,16 +968,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
*/
if (vma_pagesize == PAGE_SIZE && !force_pte)
vma_pagesize = transparent_hugepage_adjust(memslot, hva,
- &pfn, &fault_ipa);
+ &pfnpg, &fault_ipa);
if (writable)
prot |= KVM_PGTABLE_PROT_W;

if (fault_status != FSC_PERM && !device)
- clean_dcache_guest_page(pfn, vma_pagesize);
+ clean_dcache_guest_page(pfnpg.pfn, vma_pagesize);

if (exec_fault) {
prot |= KVM_PGTABLE_PROT_X;
- invalidate_icache_guest_page(pfn, vma_pagesize);
+ invalidate_icache_guest_page(pfnpg.pfn, vma_pagesize);
}

if (device)
@@ -994,20 +994,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
} else {
ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
- __pfn_to_phys(pfn), prot,
+ __pfn_to_phys(pfnpg.pfn), prot,
memcache);
}

/* Mark the page dirty only if the fault is handled successfully */
if (writable && !ret) {
- kvm_set_pfn_dirty(pfn);
+ if (pfnpg.page)
+ kvm_set_pfn_dirty(pfnpg.pfn);
mark_page_dirty_in_slot(kvm, memslot, gfn);
}

out_unlock:
spin_unlock(&kvm->mmu_lock);
- kvm_set_pfn_accessed(pfn);
- kvm_release_pfn_clean(pfn);
+ if (pfnpg.page) {
+ kvm_set_pfn_accessed(pfnpg.pfn);
+ put_page(pfnpg.page);
+ }
return ret != -EAGAIN ? ret : 0;
}

--
2.32.0.93.g670b81a890-goog

2021-06-24 04:01:10

by David Stevens

[permalink] [raw]
Subject: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

From: David Stevens <[email protected]>

Avoid converting pfns returned by follow_fault_pfn to struct pages to
transiently take a reference. The reference was originally taken to
match the reference taken by gup. However, pfns returned by
follow_fault_pfn may not have a struct page set up for reference
counting.

Signed-off-by: David Stevens <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 56 +++++++++++++++++++--------------
arch/x86/kvm/mmu/mmu_audit.c | 13 ++++----
arch/x86/kvm/mmu/mmu_internal.h | 3 +-
arch/x86/kvm/mmu/paging_tmpl.h | 36 ++++++++++++---------
arch/x86/kvm/mmu/tdp_mmu.c | 7 +++--
arch/x86/kvm/mmu/tdp_mmu.h | 4 +--
arch/x86/kvm/x86.c | 9 +++---
7 files changed, 73 insertions(+), 55 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 84913677c404..8fa4a4a411ba 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2610,16 +2610,16 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
return ret;
}

-static kvm_pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
- bool no_dirty_log)
+static struct kvm_pfn_page pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu,
+ gfn_t gfn, bool no_dirty_log)
{
struct kvm_memory_slot *slot;

slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, no_dirty_log);
if (!slot)
- return KVM_PFN_ERR_FAULT;
+ return KVM_PFN_PAGE_ERR(KVM_PFN_ERR_FAULT);

- return kvm_pfn_page_unwrap(gfn_to_pfn_memslot_atomic(slot, gfn));
+ return gfn_to_pfn_memslot_atomic(slot, gfn);
}

static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
@@ -2748,7 +2748,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,

int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
int max_level, kvm_pfn_t *pfnp,
- bool huge_page_disallowed, int *req_level)
+ struct page *page, bool huge_page_disallowed,
+ int *req_level)
{
struct kvm_memory_slot *slot;
kvm_pfn_t pfn = *pfnp;
@@ -2760,6 +2761,9 @@ int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
if (unlikely(max_level == PG_LEVEL_4K))
return PG_LEVEL_4K;

+ if (!page)
+ return PG_LEVEL_4K;
+
if (is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn))
return PG_LEVEL_4K;

@@ -2814,7 +2818,8 @@ void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
}

static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
- int map_writable, int max_level, kvm_pfn_t pfn,
+ int map_writable, int max_level,
+ const struct kvm_pfn_page *pfnpg,
bool prefault, bool is_tdp)
{
bool nx_huge_page_workaround_enabled = is_nx_huge_page_enabled();
@@ -2826,11 +2831,12 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
int level, req_level, ret;
gfn_t gfn = gpa >> PAGE_SHIFT;
gfn_t base_gfn = gfn;
+ kvm_pfn_t pfn = pfnpg->pfn;

if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
return RET_PF_RETRY;

- level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
+ level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn, pfnpg->page,
huge_page_disallowed, &req_level);

trace_kvm_mmu_spte_requested(gpa, level, pfn);
@@ -3672,8 +3678,8 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
}

static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
- gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
- bool write, bool *writable)
+ gpa_t cr2_or_gpa, struct kvm_pfn_page *pfnpg,
+ hva_t *hva, bool write, bool *writable)
{
struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
bool async;
@@ -3688,17 +3694,16 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,

/* Don't expose private memslots to L2. */
if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
- *pfn = KVM_PFN_NOSLOT;
+ *pfnpg = KVM_PFN_PAGE_ERR(KVM_PFN_NOSLOT);
*writable = false;
return false;
}

async = false;
- *pfn = kvm_pfn_page_unwrap(__gfn_to_pfn_memslot(slot, gfn, false,
- &async, write,
- writable, hva));
+ *pfnpg = __gfn_to_pfn_memslot(slot, gfn, false, &async,
+ write, writable, hva);
if (!async)
- return false; /* *pfn has correct page already */
+ return false; /* *pfnpg has correct page already */

if (!prefault && kvm_can_do_async_pf(vcpu)) {
trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
@@ -3710,8 +3715,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
return true;
}

- *pfn = kvm_pfn_page_unwrap(__gfn_to_pfn_memslot(slot, gfn, false, NULL,
- write, writable, hva));
+ *pfnpg = __gfn_to_pfn_memslot(slot, gfn, false, NULL,
+ write, writable, hva);
return false;
}

@@ -3723,7 +3728,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,

gfn_t gfn = gpa >> PAGE_SHIFT;
unsigned long mmu_seq;
- kvm_pfn_t pfn;
+ struct kvm_pfn_page pfnpg;
hva_t hva;
int r;

@@ -3743,11 +3748,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();

- if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, &hva,
+ if (try_async_pf(vcpu, prefault, gfn, gpa, &pfnpg, &hva,
write, &map_writable))
return RET_PF_RETRY;

- if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
+ if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa,
+ gfn, pfnpg.pfn, ACC_ALL, &r))
return r;

r = RET_PF_RETRY;
@@ -3757,7 +3763,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
else
write_lock(&vcpu->kvm->mmu_lock);

- if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva))
+ if (!is_noslot_pfn(pfnpg.pfn) &&
+ mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva))
goto out_unlock;
r = make_mmu_pages_available(vcpu);
if (r)
@@ -3765,17 +3772,18 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,

if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
- pfn, prefault);
+ &pfnpg, prefault);
else
- r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn,
- prefault, is_tdp);
+ r = __direct_map(vcpu, gpa, error_code, map_writable, max_level,
+ &pfnpg, prefault, is_tdp);

out_unlock:
if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
read_unlock(&vcpu->kvm->mmu_lock);
else
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(pfn);
+ if (pfnpg.page)
+ put_page(pfnpg.page);
return r;
}

diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
index 3f983dc6e0f1..72b470b892da 100644
--- a/arch/x86/kvm/mmu/mmu_audit.c
+++ b/arch/x86/kvm/mmu/mmu_audit.c
@@ -94,7 +94,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
{
struct kvm_mmu_page *sp;
gfn_t gfn;
- kvm_pfn_t pfn;
+ struct kvm_pfn_page pfnpg;
hpa_t hpa;

sp = sptep_to_sp(sptep);
@@ -111,18 +111,19 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
return;

gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
- pfn = kvm_pfn_page_unwrap(kvm_vcpu_gfn_to_pfn_atomic(vcpu, gfn));
+ pfnpg = kvm_vcpu_gfn_to_pfn_atomic(vcpu, gfn);

- if (is_error_pfn(pfn))
+ if (is_error_pfn(pfnpg.pfn))
return;

- hpa = pfn << PAGE_SHIFT;
+ hpa = pfnpg.pfn << PAGE_SHIFT;
if ((*sptep & PT64_BASE_ADDR_MASK) != hpa)
audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
- "ent %llxn", vcpu->arch.mmu->root_level, pfn,
+ "ent %llxn", vcpu->arch.mmu->root_level, pfnpg.pfn,
hpa, *sptep);

- kvm_release_pfn_clean(pfn);
+ if (pfnpg.page)
+ put_page(pfnpg.page);
}

static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d64ccb417c60..db4d878fde4e 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -154,7 +154,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
kvm_pfn_t pfn, int max_level);
int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
int max_level, kvm_pfn_t *pfnp,
- bool huge_page_disallowed, int *req_level);
+ struct page *page, bool huge_page_disallowed,
+ int *req_level);
void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
kvm_pfn_t *pfnp, int *goal_levelp);

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 823a5919f9fa..db13efd4b62d 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -535,7 +535,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
{
unsigned pte_access;
gfn_t gfn;
- kvm_pfn_t pfn;
+ struct kvm_pfn_page pfnpg;

if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
return false;
@@ -545,19 +545,20 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
gfn = gpte_to_gfn(gpte);
pte_access = sp->role.access & FNAME(gpte_access)(gpte);
FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
- pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
+ pfnpg = pte_prefetch_gfn_to_pfn(vcpu, gfn,
no_dirty_log && (pte_access & ACC_WRITE_MASK));
- if (is_error_pfn(pfn))
+ if (is_error_pfn(pfnpg.pfn))
return false;

/*
* we call mmu_set_spte() with host_writable = true because
* pte_prefetch_gfn_to_pfn always gets a writable pfn.
*/
- mmu_set_spte(vcpu, spte, pte_access, false, PG_LEVEL_4K, gfn, pfn,
+ mmu_set_spte(vcpu, spte, pte_access, false, PG_LEVEL_4K, gfn, pfnpg.pfn,
true, true);

- kvm_release_pfn_clean(pfn);
+ if (pfnpg.page)
+ put_page(pfnpg.page);
return true;
}

@@ -637,8 +638,8 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
*/
static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
struct guest_walker *gw, u32 error_code,
- int max_level, kvm_pfn_t pfn, bool map_writable,
- bool prefault)
+ int max_level, struct kvm_pfn_page *pfnpg,
+ bool map_writable, bool prefault)
{
bool nx_huge_page_workaround_enabled = is_nx_huge_page_enabled();
bool write_fault = error_code & PFERR_WRITE_MASK;
@@ -649,6 +650,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
unsigned int direct_access, access;
int top_level, level, req_level, ret;
gfn_t base_gfn = gw->gfn;
+ kvm_pfn_t pfn = pfnpg->pfn;

direct_access = gw->pte_access;

@@ -695,7 +697,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
}

level = kvm_mmu_hugepage_adjust(vcpu, gw->gfn, max_level, &pfn,
- huge_page_disallowed, &req_level);
+ pfnpg->page, huge_page_disallowed,
+ &req_level);

trace_kvm_mmu_spte_requested(addr, gw->level, pfn);

@@ -801,7 +804,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
bool user_fault = error_code & PFERR_USER_MASK;
struct guest_walker walker;
int r;
- kvm_pfn_t pfn;
+ struct kvm_pfn_page pfnpg;
hva_t hva;
unsigned long mmu_seq;
bool map_writable, is_self_change_mapping;
@@ -853,11 +856,12 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();

- if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
+ if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfnpg, &hva,
write_fault, &map_writable))
return RET_PF_RETRY;

- if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r))
+ if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfnpg.pfn,
+ walker.pte_access, &r))
return r;

/*
@@ -866,7 +870,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
*/
if (write_fault && !(walker.pte_access & ACC_WRITE_MASK) &&
!is_write_protection(vcpu) && !user_fault &&
- !is_noslot_pfn(pfn)) {
+ !is_noslot_pfn(pfnpg.pfn)) {
walker.pte_access |= ACC_WRITE_MASK;
walker.pte_access &= ~ACC_USER_MASK;

@@ -882,20 +886,22 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,

r = RET_PF_RETRY;
write_lock(&vcpu->kvm->mmu_lock);
- if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva))
+ if (!is_noslot_pfn(pfnpg.pfn) &&
+ mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva))
goto out_unlock;

kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
r = make_mmu_pages_available(vcpu);
if (r)
goto out_unlock;
- r = FNAME(fetch)(vcpu, addr, &walker, error_code, max_level, pfn,
+ r = FNAME(fetch)(vcpu, addr, &walker, error_code, max_level, &pfnpg,
map_writable, prefault);
kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);

out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(pfn);
+ if (pfnpg.page)
+ put_page(pfnpg.page);
return r;
}

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 237317b1eddd..b0e6d63f0fe1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -960,8 +960,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
* page tables and SPTEs to translate the faulting guest physical address.
*/
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
- int map_writable, int max_level, kvm_pfn_t pfn,
- bool prefault)
+ int map_writable, int max_level,
+ const struct kvm_pfn_page *pfnpg, bool prefault)
{
bool nx_huge_page_workaround_enabled = is_nx_huge_page_enabled();
bool write = error_code & PFERR_WRITE_MASK;
@@ -976,13 +976,14 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
gfn_t gfn = gpa >> PAGE_SHIFT;
int level;
int req_level;
+ kvm_pfn_t pfn = pfnpg->pfn;

if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
return RET_PF_RETRY;
if (WARN_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)))
return RET_PF_RETRY;

- level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
+ level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn, pfnpg->page,
huge_page_disallowed, &req_level);

trace_kvm_mmu_spte_requested(gpa, level, pfn);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 5fdf63090451..f78681b9dcb7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -52,8 +52,8 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);

int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
- int map_writable, int max_level, kvm_pfn_t pfn,
- bool prefault);
+ int map_writable, int max_level,
+ const struct kvm_pfn_page *pfnpg, bool prefault);

bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d31797e0cb6e..86d66c765190 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7311,7 +7311,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int emulation_type)
{
gpa_t gpa = cr2_or_gpa;
- kvm_pfn_t pfn;
+ struct kvm_pfn_page pfnpg;

if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
return false;
@@ -7341,16 +7341,17 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
* retry instruction -> write #PF -> emulation fail -> retry
* instruction -> ...
*/
- pfn = kvm_pfn_page_unwrap(gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)));
+ pfnpg = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));

/*
* If the instruction failed on the error pfn, it can not be fixed,
* report the error to userspace.
*/
- if (is_error_noslot_pfn(pfn))
+ if (is_error_noslot_pfn(pfnpg.pfn))
return false;

- kvm_release_pfn_clean(pfn);
+ if (pfnpg.page)
+ put_page(pfnpg.page);

/* The instructions are well-emulated on direct mmu. */
if (vcpu->arch.mmu->direct_map) {
--
2.32.0.93.g670b81a890-goog

2021-06-24 04:01:18

by David Stevens

[permalink] [raw]
Subject: [PATCH 6/6] drm/i915/gvt: use gfn_to_pfn's page instead of pfn

Return struct page instead of pfn from gfn_to_mfn. This function is only
used to determine if the page is a transparent hugepage, to enable 2MB
huge gtt shadowing. Returning the page directly avoids the risk of
calling pfn_to_page on a VM_IO|VM_PFNMAP pfn.

This change also properly releases the reference on the page returned by
gfn_to_pfn.

Signed-off-by: David Stevens <[email protected]>
---
drivers/gpu/drm/i915/gvt/gtt.c | 12 ++++++++----
drivers/gpu/drm/i915/gvt/hypercall.h | 3 ++-
drivers/gpu/drm/i915/gvt/kvmgt.c | 12 ++++--------
drivers/gpu/drm/i915/gvt/mpt.h | 8 ++++----
4 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
index 9478c132d7b6..b2951c560582 100644
--- a/drivers/gpu/drm/i915/gvt/gtt.c
+++ b/drivers/gpu/drm/i915/gvt/gtt.c
@@ -1160,16 +1160,20 @@ static int is_2MB_gtt_possible(struct intel_vgpu *vgpu,
struct intel_gvt_gtt_entry *entry)
{
struct intel_gvt_gtt_pte_ops *ops = vgpu->gvt->gtt.pte_ops;
- unsigned long pfn;
+ struct page *page;
+ bool is_trans_huge;

if (!HAS_PAGE_SIZES(vgpu->gvt->gt->i915, I915_GTT_PAGE_SIZE_2M))
return 0;

- pfn = intel_gvt_hypervisor_gfn_to_mfn(vgpu, ops->get_pfn(entry));
- if (pfn == INTEL_GVT_INVALID_ADDR)
+ page = intel_gvt_hypervisor_gfn_to_mfn_page(vgpu, ops->get_pfn(entry));
+ if (!page)
return -EINVAL;

- return PageTransHuge(pfn_to_page(pfn));
+ is_trans_huge = PageTransHuge(page);
+ put_page(page);
+
+ return is_trans_huge;
}

static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
diff --git a/drivers/gpu/drm/i915/gvt/hypercall.h b/drivers/gpu/drm/i915/gvt/hypercall.h
index b79da5124f83..017190ff52d5 100644
--- a/drivers/gpu/drm/i915/gvt/hypercall.h
+++ b/drivers/gpu/drm/i915/gvt/hypercall.h
@@ -60,7 +60,8 @@ struct intel_gvt_mpt {
unsigned long len);
int (*write_gpa)(unsigned long handle, unsigned long gpa, void *buf,
unsigned long len);
- unsigned long (*gfn_to_mfn)(unsigned long handle, unsigned long gfn);
+ struct page *(*gfn_to_mfn_page)(unsigned long handle,
+ unsigned long gfn);

int (*dma_map_guest_page)(unsigned long handle, unsigned long gfn,
unsigned long size, dma_addr_t *dma_addr);
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index b829ff67e3d9..1e97ae813ed0 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1928,21 +1928,17 @@ static int kvmgt_inject_msi(unsigned long handle, u32 addr, u16 data)
return -EFAULT;
}

-static unsigned long kvmgt_gfn_to_pfn(unsigned long handle, unsigned long gfn)
+static struct page *kvmgt_gfn_to_page(unsigned long handle, unsigned long gfn)
{
struct kvmgt_guest_info *info;
kvm_pfn_t pfn;

if (!handle_valid(handle))
- return INTEL_GVT_INVALID_ADDR;
+ return NULL;

info = (struct kvmgt_guest_info *)handle;

- pfn = kvm_pfn_page_unwrap(gfn_to_pfn(info->kvm, gfn));
- if (is_error_noslot_pfn(pfn))
- return INTEL_GVT_INVALID_ADDR;
-
- return pfn;
+ return gfn_to_pfn(info->kvm, gfn).page;
}

static int kvmgt_dma_map_guest_page(unsigned long handle, unsigned long gfn,
@@ -2112,7 +2108,7 @@ static const struct intel_gvt_mpt kvmgt_mpt = {
.disable_page_track = kvmgt_page_track_remove,
.read_gpa = kvmgt_read_gpa,
.write_gpa = kvmgt_write_gpa,
- .gfn_to_mfn = kvmgt_gfn_to_pfn,
+ .gfn_to_mfn_page = kvmgt_gfn_to_page,
.dma_map_guest_page = kvmgt_dma_map_guest_page,
.dma_unmap_guest_page = kvmgt_dma_unmap_guest_page,
.dma_pin_guest_page = kvmgt_dma_pin_guest_page,
diff --git a/drivers/gpu/drm/i915/gvt/mpt.h b/drivers/gpu/drm/i915/gvt/mpt.h
index 550a456e936f..9169b83cf0f6 100644
--- a/drivers/gpu/drm/i915/gvt/mpt.h
+++ b/drivers/gpu/drm/i915/gvt/mpt.h
@@ -214,17 +214,17 @@ static inline int intel_gvt_hypervisor_write_gpa(struct intel_vgpu *vgpu,
}

/**
- * intel_gvt_hypervisor_gfn_to_mfn - translate a GFN to MFN
+ * intel_gvt_hypervisor_gfn_to_mfn_page - translate a GFN to MFN page
* @vgpu: a vGPU
* @gpfn: guest pfn
*
* Returns:
- * MFN on success, INTEL_GVT_INVALID_ADDR if failed.
+ * struct page* on success, NULL if failed.
*/
-static inline unsigned long intel_gvt_hypervisor_gfn_to_mfn(
+static inline unsigned long intel_gvt_hypervisor_gfn_to_mfn_page(
struct intel_vgpu *vgpu, unsigned long gfn)
{
- return intel_gvt_host.mpt->gfn_to_mfn(vgpu->handle, gfn);
+ return intel_gvt_host.mpt->gfn_to_mfn_page(vgpu->handle, gfn);
}

/**
--
2.32.0.93.g670b81a890-goog

2021-06-24 04:02:19

by David Stevens

[permalink] [raw]
Subject: [PATCH 5/6] KVM: mmu: remove over-aggressive warnings

From: David Stevens <[email protected]>

Remove two warnings that require ref counts for pages to be non-zero, as
mapped pfns from follow_pfn may not have an initialized ref count.

Signed-off-by: David Stevens <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 7 -------
virt/kvm/kvm_main.c | 2 +-
2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8fa4a4a411ba..19249ad4b5b8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -546,13 +546,6 @@ static int mmu_spte_clear_track_bits(u64 *sptep)

pfn = spte_to_pfn(old_spte);

- /*
- * KVM does not hold the refcount of the page used by
- * kvm mmu, before reclaiming the page, we should
- * unmap it from mmu first.
- */
- WARN_ON(!kvm_is_reserved_pfn(pfn) && !page_count(pfn_to_page(pfn)));
-
if (is_accessed_spte(old_spte))
kvm_set_pfn_accessed(pfn);

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 898e90be4d0e..671361f30476 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -168,7 +168,7 @@ bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
* the device has been pinned, e.g. by get_user_pages(). WARN if the
* page_count() is zero to help detect bad usage of this helper.
*/
- if (!pfn_valid(pfn) || WARN_ON_ONCE(!page_count(pfn_to_page(pfn))))
+ if (!pfn_valid(pfn) || !page_count(pfn_to_page(pfn)))
return false;

return is_zone_device_page(pfn_to_page(pfn));
--
2.32.0.93.g670b81a890-goog

2021-06-24 04:31:38

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH 6/6] drm/i915/gvt: use gfn_to_pfn's page instead of pfn

Please ignore this last patch. It was put together as an afterthought
and wasn't properly tested.

-David

On Thu, Jun 24, 2021 at 12:59 PM David Stevens <[email protected]> wrote:
>
> Return struct page instead of pfn from gfn_to_mfn. This function is only
> used to determine if the page is a transparent hugepage, to enable 2MB
> huge gtt shadowing. Returning the page directly avoids the risk of
> calling pfn_to_page on a VM_IO|VM_PFNMAP pfn.
>
> This change also properly releases the reference on the page returned by
> gfn_to_pfn.
>
> Signed-off-by: David Stevens <[email protected]>
> ---
> drivers/gpu/drm/i915/gvt/gtt.c | 12 ++++++++----
> drivers/gpu/drm/i915/gvt/hypercall.h | 3 ++-
> drivers/gpu/drm/i915/gvt/kvmgt.c | 12 ++++--------
> drivers/gpu/drm/i915/gvt/mpt.h | 8 ++++----
> 4 files changed, 18 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index 9478c132d7b6..b2951c560582 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -1160,16 +1160,20 @@ static int is_2MB_gtt_possible(struct intel_vgpu *vgpu,
> struct intel_gvt_gtt_entry *entry)
> {
> struct intel_gvt_gtt_pte_ops *ops = vgpu->gvt->gtt.pte_ops;
> - unsigned long pfn;
> + struct page *page;
> + bool is_trans_huge;
>
> if (!HAS_PAGE_SIZES(vgpu->gvt->gt->i915, I915_GTT_PAGE_SIZE_2M))
> return 0;
>
> - pfn = intel_gvt_hypervisor_gfn_to_mfn(vgpu, ops->get_pfn(entry));
> - if (pfn == INTEL_GVT_INVALID_ADDR)
> + page = intel_gvt_hypervisor_gfn_to_mfn_page(vgpu, ops->get_pfn(entry));
> + if (!page)
> return -EINVAL;
>
> - return PageTransHuge(pfn_to_page(pfn));
> + is_trans_huge = PageTransHuge(page);
> + put_page(page);
> +
> + return is_trans_huge;
> }
>
> static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
> diff --git a/drivers/gpu/drm/i915/gvt/hypercall.h b/drivers/gpu/drm/i915/gvt/hypercall.h
> index b79da5124f83..017190ff52d5 100644
> --- a/drivers/gpu/drm/i915/gvt/hypercall.h
> +++ b/drivers/gpu/drm/i915/gvt/hypercall.h
> @@ -60,7 +60,8 @@ struct intel_gvt_mpt {
> unsigned long len);
> int (*write_gpa)(unsigned long handle, unsigned long gpa, void *buf,
> unsigned long len);
> - unsigned long (*gfn_to_mfn)(unsigned long handle, unsigned long gfn);
> + struct page *(*gfn_to_mfn_page)(unsigned long handle,
> + unsigned long gfn);
>
> int (*dma_map_guest_page)(unsigned long handle, unsigned long gfn,
> unsigned long size, dma_addr_t *dma_addr);
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index b829ff67e3d9..1e97ae813ed0 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -1928,21 +1928,17 @@ static int kvmgt_inject_msi(unsigned long handle, u32 addr, u16 data)
> return -EFAULT;
> }
>
> -static unsigned long kvmgt_gfn_to_pfn(unsigned long handle, unsigned long gfn)
> +static struct page *kvmgt_gfn_to_page(unsigned long handle, unsigned long gfn)
> {
> struct kvmgt_guest_info *info;
> kvm_pfn_t pfn;
>
> if (!handle_valid(handle))
> - return INTEL_GVT_INVALID_ADDR;
> + return NULL;
>
> info = (struct kvmgt_guest_info *)handle;
>
> - pfn = kvm_pfn_page_unwrap(gfn_to_pfn(info->kvm, gfn));
> - if (is_error_noslot_pfn(pfn))
> - return INTEL_GVT_INVALID_ADDR;
> -
> - return pfn;
> + return gfn_to_pfn(info->kvm, gfn).page;
> }
>
> static int kvmgt_dma_map_guest_page(unsigned long handle, unsigned long gfn,
> @@ -2112,7 +2108,7 @@ static const struct intel_gvt_mpt kvmgt_mpt = {
> .disable_page_track = kvmgt_page_track_remove,
> .read_gpa = kvmgt_read_gpa,
> .write_gpa = kvmgt_write_gpa,
> - .gfn_to_mfn = kvmgt_gfn_to_pfn,
> + .gfn_to_mfn_page = kvmgt_gfn_to_page,
> .dma_map_guest_page = kvmgt_dma_map_guest_page,
> .dma_unmap_guest_page = kvmgt_dma_unmap_guest_page,
> .dma_pin_guest_page = kvmgt_dma_pin_guest_page,
> diff --git a/drivers/gpu/drm/i915/gvt/mpt.h b/drivers/gpu/drm/i915/gvt/mpt.h
> index 550a456e936f..9169b83cf0f6 100644
> --- a/drivers/gpu/drm/i915/gvt/mpt.h
> +++ b/drivers/gpu/drm/i915/gvt/mpt.h
> @@ -214,17 +214,17 @@ static inline int intel_gvt_hypervisor_write_gpa(struct intel_vgpu *vgpu,
> }
>
> /**
> - * intel_gvt_hypervisor_gfn_to_mfn - translate a GFN to MFN
> + * intel_gvt_hypervisor_gfn_to_mfn_page - translate a GFN to MFN page
> * @vgpu: a vGPU
> * @gpfn: guest pfn
> *
> * Returns:
> - * MFN on success, INTEL_GVT_INVALID_ADDR if failed.
> + * struct page* on success, NULL if failed.
> */
> -static inline unsigned long intel_gvt_hypervisor_gfn_to_mfn(
> +static inline unsigned long intel_gvt_hypervisor_gfn_to_mfn_page(
> struct intel_vgpu *vgpu, unsigned long gfn)
> {
> - return intel_gvt_host.mpt->gfn_to_mfn(vgpu->handle, gfn);
> + return intel_gvt_host.mpt->gfn_to_mfn_page(vgpu->handle, gfn);
> }
>
> /**
> --
> 2.32.0.93.g670b81a890-goog
>

2021-06-24 06:59:28

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

On 24/06/21 05:57, David Stevens wrote:
> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
> assoicated struct pages, so they should not be passed to pfn_to_page.
> This series removes such calls from the x86 and arm64 secondary MMU. To
> do this, this series modifies gfn_to_pfn to return a struct page in
> addition to a pfn, if the hva was resolved by gup. This allows the
> caller to call put_page only when necessated by gup.
>
> This series provides a helper function that unwraps the new return type
> of gfn_to_pfn to provide behavior identical to the old behavior. As I
> have no hardware to test powerpc/mips changes, the function is used
> there for minimally invasive changes. Additionally, as gfn_to_page and
> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
> easily changed over to only use pfns.
>
> This addresses CVE-2021-22543 on x86 and arm64.

Thank you very much for this. I agree that it makes sense to have a
minimal change; I had similar changes almost ready, but was stuck with
deadlocks in the gfn_to_pfn_cache case. In retrospect I should have
posted something similar to your patches.

I have started reviewing the patches, and they look good. I will try to
include them in 5.13.

Paolo

2021-06-24 07:32:01

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

On 24/06/21 05:57, David Stevens wrote:
> static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> - int map_writable, int max_level, kvm_pfn_t pfn,
> + int map_writable, int max_level,
> + const struct kvm_pfn_page *pfnpg,
> bool prefault, bool is_tdp)
> {
> bool nx_huge_page_workaround_enabled = is_nx_huge_pa

So the main difference with my tentative patches is that here I was
passing the struct by value. I'll try to clean this up for 5.15, but
for now it's all good like this. Thanks again!

Paolo

2021-06-24 08:46:21

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 1/6] KVM: x86/mmu: release audited pfns

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> From: David Stevens <[email protected]>

Changelog? This looks like a bug, should it have a Fixes: tag?

Thanks,
Nick

>
> Signed-off-by: David Stevens <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu_audit.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
> index cedc17b2f60e..97ff184084b4 100644
> --- a/arch/x86/kvm/mmu/mmu_audit.c
> +++ b/arch/x86/kvm/mmu/mmu_audit.c
> @@ -121,6 +121,8 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
> audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
> "ent %llxn", vcpu->arch.mmu->root_level, pfn,
> hpa, *sptep);
> +
> + kvm_release_pfn_clean(pfn);
> }
>
> static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
> --
> 2.32.0.93.g670b81a890-goog
>
>

2021-06-24 08:53:49

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> From: David Stevens <[email protected]>
>
> Return a struct kvm_pfn_page containing both a pfn and an optional
> struct page from the gfn_to_pfn family of functions. This differentiates
> the gup and follow_fault_pfn cases, which allows callers that only need
> a pfn to avoid touching the page struct in the latter case. For callers
> that need a struct page, introduce a helper function that unwraps a
> struct kvm_pfn_page into a struct page. This helper makes the call to
> kvm_get_pfn which had previously been in hva_to_pfn_remapped.
>
> For now, wrap all calls to gfn_to_pfn functions in the new helper
> function. Callers which don't need the page struct will be updated in
> follow-up patches.

Hmm. You mean callers that do need the page will be updated? Normally
if there will be leftover users that don't need the struct page then
you would go the other way and keep the old call the same, and add a new
one (gfn_to_pfn_page) just for those that need it.

Most kernel code I look at passes back multiple values by updating
pointers to struct or variables rather than returning a struct, I
suppose that's not really a big deal and a matter of taste.

Thanks,
Nick

2021-06-24 08:59:29

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> From: David Stevens <[email protected]>
> out_unlock:
> if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> read_unlock(&vcpu->kvm->mmu_lock);
> else
> write_unlock(&vcpu->kvm->mmu_lock);
> - kvm_release_pfn_clean(pfn);
> + if (pfnpg.page)
> + put_page(pfnpg.page);
> return r;
> }

How about

kvm_release_pfn_page_clean(pfnpg);

Thanks,
Nick

2021-06-24 09:40:57

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

Hi David,

On Thu, 24 Jun 2021 04:57:45 +0100,
David Stevens <[email protected]> wrote:
>
> From: David Stevens <[email protected]>
>
> Return a struct kvm_pfn_page containing both a pfn and an optional
> struct page from the gfn_to_pfn family of functions. This differentiates
> the gup and follow_fault_pfn cases, which allows callers that only need
> a pfn to avoid touching the page struct in the latter case. For callers
> that need a struct page, introduce a helper function that unwraps a
> struct kvm_pfn_page into a struct page. This helper makes the call to
> kvm_get_pfn which had previously been in hva_to_pfn_remapped.
>
> For now, wrap all calls to gfn_to_pfn functions in the new helper
> function. Callers which don't need the page struct will be updated in
> follow-up patches.
>
> Signed-off-by: David Stevens <[email protected]>
> ---
> arch/arm64/kvm/mmu.c | 5 +-
> arch/mips/kvm/mmu.c | 3 +-
> arch/powerpc/kvm/book3s.c | 3 +-
> arch/powerpc/kvm/book3s_64_mmu_hv.c | 5 +-
> arch/powerpc/kvm/book3s_64_mmu_radix.c | 5 +-
> arch/powerpc/kvm/book3s_hv_uvmem.c | 4 +-
> arch/powerpc/kvm/e500_mmu_host.c | 2 +-
> arch/x86/kvm/mmu/mmu.c | 11 ++-
> arch/x86/kvm/mmu/mmu_audit.c | 2 +-
> arch/x86/kvm/x86.c | 2 +-
> drivers/gpu/drm/i915/gvt/kvmgt.c | 2 +-
> include/linux/kvm_host.h | 27 ++++--
> include/linux/kvm_types.h | 5 +
> virt/kvm/kvm_main.c | 121 +++++++++++++------------
> 14 files changed, 109 insertions(+), 88 deletions(-)
>

[...]

> +kvm_pfn_t kvm_pfn_page_unwrap(struct kvm_pfn_page pfnpg)
> +{
> + if (pfnpg.page)
> + return pfnpg.pfn;
> +
> + kvm_get_pfn(pfnpg.pfn);
> + return pfnpg.pfn;
> +}
> +EXPORT_SYMBOL_GPL(kvm_pfn_page_unwrap);

I'd really like to see a tiny bit of documentation explaining that
calls to kvm_pfn_page_unwrap() are not idempotent.

Otherwise, looks good to me.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2021-06-24 09:44:29

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

On 24/06/21 10:52, Nicholas Piggin wrote:
>> For now, wrap all calls to gfn_to_pfn functions in the new helper
>> function. Callers which don't need the page struct will be updated in
>> follow-up patches.
> Hmm. You mean callers that do need the page will be updated? Normally
> if there will be leftover users that don't need the struct page then
> you would go the other way and keep the old call the same, and add a new
> one (gfn_to_pfn_page) just for those that need it.

Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so
it's a good idea to move the short name to the common case and the ugly
kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one. In fact I'm not
sure there should be any kvm_pfn_page_unwrap in the end.

Paolo

2021-06-24 09:47:22

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 1/6] KVM: x86/mmu: release audited pfns

On 24/06/21 10:43, Nicholas Piggin wrote:
> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>> From: David Stevens <[email protected]>
>
> Changelog? This looks like a bug, should it have a Fixes: tag?

Probably has been there forever... The best way to fix the bug would be
to nuke mmu_audit.c, which I've threatened to do many times but never
followed up on.

Paolo

> Thanks,
> Nick
>
>>
>> Signed-off-by: David Stevens <[email protected]>
>> ---
>> arch/x86/kvm/mmu/mmu_audit.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
>> index cedc17b2f60e..97ff184084b4 100644
>> --- a/arch/x86/kvm/mmu/mmu_audit.c
>> +++ b/arch/x86/kvm/mmu/mmu_audit.c
>> @@ -121,6 +121,8 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
>> audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
>> "ent %llxn", vcpu->arch.mmu->root_level, pfn,
>> hpa, *sptep);
>> +
>> + kvm_release_pfn_clean(pfn);
>> }
>>
>> static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
>> --
>> 2.32.0.93.g670b81a890-goog
>>
>>
>

2021-06-24 09:59:47

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

Excerpts from Paolo Bonzini's message of June 24, 2021 7:42 pm:
> On 24/06/21 10:52, Nicholas Piggin wrote:
>>> For now, wrap all calls to gfn_to_pfn functions in the new helper
>>> function. Callers which don't need the page struct will be updated in
>>> follow-up patches.
>> Hmm. You mean callers that do need the page will be updated? Normally
>> if there will be leftover users that don't need the struct page then
>> you would go the other way and keep the old call the same, and add a new
>> one (gfn_to_pfn_page) just for those that need it.
>
> Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so
> it's a good idea to move the short name to the common case and the ugly
> kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one. In fact I'm not
> sure there should be any kvm_pfn_page_unwrap in the end.

If all callers were updated that is one thing, but from the changelog
it sounds like that would not happen and there would be some gfn_to_pfn
users left over.

But yes in the end you would either need to make gfn_to_pfn never return
a page found via follow_pte, or change all callers to the new way. If
the plan is for the latter then I guess that's fine.

Thanks,
Nick

2021-06-24 10:07:32

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

On Thu, 24 Jun 2021 09:58:00 +0100,
Nicholas Piggin <[email protected]> wrote:
>
> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> > From: David Stevens <[email protected]>
> > out_unlock:
> > if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> > read_unlock(&vcpu->kvm->mmu_lock);
> > else
> > write_unlock(&vcpu->kvm->mmu_lock);
> > - kvm_release_pfn_clean(pfn);
> > + if (pfnpg.page)
> > + put_page(pfnpg.page);
> > return r;
> > }
>
> How about
>
> kvm_release_pfn_page_clean(pfnpg);

I'm not sure. I always found kvm_release_pfn_clean() ugly, because it
doesn't mark the page 'clean'. I find put_page() more correct.

Something like 'kvm_put_pfn_page()' would make more sense, but I'm so
bad at naming things that I could just as well call it 'bob()'.

M.

--
Without deviation from the norm, progress is not possible.

2021-06-24 10:15:22

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

On 24/06/21 11:57, Nicholas Piggin wrote:
>> Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so
>> it's a good idea to move the short name to the common case and the ugly
>> kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one. In fact I'm not
>> sure there should be any kvm_pfn_page_unwrap in the end.
>
> If all callers were updated that is one thing, but from the changelog
> it sounds like that would not happen and there would be some gfn_to_pfn
> users left over.

In this patches there are, so yeah the plan is to always change the
callers to the new way.

> But yes in the end you would either need to make gfn_to_pfn never return
> a page found via follow_pte, or change all callers to the new way. If
> the plan is for the latter then I guess that's fine.

2021-06-24 10:19:13

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

On 24/06/21 12:06, Marc Zyngier wrote:
> On Thu, 24 Jun 2021 09:58:00 +0100,
> Nicholas Piggin <[email protected]> wrote:
>>
>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>>> From: David Stevens <[email protected]>
>>> out_unlock:
>>> if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
>>> read_unlock(&vcpu->kvm->mmu_lock);
>>> else
>>> write_unlock(&vcpu->kvm->mmu_lock);
>>> - kvm_release_pfn_clean(pfn);
>>> + if (pfnpg.page)
>>> + put_page(pfnpg.page);
>>> return r;
>>> }
>>
>> How about
>>
>> kvm_release_pfn_page_clean(pfnpg);
>
> I'm not sure. I always found kvm_release_pfn_clean() ugly, because it
> doesn't mark the page 'clean'. I find put_page() more correct.
>
> Something like 'kvm_put_pfn_page()' would make more sense, but I'm so
> bad at naming things that I could just as well call it 'bob()'.

The best way to go would be to get rid of kvm_release_pfn_clean() and
always go through a pfn_page. Then we could or could not introduce
wrappers kvm_put_pfn_page{,_dirty}.

I think for now it's best to limit the churn since these patches will go
in the stable releases too, and clean up the resulting API once we have
a clear idea of how all architectures are using kvm_pfn_page.

Paolo

2021-06-24 10:20:29

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

Excerpts from Nicholas Piggin's message of June 24, 2021 7:57 pm:
> Excerpts from Paolo Bonzini's message of June 24, 2021 7:42 pm:
>> On 24/06/21 10:52, Nicholas Piggin wrote:
>>>> For now, wrap all calls to gfn_to_pfn functions in the new helper
>>>> function. Callers which don't need the page struct will be updated in
>>>> follow-up patches.
>>> Hmm. You mean callers that do need the page will be updated? Normally
>>> if there will be leftover users that don't need the struct page then
>>> you would go the other way and keep the old call the same, and add a new
>>> one (gfn_to_pfn_page) just for those that need it.
>>
>> Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so
>> it's a good idea to move the short name to the common case and the ugly
>> kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one. In fact I'm not
>> sure there should be any kvm_pfn_page_unwrap in the end.
>
> If all callers were updated that is one thing, but from the changelog
> it sounds like that would not happen and there would be some gfn_to_pfn
> users left over.
>
> But yes in the end you would either need to make gfn_to_pfn never return
> a page found via follow_pte, or change all callers to the new way. If
> the plan is for the latter then I guess that's fine.

Actually in that case anyway I don't see the need -- the existence of
gfn_to_pfn is enough to know it might be buggy. It can just as easily
be grepped for as kvm_pfn_page_unwrap. And are gfn_to_page cases also
vulernable to the same issue?

So I think it could be marked deprecated or something if not everything
will be converted in the one series, and don't need to touch all that
arch code with this patch.

Thanks,
Nick

2021-06-24 10:22:28

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

On 24/06/21 12:17, Nicholas Piggin wrote:
>> If all callers were updated that is one thing, but from the changelog
>> it sounds like that would not happen and there would be some gfn_to_pfn
>> users left over.
>>
>> But yes in the end you would either need to make gfn_to_pfn never return
>> a page found via follow_pte, or change all callers to the new way. If
>> the plan is for the latter then I guess that's fine.
>
> Actually in that case anyway I don't see the need -- the existence of
> gfn_to_pfn is enough to know it might be buggy. It can just as easily
> be grepped for as kvm_pfn_page_unwrap.

Sure, but that would leave us with longer function names
(gfn_to_pfn_page* instead of gfn_to_pfn*). So the "safe" use is the one
that looks worse and the unsafe use is the one that looks safe.

> And are gfn_to_page cases also
> vulernable to the same issue?

No, they're just broken for the VM_IO|VM_PFNMAP case.

Paolo

> So I think it could be marked deprecated or something if not everything
> will be converted in the one series, and don't need to touch all that
> arch code with this patch.

2021-06-24 10:35:25

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
> assoicated struct pages, so they should not be passed to pfn_to_page.
> This series removes such calls from the x86 and arm64 secondary MMU. To
> do this, this series modifies gfn_to_pfn to return a struct page in
> addition to a pfn, if the hva was resolved by gup. This allows the
> caller to call put_page only when necessated by gup.
>
> This series provides a helper function that unwraps the new return type
> of gfn_to_pfn to provide behavior identical to the old behavior. As I
> have no hardware to test powerpc/mips changes, the function is used
> there for minimally invasive changes. Additionally, as gfn_to_page and
> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
> easily changed over to only use pfns.
>
> This addresses CVE-2021-22543 on x86 and arm64.

Does this fix the problem? (untested I don't have a POC setup at hand,
but at least in concept)

I have no problem with improving the API and probably in the direction
of your series is good. But there seems to be a lot of unfixed arch code
and broken APIs remaining left to do after your series too. This might
be most suitable to backport and as a base for your series that can take
more time to convert to new APIs.

Thanks,
Nick

---

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a6bc7af0e28..e208c279d903 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2104,13 +2104,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
* Whoever called remap_pfn_range is also going to call e.g.
* unmap_mapping_range before the underlying pages are freed,
* causing a call to our MMU notifier.
+ *
+ * Certain IO or PFNMAP mappings can be backed with valid
+ * struct pages, but be allocated without refcounting e.g.,
+ * tail pages of non-compound higher order allocations, which
+ * would then underflow the refcount when the caller does the
+ * required put_page. Don't allow those pages here.
*/
- kvm_get_pfn(pfn);
+ if (!kvm_try_get_pfn(pfn))
+ r = -EFAULT;

out:
pte_unmap_unlock(ptep, ptl);
*p_pfn = pfn;
- return 0;
+
+ return r;
}

/*
@@ -2487,6 +2495,13 @@ void kvm_set_pfn_accessed(kvm_pfn_t pfn)
}
EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);

+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+ if (kvm_is_reserved_pfn(pfn))
+ return 1;
+ return get_page_unless_zero(pfn_to_page(pfn));
+}
+
void kvm_get_pfn(kvm_pfn_t pfn)
{
if (!kvm_is_reserved_pfn(pfn))

2021-06-24 10:44:32

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 4/6] KVM: arm64/mmu: avoid struct page in MMU

On Thu, 24 Jun 2021 04:57:47 +0100,
David Stevens <[email protected]> wrote:
>
> From: David Stevens <[email protected]>
>
> Avoid converting pfns returned by follow_fault_pfn to struct pages to
> transiently take a reference. The reference was originally taken to
> match the reference taken by gup. However, pfns returned by
> follow_fault_pfn may not have a struct page set up for reference
> counting.
>
> Signed-off-by: David Stevens <[email protected]>
> ---
> arch/arm64/kvm/mmu.c | 43 +++++++++++++++++++++++--------------------
> 1 file changed, 23 insertions(+), 20 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 896b3644b36f..a741972cb75f 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c

[...]

> @@ -968,16 +968,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> */
> if (vma_pagesize == PAGE_SIZE && !force_pte)
> vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> - &pfn, &fault_ipa);
> + &pfnpg, &fault_ipa);
> if (writable)
> prot |= KVM_PGTABLE_PROT_W;
>
> if (fault_status != FSC_PERM && !device)
> - clean_dcache_guest_page(pfn, vma_pagesize);
> + clean_dcache_guest_page(pfnpg.pfn, vma_pagesize);
>
> if (exec_fault) {
> prot |= KVM_PGTABLE_PROT_X;
> - invalidate_icache_guest_page(pfn, vma_pagesize);
> + invalidate_icache_guest_page(pfnpg.pfn, vma_pagesize);

This is going to clash with what is currently in -next, specially with
MTE.

Paolo, if you really want to take this in 5.13, can you please push a
stable branch based on -rc4 or older so that I can merge it in and
test it?

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2021-06-24 10:44:51

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

Excerpts from Marc Zyngier's message of June 24, 2021 8:06 pm:
> On Thu, 24 Jun 2021 09:58:00 +0100,
> Nicholas Piggin <[email protected]> wrote:
>>
>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>> > From: David Stevens <[email protected]>
>> > out_unlock:
>> > if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
>> > read_unlock(&vcpu->kvm->mmu_lock);
>> > else
>> > write_unlock(&vcpu->kvm->mmu_lock);
>> > - kvm_release_pfn_clean(pfn);
>> > + if (pfnpg.page)
>> > + put_page(pfnpg.page);
>> > return r;
>> > }
>>
>> How about
>>
>> kvm_release_pfn_page_clean(pfnpg);
>
> I'm not sure. I always found kvm_release_pfn_clean() ugly, because it
> doesn't mark the page 'clean'. I find put_page() more correct.
>
> Something like 'kvm_put_pfn_page()' would make more sense, but I'm so
> bad at naming things that I could just as well call it 'bob()'.

That seems like a fine name to me. A little better than bob.

Thanks,
Nick

2021-06-24 10:45:34

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

Excerpts from Paolo Bonzini's message of June 24, 2021 8:21 pm:
> On 24/06/21 12:17, Nicholas Piggin wrote:
>>> If all callers were updated that is one thing, but from the changelog
>>> it sounds like that would not happen and there would be some gfn_to_pfn
>>> users left over.
>>>
>>> But yes in the end you would either need to make gfn_to_pfn never return
>>> a page found via follow_pte, or change all callers to the new way. If
>>> the plan is for the latter then I guess that's fine.
>>
>> Actually in that case anyway I don't see the need -- the existence of
>> gfn_to_pfn is enough to know it might be buggy. It can just as easily
>> be grepped for as kvm_pfn_page_unwrap.
>
> Sure, but that would leave us with longer function names
> (gfn_to_pfn_page* instead of gfn_to_pfn*). So the "safe" use is the one
> that looks worse and the unsafe use is the one that looks safe.

The churn isn't justified because of function name length. Choose g2pp()
if you want a non-descriptive but short name.

The existing name isn't good anyway because it not only looks up a pfn
but also a page, and more importantly it gets a ref on the page. The
name should be changed if you introduce a new API.

>> And are gfn_to_page cases also
>> vulernable to the same issue?
>
> No, they're just broken for the VM_IO|VM_PFNMAP case.

No they aren't vulnerable, or they are vunlerable but also broken in
other cases?

Thanks,
Nick

2021-06-24 11:44:48

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm:
> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
>> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
>> assoicated struct pages, so they should not be passed to pfn_to_page.
>> This series removes such calls from the x86 and arm64 secondary MMU. To
>> do this, this series modifies gfn_to_pfn to return a struct page in
>> addition to a pfn, if the hva was resolved by gup. This allows the
>> caller to call put_page only when necessated by gup.
>>
>> This series provides a helper function that unwraps the new return type
>> of gfn_to_pfn to provide behavior identical to the old behavior. As I
>> have no hardware to test powerpc/mips changes, the function is used
>> there for minimally invasive changes. Additionally, as gfn_to_page and
>> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
>> easily changed over to only use pfns.
>>
>> This addresses CVE-2021-22543 on x86 and arm64.
>
> Does this fix the problem? (untested I don't have a POC setup at hand,
> but at least in concept)

This one actually compiles at least. Unfortunately I don't have much
time in the near future to test, and I only just found out about this
CVE a few hours ago.

---


It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on the page if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).

---
virt/kvm/kvm_main.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a6bc7af0e28..46fb042837d2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
return true;
}

+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+ if (kvm_is_reserved_pfn(pfn))
+ return 1;
+ return get_page_unless_zero(pfn_to_page(pfn));
+}
+
static int hva_to_pfn_remapped(struct vm_area_struct *vma,
unsigned long addr, bool *async,
bool write_fault, bool *writable,
@@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
* Whoever called remap_pfn_range is also going to call e.g.
* unmap_mapping_range before the underlying pages are freed,
* causing a call to our MMU notifier.
+ *
+ * Certain IO or PFNMAP mappings can be backed with valid
+ * struct pages, but be allocated without refcounting e.g.,
+ * tail pages of non-compound higher order allocations, which
+ * would then underflow the refcount when the caller does the
+ * required put_page. Don't allow those pages here.
*/
- kvm_get_pfn(pfn);
+ if (!kvm_try_get_pfn(pfn))
+ r = -EFAULT;

out:
pte_unmap_unlock(ptep, ptl);
*p_pfn = pfn;
- return 0;
+
+ return r;
}

/*
--
2.23.0

2021-06-24 12:01:23

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

On 24/06/21 13:42, Nicholas Piggin wrote:
> +static int kvm_try_get_pfn(kvm_pfn_t pfn)
> +{
> + if (kvm_is_reserved_pfn(pfn))
> + return 1;

So !pfn_valid would always return true. Yeah, this should work and is
certainly appealing!

Paolo


> + return get_page_unless_zero(pfn_to_page(pfn));
> +}
> +
> static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> unsigned long addr, bool *async,
> bool write_fault, bool *writable,
> @@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> * Whoever called remap_pfn_range is also going to call e.g.
> * unmap_mapping_range before the underlying pages are freed,
> * causing a call to our MMU notifier.
> + *
> + * Certain IO or PFNMAP mappings can be backed with valid
> + * struct pages, but be allocated without refcounting e.g.,
> + * tail pages of non-compound higher order allocations, which
> + * would then underflow the refcount when the caller does the
> + * required put_page. Don't allow those pages here.
> */
> - kvm_get_pfn(pfn);
> + if (!kvm_try_get_pfn(pfn))
> + r = -EFAULT;
>
> out:
> pte_unmap_unlock(ptep, ptl);
> *p_pfn = pfn;
> - return 0;
> +
> + return r;
> }
>
> /*
>

2021-06-24 12:44:41

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

On 24/06/21 13:42, Nicholas Piggin wrote:
> Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm:
>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>>> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
>>> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
>>> assoicated struct pages, so they should not be passed to pfn_to_page.
>>> This series removes such calls from the x86 and arm64 secondary MMU. To
>>> do this, this series modifies gfn_to_pfn to return a struct page in
>>> addition to a pfn, if the hva was resolved by gup. This allows the
>>> caller to call put_page only when necessated by gup.
>>>
>>> This series provides a helper function that unwraps the new return type
>>> of gfn_to_pfn to provide behavior identical to the old behavior. As I
>>> have no hardware to test powerpc/mips changes, the function is used
>>> there for minimally invasive changes. Additionally, as gfn_to_page and
>>> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
>>> easily changed over to only use pfns.
>>>
>>> This addresses CVE-2021-22543 on x86 and arm64.
>>
>> Does this fix the problem? (untested I don't have a POC setup at hand,
>> but at least in concept)
>
> This one actually compiles at least. Unfortunately I don't have much
> time in the near future to test, and I only just found out about this
> CVE a few hours ago.

And it also works (the reproducer gets an infinite stream of userspace
exits and especially does not crash). We can still go for David's
solution later since MMU notifiers are able to deal with this pages, but
it's a very nice patch for stable kernels.

If you provide a Signed-off-by, I can integrate it.

Paolo

> ---
>
>
> It's possible to create a region which maps valid but non-refcounted
> pages (e.g., tail pages of non-compound higher order allocations). These
> host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
> of APIs, which take a reference to the page, which takes it from 0 to 1.
> When the reference is dropped, this will free the page incorrectly.
>
> Fix this by only taking a reference on the page if it was non-zero,
> which indicates it is participating in normal refcounting (and can be
> released with put_page).
>
> ---
> virt/kvm/kvm_main.c | 19 +++++++++++++++++--
> 1 file changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6a6bc7af0e28..46fb042837d2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
> return true;
> }
>
> +static int kvm_try_get_pfn(kvm_pfn_t pfn)
> +{
> + if (kvm_is_reserved_pfn(pfn))
> + return 1;
> + return get_page_unless_zero(pfn_to_page(pfn));
> +}
> +
> static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> unsigned long addr, bool *async,
> bool write_fault, bool *writable,
> @@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> * Whoever called remap_pfn_range is also going to call e.g.
> * unmap_mapping_range before the underlying pages are freed,
> * causing a call to our MMU notifier.
> + *
> + * Certain IO or PFNMAP mappings can be backed with valid
> + * struct pages, but be allocated without refcounting e.g.,
> + * tail pages of non-compound higher order allocations, which
> + * would then underflow the refcount when the caller does the
> + * required put_page. Don't allow those pages here.
> */
> - kvm_get_pfn(pfn);
> + if (!kvm_try_get_pfn(pfn))
> + r = -EFAULT;
>
> out:
> pte_unmap_unlock(ptep, ptl);
> *p_pfn = pfn;
> - return 0;
> +
> + return r;
> }
>
> /*
>

2021-06-24 13:00:14

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

Excerpts from Paolo Bonzini's message of June 24, 2021 10:41 pm:
> On 24/06/21 13:42, Nicholas Piggin wrote:
>> Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm:
>>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>>>> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
>>>> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
>>>> assoicated struct pages, so they should not be passed to pfn_to_page.
>>>> This series removes such calls from the x86 and arm64 secondary MMU. To
>>>> do this, this series modifies gfn_to_pfn to return a struct page in
>>>> addition to a pfn, if the hva was resolved by gup. This allows the
>>>> caller to call put_page only when necessated by gup.
>>>>
>>>> This series provides a helper function that unwraps the new return type
>>>> of gfn_to_pfn to provide behavior identical to the old behavior. As I
>>>> have no hardware to test powerpc/mips changes, the function is used
>>>> there for minimally invasive changes. Additionally, as gfn_to_page and
>>>> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
>>>> easily changed over to only use pfns.
>>>>
>>>> This addresses CVE-2021-22543 on x86 and arm64.
>>>
>>> Does this fix the problem? (untested I don't have a POC setup at hand,
>>> but at least in concept)
>>
>> This one actually compiles at least. Unfortunately I don't have much
>> time in the near future to test, and I only just found out about this
>> CVE a few hours ago.
>
> And it also works (the reproducer gets an infinite stream of userspace
> exits and especially does not crash). We can still go for David's
> solution later since MMU notifiers are able to deal with this pages, but
> it's a very nice patch for stable kernels.

Oh nice, thanks for testing. How's this?

Thanks,
Nick

---

KVM: Fix page ref underflow for regions with valid but non-refcounted pages

It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on the page if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).

Signed-off-by: Nicholas Piggin <[email protected]>
---
virt/kvm/kvm_main.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a6bc7af0e28..46fb042837d2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
return true;
}

+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+ if (kvm_is_reserved_pfn(pfn))
+ return 1;
+ return get_page_unless_zero(pfn_to_page(pfn));
+}
+
static int hva_to_pfn_remapped(struct vm_area_struct *vma,
unsigned long addr, bool *async,
bool write_fault, bool *writable,
@@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
* Whoever called remap_pfn_range is also going to call e.g.
* unmap_mapping_range before the underlying pages are freed,
* causing a call to our MMU notifier.
+ *
+ * Certain IO or PFNMAP mappings can be backed with valid
+ * struct pages, but be allocated without refcounting e.g.,
+ * tail pages of non-compound higher order allocations, which
+ * would then underflow the refcount when the caller does the
+ * required put_page. Don't allow those pages here.
*/
- kvm_get_pfn(pfn);
+ if (!kvm_try_get_pfn(pfn))
+ r = -EFAULT;

out:
pte_unmap_unlock(ptep, ptl);
*p_pfn = pfn;
- return 0;
+
+ return r;
}

/*
--
2.23.0

2021-06-24 15:38:22

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/6] KVM: x86/mmu: release audited pfns

On Thu, Jun 24, 2021, Paolo Bonzini wrote:
> On 24/06/21 10:43, Nicholas Piggin wrote:
> > Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> > > From: David Stevens <[email protected]>
> >
> > Changelog? This looks like a bug, should it have a Fixes: tag?
>
> Probably has been there forever... The best way to fix the bug would be to
> nuke mmu_audit.c, which I've threatened to do many times but never followed
> up on.

Yar. It has only survived because it hasn't required any maintenance.

2021-06-24 15:39:44

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

On 24/06/21 14:57, Nicholas Piggin wrote:
> KVM: Fix page ref underflow for regions with valid but non-refcounted pages

It doesn't really fix the underflow, it disallows mapping them in the
first place. Since in principle things can break, I'd rather be
explicit, so let's go with "KVM: do not allow mapping valid but
non-reference-counted pages".

> It's possible to create a region which maps valid but non-refcounted
> pages (e.g., tail pages of non-compound higher order allocations). These
> host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
> of APIs, which take a reference to the page, which takes it from 0 to 1.
> When the reference is dropped, this will free the page incorrectly.
>
> Fix this by only taking a reference on the page if it was non-zero,

s/on the page/on valid pages/ (makes clear that invalid pages are fine
without refcounting).

Thank you *so* much, I'm awful at Linux mm.

Paolo

> which indicates it is participating in normal refcounting (and can be
> released with put_page).
>
> Signed-off-by: Nicholas Piggin<[email protected]>

2021-06-25 00:21:42

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

Excerpts from Paolo Bonzini's message of June 25, 2021 1:35 am:
> On 24/06/21 14:57, Nicholas Piggin wrote:
>> KVM: Fix page ref underflow for regions with valid but non-refcounted pages
>
> It doesn't really fix the underflow, it disallows mapping them in the
> first place. Since in principle things can break, I'd rather be
> explicit, so let's go with "KVM: do not allow mapping valid but
> non-reference-counted pages".
>
>> It's possible to create a region which maps valid but non-refcounted
>> pages (e.g., tail pages of non-compound higher order allocations). These
>> host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
>> of APIs, which take a reference to the page, which takes it from 0 to 1.
>> When the reference is dropped, this will free the page incorrectly.
>>
>> Fix this by only taking a reference on the page if it was non-zero,
>
> s/on the page/on valid pages/ (makes clear that invalid pages are fine
> without refcounting).

That seems okay, you can adjust the title or changelog as you like.

> Thank you *so* much, I'm awful at Linux mm.

Glad to help. Easy to see why you were taking this approach because the
API really does need to be improved and even a pretty intwined with mm
subsystem like KVM shouldn't _really_ be doing this kind of trick (and
it should go away when old API is removed).

Thanks,
Nick

2021-06-25 07:47:07

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU



On 24.06.21 14:57, Nicholas Piggin wrote:
> Excerpts from Paolo Bonzini's message of June 24, 2021 10:41 pm:
>> On 24/06/21 13:42, Nicholas Piggin wrote:
>>> Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm:
>>>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>>>>> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
>>>>> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
>>>>> assoicated struct pages, so they should not be passed to pfn_to_page.
>>>>> This series removes such calls from the x86 and arm64 secondary MMU. To
>>>>> do this, this series modifies gfn_to_pfn to return a struct page in
>>>>> addition to a pfn, if the hva was resolved by gup. This allows the
>>>>> caller to call put_page only when necessated by gup.
>>>>>
>>>>> This series provides a helper function that unwraps the new return type
>>>>> of gfn_to_pfn to provide behavior identical to the old behavior. As I
>>>>> have no hardware to test powerpc/mips changes, the function is used
>>>>> there for minimally invasive changes. Additionally, as gfn_to_page and
>>>>> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
>>>>> easily changed over to only use pfns.
>>>>>
>>>>> This addresses CVE-2021-22543 on x86 and arm64.
>>>>
>>>> Does this fix the problem? (untested I don't have a POC setup at hand,
>>>> but at least in concept)
>>>
>>> This one actually compiles at least. Unfortunately I don't have much
>>> time in the near future to test, and I only just found out about this
>>> CVE a few hours ago.
>>
>> And it also works (the reproducer gets an infinite stream of userspace
>> exits and especially does not crash). We can still go for David's
>> solution later since MMU notifiers are able to deal with this pages, but
>> it's a very nice patch for stable kernels.
>
> Oh nice, thanks for testing. How's this?
>
> Thanks,
> Nick
>
> ---
>
> KVM: Fix page ref underflow for regions with valid but non-refcounted pages
>
> It's possible to create a region which maps valid but non-refcounted
> pages (e.g., tail pages of non-compound higher order allocations). These
> host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
> of APIs, which take a reference to the page, which takes it from 0 to 1.
> When the reference is dropped, this will free the page incorrectly.
>
> Fix this by only taking a reference on the page if it was non-zero,
> which indicates it is participating in normal refcounting (and can be
> released with put_page).
>
> Signed-off-by: Nicholas Piggin <[email protected]>
> ---
> virt/kvm/kvm_main.c | 19 +++++++++++++++++--
> 1 file changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6a6bc7af0e28..46fb042837d2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
> return true;
> }
>
> +static int kvm_try_get_pfn(kvm_pfn_t pfn)
> +{
> + if (kvm_is_reserved_pfn(pfn))
> + return 1;
> + return get_page_unless_zero(pfn_to_page(pfn));
> +}
> +
> static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> unsigned long addr, bool *async,
> bool write_fault, bool *writable,
> @@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> * Whoever called remap_pfn_range is also going to call e.g.
> * unmap_mapping_range before the underlying pages are freed,
> * causing a call to our MMU notifier.
> + *
> + * Certain IO or PFNMAP mappings can be backed with valid
> + * struct pages, but be allocated without refcounting e.g.,
> + * tail pages of non-compound higher order allocations, which
> + * would then underflow the refcount when the caller does the
> + * required put_page. Don't allow those pages here.
> */
> - kvm_get_pfn(pfn);
> + if (!kvm_try_get_pfn(pfn))
> + r = -EFAULT;
>

Right. That should also take care of s390 (pin_guest_page in vsie.c
which calls gfn_to_page).
FWIW, the current API is really hard to follow as it does not tell
which functions take a reference and which dont.

Anyway, this patch (with cc stable?)

Reviewed-by: Christian Borntraeger <[email protected]>

> out:
> pte_unmap_unlock(ptep, ptl);
> *p_pfn = pfn;
> - return 0;
> +
> + return r;
> }
>
> /*
>