From: David Stevens <[email protected]>
This patch series adds support for mapping VM_IO and VM_PFNMAP memory
that is backed by struct pages that aren't currently being refcounted
(e.g. tail pages of non-compound higher order allocations) into the
guest.
Our use case is virtio-gpu blob resources [1], which directly map host
graphics buffers into the guest as "vram" for the virtio-gpu device.
This feature currently does not work on systems using the amdgpu driver,
as that driver allocates non-compound higher order pages via
ttm_pool_alloc_page.
Currently, the gfn_to_pfn functions require being able to pin the target
pfn, so they fail when the pfn returned by follow_pte isn't a
ref-counted page. However, the KVM secondary MMUs do not require that
the pfn be pinned, since they are integrated with the mmu notifier API.
This series adds a new set of gfn_to_pfn_noref functions which parallel
the gfn_to_pfn functions but do not pin the pfn. The new functions
return the page from gup if it was present, so callers can use it and
call put_page when done.
This series updates x86 and arm64 secondary MMUs to the new API. Other
MMUs can likely be updated without too much difficulty, but I am not
familiar with them and have no way to test them. On the other hand,
updating the rest of KVM would require replacing all usages of
kvm_vcpu_map with the gfn_to_pfn_cache, which is not at all easy [2].
[1]
https://patchwork.kernel.org/project/dri-devel/cover/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
v5 -> v6:
- rebase on kvm next branch
- rename gfn_to_pfn_page to gfn_to_pfn_noref
- fix uninitialized outparam in error case of __kvm_faultin_pfn
- add kvm_release_pfn_noref_clean for releasing pfn/page pair
v4 -> v5:
- rebase on kvm next branch again
v3 -> v4:
- rebase on kvm next branch again
- Add some more context to a comment in ensure_pfn_ref
v2 -> v3:
- rebase on kvm next branch
v1 -> v2:
- Introduce new gfn_to_pfn_page functions instead of modifying the
behavior of existing gfn_to_pfn functions, to make the change less
invasive.
- Drop changes to mmu_audit.c
- Include Nicholas Piggin's patch to avoid corrupting refcount in the
follow_pte case, and use it in depreciated gfn_to_pfn functions.
- Rebase on kvm/next
David Stevens (4):
KVM: mmu: introduce new gfn_to_pfn_noref functions
KVM: x86/mmu: use gfn_to_pfn_noref
KVM: arm64/mmu: use gfn_to_pfn_noref
KVM: mmu: remove over-aggressive warnings
arch/arm64/kvm/mmu.c | 21 ++--
arch/x86/kvm/mmu/mmu.c | 29 ++---
arch/x86/kvm/mmu/mmu_internal.h | 1 +
arch/x86/kvm/mmu/paging_tmpl.h | 7 +-
arch/x86/kvm/x86.c | 5 +-
include/linux/kvm_host.h | 18 +++
virt/kvm/kvm_main.c | 214 +++++++++++++++++++++++---------
virt/kvm/kvm_mm.h | 6 +-
virt/kvm/pfncache.c | 12 +-
9 files changed, 220 insertions(+), 93 deletions(-)
--
2.40.0.348.gf938b09366-goog
From: David Stevens <[email protected]>
Switch the x86 mmu to the new gfn_to_pfn_noref functions. This allows IO
and PFNMAP mappings backed with valid struct pages but without
refcounting (e.g. tail pages of non-compound higher order allocations)
to be mapped into the guest.
Signed-off-by: David Stevens <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 19 ++++++++++---------
arch/x86/kvm/mmu/mmu_internal.h | 1 +
arch/x86/kvm/mmu/paging_tmpl.h | 7 ++++---
arch/x86/kvm/x86.c | 5 +++--
4 files changed, 18 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 144c5a01cd77..86b74e7bccfa 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3114,7 +3114,7 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (unlikely(fault->max_level == PG_LEVEL_4K))
return;
- if (is_error_noslot_pfn(fault->pfn))
+ if (!fault->page)
return;
if (kvm_slot_dirty_track_enabled(slot))
@@ -4224,6 +4224,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (is_guest_mode(vcpu)) {
fault->slot = NULL;
fault->pfn = KVM_PFN_NOSLOT;
+ fault->page = NULL;
fault->map_writable = false;
return RET_PF_CONTINUE;
}
@@ -4239,9 +4240,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
}
async = false;
- fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
- fault->write, &fault->map_writable,
- &fault->hva);
+ fault->pfn = __gfn_to_pfn_noref_memslot(slot, fault->gfn, false, false, &async,
+ fault->write, &fault->map_writable,
+ &fault->hva, &fault->page);
if (!async)
return RET_PF_CONTINUE; /* *pfn has correct page already */
@@ -4261,9 +4262,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* to wait for IO. Note, gup always bails if it is unable to quickly
* get a page and a fatal signal, i.e. SIGKILL, is pending.
*/
- fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
- fault->write, &fault->map_writable,
- &fault->hva);
+ fault->pfn = __gfn_to_pfn_noref_memslot(slot, fault->gfn, false, true, NULL,
+ fault->write, &fault->map_writable,
+ &fault->hva, &fault->page);
return RET_PF_CONTINUE;
}
@@ -4349,7 +4350,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ kvm_release_pfn_noref_clean(fault->pfn, fault->page);
return r;
}
@@ -4427,7 +4428,7 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
out_unlock:
read_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ kvm_release_pfn_noref_clean(fault->pfn, fault->page);
return r;
}
#endif
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 2cbb155c686c..6ee34a2d0e13 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -239,6 +239,7 @@ struct kvm_page_fault {
unsigned long mmu_seq;
kvm_pfn_t pfn;
hva_t hva;
+ struct page *page;
bool map_writable;
/*
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index a056f2773dd9..e4e54e372721 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -525,6 +525,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
unsigned pte_access;
gfn_t gfn;
kvm_pfn_t pfn;
+ struct page *page;
if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
return false;
@@ -540,12 +541,12 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if (!slot)
return false;
- pfn = gfn_to_pfn_memslot_atomic(slot, gfn);
+ pfn = gfn_to_pfn_noref_memslot_atomic(slot, gfn, &page);
if (is_error_pfn(pfn))
return false;
mmu_set_spte(vcpu, slot, spte, pte_access, gfn, pfn, NULL);
- kvm_release_pfn_clean(pfn);
+ kvm_release_pfn_noref_clean(pfn, page);
return true;
}
@@ -830,7 +831,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ kvm_release_pfn_noref_clean(fault->pfn, fault->page);
return r;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 237c483b1230..53a8c9e776e5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8458,6 +8458,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
{
gpa_t gpa = cr2_or_gpa;
kvm_pfn_t pfn;
+ struct page *page;
if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
return false;
@@ -8487,7 +8488,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
* retry instruction -> write #PF -> emulation fail -> retry
* instruction -> ...
*/
- pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
+ pfn = gfn_to_pfn_noref(vcpu->kvm, gpa_to_gfn(gpa), &page);
/*
* If the instruction failed on the error pfn, it can not be fixed,
@@ -8496,7 +8497,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
if (is_error_noslot_pfn(pfn))
return false;
- kvm_release_pfn_clean(pfn);
+ kvm_release_pfn_noref_clean(pfn, page);
/* The instructions are well-emulated on direct mmu. */
if (vcpu->arch.mmu->root_role.direct) {
--
2.40.0.348.gf938b09366-goog
From: David Stevens <[email protected]>
Switch the arm64 mmu to the new gfn_to_pfn_noref functions. This allows
IO and PFNMAP mappings backed with valid struct pages but without
refcounting (e.g. tail pages of non-compound higher order allocations)
to be mapped into the guest.
Signed-off-by: David Stevens <[email protected]>
---
arch/arm64/kvm/mmu.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7113587222ff..0fd726e82a19 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1082,7 +1082,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
static unsigned long
transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
unsigned long hva, kvm_pfn_t *pfnp,
- phys_addr_t *ipap)
+ struct page **page, phys_addr_t *ipap)
{
kvm_pfn_t pfn = *pfnp;
@@ -1091,7 +1091,8 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
* sure that the HVA and IPA are sufficiently aligned and that the
* block map is contained within the memslot.
*/
- if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
+ if (*page &&
+ fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
get_user_mapping_size(kvm, hva) >= PMD_SIZE) {
/*
* The address we faulted on is backed by a transparent huge
@@ -1112,10 +1113,11 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
* page accordingly.
*/
*ipap &= PMD_MASK;
- kvm_release_pfn_clean(pfn);
+ kvm_release_page_clean(*page);
pfn &= ~(PTRS_PER_PMD - 1);
- get_page(pfn_to_page(pfn));
*pfnp = pfn;
+ *page = pfn_to_page(pfn);
+ get_page(*page);
return PMD_SIZE;
}
@@ -1201,6 +1203,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
short vma_shift;
gfn_t gfn;
kvm_pfn_t pfn;
+ struct page *page;
bool logging_active = memslot_is_logging(memslot);
unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
unsigned long vma_pagesize, fault_granule;
@@ -1301,8 +1304,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
*/
smp_rmb();
- pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
- write_fault, &writable, NULL);
+ pfn = __gfn_to_pfn_noref_memslot(memslot, gfn, false, false, NULL,
+ write_fault, &writable, NULL, &page);
if (pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(hva, vma_shift);
return 0;
@@ -1348,7 +1351,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
vma_pagesize = fault_granule;
else
vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
- hva, &pfn,
+ hva,
+ &pfn, &page,
&fault_ipa);
}
@@ -1395,8 +1399,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
out_unlock:
read_unlock(&kvm->mmu_lock);
- kvm_set_pfn_accessed(pfn);
- kvm_release_pfn_clean(pfn);
+ kvm_release_pfn_noref_clean(pfn, page);
return ret != -EAGAIN ? ret : 0;
}
--
2.40.0.348.gf938b09366-goog
From: David Stevens <[email protected]>
Introduce new gfn_to_pfn_noref functions that parallel existing
gfn_to_pfn functions. These functions can be used when the caller does
not need to maintain a reference to the returned pfn (i.e. when usage is
guarded by a mmu_notifier). The noref functions take an out parameter
that is used to return the struct page if the hva was resolved via gup.
The caller needs to drop its reference such a returned page.
Signed-off-by: David Stevens <[email protected]>
---
include/linux/kvm_host.h | 18 ++++
virt/kvm/kvm_main.c | 209 ++++++++++++++++++++++++++++-----------
virt/kvm/kvm_mm.h | 6 +-
virt/kvm/pfncache.c | 12 ++-
4 files changed, 188 insertions(+), 57 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 90edc16d37e5..146f220cc25b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1162,8 +1162,22 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
bool atomic, bool interruptible, bool *async,
bool write_fault, bool *writable, hva_t *hva);
+kvm_pfn_t gfn_to_pfn_noref(struct kvm *kvm, gfn_t gfn, struct page **page);
+kvm_pfn_t gfn_to_pfn_noref_prot(struct kvm *kvm, gfn_t gfn,
+ bool write_fault, bool *writable,
+ struct page **page);
+kvm_pfn_t gfn_to_pfn_noref_memslot(const struct kvm_memory_slot *slot,
+ gfn_t gfn, struct page **page);
+kvm_pfn_t gfn_to_pfn_noref_memslot_atomic(const struct kvm_memory_slot *slot,
+ gfn_t gfn, struct page **page);
+kvm_pfn_t __gfn_to_pfn_noref_memslot(const struct kvm_memory_slot *slot,
+ gfn_t gfn, bool atomic, bool interruptible,
+ bool *async, bool write_fault, bool *writable,
+ hva_t *hva, struct page **page);
+
void kvm_release_pfn_clean(kvm_pfn_t pfn);
void kvm_release_pfn_dirty(kvm_pfn_t pfn);
+void kvm_release_pfn_noref_clean(kvm_pfn_t pfn, struct page *page);
void kvm_set_pfn_dirty(kvm_pfn_t pfn);
void kvm_set_pfn_accessed(kvm_pfn_t pfn);
@@ -1242,6 +1256,10 @@ struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_noref_atomic(struct kvm_vcpu *vcpu, gfn_t gfn,
+ struct page **page);
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_noref(struct kvm_vcpu *vcpu, gfn_t gfn,
+ struct page **page);
int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map);
void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty);
unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f40b72eb0e7b..007dd984eeea 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2484,9 +2484,9 @@ static inline int check_user_page_hwpoison(unsigned long addr)
* only part that runs if we can in atomic context.
*/
static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
- bool *writable, kvm_pfn_t *pfn)
+ bool *writable, kvm_pfn_t *pfn,
+ struct page **page)
{
- struct page *page[1];
/*
* Fast pin a writable pfn only if it is a write fault request
@@ -2497,7 +2497,7 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
return false;
if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
- *pfn = page_to_pfn(page[0]);
+ *pfn = page_to_pfn(*page);
if (writable)
*writable = true;
@@ -2512,10 +2512,10 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
* 1 indicates success, -errno is returned if error is detected.
*/
static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
- bool interruptible, bool *writable, kvm_pfn_t *pfn)
+ bool interruptible, bool *writable, kvm_pfn_t *pfn,
+ struct page **page)
{
unsigned int flags = FOLL_HWPOISON;
- struct page *page;
int npages;
might_sleep();
@@ -2530,7 +2530,7 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
if (interruptible)
flags |= FOLL_INTERRUPTIBLE;
- npages = get_user_pages_unlocked(addr, 1, &page, flags);
+ npages = get_user_pages_unlocked(addr, 1, page, flags);
if (npages != 1)
return npages;
@@ -2540,11 +2540,11 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
*writable = true;
- put_page(page);
- page = wpage;
+ put_page(*page);
+ *page = wpage;
}
}
- *pfn = page_to_pfn(page);
+ *pfn = page_to_pfn(*page);
return npages;
}
@@ -2559,16 +2559,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
return true;
}
-static int kvm_try_get_pfn(kvm_pfn_t pfn)
-{
- struct page *page = kvm_pfn_to_refcounted_page(pfn);
-
- if (!page)
- return 1;
-
- return get_page_unless_zero(page);
-}
-
static int hva_to_pfn_remapped(struct vm_area_struct *vma,
unsigned long addr, bool write_fault,
bool *writable, kvm_pfn_t *p_pfn)
@@ -2607,26 +2597,6 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
*writable = pte_write(*ptep);
pfn = pte_pfn(*ptep);
- /*
- * Get a reference here because callers of *hva_to_pfn* and
- * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
- * returned pfn. This is only needed if the VMA has VM_MIXEDMAP
- * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
- * simply do nothing for reserved pfns.
- *
- * Whoever called remap_pfn_range is also going to call e.g.
- * unmap_mapping_range before the underlying pages are freed,
- * causing a call to our MMU notifier.
- *
- * Certain IO or PFNMAP mappings can be backed with valid
- * struct pages, but be allocated without refcounting e.g.,
- * tail pages of non-compound higher order allocations, which
- * would then underflow the refcount when the caller does the
- * required put_page. Don't allow those pages here.
- */
- if (!kvm_try_get_pfn(pfn))
- r = -EFAULT;
-
out:
pte_unmap_unlock(ptep, ptl);
*p_pfn = pfn;
@@ -2643,6 +2613,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
* host page is not in the memory
* @write_fault: whether we should get a writable host page
* @writable: whether it allows to map a writable host page for !@write_fault
+ * @page: outparam for the refcounted page assicated with the pfn, if any
*
* The function will map a writable host page for these two cases:
* 1): @write_fault = true
@@ -2650,23 +2621,25 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
* whether the mapping is writable.
*/
kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
- bool *async, bool write_fault, bool *writable)
+ bool *async, bool write_fault, bool *writable,
+ struct page **page)
{
struct vm_area_struct *vma;
kvm_pfn_t pfn;
int npages, r;
+ *page = NULL;
/* we can do it either atomically or asynchronously, not both */
BUG_ON(atomic && async);
- if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
+ if (hva_to_pfn_fast(addr, write_fault, writable, &pfn, page))
return pfn;
if (atomic)
return KVM_PFN_ERR_FAULT;
npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
- writable, &pfn);
+ writable, &pfn, page);
if (npages == 1)
return pfn;
if (npages == -EINTR)
@@ -2700,9 +2673,37 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
return pfn;
}
-kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
- bool atomic, bool interruptible, bool *async,
- bool write_fault, bool *writable, hva_t *hva)
+/*
+ * Helper function for managing refcounts of pfn returned by hva_to_pfn.
+ * @pfn: pfn returned by hva_to_pfn
+ * @page: page outparam from hva_to_pfn
+ *
+ * In cases where access to the pfn resolved by hva_to_pfn isn't protected by
+ * our MMU notifier, if the pfn was resolved by hva_to_pfn_remapped instead of
+ * gup, then its refcount needs to be bumped.
+ *
+ * Certain IO or PFNMAP mappings can be backed with valid struct pages, but be
+ * allocated without refcounting e.g., tail pages of non-compound higher order
+ * allocations, which would then underflow the refcount when the caller does the
+ * required put_page. Don't allow those pages here.
+ */
+kvm_pfn_t kvm_try_get_refcounted_page_ref(kvm_pfn_t pfn, struct page *page)
+{
+ /* If @page is valid, KVM already has a reference to the pfn/page. */
+ if (page || is_error_pfn(pfn))
+ return pfn;
+
+ page = kvm_pfn_to_refcounted_page(pfn);
+ if (!page || get_page_unless_zero(page))
+ return pfn;
+
+ return KVM_PFN_ERR_FAULT;
+}
+
+kvm_pfn_t __gfn_to_pfn_noref_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
+ bool atomic, bool interruptible, bool *async,
+ bool write_fault, bool *writable, hva_t *hva,
+ struct page **page)
{
unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
@@ -2728,47 +2729,134 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
}
return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
- writable);
+ writable, page);
+}
+EXPORT_SYMBOL_GPL(__gfn_to_pfn_noref_memslot);
+
+kvm_pfn_t gfn_to_pfn_noref_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
+ bool *writable, struct page **page)
+{
+ return __gfn_to_pfn_noref_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
+ NULL, write_fault, writable, NULL, page);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_noref_prot);
+
+kvm_pfn_t gfn_to_pfn_noref_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
+ struct page **page)
+{
+ return __gfn_to_pfn_noref_memslot(slot, gfn, false, false, NULL, true,
+ NULL, NULL, page);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_noref_memslot);
+
+kvm_pfn_t gfn_to_pfn_noref_memslot_atomic(const struct kvm_memory_slot *slot,
+ gfn_t gfn, struct page **page)
+{
+ return __gfn_to_pfn_noref_memslot(slot, gfn, true, false, NULL, true, NULL,
+ NULL, page);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_noref_memslot_atomic);
+
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_noref_atomic(struct kvm_vcpu *vcpu, gfn_t gfn,
+ struct page **page)
+{
+ return gfn_to_pfn_noref_memslot_atomic(
+ kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn, page);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_noref_atomic);
+
+kvm_pfn_t gfn_to_pfn_noref(struct kvm *kvm, gfn_t gfn, struct page **page)
+{
+ return gfn_to_pfn_noref_memslot(gfn_to_memslot(kvm, gfn), gfn, page);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_noref);
+
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_noref(struct kvm_vcpu *vcpu, gfn_t gfn,
+ struct page **page)
+{
+ return gfn_to_pfn_noref_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn),
+ gfn, page);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_noref);
+
+kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
+ bool atomic, bool interruptible, bool *async,
+ bool write_fault, bool *writable, hva_t *hva)
+{
+ struct page *page;
+ kvm_pfn_t pfn;
+
+ pfn = __gfn_to_pfn_noref_memslot(slot, gfn, atomic, interruptible, async,
+ write_fault, writable, hva, &page);
+
+ return kvm_try_get_refcounted_page_ref(pfn, page);
}
EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
bool *writable)
{
- return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
- NULL, write_fault, writable, NULL);
+ struct page *page;
+ kvm_pfn_t pfn;
+
+ pfn = gfn_to_pfn_noref_prot(kvm, gfn, write_fault, writable, &page);
+
+ return kvm_try_get_refcounted_page_ref(pfn, page);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
{
- return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
- NULL, NULL);
+ struct page *page;
+ kvm_pfn_t pfn;
+
+ pfn = gfn_to_pfn_noref_memslot(slot, gfn, &page);
+
+ return kvm_try_get_refcounted_page_ref(pfn, page);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
{
- return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
- NULL, NULL);
+ struct page *page;
+ kvm_pfn_t pfn;
+
+ pfn = gfn_to_pfn_noref_memslot_atomic(slot, gfn, &page);
+
+ return kvm_try_get_refcounted_page_ref(pfn, page);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
{
- return gfn_to_pfn_memslot_atomic(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
+ struct page *page;
+ kvm_pfn_t pfn;
+
+ pfn = kvm_vcpu_gfn_to_pfn_noref_atomic(vcpu, gfn, &page);
+
+ return kvm_try_get_refcounted_page_ref(pfn, page);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_atomic);
kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
{
- return gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn);
+ struct page *page;
+ kvm_pfn_t pfn;
+
+ pfn = gfn_to_pfn_noref(kvm, gfn, &page);
+
+ return kvm_try_get_refcounted_page_ref(pfn, page);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn);
kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
{
- return gfn_to_pfn_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
+ struct page *page;
+ kvm_pfn_t pfn;
+
+ pfn = kvm_vcpu_gfn_to_pfn_noref(vcpu, gfn, &page);
+
+ return kvm_try_get_refcounted_page_ref(pfn, page);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn);
@@ -2925,6 +3013,17 @@ void kvm_release_pfn_clean(kvm_pfn_t pfn)
}
EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
+void kvm_release_pfn_noref_clean(kvm_pfn_t pfn, struct page *page)
+{
+ if (is_error_noslot_pfn(pfn))
+ return;
+
+ kvm_set_pfn_accessed(pfn);
+ if (page)
+ put_page(page);
+}
+EXPORT_SYMBOL_GPL(kvm_release_pfn_noref_clean);
+
void kvm_release_page_dirty(struct page *page)
{
WARN_ON(is_error_page(page));
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 180f1a09e6ba..a4072cc5a189 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -3,6 +3,8 @@
#ifndef __KVM_MM_H__
#define __KVM_MM_H__ 1
+#include <linux/mm_types.h>
+
/*
* Architectures can choose whether to use an rwlock or spinlock
* for the mmu_lock. These macros, for use in common code
@@ -21,7 +23,9 @@
#endif /* KVM_HAVE_MMU_RWLOCK */
kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
- bool *async, bool write_fault, bool *writable);
+ bool *async, bool write_fault, bool *writable,
+ struct page **page);
+kvm_pfn_t kvm_try_get_refcounted_page_ref(kvm_pfn_t pfn, struct page *page);
#ifdef CONFIG_HAVE_KVM_PFNCACHE
void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 2d6aba677830..e25d3af969f4 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -144,6 +144,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
void *new_khva = NULL;
unsigned long mmu_seq;
+ struct page *page;
lockdep_assert_held(&gpc->refresh_lock);
@@ -183,10 +184,19 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
}
/* We always request a writeable mapping */
- new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
+ new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL, &page);
if (is_error_noslot_pfn(new_pfn))
goto out_error;
+ /*
+ * Filter out pages that support refcounting but which aren't
+ * currently being refcounted. Some KVM MMUs support such pages, but
+ * although we could support them here, kvm internals more generally
+ * don't. Reject them here for consistency.
+ */
+ if (kvm_try_get_refcounted_page_ref(new_pfn, page) != new_pfn)
+ goto out_error;
+
/*
* Obtain a new kernel mapping if KVM itself will access the
* pfn. Note, kmap() and memremap() can both sleep, so this
--
2.40.0.348.gf938b09366-goog
From: David Stevens <[email protected]>
Remove two warnings that require ref counts for pages to be non-zero, as
mapped pfns from follow_pfn may not have an initialized ref count.
Signed-off-by: David Stevens <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 10 ----------
virt/kvm/kvm_main.c | 5 ++---
2 files changed, 2 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 86b74e7bccfa..46b3d6c0ff27 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -555,7 +555,6 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
kvm_pfn_t pfn;
u64 old_spte = *sptep;
int level = sptep_to_sp(sptep)->role.level;
- struct page *page;
if (!is_shadow_present_pte(old_spte) ||
!spte_has_volatile_bits(old_spte))
@@ -570,15 +569,6 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
pfn = spte_to_pfn(old_spte);
- /*
- * KVM doesn't hold a reference to any pages mapped into the guest, and
- * instead uses the mmu_notifier to ensure that KVM unmaps any pages
- * before they are reclaimed. Sanity check that, if the pfn is backed
- * by a refcounted page, the refcount is elevated.
- */
- page = kvm_pfn_to_refcounted_page(pfn);
- WARN_ON(page && !page_count(page));
-
if (is_accessed_spte(old_spte))
kvm_set_pfn_accessed(pfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 007dd984eeea..a80070cb04d7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -165,10 +165,9 @@ bool kvm_is_zone_device_page(struct page *page)
/*
* The metadata used by is_zone_device_page() to determine whether or
* not a page is ZONE_DEVICE is guaranteed to be valid if and only if
- * the device has been pinned, e.g. by get_user_pages(). WARN if the
- * page_count() is zero to help detect bad usage of this helper.
+ * the device has been pinned, e.g. by get_user_pages().
*/
- if (WARN_ON_ONCE(!page_count(page)))
+ if (!page_count(page))
return false;
return is_zone_device_page(page);
--
2.40.0.348.gf938b09366-goog
+Peter
On Thu, Mar 30, 2023, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Introduce new gfn_to_pfn_noref functions that parallel existing
> gfn_to_pfn functions. These functions can be used when the caller does
> not need to maintain a reference to the returned pfn (i.e. when usage is
> guarded by a mmu_notifier). The noref functions take an out parameter
> that is used to return the struct page if the hva was resolved via gup.
> The caller needs to drop its reference such a returned page.
I dislike the "noref" name and the approach itself (of providing an entirely
separate set of APIs). Using "noref" is confusing because the callers do actually
get a reference to the page (if a refcounted page is found).
As for the approach, I really, really don't want to end up with yet more APIs
for getting PFNs from GFNs. We already have far too many. In the short term,
I think we'll need to carry multiple sets of APIs, as converting all architectures
to any new API will be too much for a single series. But I want to have line of
sight to convering on a single, as-small-as-possible set of APIs, and I think/hope
it should be possible to make the old APIs, e.g. gfn_to_pfn(), to be shims around
the new APIs.
And since this series is essentially overhauling the gfn_to_pfn APIs, I think it's
the right series to take on refactoring the APIs to clean up the growing flag
problem. There was a bit of discussion back when "interruptible" support was
added (https://lore.kernel.org/all/[email protected]), but it got punted
because it wasn't necessary, and because there wasn't immediate agreement on what
exactly the APIs should look like.
Overhauling the APIs would also let us clean up things like async #PF, specifically
replacing the unintuitive "*async = true" logic with something like this:
if ((flags & FOLL_NOWAIT) && vma_is_valid(vma, flags & FOLL_WRITE))
pfn = KVM_PFN_ERR_FAULT_MINOR;
else
pfn = KVM_PFN_ERR_FAULT;
Lastly, I think there's also an opportunity here to harden KVM's interaction with
mmu_notifiers, and to dedup arch code in KVM . Specifically, even when the proposed
"allow_unsafe_kmap" is true, KVM should either (a) be "in" an mmu_notifier sequence
or (b) _want_ to grab a reference. And when KVM does NOT want a reference, the core
API can/should immediately drop the reference even before returning.
My thought is it provide an "entirely" new API, named something like kvm_follow_pfn()
to somewhat mirror follow_{pfn,pte,phys}(). Ideally something to pair with gup()
would be nice, but having a dedicated KVM helper to get _only_ struct page memory
doesn't work well because KVM almost never wants only struct page memory.
As for the flags vs. bools debate (see link above), I think the best approach is
a mix of the two. Specifically, reuse the FOLL_* flags as-is for inputs, and use
booleans for outputs. I don't _think_ there are any input bools/flags that don't
map 1:1 with existing FOLL_* flags.
As a very, *very* rough sketch, provide APIs that look a bit like this.
kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
{
kvm_pfn_t pfn;
if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll.mmu_seq))
return KVM_PFN_ERR_FAULT;
pfn = ???;
if (foll->page && !(foll->flags & FOLL_GET))
put_page(foll->page);
return pfn;
}
kvm_pfn_t kvm_follow_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct page **page)
{
struct kvm_follow_pfn foll = {
.flags = FOLL_GET | FOLL_WRITE,
};
<more stuff here?>
foll.slot = ???;
if (!foll.slot || foll.slot->flags & KVM_MEMSLOT_INVALID)
return KVM_HVA_ERR_BAD;
if (memslot_is_readonly(foll.slot))
return KVM_HVA_ERR_RO_BAD;
return __kvm_follow_pfn(&foll);
}
and a few partially converted users
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 67e2ac799aa7..5eaf0395ed87 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -550,12 +550,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
flush = true;
- kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+ if (is_refcounted_page_pte(old_spte))
+ kvm_set_page_accessed(pfn_to_page(spte_to_pfn));
}
if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
flush = true;
- kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+ if (is_refcounted_page_pte(old_spte))
+ kvm_set_page_dirty(pfn_to_page(spte_to_pfn));
}
return flush;
@@ -4278,6 +4280,10 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
+ struct kvm_follow_pfn foll = {
+ .mmu_seq = fault->mmu_seq,
+ .gfn = fault->gfn,
+ };
struct kvm_memory_slot *slot = fault->slot;
bool async;
@@ -4309,12 +4315,16 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
return RET_PF_EMULATE;
}
- async = false;
- fault->pfn = __gfn_to_pfn_noref_memslot(slot, fault->gfn, false, false, &async,
- fault->write, &fault->map_writable,
- &fault->hva, &fault->page);
- if (!async)
- return RET_PF_CONTINUE; /* *pfn has correct page already */
+ foll.flags = FOLL_NOWAIT;
+ if (fault->write)
+ foll.flags |= FOLL_WRITE;
+
+ fault->pfn = __kvm_follow_pfn(&foll);
+ if (!is_error_noslot_pfn(fault->pfn))
+ goto success;
+
+ if (!is_fault_minor_pfn(fault->pfn))
+ return RET_PF_CONTINUE;
if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
trace_kvm_try_async_get_page(fault->addr, fault->gfn);
@@ -4332,9 +4342,18 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* to wait for IO. Note, gup always bails if it is unable to quickly
* get a page and a fatal signal, i.e. SIGKILL, is pending.
*/
- fault->pfn = __gfn_to_pfn_noref_memslot(slot, fault->gfn, false, true, NULL,
- fault->write, &fault->map_writable,
- &fault->hva, &fault->page);
+ foll.flags |= FOLL_INTERRUPTIBLE;
+ foll.flags &= ~FOLL_NOWAIT;
+
+ fault->pfn = kvm_follow_pfn(&foll);
+ if (!is_error_noslot_pfn(fault->pfn))
+ goto success;
+
+ return RET_PF_CONTINUE;
+success:
+ fault->hva = foll.hva;
+ fault->page = foll.page;
+ fault->map_writable = foll.writable;
return RET_PF_CONTINUE;
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 360eaa24456f..0bae253c88dd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2663,9 +2663,10 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
if (r < 0)
pfn = KVM_PFN_ERR_FAULT;
} else {
- if (async && vma_is_valid(vma, write_fault))
- *async = true;
- pfn = KVM_PFN_ERR_FAULT;
+ if ((flags & FOLL_NOWAIT) && vma_is_valid(vma, flags & FOLL_WRITE))
+ pfn = KVM_PFN_ERR_FAULT_MINOR;
+ else
...skipping...
+ fault->pfn = kvm_follow_pfn(&foll);
+ if (!is_error_noslot_pfn(fault->pfn))
+ goto success;
+
+ return RET_PF_CONTINUE;
+success:
+ fault->hva = foll.hva;
+ fault->page = foll.page;
+ fault->map_writable = foll.writable;
return RET_PF_CONTINUE;
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 360eaa24456f..0bae253c88dd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2663,9 +2663,10 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
if (r < 0)
pfn = KVM_PFN_ERR_FAULT;
} else {
- if (async && vma_is_valid(vma, write_fault))
- *async = true;
- pfn = KVM_PFN_ERR_FAULT;
+ if ((flags & FOLL_NOWAIT) && vma_is_valid(vma, flags & FOLL_WRITE))
+ pfn = KVM_PFN_ERR_FAULT_MINOR;
+ else
+ pfn = KVM_PFN_ERR_FAULT;
}
exit:
mmap_read_unlock(current->mm);
@@ -2732,6 +2733,30 @@ kvm_pfn_t __gfn_to_pfn_noref_memslot(const struct kvm_memory_slot *slot, gfn_t g
}
EXPORT_SYMBOL_GPL(__gfn_to_pfn_noref_memslot);
+kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
+{
+ kvm_pfn_t pfn;
+
+ if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll.mmu_seq))
+ return KVM_PFN_ERR_FAULT;
+
+ pfn = __gfn_to_pfn_noref_memslot(...);
+
+ if (foll->page && !(foll->flags & FOLL_GET))
+ put_page(foll->page);
+
+ return pfn;
+}
+
+kvm_pfn_t kvm_follow_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct page **page)
+{
+ struct kvm_follow_pfn foll = {
+ .flags = FOLL_GET | FOLL_WRITE,
+ };
+
+ return __kvm_follow_pfn(&foll);
+}
+
kvm_pfn_t gfn_to_pfn_noref_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
bool *writable, struct page **page)
{
@@ -2910,25 +2935,23 @@ void kvm_release_pfn(kvm_pfn_t pfn, bool dirty)
int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
{
+ struct page *page;
kvm_pfn_t pfn;
void *hva = NULL;
- struct page *page = KVM_UNMAPPED_PAGE;
if (!map)
return -EINVAL;
- pfn = gfn_to_pfn(vcpu->kvm, gfn);
+ pfn = kvm_follow_pfn(vcpu->kvm, gfn, &page)
if (is_error_noslot_pfn(pfn))
return -EINVAL;
- if (pfn_valid(pfn)) {
- page = pfn_to_page(pfn);
+ if (page)
hva = kmap(page);
#ifdef CONFIG_HAS_IOMEM
- } else {
+ else if (allow_unsafe_kmap)
hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
#endif
- }
if (!hva)
return -EFAULT;
On Thu, Mar 30, 2023, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Remove two warnings that require ref counts for pages to be non-zero, as
> mapped pfns from follow_pfn may not have an initialized ref count.
This patch needs to be moved earlier, e.g. if just this patch is reverted, these
WARNs will fire on a guest with non-refcounted memory.
The shortlog and changelog also need to be reworded. The shortlog in particular
is misleading, as the the WARNs aren't overly agressive _in the current code base_,
but rather are invalidated by KVM allowing non-refcounted struct page memory to
be mapped into the guest.
Lastly, as I mentioned in previous versions, I would like to keep the sanity
check if possible. But this time, I have a concrete idea :-)
When installing a SPTE that points at a refcounted page, set a flag stating as
much. Then use the flag to assert that the page has an elevate refcount whenever
KVM is operating on the page. It'll require some additional plumbing changes,
e.g. to tell make_spte() that the pfn is refcounted, but the actual code should be
straightforward.
Actually, we should make that a requirement to allow an arch to get non-refcounted
struct page memory: the arch must be able to keep track which pages are/aren't
refcounted. That'll disallow your GPU use case with 32-bit x86 host kernels (we're
out of software bits in PAE SPTEs), but I can't imaging anyone cares. Then I
believe we can make that support mutually exclusive with kvm_pfn_to_refcounted_page(),
because all of the kvm_follow_pfn() users will know (and remember) that the pfn
is backed by a refcounted page.
On Mon, May 22, 2023 at 01:46:41PM -0700, Sean Christopherson wrote:
> +Peter
>
> On Thu, Mar 30, 2023, David Stevens wrote:
> > From: David Stevens <[email protected]>
> >
> > Introduce new gfn_to_pfn_noref functions that parallel existing
> > gfn_to_pfn functions. These functions can be used when the caller does
> > not need to maintain a reference to the returned pfn (i.e. when usage is
> > guarded by a mmu_notifier). The noref functions take an out parameter
> > that is used to return the struct page if the hva was resolved via gup.
> > The caller needs to drop its reference such a returned page.
>
> I dislike the "noref" name and the approach itself (of providing an entirely
> separate set of APIs). Using "noref" is confusing because the callers do actually
> get a reference to the page (if a refcounted page is found).
>
> As for the approach, I really, really don't want to end up with yet more APIs
> for getting PFNs from GFNs. We already have far too many. In the short term,
> I think we'll need to carry multiple sets of APIs, as converting all architectures
> to any new API will be too much for a single series. But I want to have line of
> sight to convering on a single, as-small-as-possible set of APIs, and I think/hope
> it should be possible to make the old APIs, e.g. gfn_to_pfn(), to be shims around
> the new APIs.
>
> And since this series is essentially overhauling the gfn_to_pfn APIs, I think it's
> the right series to take on refactoring the APIs to clean up the growing flag
> problem. There was a bit of discussion back when "interruptible" support was
> added (https://lore.kernel.org/all/[email protected]), but it got punted
> because it wasn't necessary, and because there wasn't immediate agreement on what
> exactly the APIs should look like.
>
> Overhauling the APIs would also let us clean up things like async #PF, specifically
> replacing the unintuitive "*async = true" logic with something like this:
>
> if ((flags & FOLL_NOWAIT) && vma_is_valid(vma, flags & FOLL_WRITE))
> pfn = KVM_PFN_ERR_FAULT_MINOR;
> else
> pfn = KVM_PFN_ERR_FAULT;
>
> Lastly, I think there's also an opportunity here to harden KVM's interaction with
> mmu_notifiers, and to dedup arch code in KVM . Specifically, even when the proposed
> "allow_unsafe_kmap" is true, KVM should either (a) be "in" an mmu_notifier sequence
> or (b) _want_ to grab a reference. And when KVM does NOT want a reference, the core
> API can/should immediately drop the reference even before returning.
>
> My thought is it provide an "entirely" new API, named something like kvm_follow_pfn()
> to somewhat mirror follow_{pfn,pte,phys}(). Ideally something to pair with gup()
> would be nice, but having a dedicated KVM helper to get _only_ struct page memory
> doesn't work well because KVM almost never wants only struct page memory.
>
> As for the flags vs. bools debate (see link above), I think the best approach is
> a mix of the two. Specifically, reuse the FOLL_* flags as-is for inputs, and use
> booleans for outputs. I don't _think_ there are any input bools/flags that don't
> map 1:1 with existing FOLL_* flags.
>
> As a very, *very* rough sketch, provide APIs that look a bit like this.
Unifying ref vs nonref cases does look a bit cleaner to me too.
>
> kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> {
> kvm_pfn_t pfn;
>
> if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll.mmu_seq))
IMHO we may not want to rely on mmu_seq==0 either for unlucky very initial
mmu_seq being zero, or avoid overflows?
I'd say we can stick with FOLL_GET in this case to identify ref vs nonref
and always assume mmu_seq a pure random number.
> return KVM_PFN_ERR_FAULT;
>
> pfn = ???;
>
> if (foll->page && !(foll->flags & FOLL_GET))
> put_page(foll->page);
>
> return pfn;
> }
>
> kvm_pfn_t kvm_follow_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct page **page)
> {
> struct kvm_follow_pfn foll = {
> .flags = FOLL_GET | FOLL_WRITE,
> };
>
> <more stuff here?>
>
> foll.slot = ???;
> if (!foll.slot || foll.slot->flags & KVM_MEMSLOT_INVALID)
> return KVM_HVA_ERR_BAD;
>
> if (memslot_is_readonly(foll.slot))
> return KVM_HVA_ERR_RO_BAD;
>
> return __kvm_follow_pfn(&foll);
> }
>
> and a few partially converted users
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 67e2ac799aa7..5eaf0395ed87 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -550,12 +550,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>
> if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> flush = true;
> - kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> + if (is_refcounted_page_pte(old_spte))
One question is how to impl is_refcounted_page_pte() here to identify
non-refcountable pages.
IIUC those pages are mostly identical to a normal page (so !PG_reserved)
but it has page_ref_count(page)==0 always, am I right? I got that roughly
from reading f8be156be1 only though, so I could miss a lot of things..
When thinking about that, I'm also wondering whether we can trivially allow
kvm to support such mapping (without overhaul of the kvm pfn API) by
something like this:
===8<===
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 51e4882d0873..467acbac1a96 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -192,7 +192,13 @@ struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn)
page = pfn_to_page(pfn);
if (!PageReserved(page))
- return page;
+ /*
+ * When page_ref_count(page)==0 it might be speical page
+ * that do not support refcounting. Treating them the same
+ * as normal reserved (e.g. MMIO) pages by returning NULL,
+ * so they're exempt of refcounting.
+ */
+ return page_ref_count(page) == 0 ? NULL : page;
/* The ZERO_PAGE(s) is marked PG_reserved, but is refcounted. */
if (is_zero_pfn(pfn))
===8<===
So that we treat those special pages the same as normal PFNMAP ones by
skipping all refcountings on inc/dec. This is based on the fact that kvm
should always hold at least 1 ref on a normal page so a normal page should
never hit ref==0 here, but again I could miss something somewhere..
> + kvm_set_page_accessed(pfn_to_page(spte_to_pfn));
> }
>
> if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
> flush = true;
> - kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> + if (is_refcounted_page_pte(old_spte))
> + kvm_set_page_dirty(pfn_to_page(spte_to_pfn));
> }
>
> return flush;
> @@ -4278,6 +4280,10 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>
> static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> + struct kvm_follow_pfn foll = {
> + .mmu_seq = fault->mmu_seq,
> + .gfn = fault->gfn,
> + };
> struct kvm_memory_slot *slot = fault->slot;
> bool async;
>
> @@ -4309,12 +4315,16 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> return RET_PF_EMULATE;
> }
>
> - async = false;
> - fault->pfn = __gfn_to_pfn_noref_memslot(slot, fault->gfn, false, false, &async,
> - fault->write, &fault->map_writable,
> - &fault->hva, &fault->page);
> - if (!async)
> - return RET_PF_CONTINUE; /* *pfn has correct page already */
> + foll.flags = FOLL_NOWAIT;
> + if (fault->write)
> + foll.flags |= FOLL_WRITE;
> +
> + fault->pfn = __kvm_follow_pfn(&foll);
> + if (!is_error_noslot_pfn(fault->pfn))
> + goto success;
> +
> + if (!is_fault_minor_pfn(fault->pfn))
> + return RET_PF_CONTINUE;
>
> if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
> trace_kvm_try_async_get_page(fault->addr, fault->gfn);
> @@ -4332,9 +4342,18 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> * to wait for IO. Note, gup always bails if it is unable to quickly
> * get a page and a fatal signal, i.e. SIGKILL, is pending.
> */
> - fault->pfn = __gfn_to_pfn_noref_memslot(slot, fault->gfn, false, true, NULL,
> - fault->write, &fault->map_writable,
> - &fault->hva, &fault->page);
> + foll.flags |= FOLL_INTERRUPTIBLE;
> + foll.flags &= ~FOLL_NOWAIT;
> +
> + fault->pfn = kvm_follow_pfn(&foll);
> + if (!is_error_noslot_pfn(fault->pfn))
> + goto success;
> +
> + return RET_PF_CONTINUE;
> +success:
> + fault->hva = foll.hva;
> + fault->page = foll.page;
> + fault->map_writable = foll.writable;
> return RET_PF_CONTINUE;
> }
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 360eaa24456f..0bae253c88dd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2663,9 +2663,10 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> if (r < 0)
> pfn = KVM_PFN_ERR_FAULT;
> } else {
> - if (async && vma_is_valid(vma, write_fault))
> - *async = true;
> - pfn = KVM_PFN_ERR_FAULT;
> + if ((flags & FOLL_NOWAIT) && vma_is_valid(vma, flags & FOLL_WRITE))
> + pfn = KVM_PFN_ERR_FAULT_MINOR;
> + else
> ...skipping...
> + fault->pfn = kvm_follow_pfn(&foll);
> + if (!is_error_noslot_pfn(fault->pfn))
> + goto success;
> +
> + return RET_PF_CONTINUE;
> +success:
> + fault->hva = foll.hva;
> + fault->page = foll.page;
> + fault->map_writable = foll.writable;
> return RET_PF_CONTINUE;
> }
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 360eaa24456f..0bae253c88dd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2663,9 +2663,10 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> if (r < 0)
> pfn = KVM_PFN_ERR_FAULT;
> } else {
> - if (async && vma_is_valid(vma, write_fault))
> - *async = true;
> - pfn = KVM_PFN_ERR_FAULT;
> + if ((flags & FOLL_NOWAIT) && vma_is_valid(vma, flags & FOLL_WRITE))
> + pfn = KVM_PFN_ERR_FAULT_MINOR;
> + else
> + pfn = KVM_PFN_ERR_FAULT;
> }
> exit:
> mmap_read_unlock(current->mm);
> @@ -2732,6 +2733,30 @@ kvm_pfn_t __gfn_to_pfn_noref_memslot(const struct kvm_memory_slot *slot, gfn_t g
> }
> EXPORT_SYMBOL_GPL(__gfn_to_pfn_noref_memslot);
>
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> +{
> + kvm_pfn_t pfn;
> +
> + if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll.mmu_seq))
> + return KVM_PFN_ERR_FAULT;
> +
> + pfn = __gfn_to_pfn_noref_memslot(...);
> +
> + if (foll->page && !(foll->flags & FOLL_GET))
> + put_page(foll->page);
> +
> + return pfn;
> +}
> +
> +kvm_pfn_t kvm_follow_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct page **page)
> +{
> + struct kvm_follow_pfn foll = {
> + .flags = FOLL_GET | FOLL_WRITE,
> + };
> +
> + return __kvm_follow_pfn(&foll);
> +}
> +
> kvm_pfn_t gfn_to_pfn_noref_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> bool *writable, struct page **page)
> {
> @@ -2910,25 +2935,23 @@ void kvm_release_pfn(kvm_pfn_t pfn, bool dirty)
>
> int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
> {
> + struct page *page;
> kvm_pfn_t pfn;
> void *hva = NULL;
> - struct page *page = KVM_UNMAPPED_PAGE;
>
> if (!map)
> return -EINVAL;
>
> - pfn = gfn_to_pfn(vcpu->kvm, gfn);
> + pfn = kvm_follow_pfn(vcpu->kvm, gfn, &page)
> if (is_error_noslot_pfn(pfn))
> return -EINVAL;
>
> - if (pfn_valid(pfn)) {
> - page = pfn_to_page(pfn);
> + if (page)
> hva = kmap(page);
> #ifdef CONFIG_HAS_IOMEM
> - } else {
> + else if (allow_unsafe_kmap)
> hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
> #endif
> - }
>
> if (!hva)
> return -EFAULT;
>
--
Peter Xu
On Wed, May 24, 2023, Peter Xu wrote:
> On Mon, May 22, 2023 at 01:46:41PM -0700, Sean Christopherson wrote:
> > As for the flags vs. bools debate (see link above), I think the best approach is
> > a mix of the two. Specifically, reuse the FOLL_* flags as-is for inputs, and use
> > booleans for outputs. I don't _think_ there are any input bools/flags that don't
> > map 1:1 with existing FOLL_* flags.
> >
> > As a very, *very* rough sketch, provide APIs that look a bit like this.
>
> Unifying ref vs nonref cases does look a bit cleaner to me too.
>
> >
> > kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> > {
> > kvm_pfn_t pfn;
> >
> > if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll.mmu_seq))
>
> IMHO we may not want to rely on mmu_seq==0 either for unlucky very initial
> mmu_seq being zero, or avoid overflows?
I was thinking we could initialize mmu_seq to '1' and make it a u64 to avoid
overflow.
> I'd say we can stick with FOLL_GET in this case to identify ref vs nonref
> and always assume mmu_seq a pure random number.
The intent of checking mmu_seq is to flag cases where the caller doesn't specify
FOLL_GET and isn't protected by mmu_invalidate_seq, i.e. isn't tapped into the
mmu_notifiers. I.e. this is a sanity check, not functionally necessary.
>
> > return KVM_PFN_ERR_FAULT;
> >
> > pfn = ???;
> >
> > if (foll->page && !(foll->flags & FOLL_GET))
> > put_page(foll->page);
> >
> > return pfn;
> > }
> >
> > kvm_pfn_t kvm_follow_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct page **page)
> > {
> > struct kvm_follow_pfn foll = {
> > .flags = FOLL_GET | FOLL_WRITE,
> > };
> >
> > <more stuff here?>
> >
> > foll.slot = ???;
> > if (!foll.slot || foll.slot->flags & KVM_MEMSLOT_INVALID)
> > return KVM_HVA_ERR_BAD;
> >
> > if (memslot_is_readonly(foll.slot))
> > return KVM_HVA_ERR_RO_BAD;
> >
> > return __kvm_follow_pfn(&foll);
> > }
> >
> > and a few partially converted users
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 67e2ac799aa7..5eaf0395ed87 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -550,12 +550,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> >
> > if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> > flush = true;
> > - kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > + if (is_refcounted_page_pte(old_spte))
>
> One question is how to impl is_refcounted_page_pte() here to identify
> non-refcountable pages.
KVM would use a software available bit in its PTEs to explicitly track which SPTEs
point at refcounted pages. E.g. I think bit 59 is available for EPT and 64-bit
paging. PAE paging doesn't have high available bits, which is why I called out
that this would have to be 64-bit only.
> IIUC those pages are mostly identical to a normal page (so !PG_reserved)
> but it has page_ref_count(page)==0 always, am I right? I got that roughly
> from reading f8be156be1 only though, so I could miss a lot of things..
>
> When thinking about that, I'm also wondering whether we can trivially allow
> kvm to support such mapping (without overhaul of the kvm pfn API) by
> something like this:
>
> ===8<===
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 51e4882d0873..467acbac1a96 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -192,7 +192,13 @@ struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn)
>
> page = pfn_to_page(pfn);
> if (!PageReserved(page))
> - return page;
> + /*
> + * When page_ref_count(page)==0 it might be speical page
> + * that do not support refcounting. Treating them the same
> + * as normal reserved (e.g. MMIO) pages by returning NULL,
> + * so they're exempt of refcounting.
> + */
> + return page_ref_count(page) == 0 ? NULL : page;
Heh, because I got burned by this recently, using page_ref_count() is wrong. This
needs to be page_count() so that tail pages of refcounted compound pages are
properly identified.
>
> /* The ZERO_PAGE(s) is marked PG_reserved, but is refcounted. */
> if (is_zero_pfn(pfn))
> ===8<===
>
> So that we treat those special pages the same as normal PFNMAP ones by
> skipping all refcountings on inc/dec. This is based on the fact that kvm
> should always hold at least 1 ref on a normal page so a normal page should
> never hit ref==0 here, but again I could miss something somewhere..
This would "work" from a functionality perspective, and might be acceptable as an
out-of-tree patch to unblock the ChromeOS use case, but I don't want to rely on
this heuristic on the backend in KVM because it will suppress any and all
use-after-free bugs in KVM's MMU (see patch 4 of this series). I really want to
go in the opposite direction and harden KVM against MMU bugs, e.g. I'm planning
on posting the below (which is how I learned about page_count() vs. page_ref_count()).
Today, KVM gets partial protection from check_new_page_bad(), which detects *some*
cases where KVM marks a page dirty after the page is freed. But it's racy, and
the detection occurs well after the fact since it fires only when the page is
re-allocated.
If we hack kvm_pfn_to_refcounted_page(), then all of those protections are lost
because KVM would drop its assertions and also skip dirtying pages, i.e. would
effectively suppress the latent detection by check_new_page_bad().
Author: Sean Christopherson <[email protected]>
Date: Wed May 17 13:26:54 2023 -0700
KVM: Assert that a page's refcount is elevated when marking accessed/dirty
Assert that a page's refcount is elevated, i.e. that _something_ holds a
reference to the page, when KVM marks a page as accessed and/or dirty.
KVM typically doesn't hold a reference to pages that are mapped into the
guest, e.g. to allow page migration, compaction, swap, etc., and instead
relies on mmu_notifiers to react to changes in the primary MMU.
Incorrect handling of mmu_notifier events (or similar mechanisms) can
result in KVM keeping a mapping beyond the lifetime of the backing page,
i.e. can (and often does) result in use-after-free. Yelling if KVM marks
a freed page as accessed/dirty doesn't prevent badness as KVM usually
only does A/D updates when unmapping memory from the guest, i.e. the
assertion fires well after an underlying bug has occured, but yelling
does help detect, triage, and debug use-after-free bugs.
Note, the assertion must use page_count(), NOT page_ref_count()! For
hugepages, the returned struct page may be a tailpage and thus not have
its own refcount.
Signed-off-by: Sean Christopherson <[email protected]>
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d1abb331ea68..64f18697096c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2882,6 +2882,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
static bool kvm_is_ad_tracked_page(struct page *page)
{
+ /*
+ * Assert that KVM isn't attempting to mark a freed page as Accessed or
+ * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM
+ * (typically) doesn't pin pages that are mapped in KVM's MMU, and
+ * instead relies on mmu_notifiers to know when a mapping needs to be
+ * zapped/invalidated. Unmapping from KVM's MMU must happen _before_
+ * KVM returns from its mmu_notifier, i.e. the page should have an
+ * elevated refcount at this point even though KVM doesn't hold a
+ * reference of its own.
+ */
+ if (WARN_ON_ONCE(!page_count(page)))
+ return false;
+
/*
* Per page-flags.h, pages tagged PG_reserved "should in general not be
* touched (e.g. set dirty) except by its owner".
On Wed, May 24, 2023 at 09:46:13AM -0700, Sean Christopherson wrote:
> On Wed, May 24, 2023, Peter Xu wrote:
> > On Mon, May 22, 2023 at 01:46:41PM -0700, Sean Christopherson wrote:
> > > As for the flags vs. bools debate (see link above), I think the best approach is
> > > a mix of the two. Specifically, reuse the FOLL_* flags as-is for inputs, and use
> > > booleans for outputs. I don't _think_ there are any input bools/flags that don't
> > > map 1:1 with existing FOLL_* flags.
> > >
> > > As a very, *very* rough sketch, provide APIs that look a bit like this.
> >
> > Unifying ref vs nonref cases does look a bit cleaner to me too.
> >
> > >
> > > kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> > > {
> > > kvm_pfn_t pfn;
> > >
> > > if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll.mmu_seq))
> >
> > IMHO we may not want to rely on mmu_seq==0 either for unlucky very initial
> > mmu_seq being zero, or avoid overflows?
>
> I was thinking we could initialize mmu_seq to '1' and make it a u64 to avoid
> overflow.
Yeah, that's fine too.
>
> > I'd say we can stick with FOLL_GET in this case to identify ref vs nonref
> > and always assume mmu_seq a pure random number.
>
> The intent of checking mmu_seq is to flag cases where the caller doesn't specify
> FOLL_GET and isn't protected by mmu_invalidate_seq, i.e. isn't tapped into the
> mmu_notifiers. I.e. this is a sanity check, not functionally necessary.
>
> >
> > > return KVM_PFN_ERR_FAULT;
> > >
> > > pfn = ???;
> > >
> > > if (foll->page && !(foll->flags & FOLL_GET))
> > > put_page(foll->page);
> > >
> > > return pfn;
> > > }
> > >
> > > kvm_pfn_t kvm_follow_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct page **page)
> > > {
> > > struct kvm_follow_pfn foll = {
> > > .flags = FOLL_GET | FOLL_WRITE,
> > > };
> > >
> > > <more stuff here?>
> > >
> > > foll.slot = ???;
> > > if (!foll.slot || foll.slot->flags & KVM_MEMSLOT_INVALID)
> > > return KVM_HVA_ERR_BAD;
> > >
> > > if (memslot_is_readonly(foll.slot))
> > > return KVM_HVA_ERR_RO_BAD;
> > >
> > > return __kvm_follow_pfn(&foll);
> > > }
> > >
> > > and a few partially converted users
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 67e2ac799aa7..5eaf0395ed87 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -550,12 +550,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> > >
> > > if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> > > flush = true;
> > > - kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > > + if (is_refcounted_page_pte(old_spte))
> >
> > One question is how to impl is_refcounted_page_pte() here to identify
> > non-refcountable pages.
>
> KVM would use a software available bit in its PTEs to explicitly track which SPTEs
> point at refcounted pages. E.g. I think bit 59 is available for EPT and 64-bit
> paging. PAE paging doesn't have high available bits, which is why I called out
> that this would have to be 64-bit only.
>
> > IIUC those pages are mostly identical to a normal page (so !PG_reserved)
> > but it has page_ref_count(page)==0 always, am I right? I got that roughly
> > from reading f8be156be1 only though, so I could miss a lot of things..
> >
> > When thinking about that, I'm also wondering whether we can trivially allow
> > kvm to support such mapping (without overhaul of the kvm pfn API) by
> > something like this:
> >
> > ===8<===
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 51e4882d0873..467acbac1a96 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -192,7 +192,13 @@ struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn)
> >
> > page = pfn_to_page(pfn);
> > if (!PageReserved(page))
> > - return page;
> > + /*
> > + * When page_ref_count(page)==0 it might be speical page
> > + * that do not support refcounting. Treating them the same
> > + * as normal reserved (e.g. MMIO) pages by returning NULL,
> > + * so they're exempt of refcounting.
> > + */
> > + return page_ref_count(page) == 0 ? NULL : page;
>
> Heh, because I got burned by this recently, using page_ref_count() is wrong. This
> needs to be page_count() so that tail pages of refcounted compound pages are
> properly identified.
:-D
Actually when I was replying I explicitly didn't use page_count() to make
sure we're reading the tail page, but I just noticed that's exactly the way
how we identify the special page with a PageCompound()==true tail page.
Yeah, if we'd like that it needs to be page_count()==0.
>
> >
> > /* The ZERO_PAGE(s) is marked PG_reserved, but is refcounted. */
> > if (is_zero_pfn(pfn))
> > ===8<===
> >
> > So that we treat those special pages the same as normal PFNMAP ones by
> > skipping all refcountings on inc/dec. This is based on the fact that kvm
> > should always hold at least 1 ref on a normal page so a normal page should
> > never hit ref==0 here, but again I could miss something somewhere..
>
> This would "work" from a functionality perspective, and might be acceptable as an
> out-of-tree patch to unblock the ChromeOS use case, but I don't want to rely on
> this heuristic on the backend in KVM because it will suppress any and all
> use-after-free bugs in KVM's MMU (see patch 4 of this series). I really want to
> go in the opposite direction and harden KVM against MMU bugs, e.g. I'm planning
> on posting the below (which is how I learned about page_count() vs. page_ref_count()).
>
> Today, KVM gets partial protection from check_new_page_bad(), which detects *some*
> cases where KVM marks a page dirty after the page is freed. But it's racy, and
> the detection occurs well after the fact since it fires only when the page is
> re-allocated.
>
> If we hack kvm_pfn_to_refcounted_page(), then all of those protections are lost
> because KVM would drop its assertions and also skip dirtying pages, i.e. would
> effectively suppress the latent detection by check_new_page_bad().
So it's probably that I totally have no idea what are the attributes for
those special pages so I don't understand enough on why we need to handle
those pages differently from e.g. PFNMAP pages, and also the benefits.
I think what I can tell is that they're pages that doesn't have
PageCompound bits set on either head or tails, however it's still a
multi-2-order large page. Is there an example on how these pages are used
and allocated? Why would we need those pages, and whether these pages need
to be set dirty/accessed after all?
>
> Author: Sean Christopherson <[email protected]>
> Date: Wed May 17 13:26:54 2023 -0700
>
> KVM: Assert that a page's refcount is elevated when marking accessed/dirty
>
> Assert that a page's refcount is elevated, i.e. that _something_ holds a
> reference to the page, when KVM marks a page as accessed and/or dirty.
> KVM typically doesn't hold a reference to pages that are mapped into the
> guest, e.g. to allow page migration, compaction, swap, etc., and instead
> relies on mmu_notifiers to react to changes in the primary MMU.
>
> Incorrect handling of mmu_notifier events (or similar mechanisms) can
> result in KVM keeping a mapping beyond the lifetime of the backing page,
> i.e. can (and often does) result in use-after-free. Yelling if KVM marks
> a freed page as accessed/dirty doesn't prevent badness as KVM usually
> only does A/D updates when unmapping memory from the guest, i.e. the
> assertion fires well after an underlying bug has occured, but yelling
> does help detect, triage, and debug use-after-free bugs.
>
> Note, the assertion must use page_count(), NOT page_ref_count()! For
> hugepages, the returned struct page may be a tailpage and thus not have
> its own refcount.
>
> Signed-off-by: Sean Christopherson <[email protected]>
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d1abb331ea68..64f18697096c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2882,6 +2882,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
>
> static bool kvm_is_ad_tracked_page(struct page *page)
> {
> + /*
> + * Assert that KVM isn't attempting to mark a freed page as Accessed or
> + * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM
> + * (typically) doesn't pin pages that are mapped in KVM's MMU, and
> + * instead relies on mmu_notifiers to know when a mapping needs to be
> + * zapped/invalidated. Unmapping from KVM's MMU must happen _before_
> + * KVM returns from its mmu_notifier, i.e. the page should have an
> + * elevated refcount at this point even though KVM doesn't hold a
> + * reference of its own.
> + */
> + if (WARN_ON_ONCE(!page_count(page)))
> + return false;
> +
> /*
> * Per page-flags.h, pages tagged PG_reserved "should in general not be
> * touched (e.g. set dirty) except by its owner".
>
This looks like a good thing to have, indeed. But again it doesn't seem
like anything special to the pages we're discussing here, say, !Compound &&
refcount==0 ones.
--
Peter Xu
On Wed, May 24, 2023, Peter Xu wrote:
> On Wed, May 24, 2023 at 09:46:13AM -0700, Sean Christopherson wrote:
> > If we hack kvm_pfn_to_refcounted_page(), then all of those protections are lost
> > because KVM would drop its assertions and also skip dirtying pages, i.e. would
> > effectively suppress the latent detection by check_new_page_bad().
>
> So it's probably that I totally have no idea what are the attributes for
> those special pages so I don't understand enough on why we need to handle
> those pages differently from e.g. PFNMAP pages, and also the benefits.
>
> I think what I can tell is that they're pages that doesn't have
> PageCompound bits set on either head or tails, however it's still a
> multi-2-order large page. Is there an example on how these pages are used
> and allocated? Why would we need those pages, and whether these pages need
> to be set dirty/accessed after all?
The use case David is interested in is where an AMD GPU driver kmallocs() a
chunk of memory, let's it be mmap()'d by userspace, and userspace then maps it
into the guest for a virtual (passthrough?) GPU. For all intents and purposes,
it's normal memory, just not refcounted.
> > static bool kvm_is_ad_tracked_page(struct page *page)
> > {
> > + /*
> > + * Assert that KVM isn't attempting to mark a freed page as Accessed or
> > + * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM
> > + * (typically) doesn't pin pages that are mapped in KVM's MMU, and
> > + * instead relies on mmu_notifiers to know when a mapping needs to be
> > + * zapped/invalidated. Unmapping from KVM's MMU must happen _before_
> > + * KVM returns from its mmu_notifier, i.e. the page should have an
> > + * elevated refcount at this point even though KVM doesn't hold a
> > + * reference of its own.
> > + */
> > + if (WARN_ON_ONCE(!page_count(page)))
> > + return false;
> > +
> > /*
> > * Per page-flags.h, pages tagged PG_reserved "should in general not be
> > * touched (e.g. set dirty) except by its owner".
> >
>
> This looks like a good thing to have, indeed. But again it doesn't seem
> like anything special to the pages we're discussing here, say, !Compound &&
> refcount==0 ones.
The problem is that if KVM ignores refcount==0 pages, then KVM can't distinguish
between the legitimate[*] refcount==0 AMD GPU case and a buggy refcount==0
use-after-free scenario. I don't want to make that sacrifice as the legimiate
!refcounted use case is a very specific use case, whereas consuming refcounted
memory is ubiquituous (outside of maybe AWS).
[*] Consuming !refcounted pages is safe only for flows that are tied into the
mmu_notifiers. The current proposal/plan is to add an off-by-default module
param that let's userspace opt-in to kmap() use of !refcounted memory, e.g.
this case and PFNMAP memory.
On Wed, May 24, 2023 at 11:29:45AM -0700, Sean Christopherson wrote:
> On Wed, May 24, 2023, Peter Xu wrote:
> > On Wed, May 24, 2023 at 09:46:13AM -0700, Sean Christopherson wrote:
> > > If we hack kvm_pfn_to_refcounted_page(), then all of those protections are lost
> > > because KVM would drop its assertions and also skip dirtying pages, i.e. would
> > > effectively suppress the latent detection by check_new_page_bad().
> >
> > So it's probably that I totally have no idea what are the attributes for
> > those special pages so I don't understand enough on why we need to handle
> > those pages differently from e.g. PFNMAP pages, and also the benefits.
> >
> > I think what I can tell is that they're pages that doesn't have
> > PageCompound bits set on either head or tails, however it's still a
> > multi-2-order large page. Is there an example on how these pages are used
> > and allocated? Why would we need those pages, and whether these pages need
> > to be set dirty/accessed after all?
>
> The use case David is interested in is where an AMD GPU driver kmallocs() a
> chunk of memory, let's it be mmap()'d by userspace, and userspace then maps it
> into the guest for a virtual (passthrough?) GPU. For all intents and purposes,
> it's normal memory, just not refcounted.
I'm not familiar enough with kmalloc, but what I think is kmalloc for large
chunks will be the same as alloc_pages, and I thought it should also be a
compound page already. If that needs to be mmap()ed to userapp then I
assume it mostly should be kmalloc_large().
kmalloc -> kmalloc_large -> __kmalloc_large_node:
flags |= __GFP_COMP;
Then when the new page allocated and being prepared (prep_new_page):
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
I assume prep_compound_page() will make PageCompound return true for those
pages returned. So I know I still miss something, but not sure
where.. because IIRC we're at least talking about !PageCompound pages.
>
> > > static bool kvm_is_ad_tracked_page(struct page *page)
> > > {
> > > + /*
> > > + * Assert that KVM isn't attempting to mark a freed page as Accessed or
> > > + * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM
> > > + * (typically) doesn't pin pages that are mapped in KVM's MMU, and
> > > + * instead relies on mmu_notifiers to know when a mapping needs to be
> > > + * zapped/invalidated. Unmapping from KVM's MMU must happen _before_
> > > + * KVM returns from its mmu_notifier, i.e. the page should have an
> > > + * elevated refcount at this point even though KVM doesn't hold a
> > > + * reference of its own.
> > > + */
> > > + if (WARN_ON_ONCE(!page_count(page)))
> > > + return false;
> > > +
> > > /*
> > > * Per page-flags.h, pages tagged PG_reserved "should in general not be
> > > * touched (e.g. set dirty) except by its owner".
> > >
> >
> > This looks like a good thing to have, indeed. But again it doesn't seem
> > like anything special to the pages we're discussing here, say, !Compound &&
> > refcount==0 ones.
>
> The problem is that if KVM ignores refcount==0 pages, then KVM can't distinguish
> between the legitimate[*] refcount==0 AMD GPU case and a buggy refcount==0
> use-after-free scenario. I don't want to make that sacrifice as the legimiate
> !refcounted use case is a very specific use case, whereas consuming refcounted
> memory is ubiquituous (outside of maybe AWS).
>
> [*] Consuming !refcounted pages is safe only for flows that are tied into the
> mmu_notifiers. The current proposal/plan is to add an off-by-default module
> param that let's userspace opt-in to kmap() use of !refcounted memory, e.g.
> this case and PFNMAP memory.
I see.
I think you mentioned that we can use one special bit in the shadow pte to
mark such special pages. Does it mean that your above patch will still
cover what you wanted to protect even if we use the trick? Because then
kvm_is_ad_tracked_page() should only be called when we're sure the special
bit is not set. IOW, we can still rule out these pages already and
page_count()==0 check here can still be helpful to track kvm bugs?
--
Peter Xu
On Wed, May 24, 2023, Peter Xu wrote:
> On Wed, May 24, 2023 at 11:29:45AM -0700, Sean Christopherson wrote:
> > On Wed, May 24, 2023, Peter Xu wrote:
> > > On Wed, May 24, 2023 at 09:46:13AM -0700, Sean Christopherson wrote:
> > > > If we hack kvm_pfn_to_refcounted_page(), then all of those protections are lost
> > > > because KVM would drop its assertions and also skip dirtying pages, i.e. would
> > > > effectively suppress the latent detection by check_new_page_bad().
> > >
> > > So it's probably that I totally have no idea what are the attributes for
> > > those special pages so I don't understand enough on why we need to handle
> > > those pages differently from e.g. PFNMAP pages, and also the benefits.
> > >
> > > I think what I can tell is that they're pages that doesn't have
> > > PageCompound bits set on either head or tails, however it's still a
> > > multi-2-order large page. Is there an example on how these pages are used
> > > and allocated? Why would we need those pages, and whether these pages need
> > > to be set dirty/accessed after all?
> >
> > The use case David is interested in is where an AMD GPU driver kmallocs() a
> > chunk of memory, let's it be mmap()'d by userspace, and userspace then maps it
> > into the guest for a virtual (passthrough?) GPU. For all intents and purposes,
> > it's normal memory, just not refcounted.
>
> I'm not familiar enough with kmalloc, but what I think is kmalloc for large
> chunks will be the same as alloc_pages, and I thought it should also be a
> compound page already. If that needs to be mmap()ed to userapp then I
> assume it mostly should be kmalloc_large().
Sorry, by "kmalloc()" I was handwaving at all of the variations of kernel allocated
memory. From a separate thread[*], looks like the actual usage is a direct call to
alloc_pages() that deliberately doesn't set __GFP_COMP. Note, I'm pretty sure the
comment about "mapping pages directly into userspace" being illegal really means
something like "don't allow these pages to be gup()'d or mapped via standard mmap()".
IIUC, ttm_pool_alloc() fills tt->pages and then ttm_bo_vm_fault_reserved() does
vmf_insert_pfn_prot() to shove the pfn into userspace.
static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
unsigned int order)
{
unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
struct ttm_pool_dma *dma;
struct page *p;
void *vaddr;
/* Don't set the __GFP_COMP flag for higher order allocations.
* Mapping pages directly into an userspace process and calling
* put_page() on a TTM allocated page is illegal.
*/
if (order)
gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN |
__GFP_KSWAPD_RECLAIM;
if (!pool->use_dma_alloc) {
p = alloc_pages(gfp_flags, order);
if (p)
p->private = order;
return p;
}
[*] https://lore.kernel.org/all/[email protected]
> kmalloc -> kmalloc_large -> __kmalloc_large_node:
>
> flags |= __GFP_COMP;
>
> Then when the new page allocated and being prepared (prep_new_page):
>
> if (order && (gfp_flags & __GFP_COMP))
> prep_compound_page(page, order);
>
> I assume prep_compound_page() will make PageCompound return true for those
> pages returned. So I know I still miss something, but not sure
> where.. because IIRC we're at least talking about !PageCompound pages.
Yeah, they're !PageCompound().
> > > > static bool kvm_is_ad_tracked_page(struct page *page)
> > > > {
> > > > + /*
> > > > + * Assert that KVM isn't attempting to mark a freed page as Accessed or
> > > > + * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM
> > > > + * (typically) doesn't pin pages that are mapped in KVM's MMU, and
> > > > + * instead relies on mmu_notifiers to know when a mapping needs to be
> > > > + * zapped/invalidated. Unmapping from KVM's MMU must happen _before_
> > > > + * KVM returns from its mmu_notifier, i.e. the page should have an
> > > > + * elevated refcount at this point even though KVM doesn't hold a
> > > > + * reference of its own.
> > > > + */
> > > > + if (WARN_ON_ONCE(!page_count(page)))
> > > > + return false;
> > > > +
> > > > /*
> > > > * Per page-flags.h, pages tagged PG_reserved "should in general not be
> > > > * touched (e.g. set dirty) except by its owner".
> > > >
> > >
> > > This looks like a good thing to have, indeed. But again it doesn't seem
> > > like anything special to the pages we're discussing here, say, !Compound &&
> > > refcount==0 ones.
> >
> > The problem is that if KVM ignores refcount==0 pages, then KVM can't distinguish
> > between the legitimate[*] refcount==0 AMD GPU case and a buggy refcount==0
> > use-after-free scenario. I don't want to make that sacrifice as the legimiate
> > !refcounted use case is a very specific use case, whereas consuming refcounted
> > memory is ubiquituous (outside of maybe AWS).
> >
> > [*] Consuming !refcounted pages is safe only for flows that are tied into the
> > mmu_notifiers. The current proposal/plan is to add an off-by-default module
> > param that let's userspace opt-in to kmap() use of !refcounted memory, e.g.
> > this case and PFNMAP memory.
>
> I see.
>
> I think you mentioned that we can use one special bit in the shadow pte to
> mark such special pages. Does it mean that your above patch will still
> cover what you wanted to protect even if we use the trick? Because then
> kvm_is_ad_tracked_page() should only be called when we're sure the special
> bit is not set. IOW, we can still rule out these pages already and
> page_count()==0 check here can still be helpful to track kvm bugs?
Yep, exactly. FWIW, I was thinking that the SPTE bit would flag refcounted pages,
not these "special" pages, but either way would work. All that matters is that
KVM tracks whether or not the page was refcounted when KVM installed the SPTE.