2023-07-04 08:07:25

by David Stevens

[permalink] [raw]
Subject: [PATCH v7 0/8] KVM: allow mapping non-refcounted pages

From: David Stevens <[email protected]>

This patch series adds support for mapping VM_IO and VM_PFNMAP memory
that is backed by struct pages that aren't currently being refcounted
(e.g. tail pages of non-compound higher order allocations) into the
guest.

Our use case is virtio-gpu blob resources [1], which directly map host
graphics buffers into the guest as "vram" for the virtio-gpu device.
This feature currently does not work on systems using the amdgpu driver,
as that driver allocates non-compound higher order pages via
ttm_pool_alloc_page.

First, this series replaces the __gfn_to_pfn_memslot API with a more
extensible __kvm_faultin_pfn API. The updated API rearranges
__gfn_to_pfn_memslot's args into a struct and where possible packs the
bool arguments into a FOLL_ flags argument. The refactoring changes do
not change any behavior, except as noted in the PPC change.

When introduced in the refactoring, __kvm_faultin_pfn implies FOLL_GET
to preserve existing behavior. From there, the API is made to support
mapping non-refcounted pages by respecting the FOLL_GET flag.

This series only adds support for non-refcounted pages to the x86 MMU.
Other MMUs can likely be updated without too much difficulty, but it is
not needed at this point. Updating other parts of KVM (e.g. pfncache) is
not straightforward [2].

[1]
https://patchwork.kernel.org/project/dri-devel/cover/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/

v6 -> v7:
- Replace __gfn_to_pfn_memslot with a more flexible __kvm_faultin_pfn,
and extend that API to support non-refcounted pages.
v5 -> v6:
- rebase on kvm next branch
- rename gfn_to_pfn_page to gfn_to_pfn_noref
- fix uninitialized outparam in error case of __kvm_faultin_pfn
- add kvm_release_pfn_noref_clean for releasing pfn/page pair
v4 -> v5:
- rebase on kvm next branch again
v3 -> v4:
- rebase on kvm next branch again
- Add some more context to a comment in ensure_pfn_ref
v2 -> v3:
- rebase on kvm next branch
v1 -> v2:
- Introduce new gfn_to_pfn_page functions instead of modifying the
behavior of existing gfn_to_pfn functions, to make the change less
invasive.
- Drop changes to mmu_audit.c
- Include Nicholas Piggin's patch to avoid corrupting refcount in the
follow_pte case, and use it in depreciated gfn_to_pfn functions.
- Rebase on kvm/next
David Stevens (7):
KVM: Introduce __kvm_follow_pfn function
KVM: Make __kvm_follow_pfn not imply FOLL_GET
KVM: x86/mmu: Migrate to __kvm_follow_pfn
KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
KVM: arm64: Migrate to __kvm_follow_pfn
KVM: PPC: Migrate to __kvm_follow_pfn
KVM: remove __gfn_to_pfn_memslot

Sean Christopherson (1):
KVM: Assert that a page's refcount is elevated when marking
accessed/dirty

arch/arm64/kvm/mmu.c | 25 +--
arch/powerpc/include/asm/kvm_book3s.h | 2 +-
arch/powerpc/kvm/book3s_64_mmu_hv.c | 38 ++---
arch/powerpc/kvm/book3s_64_mmu_radix.c | 50 +++---
arch/powerpc/kvm/book3s_hv_nested.c | 4 +-
arch/x86/kvm/mmu/mmu.c | 77 ++++++---
arch/x86/kvm/mmu/mmu_internal.h | 1 +
arch/x86/kvm/mmu/paging_tmpl.h | 9 +-
arch/x86/kvm/mmu/spte.c | 4 +-
arch/x86/kvm/mmu/spte.h | 12 +-
arch/x86/kvm/mmu/tdp_mmu.c | 22 +--
include/linux/kvm_host.h | 26 +++
virt/kvm/kvm_main.c | 210 +++++++++++++------------
virt/kvm/kvm_mm.h | 3 +-
virt/kvm/pfncache.c | 8 +-
15 files changed, 282 insertions(+), 209 deletions(-)

--
2.41.0.255.g8b1d071c50-goog



2023-07-04 08:07:49

by David Stevens

[permalink] [raw]
Subject: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

From: David Stevens <[email protected]>

Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
__kvm_follow_pfn refactors the old API's arguments into a struct and,
where possible, combines the boolean arguments into a single flags
argument.

Signed-off-by: David Stevens <[email protected]>
---
include/linux/kvm_host.h | 16 ++++
virt/kvm/kvm_main.c | 171 ++++++++++++++++++++++-----------------
virt/kvm/kvm_mm.h | 3 +-
virt/kvm/pfncache.c | 8 +-
4 files changed, 122 insertions(+), 76 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d3ac7720da9..ef2763c2b12e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -97,6 +97,7 @@
#define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
#define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
#define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
+#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)

/*
* error pfns indicate that the gfn is in slot but faild to
@@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
void kvm_release_page_clean(struct page *page);
void kvm_release_page_dirty(struct page *page);

+struct kvm_follow_pfn {
+ const struct kvm_memory_slot *slot;
+ gfn_t gfn;
+ unsigned int flags;
+ bool atomic;
+ /* Allow a read fault to create a writeable mapping. */
+ bool allow_write_mapping;
+
+ /* Outputs of __kvm_follow_pfn */
+ hva_t hva;
+ bool writable;
+};
+
+kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
+
kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
bool *writable);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 371bd783ff2b..b13f22861d2f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
* true indicates success, otherwise false is returned. It's also the
* only part that runs if we can in atomic context.
*/
-static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
- bool *writable, kvm_pfn_t *pfn)
+static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
{
struct page *page[1];
+ bool write_fault = foll->flags & FOLL_WRITE;

/*
* Fast pin a writable pfn only if it is a write fault request
* or the caller allows to map a writable pfn for a read fault
* request.
*/
- if (!(write_fault || writable))
+ if (!(write_fault || foll->allow_write_mapping))
return false;

- if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+ if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
*pfn = page_to_pfn(page[0]);
-
- if (writable)
- *writable = true;
+ foll->writable = foll->allow_write_mapping;
return true;
}

@@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
* The slow path to get the pfn of the specified host virtual address,
* 1 indicates success, -errno is returned if error is detected.
*/
-static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
- bool interruptible, bool *writable, kvm_pfn_t *pfn)
+static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
{
- unsigned int flags = FOLL_HWPOISON;
+ unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
struct page *page;
int npages;

might_sleep();

- if (writable)
- *writable = write_fault;
-
- if (write_fault)
- flags |= FOLL_WRITE;
- if (async)
- flags |= FOLL_NOWAIT;
- if (interruptible)
- flags |= FOLL_INTERRUPTIBLE;
-
- npages = get_user_pages_unlocked(addr, 1, &page, flags);
+ npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
if (npages != 1)
return npages;

+ foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
+
/* map read fault as writable if possible */
- if (unlikely(!write_fault) && writable) {
+ if (unlikely(!foll->writable) && foll->allow_write_mapping) {
struct page *wpage;

- if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
- *writable = true;
+ if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
+ foll->writable = true;
put_page(page);
page = wpage;
}
@@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
return get_page_unless_zero(page);
}

-static int hva_to_pfn_remapped(struct vm_area_struct *vma,
- unsigned long addr, bool write_fault,
- bool *writable, kvm_pfn_t *p_pfn)
+static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
+ kvm_pfn_t *p_pfn)
{
kvm_pfn_t pfn;
pte_t *ptep;
spinlock_t *ptl;
+ bool write_fault = foll->flags & FOLL_WRITE;
int r;

- r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
+ r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
if (r) {
/*
* get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
* not call the fault handler, so do it here.
*/
bool unlocked = false;
- r = fixup_user_fault(current->mm, addr,
+ r = fixup_user_fault(current->mm, foll->hva,
(write_fault ? FAULT_FLAG_WRITE : 0),
&unlocked);
if (unlocked)
@@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
if (r)
return r;

- r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
+ r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
if (r)
return r;
}
@@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
goto out;
}

- if (writable)
- *writable = pte_write(*ptep);
+ foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
pfn = pte_pfn(*ptep);

/*
@@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
* 2): @write_fault = false && @writable, @writable will tell the caller
* whether the mapping is writable.
*/
-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
- bool *async, bool write_fault, bool *writable)
+kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
{
struct vm_area_struct *vma;
kvm_pfn_t pfn;
int npages, r;

/* we can do it either atomically or asynchronously, not both */
- BUG_ON(atomic && async);
+ BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));

- if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
+ if (hva_to_pfn_fast(foll, &pfn))
return pfn;

- if (atomic)
+ if (foll->atomic)
return KVM_PFN_ERR_FAULT;

- npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
- writable, &pfn);
+ npages = hva_to_pfn_slow(foll, &pfn);
if (npages == 1)
return pfn;
if (npages == -EINTR)
@@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,

mmap_read_lock(current->mm);
if (npages == -EHWPOISON ||
- (!async && check_user_page_hwpoison(addr))) {
+ (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
pfn = KVM_PFN_ERR_HWPOISON;
goto exit;
}

retry:
- vma = vma_lookup(current->mm, addr);
+ vma = vma_lookup(current->mm, foll->hva);

if (vma == NULL)
pfn = KVM_PFN_ERR_FAULT;
else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
- r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
+ r = hva_to_pfn_remapped(vma, foll, &pfn);
if (r == -EAGAIN)
goto retry;
if (r < 0)
pfn = KVM_PFN_ERR_FAULT;
} else {
- if (async && vma_is_valid(vma, write_fault))
- *async = true;
- pfn = KVM_PFN_ERR_FAULT;
+ if ((foll->flags & FOLL_NOWAIT) &&
+ vma_is_valid(vma, foll->flags & FOLL_WRITE))
+ pfn = KVM_PFN_ERR_NEEDS_IO;
+ else
+ pfn = KVM_PFN_ERR_FAULT;
}
exit:
mmap_read_unlock(current->mm);
return pfn;
}

-kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
- bool atomic, bool interruptible, bool *async,
- bool write_fault, bool *writable, hva_t *hva)
+kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
{
- unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
-
- if (hva)
- *hva = addr;
+ foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
+ foll->flags & FOLL_WRITE);

- if (addr == KVM_HVA_ERR_RO_BAD) {
- if (writable)
- *writable = false;
+ if (foll->hva == KVM_HVA_ERR_RO_BAD)
return KVM_PFN_ERR_RO_FAULT;
- }

- if (kvm_is_error_hva(addr)) {
- if (writable)
- *writable = false;
+ if (kvm_is_error_hva(foll->hva))
return KVM_PFN_NOSLOT;
- }

- /* Do not map writable pfn in the readonly memslot. */
- if (writable && memslot_is_readonly(slot)) {
- *writable = false;
- writable = NULL;
- }
+ if (memslot_is_readonly(foll->slot))
+ foll->allow_write_mapping = false;
+
+ return hva_to_pfn(foll);
+}
+EXPORT_SYMBOL_GPL(__kvm_follow_pfn);

- return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
- writable);
+kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
+ bool atomic, bool interruptible, bool *async,
+ bool write_fault, bool *writable, hva_t *hva)
+{
+ kvm_pfn_t pfn;
+ struct kvm_follow_pfn foll = {
+ .slot = slot,
+ .gfn = gfn,
+ .flags = 0,
+ .atomic = atomic,
+ .allow_write_mapping = !!writable,
+ };
+
+ if (write_fault)
+ foll.flags |= FOLL_WRITE;
+ if (async)
+ foll.flags |= FOLL_NOWAIT;
+ if (interruptible)
+ foll.flags |= FOLL_INTERRUPTIBLE;
+
+ pfn = __kvm_follow_pfn(&foll);
+ if (pfn == KVM_PFN_ERR_NEEDS_IO) {
+ *async = true;
+ pfn = KVM_PFN_ERR_FAULT;
+ }
+ if (hva)
+ *hva = foll.hva;
+ if (writable)
+ *writable = foll.writable;
+ return pfn;
}
EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);

kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
bool *writable)
{
- return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
- NULL, write_fault, writable, NULL);
+ kvm_pfn_t pfn;
+ struct kvm_follow_pfn foll = {
+ .slot = gfn_to_memslot(kvm, gfn),
+ .gfn = gfn,
+ .flags = write_fault ? FOLL_WRITE : 0,
+ .allow_write_mapping = !!writable,
+ };
+ pfn = __kvm_follow_pfn(&foll);
+ if (writable)
+ *writable = foll.writable;
+ return pfn;
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);

kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
{
- return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
- NULL, NULL);
+ struct kvm_follow_pfn foll = {
+ .slot = slot,
+ .gfn = gfn,
+ .flags = FOLL_WRITE,
+ };
+ return __kvm_follow_pfn(&foll);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);

kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
{
- return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
- NULL, NULL);
+ struct kvm_follow_pfn foll = {
+ .slot = slot,
+ .gfn = gfn,
+ .flags = FOLL_WRITE,
+ .atomic = true,
+ };
+ return __kvm_follow_pfn(&foll);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);

diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 180f1a09e6ba..ed896aee5396 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -20,8 +20,7 @@
#define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock)
#endif /* KVM_HAVE_MMU_RWLOCK */

-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
- bool *async, bool write_fault, bool *writable);
+kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);

#ifdef CONFIG_HAVE_KVM_PFNCACHE
void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 2d6aba677830..e3fefa753a51 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
void *new_khva = NULL;
unsigned long mmu_seq;
+ struct kvm_follow_pfn foll = {
+ .slot = gpc->memslot,
+ .gfn = gpa_to_gfn(gpc->gpa),
+ .flags = FOLL_WRITE,
+ .hva = gpc->uhva,
+ };

lockdep_assert_held(&gpc->refresh_lock);

@@ -183,7 +189,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
}

/* We always request a writeable mapping */
- new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
+ new_pfn = hva_to_pfn(&foll);
if (is_error_noslot_pfn(new_pfn))
goto out_error;

--
2.41.0.255.g8b1d071c50-goog


2023-07-04 08:08:53

by David Stevens

[permalink] [raw]
Subject: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

From: David Stevens <[email protected]>

Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
callers to resolve a gfn when the associated pfn has a valid struct page
that isn't being actively refcounted (e.g. tail pages of non-compound
higher order pages). For a caller to safely omit FOLL_GET, all usages of
the returned pfn must be guarded by a mmu notifier.

This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
is set when the returned pfn has an associated struct page with a valid
refcount. Callers that don't pass FOLL_GET should remember this value
and use it to avoid places like kvm_is_ad_tracked_page that assume a
non-zero refcount.

Signed-off-by: David Stevens <[email protected]>
---
include/linux/kvm_host.h | 10 ++++++
virt/kvm/kvm_main.c | 67 +++++++++++++++++++++-------------------
virt/kvm/pfncache.c | 2 +-
3 files changed, 47 insertions(+), 32 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ef2763c2b12e..a45308c7d2d9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
void kvm_release_page_clean(struct page *page);
void kvm_release_page_dirty(struct page *page);

+void kvm_set_page_accessed(struct page *page);
+void kvm_set_page_dirty(struct page *page);
+
struct kvm_follow_pfn {
const struct kvm_memory_slot *slot;
gfn_t gfn;
@@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
bool atomic;
/* Allow a read fault to create a writeable mapping. */
bool allow_write_mapping;
+ /*
+ * Usage of the returned pfn will be guared by a mmu notifier. Must
+ * be true if FOLL_GET is not set.
+ */
+ bool guarded_by_mmu_notifier;

/* Outputs of __kvm_follow_pfn */
hva_t hva;
bool writable;
+ /* True if the returned pfn is for a page with a valid refcount. */
+ bool is_refcounted_page;
};

kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b13f22861d2f..0f7b41f220b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2502,6 +2502,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
*pfn = page_to_pfn(page[0]);
foll->writable = foll->allow_write_mapping;
+ foll->is_refcounted_page = true;
+ if (!(foll->flags & FOLL_GET))
+ put_page(page[0]);
return true;
}

@@ -2525,6 +2528,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
return npages;

foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
+ foll->is_refcounted_page = true;

/* map read fault as writable if possible */
if (unlikely(!foll->writable) && foll->allow_write_mapping) {
@@ -2537,6 +2541,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
}
}
*pfn = page_to_pfn(page);
+ if (!(foll->flags & FOLL_GET))
+ put_page(page);
return npages;
}

@@ -2551,16 +2557,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
return true;
}

-static int kvm_try_get_pfn(kvm_pfn_t pfn)
-{
- struct page *page = kvm_pfn_to_refcounted_page(pfn);
-
- if (!page)
- return 1;
-
- return get_page_unless_zero(page);
-}
-
static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
kvm_pfn_t *p_pfn)
{
@@ -2568,6 +2564,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
pte_t *ptep;
spinlock_t *ptl;
bool write_fault = foll->flags & FOLL_WRITE;
+ struct page *page;
int r;

r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
@@ -2599,24 +2596,27 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
pfn = pte_pfn(*ptep);

/*
- * Get a reference here because callers of *hva_to_pfn* and
- * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
- * returned pfn. This is only needed if the VMA has VM_MIXEDMAP
- * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
- * simply do nothing for reserved pfns.
- *
- * Whoever called remap_pfn_range is also going to call e.g.
- * unmap_mapping_range before the underlying pages are freed,
- * causing a call to our MMU notifier.
+ * Now deal with reference counting. If kvm_pfn_to_refcounted_page
+ * returns NULL, then there's no refcount to worry about.
*
- * Certain IO or PFNMAP mappings can be backed with valid
- * struct pages, but be allocated without refcounting e.g.,
- * tail pages of non-compound higher order allocations, which
- * would then underflow the refcount when the caller does the
- * required put_page. Don't allow those pages here.
+ * Otherwise, certain IO or PFNMAP mappings can be backed with valid
+ * struct pages but be allocated without refcounting e.g., tail pages of
+ * non-compound higher order allocations. If FOLL_GET is set and we
+ * increment such a refcount, then when that pfn is eventually passed to
+ * kvm_release_pfn_clean, its refcount would hit zero and be incorrectly
+ * freed. Therefore don't allow those pages here when FOLL_GET is set.
*/
- if (!kvm_try_get_pfn(pfn))
+ page = kvm_pfn_to_refcounted_page(pfn);
+ if (!page)
+ goto out;
+
+ if (get_page_unless_zero(page)) {
+ foll->is_refcounted_page = true;
+ if (!(foll->flags & FOLL_GET))
+ put_page(page);
+ } else if (foll->flags & FOLL_GET) {
r = -EFAULT;
+ }

out:
pte_unmap_unlock(ptep, ptl);
@@ -2693,6 +2693,9 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)

kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
{
+ if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll->guarded_by_mmu_notifier))
+ return KVM_PFN_ERR_FAULT;
+
foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
foll->flags & FOLL_WRITE);

@@ -2717,7 +2720,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
struct kvm_follow_pfn foll = {
.slot = slot,
.gfn = gfn,
- .flags = 0,
+ .flags = FOLL_GET,
.atomic = atomic,
.allow_write_mapping = !!writable,
};
@@ -2749,7 +2752,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
struct kvm_follow_pfn foll = {
.slot = gfn_to_memslot(kvm, gfn),
.gfn = gfn,
- .flags = write_fault ? FOLL_WRITE : 0,
+ .flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
.allow_write_mapping = !!writable,
};
pfn = __kvm_follow_pfn(&foll);
@@ -2764,7 +2767,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
struct kvm_follow_pfn foll = {
.slot = slot,
.gfn = gfn,
- .flags = FOLL_WRITE,
+ .flags = FOLL_GET | FOLL_WRITE,
};
return __kvm_follow_pfn(&foll);
}
@@ -2775,7 +2778,7 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf
struct kvm_follow_pfn foll = {
.slot = slot,
.gfn = gfn,
- .flags = FOLL_WRITE,
+ .flags = FOLL_GET | FOLL_WRITE,
.atomic = true,
};
return __kvm_follow_pfn(&foll);
@@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
return !PageReserved(page);
}

-static void kvm_set_page_dirty(struct page *page)
+void kvm_set_page_dirty(struct page *page)
{
if (kvm_is_ad_tracked_page(page))
SetPageDirty(page);
}
+EXPORT_SYMBOL_GPL(kvm_set_page_dirty);

-static void kvm_set_page_accessed(struct page *page)
+void kvm_set_page_accessed(struct page *page)
{
if (kvm_is_ad_tracked_page(page))
mark_page_accessed(page);
}
+EXPORT_SYMBOL_GPL(kvm_set_page_accessed);

void kvm_release_page_clean(struct page *page)
{
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index e3fefa753a51..87caafce3dd0 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -147,7 +147,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
struct kvm_follow_pfn foll = {
.slot = gpc->memslot,
.gfn = gpa_to_gfn(gpc->gpa),
- .flags = FOLL_WRITE,
+ .flags = FOLL_WRITE | FOLL_GET,
.hva = gpc->uhva,
};

--
2.41.0.255.g8b1d071c50-goog


2023-07-04 08:08:55

by David Stevens

[permalink] [raw]
Subject: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

From: David Stevens <[email protected]>

Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
memory into the guest that is backed by un-refcounted struct pages - for
example, higher order non-compound pages allocated by the amdgpu driver
via ttm_pool_alloc_page.

The bulk of this change is tracking the is_refcounted_page flag so that
non-refcounted pages don't trigger page_count() == 0 warnings. This is
done by storing the flag in an unused bit in the sptes.

Signed-off-by: David Stevens <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++------------
arch/x86/kvm/mmu/mmu_internal.h | 1 +
arch/x86/kvm/mmu/paging_tmpl.h | 9 ++++---
arch/x86/kvm/mmu/spte.c | 4 ++-
arch/x86/kvm/mmu/spte.h | 12 ++++++++-
arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++++++++-------
6 files changed, 62 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e44ab512c3a1..b1607e314497 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)

if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
flush = true;
- kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+ if (is_refcounted_page_pte(old_spte))
+ kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
}

if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
flush = true;
- kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+ if (is_refcounted_page_pte(old_spte))
+ kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
}

return flush;
@@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
* before they are reclaimed. Sanity check that, if the pfn is backed
* by a refcounted page, the refcount is elevated.
*/
- page = kvm_pfn_to_refcounted_page(pfn);
- WARN_ON(page && !page_count(page));
+ if (is_refcounted_page_pte(old_spte)) {
+ page = kvm_pfn_to_refcounted_page(pfn);
+ WARN_ON(!page || !page_count(page));
+ }

- if (is_accessed_spte(old_spte))
- kvm_set_pfn_accessed(pfn);
+ if (is_refcounted_page_pte(old_spte)) {
+ if (is_accessed_spte(old_spte))
+ kvm_set_page_accessed(pfn_to_page(pfn));

- if (is_dirty_spte(old_spte))
- kvm_set_pfn_dirty(pfn);
+ if (is_dirty_spte(old_spte))
+ kvm_set_page_dirty(pfn_to_page(pfn));
+ }

return old_spte;
}
@@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep)
* Capture the dirty status of the page, so that it doesn't get
* lost when the SPTE is marked for access tracking.
*/
- if (is_writable_pte(spte))
- kvm_set_pfn_dirty(spte_to_pfn(spte));
+ if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
+ kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));

spte = mark_spte_for_access_track(spte);
mmu_spte_update_no_track(sptep, spte);
@@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
{
bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
(unsigned long *)sptep);
- if (was_writable && !spte_ad_enabled(*sptep))
- kvm_set_pfn_dirty(spte_to_pfn(*sptep));
+ if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
+ kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));

return was_writable;
}
@@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
bool host_writable = !fault || fault->map_writable;
bool prefetch = !fault || fault->prefetch;
bool write_fault = fault && fault->write;
+ bool is_refcounted = !fault || fault->is_refcounted_page;

pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
*sptep, write_fault, gfn);
@@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
}

wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
- true, host_writable, &spte);
+ true, host_writable, is_refcounted, &spte);

if (*sptep == spte) {
ret = RET_PF_SPURIOUS;
@@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
struct kvm_follow_pfn foll = {
.slot = slot,
.gfn = fault->gfn,
- .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
+ .flags = fault->write ? FOLL_WRITE : 0,
.allow_write_mapping = true,
+ .guarded_by_mmu_notifier = true,
};

/*
@@ -4317,6 +4325,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
fault->slot = NULL;
fault->pfn = KVM_PFN_NOSLOT;
fault->map_writable = false;
+ fault->is_refcounted_page = false;
return RET_PF_CONTINUE;
}
/*
@@ -4366,6 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
success:
fault->hva = foll.hva;
fault->map_writable = foll.writable;
+ fault->is_refcounted_page = foll.is_refcounted_page;
return RET_PF_CONTINUE;
}

@@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault

out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ if (fault->is_refcounted_page)
+ kvm_set_page_accessed(pfn_to_page(fault->pfn));
return r;
}

@@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,

out_unlock:
read_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ if (fault->is_refcounted_page)
+ kvm_set_page_accessed(pfn_to_page(fault->pfn));
return r;
}
#endif
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d39af5639ce9..55790085884f 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -240,6 +240,7 @@ struct kvm_page_fault {
kvm_pfn_t pfn;
hva_t hva;
bool map_writable;
+ bool is_refcounted_page;

/*
* Indicates the guest is trying to write a gfn that contains one or
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0662e0278e70..3284e7bd9619 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -829,7 +829,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault

out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ if (fault->is_refcounted_page)
+ kvm_set_page_accessed(pfn_to_page(fault->pfn));
return r;
}

@@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
*/
static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
{
- bool host_writable;
+ bool host_writable, is_refcounted;
gpa_t first_pte_gpa;
u64 *sptep, spte;
struct kvm_memory_slot *slot;
@@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
sptep = &sp->spt[i];
spte = *sptep;
host_writable = spte & shadow_host_writable_mask;
+ // TODO: is this correct?
+ is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
make_spte(vcpu, sp, slot, pte_access, gfn,
spte_to_pfn(spte), spte, true, false,
- host_writable, &spte);
+ host_writable, is_refcounted, &spte);

return mmu_spte_update(sptep, spte);
}
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index cf2c6426a6fc..46c681dc45e6 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
const struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
u64 old_spte, bool prefetch, bool can_unsync,
- bool host_writable, u64 *new_spte)
+ bool host_writable, bool is_refcounted, u64 *new_spte)
{
int level = sp->role.level;
u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,

if (level > PG_LEVEL_4K)
spte |= PT_PAGE_SIZE_MASK;
+ else if (is_refcounted)
+ spte |= SPTE_MMU_PAGE_REFCOUNTED;

if (shadow_memtype_mask)
spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1279db2eab44..be93dd061ae3 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -95,6 +95,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
/* Defined only to keep the above static asserts readable. */
#undef SHADOW_ACC_TRACK_SAVED_MASK

+/*
+ * Indicates that the SPTE refers to a page with a valid refcount.
+ */
+#define SPTE_MMU_PAGE_REFCOUNTED BIT_ULL(59)
+
/*
* Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
* the memslots generation and is derived as follows:
@@ -332,6 +337,11 @@ static inline bool is_dirty_spte(u64 spte)
return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
}

+static inline bool is_refcounted_page_pte(u64 spte)
+{
+ return spte & SPTE_MMU_PAGE_REFCOUNTED;
+}
+
static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
int level)
{
@@ -462,7 +472,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
const struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
u64 old_spte, bool prefetch, bool can_unsync,
- bool host_writable, u64 *new_spte);
+ bool host_writable, bool is_refcounted, u64 *new_spte);
u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
union kvm_mmu_page_role role, int index);
u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 512163d52194..a9b1b14d2e26 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -474,6 +474,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
bool was_leaf = was_present && is_last_spte(old_spte, level);
bool is_leaf = is_present && is_last_spte(new_spte, level);
bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+ bool is_refcounted = is_refcounted_page_pte(old_spte);

WARN_ON(level > PT64_ROOT_MAX_LEVEL);
WARN_ON(level < PG_LEVEL_4K);
@@ -538,9 +539,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
if (is_leaf != was_leaf)
kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);

- if (was_leaf && is_dirty_spte(old_spte) &&
+ if (was_leaf && is_dirty_spte(old_spte) && is_refcounted &&
(!is_present || !is_dirty_spte(new_spte) || pfn_changed))
- kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+ kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));

/*
* Recursively handle child PTs if the change removed a subtree from
@@ -552,9 +553,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
(is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);

- if (was_leaf && is_accessed_spte(old_spte) &&
+ if (was_leaf && is_accessed_spte(old_spte) && is_refcounted &&
(!is_present || !is_accessed_spte(new_spte) || pfn_changed))
- kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+ kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
}

/*
@@ -988,8 +989,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
else
wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
- fault->pfn, iter->old_spte, fault->prefetch, true,
- fault->map_writable, &new_spte);
+ fault->pfn, iter->old_spte, fault->prefetch, true,
+ fault->map_writable, fault->is_refcounted_page,
+ &new_spte);

if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
@@ -1205,8 +1207,9 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
* Capture the dirty status of the page, so that it doesn't get
* lost when the SPTE is marked for access tracking.
*/
- if (is_writable_pte(iter->old_spte))
- kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
+ if (is_writable_pte(iter->old_spte) &&
+ is_refcounted_page_pte(iter->old_spte))
+ kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter->old_spte)));

new_spte = mark_spte_for_access_track(iter->old_spte);
iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
@@ -1626,7 +1629,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level,
iter.old_spte,
iter.old_spte & ~dbit);
- kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
+ if (is_refcounted_page_pte(iter.old_spte))
+ kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter.old_spte)));
}

rcu_read_unlock();
--
2.41.0.255.g8b1d071c50-goog


2023-07-04 08:24:21

by David Stevens

[permalink] [raw]
Subject: [PATCH v7 7/8] KVM: PPC: Migrate to __kvm_follow_pfn

From: David Stevens <[email protected]>

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn. As part of the
refactoring, remove the redundant calls to get_user_page_fast_only,
since the check for !async && !atomic was removed from the KVM generic
code in b9b33da2aa74. Also, remove the kvm_ro parameter because the KVM
generic code handles RO memslots.

Signed-off-by: David Stevens <[email protected]>
---
I have checked that this patch compiles, but I don't have the hardware
to test it myself.

arch/powerpc/include/asm/kvm_book3s.h | 2 +-
arch/powerpc/kvm/book3s_64_mmu_hv.c | 38 +++++++++-----------
arch/powerpc/kvm/book3s_64_mmu_radix.c | 50 +++++++++++---------------
arch/powerpc/kvm/book3s_hv_nested.c | 4 +--
4 files changed, 38 insertions(+), 56 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index bbf5e2c5fe09..bf48c511e700 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -202,7 +202,7 @@ extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested,
extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
unsigned long gpa,
struct kvm_memory_slot *memslot,
- bool writing, bool kvm_ro,
+ bool writing,
pte_t *inserted_pte, unsigned int *levelp);
extern int kvmppc_init_vm_radix(struct kvm *kvm);
extern void kvmppc_free_radix(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 7f765d5ad436..9a4715e73937 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -523,6 +523,9 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
unsigned long rcbits;
long mmio_update;
pte_t pte, *ptep;
+ struct kvm_follow_pfn foll = {
+ .allow_write_mapping = true,
+ };

if (kvm_is_radix(kvm))
return kvmppc_book3s_radix_page_fault(vcpu, ea, dsisr);
@@ -599,29 +602,20 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
page = NULL;
writing = (dsisr & DSISR_ISSTORE) != 0;
/* If writing != 0, then the HPTE must allow writing, if we get here */
- write_ok = writing;
- hva = gfn_to_hva_memslot(memslot, gfn);

- /*
- * Do a fast check first, since __gfn_to_pfn_memslot doesn't
- * do it with !atomic && !async, which is how we call it.
- * We always ask for write permission since the common case
- * is that the page is writable.
- */
- if (get_user_page_fast_only(hva, FOLL_WRITE, &page)) {
- write_ok = true;
- } else {
- /* Call KVM generic code to do the slow-path check */
- pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
- writing, &write_ok, NULL);
- if (is_error_noslot_pfn(pfn))
- return -EFAULT;
- page = NULL;
- if (pfn_valid(pfn)) {
- page = pfn_to_page(pfn);
- if (PageReserved(page))
- page = NULL;
- }
+ foll.slot = memslot;
+ foll.gfn = gfn;
+ foll.flags = FOLL_GET | (writing ? FOLL_WRITE : 0);
+ pfn = __kvm_follow_pfn(&foll);
+ if (is_error_noslot_pfn(pfn))
+ return -EFAULT;
+ page = NULL;
+ write_ok = foll.writable;
+ hva = foll.hva;
+ if (pfn_valid(pfn)) {
+ page = pfn_to_page(pfn);
+ if (PageReserved(page))
+ page = NULL;
}

/*
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 461307b89c3a..339d1efcb6c9 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -815,47 +815,39 @@ bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested, bool writing,
int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
unsigned long gpa,
struct kvm_memory_slot *memslot,
- bool writing, bool kvm_ro,
+ bool writing,
pte_t *inserted_pte, unsigned int *levelp)
{
struct kvm *kvm = vcpu->kvm;
struct page *page = NULL;
unsigned long mmu_seq;
- unsigned long hva, gfn = gpa >> PAGE_SHIFT;
- bool upgrade_write = false;
- bool *upgrade_p = &upgrade_write;
+ unsigned long hva, pfn, gfn = gpa >> PAGE_SHIFT;
+ bool upgrade_write;
pte_t pte, *ptep;
unsigned int shift, level;
int ret;
bool large_enable;
+ struct kvm_follow_pfn foll = {
+ .slot = memslot,
+ .gfn = gfn,
+ .flags = FOLL_GET | (writing ? FOLL_WRITE : 0),
+ .allow_write_mapping = true,
+ };

/* used to check for invalidations in progress */
mmu_seq = kvm->mmu_invalidate_seq;
smp_rmb();

- /*
- * Do a fast check first, since __gfn_to_pfn_memslot doesn't
- * do it with !atomic && !async, which is how we call it.
- * We always ask for write permission since the common case
- * is that the page is writable.
- */
- hva = gfn_to_hva_memslot(memslot, gfn);
- if (!kvm_ro && get_user_page_fast_only(hva, FOLL_WRITE, &page)) {
- upgrade_write = true;
- } else {
- unsigned long pfn;
-
- /* Call KVM generic code to do the slow-path check */
- pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
- writing, upgrade_p, NULL);
- if (is_error_noslot_pfn(pfn))
- return -EFAULT;
- page = NULL;
- if (pfn_valid(pfn)) {
- page = pfn_to_page(pfn);
- if (PageReserved(page))
- page = NULL;
- }
+ pfn = __kvm_follow_pfn(&foll);
+ if (is_error_noslot_pfn(pfn))
+ return -EFAULT;
+ page = NULL;
+ hva = foll.hva;
+ upgrade_write = foll.writable;
+ if (pfn_valid(pfn)) {
+ page = pfn_to_page(pfn);
+ if (PageReserved(page))
+ page = NULL;
}

/*
@@ -944,7 +936,6 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
struct kvm_memory_slot *memslot;
long ret;
bool writing = !!(dsisr & DSISR_ISSTORE);
- bool kvm_ro = false;

/* Check for unusual errors */
if (dsisr & DSISR_UNSUPP_MMU) {
@@ -997,7 +988,6 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
ea, DSISR_ISSTORE | DSISR_PROTFAULT);
return RESUME_GUEST;
}
- kvm_ro = true;
}

/* Failed to set the reference/change bits */
@@ -1015,7 +1005,7 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,

/* Try to insert a pte */
ret = kvmppc_book3s_instantiate_page(vcpu, gpa, memslot, writing,
- kvm_ro, NULL, NULL);
+ NULL, NULL);

if (ret == 0 || ret == -EAGAIN)
ret = RESUME_GUEST;
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 377d0b4a05ee..6d531051df04 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -1497,7 +1497,6 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
unsigned long n_gpa, gpa, gfn, perm = 0UL;
unsigned int shift, l1_shift, level;
bool writing = !!(dsisr & DSISR_ISSTORE);
- bool kvm_ro = false;
long int ret;

if (!gp->l1_gr_to_hr) {
@@ -1577,7 +1576,6 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
ea, DSISR_ISSTORE | DSISR_PROTFAULT);
return RESUME_GUEST;
}
- kvm_ro = true;
}

/* 2. Find the host pte for this L1 guest real address */
@@ -1599,7 +1597,7 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
if (!pte_present(pte) || (writing && !(pte_val(pte) & _PAGE_WRITE))) {
/* No suitable pte found -> try to insert a mapping */
ret = kvmppc_book3s_instantiate_page(vcpu, gpa, memslot,
- writing, kvm_ro, &pte, &level);
+ writing, &pte, &level);
if (ret == -EAGAIN)
return RESUME_GUEST;
else if (ret)
--
2.41.0.255.g8b1d071c50-goog


2023-07-04 08:32:33

by David Stevens

[permalink] [raw]
Subject: [PATCH v7 8/8] KVM: remove __gfn_to_pfn_memslot

From: David Stevens <[email protected]>

All callers have been migrated to __kvm_follow_pfn.

Signed-off-by: David Stevens <[email protected]>
---
virt/kvm/kvm_main.c | 33 ---------------------------------
1 file changed, 33 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f7b41f220b6..5b5afd70f239 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2712,39 +2712,6 @@ kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
}
EXPORT_SYMBOL_GPL(__kvm_follow_pfn);

-kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
- bool atomic, bool interruptible, bool *async,
- bool write_fault, bool *writable, hva_t *hva)
-{
- kvm_pfn_t pfn;
- struct kvm_follow_pfn foll = {
- .slot = slot,
- .gfn = gfn,
- .flags = FOLL_GET,
- .atomic = atomic,
- .allow_write_mapping = !!writable,
- };
-
- if (write_fault)
- foll.flags |= FOLL_WRITE;
- if (async)
- foll.flags |= FOLL_NOWAIT;
- if (interruptible)
- foll.flags |= FOLL_INTERRUPTIBLE;
-
- pfn = __kvm_follow_pfn(&foll);
- if (pfn == KVM_PFN_ERR_NEEDS_IO) {
- *async = true;
- pfn = KVM_PFN_ERR_FAULT;
- }
- if (hva)
- *hva = foll.hva;
- if (writable)
- *writable = foll.writable;
- return pfn;
-}
-EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
-
kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
bool *writable)
{
--
2.41.0.255.g8b1d071c50-goog


2023-07-04 08:32:34

by David Stevens

[permalink] [raw]
Subject: [PATCH v7 6/8] KVM: arm64: Migrate to __kvm_follow_pfn

From: David Stevens <[email protected]>

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.

Signed-off-by: David Stevens <[email protected]>
---
arch/arm64/kvm/mmu.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6db9ef288ec3..c706530d304d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1334,7 +1334,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
unsigned long fault_status)
{
int ret = 0;
- bool write_fault, writable, force_pte = false;
+ bool write_fault = kvm_is_write_fault(vcpu);
+ bool force_pte = false;
bool exec_fault, mte_allowed;
bool device = false;
unsigned long mmu_seq;
@@ -1342,16 +1343,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
struct vm_area_struct *vma;
short vma_shift;
- gfn_t gfn;
kvm_pfn_t pfn;
bool logging_active = memslot_is_logging(memslot);
unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
long vma_pagesize, fault_granule;
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
struct kvm_pgtable *pgt;
+ struct kvm_follow_pfn foll = {
+ .slot = memslot,
+ .flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
+ .allow_write_mapping = true,
+ };

fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
- write_fault = kvm_is_write_fault(vcpu);
exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
VM_BUG_ON(write_fault && exec_fault);

@@ -1425,7 +1429,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
fault_ipa &= ~(vma_pagesize - 1);

- gfn = fault_ipa >> PAGE_SHIFT;
+ foll.gfn = fault_ipa >> PAGE_SHIFT;
mte_allowed = kvm_vma_mte_allowed(vma);

/* Don't use the VMA after the unlock -- it may have vanished */
@@ -1433,7 +1437,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,

/*
* Read mmu_invalidate_seq so that KVM can detect if the results of
- * vma_lookup() or __gfn_to_pfn_memslot() become stale prior to
+ * vma_lookup() or __kvm_follow_pfn() become stale prior to
* acquiring kvm->mmu_lock.
*
* Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
@@ -1442,8 +1446,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
mmu_seq = vcpu->kvm->mmu_invalidate_seq;
mmap_read_unlock(current->mm);

- pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
- write_fault, &writable, NULL);
+ pfn = __kvm_follow_pfn(&foll);
if (pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(hva, vma_shift);
return 0;
@@ -1468,7 +1471,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
* Only actually map the page as writable if this was a write
* fault.
*/
- writable = false;
+ foll.writable = false;
}

if (exec_fault && device)
@@ -1508,7 +1511,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
}
}

- if (writable)
+ if (foll.writable)
prot |= KVM_PGTABLE_PROT_W;

if (exec_fault)
@@ -1534,9 +1537,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
KVM_PGTABLE_WALK_SHARED);

/* Mark the page dirty only if the fault is handled successfully */
- if (writable && !ret) {
+ if (foll.writable && !ret) {
kvm_set_pfn_dirty(pfn);
- mark_page_dirty_in_slot(kvm, memslot, gfn);
+ mark_page_dirty_in_slot(kvm, memslot, foll.gfn);
}

out_unlock:
--
2.41.0.255.g8b1d071c50-goog


2023-07-04 08:38:43

by David Stevens

[permalink] [raw]
Subject: [PATCH v7 1/8] KVM: Assert that a page's refcount is elevated when marking accessed/dirty

From: Sean Christopherson <[email protected]>

Assert that a page's refcount is elevated, i.e. that _something_ holds a
reference to the page, when KVM marks a page as accessed and/or dirty.
KVM typically doesn't hold a reference to pages that are mapped into the
guest, e.g. to allow page migration, compaction, swap, etc., and instead
relies on mmu_notifiers to react to changes in the primary MMU.

Incorrect handling of mmu_notifier events (or similar mechanisms) can
result in KVM keeping a mapping beyond the lifetime of the backing page,
i.e. can (and often does) result in use-after-free. Yelling if KVM marks
a freed page as accessed/dirty doesn't prevent badness as KVM usually
only does A/D updates when unmapping memory from the guest, i.e. the
assertion fires well after an underlying bug has occurred, but yelling
does help detect, triage, and debug use-after-free bugs.

Note, the assertion must use page_count(), NOT page_ref_count()! For
hugepages, the returned struct page may be a tailpage and thus not have
its own refcount.

Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b838c8f71349..371bd783ff2b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2885,6 +2885,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);

static bool kvm_is_ad_tracked_page(struct page *page)
{
+ /*
+ * Assert that KVM isn't attempting to mark a freed page as Accessed or
+ * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM
+ * (typically) doesn't pin pages that are mapped in KVM's MMU, and
+ * instead relies on mmu_notifiers to know when a mapping needs to be
+ * zapped/invalidated. Unmapping from KVM's MMU must happen _before_
+ * KVM returns from its mmu_notifier, i.e. the page should have an
+ * elevated refcount at this point even though KVM doesn't hold a
+ * reference of its own.
+ */
+ if (WARN_ON_ONCE(!page_count(page)))
+ return false;
+
/*
* Per page-flags.h, pages tagged PG_reserved "should in general not be
* touched (e.g. set dirty) except by its owner".
--
2.41.0.255.g8b1d071c50-goog


2023-07-05 03:53:28

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> * The slow path to get the pfn of the specified host virtual address,
> * 1 indicates success, -errno is returned if error is detected.
> */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> - bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> {
> - unsigned int flags = FOLL_HWPOISON;
> + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> struct page *page;
> int npages;
>
> might_sleep();
>
> - if (writable)
> - *writable = write_fault;
> -
> - if (write_fault)
> - flags |= FOLL_WRITE;
> - if (async)
> - flags |= FOLL_NOWAIT;
> - if (interruptible)
> - flags |= FOLL_INTERRUPTIBLE;
> -
> - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> if (npages != 1)
> return npages;
>
> + foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +
> /* map read fault as writable if possible */
> - if (unlikely(!write_fault) && writable) {
> + if (unlikely(!foll->writable) && foll->allow_write_mapping) {

I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.

> struct page *wpage;
>
> - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> - *writable = true;
> + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> + foll->writable = true;
> put_page(page);
> page = wpage;
> }
> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> return get_page_unless_zero(page);
> }
>
...

> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> + bool atomic, bool interruptible, bool *async,
> + bool write_fault, bool *writable, hva_t *hva)
> +{
> + kvm_pfn_t pfn;
> + struct kvm_follow_pfn foll = {
> + .slot = slot,
> + .gfn = gfn,
> + .flags = 0,
> + .atomic = atomic,
> + .allow_write_mapping = !!writable,
> + };
> +
> + if (write_fault)
> + foll.flags |= FOLL_WRITE;
> + if (async)
> + foll.flags |= FOLL_NOWAIT;
> + if (interruptible)
> + foll.flags |= FOLL_INTERRUPTIBLE;
> +
> + pfn = __kvm_follow_pfn(&foll);
> + if (pfn == KVM_PFN_ERR_NEEDS_IO) {

Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
seems unnecessary.

> + *async = true;
> + pfn = KVM_PFN_ERR_FAULT;
> + }
> + if (hva)
> + *hva = foll.hva;
> + if (writable)
> + *writable = foll.writable;
> + return pfn;
> }
> EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>

B.R.
Yu

2023-07-05 07:56:15

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

>
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
No one calls these 2 routines in this patch. How about move this change to
[PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn ?

...

> @@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
> return !PageReserved(page);
> }
>
> -static void kvm_set_page_dirty(struct page *page)
> +void kvm_set_page_dirty(struct page *page)
> {
> if (kvm_is_ad_tracked_page(page))
> SetPageDirty(page);
> }
> +EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
>
> -static void kvm_set_page_accessed(struct page *page)
> +void kvm_set_page_accessed(struct page *page)
> {
> if (kvm_is_ad_tracked_page(page))
> mark_page_accessed(page);
> }
> +EXPORT_SYMBOL_GPL(kvm_set_page_accessed);

Same here.

B.R.
Yu

2023-07-05 09:19:43

by Zhi Wang

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

On Tue, 4 Jul 2023 16:50:47 +0900
David Stevens <[email protected]> wrote:

> From: David Stevens <[email protected]>
>
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
>
> Signed-off-by: David Stevens <[email protected]>
> ---
> include/linux/kvm_host.h | 16 ++++
> virt/kvm/kvm_main.c | 171 ++++++++++++++++++++++-----------------
> virt/kvm/kvm_mm.h | 3 +-
> virt/kvm/pfncache.c | 8 +-
> 4 files changed, 122 insertions(+), 76 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
> #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
>
> /*
> * error pfns indicate that the gfn is in slot but faild to
> @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> void kvm_release_page_clean(struct page *page);
> void kvm_release_page_dirty(struct page *page);
>
> +struct kvm_follow_pfn {
> + const struct kvm_memory_slot *slot;
> + gfn_t gfn;
> + unsigned int flags;
> + bool atomic;
> + /* Allow a read fault to create a writeable mapping. */
> + bool allow_write_mapping;
> +
> + /* Outputs of __kvm_follow_pfn */
> + hva_t hva;
> + bool writable;
> +};
> +
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> +
> kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> bool *writable);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 371bd783ff2b..b13f22861d2f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> * true indicates success, otherwise false is returned. It's also the
> * only part that runs if we can in atomic context.
> */
> -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> - bool *writable, kvm_pfn_t *pfn)
> +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> {
> struct page *page[1];
> + bool write_fault = foll->flags & FOLL_WRITE;
>
> /*
> * Fast pin a writable pfn only if it is a write fault request
> * or the caller allows to map a writable pfn for a read fault
> * request.
> */
> - if (!(write_fault || writable))
> + if (!(write_fault || foll->allow_write_mapping))
> return false;
>
> - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> *pfn = page_to_pfn(page[0]);
> -
> - if (writable)
> - *writable = true;
> + foll->writable = foll->allow_write_mapping;
> return true;
> }
>
> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> * The slow path to get the pfn of the specified host virtual address,
> * 1 indicates success, -errno is returned if error is detected.
> */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> - bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> {
> - unsigned int flags = FOLL_HWPOISON;
> + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> struct page *page;
> int npages;
>
> might_sleep();
>
> - if (writable)
> - *writable = write_fault;
> -
> - if (write_fault)
> - flags |= FOLL_WRITE;
> - if (async)
> - flags |= FOLL_NOWAIT;
> - if (interruptible)
> - flags |= FOLL_INTERRUPTIBLE;
> -
> - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> if (npages != 1)
> return npages;
>
> + foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +
> /* map read fault as writable if possible */
> - if (unlikely(!write_fault) && writable) {
> + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> struct page *wpage;
>
> - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> - *writable = true;
> + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> + foll->writable = true;
> put_page(page);
> page = wpage;
> }
> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> return get_page_unless_zero(page);
> }
>
> -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> - unsigned long addr, bool write_fault,
> - bool *writable, kvm_pfn_t *p_pfn)
> +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> + kvm_pfn_t *p_pfn)
> {
> kvm_pfn_t pfn;
> pte_t *ptep;
> spinlock_t *ptl;
> + bool write_fault = foll->flags & FOLL_WRITE;
> int r;
>
> - r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> + r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> if (r) {
> /*
> * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
> * not call the fault handler, so do it here.
> */
> bool unlocked = false;
> - r = fixup_user_fault(current->mm, addr,
> + r = fixup_user_fault(current->mm, foll->hva,
> (write_fault ? FAULT_FLAG_WRITE : 0),
> &unlocked);
> if (unlocked)
> @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> if (r)
> return r;
>
> - r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> + r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> if (r)
> return r;
> }
> @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> goto out;
> }
>
> - if (writable)
> - *writable = pte_write(*ptep);
> + foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
> pfn = pte_pfn(*ptep);
>
> /*
> @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> * 2): @write_fault = false && @writable, @writable will tell the caller
> * whether the mapping is writable.
> */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> - bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
> {
> struct vm_area_struct *vma;
> kvm_pfn_t pfn;
> int npages, r;
>
> /* we can do it either atomically or asynchronously, not both */
> - BUG_ON(atomic && async);
> + BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
>
> - if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> + if (hva_to_pfn_fast(foll, &pfn))
> return pfn;
>
> - if (atomic)
> + if (foll->atomic)
> return KVM_PFN_ERR_FAULT;
>
> - npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> - writable, &pfn);
> + npages = hva_to_pfn_slow(foll, &pfn);
> if (npages == 1)
> return pfn;
> if (npages == -EINTR)
> @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>
> mmap_read_lock(current->mm);
> if (npages == -EHWPOISON ||
> - (!async && check_user_page_hwpoison(addr))) {
> + (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
> pfn = KVM_PFN_ERR_HWPOISON;
> goto exit;
> }
>
> retry:
> - vma = vma_lookup(current->mm, addr);
> + vma = vma_lookup(current->mm, foll->hva);
>
> if (vma == NULL)
> pfn = KVM_PFN_ERR_FAULT;
> else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> - r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> + r = hva_to_pfn_remapped(vma, foll, &pfn);
> if (r == -EAGAIN)
> goto retry;
> if (r < 0)
> pfn = KVM_PFN_ERR_FAULT;
> } else {
> - if (async && vma_is_valid(vma, write_fault))
> - *async = true;
> - pfn = KVM_PFN_ERR_FAULT;
> + if ((foll->flags & FOLL_NOWAIT) &&
> + vma_is_valid(vma, foll->flags & FOLL_WRITE))
> + pfn = KVM_PFN_ERR_NEEDS_IO;
> + else
> + pfn = KVM_PFN_ERR_FAULT;
> }
> exit:
> mmap_read_unlock(current->mm);
> return pfn;
> }
>
> -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> - bool atomic, bool interruptible, bool *async,
> - bool write_fault, bool *writable, hva_t *hva)
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> {
> - unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> -
> - if (hva)
> - *hva = addr;
> + foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> + foll->flags & FOLL_WRITE);
>
> - if (addr == KVM_HVA_ERR_RO_BAD) {
> - if (writable)
> - *writable = false;
> + if (foll->hva == KVM_HVA_ERR_RO_BAD)
> return KVM_PFN_ERR_RO_FAULT;
> - }
>

Can you explain why updating foll->writable = false (previously *writeable
= false) is omitted here?

In the caller where the struct kvm_follow_pfn is initialized, e.g.
__gfn_to_pfn_memslot()/gfn_to_pfn_prot(), .writable is not initialized.
IIUC, they expect __kvm_follow_pfn() to update it and return .writable to
upper caller.

As the one of the output, it would be better to initalize it either in the
caller or update it in __kvm_follow_pfn(). Or
__gfn_to_pfn_memslot()/gfn_to_pfn_prot() will return random data in the
stack to the caller via bool *writable. It doesn't sound nice.

BTW: It seems both "writable" and "writeable" are used in this patch. I am
wondering maybe we can correct them.

> - if (kvm_is_error_hva(addr)) {
> - if (writable)
> - *writable = false;
> + if (kvm_is_error_hva(foll->hva))
> return KVM_PFN_NOSLOT;
> - }
>
> - /* Do not map writable pfn in the readonly memslot. */
> - if (writable && memslot_is_readonly(slot)) {
> - *writable = false;
> - writable = NULL;
> - }
> + if (memslot_is_readonly(foll->slot))
> + foll->allow_write_mapping = false;
> +
> + return hva_to_pfn(foll);
> +}
> +EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
>
> - return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> - writable);
> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> + bool atomic, bool interruptible, bool *async,
> + bool write_fault, bool *writable, hva_t *hva)
> +{
> + kvm_pfn_t pfn;
> + struct kvm_follow_pfn foll = {
> + .slot = slot,
> + .gfn = gfn,
> + .flags = 0,
> + .atomic = atomic,
> + .allow_write_mapping = !!writable,
> + };
> +
> + if (write_fault)
> + foll.flags |= FOLL_WRITE;
> + if (async)
> + foll.flags |= FOLL_NOWAIT;
> + if (interruptible)
> + foll.flags |= FOLL_INTERRUPTIBLE;
> +
> + pfn = __kvm_follow_pfn(&foll);
> + if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> + *async = true;
> + pfn = KVM_PFN_ERR_FAULT;
> + }
> + if (hva)
> + *hva = foll.hva;
> + if (writable)
> + *writable = foll.writable;
> + return pfn;
> }
> EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>
> kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> bool *writable)
> {
> - return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
> - NULL, write_fault, writable, NULL);
> + kvm_pfn_t pfn;
> + struct kvm_follow_pfn foll = {
> + .slot = gfn_to_memslot(kvm, gfn),
> + .gfn = gfn,
> + .flags = write_fault ? FOLL_WRITE : 0,
> + .allow_write_mapping = !!writable,
> + };
> + pfn = __kvm_follow_pfn(&foll);
> + if (writable)
> + *writable = foll.writable;
> + return pfn;
> }
> EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
>
> kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
> {
> - return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
> - NULL, NULL);
> + struct kvm_follow_pfn foll = {
> + .slot = slot,
> + .gfn = gfn,
> + .flags = FOLL_WRITE,
> + };
> + return __kvm_follow_pfn(&foll);
> }
> EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
>
> kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
> {
> - return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
> - NULL, NULL);
> + struct kvm_follow_pfn foll = {
> + .slot = slot,
> + .gfn = gfn,
> + .flags = FOLL_WRITE,
> + .atomic = true,
> + };
> + return __kvm_follow_pfn(&foll);
> }
> EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
>
> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
> index 180f1a09e6ba..ed896aee5396 100644
> --- a/virt/kvm/kvm_mm.h
> +++ b/virt/kvm/kvm_mm.h
> @@ -20,8 +20,7 @@
> #define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock)
> #endif /* KVM_HAVE_MMU_RWLOCK */
>
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> - bool *async, bool write_fault, bool *writable);
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);
>
> #ifdef CONFIG_HAVE_KVM_PFNCACHE
> void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 2d6aba677830..e3fefa753a51 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
> kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
> void *new_khva = NULL;
> unsigned long mmu_seq;
> + struct kvm_follow_pfn foll = {
> + .slot = gpc->memslot,
> + .gfn = gpa_to_gfn(gpc->gpa),
> + .flags = FOLL_WRITE,
> + .hva = gpc->uhva,
> + };
>
> lockdep_assert_held(&gpc->refresh_lock);
>
> @@ -183,7 +189,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
> }
>
> /* We always request a writeable mapping */
> - new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
> + new_pfn = hva_to_pfn(&foll);
> if (is_error_noslot_pfn(new_pfn))
> goto out_error;
>


2023-07-05 09:36:01

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

On Wed, Jul 5, 2023 at 5:47 PM Zhi Wang <[email protected]> wrote:
>
> On Tue, 4 Jul 2023 16:50:47 +0900
> David Stevens <[email protected]> wrote:
>
> > From: David Stevens <[email protected]>
> >
> > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > where possible, combines the boolean arguments into a single flags
> > argument.
> >
> > Signed-off-by: David Stevens <[email protected]>
> > ---
> > include/linux/kvm_host.h | 16 ++++
> > virt/kvm/kvm_main.c | 171 ++++++++++++++++++++++-----------------
> > virt/kvm/kvm_mm.h | 3 +-
> > virt/kvm/pfncache.c | 8 +-
> > 4 files changed, 122 insertions(+), 76 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 9d3ac7720da9..ef2763c2b12e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -97,6 +97,7 @@
> > #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> > #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> > #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
> > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> >
> > /*
> > * error pfns indicate that the gfn is in slot but faild to
> > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> > void kvm_release_page_clean(struct page *page);
> > void kvm_release_page_dirty(struct page *page);
> >
> > +struct kvm_follow_pfn {
> > + const struct kvm_memory_slot *slot;
> > + gfn_t gfn;
> > + unsigned int flags;
> > + bool atomic;
> > + /* Allow a read fault to create a writeable mapping. */
> > + bool allow_write_mapping;
> > +
> > + /* Outputs of __kvm_follow_pfn */
> > + hva_t hva;
> > + bool writable;
> > +};
> > +
> > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > +
> > kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> > kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> > bool *writable);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 371bd783ff2b..b13f22861d2f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> > * true indicates success, otherwise false is returned. It's also the
> > * only part that runs if we can in atomic context.
> > */
> > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > - bool *writable, kvm_pfn_t *pfn)
> > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > {
> > struct page *page[1];
> > + bool write_fault = foll->flags & FOLL_WRITE;
> >
> > /*
> > * Fast pin a writable pfn only if it is a write fault request
> > * or the caller allows to map a writable pfn for a read fault
> > * request.
> > */
> > - if (!(write_fault || writable))
> > + if (!(write_fault || foll->allow_write_mapping))
> > return false;
> >
> > - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> > *pfn = page_to_pfn(page[0]);
> > -
> > - if (writable)
> > - *writable = true;
> > + foll->writable = foll->allow_write_mapping;
> > return true;
> > }
> >
> > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > * The slow path to get the pfn of the specified host virtual address,
> > * 1 indicates success, -errno is returned if error is detected.
> > */
> > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > - bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > {
> > - unsigned int flags = FOLL_HWPOISON;
> > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > struct page *page;
> > int npages;
> >
> > might_sleep();
> >
> > - if (writable)
> > - *writable = write_fault;
> > -
> > - if (write_fault)
> > - flags |= FOLL_WRITE;
> > - if (async)
> > - flags |= FOLL_NOWAIT;
> > - if (interruptible)
> > - flags |= FOLL_INTERRUPTIBLE;
> > -
> > - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > if (npages != 1)
> > return npages;
> >
> > + foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > +
> > /* map read fault as writable if possible */
> > - if (unlikely(!write_fault) && writable) {
> > + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > struct page *wpage;
> >
> > - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > - *writable = true;
> > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > + foll->writable = true;
> > put_page(page);
> > page = wpage;
> > }
> > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> > return get_page_unless_zero(page);
> > }
> >
> > -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > - unsigned long addr, bool write_fault,
> > - bool *writable, kvm_pfn_t *p_pfn)
> > +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> > + kvm_pfn_t *p_pfn)
> > {
> > kvm_pfn_t pfn;
> > pte_t *ptep;
> > spinlock_t *ptl;
> > + bool write_fault = foll->flags & FOLL_WRITE;
> > int r;
> >
> > - r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > + r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> > if (r) {
> > /*
> > * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
> > * not call the fault handler, so do it here.
> > */
> > bool unlocked = false;
> > - r = fixup_user_fault(current->mm, addr,
> > + r = fixup_user_fault(current->mm, foll->hva,
> > (write_fault ? FAULT_FLAG_WRITE : 0),
> > &unlocked);
> > if (unlocked)
> > @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > if (r)
> > return r;
> >
> > - r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > + r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> > if (r)
> > return r;
> > }
> > @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > goto out;
> > }
> >
> > - if (writable)
> > - *writable = pte_write(*ptep);
> > + foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
> > pfn = pte_pfn(*ptep);
> >
> > /*
> > @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > * 2): @write_fault = false && @writable, @writable will tell the caller
> > * whether the mapping is writable.
> > */
> > -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > - bool *async, bool write_fault, bool *writable)
> > +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
> > {
> > struct vm_area_struct *vma;
> > kvm_pfn_t pfn;
> > int npages, r;
> >
> > /* we can do it either atomically or asynchronously, not both */
> > - BUG_ON(atomic && async);
> > + BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
> >
> > - if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> > + if (hva_to_pfn_fast(foll, &pfn))
> > return pfn;
> >
> > - if (atomic)
> > + if (foll->atomic)
> > return KVM_PFN_ERR_FAULT;
> >
> > - npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> > - writable, &pfn);
> > + npages = hva_to_pfn_slow(foll, &pfn);
> > if (npages == 1)
> > return pfn;
> > if (npages == -EINTR)
> > @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> >
> > mmap_read_lock(current->mm);
> > if (npages == -EHWPOISON ||
> > - (!async && check_user_page_hwpoison(addr))) {
> > + (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
> > pfn = KVM_PFN_ERR_HWPOISON;
> > goto exit;
> > }
> >
> > retry:
> > - vma = vma_lookup(current->mm, addr);
> > + vma = vma_lookup(current->mm, foll->hva);
> >
> > if (vma == NULL)
> > pfn = KVM_PFN_ERR_FAULT;
> > else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> > - r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> > + r = hva_to_pfn_remapped(vma, foll, &pfn);
> > if (r == -EAGAIN)
> > goto retry;
> > if (r < 0)
> > pfn = KVM_PFN_ERR_FAULT;
> > } else {
> > - if (async && vma_is_valid(vma, write_fault))
> > - *async = true;
> > - pfn = KVM_PFN_ERR_FAULT;
> > + if ((foll->flags & FOLL_NOWAIT) &&
> > + vma_is_valid(vma, foll->flags & FOLL_WRITE))
> > + pfn = KVM_PFN_ERR_NEEDS_IO;
> > + else
> > + pfn = KVM_PFN_ERR_FAULT;
> > }
> > exit:
> > mmap_read_unlock(current->mm);
> > return pfn;
> > }
> >
> > -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > - bool atomic, bool interruptible, bool *async,
> > - bool write_fault, bool *writable, hva_t *hva)
> > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> > {
> > - unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> > -
> > - if (hva)
> > - *hva = addr;
> > + foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> > + foll->flags & FOLL_WRITE);
> >
> > - if (addr == KVM_HVA_ERR_RO_BAD) {
> > - if (writable)
> > - *writable = false;
> > + if (foll->hva == KVM_HVA_ERR_RO_BAD)
> > return KVM_PFN_ERR_RO_FAULT;
> > - }
> >
>
> Can you explain why updating foll->writable = false (previously *writeable
> = false) is omitted here?
>
> In the caller where the struct kvm_follow_pfn is initialized, e.g.
> __gfn_to_pfn_memslot()/gfn_to_pfn_prot(), .writable is not initialized.
> IIUC, they expect __kvm_follow_pfn() to update it and return .writable to
> upper caller.
>
> As the one of the output, it would be better to initalize it either in the
> caller or update it in __kvm_follow_pfn(). Or
> __gfn_to_pfn_memslot()/gfn_to_pfn_prot() will return random data in the
> stack to the caller via bool *writable. It doesn't sound nice.

Entries omitted from an initializer are initialized to zero, so
.writable does get initialized in all of the patches in this series
via designated initializers. Although you're right that explicitly
setting it to false is a good idea, in case someday someone adds a
caller that doesn't use an initializer when declaring its
kvm_follow_pfn.

-David

2023-07-05 09:47:31

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <[email protected]> wrote:
>
> > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > * The slow path to get the pfn of the specified host virtual address,
> > * 1 indicates success, -errno is returned if error is detected.
> > */
> > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > - bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > {
> > - unsigned int flags = FOLL_HWPOISON;
> > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > struct page *page;
> > int npages;
> >
> > might_sleep();
> >
> > - if (writable)
> > - *writable = write_fault;
> > -
> > - if (write_fault)
> > - flags |= FOLL_WRITE;
> > - if (async)
> > - flags |= FOLL_NOWAIT;
> > - if (interruptible)
> > - flags |= FOLL_INTERRUPTIBLE;
> > -
> > - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > if (npages != 1)
> > return npages;
> >
> > + foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > +
> > /* map read fault as writable if possible */
> > - if (unlikely(!write_fault) && writable) {
> > + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
>
> I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.

The two statements are logically equivalent, although I guess using
!(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
verbose.

> > struct page *wpage;
> >
> > - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > - *writable = true;
> > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > + foll->writable = true;
> > put_page(page);
> > page = wpage;
> > }
> > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> > return get_page_unless_zero(page);
> > }
> >
> ...
>
> > +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > + bool atomic, bool interruptible, bool *async,
> > + bool write_fault, bool *writable, hva_t *hva)
> > +{
> > + kvm_pfn_t pfn;
> > + struct kvm_follow_pfn foll = {
> > + .slot = slot,
> > + .gfn = gfn,
> > + .flags = 0,
> > + .atomic = atomic,
> > + .allow_write_mapping = !!writable,
> > + };
> > +
> > + if (write_fault)
> > + foll.flags |= FOLL_WRITE;
> > + if (async)
> > + foll.flags |= FOLL_NOWAIT;
> > + if (interruptible)
> > + foll.flags |= FOLL_INTERRUPTIBLE;
> > +
> > + pfn = __kvm_follow_pfn(&foll);
> > + if (pfn == KVM_PFN_ERR_NEEDS_IO) {
>
> Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
> if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
> Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
> seems unnecessary.

There are the cases where the fault does not fall within a vma or when
the target vma's flags don't support the fault's access permissions.
In those cases, continuing to try to resolve the fault won't cause
problems per-se, but it's wasteful and a bit confusing. Having
hva_to_pfn detect whether or not it may be possible to resolve the
fault asynchronously and return KVM_PFN_ERR_NEEDS_IO if so seems like
a good idea. It also matches what the existing code does.

-David

2023-07-05 10:47:05

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> memory into the guest that is backed by un-refcounted struct pages - for
> example, higher order non-compound pages allocated by the amdgpu driver
> via ttm_pool_alloc_page.

I guess you mean the tail pages of the higher order non-compound pages?
And as to the head page, it is said to be set to one coincidentally[*],
and shall not be considered as refcounted. IIUC, refcount of this head
page will be increased and decreased soon in hva_to_pfn_remapped(), so
this may not be a problem(?). But treating this head page differently,
as a refcounted one(e.g., to set the A/D flags), is weired.

Or maybe I missed some context, e.g., can the head page be allocted to
guest at all?


>
> The bulk of this change is tracking the is_refcounted_page flag so that
> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> done by storing the flag in an unused bit in the sptes.

Also, maybe we should mention this only works on x86-64.

>
> Signed-off-by: David Stevens <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++------------
> arch/x86/kvm/mmu/mmu_internal.h | 1 +
> arch/x86/kvm/mmu/paging_tmpl.h | 9 ++++---
> arch/x86/kvm/mmu/spte.c | 4 ++-
> arch/x86/kvm/mmu/spte.h | 12 ++++++++-
> arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++++++++-------
> 6 files changed, 62 insertions(+), 30 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e44ab512c3a1..b1607e314497 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c

...

> @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> bool host_writable = !fault || fault->map_writable;
> bool prefetch = !fault || fault->prefetch;
> bool write_fault = fault && fault->write;
> + bool is_refcounted = !fault || fault->is_refcounted_page;

Just wonder, what if a non-refcounted page is prefetched? Or is it possible in
practice?

...
>
> @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> */
> static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
> {
> - bool host_writable;
> + bool host_writable, is_refcounted;
> gpa_t first_pte_gpa;
> u64 *sptep, spte;
> struct kvm_memory_slot *slot;
> @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
> sptep = &sp->spt[i];
> spte = *sptep;
> host_writable = spte & shadow_host_writable_mask;
> + // TODO: is this correct?
> + is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> make_spte(vcpu, sp, slot, pte_access, gfn,
> spte_to_pfn(spte), spte, true, false,
> - host_writable, &spte);
> + host_writable, is_refcounted, &spte);

Could we restrict that a non-refcounted page shall not be used as shadow page?

[*] https://lore.kernel.org/all/[email protected]/

B.R.
Yu

2023-07-05 10:47:27

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> memory into the guest that is backed by un-refcounted struct pages - for
> example, higher order non-compound pages allocated by the amdgpu driver
> via ttm_pool_alloc_page.
>
> The bulk of this change is tracking the is_refcounted_page flag so that
> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> done by storing the flag in an unused bit in the sptes.
>
> Signed-off-by: David Stevens <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++------------
> arch/x86/kvm/mmu/mmu_internal.h | 1 +
> arch/x86/kvm/mmu/paging_tmpl.h | 9 ++++---
> arch/x86/kvm/mmu/spte.c | 4 ++-
> arch/x86/kvm/mmu/spte.h | 12 ++++++++-
> arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++++++++-------
> 6 files changed, 62 insertions(+), 30 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e44ab512c3a1..b1607e314497 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>
> if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> flush = true;
> - kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> + if (is_refcounted_page_pte(old_spte))
> + kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
> }
>
> if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
> flush = true;
> - kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> + if (is_refcounted_page_pte(old_spte))
> + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
> }
>
> return flush;
> @@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
> * before they are reclaimed. Sanity check that, if the pfn is backed
> * by a refcounted page, the refcount is elevated.
> */
> - page = kvm_pfn_to_refcounted_page(pfn);
> - WARN_ON(page && !page_count(page));
> + if (is_refcounted_page_pte(old_spte)) {
> + page = kvm_pfn_to_refcounted_page(pfn);
> + WARN_ON(!page || !page_count(page));
> + }
>
> - if (is_accessed_spte(old_spte))
> - kvm_set_pfn_accessed(pfn);
> + if (is_refcounted_page_pte(old_spte)) {
> + if (is_accessed_spte(old_spte))
> + kvm_set_page_accessed(pfn_to_page(pfn));
>
> - if (is_dirty_spte(old_spte))
> - kvm_set_pfn_dirty(pfn);
> + if (is_dirty_spte(old_spte))
> + kvm_set_page_dirty(pfn_to_page(pfn));
> + }
>
> return old_spte;
> }
> @@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep)
> * Capture the dirty status of the page, so that it doesn't get
> * lost when the SPTE is marked for access tracking.
> */
> - if (is_writable_pte(spte))
> - kvm_set_pfn_dirty(spte_to_pfn(spte));
> + if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
> + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
>
> spte = mark_spte_for_access_track(spte);
> mmu_spte_update_no_track(sptep, spte);
> @@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
> {
> bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
> (unsigned long *)sptep);
> - if (was_writable && !spte_ad_enabled(*sptep))
> - kvm_set_pfn_dirty(spte_to_pfn(*sptep));
> + if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
> + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
>
> return was_writable;
> }
> @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> bool host_writable = !fault || fault->map_writable;
> bool prefetch = !fault || fault->prefetch;
> bool write_fault = fault && fault->write;
> + bool is_refcounted = !fault || fault->is_refcounted_page;
>
> pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
> *sptep, write_fault, gfn);
> @@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> }
>
> wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
> - true, host_writable, &spte);
> + true, host_writable, is_refcounted, &spte);
>
> if (*sptep == spte) {
> ret = RET_PF_SPURIOUS;
> @@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> struct kvm_follow_pfn foll = {
> .slot = slot,
> .gfn = fault->gfn,
> - .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> + .flags = fault->write ? FOLL_WRITE : 0,
> .allow_write_mapping = true,
> + .guarded_by_mmu_notifier = true,
> };
>
> /*
> @@ -4317,6 +4325,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> fault->slot = NULL;
> fault->pfn = KVM_PFN_NOSLOT;
> fault->map_writable = false;
> + fault->is_refcounted_page = false;
> return RET_PF_CONTINUE;
> }
> /*
> @@ -4366,6 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> success:
> fault->hva = foll.hva;
> fault->map_writable = foll.writable;
> + fault->is_refcounted_page = foll.is_refcounted_page;
> return RET_PF_CONTINUE;
> }
>
> @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>
> out_unlock:
> write_unlock(&vcpu->kvm->mmu_lock);
> - kvm_release_pfn_clean(fault->pfn);
> + if (fault->is_refcounted_page)
> + kvm_set_page_accessed(pfn_to_page(fault->pfn));
> return r;
> }
>
> @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>
> out_unlock:
> read_unlock(&vcpu->kvm->mmu_lock);
> - kvm_release_pfn_clean(fault->pfn);

Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
believe this is not gonna happen in real world...

B.R.
Yu

2023-07-05 11:42:18

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <[email protected]> wrote:
> >
> > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > * The slow path to get the pfn of the specified host virtual address,
> > > * 1 indicates success, -errno is returned if error is detected.
> > > */
> > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > - bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > {
> > > - unsigned int flags = FOLL_HWPOISON;
> > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > struct page *page;
> > > int npages;
> > >
> > > might_sleep();
> > >
> > > - if (writable)
> > > - *writable = write_fault;
> > > -
> > > - if (write_fault)
> > > - flags |= FOLL_WRITE;
> > > - if (async)
> > > - flags |= FOLL_NOWAIT;
> > > - if (interruptible)
> > > - flags |= FOLL_INTERRUPTIBLE;
> > > -
> > > - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > if (npages != 1)
> > > return npages;
> > >
> > > + foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > +
> > > /* map read fault as writable if possible */
> > > - if (unlikely(!write_fault) && writable) {
> > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> >
> > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
>
> The two statements are logically equivalent, although I guess using
> !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> verbose.

Well, as the comment says, we wanna try to map the read fault as writable
whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
for write faults. So I guess using !foll->writable will not allow this.
Did I miss anything?

> > > +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > > + bool atomic, bool interruptible, bool *async,
> > > + bool write_fault, bool *writable, hva_t *hva)
> > > +{
> > > + kvm_pfn_t pfn;
> > > + struct kvm_follow_pfn foll = {
> > > + .slot = slot,
> > > + .gfn = gfn,
> > > + .flags = 0,
> > > + .atomic = atomic,
> > > + .allow_write_mapping = !!writable,
> > > + };
> > > +
> > > + if (write_fault)
> > > + foll.flags |= FOLL_WRITE;
> > > + if (async)
> > > + foll.flags |= FOLL_NOWAIT;
> > > + if (interruptible)
> > > + foll.flags |= FOLL_INTERRUPTIBLE;
> > > +
> > > + pfn = __kvm_follow_pfn(&foll);
> > > + if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> >
> > Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
> > if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
> > Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
> > seems unnecessary.
>
> There are the cases where the fault does not fall within a vma or when
> the target vma's flags don't support the fault's access permissions.
> In those cases, continuing to try to resolve the fault won't cause
> problems per-se, but it's wasteful and a bit confusing. Having
> hva_to_pfn detect whether or not it may be possible to resolve the
> fault asynchronously and return KVM_PFN_ERR_NEEDS_IO if so seems like
> a good idea. It also matches what the existing code does.

Got it. Sounds reasonable. And thanks! :)

B.R.
Yu

2023-07-05 12:12:46

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> callers to resolve a gfn when the associated pfn has a valid struct page
> that isn't being actively refcounted (e.g. tail pages of non-compound
> higher order pages). For a caller to safely omit FOLL_GET, all usages of
> the returned pfn must be guarded by a mmu notifier.
>
> This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> is set when the returned pfn has an associated struct page with a valid
> refcount. Callers that don't pass FOLL_GET should remember this value
> and use it to avoid places like kvm_is_ad_tracked_page that assume a
> non-zero refcount.
>
> Signed-off-by: David Stevens <[email protected]>
> ---
> include/linux/kvm_host.h | 10 ++++++
> virt/kvm/kvm_main.c | 67 +++++++++++++++++++++-------------------
> virt/kvm/pfncache.c | 2 +-
> 3 files changed, 47 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ef2763c2b12e..a45308c7d2d9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> void kvm_release_page_clean(struct page *page);
> void kvm_release_page_dirty(struct page *page);
>
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
> struct kvm_follow_pfn {
> const struct kvm_memory_slot *slot;
> gfn_t gfn;
> @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> bool atomic;
> /* Allow a read fault to create a writeable mapping. */
> bool allow_write_mapping;
> + /*
> + * Usage of the returned pfn will be guared by a mmu notifier. Must
> + * be true if FOLL_GET is not set.
> + */
> + bool guarded_by_mmu_notifier;

And how? Any place to check the invalidate seq?

B.R.
Yu

2023-07-05 13:31:06

by Zhi Wang

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

On Tue, 4 Jul 2023 16:50:48 +0900
David Stevens <[email protected]> wrote:

> From: David Stevens <[email protected]>
>
> Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> callers to resolve a gfn when the associated pfn has a valid struct page
> that isn't being actively refcounted (e.g. tail pages of non-compound
> higher order pages). For a caller to safely omit FOLL_GET, all usages of
> the returned pfn must be guarded by a mmu notifier.
>
> This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> is set when the returned pfn has an associated struct page with a valid
> refcount. Callers that don't pass FOLL_GET should remember this value
> and use it to avoid places like kvm_is_ad_tracked_page that assume a
> non-zero refcount.
>
> Signed-off-by: David Stevens <[email protected]>
> ---
> include/linux/kvm_host.h | 10 ++++++
> virt/kvm/kvm_main.c | 67 +++++++++++++++++++++-------------------
> virt/kvm/pfncache.c | 2 +-
> 3 files changed, 47 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ef2763c2b12e..a45308c7d2d9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> void kvm_release_page_clean(struct page *page);
> void kvm_release_page_dirty(struct page *page);
>
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
> struct kvm_follow_pfn {
> const struct kvm_memory_slot *slot;
> gfn_t gfn;
> @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> bool atomic;
> /* Allow a read fault to create a writeable mapping. */
> bool allow_write_mapping;
> + /*
> + * Usage of the returned pfn will be guared by a mmu notifier. Must
^guarded
> + * be true if FOLL_GET is not set.
> + */
> + bool guarded_by_mmu_notifier;
>
It seems no one sets the guraded_by_mmu_notifier in this patch. Is
guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
caller of __kvm_follow_pfn()?

If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
temporary solution. I don't think it is a good idea to play tricks with
a temporary solution, more like we are abusing the toleration.

Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
indicate a tail page?

> /* Outputs of __kvm_follow_pfn */
> hva_t hva;
> bool writable;
> + /* True if the returned pfn is for a page with a valid refcount. */
> + bool is_refcounted_page;
> };
>
> kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b13f22861d2f..0f7b41f220b6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2502,6 +2502,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> *pfn = page_to_pfn(page[0]);
> foll->writable = foll->allow_write_mapping;
> + foll->is_refcounted_page = true;
> + if (!(foll->flags & FOLL_GET))
> + put_page(page[0]);
> return true;
> }
>
> @@ -2525,6 +2528,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> return npages;
>
> foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> + foll->is_refcounted_page = true;
>
> /* map read fault as writable if possible */
> if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> @@ -2537,6 +2541,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> }
> }
> *pfn = page_to_pfn(page);
> + if (!(foll->flags & FOLL_GET))
> + put_page(page);
> return npages;
> }
>
> @@ -2551,16 +2557,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
> return true;
> }
>
> -static int kvm_try_get_pfn(kvm_pfn_t pfn)
> -{
> - struct page *page = kvm_pfn_to_refcounted_page(pfn);
> -
> - if (!page)
> - return 1;
> -
> - return get_page_unless_zero(page);
> -}
> -
> static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> kvm_pfn_t *p_pfn)
> {
> @@ -2568,6 +2564,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
> pte_t *ptep;
> spinlock_t *ptl;
> bool write_fault = foll->flags & FOLL_WRITE;
> + struct page *page;
> int r;
>
> r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> @@ -2599,24 +2596,27 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
> pfn = pte_pfn(*ptep);
>
> /*
> - * Get a reference here because callers of *hva_to_pfn* and
> - * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
> - * returned pfn. This is only needed if the VMA has VM_MIXEDMAP
> - * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
> - * simply do nothing for reserved pfns.
> - *
> - * Whoever called remap_pfn_range is also going to call e.g.
> - * unmap_mapping_range before the underlying pages are freed,
> - * causing a call to our MMU notifier.
> + * Now deal with reference counting. If kvm_pfn_to_refcounted_page
> + * returns NULL, then there's no refcount to worry about.
> *
> - * Certain IO or PFNMAP mappings can be backed with valid
> - * struct pages, but be allocated without refcounting e.g.,
> - * tail pages of non-compound higher order allocations, which
> - * would then underflow the refcount when the caller does the
> - * required put_page. Don't allow those pages here.
> + * Otherwise, certain IO or PFNMAP mappings can be backed with valid
> + * struct pages but be allocated without refcounting e.g., tail pages of
> + * non-compound higher order allocations. If FOLL_GET is set and we
> + * increment such a refcount, then when that pfn is eventually passed to
> + * kvm_release_pfn_clean, its refcount would hit zero and be incorrectly
> + * freed. Therefore don't allow those pages here when FOLL_GET is set.
> */
> - if (!kvm_try_get_pfn(pfn))
> + page = kvm_pfn_to_refcounted_page(pfn);
> + if (!page)
> + goto out;
> +
> + if (get_page_unless_zero(page)) {
> + foll->is_refcounted_page = true;
> + if (!(foll->flags & FOLL_GET))
> + put_page(page);
> + } else if (foll->flags & FOLL_GET) {
> r = -EFAULT;
> + }
>
> out:
> pte_unmap_unlock(ptep, ptl);
> @@ -2693,6 +2693,9 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>
> kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> {
> + if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll->guarded_by_mmu_notifier))
> + return KVM_PFN_ERR_FAULT;
> +
> foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> foll->flags & FOLL_WRITE);
>
> @@ -2717,7 +2720,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> struct kvm_follow_pfn foll = {
> .slot = slot,
> .gfn = gfn,
> - .flags = 0,
> + .flags = FOLL_GET,
> .atomic = atomic,
> .allow_write_mapping = !!writable,
> };
> @@ -2749,7 +2752,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> struct kvm_follow_pfn foll = {
> .slot = gfn_to_memslot(kvm, gfn),
> .gfn = gfn,
> - .flags = write_fault ? FOLL_WRITE : 0,
> + .flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
> .allow_write_mapping = !!writable,
> };
> pfn = __kvm_follow_pfn(&foll);
> @@ -2764,7 +2767,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
> struct kvm_follow_pfn foll = {
> .slot = slot,
> .gfn = gfn,
> - .flags = FOLL_WRITE,
> + .flags = FOLL_GET | FOLL_WRITE,
> };
> return __kvm_follow_pfn(&foll);
> }
> @@ -2775,7 +2778,7 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf
> struct kvm_follow_pfn foll = {
> .slot = slot,
> .gfn = gfn,
> - .flags = FOLL_WRITE,
> + .flags = FOLL_GET | FOLL_WRITE,
> .atomic = true,
> };
> return __kvm_follow_pfn(&foll);
> @@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
> return !PageReserved(page);
> }
>
> -static void kvm_set_page_dirty(struct page *page)
> +void kvm_set_page_dirty(struct page *page)
> {
> if (kvm_is_ad_tracked_page(page))
> SetPageDirty(page);
> }
> +EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
>
> -static void kvm_set_page_accessed(struct page *page)
> +void kvm_set_page_accessed(struct page *page)
> {
> if (kvm_is_ad_tracked_page(page))
> mark_page_accessed(page);
> }
> +EXPORT_SYMBOL_GPL(kvm_set_page_accessed);
>
> void kvm_release_page_clean(struct page *page)
> {
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index e3fefa753a51..87caafce3dd0 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -147,7 +147,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
> struct kvm_follow_pfn foll = {
> .slot = gpc->memslot,
> .gfn = gpa_to_gfn(gpc->gpa),
> - .flags = FOLL_WRITE,
> + .flags = FOLL_WRITE | FOLL_GET,
> .hva = gpc->uhva,
> };
>


2023-07-05 14:21:34

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

> > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > */
> > static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
> > {
> > - bool host_writable;
> > + bool host_writable, is_refcounted;
> > gpa_t first_pte_gpa;
> > u64 *sptep, spte;
> > struct kvm_memory_slot *slot;
> > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
> > sptep = &sp->spt[i];
> > spte = *sptep;
> > host_writable = spte & shadow_host_writable_mask;
> > + // TODO: is this correct?
> > + is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> > slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > make_spte(vcpu, sp, slot, pte_access, gfn,
> > spte_to_pfn(spte), spte, true, false,
> > - host_writable, &spte);
> > + host_writable, is_refcounted, &spte);
>
> Could we restrict that a non-refcounted page shall not be used as shadow page?

Oh, sorry. It's not about shadow page. It's about guest page being
mapped as not refcounted. Silly me...

B.R.
Yu

2023-07-06 01:56:53

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

On Tue, Jul 04, 2023 at 04:50:47PM +0900,
David Stevens <[email protected]> wrote:

> From: David Stevens <[email protected]>
>
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
>
> Signed-off-by: David Stevens <[email protected]>
> ---
> include/linux/kvm_host.h | 16 ++++
> virt/kvm/kvm_main.c | 171 ++++++++++++++++++++++-----------------
> virt/kvm/kvm_mm.h | 3 +-
> virt/kvm/pfncache.c | 8 +-
> 4 files changed, 122 insertions(+), 76 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
> #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
>
> /*
> * error pfns indicate that the gfn is in slot but faild to
> @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> void kvm_release_page_clean(struct page *page);
> void kvm_release_page_dirty(struct page *page);
>
> +struct kvm_follow_pfn {
> + const struct kvm_memory_slot *slot;
> + gfn_t gfn;
> + unsigned int flags;
> + bool atomic;
> + /* Allow a read fault to create a writeable mapping. */
> + bool allow_write_mapping;

Maybe, make them const for input arguments?


> +
> + /* Outputs of __kvm_follow_pfn */
> + hva_t hva;
> + bool writable;
> +};
> +
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> +
> kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> bool *writable);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 371bd783ff2b..b13f22861d2f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> * true indicates success, otherwise false is returned. It's also the
> * only part that runs if we can in atomic context.
> */
> -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> - bool *writable, kvm_pfn_t *pfn)
> +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> {
> struct page *page[1];
> + bool write_fault = foll->flags & FOLL_WRITE;
>
> /*
> * Fast pin a writable pfn only if it is a write fault request
> * or the caller allows to map a writable pfn for a read fault
> * request.
> */
> - if (!(write_fault || writable))
> + if (!(write_fault || foll->allow_write_mapping))
> return false;
>
> - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> *pfn = page_to_pfn(page[0]);
> -
> - if (writable)
> - *writable = true;
> + foll->writable = foll->allow_write_mapping;
> return true;
> }
>
> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> * The slow path to get the pfn of the specified host virtual address,
> * 1 indicates success, -errno is returned if error is detected.
> */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> - bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> {
> - unsigned int flags = FOLL_HWPOISON;
> + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;

Although adding FOLL_GET doesn't affect the behavior of
get_user_pages_unlocked(), I wondered how this affects the next change
It's better to mention it in the commit message.
get_user_pages_*() called by hva_to_pfn_{fast, slot} imply FOLL_GET,
but __kvm_follow_pfn() doesn't imply FOLL_GET.


> struct page *page;
> int npages;
>
> might_sleep();
>
> - if (writable)
> - *writable = write_fault;
> -
> - if (write_fault)
> - flags |= FOLL_WRITE;
> - if (async)
> - flags |= FOLL_NOWAIT;
> - if (interruptible)
> - flags |= FOLL_INTERRUPTIBLE;
> -
> - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> if (npages != 1)
> return npages;
>
> + foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +
> /* map read fault as writable if possible */
> - if (unlikely(!write_fault) && writable) {
> + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> struct page *wpage;
>
> - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> - *writable = true;
> + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> + foll->writable = true;
> put_page(page);
> page = wpage;
> }
> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> return get_page_unless_zero(page);
> }
>
> -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> - unsigned long addr, bool write_fault,
> - bool *writable, kvm_pfn_t *p_pfn)
> +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> + kvm_pfn_t *p_pfn)
> {
> kvm_pfn_t pfn;
> pte_t *ptep;
> spinlock_t *ptl;
> + bool write_fault = foll->flags & FOLL_WRITE;
> int r;
>
> - r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> + r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> if (r) {
> /*
> * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
> * not call the fault handler, so do it here.
> */
> bool unlocked = false;
> - r = fixup_user_fault(current->mm, addr,
> + r = fixup_user_fault(current->mm, foll->hva,
> (write_fault ? FAULT_FLAG_WRITE : 0),
> &unlocked);
> if (unlocked)
> @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> if (r)
> return r;
>
> - r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> + r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> if (r)
> return r;
> }
> @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> goto out;
> }
>
> - if (writable)
> - *writable = pte_write(*ptep);
> + foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
> pfn = pte_pfn(*ptep);
>
> /*
> @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> * 2): @write_fault = false && @writable, @writable will tell the caller
> * whether the mapping is writable.
> */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> - bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
> {
> struct vm_area_struct *vma;
> kvm_pfn_t pfn;
> int npages, r;
>
> /* we can do it either atomically or asynchronously, not both */
> - BUG_ON(atomic && async);
> + BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
>
> - if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> + if (hva_to_pfn_fast(foll, &pfn))
> return pfn;
>
> - if (atomic)
> + if (foll->atomic)
> return KVM_PFN_ERR_FAULT;
>
> - npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> - writable, &pfn);
> + npages = hva_to_pfn_slow(foll, &pfn);
> if (npages == 1)
> return pfn;
> if (npages == -EINTR)
> @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>
> mmap_read_lock(current->mm);
> if (npages == -EHWPOISON ||
> - (!async && check_user_page_hwpoison(addr))) {
> + (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
> pfn = KVM_PFN_ERR_HWPOISON;
> goto exit;
> }
>
> retry:
> - vma = vma_lookup(current->mm, addr);
> + vma = vma_lookup(current->mm, foll->hva);
>
> if (vma == NULL)
> pfn = KVM_PFN_ERR_FAULT;
> else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> - r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> + r = hva_to_pfn_remapped(vma, foll, &pfn);
> if (r == -EAGAIN)
> goto retry;
> if (r < 0)
> pfn = KVM_PFN_ERR_FAULT;
> } else {
> - if (async && vma_is_valid(vma, write_fault))
> - *async = true;
> - pfn = KVM_PFN_ERR_FAULT;
> + if ((foll->flags & FOLL_NOWAIT) &&
> + vma_is_valid(vma, foll->flags & FOLL_WRITE))
> + pfn = KVM_PFN_ERR_NEEDS_IO;
> + else
> + pfn = KVM_PFN_ERR_FAULT;
> }
> exit:
> mmap_read_unlock(current->mm);
> return pfn;
> }
>
> -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> - bool atomic, bool interruptible, bool *async,
> - bool write_fault, bool *writable, hva_t *hva)
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> {
> - unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> -
> - if (hva)
> - *hva = addr;
> + foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> + foll->flags & FOLL_WRITE);
>
> - if (addr == KVM_HVA_ERR_RO_BAD) {
> - if (writable)
> - *writable = false;
> + if (foll->hva == KVM_HVA_ERR_RO_BAD)
> return KVM_PFN_ERR_RO_FAULT;
> - }
>
> - if (kvm_is_error_hva(addr)) {
> - if (writable)
> - *writable = false;
> + if (kvm_is_error_hva(foll->hva))
> return KVM_PFN_NOSLOT;
> - }
>
> - /* Do not map writable pfn in the readonly memslot. */
> - if (writable && memslot_is_readonly(slot)) {
> - *writable = false;
> - writable = NULL;
> - }
> + if (memslot_is_readonly(foll->slot))
> + foll->allow_write_mapping = false;
> +
> + return hva_to_pfn(foll);
> +}
> +EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
>
> - return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> - writable);
> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> + bool atomic, bool interruptible, bool *async,
> + bool write_fault, bool *writable, hva_t *hva)
> +{
> + kvm_pfn_t pfn;
> + struct kvm_follow_pfn foll = {
> + .slot = slot,
> + .gfn = gfn,
> + .flags = 0,
> + .atomic = atomic,
> + .allow_write_mapping = !!writable,
> + };
> +
> + if (write_fault)
> + foll.flags |= FOLL_WRITE;
> + if (async)
> + foll.flags |= FOLL_NOWAIT;
> + if (interruptible)
> + foll.flags |= FOLL_INTERRUPTIBLE;
> +
> + pfn = __kvm_follow_pfn(&foll);
> + if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> + *async = true;
> + pfn = KVM_PFN_ERR_FAULT;
> + }
> + if (hva)
> + *hva = foll.hva;
> + if (writable)
> + *writable = foll.writable;
> + return pfn;
> }
> EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>
> kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> bool *writable)
> {
> - return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
> - NULL, write_fault, writable, NULL);
> + kvm_pfn_t pfn;
> + struct kvm_follow_pfn foll = {
> + .slot = gfn_to_memslot(kvm, gfn),
> + .gfn = gfn,
> + .flags = write_fault ? FOLL_WRITE : 0,
> + .allow_write_mapping = !!writable,
> + };
> + pfn = __kvm_follow_pfn(&foll);
> + if (writable)
> + *writable = foll.writable;
> + return pfn;
> }
> EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
>
> kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
> {
> - return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
> - NULL, NULL);
> + struct kvm_follow_pfn foll = {
> + .slot = slot,
> + .gfn = gfn,
> + .flags = FOLL_WRITE,
> + };
> + return __kvm_follow_pfn(&foll);
> }
> EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
>
> kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
> {
> - return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
> - NULL, NULL);
> + struct kvm_follow_pfn foll = {
> + .slot = slot,
> + .gfn = gfn,
> + .flags = FOLL_WRITE,
> + .atomic = true,
> + };
> + return __kvm_follow_pfn(&foll);
> }
> EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
>
> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
> index 180f1a09e6ba..ed896aee5396 100644
> --- a/virt/kvm/kvm_mm.h
> +++ b/virt/kvm/kvm_mm.h
> @@ -20,8 +20,7 @@
> #define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock)
> #endif /* KVM_HAVE_MMU_RWLOCK */
>
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> - bool *async, bool write_fault, bool *writable);
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);
>
> #ifdef CONFIG_HAVE_KVM_PFNCACHE
> void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 2d6aba677830..e3fefa753a51 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
> kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
> void *new_khva = NULL;
> unsigned long mmu_seq;
> + struct kvm_follow_pfn foll = {
> + .slot = gpc->memslot,
> + .gfn = gpa_to_gfn(gpc->gpa),
> + .flags = FOLL_WRITE,
> + .hva = gpc->uhva,
> + };
>
> lockdep_assert_held(&gpc->refresh_lock);
>
> @@ -183,7 +189,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
> }
>
> /* We always request a writeable mapping */
> - new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
> + new_pfn = hva_to_pfn(&foll);
> if (is_error_noslot_pfn(new_pfn))
> goto out_error;
>
> --
> 2.41.0.255.g8b1d071c50-goog
>

--
Isaku Yamahata <[email protected]>

2023-07-06 02:15:02

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Tue, Jul 04, 2023 at 04:50:50PM +0900,
David Stevens <[email protected]> wrote:

> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index cf2c6426a6fc..46c681dc45e6 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> const struct kvm_memory_slot *slot,
> unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
> u64 old_spte, bool prefetch, bool can_unsync,
> - bool host_writable, u64 *new_spte)
> + bool host_writable, bool is_refcounted, u64 *new_spte)
> {
> int level = sp->role.level;
> u64 spte = SPTE_MMU_PRESENT_MASK;
> @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>
> if (level > PG_LEVEL_4K)
> spte |= PT_PAGE_SIZE_MASK;
> + else if (is_refcounted)
> + spte |= SPTE_MMU_PAGE_REFCOUNTED;

Is REFCOUNTED for 4K page only? What guarantees that large page doesn't have
FOLL_GET? or can we set the bit for large page?


>
> if (shadow_memtype_mask)
> spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,

--
Isaku Yamahata <[email protected]>

2023-07-06 05:03:22

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <[email protected]> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > From: David Stevens <[email protected]>
> >
> > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > memory into the guest that is backed by un-refcounted struct pages - for
> > example, higher order non-compound pages allocated by the amdgpu driver
> > via ttm_pool_alloc_page.
>
> I guess you mean the tail pages of the higher order non-compound pages?
> And as to the head page, it is said to be set to one coincidentally[*],
> and shall not be considered as refcounted. IIUC, refcount of this head
> page will be increased and decreased soon in hva_to_pfn_remapped(), so
> this may not be a problem(?). But treating this head page differently,
> as a refcounted one(e.g., to set the A/D flags), is weired.
>
> Or maybe I missed some context, e.g., can the head page be allocted to
> guest at all?

Yes, this is to allow mapping the tail pages of higher order
non-compound pages - I should have been more precise in my wording.
The head pages can already be mapped into the guest.

Treating the head and tail pages would require changing how KVM
behaves in a situation it supports today (rather than just adding
support for an unsupported situation). Currently, without this series,
KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
guest. When that happens, KVM sets the A/D flags. I'm not sure whether
that's actually valid behavior, nor do I know whether anyone actually
cares about it. But it's what KVM does today, and I would shy away
from modifying that behavior without good reason.

> >
> > The bulk of this change is tracking the is_refcounted_page flag so that
> > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > done by storing the flag in an unused bit in the sptes.
>
> Also, maybe we should mention this only works on x86-64.
>
> >
> > Signed-off-by: David Stevens <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++------------
> > arch/x86/kvm/mmu/mmu_internal.h | 1 +
> > arch/x86/kvm/mmu/paging_tmpl.h | 9 ++++---
> > arch/x86/kvm/mmu/spte.c | 4 ++-
> > arch/x86/kvm/mmu/spte.h | 12 ++++++++-
> > arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++++++++-------
> > 6 files changed, 62 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e44ab512c3a1..b1607e314497 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
>
> ...
>
> > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > bool host_writable = !fault || fault->map_writable;
> > bool prefetch = !fault || fault->prefetch;
> > bool write_fault = fault && fault->write;
> > + bool is_refcounted = !fault || fault->is_refcounted_page;
>
> Just wonder, what if a non-refcounted page is prefetched? Or is it possible in
> practice?

Prefetching is still done via gfn_to_page_many_atomic, which sets
FOLL_GET. That's fixable, but it's not something this series currently
does.

> ...
> >
> > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > */
> > static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
> > {
> > - bool host_writable;
> > + bool host_writable, is_refcounted;
> > gpa_t first_pte_gpa;
> > u64 *sptep, spte;
> > struct kvm_memory_slot *slot;
> > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
> > sptep = &sp->spt[i];
> > spte = *sptep;
> > host_writable = spte & shadow_host_writable_mask;
> > + // TODO: is this correct?
> > + is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> > slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > make_spte(vcpu, sp, slot, pte_access, gfn,
> > spte_to_pfn(spte), spte, true, false,
> > - host_writable, &spte);
> > + host_writable, is_refcounted, &spte);
>
> Could we restrict that a non-refcounted page shall not be used as shadow page?

I'm not very familiar with the shadow mmu, so my response might not
make sense. But do you mean not allowing non-refcoutned pages as the
guest page tables shadowed by a kvm_mmu_page? It would probably be
possible to do that, and I doubt anyone would care about the
restriction. But as far as I can tell, the guest page table is only
accessed via kvm_vcpu_read_guest_atomic, which handles non-refcounted
pages just fine.

-David

2023-07-06 05:36:00

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Thu, Jul 6, 2023 at 11:10 AM Isaku Yamahata <[email protected]> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900,
> David Stevens <[email protected]> wrote:
>
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index cf2c6426a6fc..46c681dc45e6 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > const struct kvm_memory_slot *slot,
> > unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
> > u64 old_spte, bool prefetch, bool can_unsync,
> > - bool host_writable, u64 *new_spte)
> > + bool host_writable, bool is_refcounted, u64 *new_spte)
> > {
> > int level = sp->role.level;
> > u64 spte = SPTE_MMU_PRESENT_MASK;
> > @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >
> > if (level > PG_LEVEL_4K)
> > spte |= PT_PAGE_SIZE_MASK;
> > + else if (is_refcounted)
> > + spte |= SPTE_MMU_PAGE_REFCOUNTED;
>
> Is REFCOUNTED for 4K page only? What guarantees that large page doesn't have
> FOLL_GET? or can we set the bit for large page?

Oh, you're right, it should apply to >4K pages as well. This was based
on stale thinking from earlier versions of this series.

-David

2023-07-06 05:36:36

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang <[email protected]> wrote:
>
> On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <[email protected]> wrote:
> > >
> > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > > * The slow path to get the pfn of the specified host virtual address,
> > > > * 1 indicates success, -errno is returned if error is detected.
> > > > */
> > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > > - bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > > {
> > > > - unsigned int flags = FOLL_HWPOISON;
> > > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > > struct page *page;
> > > > int npages;
> > > >
> > > > might_sleep();
> > > >
> > > > - if (writable)
> > > > - *writable = write_fault;
> > > > -
> > > > - if (write_fault)
> > > > - flags |= FOLL_WRITE;
> > > > - if (async)
> > > > - flags |= FOLL_NOWAIT;
> > > > - if (interruptible)
> > > > - flags |= FOLL_INTERRUPTIBLE;
> > > > -
> > > > - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > > if (npages != 1)
> > > > return npages;
> > > >
> > > > + foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > > +
> > > > /* map read fault as writable if possible */
> > > > - if (unlikely(!write_fault) && writable) {
> > > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > >
> > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> >
> > The two statements are logically equivalent, although I guess using
> > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > verbose.
>
> Well, as the comment says, we wanna try to map the read fault as writable
> whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> for write faults. So I guess using !foll->writable will not allow this.
> Did I miss anything?

We just set the foll->writable out parameter to be equal to
((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
!(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
b.

-David

2023-07-06 06:13:32

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

On Thu, Jul 6, 2023 at 10:34 AM Isaku Yamahata <[email protected]> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:47PM +0900,
> David Stevens <[email protected]> wrote:
>
> > From: David Stevens <[email protected]>
> >
> > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > where possible, combines the boolean arguments into a single flags
> > argument.
> >
> > Signed-off-by: David Stevens <[email protected]>
> > ---
> > include/linux/kvm_host.h | 16 ++++
> > virt/kvm/kvm_main.c | 171 ++++++++++++++++++++++-----------------
> > virt/kvm/kvm_mm.h | 3 +-
> > virt/kvm/pfncache.c | 8 +-
> > 4 files changed, 122 insertions(+), 76 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 9d3ac7720da9..ef2763c2b12e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -97,6 +97,7 @@
> > #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> > #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> > #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
> > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> >
> > /*
> > * error pfns indicate that the gfn is in slot but faild to
> > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> > void kvm_release_page_clean(struct page *page);
> > void kvm_release_page_dirty(struct page *page);
> >
> > +struct kvm_follow_pfn {
> > + const struct kvm_memory_slot *slot;
> > + gfn_t gfn;
> > + unsigned int flags;
> > + bool atomic;
> > + /* Allow a read fault to create a writeable mapping. */
> > + bool allow_write_mapping;
>
> Maybe, make them const for input arguments?

Unfortunately using const isn't straightforward as long as the kernel
continues to use -Wdeclaration-after-statement. If these fields were
const, then they would need to be specified in the initializer when
declaring the variable, but that's not necessarily always possible.

-David

2023-07-06 06:46:44

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

On Wed, Jul 5, 2023 at 8:56 PM Yu Zhang <[email protected]> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote:
> > From: David Stevens <[email protected]>
> >
> > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > callers to resolve a gfn when the associated pfn has a valid struct page
> > that isn't being actively refcounted (e.g. tail pages of non-compound
> > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > the returned pfn must be guarded by a mmu notifier.
> >
> > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > is set when the returned pfn has an associated struct page with a valid
> > refcount. Callers that don't pass FOLL_GET should remember this value
> > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > non-zero refcount.
> >
> > Signed-off-by: David Stevens <[email protected]>
> > ---
> > include/linux/kvm_host.h | 10 ++++++
> > virt/kvm/kvm_main.c | 67 +++++++++++++++++++++-------------------
> > virt/kvm/pfncache.c | 2 +-
> > 3 files changed, 47 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ef2763c2b12e..a45308c7d2d9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> > void kvm_release_page_clean(struct page *page);
> > void kvm_release_page_dirty(struct page *page);
> >
> > +void kvm_set_page_accessed(struct page *page);
> > +void kvm_set_page_dirty(struct page *page);
> > +
> > struct kvm_follow_pfn {
> > const struct kvm_memory_slot *slot;
> > gfn_t gfn;
> > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> > bool atomic;
> > /* Allow a read fault to create a writeable mapping. */
> > bool allow_write_mapping;
> > + /*
> > + * Usage of the returned pfn will be guared by a mmu notifier. Must
> > + * be true if FOLL_GET is not set.
> > + */
> > + bool guarded_by_mmu_notifier;
>
> And how? Any place to check the invalidate seq?

kvm_follow_pfn can't meaningfully validate the seq number, since the
mmu notifier locking is handled by the caller. This is more of a
sanity check that the API is being used properly, as proposed here
[1]. I did deviate from the proposal with a bool instead of some type
of integer, since the exact value of mmu_seq wouldn't be useful.

[1] https://lore.kernel.org/all/[email protected]/#t

-David

2023-07-06 07:13:28

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

On Wed, Jul 5, 2023 at 10:19 PM Zhi Wang <[email protected]> wrote:
>
> On Tue, 4 Jul 2023 16:50:48 +0900
> David Stevens <[email protected]> wrote:
>
> > From: David Stevens <[email protected]>
> >
> > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > callers to resolve a gfn when the associated pfn has a valid struct page
> > that isn't being actively refcounted (e.g. tail pages of non-compound
> > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > the returned pfn must be guarded by a mmu notifier.
> >
> > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > is set when the returned pfn has an associated struct page with a valid
> > refcount. Callers that don't pass FOLL_GET should remember this value
> > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > non-zero refcount.
> >
> > Signed-off-by: David Stevens <[email protected]>
> > ---
> > include/linux/kvm_host.h | 10 ++++++
> > virt/kvm/kvm_main.c | 67 +++++++++++++++++++++-------------------
> > virt/kvm/pfncache.c | 2 +-
> > 3 files changed, 47 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ef2763c2b12e..a45308c7d2d9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> > void kvm_release_page_clean(struct page *page);
> > void kvm_release_page_dirty(struct page *page);
> >
> > +void kvm_set_page_accessed(struct page *page);
> > +void kvm_set_page_dirty(struct page *page);
> > +
> > struct kvm_follow_pfn {
> > const struct kvm_memory_slot *slot;
> > gfn_t gfn;
> > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> > bool atomic;
> > /* Allow a read fault to create a writeable mapping. */
> > bool allow_write_mapping;
> > + /*
> > + * Usage of the returned pfn will be guared by a mmu notifier. Must
> ^guarded
> > + * be true if FOLL_GET is not set.
> > + */
> > + bool guarded_by_mmu_notifier;
> >
> It seems no one sets the guraded_by_mmu_notifier in this patch. Is
> guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
> caller of __kvm_follow_pfn()?

Yes, this is the case.

> If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> temporary solution. I don't think it is a good idea to play tricks with
> a temporary solution, more like we are abusing the toleration.

I'm not sure I understand what you're getting at. This series never
calls gup without FOLL_GET.

This series aims to provide kvm_follow_pfn as a unified API on top of
gup+follow_pte. Since one of the major clients of this API uses an mmu
notifier, it makes sense to support returning a pfn without taking a
reference. And we indeed need to do that for certain types of memory.

> Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
> indicate a tail page?

What do you mean by to indicate a tail page? Do you mean to indicate
that the returned pfn refers to non-refcounted page? That's specified
by is_refcounted_page.

-David

2023-07-06 07:39:50

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Thu, Jul 06, 2023 at 01:52:08PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <[email protected]> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens <[email protected]>
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted. IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
>
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.
>
> Treating the head and tail pages would require changing how KVM
> behaves in a situation it supports today (rather than just adding
> support for an unsupported situation). Currently, without this series,
> KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> that's actually valid behavior, nor do I know whether anyone actually
> cares about it. But it's what KVM does today, and I would shy away
> from modifying that behavior without good reason.

I know the A/D status of the refcounted, VM_PFNMAP|VM_IO backed pages
will be recorded. And I have no idea if this is a necessary requirement
either.

But it feels awkward to see the head and the tail ones of non-compound
pages be treated inconsistently. After all, the head page just happens
to have its refcount being 1, it is not a real refcounted page.

So I would suggest to mention such different behehavior in the commit
message at least. :)

> > >
> > > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > */
> > > static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
> > > {
> > > - bool host_writable;
> > > + bool host_writable, is_refcounted;
> > > gpa_t first_pte_gpa;
> > > u64 *sptep, spte;
> > > struct kvm_memory_slot *slot;
> > > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
> > > sptep = &sp->spt[i];
> > > spte = *sptep;
> > > host_writable = spte & shadow_host_writable_mask;
> > > + // TODO: is this correct?
> > > + is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> > > slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > > make_spte(vcpu, sp, slot, pte_access, gfn,
> > > spte_to_pfn(spte), spte, true, false,
> > > - host_writable, &spte);
> > > + host_writable, is_refcounted, &spte);
> >
> > Could we restrict that a non-refcounted page shall not be used as shadow page?
>
> I'm not very familiar with the shadow mmu, so my response might not
> make sense. But do you mean not allowing non-refcoutned pages as the
> guest page tables shadowed by a kvm_mmu_page? It would probably be
> possible to do that, and I doubt anyone would care about the
> restriction. But as far as I can tell, the guest page table is only
> accessed via kvm_vcpu_read_guest_atomic, which handles non-refcounted
> pages just fine.

Sorry, my brain just got baked... Pls just ignore this question :)

B.R.
Yu

2023-07-06 15:18:43

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

On Thu, Jul 06, 2023 at 02:29:24PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang <[email protected]> wrote:
> >
> > On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <[email protected]> wrote:
> > > >
> > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > > > * The slow path to get the pfn of the specified host virtual address,
> > > > > * 1 indicates success, -errno is returned if error is detected.
> > > > > */
> > > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > > > - bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > > > {
> > > > > - unsigned int flags = FOLL_HWPOISON;
> > > > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > > > struct page *page;
> > > > > int npages;
> > > > >
> > > > > might_sleep();
> > > > >
> > > > > - if (writable)
> > > > > - *writable = write_fault;
> > > > > -
> > > > > - if (write_fault)
> > > > > - flags |= FOLL_WRITE;
> > > > > - if (async)
> > > > > - flags |= FOLL_NOWAIT;
> > > > > - if (interruptible)
> > > > > - flags |= FOLL_INTERRUPTIBLE;
> > > > > -
> > > > > - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > > > if (npages != 1)
> > > > > return npages;
> > > > >
> > > > > + foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > > > +
> > > > > /* map read fault as writable if possible */
> > > > > - if (unlikely(!write_fault) && writable) {
> > > > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > >
> > > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> > >
> > > The two statements are logically equivalent, although I guess using
> > > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > > verbose.
> >
> > Well, as the comment says, we wanna try to map the read fault as writable
> > whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> > for write faults. So I guess using !foll->writable will not allow this.
> > Did I miss anything?
>
> We just set the foll->writable out parameter to be equal to
> ((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
> foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
> !(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
> b.

Ouch, my bad again... I typed "!foll->writable", but missed the "!" in
my head while calculating... Thanks! :)

B.R.
Yu

2023-07-06 16:29:28

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Thu, Jul 06, 2023 at 01:52:08PM +0900,
David Stevens <[email protected]> wrote:

> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <[email protected]> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens <[email protected]>
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted. IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
>
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.
>
> Treating the head and tail pages would require changing how KVM
> behaves in a situation it supports today (rather than just adding
> support for an unsupported situation). Currently, without this series,
> KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> that's actually valid behavior, nor do I know whether anyone actually
> cares about it. But it's what KVM does today, and I would shy away
> from modifying that behavior without good reason.
>
> > >
> > > The bulk of this change is tracking the is_refcounted_page flag so that
> > > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > > done by storing the flag in an unused bit in the sptes.
> >
> > Also, maybe we should mention this only works on x86-64.
> >
> > >
> > > Signed-off-by: David Stevens <[email protected]>
> > > ---
> > > arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++------------
> > > arch/x86/kvm/mmu/mmu_internal.h | 1 +
> > > arch/x86/kvm/mmu/paging_tmpl.h | 9 ++++---
> > > arch/x86/kvm/mmu/spte.c | 4 ++-
> > > arch/x86/kvm/mmu/spte.h | 12 ++++++++-
> > > arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++++++++-------
> > > 6 files changed, 62 insertions(+), 30 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index e44ab512c3a1..b1607e314497 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> >
> > ...
> >
> > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > > bool host_writable = !fault || fault->map_writable;
> > > bool prefetch = !fault || fault->prefetch;
> > > bool write_fault = fault && fault->write;
> > > + bool is_refcounted = !fault || fault->is_refcounted_page;
> >
> > Just wonder, what if a non-refcounted page is prefetched? Or is it possible in
> > practice?
>
> Prefetching is still done via gfn_to_page_many_atomic, which sets
> FOLL_GET. That's fixable, but it's not something this series currently
> does.

So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
hunk. kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
spte. If I read the patch correctly, REFCOUNTED bit in SPTE should represent
whether the corresponding page is ref-countable or not, right?

Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
the issue, though.

Thanks,
--
Isaku Yamahata <[email protected]>

2023-07-07 02:48:24

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Fri, Jul 7, 2023 at 12:58 AM Isaku Yamahata <[email protected]> wrote:
>
> On Thu, Jul 06, 2023 at 01:52:08PM +0900,
> David Stevens <[email protected]> wrote:
>
> > On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <[email protected]> wrote:
> > >
> > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > > From: David Stevens <[email protected]>
> > > >
> > > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > > memory into the guest that is backed by un-refcounted struct pages - for
> > > > example, higher order non-compound pages allocated by the amdgpu driver
> > > > via ttm_pool_alloc_page.
> > >
> > > I guess you mean the tail pages of the higher order non-compound pages?
> > > And as to the head page, it is said to be set to one coincidentally[*],
> > > and shall not be considered as refcounted. IIUC, refcount of this head
> > > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > > this may not be a problem(?). But treating this head page differently,
> > > as a refcounted one(e.g., to set the A/D flags), is weired.
> > >
> > > Or maybe I missed some context, e.g., can the head page be allocted to
> > > guest at all?
> >
> > Yes, this is to allow mapping the tail pages of higher order
> > non-compound pages - I should have been more precise in my wording.
> > The head pages can already be mapped into the guest.
> >
> > Treating the head and tail pages would require changing how KVM
> > behaves in a situation it supports today (rather than just adding
> > support for an unsupported situation). Currently, without this series,
> > KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> > guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> > that's actually valid behavior, nor do I know whether anyone actually
> > cares about it. But it's what KVM does today, and I would shy away
> > from modifying that behavior without good reason.
> >
> > > >
> > > > The bulk of this change is tracking the is_refcounted_page flag so that
> > > > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > > > done by storing the flag in an unused bit in the sptes.
> > >
> > > Also, maybe we should mention this only works on x86-64.
> > >
> > > >
> > > > Signed-off-by: David Stevens <[email protected]>
> > > > ---
> > > > arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++------------
> > > > arch/x86/kvm/mmu/mmu_internal.h | 1 +
> > > > arch/x86/kvm/mmu/paging_tmpl.h | 9 ++++---
> > > > arch/x86/kvm/mmu/spte.c | 4 ++-
> > > > arch/x86/kvm/mmu/spte.h | 12 ++++++++-
> > > > arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++++++++-------
> > > > 6 files changed, 62 insertions(+), 30 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index e44ab512c3a1..b1607e314497 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > >
> > > ...
> > >
> > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > > > bool host_writable = !fault || fault->map_writable;
> > > > bool prefetch = !fault || fault->prefetch;
> > > > bool write_fault = fault && fault->write;
> > > > + bool is_refcounted = !fault || fault->is_refcounted_page;
> > >
> > > Just wonder, what if a non-refcounted page is prefetched? Or is it possible in
> > > practice?
> >
> > Prefetching is still done via gfn_to_page_many_atomic, which sets
> > FOLL_GET. That's fixable, but it's not something this series currently
> > does.
>
> So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
> hunk. kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
> spte. If I read the patch correctly, REFCOUNTED bit in SPTE should represent
> whether the corresponding page is ref-countable or not, right?
>
> Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
> is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
> the issue, though.
>

direct_pte_prefetch_many and prefetch_gpte both pass NULL for the
fault parameter, so is_refcounted will evaluate to true. So the spte's
refcounted bit will get set in that case.

-David

2023-07-10 17:02:57

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Fri, Jul 07, 2023 at 10:35:02AM +0900,
David Stevens <[email protected]> wrote:

> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > index e44ab512c3a1..b1607e314497 100644
> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > >
> > > > ...
> > > >
> > > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > > > > bool host_writable = !fault || fault->map_writable;
> > > > > bool prefetch = !fault || fault->prefetch;
> > > > > bool write_fault = fault && fault->write;
> > > > > + bool is_refcounted = !fault || fault->is_refcounted_page;
> > > >
> > > > Just wonder, what if a non-refcounted page is prefetched? Or is it possible in
> > > > practice?
> > >
> > > Prefetching is still done via gfn_to_page_many_atomic, which sets
> > > FOLL_GET. That's fixable, but it's not something this series currently
> > > does.
> >
> > So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
> > hunk. kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
> > spte. If I read the patch correctly, REFCOUNTED bit in SPTE should represent
> > whether the corresponding page is ref-countable or not, right?
> >
> > Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
> > is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
> > the issue, though.
> >
>
> direct_pte_prefetch_many and prefetch_gpte both pass NULL for the
> fault parameter, so is_refcounted will evaluate to true. So the spte's
> refcounted bit will get set in that case.

Oops, my bad. My point is "unconditionally". Is the bit always set for
non-refcountable pages? Or non-refcountable pages are not prefeched?
--
Isaku Yamahata <[email protected]>

2023-07-11 03:27:24

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Tue, Jul 11, 2023 at 1:34 AM Isaku Yamahata <[email protected]> wrote:
>
> On Fri, Jul 07, 2023 at 10:35:02AM +0900,
> David Stevens <[email protected]> wrote:
>
> > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > > index e44ab512c3a1..b1607e314497 100644
> > > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > >
> > > > > ...
> > > > >
> > > > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > > > > > bool host_writable = !fault || fault->map_writable;
> > > > > > bool prefetch = !fault || fault->prefetch;
> > > > > > bool write_fault = fault && fault->write;
> > > > > > + bool is_refcounted = !fault || fault->is_refcounted_page;
> > > > >
> > > > > Just wonder, what if a non-refcounted page is prefetched? Or is it possible in
> > > > > practice?
> > > >
> > > > Prefetching is still done via gfn_to_page_many_atomic, which sets
> > > > FOLL_GET. That's fixable, but it's not something this series currently
> > > > does.
> > >
> > > So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
> > > hunk. kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
> > > spte. If I read the patch correctly, REFCOUNTED bit in SPTE should represent
> > > whether the corresponding page is ref-countable or not, right?
> > >
> > > Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
> > > is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
> > > the issue, though.
> > >
> >
> > direct_pte_prefetch_many and prefetch_gpte both pass NULL for the
> > fault parameter, so is_refcounted will evaluate to true. So the spte's
> > refcounted bit will get set in that case.
>
> Oops, my bad. My point is "unconditionally". Is the bit always set for
> non-refcountable pages? Or non-refcountable pages are not prefeched?

The bit is never set for non-refcounted pages, and is always set for
refcounted pages. The current series never prefetches non-refcounted
pages, since it continues to use the gfn_to_page_many_atomic API.

-David

2023-07-11 17:46:54

by Zhi Wang

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

On Wed, 5 Jul 2023 18:08:17 +0900
David Stevens <[email protected]> wrote:

> On Wed, Jul 5, 2023 at 5:47___PM Zhi Wang <[email protected]> wrote:
> >
> > On Tue, 4 Jul 2023 16:50:47 +0900
> > David Stevens <[email protected]> wrote:
> >
> > > From: David Stevens <[email protected]>
> > >
> > > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > > where possible, combines the boolean arguments into a single flags
> > > argument.
> > >
> > > Signed-off-by: David Stevens <[email protected]>
> > > ---
> > > include/linux/kvm_host.h | 16 ++++
> > > virt/kvm/kvm_main.c | 171 ++++++++++++++++++++++-----------------
> > > virt/kvm/kvm_mm.h | 3 +-
> > > virt/kvm/pfncache.c | 8 +-
> > > 4 files changed, 122 insertions(+), 76 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 9d3ac7720da9..ef2763c2b12e 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -97,6 +97,7 @@
> > > #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> > > #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> > > #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
> > > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> > >
> > > /*
> > > * error pfns indicate that the gfn is in slot but faild to
> > > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> > > void kvm_release_page_clean(struct page *page);
> > > void kvm_release_page_dirty(struct page *page);
> > >
> > > +struct kvm_follow_pfn {
> > > + const struct kvm_memory_slot *slot;
> > > + gfn_t gfn;
> > > + unsigned int flags;
> > > + bool atomic;
> > > + /* Allow a read fault to create a writeable mapping. */
> > > + bool allow_write_mapping;
> > > +
> > > + /* Outputs of __kvm_follow_pfn */
> > > + hva_t hva;
> > > + bool writable;
> > > +};
> > > +
> > > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > > +
> > > kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> > > kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> > > bool *writable);
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 371bd783ff2b..b13f22861d2f 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> > > * true indicates success, otherwise false is returned. It's also the
> > > * only part that runs if we can in atomic context.
> > > */
> > > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > - bool *writable, kvm_pfn_t *pfn)
> > > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > {
> > > struct page *page[1];
> > > + bool write_fault = foll->flags & FOLL_WRITE;
> > >
> > > /*
> > > * Fast pin a writable pfn only if it is a write fault request
> > > * or the caller allows to map a writable pfn for a read fault
> > > * request.
> > > */
> > > - if (!(write_fault || writable))
> > > + if (!(write_fault || foll->allow_write_mapping))
> > > return false;
> > >
> > > - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> > > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> > > *pfn = page_to_pfn(page[0]);
> > > -
> > > - if (writable)
> > > - *writable = true;
> > > + foll->writable = foll->allow_write_mapping;
> > > return true;
> > > }
> > >
> > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > * The slow path to get the pfn of the specified host virtual address,
> > > * 1 indicates success, -errno is returned if error is detected.
> > > */
> > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > - bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > {
> > > - unsigned int flags = FOLL_HWPOISON;
> > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > struct page *page;
> > > int npages;
> > >
> > > might_sleep();
> > >
> > > - if (writable)
> > > - *writable = write_fault;
> > > -
> > > - if (write_fault)
> > > - flags |= FOLL_WRITE;
> > > - if (async)
> > > - flags |= FOLL_NOWAIT;
> > > - if (interruptible)
> > > - flags |= FOLL_INTERRUPTIBLE;
> > > -
> > > - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > if (npages != 1)
> > > return npages;
> > >
> > > + foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > +
> > > /* map read fault as writable if possible */
> > > - if (unlikely(!write_fault) && writable) {
> > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > struct page *wpage;
> > >
> > > - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > > - *writable = true;
> > > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > > + foll->writable = true;
> > > put_page(page);
> > > page = wpage;
> > > }
> > > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> > > return get_page_unless_zero(page);
> > > }
> > >
> > > -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > > - unsigned long addr, bool write_fault,
> > > - bool *writable, kvm_pfn_t *p_pfn)
> > > +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> > > + kvm_pfn_t *p_pfn)
> > > {
> > > kvm_pfn_t pfn;
> > > pte_t *ptep;
> > > spinlock_t *ptl;
> > > + bool write_fault = foll->flags & FOLL_WRITE;
> > > int r;
> > >
> > > - r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > > + r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> > > if (r) {
> > > /*
> > > * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
> > > * not call the fault handler, so do it here.
> > > */
> > > bool unlocked = false;
> > > - r = fixup_user_fault(current->mm, addr,
> > > + r = fixup_user_fault(current->mm, foll->hva,
> > > (write_fault ? FAULT_FLAG_WRITE : 0),
> > > &unlocked);
> > > if (unlocked)
> > > @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > > if (r)
> > > return r;
> > >
> > > - r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > > + r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> > > if (r)
> > > return r;
> > > }
> > > @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > > goto out;
> > > }
> > >
> > > - if (writable)
> > > - *writable = pte_write(*ptep);
> > > + foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
> > > pfn = pte_pfn(*ptep);
> > >
> > > /*
> > > @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > > * 2): @write_fault = false && @writable, @writable will tell the caller
> > > * whether the mapping is writable.
> > > */
> > > -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > > - bool *async, bool write_fault, bool *writable)
> > > +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
> > > {
> > > struct vm_area_struct *vma;
> > > kvm_pfn_t pfn;
> > > int npages, r;
> > >
> > > /* we can do it either atomically or asynchronously, not both */
> > > - BUG_ON(atomic && async);
> > > + BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
> > >
> > > - if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> > > + if (hva_to_pfn_fast(foll, &pfn))
> > > return pfn;
> > >
> > > - if (atomic)
> > > + if (foll->atomic)
> > > return KVM_PFN_ERR_FAULT;
> > >
> > > - npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> > > - writable, &pfn);
> > > + npages = hva_to_pfn_slow(foll, &pfn);
> > > if (npages == 1)
> > > return pfn;
> > > if (npages == -EINTR)
> > > @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > >
> > > mmap_read_lock(current->mm);
> > > if (npages == -EHWPOISON ||
> > > - (!async && check_user_page_hwpoison(addr))) {
> > > + (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
> > > pfn = KVM_PFN_ERR_HWPOISON;
> > > goto exit;
> > > }
> > >
> > > retry:
> > > - vma = vma_lookup(current->mm, addr);
> > > + vma = vma_lookup(current->mm, foll->hva);
> > >
> > > if (vma == NULL)
> > > pfn = KVM_PFN_ERR_FAULT;
> > > else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> > > - r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> > > + r = hva_to_pfn_remapped(vma, foll, &pfn);
> > > if (r == -EAGAIN)
> > > goto retry;
> > > if (r < 0)
> > > pfn = KVM_PFN_ERR_FAULT;
> > > } else {
> > > - if (async && vma_is_valid(vma, write_fault))
> > > - *async = true;
> > > - pfn = KVM_PFN_ERR_FAULT;
> > > + if ((foll->flags & FOLL_NOWAIT) &&
> > > + vma_is_valid(vma, foll->flags & FOLL_WRITE))
> > > + pfn = KVM_PFN_ERR_NEEDS_IO;
> > > + else
> > > + pfn = KVM_PFN_ERR_FAULT;
> > > }
> > > exit:
> > > mmap_read_unlock(current->mm);
> > > return pfn;
> > > }
> > >
> > > -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > > - bool atomic, bool interruptible, bool *async,
> > > - bool write_fault, bool *writable, hva_t *hva)
> > > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> > > {
> > > - unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> > > -
> > > - if (hva)
> > > - *hva = addr;
> > > + foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> > > + foll->flags & FOLL_WRITE);
> > >
> > > - if (addr == KVM_HVA_ERR_RO_BAD) {
> > > - if (writable)
> > > - *writable = false;
> > > + if (foll->hva == KVM_HVA_ERR_RO_BAD)
> > > return KVM_PFN_ERR_RO_FAULT;
> > > - }
> > >
> >
> > Can you explain why updating foll->writable = false (previously *writeable
> > = false) is omitted here?
> >
> > In the caller where the struct kvm_follow_pfn is initialized, e.g.
> > __gfn_to_pfn_memslot()/gfn_to_pfn_prot(), .writable is not initialized.
> > IIUC, they expect __kvm_follow_pfn() to update it and return .writable to
> > upper caller.
> >
> > As the one of the output, it would be better to initalize it either in the
> > caller or update it in __kvm_follow_pfn(). Or
> > __gfn_to_pfn_memslot()/gfn_to_pfn_prot() will return random data in the
> > stack to the caller via bool *writable. It doesn't sound nice.
>
> Entries omitted from an initializer are initialized to zero, so
> .writable does get initialized in all of the patches in this series
> via designated initializers. Although you're right that explicitly
> setting it to false is a good idea, in case someday someone adds a
> caller that doesn't use an initializer when declaring its
> kvm_follow_pfn.
>

Nice trick and nice to know that. :) Agreed on improving readability and
preventing a risk from the caller.

> -David


2023-07-11 18:10:38

by Zhi Wang

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

On Thu, 6 Jul 2023 15:49:39 +0900
David Stevens <[email protected]> wrote:

> On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <[email protected]> wrote:
> >
> > On Tue, 4 Jul 2023 16:50:48 +0900
> > David Stevens <[email protected]> wrote:
> >
> > > From: David Stevens <[email protected]>
> > >
> > > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > > callers to resolve a gfn when the associated pfn has a valid struct page
> > > that isn't being actively refcounted (e.g. tail pages of non-compound
> > > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > > the returned pfn must be guarded by a mmu notifier.
> > >
> > > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > > is set when the returned pfn has an associated struct page with a valid
> > > refcount. Callers that don't pass FOLL_GET should remember this value
> > > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > > non-zero refcount.
> > >
> > > Signed-off-by: David Stevens <[email protected]>
> > > ---
> > > include/linux/kvm_host.h | 10 ++++++
> > > virt/kvm/kvm_main.c | 67 +++++++++++++++++++++-------------------
> > > virt/kvm/pfncache.c | 2 +-
> > > 3 files changed, 47 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index ef2763c2b12e..a45308c7d2d9 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> > > void kvm_release_page_clean(struct page *page);
> > > void kvm_release_page_dirty(struct page *page);
> > >
> > > +void kvm_set_page_accessed(struct page *page);
> > > +void kvm_set_page_dirty(struct page *page);
> > > +
> > > struct kvm_follow_pfn {
> > > const struct kvm_memory_slot *slot;
> > > gfn_t gfn;
> > > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> > > bool atomic;
> > > /* Allow a read fault to create a writeable mapping. */
> > > bool allow_write_mapping;
> > > + /*
> > > + * Usage of the returned pfn will be guared by a mmu notifier. Must
> > ^guarded
> > > + * be true if FOLL_GET is not set.
> > > + */
> > > + bool guarded_by_mmu_notifier;
> > >
> > It seems no one sets the guraded_by_mmu_notifier in this patch. Is
> > guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
> > caller of __kvm_follow_pfn()?
>
> Yes, this is the case.
>
> > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > temporary solution. I don't think it is a good idea to play tricks with
> > a temporary solution, more like we are abusing the toleration.
>
> I'm not sure I understand what you're getting at. This series never
> calls gup without FOLL_GET.
>
> This series aims to provide kvm_follow_pfn as a unified API on top of
> gup+follow_pte. Since one of the major clients of this API uses an mmu
> notifier, it makes sense to support returning a pfn without taking a
> reference. And we indeed need to do that for certain types of memory.
>

I am not having prob with taking a pfn without taking a ref. I am
questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
a pfn without a ref is a good idea or not, while there is another flag
actually showing it.

I can understand that using FOLL_XXX in kvm_follow_pfn saves some
translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
requirements of GUP in the code path that going to call GUP is reasonable.

But using FOLL_XXX with purposes that are not related to GUP call really
feels off. Those flags can be changed in future because of GUP requirements.
Then people have to figure out what actually is happening with FOLL_GET here
as it is not actually tied to GUP calls.


> > Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
> > indicate a tail page?
>
> What do you mean by to indicate a tail page? Do you mean to indicate
> that the returned pfn refers to non-refcounted page? That's specified
> by is_refcounted_page.
>

I figured out the reason why I got confused.

+ * Otherwise, certain IO or PFNMAP mappings can be backed with valid
+ * struct pages but be allocated without refcounting e.g., tail pages of
+ * non-compound higher order allocations. If FOLL_GET is set and we
+ * increment such a refcount, then when that pfn is eventually passed to
+ * kvm_release_pfn_clean, its refcount would hit zero and be incorrectly
+ * freed. Therefore don't allow those pages here when FOLL_GET is set.
*/

The above statements only explains the wrong behavior, but doesn't explain the
expected behavior. It would be better to explain that for manipulating mmu
notifier guard page (!FOLL_GET), we put back the reference taken by GUP.
FOLL_GET stuff really confused me a lot.

- if (!kvm_try_get_pfn(pfn))
+ page = kvm_pfn_to_refcounted_page(pfn);
+ if (!page)
+ goto out;
+
+ if (get_page_unless_zero(page)) {
+ foll->is_refcounted_page = true;
+ if (!(foll->flags & FOLL_GET))
+ put_page(page);
+ } else if (foll->flags & FOLL_GET) {
r = -EFAULT;
+ }

> -David


2023-07-11 22:58:55

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

On Tue, Jul 11, 2023, Zhi Wang wrote:
> On Thu, 6 Jul 2023 15:49:39 +0900
> David Stevens <[email protected]> wrote:
>
> > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <[email protected]> wrote:
> > >
> > > On Tue, 4 Jul 2023 16:50:48 +0900
> > > David Stevens <[email protected]> wrote:
> > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > temporary solution. I don't think it is a good idea to play tricks with
> > > a temporary solution, more like we are abusing the toleration.
> >
> > I'm not sure I understand what you're getting at. This series never
> > calls gup without FOLL_GET.
> >
> > This series aims to provide kvm_follow_pfn as a unified API on top of
> > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > notifier, it makes sense to support returning a pfn without taking a
> > reference. And we indeed need to do that for certain types of memory.
> >
>
> I am not having prob with taking a pfn without taking a ref. I am
> questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> a pfn without a ref is a good idea or not, while there is another flag
> actually showing it.
>
> I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> requirements of GUP in the code path that going to call GUP is reasonable.
>
> But using FOLL_XXX with purposes that are not related to GUP call really
> feels off.

I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
that handles non-refcounted pages, i.e. this

if (get_page_unless_zero(page)) {
foll->is_refcounted_page = true;
if (!(foll->flags & FOLL_GET))
put_page(page);
} else if (foll->flags & FOLL_GET) {
r = -EFAULT;
}

should be

if (get_page_unless_zero(page)) {
foll->is_refcounted_page = true;
if (!(foll->flags & FOLL_GET))
put_page(page);
else if (!foll->guarded_by_mmu_notifier)
r = -EFAULT;

because it's not the desire to grab a reference that makes getting non-refcounted
pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.

Though that highlights that checking guarded_by_mmu_notifier should be done for
*all* non-refcounted pfns, not just non-refcounted struct page memory.

As for the other usage of FOLL_GET in this series (using it to conditionally do
put_page()), IMO that's very much related to the GUP call. Invoking put_page()
is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
without grabbing a reference to the page. In an ideal world, KVM would NOT pass
FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.

I do think it's worth providing a helper to consolidate and document that hacky
code, e.g. add a kvm_follow_refcounted_pfn() helper.

All in all, I think the below (completely untested) is what we want?

David (and others), I am planning on doing a full review of this series "soon",
but it will likely be a few weeks until that happens. I jumped in on this
specific thread because this caught my eye and I really don't want to throw out
*all* of the FOLL_GET usage.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5b5afd70f239..90d424990e0a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
return rc == -EHWPOISON;
}

+static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
+ struct page *page)
+{
+ kvm_pfn_t pfn = page_to_pfn(page);
+
+ foll->is_refcounted_page = true;
+
+ /*
+ * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
+ * doesn't want to grab a reference, but gup() doesn't support getting
+ * just the pfn, i.e. FOLL_GET is effectively mandatory. If that ever
+ * changes, drop this and simply don't pass FOLL_GET to gup().
+ */
+ if (!(foll->flags & FOLL_GET))
+ put_page(page);
+
+ return pfn;
+}
+
/*
* The fast path to get the writable pfn which will be stored in @pfn,
* true indicates success, otherwise false is returned. It's also the
@@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
return false;

if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
- *pfn = page_to_pfn(page[0]);
foll->writable = foll->allow_write_mapping;
- foll->is_refcounted_page = true;
- if (!(foll->flags & FOLL_GET))
- put_page(page[0]);
+
+ *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
return true;
}

@@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
return npages;

foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
- foll->is_refcounted_page = true;

/* map read fault as writable if possible */
if (unlikely(!foll->writable) && foll->allow_write_mapping) {
@@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
page = wpage;
}
}
- *pfn = page_to_pfn(page);
- if (!(foll->flags & FOLL_GET))
- put_page(page);
+
+ *pfn = kvm_follow_refcounted_pfn(foll, page);
return npages;
}

@@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
if (!page)
goto out;

- if (get_page_unless_zero(page)) {
- foll->is_refcounted_page = true;
- if (!(foll->flags & FOLL_GET))
- put_page(page);
- } else if (foll->flags & FOLL_GET) {
- r = -EFAULT;
- }
-
+ if (get_page_unless_zero(page))
+ WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
out:
pte_unmap_unlock(ptep, ptl);
- *p_pfn = pfn;
+
+ if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
+ !allow_unsafe_mappings)
+ r = -EFAULT;
+ else
+ *p_pfn = pfn;

return r;
}


2023-07-19 06:47:27

by Yan Zhao

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>
> out_unlock:
> write_unlock(&vcpu->kvm->mmu_lock);
> - kvm_release_pfn_clean(fault->pfn);
> + if (fault->is_refcounted_page)
> + kvm_set_page_accessed(pfn_to_page(fault->pfn));
For a refcounted page, as now KVM puts its ref early in kvm_faultin_pfn(),
should this kvm_set_page_accessed() be placed before unlocking mmu_lock?

Otherwise, if the user unmaps a region (which triggers kvm_unmap_gfn_range()
with mmu_lock holding for write), and release the page, and if the two
steps happen after checking page_count() in kvm_set_page_accessed() and
before mark_page_accessed(), the latter function may mark accessed to a page
that is released or does not belong to current process.

Is it true?

> return r;
> }
>
> @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>
> out_unlock:
> read_unlock(&vcpu->kvm->mmu_lock);
> - kvm_release_pfn_clean(fault->pfn);
> + if (fault->is_refcounted_page)
> + kvm_set_page_accessed(pfn_to_page(fault->pfn));
> return r;
> }
Ditto.

2023-07-19 07:39:33

by David Stevens

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Wed, Jul 19, 2023 at 3:35 PM Yan Zhao <[email protected]> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >
> > out_unlock:
> > write_unlock(&vcpu->kvm->mmu_lock);
> > - kvm_release_pfn_clean(fault->pfn);
> > + if (fault->is_refcounted_page)
> > + kvm_set_page_accessed(pfn_to_page(fault->pfn));
> For a refcounted page, as now KVM puts its ref early in kvm_faultin_pfn(),
> should this kvm_set_page_accessed() be placed before unlocking mmu_lock?
>
> Otherwise, if the user unmaps a region (which triggers kvm_unmap_gfn_range()
> with mmu_lock holding for write), and release the page, and if the two
> steps happen after checking page_count() in kvm_set_page_accessed() and
> before mark_page_accessed(), the latter function may mark accessed to a page
> that is released or does not belong to current process.
>
> Is it true?

Yes, good catch. During some testing last week, I actually found this
bug thanks to the WARN_ON the first patch in this series added to
kvm_is_ad_tracked_page. I'll fix it in the next revision, after Sean
gets a chance to comment on the series.

Thanks,
David

2023-08-04 23:28:01

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

On Tue, Jul 04, 2023, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
>
> Signed-off-by: David Stevens <[email protected]>
> ---
> include/linux/kvm_host.h | 16 ++++
> virt/kvm/kvm_main.c | 171 ++++++++++++++++++++++-----------------
> virt/kvm/kvm_mm.h | 3 +-
> virt/kvm/pfncache.c | 8 +-
> 4 files changed, 122 insertions(+), 76 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
> #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)

Hmm, ideally KVM_PFN_ERR_NEEDS_IO would be introduced in a separate prep patch,
e.g. by changing "bool *async" to "bool no_wait". At a glance, I can't tell if
that's feasible though, so consider it more of a "wish" than a request.

> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> return get_page_unless_zero(page);
> }
>
> -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> - unsigned long addr, bool write_fault,
> - bool *writable, kvm_pfn_t *p_pfn)
> +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> + kvm_pfn_t *p_pfn)

Please wrap. KVM still honors the 80 char soft limit unless there's a reason not
to, and in this case it's already wrapping

static int hva_to_pfn_remapped(struct vm_area_struct *vma,
struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn)

> @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> goto out;
> }
>
> - if (writable)
> - *writable = pte_write(*ptep);
> + foll->writable = pte_write(*ptep) && foll->allow_write_mapping;

Similar to feedback in my other response, don't condition this on try_map_writable,
i.e. just do:

foll->writable = pte_write(...);

> pfn = pte_pfn(*ptep);
>
> /*
> @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> * 2): @write_fault = false && @writable, @writable will tell the caller
> * whether the mapping is writable.
> */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> - bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
> {
> struct vm_area_struct *vma;
> kvm_pfn_t pfn;
> int npages, r;
>
> /* we can do it either atomically or asynchronously, not both */
> - BUG_ON(atomic && async);
> + BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
>
> - if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> + if (hva_to_pfn_fast(foll, &pfn))
> return pfn;
>
> - if (atomic)
> + if (foll->atomic)
> return KVM_PFN_ERR_FAULT;
>
> - npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> - writable, &pfn);
> + npages = hva_to_pfn_slow(foll, &pfn);
> if (npages == 1)
> return pfn;
> if (npages == -EINTR)
> @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>
> mmap_read_lock(current->mm);
> if (npages == -EHWPOISON ||
> - (!async && check_user_page_hwpoison(addr))) {
> + (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {

Opportunistically align the indentation, as an added bonus that makes the line
length a few chars shorter, i.e.

if (npages == -EHWPOISON ||
(!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
pfn = KVM_PFN_ERR_HWPOISON;
goto exit;
}

2023-08-04 23:35:21

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] KVM: allow mapping non-refcounted pages

On Tue, Jul 04, 2023, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> that is backed by struct pages that aren't currently being refcounted
> (e.g. tail pages of non-compound higher order allocations) into the
> guest.

Aplogies for the slow review, I'm done with feedback for this version.

FWIW, it's probably too bit late to catch 6.6, especially since we need acks from
ARM and PPC, but 6.7 should be very doable unless someone outright objects.

Thanks for being persistent!

2023-08-04 23:46:46

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

On Thu, Jul 06, 2023, Yu Zhang wrote:
> On Thu, Jul 06, 2023 at 02:29:24PM +0900, David Stevens wrote:
> > On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang <[email protected]> wrote:
> > >
> > > On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > > > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <[email protected]> wrote:
> > > > >
> > > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > > > > * The slow path to get the pfn of the specified host virtual address,
> > > > > > * 1 indicates success, -errno is returned if error is detected.
> > > > > > */
> > > > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > > > > - bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > > > > {
> > > > > > - unsigned int flags = FOLL_HWPOISON;
> > > > > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > > > > struct page *page;
> > > > > > int npages;
> > > > > >
> > > > > > might_sleep();
> > > > > >
> > > > > > - if (writable)
> > > > > > - *writable = write_fault;
> > > > > > -
> > > > > > - if (write_fault)
> > > > > > - flags |= FOLL_WRITE;
> > > > > > - if (async)
> > > > > > - flags |= FOLL_NOWAIT;
> > > > > > - if (interruptible)
> > > > > > - flags |= FOLL_INTERRUPTIBLE;
> > > > > > -
> > > > > > - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > > > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > > > > if (npages != 1)
> > > > > > return npages;
> > > > > >
> > > > > > + foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > > > > +
> > > > > > /* map read fault as writable if possible */
> > > > > > - if (unlikely(!write_fault) && writable) {
> > > > > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > > >
> > > > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> > > >
> > > > The two statements are logically equivalent, although I guess using
> > > > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > > > verbose.
> > >
> > > Well, as the comment says, we wanna try to map the read fault as writable
> > > whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> > > for write faults. So I guess using !foll->writable will not allow this.
> > > Did I miss anything?
> >
> > We just set the foll->writable out parameter to be equal to
> > ((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
> > foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
> > !(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
> > b.
>
> Ouch, my bad again... I typed "!foll->writable", but missed the "!" in
> my head while calculating... Thanks! :)

The code is funky and confusing though. Specifically, FOLL_WRITE without
allow_write_mapping is nonsensical, and yields the even more nonsensical output
of a successful FOLL_WRITE with foll->writable==%false.

It "works" because callers only consume foll->writable when foll->allow_write_mapping
is true, but relying on that is ugly and completely unnecessary. Similarly, the
"allow" terminology is misleading. FOLL_WRITE *always* allows writable mappings.

This wasn't as much of problem in the previous code because the lower levels took
the pointer, i.e. avoided the "allow" terminology entirely.

So we should either keep that behavior, i.e. replace "bool allow_write_mapping"
with "bool *writable", or rename allow_write_mapping to something like
opportunistically_map_writable, and then unconditionally set foll->writable
whenever KVM obtains a writable mapping, i.e. regardless of whether the original
fault was a read or a write.

My vote is for the latter. If opportunistically_map_writable is too verbose,
try_map_writable would be another option. Hmm, I'll make "try_map_writable" my
official vote.

Ah, and I also vote to use an if-elif instead of unconditionally setting foll->writable.
That makes the relationship between FOLL_WRITE and try_map_writable a bit more
obvious IMO. E.g.

static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
{
struct page *page[1];

/*
* Fast pin a writable pfn only if it is a write fault request
* or the caller allows to map a writable pfn for a read fault
* request.
*/
if (!((foll->flags & FOLL_WRITE) || foll->try_map_writable))
return false;

if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
*pfn = page_to_pfn(page[0]);
foll->writable = true;
return true;
}

return false;
}

/*
* The slow path to get the pfn of the specified host virtual address,
* 1 indicates success, -errno is returned if error is detected.
*/
static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
{
unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
struct page *page;
int npages;

might_sleep();

npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
if (npages != 1)
return npages;

if (foll->flags & FOLL_WRITE) {
foll->writable = true;
} else if (foll->try_map_writable) {
struct page *wpage;

/* map read fault as writable if possible */
if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
foll->writable = true;
put_page(page);
page = wpage;
}
}
*pfn = page_to_pfn(page);
return npages;
}


2023-08-05 01:15:00

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Thu, Jul 06, 2023, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <[email protected]> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens <[email protected]>
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted. IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
>
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.

Recording for posterity (or to make an incorrect statment and get corrected),
because I recently had a conversation about the head page not actually being
refcounted. (I can't remember with whom I had the conversation, but I'm pretty
sure it wasn't an imaginary friend).

Even though whatever allocates the page doesn't explicit refcount the head page,
__free_pages() will still do the right thing and (a) keep the head page around
until its last reference is put. And my understanding is that even though it's
a "head" page, it's not a PG_head page, i.e. not a compound page and so is treated
as an order-0 page when KVM invoke put_page().

void __free_pages(struct page *page, unsigned int order)
{
/* get PageHead before we drop reference */
int head = PageHead(page);

if (put_page_testzero(page)) <=== will evaluate false if KVM holds a ref
free_the_page(page, order);
else if (!head) <=== will be false for non-compound pages
while (order-- > 0)
free_the_page(page + (1 << order), order);
}
EXPORT_SYMBOL(__free_pages);