LinuxLists.cc - [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

[permalink] [raw]

Subject: [PATCH v11 1/8] KVM: Assert that a page's refcount is elevated when marking accessed/dirty

From: Sean Christopherson <[email protected]>

Assert that a page's refcount is elevated, i.e. that _something_ holds a
reference to the page, when KVM marks a page as accessed and/or dirty.
KVM typically doesn't hold a reference to pages that are mapped into the
guest, e.g. to allow page migration, compaction, swap, etc., and instead
relies on mmu_notifiers to react to changes in the primary MMU.

Incorrect handling of mmu_notifier events (or similar mechanisms) can
result in KVM keeping a mapping beyond the lifetime of the backing page,
i.e. can (and often does) result in use-after-free. Yelling if KVM marks
a freed page as accessed/dirty doesn't prevent badness as KVM usually
only does A/D updates when unmapping memory from the guest, i.e. the
assertion fires well after an underlying bug has occurred, but yelling
does help detect, triage, and debug use-after-free bugs.

Note, the assertion must use page_count(), NOT page_ref_count()! For
hugepages, the returned struct page may be a tailpage and thus not have
its own refcount.

Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 10bfc88a69f7..c5e4bf7c48f9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3204,6 +3204,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);

static bool kvm_is_ad_tracked_page(struct page *page)
{
+ /*
+ * Assert that KVM isn't attempting to mark a freed page as Accessed or
+ * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM
+ * (typically) doesn't pin pages that are mapped in KVM's MMU, and
+ * instead relies on mmu_notifiers to know when a mapping needs to be
+ * zapped/invalidated. Unmapping from KVM's MMU must happen _before_
+ * KVM returns from its mmu_notifier, i.e. the page should have an
+ * elevated refcount at this point even though KVM doesn't hold a
+ * reference of its own.
+ */
+ if (WARN_ON_ONCE(!page_count(page)))
+ return false;
+
/*
* Per page-flags.h, pages tagged PG_reserved "should in general not be
* touched (e.g. set dirty) except by its owner".
--
2.44.0.rc1.240.g4c46232300-goog

2024-02-29 02:58:59

[permalink] [raw]

Subject: [PATCH v11 2/8] KVM: Relax BUG_ON argument validation

From: David Stevens <[email protected]>

hva_to_pfn() includes a check that KVM isn't trying to do an async page
fault in a situation where it can't sleep. Downgrade this check from a
BUG_ON() to a WARN_ON_ONCE(), since DoS'ing the guest (at worst) is
better than bringing down the host.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: David Stevens <[email protected]>
---
virt/kvm/kvm_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c5e4bf7c48f9..6f37d56fb2fc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2979,7 +2979,7 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
int npages, r;

/* we can do it either atomically or asynchronously, not both */
- BUG_ON(atomic && async);
+ WARN_ON_ONCE(atomic && async);

if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
return pfn;
--
2.44.0.rc1.240.g4c46232300-goog

2024-02-29 02:59:43

[permalink] [raw]

Subject: [PATCH v11 4/8] KVM: mmu: Improve handling of non-refcounted pfns

From: David Stevens <[email protected]>

KVM's handling of non-refcounted pfns has two problems:

- pfns without struct pages can be accessed without the protection of a
mmu notifier. This is unsafe because KVM cannot monitor or control
the lifespan of such pfns, so it may continue to access the pfns
after they are freed.
- struct pages without refcounting (e.g. tail pages of non-compound
higher order pages) cannot be used at all, as gfn_to_pfn does not
provide enough information for callers to be able to avoid
underflowing the refcount.

This patch extends the kvm_follow_pfn() API to properly handle these
cases:

- First, it adds FOLL_GET to the list of supported flags, to indicate
whether or not the caller actually wants to take a refcount.
- Second, it adds a guarded_by_mmu_notifier parameter that is used to
avoid returning non-refcounted pages when the caller cannot safely
use them.
- Third, it adds an is_refcounted_page output parameter so that
callers can tell whether or not a pfn has a struct page that needs
to be passed to put_page.

Since callers need to be updated on a case-by-case basis to pay
attention to is_refcounted_page, the new behavior of returning
non-refcounted pages is opt-in via the allow_non_refcounted_struct_page
parameter. Once all callers have been updated, this parameter should be
removed.

The fact that non-refcounted pfns can no longer be accessed without mmu
notifier protection by default is a breaking change. This patch provides
a module parameter that system admins can use to re-enable the previous
unsafe behavior when userspace is trusted not to migrate/free/etc
non-refcounted pfns that are mapped into the guest. There is no timeline
for updating everything in KVM to use mmu notifiers to alleviate the
need for this module parameter.

Signed-off-by: David Stevens <[email protected]>
---
include/linux/kvm_host.h | 29 +++++++++++
virt/kvm/kvm_main.c | 104 +++++++++++++++++++++++++--------------
virt/kvm/pfncache.c | 3 +-
3 files changed, 99 insertions(+), 37 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 290db5133c36..66516088bb0a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1219,10 +1219,39 @@ struct kvm_follow_pfn {
bool atomic;
/* Try to create a writable mapping even for a read fault. */
bool try_map_writable;
+ /*
+ * Usage of the returned pfn will be guared by a mmu notifier. If
+ * FOLL_GET is not set, this must be true.
+ */
+ bool guarded_by_mmu_notifier;
+ /*
+ * When false, do not return pfns for non-refcounted struct pages.
+ *
+ * This allows callers to continue to rely on the legacy behavior
+ * where pfs returned by gfn_to_pfn can be safely passed to
+ * kvm_release_pfn without worrying about corrupting the refcount of
+ * non-refcounted pages.
+ *
+ * Callers that opt into non-refcount struct pages need to track
+ * whether or not the returned pages are refcounted and avoid touching
+ * them when they are not. Some architectures may not have enough
+ * free space in PTEs to do this.
+ */
+ bool allow_non_refcounted_struct_page;

/* Outputs of kvm_follow_pfn */
hva_t hva;
bool writable;
+ /*
+ * Non-NULL if the returned pfn is for a page with a valid refcount,
+ * NULL if the returned pfn has no struct page or if the struct page is
+ * not being refcounted (e.g. tail pages of non-compound higher order
+ * allocations from IO/PFNMAP mappings).
+ *
+ * NOTE: This will still be set if FOLL_GET is not specified, but the
+ * returned page will not have an elevated refcount.
+ */
+ struct page *refcounted_page;
};

kvm_pfn_t kvm_follow_pfn(struct kvm_follow_pfn *kfp);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 575756c9c5b0..984bcf8511e7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -96,6 +96,13 @@ unsigned int halt_poll_ns_shrink;
module_param(halt_poll_ns_shrink, uint, 0644);
EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);

+/*
+ * Allow non-refcounted struct pages and non-struct page memory to
+ * be mapped without MMU notifier protection.
+ */
+static bool allow_unsafe_mappings;
+module_param(allow_unsafe_mappings, bool, 0444);
+
/*
* Ordering of locks:
*
@@ -2786,6 +2793,24 @@ static inline int check_user_page_hwpoison(unsigned long addr)
return rc == -EHWPOISON;
}

+static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *kfp,
+ struct page *page)
+{
+ kvm_pfn_t pfn = page_to_pfn(page);
+
+ /*
+ * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
+ * doesn't want to grab a reference, but gup() doesn't support getting
+ * just the pfn, i.e. FOLL_GET is effectively mandatory. If that ever
+ * changes, drop this and simply don't pass FOLL_GET to gup().
+ */
+ if (!(kfp->flags & FOLL_GET))
+ put_page(page);
+
+ kfp->refcounted_page = page;
+ return pfn;
+}
+
/*
* The fast path to get the writable pfn which will be stored in @pfn,
* true indicates success, otherwise false is returned. It's also the
@@ -2804,7 +2829,7 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *kfp, kvm_pfn_t *pfn)
return false;

if (get_user_page_fast_only(kfp->hva, FOLL_WRITE, page)) {
- *pfn = page_to_pfn(page[0]);
+ *pfn = kvm_follow_refcounted_pfn(kfp, page[0]);
kfp->writable = true;
return true;
}
@@ -2851,7 +2876,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *kfp, kvm_pfn_t *pfn)
page = wpage;
}
}
- *pfn = page_to_pfn(page);
+ *pfn = kvm_follow_refcounted_pfn(kfp, page);
return npages;
}

@@ -2866,16 +2891,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
return true;
}

-static int kvm_try_get_pfn(kvm_pfn_t pfn)
-{
- struct page *page = kvm_pfn_to_refcounted_page(pfn);
-
- if (!page)
- return 1;
-
- return get_page_unless_zero(page);
-}
-
static int hva_to_pfn_remapped(struct vm_area_struct *vma,
struct kvm_follow_pfn *kfp, kvm_pfn_t *p_pfn)
{
@@ -2884,6 +2899,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
pte_t pte;
spinlock_t *ptl;
bool write_fault = kfp->flags & FOLL_WRITE;
+ struct page *page;
int r;

r = follow_pte(vma->vm_mm, kfp->hva, &ptep, &ptl);
@@ -2908,37 +2924,40 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,

pte = ptep_get(ptep);

+ kfp->writable = pte_write(pte);
+ pfn = pte_pfn(pte);
+
+ page = kvm_pfn_to_refcounted_page(pfn);
+
if (write_fault && !pte_write(pte)) {
pfn = KVM_PFN_ERR_RO_FAULT;
goto out;
}

- kfp->writable = pte_write(pte);
- pfn = pte_pfn(pte);
+ if (!page)
+ goto out;

/*
- * Get a reference here because callers of *hva_to_pfn* and
- * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
- * returned pfn. This is only needed if the VMA has VM_MIXEDMAP
- * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
- * simply do nothing for reserved pfns.
- *
- * Whoever called remap_pfn_range is also going to call e.g.
- * unmap_mapping_range before the underlying pages are freed,
- * causing a call to our MMU notifier.
- *
- * Certain IO or PFNMAP mappings can be backed with valid
- * struct pages, but be allocated without refcounting e.g.,
- * tail pages of non-compound higher order allocations, which
- * would then underflow the refcount when the caller does the
- * required put_page. Don't allow those pages here.
+ * IO or PFNMAP mappings can be backed with valid struct pages but be
+ * allocated without refcounting. We need to detect that to make sure we
+ * only pass refcounted pages to kvm_follow_refcounted_pfn.
*/
- if (!kvm_try_get_pfn(pfn))
- r = -EFAULT;
+ if (get_page_unless_zero(page))
+ WARN_ON_ONCE(kvm_follow_refcounted_pfn(kfp, page) != pfn);

out:
pte_unmap_unlock(ptep, ptl);
- *p_pfn = pfn;
+
+ if (page && !kfp->refcounted_page &&
+ !kfp->allow_non_refcounted_struct_page) {
+ r = -EFAULT;
+ } else if (!kfp->refcounted_page &&
+ !kfp->guarded_by_mmu_notifier &&
+ !allow_unsafe_mappings) {
+ r = -EFAULT;
+ } else {
+ *p_pfn = pfn;
+ }

return r;
}
@@ -3004,6 +3023,11 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *kfp)
kvm_pfn_t kvm_follow_pfn(struct kvm_follow_pfn *kfp)
{
kfp->writable = false;
+ kfp->refcounted_page = NULL;
+
+ if (WARN_ON_ONCE(!(kfp->flags & FOLL_GET) && !kfp->guarded_by_mmu_notifier))
+ return KVM_PFN_ERR_FAULT;
+
kfp->hva = __gfn_to_hva_many(kfp->slot, kfp->gfn, NULL,
kfp->flags & FOLL_WRITE);

@@ -3028,9 +3052,10 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
struct kvm_follow_pfn kfp = {
.slot = slot,
.gfn = gfn,
- .flags = 0,
+ .flags = FOLL_GET,
.atomic = atomic,
.try_map_writable = !!writable,
+ .allow_non_refcounted_struct_page = false,
};

if (write_fault)
@@ -3060,8 +3085,9 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
struct kvm_follow_pfn kfp = {
.slot = gfn_to_memslot(kvm, gfn),
.gfn = gfn,
- .flags = write_fault ? FOLL_WRITE : 0,
+ .flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
.try_map_writable = !!writable,
+ .allow_non_refcounted_struct_page = false,
};
pfn = kvm_follow_pfn(&kfp);
if (writable)
@@ -3075,7 +3101,8 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
struct kvm_follow_pfn kfp = {
.slot = slot,
.gfn = gfn,
- .flags = FOLL_WRITE,
+ .flags = FOLL_GET | FOLL_WRITE,
+ .allow_non_refcounted_struct_page = false,
};
return kvm_follow_pfn(&kfp);
}
@@ -3086,8 +3113,13 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf
struct kvm_follow_pfn kfp = {
.slot = slot,
.gfn = gfn,
- .flags = FOLL_WRITE,
+ .flags = FOLL_GET | FOLL_WRITE,
.atomic = true,
+ /*
+ * Setting atomic means __kvm_follow_pfn will never make it
+ * to hva_to_pfn_remapped, so this is vacuously true.
+ */
+ .allow_non_refcounted_struct_page = true,
};
return kvm_follow_pfn(&kfp);
}
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 1fb21c2ced5d..6e82062ea203 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -147,8 +147,9 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
struct kvm_follow_pfn kfp = {
.slot = gpc->memslot,
.gfn = gpa_to_gfn(gpc->gpa),
- .flags = FOLL_WRITE,
+ .flags = FOLL_GET | FOLL_WRITE,
.hva = gpc->uhva,
+ .allow_non_refcounted_struct_page = false,
};

lockdep_assert_held(&gpc->refresh_lock);
--
2.44.0.rc1.240.g4c46232300-goog

2024-02-29 02:59:47

[permalink] [raw]

Subject: [PATCH v11 3/8] KVM: mmu: Introduce kvm_follow_pfn()

From: David Stevens <[email protected]>

Introduce kvm_follow_pfn(), which will replace __gfn_to_pfn_memslot().
This initial implementation is just a refactor of the existing API which
uses a single structure for passing the arguments. The arguments are
further refactored as follows:

- The write_fault and interruptible boolean flags and the in
parameter part of async are replaced by setting FOLL_WRITE,
FOLL_INTERRUPTIBLE, and FOLL_NOWAIT respectively in a new flags
argument.
- The out parameter portion of the async parameter is now a return
value.
- The writable in/out parameter is split into a separate.
try_map_writable in parameter and writable out parameter.
- All other parameter are the same.

Upcoming changes will add the ability to get a pfn without needing to
take a ref to the underlying page.

Signed-off-by: David Stevens <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
include/linux/kvm_host.h | 18 ++++
virt/kvm/kvm_main.c | 191 +++++++++++++++++++++------------------
virt/kvm/kvm_mm.h | 3 +-
virt/kvm/pfncache.c | 10 +-
4 files changed, 131 insertions(+), 91 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7e7fd25b09b3..290db5133c36 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -97,6 +97,7 @@
#define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
#define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
#define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
+#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)

/*
* error pfns indicate that the gfn is in slot but faild to
@@ -1209,6 +1210,23 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
void kvm_release_page_clean(struct page *page);
void kvm_release_page_dirty(struct page *page);

+struct kvm_follow_pfn {
+ const struct kvm_memory_slot *slot;
+ gfn_t gfn;
+ /* FOLL_* flags modifying lookup behavior. */
+ unsigned int flags;
+ /* Whether this function can sleep. */
+ bool atomic;
+ /* Try to create a writable mapping even for a read fault. */
+ bool try_map_writable;
+
+ /* Outputs of kvm_follow_pfn */
+ hva_t hva;
+ bool writable;
+};
+
+kvm_pfn_t kvm_follow_pfn(struct kvm_follow_pfn *kfp);
+
kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
bool *writable);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6f37d56fb2fc..575756c9c5b0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2791,8 +2791,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
* true indicates success, otherwise false is returned. It's also the
* only part that runs if we can in atomic context.
*/
-static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
- bool *writable, kvm_pfn_t *pfn)
+static bool hva_to_pfn_fast(struct kvm_follow_pfn *kfp, kvm_pfn_t *pfn)
{
struct page *page[1];

@@ -2801,14 +2800,12 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
* or the caller allows to map a writable pfn for a read fault
* request.
*/
- if (!(write_fault || writable))
+ if (!((kfp->flags & FOLL_WRITE) || kfp->try_map_writable))
return false;

- if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+ if (get_user_page_fast_only(kfp->hva, FOLL_WRITE, page)) {
*pfn = page_to_pfn(page[0]);
-
- if (writable)
- *writable = true;
+ kfp->writable = true;
return true;
}

@@ -2819,8 +2816,7 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
* The slow path to get the pfn of the specified host virtual address,
* 1 indicates success, -errno is returned if error is detected.
*/
-static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
- bool interruptible, bool *writable, kvm_pfn_t *pfn)
+static int hva_to_pfn_slow(struct kvm_follow_pfn *kfp, kvm_pfn_t *pfn)
{
/*
* When a VCPU accesses a page that is not mapped into the secondary
@@ -2833,32 +2829,24 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
* Note that get_user_page_fast_only() and FOLL_WRITE for now
* implicitly honor NUMA hinting faults and don't need this flag.
*/
- unsigned int flags = FOLL_HWPOISON | FOLL_HONOR_NUMA_FAULT;
+ unsigned int flags = FOLL_HWPOISON | FOLL_HONOR_NUMA_FAULT | kfp->flags;
struct page *page;
int npages;

might_sleep();

- if (writable)
- *writable = write_fault;
-
- if (write_fault)
- flags |= FOLL_WRITE;
- if (async)
- flags |= FOLL_NOWAIT;
- if (interruptible)
- flags |= FOLL_INTERRUPTIBLE;
-
- npages = get_user_pages_unlocked(addr, 1, &page, flags);
+ npages = get_user_pages_unlocked(kfp->hva, 1, &page, flags);
if (npages != 1)
return npages;

- /* map read fault as writable if possible */
- if (unlikely(!write_fault) && writable) {
+ if (kfp->flags & FOLL_WRITE) {
+ kfp->writable = true;
+ } else if (kfp->try_map_writable) {
struct page *wpage;

- if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
- *writable = true;
+ /* map read fault as writable if possible */
+ if (get_user_page_fast_only(kfp->hva, FOLL_WRITE, &wpage)) {
+ kfp->writable = true;
put_page(page);
page = wpage;
}
@@ -2889,23 +2877,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
}

static int hva_to_pfn_remapped(struct vm_area_struct *vma,
- unsigned long addr, bool write_fault,
- bool *writable, kvm_pfn_t *p_pfn)
+ struct kvm_follow_pfn *kfp, kvm_pfn_t *p_pfn)
{
kvm_pfn_t pfn;
pte_t *ptep;
pte_t pte;
spinlock_t *ptl;
+ bool write_fault = kfp->flags & FOLL_WRITE;
int r;

- r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
+ r = follow_pte(vma->vm_mm, kfp->hva, &ptep, &ptl);
if (r) {
/*
* get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
* not call the fault handler, so do it here.
*/
bool unlocked = false;
- r = fixup_user_fault(current->mm, addr,
+ r = fixup_user_fault(current->mm, kfp->hva,
(write_fault ? FAULT_FLAG_WRITE : 0),
&unlocked);
if (unlocked)
@@ -2913,7 +2901,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
if (r)
return r;

- r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
+ r = follow_pte(vma->vm_mm, kfp->hva, &ptep, &ptl);
if (r)
return r;
}
@@ -2925,8 +2913,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
goto out;
}

- if (writable)
- *writable = pte_write(pte);
+ kfp->writable = pte_write(pte);
pfn = pte_pfn(pte);

/*
@@ -2957,38 +2944,28 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
}

/*
- * Pin guest page in memory and return its pfn.
- * @addr: host virtual address which maps memory to the guest
- * @atomic: whether this function can sleep
- * @interruptible: whether the process can be interrupted by non-fatal signals
- * @async: whether this function need to wait IO complete if the
- * host page is not in the memory
- * @write_fault: whether we should get a writable host page
- * @writable: whether it allows to map a writable host page for !@write_fault
- *
- * The function will map a writable host page for these two cases:
- * 1): @write_fault = true
- * 2): @write_fault = false && @writable, @writable will tell the caller
- * whether the mapping is writable.
+ * Convert a hva to a pfn.
+ * @kfp: args struct for the conversion
*/
-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
- bool *async, bool write_fault, bool *writable)
+kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *kfp)
{
struct vm_area_struct *vma;
kvm_pfn_t pfn;
int npages, r;

- /* we can do it either atomically or asynchronously, not both */
- WARN_ON_ONCE(atomic && async);
+ /*
+ * FOLL_NOWAIT is used for async page faults, which don't make sense
+ * in an atomic context where the caller can't do async resolution.
+ */
+ WARN_ON_ONCE(kfp->atomic && (kfp->flags & FOLL_NOWAIT));

- if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
+ if (hva_to_pfn_fast(kfp, &pfn))
return pfn;

- if (atomic)
+ if (kfp->atomic)
return KVM_PFN_ERR_FAULT;

- npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
- writable, &pfn);
+ npages = hva_to_pfn_slow(kfp, &pfn);
if (npages == 1)
return pfn;
if (npages == -EINTR)
@@ -2996,83 +2973,123 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,

mmap_read_lock(current->mm);
if (npages == -EHWPOISON ||
- (!async && check_user_page_hwpoison(addr))) {
+ (!(kfp->flags & FOLL_NOWAIT) && check_user_page_hwpoison(kfp->hva))) {
pfn = KVM_PFN_ERR_HWPOISON;
goto exit;
}

retry:
- vma = vma_lookup(current->mm, addr);
+ vma = vma_lookup(current->mm, kfp->hva);

if (vma == NULL)
pfn = KVM_PFN_ERR_FAULT;
else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
- r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
+ r = hva_to_pfn_remapped(vma, kfp, &pfn);
if (r == -EAGAIN)
goto retry;
if (r < 0)
pfn = KVM_PFN_ERR_FAULT;
} else {
- if (async && vma_is_valid(vma, write_fault))
- *async = true;
- pfn = KVM_PFN_ERR_FAULT;
+ if ((kfp->flags & FOLL_NOWAIT) &&
+ vma_is_valid(vma, kfp->flags & FOLL_WRITE))
+ pfn = KVM_PFN_ERR_NEEDS_IO;
+ else
+ pfn = KVM_PFN_ERR_FAULT;
}
exit:
mmap_read_unlock(current->mm);
return pfn;
}

-kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
- bool atomic, bool interruptible, bool *async,
- bool write_fault, bool *writable, hva_t *hva)
+kvm_pfn_t kvm_follow_pfn(struct kvm_follow_pfn *kfp)
{
- unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
+ kfp->writable = false;
+ kfp->hva = __gfn_to_hva_many(kfp->slot, kfp->gfn, NULL,
+ kfp->flags & FOLL_WRITE);

- if (hva)
- *hva = addr;
-
- if (addr == KVM_HVA_ERR_RO_BAD) {
- if (writable)
- *writable = false;
+ if (kfp->hva == KVM_HVA_ERR_RO_BAD)
return KVM_PFN_ERR_RO_FAULT;
- }

- if (kvm_is_error_hva(addr)) {
- if (writable)
- *writable = false;
+ if (kvm_is_error_hva(kfp->hva))
return KVM_PFN_NOSLOT;
- }

- /* Do not map writable pfn in the readonly memslot. */
- if (writable && memslot_is_readonly(slot)) {
- *writable = false;
- writable = NULL;
- }
+ if (memslot_is_readonly(kfp->slot))
+ kfp->try_map_writable = false;
+
+ return hva_to_pfn(kfp);
+}
+EXPORT_SYMBOL_GPL(kvm_follow_pfn);
+
+kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
+ bool atomic, bool interruptible, bool *async,
+ bool write_fault, bool *writable, hva_t *hva)
+{
+ kvm_pfn_t pfn;
+ struct kvm_follow_pfn kfp = {
+ .slot = slot,
+ .gfn = gfn,
+ .flags = 0,
+ .atomic = atomic,
+ .try_map_writable = !!writable,
+ };
+
+ if (write_fault)
+ kfp.flags |= FOLL_WRITE;
+ if (async)
+ kfp.flags |= FOLL_NOWAIT;
+ if (interruptible)
+ kfp.flags |= FOLL_INTERRUPTIBLE;

- return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
- writable);
+ pfn = kvm_follow_pfn(&kfp);
+ if (pfn == KVM_PFN_ERR_NEEDS_IO) {
+ *async = true;
+ pfn = KVM_PFN_ERR_FAULT;
+ }
+ if (hva)
+ *hva = kfp.hva;
+ if (writable)
+ *writable = kfp.writable;
+ return pfn;
}
EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);

kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
bool *writable)
{
- return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
- NULL, write_fault, writable, NULL);
+ kvm_pfn_t pfn;
+ struct kvm_follow_pfn kfp = {
+ .slot = gfn_to_memslot(kvm, gfn),
+ .gfn = gfn,
+ .flags = write_fault ? FOLL_WRITE : 0,
+ .try_map_writable = !!writable,
+ };
+ pfn = kvm_follow_pfn(&kfp);
+ if (writable)
+ *writable = kfp.writable;
+ return pfn;
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);

kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
{
- return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
- NULL, NULL);
+ struct kvm_follow_pfn kfp = {
+ .slot = slot,
+ .gfn = gfn,
+ .flags = FOLL_WRITE,
+ };
+ return kvm_follow_pfn(&kfp);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);

kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
{
- return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
- NULL, NULL);
+ struct kvm_follow_pfn kfp = {
+ .slot = slot,
+ .gfn = gfn,
+ .flags = FOLL_WRITE,
+ .atomic = true,
+ };
+ return kvm_follow_pfn(&kfp);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);

diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index ecefc7ec51af..9ba61fbb727c 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -20,8 +20,7 @@
#define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock)
#endif /* KVM_HAVE_MMU_RWLOCK */

-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
- bool *async, bool write_fault, bool *writable);
+kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);

#ifdef CONFIG_HAVE_KVM_PFNCACHE
void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 2d6aba677830..1fb21c2ced5d 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
void *new_khva = NULL;
unsigned long mmu_seq;
+ struct kvm_follow_pfn kfp = {
+ .slot = gpc->memslot,
+ .gfn = gpa_to_gfn(gpc->gpa),
+ .flags = FOLL_WRITE,
+ .hva = gpc->uhva,
+ };

lockdep_assert_held(&gpc->refresh_lock);

@@ -182,8 +188,8 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
cond_resched();
}

- /* We always request a writeable mapping */
- new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
+ /* We always request a writable mapping */
+ new_pfn = hva_to_pfn(&kfp);
if (is_error_noslot_pfn(new_pfn))
goto out_error;

--
2.44.0.rc1.240.g4c46232300-goog

2024-02-29 03:00:03

[permalink] [raw]

Subject: [PATCH v11 5/8] KVM: Migrate kvm_vcpu_map() to kvm_follow_pfn()

From: David Stevens <[email protected]>

Migrate kvm_vcpu_map() to kvm_follow_pfn(). Track is_refcounted_page so
that kvm_vcpu_unmap() know whether or not it needs to release the page.

Signed-off-by: David Stevens <[email protected]>
---
include/linux/kvm_host.h | 2 +-
virt/kvm/kvm_main.c | 24 ++++++++++++++----------
2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 66516088bb0a..59dc9fbafc08 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -295,6 +295,7 @@ struct kvm_host_map {
void *hva;
kvm_pfn_t pfn;
kvm_pfn_t gfn;
+ bool is_refcounted_page;
};

/*
@@ -1270,7 +1271,6 @@ void kvm_release_pfn_dirty(kvm_pfn_t pfn);
void kvm_set_pfn_dirty(kvm_pfn_t pfn);
void kvm_set_pfn_accessed(kvm_pfn_t pfn);

-void kvm_release_pfn(kvm_pfn_t pfn, bool dirty);
int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
int len);
int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 984bcf8511e7..17bf9fd6774e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3184,24 +3184,22 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
}
EXPORT_SYMBOL_GPL(gfn_to_page);

-void kvm_release_pfn(kvm_pfn_t pfn, bool dirty)
-{
- if (dirty)
- kvm_release_pfn_dirty(pfn);
- else
- kvm_release_pfn_clean(pfn);
-}
-
int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
{
kvm_pfn_t pfn;
void *hva = NULL;
struct page *page = KVM_UNMAPPED_PAGE;
+ struct kvm_follow_pfn kfp = {
+ .slot = gfn_to_memslot(vcpu->kvm, gfn),
+ .gfn = gfn,
+ .flags = FOLL_GET | FOLL_WRITE,
+ .allow_non_refcounted_struct_page = true,
+ };

if (!map)
return -EINVAL;

- pfn = gfn_to_pfn(vcpu->kvm, gfn);
+ pfn = kvm_follow_pfn(&kfp);
if (is_error_noslot_pfn(pfn))
return -EINVAL;

@@ -3221,6 +3219,7 @@ int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
map->hva = hva;
map->pfn = pfn;
map->gfn = gfn;
+ map->is_refcounted_page = !!kfp.refcounted_page;

return 0;
}
@@ -3244,7 +3243,12 @@ void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty)
if (dirty)
kvm_vcpu_mark_page_dirty(vcpu, map->gfn);

- kvm_release_pfn(map->pfn, dirty);
+ if (map->is_refcounted_page) {
+ if (dirty)
+ kvm_release_page_dirty(map->page);
+ else
+ kvm_release_page_clean(map->page);
+ }

map->hva = NULL;
map->page = NULL;
--
2.44.0.rc1.240.g4c46232300-goog

2024-02-29 03:00:34

[permalink] [raw]

Subject: [PATCH v11 6/8] KVM: x86: Migrate to kvm_follow_pfn()

From: David Stevens <[email protected]>

Migrate functions which need to be able to map non-refcounted struct
pages to kvm_follow_pfn(). These functions are kvm_faultin_pfn() and
reexecute_instruction(). The former requires replacing the async in/out
parameter with FOLL_NOWAIT parameter and the KVM_PFN_ERR_NEEDS_IO return
value (actually handling non-refcounted pages is complicated, so it will
be done in a followup). The latter is a straightforward refactor.

APIC related callers do not need to migrate because KVM controls the
memslot, so it will always be regular memory. Prefetch related callers
do not need to be migrated because atomic gfn_to_pfn() calls can never
make it to hva_to_pfn_remapped().

Signed-off-by: David Stevens <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 43 ++++++++++++++++++++++++++++++++----------
arch/x86/kvm/x86.c | 11 +++++++++--
virt/kvm/kvm_main.c | 11 ++++-------
3 files changed, 46 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2d6cdeab1f8a..bbeb0f6783d7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4331,7 +4331,14 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_memory_slot *slot = fault->slot;
- bool async;
+ struct kvm_follow_pfn kfp = {
+ .slot = slot,
+ .gfn = fault->gfn,
+ .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
+ .try_map_writable = true,
+ .guarded_by_mmu_notifier = true,
+ .allow_non_refcounted_struct_page = false,
+ };

/*
* Retry the page fault if the gfn hit a memslot that is being deleted
@@ -4368,12 +4375,20 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (fault->is_private)
return kvm_faultin_pfn_private(vcpu, fault);

- async = false;
- fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
- fault->write, &fault->map_writable,
- &fault->hva);
- if (!async)
- return RET_PF_CONTINUE; /* *pfn has correct page already */
+ kfp.flags |= FOLL_NOWAIT;
+ fault->pfn = kvm_follow_pfn(&kfp);
+
+ if (!is_error_noslot_pfn(fault->pfn))
+ goto success;
+
+ /*
+ * If kvm_follow_pfn() failed because I/O is needed to fault in the
+ * page, then either set up an asynchronous #PF to do the I/O, or if
+ * doing an async #PF isn't possible, retry kvm_follow_pfn() with
+ * I/O allowed. All other failures are fatal, i.e. retrying won't help.
+ */
+ if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
+ return RET_PF_CONTINUE;

if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
trace_kvm_try_async_get_page(fault->addr, fault->gfn);
@@ -4391,9 +4406,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* to wait for IO. Note, gup always bails if it is unable to quickly
* get a page and a fatal signal, i.e. SIGKILL, is pending.
*/
- fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
- fault->write, &fault->map_writable,
- &fault->hva);
+ kfp.flags |= FOLL_INTERRUPTIBLE;
+ kfp.flags &= ~FOLL_NOWAIT;
+ fault->pfn = kvm_follow_pfn(&kfp);
+
+ if (!is_error_noslot_pfn(fault->pfn))
+ goto success;
+
+ return RET_PF_CONTINUE;
+success:
+ fault->hva = kfp.hva;
+ fault->map_writable = kfp.writable;
return RET_PF_CONTINUE;
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 363b1c080205..f4a20e9bc7a6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8747,6 +8747,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
{
gpa_t gpa = cr2_or_gpa;
kvm_pfn_t pfn;
+ struct kvm_follow_pfn kfp;

if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
return false;
@@ -8776,7 +8777,13 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
* retry instruction -> write #PF -> emulation fail -> retry
* instruction -> ...
*/
- pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
+ kfp = (struct kvm_follow_pfn) {
+ .slot = gfn_to_memslot(vcpu->kvm, gpa_to_gfn(gpa)),
+ .gfn = gpa_to_gfn(gpa),
+ .flags = FOLL_GET | FOLL_WRITE,
+ .allow_non_refcounted_struct_page = true,
+ };
+ pfn = kvm_follow_pfn(&kfp);

/*
* If the instruction failed on the error pfn, it can not be fixed,
@@ -8785,7 +8792,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
if (is_error_noslot_pfn(pfn))
return false;

- kvm_release_pfn_clean(pfn);
+ kvm_release_page_clean(kfp.refcounted_page);

/* The instructions are well-emulated on direct mmu. */
if (vcpu->arch.mmu->root_role.direct) {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 17bf9fd6774e..24e2269339cb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3293,6 +3293,9 @@ void kvm_release_page_clean(struct page *page)
{
WARN_ON(is_error_page(page));

+ if (!page)
+ return;
+
kvm_set_page_accessed(page);
put_page(page);
}
@@ -3300,16 +3303,10 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);

void kvm_release_pfn_clean(kvm_pfn_t pfn)
{
- struct page *page;
-
if (is_error_noslot_pfn(pfn))
return;

- page = kvm_pfn_to_refcounted_page(pfn);
- if (!page)
- return;
-
- kvm_release_page_clean(page);
+ kvm_release_page_clean(kvm_pfn_to_refcounted_page(pfn));
}
EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);

--
2.44.0.rc1.240.g4c46232300-goog

2024-02-29 03:00:46

[permalink] [raw]

Subject: [PATCH v11 7/8] KVM: x86/mmu: Track if sptes refer to refcounted pages

From: David Stevens <[email protected]>

Use one of the unused bits in EPT sptes to track whether or not an spte
refers to a struct page that has a valid refcount, in preparation for
adding support for mapping such pages into guests. The new bit is used
to avoid triggering a page_count() == 0 warning and to avoid touching
A/D bits of unknown usage.

Non-EPT sptes don't have any free bits to use, so this tracking is not
possible when TDP is disabled or on 32-bit x86.

Signed-off-by: David Stevens <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 47 ++++++++++++++++++++--------------
arch/x86/kvm/mmu/paging_tmpl.h | 5 ++--
arch/x86/kvm/mmu/spte.c | 5 +++-
arch/x86/kvm/mmu/spte.h | 16 +++++++++++-
arch/x86/kvm/mmu/tdp_mmu.c | 21 ++++++++-------
include/linux/kvm_host.h | 3 +++
virt/kvm/kvm_main.c | 6 +++--
7 files changed, 69 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bbeb0f6783d7..4936a8c5829b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -541,12 +541,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)

if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
flush = true;
- kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+ if (is_refcounted_page_spte(old_spte))
+ kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
}

if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
flush = true;
- kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+ if (is_refcounted_page_spte(old_spte))
+ kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
}

return flush;
@@ -578,20 +580,23 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)

pfn = spte_to_pfn(old_spte);

- /*
- * KVM doesn't hold a reference to any pages mapped into the guest, and
- * instead uses the mmu_notifier to ensure that KVM unmaps any pages
- * before they are reclaimed. Sanity check that, if the pfn is backed
- * by a refcounted page, the refcount is elevated.
- */
- page = kvm_pfn_to_refcounted_page(pfn);
- WARN_ON_ONCE(page && !page_count(page));
+ if (is_refcounted_page_spte(old_spte)) {
+ /*
+ * KVM doesn't hold a reference to any pages mapped into the
+ * guest, and instead uses the mmu_notifier to ensure that KVM
+ * unmaps any pages before they are reclaimed. Sanity check
+ * that, if the pfn is backed by a refcounted page, the
+ * refcount is elevated.
+ */
+ page = kvm_pfn_to_refcounted_page(pfn);
+ WARN_ON_ONCE(!page || !page_count(page));

- if (is_accessed_spte(old_spte))
- kvm_set_pfn_accessed(pfn);
+ if (is_accessed_spte(old_spte))
+ kvm_set_page_accessed(pfn_to_page(pfn));

- if (is_dirty_spte(old_spte))
- kvm_set_pfn_dirty(pfn);
+ if (is_dirty_spte(old_spte))
+ kvm_set_page_dirty(pfn_to_page(pfn));
+ }

return old_spte;
}
@@ -627,8 +632,8 @@ static bool mmu_spte_age(u64 *sptep)
* Capture the dirty status of the page, so that it doesn't get
* lost when the SPTE is marked for access tracking.
*/
- if (is_writable_pte(spte))
- kvm_set_pfn_dirty(spte_to_pfn(spte));
+ if (is_writable_pte(spte) && is_refcounted_page_spte(spte))
+ kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));

spte = mark_spte_for_access_track(spte);
mmu_spte_update_no_track(sptep, spte);
@@ -1267,8 +1272,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
{
bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
(unsigned long *)sptep);
- if (was_writable && !spte_ad_enabled(*sptep))
- kvm_set_pfn_dirty(spte_to_pfn(*sptep));
+ if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_spte(*sptep))
+ kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));

return was_writable;
}
@@ -2946,7 +2951,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
}

wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
- true, host_writable, &spte);
+ true, host_writable, true, &spte);

if (*sptep == spte) {
ret = RET_PF_SPURIOUS;
@@ -5999,6 +6004,10 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,

#ifdef CONFIG_X86_64
tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;
+
+ /* The SPTE_MMU_PAGE_REFCOUNTED bit is only available with EPT. */
+ if (enable_tdp)
+ shadow_refcounted_mask = SPTE_MMU_PAGE_REFCOUNTED;
#endif
/*
* max_huge_page_level reflects KVM's MMU capabilities irrespective
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 4d4e98fe4f35..c965f77ac4d5 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -902,7 +902,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
*/
static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
{
- bool host_writable;
+ bool host_writable, is_refcounted;
gpa_t first_pte_gpa;
u64 *sptep, spte;
struct kvm_memory_slot *slot;
@@ -959,10 +959,11 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
sptep = &sp->spt[i];
spte = *sptep;
host_writable = spte & shadow_host_writable_mask;
+ is_refcounted = is_refcounted_page_spte(spte);
slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
make_spte(vcpu, sp, slot, pte_access, gfn,
spte_to_pfn(spte), spte, true, false,
- host_writable, &spte);
+ host_writable, is_refcounted, &spte);

return mmu_spte_update(sptep, spte);
}
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4a599130e9c9..e4a458b7e185 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -39,6 +39,7 @@ u64 __read_mostly shadow_memtype_mask;
u64 __read_mostly shadow_me_value;
u64 __read_mostly shadow_me_mask;
u64 __read_mostly shadow_acc_track_mask;
+u64 __read_mostly shadow_refcounted_mask;

u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
@@ -138,7 +139,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
const struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
u64 old_spte, bool prefetch, bool can_unsync,
- bool host_writable, u64 *new_spte)
+ bool host_writable, bool is_refcounted, u64 *new_spte)
{
int level = sp->role.level;
u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -188,6 +189,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,

if (level > PG_LEVEL_4K)
spte |= PT_PAGE_SIZE_MASK;
+ if (is_refcounted)
+ spte |= shadow_refcounted_mask;

if (shadow_memtype_mask)
spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index a129951c9a88..6bf0069d8db6 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -96,6 +96,13 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
/* Defined only to keep the above static asserts readable. */
#undef SHADOW_ACC_TRACK_SAVED_MASK

+/*
+ * Indicates that the SPTE refers to a page with a valid refcount. Only
+ * available for TDP SPTEs, since bits 62:52 are reserved for PAE paging,
+ * including NPT PAE.
+ */
+#define SPTE_MMU_PAGE_REFCOUNTED BIT_ULL(59)
+
/*
* Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
* the memslots generation and is derived as follows:
@@ -345,6 +352,13 @@ static inline bool is_dirty_spte(u64 spte)
return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
}

+extern u64 __read_mostly shadow_refcounted_mask;
+
+static inline bool is_refcounted_page_spte(u64 spte)
+{
+ return !shadow_refcounted_mask || (spte & shadow_refcounted_mask);
+}
+
static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
int level)
{
@@ -475,7 +489,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
const struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
u64 old_spte, bool prefetch, bool can_unsync,
- bool host_writable, u64 *new_spte);
+ bool host_writable, bool is_refcounted, u64 *new_spte);
u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
union kvm_mmu_page_role role, int index);
u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6ae19b4ee5b1..ee497fb78d90 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -414,6 +414,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
bool was_leaf = was_present && is_last_spte(old_spte, level);
bool is_leaf = is_present && is_last_spte(new_spte, level);
bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+ bool is_refcounted = is_refcounted_page_spte(old_spte);

WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
WARN_ON_ONCE(level < PG_LEVEL_4K);
@@ -478,9 +479,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
if (is_leaf != was_leaf)
kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);

- if (was_leaf && is_dirty_spte(old_spte) &&
+ if (was_leaf && is_dirty_spte(old_spte) && is_refcounted &&
(!is_present || !is_dirty_spte(new_spte) || pfn_changed))
- kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+ kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));

/*
* Recursively handle child PTs if the change removed a subtree from
@@ -492,9 +493,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
(is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);

- if (was_leaf && is_accessed_spte(old_spte) &&
+ if (was_leaf && is_accessed_spte(old_spte) && is_refcounted &&
(!is_present || !is_accessed_spte(new_spte) || pfn_changed))
- kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+ kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
}

/*
@@ -956,8 +957,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
else
wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
- fault->pfn, iter->old_spte, fault->prefetch, true,
- fault->map_writable, &new_spte);
+ fault->pfn, iter->old_spte, fault->prefetch, true,
+ fault->map_writable, true, &new_spte);

if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
@@ -1178,8 +1179,9 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
* Capture the dirty status of the page, so that it doesn't get
* lost when the SPTE is marked for access tracking.
*/
- if (is_writable_pte(iter->old_spte))
- kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
+ if (is_writable_pte(iter->old_spte) &&
+ is_refcounted_page_spte(iter->old_spte))
+ kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter->old_spte)));

new_spte = mark_spte_for_access_track(iter->old_spte);
iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
@@ -1602,7 +1604,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level,
iter.old_spte,
iter.old_spte & ~dbit);
- kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
+ if (is_refcounted_page_spte(iter.old_spte))
+ kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter.old_spte)));
}

rcu_read_unlock();
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 59dc9fbafc08..d19a418df04b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1211,6 +1211,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
void kvm_release_page_clean(struct page *page);
void kvm_release_page_dirty(struct page *page);

+void kvm_set_page_accessed(struct page *page);
+void kvm_set_page_dirty(struct page *page);
+
struct kvm_follow_pfn {
const struct kvm_memory_slot *slot;
gfn_t gfn;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 24e2269339cb..235c92830cdc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3277,17 +3277,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
return !PageReserved(page);
}

-static void kvm_set_page_dirty(struct page *page)
+void kvm_set_page_dirty(struct page *page)
{
if (kvm_is_ad_tracked_page(page))
SetPageDirty(page);
}
+EXPORT_SYMBOL_GPL(kvm_set_page_dirty);

-static void kvm_set_page_accessed(struct page *page)
+void kvm_set_page_accessed(struct page *page)
{
if (kvm_is_ad_tracked_page(page))
mark_page_accessed(page);
}
+EXPORT_SYMBOL_GPL(kvm_set_page_accessed);

void kvm_release_page_clean(struct page *page)
{
--
2.44.0.rc1.240.g4c46232300-goog

2024-02-29 03:01:00

[permalink] [raw]

Subject: [PATCH v11 8/8] KVM: x86/mmu: Handle non-refcounted pages

From: David Stevens <[email protected]>

Handle non-refcounted pages in __kvm_faultin_pfn. This allows the
host to map memory into the guest that is backed by non-refcounted
struct pages - for example, the tail pages of higher order non-compound
pages allocated by the amdgpu driver via ttm_pool_alloc_page.

Signed-off-by: David Stevens <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 24 +++++++++++++++++-------
arch/x86/kvm/mmu/mmu_internal.h | 2 ++
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 3 ++-
include/linux/kvm_host.h | 6 ++++--
virt/kvm/guest_memfd.c | 8 ++++----
virt/kvm/kvm_main.c | 10 ++++++++--
7 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4936a8c5829b..f9046912bb43 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2924,6 +2924,11 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
bool host_writable = !fault || fault->map_writable;
bool prefetch = !fault || fault->prefetch;
bool write_fault = fault && fault->write;
+ /*
+ * Prefetching uses gfn_to_page_many_atomic, which never gets
+ * non-refcounted pages.
+ */
+ bool is_refcounted = !fault || !!fault->accessed_page;

if (unlikely(is_noslot_pfn(pfn))) {
vcpu->stat.pf_mmio_spte_created++;
@@ -2951,7 +2956,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
}

wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
- true, host_writable, true, &spte);
+ true, host_writable, is_refcounted, &spte);

if (*sptep == spte) {
ret = RET_PF_SPURIOUS;
@@ -4319,8 +4324,8 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
return -EFAULT;
}

- r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
- &max_order);
+ r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn,
+ &fault->pfn, &fault->accessed_page, &max_order);
if (r) {
kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
return r;
@@ -4330,6 +4335,9 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
fault->max_level);
fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);

+ /* kvm_gmem_get_pfn takes a refcount, but accessed_page doesn't need it. */
+ put_page(fault->accessed_page);
+
return RET_PF_CONTINUE;
}

@@ -4339,10 +4347,10 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
struct kvm_follow_pfn kfp = {
.slot = slot,
.gfn = fault->gfn,
- .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
+ .flags = fault->write ? FOLL_WRITE : 0,
.try_map_writable = true,
.guarded_by_mmu_notifier = true,
- .allow_non_refcounted_struct_page = false,
+ .allow_non_refcounted_struct_page = shadow_refcounted_mask,
};

/*
@@ -4359,6 +4367,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
fault->slot = NULL;
fault->pfn = KVM_PFN_NOSLOT;
fault->map_writable = false;
+ fault->accessed_page = NULL;
return RET_PF_CONTINUE;
}
/*
@@ -4422,6 +4431,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
success:
fault->hva = kfp.hva;
fault->map_writable = kfp.writable;
+ fault->accessed_page = kfp.refcounted_page;
return RET_PF_CONTINUE;
}

@@ -4510,8 +4520,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
r = direct_map(vcpu, fault);

out_unlock:
+ kvm_set_page_accessed(fault->accessed_page);
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
return r;
}

@@ -4586,8 +4596,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
r = kvm_tdp_mmu_map(vcpu, fault);

out_unlock:
+ kvm_set_page_accessed(fault->accessed_page);
read_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
return r;
}
#endif
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 0669a8a668ca..0b05183600af 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -240,6 +240,8 @@ struct kvm_page_fault {
kvm_pfn_t pfn;
hva_t hva;
bool map_writable;
+ /* Does NOT have an elevated refcount */
+ struct page *accessed_page;

/*
* Indicates the guest is trying to write a gfn that contains one or
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c965f77ac4d5..b39dce802394 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -847,8 +847,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
r = FNAME(fetch)(vcpu, fault, &walker);

out_unlock:
+ kvm_set_page_accessed(fault->accessed_page);
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
return r;
}

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ee497fb78d90..0524be7c0796 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -958,7 +958,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
else
wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
fault->pfn, iter->old_spte, fault->prefetch, true,
- fault->map_writable, true, &new_spte);
+ fault->map_writable, !!fault->accessed_page,
+ &new_spte);

if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d19a418df04b..ea34eae6cfa4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2426,11 +2426,13 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)

#ifdef CONFIG_KVM_PRIVATE_MEM
int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
- gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+ gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
+ int *max_order);
#else
static inline int kvm_gmem_get_pfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
- kvm_pfn_t *pfn, int *max_order)
+ kvm_pfn_t *pfn, struct page **page,
+ int *max_order)
{
KVM_BUG_ON(1, kvm);
return -EIO;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 0f4e0cf4f158..dabcca2ecc37 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -483,12 +483,12 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
}

int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
- gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+ gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
+ int *max_order)
{
pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
struct kvm_gmem *gmem;
struct folio *folio;
- struct page *page;
struct file *file;
int r;

@@ -514,9 +514,9 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
goto out_unlock;
}

- page = folio_file_page(folio, index);
+ *page = folio_file_page(folio, index);

- *pfn = page_to_pfn(page);
+ *pfn = page_to_pfn(*page);
if (max_order)
*max_order = 0;

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 235c92830cdc..1f5d2a1e63a9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3284,11 +3284,17 @@ void kvm_set_page_dirty(struct page *page)
}
EXPORT_SYMBOL_GPL(kvm_set_page_dirty);

-void kvm_set_page_accessed(struct page *page)
+static void __kvm_set_page_accessed(struct page *page)
{
if (kvm_is_ad_tracked_page(page))
mark_page_accessed(page);
}
+
+void kvm_set_page_accessed(struct page *page)
+{
+ if (page)
+ __kvm_set_page_accessed(page);
+}
EXPORT_SYMBOL_GPL(kvm_set_page_accessed);

void kvm_release_page_clean(struct page *page)
@@ -3298,7 +3304,7 @@ void kvm_release_page_clean(struct page *page)
if (!page)
return;

- kvm_set_page_accessed(page);
+ __kvm_set_page_accessed(page);
put_page(page);
}
EXPORT_SYMBOL_GPL(kvm_release_page_clean);
--
2.44.0.rc1.240.g4c46232300-goog

2024-02-29 13:36:48

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
> Our use case is virtio-gpu blob resources [1], which directly map host
> graphics buffers into the guest as "vram" for the virtio-gpu device.
> This feature currently does not work on systems using the amdgpu driver,
> as that driver allocates non-compound higher order pages via
> ttm_pool_alloc_page().

. and just as last time around that is still the problem that needs
to be fixed instead of creating a monster like this to map
non-refcounted pages.

2024-03-13 04:55:44

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <[email protected]> wrote:
>
> On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
> > Our use case is virtio-gpu blob resources [1], which directly map host
> > graphics buffers into the guest as "vram" for the virtio-gpu device.
> > This feature currently does not work on systems using the amdgpu driver,
> > as that driver allocates non-compound higher order pages via
> > ttm_pool_alloc_page().
>
> .. and just as last time around that is still the problem that needs
> to be fixed instead of creating a monster like this to map
> non-refcounted pages.
>

Patches to amdgpu to have been NAKed [1] with the justification that
using non-refcounted pages is working as intended and KVM is in the
wrong for wanting to take references to pages mapped with VM_PFNMAP
[2].

The existence of the VM_PFNMAP implies that the existence of
non-refcounted pages is working as designed. We can argue about
whether or not VM_PFNMAP should exist, but until VM_PFNMAP is removed,
KVM should be able to handle it. Also note that this is not adding a
new source of non-refcounted pages, so it doesn't make removing
non-refcounted pages more difficult, if the kernel does decide to go
in that direction.

-David

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/

2024-03-13 09:55:59

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

Am 13.03.24 um 05:55 schrieb David Stevens:
> On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <[email protected]> wrote:
>> On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
>>> Our use case is virtio-gpu blob resources [1], which directly map host
>>> graphics buffers into the guest as "vram" for the virtio-gpu device.
>>> This feature currently does not work on systems using the amdgpu driver,
>>> as that driver allocates non-compound higher order pages via
>>> ttm_pool_alloc_page().
>> .. and just as last time around that is still the problem that needs
>> to be fixed instead of creating a monster like this to map
>> non-refcounted pages.
>>
> Patches to amdgpu to have been NAKed [1] with the justification that
> using non-refcounted pages is working as intended and KVM is in the
> wrong for wanting to take references to pages mapped with VM_PFNMAP
> [2].
>
> The existence of the VM_PFNMAP implies that the existence of
> non-refcounted pages is working as designed. We can argue about
> whether or not VM_PFNMAP should exist, but until VM_PFNMAP is removed,
> KVM should be able to handle it. Also note that this is not adding a
> new source of non-refcounted pages, so it doesn't make removing
> non-refcounted pages more difficult, if the kernel does decide to go
> in that direction.

Well, the meaning of VM_PFNMAP is that you should not touch the
underlying struct page the PTE is pointing to. As far as I can see this
includes grabbing a reference count.

But that isn't really the problem here. The issue is rather that KVM
assumes that by grabbing a reference count to the page that the driver
won't change the PTE to point somewhere else.. And that is simply not true.

So what KVM needs to do is to either have an MMU notifier installed so
that updates to the PTEs on the host side are reflected immediately to
the PTEs on the guest side.

Or (even better) you use hardware functionality like nested page tables
so that we don't actually need to update the guest PTEs when the host
PTEs change.

And when you have either of those two functionalities the requirement to
add a long term reference to the struct page goes away completely. So
when this is done right you don't need to grab a reference in the first
place.

Regards,
Christian.

>
> -David
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/

2024-03-13 13:34:24

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Wed, Mar 13, 2024, Christian König wrote:
> Am 13.03.24 um 05:55 schrieb David Stevens:
> > On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <[email protected]> wrote:
> > > On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
> > > > Our use case is virtio-gpu blob resources [1], which directly map host
> > > > graphics buffers into the guest as "vram" for the virtio-gpu device.
> > > > This feature currently does not work on systems using the amdgpu driver,
> > > > as that driver allocates non-compound higher order pages via
> > > > ttm_pool_alloc_page().
> > > .. and just as last time around that is still the problem that needs
> > > to be fixed instead of creating a monster like this to map
> > > non-refcounted pages.
> > >
> > Patches to amdgpu to have been NAKed [1] with the justification that
> > using non-refcounted pages is working as intended and KVM is in the
> > wrong for wanting to take references to pages mapped with VM_PFNMAP
> > [2].
> >
> > The existence of the VM_PFNMAP implies that the existence of
> > non-refcounted pages is working as designed. We can argue about
> > whether or not VM_PFNMAP should exist, but until VM_PFNMAP is removed,
> > KVM should be able to handle it. Also note that this is not adding a
> > new source of non-refcounted pages, so it doesn't make removing
> > non-refcounted pages more difficult, if the kernel does decide to go
> > in that direction.
>
> Well, the meaning of VM_PFNMAP is that you should not touch the underlying
> struct page the PTE is pointing to. As far as I can see this includes
> grabbing a reference count.
>
> But that isn't really the problem here. The issue is rather that KVM assumes
> that by grabbing a reference count to the page that the driver won't change
> the PTE to point somewhere else.. And that is simply not true.

No, KVM doesn't assume that.

> So what KVM needs to do is to either have an MMU notifier installed so that
> updates to the PTEs on the host side are reflected immediately to the PTEs
> on the guest side.

KVM already has an MMU notifier and reacts accordingly.

> Or (even better) you use hardware functionality like nested page tables so
> that we don't actually need to update the guest PTEs when the host PTEs
> change.

That's not how stage-2 page tables work.

> And when you have either of those two functionalities the requirement to add
> a long term reference to the struct page goes away completely. So when this
> is done right you don't need to grab a reference in the first place.

The KVM issue that this series is solving isn't that KVM grabs a reference, it's
that KVM assumes that any non-reserved pfn that is backed by "struct page" is
refcounted.

What Christoph is objecting to is that, in this series, KVM is explicitly adding
support for mapping non-compound (huge)pages into KVM guests. David is arguing
that Christoph's objection to _KVM_ adding support is unfair, because the real
problem is that the kernel already maps such pages into host userspace. I.e. if
the userspace mapping ceases to exist, then there are no mappings for KVM to follow
and propagate to KVM's stage-2 page tables.

2024-03-13 14:35:01

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Wed, Mar 13, 2024 at 01:55:20PM +0900, David Stevens wrote:
> On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <[email protected]> wrote:
> >
> > On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
> > > Our use case is virtio-gpu blob resources [1], which directly map host
> > > graphics buffers into the guest as "vram" for the virtio-gpu device.
> > > This feature currently does not work on systems using the amdgpu driver,
> > > as that driver allocates non-compound higher order pages via
> > > ttm_pool_alloc_page().
> >
> > .. and just as last time around that is still the problem that needs
> > to be fixed instead of creating a monster like this to map
> > non-refcounted pages.
> >
>
> Patches to amdgpu to have been NAKed [1] with the justification that
> using non-refcounted pages is working as intended and KVM is in the
> wrong for wanting to take references to pages mapped with VM_PFNMAP
> [2].

So make them not work for KVM as they should to kick the amdgpu
maintainers a***es.

2024-03-13 14:39:06

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

Am 13.03.24 um 14:34 schrieb Sean Christopherson:
> On Wed, Mar 13, 2024, Christian König wrote:
>> Am 13.03.24 um 05:55 schrieb David Stevens:
>>> On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <[email protected]> wrote:
>>>> On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
>>>>> Our use case is virtio-gpu blob resources [1], which directly map host
>>>>> graphics buffers into the guest as "vram" for the virtio-gpu device.
>>>>> This feature currently does not work on systems using the amdgpu driver,
>>>>> as that driver allocates non-compound higher order pages via
>>>>> ttm_pool_alloc_page().
>>>> .. and just as last time around that is still the problem that needs
>>>> to be fixed instead of creating a monster like this to map
>>>> non-refcounted pages.
>>>>
>>> Patches to amdgpu to have been NAKed [1] with the justification that
>>> using non-refcounted pages is working as intended and KVM is in the
>>> wrong for wanting to take references to pages mapped with VM_PFNMAP
>>> [2].
>>>
>>> The existence of the VM_PFNMAP implies that the existence of
>>> non-refcounted pages is working as designed. We can argue about
>>> whether or not VM_PFNMAP should exist, but until VM_PFNMAP is removed,
>>> KVM should be able to handle it. Also note that this is not adding a
>>> new source of non-refcounted pages, so it doesn't make removing
>>> non-refcounted pages more difficult, if the kernel does decide to go
>>> in that direction.
>> Well, the meaning of VM_PFNMAP is that you should not touch the underlying
>> struct page the PTE is pointing to. As far as I can see this includes
>> grabbing a reference count.
>>
>> But that isn't really the problem here. The issue is rather that KVM assumes
>> that by grabbing a reference count to the page that the driver won't change
>> the PTE to point somewhere else.. And that is simply not true.
> No, KVM doesn't assume that.
>
>> So what KVM needs to do is to either have an MMU notifier installed so that
>> updates to the PTEs on the host side are reflected immediately to the PTEs
>> on the guest side.
> KVM already has an MMU notifier and reacts accordingly.
>
>> Or (even better) you use hardware functionality like nested page tables so
>> that we don't actually need to update the guest PTEs when the host PTEs
>> change.
> That's not how stage-2 page tables work.
>
>> And when you have either of those two functionalities the requirement to add
>> a long term reference to the struct page goes away completely. So when this
>> is done right you don't need to grab a reference in the first place.
> The KVM issue that this series is solving isn't that KVM grabs a reference, it's
> that KVM assumes that any non-reserved pfn that is backed by "struct page" is
> refcounted.

Well why does it assumes that? When you have a MMU notifier that seems
unnecessary.

> What Christoph is objecting to is that, in this series, KVM is explicitly adding
> support for mapping non-compound (huge)pages into KVM guests. David is arguing
> that Christoph's objection to _KVM_ adding support is unfair, because the real
> problem is that the kernel already maps such pages into host userspace. I.e. if
> the userspace mapping ceases to exist, then there are no mappings for KVM to follow
> and propagate to KVM's stage-2 page tables.

And I have to agree with Christoph that this doesn't make much sense.
KVM should *never* map (huge) pages from VMAs marked with VM_PFNMAP into
KVM guests in the first place.

What it should do instead is to mirror the PFN from the host page tables
into the guest page tables. If there is a page behind that or not *must*
be completely irrelevant to KVM.

The background here is that drivers are modifying the page table on the
fly to point to either MMIO or real memory, this also includes switching
the caching attributes.

The real question is why is KVM trying to grab a page reference when
there is an MMU notifier installed.

Regards,
Christian.

2024-03-13 14:49:02

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Wed, Mar 13, 2024, Christian König wrote:
> Am 13.03.24 um 14:34 schrieb Sean Christopherson:
> > On Wed, Mar 13, 2024, Christian König wrote:
> > > And when you have either of those two functionalities the requirement to add
> > > a long term reference to the struct page goes away completely. So when this
> > > is done right you don't need to grab a reference in the first place.
> > The KVM issue that this series is solving isn't that KVM grabs a reference, it's
> > that KVM assumes that any non-reserved pfn that is backed by "struct page" is
> > refcounted.
>
> Well why does it assumes that? When you have a MMU notifier that seems
> unnecessary.

Indeed, it's legacy code that we're trying to clean up. It's the bulk of this
series.

> > What Christoph is objecting to is that, in this series, KVM is explicitly adding
> > support for mapping non-compound (huge)pages into KVM guests. David is arguing
> > that Christoph's objection to _KVM_ adding support is unfair, because the real
> > problem is that the kernel already maps such pages into host userspace. I.e. if
> > the userspace mapping ceases to exist, then there are no mappings for KVM to follow
> > and propagate to KVM's stage-2 page tables.
>
> And I have to agree with Christoph that this doesn't make much sense. KVM
> should *never* map (huge) pages from VMAs marked with VM_PFNMAP into KVM
> guests in the first place.
>
> What it should do instead is to mirror the PFN from the host page tables
> into the guest page tables.

That's exactly what this series does. Christoph is objecting to KVM playing nice
with non-compound hugepages, as he feels that such mappings should not exist
*anywhere*.

I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
we should instead fix the TTM allocations. And David pointed out that that was
tried and got NAK'd.

2024-03-13 15:15:04

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

Sending that once more as text only since AMD eMail has messed that up
once more.

Regards,
Christian.

Am 13.03.24 um 16:07 schrieb Christian König:
> Am 13.03.24 um 15:48 schrieb Sean Christopherson:
>> On Wed, Mar 13, 2024, Christian König wrote:
>>> Am 13.03.24 um 14:34 schrieb Sean Christopherson:
>>>> On Wed, Mar 13, 2024, Christian König wrote:
>>>>> And when you have either of those two functionalities the requirement to add
>>>>> a long term reference to the struct page goes away completely. So when this
>>>>> is done right you don't need to grab a reference in the first place.
>>>> The KVM issue that this series is solving isn't that KVM grabs a reference, it's
>>>> that KVM assumes that any non-reserved pfn that is backed by "struct page" is
>>>> refcounted.
>>> Well why does it assumes that? When you have a MMU notifier that seems
>>> unnecessary.
>> Indeed, it's legacy code that we're trying to clean up. It's the bulk of this
>> series.
>
> Yeah, that is the right approach as far as I can see.
>
>>>> What Christoph is objecting to is that, in this series, KVM is explicitly adding
>>>> support for mapping non-compound (huge)pages into KVM guests. David is arguing
>>>> that Christoph's objection to _KVM_ adding support is unfair, because the real
>>>> problem is that the kernel already maps such pages into host userspace. I.e. if
>>>> the userspace mapping ceases to exist, then there are no mappings for KVM to follow
>>>> and propagate to KVM's stage-2 page tables.
>>> And I have to agree with Christoph that this doesn't make much sense. KVM
>>> should *never* map (huge) pages from VMAs marked with VM_PFNMAP into KVM
>>> guests in the first place.
>>>
>>> What it should do instead is to mirror the PFN from the host page tables
>>> into the guest page tables.
>> That's exactly what this series does. Christoph is objecting to KVM playing nice
>> with non-compound hugepages, as he feels that such mappings should not exist
>> *anywhere*.
>
> Well Christoph is right those mappings shouldn't exists and they also
> don't exists.
>
> What happens here is that a driver has allocated some contiguous
> memory to do DMA with. And then some page table is pointing to a PFN
> inside that memory because userspace needs to provide parameters for
> the DMA transfer.
>
> This is *not* a mapping of a non-compound hugepage, it's simply a PTE
> pointing to some PFN. It can trivially be that userspace only maps
> 4KiB of some 2MiB piece of memory the driver has allocate.
>
>> I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
>> we should instead fix the TTM allocations. And David pointed out that that was
>> tried and got NAK'd.
>
> Well as far as I can see Christoph rejects the complexity coming with
> the approach of sometimes grabbing the reference and sometimes not.
> And I have to agree that this is extremely odd.
>
> What the KVM code should do instead is to completely drop grabbing
> references to struct pages, no matter what the VMA flags are.
>
> Regards,
> Christian.

2024-03-13 15:48:13

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Wed, Mar 13, 2024, Christian König wrote:
> Am 13.03.24 um 15:48 schrieb Sean Christopherson:
> > On Wed, Mar 13, 2024, Christian König wrote:
> > > Am 13.03.24 um 14:34 schrieb Sean Christopherson:
> > > > What Christoph is objecting to is that, in this series, KVM is explicitly adding
> > > > support for mapping non-compound (huge)pages into KVM guests. David is arguing
> > > > that Christoph's objection to _KVM_ adding support is unfair, because the real
> > > > problem is that the kernel already maps such pages into host userspace. I.e. if
> > > > the userspace mapping ceases to exist, then there are no mappings for KVM to follow
> > > > and propagate to KVM's stage-2 page tables.
> > > And I have to agree with Christoph that this doesn't make much sense. KVM
> > > should *never* map (huge) pages from VMAs marked with VM_PFNMAP into KVM
> > > guests in the first place.
> > >
> > > What it should do instead is to mirror the PFN from the host page tables
> > > into the guest page tables.
> > That's exactly what this series does. Christoph is objecting to KVM playing nice
> > with non-compound hugepages, as he feels that such mappings should not exist
> > *anywhere*.
>
> Well Christoph is right those mappings shouldn't exists and they also don't
> exists.
>
> What happens here is that a driver has allocated some contiguous memory to
> do DMA with. And then some page table is pointing to a PFN inside that
> memory because userspace needs to provide parameters for the DMA transfer.
>
> This is *not* a mapping of a non-compound hugepage, it's simply a PTE
> pointing to some PFN.

Yes, I know. And David knows. By "such mappings" I did not mean "huge PMD mappings
that point at non-compound pages", I meant "any mapping in the host userspace
VMAs and page tables that points at memory that is backed by a larger-than-order-0,
non-compound allocation".

And even then, the whole larger-than-order-0 mapping is not something we on the
KVM side care about, at all. The _only_ new thing KVM is trying to do in this
series is to allow mapping non-refcounted struct page memory into KVM guest.
Those details were brought up purely because they provide context on how/why such
non-refcounted pages exist.

> It can trivially be that userspace only maps 4KiB of some 2MiB piece of
> memory the driver has allocate.
>
> > I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
> > we should instead fix the TTM allocations. And David pointed out that that was
> > tried and got NAK'd.
>
> Well as far as I can see Christoph rejects the complexity coming with the
> approach of sometimes grabbing the reference and sometimes not.

Unless I've wildly misread multiple threads, that is not Christoph's objection.
From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%[email protected]):

On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig <[email protected]> wrote:
>
> On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> > KVM needs to be aware of non-refcounted struct page memory no matter what; see
> > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
> > non-reference-counted pages"). I don't think it makes any sense whatsoever to
> > remove that code and assume every driver in existence will do the right thing.
>
> Agreed.
>
> >
> > With the cleanups done, playing nice with non-refcounted paged instead of outright
> > rejecting them is a wash in terms of lines of code, complexity, and ongoing
> > maintenance cost.
>
> I tend to strongly disagree with that, though. We can't just let these
> non-refcounted pages spread everywhere and instead need to fix their
> usage.

> And I have to agree that this is extremely odd.

Yes, it's odd and not ideal. But with nested virtualization, KVM _must_ "map"
pfns directly into the guest via fields in the control structures that are
consumed by hardware. I.e. pfns are exposed to the guest in an "out-of-band"
structure that is NOT part of the stage-2 page tables. And wiring those up to
the MMU notifiers is extremely difficult for a variety of reasons[*].

Because KVM doesn't control which pfns are mapped this way, KVM's compromise is
to grab a reference to the struct page while the out-of-band mapping exists, i.e.
to pin the page to prevent use-after-free. And KVM's historical ABI is to support
any refcounted page for these out-of-band mappings, regardless of whether the
page was obtained by gup() or follow_pte().

Thus, to support non-refouncted VM_PFNMAP pages without breaking existing userspace,
KVM resorts to conditionally grabbing references and disllowing non-refcounted
pages from being inserted into the out-of-band mappings.

But again, I don't think these details are relevant to Christoph's objection.

[*] https://lore.kernel.org/all/[email protected]

2024-03-13 18:42:16

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Wed, Mar 13, 2024, Christian König wrote:
> Am 13.03.24 um 16:47 schrieb Sean Christopherson:
> > [SNIP]
> > > It can trivially be that userspace only maps 4KiB of some 2MiB piece of
> > > memory the driver has allocate.
> > >
> > > > I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
> > > > we should instead fix the TTM allocations. And David pointed out that that was
> > > > tried and got NAK'd.
> > > Well as far as I can see Christoph rejects the complexity coming with the
> > > approach of sometimes grabbing the reference and sometimes not.
> > Unless I've wildly misread multiple threads, that is not Christoph's objection.
> > From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%[email protected]):
> >
> > On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig<[email protected]> wrote:
> > >
> > > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> > > > KVM needs to be aware of non-refcounted struct page memory no matter what; see
> > > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
> > > > non-reference-counted pages"). I don't think it makes any sense whatsoever to
> > > > remove that code and assume every driver in existence will do the right thing.
> > >
> > > Agreed.
> > >
> > > >
> > > > With the cleanups done, playing nice with non-refcounted paged instead of outright
> > > > rejecting them is a wash in terms of lines of code, complexity, and ongoing
> > > > maintenance cost.
> > >
> > > I tend to strongly disagree with that, though. We can't just let these
> > > non-refcounted pages spread everywhere and instead need to fix their
> > > usage.
>
> And I can only repeat myself that I completely agree with Christoph here.

I am so confused. If you agree with Christoph, why not fix the TTM allocations?

> > > And I have to agree that this is extremely odd.
> > Yes, it's odd and not ideal. But with nested virtualization, KVM _must_ "map"
> > pfns directly into the guest via fields in the control structures that are
> > consumed by hardware. I.e. pfns are exposed to the guest in an "out-of-band"
> > structure that is NOT part of the stage-2 page tables. And wiring those up to
> > the MMU notifiers is extremely difficult for a variety of reasons[*].
> >
> > Because KVM doesn't control which pfns are mapped this way, KVM's compromise is
> > to grab a reference to the struct page while the out-of-band mapping exists, i.e.
> > to pin the page to prevent use-after-free.
>
> Wait a second, as far as I know this approach doesn't work correctly in all
> cases. See here as well: https://lwn.net/Articles/930667/
>
> The MMU notifier is not only to make sure that pages don't go away and
> prevent use after free, but also that PTE updates correctly wait for ongoing
> DMA transfers.
>
> Otherwise quite a bunch of other semantics doesn't work correctly either.
>
> > And KVM's historical ABI is to support
> > any refcounted page for these out-of-band mappings, regardless of whether the
> > page was obtained by gup() or follow_pte().
>
> Well see the discussion from last year the LWN article summarizes.

Oof. I suspect that in practice, no one has observed issues because the pages
in question are heavily used by the guest and don't get evicted from the page cache.

> I'm not an expert for KVM but as far as I can see what you guys do here is
> illegal and can lead to corruption.
>
> Basically you can't create a second channel to modify page content like
> that.

Well, KVM can, it just needs to honor mmu_notifier events in order to be 100%
robust.

That said, while this might be motivation to revisit tying the out-of-band mappings
to the mmu_notifier, it has no bearing on the outcome of Christoph's objection.
Because (a) AIUI, Christoph is objecting to mapping non-refcounted struct page
memory *anywhere*, and (b) in this series, KVM will explicitly disallow non-
refcounted pages from being mapped in the out-of-band structures (see below).

> > Thus, to support non-refouncted VM_PFNMAP pages without breaking existing userspace,
> > KVM resorts to conditionally grabbing references and disllowing non-refcounted
> > pages from being inserted into the out-of-band mappings.
>
> Why not only refcount the pages when out of band mappings are requested?

That's also what this series effectively does. By default, KVM will disallow
inserting *any* non-refcounted memory into the out-of-band mappings.

"By default" because there are use cases where the guest memory pool is hidden
from the kernel at boot, and is fully managed by userspace. I.e. userspace is
effectively responsible/trusted to not free/change the mappings for an active
guest.

2024-03-14 09:20:44

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

Sending that out once more since the AMD email servers have converted it
to HTML mail once more :(

Grrr,
Christian.

Am 14.03.24 um 10:18 schrieb Christian König:
> Am 13.03.24 um 18:26 schrieb Sean Christopherson:
>> On Wed, Mar 13, 2024, Christian König wrote:
>>> Am 13.03.24 um 16:47 schrieb Sean Christopherson:
>>>> [SNIP]
>>>>> It can trivially be that userspace only maps 4KiB of some 2MiB piece of
>>>>> memory the driver has allocate.
>>>>>
>>>>>> I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
>>>>>> we should instead fix the TTM allocations. And David pointed out that that was
>>>>>> tried and got NAK'd.
>>>>> Well as far as I can see Christoph rejects the complexity coming with the
>>>>> approach of sometimes grabbing the reference and sometimes not.
>>>> Unless I've wildly misread multiple threads, that is not Christoph's objection.
>>>> From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%[email protected]):
>>>>
>>>> On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig<[email protected]> wrote:
>>>> >
>>>> > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
>>>> > > KVM needs to be aware of non-refcounted struct page memory no matter what; see
>>>> > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
>>>> > > non-reference-counted pages"). I don't think it makes any sense whatsoever to
>>>> > > remove that code and assume every driver in existence will do the right thing.
>>>> >
>>>> > Agreed.
>>>> >
>>>> > >
>>>> > > With the cleanups done, playing nice with non-refcounted paged instead of outright
>>>> > > rejecting them is a wash in terms of lines of code, complexity, and ongoing
>>>> > > maintenance cost.
>>>> >
>>>> > I tend to strongly disagree with that, though. We can't just let these
>>>> > non-refcounted pages spread everywhere and instead need to fix their
>>>> > usage.
>>> And I can only repeat myself that I completely agree with Christoph here.
>> I am so confused. If you agree with Christoph, why not fix the TTM allocations?
>
> Because the TTM allocation isn't broken in any way.
>
> See in some configurations TTM even uses the DMA API for those
> allocations and that is actually something Christoph coded.
>
> What Christoph is really pointing out is that absolutely nobody should
> put non-refcounted pages into a VMA, but again this isn't something
> TTM does. What TTM does instead is to work with the PFN and puts that
> into a VMA.
>
> It's just that then KVM comes along and converts the PFN back into a
> struct page again and that is essentially what causes all the
> problems, including CVE-2021-22543.
>
>>>>> And I have to agree that this is extremely odd.
>>>> Yes, it's odd and not ideal. But with nested virtualization, KVM _must_ "map"
>>>> pfns directly into the guest via fields in the control structures that are
>>>> consumed by hardware. I.e. pfns are exposed to the guest in an "out-of-band"
>>>> structure that is NOT part of the stage-2 page tables. And wiring those up to
>>>> the MMU notifiers is extremely difficult for a variety of reasons[*].
>>>>
>>>> Because KVM doesn't control which pfns are mapped this way, KVM's compromise is
>>>> to grab a reference to the struct page while the out-of-band mapping exists, i.e.
>>>> to pin the page to prevent use-after-free.
>>> Wait a second, as far as I know this approach doesn't work correctly in all
>>> cases. See here as well:https://lwn.net/Articles/930667/
>>>
>>> The MMU notifier is not only to make sure that pages don't go away and
>>> prevent use after free, but also that PTE updates correctly wait for ongoing
>>> DMA transfers.
>>>
>>> Otherwise quite a bunch of other semantics doesn't work correctly either.
>>>
>>>> And KVM's historical ABI is to support
>>>> any refcounted page for these out-of-band mappings, regardless of whether the
>>>> page was obtained by gup() or follow_pte().
>>> Well see the discussion from last year the LWN article summarizes.
>> Oof. I suspect that in practice, no one has observed issues because the pages
>> in question are heavily used by the guest and don't get evicted from the page cache.
>
> Well in this case I strongly suggest to block the problematic cases.
>
> It just sounds like KVM never run into problems because nobody is
> doing any of those problematic cases. If that's true it should be
> doable to change the UAPI and say this specific combination is
> forbidden because it could result in data corruption.
>
>>> I'm not an expert for KVM but as far as I can see what you guys do here is
>>> illegal and can lead to corruption.
>>>
>>> Basically you can't create a second channel to modify page content like
>>> that.
>> Well, KVM can, it just needs to honor mmu_notifier events in order to be 100%
>> robust.
>
> Yeah, completely agree.
>
>> That said, while this might be motivation to revisit tying the out-of-band mappings
>> to the mmu_notifier, it has no bearing on the outcome of Christoph's objection.
>> Because (a) AIUI, Christoph is objecting to mapping non-refcounted struct page
>> memory *anywhere*, and (b) in this series, KVM will explicitly disallow non-
>> refcounted pages from being mapped in the out-of-band structures (see below).
>>
>>>> Thus, to support non-refouncted VM_PFNMAP pages without breaking existing userspace,
>>>> KVM resorts to conditionally grabbing references and disllowing non-refcounted
>>>> pages from being inserted into the out-of-band mappings.
>>> Why not only refcount the pages when out of band mappings are requested?
>> That's also what this series effectively does. By default, KVM will disallow
>> inserting *any* non-refcounted memory into the out-of-band mappings.
>
> Ok than that's basically my inconvenient. I can't really criticize the
> KVM patch because I don't really understand all the details.
>
> I'm only pointing out from a very high level view how memory
> management works and that I see some conflict with what KVM does.
>
> As far as I can tell the cleanest approach for KVM would be to have
> two completely separate handlings:
>
> 1. Using GUP to handle the out-of-band mappings, this one grabs
> references and honors all the GUP restrictions with the appropriate
> flags. When VM_PFNMAP is set then those mappings will be rejected.
> That should also avoid any trouble with file backed mappings.
>
> 2. Using follow_pte() plus an MMU notifier and this one can even
> handle VMAs with the VM_PFNMAP and VM_IO flag set.
>
>> "By default" because there are use cases where the guest memory pool is hidden
>> from the kernel at boot, and is fully managed by userspace. I.e. userspace is
>> effectively responsible/trusted to not free/change the mappings for an active
>> guest.
>
> Wow, could that potentially crash the kernel?
>
> Cheers,
> Christian.

2024-03-14 11:32:09

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Thu, Mar 14, 2024 at 6:20 PM Christian König
<[email protected]> wrote:
>
> Sending that out once more since the AMD email servers have converted it
> to HTML mail once more :(
>
> Grrr,
> Christian.
>
> Am 14.03.24 um 10:18 schrieb Christian König:
> > Am 13.03.24 um 18:26 schrieb Sean Christopherson:
> >> On Wed, Mar 13, 2024, Christian König wrote:
> >>> Am 13.03.24 um 16:47 schrieb Sean Christopherson:
> >>>> [SNIP]
> >>>>> It can trivially be that userspace only maps 4KiB of some 2MiB piece of
> >>>>> memory the driver has allocate.
> >>>>>
> >>>>>> I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
> >>>>>> we should instead fix the TTM allocations. And David pointed out that that was
> >>>>>> tried and got NAK'd.
> >>>>> Well as far as I can see Christoph rejects the complexity coming with the
> >>>>> approach of sometimes grabbing the reference and sometimes not.
> >>>> Unless I've wildly misread multiple threads, that is not Christoph's objection.
> >>>> From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%[email protected]):
> >>>>
> >>>> On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig<[email protected]> wrote:
> >>>> >
> >>>> > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> >>>> > > KVM needs to be aware of non-refcounted struct page memory no matter what; see
> >>>> > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
> >>>> > > non-reference-counted pages"). I don't think it makes any sense whatsoever to
> >>>> > > remove that code and assume every driver in existence will do the right thing.
> >>>> >
> >>>> > Agreed.
> >>>> >
> >>>> > >
> >>>> > > With the cleanups done, playing nice with non-refcounted paged instead of outright
> >>>> > > rejecting them is a wash in terms of lines of code, complexity, and ongoing
> >>>> > > maintenance cost.
> >>>> >
> >>>> > I tend to strongly disagree with that, though. We can't just let these
> >>>> > non-refcounted pages spread everywhere and instead need to fix their
> >>>> > usage.
> >>> And I can only repeat myself that I completely agree with Christoph here.
> >> I am so confused. If you agree with Christoph, why not fix the TTM allocations?
> >
> > Because the TTM allocation isn't broken in any way.
> >
> > See in some configurations TTM even uses the DMA API for those
> > allocations and that is actually something Christoph coded.
> >
> > What Christoph is really pointing out is that absolutely nobody should
> > put non-refcounted pages into a VMA, but again this isn't something
> > TTM does. What TTM does instead is to work with the PFN and puts that
> > into a VMA.
> >
> > It's just that then KVM comes along and converts the PFN back into a
> > struct page again and that is essentially what causes all the
> > problems, including CVE-2021-22543.

Does Christoph's objection come from my poorly worded cover letter and
commit messages, then? Fundamentally, what this series is doing is
allowing pfns returned by follow_pte to be mapped into KVM's shadow
MMU without inadvertently translating them into struct pages. If I'm
understand this discussion correctly, since KVM's shadow MMU is hooked
up to MMU notifiers, this shouldn't be controversial. However, my
cover letter got a little confused because KVM is currently doing
something that it sounds like it shouldn't - translating pfns returned
by follow_pte into struct pages with kvm_try_get_pfn. Because of that,
the specific type of pfns that don't work right now are pfn_valid() &&
!PG_Reserved && !page_ref_count() - what I called the non-refcounted
pages in a bad choice of words. If that's correct, then perhaps this
series should go a little bit further in modifying
hva_to_pfn_remapped, but it isn't fundamentally wrong.

-David

2024-03-14 11:52:15

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

Am 14.03.24 um 12:31 schrieb David Stevens:
> On Thu, Mar 14, 2024 at 6:20 PM Christian König
> <[email protected]> wrote:
>> Sending that out once more since the AMD email servers have converted it
>> to HTML mail once more :(
>>
>> Grrr,
>> Christian.
>>
>> Am 14.03.24 um 10:18 schrieb Christian König:
>>> Am 13.03.24 um 18:26 schrieb Sean Christopherson:
>>>> On Wed, Mar 13, 2024, Christian König wrote:
>>>>> Am 13.03.24 um 16:47 schrieb Sean Christopherson:
>>>>>> [SNIP]
>>>>>>> It can trivially be that userspace only maps 4KiB of some 2MiB piece of
>>>>>>> memory the driver has allocate.
>>>>>>>
>>>>>>>> I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
>>>>>>>> we should instead fix the TTM allocations. And David pointed out that that was
>>>>>>>> tried and got NAK'd.
>>>>>>> Well as far as I can see Christoph rejects the complexity coming with the
>>>>>>> approach of sometimes grabbing the reference and sometimes not.
>>>>>> Unless I've wildly misread multiple threads, that is not Christoph's objection.
>>>>>> From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%[email protected]):
>>>>>>
>>>>>> On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig<[email protected]> wrote:
>>>>>> >
>>>>>> > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
>>>>>> > > KVM needs to be aware of non-refcounted struct page memory no matter what; see
>>>>>> > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
>>>>>> > > non-reference-counted pages"). I don't think it makes any sense whatsoever to
>>>>>> > > remove that code and assume every driver in existence will do the right thing.
>>>>>> >
>>>>>> > Agreed.
>>>>>> >
>>>>>> > >
>>>>>> > > With the cleanups done, playing nice with non-refcounted paged instead of outright
>>>>>> > > rejecting them is a wash in terms of lines of code, complexity, and ongoing
>>>>>> > > maintenance cost.
>>>>>> >
>>>>>> > I tend to strongly disagree with that, though. We can't just let these
>>>>>> > non-refcounted pages spread everywhere and instead need to fix their
>>>>>> > usage.
>>>>> And I can only repeat myself that I completely agree with Christoph here.
>>>> I am so confused. If you agree with Christoph, why not fix the TTM allocations?
>>> Because the TTM allocation isn't broken in any way.
>>>
>>> See in some configurations TTM even uses the DMA API for those
>>> allocations and that is actually something Christoph coded.
>>>
>>> What Christoph is really pointing out is that absolutely nobody should
>>> put non-refcounted pages into a VMA, but again this isn't something
>>> TTM does. What TTM does instead is to work with the PFN and puts that
>>> into a VMA.
>>>
>>> It's just that then KVM comes along and converts the PFN back into a
>>> struct page again and that is essentially what causes all the
>>> problems, including CVE-2021-22543.
> Does Christoph's objection come from my poorly worded cover letter and
> commit messages, then?

Yes, that could certainly be.

> Fundamentally, what this series is doing is
> allowing pfns returned by follow_pte to be mapped into KVM's shadow
> MMU without inadvertently translating them into struct pages.

As far as I can tell that is really the right thing to do. Yes.

> If I'm understand this discussion correctly, since KVM's shadow MMU is hooked
> up to MMU notifiers, this shouldn't be controversial. However, my
> cover letter got a little confused because KVM is currently doing
> something that it sounds like it shouldn't - translating pfns returned
> by follow_pte into struct pages with kvm_try_get_pfn. Because of that,
> the specific type of pfns that don't work right now are pfn_valid() &&
> !PG_Reserved && !page_ref_count() - what I called the non-refcounted
> pages in a bad choice of words. If that's correct, then perhaps this
> series should go a little bit further in modifying
> hva_to_pfn_remapped, but it isn't fundamentally wrong.

Completely agree. In my thinking when you go a step further and offload
grabbing the page reference to get_user_pages() then you are always on
the save side.

Because then it is no longer the responsibility of the KVM code to get
all the rules around that right, instead you are relying on a core
functionality which should (at least in theory) do the correct thing.

Regards,
Christian.

>
> -David

2024-03-14 16:17:37

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

-Christ{oph,ian} to avoid creating more noise...

On Thu, Mar 14, 2024, David Stevens wrote:
> Because of that, the specific type of pfns that don't work right now are
> pfn_valid() && !PG_Reserved && !page_ref_count() - what I called the
> non-refcounted pages in a bad choice of words. If that's correct, then
> perhaps this series should go a little bit further in modifying
> hva_to_pfn_remapped, but it isn't fundamentally wrong.

Loosely related to all of this, I have a mildly ambitious idea. Well, one mildly
ambitious idea, and one crazy ambitious idea. Crazy ambitious idea first...

Something we (GCE side of Google) have been eyeballing is adding support for huge
VM_PFNMAP memory, e.g. for mapping large amounts of device (a.k.a. GPU) memory
into guests using hugepages. One of the hiccups is that follow_pte() doesn't play
nice with hugepages, at all, e.g. even has a "VM_BUG_ON(pmd_trans_huge(*pmd))".
Teaching follow_pte() to play nice with hugepage probably is doing, but making
sure all existing users are aware, maybe not so much.

My first (half baked, crazy ambitious) idea is to move away from follow_pte() and
get_user_page_fast_only() for mmu_notifier-aware lookups, i.e. that don't need
to grab references, and replace them with a new converged API that locklessly walks
host userspace page tables, and grabs the hugepage size along the way, e.g. so that
arch code wouldn't have to do a second walk of the page tables just to get the
hugepage size.

In other words, for the common case (mmu_notifier integration, no reference needed),
route hva_to_pfn_fast() into the new API and walk the userspace page tables (probably
only for write faults, to avoid CoW compliciations) before doing anything else.

Uses of hva_to_pfn() that need to get a reference to the struct page couldn't be
converted, e.g. when stuffing physical addresses into the VMCS for nested virtualization.
But for everything else, grabbing a reference is a non-goal, i.e. actually "getting"
a user page is wasted effort and actively gets in the way.

I was initially hoping we could go super simple and use something like x86's
host_pfn_mapping_level(), but there are too many edge cases in gup() that need to
be respected, e.g. to avoid mapping memfd_secret pages into KVM guests. I.e. the
API would need to be a formal mm-owned thing, not some homebrewed KVM implementation.

I can't tell if the payoff would be big enough to justify the effort involved, i.e.
having a single unified API for grabbing PFNs from the primary MMU might just be a
pie-in-the-sky type idea.

My second, less ambitious idea: the previously linked LWN[*] article about the
writeback issues reminded me of something that has bugged me for a long time. IIUC,
getting a writable mapping from the primary MMU marks the page/folio dirty, and that
page/folio stays dirty until the data is written back and the mapping is made read-only.
And because KVM is tapped into the mmu_notifiers, KVM will be notified *before* the
RW=>RO conversion completes, i.e. before the page/folio is marked clean.

I _think_ that means that calling kvm_set_page_dirty() when zapping a SPTE (or
dropping any mmu_notifier-aware mapping) is completely unnecessary. If that is the
case, _and_ we can weasel our way out of calling kvm_set_page_accessed() too, then
with FOLL_GET plumbed into hva_to_pfn(), we can:

- Drop kvm_{set,release}_pfn_{accessed,dirty}(), because all callers of hva_to_pfn()
that aren't tied into mmu_notifiers, i.e. aren't guaranteed to drop mappings
before the page/folio is cleaned, will *know* that they hold a refcounted struct
page.

- Skip "KVM: x86/mmu: Track if sptes refer to refcounted pages" entirely, because
KVM never needs to know if a SPTE points at a refcounted page.

In other words, double down on immediately doing put_page() after gup() if FOLL_GET
isn't specified, and naturally make all KVM MMUs compatible with pfn_valid() PFNs
that are acquired by follow_pte().

I suspect we can simply mark pages as access when a page is retrieved from the primary
MMU, as marking a page accessed when its *removed* from the guest is rather nonsensical.
E.g. if a page is mapped into the guest for a long time and it gets swapped out, marking
the page accessed when KVM drops its SPTEs in response to the swap adds no value. And
through the mmu_notifiers, KVM already plays nice with setups that use idle page
tracking to make reclaim decisions.

[*] https://lwn.net/Articles/930667

2024-03-14 17:19:51

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

+Alex, who is looking at the huge-VM_PFNMAP angle in particular.

On Thu, Mar 14, 2024, Sean Christopherson wrote:
> -Christ{oph,ian} to avoid creating more noise...
>
> On Thu, Mar 14, 2024, David Stevens wrote:
> > Because of that, the specific type of pfns that don't work right now are
> > pfn_valid() && !PG_Reserved && !page_ref_count() - what I called the
> > non-refcounted pages in a bad choice of words. If that's correct, then
> > perhaps this series should go a little bit further in modifying
> > hva_to_pfn_remapped, but it isn't fundamentally wrong.
>
> Loosely related to all of this, I have a mildly ambitious idea. Well, one mildly
> ambitious idea, and one crazy ambitious idea. Crazy ambitious idea first...
>
> Something we (GCE side of Google) have been eyeballing is adding support for huge
> VM_PFNMAP memory, e.g. for mapping large amounts of device (a.k.a. GPU) memory
> into guests using hugepages. One of the hiccups is that follow_pte() doesn't play
> nice with hugepages, at all, e.g. even has a "VM_BUG_ON(pmd_trans_huge(*pmd))".
> Teaching follow_pte() to play nice with hugepage probably is doing, but making
> sure all existing users are aware, maybe not so much.
>
> My first (half baked, crazy ambitious) idea is to move away from follow_pte() and
> get_user_page_fast_only() for mmu_notifier-aware lookups, i.e. that don't need
> to grab references, and replace them with a new converged API that locklessly walks
> host userspace page tables, and grabs the hugepage size along the way, e.g. so that
> arch code wouldn't have to do a second walk of the page tables just to get the
> hugepage size.
>
> In other words, for the common case (mmu_notifier integration, no reference needed),
> route hva_to_pfn_fast() into the new API and walk the userspace page tables (probably
> only for write faults, to avoid CoW compliciations) before doing anything else.
>
> Uses of hva_to_pfn() that need to get a reference to the struct page couldn't be
> converted, e.g. when stuffing physical addresses into the VMCS for nested virtualization.
> But for everything else, grabbing a reference is a non-goal, i.e. actually "getting"
> a user page is wasted effort and actively gets in the way.
>
> I was initially hoping we could go super simple and use something like x86's
> host_pfn_mapping_level(), but there are too many edge cases in gup() that need to
> be respected, e.g. to avoid mapping memfd_secret pages into KVM guests. I.e. the
> API would need to be a formal mm-owned thing, not some homebrewed KVM implementation.
>
> I can't tell if the payoff would be big enough to justify the effort involved, i.e.
> having a single unified API for grabbing PFNs from the primary MMU might just be a
> pie-in-the-sky type idea.
>
> My second, less ambitious idea: the previously linked LWN[*] article about the
> writeback issues reminded me of something that has bugged me for a long time. IIUC,
> getting a writable mapping from the primary MMU marks the page/folio dirty, and that
> page/folio stays dirty until the data is written back and the mapping is made read-only.
> And because KVM is tapped into the mmu_notifiers, KVM will be notified *before* the
> RW=>RO conversion completes, i.e. before the page/folio is marked clean.
>
> I _think_ that means that calling kvm_set_page_dirty() when zapping a SPTE (or
> dropping any mmu_notifier-aware mapping) is completely unnecessary. If that is the
> case, _and_ we can weasel our way out of calling kvm_set_page_accessed() too, then
> with FOLL_GET plumbed into hva_to_pfn(), we can:
>
> - Drop kvm_{set,release}_pfn_{accessed,dirty}(), because all callers of hva_to_pfn()
> that aren't tied into mmu_notifiers, i.e. aren't guaranteed to drop mappings
> before the page/folio is cleaned, will *know* that they hold a refcounted struct
> page.
>
> - Skip "KVM: x86/mmu: Track if sptes refer to refcounted pages" entirely, because
> KVM never needs to know if a SPTE points at a refcounted page.
>
> In other words, double down on immediately doing put_page() after gup() if FOLL_GET
> isn't specified, and naturally make all KVM MMUs compatible with pfn_valid() PFNs
> that are acquired by follow_pte().
>
> I suspect we can simply mark pages as access when a page is retrieved from the primary
> MMU, as marking a page accessed when its *removed* from the guest is rather nonsensical.
> E.g. if a page is mapped into the guest for a long time and it gets swapped out, marking
> the page accessed when KVM drops its SPTEs in response to the swap adds no value. And
> through the mmu_notifiers, KVM already plays nice with setups that use idle page
> tracking to make reclaim decisions.
>
> [*] https://lwn.net/Articles/930667

2024-03-14 14:45:55

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Thu, Mar 14, 2024, Christian König wrote:
> Am 14.03.24 um 12:31 schrieb David Stevens:
> > On Thu, Mar 14, 2024 at 6:20 PM Christian König <[email protected]> wrote:
> > > > > > > > Well as far as I can see Christoph rejects the complexity coming with the
> > > > > > > > approach of sometimes grabbing the reference and sometimes not.
> > > > > > > Unless I've wildly misread multiple threads, that is not Christoph's objection.
> > > > > > > From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%[email protected]):
> > > > > > >
> > > > > > > On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig<[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> > > > > > > > > KVM needs to be aware of non-refcounted struct page memory no matter what; see
> > > > > > > > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
> > > > > > > > > non-reference-counted pages"). I don't think it makes any sense whatsoever to
> > > > > > > > > remove that code and assume every driver in existence will do the right thing.
> > > > > > > >
> > > > > > > > Agreed.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > With the cleanups done, playing nice with non-refcounted paged instead of outright
> > > > > > > > > rejecting them is a wash in terms of lines of code, complexity, and ongoing
> > > > > > > > > maintenance cost.
> > > > > > > >
> > > > > > > > I tend to strongly disagree with that, though. We can't just let these
> > > > > > > > non-refcounted pages spread everywhere and instead need to fix their
> > > > > > > > usage.
> > > > > > And I can only repeat myself that I completely agree with Christoph here.
> > > > > I am so confused. If you agree with Christoph, why not fix the TTM allocations?
> > > > Because the TTM allocation isn't broken in any way.
> > > >
> > > > See in some configurations TTM even uses the DMA API for those
> > > > allocations and that is actually something Christoph coded.
> > > >
> > > > What Christoph is really pointing out is that absolutely nobody should
> > > > put non-refcounted pages into a VMA, but again this isn't something
> > > > TTM does. What TTM does instead is to work with the PFN and puts that
> > > > into a VMA.
> > > >
> > > > It's just that then KVM comes along and converts the PFN back into a
> > > > struct page again and that is essentially what causes all the
> > > > problems, including CVE-2021-22543.
> > Does Christoph's objection come from my poorly worded cover letter and
> > commit messages, then?
>
> Yes, that could certainly be.
>
> > Fundamentally, what this series is doing is
> > allowing pfns returned by follow_pte to be mapped into KVM's shadow
> > MMU without inadvertently translating them into struct pages.
>
> As far as I can tell that is really the right thing to do. Yes.

Christoph,

Can you please confirm that you don't object to KVM using follow_pte() to get
PFNs which happen to have an associated struct page? We've gone in enough circles...

2024-03-15 17:59:25

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Thu, Mar 14, 2024, Sean Christopherson wrote:
> +Alex, who is looking at the huge-VM_PFNMAP angle in particular.

Oof, *Axel*. Sorry Axel.

> On Thu, Mar 14, 2024, Sean Christopherson wrote:
> > -Christ{oph,ian} to avoid creating more noise...
> >
> > On Thu, Mar 14, 2024, David Stevens wrote:
> > > Because of that, the specific type of pfns that don't work right now are
> > > pfn_valid() && !PG_Reserved && !page_ref_count() - what I called the
> > > non-refcounted pages in a bad choice of words. If that's correct, then
> > > perhaps this series should go a little bit further in modifying
> > > hva_to_pfn_remapped, but it isn't fundamentally wrong.
> >
> > Loosely related to all of this, I have a mildly ambitious idea. Well, one mildly
> > ambitious idea, and one crazy ambitious idea. Crazy ambitious idea first...
> >
> > Something we (GCE side of Google) have been eyeballing is adding support for huge
> > VM_PFNMAP memory, e.g. for mapping large amounts of device (a.k.a. GPU) memory
> > into guests using hugepages. One of the hiccups is that follow_pte() doesn't play
> > nice with hugepages, at all, e.g. even has a "VM_BUG_ON(pmd_trans_huge(*pmd))".
> > Teaching follow_pte() to play nice with hugepage probably is doing, but making
> > sure all existing users are aware, maybe not so much.
> >
> > My first (half baked, crazy ambitious) idea is to move away from follow_pte() and
> > get_user_page_fast_only() for mmu_notifier-aware lookups, i.e. that don't need
> > to grab references, and replace them with a new converged API that locklessly walks
> > host userspace page tables, and grabs the hugepage size along the way, e.g. so that
> > arch code wouldn't have to do a second walk of the page tables just to get the
> > hugepage size.
> >
> > In other words, for the common case (mmu_notifier integration, no reference needed),
> > route hva_to_pfn_fast() into the new API and walk the userspace page tables (probably
> > only for write faults, to avoid CoW compliciations) before doing anything else.
> >
> > Uses of hva_to_pfn() that need to get a reference to the struct page couldn't be
> > converted, e.g. when stuffing physical addresses into the VMCS for nested virtualization.
> > But for everything else, grabbing a reference is a non-goal, i.e. actually "getting"
> > a user page is wasted effort and actively gets in the way.
> >
> > I was initially hoping we could go super simple and use something like x86's
> > host_pfn_mapping_level(), but there are too many edge cases in gup() that need to
> > be respected, e.g. to avoid mapping memfd_secret pages into KVM guests. I.e. the
> > API would need to be a formal mm-owned thing, not some homebrewed KVM implementation.
> >
> > I can't tell if the payoff would be big enough to justify the effort involved, i.e.
> > having a single unified API for grabbing PFNs from the primary MMU might just be a
> > pie-in-the-sky type idea.
> >
> > My second, less ambitious idea: the previously linked LWN[*] article about the
> > writeback issues reminded me of something that has bugged me for a long time. IIUC,
> > getting a writable mapping from the primary MMU marks the page/folio dirty, and that
> > page/folio stays dirty until the data is written back and the mapping is made read-only.
> > And because KVM is tapped into the mmu_notifiers, KVM will be notified *before* the
> > RW=>RO conversion completes, i.e. before the page/folio is marked clean.
> >
> > I _think_ that means that calling kvm_set_page_dirty() when zapping a SPTE (or
> > dropping any mmu_notifier-aware mapping) is completely unnecessary. If that is the
> > case, _and_ we can weasel our way out of calling kvm_set_page_accessed() too, then
> > with FOLL_GET plumbed into hva_to_pfn(), we can:
> >
> > - Drop kvm_{set,release}_pfn_{accessed,dirty}(), because all callers of hva_to_pfn()
> > that aren't tied into mmu_notifiers, i.e. aren't guaranteed to drop mappings
> > before the page/folio is cleaned, will *know* that they hold a refcounted struct
> > page.
> >
> > - Skip "KVM: x86/mmu: Track if sptes refer to refcounted pages" entirely, because
> > KVM never needs to know if a SPTE points at a refcounted page.
> >
> > In other words, double down on immediately doing put_page() after gup() if FOLL_GET
> > isn't specified, and naturally make all KVM MMUs compatible with pfn_valid() PFNs
> > that are acquired by follow_pte().
> >
> > I suspect we can simply mark pages as access when a page is retrieved from the primary
> > MMU, as marking a page accessed when its *removed* from the guest is rather nonsensical.
> > E.g. if a page is mapped into the guest for a long time and it gets swapped out, marking
> > the page accessed when KVM drops its SPTEs in response to the swap adds no value. And
> > through the mmu_notifiers, KVM already plays nice with setups that use idle page
> > tracking to make reclaim decisions.
> >
> > [*] https://lwn.net/Articles/930667

2024-03-18 05:42:07

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Thu, Mar 14, 2024 at 12:51:40PM +0100, Christian K?nig wrote:
> > Does Christoph's objection come from my poorly worded cover letter and
> > commit messages, then?
>
> Yes, that could certainly be.

That's definitively a big part of it, but I think not the only one.

> > Fundamentally, what this series is doing is
> > allowing pfns returned by follow_pte to be mapped into KVM's shadow
> > MMU without inadvertently translating them into struct pages.
>
> As far as I can tell that is really the right thing to do. Yes.

IFF your callers don't need pages and you just want to track the
mapping in the shadow mmu and never take a refcount that is a good
thing.

But unless I completely misunderstood the series that doesn't seem
to be the case - it builds a new kvm_follow_pfn API which is another
of these weird multiplexers like get_user_pages that can to tons of
different things depending on the flags. And some of that still
grabs the refcount, right?

> Completely agree. In my thinking when you go a step further and offload
> grabbing the page reference to get_user_pages() then you are always on the
> save side.

Agreed.

> Because then it is no longer the responsibility of the KVM code to get all
> the rules around that right, instead you are relying on a core functionality
> which should (at least in theory) do the correct thing.

Exactly.

2024-03-18 13:11:25

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Mon, Mar 18, 2024 at 3:07 AM Christoph Hellwig <[email protected]> wrote:
> > > Fundamentally, what this series is doing is
> > > allowing pfns returned by follow_pte to be mapped into KVM's shadow
> > > MMU without inadvertently translating them into struct pages.
> >
> > As far as I can tell that is really the right thing to do. Yes.
>
> IFF your callers don't need pages and you just want to track the
> mapping in the shadow mmu and never take a refcount that is a good
> thing.

Yes, that's the case and for everything else we can use a function
that goes from guest physical address to struct page with a reference
taken, similar to the current gfn_to_page family of functions.

> But unless I completely misunderstood the series that doesn't seem
> to be the case - it builds a new kvm_follow_pfn API which is another
> of these weird multiplexers like get_user_pages that can to tons of
> different things depending on the flags. And some of that still
> grabs the refcount, right?

Yes, for a couple reasons. First, a lot of the lookup logic is shared
by the two cases; second, it's easier for both developers and
reviewers if you first convert to the new API, and remove the refcount
in a separate commit. Also, you have to do this for every architecture
since we agree that this is the direction that all of them should move
to.

So what we could do, would be to start with something like David's
patches, and move towards forbidding the FOLL_GET flag (the case that
returns the page with elevated refcount) in the new kvm_follow_pfn()
API.

Another possibility is to have a double-underscore version that allows
FOLL_GET, and have the "clean" kvm_follow_pfn() forbid it. So you
would still have the possibility to convert to __kvm_follow_pfn() with
FOLL_GET first, and then when you remove the refcount you switch to
kvm_follow_pfn().

Paolo

> > Completely agree. In my thinking when you go a step further and offload
> > grabbing the page reference to get_user_pages() then you are always on the
> > save side.
>
> Agreed.
>
> > Because then it is no longer the responsibility of the KVM code to get all
> > the rules around that right, instead you are relying on a core functionality
> > which should (at least in theory) do the correct thing.
>
> Exactly.
>

2024-03-19 06:27:45

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Mon, Mar 18, 2024 at 02:10:55PM +0100, Paolo Bonzini wrote:
> Another possibility is to have a double-underscore version that allows
> FOLL_GET, and have the "clean" kvm_follow_pfn() forbid it. So you
> would still have the possibility to convert to __kvm_follow_pfn() with
> FOLL_GET first, and then when you remove the refcount you switch to
> kvm_follow_pfn().

That does sound much better. Then again anything that actually wants
pages (either for a good reason or historic reasons) really should be
using get/pin_user_pages anyway and not use follow_pte.

2024-03-20 20:55:01

by Axel Rasmussen

[permalink] [raw]

Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

On Fri, Mar 15, 2024 at 10:59 AM Sean Christopherson <seanjc@googlecom> wrote:
>
> On Thu, Mar 14, 2024, Sean Christopherson wrote:
> > +Alex, who is looking at the huge-VM_PFNMAP angle in particular.
>
> Oof, *Axel*. Sorry Axel.

No worries. Believe it or not this happens frequently. :) I'm well
past being bothered about it.

>
>
> > On Thu, Mar 14, 2024, Sean Christopherson wrote:
> > > -Christ{oph,ian} to avoid creating more noise...
> > >
> > > On Thu, Mar 14, 2024, David Stevens wrote:
> > > > Because of that, the specific type of pfns that don't work right now are
> > > > pfn_valid() && !PG_Reserved && !page_ref_count() - what I called the
> > > > non-refcounted pages in a bad choice of words. If that's correct, then
> > > > perhaps this series should go a little bit further in modifying
> > > > hva_to_pfn_remapped, but it isn't fundamentally wrong.
> > >
> > > Loosely related to all of this, I have a mildly ambitious idea. Well, one mildly
> > > ambitious idea, and one crazy ambitious idea. Crazy ambitious idea first...
> > >
> > > Something we (GCE side of Google) have been eyeballing is adding support for huge
> > > VM_PFNMAP memory, e.g. for mapping large amounts of device (a.k.a. GPU) memory
> > > into guests using hugepages. One of the hiccups is that follow_pte() doesn't play
> > > nice with hugepages, at all, e.g. even has a "VM_BUG_ON(pmd_trans_huge(*pmd))".
> > > Teaching follow_pte() to play nice with hugepage probably is doing, but making
> > > sure all existing users are aware, maybe not so much.

Right. The really basic idea I had was, to modify remap_pfn_range so
it can, if possible (if it sees a (sub)range which is aligned + big
enough), it can just install a leaf pud or pmd instead of always going
down to the pte level.

follow_pte is problematic though, because it returns a pte
specifically so it's unclear to me how to have it "just work" for
existing callers with an area mapped this way. So I think the idea
would be to change follow_pte to detect this case and bail out
(-EINVAL I guess), and then add some new function which can properly
support these mappings.

>
> > >
> > > My first (half baked, crazy ambitious) idea is to move away from follow_pte() and
> > > get_user_page_fast_only() for mmu_notifier-aware lookups, i.e. that don't need
> > > to grab references, and replace them with a new converged API that locklessly walks
> > > host userspace page tables, and grabs the hugepage size along the way, e.g. so that
> > > arch code wouldn't have to do a second walk of the page tables just to get the
> > > hugepage size.
> > >
> > > In other words, for the common case (mmu_notifier integration, no reference needed),
> > > route hva_to_pfn_fast() into the new API and walk the userspace page tables (probably
> > > only for write faults, to avoid CoW compliciations) before doing anything else.
> > >
> > > Uses of hva_to_pfn() that need to get a reference to the struct page couldn't be
> > > converted, e.g. when stuffing physical addresses into the VMCS for nested virtualization.
> > > But for everything else, grabbing a reference is a non-goal, i.e. actually "getting"
> > > a user page is wasted effort and actively gets in the way.
> > >
> > > I was initially hoping we could go super simple and use something like x86's
> > > host_pfn_mapping_level(), but there are too many edge cases in gup() that need to
> > > be respected, e.g. to avoid mapping memfd_secret pages into KVM guests. I.e. the
> > > API would need to be a formal mm-owned thing, not some homebrewed KVM implementation.
> > >
> > > I can't tell if the payoff would be big enough to justify the effort involved, i.e.
> > > having a single unified API for grabbing PFNs from the primary MMU might just be a
> > > pie-in-the-sky type idea.

Yeah, I have been thinking about this.

One thing is, KVM is not the only caller of follow_pte today. At least
naively it would be nice if any existing callers could benefit from
this new support, not just KVM.

Another thing I noticed is, most callers don't need much; they don't
need a reference to the page, they don't even really need the pte
itself. Most callers just want a ptl held, and they want to know the
pgprot flags or whether the pte is writable or not, and they want to
know the pfn for this address. IOW follow_pte is sort of overkill for
most callers.

KVM is a bit different, it does call GUP to get the page. One other
thing is, KVM at least on x86 also cares about the "level" of the
mapping, in host_pfn_mapping_level(). That code is all fairly x86
specific so I'm not sure how to generalize it. Also I haven't looked
closely at what's going on for other architectures.

I'm not sure yet where is the right place to end up, but I at least
think it's worth trying to build some general API under mm/ which
supports these various things.

In general I'm thinking of proceeding in two steps. First,
enlightening remap_pfn_range, updating follow_pte to return -EINVAL,
and then adding some new function which tries to be somewhat close to
a drop-in replacement for existing follow_pte callers. Once I get that
proof of concept working / tested, I think the next step is figuring
out how we can extend it a bit to build some more general solution
like Sean is describing here.

> > >
> > > My second, less ambitious idea: the previously linked LWN[*] article about the
> > > writeback issues reminded me of something that has bugged me for a long time. IIUC,
> > > getting a writable mapping from the primary MMU marks the page/folio dirty, and that
> > > page/folio stays dirty until the data is written back and the mapping is made read-only.
> > > And because KVM is tapped into the mmu_notifiers, KVM will be notified *before* the
> > > RW=>RO conversion completes, i.e. before the page/folio is marked clean.
> > >
> > > I _think_ that means that calling kvm_set_page_dirty() when zapping a SPTE (or
> > > dropping any mmu_notifier-aware mapping) is completely unnecessary. If that is the
> > > case, _and_ we can weasel our way out of calling kvm_set_page_accessed() too, then
> > > with FOLL_GET plumbed into hva_to_pfn(), we can:
> > >
> > > - Drop kvm_{set,release}_pfn_{accessed,dirty}(), because all callers of hva_to_pfn()
> > > that aren't tied into mmu_notifiers, i.e. aren't guaranteed to drop mappings
> > > before the page/folio is cleaned, will *know* that they hold a refcounted struct
> > > page.
> > >
> > > - Skip "KVM: x86/mmu: Track if sptes refer to refcounted pages" entirely, because
> > > KVM never needs to know if a SPTE points at a refcounted page.
> > >
> > > In other words, double down on immediately doing put_page() after gup() if FOLL_GET
> > > isn't specified, and naturally make all KVM MMUs compatible with pfn_valid() PFNs
> > > that are acquired by follow_pte().
> > >
> > > I suspect we can simply mark pages as access when a page is retrieved from the primary
> > > MMU, as marking a page accessed when its *removed* from the guest is rather nonsensical.
> > > E.g. if a page is mapped into the guest for a long time and it gets swapped out, marking
> > > the page accessed when KVM drops its SPTEs in response to the swap adds no value. And
> > > through the mmu_notifiers, KVM already plays nice with setups that use idle page
> > > tracking to make reclaim decisions.
> > >
> > > [*] https://lwn.net/Articles/930667

2024-04-04 16:34:20

by Dmitry Osipenko

[permalink] [raw]

Subject: Re: [PATCH v11 8/8] KVM: x86/mmu: Handle non-refcounted pages

Hi David,

On 2/29/24 05:57, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> Handle non-refcounted pages in __kvm_faultin_pfn. This allows the
> host to map memory into the guest that is backed by non-refcounted
> struct pages - for example, the tail pages of higher order non-compound
> pages allocated by the amdgpu driver via ttm_pool_alloc_page.
>
> Signed-off-by: David Stevens <[email protected]>

This patch has a problem on v6.8 kernel. Pierre-Eric of AMD found that
Qemu crashes with "kvm bad address" error when booting Ubuntu 23.10 ISO
with a disabled virtio-gpu and I was able to reproduce it. Pierre-Eric
said this problem didn't exist with v6.7 kernel and using v10 kvm
patches. Could you please take a look at this issue?

To reproduce the bug, run Qemu like this and load the Ubuntu installer:

qemu-system-x86_64 -boot d -cdrom ubuntu-23.10.1-desktop-amd64.iso -m
4G --enable-kvm -display gtk -smp 1 -machine q35

Qemu fails with "error: kvm run failed Bad address"

On the host kernel there is this warning:

------------[ cut here ]------------
WARNING: CPU: 19 PID: 11696 at mm/gup.c:229 try_grab_page+0x64/0x100
Call Trace:
<TASK>
? try_grab_page+0x64/0x100
? __warn+0x81/0x130
? try_grab_page+0x64/0x100
? report_bug+0x171/0x1a0
? handle_bug+0x3c/0x80
? exc_invalid_op+0x17/0x70
? asm_exc_invalid_op+0x1a/0x20
? try_grab_page+0x64/0x100
follow_page_pte+0xfa/0x4b0
__get_user_pages+0xe5/0x6e0
get_user_pages_unlocked+0xe7/0x370
hva_to_pfn+0xa2/0x760 [kvm]
? free_unref_page+0xf9/0x180
kvm_faultin_pfn+0x112/0x610 [kvm]
kvm_tdp_page_fault+0x104/0x150 [kvm]
kvm_mmu_page_fault+0x298/0x860 [kvm]
kvm_arch_vcpu_ioctl_run+0xc7d/0x16b0 [kvm]
? srso_alias_return_thunk+0x5/0xfbef5
? kvm_arch_vcpu_put+0x128/0x190 [kvm]
? srso_alias_return_thunk+0x5/0xfbef5
kvm_vcpu_ioctl+0x199/0x700 [kvm]
__x64_sys_ioctl+0x94/0xd0
do_syscall_64+0x86/0x170
? kvm_on_user_return+0x64/0x90 [kvm]
? srso_alias_return_thunk+0x5/0xfbef5
? fire_user_return_notifiers+0x37/0x70
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x80/0x230
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x96/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x96/0x170
? do_syscall_64+0x96/0x170
? do_syscall_64+0x96/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x96/0x170
? srso_alias_return_thunk+0x5/0xfbef5
entry_SYSCALL_64_after_hwframe+0x6e/0x76
---[ end trace 0000000000000000 ]---

--
Best regards,
Dmitry

2024-04-15 07:29:14