2022-04-29 16:01:29

by Mingwei Zhang

[permalink] [raw]
Subject: [PATCH] KVM: x86/mmu: fix potential races when walking host page table

Implement a KVM function for walking host page table in x86 architecture
and stop using lookup_address_in_mm(). lookup_address_in_mm() is basically
lookup_address_in_pgd() in mm. This function suffer from several issues:

- no usage of READ_ONCE(*). This allows multiple dereference of the same
page table entry. The TOCTOU problem because of that may cause kernel
incorrectly thinks a newly generated leaf entry as a nonleaf one and
dereference the content by using its pfn value.

- Incorrect information returned. lookup_address_in_mm() returns pte_t
pointer and level regardless of the 'presentness' of the entry, ie.,
even if an PXE entry is 'non-present', as long as it is not 'none', the
function still returns its level. In comparison, KVM needs the level
information of only 'present' entries. This is a clear bug and may cause
KVM incorrectly regard a non-present PXE entry as a present large page
mapping.

lookup_address_in_mm() and its relevant functions are generally helpful
only for walking kernel addresses that have mostly static mappings and no
page table tear down would happen. Patching this function does not help
other callers, since its return value: a PTE pointer, is NEVER safe to
deference after the function returns.

Cc: Sean Christopherson <[email protected]>
Cc: Paolo Bonzini <[email protected]>

Signed-off-by: Mingwei Zhang <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 +----
arch/x86/kvm/x86.c | 70 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.h | 2 ++
3 files changed, 73 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 904f0faff2186..6db195e5eae94 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2822,8 +2822,6 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
const struct kvm_memory_slot *slot)
{
unsigned long hva;
- pte_t *pte;
- int level;

if (!PageCompound(pfn_to_page(pfn)) && !kvm_is_zone_device_pfn(pfn))
return PG_LEVEL_4K;
@@ -2838,11 +2836,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
*/
hva = __gfn_to_hva_memslot(slot, gfn);

- pte = lookup_address_in_mm(kvm->mm, hva, &level);
- if (unlikely(!pte))
- return PG_LEVEL_4K;
-
- return level;
+ return kvm_lookup_address_level_in_mm(kvm, hva);
}

int kvm_mmu_max_mapping_level(struct kvm *kvm,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 951d0a78ccdae..61406efe4ea7f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13044,6 +13044,76 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
}
EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);

+/*
+ * Lookup the valid mapping level for a virtual address in the current mm.
+ * Return the level of the mapping if there is present one. Otherwise, always
+ * return PG_LEVEL_NONE.
+ *
+ * Note: the information retrieved may be stale. Use it with causion.
+ */
+int kvm_lookup_address_level_in_mm(struct kvm *kvm, unsigned long address)
+{
+ pgd_t *pgdp, pgd;
+ p4d_t *p4dp, p4d;
+ pud_t *pudp, pud;
+ pmd_t *pmdp, pmd;
+ pte_t *ptep, pte;
+ unsigned long flags;
+ int level = PG_LEVEL_NONE;
+
+ /* Disable IRQs to prevent any tear down of page tables. */
+ local_irq_save(flags);
+
+ pgdp = pgd_offset(kvm->mm, address);
+ pgd = READ_ONCE(*pgdp);
+ if (pgd_none(pgd))
+ goto out;
+
+ p4dp = p4d_offset(pgdp, address);
+ p4d = READ_ONCE(*p4dp);
+ if (p4d_none(p4d) || !p4d_present(p4d))
+ goto out;
+
+ if (p4d_large(p4d)) {
+ level = PG_LEVEL_512G;
+ goto out;
+ }
+
+ pudp = pud_offset(p4dp, address);
+ pud = READ_ONCE(*pudp);
+ if (pud_none(pud) || !pud_present(pud))
+ goto out;
+
+ if (pud_large(pud)) {
+ level = PG_LEVEL_1G;
+ goto out;
+ }
+
+ pmdp = pmd_offset(pudp, address);
+ pmd = READ_ONCE(*pmdp);
+ if (pmd_none(pmd) || !pmd_present(pmd))
+ goto out;
+
+ if (pmd_large(pmd)) {
+ level = PG_LEVEL_2M;
+ goto out;
+ }
+
+ ptep = pte_offset_map(&pmd, address);
+ pte = ptep_get(ptep);
+ if (pte_present(pte)) {
+ pte_unmap(ptep);
+ level = PG_LEVEL_4K;
+ goto out;
+ }
+ pte_unmap(ptep);
+
+out:
+ local_irq_restore(flags);
+ return level;
+}
+EXPORT_SYMBOL_GPL(kvm_lookup_address_level_in_mm);
+
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 588792f003345..f1cdcc8483bd0 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -454,4 +454,6 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
unsigned int port, void *data, unsigned int count,
int in);

+int kvm_lookup_address_level_in_mm(struct kvm *kvm, unsigned long address);
+
#endif

base-commit: 2a39d8b39bffdaf1a4223d0d22f07baee154c8f3
--
2.36.0.464.gb9c8b46e94-goog


2022-04-29 20:30:08

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/mmu: fix potential races when walking host page table

On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> > +out:
> > + local_irq_restore(flags);
> > + return level;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_lookup_address_level_in_mm);
>
> Exporting is not needed.
>
> Thanks for writing the walk code though. I'll adapt it and integrate the
> patch.

But why are we fixing this only in KVM? I liked the idea of stealing perf's
implementation because it was a seemlingly perfect fit and wouldn't introduce
new code (ignoring wrappers, etc...).

We _know_ that at least one subsystem is misusing lookup_address_in_pgd() and
given that its wrappers are exported, I highly doubt KVM is the only offender.
It really feels like we're passing the buck here by burying the fix in KVM.

2022-04-29 20:56:19

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/mmu: fix potential races when walking host page table

On 4/29/22 16:35, Sean Christopherson wrote:
> On Fri, Apr 29, 2022, Paolo Bonzini wrote:
>>> +out:
>>> + local_irq_restore(flags);
>>> + return level;
>>> +}
>>> +EXPORT_SYMBOL_GPL(kvm_lookup_address_level_in_mm);
>>
>> Exporting is not needed.
>>
>> Thanks for writing the walk code though. I'll adapt it and integrate the
>> patch.
>
> But why are we fixing this only in KVM? I liked the idea of stealing perf's
> implementation because it was a seemlingly perfect fit and wouldn't introduce
> new code (ignoring wrappers, etc...).
>
> We _know_ that at least one subsystem is misusing lookup_address_in_pgd() and
> given that its wrappers are exported, I highly doubt KVM is the only offender.
> It really feels like we're passing the buck here by burying the fix in KVM.

There are two ways to do it:

* having a generic function in mm/. The main issue there is the lack of
a PG_LEVEL_{P4D,PUD,PMD,PTE} enum at the mm/ level. We could use
(ctz(x) - 12) / 9 to go from size to level, but it's ugly and there
could be architectures with heterogeneous page table sizes.

* having a generic function in arch/x86/. In this case KVM seems to be
the odd one that doesn't need the PTE. For example vc_slow_virt_to_phys
needs the PTE, and needs the size rather than the "level" per se.

So for now I punted, while keeping open the door for moving code from
arch/x86/kvm/ to mm/ if anyone else (even other KVM ports) need the same
logic.

Paolo

2022-05-02 05:55:32

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/mmu: fix potential races when walking host page table

On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> On 4/29/22 05:17, Mingwei Zhang wrote:
> > + ptep = pte_offset_map(&pmd, address);
> > + pte = ptep_get(ptep);
> > + if (pte_present(pte)) {
> > + pte_unmap(ptep);
> > + level = PG_LEVEL_4K;
> > + goto out;
> > + }
> > + pte_unmap(ptep);
>
> Not needed as long as PG_LEVEL_4K is returned for a non-present PTE.
>
> > +out:
> > + local_irq_restore(flags);
> > + return level;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_lookup_address_level_in_mm);
>
> Exporting is not needed.
>
> Thanks for writing the walk code though. I'll adapt it and integrate the
> patch.

Can you also remove the intermediate pointers? They are at best a wash in terms
of readability (net negative IMO), and introduce unnecessary risk by opening up
the possibility of using the wrong variable and/or re-reading an entry.

Case in point, this code subtly re-reads the upper level entries when getting
the next level down, which can run afould of huge page promotion and use the
wrong address (huge pfn instead of page table address).

The alternative is to use the p*d_offset_lockless() helper, but IMO that's
unnecessary and pointless because it just does the same thing under the hood.

E.g. (compile tested only at this point)

static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
const struct kvm_memory_slot *slot)
{
unsigned long flags;
unsigned long hva;
int level;
pgd_t pgd;
p4d_t p4d;
pud_t pud;
pmd_t pmd;

if (!PageCompound(pfn_to_page(pfn)) && !kvm_is_zone_device_pfn(pfn))
return PG_LEVEL_4K;

/*
* Note, using the already-retrieved memslot and __gfn_to_hva_memslot()
* is not solely for performance, it's also necessary to avoid the
* "writable" check in __gfn_to_hva_many(), which will always fail on
* read-only memslots due to gfn_to_hva() assuming writes. Earlier
* page fault steps have already verified the guest isn't writing a
* read-only memslot.
*/
hva = __gfn_to_hva_memslot(slot, gfn);

level = PG_LEVEL_4K;

/*
* Disable IRQs to block TLB shootdown and thus prevent user page tables
* from being freed.
*/
local_irq_save(flags);

pgd = READ_ONCE(*pgd_offset(kvm->mm, hva));
if (pgd_none(pgd))
goto out;

p4d = READ_ONCE(*p4d_offset(&pgd, hva));
if (p4d_none(p4d) || !p4d_present(p4d))
goto out;

if (p4d_large(p4d)) {
level = PG_LEVEL_512G;
goto out;
}

pud = READ_ONCE(*pud_offset(&p4d, hva));
if (pud_none(pud) || !pud_present(pud))
goto out;

if (pud_large(pud)) {
level = PG_LEVEL_1G;
goto out;
}

pmd = READ_ONCE(*pmd_offset(&pud, hva));
if (pmd_none(pmd) || !pmd_present(pmd))
goto out;

if (pmd_large(pmd))
level = PG_LEVEL_2M;
out:
local_irq_restore(flags);
return level;
}

2022-05-02 12:32:15

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/mmu: fix potential races when walking host page table

On 4/29/22 05:17, Mingwei Zhang wrote:
> @@ -2838,11 +2836,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> */
> hva = __gfn_to_hva_memslot(slot, gfn);
>
> - pte = lookup_address_in_mm(kvm->mm, hva, &level);
> - if (unlikely(!pte))
> - return PG_LEVEL_4K;
> -
> - return level;
> + return kvm_lookup_address_level_in_mm(kvm, hva);
> }

The function can be just inlined in host_pfn_mapping_level.

> int kvm_mmu_max_mapping_level(struct kvm *kvm,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 951d0a78ccdae..61406efe4ea7f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13044,6 +13044,76 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
> }
> EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
>
> +/*
> + * Lookup the valid mapping level for a virtual address in the current mm.
> + * Return the level of the mapping if there is present one. Otherwise, always
> + * return PG_LEVEL_NONE.

This is a change in semantics, because host_pfn_mapping_level never
returned PG_LEVEL_NONE. Returning PG_LEVEL_4K for a non-present entry
is safe; if it happens, MMU notifiers will force a retry. If the
function is inlined in host_pfn_mapping_level, returning PG_LEVEL_4K
would allow making the semantic change in a separate patch.

In fact, kvm_mmu_hugepage_adjust will go on and set fault->req_level and
fault->goal_level to PG_LEVEL_NONE, which is wrong even if it does not
cause havoc.

> + * Note: the information retrieved may be stale. Use it with causion.

The comment should point out that mmu_notifier_retry make it safe to use
the stale value---of course this is only true if
kvm_lookup_address_level_in_mm is used where mmu_notifier_retry is used,
and might be another point in favor of inlining.

> + ptep = pte_offset_map(&pmd, address);
> + pte = ptep_get(ptep);
> + if (pte_present(pte)) {
> + pte_unmap(ptep);
> + level = PG_LEVEL_4K;
> + goto out;
> + }
> + pte_unmap(ptep);

Not needed as long as PG_LEVEL_4K is returned for a non-present PTE.

> +out:
> + local_irq_restore(flags);
> + return level;
> +}
> +EXPORT_SYMBOL_GPL(kvm_lookup_address_level_in_mm);

Exporting is not needed.

Thanks for writing the walk code though. I'll adapt it and integrate
the patch.

Paolo

> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 588792f003345..f1cdcc8483bd0 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -454,4 +454,6 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
> unsigned int port, void *data, unsigned int count,
> int in);
>
> +int kvm_lookup_address_level_in_mm(struct kvm *kvm, unsigned long address);
> +
> #endif
>
> base-commit: 2a39d8b39bffdaf1a4223d0d22f07baee154c8f3

2022-05-02 23:29:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/mmu: fix potential races when walking host page table

On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> On 4/29/22 16:35, Sean Christopherson wrote:
> > On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> > > > +out:
> > > > + local_irq_restore(flags);
> > > > + return level;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(kvm_lookup_address_level_in_mm);
> > >
> > > Exporting is not needed.
> > >
> > > Thanks for writing the walk code though. I'll adapt it and integrate the
> > > patch.
> >
> > But why are we fixing this only in KVM? I liked the idea of stealing perf's
> > implementation because it was a seemlingly perfect fit and wouldn't introduce
> > new code (ignoring wrappers, etc...).
> >
> > We _know_ that at least one subsystem is misusing lookup_address_in_pgd() and
> > given that its wrappers are exported, I highly doubt KVM is the only offender.
> > It really feels like we're passing the buck here by burying the fix in KVM.
>
> There are two ways to do it:
>
> * having a generic function in mm/. The main issue there is the lack of a
> PG_LEVEL_{P4D,PUD,PMD,PTE} enum at the mm/ level. We could use (ctz(x) -
> 12) / 9 to go from size to level, but it's ugly and there could be
> architectures with heterogeneous page table sizes.
>
> * having a generic function in arch/x86/. In this case KVM seems to be the
> odd one that doesn't need the PTE. For example vc_slow_virt_to_phys needs
> the PTE, and needs the size rather than the "level" per se.
>
> So for now I punted, while keeping open the door for moving code from
> arch/x86/kvm/ to mm/ if anyone else (even other KVM ports) need the same
> logic.

Ugh. I was going to say that KVM is the only in-tree user that's subtly broken,
but then I saw vc_slow_virt_to_phys()... So there's at least one other use case
for walking user addresses and being able to tolerate a not-present mapping.

There are no other users of lookup_address_in_mm(), and other than SEV-ES's
dastardly use of lookup_address_in_pgd(), pgd can only ever come from init_mm or
efi_mm, i.e. can't work with user address anyways.

If we go the KVM-only route, can send the below revert along with it? The least
we can do is not give others an easy way to screw up.

Until I saw the #VC crud, I was hoping we could also explicitly prevent using
lookup_address_in_pgd() with user addresses. If/when #VC is fixed, we can/should
add this:

@@ -592,6 +592,15 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,

*level = PG_LEVEL_NONE;

+ /*
+ * The below walk does not guard against user page tables being torn
+ * down, attempting to walk a user address is dangerous and likely to
+ * explode sooner or later. This helper is intended only for use with
+ * kernel-only mm_structs, e.g. init_mm and efi_mm.
+ */
+ if (WARN_ON_ONCE(address < TASK_SIZE_MAX))
+ return NULL;
+


From: Sean Christopherson <[email protected]>
Date: Fri, 29 Apr 2022 07:57:53 -0700
Subject: [PATCH] Revert "x86/mm: Introduce lookup_address_in_mm()"

Drop lookup_address_in_mm() now that KVM is providing it's own variant
of lookup_address_in_pgd() that is safe for use with user addresses, e.g.
guards against page tables being torn down. A variant that provides a
non-init mm is inherently dangerous and flawed, as the only reason to use
an mm other than init_mm is to walk a userspace mapping, and
lookup_address_in_pgd() does not play nice with userspace mappings, e.g.
doesn't disable IRQs to block TLB shootdowns and doesn't use READ_ONCE()
to ensure an upper level entry isn't converted to a huge page between
checking the PAGE_SIZE bit and grabbing the address of the next level
down.

This reverts commit 13c72c060f1ba6f4eddd7b1c4f52a8aded43d6d9.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 4 ----
arch/x86/mm/pat/set_memory.c | 11 -----------
2 files changed, 15 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 40497a9020c6..407084d9fd99 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -559,10 +559,6 @@ static inline void update_page_count(int level, unsigned long pages) { }
extern pte_t *lookup_address(unsigned long address, unsigned int *level);
extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
unsigned int *level);
-
-struct mm_struct;
-extern pte_t *lookup_address_in_mm(struct mm_struct *mm, unsigned long address,
- unsigned int *level);
extern pmd_t *lookup_pmd_address(unsigned long address);
extern phys_addr_t slow_virt_to_phys(void *__address);
extern int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn,
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index abf5ed76e4b7..0656db33574d 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -638,17 +638,6 @@ pte_t *lookup_address(unsigned long address, unsigned int *level)
}
EXPORT_SYMBOL_GPL(lookup_address);

-/*
- * Lookup the page table entry for a virtual address in a given mm. Return a
- * pointer to the entry and the level of the mapping.
- */
-pte_t *lookup_address_in_mm(struct mm_struct *mm, unsigned long address,
- unsigned int *level)
-{
- return lookup_address_in_pgd(pgd_offset(mm, address), address, level);
-}
-EXPORT_SYMBOL_GPL(lookup_address_in_mm);
-
static pte_t *_lookup_address_cpa(struct cpa_data *cpa, unsigned long address,
unsigned int *level)
{

base-commit: 6f363ed2fa4c24c400acc29b659c96e4dc7930e8
--