2022-04-16 01:59:59

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH] kvm: x86/svm/nested: Cache PDPTEs for nested NPT in PAE paging mode

From: Lai Jiangshan <[email protected]>

When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which
is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy
unconditionally for each call.

The guest PAE root page is not write-protected.

The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different
values every time or it is different from the return value of
mmu->get_pdptrs() in mmu_alloc_shadow_roots().

And it will cause FNAME(fetch) installs the spte in a wrong sp
or links a sp to a wrong parent since FNAME(gpte_changed) can't
check these kind of changes.

Cache the PDPTEs and the problem is resolved. The guest is responsible
to info the host if its PAE root page is updated which will cause
nested vmexit and the host updates the cache when next nested run.

The commit e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE
from guest memory") fixs the same problem for non-nested case.

Fixes: e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE from guest memory")
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/kvm/svm/nested.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index d736ec6514ca..a34983d2dc07 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -72,18 +72,22 @@ static void svm_inject_page_fault_nested(struct kvm_vcpu *vcpu, struct x86_excep
}
}

-static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
+static void nested_svm_cache_tdp_pdptrs(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
u64 cr3 = svm->nested.ctl.nested_cr3;
- u64 pdpte;
+ u64 *pdptrs = vcpu->arch.mmu->pdptrs;
int ret;

- ret = kvm_vcpu_read_guest_page(vcpu, gpa_to_gfn(cr3), &pdpte,
- offset_in_page(cr3) + index * 8, 8);
+ ret = kvm_vcpu_read_guest_page(vcpu, gpa_to_gfn(cr3), pdptrs,
+ offset_in_page(cr3), 8 * 4);
if (ret)
- return 0;
- return pdpte;
+ memset(pdptrs, 0, 8 * 4);
+}
+
+static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
+{
+ return vcpu->arch.mmu->pdptrs[index];
}

static unsigned long nested_svm_get_tdp_cr3(struct kvm_vcpu *vcpu)
@@ -109,6 +113,8 @@ static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu)
kvm_init_shadow_npt_mmu(vcpu, X86_CR0_PG, svm->vmcb01.ptr->save.cr4,
svm->vmcb01.ptr->save.efer,
svm->nested.ctl.nested_cr3);
+ if (vcpu->arch.mmu->root_level == PT32E_ROOT_LEVEL)
+ nested_svm_cache_tdp_pdptrs(vcpu);
vcpu->arch.mmu->get_guest_pgd = nested_svm_get_tdp_cr3;
vcpu->arch.mmu->get_pdptr = nested_svm_get_tdp_pdptr;
vcpu->arch.mmu->inject_page_fault = nested_svm_inject_npf_exit;
--
2.19.1.6.gb485710b


2022-05-13 10:08:07

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH] kvm: x86/svm/nested: Cache PDPTEs for nested NPT in PAE paging mode

On Fri, Apr 15, 2022 at 6:33 PM Lai Jiangshan <[email protected]> wrote:
>
> From: Lai Jiangshan <[email protected]>
>
> When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which
> is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy
> unconditionally for each call.
>
> The guest PAE root page is not write-protected.
>
> The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different
> values every time or it is different from the return value of
> mmu->get_pdptrs() in mmu_alloc_shadow_roots().
>
> And it will cause FNAME(fetch) installs the spte in a wrong sp
> or links a sp to a wrong parent since FNAME(gpte_changed) can't
> check these kind of changes.
>
> Cache the PDPTEs and the problem is resolved. The guest is responsible
> to info the host if its PAE root page is updated which will cause
> nested vmexit and the host updates the cache when next nested run.
>
> The commit e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE
> from guest memory") fixs the same problem for non-nested case.
>
> Fixes: e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE from guest memory")
> Signed-off-by: Lai Jiangshan <[email protected]>

Ping

2022-05-17 01:39:49

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH] kvm: x86/svm/nested: Cache PDPTEs for nested NPT in PAE paging mode

On Fri, Apr 15, 2022, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which
> is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy
> unconditionally for each call.
>
> The guest PAE root page is not write-protected.
>
> The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different
> values every time or it is different from the return value of
> mmu->get_pdptrs() in mmu_alloc_shadow_roots().
>
> And it will cause FNAME(fetch) installs the spte in a wrong sp
> or links a sp to a wrong parent since FNAME(gpte_changed) can't
> check these kind of changes.
>
> Cache the PDPTEs and the problem is resolved. The guest is responsible
> to info the host if its PAE root page is updated which will cause
> nested vmexit and the host updates the cache when next nested run.

Hmm, no, the guest is responsible for invalidating translations that can be
cached in the TLB, but the guest is not responsible for a full reload of PDPTEs.
Per the APM, the PDPTEs can be cached like regular PTEs:

Under SVM, however, when the processor is in guest mode with PAE enabled, the
guest PDPT entries are not cached or validated at this point, but instead are
loaded and checked on demand in the normal course of address translation, just
like page directory and page table entries. Any reserved bit violations ared
etected at the point of use, and result in a page-fault (#PF) exception rather
than a general-protection (#GP) exception.

So if L1 modifies a PDPTE from !PRESENT (or RESERVED) to PRESENT (and valid), then
any active L2 vCPUs should recognize the new PDPTE without a nested VM-Exit because
the old entry can't have been cached in the TLB.

In practice, snapshotting at nested VMRUN would likely work, but architecturally
it's wrong and could cause problems if L1+L2 are engange in paravirt shenanigans,
e.g. async #PF comes to mind.

I believe the correct way to fix this is to write-protect nNPT PDPTEs like all other
shadow pages, which shouldn't be too awful to do as part of your series to route
PDPTEs through kvm_mmu_get_page().

2022-05-17 07:03:20

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH] kvm: x86/svm/nested: Cache PDPTEs for nested NPT in PAE paging mode

On Tue, May 17, 2022 at 9:02 AM Lai Jiangshan <[email protected]> wrote:
>
> On Tue, May 17, 2022 at 4:45 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Fri, Apr 15, 2022, Lai Jiangshan wrote:
> > > From: Lai Jiangshan <[email protected]>
> > >
> > > When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which
> > > is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy
> > > unconditionally for each call.
> > >
> > > The guest PAE root page is not write-protected.
> > >
> > > The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different
> > > values every time or it is different from the return value of
> > > mmu->get_pdptrs() in mmu_alloc_shadow_roots().
> > >
> > > And it will cause FNAME(fetch) installs the spte in a wrong sp
> > > or links a sp to a wrong parent since FNAME(gpte_changed) can't
> > > check these kind of changes.
> > >
> > > Cache the PDPTEs and the problem is resolved. The guest is responsible
> > > to info the host if its PAE root page is updated which will cause
> > > nested vmexit and the host updates the cache when next nested run.
> >
> > Hmm, no, the guest is responsible for invalidating translations that can be
> > cached in the TLB, but the guest is not responsible for a full reload of PDPTEs.
> > Per the APM, the PDPTEs can be cached like regular PTEs:
> >
> > Under SVM, however, when the processor is in guest mode with PAE enabled, the
> > guest PDPT entries are not cached or validated at this point, but instead are
> > loaded and checked on demand in the normal course of address translation, just
> > like page directory and page table entries. Any reserved bit violations ared
> > etected at the point of use, and result in a page-fault (#PF) exception rather
> > than a general-protection (#GP) exception.
> >
> > So if L1 modifies a PDPTE from !PRESENT (or RESERVED) to PRESENT (and valid), then
> > any active L2 vCPUs should recognize the new PDPTE without a nested VM-Exit because
> > the old entry can't have been cached in the TLB.
>
> In this case, it is still !PRESENT in the shadow page, and it will cause
> a vmexit when L2 tries to use the translation. I can't see anything wrong
> in TLB or vTLB(shadow pages).
>
> But I think some code is needed to reload the cached PDPTEs
> when guest_mmu->get_pdptrs() returns !PRESENT and reload mmu if
> the cache is changed. (and add code to avoid infinite loop)
>
> The patch fails to reload mmu if the cache is changed which
> leaves the problem described in the changelog partial resolved
> only.
>
> Maybe we need to add mmu parameter back in load_pdptrs() for it.
>

It is still too complicated, I will try to do a direct check
in FNAME(fetch) instead of (re-)using the cache.

> >
> > In practice, snapshotting at nested VMRUN would likely work, but architecturally
> > it's wrong and could cause problems if L1+L2 are engange in paravirt shenanigans,
> > e.g. async #PF comes to mind.
> >
> > I believe the correct way to fix this is to write-protect nNPT PDPTEs like all other
> > shadow pages, which shouldn't be too awful to do as part of your series to route
> > PDPTEs through kvm_mmu_get_page().
>
> In the one-off special shadow page (will be renamed to one-off local
> shadow page)
> patchsets, PAE PDPTEs is not write-protected. Wirte-protecting it causing nasty
> code.

2022-05-18 01:57:46

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH] kvm: x86/svm/nested: Cache PDPTEs for nested NPT in PAE paging mode

On Tue, May 17, 2022 at 4:45 AM Sean Christopherson <[email protected]> wrote:
>
> On Fri, Apr 15, 2022, Lai Jiangshan wrote:
> > From: Lai Jiangshan <[email protected]>
> >
> > When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which
> > is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy
> > unconditionally for each call.
> >
> > The guest PAE root page is not write-protected.
> >
> > The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different
> > values every time or it is different from the return value of
> > mmu->get_pdptrs() in mmu_alloc_shadow_roots().
> >
> > And it will cause FNAME(fetch) installs the spte in a wrong sp
> > or links a sp to a wrong parent since FNAME(gpte_changed) can't
> > check these kind of changes.
> >
> > Cache the PDPTEs and the problem is resolved. The guest is responsible
> > to info the host if its PAE root page is updated which will cause
> > nested vmexit and the host updates the cache when next nested run.
>
> Hmm, no, the guest is responsible for invalidating translations that can be
> cached in the TLB, but the guest is not responsible for a full reload of PDPTEs.
> Per the APM, the PDPTEs can be cached like regular PTEs:
>
> Under SVM, however, when the processor is in guest mode with PAE enabled, the
> guest PDPT entries are not cached or validated at this point, but instead are
> loaded and checked on demand in the normal course of address translation, just
> like page directory and page table entries. Any reserved bit violations ared
> etected at the point of use, and result in a page-fault (#PF) exception rather
> than a general-protection (#GP) exception.
>
> So if L1 modifies a PDPTE from !PRESENT (or RESERVED) to PRESENT (and valid), then
> any active L2 vCPUs should recognize the new PDPTE without a nested VM-Exit because
> the old entry can't have been cached in the TLB.

In this case, it is still !PRESENT in the shadow page, and it will cause
a vmexit when L2 tries to use the translation. I can't see anything wrong
in TLB or vTLB(shadow pages).

But I think some code is needed to reload the cached PDPTEs
when guest_mmu->get_pdptrs() returns !PRESENT and reload mmu if
the cache is changed. (and add code to avoid infinite loop)

The patch fails to reload mmu if the cache is changed which
leaves the problem described in the changelog partial resolved
only.

Maybe we need to add mmu parameter back in load_pdptrs() for it.

>
> In practice, snapshotting at nested VMRUN would likely work, but architecturally
> it's wrong and could cause problems if L1+L2 are engange in paravirt shenanigans,
> e.g. async #PF comes to mind.
>
> I believe the correct way to fix this is to write-protect nNPT PDPTEs like all other
> shadow pages, which shouldn't be too awful to do as part of your series to route
> PDPTEs through kvm_mmu_get_page().

In the one-off special shadow page (will be renamed to one-off local
shadow page)
patchsets, PAE PDPTEs is not write-protected. Wirte-protecting it causing nasty
code.

2022-05-26 10:01:29

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH] kvm: x86/svm/nested: Cache PDPTEs for nested NPT in PAE paging mode

On Thu, May 26, 2022 at 5:45 AM David Matlack <[email protected]> wrote:
>
> On Mon, May 16, 2022 at 2:06 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Fri, Apr 15, 2022, Lai Jiangshan wrote:
> > > From: Lai Jiangshan <[email protected]>
> > >
> > > When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which
> > > is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy
> > > unconditionally for each call.
> > >
> > > The guest PAE root page is not write-protected.
> > >
> > > The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different
> > > values every time or it is different from the return value of
> > > mmu->get_pdptrs() in mmu_alloc_shadow_roots().
> > >
> > > And it will cause FNAME(fetch) installs the spte in a wrong sp
> > > or links a sp to a wrong parent since FNAME(gpte_changed) can't
> > > check these kind of changes.
> > >
> > > Cache the PDPTEs and the problem is resolved. The guest is responsible
> > > to info the host if its PAE root page is updated which will cause
> > > nested vmexit and the host updates the cache when next nested run.
> >
> > Hmm, no, the guest is responsible for invalidating translations that can be
> > cached in the TLB, but the guest is not responsible for a full reload of PDPTEs.
> > Per the APM, the PDPTEs can be cached like regular PTEs:
> >
> > Under SVM, however, when the processor is in guest mode with PAE enabled, the
> > guest PDPT entries are not cached or validated at this point, but instead are
> > loaded and checked on demand in the normal course of address translation, just
> > like page directory and page table entries. Any reserved bit violations ared
> > etected at the point of use, and result in a page-fault (#PF) exception rather
> > than a general-protection (#GP) exception.
>
> This paragraph from the APM describes the behavior of CR3 loads while
> in SVM guest-mode. But this patch is changing how KVM emulates SVM
> host-mode (i.e. L1), right? It seems like AMD makes no guarantee
> whether or not CR3 loads pre-load PDPTEs while in SVM host-mode.
> (Although the APM does say that "modern processors" do not pre-load
> PDPTEs.)

Oh, I also missed the fact that L1 is the host when emulating it.

The code is for host-mode (L1)'s nested_cr3 which is using the
traditional PAE PDPTEs loading and checking.

So using caches is the only correct way, right?

If so, I can update this patch only (not adding it to the patchset of
one-off local shadow page) and add some checks to see if the loaded
caches changed.

Maybe I just ignore it since I'm not familiar with SVM enough.
I hope it served as a bug report.

Thanks
Lai

2022-05-26 12:32:55

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH] kvm: x86/svm/nested: Cache PDPTEs for nested NPT in PAE paging mode

On Thu, May 26, 2022 at 5:33 PM Paolo Bonzini <[email protected]> wrote:
>
> On 5/26/22 10:38, Lai Jiangshan wrote:
> >> (Although the APM does say that "modern processors" do not pre-load
> >> PDPTEs.)
>
> This changed between the Oct 2020 and Nov 2021, so I suppose the change
> was done in Zen 3.
>
> > Oh, I also missed the fact that L1 is the host when emulating it.
> >
> > The code is for host-mode (L1)'s nested_cr3 which is using the
> > traditional PAE PDPTEs loading and checking.
> >
> > So using caches is the only correct way, right?
>
> The caching behavior for NPT PDPTEs does not matter too much. What
> matters is that a PDPTE with reserved bits should cause a #NPF at usage
> time rather than a VMentry failure or a #NPF immediately after VMentry.
>

Since there is mmu->get_pdptrs() in mmu_alloc_shadow_roots(), you can't
conform this now.

It will be easier to only cause a #NPF at usage time after the one-off local
patchset.

2022-05-26 15:14:20

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] kvm: x86/svm/nested: Cache PDPTEs for nested NPT in PAE paging mode

On 5/26/22 10:38, Lai Jiangshan wrote:
>> (Although the APM does say that "modern processors" do not pre-load
>> PDPTEs.)

This changed between the Oct 2020 and Nov 2021, so I suppose the change
was done in Zen 3.

> Oh, I also missed the fact that L1 is the host when emulating it.
>
> The code is for host-mode (L1)'s nested_cr3 which is using the
> traditional PAE PDPTEs loading and checking.
>
> So using caches is the only correct way, right?

The caching behavior for NPT PDPTEs does not matter too much. What
matters is that a PDPTE with reserved bits should cause a #NPF at usage
time rather than a VMentry failure or a #NPF immediately after VMentry.

Paolo


2022-05-26 20:47:37

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH] kvm: x86/svm/nested: Cache PDPTEs for nested NPT in PAE paging mode

On Mon, May 16, 2022 at 2:06 PM Sean Christopherson <[email protected]> wrote:
>
> On Fri, Apr 15, 2022, Lai Jiangshan wrote:
> > From: Lai Jiangshan <[email protected]>
> >
> > When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which
> > is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy
> > unconditionally for each call.
> >
> > The guest PAE root page is not write-protected.
> >
> > The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different
> > values every time or it is different from the return value of
> > mmu->get_pdptrs() in mmu_alloc_shadow_roots().
> >
> > And it will cause FNAME(fetch) installs the spte in a wrong sp
> > or links a sp to a wrong parent since FNAME(gpte_changed) can't
> > check these kind of changes.
> >
> > Cache the PDPTEs and the problem is resolved. The guest is responsible
> > to info the host if its PAE root page is updated which will cause
> > nested vmexit and the host updates the cache when next nested run.
>
> Hmm, no, the guest is responsible for invalidating translations that can be
> cached in the TLB, but the guest is not responsible for a full reload of PDPTEs.
> Per the APM, the PDPTEs can be cached like regular PTEs:
>
> Under SVM, however, when the processor is in guest mode with PAE enabled, the
> guest PDPT entries are not cached or validated at this point, but instead are
> loaded and checked on demand in the normal course of address translation, just
> like page directory and page table entries. Any reserved bit violations ared
> etected at the point of use, and result in a page-fault (#PF) exception rather
> than a general-protection (#GP) exception.

This paragraph from the APM describes the behavior of CR3 loads while
in SVM guest-mode. But this patch is changing how KVM emulates SVM
host-mode (i.e. L1), right? It seems like AMD makes no guarantee
whether or not CR3 loads pre-load PDPTEs while in SVM host-mode.
(Although the APM does say that "modern processors" do not pre-load
PDPTEs.)

>
> So if L1 modifies a PDPTE from !PRESENT (or RESERVED) to PRESENT (and valid), then
> any active L2 vCPUs should recognize the new PDPTE without a nested VM-Exit because
> the old entry can't have been cached in the TLB.
>
> In practice, snapshotting at nested VMRUN would likely work, but architecturally
> it's wrong and could cause problems if L1+L2 are engange in paravirt shenanigans,
> e.g. async #PF comes to mind.
>
> I believe the correct way to fix this is to write-protect nNPT PDPTEs like all other
> shadow pages, which shouldn't be too awful to do as part of your series to route
> PDPTEs through kvm_mmu_get_page().