LinuxLists.cc - [PATCH 0/4] KVM: x86/mmu: Memtype related cleanups

2022-07-15 23:36:33

Subject: [PATCH 0/4] KVM: x86/mmu: Memtype related cleanups

Minor cleanups for KVM's handling of the memtype that's shoved into SPTEs.

Patch 1 enforces that entry '0' of the host's IA32_PAT is configured for WB
memtype. KVM subtle relies on this behavior (silently shoves '0' into the
SPTE PAT field). Check this at KVM load time so that if that doesn't hold
true, KVM will refuse to load instead of running the guest with weird and
potentially dangerous memtypes.

Patch 2 is a pure code cleanup (ordered after patch 1 in case someone wants
to backport the PAT check).

Patch 3 add a mask to track whether or not KVM may use a non-zero memtype
value in SPTEs. Essentially, it's a "is EPT enabled" flag without being an
explicit "is EPT enabled" flag. This avoid some minor work when not using
EPT, e.g. technically KVM could drop the RET0 implemention that's used for
SVM's get_mt_mask(), but IMO that's an unnecessary risk.

Patch 4 modifies the TDP page fault path to restrict the mapping level
based on guest MTRRs if and only if KVM might actually consume them. The
guest MTRRs are purely software constructs (not directly consumed by
hardware), and KVM only honors them when EPT is enabled (host MTRRs are
overridden by EPT) and the guest has non-coherent DMA. I doubt this will
move the needed on whether or not KVM can create huge pages, but it does
save having to do MTRR lookups on every page fault for guests without
a non-coherent DMA device attached.

Sean Christopherson (4):
KVM: x86: Reject loading KVM if host.PAT[0] != WB
KVM: x86: Drop unnecessary goto+label in kvm_arch_init()
KVM: x86/mmu: Add shadow mask for effective host MTRR memtype
KVM: x86/mmu: Restrict mapping level based on guest MTRR iff they're
used

arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++-------
arch/x86/kvm/mmu/spte.c | 21 ++++++++++++++++++---
arch/x86/kvm/mmu/spte.h | 1 +
arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
4 files changed, 58 insertions(+), 23 deletions(-)

base-commit: 8031d87aa9953ddeb047a5356ebd0b240c30f233
--
2.37.0.170.g444d1eabd0-goog

2022-07-15 23:56:21

by Sean Christopherson

[permalink] [raw]

Subject: [PATCH 1/4] KVM: x86: Reject loading KVM if host.PAT[0] != WB

Reject KVM if entry '0' in the host's IA32_PAT MSR is not programmed to
writeback (WB) memtype. KVM subtly relies on IA32_PAT entry '0' to be
programmed to WB by leaving the PAT bits in shadow paging and NPT SPTEs
as '0'. If something other than WB is in PAT[0], at _best_ guests will
suffer very poor performance, and at worst KVM will crash the system by
breaking cache-coherency expecations (e.g. using WC for guest memory).

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f389691d8c04..12199c40f2bc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9141,6 +9141,7 @@ static struct notifier_block pvclock_gtod_notifier = {
int kvm_arch_init(void *opaque)
{
struct kvm_x86_init_ops *ops = opaque;
+ u64 host_pat;
int r;

if (kvm_x86_ops.hardware_enable) {
@@ -9179,6 +9180,20 @@ int kvm_arch_init(void *opaque)
goto out;
}

+ /*
+ * KVM assumes that PAT entry '0' encodes WB memtype and simply zeroes
+ * the PAT bits in SPTEs. Bail if PAT[0] is programmed to something
+ * other than WB. Note, EPT doesn't utilize the PAT, but don't bother
+ * with an exception. PAT[0] is set to WB on RESET and also by the
+ * kernel, i.e. failure indicates a kernel bug or broken firmware.
+ */
+ if (rdmsrl_safe(MSR_IA32_CR_PAT, &host_pat) ||
+ (host_pat & GENMASK(2, 0)) != 6) {
+ pr_err("kvm: host PAT[0] is not WB\n");
+ r = -EIO;
+ goto out;
+ }
+
r = -ENOMEM;

x86_emulator_cache = kvm_alloc_emulator_cache();
--
2.37.0.170.g444d1eabd0-goog

2022-07-15 23:57:02

by Sean Christopherson

[permalink] [raw]

Subject: [PATCH 3/4] KVM: x86/mmu: Add shadow mask for effective host MTRR memtype

Add shadow_memtype_mask to capture that EPT needs a non-zero memtype mask
instead of relying on TDP being enabled, as NPT doesn't need a non-zero
mask. This is a glorified nop as kvm_x86_ops.get_mt_mask() returns zero
for NPT anyways.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 21 ++++++++++++++++++---
arch/x86/kvm/mmu/spte.h | 1 +
2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index fb1f17504138..7314d27d57a4 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -33,6 +33,7 @@ u64 __read_mostly shadow_mmio_value;
u64 __read_mostly shadow_mmio_mask;
u64 __read_mostly shadow_mmio_access_mask;
u64 __read_mostly shadow_present_mask;
+u64 __read_mostly shadow_memtype_mask;
u64 __read_mostly shadow_me_value;
u64 __read_mostly shadow_me_mask;
u64 __read_mostly shadow_acc_track_mask;
@@ -161,10 +162,10 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,

if (level > PG_LEVEL_4K)
spte |= PT_PAGE_SIZE_MASK;
- if (tdp_enabled)
+
+ if (shadow_memtype_mask)
spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
- kvm_is_mmio_pfn(pfn));
-
+ kvm_is_mmio_pfn(pfn));
if (host_writable)
spte |= shadow_host_writable_mask;
else
@@ -391,6 +392,13 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
shadow_nx_mask = 0ull;
shadow_x_mask = VMX_EPT_EXECUTABLE_MASK;
shadow_present_mask = has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
+ /*
+ * EPT overrides the host MTRRs, and so KVM must program the desired
+ * memtype directly into the SPTEs. Note, this mask is just the mask
+ * of all bits that factor into the memtype, the actual memtype must be
+ * dynamically calculated, e.g. to ensure host MMIO is mapped UC.
+ */
+ shadow_memtype_mask = VMX_EPT_MT_MASK | VMX_EPT_IPAT_BIT;
shadow_acc_track_mask = VMX_EPT_RWX_MASK;
shadow_host_writable_mask = EPT_SPTE_HOST_WRITABLE;
shadow_mmu_writable_mask = EPT_SPTE_MMU_WRITABLE;
@@ -441,6 +449,13 @@ void kvm_mmu_reset_all_pte_masks(void)
shadow_nx_mask = PT64_NX_MASK;
shadow_x_mask = 0;
shadow_present_mask = PT_PRESENT_MASK;
+
+ /*
+ * For shadow paging and NPT, KVM uses PAT entry '0' to encode WB
+ * memtype in the SPTEs, i.e. relies on host MTRRs to provide the
+ * correct memtype (WB is the "weakest" memtype).
+ */
+ shadow_memtype_mask = 0;
shadow_acc_track_mask = 0;
shadow_me_mask = 0;
shadow_me_value = 0;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index ba3dccb202bc..cabe3fbb4f39 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -147,6 +147,7 @@ extern u64 __read_mostly shadow_mmio_value;
extern u64 __read_mostly shadow_mmio_mask;
extern u64 __read_mostly shadow_mmio_access_mask;
extern u64 __read_mostly shadow_present_mask;
+extern u64 __read_mostly shadow_memtype_mask;
extern u64 __read_mostly shadow_me_value;
extern u64 __read_mostly shadow_me_mask;

--
2.37.0.170.g444d1eabd0-goog

2022-07-18 12:24:46

by Maxim Levitsky

[permalink] [raw]

Subject: Re: [PATCH 3/4] KVM: x86/mmu: Add shadow mask for effective host MTRR memtype

On Fri, 2022-07-15 at 23:00 +0000, Sean Christopherson wrote:
> Add shadow_memtype_mask to capture that EPT needs a non-zero memtype mask
> instead of relying on TDP being enabled, as NPT doesn't need a non-zero
> mask. This is a glorified nop as kvm_x86_ops.get_mt_mask() returns zero
> for NPT anyways.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/spte.c | 21 ++++++++++++++++++---
> arch/x86/kvm/mmu/spte.h | 1 +
> 2 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index fb1f17504138..7314d27d57a4 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -33,6 +33,7 @@ u64 __read_mostly shadow_mmio_value;
> u64 __read_mostly shadow_mmio_mask;
> u64 __read_mostly shadow_mmio_access_mask;
> u64 __read_mostly shadow_present_mask;
> +u64 __read_mostly shadow_memtype_mask;
> u64 __read_mostly shadow_me_value;
> u64 __read_mostly shadow_me_mask;
> u64 __read_mostly shadow_acc_track_mask;
> @@ -161,10 +162,10 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>
>         if (level > PG_LEVEL_4K)
>                 spte |= PT_PAGE_SIZE_MASK;
> -       if (tdp_enabled)
> +
> +       if (shadow_memtype_mask)
>                 spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
> -                       kvm_is_mmio_pfn(pfn));
> -
> +                                                        kvm_is_mmio_pfn(pfn));
>         if (host_writable)
>                 spte |= shadow_host_writable_mask;
>         else
> @@ -391,6 +392,13 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
>         shadow_nx_mask          = 0ull;
>         shadow_x_mask           = VMX_EPT_EXECUTABLE_MASK;
>         shadow_present_mask     = has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
> +       /*
> +        * EPT overrides the host MTRRs, and so KVM must program the desired
> +        * memtype directly into the SPTEs. Note, this mask is just the mask
> +        * of all bits that factor into the memtype, the actual memtype must be
> +        * dynamically calculated, e.g. to ensure host MMIO is mapped UC.
> +        */
> +       shadow_memtype_mask     = VMX_EPT_MT_MASK | VMX_EPT_IPAT_BIT;
>         shadow_acc_track_mask   = VMX_EPT_RWX_MASK;
>         shadow_host_writable_mask = EPT_SPTE_HOST_WRITABLE;
>         shadow_mmu_writable_mask = EPT_SPTE_MMU_WRITABLE;
> @@ -441,6 +449,13 @@ void kvm_mmu_reset_all_pte_masks(void)
>         shadow_nx_mask          = PT64_NX_MASK;
>         shadow_x_mask           = 0;
>         shadow_present_mask     = PT_PRESENT_MASK;
> +
> +       /*
> +        * For shadow paging and NPT, KVM uses PAT entry '0' to encode WB
> +        * memtype in the SPTEs, i.e. relies on host MTRRs to provide the
> +        * correct memtype (WB is the "weakest" memtype).
> +        */
> +       shadow_memtype_mask     = 0;
>         shadow_acc_track_mask   = 0;
>         shadow_me_mask          = 0;
>         shadow_me_value         = 0;
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index ba3dccb202bc..cabe3fbb4f39 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -147,6 +147,7 @@ extern u64 __read_mostly shadow_mmio_value;
> extern u64 __read_mostly shadow_mmio_mask;
> extern u64 __read_mostly shadow_mmio_access_mask;
> extern u64 __read_mostly shadow_present_mask;
> +extern u64 __read_mostly shadow_memtype_mask;
> extern u64 __read_mostly shadow_me_value;
> extern u64 __read_mostly shadow_me_mask;
>

So if I understand correctly:

VMX:

- host MTRRs are ignored.

- all *host* mmio ranges (can only be VFIO's pci BARs), are mapped UC in EPT,
but guest can override this with its PAT to WC)

- all regular memory is mapped WB + guest PAT ignored unless there is noncoherent dma,
(an older Intel's IOMMU? I think current Intel's IOMMLU are coherent?)

- In case of noncoherent dma guest MTRRs and PAT are respected.

SVM:

- host MTRRs are respected, and can enforce UC on *host* mmio areas.

- WB is always used in NPT, *always*, however NPT doesn't have the 'IPAT'
bit, so the guest is free to overrride it for its its MMIO areas to any memory type as it wishes,
using its own PAT, and we do allow the guest to change IA32_PAT to any value it wishes to.

(e.g VFIO's PCI bars, memory which a VFIO devices needs to access, etc)

(This reminds me that PAT is somewhat broken in regard to nesting, we ignore L2's PAT)

With all this said, it makes sense.

Reviewed-by: Maxim Levitsky <[email protected]>

Best regards,
Maxim Levitsky

2022-07-18 16:23:05

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH 3/4] KVM: x86/mmu: Add shadow mask for effective host MTRR memtype

On Mon, Jul 18, 2022, Maxim Levitsky wrote:
> On Fri, 2022-07-15 at 23:00 +0000, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index ba3dccb202bc..cabe3fbb4f39 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -147,6 +147,7 @@ extern u64 __read_mostly shadow_mmio_value;
> > ?extern u64 __read_mostly shadow_mmio_mask;
> > ?extern u64 __read_mostly shadow_mmio_access_mask;
> > ?extern u64 __read_mostly shadow_present_mask;
> > +extern u64 __read_mostly shadow_memtype_mask;
> > ?extern u64 __read_mostly shadow_me_value;
> > ?extern u64 __read_mostly shadow_me_mask;
> > ?
>
>
> So if I understand correctly:
>
>
> VMX:
>
> - host MTRRs are ignored.
>
> - all *host* mmio ranges (can only be VFIO's pci BARs), are mapped UC in EPT,
> but guest can override this with its PAT to WC)
>
>
> - all regular memory is mapped WB + guest PAT ignored unless there is noncoherent dma,
> (an older Intel's IOMMU? I think current Intel's IOMMLU are coherent?)

Effectively, yes.

My understanding is that on x86, everything is cache-coherent by default, but devices
can set a no-snoop flag, which breaks cache coherency. But then the IOMMU, except for
old Intel IOMMUs, can block such packets, and VFIO forces the block setting in the IOMMU
when it's supported by hardware.

Note, at first glance, commit e8ae0e140c05 ("vfio: Require that devices support DMA
cache coherence") makes it seem like exposing non-coherent DMA to KVM is impossible,
but IIUC that's just enforcing that the _default_ device behavior provides coherency.
I.e. VFIO will still allow an old Intel IOMMU plus a device that sets no-snoop.

2022-07-19 18:57:15

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH 0/4] KVM: x86/mmu: Memtype related cleanups

On 7/16/22 01:00, Sean Christopherson wrote:
> Minor cleanups for KVM's handling of the memtype that's shoved into SPTEs.
>
> Patch 1 enforces that entry '0' of the host's IA32_PAT is configured for WB
> memtype. KVM subtle relies on this behavior (silently shoves '0' into the
> SPTE PAT field). Check this at KVM load time so that if that doesn't hold
> true, KVM will refuse to load instead of running the guest with weird and
> potentially dangerous memtypes.
>
> Patch 2 is a pure code cleanup (ordered after patch 1 in case someone wants
> to backport the PAT check).
>
> Patch 3 add a mask to track whether or not KVM may use a non-zero memtype
> value in SPTEs. Essentially, it's a "is EPT enabled" flag without being an
> explicit "is EPT enabled" flag. This avoid some minor work when not using
> EPT, e.g. technically KVM could drop the RET0 implemention that's used for
> SVM's get_mt_mask(), but IMO that's an unnecessary risk.
>
> Patch 4 modifies the TDP page fault path to restrict the mapping level
> based on guest MTRRs if and only if KVM might actually consume them. The
> guest MTRRs are purely software constructs (not directly consumed by
> hardware), and KVM only honors them when EPT is enabled (host MTRRs are
> overridden by EPT) and the guest has non-coherent DMA. I doubt this will
> move the needed on whether or not KVM can create huge pages, but it does
> save having to do MTRR lookups on every page fault for guests without
> a non-coherent DMA device attached.
>
> Sean Christopherson (4):
> KVM: x86: Reject loading KVM if host.PAT[0] != WB
> KVM: x86: Drop unnecessary goto+label in kvm_arch_init()
> KVM: x86/mmu: Add shadow mask for effective host MTRR memtype
> KVM: x86/mmu: Restrict mapping level based on guest MTRR iff they're
> used
>
> arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++-------
> arch/x86/kvm/mmu/spte.c | 21 ++++++++++++++++++---
> arch/x86/kvm/mmu/spte.h | 1 +
> arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
> 4 files changed, 58 insertions(+), 23 deletions(-)
>
>
> base-commit: 8031d87aa9953ddeb047a5356ebd0b240c30f233

Queued, thanks.

Paolo