Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Reply-To: Sean Christopherson <seanjc@google.com>
Date:   Thu, 28 Apr 2022 23:34:16 +0000
Message-Id: <20220428233416.2446833-1-seanjc@google.com>
Mime-Version: 1.0
Subject: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
From:   Sean Christopherson <seanjc@google.com>
To:     Paolo Bonzini <pbonzini@redhat.com>
Cc:     Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Maxim Levitsky <mlevitsk@redhat.com>,
        Ben Gardon <bgardon@google.com>,
        David Matlack <dmatlack@google.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables).

On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB).  But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.

Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range.  But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM).  Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.

A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR.  That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.

Opportunistically make the now common kvm_mmu_max_gfn_host() inclusive
instead of exclusive.  The inclusive approach is somewhat silly as the
memslot and TDP MMU code want an exclusive value, but the name implies
the returned value is inclusive, and the MMIO path needs an inclusive
check.

  WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
                                kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
  Modules linked in: kvm_intel kvm irqbypass
  CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G        W         5.18.0-rc1+ #293
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
  Call Trace:
   <TASK>
   kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
   kvm_destroy_vm+0x162/0x2d0 [kvm]
   kvm_vm_release+0x1d/0x30 [kvm]
   __fput+0x82/0x240
   task_work_run+0x5b/0x90
   exit_to_user_mode_prepare+0xd2/0xe0
   syscall_exit_to_user_mode+0x1d/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>

Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---

Maxim, I didn't add you as Reported-by because I'm not confident this is
actually the cause of the bug you're hitting.  I wasn't able to reproduce
using your ipi_stress test, I found this horror via inspection and proved
it by hacking a selftest.

That said, I'm holding out hope that this does fix your issue.  Your
config doesn't specify host-phys-bits and IIUC you haven't been able to
repro on bare metal, which fits the kvm.MAXPHYADDR < cpu.MAXPHYADDR
scenario.  In theory if the test went rogue it could acccess a bad gfn and
cause KVM to insert an MMIO SPTE.  Fingers crossed :-)

If it does fix your case, a Reported-by for both you and Syzbot is
definitely in order.

 arch/x86/kvm/mmu.h         | 21 +++++++++++++++++++++
 arch/x86/kvm/mmu/mmu.c     | 10 ++++++++--
 arch/x86/kvm/mmu/spte.h    |  6 ------
 arch/x86/kvm/mmu/tdp_mmu.c | 15 ++++++++-------
 arch/x86/kvm/x86.c         |  6 +++++-
 5 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 671cfeccf04e..d291ff3065dc 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -65,6 +65,27 @@ static __always_inline u64 rsvd_bits(int s, int e)
 	return ((2ULL << (e - s)) - 1) << s;
 }
 
+/*
+ * The number of non-reserved physical address bits irrespective of features
+ * that repurpose legal bits, e.g. MKTME.
+ */
+extern u8 __read_mostly shadow_phys_bits;
+
+static inline gfn_t kvm_mmu_max_gfn_host(void)
+{
+	/*
+	 * Disallow SPTEs (via memslots or cached MMIO) whose gfn would exceed
+	 * host.MAXPHYADDR.  Assuming KVM is running on bare metal, guest
+	 * accesses beyond host.MAXPHYADDR will hit a #PF(RSVD) and never hit
+	 * an EPT Violation/Misconfig / #NPF, and so KVM will never install a
+	 * SPTE for such addresses.  That doesn't hold true if KVM is running
+	 * as a VM itself, e.g. if the MAXPHYADDR KVM sees is less than
+	 * hardware's real MAXPHYADDR, but since KVM can't honor such behavior
+	 * on bare metal, disallow it entirely to simplify e.g. the TDP MMU.
+	 */
+	return (1ULL << (shadow_phys_bits - PAGE_SHIFT)) - 1;
+}
+
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 904f0faff218..c10ae25135e3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3010,9 +3010,15 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 		/*
 		 * If MMIO caching is disabled, emulate immediately without
 		 * touching the shadow page tables as attempting to install an
-		 * MMIO SPTE will just be an expensive nop.
+		 * MMIO SPTE will just be an expensive nop.  Do not cache MMIO
+		 * whose gfn is greater than host.MAXPHYADDR, any guest that
+		 * generates such gfns is either malicious or in the weeds.
+		 * Note, it's possible to observe a gfn > host.MAXPHYADDR if
+		 * and only if host.MAXPHYADDR is inaccurate with respect to
+		 * hardware behavior, e.g. if KVM itself is running as a VM.
 		 */
-		if (unlikely(!enable_mmio_caching)) {
+		if (unlikely(!enable_mmio_caching) ||
+		    unlikely(fault->gfn > kvm_mmu_max_gfn_host())) {
 			*ret_val = RET_PF_EMULATE;
 			return true;
 		}
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index ad8ce3c5d083..43eceb821b31 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -203,12 +203,6 @@ static inline bool is_removed_spte(u64 spte)
  */
 extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
 
-/*
- * The number of non-reserved physical address bits irrespective of features
- * that repurpose legal bits, e.g. MKTME.
- */
-extern u8 __read_mostly shadow_phys_bits;
-
 static inline bool is_mmio_spte(u64 spte)
 {
 	return (spte & shadow_mmio_mask) == shadow_mmio_value &&
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 566548a3efa7..a81994ad43de 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -815,14 +815,15 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
 	return iter->yielded;
 }
 
-static inline gfn_t tdp_mmu_max_gfn_host(void)
+static inline gfn_t tdp_mmu_max_exclusive_gfn_host(void)
 {
 	/*
-	 * Bound TDP MMU walks at host.MAXPHYADDR, guest accesses beyond that
-	 * will hit a #PF(RSVD) and never hit an EPT Violation/Misconfig / #NPF,
-	 * and so KVM will never install a SPTE for such addresses.
+	 * Bound TDP MMU walks at host.MAXPHYADDR.  KVM disallows memslots with
+	 * a gpa range that would exceed the max gfn, and KVM does not create
+	 * MMIO SPTEs for "impossible" gfns, instead sending such accesses down
+	 * the slow emulation path every time.
 	 */
-	return 1ULL << (shadow_phys_bits - PAGE_SHIFT);
+	return kvm_mmu_max_gfn_host() + 1;
 }
 
 static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
@@ -830,7 +831,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 {
 	struct tdp_iter iter;
 
-	gfn_t end = tdp_mmu_max_gfn_host();
+	gfn_t end = tdp_mmu_max_exclusive_gfn_host();
 	gfn_t start = 0;
 
 	for_each_tdp_pte_min_level(iter, root, zap_level, start, end) {
@@ -923,7 +924,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 {
 	struct tdp_iter iter;
 
-	end = min(end, tdp_mmu_max_gfn_host());
+	end = min(end, tdp_mmu_max_exclusive_gfn_host());
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 951d0a78ccda..26ea5999d72b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12078,8 +12078,12 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				   struct kvm_memory_slot *new,
 				   enum kvm_mr_change change)
 {
-	if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
+	if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
+		if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn_host())
+			return -EINVAL;
+
 		return kvm_alloc_memslot_metadata(kvm, new);
+	}
 
 	if (change == KVM_MR_FLAGS_ONLY)
 		memcpy(&new->arch, &old->arch, sizeof(old->arch));

base-commit: 2a39d8b39bffdaf1a4223d0d22f07baee154c8f3
-- 
2.36.0.464.gb9c8b46e94-goog