Note, this applies on Yosry's stats series (there's a trivial-but-subtle
conflict in the TDP MMU shadow page accounting).
https://lore.kernel.org/all/[email protected]
Precisely track (via kvm_mmu_page) if a non-huge page is being forced
and use that info to avoid unnecessarily forcing smaller page sizes in
disallowed_hugepage_adjust().
KVM incorrectly assumes that the NX huge page mitigation is the only
scenario where KVM will create a non-leaf page instead of a huge page.
As a result, if the original source of huge page incompatibility goes
away, the NX mitigation is enabled, and KVM encounters an present shadow
page when attempting to install a huge page, KVM will force a smaller page
regardless of whether or not a smaller page is actually necessary to
satisfy the NX huge page mitigation.
Unnecessarily forcing small pages can result in degraded guest performance,
especially on larger VMs. The bug was originally discovered when testing
dirty log performance, as KVM would leave small pages lying around when
zapping collapsible SPTEs. That case was indadvertantly fixed by commit
5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty
logging"), but other scenarios are still affected, e.g. KVM will not
rebuild a huge page if the mmu_notifier zaps a range of PTEs because the
primary MMU is creating a huge page.
v4:
- Collect reviews. [Mingwei]
- Add comment to document possible_nx_huge_pages. [Mingwei]
- Drop extra memory barriers. [Paolo]
- Document ordering providing by TDP SPTE helpers. [Paolo]
v3:
- https://lore.kernel.org/all/[email protected]
- Bug the VM if KVM attempts to double account a shadow page that
disallows a NX huge page. [David]
- Split the rename to separate patch. [Paolo]
- Rename more NX huge page variables/functions. [David]
- Combine and tweak the comments about enforcing the NX huge page
mitigation for non-paging MMUs. [Paolo, David]
- Call out that the shadow MMU holds mmu_lock for write and doesn't need
to manual handle memory ordering when accounting NX huge pages. [David]
- Add a smp_rmb() when unlinking shadow pages in the TDP MMU.
- Rename spte_to_sp() to spte_to_child_sp(). [David]
- Collect reviews. [David]
- Tweak the changelog for the final patch to call out that precise
accounting addresses real world performance bugs. [Paolo]
- Reword the changelog for the patch to (almost) always tag disallowed
NX huge pages, and call out that it doesn't fix the TDP MMU. [David]
v2: Rebase, tweak a changelog accordingly.
v1: https://lore.kernel.org/all/[email protected]
Mingwei Zhang (1):
KVM: x86/mmu: explicitly check nx_hugepage in
disallowed_hugepage_adjust()
Sean Christopherson (8):
KVM: x86/mmu: Bug the VM if KVM attempts to double count an NX huge
page
KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
KVM: x86/mmu: Rename NX huge pages fields/functions for consistency
KVM: x86/mmu: Properly account NX huge page workaround for nonpaging
MMUs
KVM: x86/mmu: Document implicit barriers/ordering in TDP MMU shared
mode
KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting
SPTE
KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
pages
KVM: x86/mmu: Add helper to convert SPTE value to its shadow page
arch/x86/include/asm/kvm_host.h | 30 +++++----
arch/x86/kvm/mmu/mmu.c | 115 ++++++++++++++++++++------------
arch/x86/kvm/mmu/mmu_internal.h | 33 ++++-----
arch/x86/kvm/mmu/paging_tmpl.h | 6 +-
arch/x86/kvm/mmu/spte.c | 12 ++++
arch/x86/kvm/mmu/spte.h | 17 +++++
arch/x86/kvm/mmu/tdp_iter.h | 6 ++
arch/x86/kvm/mmu/tdp_mmu.c | 48 +++++++------
arch/x86/kvm/mmu/tdp_mmu.h | 2 +
9 files changed, 176 insertions(+), 93 deletions(-)
base-commit: 6f87cacaf4fe95288ac4cbe01a2fadc5d2b8b936
--
2.37.2.672.g94769d06f0-goog
Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP
visible to other readers, i.e. before setting its SPTE. This will allow
KVM to query the flag when determining if a shadow page can be replaced
by a NX huge page without violating the rules of the mitigation.
Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible
for another CPU to see a shadow page without an up-to-date
nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated
dance.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: David Matlack <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++----------
arch/x86/kvm/mmu/mmu_internal.h | 5 ++---
arch/x86/kvm/mmu/tdp_mmu.c | 31 ++++++++++++++++++-------------
3 files changed, 38 insertions(+), 26 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 04eb87f5a39d..de06c1f87635 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -802,22 +802,25 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
}
-void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- bool nx_huge_page_possible)
+void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
if (KVM_BUG_ON(!list_empty(&sp->possible_nx_huge_page_link), kvm))
return;
- sp->nx_huge_page_disallowed = true;
-
- if (!nx_huge_page_possible)
- return;
-
++kvm->stat.nx_lpage_splits;
list_add_tail(&sp->possible_nx_huge_page_link,
&kvm->arch.possible_nx_huge_pages);
}
+static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ bool nx_huge_page_possible)
+{
+ sp->nx_huge_page_disallowed = true;
+
+ if (nx_huge_page_possible)
+ track_possible_nx_huge_page(kvm, sp);
+}
+
static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
{
struct kvm_memslots *slots;
@@ -835,10 +838,8 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
kvm_mmu_gfn_allow_lpage(slot, gfn);
}
-void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
- sp->nx_huge_page_disallowed = false;
-
if (list_empty(&sp->possible_nx_huge_page_link))
return;
@@ -846,6 +847,13 @@ void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
list_del_init(&sp->possible_nx_huge_page_link);
}
+static void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ sp->nx_huge_page_disallowed = false;
+
+ untrack_possible_nx_huge_page(kvm, sp);
+}
+
static struct kvm_memory_slot *
gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
bool no_dirty_log)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 67879459a25c..22152241bd29 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -328,8 +328,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
-void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- bool nx_huge_page_possible);
-void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d1079fabe14c..fd38465aee9e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -403,8 +403,11 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
lockdep_assert_held_write(&kvm->mmu_lock);
list_del(&sp->link);
- if (sp->nx_huge_page_disallowed)
- unaccount_nx_huge_page(kvm, sp);
+
+ if (sp->nx_huge_page_disallowed) {
+ sp->nx_huge_page_disallowed = false;
+ untrack_possible_nx_huge_page(kvm, sp);
+ }
if (shared)
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1123,16 +1126,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
* @kvm: kvm instance
* @iter: a tdp_iter instance currently on the SPTE that should be set
* @sp: The new TDP page table to install.
- * @account_nx: True if this page table is being installed to split a
- * non-executable huge page.
* @shared: This operation is running under the MMU lock in read mode.
*
* Returns: 0 if the new page table was installed. Non-0 if the page table
* could not be installed (e.g. the atomic compare-exchange failed).
*/
static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
- struct kvm_mmu_page *sp, bool account_nx,
- bool shared)
+ struct kvm_mmu_page *sp, bool shared)
{
u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled());
int ret = 0;
@@ -1147,8 +1147,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
- if (account_nx)
- account_nx_huge_page(kvm, sp, true);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
tdp_account_mmu_page(kvm, sp);
@@ -1162,6 +1160,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm *kvm = vcpu->kvm;
struct tdp_iter iter;
struct kvm_mmu_page *sp;
int ret;
@@ -1198,9 +1197,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
}
if (!is_shadow_present_pte(iter.old_spte)) {
- bool account_nx = fault->huge_page_disallowed &&
- fault->req_level >= iter.level;
-
/*
* If SPTE has been frozen by another thread, just
* give up and retry, avoiding unnecessary page table
@@ -1212,10 +1208,19 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
sp = tdp_mmu_alloc_sp(vcpu);
tdp_mmu_init_child_sp(sp, &iter);
- if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
+ sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
+
+ if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
tdp_mmu_free_sp(sp);
break;
}
+
+ if (fault->huge_page_disallowed &&
+ fault->req_level >= iter.level) {
+ spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+ track_possible_nx_huge_page(kvm, sp);
+ spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+ }
}
}
@@ -1503,7 +1508,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
* correctness standpoint since the translation will be the same either
* way.
*/
- ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared);
+ ret = tdp_mmu_link_sp(kvm, iter, sp, shared);
if (ret)
goto out;
--
2.37.2.672.g94769d06f0-goog
Track the number of TDP MMU "shadow" pages instead of tracking the pages
themselves. With the NX huge page list manipulation moved out of the common
linking flow, elminating the list-based tracking means the happy path of
adding a shadow page doesn't need to acquire a spinlock and can instead
inc/dec an atomic.
Keep the tracking as the WARN during TDP MMU teardown on leaked shadow
pages is very, very useful for detecting KVM bugs.
Tracking the number of pages will also make it trivial to expose the
counter to userspace as a stat in the future, which may or may not be
desirable.
Note, the TDP MMU needs to use a separate counter (and stat if that ever
comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother
supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the
TDP MMU consumes so few pages relative to shadow paging), and including TDP
MMU pages in that counter would break both the shrinker and shadow MMUs,
e.g. if a VM is using nested TDP.
Cc: Yan Zhao <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 11 +++--------
arch/x86/kvm/mmu/tdp_mmu.c | 20 +++++++++-----------
2 files changed, 12 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 48e51600f1be..6c2113e6d19c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1282,6 +1282,9 @@ struct kvm_arch {
*/
bool tdp_mmu_enabled;
+ /* The number of TDP MMU pages across all roots. */
+ atomic64_t tdp_mmu_pages;
+
/*
* List of struct kvm_mmu_pages being used as roots.
* All struct kvm_mmu_pages in the list should have
@@ -1302,18 +1305,10 @@ struct kvm_arch {
*/
struct list_head tdp_mmu_roots;
- /*
- * List of struct kvmp_mmu_pages not being used as roots.
- * All struct kvm_mmu_pages in the list should have
- * tdp_mmu_page set and a tdp_mmu_root_count of 0.
- */
- struct list_head tdp_mmu_pages;
-
/*
* Protects accesses to the following fields when the MMU lock
* is held in read mode:
* - tdp_mmu_roots (above)
- * - tdp_mmu_pages (above)
* - the link field of struct kvm_mmu_pages used by the TDP MMU
* - possible_nx_huge_pages;
* - the possible_nx_huge_page_link field of struct kvm_mmu_pages used
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index fd38465aee9e..92ad533f4f25 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -29,7 +29,6 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
kvm->arch.tdp_mmu_enabled = true;
INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
- INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
kvm->arch.tdp_mmu_zap_wq = wq;
return 1;
}
@@ -54,7 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
/* Also waits for any queued work items. */
destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
- WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages));
+ WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
/*
@@ -377,11 +376,13 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
kvm_account_pgtable_pages((void *)sp->spt, +1);
+ atomic64_inc(&kvm->arch.tdp_mmu_pages);
}
static void tdp_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
kvm_account_pgtable_pages((void *)sp->spt, -1);
+ atomic64_dec(&kvm->arch.tdp_mmu_pages);
}
/**
@@ -397,17 +398,17 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
bool shared)
{
tdp_unaccount_mmu_page(kvm, sp);
+
+ if (!sp->nx_huge_page_disallowed)
+ return;
+
if (shared)
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
else
lockdep_assert_held_write(&kvm->mmu_lock);
- list_del(&sp->link);
-
- if (sp->nx_huge_page_disallowed) {
- sp->nx_huge_page_disallowed = false;
- untrack_possible_nx_huge_page(kvm, sp);
- }
+ sp->nx_huge_page_disallowed = false;
+ untrack_possible_nx_huge_page(kvm, sp);
if (shared)
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1145,9 +1146,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
tdp_mmu_set_spte(kvm, iter, spte);
}
- spin_lock(&kvm->arch.tdp_mmu_pages_lock);
- list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
- spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
tdp_account_mmu_page(kvm, sp);
return 0;
--
2.37.2.672.g94769d06f0-goog
WARN and kill the VM if KVM attempts to double count an NX huge page,
i.e. attempts to re-tag a shadow page with "NX huge page disallowed".
KVM does NX huge page accounting only when linking a new shadow page, and
it should be impossible for a new shadow page to be already accounted.
E.g. even in the TDP MMU case, where vCPUs can race to install a new
shadow page, only the "winner" will account the installed page.
Kill the VM instead of continuing on as either KVM has an egregious bug,
e.g. didn't zero-initialize the data, or there's host data corruption, in
which carrying on is dangerous, e.g. could cause silent data corruption
in the guest.
Reported-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 32b60a6b83bd..74afee3f2476 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -804,7 +804,7 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
- if (sp->lpage_disallowed)
+ if (KVM_BUG_ON(sp->lpage_disallowed, kvm))
return;
++kvm->stat.nx_lpage_splits;
--
2.37.2.672.g94769d06f0-goog
On Tue, Aug 30, 2022, Sean Christopherson wrote:
> Note, this applies on Yosry's stats series (there's a trivial-but-subtle
> conflict in the TDP MMU shadow page accounting).
> https://lore.kernel.org/all/[email protected]
>
> Precisely track (via kvm_mmu_page) if a non-huge page is being forced
> and use that info to avoid unnecessarily forcing smaller page sizes in
> disallowed_hugepage_adjust().
>
> KVM incorrectly assumes that the NX huge page mitigation is the only
> scenario where KVM will create a non-leaf page instead of a huge page.
> As a result, if the original source of huge page incompatibility goes
> away, the NX mitigation is enabled, and KVM encounters an present shadow
> page when attempting to install a huge page, KVM will force a smaller page
> regardless of whether or not a smaller page is actually necessary to
> satisfy the NX huge page mitigation.
>
> Unnecessarily forcing small pages can result in degraded guest performance,
> especially on larger VMs. The bug was originally discovered when testing
> dirty log performance, as KVM would leave small pages lying around when
> zapping collapsible SPTEs. That case was indadvertantly fixed by commit
> 5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty
> logging"), but other scenarios are still affected, e.g. KVM will not
> rebuild a huge page if the mmu_notifier zaps a range of PTEs because the
> primary MMU is creating a huge page.
>
> v4:
> - Collect reviews. [Mingwei]
> - Add comment to document possible_nx_huge_pages. [Mingwei]
> - Drop extra memory barriers. [Paolo]
> - Document ordering providing by TDP SPTE helpers. [Paolo]
Hi Paolo and folks,
Just a gentle ping. Are we good on this version? It seems the ordering
concern in TDP MMU has been addressed.
Thanks.
-Mingwei
Sean Christopherson <[email protected]> writes:
> WARN and kill the VM if KVM attempts to double count an NX huge page,
> i.e. attempts to re-tag a shadow page with "NX huge page disallowed".
> KVM does NX huge page accounting only when linking a new shadow page, and
> it should be impossible for a new shadow page to be already accounted.
> E.g. even in the TDP MMU case, where vCPUs can race to install a new
> shadow page, only the "winner" will account the installed page.
>
> Kill the VM instead of continuing on as either KVM has an egregious bug,
> e.g. didn't zero-initialize the data, or there's host data corruption, in
> which carrying on is dangerous, e.g. could cause silent data corruption
> in the guest.
>
> Reported-by: David Matlack <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Mingwei Zhang <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 32b60a6b83bd..74afee3f2476 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -804,7 +804,7 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>
> void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> - if (sp->lpage_disallowed)
> + if (KVM_BUG_ON(sp->lpage_disallowed, kvm))
> return;
>
> ++kvm->stat.nx_lpage_splits;
This patch (now in sean/for_paolo/6.1) causes nested Hyper-V guests to
break early in the boot sequence but the fault is not
Hyper-V-enlightenments related, e.g. even without them I see:
# ~/qemu/build/qemu-system-x86_64 -machine q35,accel=kvm,kernel-irqchip=split -name guest=win10 -cpu host -smp 4 -m 16384 -drive file=/home/VMs/WinDev2202Eval.qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -vnc :0 -rtc base=localtime,driftfix=slew --no-hpet -monitor stdio --no-reboot
QEMU 7.0.50 monitor - type 'help' for more information
(qemu)
error: kvm run failed Input/output error
EAX=00000020 EBX=0000ffff ECX=00000000 EDX=0000ffff
ESI=00000000 EDI=00002300 EBP=00000000 ESP=00006d8c
EIP=00000018 EFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =f000 000f0000 ffffffff 00809300
CS =cb00 000cb000 ffffffff 00809b00
SS =0000 00000000 ffffffff 00809300
DS =0000 00000000 ffffffff 00809300
FS =0000 00000000 ffffffff 00809300
GS =0000 00000000 ffffffff 00809300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT= 00000000 00000000
IDT= 00000000 000003ff
CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=0e 07 31 c0 b9 00 10 8d 3e 00 03 fc f3 ab 07 b8 20 00 e7 7e <cb> 0f 1f 80 00 00 00 00 6b 76 6d 20 61 50 69 43 20 00 00 00 2d 02 00 00 d9 02 00 00 00 03
KVM_GET_CLOCK failed: Input/output error
Aborted (core dumped)
(FWIW, KVM_GET_CLOCK is obviously unrelated here, KVM_BUG_ON'ed VMs are
just like that for all ioctls)
I can also see
[ 962.063025] WARNING: CPU: 2 PID: 20511 at arch/x86/kvm/mmu/mmu.c:808 account_huge_nx_page+0x2c/0xc0 [kvm]
[ 962.072654] Modules linked in: kvm_intel(E) kvm(E) qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif mlx5_ib ib_uverbs irqbypass acpi_ipmi ib_core rapl dcdbas ipmi_si mei_me intel_cstate i2c_i801 ipmi_devintf dell_smbios mei intel_uncore dell_wmi_descriptor wmi_bmof pcspkr i2c_smbus lpc_ich ipmi_msghandler acpi_power_meter xfs libcrc32c mlx5_core sd_mod t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 sg mgag200 drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci libahci crct10dif_pclmul drm igb crc32_pclmul libata crc32c_intel mlxfw megaraid_sas ghash_clmulni_intel psample dca i2c_algo_bit pci_hyperv_intf wmi dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: kvm]
[ 962.148222] CPU: 2 PID: 20511 Comm: qemu-system-x86 Tainted: G I E 6.0.0-rc1+ #158
[ 962.157005] Hardware name: Dell Inc. PowerEdge R740/0WRPXK, BIOS 2.12.2 07/09/2021
[ 962.164572] RIP: 0010:account_huge_nx_page+0x2c/0xc0 [kvm]
[ 962.170101] Code: 44 00 00 41 56 48 8d 86 90 00 00 00 41 55 41 54 55 48 89 fd 53 4c 8b a6 90 00 00 00 49 39 c4 74 29 80 bf f4 9d 00 00 00 75 2b <0f> 0b b8 01 01 00 00 be 01 03 00 00 66 89 87 f4 9d 00 00 5b 5d 41
[ 962.188854] RSP: 0018:ffffbb2243e17b10 EFLAGS: 00010246
[ 962.194081] RAX: ffffa0b5d39c5790 RBX: 0000000000000600 RCX: ffffa0b5e610e018
[ 962.201212] RDX: 0000000000000001 RSI: ffffa0b5d39c5700 RDI: ffffbb2243de9000
[ 962.208346] RBP: ffffbb2243de9000 R08: 0000000000000001 R09: 0000000000000001
[ 962.215481] R10: ffffa0b4c0000000 R11: ffffa0b5d39c5700 R12: ffffbb2243df22d8
[ 962.222612] R13: ffffa0b5d3b22880 R14: 0000000000000002 R15: ffffa0b5c884b018
[ 962.229745] FS: 00007fdaf5177640(0000) GS:ffffa0b92fc40000(0000) knlGS:0000000000000000
[ 962.237830] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 962.243577] CR2: 0000000000000000 CR3: 000000014ef9e004 CR4: 00000000007726e0
[ 962.250710] PKRU: 55555554
[ 962.253422] Call Trace:
[ 962.255879] <TASK>
[ 962.257992] ept_fetch+0x504/0x5a0 [kvm]
[ 962.261959] ept_page_fault+0x2d7/0x300 [kvm]
[ 962.266362] ? kvm_mmu_slot_gfn_write_protect+0xb1/0xd0 [kvm]
[ 962.272150] ? kvm_slot_page_track_add_page+0x5b/0x90 [kvm]
[ 962.277766] ? kvm_mmu_alloc_shadow_page+0x33c/0x3c0 [kvm]
[ 962.283297] ? mmu_alloc_root+0x9d/0xf0 [kvm]
[ 962.287701] kvm_mmu_page_fault+0x258/0x290 [kvm]
[ 962.292451] vmx_handle_exit+0xe/0x40 [kvm_intel]
[ 962.297173] vcpu_enter_guest+0x665/0xfc0 [kvm]
[ 962.301741] ? vmx_check_nested_events+0x12d/0x2e0 [kvm_intel]
[ 962.307580] vcpu_run+0x33/0x250 [kvm]
[ 962.311367] kvm_arch_vcpu_ioctl_run+0xf7/0x460 [kvm]
[ 962.316456] kvm_vcpu_ioctl+0x271/0x670 [kvm]
[ 962.320843] __x64_sys_ioctl+0x87/0xc0
[ 962.324602] do_syscall_64+0x38/0x90
[ 962.328192] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 962.333252] RIP: 0033:0x7fdaf7d073fb
[ 962.336832] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fd 29 0f 00 f7 d8 64 89 01 48
[ 962.355578] RSP: 002b:00007fdaf51767b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 962.363148] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fdaf7d073fb
[ 962.370286] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000c
[ 962.377417] RBP: 000055a84ef30900 R08: 000055a84d638be0 R09: 0000000000000000
[ 962.384550] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 962.391685] R13: 000055a84d65e0ce R14: 00007fdaf7c8d810 R15: 0000000000000000
[ 962.398818] </TASK>
[ 962.401009] ---[ end trace 0000000000000000 ]---
[ 1213.265975] ------------[ cut here ]------------
which can hopefully give a hint on where the real issue is ...
--
Vitaly
On Wed, Sep 21, 2022, Vitaly Kuznetsov wrote:
> Sean Christopherson <[email protected]> writes:
>
> > WARN and kill the VM if KVM attempts to double count an NX huge page,
> > i.e. attempts to re-tag a shadow page with "NX huge page disallowed".
> > KVM does NX huge page accounting only when linking a new shadow page, and
> > it should be impossible for a new shadow page to be already accounted.
> > E.g. even in the TDP MMU case, where vCPUs can race to install a new
> > shadow page, only the "winner" will account the installed page.
> >
> > Kill the VM instead of continuing on as either KVM has an egregious bug,
> > e.g. didn't zero-initialize the data, or there's host data corruption, in
> > which carrying on is dangerous, e.g. could cause silent data corruption
> > in the guest.
> >
> > Reported-by: David Matlack <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Reviewed-by: Mingwei Zhang <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 32b60a6b83bd..74afee3f2476 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -804,7 +804,7 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> >
> > void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > {
> > - if (sp->lpage_disallowed)
> > + if (KVM_BUG_ON(sp->lpage_disallowed, kvm))
> > return;
> >
> > ++kvm->stat.nx_lpage_splits;
>
> This patch (now in sean/for_paolo/6.1) causes nested Hyper-V guests to
> break early in the boot sequence but the fault is not
> Hyper-V-enlightenments related, e.g. even without them I see:
...
> [ 962.257992] ept_fetch+0x504/0x5a0 [kvm]
> [ 962.261959] ept_page_fault+0x2d7/0x300 [kvm]
> [ 962.287701] kvm_mmu_page_fault+0x258/0x290 [kvm]
> [ 962.292451] vmx_handle_exit+0xe/0x40 [kvm_intel]
> [ 962.297173] vcpu_enter_guest+0x665/0xfc0 [kvm]
> [ 962.307580] vcpu_run+0x33/0x250 [kvm]
> [ 962.311367] kvm_arch_vcpu_ioctl_run+0xf7/0x460 [kvm]
> [ 962.316456] kvm_vcpu_ioctl+0x271/0x670 [kvm]
> [ 962.320843] __x64_sys_ioctl+0x87/0xc0
> [ 962.324602] do_syscall_64+0x38/0x90
> [ 962.328192] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Ugh, past me completely forgot the basics of shadow paging[*]. The shadow MMU
can reuse existing shadow pages, whereas the TDP MMU always links in new pages.
I got turned around by the "doesn't exist" check, which only means "is there
already a _SPTE_ here", not "is there an existing SP for the target gfn+role that
can be used".
I'll drop the series from the queue, send a new pull request, and spin a v5
targeting 6.2, which amusing will look a lot like v1...
Thanks for catching this!
[*] https://lore.kernel.org/all/Yt8uwMt%[email protected]
On Wed, Sep 21, 2022, Sean Christopherson wrote:
> On Wed, Sep 21, 2022, Vitaly Kuznetsov wrote:
> > [ 962.257992] ept_fetch+0x504/0x5a0 [kvm]
> > [ 962.261959] ept_page_fault+0x2d7/0x300 [kvm]
> > [ 962.287701] kvm_mmu_page_fault+0x258/0x290 [kvm]
> > [ 962.292451] vmx_handle_exit+0xe/0x40 [kvm_intel]
> > [ 962.297173] vcpu_enter_guest+0x665/0xfc0 [kvm]
> > [ 962.307580] vcpu_run+0x33/0x250 [kvm]
> > [ 962.311367] kvm_arch_vcpu_ioctl_run+0xf7/0x460 [kvm]
> > [ 962.316456] kvm_vcpu_ioctl+0x271/0x670 [kvm]
> > [ 962.320843] __x64_sys_ioctl+0x87/0xc0
> > [ 962.324602] do_syscall_64+0x38/0x90
> > [ 962.328192] entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
> Ugh, past me completely forgot the basics of shadow paging[*]. The shadow MMU
> can reuse existing shadow pages, whereas the TDP MMU always links in new pages.
>
> I got turned around by the "doesn't exist" check, which only means "is there
> already a _SPTE_ here", not "is there an existing SP for the target gfn+role that
> can be used".
>
> I'll drop the series from the queue, send a new pull request, and spin a v5
> targeting 6.2, which amusing will look a lot like v1...
Huh. I was expecting more churn, but dropping the offending patch and then
"reworking" the series yields a very trivial overall diff.
Vitaly, can you easily re-test with the below, i.e. simply delete the KVM_BUG_ON()?
I'll still spin a v5, but assuming all is well I think this can go into 6.1 and
not get pushed out to 6.2.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 54ee48a87f81..e6f19e605979 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -804,7 +804,15 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
- if (KVM_BUG_ON(!list_empty(&sp->possible_nx_huge_page_link), kvm))
+ /*
+ * If it's possible to replace the shadow page with an NX huge page,
+ * i.e. if the shadow page is the only thing currently preventing KVM
+ * from using a huge page, add the shadow page to the list of "to be
+ * zapped for NX recovery" pages. Note, the shadow page can already be
+ * on the list if KVM is reusing an existing shadow page, i.e. if KVM
+ * links a shadow page at multiple points.
+ */
+ if (!list_empty(&sp->possible_nx_huge_page_link))
return;
++kvm->stat.nx_lpage_splits;
Sean Christopherson <[email protected]> writes:
> On Wed, Sep 21, 2022, Sean Christopherson wrote:
>> On Wed, Sep 21, 2022, Vitaly Kuznetsov wrote:
>> > [ 962.257992] ept_fetch+0x504/0x5a0 [kvm]
>> > [ 962.261959] ept_page_fault+0x2d7/0x300 [kvm]
>> > [ 962.287701] kvm_mmu_page_fault+0x258/0x290 [kvm]
>> > [ 962.292451] vmx_handle_exit+0xe/0x40 [kvm_intel]
>> > [ 962.297173] vcpu_enter_guest+0x665/0xfc0 [kvm]
>> > [ 962.307580] vcpu_run+0x33/0x250 [kvm]
>> > [ 962.311367] kvm_arch_vcpu_ioctl_run+0xf7/0x460 [kvm]
>> > [ 962.316456] kvm_vcpu_ioctl+0x271/0x670 [kvm]
>> > [ 962.320843] __x64_sys_ioctl+0x87/0xc0
>> > [ 962.324602] do_syscall_64+0x38/0x90
>> > [ 962.328192] entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>
>> Ugh, past me completely forgot the basics of shadow paging[*]. The shadow MMU
>> can reuse existing shadow pages, whereas the TDP MMU always links in new pages.
>>
>> I got turned around by the "doesn't exist" check, which only means "is there
>> already a _SPTE_ here", not "is there an existing SP for the target gfn+role that
>> can be used".
>>
>> I'll drop the series from the queue, send a new pull request, and spin a v5
>> targeting 6.2, which amusing will look a lot like v1...
>
> Huh. I was expecting more churn, but dropping the offending patch and then
> "reworking" the series yields a very trivial overall diff.
>
> Vitaly, can you easily re-test with the below, i.e. simply delete the
> KVM_BUG_ON()?
This seems to work! At least, I haven't noticed anything weird when
booting my beloved Win11 + WSL2 guest.
--
Vitaly
On Wed, Sep 21, 2022, Vitaly Kuznetsov wrote:
> Sean Christopherson <[email protected]> writes:
>
> > On Wed, Sep 21, 2022, Sean Christopherson wrote:
> >> On Wed, Sep 21, 2022, Vitaly Kuznetsov wrote:
> >> > [ 962.257992] ept_fetch+0x504/0x5a0 [kvm]
> >> > [ 962.261959] ept_page_fault+0x2d7/0x300 [kvm]
> >> > [ 962.287701] kvm_mmu_page_fault+0x258/0x290 [kvm]
> >> > [ 962.292451] vmx_handle_exit+0xe/0x40 [kvm_intel]
> >> > [ 962.297173] vcpu_enter_guest+0x665/0xfc0 [kvm]
> >> > [ 962.307580] vcpu_run+0x33/0x250 [kvm]
> >> > [ 962.311367] kvm_arch_vcpu_ioctl_run+0xf7/0x460 [kvm]
> >> > [ 962.316456] kvm_vcpu_ioctl+0x271/0x670 [kvm]
> >> > [ 962.320843] __x64_sys_ioctl+0x87/0xc0
> >> > [ 962.324602] do_syscall_64+0x38/0x90
> >> > [ 962.328192] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >>
> >> Ugh, past me completely forgot the basics of shadow paging[*]. The shadow MMU
> >> can reuse existing shadow pages, whereas the TDP MMU always links in new pages.
> >>
> >> I got turned around by the "doesn't exist" check, which only means "is there
> >> already a _SPTE_ here", not "is there an existing SP for the target gfn+role that
> >> can be used".
> >>
> >> I'll drop the series from the queue, send a new pull request, and spin a v5
> >> targeting 6.2, which amusing will look a lot like v1...
> >
> > Huh. I was expecting more churn, but dropping the offending patch and then
> > "reworking" the series yields a very trivial overall diff.
> >
> > Vitaly, can you easily re-test with the below, i.e. simply delete the
> > KVM_BUG_ON()?
>
> This seems to work! At least, I haven't noticed anything weird when
> booting my beloved Win11 + WSL2 guest.
I finally figured out why I didn't see this in testing. It _should_ have fired
during kernel boot when testing legacy shadow paging, i.e. ept=0, as the bug requires
nothing more than executing from two GVAs pointing at the same huge 2mb GPA.
I did test ept=0, but all of my normal test systems aren't susceptible to L1TF
(KVM guest, all AMD, and ICX), i.e. don't enable the mitigation by default. I
also tested those systems with the mitigation forced on and ept=0, but never
booted a VM with that combination, and neither KUT nor selftests does the requisite
aliasing with huge pages.
Death was instantaneous once I forced the mitigation on with ept=0 and booted a VM.
*sigh*