This series refines mmu zap caused by EPT memory type update when guest
MTRRs are honored.
The first 5 patches revolve around utilizing helper functions to check if
KVM TDP honors guest MTRRs, so that TDP zap and page fault max_level
reduction are only targeted to TDPs that honor guest MTRRs.
-The 5th patch will trigger zapping of TDP leaf entries if non-coherent
DMA devices count goes from 0 to 1 or from 1 to 0.
The last 6 patches are fixes and optimizations for mmu zaps happen when
guest MTRRs are honored. Those mmu zaps are usually triggered from all
vCPUs in bursts on all GFN ranges, intending to remove stale memtypes of
TDP entries.
- The 6th patch places TDP zap to when CR0.CD toggles and when guest MTRRs
update under CR0.CD=0.
- The 7th-8th patches refine KVM_X86_QUIRK_CD_NW_CLEARED by removing the
IPAT bit in EPT memtype when CR0.CD=1 and guest MTRRs are honored.
- The 9th-11th patches are optimizations of the mmu zap when guest MTRRs
are honored by serializing vCPUs' gfn zap requests and calculating of
precise fine-grained ranges to zap.
They are put in mtrr.c because the optimizations are related to when
guest MTRRs are honored and because it requires to read guest MTRRs
for fine-grained ranges.
Calls to kvm_unmap_gfn_range() are not included into the optimization,
because they are not triggered from all vCPUs in bursts and not all of
them are blockable. They usually happen at memslot removal and thus do
not affect the mmu zaps when guest MTRRs are honored. Also, current
performance data shows that there's no observable performance difference
to mmu zaps by turning on/off auto numa balancing triggered
kvm_unmap_gfn_range().
A reference performance data for last 6 patches as below:
Base: base code before patch 6
C6-8: includes base code + patches 6 + 7 + 8
patch 6: move TDP zaps from guest MTRRs update to CR0.CD toggling
patch 7: drop IPAT in memtype when CD=1 for
KVM_X86_QUIRK_CD_NW_CLEARED
patch 8: move vmx code to get EPT memtype when CR0.CD=1 to x86 common
code
C9: includes C6-8 + patch 9
patch 9: serialize vCPUs to zap gfn when guest MTRRs are honored
C10: includes C9 + patch 10
patch 10: fine-grained gfn zap when guest MTRRs are honored
C11: includes C10 + patch 11
patch 11: split a single gfn zap range when guest MTRRs are honored
vCPUs cnt: 8, guest memory: 16G
Physical CPU frequency: 3100 MHz
| OVMF | Seabios |
| EPT zap cycles | EPT zap cnt | EPT zap cycles | EPT zap cnt |
Base | 3444.97M | 84 | 61.29M | 50 |
C6-8 | 4343.68M | 74 | 503.04M | 42 |*
C9 | 261.45M | 74 | 106.64M | 42 |
C10 | 157.42M | 74 | 71.04M | 42 |
C11 | 33.95M | 74 | 24.04M | 42 |
* With C8, EPT zap cnt are reduced because there are some MTRR updates
under CR0.CD=1.
EPT zap cycles increases a bit (especially true in case of Seabios)
because concurrency is more intense when CR0.CD toggles than when
guest MTRRs update.
(patch 7/8 are neglectable in performance)
Changelog:
v2 --> v3:
1. Updated patch 1 about definition of honor guest MTRRs helper. (Sean)
2. Added patch 2 to use honor guest MTRRs helper in kvm_tdp_page_fault().
(Sean)
3. Remove unnecessary calculation of MTRR ranges.
(Chao Gao, Kai Huang, Sean)
4. Updated patches 3-5 to use the helper. (Chao Gao, Kai Huang, Sean)
5. Added patches 6,7 to reposition TDP zap and drop IPAT bit. (Sean)
6. Added patch 8 to prepare for patch 10's memtype calculation when
CR0.CD=1.
7. Added patches 9-11 to speed up MTRR update /CD0 toggle when guest
MTRRs are honored. (Sean)
8. Dropped per-VM based MTRRs in v2 (Sean)
v1 --> v2:
1. Added a helper to skip non EPT case in patch 1
2. Added patch 2 to skip mmu zap when guest CR0_CD changes if EPT is not
enabled. (Chao Gao)
3. Added patch 3 to skip mmu zap when guest MTRR changes if EPT is not
enabled.
4. Do not mention TDX in patch 4 as the code is not merged yet (Chao Gao)
5. Added patches 5-6 to reduce EPT zap during guest bootup.
v2:
https://lore.kernel.org/all/[email protected]/
v1:
https://lore.kernel.org/all/[email protected]/
Yan Zhao (11):
KVM: x86/mmu: helpers to return if KVM honors guest MTRRs
KVM: x86/mmu: Use KVM honors guest MTRRs helper in
kvm_tdp_page_fault()
KVM: x86/mmu: Use KVM honors guest MTRRs helper when CR0.CD toggles
KVM: x86/mmu: Use KVM honors guest MTRRs helper when update mtrr
KVM: x86/mmu: zap KVM TDP when noncoherent DMA assignment starts/stops
KVM: x86/mmu: move TDP zaps from guest MTRRs update to CR0.CD toggling
KVM: VMX: drop IPAT in memtype when CD=1 for
KVM_X86_QUIRK_CD_NW_CLEARED
KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common
code
KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored
KVM: x86/mmu: fine-grained gfn zap when guest MTRRs are honored
KVM: x86/mmu: split a single gfn zap range when guest MTRRs are
honored
arch/x86/include/asm/kvm_host.h | 4 +
arch/x86/kvm/mmu.h | 7 +
arch/x86/kvm/mmu/mmu.c | 18 +-
arch/x86/kvm/mtrr.c | 286 +++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.c | 11 +-
arch/x86/kvm/x86.c | 25 ++-
arch/x86/kvm/x86.h | 2 +
7 files changed, 333 insertions(+), 20 deletions(-)
base-commit: 24ff4c08e5bbdd7399d45f940f10fed030dfadda
--
2.17.1
For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
types when cache is disabled and non-coherent DMA are present.
With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
EPT memory type when guest cache is disabled before this patch.
Removing the IPAT bit in this patch will allow effective memory type to
honor PAT values as well, which will make the effective memory type
stronger than WB as WB is the weakest memtype. However, this change is
acceptable as it doesn't make the memory type weaker, consider without
this quirk, the original memtype for cache disabled is UC + IPAT.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0ecf4be2c6af..c1e93678cea4 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7548,8 +7548,6 @@ static int vmx_vm_init(struct kvm *kvm)
static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
- u8 cache;
-
/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
* memory aliases with conflicting memory types and sometimes MCEs.
* We have to be careful as to what are honored and when.
@@ -7576,11 +7574,10 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
- cache = MTRR_TYPE_WRBACK;
+ return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
else
- cache = MTRR_TYPE_UNCACHABLE;
-
- return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+ return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
+ VMX_EPT_IPAT_BIT;
}
return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
--
2.17.1
Call helper to check if guest MTRRs are honored by KVM MMU before
calculation and zapping.
Guest MTRRs only affect TDP memtypes when TDP honors guest MTRRs, there's
no meaning to do the calculation and zapping otherwise.
Suggested-by: Chao Gao <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Cc: Kai Huang <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mtrr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 3eb6e7f47e96..a67c28a56417 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -320,7 +320,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
gfn_t start, end;
- if (!tdp_enabled || !kvm_arch_has_noncoherent_dma(vcpu->kvm))
+ if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
return;
if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
--
2.17.1
Split a single gfn zap range (specifially range [0, ~0UL)) to smaller
ranges according to current memslot layout when guest MTRRs are honored.
Though vCPUs have been serialized to perform kvm_zap_gfn_range() for MTRRs
updates and CR0.CD toggles, contention caused rescheduling cost is still
huge when there're concurrent page fault caused read locks of
kvm->mmu_lock.
Split a single huge zap range according to the actual memslot layout can
reduce unnecessary transversal and scheduling cost in tdp mmu.
Also, it can increase the chances for larger ranges to find existing ranges
to zap in zap list.
Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mtrr.c | 31 +++++++++++++++++++++++++------
1 file changed, 25 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index e2a097822a62..b83abd14ccb1 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -917,21 +917,40 @@ static void kvm_zap_or_wait_mtrr_zap_list(struct kvm *kvm)
static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
gfn_t gfn_start, gfn_t gfn_end)
{
+ int idx = srcu_read_lock(&vcpu->kvm->srcu);
+ const struct kvm_memory_slot *memslot;
struct mtrr_zap_range *range;
+ struct kvm_memslot_iter iter;
+ struct kvm_memslots *slots;
+ gfn_t start, end;
+ int i;
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ slots = __kvm_memslots(vcpu->kvm, i);
+ kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, gfn_end) {
+ memslot = iter.slot;
+ start = max(gfn_start, memslot->base_gfn);
+ end = min(gfn_end, memslot->base_gfn + memslot->npages);
+ if (WARN_ON_ONCE(start >= end))
+ continue;
- range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
- if (!range)
- goto fail;
+ range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
+ if (!range)
+ goto fail;
- range->start = gfn_start;
- range->end = gfn_end;
+ range->start = start;
+ range->end = end;
- kvm_add_mtrr_zap_list(vcpu->kvm, range);
+ kvm_add_mtrr_zap_list(vcpu->kvm, range);
+ }
+ }
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
kvm_zap_or_wait_mtrr_zap_list(vcpu->kvm);
return;
fail:
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
kvm_clear_mtrr_zap_list(vcpu->kvm);
kvm_zap_gfn_range(vcpu->kvm, gfn_start, gfn_end);
}
--
2.17.1
Serialize concurrent and repeated calls of kvm_zap_gfn_range() from every
vCPU for CR0.CD toggles and MTRR updates when guest MTRRs are honored.
During guest boot-up, if guest MTRRs are honored by TDP, TDP zaps are
triggered several times by each vCPU for CR0.CD toggles and MTRRs updates.
This will take unexpected longer CPU cycles because of the contention of
kvm->mmu_lock.
Therefore, introduce a mtrr_zap_list to remove duplicated zap and an atomic
mtrr_zapping to allow only one vCPU to do the real zap work at one time.
Suggested-by: Sean Christopherson <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 +
arch/x86/kvm/mtrr.c | 141 +++++++++++++++++++++++++++++++-
arch/x86/kvm/x86.c | 2 +-
arch/x86/kvm/x86.h | 1 +
4 files changed, 146 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 28bd38303d70..8da1517a1513 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1444,6 +1444,10 @@ struct kvm_arch {
*/
#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
struct kvm_mmu_memory_cache split_desc_cache;
+
+ struct list_head mtrr_zap_list;
+ spinlock_t mtrr_zap_list_lock;
+ atomic_t mtrr_zapping;
};
struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index b35dd0bc9cad..688748e3a4d2 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -25,6 +25,8 @@
#define IA32_MTRR_DEF_TYPE_FE (1ULL << 10)
#define IA32_MTRR_DEF_TYPE_TYPE_MASK (0xff)
+static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
+ gfn_t gfn_start, gfn_t gfn_end);
static bool is_mtrr_base_msr(unsigned int msr)
{
/* MTRR base MSRs use even numbers, masks use odd numbers. */
@@ -341,7 +343,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
var_mtrr_range(var_mtrr_msr_to_range(vcpu, msr), &start, &end);
}
- kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
+ kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(start), gpa_to_gfn(end));
}
static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
@@ -437,6 +439,11 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
{
INIT_LIST_HEAD(&vcpu->arch.mtrr_state.head);
+
+ if (vcpu->vcpu_id == 0) {
+ spin_lock_init(&vcpu->kvm->arch.mtrr_zap_list_lock);
+ INIT_LIST_HEAD(&vcpu->kvm->arch.mtrr_zap_list);
+ }
}
struct mtrr_iter {
@@ -740,3 +747,135 @@ void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat)
}
}
EXPORT_SYMBOL_GPL(kvm_mtrr_get_cd_memory_type);
+
+struct mtrr_zap_range {
+ gfn_t start;
+ /* end is exclusive */
+ gfn_t end;
+ struct list_head node;
+};
+
+static void kvm_clear_mtrr_zap_list(struct kvm *kvm)
+{
+ struct list_head *head = &kvm->arch.mtrr_zap_list;
+ struct mtrr_zap_range *tmp, *n;
+
+ spin_lock(&kvm->arch.mtrr_zap_list_lock);
+ list_for_each_entry_safe(tmp, n, head, node) {
+ list_del(&tmp->node);
+ kfree(tmp);
+ }
+ spin_unlock(&kvm->arch.mtrr_zap_list_lock);
+}
+
+/*
+ * Add @range into kvm->arch.mtrr_zap_list and sort the list in
+ * "length" ascending + "start" descending order, so that
+ * ranges consuming more zap cycles can be dequeued later and their
+ * chances of being found duplicated are increased.
+ */
+static void kvm_add_mtrr_zap_list(struct kvm *kvm, struct mtrr_zap_range *range)
+{
+ struct list_head *head = &kvm->arch.mtrr_zap_list;
+ u64 len = range->end - range->start;
+ struct mtrr_zap_range *cur, *n;
+ bool added = false;
+
+ spin_lock(&kvm->arch.mtrr_zap_list_lock);
+
+ if (list_empty(head)) {
+ list_add(&range->node, head);
+ spin_unlock(&kvm->arch.mtrr_zap_list_lock);
+ return;
+ }
+
+ list_for_each_entry_safe(cur, n, head, node) {
+ u64 cur_len = cur->end - cur->start;
+
+ if (len < cur_len)
+ break;
+
+ if (len > cur_len)
+ continue;
+
+ if (range->start > cur->start)
+ break;
+
+ if (range->start < cur->start)
+ continue;
+
+ /* equal len & start, no need to add */
+ added = true;
+ kfree(range);
+ break;
+ }
+
+ if (!added)
+ list_add_tail(&range->node, &cur->node);
+
+ spin_unlock(&kvm->arch.mtrr_zap_list_lock);
+}
+
+static void kvm_zap_mtrr_zap_list(struct kvm *kvm)
+{
+ struct list_head *head = &kvm->arch.mtrr_zap_list;
+ struct mtrr_zap_range *cur = NULL;
+
+ spin_lock(&kvm->arch.mtrr_zap_list_lock);
+
+ while (!list_empty(head)) {
+ u64 start, end;
+
+ cur = list_first_entry(head, typeof(*cur), node);
+ start = cur->start;
+ end = cur->end;
+ list_del(&cur->node);
+ kfree(cur);
+ spin_unlock(&kvm->arch.mtrr_zap_list_lock);
+
+ kvm_zap_gfn_range(kvm, start, end);
+
+ spin_lock(&kvm->arch.mtrr_zap_list_lock);
+ }
+
+ spin_unlock(&kvm->arch.mtrr_zap_list_lock);
+}
+
+static void kvm_zap_or_wait_mtrr_zap_list(struct kvm *kvm)
+{
+ if (atomic_cmpxchg_acquire(&kvm->arch.mtrr_zapping, 0, 1) == 0) {
+ kvm_zap_mtrr_zap_list(kvm);
+ atomic_set_release(&kvm->arch.mtrr_zapping, 0);
+ return;
+ }
+
+ while (atomic_read(&kvm->arch.mtrr_zapping))
+ cpu_relax();
+}
+
+static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
+ gfn_t gfn_start, gfn_t gfn_end)
+{
+ struct mtrr_zap_range *range;
+
+ range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
+ if (!range)
+ goto fail;
+
+ range->start = gfn_start;
+ range->end = gfn_end;
+
+ kvm_add_mtrr_zap_list(vcpu->kvm, range);
+
+ kvm_zap_or_wait_mtrr_zap_list(vcpu->kvm);
+ return;
+
+fail:
+ kvm_clear_mtrr_zap_list(vcpu->kvm);
+ kvm_zap_gfn_range(vcpu->kvm, gfn_start, gfn_end);
+}
+
+void kvm_zap_gfn_range_on_cd_toggle(struct kvm_vcpu *vcpu)
+{
+ return kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 32cc8bfaa5f1..74aac14a3c0b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -943,7 +943,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
- kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
+ kvm_zap_gfn_range_on_cd_toggle(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 9781b4b32d68..be946aba2bf0 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -314,6 +314,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
int page_num);
void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat);
+void kvm_zap_gfn_range_on_cd_toggle(struct kvm_vcpu *vcpu);
bool kvm_vector_hashing_enabled(void);
void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
int x86_decode_emulated_instruction(struct kvm_vcpu *vcpu, int emulation_type,
--
2.17.1
Find out guest MTRR ranges of non-default type and only zap those ranges
when guest MTRRs are enabled and MTRR default type equals to the memtype of
CR0.CD=1.
This allows fine-grained and faster gfn zap because of precise and shorter
zap ranges, and increases chances for concurent vCPUs to find existing
ranges to zap in zap list.
Incidentally, fix a typo in the original comment.
Suggested-by: Sean Christopherson <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mtrr.c | 108 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 106 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 688748e3a4d2..e2a097822a62 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -179,7 +179,7 @@ static struct fixed_mtrr_segment fixed_seg_table[] = {
{
.start = 0xc0000,
.end = 0x100000,
- .range_shift = 12, /* 12K */
+ .range_shift = 12, /* 4K */
.range_start = 24,
}
};
@@ -816,6 +816,67 @@ static void kvm_add_mtrr_zap_list(struct kvm *kvm, struct mtrr_zap_range *range)
spin_unlock(&kvm->arch.mtrr_zap_list_lock);
}
+/*
+ * Fixed ranges are only 256 pages in total.
+ * After balancing between reducing overhead of zap multiple ranges
+ * and increasing chances of finding duplicated ranges,
+ * just add fixed mtrr ranges as a whole to the mtrr zap list
+ * if memory type of one of them is not the specified type.
+ */
+static int prepare_zaplist_fixed_mtrr_of_non_type(struct kvm_vcpu *vcpu, u8 type)
+{
+ struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
+ struct mtrr_zap_range *range;
+ int index, seg_end;
+ u8 mem_type;
+
+ for (index = 0; index < KVM_NR_FIXED_MTRR_REGION; index++) {
+ mem_type = mtrr_state->fixed_ranges[index];
+
+ if (mem_type == type)
+ continue;
+
+ range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
+ if (!range)
+ return -ENOMEM;
+
+ seg_end = ARRAY_SIZE(fixed_seg_table) - 1;
+ range->start = gpa_to_gfn(fixed_seg_table[0].start);
+ range->end = gpa_to_gfn(fixed_seg_table[seg_end].end);
+ kvm_add_mtrr_zap_list(vcpu->kvm, range);
+ break;
+ }
+ return 0;
+}
+
+/*
+ * Add var mtrr ranges to the mtrr zap list
+ * if its memory type does not equal to type
+ */
+static int prepare_zaplist_var_mtrr_of_non_type(struct kvm_vcpu *vcpu, u8 type)
+{
+ struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
+ struct mtrr_zap_range *range;
+ struct kvm_mtrr_range *tmp;
+ u8 mem_type;
+
+ list_for_each_entry(tmp, &mtrr_state->head, node) {
+ mem_type = tmp->base & 0xff;
+ if (mem_type == type)
+ continue;
+
+ range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
+ if (!range)
+ return -ENOMEM;
+
+ var_mtrr_range(tmp, &range->start, &range->end);
+ range->start = gpa_to_gfn(range->start);
+ range->end = gpa_to_gfn(range->end);
+ kvm_add_mtrr_zap_list(vcpu->kvm, range);
+ }
+ return 0;
+}
+
static void kvm_zap_mtrr_zap_list(struct kvm *kvm)
{
struct list_head *head = &kvm->arch.mtrr_zap_list;
@@ -875,7 +936,50 @@ static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
kvm_zap_gfn_range(vcpu->kvm, gfn_start, gfn_end);
}
+/*
+ * Zap GFN ranges when CR0.CD toggles between 0 and 1.
+ * With noncoherent DMA present,
+ * when CR0.CD=1, TDP memtype is WB or UC + IPAT;
+ * when CR0.CD=0, TDP memtype is determined by guest MTRR.
+ * Therefore, if the cache disabled memtype is different from default memtype
+ * in guest MTRR, everything is zapped;
+ * if the cache disabled memtype is equal to default memtype in guest MTRR,
+ * only MTRR ranges of non-default-memtype are required to be zapped.
+ */
void kvm_zap_gfn_range_on_cd_toggle(struct kvm_vcpu *vcpu)
{
- return kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
+ struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
+ bool mtrr_enabled = mtrr_is_enabled(mtrr_state);
+ u8 default_type;
+ u8 cd_type;
+ bool ipat;
+
+ kvm_mtrr_get_cd_memory_type(vcpu, &cd_type, &ipat);
+
+ default_type = mtrr_enabled ? mtrr_default_type(mtrr_state) :
+ mtrr_disabled_type(vcpu);
+
+ if (cd_type != default_type || ipat)
+ return kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
+
+ /*
+ * If mtrr is not enabled, it will go to zap all above if the default
+ * type does not equal to cd_type;
+ * Or it has no need to zap if the default type equals to cd_type.
+ */
+ if (mtrr_enabled) {
+ if (prepare_zaplist_fixed_mtrr_of_non_type(vcpu, default_type))
+ goto fail;
+
+ if (prepare_zaplist_var_mtrr_of_non_type(vcpu, default_type))
+ goto fail;
+
+ kvm_zap_or_wait_mtrr_zap_list(vcpu->kvm);
+ }
+ return;
+fail:
+ kvm_clear_mtrr_zap_list(vcpu->kvm);
+ /* resort to zapping all on failure*/
+ kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
+ return;
}
--
2.17.1
Zap KVM TDP when noncoherent DMA assignment starts (noncoherent dma count
transitions from 0 to 1) or stops (noncoherent dma count transistions
from 1 to 0). Before the zap, test if guest MTRR is to be honored after
the assignment starts or was honored before the assignment stops.
When there's no noncoherent DMA device, EPT memory type is
((MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT)
When there're noncoherent DMA devices, EPT memory type needs to honor
guest CR0.CD and MTRR settings.
So, if noncoherent DMA count transitions between 0 and 1, EPT leaf entries
need to be zapped to clear stale memory type.
This issue might be hidden when the device is statically assigned with
VFIO adding/removing MMIO regions of the noncoherent DMA devices for
several times during guest boot, and current KVM MMU will call
kvm_mmu_zap_all_fast() on the memslot removal.
But if the device is hot-plugged, or if the guest has mmio_always_on for
the device, the MMIO regions of it may only be added for once, then there's
no path to do the EPT entries zapping to clear stale memory type.
Therefore do the EPT zapping when noncoherent assignment starts/stops to
ensure stale entries cleaned away.
Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/x86.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6693daeb5686..ac9548efa76f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13164,15 +13164,31 @@ bool noinstr kvm_arch_has_assigned_device(struct kvm *kvm)
}
EXPORT_SYMBOL_GPL(kvm_arch_has_assigned_device);
+static void kvm_noncoherent_dma_assignment_start_or_stop(struct kvm *kvm)
+{
+ /*
+ * Non-coherent DMA assignement and de-assignment will affect
+ * whether KVM honors guest MTRRs and cause changes in memtypes
+ * in TDP.
+ * So, specify the second parameter as true here to indicate
+ * non-coherent DMAs are/were involved and TDP zap might be
+ * necessary.
+ */
+ if (__kvm_mmu_honors_guest_mtrrs(kvm, true))
+ kvm_zap_gfn_range(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
+}
+
void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
{
- atomic_inc(&kvm->arch.noncoherent_dma_count);
+ if (atomic_inc_return(&kvm->arch.noncoherent_dma_count) == 1)
+ kvm_noncoherent_dma_assignment_start_or_stop(kvm);
}
EXPORT_SYMBOL_GPL(kvm_arch_register_noncoherent_dma);
void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm)
{
- atomic_dec(&kvm->arch.noncoherent_dma_count);
+ if (!atomic_dec_return(&kvm->arch.noncoherent_dma_count))
+ kvm_noncoherent_dma_assignment_start_or_stop(kvm);
}
EXPORT_SYMBOL_GPL(kvm_arch_unregister_noncoherent_dma);
--
2.17.1
Call helper to check if guest MTRRs are honored by KVM MMU before zapping,
as values of guest CR0.CD will only affect memory types of KVM TDP when
guest MTRRs are honored.
Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/x86.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9e7186864542..6693daeb5686 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -942,7 +942,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
kvm_mmu_reset_context(vcpu);
if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
- kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
+ kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
}
--
2.17.1
On Fri, Jun 16, 2023 at 10:39:45AM +0800, Yan Zhao wrote:
> Serialize concurrent and repeated calls of kvm_zap_gfn_range() from every
> vCPU for CR0.CD toggles and MTRR updates when guest MTRRs are honored.
>
> During guest boot-up, if guest MTRRs are honored by TDP, TDP zaps are
> triggered several times by each vCPU for CR0.CD toggles and MTRRs updates.
> This will take unexpected longer CPU cycles because of the contention of
> kvm->mmu_lock.
>
> Therefore, introduce a mtrr_zap_list to remove duplicated zap and an atomic
> mtrr_zapping to allow only one vCPU to do the real zap work at one time.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Yan Zhao <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 +
> arch/x86/kvm/mtrr.c | 141 +++++++++++++++++++++++++++++++-
> arch/x86/kvm/x86.c | 2 +-
> arch/x86/kvm/x86.h | 1 +
> 4 files changed, 146 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 28bd38303d70..8da1517a1513 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1444,6 +1444,10 @@ struct kvm_arch {
> */
> #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> struct kvm_mmu_memory_cache split_desc_cache;
> +
> + struct list_head mtrr_zap_list;
> + spinlock_t mtrr_zap_list_lock;
> + atomic_t mtrr_zapping;
> };
>
> struct kvm_vm_stat {
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index b35dd0bc9cad..688748e3a4d2 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -25,6 +25,8 @@
> #define IA32_MTRR_DEF_TYPE_FE (1ULL << 10)
> #define IA32_MTRR_DEF_TYPE_TYPE_MASK (0xff)
>
> +static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
> + gfn_t gfn_start, gfn_t gfn_end);
> static bool is_mtrr_base_msr(unsigned int msr)
> {
> /* MTRR base MSRs use even numbers, masks use odd numbers. */
> @@ -341,7 +343,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> var_mtrr_range(var_mtrr_msr_to_range(vcpu, msr), &start, &end);
> }
>
> - kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
> + kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(start), gpa_to_gfn(end));
> }
>
> static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
> @@ -437,6 +439,11 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
> void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
> {
> INIT_LIST_HEAD(&vcpu->arch.mtrr_state.head);
> +
> + if (vcpu->vcpu_id == 0) {
> + spin_lock_init(&vcpu->kvm->arch.mtrr_zap_list_lock);
> + INIT_LIST_HEAD(&vcpu->kvm->arch.mtrr_zap_list);
> + }
> }
>
> struct mtrr_iter {
> @@ -740,3 +747,135 @@ void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat)
> }
> }
> EXPORT_SYMBOL_GPL(kvm_mtrr_get_cd_memory_type);
> +
> +struct mtrr_zap_range {
> + gfn_t start;
> + /* end is exclusive */
> + gfn_t end;
> + struct list_head node;
> +};
> +
> +static void kvm_clear_mtrr_zap_list(struct kvm *kvm)
> +{
> + struct list_head *head = &kvm->arch.mtrr_zap_list;
> + struct mtrr_zap_range *tmp, *n;
> +
> + spin_lock(&kvm->arch.mtrr_zap_list_lock);
> + list_for_each_entry_safe(tmp, n, head, node) {
> + list_del(&tmp->node);
> + kfree(tmp);
> + }
> + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> +}
> +
> +/*
> + * Add @range into kvm->arch.mtrr_zap_list and sort the list in
> + * "length" ascending + "start" descending order, so that
> + * ranges consuming more zap cycles can be dequeued later and their
> + * chances of being found duplicated are increased.
> + */
> +static void kvm_add_mtrr_zap_list(struct kvm *kvm, struct mtrr_zap_range *range)
> +{
> + struct list_head *head = &kvm->arch.mtrr_zap_list;
> + u64 len = range->end - range->start;
> + struct mtrr_zap_range *cur, *n;
> + bool added = false;
> +
> + spin_lock(&kvm->arch.mtrr_zap_list_lock);
> +
> + if (list_empty(head)) {
> + list_add(&range->node, head);
> + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> + return;
> + }
> +
> + list_for_each_entry_safe(cur, n, head, node) {
> + u64 cur_len = cur->end - cur->start;
> +
> + if (len < cur_len)
> + break;
> +
> + if (len > cur_len)
> + continue;
> +
> + if (range->start > cur->start)
> + break;
> +
> + if (range->start < cur->start)
> + continue;
> +
> + /* equal len & start, no need to add */
> + added = true;
Possible/worth to ignore the range already covered
by queued range ?
> + kfree(range);
> + break;
> + }
> +
> + if (!added)
> + list_add_tail(&range->node, &cur->node);
> +
> + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> +}
> +
> +static void kvm_zap_mtrr_zap_list(struct kvm *kvm)
> +{
> + struct list_head *head = &kvm->arch.mtrr_zap_list;
> + struct mtrr_zap_range *cur = NULL;
> +
> + spin_lock(&kvm->arch.mtrr_zap_list_lock);
> +
> + while (!list_empty(head)) {
> + u64 start, end;
> +
> + cur = list_first_entry(head, typeof(*cur), node);
> + start = cur->start;
> + end = cur->end;
> + list_del(&cur->node);
> + kfree(cur);
> + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> +
> + kvm_zap_gfn_range(kvm, start, end);
> +
> + spin_lock(&kvm->arch.mtrr_zap_list_lock);
> + }
> +
> + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> +}
> +
> +static void kvm_zap_or_wait_mtrr_zap_list(struct kvm *kvm)
> +{
> + if (atomic_cmpxchg_acquire(&kvm->arch.mtrr_zapping, 0, 1) == 0) {
> + kvm_zap_mtrr_zap_list(kvm);
> + atomic_set_release(&kvm->arch.mtrr_zapping, 0);
> + return;
> + }
> +
> + while (atomic_read(&kvm->arch.mtrr_zapping))
> + cpu_relax();
> +}
> +
> +static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
> + gfn_t gfn_start, gfn_t gfn_end)
> +{
> + struct mtrr_zap_range *range;
> +
> + range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
> + if (!range)
> + goto fail;
> +
> + range->start = gfn_start;
> + range->end = gfn_end;
> +
> + kvm_add_mtrr_zap_list(vcpu->kvm, range);
> +
> + kvm_zap_or_wait_mtrr_zap_list(vcpu->kvm);
> + return;
> +
> +fail:
> + kvm_clear_mtrr_zap_list(vcpu->kvm);
A very small chance race condition that incorrectly
clear the queued ranges which have not been zapped by another thread ?
Like below:
Thread A | Thread B
kvm_add_mtrr_zap_list() |
| kvm_clear_mtrr_zap_list()
kvm_zap_or_wait_mtrr_zap_list() |
Call kvm_clear_mtrr_zap_list() here looks unnecessary, other
threads(B here) who put thing in the queue will take care them well.
> + kvm_zap_gfn_range(vcpu->kvm, gfn_start, gfn_end);
> +}
> +
> +void kvm_zap_gfn_range_on_cd_toggle(struct kvm_vcpu *vcpu)
> +{
> + return kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
> +}
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 32cc8bfaa5f1..74aac14a3c0b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -943,7 +943,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>
> if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> - kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> }
> EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
>
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 9781b4b32d68..be946aba2bf0 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -314,6 +314,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
> bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
> int page_num);
> void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat);
> +void kvm_zap_gfn_range_on_cd_toggle(struct kvm_vcpu *vcpu);
> bool kvm_vector_hashing_enabled(void);
> void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
> int x86_decode_emulated_instruction(struct kvm_vcpu *vcpu, int emulation_type,
> --
> 2.17.1
>
On Fri, Jun 16, 2023 at 03:45:50PM +0800, Yuan Yao wrote:
> > +/*
> > + * Add @range into kvm->arch.mtrr_zap_list and sort the list in
> > + * "length" ascending + "start" descending order, so that
> > + * ranges consuming more zap cycles can be dequeued later and their
> > + * chances of being found duplicated are increased.
> > + */
> > +static void kvm_add_mtrr_zap_list(struct kvm *kvm, struct mtrr_zap_range *range)
> > +{
> > + struct list_head *head = &kvm->arch.mtrr_zap_list;
> > + u64 len = range->end - range->start;
> > + struct mtrr_zap_range *cur, *n;
> > + bool added = false;
> > +
> > + spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > +
> > + if (list_empty(head)) {
> > + list_add(&range->node, head);
> > + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > + return;
> > + }
> > +
> > + list_for_each_entry_safe(cur, n, head, node) {
> > + u64 cur_len = cur->end - cur->start;
> > +
> > + if (len < cur_len)
> > + break;
> > +
> > + if (len > cur_len)
> > + continue;
> > +
> > + if (range->start > cur->start)
> > + break;
> > +
> > + if (range->start < cur->start)
> > + continue;
> > +
> > + /* equal len & start, no need to add */
> > + added = true;
>
> Possible/worth to ignore the range already covered
> by queued range ?
I may not get you correctly, but
the "added" here means an queued range with exactly same start + len
found, so free and drop adding the new range here.
>
> > + kfree(range);
> > + break;
> > + }
> > +
> > + if (!added)
> > + list_add_tail(&range->node, &cur->node);
> > +
> > + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > +}
> > +
> > +static void kvm_zap_mtrr_zap_list(struct kvm *kvm)
> > +{
> > + struct list_head *head = &kvm->arch.mtrr_zap_list;
> > + struct mtrr_zap_range *cur = NULL;
> > +
> > + spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > +
> > + while (!list_empty(head)) {
> > + u64 start, end;
> > +
> > + cur = list_first_entry(head, typeof(*cur), node);
> > + start = cur->start;
> > + end = cur->end;
> > + list_del(&cur->node);
> > + kfree(cur);
> > + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > +
> > + kvm_zap_gfn_range(kvm, start, end);
> > +
> > + spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > + }
> > +
> > + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > +}
> > +
> > +static void kvm_zap_or_wait_mtrr_zap_list(struct kvm *kvm)
> > +{
> > + if (atomic_cmpxchg_acquire(&kvm->arch.mtrr_zapping, 0, 1) == 0) {
> > + kvm_zap_mtrr_zap_list(kvm);
> > + atomic_set_release(&kvm->arch.mtrr_zapping, 0);
> > + return;
> > + }
> > +
> > + while (atomic_read(&kvm->arch.mtrr_zapping))
> > + cpu_relax();
> > +}
> > +
> > +static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
> > + gfn_t gfn_start, gfn_t gfn_end)
> > +{
> > + struct mtrr_zap_range *range;
> > +
> > + range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
> > + if (!range)
> > + goto fail;
> > +
> > + range->start = gfn_start;
> > + range->end = gfn_end;
> > +
> > + kvm_add_mtrr_zap_list(vcpu->kvm, range);
> > +
> > + kvm_zap_or_wait_mtrr_zap_list(vcpu->kvm);
> > + return;
> > +
> > +fail:
> > + kvm_clear_mtrr_zap_list(vcpu->kvm);
> A very small chance race condition that incorrectly
> clear the queued ranges which have not been zapped by another thread ?
> Like below:
>
> Thread A | Thread B
> kvm_add_mtrr_zap_list() |
> | kvm_clear_mtrr_zap_list()
> kvm_zap_or_wait_mtrr_zap_list() |
>
> Call kvm_clear_mtrr_zap_list() here looks unnecessary, other
> threads(B here) who put thing in the queue will take care them well.
> > + kvm_zap_gfn_range(vcpu->kvm, gfn_start, gfn_end);
Yes, if gfn_start and gfn_end here are not 0 and ~0ULL, the
kvm_clear_mtrr_zap_list() is not necessary.
Though in reality, they are always 0-~0ULL, I agree dropping the
kvm_clear_mtrr_zap_list() here is better.
Thanks!
On Fri, Jun 16, 2023 at 03:37:29PM +0800, Yan Zhao wrote:
> On Fri, Jun 16, 2023 at 03:45:50PM +0800, Yuan Yao wrote:
> > > +/*
> > > + * Add @range into kvm->arch.mtrr_zap_list and sort the list in
> > > + * "length" ascending + "start" descending order, so that
> > > + * ranges consuming more zap cycles can be dequeued later and their
> > > + * chances of being found duplicated are increased.
> > > + */
> > > +static void kvm_add_mtrr_zap_list(struct kvm *kvm, struct mtrr_zap_range *range)
> > > +{
> > > + struct list_head *head = &kvm->arch.mtrr_zap_list;
> > > + u64 len = range->end - range->start;
> > > + struct mtrr_zap_range *cur, *n;
> > > + bool added = false;
> > > +
> > > + spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > > +
> > > + if (list_empty(head)) {
> > > + list_add(&range->node, head);
> > > + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > > + return;
> > > + }
> > > +
> > > + list_for_each_entry_safe(cur, n, head, node) {
> > > + u64 cur_len = cur->end - cur->start;
> > > +
> > > + if (len < cur_len)
> > > + break;
> > > +
> > > + if (len > cur_len)
> > > + continue;
> > > +
> > > + if (range->start > cur->start)
> > > + break;
> > > +
> > > + if (range->start < cur->start)
> > > + continue;
> > > +
> > > + /* equal len & start, no need to add */
> > > + added = true;
> >
> > Possible/worth to ignore the range already covered
> > by queued range ?
>
> I may not get you correctly, but
> the "added" here means an queued range with exactly same start + len
> found, so free and drop adding the new range here.
I mean drop adding three B below if A already in the queue:
|------A--------|
|----B----|
|------A--------|
|----B----|
|------A--------|
|----B----|
>
> >
> > > + kfree(range);
> > > + break;
> > > + }
> > > +
> > > + if (!added)
> > > + list_add_tail(&range->node, &cur->node);
> > > +
> > > + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > > +}
> > > +
> > > +static void kvm_zap_mtrr_zap_list(struct kvm *kvm)
> > > +{
> > > + struct list_head *head = &kvm->arch.mtrr_zap_list;
> > > + struct mtrr_zap_range *cur = NULL;
> > > +
> > > + spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > > +
> > > + while (!list_empty(head)) {
> > > + u64 start, end;
> > > +
> > > + cur = list_first_entry(head, typeof(*cur), node);
> > > + start = cur->start;
> > > + end = cur->end;
> > > + list_del(&cur->node);
> > > + kfree(cur);
> > > + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > > +
> > > + kvm_zap_gfn_range(kvm, start, end);
> > > +
> > > + spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > > + }
> > > +
> > > + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > > +}
> > > +
> > > +static void kvm_zap_or_wait_mtrr_zap_list(struct kvm *kvm)
> > > +{
> > > + if (atomic_cmpxchg_acquire(&kvm->arch.mtrr_zapping, 0, 1) == 0) {
> > > + kvm_zap_mtrr_zap_list(kvm);
> > > + atomic_set_release(&kvm->arch.mtrr_zapping, 0);
> > > + return;
> > > + }
> > > +
> > > + while (atomic_read(&kvm->arch.mtrr_zapping))
> > > + cpu_relax();
> > > +}
> > > +
> > > +static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
> > > + gfn_t gfn_start, gfn_t gfn_end)
> > > +{
> > > + struct mtrr_zap_range *range;
> > > +
> > > + range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
> > > + if (!range)
> > > + goto fail;
> > > +
> > > + range->start = gfn_start;
> > > + range->end = gfn_end;
> > > +
> > > + kvm_add_mtrr_zap_list(vcpu->kvm, range);
> > > +
> > > + kvm_zap_or_wait_mtrr_zap_list(vcpu->kvm);
> > > + return;
> > > +
> > > +fail:
> > > + kvm_clear_mtrr_zap_list(vcpu->kvm);
> > A very small chance race condition that incorrectly
> > clear the queued ranges which have not been zapped by another thread ?
> > Like below:
> >
> > Thread A | Thread B
> > kvm_add_mtrr_zap_list() |
> > | kvm_clear_mtrr_zap_list()
> > kvm_zap_or_wait_mtrr_zap_list() |
> >
> > Call kvm_clear_mtrr_zap_list() here looks unnecessary, other
> > threads(B here) who put thing in the queue will take care them well.
>
> > > + kvm_zap_gfn_range(vcpu->kvm, gfn_start, gfn_end);
>
> Yes, if gfn_start and gfn_end here are not 0 and ~0ULL, the
> kvm_clear_mtrr_zap_list() is not necessary.
> Though in reality, they are always 0-~0ULL, I agree dropping the
> kvm_clear_mtrr_zap_list() here is better.
>
> Thanks!
On Fri, Jun 16, 2023 at 04:09:17PM +0800, Yuan Yao wrote:
> On Fri, Jun 16, 2023 at 03:37:29PM +0800, Yan Zhao wrote:
> > On Fri, Jun 16, 2023 at 03:45:50PM +0800, Yuan Yao wrote:
> > > > +/*
> > > > + * Add @range into kvm->arch.mtrr_zap_list and sort the list in
> > > > + * "length" ascending + "start" descending order, so that
> > > > + * ranges consuming more zap cycles can be dequeued later and their
> > > > + * chances of being found duplicated are increased.
> > > > + */
> > > > +static void kvm_add_mtrr_zap_list(struct kvm *kvm, struct mtrr_zap_range *range)
> > > > +{
> > > > + struct list_head *head = &kvm->arch.mtrr_zap_list;
> > > > + u64 len = range->end - range->start;
> > > > + struct mtrr_zap_range *cur, *n;
> > > > + bool added = false;
> > > > +
> > > > + spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > > > +
> > > > + if (list_empty(head)) {
> > > > + list_add(&range->node, head);
> > > > + spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > > > + return;
> > > > + }
> > > > +
> > > > + list_for_each_entry_safe(cur, n, head, node) {
> > > > + u64 cur_len = cur->end - cur->start;
> > > > +
> > > > + if (len < cur_len)
> > > > + break;
> > > > +
> > > > + if (len > cur_len)
> > > > + continue;
> > > > +
> > > > + if (range->start > cur->start)
> > > > + break;
> > > > +
> > > > + if (range->start < cur->start)
> > > > + continue;
> > > > +
> > > > + /* equal len & start, no need to add */
> > > > + added = true;
> > >
> > > Possible/worth to ignore the range already covered
> > > by queued range ?
> >
> > I may not get you correctly, but
> > the "added" here means an queued range with exactly same start + len
> > found, so free and drop adding the new range here.
>
> I mean drop adding three B below if A already in the queue:
>
> |------A--------|
> |----B----|
>
> |------A--------|
> |----B----|
>
> |------A--------|
> |----B----|
>
Oh, I implemented this way in my first version.
But it will complicate the logic and increase time holding spinlock.
And as usually in the zaps caused by MTRRs update and CR0.CD toggles,
the queued ranges are duplicated in different vCPUs and non-overlapping
in one vCPU, I turned to this simplier implemenation finally.
On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
>For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
>types when cache is disabled and non-coherent DMA are present.
>
>With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
>EPT memory type when guest cache is disabled before this patch.
>Removing the IPAT bit in this patch will allow effective memory type to
>honor PAT values as well, which will make the effective memory type
Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
which guests can benefit from this change?
>stronger than WB as WB is the weakest memtype. However, this change is
>acceptable as it doesn't make the memory type weaker,
>consider without
>this quirk, the original memtype for cache disabled is UC + IPAT.
This change impacts only the case where the quirk is enabled. Maybe you
mean:
_with_ the quirk, the original memtype for cached disabled is _WB_ + IPAT.
>
>Suggested-by: Sean Christopherson <[email protected]>
>Signed-off-by: Yan Zhao <[email protected]>
>---
> arch/x86/kvm/vmx/vmx.c | 9 +++------
> 1 file changed, 3 insertions(+), 6 deletions(-)
>
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 0ecf4be2c6af..c1e93678cea4 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -7548,8 +7548,6 @@ static int vmx_vm_init(struct kvm *kvm)
>
> static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> {
>- u8 cache;
>-
> /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> * memory aliases with conflicting memory types and sometimes MCEs.
> * We have to be careful as to what are honored and when.
>@@ -7576,11 +7574,10 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>
> if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
> if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
>- cache = MTRR_TYPE_WRBACK;
>+ return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
Shouldn't KVM honor guest MTRRs as well? i.e., as if CR0.CD isn't set.
> else
>- cache = MTRR_TYPE_UNCACHABLE;
>-
>- return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>+ return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
>+ VMX_EPT_IPAT_BIT;
> }
>
> return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
>--
>2.17.1
>
On Tue, Jun 20, 2023 at 11:34:43AM +0800, Chao Gao wrote:
> On Tue, Jun 20, 2023 at 10:34:29AM +0800, Yan Zhao wrote:
> >On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
> >> On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
> >> >For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
> >> >types when cache is disabled and non-coherent DMA are present.
> >> >
> >> >With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
> >> >EPT memory type when guest cache is disabled before this patch.
> >> >Removing the IPAT bit in this patch will allow effective memory type to
> >> >honor PAT values as well, which will make the effective memory type
> >>
> >> Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
> >> which guests can benefit from this change?
> >This patch is actually a preparation for later patch 10 to implement
> >fine-grained zap.
> >If when CR0.CD=1 the EPT type is WB + IPAT, and
> >when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
> >without IPAT, then we have to always zap all EPT entries.
>
> OK. The goal is to reduce the cost of toggling CR0.CD. The key is if KVM sets
> the IPAT, then when CR0.CD is cleared by guest, KVM has to zap _all_ EPT entries
> at least to clear IPAT.
>
> Can kvm honor guest MTRRs as well when CR0.CD=1 && with the quirk? then later
> clearing CR0.CD needn't zap _any_ EPT entry. But the optimization is exactly the
> one removed in patch 6. Maybe I miss something important.
I replied it in another mail.
On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
> On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
> >For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
> >types when cache is disabled and non-coherent DMA are present.
> >
> >With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
> >EPT memory type when guest cache is disabled before this patch.
> >Removing the IPAT bit in this patch will allow effective memory type to
> >honor PAT values as well, which will make the effective memory type
>
> Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
> which guests can benefit from this change?
This patch is actually a preparation for later patch 10 to implement
fine-grained zap.
If when CR0.CD=1 the EPT type is WB + IPAT, and
when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
without IPAT, then we have to always zap all EPT entries.
Given removing the IPAT bit when CR0.CD=1 only makes the quirk
KVM_X86_QUIRK_CD_NW_CLEARED more strict (meaning it could be WC/UC... if
the guest PAT overwrites it), it's still acceptable.
On the other hand, from the guest's point of view, it sees the GPA is UC
with CR0.CD=1. So, if we want to align host EPT memtype with guest's
view, we have to drop the quirk KVM_X86_QUIRK_CD_NW_CLEARED, which,
however, will impact the guest bootup performance dramatically and is
not adopted in any VM.
So, we remove the IPAT bit when CR0.CD=1 with the quirk
KVM_X86_QUIRK_CD_NW_CLEARED to make it stricter and to enable later
optimization.
>
> >stronger than WB as WB is the weakest memtype. However, this change is
> >acceptable as it doesn't make the memory type weaker,
>
> >consider without
> >this quirk, the original memtype for cache disabled is UC + IPAT.
>
> This change impacts only the case where the quirk is enabled. Maybe you
> mean:
>
> _with_ the quirk, the original memtype for cached disabled is _WB_ + IPAT.
>
Uh, I mean originally without the quirk, UC + IPAT should be returned,
which is the correct value to return.
I can explain the full background more clearly in the next version.
Thanks
>
> >
> >Suggested-by: Sean Christopherson <[email protected]>
> >Signed-off-by: Yan Zhao <[email protected]>
> >---
> > arch/x86/kvm/vmx/vmx.c | 9 +++------
> > 1 file changed, 3 insertions(+), 6 deletions(-)
> >
> >diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >index 0ecf4be2c6af..c1e93678cea4 100644
> >--- a/arch/x86/kvm/vmx/vmx.c
> >+++ b/arch/x86/kvm/vmx/vmx.c
> >@@ -7548,8 +7548,6 @@ static int vmx_vm_init(struct kvm *kvm)
> >
> > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > {
> >- u8 cache;
> >-
> > /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> > * memory aliases with conflicting memory types and sometimes MCEs.
> > * We have to be careful as to what are honored and when.
> >@@ -7576,11 +7574,10 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> >
> > if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
> > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> >- cache = MTRR_TYPE_WRBACK;
> >+ return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
>
> Shouldn't KVM honor guest MTRRs as well? i.e., as if CR0.CD isn't set.
>
> > else
> >- cache = MTRR_TYPE_UNCACHABLE;
> >-
> >- return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> >+ return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
> >+ VMX_EPT_IPAT_BIT;
> > }
> >
> > return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> >--
> >2.17.1
> >
On Tue, Jun 20, 2023 at 10:34:29AM +0800, Yan Zhao wrote:
>On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
>> On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
>> >For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
>> >types when cache is disabled and non-coherent DMA are present.
>> >
>> >With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
>> >EPT memory type when guest cache is disabled before this patch.
>> >Removing the IPAT bit in this patch will allow effective memory type to
>> >honor PAT values as well, which will make the effective memory type
>>
>> Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
>> which guests can benefit from this change?
>This patch is actually a preparation for later patch 10 to implement
>fine-grained zap.
>If when CR0.CD=1 the EPT type is WB + IPAT, and
>when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
>without IPAT, then we have to always zap all EPT entries.
OK. The goal is to reduce the cost of toggling CR0.CD. The key is if KVM sets
the IPAT, then when CR0.CD is cleared by guest, KVM has to zap _all_ EPT entries
at least to clear IPAT.
Can kvm honor guest MTRRs as well when CR0.CD=1 && with the quirk? then later
clearing CR0.CD needn't zap _any_ EPT entry. But the optimization is exactly the
one removed in patch 6. Maybe I miss something important.
On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
> >
> >diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >index 0ecf4be2c6af..c1e93678cea4 100644
> >--- a/arch/x86/kvm/vmx/vmx.c
> >+++ b/arch/x86/kvm/vmx/vmx.c
> >@@ -7548,8 +7548,6 @@ static int vmx_vm_init(struct kvm *kvm)
> >
> > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > {
> >- u8 cache;
> >-
> > /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> > * memory aliases with conflicting memory types and sometimes MCEs.
> > * We have to be careful as to what are honored and when.
> >@@ -7576,11 +7574,10 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> >
> > if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
> > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> >- cache = MTRR_TYPE_WRBACK;
> >+ return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
>
> Shouldn't KVM honor guest MTRRs as well? i.e., as if CR0.CD isn't set.
I proposed to return guest mtrr type when CR0.CD=1.
https://lore.kernel.org/all/[email protected]/
Sean rejected it because before guest MTRR is first time enabled, we
have to return WB with the quirk, otherwise the VM bootup will be super
slow.
Or maybe we can return WB when guest MTRR is not enabled and guest MTRR
type when guest MTRR is anabled to woraround it. e.g.
if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
{
cache = !mtrr_is_enabled(mtrr_state) ? MTRR_TYPE_WRBACK :
kvm_mtrr_get_guest_memory_type(vcpu, gfn);
return cache << VMX_EPT_MT_EPTE_SHIFT;
}
...
}
But Sean does not like piling too much on top of the quirk and I think it is right
to keep the quirk simple.
"In short, I am steadfastly against any change that piles more arbitrary behavior
functional behavior on top of the quirk, especially when that behavior relies on
heuristics to try and guess what the guest is doing."
>
> > else
> >- cache = MTRR_TYPE_UNCACHABLE;
> >-
> >- return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> >+ return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
> >+ VMX_EPT_IPAT_BIT;
> > }
> >
> > return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> >--
> >2.17.1
> >
On 6/20/2023 10:34 AM, Yan Zhao wrote:
> On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
>> On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
>>> For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
>>> types when cache is disabled and non-coherent DMA are present.
>>>
>>> With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
>>> EPT memory type when guest cache is disabled before this patch.
>>> Removing the IPAT bit in this patch will allow effective memory type to
>>> honor PAT values as well, which will make the effective memory type
>> Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
>> which guests can benefit from this change?
> This patch is actually a preparation for later patch 10 to implement
> fine-grained zap.
> If when CR0.CD=1 the EPT type is WB + IPAT, and
> when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
> without IPAT, then we have to always zap all EPT entries.
>
> Given removing the IPAT bit when CR0.CD=1 only makes the quirk
> KVM_X86_QUIRK_CD_NW_CLEARED more strict (meaning it could be WC/UC... if
> the guest PAT overwrites it), it's still acceptable.
Per my understanding, the reason why KVM had KVM_X86_QUIRK_CD_NW_CLEARED
is to ensure the memory type is WB to achieve better boot performance
for old OVMF.
you need to justify the original purpose is not broken by this patch.
On Sun, Jun 25, 2023 at 03:14:37PM +0800, Xiaoyao Li wrote:
> On 6/20/2023 10:34 AM, Yan Zhao wrote:
> > On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
> > > On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
> > > > For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
> > > > types when cache is disabled and non-coherent DMA are present.
> > > >
> > > > With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
> > > > EPT memory type when guest cache is disabled before this patch.
> > > > Removing the IPAT bit in this patch will allow effective memory type to
> > > > honor PAT values as well, which will make the effective memory type
> > > Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
> > > which guests can benefit from this change?
> > This patch is actually a preparation for later patch 10 to implement
> > fine-grained zap.
> > If when CR0.CD=1 the EPT type is WB + IPAT, and
> > when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
> > without IPAT, then we have to always zap all EPT entries.
> >
> > Given removing the IPAT bit when CR0.CD=1 only makes the quirk
> > KVM_X86_QUIRK_CD_NW_CLEARED more strict (meaning it could be WC/UC... if
> > the guest PAT overwrites it), it's still acceptable.
>
> Per my understanding, the reason why KVM had KVM_X86_QUIRK_CD_NW_CLEARED is
> to ensure the memory type is WB to achieve better boot performance for old
> OVMF.
It works well for OVMF c9e5618f84b0cb54a9ac2d7604f7b7e7859b45a7,
which is Apr 14 2015.
> you need to justify the original purpose is not broken by this patch.
Hmm, to dig into the history, the reason for this quirk is explained below:
commit fb279950ba02e3210a16b11ecfa8871f3ee0ca49
Author: Xiao Guangrong <[email protected]>
Date: Thu Jul 16 03:25:56 2015 +0800
KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED
OVMF depends on WB to boot fast, because it only clears caches after
it has set up MTRRs---which is too late.
Let's do writeback if CR0.CD is set to make it happy, similar to what
SVM is already doing.
which means WB is only a must for fast boot before OVMF has set up MTRRs.
At that period, PAT is default to WB.
After OVMF setting up MTRR, according to the definition of no-fill cache
mode, "Strict memory ordering is not enforced unless the MTRRs are
disabled and/or all memory is referenced as uncached", it's valid to
honor PAT in no-fill cache mode.
Besides, if the guest explicitly claim UC via PAT, why should KVM return
WB?
In other words, if it's still slow caused by a UC value in guest PAT,
it's desired to be fixed in guest instead of a workaround in KVM.
On Mon, Jun 26, 2023 at 08:08:20AM +0800, Yan Zhao wrote:
> On Sun, Jun 25, 2023 at 03:14:37PM +0800, Xiaoyao Li wrote:
> > On 6/20/2023 10:34 AM, Yan Zhao wrote:
> > > On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
> > > > On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
> > > > > For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
> > > > > types when cache is disabled and non-coherent DMA are present.
> > > > >
> > > > > With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
> > > > > EPT memory type when guest cache is disabled before this patch.
> > > > > Removing the IPAT bit in this patch will allow effective memory type to
> > > > > honor PAT values as well, which will make the effective memory type
> > > > Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
> > > > which guests can benefit from this change?
> > > This patch is actually a preparation for later patch 10 to implement
> > > fine-grained zap.
> > > If when CR0.CD=1 the EPT type is WB + IPAT, and
> > > when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
> > > without IPAT, then we have to always zap all EPT entries.
> > >
> > > Given removing the IPAT bit when CR0.CD=1 only makes the quirk
> > > KVM_X86_QUIRK_CD_NW_CLEARED more strict (meaning it could be WC/UC... if
> > > the guest PAT overwrites it), it's still acceptable.
> >
> > Per my understanding, the reason why KVM had KVM_X86_QUIRK_CD_NW_CLEARED is
> > to ensure the memory type is WB to achieve better boot performance for old
> > OVMF.
> It works well for OVMF c9e5618f84b0cb54a9ac2d7604f7b7e7859b45a7,
> which is Apr 14 2015.
>
>
> > you need to justify the original purpose is not broken by this patch.
>
> Hmm, to dig into the history, the reason for this quirk is explained below:
>
> commit fb279950ba02e3210a16b11ecfa8871f3ee0ca49
> Author: Xiao Guangrong <[email protected]>
> Date: Thu Jul 16 03:25:56 2015 +0800
>
> KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED
>
> OVMF depends on WB to boot fast, because it only clears caches after
> it has set up MTRRs---which is too late.
>
> Let's do writeback if CR0.CD is set to make it happy, similar to what
> SVM is already doing.
>
>
> which means WB is only a must for fast boot before OVMF has set up MTRRs.
> At that period, PAT is default to WB.
>
> After OVMF setting up MTRR, according to the definition of no-fill cache
> mode, "Strict memory ordering is not enforced unless the MTRRs are
> disabled and/or all memory is referenced as uncached", it's valid to
> honor PAT in no-fill cache mode.
Does it also mean that, the honor PAT in such no-fill cache mode should
also happen for non-quirk case ? e.g. the effective memory type can be
WC if EPT is UC + guest PAT is WC for CD=1.
> Besides, if the guest explicitly claim UC via PAT, why should KVM return
> WB?
> In other words, if it's still slow caused by a UC value in guest PAT,
> it's desired to be fixed in guest instead of a workaround in KVM.
the quirk may not work after this patch if the guest PAT is
stronger than WB for CD=1, we don't if any guest "works correctly" based
on this quirk, I hope no. How about highlight this in commit message
explicitly ?
Also I agree that such issue should be fixed in guest not in KVM.
>
>
>
>
On Mon, Jun 26, 2023 at 11:40:28AM +0800, Yuan Yao wrote:
> On Mon, Jun 26, 2023 at 08:08:20AM +0800, Yan Zhao wrote:
> > On Sun, Jun 25, 2023 at 03:14:37PM +0800, Xiaoyao Li wrote:
> > > On 6/20/2023 10:34 AM, Yan Zhao wrote:
> > > > On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
> > > > > On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
> > > > > > For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
> > > > > > types when cache is disabled and non-coherent DMA are present.
> > > > > >
> > > > > > With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
> > > > > > EPT memory type when guest cache is disabled before this patch.
> > > > > > Removing the IPAT bit in this patch will allow effective memory type to
> > > > > > honor PAT values as well, which will make the effective memory type
> > > > > Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
> > > > > which guests can benefit from this change?
> > > > This patch is actually a preparation for later patch 10 to implement
> > > > fine-grained zap.
> > > > If when CR0.CD=1 the EPT type is WB + IPAT, and
> > > > when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
> > > > without IPAT, then we have to always zap all EPT entries.
> > > >
> > > > Given removing the IPAT bit when CR0.CD=1 only makes the quirk
> > > > KVM_X86_QUIRK_CD_NW_CLEARED more strict (meaning it could be WC/UC... if
> > > > the guest PAT overwrites it), it's still acceptable.
> > >
> > > Per my understanding, the reason why KVM had KVM_X86_QUIRK_CD_NW_CLEARED is
> > > to ensure the memory type is WB to achieve better boot performance for old
> > > OVMF.
> > It works well for OVMF c9e5618f84b0cb54a9ac2d7604f7b7e7859b45a7,
> > which is Apr 14 2015.
> >
> >
> > > you need to justify the original purpose is not broken by this patch.
> >
> > Hmm, to dig into the history, the reason for this quirk is explained below:
> >
> > commit fb279950ba02e3210a16b11ecfa8871f3ee0ca49
> > Author: Xiao Guangrong <[email protected]>
> > Date: Thu Jul 16 03:25:56 2015 +0800
> >
> > KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED
> >
> > OVMF depends on WB to boot fast, because it only clears caches after
> > it has set up MTRRs---which is too late.
> >
> > Let's do writeback if CR0.CD is set to make it happy, similar to what
> > SVM is already doing.
> >
> >
> > which means WB is only a must for fast boot before OVMF has set up MTRRs.
> > At that period, PAT is default to WB.
> >
> > After OVMF setting up MTRR, according to the definition of no-fill cache
> > mode, "Strict memory ordering is not enforced unless the MTRRs are
> > disabled and/or all memory is referenced as uncached", it's valid to
> > honor PAT in no-fill cache mode.
>
> Does it also mean that, the honor PAT in such no-fill cache mode should
> also happen for non-quirk case ? e.g. the effective memory type can be
> WC if EPT is UC + guest PAT is WC for CD=1.
No. Only the quirk KVM_X86_QUIRK_CD_NW_CLEARED indicates no-fill cache
mode (CD=1 and NW=0).
Without the quirk, UC + IPAT is desired.
>
> > Besides, if the guest explicitly claim UC via PAT, why should KVM return
> > WB?
> > In other words, if it's still slow caused by a UC value in guest PAT,
> > it's desired to be fixed in guest instead of a workaround in KVM.
>
> the quirk may not work after this patch if the guest PAT is
> stronger than WB for CD=1, we don't if any guest "works correctly" based
> on this quirk, I hope no. How about highlight this in commit message
At least for Seabios and OVMF, the PAT is WB by default.
Even after MTRRs enabled, if there are UC ranges, they are small in size
and are desired to be UC.
So, I think it's ok.
> explicitly ?
Will try to explain the background and possible influence.
Thanks
>
> Also I agree that such issue should be fixed in guest not in KVM.
>
On Fri, Jun 16, 2023, Yan Zhao wrote:
> Call helper to check if guest MTRRs are honored by KVM MMU before zapping,
Nit, state the effect, not what the code literally does. The important part is
that the end result is that KVM will zap if and only if guest MTRRs are being
honored, e.g.
Zap SPTEs when CR0.CD is toggled if and only if KVM's MMU is honoring
guest MTRRs, which is the only time that KVM incorporates the guest's
CR0.CD into the final memtype.
> as values of guest CR0.CD will only affect memory types of KVM TDP when
> guest MTRRs are honored.
>
> Suggested-by: Chao Gao <[email protected]>
> Signed-off-by: Yan Zhao <[email protected]>
> ---
> arch/x86/kvm/x86.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9e7186864542..6693daeb5686 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -942,7 +942,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
> kvm_mmu_reset_context(vcpu);
>
> if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> - kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
> + kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
> !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> }
> --
> 2.17.1
>
On Fri, Jun 16, 2023, Yan Zhao wrote:
> Call helper to check if guest MTRRs are honored by KVM MMU before
> calculation and zapping.
Same comment here about stating the effect, not the literal code change.
> Guest MTRRs only affect TDP memtypes when TDP honors guest MTRRs, there's
> no meaning to do the calculation and zapping otherwise.
On Fri, Jun 16, 2023, Yan Zhao wrote:
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index b35dd0bc9cad..688748e3a4d2 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -25,6 +25,8 @@
> #define IA32_MTRR_DEF_TYPE_FE (1ULL << 10)
> #define IA32_MTRR_DEF_TYPE_TYPE_MASK (0xff)
>
> +static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
> + gfn_t gfn_start, gfn_t gfn_end);
> static bool is_mtrr_base_msr(unsigned int msr)
> {
> /* MTRR base MSRs use even numbers, masks use odd numbers. */
> @@ -341,7 +343,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> var_mtrr_range(var_mtrr_msr_to_range(vcpu, msr), &start, &end);
> }
>
> - kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
> + kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(start), gpa_to_gfn(end));
> }
>
> static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
> @@ -437,6 +439,11 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
> void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
> {
> INIT_LIST_HEAD(&vcpu->arch.mtrr_state.head);
> +
> + if (vcpu->vcpu_id == 0) {
Eww. This is actually unsafe, because kvm_arch_vcpu_create() is invoked without
holding kvm->lock. Oh, and vcpu_id is userspace controlled, so it's *very*
unsafe. Just initialize these in kvm_arch_init_vm().
> + spin_lock_init(&vcpu->kvm->arch.mtrr_zap_list_lock);
> + INIT_LIST_HEAD(&vcpu->kvm->arch.mtrr_zap_list);
> + }
> }
On Fri, Jun 16, 2023, Yan Zhao wrote:
> This series refines mmu zap caused by EPT memory type update when guest
> MTRRs are honored.
...
> Yan Zhao (11):
> KVM: x86/mmu: helpers to return if KVM honors guest MTRRs
> KVM: x86/mmu: Use KVM honors guest MTRRs helper in
> kvm_tdp_page_fault()
> KVM: x86/mmu: Use KVM honors guest MTRRs helper when CR0.CD toggles
> KVM: x86/mmu: Use KVM honors guest MTRRs helper when update mtrr
> KVM: x86/mmu: zap KVM TDP when noncoherent DMA assignment starts/stops
> KVM: x86/mmu: move TDP zaps from guest MTRRs update to CR0.CD toggling
> KVM: VMX: drop IPAT in memtype when CD=1 for
> KVM_X86_QUIRK_CD_NW_CLEARED
> KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common
> code
> KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored
> KVM: x86/mmu: fine-grained gfn zap when guest MTRRs are honored
> KVM: x86/mmu: split a single gfn zap range when guest MTRRs are
> honored
I got through the easy patches, I'll circle back for the last few patches in a
few weeks (probably 3+ weeks at this point).
On Wed, Jun 28, 2023 at 02:59:00PM -0700, Sean Christopherson wrote:
> On Fri, Jun 16, 2023, Yan Zhao wrote:
> > Call helper to check if guest MTRRs are honored by KVM MMU before zapping,
>
> Nit, state the effect, not what the code literally does. The important part is
> that the end result is that KVM will zap if and only if guest MTRRs are being
> honored, e.g.
>
> Zap SPTEs when CR0.CD is toggled if and only if KVM's MMU is honoring
> guest MTRRs, which is the only time that KVM incorporates the guest's
> CR0.CD into the final memtype.
>
Thanks! Will update it.
On Wed, Jun 28, 2023 at 04:00:55PM -0700, Sean Christopherson wrote:
> On Fri, Jun 16, 2023, Yan Zhao wrote:
> > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > index b35dd0bc9cad..688748e3a4d2 100644
> > --- a/arch/x86/kvm/mtrr.c
> > +++ b/arch/x86/kvm/mtrr.c
> > @@ -25,6 +25,8 @@
> > #define IA32_MTRR_DEF_TYPE_FE (1ULL << 10)
> > #define IA32_MTRR_DEF_TYPE_TYPE_MASK (0xff)
> >
> > +static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
> > + gfn_t gfn_start, gfn_t gfn_end);
> > static bool is_mtrr_base_msr(unsigned int msr)
> > {
> > /* MTRR base MSRs use even numbers, masks use odd numbers. */
> > @@ -341,7 +343,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> > var_mtrr_range(var_mtrr_msr_to_range(vcpu, msr), &start, &end);
> > }
> >
> > - kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
> > + kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(start), gpa_to_gfn(end));
> > }
> >
> > static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
> > @@ -437,6 +439,11 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
> > void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
> > {
> > INIT_LIST_HEAD(&vcpu->arch.mtrr_state.head);
> > +
> > + if (vcpu->vcpu_id == 0) {
>
> Eww. This is actually unsafe, because kvm_arch_vcpu_create() is invoked without
> holding kvm->lock. Oh, and vcpu_id is userspace controlled, so it's *very*
> unsafe. Just initialize these in kvm_arch_init_vm().
Will do. Thanks!
>
> > + spin_lock_init(&vcpu->kvm->arch.mtrr_zap_list_lock);
> > + INIT_LIST_HEAD(&vcpu->kvm->arch.mtrr_zap_list);
> > + }
> > }
On Wed, Jun 28, 2023 at 04:02:05PM -0700, Sean Christopherson wrote:
> On Fri, Jun 16, 2023, Yan Zhao wrote:
> > This series refines mmu zap caused by EPT memory type update when guest
> > MTRRs are honored.
>
> ...
>
> > Yan Zhao (11):
> > KVM: x86/mmu: helpers to return if KVM honors guest MTRRs
> > KVM: x86/mmu: Use KVM honors guest MTRRs helper in
> > kvm_tdp_page_fault()
> > KVM: x86/mmu: Use KVM honors guest MTRRs helper when CR0.CD toggles
> > KVM: x86/mmu: Use KVM honors guest MTRRs helper when update mtrr
> > KVM: x86/mmu: zap KVM TDP when noncoherent DMA assignment starts/stops
> > KVM: x86/mmu: move TDP zaps from guest MTRRs update to CR0.CD toggling
> > KVM: VMX: drop IPAT in memtype when CD=1 for
> > KVM_X86_QUIRK_CD_NW_CLEARED
> > KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common
> > code
> > KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored
> > KVM: x86/mmu: fine-grained gfn zap when guest MTRRs are honored
> > KVM: x86/mmu: split a single gfn zap range when guest MTRRs are
> > honored
>
> I got through the easy patches, I'll circle back for the last few patches in a
> few weeks (probably 3+ weeks at this point).
Thanks for this heads-up.
I addressed almost all the comments for v3 currently, except about
where to get memtype for CR0.CD=1, and feel free to decline my new
proposal in v4 as explained in another mail :)
v4 is available here
https://lore.kernel.org/all/[email protected]/
Please review the new version directly.
Thanks!