This is both a new feature and a bugfix.
Bugfix description:
It was found that Windows 2016 guests on KVM crash when they have > 64
vCPUs, non-flat topology (>1 core/thread per socket; in case it has >64
sockets Windows just ignores vCPUs above 64) and Hyper-V enlightenments
(any) are enabled. The most common error reported is "PAGE FAULT IN
NONPAGED AREA" but I saw different messages. Apparently, Windows doesn't
expect to run on a Hyper-V server without PV TLB flush support as there's
no such Hyper-V servers out there (it's only WS2016 supporting > 64 vCPUs
AFAIR).
Adding PV TLB flush support to KVM helps, Windows 2016 guests now boot
normally (I tried '-smp 128,sockets=64,cores=1,threads=2' and
'-smp 128,sockets=8,cores=16,threads=1' but other topologies should work
too).
Feature description:
PV TLB flush helps a lot when running overcommited. KVM gained support for
it recently but it is only available for Linux guests. Windows guests use
emulated Hyper-V interface and PV TLB flush needs to be added there.
I tested WS2016 guest with 128 vCPUs running on a 12 pCPU server. The test
was running 64 threads doing 100 mmap()/munmap() for 16384 pages with a
tiny random nanosleep in between (I used Cygwin. It would be great if
someone could point me to a good Windows-native TLB trashing test).
The results are:
Before:
real 0m44.362s
user 0m1.796s
sys 6m43.218s
After:
real 0m24.425s
user 0m1.811s
sys 0m40.625s
When running without overcommit (single 12 vCPU guest on 12 pCPU server) the
results of the same test are very close:
Before:
real 0m21.237s
user 0m1.531s
sys 0m19.984s
After:
real 0m21.082s
user 0m1.546s
sys 0m20.030s
Implementation details.
The implementation is very simplistic and straightforward. We ignore
'address space' argument of the hypercalls (as there is no good way to
figure out what's currently in CR3 of a running vCPU as generally we don't
VMEXIT on guest CR3 write) and do full TLB flush on specified vCPUs. In
case said vCPUs are not running TLB flush will be performed upon guest
enter.
Qemu (and other userspaces) need to enable CPUID feature bits to make
Windows aware the feature is supported. I'll post Qemu enablement patch
separately.
Patches are based on the current kvm/queue branch.
Vitaly Kuznetsov (5):
x86/hyper-v: move struct hv_flush_pcpu{,ex} definitions to common
header
KVM: x86: hyperv: use defines when parsing hypercall parameters
KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}
implementation
KVM: x86: hyperv: simplistic
HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX implementation
KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capability
Documentation/virtual/kvm/api.txt | 9 +++
arch/x86/hyperv/mmu.c | 40 +++-------
arch/x86/include/asm/hyperv-tlfs.h | 20 +++++
arch/x86/kvm/hyperv.c | 154 ++++++++++++++++++++++++++++++++++---
arch/x86/kvm/trace.h | 51 ++++++++++++
arch/x86/kvm/x86.c | 1 +
include/uapi/linux/kvm.h | 1 +
7 files changed, 236 insertions(+), 40 deletions(-)
--
2.14.3
Implement HvFlushVirtualAddress{List,Space} hypercalls in a simplistic way:
do full TLB flush with KVM_REQ_TLB_FLUSH and rely on kvm_vcpu_kick()
kicking only vCPUs which are currently IN_GUEST_MODE.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/kvm/hyperv.c | 54 ++++++++++++++++++++++++++++++++++++++++++++-------
arch/x86/kvm/trace.h | 24 +++++++++++++++++++++++
2 files changed, 71 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 3cb3bb68db7e..aa866994366d 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1242,6 +1242,49 @@ int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
return kvm_hv_get_msr(vcpu, msr, pdata);
}
+static u64 kvm_hv_flush_tlb(struct kvm_vcpu *current_vcpu, u64 ingpa,
+ u16 rep_cnt)
+{
+ struct kvm *kvm = current_vcpu->kvm;
+ struct hv_tlb_flush flush;
+ struct kvm_vcpu *vcpu;
+ int i;
+
+ if (unlikely(kvm_read_guest(kvm, ingpa, &flush, sizeof(flush))))
+ return HV_STATUS_INVALID_HYPERCALL_INPUT;
+
+ trace_kvm_hv_flush_tlb(flush.processor_mask, flush.address_space,
+ flush.flags);
+
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ struct kvm_vcpu_hv *hv = &vcpu->arch.hyperv;
+
+ if (!(flush.flags & HV_FLUSH_ALL_PROCESSORS) &&
+ !(flush.processor_mask & BIT_ULL(hv->vp_index)))
+ continue;
+
+ /*
+ * vcpu->arch.cr3 may not be up-to-date for running vCPUs so we
+ * can't analyze it here, flush TLB regardless of the specified
+ * address space.
+ */
+ kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
+
+ /*
+ * It is very unlikely but possible that we're doing an extra
+ * kick here (e.g. if the vCPU has just entered the guest and
+ * has its TLB flushed).
+ */
+ if (vcpu != current_vcpu)
+ kvm_vcpu_kick(vcpu);
+ }
+
+ /* We always do full TLB flush, set rep_done = rep_cnt. */
+ return (u64)HV_STATUS_SUCCESS |
+ ((u64)rep_cnt << HV_HYPERCALL_REP_START_OFFSET) |
+ ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
+}
+
bool kvm_hv_hypercall_enabled(struct kvm *kvm)
{
return READ_ONCE(kvm->arch.hyperv.hv_hypercall) & HV_X64_MSR_HYPERCALL_ENABLE;
@@ -1345,12 +1388,6 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
trace_kvm_hv_hypercall(code, fast, rep_cnt, rep_idx, ingpa, outgpa);
- /* Hypercall continuation is not supported yet */
- if (rep_cnt || rep_idx) {
- ret = HV_STATUS_INVALID_HYPERCALL_CODE;
- goto set_result;
- }
-
switch (code) {
case HVCALL_NOTIFY_LONG_SPIN_WAIT:
kvm_vcpu_on_spin(vcpu, true);
@@ -1374,12 +1411,15 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
vcpu->arch.complete_userspace_io =
kvm_hv_hypercall_complete_userspace;
return 0;
+ case HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST:
+ case HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE:
+ ret = kvm_hv_flush_tlb(vcpu, ingpa, rep_cnt);
+ break;
default:
ret = HV_STATUS_INVALID_HYPERCALL_CODE;
break;
}
-set_result:
kvm_hv_hypercall_set_result(vcpu, ret);
return 1;
}
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 9807c314c478..47a4fd758743 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1367,6 +1367,30 @@ TRACE_EVENT(kvm_hv_timer_state,
__entry->vcpu_id,
__entry->hv_timer_in_use)
);
+
+/*
+ * Tracepoint for kvm_hv_flush_tlb.
+ */
+TRACE_EVENT(kvm_hv_flush_tlb,
+ TP_PROTO(u64 processor_mask, u64 address_space, u64 flags),
+ TP_ARGS(processor_mask, address_space, flags),
+
+ TP_STRUCT__entry(
+ __field(u64, processor_mask)
+ __field(u64, address_space)
+ __field(u64, flags)
+ ),
+
+ TP_fast_assign(
+ __entry->processor_mask = processor_mask;
+ __entry->address_space = address_space;
+ __entry->flags = flags;
+ ),
+
+ TP_printk("processor_mask 0x%llx address_space 0x%llx flags 0x%llx",
+ __entry->processor_mask, __entry->address_space,
+ __entry->flags)
+);
#endif /* _TRACE_KVM_H */
#undef TRACE_INCLUDE_PATH
--
2.14.3
We need a new capability to indicate support for the newly added
HvFlushVirtualAddress{List,Space}{,Ex} hypercalls. Upon seeing this
capability, userspace is supposed to announce PV TLB flush features
by setting the appropriate CPUID bits (if needed).
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
Documentation/virtual/kvm/api.txt | 9 +++++++++
arch/x86/kvm/x86.c | 1 +
include/uapi/linux/kvm.h | 1 +
3 files changed, 11 insertions(+)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 1c7958b57fe9..74e22e36b3c8 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4596,3 +4596,12 @@ Architectures: s390
This capability indicates that kvm will implement the interfaces to handle
reset, migration and nested KVM for branch prediction blocking. The stfle
facility 82 should not be provided to the guest without this capability.
+
+8.14 KVM_CAP_HYPERV_TLBFLUSH
+
+Architectures: x86
+
+This capability indicates that KVM supports paravirtualized Hyper-V TLB Flush
+hypercalls:
+HvFlushVirtualAddressSpace, HvFlushVirtualAddressSpaceEx,
+HvFlushVirtualAddressList, HvFlushVirtualAddressListEx.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7d9a444f2f24..a34a068c4663 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2852,6 +2852,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_HYPERV_SYNIC2:
case KVM_CAP_HYPERV_VP_INDEX:
case KVM_CAP_HYPERV_EVENTFD:
+ case KVM_CAP_HYPERV_TLBFLUSH:
case KVM_CAP_PCI_SEGMENT:
case KVM_CAP_DEBUGREGS:
case KVM_CAP_X86_ROBUST_SINGLESTEP:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1065006c9bf5..30c193b43d89 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -941,6 +941,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_S390_BPB 152
#define KVM_CAP_GET_MSR_FEATURES 153
#define KVM_CAP_HYPERV_EVENTFD 154
+#define KVM_CAP_HYPERV_TLBFLUSH 155
#ifdef KVM_CAP_IRQ_ROUTING
--
2.14.3
Avoid open-coding offsets for hypercall input parameters, we already
have defines for them.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/kvm/hyperv.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 98618e397342..3cb3bb68db7e 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1339,9 +1339,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
#endif
code = param & 0xffff;
- fast = (param >> 16) & 0x1;
- rep_cnt = (param >> 32) & 0xfff;
- rep_idx = (param >> 48) & 0xfff;
+ fast = !!(param & HV_HYPERCALL_FAST_BIT);
+ rep_cnt = (param >> HV_HYPERCALL_REP_COMP_OFFSET) & 0xfff;
+ rep_idx = (param >> HV_HYPERCALL_REP_START_OFFSET) & 0xfff;
trace_kvm_hv_hypercall(code, fast, rep_cnt, rep_idx, ingpa, outgpa);
--
2.14.3
Implement HvFlushVirtualAddress{List,Space}Ex hypercalls in a simplistic
way: do full TLB flush with KVM_REQ_TLB_FLUSH and rely on kvm_vcpu_kick()
kicking only vCPUs which are currently IN_GUEST_MODE.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/kvm/hyperv.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/trace.h | 27 +++++++++++++++
2 files changed, 121 insertions(+)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index aa866994366d..e72f8a67dd82 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1285,6 +1285,96 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *current_vcpu, u64 ingpa,
((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
}
+static __always_inline int get_sparse_bank_no(u64 valid_bank_mask, int bank_no)
+{
+ int i = 0, j;
+
+ if (!(valid_bank_mask & BIT_ULL(bank_no)))
+ return -1;
+
+ for (j = 0; j < bank_no; j++)
+ if (valid_bank_mask & BIT_ULL(j))
+ i++;
+
+ return i;
+}
+
+static __always_inline int load_bank_guest(struct kvm *kvm, u64 ingpa,
+ int sparse_bank, u64 *bank_contents)
+{
+ int offset;
+
+ offset = offsetof(struct hv_tlb_flush_ex, hv_vp_set.bank_contents) +
+ sizeof(u64) * sparse_bank;
+
+ if (unlikely(kvm_read_guest(kvm, ingpa + offset,
+ bank_contents, sizeof(u64))))
+ return 1;
+
+ return 0;
+}
+
+static int kvm_hv_flush_tlb_ex(struct kvm_vcpu *current_vcpu, u64 ingpa,
+ u16 rep_cnt)
+{
+ struct kvm *kvm = current_vcpu->kvm;
+ struct hv_tlb_flush_ex flush;
+ struct kvm_vcpu *vcpu;
+ u64 bank_contents, valid_bank_mask;
+ int i, current_sparse_bank = -1;
+ u64 ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+
+ if (unlikely(kvm_read_guest(kvm, ingpa, &flush, sizeof(flush))))
+ return ret;
+
+ valid_bank_mask = flush.hv_vp_set.valid_bank_mask;
+
+ trace_kvm_hv_flush_tlb_ex(valid_bank_mask, flush.hv_vp_set.format,
+ flush.address_space, flush.flags);
+
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ struct kvm_vcpu_hv *hv = &vcpu->arch.hyperv;
+ int bank = hv->vp_index / 64, sparse_bank;
+
+ if (flush.hv_vp_set.format == HV_GENERIC_SET_SPARCE_4K) {
+ /* Check is the bank of this vCPU is in sparse set */
+ sparse_bank = get_sparse_bank_no(valid_bank_mask, bank);
+ if (sparse_bank < 0)
+ continue;
+
+ /*
+ * Assume hv->vp_index is in ascending order and we can
+ * optimize by not reloading bank contents for every
+ * vCPU.
+ */
+ if (sparse_bank != current_sparse_bank) {
+ if (load_bank_guest(kvm, ingpa, sparse_bank,
+ &bank_contents))
+ return ret;
+ current_sparse_bank = sparse_bank;
+ }
+
+ if (!(bank_contents & BIT_ULL(hv->vp_index % 64)))
+ continue;
+ }
+
+ kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
+
+ /*
+ * It is very unlikely but possible that we're doing an extra
+ * kick here (e.g. if the vCPU has just entered the guest and
+ * has its TLB flushed).
+ */
+ if (vcpu != current_vcpu)
+ kvm_vcpu_kick(vcpu);
+ }
+
+ /* We always do full TLB flush, set rep_done = rep_cnt. */
+ return (u64)HV_STATUS_SUCCESS |
+ ((u64)rep_cnt << HV_HYPERCALL_REP_START_OFFSET) |
+ ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
+}
+
bool kvm_hv_hypercall_enabled(struct kvm *kvm)
{
return READ_ONCE(kvm->arch.hyperv.hv_hypercall) & HV_X64_MSR_HYPERCALL_ENABLE;
@@ -1415,6 +1505,10 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
case HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE:
ret = kvm_hv_flush_tlb(vcpu, ingpa, rep_cnt);
break;
+ case HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX:
+ case HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX:
+ ret = kvm_hv_flush_tlb_ex(vcpu, ingpa, rep_cnt);
+ break;
default:
ret = HV_STATUS_INVALID_HYPERCALL_CODE;
break;
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 47a4fd758743..0f997683404f 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1391,6 +1391,33 @@ TRACE_EVENT(kvm_hv_flush_tlb,
__entry->processor_mask, __entry->address_space,
__entry->flags)
);
+
+/*
+ * Tracepoint for kvm_hv_flush_tlb_ex.
+ */
+TRACE_EVENT(kvm_hv_flush_tlb_ex,
+ TP_PROTO(u64 valid_bank_mask, u64 format, u64 address_space, u64 flags),
+ TP_ARGS(valid_bank_mask, format, address_space, flags),
+
+ TP_STRUCT__entry(
+ __field(u64, valid_bank_mask)
+ __field(u64, format)
+ __field(u64, address_space)
+ __field(u64, flags)
+ ),
+
+ TP_fast_assign(
+ __entry->valid_bank_mask = valid_bank_mask;
+ __entry->format = format;
+ __entry->address_space = address_space;
+ __entry->flags = flags;
+ ),
+
+ TP_printk("valid_bank_mask 0x%llx format 0x%llx "
+ "address_space 0x%llx flags 0x%llx",
+ __entry->valid_bank_mask, __entry->format,
+ __entry->address_space, __entry->flags)
+);
#endif /* _TRACE_KVM_H */
#undef TRACE_INCLUDE_PATH
--
2.14.3
Hyper-V TLB flush hypercalls definitions will be required for KVM so move
them hyperv-tlfs.h. Structures also need to be renamed as '_pcpu' suffix is
invalid for a general-purpose definition.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/hyperv/mmu.c | 40 ++++++++++----------------------------
arch/x86/include/asm/hyperv-tlfs.h | 20 +++++++++++++++++++
2 files changed, 30 insertions(+), 30 deletions(-)
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 56c9ebac946f..002fc565f8f2 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -13,32 +13,12 @@
#define CREATE_TRACE_POINTS
#include <asm/trace/hyperv.h>
-/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
-struct hv_flush_pcpu {
- u64 address_space;
- u64 flags;
- u64 processor_mask;
- u64 gva_list[];
-};
-
-/* HvFlushVirtualAddressSpaceEx, HvFlushVirtualAddressListEx hypercalls */
-struct hv_flush_pcpu_ex {
- u64 address_space;
- u64 flags;
- struct {
- u64 format;
- u64 valid_bank_mask;
- u64 bank_contents[];
- } hv_vp_set;
- u64 gva_list[];
-};
-
/* Each gva in gva_list encodes up to 4096 pages to flush */
#define HV_TLB_FLUSH_UNIT (4096 * PAGE_SIZE)
-static struct hv_flush_pcpu __percpu **pcpu_flush;
+static struct hv_tlb_flush __percpu **pcpu_flush;
-static struct hv_flush_pcpu_ex __percpu **pcpu_flush_ex;
+static struct hv_tlb_flush_ex __percpu **pcpu_flush_ex;
/*
* Fills in gva_list starting from offset. Returns the number of items added.
@@ -71,7 +51,7 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
}
/* Return the number of banks in the resulting vp_set */
-static inline int cpumask_to_vp_set(struct hv_flush_pcpu_ex *flush,
+static inline int cpumask_to_vp_set(struct hv_tlb_flush_ex *flush,
const struct cpumask *cpus)
{
int cpu, vcpu, vcpu_bank, vcpu_offset, nr_bank = 1;
@@ -81,7 +61,7 @@ static inline int cpumask_to_vp_set(struct hv_flush_pcpu_ex *flush,
return 0;
/*
- * Clear all banks up to the maximum possible bank as hv_flush_pcpu_ex
+ * Clear all banks up to the maximum possible bank as hv_tlb_flush_ex
* structs are not cleared between calls, we risk flushing unneeded
* vCPUs otherwise.
*/
@@ -109,8 +89,8 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
const struct flush_tlb_info *info)
{
int cpu, vcpu, gva_n, max_gvas;
- struct hv_flush_pcpu **flush_pcpu;
- struct hv_flush_pcpu *flush;
+ struct hv_tlb_flush **flush_pcpu;
+ struct hv_tlb_flush *flush;
u64 status = U64_MAX;
unsigned long flags;
@@ -196,8 +176,8 @@ static void hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
const struct flush_tlb_info *info)
{
int nr_bank = 0, max_gvas, gva_n;
- struct hv_flush_pcpu_ex **flush_pcpu;
- struct hv_flush_pcpu_ex *flush;
+ struct hv_tlb_flush_ex **flush_pcpu;
+ struct hv_tlb_flush_ex *flush;
u64 status = U64_MAX;
unsigned long flags;
@@ -303,7 +283,7 @@ void hyper_alloc_mmu(void)
return;
if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED))
- pcpu_flush = alloc_percpu(struct hv_flush_pcpu *);
+ pcpu_flush = alloc_percpu(struct hv_tlb_flush *);
else
- pcpu_flush_ex = alloc_percpu(struct hv_flush_pcpu_ex *);
+ pcpu_flush_ex = alloc_percpu(struct hv_tlb_flush_ex *);
}
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index 1c602ad4bda8..ccb1dffe0d84 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -703,4 +703,24 @@ struct hv_enlightened_vmcs {
#define HV_STIMER_AUTOENABLE (1ULL << 3)
#define HV_STIMER_SINT(config) (__u8)(((config) >> 16) & 0x0F)
+/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
+struct hv_tlb_flush {
+ u64 address_space;
+ u64 flags;
+ u64 processor_mask;
+ u64 gva_list[];
+};
+
+/* HvFlushVirtualAddressSpaceEx, HvFlushVirtualAddressListEx hypercalls */
+struct hv_tlb_flush_ex {
+ u64 address_space;
+ u64 flags;
+ struct {
+ u64 format;
+ u64 valid_bank_mask;
+ u64 bank_contents[];
+ } hv_vp_set;
+ u64 gva_list[];
+};
+
#endif
--
2.14.3
On Mon, Apr 02, 2018 at 06:10:54PM +0200, Vitaly Kuznetsov wrote:
> This is both a new feature and a bugfix.
>
> Bugfix description:
>
> It was found that Windows 2016 guests on KVM crash when they have > 64
> vCPUs, non-flat topology (>1 core/thread per socket; in case it has >64
> sockets Windows just ignores vCPUs above 64) and Hyper-V enlightenments
> (any) are enabled. The most common error reported is "PAGE FAULT IN
> NONPAGED AREA" but I saw different messages. Apparently, Windows doesn't
> expect to run on a Hyper-V server without PV TLB flush support as there's
> no such Hyper-V servers out there (it's only WS2016 supporting > 64 vCPUs
> AFAIR).
>
> Adding PV TLB flush support to KVM helps, Windows 2016 guests now boot
> normally (I tried '-smp 128,sockets=64,cores=1,threads=2' and
> '-smp 128,sockets=8,cores=16,threads=1' but other topologies should work
> too).
>
> Feature description:
>
> PV TLB flush helps a lot when running overcommited. KVM gained support for
> it recently but it is only available for Linux guests. Windows guests use
> emulated Hyper-V interface and PV TLB flush needs to be added there.
>
> I tested WS2016 guest with 128 vCPUs running on a 12 pCPU server. The test
> was running 64 threads doing 100 mmap()/munmap() for 16384 pages with a
> tiny random nanosleep in between (I used Cygwin. It would be great if
> someone could point me to a good Windows-native TLB trashing test).
>
> The results are:
> Before:
> real 0m44.362s
> user 0m1.796s
> sys 6m43.218s
>
> After:
> real 0m24.425s
> user 0m1.811s
> sys 0m40.625s
>
> When running without overcommit (single 12 vCPU guest on 12 pCPU server) the
> results of the same test are very close:
> Before:
> real 0m21.237s
> user 0m1.531s
> sys 0m19.984s
>
> After:
> real 0m21.082s
> user 0m1.546s
> sys 0m20.030s
I vaguely remember Denis Plotnikov (cc-d) did a similar attempt a couple
of years ago. IIRC the outcome was that win2012r2 (back then) guests
started to also use this mechanism for local tlb flushes via self-IPI,
which led to noticable degradation on certain workloads.
Denis do you have any details to share?
Thanks,
Roman.
Roman Kagan <[email protected]> writes:
> On Mon, Apr 02, 2018 at 06:10:54PM +0200, Vitaly Kuznetsov wrote:
>>
>> Feature description:
>>
>> PV TLB flush helps a lot when running overcommited. KVM gained support for
>> it recently but it is only available for Linux guests. Windows guests use
>> emulated Hyper-V interface and PV TLB flush needs to be added there.
>>
>
> I vaguely remember Denis Plotnikov (cc-d) did a similar attempt a couple
> of years ago. IIRC the outcome was that win2012r2 (back then) guests
> started to also use this mechanism for local tlb flushes via self-IPI,
> which led to noticable degradation on certain workloads.
>
Thanks for the feedback,
in theory it shouldn't do that. Hyper-V has different enlightenments for
local and remote TLB flushing:
Implementation Recommendations - 0x40000004
...
Bit 1: Recommend using hypercall for local TLB flushes rather
than INVLPG or MOV to CR3 instructions
Bit 2: Recommend using hypercall for remote TLB flushes
rather than inter-processor interrupts
...
I see no reason to do local flushes with a hypercall so we'll only be
announcing remote.
In case things are irreparably broken with WS2012 we can still have the
feature in KVM and decide if we want to announce it to guests on a
higher level (probably even higher than Qemu as this will just become
another 'hv_tlbflush' feature which needs to be enabled) based on
Windows version.
--
Vitaly
2018-04-02 18:10+0200, Vitaly Kuznetsov:
> Implement HvFlushVirtualAddress{List,Space} hypercalls in a simplistic way:
> do full TLB flush with KVM_REQ_TLB_FLUSH and rely on kvm_vcpu_kick()
> kicking only vCPUs which are currently IN_GUEST_MODE.
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/kvm/hyperv.c | 54 ++++++++++++++++++++++++++++++++++++++++++++-------
> arch/x86/kvm/trace.h | 24 +++++++++++++++++++++++
> 2 files changed, 71 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 3cb3bb68db7e..aa866994366d 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1242,6 +1242,49 @@ int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
> return kvm_hv_get_msr(vcpu, msr, pdata);
> }
>
> +static u64 kvm_hv_flush_tlb(struct kvm_vcpu *current_vcpu, u64 ingpa,
> + u16 rep_cnt)
> +{
> + struct kvm *kvm = current_vcpu->kvm;
> + struct hv_tlb_flush flush;
> + struct kvm_vcpu *vcpu;
> + int i;
> +
> + if (unlikely(kvm_read_guest(kvm, ingpa, &flush, sizeof(flush))))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + trace_kvm_hv_flush_tlb(flush.processor_mask, flush.address_space,
> + flush.flags);
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + struct kvm_vcpu_hv *hv = &vcpu->arch.hyperv;
> +
> + if (!(flush.flags & HV_FLUSH_ALL_PROCESSORS) &&
> + !(flush.processor_mask & BIT_ULL(hv->vp_index)))
> + continue;
> +
> + /*
> + * vcpu->arch.cr3 may not be up-to-date for running vCPUs so we
> + * can't analyze it here, flush TLB regardless of the specified
> + * address space.
> + */
> + kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> +
> + /*
> + * It is very unlikely but possible that we're doing an extra
> + * kick here (e.g. if the vCPU has just entered the guest and
> + * has its TLB flushed).
> + */
> + if (vcpu != current_vcpu)
> + kvm_vcpu_kick(vcpu);
The spec says that
"This call guarantees that by the time control returns back to the
caller, the observable effects of all flushes on the specified virtual
processors have occurred."
Other KVM code doesn't assume that kvm_vcpu_kick() and a delay provides
that guarantee; kvm_make_all_cpus_request waits for the target CPU to
exit before saying that TLB has been flushed.
I am leaning towards the safer variant here as well. (Anyway, it's a
good time to figure out if we really need it.)
> + }
> +
> + /* We always do full TLB flush, set rep_done = rep_cnt. */
> + return (u64)HV_STATUS_SUCCESS |
> + ((u64)rep_cnt << HV_HYPERCALL_REP_START_OFFSET) |
Why at bits 48-59? I don't see this field in the spec.
> + ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
> +}
> +
> bool kvm_hv_hypercall_enabled(struct kvm *kvm)
> {
> return READ_ONCE(kvm->arch.hyperv.hv_hypercall) & HV_X64_MSR_HYPERCALL_ENABLE;
> @@ -1345,12 +1388,6 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>
> trace_kvm_hv_hypercall(code, fast, rep_cnt, rep_idx, ingpa, outgpa);
>
> - /* Hypercall continuation is not supported yet */
> - if (rep_cnt || rep_idx) {
> - ret = HV_STATUS_INVALID_HYPERCALL_CODE;
Hm, we should have returned HV_STATUS_INVALID_HYPERCALL_INPUT in any
case. I think it would be good to still fail in case of non-rep
hypercalls,
thanks.
Radim Krčmář <[email protected]> writes:
> 2018-04-02 18:10+0200, Vitaly Kuznetsov:
>> Implement HvFlushVirtualAddress{List,Space} hypercalls in a simplistic way:
>> do full TLB flush with KVM_REQ_TLB_FLUSH and rely on kvm_vcpu_kick()
>> kicking only vCPUs which are currently IN_GUEST_MODE.
>>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> arch/x86/kvm/hyperv.c | 54 ++++++++++++++++++++++++++++++++++++++++++++-------
>> arch/x86/kvm/trace.h | 24 +++++++++++++++++++++++
>> 2 files changed, 71 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> index 3cb3bb68db7e..aa866994366d 100644
>> --- a/arch/x86/kvm/hyperv.c
>> +++ b/arch/x86/kvm/hyperv.c
>> @@ -1242,6 +1242,49 @@ int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>> return kvm_hv_get_msr(vcpu, msr, pdata);
>> }
>>
>> +static u64 kvm_hv_flush_tlb(struct kvm_vcpu *current_vcpu, u64 ingpa,
>> + u16 rep_cnt)
>> +{
>> + struct kvm *kvm = current_vcpu->kvm;
>> + struct hv_tlb_flush flush;
>> + struct kvm_vcpu *vcpu;
>> + int i;
>> +
>> + if (unlikely(kvm_read_guest(kvm, ingpa, &flush, sizeof(flush))))
>> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
>> +
>> + trace_kvm_hv_flush_tlb(flush.processor_mask, flush.address_space,
>> + flush.flags);
>> +
>> + kvm_for_each_vcpu(i, vcpu, kvm) {
>> + struct kvm_vcpu_hv *hv = &vcpu->arch.hyperv;
>> +
>> + if (!(flush.flags & HV_FLUSH_ALL_PROCESSORS) &&
>> + !(flush.processor_mask & BIT_ULL(hv->vp_index)))
>> + continue;
>> +
>> + /*
>> + * vcpu->arch.cr3 may not be up-to-date for running vCPUs so we
>> + * can't analyze it here, flush TLB regardless of the specified
>> + * address space.
>> + */
>> + kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
>> +
>> + /*
>> + * It is very unlikely but possible that we're doing an extra
>> + * kick here (e.g. if the vCPU has just entered the guest and
>> + * has its TLB flushed).
>> + */
>> + if (vcpu != current_vcpu)
>> + kvm_vcpu_kick(vcpu);
>
> The spec says that
>
> "This call guarantees that by the time control returns back to the
> caller, the observable effects of all flushes on the specified virtual
> processors have occurred."
>
> Other KVM code doesn't assume that kvm_vcpu_kick() and a delay provides
> that guarantee; kvm_make_all_cpus_request waits for the target CPU to
> exit before saying that TLB has been flushed.
>
> I am leaning towards the safer variant here as well. (Anyway, it's a
> good time to figure out if we really need it.)
Ha, it depends on how we define "observable effects" :-)
I think kvm_vcpu_kick() is enough as the corresponding vCPU can't
actually observe old mapping after being kicked (even if we didn't flush
yet we're not running). Or do you see any possible problem with such
definition?
>
>> + }
>> +
>> + /* We always do full TLB flush, set rep_done = rep_cnt. */
>> + return (u64)HV_STATUS_SUCCESS |
>> + ((u64)rep_cnt << HV_HYPERCALL_REP_START_OFFSET) |
>
> Why at bits 48-59? I don't see this field in the spec.
>
True, it is only for 'input'. Will drop.
>> + ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
>> +}
>> +
>> bool kvm_hv_hypercall_enabled(struct kvm *kvm)
>> {
>> return READ_ONCE(kvm->arch.hyperv.hv_hypercall) & HV_X64_MSR_HYPERCALL_ENABLE;
>> @@ -1345,12 +1388,6 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>>
>> trace_kvm_hv_hypercall(code, fast, rep_cnt, rep_idx, ingpa, outgpa);
>>
>> - /* Hypercall continuation is not supported yet */
>> - if (rep_cnt || rep_idx) {
>> - ret = HV_STATUS_INVALID_HYPERCALL_CODE;
>
> Hm, we should have returned HV_STATUS_INVALID_HYPERCALL_INPUT in any
> case. I think it would be good to still fail in case of non-rep
> hypercalls,
Sure. I skimmed through the spec and didn't find any direct reference
that specifying 'rep' bit for non-rep hypercalls is forbidden but this
is definitely a guest bug if it does that.
Thanks for the review!
--
Vitaly
Vitaly Kuznetsov <[email protected]> writes:
> Radim Krčmář <[email protected]> writes:
>
>> 2018-04-02 18:10+0200, Vitaly Kuznetsov:
>>> + if (vcpu != current_vcpu)
>>> + kvm_vcpu_kick(vcpu);
>>
>> The spec says that
>>
>> "This call guarantees that by the time control returns back to the
>> caller, the observable effects of all flushes on the specified virtual
>> processors have occurred."
>>
>> Other KVM code doesn't assume that kvm_vcpu_kick() and a delay provides
>> that guarantee; kvm_make_all_cpus_request waits for the target CPU to
>> exit before saying that TLB has been flushed.
>>
>> I am leaning towards the safer variant here as well. (Anyway, it's a
>> good time to figure out if we really need it.)
>
> Ha, it depends on how we define "observable effects" :-)
>
> I think kvm_vcpu_kick() is enough as the corresponding vCPU can't
> actually observe old mapping after being kicked (even if we didn't flush
> yet we're not running). Or do you see any possible problem with such
> definition?
>
Oh, now I see it myself -- native_smp_send_reschedule() only does
apic->send_IPI() so this is indeed unsafe. We need something like
kvm_make_all_cpus_request() with a mask (and, to make it fast, we'll
probably have to pre-allocate these).
Will do in v2, thanks!
--
Vitaly