Changes since v2:
- Fixed a sparse warning in PATCH1 [kbuild test robot]
Description:
This is both a new feature and a bugfix.
Bugfix description:
It was found that Windows 2016 guests on KVM crash when they have > 64
vCPUs, non-flat topology (>1 core/thread per socket; in case it has >64
sockets Windows just ignores vCPUs above 64) and Hyper-V enlightenments
(any) are enabled. The most common error reported is "PAGE FAULT IN
NONPAGED AREA" but I saw different messages. Apparently, Windows doesn't
expect to run on a Hyper-V server without PV TLB flush support as there's
no such Hyper-V servers out there (it's only WS2016 supporting > 64 vCPUs
AFAIR).
Adding PV TLB flush support to KVM helps, Windows 2016 guests now boot
normally (I tried '-smp 128,sockets=64,cores=1,threads=2' and
'-smp 128,sockets=8,cores=16,threads=1' but other topologies should work
too).
Feature description:
PV TLB flush helps a lot when running overcommited. KVM gained support for
it recently but it is only available for Linux guests. Windows guests use
emulated Hyper-V interface and PV TLB flush needs to be added there.
I tested WS2016 guest with 128 vCPUs running on a 12 pCPU server. The test
was running 65 threads doing 50 mmap()/munmap() for 16384 pages with a
tiny random nanosleep in between (I used Cygwin. It would be great if
someone could point me to a good Windows-native TLB trashing test).
The average results are:
Before:
real 0m22.464s
user 0m0.990s
sys 1m26.3276s
After:
real 0m19.304s
user 0m0.908s
sys 0m36.249s
When running without overcommit the results of the same test are very close
so the feature can be enabled by default.
Implementation details.
The implementation is very simplistic and straightforward. We ignore
'address space' argument of the hypercalls (as there is no good way to
figure out what's currently in CR3 of a running vCPU as generally we don't
VMEXIT on guest CR3 write) and do full TLB flush on specified vCPUs. In
case said vCPUs are not running TLB flush will be performed upon guest
enter.
Qemu (and other userspaces) need to enable CPUID feature bits to make
Windows aware the feature is supported. I'll post Qemu enablement patch
separately.
Patches are based on the current kvm/queue branch.
Vitaly Kuznetsov (6):
x86/hyper-v: move struct hv_flush_pcpu{,ex} definitions to common
header
KVM: x86: hyperv: use defines when parsing hypercall parameters
KVM: x86: hyperv: do rep check for each hypercall separately
KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}
implementation
KVM: x86: hyperv: simplistic
HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX implementation
KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capability
Documentation/virtual/kvm/api.txt | 9 ++
arch/x86/hyperv/mmu.c | 40 ++-----
arch/x86/include/asm/hyperv-tlfs.h | 20 ++++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/hyperv.c | 217 ++++++++++++++++++++++++++++++++++---
arch/x86/kvm/trace.h | 51 +++++++++
arch/x86/kvm/x86.c | 1 +
include/uapi/linux/kvm.h | 1 +
8 files changed, 297 insertions(+), 43 deletions(-)
--
2.14.3
Hyper-V TLB flush hypercalls definitions will be required for KVM so move
them hyperv-tlfs.h. Structures also need to be renamed as '_pcpu' suffix is
invalid for a general-purpose definition.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/hyperv/mmu.c | 40 ++++++++++----------------------------
arch/x86/include/asm/hyperv-tlfs.h | 20 +++++++++++++++++++
2 files changed, 30 insertions(+), 30 deletions(-)
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 56c9ebac946f..528a1f34df96 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -13,32 +13,12 @@
#define CREATE_TRACE_POINTS
#include <asm/trace/hyperv.h>
-/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
-struct hv_flush_pcpu {
- u64 address_space;
- u64 flags;
- u64 processor_mask;
- u64 gva_list[];
-};
-
-/* HvFlushVirtualAddressSpaceEx, HvFlushVirtualAddressListEx hypercalls */
-struct hv_flush_pcpu_ex {
- u64 address_space;
- u64 flags;
- struct {
- u64 format;
- u64 valid_bank_mask;
- u64 bank_contents[];
- } hv_vp_set;
- u64 gva_list[];
-};
-
/* Each gva in gva_list encodes up to 4096 pages to flush */
#define HV_TLB_FLUSH_UNIT (4096 * PAGE_SIZE)
-static struct hv_flush_pcpu __percpu **pcpu_flush;
+static struct hv_tlb_flush * __percpu *pcpu_flush;
-static struct hv_flush_pcpu_ex __percpu **pcpu_flush_ex;
+static struct hv_tlb_flush_ex * __percpu *pcpu_flush_ex;
/*
* Fills in gva_list starting from offset. Returns the number of items added.
@@ -71,7 +51,7 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
}
/* Return the number of banks in the resulting vp_set */
-static inline int cpumask_to_vp_set(struct hv_flush_pcpu_ex *flush,
+static inline int cpumask_to_vp_set(struct hv_tlb_flush_ex *flush,
const struct cpumask *cpus)
{
int cpu, vcpu, vcpu_bank, vcpu_offset, nr_bank = 1;
@@ -81,7 +61,7 @@ static inline int cpumask_to_vp_set(struct hv_flush_pcpu_ex *flush,
return 0;
/*
- * Clear all banks up to the maximum possible bank as hv_flush_pcpu_ex
+ * Clear all banks up to the maximum possible bank as hv_tlb_flush_ex
* structs are not cleared between calls, we risk flushing unneeded
* vCPUs otherwise.
*/
@@ -109,8 +89,8 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
const struct flush_tlb_info *info)
{
int cpu, vcpu, gva_n, max_gvas;
- struct hv_flush_pcpu **flush_pcpu;
- struct hv_flush_pcpu *flush;
+ struct hv_tlb_flush **flush_pcpu;
+ struct hv_tlb_flush *flush;
u64 status = U64_MAX;
unsigned long flags;
@@ -196,8 +176,8 @@ static void hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
const struct flush_tlb_info *info)
{
int nr_bank = 0, max_gvas, gva_n;
- struct hv_flush_pcpu_ex **flush_pcpu;
- struct hv_flush_pcpu_ex *flush;
+ struct hv_tlb_flush_ex **flush_pcpu;
+ struct hv_tlb_flush_ex *flush;
u64 status = U64_MAX;
unsigned long flags;
@@ -303,7 +283,7 @@ void hyper_alloc_mmu(void)
return;
if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED))
- pcpu_flush = alloc_percpu(struct hv_flush_pcpu *);
+ pcpu_flush = alloc_percpu(struct hv_tlb_flush *);
else
- pcpu_flush_ex = alloc_percpu(struct hv_flush_pcpu_ex *);
+ pcpu_flush_ex = alloc_percpu(struct hv_tlb_flush_ex *);
}
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index 1c602ad4bda8..ccb1dffe0d84 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -703,4 +703,24 @@ struct hv_enlightened_vmcs {
#define HV_STIMER_AUTOENABLE (1ULL << 3)
#define HV_STIMER_SINT(config) (__u8)(((config) >> 16) & 0x0F)
+/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
+struct hv_tlb_flush {
+ u64 address_space;
+ u64 flags;
+ u64 processor_mask;
+ u64 gva_list[];
+};
+
+/* HvFlushVirtualAddressSpaceEx, HvFlushVirtualAddressListEx hypercalls */
+struct hv_tlb_flush_ex {
+ u64 address_space;
+ u64 flags;
+ struct {
+ u64 format;
+ u64 valid_bank_mask;
+ u64 bank_contents[];
+ } hv_vp_set;
+ u64 gva_list[];
+};
+
#endif
--
2.14.3
Prepare to support TLB flush hypercalls, some of which are REP hypercalls.
Also, return HV_STATUS_INVALID_HYPERCALL_INPUT as it seems more
appropriate.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/kvm/hyperv.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 3cb3bb68db7e..67788d358200 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1309,7 +1309,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
{
u64 param, ingpa, outgpa, ret = HV_STATUS_SUCCESS;
uint16_t code, rep_idx, rep_cnt;
- bool fast, longmode;
+ bool fast, longmode, rep;
/*
* hypercall generates UD from non zero cpl and real mode
@@ -1342,28 +1342,31 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
fast = !!(param & HV_HYPERCALL_FAST_BIT);
rep_cnt = (param >> HV_HYPERCALL_REP_COMP_OFFSET) & 0xfff;
rep_idx = (param >> HV_HYPERCALL_REP_START_OFFSET) & 0xfff;
+ rep = !!(rep_cnt || rep_idx);
trace_kvm_hv_hypercall(code, fast, rep_cnt, rep_idx, ingpa, outgpa);
- /* Hypercall continuation is not supported yet */
- if (rep_cnt || rep_idx) {
- ret = HV_STATUS_INVALID_HYPERCALL_CODE;
- goto set_result;
- }
-
switch (code) {
case HVCALL_NOTIFY_LONG_SPIN_WAIT:
+ if (unlikely(rep)) {
+ ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+ break;
+ }
kvm_vcpu_on_spin(vcpu, true);
break;
case HVCALL_SIGNAL_EVENT:
+ if (unlikely(rep)) {
+ ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+ break;
+ }
ret = kvm_hvcall_signal_event(vcpu, fast, ingpa);
if (ret != HV_STATUS_INVALID_PORT_ID)
break;
/* maybe userspace knows this conn_id: fall through */
case HVCALL_POST_MESSAGE:
/* don't bother userspace if it has no way to handle it */
- if (!vcpu_to_synic(vcpu)->active) {
- ret = HV_STATUS_INVALID_HYPERCALL_CODE;
+ if (unlikely(rep || !vcpu_to_synic(vcpu)->active)) {
+ ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
break;
}
vcpu->run->exit_reason = KVM_EXIT_HYPERV;
--
2.14.3
Implement HvFlushVirtualAddress{List,Space}Ex hypercalls in a simplistic
way: do full TLB flush with KVM_REQ_TLB_FLUSH and kick vCPUs which are
currently IN_GUEST_MODE.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/kvm/hyperv.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/trace.h | 27 ++++++++++++
2 files changed, 143 insertions(+)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index fa26af1e8b7c..7028cd58d5f4 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1301,6 +1301,108 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *current_vcpu, u64 ingpa,
((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
}
+static __always_inline int get_sparse_bank_no(u64 valid_bank_mask, int bank_no)
+{
+ int i = 0, j;
+
+ if (!(valid_bank_mask & BIT_ULL(bank_no)))
+ return -1;
+
+ for (j = 0; j < bank_no; j++)
+ if (valid_bank_mask & BIT_ULL(j))
+ i++;
+
+ return i;
+}
+
+static __always_inline int load_bank_guest(struct kvm *kvm, u64 ingpa,
+ int sparse_bank, u64 *bank_contents)
+{
+ int offset;
+
+ offset = offsetof(struct hv_tlb_flush_ex, hv_vp_set.bank_contents) +
+ sizeof(u64) * sparse_bank;
+
+ if (unlikely(kvm_read_guest(kvm, ingpa + offset,
+ bank_contents, sizeof(u64))))
+ return 1;
+
+ return 0;
+}
+
+static int kvm_hv_flush_tlb_ex(struct kvm_vcpu *current_vcpu, u64 ingpa,
+ u16 rep_cnt)
+{
+ struct kvm *kvm = current_vcpu->kvm;
+ struct kvm_vcpu_hv *hv_current = ¤t_vcpu->arch.hyperv;
+ struct hv_tlb_flush_ex flush;
+ struct kvm_vcpu *vcpu;
+ u64 bank_contents, valid_bank_mask;
+ int i, cpu, me, current_sparse_bank = -1;
+ u64 ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+
+ if (unlikely(kvm_read_guest(kvm, ingpa, &flush, sizeof(flush))))
+ return ret;
+
+ valid_bank_mask = flush.hv_vp_set.valid_bank_mask;
+
+ trace_kvm_hv_flush_tlb_ex(valid_bank_mask, flush.hv_vp_set.format,
+ flush.address_space, flush.flags);
+
+ cpumask_clear(&hv_current->tlb_lush);
+
+ me = get_cpu();
+
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ struct kvm_vcpu_hv *hv = &vcpu->arch.hyperv;
+ int bank = hv->vp_index / 64, sparse_bank;
+
+ if (flush.hv_vp_set.format == HV_GENERIC_SET_SPARCE_4K) {
+ /* Check is the bank of this vCPU is in sparse set */
+ sparse_bank = get_sparse_bank_no(valid_bank_mask, bank);
+ if (sparse_bank < 0)
+ continue;
+
+ /*
+ * Assume hv->vp_index is in ascending order and we can
+ * optimize by not reloading bank contents for every
+ * vCPU.
+ */
+ if (sparse_bank != current_sparse_bank) {
+ if (load_bank_guest(kvm, ingpa, sparse_bank,
+ &bank_contents))
+ return ret;
+ current_sparse_bank = sparse_bank;
+ }
+
+ if (!(bank_contents & BIT_ULL(hv->vp_index % 64)))
+ continue;
+ }
+
+ kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
+
+ /*
+ * It is possible that vCPU will migrate and we will kick wrong
+ * CPU but vCPU's TLB will anyway be flushed upon migration as
+ * we already made KVM_REQ_TLB_FLUSH request.
+ */
+ cpu = vcpu->cpu;
+ if (cpu != -1 && cpu != me && cpu_online(cpu) &&
+ kvm_arch_vcpu_should_kick(vcpu))
+ cpumask_set_cpu(cpu, &hv_current->tlb_lush);
+ }
+
+ if (!cpumask_empty(&hv_current->tlb_lush))
+ smp_call_function_many(&hv_current->tlb_lush, ack_flush,
+ NULL, true);
+
+ put_cpu();
+
+ /* We always do full TLB flush, set rep_done = rep_cnt. */
+ return (u64)HV_STATUS_SUCCESS |
+ ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
+}
+
bool kvm_hv_hypercall_enabled(struct kvm *kvm)
{
return READ_ONCE(kvm->arch.hyperv.hv_hypercall) & HV_X64_MSR_HYPERCALL_ENABLE;
@@ -1450,6 +1552,20 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
}
ret = kvm_hv_flush_tlb(vcpu, ingpa, rep_cnt);
break;
+ case HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX:
+ if (unlikely(fast || !rep_cnt || rep_idx)) {
+ ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+ break;
+ }
+ ret = kvm_hv_flush_tlb_ex(vcpu, ingpa, rep_cnt);
+ break;
+ case HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX:
+ if (unlikely(fast || rep)) {
+ ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+ break;
+ }
+ ret = kvm_hv_flush_tlb_ex(vcpu, ingpa, rep_cnt);
+ break;
default:
ret = HV_STATUS_INVALID_HYPERCALL_CODE;
break;
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 47a4fd758743..0f997683404f 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1391,6 +1391,33 @@ TRACE_EVENT(kvm_hv_flush_tlb,
__entry->processor_mask, __entry->address_space,
__entry->flags)
);
+
+/*
+ * Tracepoint for kvm_hv_flush_tlb_ex.
+ */
+TRACE_EVENT(kvm_hv_flush_tlb_ex,
+ TP_PROTO(u64 valid_bank_mask, u64 format, u64 address_space, u64 flags),
+ TP_ARGS(valid_bank_mask, format, address_space, flags),
+
+ TP_STRUCT__entry(
+ __field(u64, valid_bank_mask)
+ __field(u64, format)
+ __field(u64, address_space)
+ __field(u64, flags)
+ ),
+
+ TP_fast_assign(
+ __entry->valid_bank_mask = valid_bank_mask;
+ __entry->format = format;
+ __entry->address_space = address_space;
+ __entry->flags = flags;
+ ),
+
+ TP_printk("valid_bank_mask 0x%llx format 0x%llx "
+ "address_space 0x%llx flags 0x%llx",
+ __entry->valid_bank_mask, __entry->format,
+ __entry->address_space, __entry->flags)
+);
#endif /* _TRACE_KVM_H */
#undef TRACE_INCLUDE_PATH
--
2.14.3
Implement HvFlushVirtualAddress{List,Space} hypercalls in a simplistic way:
do full TLB flush with KVM_REQ_TLB_FLUSH and kick vCPUs which are currently
IN_GUEST_MODE.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/hyperv.c | 74 ++++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/trace.h | 24 +++++++++++++
3 files changed, 98 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c25775fad4ed..9ce1fb2b6af3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -476,6 +476,7 @@ struct kvm_vcpu_hv {
struct kvm_hyperv_exit exit;
struct kvm_vcpu_hv_stimer stimer[HV_SYNIC_STIMER_COUNT];
DECLARE_BITMAP(stimer_pending_bitmap, HV_SYNIC_STIMER_COUNT);
+ cpumask_t tlb_lush;
};
struct kvm_vcpu_arch {
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 67788d358200..fa26af1e8b7c 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1242,6 +1242,65 @@ int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
return kvm_hv_get_msr(vcpu, msr, pdata);
}
+static void ack_flush(void *_completed)
+{
+}
+
+static u64 kvm_hv_flush_tlb(struct kvm_vcpu *current_vcpu, u64 ingpa,
+ u16 rep_cnt)
+{
+ struct kvm *kvm = current_vcpu->kvm;
+ struct kvm_vcpu_hv *hv_current = ¤t_vcpu->arch.hyperv;
+ struct hv_tlb_flush flush;
+ struct kvm_vcpu *vcpu;
+ int i, cpu, me;
+
+ if (unlikely(kvm_read_guest(kvm, ingpa, &flush, sizeof(flush))))
+ return HV_STATUS_INVALID_HYPERCALL_INPUT;
+
+ trace_kvm_hv_flush_tlb(flush.processor_mask, flush.address_space,
+ flush.flags);
+
+ cpumask_clear(&hv_current->tlb_lush);
+
+ me = get_cpu();
+
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ struct kvm_vcpu_hv *hv = &vcpu->arch.hyperv;
+
+ if (!(flush.flags & HV_FLUSH_ALL_PROCESSORS) &&
+ !(flush.processor_mask & BIT_ULL(hv->vp_index)))
+ continue;
+
+ /*
+ * vcpu->arch.cr3 may not be up-to-date for running vCPUs so we
+ * can't analyze it here, flush TLB regardless of the specified
+ * address space.
+ */
+ kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
+
+ /*
+ * It is possible that vCPU will migrate and we will kick wrong
+ * CPU but vCPU's TLB will anyway be flushed upon migration as
+ * we already made KVM_REQ_TLB_FLUSH request.
+ */
+ cpu = vcpu->cpu;
+ if (cpu != -1 && cpu != me && cpu_online(cpu) &&
+ kvm_arch_vcpu_should_kick(vcpu))
+ cpumask_set_cpu(cpu, &hv_current->tlb_lush);
+ }
+
+ if (!cpumask_empty(&hv_current->tlb_lush))
+ smp_call_function_many(&hv_current->tlb_lush, ack_flush,
+ NULL, true);
+
+ put_cpu();
+
+ /* We always do full TLB flush, set rep_done = rep_cnt. */
+ return (u64)HV_STATUS_SUCCESS |
+ ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
+}
+
bool kvm_hv_hypercall_enabled(struct kvm *kvm)
{
return READ_ONCE(kvm->arch.hyperv.hv_hypercall) & HV_X64_MSR_HYPERCALL_ENABLE;
@@ -1377,12 +1436,25 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
vcpu->arch.complete_userspace_io =
kvm_hv_hypercall_complete_userspace;
return 0;
+ case HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST:
+ if (unlikely(fast || !rep_cnt || rep_idx)) {
+ ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+ break;
+ }
+ ret = kvm_hv_flush_tlb(vcpu, ingpa, rep_cnt);
+ break;
+ case HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE:
+ if (unlikely(fast || rep)) {
+ ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+ break;
+ }
+ ret = kvm_hv_flush_tlb(vcpu, ingpa, rep_cnt);
+ break;
default:
ret = HV_STATUS_INVALID_HYPERCALL_CODE;
break;
}
-set_result:
kvm_hv_hypercall_set_result(vcpu, ret);
return 1;
}
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 9807c314c478..47a4fd758743 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1367,6 +1367,30 @@ TRACE_EVENT(kvm_hv_timer_state,
__entry->vcpu_id,
__entry->hv_timer_in_use)
);
+
+/*
+ * Tracepoint for kvm_hv_flush_tlb.
+ */
+TRACE_EVENT(kvm_hv_flush_tlb,
+ TP_PROTO(u64 processor_mask, u64 address_space, u64 flags),
+ TP_ARGS(processor_mask, address_space, flags),
+
+ TP_STRUCT__entry(
+ __field(u64, processor_mask)
+ __field(u64, address_space)
+ __field(u64, flags)
+ ),
+
+ TP_fast_assign(
+ __entry->processor_mask = processor_mask;
+ __entry->address_space = address_space;
+ __entry->flags = flags;
+ ),
+
+ TP_printk("processor_mask 0x%llx address_space 0x%llx flags 0x%llx",
+ __entry->processor_mask, __entry->address_space,
+ __entry->flags)
+);
#endif /* _TRACE_KVM_H */
#undef TRACE_INCLUDE_PATH
--
2.14.3
We need a new capability to indicate support for the newly added
HvFlushVirtualAddress{List,Space}{,Ex} hypercalls. Upon seeing this
capability, userspace is supposed to announce PV TLB flush features
by setting the appropriate CPUID bits (if needed).
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
Documentation/virtual/kvm/api.txt | 9 +++++++++
arch/x86/kvm/x86.c | 1 +
include/uapi/linux/kvm.h | 1 +
3 files changed, 11 insertions(+)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 1c7958b57fe9..74e22e36b3c8 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4596,3 +4596,12 @@ Architectures: s390
This capability indicates that kvm will implement the interfaces to handle
reset, migration and nested KVM for branch prediction blocking. The stfle
facility 82 should not be provided to the guest without this capability.
+
+8.14 KVM_CAP_HYPERV_TLBFLUSH
+
+Architectures: x86
+
+This capability indicates that KVM supports paravirtualized Hyper-V TLB Flush
+hypercalls:
+HvFlushVirtualAddressSpace, HvFlushVirtualAddressSpaceEx,
+HvFlushVirtualAddressList, HvFlushVirtualAddressListEx.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 51ecd381793b..c6784ea69dc6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2864,6 +2864,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_HYPERV_SYNIC2:
case KVM_CAP_HYPERV_VP_INDEX:
case KVM_CAP_HYPERV_EVENTFD:
+ case KVM_CAP_HYPERV_TLBFLUSH:
case KVM_CAP_PCI_SEGMENT:
case KVM_CAP_DEBUGREGS:
case KVM_CAP_X86_ROBUST_SINGLESTEP:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1065006c9bf5..30c193b43d89 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -941,6 +941,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_S390_BPB 152
#define KVM_CAP_GET_MSR_FEATURES 153
#define KVM_CAP_HYPERV_EVENTFD 154
+#define KVM_CAP_HYPERV_TLBFLUSH 155
#ifdef KVM_CAP_IRQ_ROUTING
--
2.14.3
Avoid open-coding offsets for hypercall input parameters, we already
have defines for them.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/kvm/hyperv.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 98618e397342..3cb3bb68db7e 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1339,9 +1339,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
#endif
code = param & 0xffff;
- fast = (param >> 16) & 0x1;
- rep_cnt = (param >> 32) & 0xfff;
- rep_idx = (param >> 48) & 0xfff;
+ fast = !!(param & HV_HYPERCALL_FAST_BIT);
+ rep_cnt = (param >> HV_HYPERCALL_REP_COMP_OFFSET) & 0xfff;
+ rep_idx = (param >> HV_HYPERCALL_REP_START_OFFSET) & 0xfff;
trace_kvm_hv_hypercall(code, fast, rep_cnt, rep_idx, ingpa, outgpa);
--
2.14.3
Vitaly Kuznetsov <[email protected]> writes:
>
> Qemu (and other userspaces) need to enable CPUID feature bits to make
> Windows aware the feature is supported. I'll post Qemu enablement patch
> separately.
Radim, Paolo,
I'd like to send Qemu feature enablement patch for this feature but
before I do that I'd like to get your feedback on this series.
Thanks!
--
Vitaly
2018-04-16 13:08+0200, Vitaly Kuznetsov:
> Hyper-V TLB flush hypercalls definitions will be required for KVM so move
> them hyperv-tlfs.h. Structures also need to be renamed as '_pcpu' suffix is
> invalid for a general-purpose definition.
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/hyperv/mmu.c | 40 ++++++++++----------------------------
> arch/x86/include/asm/hyperv-tlfs.h | 20 +++++++++++++++++++
> 2 files changed, 30 insertions(+), 30 deletions(-)
>
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -703,4 +703,24 @@ struct hv_enlightened_vmcs {
> #define HV_STIMER_AUTOENABLE (1ULL << 3)
> #define HV_STIMER_SINT(config) (__u8)(((config) >> 16) & 0x0F)
>
> +/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
> +struct hv_tlb_flush {
> + u64 address_space;
> + u64 flags;
> + u64 processor_mask;
> + u64 gva_list[];
> +};
> +
> +/* HvFlushVirtualAddressSpaceEx, HvFlushVirtualAddressListEx hypercalls */
> +struct hv_tlb_flush_ex {
> + u64 address_space;
> + u64 flags;
> + struct {
> + u64 format;
> + u64 valid_bank_mask;
> + u64 bank_contents[];
> + } hv_vp_set;
> + u64 gva_list[];
Why is the gva_list there?
2018-04-16 13:08+0200, Vitaly Kuznetsov:
> Implement HvFlushVirtualAddress{List,Space} hypercalls in a simplistic way:
> do full TLB flush with KVM_REQ_TLB_FLUSH and kick vCPUs which are currently
> IN_GUEST_MODE.
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> @@ -1242,6 +1242,65 @@ int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
> return kvm_hv_get_msr(vcpu, msr, pdata);
> }
>
> +static void ack_flush(void *_completed)
> +{
> +}
> +
> +static u64 kvm_hv_flush_tlb(struct kvm_vcpu *current_vcpu, u64 ingpa,
> + u16 rep_cnt)
> +{
> + struct kvm *kvm = current_vcpu->kvm;
> + struct kvm_vcpu_hv *hv_current = ¤t_vcpu->arch.hyperv;
> + struct hv_tlb_flush flush;
> + struct kvm_vcpu *vcpu;
> + int i, cpu, me;
> +
> + if (unlikely(kvm_read_guest(kvm, ingpa, &flush, sizeof(flush))))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + trace_kvm_hv_flush_tlb(flush.processor_mask, flush.address_space,
> + flush.flags);
> +
> + cpumask_clear(&hv_current->tlb_lush);
> +
> + me = get_cpu();
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + struct kvm_vcpu_hv *hv = &vcpu->arch.hyperv;
> +
> + if (!(flush.flags & HV_FLUSH_ALL_PROCESSORS) &&
Please add a check to prevent undefined behavior in C:
(hv->vp_index >= 64 ||
> + !(flush.processor_mask & BIT_ULL(hv->vp_index)))
> + continue;
It would also fail in the wild as shl only considers the bottom 5 bits.
> + /*
> + * vcpu->arch.cr3 may not be up-to-date for running vCPUs so we
> + * can't analyze it here, flush TLB regardless of the specified
> + * address space.
> + */
> + kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> +
> + /*
> + * It is possible that vCPU will migrate and we will kick wrong
> + * CPU but vCPU's TLB will anyway be flushed upon migration as
> + * we already made KVM_REQ_TLB_FLUSH request.
> + */
> + cpu = vcpu->cpu;
> + if (cpu != -1 && cpu != me && cpu_online(cpu) &&
> + kvm_arch_vcpu_should_kick(vcpu))
> + cpumask_set_cpu(cpu, &hv_current->tlb_lush);
> + }
> +
> + if (!cpumask_empty(&hv_current->tlb_lush))
> + smp_call_function_many(&hv_current->tlb_lush, ack_flush,
> + NULL, true);
Hm, quite a lot of code duplication with EX hypercall and also
kvm_make_all_cpus_request ... I'm thinking about making something like
kvm_make_some_cpus_request(struct kvm *kvm, unsigned int req,
bool (*predicate)(struct kvm_vcpu *vcpu))
or to implement a vp_index -> vcpu mapping and using
kvm_vcpu_request_mask(struct kvm *kvm, unsigned int req, long *vcpu_bitmap)
The latter would probably simplify logic of the EX hypercall.
What do you think?
2018-04-16 13:08+0200, Vitaly Kuznetsov:
> Implement HvFlushVirtualAddress{List,Space}Ex hypercalls in a simplistic
> way: do full TLB flush with KVM_REQ_TLB_FLUSH and kick vCPUs which are
> currently IN_GUEST_MODE.
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> @@ -1301,6 +1301,108 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *current_vcpu, u64 ingpa,
> ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
> }
>
> +static __always_inline int get_sparse_bank_no(u64 valid_bank_mask, int bank_no)
> +{
> + int i = 0, j;
> +
> + if (!(valid_bank_mask & BIT_ULL(bank_no)))
> + return -1;
> +
> + for (j = 0; j < bank_no; j++)
> + if (valid_bank_mask & BIT_ULL(j))
> + i++;
> +
> + return i;
> +}
> +
> +static __always_inline int load_bank_guest(struct kvm *kvm, u64 ingpa,
> + int sparse_bank, u64 *bank_contents)
> +{
> + int offset;
> +
> + offset = offsetof(struct hv_tlb_flush_ex, hv_vp_set.bank_contents) +
> + sizeof(u64) * sparse_bank;
> +
> + if (unlikely(kvm_read_guest(kvm, ingpa + offset,
> + bank_contents, sizeof(u64))))
> + return 1;
> +
> + return 0;
> +}
> +
> +static int kvm_hv_flush_tlb_ex(struct kvm_vcpu *current_vcpu, u64 ingpa,
> + u16 rep_cnt)
> +{
> + struct kvm *kvm = current_vcpu->kvm;
> + struct kvm_vcpu_hv *hv_current = ¤t_vcpu->arch.hyperv;
> + struct hv_tlb_flush_ex flush;
> + struct kvm_vcpu *vcpu;
> + u64 bank_contents, valid_bank_mask;
> + int i, cpu, me, current_sparse_bank = -1;
> + u64 ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + if (unlikely(kvm_read_guest(kvm, ingpa, &flush, sizeof(flush))))
> + return ret;
> +
> + valid_bank_mask = flush.hv_vp_set.valid_bank_mask;
> +
> + trace_kvm_hv_flush_tlb_ex(valid_bank_mask, flush.hv_vp_set.format,
> + flush.address_space, flush.flags);
> +
> + cpumask_clear(&hv_current->tlb_lush);
> +
> + me = get_cpu();
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + struct kvm_vcpu_hv *hv = &vcpu->arch.hyperv;
> + int bank = hv->vp_index / 64, sparse_bank;
> +
> + if (flush.hv_vp_set.format == HV_GENERIC_SET_SPARCE_4K) {
^
typo in the define
> + /* Check is the bank of this vCPU is in sparse set */
> + sparse_bank = get_sparse_bank_no(valid_bank_mask, bank);
> + if (sparse_bank < 0)
> + continue;
> +
> + /*
> + * Assume hv->vp_index is in ascending order and we can
> + * optimize by not reloading bank contents for every
> + * vCPU.
> + */
Since sparse_bank is packed, we could compute how many bank_contents do
we need to load and do it with one kvm_read_guest() into a local array;
it would be faster even if hv->vp_index were in ascending order and
wouldn't take that much memory (up to 512 B).
> + if (sparse_bank != current_sparse_bank) {
> + if (load_bank_guest(kvm, ingpa, sparse_bank,
> + &bank_contents))
> + return ret;
> + current_sparse_bank = sparse_bank;
> + }
Radim Krčmář <[email protected]> writes:
> 2018-04-16 13:08+0200, Vitaly Kuznetsov:
>> Implement HvFlushVirtualAddress{List,Space} hypercalls in a simplistic way:
>> do full TLB flush with KVM_REQ_TLB_FLUSH and kick vCPUs which are currently
>> IN_GUEST_MODE.
>>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> @@ -1242,6 +1242,65 @@ int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>> return kvm_hv_get_msr(vcpu, msr, pdata);
>> }
>>
>> +static void ack_flush(void *_completed)
>> +{
>> +}
>> +
>> +static u64 kvm_hv_flush_tlb(struct kvm_vcpu *current_vcpu, u64 ingpa,
>> + u16 rep_cnt)
>> +{
>> + struct kvm *kvm = current_vcpu->kvm;
>> + struct kvm_vcpu_hv *hv_current = ¤t_vcpu->arch.hyperv;
>> + struct hv_tlb_flush flush;
>> + struct kvm_vcpu *vcpu;
>> + int i, cpu, me;
>> +
>> + if (unlikely(kvm_read_guest(kvm, ingpa, &flush, sizeof(flush))))
>> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
>> +
>> + trace_kvm_hv_flush_tlb(flush.processor_mask, flush.address_space,
>> + flush.flags);
>> +
>> + cpumask_clear(&hv_current->tlb_lush);
>> +
>> + me = get_cpu();
>> +
>> + kvm_for_each_vcpu(i, vcpu, kvm) {
>> + struct kvm_vcpu_hv *hv = &vcpu->arch.hyperv;
>> +
>> + if (!(flush.flags & HV_FLUSH_ALL_PROCESSORS) &&
>
> Please add a check to prevent undefined behavior in C:
>
> (hv->vp_index >= 64 ||
>
>> + !(flush.processor_mask & BIT_ULL(hv->vp_index)))
>> + continue;
>
> It would also fail in the wild as shl only considers the bottom 5 bits.
>
>> + /*
>> + * vcpu->arch.cr3 may not be up-to-date for running vCPUs so we
>> + * can't analyze it here, flush TLB regardless of the specified
>> + * address space.
>> + */
>> + kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
>> +
>> + /*
>> + * It is possible that vCPU will migrate and we will kick wrong
>> + * CPU but vCPU's TLB will anyway be flushed upon migration as
>> + * we already made KVM_REQ_TLB_FLUSH request.
>> + */
>> + cpu = vcpu->cpu;
>> + if (cpu != -1 && cpu != me && cpu_online(cpu) &&
>> + kvm_arch_vcpu_should_kick(vcpu))
>> + cpumask_set_cpu(cpu, &hv_current->tlb_lush);
>> + }
>> +
>> + if (!cpumask_empty(&hv_current->tlb_lush))
>> + smp_call_function_many(&hv_current->tlb_lush, ack_flush,
>> + NULL, true);
>
> Hm, quite a lot of code duplication with EX hypercall and also
> kvm_make_all_cpus_request ... I'm thinking about making something like
>
> kvm_make_some_cpus_request(struct kvm *kvm, unsigned int req,
> bool (*predicate)(struct kvm_vcpu *vcpu))
>
> or to implement a vp_index -> vcpu mapping and using
>
> kvm_vcpu_request_mask(struct kvm *kvm, unsigned int req, long *vcpu_bitmap)
>
> The latter would probably simplify logic of the EX hypercall.
>
> What do you think?
Makes sense, I'll take a look. Thanks!
--
Vitaly
Radim Krčmář <[email protected]> writes:
> 2018-04-16 13:08+0200, Vitaly Kuznetsov:
...
>
>> + /*
>> + * vcpu->arch.cr3 may not be up-to-date for running vCPUs so we
>> + * can't analyze it here, flush TLB regardless of the specified
>> + * address space.
>> + */
>> + kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
>> +
>> + /*
>> + * It is possible that vCPU will migrate and we will kick wrong
>> + * CPU but vCPU's TLB will anyway be flushed upon migration as
>> + * we already made KVM_REQ_TLB_FLUSH request.
>> + */
>> + cpu = vcpu->cpu;
>> + if (cpu != -1 && cpu != me && cpu_online(cpu) &&
>> + kvm_arch_vcpu_should_kick(vcpu))
>> + cpumask_set_cpu(cpu, &hv_current->tlb_lush);
>> + }
>> +
>> + if (!cpumask_empty(&hv_current->tlb_lush))
>> + smp_call_function_many(&hv_current->tlb_lush, ack_flush,
>> + NULL, true);
>
> Hm, quite a lot of code duplication with EX hypercall and also
> kvm_make_all_cpus_request ... I'm thinking about making something like
>
> kvm_make_some_cpus_request(struct kvm *kvm, unsigned int req,
> bool (*predicate)(struct kvm_vcpu *vcpu))
>
> or to implement a vp_index -> vcpu mapping and using
>
> kvm_vcpu_request_mask(struct kvm *kvm, unsigned int req, long *vcpu_bitmap)
>
> The latter would probably simplify logic of the EX hypercall.
We really want to avoid memory allocation for cpumask on this path and
that's what kvm_make_all_cpus_request() currently does (when
CPUMASK_OFFSTACK). vcpu bitmap is probably OK as KVM_MAX_VCPUS is much
lower.
Making cpumask allocation avoidable leads us to the following API:
bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
long *vcpu_bitmap, cpumask_var_t tmp);
or, if we want to prettify this a little bit, we may end up with the
following pair:
bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
long *vcpu_bitmap);
bool __kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
long *vcpu_bitmap, cpumask_var_t tmp);
and from hyperv code we'll use the later. With this, no code duplication
is required.
Does this look acceptable?
--
Vitaly