LinuxLists.cc - [PATCH v2 0/5] KVM: X86: Scaling Guest OS Critical Sections with boosting

2022-04-02 12:43:05

Subject: [PATCH v2 0/5] KVM: X86: Scaling Guest OS Critical Sections with boosting

The missing semantic gap that occurs when a guest OS is preempted
when executing its own critical section, this leads to degradation
of application scalability. We try to bridge this semantic gap in
some ways, by passing guest preempt_count to the host and checking
guest irq disable state, the hypervisor now knows whether guest
OSes are running in the critical section, the hypervisor yield-on-spin
heuristics can be more smart this time to boost the vCPU candidate
who is in the critical section to mitigate this preemption problem,
in addition, it is more likely to be a potential lock holder.

Testing on 96 HT 2 socket Xeon CLX server, with 96 vCPUs VM 100GB RAM,
one VM running benchmark, the other(none-2) VMs running cpu-bound
workloads, There is no performance regression for other benchmarks
like Unixbench etc.

1VM
vanilla optimized improved

hackbench -l 50000
28 21.45 30.5%
ebizzy -M
12189 12354 1.4%
dbench
712 MB/sec 722 MB/sec 1.4%

2VM:
vanilla optimized improved

hackbench -l 10000
29.4 26 13%
ebizzy -M
3834 4033 5%
dbench
42.3 MB/sec 44.1 MB/sec 4.3%

3VM:
vanilla optimized improved

hackbench -l 10000
47 35.46 33%
ebizzy -M
3828 4031 5%
dbench
30.5 MB/sec 31.16 MB/sec 2.3%

v1 -> v2:
* add more comments to irq disable state
* renaming irq_disabled to last_guest_irq_disabled
* renaming, inverting the return, and also return a bool for kvm_vcpu_non_preemptable

Wanpeng Li (5):
KVM: X86: Add MSR_KVM_PREEMPT_COUNT support
KVM: X86: Add last guest interrupt disable state support
KVM: X86: Boost vCPU which is in critical section
x86/kvm: Add MSR_KVM_PREEMPT_COUNT guest support
KVM: X86: Expose PREEMT_COUNT CPUID feature bit to guest

Documentation/virt/kvm/cpuid.rst | 3 ++
arch/x86/include/asm/kvm_host.h | 8 ++++
arch/x86/include/uapi/asm/kvm_para.h | 2 +
arch/x86/kernel/kvm.c | 10 +++++
arch/x86/kvm/cpuid.c | 3 +-
arch/x86/kvm/x86.c | 60 ++++++++++++++++++++++++++++
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 7 ++++
8 files changed, 93 insertions(+), 1 deletion(-)

--
2.25.1

2022-04-02 16:53:49

by Wanpeng Li

[permalink] [raw]

Subject: [PATCH v2 4/5] x86/kvm: Add MSR_KVM_PREEMPT_COUNT guest support

From: Wanpeng Li <[email protected]>

The x86 guest passes the per-cpu preempt_count value to the hypervisor,
so the hypervisor knows whether the guest is running in the critical
section.

Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/kernel/kvm.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 79e0b8d63ffa..5b900334de6e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -366,6 +366,14 @@ static void kvm_guest_cpu_init(void)

if (has_steal_clock)
kvm_register_steal_time();
+
+ if (kvm_para_has_feature(KVM_FEATURE_PREEMPT_COUNT)) {
+ u64 pa = slow_virt_to_phys(this_cpu_ptr(&__preempt_count))
+ | KVM_MSR_ENABLED;
+ wrmsrl(MSR_KVM_PREEMPT_COUNT, pa);
+
+ pr_debug("setup pv preempt_count: cpu %d\n", smp_processor_id());
+ }
}

static void kvm_pv_disable_apf(void)
@@ -442,6 +450,8 @@ static void kvm_guest_cpu_offline(bool shutdown)
if (!shutdown)
apf_task_wake_all();
kvmclock_disable();
+ if (kvm_para_has_feature(KVM_FEATURE_PREEMPT_COUNT))
+ wrmsrl(MSR_KVM_PREEMPT_COUNT, 0);
}

static int kvm_cpu_online(unsigned int cpu)
--
2.25.1

2022-04-04 08:19:41

by Wanpeng Li

[permalink] [raw]

Subject: [PATCH v2 3/5] KVM: X86: Boost vCPU which is in critical section

From: Wanpeng Li <[email protected]>

The missing semantic gap that occurs when a guest OS is preempted
when executing its own critical section, this leads to degradation
of application scalability. We try to bridge this semantic gap in
some ways, by passing guest preempt_count to the host and checking
guest irq disable state, the hypervisor now knows whether guest
OSes are running in the critical section, the hypervisor yield-on-spin
heuristics can be more smart this time to boost the vCPU candidate
who is in the critical section to mitigate this preemption problem,
in addition, it is more likely to be a potential lock holder.

Testing on 96 HT 2 socket Xeon CLX server, with 96 vCPUs VM 100GB RAM,
one VM running benchmark, the other(none-2) VMs running cpu-bound
workloads, There is no performance regression for other benchmarks
like Unixbench etc.

1VM
vanilla optimized improved

hackbench -l 50000
28 21.45 30.5%
ebizzy -M
12189 12354 1.4%
dbench
712 MB/sec 722 MB/sec 1.4%

2VM:
vanilla optimized improved

hackbench -l 10000
29.4 26 13%
ebizzy -M
3834 4033 5%
dbench
42.3 MB/sec 44.1 MB/sec 4.3%

3VM:
vanilla optimized improved

hackbench -l 10000
47 35.46 33%
ebizzy -M
3828 4031 5%
dbench
30.5 MB/sec 31.16 MB/sec 2.3%

Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/kvm/x86.c | 22 ++++++++++++++++++++++
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 7 +++++++
3 files changed, 30 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9aa05f79b743..b613cd2b822a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10377,6 +10377,28 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
return r;
}

+static bool kvm_vcpu_is_preemptible(struct kvm_vcpu *vcpu)
+{
+ int count;
+
+ if (!vcpu->arch.pv_pc.preempt_count_enabled)
+ return false;
+
+ if (!kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.pv_pc.preempt_count_cache,
+ &count, sizeof(int)))
+ return !(count & ~PREEMPT_NEED_RESCHED);
+
+ return false;
+}
+
+bool kvm_arch_boost_candidate(struct kvm_vcpu *vcpu)
+{
+ if (vcpu->arch.last_guest_irq_disabled || !kvm_vcpu_is_preemptible(vcpu))
+ return true;
+
+ return false;
+}
+
static inline int complete_emulated_io(struct kvm_vcpu *vcpu)
{
int r;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9536ffa0473b..28d9e99284f1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1420,6 +1420,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
int kvm_arch_post_init_vm(struct kvm *kvm);
void kvm_arch_pre_destroy_vm(struct kvm *kvm);
int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_boost_candidate(struct kvm_vcpu *vcpu);

#ifndef __KVM_HAVE_ARCH_VM_ALLOC
/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 69c318fdff61..018a87af01a1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3544,6 +3544,11 @@ bool __weak kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu)
return false;
}

+bool __weak kvm_arch_boost_candidate(struct kvm_vcpu *vcpu)
+{
+ return true;
+}
+
void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
{
struct kvm *kvm = me->kvm;
@@ -3579,6 +3584,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
!kvm_arch_dy_has_pending_interrupt(vcpu) &&
!kvm_arch_vcpu_in_kernel(vcpu))
continue;
+ if (!kvm_arch_boost_candidate(vcpu))
+ continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
continue;

--
2.25.1

2022-04-04 13:25:09

by Wanpeng Li

[permalink] [raw]

Subject: [PATCH v2 1/5] KVM: X86: Add MSR_KVM_PREEMPT_COUNT support

From: Wanpeng Li <[email protected]>

x86 preempt_count is per-cpu, any non-zero value for preempt_count
indicates that either preemption has been disabled explicitly or the
CPU is currently servicing some sort of interrupt. The guest will
pass this value to the hypervisor, so the hypervisor knows whether
the guest is running in the critical section.

Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 +++++
arch/x86/include/uapi/asm/kvm_para.h | 2 ++
arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++
3 files changed, 43 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4138939532c6..c13c9ed50903 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -852,6 +852,12 @@ struct kvm_vcpu_arch {

u64 msr_kvm_poll_control;

+ struct {
+ u64 msr_val;
+ bool preempt_count_enabled;
+ struct gfn_to_hva_cache preempt_count_cache;
+ } pv_pc;
+
/*
* Indicates the guest is trying to write a gfn that contains one or
* more of the PTEs used to translate the write itself, i.e. the access
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 6e64b27b2c1e..f99fa4407604 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -36,6 +36,7 @@
#define KVM_FEATURE_MSI_EXT_DEST_ID 15
#define KVM_FEATURE_HC_MAP_GPA_RANGE 16
#define KVM_FEATURE_MIGRATION_CONTROL 17
+#define KVM_FEATURE_PREEMPT_COUNT 18

#define KVM_HINTS_REALTIME 0

@@ -58,6 +59,7 @@
#define MSR_KVM_ASYNC_PF_INT 0x4b564d06
#define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
#define MSR_KVM_MIGRATION_CONTROL 0x4b564d08
+#define MSR_KVM_PREEMPT_COUNT 0x4b564d09

struct kvm_steal_time {
__u64 steal;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 02cf0a7e1d14..f2d2e3d25230 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1456,6 +1456,7 @@ static const u32 emulated_msrs_all[] = {

MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF_INT, MSR_KVM_ASYNC_PF_ACK,
+ MSR_KVM_PREEMPT_COUNT,

MSR_IA32_TSC_ADJUST,
MSR_IA32_TSC_DEADLINE,
@@ -3442,6 +3443,25 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
mark_page_dirty_in_slot(vcpu->kvm, ghc->memslot, gpa_to_gfn(ghc->gpa));
}

+static int kvm_pv_enable_preempt_count(struct kvm_vcpu *vcpu, u64 data)
+{
+ u64 addr = data & ~KVM_MSR_ENABLED;
+ struct gfn_to_hva_cache *ghc = &vcpu->arch.pv_pc.preempt_count_cache;
+
+ vcpu->arch.pv_pc.preempt_count_enabled = false;
+ vcpu->arch.pv_pc.msr_val = data;
+
+ if (!(data & KVM_MSR_ENABLED))
+ return 0;
+
+ if (kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, addr, sizeof(int)))
+ return 1;
+
+ vcpu->arch.pv_pc.preempt_count_enabled = true;
+
+ return 0;
+}
+
int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
bool pr = false;
@@ -3661,6 +3681,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vcpu->arch.msr_kvm_poll_control = data;
break;

+ case MSR_KVM_PREEMPT_COUNT:
+ if (!guest_pv_has(vcpu, KVM_FEATURE_PREEMPT_COUNT))
+ return 1;
+
+ if (kvm_pv_enable_preempt_count(vcpu, data))
+ return 1;
+ break;
+
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
@@ -4001,6 +4029,12 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

msr_info->data = vcpu->arch.msr_kvm_poll_control;
break;
+ case MSR_KVM_PREEMPT_COUNT:
+ if (!guest_pv_has(vcpu, KVM_FEATURE_PREEMPT_COUNT))
+ return 1;
+
+ msr_info->data = vcpu->arch.pv_pc.msr_val;
+ break;
case MSR_IA32_P5_MC_ADDR:
case MSR_IA32_P5_MC_TYPE:
case MSR_IA32_MCG_CAP:
@@ -11192,6 +11226,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)

vcpu->arch.pending_external_vector = -1;
vcpu->arch.preempted_in_kernel = false;
+ vcpu->arch.pv_pc.preempt_count_enabled = false;

#if IS_ENABLED(CONFIG_HYPERV)
vcpu->arch.hv_root_tdp = INVALID_PAGE;
--
2.25.1

2022-04-08 00:28:32

by Wanpeng Li

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] KVM: X86: Scaling Guest OS Critical Sections with boosting

ping,
On Fri, 1 Apr 2022 at 16:10, Wanpeng Li <[email protected]> wrote:
>
> The missing semantic gap that occurs when a guest OS is preempted
> when executing its own critical section, this leads to degradation
> of application scalability. We try to bridge this semantic gap in
> some ways, by passing guest preempt_count to the host and checking
> guest irq disable state, the hypervisor now knows whether guest
> OSes are running in the critical section, the hypervisor yield-on-spin
> heuristics can be more smart this time to boost the vCPU candidate
> who is in the critical section to mitigate this preemption problem,
> in addition, it is more likely to be a potential lock holder.
>
> Testing on 96 HT 2 socket Xeon CLX server, with 96 vCPUs VM 100GB RAM,
> one VM running benchmark, the other(none-2) VMs running cpu-bound
> workloads, There is no performance regression for other benchmarks
> like Unixbench etc.
>
> 1VM
> vanilla optimized improved
>
> hackbench -l 50000
> 28 21.45 30.5%
> ebizzy -M
> 12189 12354 1.4%
> dbench
> 712 MB/sec 722 MB/sec 1.4%
>
> 2VM:
> vanilla optimized improved
>
> hackbench -l 10000
> 29.4 26 13%
> ebizzy -M
> 3834 4033 5%
> dbench
> 42.3 MB/sec 44.1 MB/sec 4.3%
>
> 3VM:
> vanilla optimized improved
>
> hackbench -l 10000
> 47 35.46 33%
> ebizzy -M
> 3828 4031 5%
> dbench
> 30.5 MB/sec 31.16 MB/sec 2.3%
>
> v1 -> v2:
> * add more comments to irq disable state
> * renaming irq_disabled to last_guest_irq_disabled
> * renaming, inverting the return, and also return a bool for kvm_vcpu_non_preemptable
>
> Wanpeng Li (5):
> KVM: X86: Add MSR_KVM_PREEMPT_COUNT support
> KVM: X86: Add last guest interrupt disable state support
> KVM: X86: Boost vCPU which is in critical section
> x86/kvm: Add MSR_KVM_PREEMPT_COUNT guest support
> KVM: X86: Expose PREEMT_COUNT CPUID feature bit to guest
>
> Documentation/virt/kvm/cpuid.rst | 3 ++
> arch/x86/include/asm/kvm_host.h | 8 ++++
> arch/x86/include/uapi/asm/kvm_para.h | 2 +
> arch/x86/kernel/kvm.c | 10 +++++
> arch/x86/kvm/cpuid.c | 3 +-
> arch/x86/kvm/x86.c | 60 ++++++++++++++++++++++++++++
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 7 ++++
> 8 files changed, 93 insertions(+), 1 deletion(-)
>
> --
> 2.25.1
>

2022-04-14 22:52:05

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v2 3/5] KVM: X86: Boost vCPU which is in critical section

On Wed, Apr 13, 2022 at 09:43:03PM +0000, Sean Christopherson wrote:
> +tglx and PeterZ
>
> On Fri, Apr 01, 2022, Wanpeng Li wrote:
> > From: Wanpeng Li <[email protected]>
> >
> > The missing semantic gap that occurs when a guest OS is preempted
> > when executing its own critical section, this leads to degradation
> > of application scalability. We try to bridge this semantic gap in
> > some ways, by passing guest preempt_count to the host and checking
> > guest irq disable state, the hypervisor now knows whether guest
> > OSes are running in the critical section, the hypervisor yield-on-spin
> > heuristics can be more smart this time to boost the vCPU candidate
> > who is in the critical section to mitigate this preemption problem,
> > in addition, it is more likely to be a potential lock holder.
> >
> > Testing on 96 HT 2 socket Xeon CLX server, with 96 vCPUs VM 100GB RAM,
> > one VM running benchmark, the other(none-2) VMs running cpu-bound
> > workloads, There is no performance regression for other benchmarks
> > like Unixbench etc.
>
> ...
>
> > Signed-off-by: Wanpeng Li <[email protected]>
> > ---
> > arch/x86/kvm/x86.c | 22 ++++++++++++++++++++++
> > include/linux/kvm_host.h | 1 +
> > virt/kvm/kvm_main.c | 7 +++++++
> > 3 files changed, 30 insertions(+)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 9aa05f79b743..b613cd2b822a 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10377,6 +10377,28 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
> > return r;
> > }
> >
> > +static bool kvm_vcpu_is_preemptible(struct kvm_vcpu *vcpu)
> > +{
> > + int count;
> > +
> > + if (!vcpu->arch.pv_pc.preempt_count_enabled)
> > + return false;
> > +
> > + if (!kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.pv_pc.preempt_count_cache,
> > + &count, sizeof(int)))
> > + return !(count & ~PREEMPT_NEED_RESCHED);
>
> As I pointed out in v1[*], this makes PREEMPT_NEED_RESCHED and really the entire
> __preempt_count to some extent, KVM guest/host ABI. That needs acks from sched
> folks, and if they're ok with it, needs to be formalized somewhere in kvm_para.h,
> not buried in the KVM host code.

Right, not going to happen. There's been plenty changes to
__preempt_count over the past years, suggesting that making it ABI will
be an incredibly bad idea.

It also only solves part of the problem; namely spinlocks, but doesn't
help at all with mutexes, which can be equally short lived, as evidenced
by the adaptive spinning mutex code etc..

Also, I'm not sure I fully understand the problem, doesn't the paravirt
spinlock code give sufficient clues?

2022-04-16 00:23:02

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v2 3/5] KVM: X86: Boost vCPU which is in critical section

+tglx and PeterZ

On Fri, Apr 01, 2022, Wanpeng Li wrote:
> From: Wanpeng Li <[email protected]>
>
> The missing semantic gap that occurs when a guest OS is preempted
> when executing its own critical section, this leads to degradation
> of application scalability. We try to bridge this semantic gap in
> some ways, by passing guest preempt_count to the host and checking
> guest irq disable state, the hypervisor now knows whether guest
> OSes are running in the critical section, the hypervisor yield-on-spin
> heuristics can be more smart this time to boost the vCPU candidate
> who is in the critical section to mitigate this preemption problem,
> in addition, it is more likely to be a potential lock holder.
>
> Testing on 96 HT 2 socket Xeon CLX server, with 96 vCPUs VM 100GB RAM,
> one VM running benchmark, the other(none-2) VMs running cpu-bound
> workloads, There is no performance regression for other benchmarks
> like Unixbench etc.

...

> Signed-off-by: Wanpeng Li <[email protected]>
> ---
> arch/x86/kvm/x86.c | 22 ++++++++++++++++++++++
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 7 +++++++
> 3 files changed, 30 insertions(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9aa05f79b743..b613cd2b822a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10377,6 +10377,28 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
> return r;
> }
>
> +static bool kvm_vcpu_is_preemptible(struct kvm_vcpu *vcpu)
> +{
> + int count;
> +
> + if (!vcpu->arch.pv_pc.preempt_count_enabled)
> + return false;
> +
> + if (!kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.pv_pc.preempt_count_cache,
> + &count, sizeof(int)))
> + return !(count & ~PREEMPT_NEED_RESCHED);

As I pointed out in v1[*], this makes PREEMPT_NEED_RESCHED and really the entire
__preempt_count to some extent, KVM guest/host ABI. That needs acks from sched
folks, and if they're ok with it, needs to be formalized somewhere in kvm_para.h,
not buried in the KVM host code.

[*] https://lore.kernel.org/all/[email protected]

> +
> + return false;
> +}
> +