LinuxLists.cc - [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM

2021-04-21 20:59:04

Subject: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM

Dear KVM developers and maintainers,

In our research work presented last week at the VEE 2021 conference [1], we
found out that a lot of continuous Pause-Loop-Exiting (PLE) events occur
due to three problems we have identified: 1) Linux CFS ignores hints from
KVM; 2) IPI receiver vCPUs in user-mode are not boosted; 3) IPI-receiver
that has halted is always a candidate for boost. We have intoduced two
mitigations against the problems.

To solve problem (1), patch 1 increases the vruntime of yielded vCPU to
pass the check `if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next,
left) < 1)` in `struct sched_entity * pick_next_entity()` if the cfs_rq's
skip and next are both vCPUs in the same VM. To keep fairness it does not
prioritize the guest VM which causes PLE, however it improves the
performance by eliminating unnecessary PLE. Also we have confirmed
`yield_to_task_fair` is called only from KVM.

To solve problems (2) and (3), patch 2 monitors IPI communication between
vCPUs and leverages the relationship between vCPUs to select boost
candidates. The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
delivering interrupt" patch
(https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
seems to be effective for (2) while it only uses the IPI receiver
information.

Our approach reduces the total number of PLE events by up to 87.6 % in four
8-vCPU VMs in over-subscribed scenario with the Linux kernel 5.6.0. Please
find the patch below.

We would greatly appreciate your valuable feedback on our approach and
patch.

Thank you very much for your consideration
Kenta Ishiguro

[1] Kenta Ishiguro, Naoki Yasuno, Pierre-Louis Aublin, and Kenji Kono.
"Mitigating excessive vCPU spinning in VM-agnostic KVM".
In Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference
on Virtual Execution Environments (VEE 2021).
Association for Computing Machinery, New York,
NY, USA, 139--152. https://dl.acm.org/doi/abs/10.1145/3453933.3454020

Kenta Ishiguro (2):
Prevent CFS from ignoring boost requests from KVM
Boost vCPUs based on IPI-sender and receiver information

arch/x86/kvm/lapic.c | 14 ++++++++++++++
arch/x86/kvm/vmx/vmx.c | 2 ++
include/linux/kvm_host.h | 5 +++++
kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
virt/kvm/kvm_main.c | 26 ++++++++++++++++++++++++--
5 files changed, 76 insertions(+), 2 deletions(-)

--
2.30.2

2021-04-21 22:44:52

by Kenta Ishiguro

[permalink] [raw]

Subject: [RFC PATCH 1/2] Prevent CFS from ignoring boost requests from KVM

This commit increases the vruntime of yielded vCPU to boost a vCPU instead
of the yielded vCPU when two vCPUs are in the same VM. This change avoids
the situation where scheduling the boosted vCPU is too unfair.

Cc: David Hildenbrand <[email protected]>
Signed-off-by: Kenta Ishiguro <[email protected]>
---
kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..2908da3f4c77 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7204,9 +7204,36 @@ static void yield_task_fair(struct rq *rq)
set_skip_buddy(se);
}

+static void deboost_yield_task_vruntime(struct sched_entity *next_se, struct sched_entity *yield_se)
+{
+ if (wakeup_preempt_entity(next_se, yield_se) < 1)
+ return;
+ yield_se->vruntime = next_se->vruntime - wakeup_gran(yield_se);
+}
+
+static void deboost_yield_task(struct sched_entity *next_se, struct sched_entity *yield_se)
+{
+ struct sched_entity *next_se_base = next_se;
+
+ if (rq_of(cfs_rq_of(yield_se)) != rq_of(cfs_rq_of(next_se)))
+ return;
+
+ for_each_sched_entity(yield_se) {
+ next_se = next_se_base;
+ for_each_sched_entity(next_se) {
+ if (cfs_rq_of(yield_se) == cfs_rq_of(next_se)) {
+ deboost_yield_task_vruntime(next_se, yield_se);
+ return;
+ }
+ }
+ }
+}
+
static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
{
struct sched_entity *se = &p->se;
+ struct task_struct *curr;
+ struct sched_entity *yield_se;

/* throttled hierarchies are not runnable */
if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
@@ -7215,6 +7242,10 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
/* Tell the scheduler that we'd really like pse to run next. */
set_next_buddy(se);

+ curr = rq->curr;
+ yield_se = &curr->se;
+ deboost_yield_task(se, yield_se);
+
yield_task_fair(rq);

return true;
--
2.30.2

2021-04-21 22:45:09

by Kenta Ishiguro

[permalink] [raw]

Subject: [RFC PATCH 2/2] Boost vCPUs based on IPI-sender and receiver information

This commit monitors IPI communication between vCPUs and leverages the
relationship between vCPUs to select boost candidates.

Cc: David Hildenbrand <[email protected]>
Signed-off-by: Kenta Ishiguro <[email protected]>
---
arch/x86/kvm/lapic.c | 14 ++++++++++++++
arch/x86/kvm/vmx/vmx.c | 2 ++
include/linux/kvm_host.h | 5 +++++
virt/kvm/kvm_main.c | 26 ++++++++++++++++++++++++--
4 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index cc369b9ad8f1..c8d967ddecf9 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1269,6 +1269,18 @@ void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
}
EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);

+static void mark_ipi_receiver(struct kvm_lapic *apic, struct kvm_lapic_irq *irq)
+{
+ struct kvm_vcpu *dest_vcpu;
+ u64 prev_ipi_received;
+
+ dest_vcpu = kvm_get_vcpu_by_id(apic->vcpu->kvm, irq->dest_id);
+ if (READ_ONCE(dest_vcpu->sched_outed)) {
+ prev_ipi_received = READ_ONCE(dest_vcpu->ipi_received);
+ WRITE_ONCE(dest_vcpu->ipi_received, prev_ipi_received | (1 << apic->vcpu->vcpu_id));
+ }
+}
+
void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
{
struct kvm_lapic_irq irq;
@@ -1287,6 +1299,8 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)

trace_kvm_apic_ipi(icr_low, irq.dest_id);

+ mark_ipi_receiver(apic, &irq);
+
kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq, NULL);
}

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 29b40e092d13..ced50935a38b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6718,6 +6718,8 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
vmcs_write32(PLE_WINDOW, vmx->ple_window);
}

+ WRITE_ONCE(vcpu->ipi_received, 0);
+
/*
* We did this in prepare_switch_to_guest, because it needs to
* be within srcu_read_lock.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1b65e7204344..6726aeec03e7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -320,6 +320,11 @@ struct kvm_vcpu {
#endif
bool preempted;
bool ready;
+ bool sched_outed;
+ /*
+ * The current implementation of strict boost supports up to 64 vCPUs
+ */
+ u64 ipi_received;
struct kvm_vcpu_arch arch;
struct kvm_dirty_ring dirty_ring;
};
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 383df23514b9..08e629957e7e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -413,6 +413,10 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
kvm_vcpu_set_dy_eligible(vcpu, false);
vcpu->preempted = false;
vcpu->ready = false;
+
+ vcpu->sched_outed = false;
+ vcpu->ipi_received = 0;
+
preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
}

@@ -3011,6 +3015,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
int try = 3;
int pass;
int i;
+ u64 prev_ipi_received;

kvm_vcpu_set_in_spin_loop(me, true);
/*
@@ -3031,12 +3036,25 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
continue;
if (vcpu == me)
continue;
+ prev_ipi_received = READ_ONCE(vcpu->ipi_received);
+ if (!READ_ONCE(vcpu->preempted) &&
+ !(prev_ipi_received & (1 << me->vcpu_id))) {
+ WRITE_ONCE(vcpu->ipi_received,
+ prev_ipi_received | (1 << me->vcpu_id));
+ continue;
+ }
if (rcuwait_active(&vcpu->wait) &&
!vcpu_dy_runnable(vcpu))
continue;
if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
- !kvm_arch_vcpu_in_kernel(vcpu))
- continue;
+ !kvm_arch_vcpu_in_kernel(vcpu)) {
+ prev_ipi_received = READ_ONCE(vcpu->ipi_received);
+ if (!(prev_ipi_received & (1 << me->vcpu_id))) {
+ WRITE_ONCE(vcpu->ipi_received,
+ prev_ipi_received | (1 << me->vcpu_id));
+ continue;
+ }
+ }
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
continue;

@@ -4859,6 +4877,9 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
WRITE_ONCE(vcpu->preempted, false);
WRITE_ONCE(vcpu->ready, false);

+ WRITE_ONCE(vcpu->sched_outed, false);
+ WRITE_ONCE(vcpu->ipi_received, 0);
+
__this_cpu_write(kvm_running_vcpu, vcpu);
kvm_arch_sched_in(vcpu, cpu);
kvm_arch_vcpu_load(vcpu, cpu);
@@ -4873,6 +4894,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
WRITE_ONCE(vcpu->preempted, true);
WRITE_ONCE(vcpu->ready, true);
}
+ WRITE_ONCE(vcpu->sched_outed, true);
kvm_arch_vcpu_put(vcpu);
__this_cpu_write(kvm_running_vcpu, NULL);
}
--
2.30.2

2021-04-22 01:31:47

by Wanpeng Li

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM

On Thu, 22 Apr 2021 at 06:13, Kenta Ishiguro
<[email protected]> wrote:
>
> Dear KVM developers and maintainers,
>
> In our research work presented last week at the VEE 2021 conference [1], we
> found out that a lot of continuous Pause-Loop-Exiting (PLE) events occur
> due to three problems we have identified: 1) Linux CFS ignores hints from
> KVM; 2) IPI receiver vCPUs in user-mode are not boosted; 3) IPI-receiver
> that has halted is always a candidate for boost. We have intoduced two
> mitigations against the problems.
>
> To solve problem (1), patch 1 increases the vruntime of yielded vCPU to
> pass the check `if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next,
> left) < 1)` in `struct sched_entity * pick_next_entity()` if the cfs_rq's
> skip and next are both vCPUs in the same VM. To keep fairness it does not
> prioritize the guest VM which causes PLE, however it improves the
> performance by eliminating unnecessary PLE. Also we have confirmed
> `yield_to_task_fair` is called only from KVM.
>
> To solve problems (2) and (3), patch 2 monitors IPI communication between
> vCPUs and leverages the relationship between vCPUs to select boost
> candidates. The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
> delivering interrupt" patch
> (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
> seems to be effective for (2) while it only uses the IPI receiver
> information.
>
> Our approach reduces the total number of PLE events by up to 87.6 % in four
> 8-vCPU VMs in over-subscribed scenario with the Linux kernel 5.6.0. Please
> find the patch below.

You should mention that this improvement mainly comes from your
problems (1) scheduler hacking, however, kvm task is just an ordinary
task and scheduler maintainer always does not accept special
treatment. the worst case for problems (1) mentioned in your paper, I
guess it is vCPU stacking issue, I try to mitigate it before
(https://lore.kernel.org/kvm/[email protected]/).
For your problems (3), we evaluate hackbench which is heavily
contended rq locks and heavy async ipi(reschedule ipi), the async ipi
influence is around 0.X%, I don't expect normal workloads can feel any
affected. In addition, four 8-vCPU VMs are not suitable for
scalability evaluation. I don't think the complex which is introduced
by your patch 2 is worth it since it gets a similar effect as my
version w/ current heuristic algorithm.

Wanpeng

2021-04-22 01:45:41

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM

On Thu, Apr 22, 2021, Kenta Ishiguro wrote:
> To solve problems (2) and (3), patch 2 monitors IPI communication between
> vCPUs and leverages the relationship between vCPUs to select boost
> candidates. The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
> delivering interrupt" patch
> (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
> seems to be effective for (2) while it only uses the IPI receiver
> information.

On the IPI side of thing, I like the idea of explicitly tracking the IPIs,
especially if we can simplify the implementation, e.g. by losing the receiver
info and making ipi_received a bool. Maybe temporarily table Wanpeng's patch
while this approach is analyzed?

2021-04-22 01:46:23

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] Boost vCPUs based on IPI-sender and receiver information

On Thu, Apr 22, 2021, Kenta Ishiguro wrote:
> This commit monitors IPI communication between vCPUs and leverages the
> relationship between vCPUs to select boost candidates.
>
> Cc: David Hildenbrand <[email protected]>
> Signed-off-by: Kenta Ishiguro <[email protected]>
> ---
> arch/x86/kvm/lapic.c | 14 ++++++++++++++
> arch/x86/kvm/vmx/vmx.c | 2 ++
> include/linux/kvm_host.h | 5 +++++
> virt/kvm/kvm_main.c | 26 ++++++++++++++++++++++++--
> 4 files changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index cc369b9ad8f1..c8d967ddecf9 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1269,6 +1269,18 @@ void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
> }
> EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
>
> +static void mark_ipi_receiver(struct kvm_lapic *apic, struct kvm_lapic_irq *irq)
> +{
> + struct kvm_vcpu *dest_vcpu;
> + u64 prev_ipi_received;
> +
> + dest_vcpu = kvm_get_vcpu_by_id(apic->vcpu->kvm, irq->dest_id);
> + if (READ_ONCE(dest_vcpu->sched_outed)) {

dest_vcpu needs to be checked for NULL.

> + prev_ipi_received = READ_ONCE(dest_vcpu->ipi_received);
> + WRITE_ONCE(dest_vcpu->ipi_received, prev_ipi_received | (1 << apic->vcpu->vcpu_id));
> + }
> +}
> +
> void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
> {
> struct kvm_lapic_irq irq;
> @@ -1287,6 +1299,8 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
>
> trace_kvm_apic_ipi(icr_low, irq.dest_id);
>
> + mark_ipi_receiver(apic, &irq);
> +
> kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq, NULL);
> }
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 29b40e092d13..ced50935a38b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6718,6 +6718,8 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> vmcs_write32(PLE_WINDOW, vmx->ple_window);
> }
>
> + WRITE_ONCE(vcpu->ipi_received, 0);

Given that ipi_received is cleared when the vCPU is run, is there actually an
observable advantage to tracking which vCPU sent the IPI? I.e. how do the
numbers look if ipi_received is a simple bool, and kvm_vcpu_on_spin() yields to
any vCPU that has an IPI pending?

> /*
> * We did this in prepare_switch_to_guest, because it needs to
> * be within srcu_read_lock.

...

> @@ -4873,6 +4894,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
> WRITE_ONCE(vcpu->preempted, true);
> WRITE_ONCE(vcpu->ready, true);
> }
> + WRITE_ONCE(vcpu->sched_outed, true);

s/sched_outed/scheduled_out to be more grammatically correct.

It might also make sense to introduce the flag in a separate path. Or even
better, can the existing "preempted" and "ready" be massaged so that we don't
have three flags that are tracking the same basic thing, with slightly different
semantics?

> kvm_arch_vcpu_put(vcpu);
> __this_cpu_write(kvm_running_vcpu, NULL);
> }
> --
> 2.30.2
>

2021-04-22 12:25:43

by Wanpeng Li

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM

On Thu, 22 Apr 2021 at 09:45, Sean Christopherson <[email protected]> wrote:
>
> On Thu, Apr 22, 2021, Kenta Ishiguro wrote:
> > To solve problems (2) and (3), patch 2 monitors IPI communication between
> > vCPUs and leverages the relationship between vCPUs to select boost
> > candidates. The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
> > delivering interrupt" patch
> > (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
> > seems to be effective for (2) while it only uses the IPI receiver
> > information.
>
> On the IPI side of thing, I like the idea of explicitly tracking the IPIs,
> especially if we can simplify the implementation, e.g. by losing the receiver
> info and making ipi_received a bool. Maybe temporarily table Wanpeng's patch
> while this approach is analyzed?

Hi all,

I evaluate my patch
(https://lore.kernel.org/kvm/[email protected]),
Kenta's patch 2 and Sean's suggestion. The testing environment is
pbzip2 in 96 vCPUs VM in over-subscribe scenario (The host machine is
2 socket, 48 cores, 96 HTs Intel CLX box). Note: the Kenta's scheduler
hacking is not applied. The score of my patch is the most stable and
the best performance.

Wanpeng's patch

The average: vanilla -> boost: 69.124 -> 61.975, 10.3%

* Wall Clock: 61.695359 seconds
* Wall Clock: 63.343579 seconds
* Wall Clock: 61.567513 seconds
* Wall Clock: 62.144722 seconds
* Wall Clock: 61.091442 seconds
* Wall Clock: 62.085912 seconds
* Wall Clock: 61.311954 seconds

Kenta' patch

The average: vanilla -> boost: 69.148 -> 64.567, 6.6%

* Wall Clock: 66.288113 seconds
* Wall Clock: 61.228642 seconds
* Wall Clock: 62.100524 seconds
* Wall Clock: 68.355473 seconds
* Wall Clock: 64.864608 seconds

Sean's suggestion:

The average: vanilla -> boost: 69.148 -> 66.505, 3.8%

* Wall Clock: 60.583562 seconds
* Wall Clock: 58.533960 seconds
* Wall Clock: 70.103489 seconds
* Wall Clock: 74.279028 seconds
* Wall Clock: 69.024194 seconds

I follow(almost) Sean's suggestion:

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0050f39..78b5eb6 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1272,6 +1272,7 @@ EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
{
struct kvm_lapic_irq irq;
+ struct kvm_vcpu *dest_vcpu;

irq.vector = icr_low & APIC_VECTOR_MASK;
irq.delivery_mode = icr_low & APIC_MODE_MASK;
@@ -1285,6 +1286,10 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic,
u32 icr_low, u32 icr_high)
else
irq.dest_id = GET_APIC_DEST_FIELD(icr_high);

+ dest_vcpu = kvm_get_vcpu_by_id(apic->vcpu->kvm, irq.dest_id);
+ if (dest_vcpu)
+ WRITE_ONCE(dest_vcpu->ipi_received, true);
+
trace_kvm_apic_ipi(icr_low, irq.dest_id);

kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq, NULL);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 303fb55..a98bf571 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9298,6 +9298,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (test_thread_flag(TIF_NEED_FPU_LOAD))
switch_fpu_return();

+ WRITE_ONCE(vcpu->ipi_received, false);
+
if (unlikely(vcpu->arch.switch_db_regs)) {
set_debugreg(0, 7);
set_debugreg(vcpu->arch.eff_db[0], 0);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5ef09a4..81e39fa 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -332,6 +332,8 @@ struct kvm_vcpu {
bool dy_eligible;
} spin_loop;
#endif
+
+ bool ipi_received;
bool preempted;
bool ready;
struct kvm_vcpu_arch arch;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c682f82..5098929 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -411,6 +411,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu,
struct kvm *kvm, unsigned id)

kvm_vcpu_set_in_spin_loop(vcpu, false);
kvm_vcpu_set_dy_eligible(vcpu, false);
+ vcpu->ipi_received = false;
vcpu->preempted = false;
vcpu->ready = false;
preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
@@ -3220,6 +3221,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool
yield_to_kernel_mode)
!vcpu_dy_runnable(vcpu))
continue;
if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
+ !READ_ONCE(vcpu->ipi_received) &&
!kvm_arch_vcpu_in_kernel(vcpu))
continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))

2021-04-22 16:00:17

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM

On Thu, Apr 22, 2021, Wanpeng Li wrote:
> On Thu, 22 Apr 2021 at 09:45, Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Apr 22, 2021, Kenta Ishiguro wrote:
> > > To solve problems (2) and (3), patch 2 monitors IPI communication between
> > > vCPUs and leverages the relationship between vCPUs to select boost
> > > candidates. The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
> > > delivering interrupt" patch
> > > (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
> > > seems to be effective for (2) while it only uses the IPI receiver
> > > information.
> >
> > On the IPI side of thing, I like the idea of explicitly tracking the IPIs,
> > especially if we can simplify the implementation, e.g. by losing the receiver
> > info and making ipi_received a bool. Maybe temporarily table Wanpeng's patch
> > while this approach is analyzed?
>
> Hi all,
>
> I evaluate my patch

Thanks for doing the testing, much appreciated!

> (https://lore.kernel.org/kvm/[email protected]),
> Kenta's patch 2 and Sean's suggestion. The testing environment is
> pbzip2 in 96 vCPUs VM in over-subscribe scenario (The host machine is
> 2 socket, 48 cores, 96 HTs Intel CLX box).

Are the vCPUs affined in any way? How many VMs are running? Are there other
workloads in the host? Not criticising, just asking so that others can reproduce
your setup.

> Note: the Kenta's scheduler hacking is not applied. The score of my patch is
> the most stable and the best performance.

On the other hand, Kenta's approach has the advantage of working for both Intel
and AMD. But I'm also not very familiar with AMD's AVIC, so I don't know if it's
feasible to implement a performant equivalent in svm_dy_apicv_has_pending_interrupt().

Kenda's patch is also flawed as it doesn't scale to 96 vCPUs; vCPUs 64-95 will
never get boosted.

> Wanpeng's patch
>
> The average: vanilla -> boost: 69.124 -> 61.975, 10.3%
>
> * Wall Clock: 61.695359 seconds
> * Wall Clock: 63.343579 seconds
> * Wall Clock: 61.567513 seconds
> * Wall Clock: 62.144722 seconds
> * Wall Clock: 61.091442 seconds
> * Wall Clock: 62.085912 seconds
> * Wall Clock: 61.311954 seconds
>
> Kenta' patch
>
> The average: vanilla -> boost: 69.148 -> 64.567, 6.6%
>
> * Wall Clock: 66.288113 seconds
> * Wall Clock: 61.228642 seconds
> * Wall Clock: 62.100524 seconds
> * Wall Clock: 68.355473 seconds
> * Wall Clock: 64.864608 seconds
>
> Sean's suggestion:
>
> The average: vanilla -> boost: 69.148 -> 66.505, 3.8%
>
> * Wall Clock: 60.583562 seconds
> * Wall Clock: 58.533960 seconds
> * Wall Clock: 70.103489 seconds
> * Wall Clock: 74.279028 seconds
> * Wall Clock: 69.024194 seconds

2021-04-26 03:10:01

by Wanpeng Li

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM

On Mon, 26 Apr 2021 at 10:56, Kenta Ishiguro
<[email protected]> wrote:
>
> Dear all,
>
> Thank you for the insightful feedback.
>
> Does Sean's suggested version of Wanpeng's patch mark a running vCPU as an IPI
> receiver? If it's right, I think the candidate set of vCPUs for boost is
> slightly different between using kvm_arch_interrupt_delivery and using boolean
> ipi_received. In the version of using boolean ipi_received, vCPUs which
> receive IPI while running are also candidates for a boost.
> However, they likely have already responded to their IPI before they exit.

if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
+ !READ_ONCE(vcpu->ipi_received) &&

There is a vcpu->preempted checking here.

Wanpeng

2021-04-26 03:20:24

by Kenta Ishiguro

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM

Thank you for the reply.

My question is about following scenario:
1. running vCPU receives IPI and the vCPU's ipi_received gets true
2. the vCPU responds to the IPI
3. the vCPU exits
4. the vCPU is preempted by KVM
5. the vCPU is boosted, but it has already responded to the IPI
6. the vCPU enters and the vCPU's ipi_received is cleaned

In this case, I think the check of vcpu->preempted does not limit the candidate vCPUs.

2021-04-26 03:59:53

by Wanpeng Li

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM

On Mon, 26 Apr 2021 at 11:19, Kenta Ishiguro
<[email protected]> wrote:
>
> Thank you for the reply.
>
> My question is about following scenario:
> 1. running vCPU receives IPI and the vCPU's ipi_received gets true
> 2. the vCPU responds to the IPI
> 3. the vCPU exits
> 4. the vCPU is preempted by KVM
> 5. the vCPU is boosted, but it has already responded to the IPI
> 6. the vCPU enters and the vCPU's ipi_received is cleaned
>
> In this case, I think the check of vcpu->preempted does not limit the candidate vCPUs.

Good point, you are right. However, actually I played with that code a
bit before, I have another version adding the vcpu->preempted checking
when marking IPI receiver, the score is not as good as expected.

Wanpeng

2021-04-26 04:11:51

by Kenta Ishiguro

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM

I see. Thank you!