2021-05-19 18:02:22

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH v4 1/5] KVM: exit halt polling on need_resched() for both book3s and generic halt-polling

From: Wanpeng Li <[email protected]>

Inspired by commit 262de4102c7bb8 (kvm: exit halt polling on need_resched()
as well), CFS_BANDWIDTH throttling will use resched_task() when there is just
one task to get the task to block. It was likely allowing VMs to overrun their
quota when halt polling. Due to PPC implements an arch specific halt polling
logic, we should add the need_resched() checking there as well. This
patch adds a helper function that to be shared between book3s and generic
halt-polling loop.

Reviewed-by: David Matlack <[email protected]>
Reviewed-by: Venkatesh Srinivas <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Venkatesh Srinivas <[email protected]>
Cc: Jim Mattson <[email protected]>
Cc: David Matlack <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Suraj Jitindar Singh <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
v3 -> v4:
* rename to kvm_vcpu_can_poll
v2 -> v3:
* add a helper function
v1 -> v2:
* update patch description

arch/powerpc/kvm/book3s_hv.c | 2 +-
include/linux/kvm_host.h | 2 ++
virt/kvm/kvm_main.c | 8 ++++++--
3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 28a80d240b76..7360350e66ff 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3936,7 +3936,7 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc)
break;
}
cur = ktime_get();
- } while (single_task_running() && ktime_before(cur, stop));
+ } while (kvm_vcpu_can_poll(cur, stop));

spin_lock(&vc->lock);
vc->vcore_state = VCORE_INACTIVE;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2f34487e21f2..ba682f738a25 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1583,4 +1583,6 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
/* Max number of entries allowed for each kvm dirty ring */
#define KVM_DIRTY_RING_MAX_ENTRIES 65536

+bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop);
+
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6b4feb92dc79..62522c12beba 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2945,6 +2945,11 @@ update_halt_poll_stats(struct kvm_vcpu *vcpu, u64 poll_ns, bool waited)
vcpu->stat.halt_poll_success_ns += poll_ns;
}

+bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop)
+{
+ return single_task_running() && !need_resched() && ktime_before(cur, stop);
+}
+
/*
* The vCPU has executed a HLT instruction with in-kernel mode enabled.
*/
@@ -2973,8 +2978,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
goto out;
}
poll_end = cur = ktime_get();
- } while (single_task_running() && !need_resched() &&
- ktime_before(cur, stop));
+ } while (kvm_vcpu_can_poll(cur, stop));
}

prepare_to_rcuwait(&vcpu->wait);
--
2.25.1



2021-05-19 18:02:42

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH v4 2/5] KVM: X86: Bail out of direct yield in case of under-committed scenarios

From: Wanpeng Li <[email protected]>

In case of under-committed scenarios, vCPU can get scheduling easily,
kvm_vcpu_yield_to add extra overhead, we can observe a lot of race
between vcpu->ready is true and yield fails due to p->state is
TASK_RUNNING. Let's bail out in such scenarios by checking the length
of current cpu runqueue, it can be treated as a hint of under-committed
instead of guarantee of accuracy. The directed_yield_successful/attempted
ratio can be improved from 50+% to 80+% in the under-committed scenario.

Signed-off-by: Wanpeng Li <[email protected]>
---
v2 -> v3:
* update patch description
v1 -> v2:
* move the check after attempted counting
* update patch description

arch/x86/kvm/x86.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9b6bca616929..dfb7c320581f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8360,6 +8360,9 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)

vcpu->stat.directed_yield_attempted++;

+ if (single_task_running())
+ goto no_yield;
+
rcu_read_lock();
map = rcu_dereference(vcpu->kvm->arch.apic_map);

--
2.25.1


2021-05-19 18:03:27

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH v4 3/5] KVM: X86: Fix vCPU preempted state from guest's point of view

From: Wanpeng Li <[email protected]>

Commit 66570e966dd9 (kvm: x86: only provide PV features if enabled in guest's
CPUID) avoids to access pv tlb shootdown host side logic when this pv feature
is not exposed to guest, however, kvm_steal_time.preempted not only leveraged
by pv tlb shootdown logic but also mitigate the lock holder preemption issue.
From guest's point of view, vCPU is always preempted since we lose the reset
of kvm_steal_time.preempted before vmentry if pv tlb shootdown feature is not
exposed. This patch fixes it by clearing kvm_steal_time.preempted before
vmentry.

Fixes: 66570e966dd9 (kvm: x86: only provide PV features if enabled in guest's CPUID)
Reviewed-by: Sean Christopherson <[email protected]>
Cc: [email protected]
Signed-off-by: Wanpeng Li <[email protected]>
---
v1 -> v2:
* add curly braces

arch/x86/kvm/x86.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dfb7c320581f..bed7b5348c0e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3105,6 +3105,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
st->preempted & KVM_VCPU_FLUSH_TLB);
if (xchg(&st->preempted, 0) & KVM_VCPU_FLUSH_TLB)
kvm_vcpu_flush_tlb_guest(vcpu);
+ } else {
+ st->preempted = 0;
}

vcpu->arch.st.preempted = 0;
--
2.25.1


2021-05-19 18:03:54

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH v4 4/5] KVM: X86: hyper-v: Task srcu lock when accessing kvm_memslots()

From: Wanpeng Li <[email protected]>

WARNING: suspicious RCU usage
5.13.0-rc1 #4 Not tainted
-----------------------------
./include/linux/kvm_host.h:710 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by hyperv_clock/8318:
#0: ffffb6b8cb05a7d8 (&hv->hv_lock){+.+.}-{3:3}, at: kvm_hv_invalidate_tsc_page+0x3e/0xa0 [kvm]

stack backtrace:
CPU: 3 PID: 8318 Comm: hyperv_clock Not tainted 5.13.0-rc1 #4
Call Trace:
dump_stack+0x87/0xb7
lockdep_rcu_suspicious+0xce/0xf0
kvm_write_guest_page+0x1c1/0x1d0 [kvm]
kvm_write_guest+0x50/0x90 [kvm]
kvm_hv_invalidate_tsc_page+0x79/0xa0 [kvm]
kvm_gen_update_masterclock+0x1d/0x110 [kvm]
kvm_arch_vm_ioctl+0x2a7/0xc50 [kvm]
kvm_vm_ioctl+0x123/0x11d0 [kvm]
__x64_sys_ioctl+0x3ed/0x9d0
do_syscall_64+0x3d/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae

kvm_memslots() will be called by kvm_write_guest(), so we should take the srcu lock.

Fixes: e880c6ea5 (KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs)
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/kvm/hyperv.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index f98370a39936..f00830e5202f 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1172,6 +1172,7 @@ void kvm_hv_invalidate_tsc_page(struct kvm *kvm)
{
struct kvm_hv *hv = to_kvm_hv(kvm);
u64 gfn;
+ int idx;

if (hv->hv_tsc_page_status == HV_TSC_PAGE_BROKEN ||
hv->hv_tsc_page_status == HV_TSC_PAGE_UNSET ||
@@ -1190,9 +1191,16 @@ void kvm_hv_invalidate_tsc_page(struct kvm *kvm)
gfn = hv->hv_tsc_page >> HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT;

hv->tsc_ref.tsc_sequence = 0;
+
+ /*
+ * Take the srcu lock as memslots will be accessed to check the gfn
+ * cache generation against the memslots generation.
+ */
+ idx = srcu_read_lock(&kvm->srcu);
if (kvm_write_guest(kvm, gfn_to_gpa(gfn),
&hv->tsc_ref, sizeof(hv->tsc_ref.tsc_sequence)))
hv->hv_tsc_page_status = HV_TSC_PAGE_BROKEN;
+ srcu_read_unlock(&kvm->srcu, idx);

out_unlock:
mutex_unlock(&hv->hv_lock);
--
2.25.1


2021-05-19 18:05:18

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH v4 5/5] KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switch

From: Wanpeng Li <[email protected]>

Let's treat lapic_timer_advance_ns automatic tuning logic as hypervisor
overhead, move it before wait_lapic_expire instead of between wait_lapic_expire
and the world switch, the wait duration should be calculated by the
up-to-date guest_tsc after the overhead of automatic tuning logic. This
patch reduces ~30+ cycles for kvm-unit-tests/tscdeadline-latency when testing
busy waits.

Signed-off-by: Wanpeng Li <[email protected]>
---
v3 -> v4:
* consider automatic tuning is disabled and timer did not arrive early
* add a code comment

arch/x86/kvm/lapic.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c0ebef560bd1..5d91f2367c31 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1598,11 +1598,19 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
apic->lapic_timer.advance_expire_delta = guest_tsc - tsc_deadline;

+ if (lapic_timer_advance_dynamic) {
+ adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta);
+ /*
+ * If the timer fired early, reread the TSC to account for the
+ * overhead of the above adjustment to avoid waiting longer
+ * than is necessary.
+ */
+ if (guest_tsc < tsc_deadline)
+ guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+ }
+
if (guest_tsc < tsc_deadline)
__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
-
- if (lapic_timer_advance_dynamic)
- adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta);
}

void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
--
2.25.1


2021-05-19 18:30:57

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switch

On Tue, May 18, 2021, Wanpeng Li wrote:
> From: Wanpeng Li <[email protected]>
>
> Let's treat lapic_timer_advance_ns automatic tuning logic as hypervisor
> overhead, move it before wait_lapic_expire instead of between wait_lapic_expire
> and the world switch, the wait duration should be calculated by the
> up-to-date guest_tsc after the overhead of automatic tuning logic. This
> patch reduces ~30+ cycles for kvm-unit-tests/tscdeadline-latency when testing
> busy waits.
>
> Signed-off-by: Wanpeng Li <[email protected]>
> ---

Reviewed-by: Sean Christopherson <[email protected]>

2021-05-19 18:33:00

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] KVM: X86: Bail out of direct yield in case of under-committed scenarios

On Tue, May 18, 2021, Wanpeng Li wrote:
> From: Wanpeng Li <[email protected]>
>
> In case of under-committed scenarios, vCPU can get scheduling easily,
> kvm_vcpu_yield_to add extra overhead, we can observe a lot of race
> between vcpu->ready is true and yield fails due to p->state is
> TASK_RUNNING. Let's bail out in such scenarios by checking the length
> of current cpu runqueue, it can be treated as a hint of under-committed
> instead of guarantee of accuracy. The directed_yield_successful/attempted
> ratio can be improved from 50+% to 80+% in the under-committed scenario.

The "50+% to 80+%" comment will be a bit confusing for future readers now that
the single_task_running() case counts as an attempt. I think the new comment
would be something like "30%+ of directed-yield attempts can avoid the expensive
lookups in kvm_sched_yield() in an under-committed scenario." That would also
provide the real justification, as bumping the success ratio isn't the true goal
of this path.

> Signed-off-by: Wanpeng Li <[email protected]>
> ---
> v2 -> v3:
> * update patch description
> v1 -> v2:
> * move the check after attempted counting
> * update patch description
>
> arch/x86/kvm/x86.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9b6bca616929..dfb7c320581f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8360,6 +8360,9 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
>
> vcpu->stat.directed_yield_attempted++;
>
> + if (single_task_running())
> + goto no_yield;
> +
> rcu_read_lock();
> map = rcu_dereference(vcpu->kvm->arch.apic_map);
>
> --
> 2.25.1
>

2021-05-19 19:05:40

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] KVM: X86: Bail out of direct yield in case of under-committed scenarios

On Wed, 19 May 2021 at 03:21, Sean Christopherson <[email protected]> wrote:
>
> On Tue, May 18, 2021, Wanpeng Li wrote:
> > From: Wanpeng Li <[email protected]>
> >
> > In case of under-committed scenarios, vCPU can get scheduling easily,
> > kvm_vcpu_yield_to add extra overhead, we can observe a lot of race
> > between vcpu->ready is true and yield fails due to p->state is
> > TASK_RUNNING. Let's bail out in such scenarios by checking the length
> > of current cpu runqueue, it can be treated as a hint of under-committed
> > instead of guarantee of accuracy. The directed_yield_successful/attempted
> > ratio can be improved from 50+% to 80+% in the under-committed scenario.
>
> The "50+% to 80+%" comment will be a bit confusing for future readers now that
> the single_task_running() case counts as an attempt. I think the new comment
> would be something like "30%+ of directed-yield attempts can avoid the expensive
> lookups in kvm_sched_yield() in an under-committed scenario." That would also
> provide the real justification, as bumping the success ratio isn't the true goal
> of this path.

Looks good. Hope Paolo can update the patch description when applying. :)

"In case of under-committed scenarios, vCPU can get scheduling easily,
kvm_vcpu_yield_to add extra overhead, we can observe a lot of races
between vcpu->ready is true and yield fails due to p->state is
TASK_RUNNING. Let's bail out in such scenarios by checking the length
of current cpu runqueue, it can be treated as a hint of under-committed
instead of guaranteeing accuracy. 30%+ of directed-yield attempts can
avoid the expensive lookups in kvm_sched_yield() in an under-committed
scenario. "

2021-05-24 13:43:46

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] KVM: X86: Bail out of direct yield in case of under-committed scenarios

On 19/05/21 04:57, Wanpeng Li wrote:
> Looks good. Hope Paolo can update the patch description when applying.:)
>
> "In case of under-committed scenarios, vCPU can get scheduling easily,
> kvm_vcpu_yield_to add extra overhead, we can observe a lot of races
> between vcpu->ready is true and yield fails due to p->state is
> TASK_RUNNING. Let's bail out in such scenarios by checking the length
> of current cpu runqueue, it can be treated as a hint of under-committed
> instead of guaranteeing accuracy. 30%+ of directed-yield attempts can
> avoid the expensive lookups in kvm_sched_yield() in an under-committed
> scenario. "
>

Here is what I used:

In case of under-committed scenarios, vCPUs can be scheduled easily;
kvm_vcpu_yield_to adds extra overhead, and it is also common to see
when vcpu->ready is true but yield later failing due to p->state is
TASK_RUNNING.

Let's bail out in such scenarios by checking the length of current cpu
runqueue, which can be treated as a hint of under-committed instead of
guarantee of accuracy. 30%+ of directed-yield attempts can now avoid
the expensive lookups in kvm_sched_yield() in an under-committed scenario.

Paolo

2021-05-24 13:48:48

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] KVM: exit halt polling on need_resched() for both book3s and generic halt-polling

On 18/05/21 14:00, Wanpeng Li wrote:
> From: Wanpeng Li <[email protected]>
>
> Inspired by commit 262de4102c7bb8 (kvm: exit halt polling on need_resched()
> as well), CFS_BANDWIDTH throttling will use resched_task() when there is just
> one task to get the task to block. It was likely allowing VMs to overrun their
> quota when halt polling. Due to PPC implements an arch specific halt polling
> logic, we should add the need_resched() checking there as well. This
> patch adds a helper function that to be shared between book3s and generic
> halt-polling loop.
>
> Reviewed-by: David Matlack <[email protected]>
> Reviewed-by: Venkatesh Srinivas <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Venkatesh Srinivas <[email protected]>
> Cc: Jim Mattson <[email protected]>
> Cc: David Matlack <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Suraj Jitindar Singh <[email protected]>
> Signed-off-by: Wanpeng Li <[email protected]>
> ---
> v3 -> v4:
> * rename to kvm_vcpu_can_poll
> v2 -> v3:
> * add a helper function
> v1 -> v2:
> * update patch description
>
> arch/powerpc/kvm/book3s_hv.c | 2 +-
> include/linux/kvm_host.h | 2 ++
> virt/kvm/kvm_main.c | 8 ++++++--
> 3 files changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 28a80d240b76..7360350e66ff 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3936,7 +3936,7 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc)
> break;
> }
> cur = ktime_get();
> - } while (single_task_running() && ktime_before(cur, stop));
> + } while (kvm_vcpu_can_poll(cur, stop));
>
> spin_lock(&vc->lock);
> vc->vcore_state = VCORE_INACTIVE;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2f34487e21f2..ba682f738a25 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1583,4 +1583,6 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
> /* Max number of entries allowed for each kvm dirty ring */
> #define KVM_DIRTY_RING_MAX_ENTRIES 65536
>
> +bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop);
> +
> #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6b4feb92dc79..62522c12beba 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2945,6 +2945,11 @@ update_halt_poll_stats(struct kvm_vcpu *vcpu, u64 poll_ns, bool waited)
> vcpu->stat.halt_poll_success_ns += poll_ns;
> }
>
> +bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop)
> +{
> + return single_task_running() && !need_resched() && ktime_before(cur, stop);
> +}
> +
> /*
> * The vCPU has executed a HLT instruction with in-kernel mode enabled.
> */
> @@ -2973,8 +2978,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> goto out;
> }
> poll_end = cur = ktime_get();
> - } while (single_task_running() && !need_resched() &&
> - ktime_before(cur, stop));
> + } while (kvm_vcpu_can_poll(cur, stop));
> }
>
> prepare_to_rcuwait(&vcpu->wait);
>

Queued all five, thanks.

Paolo

2021-05-24 13:48:50

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] KVM: X86: Bail out of direct yield in case of under-committed scenarios

On Mon, 24 May 2021 at 21:42, Paolo Bonzini <[email protected]> wrote:
>
> On 19/05/21 04:57, Wanpeng Li wrote:
> > Looks good. Hope Paolo can update the patch description when applying.:)
> >
> > "In case of under-committed scenarios, vCPU can get scheduling easily,
> > kvm_vcpu_yield_to add extra overhead, we can observe a lot of races
> > between vcpu->ready is true and yield fails due to p->state is
> > TASK_RUNNING. Let's bail out in such scenarios by checking the length
> > of current cpu runqueue, it can be treated as a hint of under-committed
> > instead of guaranteeing accuracy. 30%+ of directed-yield attempts can
> > avoid the expensive lookups in kvm_sched_yield() in an under-committed
> > scenario. "
> >
>
> Here is what I used:
>
> In case of under-committed scenarios, vCPUs can be scheduled easily;
> kvm_vcpu_yield_to adds extra overhead, and it is also common to see
> when vcpu->ready is true but yield later failing due to p->state is
> TASK_RUNNING.
>
> Let's bail out in such scenarios by checking the length of current cpu
> runqueue, which can be treated as a hint of under-committed instead of
> guarantee of accuracy. 30%+ of directed-yield attempts can now avoid
> the expensive lookups in kvm_sched_yield() in an under-committed scenario.

Thanks Paolo! :)
Wanpeng