From: Wanpeng Li <[email protected]>
Inspired by commit 262de4102c7bb8 (kvm: exit halt polling on need_resched()
as well), CFS_BANDWIDTH throttling will use resched_task() when there is just
one task to get the task to block. It was likely allowing VMs to overrun their
quota when halt polling. Due to PPC implements an arch specific halt polling
logic, we should add the need_resched() checking there as well. This
patch adds a helper function that to be shared between book3s and generic
halt-polling loop.
Cc: Ben Segall <[email protected]>
Cc: Venkatesh Srinivas <[email protected]>
Cc: Jim Mattson <[email protected]>
Cc: David Matlack <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Suraj Jitindar Singh <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
v2 -> v3:
* add a helper function
v1 -> v2:
* update patch description
arch/powerpc/kvm/book3s_hv.c | 2 +-
include/linux/kvm_host.h | 2 ++
virt/kvm/kvm_main.c | 9 +++++++--
3 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 28a80d240b76..360165df345b 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3936,7 +3936,7 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc)
break;
}
cur = ktime_get();
- } while (single_task_running() && ktime_before(cur, stop));
+ } while (kvm_vcpu_can_block(cur, stop));
spin_lock(&vc->lock);
vc->vcore_state = VCORE_INACTIVE;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2f34487e21f2..bf4fd60c4699 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1583,4 +1583,6 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
/* Max number of entries allowed for each kvm dirty ring */
#define KVM_DIRTY_RING_MAX_ENTRIES 65536
+bool kvm_vcpu_can_block(ktime_t cur, ktime_t stop);
+
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6b4feb92dc79..c81080667fd1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2945,6 +2945,12 @@ update_halt_poll_stats(struct kvm_vcpu *vcpu, u64 poll_ns, bool waited)
vcpu->stat.halt_poll_success_ns += poll_ns;
}
+
+bool kvm_vcpu_can_block(ktime_t cur, ktime_t stop)
+{
+ return single_task_running() && !need_resched() && ktime_before(cur, stop);
+}
+
/*
* The vCPU has executed a HLT instruction with in-kernel mode enabled.
*/
@@ -2973,8 +2979,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
goto out;
}
poll_end = cur = ktime_get();
- } while (single_task_running() && !need_resched() &&
- ktime_before(cur, stop));
+ } while (kvm_vcpu_can_block(cur, stop));
}
prepare_to_rcuwait(&vcpu->wait);
--
2.25.1
From: Wanpeng Li <[email protected]>
In case of under-committed scenarios, vCPU can get scheduling easily,
kvm_vcpu_yield_to add extra overhead, we can observe a lot of race
between vcpu->ready is true and yield fails due to p->state is
TASK_RUNNING. Let's bail out in such scenarios by checking the length
of current cpu runqueue, it can be treated as a hint of under-committed
instead of guarantee of accuracy. The directed_yield_successful/attempted
ratio can be improved from 50+% to 80+% in the under-committed scenario.
Signed-off-by: Wanpeng Li <[email protected]>
---
v2 -> v3:
* update patch description
v1 -> v2:
* move the check after attempted counting
* update patch description
arch/x86/kvm/x86.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9b6bca616929..dfb7c320581f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8360,6 +8360,9 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
vcpu->stat.directed_yield_attempted++;
+ if (single_task_running())
+ goto no_yield;
+
rcu_read_lock();
map = rcu_dereference(vcpu->kvm->arch.apic_map);
--
2.25.1
From: Wanpeng Li <[email protected]>
WARNING: suspicious RCU usage
5.13.0-rc1 #4 Not tainted
-----------------------------
./include/linux/kvm_host.h:710 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by hyperv_clock/8318:
#0: ffffb6b8cb05a7d8 (&hv->hv_lock){+.+.}-{3:3}, at: kvm_hv_invalidate_tsc_page+0x3e/0xa0 [kvm]
stack backtrace:
CPU: 3 PID: 8318 Comm: hyperv_clock Not tainted 5.13.0-rc1 #4
Call Trace:
dump_stack+0x87/0xb7
lockdep_rcu_suspicious+0xce/0xf0
kvm_write_guest_page+0x1c1/0x1d0 [kvm]
kvm_write_guest+0x50/0x90 [kvm]
kvm_hv_invalidate_tsc_page+0x79/0xa0 [kvm]
kvm_gen_update_masterclock+0x1d/0x110 [kvm]
kvm_arch_vm_ioctl+0x2a7/0xc50 [kvm]
kvm_vm_ioctl+0x123/0x11d0 [kvm]
__x64_sys_ioctl+0x3ed/0x9d0
do_syscall_64+0x3d/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
kvm_memslots() will be called by kvm_write_guest(), so we should take the srcu lock.
Fixes: e880c6ea5 (KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs)
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/kvm/hyperv.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index f98370a39936..f00830e5202f 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1172,6 +1172,7 @@ void kvm_hv_invalidate_tsc_page(struct kvm *kvm)
{
struct kvm_hv *hv = to_kvm_hv(kvm);
u64 gfn;
+ int idx;
if (hv->hv_tsc_page_status == HV_TSC_PAGE_BROKEN ||
hv->hv_tsc_page_status == HV_TSC_PAGE_UNSET ||
@@ -1190,9 +1191,16 @@ void kvm_hv_invalidate_tsc_page(struct kvm *kvm)
gfn = hv->hv_tsc_page >> HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT;
hv->tsc_ref.tsc_sequence = 0;
+
+ /*
+ * Take the srcu lock as memslots will be accessed to check the gfn
+ * cache generation against the memslots generation.
+ */
+ idx = srcu_read_lock(&kvm->srcu);
if (kvm_write_guest(kvm, gfn_to_gpa(gfn),
&hv->tsc_ref, sizeof(hv->tsc_ref.tsc_sequence)))
hv->hv_tsc_page_status = HV_TSC_PAGE_BROKEN;
+ srcu_read_unlock(&kvm->srcu, idx);
out_unlock:
mutex_unlock(&hv->hv_lock);
--
2.25.1
From: Wanpeng Li <[email protected]>
Let's treat lapic_timer_advance_ns automatically tune logic as hypervisor
overhead, move it before wait_lapic_expire instead of between wait_lapic_expire
and the world switch, the wait duration should be calculated by the
up-to-date guest_tsc after the overhead of automatically tune logic. This
patch reduces ~30+ cycles for kvm-unit-tests/tscdeadline-latency when testing
busy waits.
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/kvm/lapic.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c0ebef560bd1..552d2acf89ab 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1598,11 +1598,12 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
apic->lapic_timer.advance_expire_delta = guest_tsc - tsc_deadline;
- if (guest_tsc < tsc_deadline)
- __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
-
if (lapic_timer_advance_dynamic)
adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta);
+
+ guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+ if (guest_tsc < tsc_deadline)
+ __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
}
void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
--
2.25.1
From: Wanpeng Li <[email protected]>
Commit 66570e966dd9 (kvm: x86: only provide PV features if enabled in guest's
CPUID) avoids to access pv tlb shootdown host side logic when this pv feature
is not exposed to guest, however, kvm_steal_time.preempted not only leveraged
by pv tlb shootdown logic but also mitigate the lock holder preemption issue.
From guest's point of view, vCPU is always preempted since we lose the reset
of kvm_steal_time.preempted before vmentry if pv tlb shootdown feature is not
exposed. This patch fixes it by clearing kvm_steal_time.preempted before
vmentry.
Fixes: 66570e966dd9 (kvm: x86: only provide PV features if enabled in guest's CPUID)
Reviewed-by: Sean Christopherson <[email protected]>
Cc: [email protected]
Signed-off-by: Wanpeng Li <[email protected]>
---
v1 -> v2:
* add curly braces
arch/x86/kvm/x86.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dfb7c320581f..bed7b5348c0e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3105,6 +3105,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
st->preempted & KVM_VCPU_FLUSH_TLB);
if (xchg(&st->preempted, 0) & KVM_VCPU_FLUSH_TLB)
kvm_vcpu_flush_tlb_guest(vcpu);
+ } else {
+ st->preempted = 0;
}
vcpu->arch.st.preempted = 0;
--
2.25.1
On Mon, May 17, 2021 at 7:01 AM Wanpeng Li <[email protected]> wrote:
>
> From: Wanpeng Li <[email protected]>
>
> Inspired by commit 262de4102c7bb8 (kvm: exit halt polling on need_resched()
> as well), CFS_BANDWIDTH throttling will use resched_task() when there is just
> one task to get the task to block. It was likely allowing VMs to overrun their
> quota when halt polling. Due to PPC implements an arch specific halt polling
> logic, we should add the need_resched() checking there as well. This
> patch adds a helper function that to be shared between book3s and generic
> halt-polling loop.
>
> Cc: Ben Segall <[email protected]>
> Cc: Venkatesh Srinivas <[email protected]>
> Cc: Jim Mattson <[email protected]>
> Cc: David Matlack <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Suraj Jitindar Singh <[email protected]>
> Signed-off-by: Wanpeng Li <[email protected]>
Reviewed-by: David Matlack <[email protected]>
> ---
> v2 -> v3:
> * add a helper function
> v1 -> v2:
> * update patch description
>
> arch/powerpc/kvm/book3s_hv.c | 2 +-
> include/linux/kvm_host.h | 2 ++
> virt/kvm/kvm_main.c | 9 +++++++--
> 3 files changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 28a80d240b76..360165df345b 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3936,7 +3936,7 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc)
> break;
> }
> cur = ktime_get();
> - } while (single_task_running() && ktime_before(cur, stop));
> + } while (kvm_vcpu_can_block(cur, stop));
>
> spin_lock(&vc->lock);
> vc->vcore_state = VCORE_INACTIVE;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2f34487e21f2..bf4fd60c4699 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1583,4 +1583,6 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
> /* Max number of entries allowed for each kvm dirty ring */
> #define KVM_DIRTY_RING_MAX_ENTRIES 65536
>
> +bool kvm_vcpu_can_block(ktime_t cur, ktime_t stop);
> +
> #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6b4feb92dc79..c81080667fd1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2945,6 +2945,12 @@ update_halt_poll_stats(struct kvm_vcpu *vcpu, u64 poll_ns, bool waited)
> vcpu->stat.halt_poll_success_ns += poll_ns;
> }
>
> +
> +bool kvm_vcpu_can_block(ktime_t cur, ktime_t stop)
nit: kvm_vcpu_can_poll() would be a more accurate name for this function.
> +{
> + return single_task_running() && !need_resched() && ktime_before(cur, stop);
> +}
> +
> /*
> * The vCPU has executed a HLT instruction with in-kernel mode enabled.
> */
> @@ -2973,8 +2979,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> goto out;
> }
> poll_end = cur = ktime_get();
> - } while (single_task_running() && !need_resched() &&
> - ktime_before(cur, stop));
> + } while (kvm_vcpu_can_block(cur, stop));
> }
>
> prepare_to_rcuwait(&vcpu->wait);
> --
> 2.25.1
>
On Mon, May 17, 2021, Wanpeng Li wrote:
> From: Wanpeng Li <[email protected]>
>
> Let's treat lapic_timer_advance_ns automatically tune logic as hypervisor
> overhead, move it before wait_lapic_expire instead of between wait_lapic_expire
> and the world switch, the wait duration should be calculated by the
> up-to-date guest_tsc after the overhead of automatically tune logic. This
> patch reduces ~30+ cycles for kvm-unit-tests/tscdeadline-latency when testing
> busy waits.
>
> Signed-off-by: Wanpeng Li <[email protected]>
> ---
> arch/x86/kvm/lapic.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index c0ebef560bd1..552d2acf89ab 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1598,11 +1598,12 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> apic->lapic_timer.advance_expire_delta = guest_tsc - tsc_deadline;
>
> - if (guest_tsc < tsc_deadline)
> - __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
> -
> if (lapic_timer_advance_dynamic)
> adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta);
> +
> + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
This is redundant and unnecessary if automatic tuning is disabled, or if the
timer did not arrive early. A comment would also be helpful. E.g. I think this
would micro-optimize all paths:
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c0ebef560bd1..5d91f2367c31 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1598,11 +1598,19 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
apic->lapic_timer.advance_expire_delta = guest_tsc - tsc_deadline;
+ if (lapic_timer_advance_dynamic) {
+ adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta);
+ /*
+ * If the timer fired early, reread the TSC to account for the
+ * overhead of the above adjustment to avoid waiting longer
+ * than is necessary.
+ */
+ if (guest_tsc < tsc_deadline)
+ guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+ }
+
if (guest_tsc < tsc_deadline)
__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
-
- if (lapic_timer_advance_dynamic)
- adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta);
}
void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> + if (guest_tsc < tsc_deadline)
> + __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
> }
>
> void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> --
> 2.25.1
>
On Mon, May 17, 2021 at 7:01 AM Wanpeng Li <[email protected]> wrote:
>
> From: Wanpeng Li <[email protected]>
>
> Inspired by commit 262de4102c7bb8 (kvm: exit halt polling on need_resched()
> as well), CFS_BANDWIDTH throttling will use resched_task() when there is just
> one task to get the task to block. It was likely allowing VMs to overrun their
> quota when halt polling. Due to PPC implements an arch specific halt polling
> logic, we should add the need_resched() checking there as well. This
> patch adds a helper function that to be shared between book3s and generic
> halt-polling loop.
>
> Cc: Ben Segall <[email protected]>
> Cc: Venkatesh Srinivas <[email protected]>
> Cc: Jim Mattson <[email protected]>
> Cc: David Matlack <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Suraj Jitindar Singh <[email protected]>
> Signed-off-by: Wanpeng Li <[email protected]>
> ---
> v2 -> v3:
> * add a helper function
> v1 -> v2:
> * update patch description
>
> arch/powerpc/kvm/book3s_hv.c | 2 +-
> include/linux/kvm_host.h | 2 ++
> virt/kvm/kvm_main.c | 9 +++++++--
> 3 files changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 28a80d240b76..360165df345b 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3936,7 +3936,7 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc)
> break;
> }
> cur = ktime_get();
> - } while (single_task_running() && ktime_before(cur, stop));
> + } while (kvm_vcpu_can_block(cur, stop));
>
> spin_lock(&vc->lock);
> vc->vcore_state = VCORE_INACTIVE;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2f34487e21f2..bf4fd60c4699 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1583,4 +1583,6 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
> /* Max number of entries allowed for each kvm dirty ring */
> #define KVM_DIRTY_RING_MAX_ENTRIES 65536
>
> +bool kvm_vcpu_can_block(ktime_t cur, ktime_t stop);
> +
> #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6b4feb92dc79..c81080667fd1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2945,6 +2945,12 @@ update_halt_poll_stats(struct kvm_vcpu *vcpu, u64 poll_ns, bool waited)
> vcpu->stat.halt_poll_success_ns += poll_ns;
> }
>
> +
> +bool kvm_vcpu_can_block(ktime_t cur, ktime_t stop)
> +{
> + return single_task_running() && !need_resched() && ktime_before(cur, stop);
> +}
> +
> /*
> * The vCPU has executed a HLT instruction with in-kernel mode enabled.
> */
> @@ -2973,8 +2979,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> goto out;
> }
> poll_end = cur = ktime_get();
> - } while (single_task_running() && !need_resched() &&
> - ktime_before(cur, stop));
> + } while (kvm_vcpu_can_block(cur, stop));
> }
>
> prepare_to_rcuwait(&vcpu->wait);
> --
> 2.25.1
>
Reviewed-by: Venkatesh Srinivas <[email protected]>
On Tue, 18 May 2021 at 01:51, Sean Christopherson <[email protected]> wrote:
>
> On Mon, May 17, 2021, Wanpeng Li wrote:
> > From: Wanpeng Li <[email protected]>
> >
> > Let's treat lapic_timer_advance_ns automatically tune logic as hypervisor
> > overhead, move it before wait_lapic_expire instead of between wait_lapic_expire
> > and the world switch, the wait duration should be calculated by the
> > up-to-date guest_tsc after the overhead of automatically tune logic. This
> > patch reduces ~30+ cycles for kvm-unit-tests/tscdeadline-latency when testing
> > busy waits.
> >
> > Signed-off-by: Wanpeng Li <[email protected]>
> > ---
> > arch/x86/kvm/lapic.c | 7 ++++---
> > 1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index c0ebef560bd1..552d2acf89ab 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -1598,11 +1598,12 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> > guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > apic->lapic_timer.advance_expire_delta = guest_tsc - tsc_deadline;
> >
> > - if (guest_tsc < tsc_deadline)
> > - __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
> > -
> > if (lapic_timer_advance_dynamic)
> > adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta);
> > +
> > + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>
> This is redundant and unnecessary if automatic tuning is disabled, or if the
> timer did not arrive early. A comment would also be helpful. E.g. I think this
> would micro-optimize all paths:
Do it in v4, thanks.
Wanpeng
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index c0ebef560bd1..5d91f2367c31 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1598,11 +1598,19 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> apic->lapic_timer.advance_expire_delta = guest_tsc - tsc_deadline;
>
> + if (lapic_timer_advance_dynamic) {
> + adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta);
> + /*
> + * If the timer fired early, reread the TSC to account for the
> + * overhead of the above adjustment to avoid waiting longer
> + * than is necessary.
> + */
> + if (guest_tsc < tsc_deadline)
> + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> + }
> +
> if (guest_tsc < tsc_deadline)
> __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
> -
> - if (lapic_timer_advance_dynamic)
> - adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta);
> }
>
> void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
>
> > + if (guest_tsc < tsc_deadline)
> > + __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
> > }
> >
> > void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> > --
> > 2.25.1
> >
On Tue, 18 May 2021 at 00:35, David Matlack <[email protected]> wrote:
>
> On Mon, May 17, 2021 at 7:01 AM Wanpeng Li <[email protected]> wrote:
> >
> > From: Wanpeng Li <[email protected]>
> >
> > Inspired by commit 262de4102c7bb8 (kvm: exit halt polling on need_resched()
> > as well), CFS_BANDWIDTH throttling will use resched_task() when there is just
> > one task to get the task to block. It was likely allowing VMs to overrun their
> > quota when halt polling. Due to PPC implements an arch specific halt polling
> > logic, we should add the need_resched() checking there as well. This
> > patch adds a helper function that to be shared between book3s and generic
> > halt-polling loop.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Venkatesh Srinivas <[email protected]>
> > Cc: Jim Mattson <[email protected]>
> > Cc: David Matlack <[email protected]>
> > Cc: Paul Mackerras <[email protected]>
> > Cc: Suraj Jitindar Singh <[email protected]>
> > Signed-off-by: Wanpeng Li <[email protected]>
>
> Reviewed-by: David Matlack <[email protected]>
>
> > ---
> > v2 -> v3:
> > * add a helper function
> > v1 -> v2:
> > * update patch description
> >
> > arch/powerpc/kvm/book3s_hv.c | 2 +-
> > include/linux/kvm_host.h | 2 ++
> > virt/kvm/kvm_main.c | 9 +++++++--
> > 3 files changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 28a80d240b76..360165df345b 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -3936,7 +3936,7 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc)
> > break;
> > }
> > cur = ktime_get();
> > - } while (single_task_running() && ktime_before(cur, stop));
> > + } while (kvm_vcpu_can_block(cur, stop));
> >
> > spin_lock(&vc->lock);
> > vc->vcore_state = VCORE_INACTIVE;
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 2f34487e21f2..bf4fd60c4699 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1583,4 +1583,6 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
> > /* Max number of entries allowed for each kvm dirty ring */
> > #define KVM_DIRTY_RING_MAX_ENTRIES 65536
> >
> > +bool kvm_vcpu_can_block(ktime_t cur, ktime_t stop);
> > +
> > #endif
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 6b4feb92dc79..c81080667fd1 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2945,6 +2945,12 @@ update_halt_poll_stats(struct kvm_vcpu *vcpu, u64 poll_ns, bool waited)
> > vcpu->stat.halt_poll_success_ns += poll_ns;
> > }
> >
> > +
> > +bool kvm_vcpu_can_block(ktime_t cur, ktime_t stop)
>
> nit: kvm_vcpu_can_poll() would be a more accurate name for this function.
Do it in v4. :)
Wanpeng