2019-05-20 08:24:18

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH v4 0/5] KVM: LAPIC: Optimize timer latency further

Advance lapic timer tries to hidden the hypervisor overhead between the
host emulated timer fires and the guest awares the timer is fired. However,
it just hidden the time between apic_timer_fn/handle_preemption_timer ->
wait_lapic_expire, instead of the real position of vmentry which is
mentioned in the orignial commit d0659d946be0 ("KVM: x86: add option to
advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between
the end of wait_lapic_expire and before world switch on my haswell desktop.

This patchset tries to narrow the last gap(wait_lapic_expire -> world switch),
it takes the real overhead time between apic_timer_fn/handle_preemption_timer
and before world switch into consideration when adaptively tuning timer
advancement. The patchset can reduce 40% latency (~1600+ cycles to ~1000+
cycles on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when
testing busy waits.

v3 -> v4:
* create timer_advance_ns debugfs entry iff lapic_in_kernel()
* keep if (guest_tsc < tsc_deadline) before the call to __wait_lapic_expire()

v2 -> v3:
* expose 'kvm_timer.timer_advance_ns' to userspace
* move the tracepoint below guest_exit_irqoff()
* move wait_lapic_expire() before flushing the L1

v1 -> v2:
* fix indent in patch 1/4
* remove the wait_lapic_expire() tracepoint and expose by debugfs
* move the call to wait_lapic_expire() into vmx.c and svm.c

Wanpeng Li (5):
KVM: LAPIC: Extract adaptive tune timer advancement logic
KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow
KVM: LAPIC: Expose per-vCPU timer_advance_ns to userspace
KVM: LAPIC: Delay trace advance expire delta
KVM: LAPIC: Optimize timer latency further

arch/x86/kvm/debugfs.c | 18 +++++++++++++++
arch/x86/kvm/lapic.c | 60 +++++++++++++++++++++++++++++---------------------
arch/x86/kvm/lapic.h | 3 ++-
arch/x86/kvm/svm.c | 4 ++++
arch/x86/kvm/vmx/vmx.c | 4 ++++
arch/x86/kvm/x86.c | 9 ++++----
6 files changed, 68 insertions(+), 30 deletions(-)

--
2.7.4



2019-05-20 08:24:19

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH v4 3/5] KVM: LAPIC: Expose per-vCPU timer_advance_ns to userspace

From: Wanpeng Li <[email protected]>

Expose per-vCPU timer_advance_ns to userspace, so it is able to
query the auto-adjusted value.

Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Liran Alon <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/kvm/debugfs.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index c19c7ed..a2f3432 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -9,12 +9,22 @@
*/
#include <linux/kvm_host.h>
#include <linux/debugfs.h>
+#include "lapic.h"

bool kvm_arch_has_vcpu_debugfs(void)
{
return true;
}

+static int vcpu_get_timer_advance_ns(void *data, u64 *val)
+{
+ struct kvm_vcpu *vcpu = (struct kvm_vcpu *) data;
+ *val = vcpu->arch.apic->lapic_timer.timer_advance_ns;
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(vcpu_timer_advance_ns_fops, vcpu_get_timer_advance_ns, NULL, "%llu\n");
+
static int vcpu_get_tsc_offset(void *data, u64 *val)
{
struct kvm_vcpu *vcpu = (struct kvm_vcpu *) data;
@@ -51,6 +61,14 @@ int kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu)
if (!ret)
return -ENOMEM;

+ if (lapic_in_kernel(vcpu)) {
+ ret = debugfs_create_file("lapic_timer_advance_ns", 0444,
+ vcpu->debugfs_dentry,
+ vcpu, &vcpu_timer_advance_ns_fops);
+ if (!ret)
+ return -ENOMEM;
+ }
+
if (kvm_has_tsc_control) {
ret = debugfs_create_file("tsc-scaling-ratio", 0444,
vcpu->debugfs_dentry,
--
2.7.4


2019-05-20 08:24:23

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH v4 5/5] KVM: LAPIC: Optimize timer latency further

From: Wanpeng Li <[email protected]>

Advance lapic timer tries to hidden the hypervisor overhead between the
host emulated timer fires and the guest awares the timer is fired. However,
it just hidden the time between apic_timer_fn/handle_preemption_timer ->
wait_lapic_expire, instead of the real position of vmentry which is
mentioned in the orignial commit d0659d946be0 ("KVM: x86: add option to
advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between
the end of wait_lapic_expire and before world switch on my haswell desktop.

This patch tries to narrow the last gap(wait_lapic_expire -> world switch),
it takes the real overhead time between apic_timer_fn/handle_preemption_timer
and before world switch into consideration when adaptively tuning timer
advancement. The patch can reduce 40% latency (~1600+ cycles to ~1000+ cycles
on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
busy waits.

Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Liran Alon <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/kvm/lapic.c | 3 ++-
arch/x86/kvm/lapic.h | 2 +-
arch/x86/kvm/svm.c | 4 ++++
arch/x86/kvm/vmx/vmx.c | 4 ++++
arch/x86/kvm/x86.c | 3 ---
5 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 6652928..db75ac5 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1531,7 +1531,7 @@ static inline void adaptive_tune_timer_advancement(struct kvm_vcpu *vcpu,
apic->lapic_timer.timer_advance_ns = timer_advance_ns;
}

-void wait_lapic_expire(struct kvm_vcpu *vcpu)
+void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
u64 guest_tsc, tsc_deadline;
@@ -1553,6 +1553,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
if (unlikely(!apic->lapic_timer.timer_advance_adjust_done))
adaptive_tune_timer_advancement(vcpu, apic->lapic_timer.advance_expire_delta);
}
+EXPORT_SYMBOL_GPL(kvm_wait_lapic_expire);

static void start_sw_tscdeadline(struct kvm_lapic *apic)
{
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 3e72a25..f974a3d 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -220,7 +220,7 @@ static inline int kvm_lapic_latched_init(struct kvm_vcpu *vcpu)

bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);

-void wait_lapic_expire(struct kvm_vcpu *vcpu);
+void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu);

bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq,
struct kvm_vcpu **dest_vcpu);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index a849dcb..e68c1b9 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5632,6 +5632,10 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
clgi();
kvm_load_guest_xcr0(vcpu);

+ if (lapic_in_kernel(vcpu) &&
+ vcpu->arch.apic->lapic_timer.timer_advance_ns)
+ kvm_wait_lapic_expire(vcpu);
+
/*
* If this vCPU has touched SPEC_CTRL, restore the guest's value if
* it's non-zero. Since vmentry is serialising on affected CPUs, there
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 16d2a3e..57d0e57 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6433,6 +6433,10 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)

vmx_update_hv_timer(vcpu);

+ if (lapic_in_kernel(vcpu) &&
+ vcpu->arch.apic->lapic_timer.timer_advance_ns)
+ kvm_wait_lapic_expire(vcpu);
+
/*
* If this vCPU has touched SPEC_CTRL, restore the guest's value if
* it's non-zero. Since vmentry is serialising on affected CPUs, there
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ac7353f..e6f05f4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7959,9 +7959,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
}

trace_kvm_entry(vcpu->vcpu_id);
- if (lapic_in_kernel(vcpu) &&
- vcpu->arch.apic->lapic_timer.timer_advance_ns)
- wait_lapic_expire(vcpu);
guest_enter_irqoff();

fpregs_assert_state_consistent();
--
2.7.4


2019-05-20 09:11:25

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH v4 4/5] KVM: LAPIC: Delay trace advance expire delta

From: Wanpeng Li <[email protected]>

wait_lapic_expire() call was moved above guest_enter_irqoff() because of
its tracepoint, which violated the RCU extended quiescent state invoked
by guest_enter_irqoff()[1][2]. This patch simply moves the tracepoint
below guest_exit_irqoff() in vcpu_enter_guest(). Snapshot the delta before
VM-Enter, but trace it after VM-Exit. This can help us to move
wait_lapic_expire() just before vmentry in the later patch.

[1] Commit 8b89fe1f6c43 ("kvm: x86: move tracepoints outside extended quiescent state")
[2] https://patchwork.kernel.org/patch/7821111/

Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Liran Alon <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/kvm/lapic.c | 14 +++++++-------
arch/x86/kvm/lapic.h | 1 +
arch/x86/kvm/x86.c | 4 ++++
3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 89ac82d..6652928 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1502,27 +1502,27 @@ static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
}

static inline void adaptive_tune_timer_advancement(struct kvm_vcpu *vcpu,
- u64 guest_tsc, u64 tsc_deadline)
+ s64 advance_expire_delta)
{
struct kvm_lapic *apic = vcpu->arch.apic;
u32 timer_advance_ns = apic->lapic_timer.timer_advance_ns;
u64 ns;

/* too early */
- if (guest_tsc < tsc_deadline) {
- ns = (tsc_deadline - guest_tsc) * 1000000ULL;
+ if (advance_expire_delta < 0) {
+ ns = -advance_expire_delta * 1000000ULL;
do_div(ns, vcpu->arch.virtual_tsc_khz);
timer_advance_ns -= min((u32)ns,
timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
} else {
/* too late */
- ns = (guest_tsc - tsc_deadline) * 1000000ULL;
+ ns = advance_expire_delta * 1000000ULL;
do_div(ns, vcpu->arch.virtual_tsc_khz);
timer_advance_ns += min((u32)ns,
timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
}

- if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
+ if (abs(advance_expire_delta) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
apic->lapic_timer.timer_advance_adjust_done = true;
if (unlikely(timer_advance_ns > 5000)) {
timer_advance_ns = 0;
@@ -1545,13 +1545,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
tsc_deadline = apic->lapic_timer.expired_tscdeadline;
apic->lapic_timer.expired_tscdeadline = 0;
guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
- trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
+ apic->lapic_timer.advance_expire_delta = guest_tsc - tsc_deadline;

if (guest_tsc < tsc_deadline)
__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);

if (unlikely(!apic->lapic_timer.timer_advance_adjust_done))
- adaptive_tune_timer_advancement(vcpu, guest_tsc, tsc_deadline);
+ adaptive_tune_timer_advancement(vcpu, apic->lapic_timer.advance_expire_delta);
}

static void start_sw_tscdeadline(struct kvm_lapic *apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index d6d049b..3e72a25 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -32,6 +32,7 @@ struct kvm_timer {
u64 tscdeadline;
u64 expired_tscdeadline;
u32 timer_advance_ns;
+ s64 advance_expire_delta;
atomic_t pending; /* accumulated triggered timers */
bool hv_timer_in_use;
bool timer_advance_adjust_done;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3eb77e7..ac7353f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8017,6 +8017,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
++vcpu->stat.exits;

guest_exit_irqoff();
+ if (lapic_in_kernel(vcpu) &&
+ vcpu->arch.apic->lapic_timer.timer_advance_ns)
+ trace_kvm_wait_lapic_expire(vcpu->vcpu_id,
+ vcpu->arch.apic->lapic_timer.advance_expire_delta);

local_irq_enable();
preempt_enable();
--
2.7.4


2019-05-20 11:51:56

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v4 0/5] KVM: LAPIC: Optimize timer latency further

On 20/05/19 10:18, Wanpeng Li wrote:
> Advance lapic timer tries to hidden the hypervisor overhead between the
> host emulated timer fires and the guest awares the timer is fired. However,
> it just hidden the time between apic_timer_fn/handle_preemption_timer ->
> wait_lapic_expire, instead of the real position of vmentry which is
> mentioned in the orignial commit d0659d946be0 ("KVM: x86: add option to
> advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between
> the end of wait_lapic_expire and before world switch on my haswell desktop.
>
> This patchset tries to narrow the last gap(wait_lapic_expire -> world switch),
> it takes the real overhead time between apic_timer_fn/handle_preemption_timer
> and before world switch into consideration when adaptively tuning timer
> advancement. The patchset can reduce 40% latency (~1600+ cycles to ~1000+
> cycles on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when
> testing busy waits.
>
> v3 -> v4:
> * create timer_advance_ns debugfs entry iff lapic_in_kernel()
> * keep if (guest_tsc < tsc_deadline) before the call to __wait_lapic_expire()
>
> v2 -> v3:
> * expose 'kvm_timer.timer_advance_ns' to userspace
> * move the tracepoint below guest_exit_irqoff()
> * move wait_lapic_expire() before flushing the L1
>
> v1 -> v2:
> * fix indent in patch 1/4
> * remove the wait_lapic_expire() tracepoint and expose by debugfs
> * move the call to wait_lapic_expire() into vmx.c and svm.c
>
> Wanpeng Li (5):
> KVM: LAPIC: Extract adaptive tune timer advancement logic
> KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow
> KVM: LAPIC: Expose per-vCPU timer_advance_ns to userspace
> KVM: LAPIC: Delay trace advance expire delta
> KVM: LAPIC: Optimize timer latency further
>
> arch/x86/kvm/debugfs.c | 18 +++++++++++++++
> arch/x86/kvm/lapic.c | 60 +++++++++++++++++++++++++++++---------------------
> arch/x86/kvm/lapic.h | 3 ++-
> arch/x86/kvm/svm.c | 4 ++++
> arch/x86/kvm/vmx/vmx.c | 4 ++++
> arch/x86/kvm/x86.c | 9 ++++----
> 6 files changed, 68 insertions(+), 30 deletions(-)
>

Queued, thanks (2-3 for 5.2, the rest for 5.3).

Paolo

2019-05-20 11:56:12

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] KVM: LAPIC: Delay trace advance expire delta

On Mon, 20 May 2019 at 19:14, Paolo Bonzini <[email protected]> wrote:
>
> On 20/05/19 10:18, Wanpeng Li wrote:
> > From: Wanpeng Li <[email protected]>
> >
> > wait_lapic_expire() call was moved above guest_enter_irqoff() because of
> > its tracepoint, which violated the RCU extended quiescent state invoked
> > by guest_enter_irqoff()[1][2]. This patch simply moves the tracepoint
> > below guest_exit_irqoff() in vcpu_enter_guest(). Snapshot the delta before
> > VM-Enter, but trace it after VM-Exit. This can help us to move
> > wait_lapic_expire() just before vmentry in the later patch.
> >
> > [1] Commit 8b89fe1f6c43 ("kvm: x86: move tracepoints outside extended quiescent state")
> > [2] https://patchwork.kernel.org/patch/7821111/
>
> This is a bit confusing, since the delta is printed after the
> corresponding vmexit but the wait is done before the vmentry. I think
> we can drop the tracepoint:
>
> ------------- 8< ----------------
> From ae148d98d49b96b5222e2c78ac1b1e13cc526d71 Mon Sep 17 00:00:00 2001
> From: Paolo Bonzini <[email protected]>
> Date: Mon, 20 May 2019 13:10:01 +0200
> Subject: [PATCH] KVM: lapic: replace wait_lapic_expire tracepoint with
> restart_apic_timer
>
> wait_lapic_expire() call was moved above guest_enter_irqoff() because of
> its tracepoint, which violated the RCU extended quiescent state invoked
> by guest_enter_irqoff()[1][2].
>
> We would like to move wait_lapic_expire() just before vmentry, which would
> place wait_lapic_expire() again inside the extended quiescent state. Drop
> the tracepoint, but add instead another one that can be useful and where
> we can check the status of the adaptive tuning procedure.

https://lkml.org/lkml/2019/5/15/1435

Maybe Sean's comment is reasonable, per-vCPU debugfs entry for
adaptive tuning and wait_lapic_expire() tracepoint for hand tuning.

Regards,
Wanpeng Li

>
> [1] Commit 8b89fe1f6c43 ("kvm: x86: move tracepoints outside extended quiescent state")
> [2] https://patchwork.kernel.org/patch/7821111/
>
> Signed-off-by: Paolo Bonzini <[email protected]>
>
> ---
> arch/x86/kvm/lapic.c | 4 +++-
> arch/x86/kvm/trace.h | 15 +++++++--------
> 2 files changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index c12b090f4fad..8f05c1d0b486 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1545,7 +1545,6 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> tsc_deadline = apic->lapic_timer.expired_tscdeadline;
> apic->lapic_timer.expired_tscdeadline = 0;
> guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> - trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
>
> if (guest_tsc < tsc_deadline)
> __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
> @@ -1763,6 +1762,9 @@ static void start_sw_timer(struct kvm_lapic *apic)
>
> static void restart_apic_timer(struct kvm_lapic *apic)
> {
> + trace_kvm_restart_apic_timer(apic->vcpu->vcpu_id,
> + apic->lapic_timer.timer_advance_ns);
> +
> preempt_disable();
>
> if (!apic_lvtt_period(apic) && atomic_read(&apic->lapic_timer.pending))
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 4d47a2631d1f..f6e000038f3f 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -953,24 +953,23 @@
> __entry->flags)
> );
>
> -TRACE_EVENT(kvm_wait_lapic_expire,
> - TP_PROTO(unsigned int vcpu_id, s64 delta),
> - TP_ARGS(vcpu_id, delta),
> +TRACE_EVENT(kvm_restart_apic_timer,
> + TP_PROTO(unsigned int vcpu_id, u32 advance),
> + TP_ARGS(vcpu_id, advance),
>
> TP_STRUCT__entry(
> __field( unsigned int, vcpu_id )
> - __field( s64, delta )
> + __field( u32, advance )
> ),
>
> TP_fast_assign(
> __entry->vcpu_id = vcpu_id;
> - __entry->delta = delta;
> + __entry->advance = advance;
> ),
>
> - TP_printk("vcpu %u: delta %lld (%s)",
> + TP_printk("vcpu %u: advance %u",
> __entry->vcpu_id,
> - __entry->delta,
> - __entry->delta < 0 ? "early" : "late")
> + __entry->advance)
> );
>
> TRACE_EVENT(kvm_enter_smm,

2019-05-20 12:17:43

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] KVM: LAPIC: Delay trace advance expire delta

On Mon, 20 May 2019 at 19:33, Paolo Bonzini <[email protected]> wrote:
>
> On 20/05/19 13:22, Wanpeng Li wrote:
> >>
> >> We would like to move wait_lapic_expire() just before vmentry, which would
> >> place wait_lapic_expire() again inside the extended quiescent state. Drop
> >> the tracepoint, but add instead another one that can be useful and where
> >> we can check the status of the adaptive tuning procedure.
> > https://lkml.org/lkml/2019/5/15/1435
> >
> > Maybe Sean's comment is reasonable, per-vCPU debugfs entry for
> > adaptive tuning and wait_lapic_expire() tracepoint for hand tuning.
>
> Hmm, yeah, that makes sense. The location of the tracepoint is a bit
> weird, but I guess we can add a comment in the code.

Do you need me to post a new patchset? :)

Regards,
Wanpeng Li

2019-05-20 12:19:25

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] KVM: LAPIC: Delay trace advance expire delta

On Mon, 20 May 2019 at 19:41, Paolo Bonzini <[email protected]> wrote:
>
> On 20/05/19 13:36, Wanpeng Li wrote:
> >> Hmm, yeah, that makes sense. The location of the tracepoint is a bit
> >> weird, but I guess we can add a comment in the code.
> > Do you need me to post a new patchset? :)
>
> No problem. The final patch that I committed is this:
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index c12b090f4fad..f8615872ae64 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1502,27 +1502,27 @@ static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
> }
>
> static inline void adjust_lapic_timer_advance(struct kvm_vcpu *vcpu,
> - u64 guest_tsc, u64 tsc_deadline)
> + s64 advance_expire_delta)
> {
> struct kvm_lapic *apic = vcpu->arch.apic;
> u32 timer_advance_ns = apic->lapic_timer.timer_advance_ns;
> u64 ns;
>
> /* too early */
> - if (guest_tsc < tsc_deadline) {
> - ns = (tsc_deadline - guest_tsc) * 1000000ULL;
> + if (advance_expire_delta < 0) {
> + ns = -advance_expire_delta * 1000000ULL;
> do_div(ns, vcpu->arch.virtual_tsc_khz);
> timer_advance_ns -= min((u32)ns,
> timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> } else {
> /* too late */
> - ns = (guest_tsc - tsc_deadline) * 1000000ULL;
> + ns = advance_expire_delta * 1000000ULL;
> do_div(ns, vcpu->arch.virtual_tsc_khz);
> timer_advance_ns += min((u32)ns,
> timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> }
>
> - if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
> + if (abs(advance_expire_delta) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
> apic->lapic_timer.timer_advance_adjust_done = true;
> if (unlikely(timer_advance_ns > 5000)) {
> timer_advance_ns = 0;
> @@ -1545,13 +1545,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> tsc_deadline = apic->lapic_timer.expired_tscdeadline;
> apic->lapic_timer.expired_tscdeadline = 0;
> guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> - trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
> + apic->lapic_timer.advance_expire_delta = guest_tsc - tsc_deadline;
>
> if (guest_tsc < tsc_deadline)
> __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
>
> if (unlikely(!apic->lapic_timer.timer_advance_adjust_done))
> - adjust_lapic_timer_advance(vcpu, guest_tsc, tsc_deadline);
> + adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta);
> }
>
> static void start_sw_tscdeadline(struct kvm_lapic *apic)
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index d6d049ba3045..3e72a255543d 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -32,6 +32,7 @@ struct kvm_timer {
> u64 tscdeadline;
> u64 expired_tscdeadline;
> u32 timer_advance_ns;
> + s64 advance_expire_delta;
> atomic_t pending; /* accumulated triggered timers */
> bool hv_timer_in_use;
> bool timer_advance_adjust_done;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e7e57de50a3c..35631505421c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8008,6 +8008,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> ++vcpu->stat.exits;
>
> guest_exit_irqoff();
> + if (lapic_in_kernel(vcpu)) {
> + s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
> + if (delta != S64_MIN) {
> + trace_kvm_wait_lapic_expire(vcpu->vcpu_id, delta);
> + vcpu->arch.apic->lapic_timer.advance_expire_delta = S64_MIN;
> + }
> + }
>
> local_irq_enable();
> preempt_enable();
>
> so that KVM tracks whether wait_lapic_expire was called, and do not
> invoke the tracepoint if not.

Looks good to me, thank you. :)

Regards,
Wanpeng Li

2019-05-20 14:03:22

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] KVM: LAPIC: Delay trace advance expire delta

On 20/05/19 13:22, Wanpeng Li wrote:
>>
>> We would like to move wait_lapic_expire() just before vmentry, which would
>> place wait_lapic_expire() again inside the extended quiescent state. Drop
>> the tracepoint, but add instead another one that can be useful and where
>> we can check the status of the adaptive tuning procedure.
> https://lkml.org/lkml/2019/5/15/1435
>
> Maybe Sean's comment is reasonable, per-vCPU debugfs entry for
> adaptive tuning and wait_lapic_expire() tracepoint for hand tuning.

Hmm, yeah, that makes sense. The location of the tracepoint is a bit
weird, but I guess we can add a comment in the code.

Paolo

2019-05-20 14:23:23

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] KVM: LAPIC: Delay trace advance expire delta

On 20/05/19 10:18, Wanpeng Li wrote:
> From: Wanpeng Li <[email protected]>
>
> wait_lapic_expire() call was moved above guest_enter_irqoff() because of
> its tracepoint, which violated the RCU extended quiescent state invoked
> by guest_enter_irqoff()[1][2]. This patch simply moves the tracepoint
> below guest_exit_irqoff() in vcpu_enter_guest(). Snapshot the delta before
> VM-Enter, but trace it after VM-Exit. This can help us to move
> wait_lapic_expire() just before vmentry in the later patch.
>
> [1] Commit 8b89fe1f6c43 ("kvm: x86: move tracepoints outside extended quiescent state")
> [2] https://patchwork.kernel.org/patch/7821111/

This is a bit confusing, since the delta is printed after the
corresponding vmexit but the wait is done before the vmentry. I think
we can drop the tracepoint:

------------- 8< ----------------
From ae148d98d49b96b5222e2c78ac1b1e13cc526d71 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <[email protected]>
Date: Mon, 20 May 2019 13:10:01 +0200
Subject: [PATCH] KVM: lapic: replace wait_lapic_expire tracepoint with
restart_apic_timer

wait_lapic_expire() call was moved above guest_enter_irqoff() because of
its tracepoint, which violated the RCU extended quiescent state invoked
by guest_enter_irqoff()[1][2].

We would like to move wait_lapic_expire() just before vmentry, which would
place wait_lapic_expire() again inside the extended quiescent state. Drop
the tracepoint, but add instead another one that can be useful and where
we can check the status of the adaptive tuning procedure.

[1] Commit 8b89fe1f6c43 ("kvm: x86: move tracepoints outside extended quiescent state")
[2] https://patchwork.kernel.org/patch/7821111/

Signed-off-by: Paolo Bonzini <[email protected]>

---
arch/x86/kvm/lapic.c | 4 +++-
arch/x86/kvm/trace.h | 15 +++++++--------
2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c12b090f4fad..8f05c1d0b486 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1545,7 +1545,6 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
tsc_deadline = apic->lapic_timer.expired_tscdeadline;
apic->lapic_timer.expired_tscdeadline = 0;
guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
- trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);

if (guest_tsc < tsc_deadline)
__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
@@ -1763,6 +1762,9 @@ static void start_sw_timer(struct kvm_lapic *apic)

static void restart_apic_timer(struct kvm_lapic *apic)
{
+ trace_kvm_restart_apic_timer(apic->vcpu->vcpu_id,
+ apic->lapic_timer.timer_advance_ns);
+
preempt_disable();

if (!apic_lvtt_period(apic) && atomic_read(&apic->lapic_timer.pending))
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 4d47a2631d1f..f6e000038f3f 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -953,24 +953,23 @@
__entry->flags)
);

-TRACE_EVENT(kvm_wait_lapic_expire,
- TP_PROTO(unsigned int vcpu_id, s64 delta),
- TP_ARGS(vcpu_id, delta),
+TRACE_EVENT(kvm_restart_apic_timer,
+ TP_PROTO(unsigned int vcpu_id, u32 advance),
+ TP_ARGS(vcpu_id, advance),

TP_STRUCT__entry(
__field( unsigned int, vcpu_id )
- __field( s64, delta )
+ __field( u32, advance )
),

TP_fast_assign(
__entry->vcpu_id = vcpu_id;
- __entry->delta = delta;
+ __entry->advance = advance;
),

- TP_printk("vcpu %u: delta %lld (%s)",
+ TP_printk("vcpu %u: advance %u",
__entry->vcpu_id,
- __entry->delta,
- __entry->delta < 0 ? "early" : "late")
+ __entry->advance)
);

TRACE_EVENT(kvm_enter_smm,

2019-05-20 14:48:16

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] KVM: LAPIC: Delay trace advance expire delta

On 20/05/19 13:36, Wanpeng Li wrote:
>> Hmm, yeah, that makes sense. The location of the tracepoint is a bit
>> weird, but I guess we can add a comment in the code.
> Do you need me to post a new patchset? :)

No problem. The final patch that I committed is this:

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c12b090f4fad..f8615872ae64 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1502,27 +1502,27 @@ static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
}

static inline void adjust_lapic_timer_advance(struct kvm_vcpu *vcpu,
- u64 guest_tsc, u64 tsc_deadline)
+ s64 advance_expire_delta)
{
struct kvm_lapic *apic = vcpu->arch.apic;
u32 timer_advance_ns = apic->lapic_timer.timer_advance_ns;
u64 ns;

/* too early */
- if (guest_tsc < tsc_deadline) {
- ns = (tsc_deadline - guest_tsc) * 1000000ULL;
+ if (advance_expire_delta < 0) {
+ ns = -advance_expire_delta * 1000000ULL;
do_div(ns, vcpu->arch.virtual_tsc_khz);
timer_advance_ns -= min((u32)ns,
timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
} else {
/* too late */
- ns = (guest_tsc - tsc_deadline) * 1000000ULL;
+ ns = advance_expire_delta * 1000000ULL;
do_div(ns, vcpu->arch.virtual_tsc_khz);
timer_advance_ns += min((u32)ns,
timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
}

- if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
+ if (abs(advance_expire_delta) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
apic->lapic_timer.timer_advance_adjust_done = true;
if (unlikely(timer_advance_ns > 5000)) {
timer_advance_ns = 0;
@@ -1545,13 +1545,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
tsc_deadline = apic->lapic_timer.expired_tscdeadline;
apic->lapic_timer.expired_tscdeadline = 0;
guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
- trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
+ apic->lapic_timer.advance_expire_delta = guest_tsc - tsc_deadline;

if (guest_tsc < tsc_deadline)
__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);

if (unlikely(!apic->lapic_timer.timer_advance_adjust_done))
- adjust_lapic_timer_advance(vcpu, guest_tsc, tsc_deadline);
+ adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta);
}

static void start_sw_tscdeadline(struct kvm_lapic *apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index d6d049ba3045..3e72a255543d 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -32,6 +32,7 @@ struct kvm_timer {
u64 tscdeadline;
u64 expired_tscdeadline;
u32 timer_advance_ns;
+ s64 advance_expire_delta;
atomic_t pending; /* accumulated triggered timers */
bool hv_timer_in_use;
bool timer_advance_adjust_done;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e7e57de50a3c..35631505421c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8008,6 +8008,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
++vcpu->stat.exits;

guest_exit_irqoff();
+ if (lapic_in_kernel(vcpu)) {
+ s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
+ if (delta != S64_MIN) {
+ trace_kvm_wait_lapic_expire(vcpu->vcpu_id, delta);
+ vcpu->arch.apic->lapic_timer.advance_expire_delta = S64_MIN;
+ }
+ }

local_irq_enable();
preempt_enable();

so that KVM tracks whether wait_lapic_expire was called, and do not
invoke the tracepoint if not.

Thanks,

Paolo

2019-05-22 08:55:14

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH v4 0/5] KVM: LAPIC: Optimize timer latency further

On Mon, 20 May 2019 at 16:18, Wanpeng Li <[email protected]> wrote:
>
> Advance lapic timer tries to hidden the hypervisor overhead between the
> host emulated timer fires and the guest awares the timer is fired. However,
> it just hidden the time between apic_timer_fn/handle_preemption_timer ->
> wait_lapic_expire, instead of the real position of vmentry which is
> mentioned in the orignial commit d0659d946be0 ("KVM: x86: add option to
> advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between
> the end of wait_lapic_expire and before world switch on my haswell desktop.
>
> This patchset tries to narrow the last gap(wait_lapic_expire -> world switch),
> it takes the real overhead time between apic_timer_fn/handle_preemption_timer
> and before world switch into consideration when adaptively tuning timer
> advancement. The patchset can reduce 40% latency (~1600+ cycles to ~1000+
> cycles on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when
> testing busy waits.

Testing on a Skylake Server, w/ nohz=off, idle=poll in the guest.
Reduces average cyclictest latency from 3us to 2us.

Regards,
Wanpeng Li