2017-12-08 08:40:21

by Quan Xu

[permalink] [raw]
Subject: [PATCH RFC 0/7] kvm pvtimer

From: Ben Luo <[email protected]>

This patchset introduces a new paravirtualized mechanism to reduce VM-exit
caused by guest timer accessing.

In general, KVM guest programs tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE
MSR. This will cause a VM-exit, and then KVM handles this timer for guest.

Also kvm always registers timer on the CPU which vCPU was running on. Even
though vCPU thread is rescheduled to another CPU, the timer will be migrated
to the target CPU as well. When timer expired, timer interrupt could make
guest-mode vCPU VM-exit on this CPU.

When pvtimer is enabled:

- The tsc-deadline timestamp is mostly recorded in share page with less
VM-exit. We Introduce a periodically working kthread to scan share page
and synchronize timer setting for guest on a dedicated CPU.

- Since the working kthread scans periodically, some of the timer events
may be lost or delayed. We have to program these tsc-deadline timestamps
to MSR_IA32_TSC_DEADLINE as normal, which will cause VM-exit and KVM will
signal the working thread through IPI to program timer, instread of
registering on current CPU.

- Timer interrupt will be delivered by posted interrupt mechanism to vCPUs
with less VM-exit.

Ben Luo (7):
kvm: x86: emulate MSR_KVM_PV_TIMER_EN MSR
kvm: x86: add a function to exchange value
KVM: timer: synchronize tsc-deadline timestamp for guest
KVM: timer: program timer to a dedicated CPU
KVM: timer: ignore timer migration if pvtimer is enabled
Doc/KVM: introduce a new cpuid bit for kvm pvtimer
kvm: guest: reprogram guest timer

Documentation/virtual/kvm/cpuid.txt | 4 +
arch/x86/include/asm/kvm_host.h | 5 +
arch/x86/include/asm/kvm_para.h | 9 ++
arch/x86/include/uapi/asm/kvm_para.h | 7 ++
arch/x86/kernel/apic/apic.c | 9 +-
arch/x86/kernel/kvm.c | 38 ++++++++
arch/x86/kvm/cpuid.c | 1 +
arch/x86/kvm/lapic.c | 170 +++++++++++++++++++++++++++++++++-
arch/x86/kvm/lapic.h | 11 ++
arch/x86/kvm/x86.c | 15 +++-
include/linux/kvm_host.h | 3 +
virt/kvm/kvm_main.c | 42 +++++++++
12 files changed, 308 insertions(+), 6 deletions(-)


2017-12-08 08:40:32

by Quan Xu

[permalink] [raw]
Subject: [PATCH RFC 1/7] kvm: x86: emulate MSR_KVM_PV_TIMER_EN MSR

From: Ben Luo <[email protected]>

Guest enables pv timer functionality using this MSR

Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Signed-off-by: Ben Luo <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 5 +++++
arch/x86/include/uapi/asm/kvm_para.h | 6 ++++++
arch/x86/kvm/lapic.c | 22 ++++++++++++++++++++++
arch/x86/kvm/lapic.h | 6 ++++++
arch/x86/kvm/x86.c | 8 ++++++++
5 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c73e493..641b4aa 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -684,6 +684,11 @@ struct kvm_vcpu_arch {
bool pv_unhalted;
} pv;

+ struct {
+ u64 msr_val;
+ struct gfn_to_hva_cache data;
+ } pv_timer;
+
int pending_ioapic_eoi;
int pending_external_vector;

diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 554aa8f..3dd6116 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -41,6 +41,7 @@
#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
#define MSR_KVM_STEAL_TIME 0x4b564d03
#define MSR_KVM_PV_EOI_EN 0x4b564d04
+#define MSR_KVM_PV_TIMER_EN 0x4b564d05

struct kvm_steal_time {
__u64 steal;
@@ -64,6 +65,11 @@ struct kvm_clock_pairing {
#define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
#define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)

+struct pvtimer_vcpu_event_info {
+ __u64 expire_tsc;
+ __u64 next_sync_tsc;
+} __attribute__((__packed__));
+
#define KVM_MAX_MMU_OP_BATCH 32

#define KVM_ASYNC_PF_ENABLED (1 << 0)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 36c90d6..55c9ba3 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1991,6 +1991,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_lapic_set_base(vcpu,
vcpu->arch.apic_base | MSR_IA32_APICBASE_BSP);
vcpu->arch.pv_eoi.msr_val = 0;
+ vcpu->arch.pv_timer.msr_val = 0;
apic_update_ppr(apic);
if (vcpu->arch.apicv_active) {
kvm_x86_ops->apicv_post_state_restore(vcpu);
@@ -2478,6 +2479,27 @@ int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data)
addr, sizeof(u8));
}

+int kvm_lapic_enable_pv_timer(struct kvm_vcpu *vcpu, u64 data)
+{
+ u64 addr = data & ~KVM_MSR_ENABLED;
+ int ret;
+
+ if (!lapic_in_kernel(vcpu))
+ return 1;
+
+ if (!IS_ALIGNED(addr, 4))
+ return 1;
+
+ vcpu->arch.pv_timer.msr_val = data;
+ if (!pv_timer_enabled(vcpu))
+ return 0;
+
+ ret = kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.pv_timer.data,
+ addr, sizeof(struct pvtimer_vcpu_event_info));
+
+ return ret;
+}
+
void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 4b9935a..539a738 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -113,6 +113,7 @@ static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu)
}

int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data);
+int kvm_lapic_enable_pv_timer(struct kvm_vcpu *vcpu, u64 data);
void kvm_lapic_init(void);
void kvm_lapic_exit(void);

@@ -207,6 +208,11 @@ static inline int kvm_lapic_latched_init(struct kvm_vcpu *vcpu)
return lapic_in_kernel(vcpu) && test_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
}

+static inline bool pv_timer_enabled(struct kvm_vcpu *vcpu)
+{
+ return vcpu->arch.pv_timer.msr_val & KVM_MSR_ENABLED;
+}
+
bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);

void wait_lapic_expire(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03869eb..5668774 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1025,6 +1025,7 @@ bool kvm_rdpmc(struct kvm_vcpu *vcpu)
HV_X64_MSR_STIMER0_CONFIG,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_KVM_PV_EOI_EN,
+ MSR_KVM_PV_TIMER_EN,

MSR_IA32_TSC_ADJUST,
MSR_IA32_TSCDEADLINE,
@@ -2279,6 +2280,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if (kvm_lapic_enable_pv_eoi(vcpu, data))
return 1;
break;
+ case MSR_KVM_PV_TIMER_EN:
+ if (kvm_lapic_enable_pv_timer(vcpu, data))
+ return 1;
+ break;

case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
@@ -2510,6 +2515,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_KVM_PV_EOI_EN:
msr_info->data = vcpu->arch.pv_eoi.msr_val;
break;
+ case MSR_KVM_PV_TIMER_EN:
+ msr_info->data = vcpu->arch.pv_timer.msr_val;
+ break;
case MSR_IA32_P5_MC_ADDR:
case MSR_IA32_P5_MC_TYPE:
case MSR_IA32_MCG_CAP:
--
1.7.1

2017-12-08 08:40:44

by Quan Xu

[permalink] [raw]
Subject: [PATCH RFC 2/7] kvm: x86: add a function to exchange value

From: Ben Luo <[email protected]>

Introduce kvm_xchg_guest_cached to exchange value with guest
page atomically.

Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Signed-off-by: Ben Luo <[email protected]>
---
include/linux/kvm_host.h | 3 +++
virt/kvm/kvm_main.c | 42 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6882538..32949ed 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -688,6 +688,9 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
void *data, int offset, unsigned long len);
int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
gpa_t gpa, unsigned long len);
+unsigned long kvm_xchg_guest_cached(struct kvm *kvm,
+ struct gfn_to_hva_cache *ghc, unsigned long offset,
+ unsigned long new, int size);
int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9deb5a2..3149e17 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2010,6 +2010,48 @@ int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
}
EXPORT_SYMBOL_GPL(kvm_read_guest_cached);

+unsigned long kvm_xchg_guest_cached(struct kvm *kvm,
+ struct gfn_to_hva_cache *ghc, unsigned long offset,
+ unsigned long new, int size)
+{
+ unsigned long r;
+ void *kva;
+ struct page *page;
+ kvm_pfn_t pfn;
+
+ WARN_ON(offset > ghc->len);
+
+ pfn = gfn_to_pfn_atomic(kvm, ghc->gpa >> PAGE_SHIFT);
+ page = kvm_pfn_to_page(pfn);
+
+ if (is_error_page(page))
+ return -EFAULT;
+
+ kva = kmap_atomic(page) + offset_in_page(ghc->gpa) + offset;
+ switch (size) {
+ case 1:
+ r = xchg((char *)kva, new);
+ break;
+ case 2:
+ r = xchg((short *)kva, new);
+ break;
+ case 4:
+ r = xchg((int *)kva, new);
+ break;
+ case 8:
+ r = xchg((long *)kva, new);
+ break;
+ default:
+ kunmap_atomic(kva);
+ return -EFAULT;
+ }
+
+ kunmap_atomic(kva);
+ mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+
+ return r;
+}
+
int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
{
const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
--
1.7.1

2017-12-08 08:40:52

by Quan Xu

[permalink] [raw]
Subject: [PATCH RFC 3/7] KVM: timer: synchronize tsc-deadline timestamp for guest

From: Ben Luo <[email protected]>

In general, KVM guest programs tsc-deadline timestamp to
MSR_IA32_TSC_DEADLINE MSR. This will cause a VM-exit, and
then KVM handles this timer for guest.

The tsc-deadline timestamp is mostly recorded in share page
with less VM-exit. We Introduce a periodically working kthread
to scan share page and synchronize timer setting for guest
on a dedicated CPU.

Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Signed-off-by: Ben Luo <[email protected]>
---
arch/x86/kvm/lapic.c | 138 ++++++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/lapic.h | 5 ++
2 files changed, 143 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 55c9ba3..20a23bb 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -36,6 +36,10 @@
#include <asm/delay.h>
#include <linux/atomic.h>
#include <linux/jump_label.h>
+#include <linux/ktime.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+#include <linux/mmu_context.h>
#include "kvm_cache_regs.h"
#include "irq.h"
#include "trace.h"
@@ -70,6 +74,12 @@
#define APIC_BROADCAST 0xFF
#define X2APIC_BROADCAST 0xFFFFFFFFul

+static struct hrtimer pv_sync_timer;
+static long pv_timer_period_ns = PVTIMER_PERIOD_NS;
+static struct task_struct *pv_timer_polling_worker;
+
+module_param(pv_timer_period_ns, long, 0644);
+
static inline int apic_test_vector(int vec, void *bitmap)
{
return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
@@ -2542,8 +2552,130 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
}
}

+static enum hrtimer_restart pv_sync_timer_callback(struct hrtimer *timer)
+{
+ hrtimer_forward_now(timer, ns_to_ktime(pv_timer_period_ns));
+ wake_up_process(pv_timer_polling_worker);
+
+ return HRTIMER_RESTART;
+}
+
+void kvm_apic_sync_pv_timer(void *data)
+{
+ struct kvm_vcpu *vcpu = data;
+ struct kvm_lapic *apic = vcpu->arch.apic;
+ unsigned long flags, this_tsc_khz = vcpu->arch.virtual_tsc_khz;
+ u64 guest_tsc, expire_tsc;
+ long rem_tsc;
+
+ if (!lapic_in_kernel(vcpu) || !pv_timer_enabled(vcpu))
+ return;
+
+ local_irq_save(flags);
+ guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+ rem_tsc = ktime_to_ns(hrtimer_get_remaining(&pv_sync_timer))
+ * this_tsc_khz;
+ if (rem_tsc <= 0)
+ rem_tsc += pv_timer_period_ns * this_tsc_khz;
+ do_div(rem_tsc, 1000000L);
+
+ /*
+ * make sure guest_tsc and rem_tsc are assigned before to update
+ * next_sync_tsc.
+ */
+ smp_wmb();
+ kvm_xchg_guest_cached(vcpu->kvm, &vcpu->arch.pv_timer.data,
+ offsetof(struct pvtimer_vcpu_event_info, next_sync_tsc),
+ guest_tsc + rem_tsc, 8);
+
+ /* make sure next_sync_tsc is visible */
+ smp_wmb();
+
+ expire_tsc = kvm_xchg_guest_cached(vcpu->kvm, &vcpu->arch.pv_timer.data,
+ offsetof(struct pvtimer_vcpu_event_info, expire_tsc),
+ 0UL, 8);
+
+ /* make sure expire_tsc is visible */
+ smp_wmb();
+
+ if (expire_tsc) {
+ if (expire_tsc > guest_tsc)
+ /*
+ * As we bind this thread to a dedicated CPU through
+ * IPI, the timer is registered on that dedicated
+ * CPU here.
+ */
+ kvm_set_lapic_tscdeadline_msr(apic->vcpu, expire_tsc);
+ else
+ /* deliver immediately if expired */
+ kvm_apic_local_deliver(apic, APIC_LVTT);
+ }
+ local_irq_restore(flags);
+}
+
+static int pv_timer_polling(void *arg)
+{
+ struct kvm *kvm;
+ struct kvm_vcpu *vcpu;
+ int i;
+ mm_segment_t oldfs = get_fs();
+
+ while (1) {
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ if (kthread_should_stop()) {
+ __set_current_state(TASK_RUNNING);
+ break;
+ }
+
+ spin_lock(&kvm_lock);
+ __set_current_state(TASK_RUNNING);
+ list_for_each_entry(kvm, &vm_list, vm_list) {
+ set_fs(USER_DS);
+ use_mm(kvm->mm);
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ kvm_apic_sync_pv_timer(vcpu);
+ }
+ unuse_mm(kvm->mm);
+ set_fs(oldfs);
+ }
+
+ spin_unlock(&kvm_lock);
+
+ schedule();
+ }
+
+ return 0;
+}
+
+static void kvm_pv_timer_init(void)
+{
+ ktime_t ktime = ktime_set(0, pv_timer_period_ns);
+
+ hrtimer_init(&pv_sync_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+ pv_sync_timer.function = &pv_sync_timer_callback;
+
+ /* kthread for pv_timer sync buffer */
+ pv_timer_polling_worker = kthread_create(pv_timer_polling, NULL,
+ "pv_timer_polling_worker/%d",
+ PVTIMER_SYNC_CPU);
+ if (IS_ERR(pv_timer_polling_worker)) {
+ pr_warn_once("kvm: failed to create thread for pv_timer\n");
+ pv_timer_polling_worker = NULL;
+ hrtimer_cancel(&pv_sync_timer);
+
+ return;
+ }
+
+ kthread_bind(pv_timer_polling_worker, PVTIMER_SYNC_CPU);
+ wake_up_process(pv_timer_polling_worker);
+ hrtimer_start(&pv_sync_timer, ktime, HRTIMER_MODE_REL);
+}
+
void kvm_lapic_init(void)
{
+ kvm_pv_timer_init();
+
/* do not patch jump label more than once per second */
jump_label_rate_limit(&apic_hw_disabled, HZ);
jump_label_rate_limit(&apic_sw_disabled, HZ);
@@ -2551,6 +2683,12 @@ void kvm_lapic_init(void)

void kvm_lapic_exit(void)
{
+ if (pv_timer_polling_worker) {
+ hrtimer_cancel(&pv_sync_timer);
+ kthread_stop(pv_timer_polling_worker);
+ pv_timer_polling_worker = NULL;
+ }
+
static_key_deferred_flush(&apic_hw_disabled);
static_key_deferred_flush(&apic_sw_disabled);
}
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 539a738..4588d59 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -16,6 +16,9 @@
#define APIC_BUS_CYCLE_NS 1
#define APIC_BUS_FREQUENCY (1000000000ULL / APIC_BUS_CYCLE_NS)

+#define PVTIMER_SYNC_CPU (NR_CPUS - 1) /* dedicated CPU */
+#define PVTIMER_PERIOD_NS 250000L /* pvtimer default period */
+
struct kvm_timer {
struct hrtimer timer;
s64 period; /* unit: ns */
@@ -213,6 +216,8 @@ static inline bool pv_timer_enabled(struct kvm_vcpu *vcpu)
return vcpu->arch.pv_timer.msr_val & KVM_MSR_ENABLED;
}

+void kvm_apic_sync_pv_timer(void *data);
+
bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);

void wait_lapic_expire(struct kvm_vcpu *vcpu);
--
1.7.1

2017-12-08 08:41:10

by Quan Xu

[permalink] [raw]
Subject: [PATCH RFC 5/7] KVM: timer: ignore timer migration if pvtimer is enabled

From: Ben Luo <[email protected]>

When pvtimer is enabled, KVM programs timer to a dedicated CPU
through IPI. Whether the vCPU is on the dedicated CPU or any
other CPU, the timer interrupt will be delivered properly.
No need to migrate timer.

Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Signed-off-by: Ben Luo <[email protected]>
---
arch/x86/kvm/lapic.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5835a27..265efe6 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2282,7 +2282,7 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
{
struct hrtimer *timer;

- if (!lapic_in_kernel(vcpu))
+ if (!lapic_in_kernel(vcpu) || pv_timer_enabled(vcpu))
return;

timer = &vcpu->arch.apic->lapic_timer.timer;
--
1.7.1

2017-12-08 08:41:19

by Quan Xu

[permalink] [raw]
Subject: [PATCH RFC 6/7] Doc/KVM: introduce a new cpuid bit for kvm pvtimer

From: Ben Luo <[email protected]>

KVM_FEATURE_PV_TIMER enables guest to check whether pvtimer
can be enabled in guest.

Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Signed-off-by: Ben Luo <[email protected]>
---
Documentation/virtual/kvm/cpuid.txt | 4 ++++
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kvm/cpuid.c | 1 +
3 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
index 3c65feb..b26b31c 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -54,6 +54,10 @@ KVM_FEATURE_PV_UNHALT || 7 || guest checks this feature bit
|| || before enabling paravirtualized
|| || spinlock support.
------------------------------------------------------------------------------
+KVM_FEATURE_PV_TIMER || 8 || guest checks this feature bit
+ || || before enabling paravirtualized
+ || || timer support.
+------------------------------------------------------------------------------
KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
|| || per-cpu warps are expected in
|| || kvmclock.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 3dd6116..46734a8 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -25,6 +25,7 @@
#define KVM_FEATURE_STEAL_TIME 5
#define KVM_FEATURE_PV_EOI 6
#define KVM_FEATURE_PV_UNHALT 7
+#define KVM_FEATURE_PV_TIMER 8

/* The last 8 bits are used to indicate how to interpret the flags field
* in pvclock structure. If no bits are set, all flags are ignored.
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0099e10..e02fd23 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -593,6 +593,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
(1 << KVM_FEATURE_CLOCKSOURCE2) |
(1 << KVM_FEATURE_ASYNC_PF) |
(1 << KVM_FEATURE_PV_EOI) |
+ (1 << KVM_FEATURE_PV_TIMER) |
(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
(1 << KVM_FEATURE_PV_UNHALT);

--
1.7.1

2017-12-08 08:41:28

by Quan Xu

[permalink] [raw]
Subject: [PATCH RFC 7/7] kvm: guest: reprogram guest timer

From: Ben Luo <[email protected]>

In general, KVM guest programs tsc-deadline timestamp to
MSR_IA32_TSC_DEADLINE MSR.

When pvtimer is enabled, we introduce a new mechanism to
reprogram KVM guest timer. A periodically working kthread
scans share page and synchronize timer setting for guest
on a dedicated CPU. The next time event of the periodically
working kthread is a threshold to decide whether to program
tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE MSR, or to
share page.

Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Signed-off-by: Ben Luo <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 9 +++++++++
arch/x86/kernel/apic/apic.c | 9 ++++++---
arch/x86/kernel/kvm.c | 38 ++++++++++++++++++++++++++++++++++++++
3 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index c373e44..109e706 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -5,6 +5,7 @@
#include <asm/processor.h>
#include <asm/alternative.h>
#include <uapi/asm/kvm_para.h>
+#include <linux/hrtimer.h>

extern void kvmclock_init(void);
extern int kvm_register_clock(char *txt);
@@ -92,6 +93,8 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
void kvm_async_pf_task_wake(u32 token);
u32 kvm_read_and_reset_pf_reason(void);
+int kvm_pv_timer_next_event(unsigned long tsc,
+ struct clock_event_device *evt);
extern void kvm_disable_steal_time(void);

#ifdef CONFIG_PARAVIRT_SPINLOCKS
@@ -126,6 +129,12 @@ static inline void kvm_disable_steal_time(void)
{
return;
}
+
+static inline int kvm_pv_timer_next_event(unsigned long tsc,
+ struct clock_event_device *evt)
+{
+ return 0;
+}
#endif

#endif /* _ASM_X86_KVM_PARA_H */
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ff89177..286c1b3 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -471,10 +471,13 @@ static int lapic_next_event(unsigned long delta,
static int lapic_next_deadline(unsigned long delta,
struct clock_event_device *evt)
{
- u64 tsc;
+ u64 tsc = rdtsc() + (((u64) delta) * TSC_DIVISOR);

- tsc = rdtsc();
- wrmsrl(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
+ /* TODO: undisciplined function call */
+ if (kvm_pv_timer_next_event(tsc, evt))
+ return 0;
+
+ wrmsrl(MSR_IA32_TSC_DEADLINE, tsc);
return 0;
}

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8bb9594..ec7aff1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -328,6 +328,35 @@ static notrace void kvm_guest_apic_eoi_write(u32 reg, u32 val)
apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK);
}

+static DEFINE_PER_CPU(int, pvtimer_enabled);
+static DEFINE_PER_CPU(struct pvtimer_vcpu_event_info,
+ pvtimer_shared_buf) = {0};
+#define PVTIMER_PADDING 25000
+int kvm_pv_timer_next_event(unsigned long tsc,
+ struct clock_event_device *evt)
+{
+ struct pvtimer_vcpu_event_info *src;
+ u64 now;
+
+ if (!this_cpu_read(pvtimer_enabled))
+ return false;
+
+ src = this_cpu_ptr(&pvtimer_shared_buf);
+ xchg((u64 *)&src->expire_tsc, tsc);
+
+ barrier();
+
+ if (tsc < src->next_sync_tsc)
+ return false;
+
+ rdtscll(now);
+ if (tsc < now || tsc - now < PVTIMER_PADDING)
+ return false;
+
+ return true;
+}
+EXPORT_SYMBOL_GPL(kvm_pv_timer_next_event);
+
static void kvm_guest_cpu_init(void)
{
if (!kvm_para_available())
@@ -362,6 +391,15 @@ static void kvm_guest_cpu_init(void)

if (has_steal_clock)
kvm_register_steal_time();
+
+ if (kvm_para_has_feature(KVM_FEATURE_PV_TIMER)) {
+ unsigned long data;
+
+ data = slow_virt_to_phys(this_cpu_ptr(&pvtimer_shared_buf))
+ | KVM_MSR_ENABLED;
+ wrmsrl(MSR_KVM_PV_TIMER_EN, data);
+ this_cpu_write(pvtimer_enabled, 1);
+ }
}

static void kvm_pv_disable_apf(void)
--
1.7.1

2017-12-08 08:41:00

by Quan Xu

[permalink] [raw]
Subject: [PATCH RFC 4/7] KVM: timer: program timer to a dedicated CPU

From: Ben Luo <[email protected]>

KVM always registers timer on the CPU which vCPU was running on.
Even though vCPU thread is rescheduled to another CPU, the timer
will be migrated to the target CPU as well. When timer expired,
timer interrupt could make guest-mode vCPU VM-exit on this CPU.

Since the working kthread scans periodically, some of the timer
events may be lost or delayed. We have to program these tsc-
deadline timestamps to MSR_IA32_TSC_DEADLINE as normal, which
will cause VM-exit and KVM will signal the working thread through
IPI to program timer, instread of registering on current CPU.

Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Signed-off-by: Ben Luo <[email protected]>
---
arch/x86/kvm/lapic.c | 8 +++++++-
arch/x86/kvm/x86.c | 7 ++++++-
2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 20a23bb..5835a27 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2072,7 +2072,13 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)
struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
struct kvm_lapic *apic = container_of(ktimer, struct kvm_lapic, lapic_timer);

- apic_timer_expired(apic);
+
+ if (pv_timer_enabled(apic->vcpu)) {
+ kvm_apic_local_deliver(apic, APIC_LVTT);
+ if (apic_lvtt_tscdeadline(apic))
+ apic->lapic_timer.tscdeadline = 0;
+ } else
+ apic_timer_expired(apic);

if (lapic_is_periodic(apic)) {
advance_periodic_target_expiration(apic);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5668774..3cbb223 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -26,6 +26,7 @@
#include "tss.h"
#include "kvm_cache_regs.h"
#include "x86.h"
+#include "lapic.h"
#include "cpuid.h"
#include "pmu.h"
#include "hyperv.h"
@@ -2196,7 +2197,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff:
return kvm_x2apic_msr_write(vcpu, msr, data);
case MSR_IA32_TSCDEADLINE:
- kvm_set_lapic_tscdeadline_msr(vcpu, data);
+ if (pv_timer_enabled(vcpu))
+ smp_call_function_single(PVTIMER_SYNC_CPU,
+ kvm_apic_sync_pv_timer, vcpu, 0);
+ else
+ kvm_set_lapic_tscdeadline_msr(vcpu, data);
break;
case MSR_IA32_TSC_ADJUST:
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
--
1.7.1

2017-12-08 15:06:31

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH RFC 3/7] KVM: timer: synchronize tsc-deadline timestamp for guest

On Fri, Dec 08, 2017 at 04:39:46PM +0800, Quan Xu wrote:
> From: Ben Luo <[email protected]>
>
> In general, KVM guest programs tsc-deadline timestamp to
> MSR_IA32_TSC_DEADLINE MSR. This will cause a VM-exit, and
> then KVM handles this timer for guest.
>
> The tsc-deadline timestamp is mostly recorded in share page
> with less VM-exit. We Introduce a periodically working kthread
> to scan share page and synchronize timer setting for guest
> on a dedicated CPU.

That sounds like a race. Meaning the guest may put too small window
and this 'working thread to scan' may not get to it fast enough?

Meaning we miss the deadline to inject the timer in the guest.

Or is this part of this PV MSR semantics - that it will only work
for certain amount of values and anything less than say 1ms
should not use the PV MSR?

2017-12-08 15:10:24

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] kvm pvtimer

On Fri, Dec 08, 2017 at 04:39:43PM +0800, Quan Xu wrote:
> From: Ben Luo <[email protected]>
>
> This patchset introduces a new paravirtualized mechanism to reduce VM-exit
> caused by guest timer accessing.

And how bad is this blib in arming the timer?

And how often do you get this timer to be armed? OR better yet - what
are the workloads in which you found this VMExit to be painful?

Thanks. Or better yet - what
are the workloads in which you found this VMExit to be painful?

2017-12-13 16:28:13

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] kvm pvtimer

On Wed, Dec 13, 2017 at 11:25:13PM +0800, Quan Xu wrote:
> On Fri, Dec 8, 2017 at 11:10 PM, Konrad Rzeszutek Wilk <
> [email protected]> wrote:
>
> > On Fri, Dec 08, 2017 at 04:39:43PM +0800, Quan Xu wrote:
> > > From: Ben Luo <[email protected]>
> > >
> > > This patchset introduces a new paravirtualized mechanism to reduce
> > VM-exit
> > > caused by guest timer accessing.
> >
> > And how bad is this blib in arming the timer?
> >
> > And how often do you get this timer to be armed? OR better yet - what
> > are the workloads in which you found this VMExit to be painful?
> >
> > Thanks. Or better yet - what
> > are the workloads in which you found this VMExit to be painful?
> >
>
> one painful point is from VM idle path..
> for some network req/resp services, or benchmark of process context
> switches..

So:

1) VM idle path and network req/resp services:

Does this go away if you don't hit the idle path? Meaning if you
loop without hitting HLT/MWAIT? I am assuming the issue you are facing
is the latency - that is first time the guest comes from HLT and
responds to the packet the latency is much higher than without?

And the arming of the timer?
2) process context switches.

Is that related to the 1)? That is the 'schedule' call and the process
going to sleep waiting for an interrupt or timer?

This all sounds like issues with low-CPU usage workloads where you
need low latency responses?

>
>
>
> Quan
> Alibaba Cloud

2017-12-14 01:55:14

by Quan Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 3/7] KVM: timer: synchronize tsc-deadline timestamp for guest



On 2017/12/08 23:06, Konrad Rzeszutek Wilk wrote:
> On Fri, Dec 08, 2017 at 04:39:46PM +0800, Quan Xu wrote:
>> From: Ben Luo <[email protected]>
>>
>> In general, KVM guest programs tsc-deadline timestamp to
>> MSR_IA32_TSC_DEADLINE MSR. This will cause a VM-exit, and
>> then KVM handles this timer for guest.
>>
>> The tsc-deadline timestamp is mostly recorded in share page
>> with less VM-exit. We Introduce a periodically working kthread
>> to scan share page and synchronize timer setting for guest
>> on a dedicated CPU.
> That sounds like a race. Meaning the guest may put too small window
> and this 'working thread to scan' may not get to it fast enough?
yes, you are right. So ..
> .
> Meaning we miss the deadline to inject the timer in the guest.
>
> Or is this part of this PV MSR semantics - that it will only work
> for certain amount of values and anything less than say 1ms
> should not use the PV MSR?

..
for these timers, We have to program these tsc-deadline timestamps
to MSR_IA32_TSC_DEADLINE as normal, which will cause VM-exit and KVM will
signal the working thread through IPI to program timer, instead of
registering on current CPU (patch 0004).

more detail in patch 0007.

Quan
Alibaba Cloud

2017-12-14 02:32:47

by Quan Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] kvm pvtimer



On 2017/12/14 00:28, Konrad Rzeszutek Wilk wrote:
> On Wed, Dec 13, 2017 at 11:25:13PM +0800, Quan Xu wrote:
>> On Fri, Dec 8, 2017 at 11:10 PM, Konrad Rzeszutek Wilk <
>> [email protected]> wrote:
>>
>>> On Fri, Dec 08, 2017 at 04:39:43PM +0800, Quan Xu wrote:
>>>> From: Ben Luo <[email protected]>
>>>>
>>>> This patchset introduces a new paravirtualized mechanism to reduce
>>> VM-exit
>>>> caused by guest timer accessing.
>>> And how bad is this blib in arming the timer?
>>>
>>> And how often do you get this timer to be armed? OR better yet - what
>>> are the workloads in which you found this VMExit to be painful?
>>>
>>> Thanks. Or better yet - what
>>> are the workloads in which you found this VMExit to be painful?
>>>
>> one painful point is from VM idle path..
>> for some network req/resp services, or benchmark of process context
>> switches..
> So:
>
> 1) VM idle path and network req/resp services:
>
> Does this go away if you don't hit the idle path? Meaning if you
> loop without hitting HLT/MWAIT?
  we still hit HLT.. we can use it with
https://lkml.org/lkml/2017/8/29/279 ..
> I am assuming the issue you are facing
> is the latency - that is first time the guest comes from HLT and
> responds to the packet the latency is much higher than without?
yes,
> And the arming of the timer?
> 2) process context switches.
>
> Is that related to the 1)? That is the 'schedule' call and the process
> going to sleep waiting for an interrupt or timer?
>
> This all sounds like issues with low-CPU usage workloads where you
> need low latency responses?
yes,  it is also helpful to some timer-intensive services.

Quan

2017-12-14 11:57:00

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] kvm pvtimer

On 13/12/2017 17:28, Konrad Rzeszutek Wilk wrote:
> 1) VM idle path and network req/resp services:
>
> Does this go away if you don't hit the idle path? Meaning if you
> loop without hitting HLT/MWAIT? I am assuming the issue you are facing
> is the latency - that is first time the guest comes from HLT and
> responds to the packet the latency is much higher than without?
>
> And the arming of the timer?
> 2) process context switches.
>
> Is that related to the 1)? That is the 'schedule' call and the process
> going to sleep waiting for an interrupt or timer?
>
> This all sounds like issues with low-CPU usage workloads where you
> need low latency responses?

Even high-CPU usage, as long as there is a small idle time. The cost of
setting the TSC deadline timer twice is about 3000 cycles.

However, I think Amazon's approach of not intercepting HLT/MWAIT/PAUSE
can recover most of the performance and it's way less intrusive.

Thanks,

Paolo

2017-12-14 12:06:24

by Quan Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] kvm pvtimer



On 2017/12/14 19:56, Paolo Bonzini wrote:
> On 13/12/2017 17:28, Konrad Rzeszutek Wilk wrote:
>> 1) VM idle path and network req/resp services:
>>
>> Does this go away if you don't hit the idle path? Meaning if you
>> loop without hitting HLT/MWAIT? I am assuming the issue you are facing
>> is the latency - that is first time the guest comes from HLT and
>> responds to the packet the latency is much higher than without?
>>
>> And the arming of the timer?
>> 2) process context switches.
>>
>> Is that related to the 1)? That is the 'schedule' call and the process
>> going to sleep waiting for an interrupt or timer?
>>
>> This all sounds like issues with low-CPU usage workloads where you
>> need low latency responses?
> Even high-CPU usage, as long as there is a small idle time. The cost of
> setting the TSC deadline timer twice is about 3000 cycles.
>
> However, I think Amazon's approach of not intercepting HLT/MWAIT/PAUSE
> can recover most of the performance and it's way less intrusive.

  Paolo, could you share the Amazon's patch or the LML link? thanks.


Quan

> Thanks,
>
> Paolo
>

2017-12-14 13:01:25

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] kvm pvtimer

On 14/12/2017 13:06, Quan Xu wrote:
>
>
> On 2017/12/14 19:56, Paolo Bonzini wrote:
>> On 13/12/2017 17:28, Konrad Rzeszutek Wilk wrote:
>>> 1) VM idle path and network req/resp services:
>>>
>>> Does this go away if you don't hit the idle path? Meaning if you
>>> loop without hitting HLT/MWAIT? I am assuming the issue you are facing
>>> is the latency - that is first time the guest comes from HLT and
>>> responds to the packet the latency is much higher than without?
>>>
>>> And the arming of the timer?
>>> 2) process context switches.
>>>
>>> Is that related to the 1)? That is the 'schedule' call and the process
>>> going to sleep waiting for an interrupt or timer?
>>>
>>> This all sounds like issues with low-CPU usage workloads where you
>>> need low latency responses?
>> Even high-CPU usage, as long as there is a small idle time.  The cost of
>> setting the TSC deadline timer twice is about 3000 cycles.
>>
>> However, I think Amazon's approach of not intercepting HLT/MWAIT/PAUSE
>> can recover most of the performance and it's way less intrusive.
>
> Paolo, could you share the Amazon's patch or the LML link? thanks.

Here: https://www.spinics.net/lists/kvm/msg159356.html

Paolo