LinuxLists.cc - [PATCHv9 0/8] kvm: eoi optimization support

2012-06-24 16:24:23

Subject: [PATCHv9 0/8] kvm: eoi optimization support

I'm looking at reducing the interrupt overhead for virtualized guests:
some workloads spend a large part of their time processing interrupts.

On kvm, an EOI write from the guest causes an expensive exit to host; we
avoid this using shared memory.

The patches work fine on my boxes. See individual patches
for perf tests. You need to patch qemu to whitelist the kvm feature.
qemu patch was sent separately.

The patches are against Linus's master and apply to kvm.git
cleanly. The last patch in the series, supplying the host
part, also depends on the ISR optimization patch that I
have for convenience included in the series (patch 2),
I also included a documentation patch (patch 1) - it is
here since it clarifies patch 2. This revision does not yet address
Thomas's idea of reworking the APIC page handling. Changes to this
optimization would require reworking this last patch in the series.

The rest of the patchset has not changed significantly since v2.

Thanks,
MST

changes from v8:
Document the use of reboot notifier:
it is used to make sure kexec works correctly

changes from v7:
Address Avi's comments:
make bit 1 in the new MSR reserved, require an aligned address
remove unused macros from header
replace __align by BUILD_BUG_ON and a comment
Whitespace fix

changes from v6:
Address Marcelo's comments:
marcelo's comments for isr optimization: isr_cache -> highest_*
marcelo's comments for host side eoi: rename isr_cache->highest_*
kvm eoi msr doc: fix typo pointed out by marcelo
isr optimization: add comment to address marcelo's request
kvm: don't make lapic attention check unlikely. at marcelo's request
eoi host side: remove unlikely annotations

Changes from v5:
Clear PV EOI when we cancel interrupts.
Pointed out by Gleb.
Always set ISR cache when we inject an interrupt.
Suggested by Ronen Hod.

Changes from v4:
Turn off PV EOI on each exit. Turn it back on when safe.
Suggested by Avi.
Address bug with nested interrupts pointed out by Marcelo.

Changes from v3:
Address review comments by Marcelo:
Multiple cosmetic changes eoi -> pv_eoi
Added multiple comments
Changes from v2:
Kill guest with GP on an illegal MSR value
Add documentation

Changes from v1:
Add host side patch to series
Remove kvm-specific __test_and_clear_bit, document
that x86 one does what we want already
Clear msr on cpu unplug

Michael S. Tsirkin (8):
kvm: document lapic regs field
kvm: optimize ISR lookups
kvm_para: guest side for eoi avoidance
x86/bitops: note on __test_and_clear_bit atomicity
kvm: eoi msi documentation
kvm: only sync when attention bits set
kvm: rearrange injection cancelling code
kvm: host side for eoi optimization

Documentation/virtual/kvm/msr.txt | 34 +++++++
arch/x86/include/asm/bitops.h | 7 ++
arch/x86/include/asm/kvm_host.h | 12 +++
arch/x86/include/asm/kvm_para.h | 7 ++
arch/x86/kernel/kvm.c | 57 ++++++++++-
arch/x86/kvm/cpuid.c | 1 +
arch/x86/kvm/lapic.c | 194 +++++++++++++++++++++++++++++++++++-
arch/x86/kvm/lapic.h | 11 ++
arch/x86/kvm/trace.h | 34 +++++++
arch/x86/kvm/x86.c | 20 +++-
10 files changed, 363 insertions(+), 14 deletions(-)

2012-06-24 16:25:28

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCHv9 2/8] kvm: optimize ISR lookups

We perform ISR lookups twice: during interrupt
injection and on EOI. Typical workloads only have
a single bit set there. So we can avoid ISR scans by
1. counting bits as we set/clear them in ISR
2. on set, caching the injected vector number
3. on clear, invalidating the cache

The real purpose of this is enabling PV EOI
which needs to quickly validate the vector.
But non PV guests also benefit: with this patch,
and without interrupt nesting, apic_find_highest_isr
will always return immediately without scanning ISR.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
arch/x86/kvm/lapic.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/lapic.h | 4 +++
2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 93c1574..805d887 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -107,6 +107,16 @@ static inline void apic_clear_vector(int vec, void *bitmap)
clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
}

+static inline int __apic_test_and_set_vector(int vec, void *bitmap)
+{
+ return __test_and_set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
+}
+
+static inline int __apic_test_and_clear_vector(int vec, void *bitmap)
+{
+ return __test_and_clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
+}
+
static inline int apic_hw_enabled(struct kvm_lapic *apic)
{
return (apic)->vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE;
@@ -210,6 +220,16 @@ static int find_highest_vector(void *bitmap)
return fls(word[word_offset << 2]) - 1 + (word_offset << 5);
}

+static u8 count_vectors(void *bitmap)
+{
+ u32 *word = bitmap;
+ int word_offset;
+ u8 count = 0;
+ for (word_offset = 0; word_offset < MAX_APIC_VECTOR >> 5; ++word_offset)
+ count += hweight32(word[word_offset << 2]);
+ return count;
+}
+
static inline int apic_test_and_set_irr(int vec, struct kvm_lapic *apic)
{
apic->irr_pending = true;
@@ -242,6 +262,27 @@ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic)
apic->irr_pending = true;
}

+static inline void apic_set_isr(int vec, struct kvm_lapic *apic)
+{
+ if (!__apic_test_and_set_vector(vec, apic->regs + APIC_ISR))
+ ++apic->isr_count;
+ BUG_ON(apic->isr_count > MAX_APIC_VECTOR);
+ /*
+ * ISR (in service register) bit is set when injecting an interrupt.
+ * The highest vector is injected. Thus the latest bit set matches
+ * the highest bit in ISR.
+ */
+ apic->highest_isr_cache = vec;
+}
+
+static inline void apic_clear_isr(int vec, struct kvm_lapic *apic)
+{
+ if (__apic_test_and_clear_vector(vec, apic->regs + APIC_ISR))
+ --apic->isr_count;
+ BUG_ON(apic->isr_count < 0);
+ apic->highest_isr_cache = -1;
+}
+
int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
@@ -273,6 +314,10 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq)
static inline int apic_find_highest_isr(struct kvm_lapic *apic)
{
int result;
+ if (!apic->isr_count)
+ return -1;
+ if (likely(apic->highest_isr_cache != -1))
+ return apic->highest_isr_cache;

result = find_highest_vector(apic->regs + APIC_ISR);
ASSERT(result == -1 || result >= 16);
@@ -492,7 +537,7 @@ static void apic_set_eoi(struct kvm_lapic *apic)
if (vector == -1)
return;

- apic_clear_vector(vector, apic->regs + APIC_ISR);
+ apic_clear_isr(vector, apic);
apic_update_ppr(apic);

if (!(apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI) &&
@@ -1081,6 +1126,8 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
apic_set_reg(apic, APIC_TMR + 0x10 * i, 0);
}
apic->irr_pending = false;
+ apic->isr_count = 0;
+ apic->highest_isr_cache = -1;
update_divide_count(apic);
atomic_set(&apic->lapic_timer.pending, 0);
if (kvm_vcpu_is_bsp(vcpu))
@@ -1248,7 +1295,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
if (vector == -1)
return -1;

- apic_set_vector(vector, apic->regs + APIC_ISR);
+ apic_set_isr(vector, apic);
apic_update_ppr(apic);
apic_clear_irr(vector, apic);
return vector;
@@ -1267,6 +1314,8 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu)
update_divide_count(apic);
start_apic_timer(apic);
apic->irr_pending = true;
+ apic->isr_count = count_vectors(apic->regs + APIC_ISR);
+ apic->highest_isr_cache = -1;
kvm_make_request(KVM_REQ_EVENT, vcpu);
}

diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index d29da25..5ac9e5e 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -13,6 +13,10 @@ struct kvm_lapic {
u32 divide_count;
struct kvm_vcpu *vcpu;
bool irr_pending;
+ /* Number of bits set in ISR. */
+ s16 isr_count;
+ /* The highest vector set in ISR; if -1 - invalid, must scan ISR. */
+ int highest_isr_cache;
/**
* APIC register page. The layout matches the register layout seen by
* the guest 1:1, because it is accessed by the vmx microcode.
--
1.6.3.2.307.gdf0b

2012-06-24 16:25:43

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCHv9 8/8] kvm: host side for eoi optimization

Implementation of PV EOI using shared memory.
This reduces the number of exits an interrupt
causes as much as by half.

The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
We set it before injecting an interrupt and clear
before injecting a nested one. Guest tests it using
a test and clear operation - this is necessary
so that host can detect interrupt nesting -
and if set, it can skip the EOI MSR.

There's a new MSR to set the address of said register
in guest memory. Otherwise not much changed:
- Guest EOI is not required
- Register is tested & ISR is automatically cleared on exit

For testing results see description of previous patch
'kvm_para: guest side for eoi avoidance'.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 12 +++
arch/x86/kvm/cpuid.c | 1 +
arch/x86/kvm/lapic.c | 141 +++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/lapic.h | 2 +
arch/x86/kvm/trace.h | 34 +++++++++
arch/x86/kvm/x86.c | 7 ++
6 files changed, 193 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index db7c1f2..24b7647 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -175,6 +175,13 @@ enum {

/* apic attention bits */
#define KVM_APIC_CHECK_VAPIC 0
+/*
+ * The following bit is set with PV-EOI, unset on EOI.
+ * We detect PV-EOI changes by guest by comparing
+ * this bit with PV-EOI in guest memory.
+ * See the implementation in apic_update_pv_eoi.
+ */
+#define KVM_APIC_PV_EOI_PENDING 1

/*
* We don't want allocation failures within the mmu code, so we preallocate
@@ -484,6 +491,11 @@ struct kvm_vcpu_arch {
u64 length;
u64 status;
} osvw;
+
+ struct {
+ u64 msr_val;
+ struct gfn_to_hva_cache data;
+ } pv_eoi;
};

struct kvm_lpage_info {
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7df1c6d..61ccbdf 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -409,6 +409,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
(1 << KVM_FEATURE_NOP_IO_DELAY) |
(1 << KVM_FEATURE_CLOCKSOURCE2) |
(1 << KVM_FEATURE_ASYNC_PF) |
+ (1 << KVM_FEATURE_PV_EOI) |
(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);

if (sched_info_on())
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 805d887..ce87878 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -311,6 +311,54 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq)
irq->level, irq->trig_mode);
}

+static int pv_eoi_put_user(struct kvm_vcpu *vcpu, u8 val)
+{
+
+ return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.pv_eoi.data, &val,
+ sizeof(val));
+}
+
+static int pv_eoi_get_user(struct kvm_vcpu *vcpu, u8 *val)
+{
+
+ return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.pv_eoi.data, val,
+ sizeof(*val));
+}
+
+static inline bool pv_eoi_enabled(struct kvm_vcpu *vcpu)
+{
+ return vcpu->arch.pv_eoi.msr_val & KVM_MSR_ENABLED;
+}
+
+static bool pv_eoi_get_pending(struct kvm_vcpu *vcpu)
+{
+ u8 val;
+ if (pv_eoi_get_user(vcpu, &val) < 0)
+ apic_debug("Can't read EOI MSR value: 0x%llx\n",
+ (unsigned long long)vcpi->arch.pv_eoi.msr_val);
+ return val & 0x1;
+}
+
+static void pv_eoi_set_pending(struct kvm_vcpu *vcpu)
+{
+ if (pv_eoi_put_user(vcpu, KVM_PV_EOI_ENABLED) < 0) {
+ apic_debug("Can't set EOI MSR value: 0x%llx\n",
+ (unsigned long long)vcpi->arch.pv_eoi.msr_val);
+ return;
+ }
+ __set_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention);
+}
+
+static void pv_eoi_clr_pending(struct kvm_vcpu *vcpu)
+{
+ if (pv_eoi_put_user(vcpu, KVM_PV_EOI_DISABLED) < 0) {
+ apic_debug("Can't clear EOI MSR value: 0x%llx\n",
+ (unsigned long long)vcpi->arch.pv_eoi.msr_val);
+ return;
+ }
+ __clear_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention);
+}
+
static inline int apic_find_highest_isr(struct kvm_lapic *apic)
{
int result;
@@ -527,15 +575,18 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
}

-static void apic_set_eoi(struct kvm_lapic *apic)
+static int apic_set_eoi(struct kvm_lapic *apic)
{
int vector = apic_find_highest_isr(apic);
+
+ trace_kvm_eoi(apic, vector);
+
/*
* Not every write EOI will has corresponding ISR,
* one example is when Kernel check timer on setup_IO_APIC
*/
if (vector == -1)
- return;
+ return vector;

apic_clear_isr(vector, apic);
apic_update_ppr(apic);
@@ -550,6 +601,7 @@ static void apic_set_eoi(struct kvm_lapic *apic)
kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
}
kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
+ return vector;
}

static void apic_send_ipi(struct kvm_lapic *apic)
@@ -1132,6 +1184,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
atomic_set(&apic->lapic_timer.pending, 0);
if (kvm_vcpu_is_bsp(vcpu))
vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
+ vcpu->arch.pv_eoi.msr_val = 0;
apic_update_ppr(apic);

vcpu->arch.apic_arb_prio = 0;
@@ -1332,11 +1385,51 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
hrtimer_start_expires(timer, HRTIMER_MODE_ABS);
}

+/*
+ * apic_sync_pv_eoi_from_guest - called on vmexit or cancel interrupt
+ *
+ * Detect whether guest triggered PV EOI since the
+ * last entry. If yes, set EOI on guests's behalf.
+ * Clear PV EOI in guest memory in any case.
+ */
+static void apic_sync_pv_eoi_from_guest(struct kvm_vcpu *vcpu,
+ struct kvm_lapic *apic)
+{
+ bool pending;
+ int vector;
+ /*
+ * PV EOI state is derived from KVM_APIC_PV_EOI_PENDING in host
+ * and KVM_PV_EOI_ENABLED in guest memory as follows:
+ *
+ * KVM_APIC_PV_EOI_PENDING is unset:
+ * -> host disabled PV EOI.
+ * KVM_APIC_PV_EOI_PENDING is set, KVM_PV_EOI_ENABLED is set:
+ * -> host enabled PV EOI, guest did not execute EOI yet.
+ * KVM_APIC_PV_EOI_PENDING is set, KVM_PV_EOI_ENABLED is unset:
+ * -> host enabled PV EOI, guest executed EOI.
+ */
+ BUG_ON(!pv_eoi_enabled(vcpu));
+ pending = pv_eoi_get_pending(vcpu);
+ /*
+ * Clear pending bit in any case: it will be set again on vmentry.
+ * While this might not be ideal from performance point of view,
+ * this makes sure pv eoi is only enabled when we know it's safe.
+ */
+ pv_eoi_clr_pending(vcpu);
+ if (pending)
+ return;
+ vector = apic_set_eoi(apic);
+ trace_kvm_pv_eoi(apic, vector);
+}
+
void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu)
{
u32 data;
void *vapic;

+ if (test_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention))
+ apic_sync_pv_eoi_from_guest(vcpu, vcpu->arch.apic);
+
if (!test_bit(KVM_APIC_CHECK_VAPIC, &vcpu->arch.apic_attention))
return;

@@ -1347,17 +1440,44 @@ void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu)
apic_set_tpr(vcpu->arch.apic, data & 0xff);
}

+/*
+ * apic_sync_pv_eoi_to_guest - called before vmentry
+ *
+ * Detect whether it's safe to enable PV EOI and
+ * if yes do so.
+ */
+static void apic_sync_pv_eoi_to_guest(struct kvm_vcpu *vcpu,
+ struct kvm_lapic *apic)
+{
+ if (!pv_eoi_enabled(vcpu) ||
+ /* IRR set or many bits in ISR: could be nested. */
+ apic->irr_pending ||
+ /* Cache not set: could be safe but we don't bother. */
+ apic->highest_isr_cache == -1 ||
+ /* Need EOI to update ioapic. */
+ kvm_ioapic_handles_vector(vcpu->kvm, apic->highest_isr_cache)) {
+ /*
+ * PV EOI was disabled by apic_sync_pv_eoi_from_guest
+ * so we need not do anything here.
+ */
+ return;
+ }
+
+ pv_eoi_set_pending(apic->vcpu);
+}
+
void kvm_lapic_sync_to_vapic(struct kvm_vcpu *vcpu)
{
u32 data, tpr;
int max_irr, max_isr;
- struct kvm_lapic *apic;
+ struct kvm_lapic *apic = vcpu->arch.apic;
void *vapic;

+ apic_sync_pv_eoi_to_guest(vcpu, apic);
+
if (!test_bit(KVM_APIC_CHECK_VAPIC, &vcpu->arch.apic_attention))
return;

- apic = vcpu->arch.apic;
tpr = apic_get_reg(apic, APIC_TASKPRI) & 0xff;
max_irr = apic_find_highest_irr(apic);
if (max_irr < 0)
@@ -1443,3 +1563,16 @@ int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 reg, u64 *data)

return 0;
}
+
+int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data)
+{
+ u64 addr = data & ~KVM_MSR_ENABLED;
+ if (!IS_ALIGNED(addr, 4))
+ return 1;
+
+ vcpu->arch.pv_eoi.msr_val = data;
+ if (!pv_eoi_enabled(vcpu))
+ return 0;
+ return kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.pv_eoi.data,
+ addr);
+}
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 5ac9e5e..4af5405 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -69,4 +69,6 @@ static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu)
{
return vcpu->arch.hv_vapic & HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE;
}
+
+int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data);
#endif
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 911d264..851914e 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -517,6 +517,40 @@ TRACE_EVENT(kvm_apic_accept_irq,
__entry->coalesced ? " (coalesced)" : "")
);

+TRACE_EVENT(kvm_eoi,
+ TP_PROTO(struct kvm_lapic *apic, int vector),
+ TP_ARGS(apic, vector),
+
+ TP_STRUCT__entry(
+ __field( __u32, apicid )
+ __field( int, vector )
+ ),
+
+ TP_fast_assign(
+ __entry->apicid = apic->vcpu->vcpu_id;
+ __entry->vector = vector;
+ ),
+
+ TP_printk("apicid %x vector %d", __entry->apicid, __entry->vector)
+);
+
+TRACE_EVENT(kvm_pv_eoi,
+ TP_PROTO(struct kvm_lapic *apic, int vector),
+ TP_ARGS(apic, vector),
+
+ TP_STRUCT__entry(
+ __field( __u32, apicid )
+ __field( int, vector )
+ ),
+
+ TP_fast_assign(
+ __entry->apicid = apic->vcpu->vcpu_id;
+ __entry->vector = vector;
+ ),
+
+ TP_printk("apicid %x vector %d", __entry->apicid, __entry->vector)
+);
+
/*
* Tracepoint for nested VMRUN
*/
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cb19d3b..3913fb1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -795,6 +795,7 @@ static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
+ MSR_KVM_PV_EOI_EN,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
#ifdef CONFIG_X86_64
@@ -1653,6 +1654,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);

break;
+ case MSR_KVM_PV_EOI_EN:
+ if (kvm_lapic_enable_pv_eoi(vcpu, data))
+ return 1;
+ break;

case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
@@ -5394,6 +5399,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)

cancel_injection:
kvm_x86_ops->cancel_injection(vcpu);
+ if (unlikely(vcpu->arch.apic_attention))
+ kvm_lapic_sync_from_vapic(vcpu);
out:
return r;
}
--
1.6.3.2.307.gdf0b

2012-06-24 16:25:41

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCHv9 7/8] kvm: rearrange injection cancelling code

Each time we need to cancel injection we invoke same code
(cancel_injection callback). Move it towards the end of function using
the familiar goto on error pattern.

Will make it easier to do more cleanups for PV EOI.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
arch/x86/kvm/x86.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4e582c0..cb19d3b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5296,8 +5296,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)

r = kvm_mmu_reload(vcpu);
if (unlikely(r)) {
- kvm_x86_ops->cancel_injection(vcpu);
- goto out;
+ goto cancel_injection;
}

preempt_disable();
@@ -5322,9 +5321,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
smp_wmb();
local_irq_enable();
preempt_enable();
- kvm_x86_ops->cancel_injection(vcpu);
r = 1;
- goto out;
+ goto cancel_injection;
}

srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
@@ -5392,6 +5390,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_lapic_sync_from_vapic(vcpu);

r = kvm_x86_ops->handle_exit(vcpu);
+ return r;
+
+cancel_injection:
+ kvm_x86_ops->cancel_injection(vcpu);
out:
return r;
}
--
1.6.3.2.307.gdf0b

2012-06-24 16:25:38

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCHv9 6/8] kvm: only sync when attention bits set

Commit eb0dc6d0368072236dcd086d7fdc17fd3c4574d4 introduced apic
attention bitmask but kvm still syncs lapic unconditionally.
As that commit suggested and in anticipation of adding more attention
bits, only sync lapic if(apic_attention).

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
arch/x86/kvm/x86.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index be6d549..4e582c0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5388,7 +5388,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (unlikely(vcpu->arch.tsc_always_catchup))
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

- kvm_lapic_sync_from_vapic(vcpu);
+ if (vcpu->arch.apic_attention)
+ kvm_lapic_sync_from_vapic(vcpu);

r = kvm_x86_ops->handle_exit(vcpu);
out:
--
1.6.3.2.307.gdf0b

2012-06-24 16:25:36

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCHv9 5/8] kvm: eoi msi documentation

Document the new EOI MSR. Couldn't decide whether this change belongs
conceptually on guest or host side, so a separate patch.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
Documentation/virtual/kvm/msr.txt | 34 ++++++++++++++++++++++++++++++++++
1 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
index 96b41bd..5ed05b0 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -223,3 +223,37 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
steal: the amount of time in which this vCPU did not run, in
nanoseconds. Time during which the vcpu is idle, will not be
reported as steal time.
+
+MSR_KVM_EOI_EN: 0x4b564d04
+ data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
+ when disabled. Bit 1 is reserved and must be zero. When PV end of
+ interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
+ physical address of a 4 byte memory area which must be in guest RAM and
+ must be zeroed.
+
+ The first, least significant bit of 4 byte memory location will be
+ written to by the hypervisor, typically at the time of interrupt
+ injection. Value of 1 means that guest can skip writing EOI to the apic
+ (using MSR or MMIO write); instead, it is sufficient to signal
+ EOI by clearing the bit in guest memory - this location will
+ later be polled by the hypervisor.
+ Value of 0 means that the EOI write is required.
+
+ It is always safe for the guest to ignore the optimization and perform
+ the APIC EOI write anyway.
+
+ Hypervisor is guaranteed to only modify this least
+ significant bit while in the current VCPU context, this means that
+ guest does not need to use either lock prefix or memory ordering
+ primitives to synchronise with the hypervisor.
+
+ However, hypervisor can set and clear this memory bit at any time:
+ therefore to make sure hypervisor does not interrupt the
+ guest and clear the least significant bit in the memory area
+ in the window between guest testing it to detect
+ whether it can skip EOI apic write and between guest
+ clearing it to signal EOI to the hypervisor,
+ guest must both read the least significant bit in the memory area and
+ clear it using a single CPU instruction, such as test and clear, or
+ compare and exchange.
+
--
1.6.3.2.307.gdf0b

2012-06-24 16:25:34

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCHv9 4/8] x86/bitops: note on __test_and_clear_bit atomicity

__test_and_clear_bit is actually atomic with respect
to the local CPU. Add a note saying that KVM on x86
relies on this behaviour so people don't accidentaly break it.
Also warn not to rely on this in portable code.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
arch/x86/include/asm/bitops.h | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a6983b2..72f5009 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -264,6 +264,13 @@ static inline int test_and_clear_bit(int nr, volatile unsigned long *addr)
* This operation is non-atomic and can be reordered.
* If two examples of this operation race, one can appear to succeed
* but actually fail. You must protect multiple accesses with a lock.
+ *
+ * Note: the operation is performed atomically with respect to
+ * the local CPU, but not other CPUs. Portable code should not
+ * rely on this behaviour.
+ * KVM relies on this behaviour on x86 for modifying memory that is also
+ * accessed from a hypervisor on the same CPU if running in a VM: don't change
+ * this without also updating arch/x86/kernel/kvm.c
*/
static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr)
{
--
1.6.3.2.307.gdf0b

2012-06-24 16:25:31

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCHv9 3/8] kvm_para: guest side for eoi avoidance

The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
Guest tests it using a single est and clear operation - this is
necessary so that host can detect interrupt nesting - and if set, it can
skip the EOI MSR.

I run a simple microbenchmark to show exit reduction
(note: for testing, need to apply follow-up patch
'kvm: host side for eoi optimization' + a qemu patch
I posted separately, on host):

Before:

Performance counter stats for 'sleep 1s':

47,357 kvm:kvm_entry [99.98%]
0 kvm:kvm_hypercall [99.98%]
0 kvm:kvm_hv_hypercall [99.98%]
5,001 kvm:kvm_pio [99.98%]
0 kvm:kvm_cpuid [99.98%]
22,124 kvm:kvm_apic [99.98%]
49,849 kvm:kvm_exit [99.98%]
21,115 kvm:kvm_inj_virq [99.98%]
0 kvm:kvm_inj_exception [99.98%]
0 kvm:kvm_page_fault [99.98%]
22,937 kvm:kvm_msr [99.98%]
0 kvm:kvm_cr [99.98%]
0 kvm:kvm_pic_set_irq [99.98%]
0 kvm:kvm_apic_ipi [99.98%]
22,207 kvm:kvm_apic_accept_irq [99.98%]
22,421 kvm:kvm_eoi [99.98%]
0 kvm:kvm_pv_eoi [99.99%]
0 kvm:kvm_nested_vmrun [99.99%]
0 kvm:kvm_nested_intercepts [99.99%]
0 kvm:kvm_nested_vmexit [99.99%]
0 kvm:kvm_nested_vmexit_inject [99.99%]
0 kvm:kvm_nested_intr_vmexit [99.99%]
0 kvm:kvm_invlpga [99.99%]
0 kvm:kvm_skinit [99.99%]
57 kvm:kvm_emulate_insn [99.99%]
0 kvm:vcpu_match_mmio [99.99%]
0 kvm:kvm_userspace_exit [99.99%]
2 kvm:kvm_set_irq [99.99%]
2 kvm:kvm_ioapic_set_irq [99.99%]
23,609 kvm:kvm_msi_set_irq [99.99%]
1 kvm:kvm_ack_irq [99.99%]
131 kvm:kvm_mmio [99.99%]
226 kvm:kvm_fpu [100.00%]
0 kvm:kvm_age_page [100.00%]
0 kvm:kvm_try_async_get_page [100.00%]
0 kvm:kvm_async_pf_doublefault [100.00%]
0 kvm:kvm_async_pf_not_present [100.00%]
0 kvm:kvm_async_pf_ready [100.00%]
0 kvm:kvm_async_pf_completed

1.002100578 seconds time elapsed

After:

Performance counter stats for 'sleep 1s':

28,354 kvm:kvm_entry [99.98%]
0 kvm:kvm_hypercall [99.98%]
0 kvm:kvm_hv_hypercall [99.98%]
1,347 kvm:kvm_pio [99.98%]
0 kvm:kvm_cpuid [99.98%]
1,931 kvm:kvm_apic [99.98%]
29,595 kvm:kvm_exit [99.98%]
24,884 kvm:kvm_inj_virq [99.98%]
0 kvm:kvm_inj_exception [99.98%]
0 kvm:kvm_page_fault [99.98%]
1,986 kvm:kvm_msr [99.98%]
0 kvm:kvm_cr [99.98%]
0 kvm:kvm_pic_set_irq [99.98%]
0 kvm:kvm_apic_ipi [99.99%]
25,953 kvm:kvm_apic_accept_irq [99.99%]
26,132 kvm:kvm_eoi [99.99%]
26,593 kvm:kvm_pv_eoi [99.99%]
0 kvm:kvm_nested_vmrun [99.99%]
0 kvm:kvm_nested_intercepts [99.99%]
0 kvm:kvm_nested_vmexit [99.99%]
0 kvm:kvm_nested_vmexit_inject [99.99%]
0 kvm:kvm_nested_intr_vmexit [99.99%]
0 kvm:kvm_invlpga [99.99%]
0 kvm:kvm_skinit [99.99%]
284 kvm:kvm_emulate_insn [99.99%]
68 kvm:vcpu_match_mmio [99.99%]
68 kvm:kvm_userspace_exit [99.99%]
2 kvm:kvm_set_irq [99.99%]
2 kvm:kvm_ioapic_set_irq [99.99%]
28,288 kvm:kvm_msi_set_irq [99.99%]
1 kvm:kvm_ack_irq [99.99%]
131 kvm:kvm_mmio [100.00%]
588 kvm:kvm_fpu [100.00%]
0 kvm:kvm_age_page [100.00%]
0 kvm:kvm_try_async_get_page [100.00%]
0 kvm:kvm_async_pf_doublefault [100.00%]
0 kvm:kvm_async_pf_not_present [100.00%]
0 kvm:kvm_async_pf_ready [100.00%]
0 kvm:kvm_async_pf_completed

1.002039622 seconds time elapsed

We see that # of exits is almost halved.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 7 +++++
arch/x86/kernel/kvm.c | 57 ++++++++++++++++++++++++++++++++++++--
2 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 63ab166..2f7712e 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -22,6 +22,7 @@
#define KVM_FEATURE_CLOCKSOURCE2 3
#define KVM_FEATURE_ASYNC_PF 4
#define KVM_FEATURE_STEAL_TIME 5
+#define KVM_FEATURE_PV_EOI 6

/* The last 8 bits are used to indicate how to interpret the flags field
* in pvclock structure. If no bits are set, all flags are ignored.
@@ -37,6 +38,7 @@
#define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
#define MSR_KVM_STEAL_TIME 0x4b564d03
+#define MSR_KVM_PV_EOI_EN 0x4b564d04

struct kvm_steal_time {
__u64 steal;
@@ -89,6 +91,11 @@ struct kvm_vcpu_pv_apf_data {
__u32 enabled;
};

+#define KVM_PV_EOI_BIT 0
+#define KVM_PV_EOI_MASK (0x1 << KVM_PV_EOI_BIT)
+#define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
+#define KVM_PV_EOI_DISABLED 0x0
+
#ifdef __KERNEL__
#include <asm/processor.h>

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e554e5a..75ab94c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -39,6 +39,8 @@
#include <asm/desc.h>
#include <asm/tlbflush.h>
#include <asm/idle.h>
+#include <asm/apic.h>
+#include <asm/apicdef.h>

static int kvmapf = 1;

@@ -283,6 +285,22 @@ static void kvm_register_steal_time(void)
cpu, __pa(st));
}

+static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
+
+static void kvm_guest_apic_eoi_write(u32 reg, u32 val)
+{
+ /**
+ * This relies on __test_and_clear_bit to modify the memory
+ * in a way that is atomic with respect to the local CPU.
+ * The hypervisor only accesses this memory from the local CPU so
+ * there's no need for lock or memory barriers.
+ * An optimization barrier is implied in apic write.
+ */
+ if (__test_and_clear_bit(KVM_PV_EOI_BIT, &__get_cpu_var(kvm_apic_eoi)))
+ return;
+ apic->write(APIC_EOI, APIC_EOI_ACK);
+}
+
void __cpuinit kvm_guest_cpu_init(void)
{
if (!kvm_para_available())
@@ -300,11 +318,20 @@ void __cpuinit kvm_guest_cpu_init(void)
smp_processor_id());
}

+ if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
+ unsigned long pa;
+ /* Size alignment is implied but just to make it explicit. */
+ BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
+ __get_cpu_var(kvm_apic_eoi) = 0;
+ pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
+ wrmsrl(MSR_KVM_PV_EOI_EN, pa);
+ }
+
if (has_steal_clock)
kvm_register_steal_time();
}

-static void kvm_pv_disable_apf(void *unused)
+static void kvm_pv_disable_apf(void)
{
if (!__get_cpu_var(apf_reason).enabled)
return;
@@ -316,11 +343,23 @@ static void kvm_pv_disable_apf(void *unused)
smp_processor_id());
}

+static void kvm_pv_guest_cpu_reboot(void *unused)
+{
+ /*
+ * We disable PV EOI before we load a new kernel by kexec,
+ * since MSR_KVM_PV_EOI_EN stores a pointer into old kernel's memory.
+ * New kernel can re-enable when it boots.
+ */
+ if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
+ wrmsrl(MSR_KVM_PV_EOI_EN, 0);
+ kvm_pv_disable_apf();
+}
+
static int kvm_pv_reboot_notify(struct notifier_block *nb,
unsigned long code, void *unused)
{
if (code == SYS_RESTART)
- on_each_cpu(kvm_pv_disable_apf, NULL, 1);
+ on_each_cpu(kvm_pv_guest_cpu_reboot, NULL, 1);
return NOTIFY_DONE;
}

@@ -371,7 +410,9 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
static void kvm_guest_cpu_offline(void *dummy)
{
kvm_disable_steal_time();
- kvm_pv_disable_apf(NULL);
+ if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
+ wrmsrl(MSR_KVM_PV_EOI_EN, 0);
+ kvm_pv_disable_apf();
apf_task_wake_all();
}

@@ -424,6 +465,16 @@ void __init kvm_guest_init(void)
pv_time_ops.steal_clock = kvm_steal_clock;
}

+ if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
+ struct apic **drv;
+
+ for (drv = __apicdrivers; drv < __apicdrivers_end; drv++) {
+ /* Should happen once for each apic */
+ WARN_ON((*drv)->eoi_write == kvm_guest_apic_eoi_write);
+ (*drv)->eoi_write = kvm_guest_apic_eoi_write;
+ }
+ }
+
#ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
register_cpu_notifier(&kvm_cpu_notifier);
--
1.6.3.2.307.gdf0b

2012-06-24 16:25:24

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCHv9 1/8] kvm: document lapic regs field

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
arch/x86/kvm/lapic.h | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 6f4ce25..d29da25 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -13,6 +13,11 @@ struct kvm_lapic {
u32 divide_count;
struct kvm_vcpu *vcpu;
bool irr_pending;
+ /**
+ * APIC register page. The layout matches the register layout seen by
+ * the guest 1:1, because it is accessed by the vmx microcode.
+ * Note: Only one register, the TPR, is used by the microcode.
+ */
void *regs;
gpa_t vapic_addr;
struct page *vapic_page;
--
1.6.3.2.307.gdf0b

2012-06-25 09:41:41

by Avi Kivity

[permalink] [raw]

Subject: Re: [PATCHv9 0/8] kvm: eoi optimization support

On 06/24/2012 07:24 PM, Michael S. Tsirkin wrote:
> I'm looking at reducing the interrupt overhead for virtualized guests:
> some workloads spend a large part of their time processing interrupts.
>
> On kvm, an EOI write from the guest causes an expensive exit to host; we
> avoid this using shared memory.
>
> The patches work fine on my boxes. See individual patches
> for perf tests. You need to patch qemu to whitelist the kvm feature.
> qemu patch was sent separately.
>
> The patches are against Linus's master and apply to kvm.git
> cleanly. The last patch in the series, supplying the host
> part, also depends on the ISR optimization patch that I
> have for convenience included in the series (patch 2),
> I also included a documentation patch (patch 1) - it is
> here since it clarifies patch 2. This revision does not yet address
> Thomas's idea of reworking the APIC page handling. Changes to this
> optimization would require reworking this last patch in the series.
>
> The rest of the patchset has not changed significantly since v2.

Thanks, applied.

--
error compiling committee.c: too many arguments to function