2014-06-09 12:59:22

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 00/25] KVM: x86: Speed up emulation of invalid state

This series, done in collaboration with Bandan Das, speeds up
emulation of invalid state by approximately a factor of 4
(as measured by realmode.flat). It brings together patches sent
as RFC in the past 3 months, and adds a few more on top.

The total speedup achieved is around 3x. Some changes shave a constant
number of cycles from all instructions; others only affect more complex
instructions that take more clock cycles to run. Together, these two
different effects make the speedup nicely homogeneous across various kinds
of instructions. Here are rough numbers (expressed in clock cycles on a
Sandy Bridge Xeon machine, with unrestricted_guest=0) at various points
of the series:

jump move arith load store RMW
2300 2600 2500 2800 2800 3200
1650 1950 1900 2150 2150 2600 KVM: vmx: speed up emulation of invalid guest state
900 1250 1050 1350 1300 1700 KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation
900 1050 1050 1350 1300 1700 KVM: emulate: speed up emulated moves
900 1050 1050 1300 1250 1400 KVM: emulate: extend memory access optimization to stores
825 1000 1000 1250 1200 1350 KVM: emulate: do not initialize memopp
750 950 950 1150 1050 1200 KVM: emulate: avoid per-byte copying in instruction fetches
720 850 850 1075 1000 1100 KVM: x86: use kvm_read_guest_page for emulator accesses

The above only lists the patches where the improvement on kvm-unit-tests
became consistently identifiable and reproducible. Take these with a
grain of salt, since all the rounding here was done by hand, no stddev
is provided, etc.

I tried to be quite strict and limited this series to patches that obey
the following criteria:

* either the patch is by itself a measurable improvement
(example: patch 6)

* or the patch is a really really obvious improvement (example:
patch 17), the compiler must really screw up for this not to be the
case

* or the patch is just preparatory for a subsequent measurable
improvement.

Quite a few functions disappear from the profile, and others have their
cost cut by a pretty large factor:

61643 [kvm_intel] vmx_segment_access_rights
47504 [kvm] vcpu_enter_guest
34610 [kvm_intel] rmode_segment_valid
30312 7119 [kvm_intel] vmx_get_segment
27371 23363 [kvm] x86_decode_insn
20924 21185 [kernel.kallsyms] copy_user_generic_string
18775 3614 [kvm_intel] vmx_read_guest_seg_selector
18040 9580 [kvm] emulator_get_segment
16061 5791 [kvm] do_insn_fetch (__do_insn_fetch_bytes after patches)
15834 5530 [kvm] kvm_read_guest (kvm_fetch_guest_virt after patches)
15721 [kernel.kallsyms] __srcu_read_lock
15439 4115 [kvm] init_emulate_ctxt
14421 11692 [kvm] x86_emulate_instruction
12498 [kernel.kallsyms] __srcu_read_unlock
12385 11779 [kvm] __linearize
12385 13194 [kvm] decode_operand
7408 5574 [kvm] x86_emulate_insn
6447 [kvm] kvm_lapic_find_highest_irr
6390 [kvm_intel] vmx_handle_exit
5598 3418 [kvm_intel] vmx_interrupt_allowed

Honorable mentions among things that I tried and didn't have the effect
I hoped for: using __get_user/__put_user to read memory operands, and
simplifying linearize.


Patches 1-6 are various low-hanging fruit, which alone provide a
2-2.5x speedup (higher on simpler instructions).

Patches 7-12 make the emulator cache the host virtual address of memory
operands, thus avoid walking the page table twice.

Patch 13-18 avoid wasting time unnecessarily in the memset call of
x86_emulate_ctxt.

Patches 19-22 speed up operand fetching.

Patches 23-25 are the loose ends.

Bandan Das (6):
KVM: emulate: move init_decode_cache to emulate.c
KVM: emulate: Remove ctxt->intercept and ctxt->check_perm checks
KVM: emulate: cleanup decode_modrm
KVM: emulate: clean up initializations in init_decode_cache
KVM: emulate: rework seg_override
KVM: emulate: do not initialize memopp

Paolo Bonzini (19):
KVM: vmx: speed up emulation of invalid guest state
KVM: x86: return all bits from get_interrupt_shadow
KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation
KVM: emulate: move around some checks
KVM: emulate: protect checks on ctxt->d by a common "if (unlikely())"
KVM: emulate: speed up emulated moves
KVM: emulate: simplify writeback
KVM: emulate: abstract handling of memory operands
KVM: export mark_page_dirty_in_slot
KVM: emulate: introduce memory_prepare callback to speed up memory access
KVM: emulate: activate memory access optimization
KVM: emulate: extend memory access optimization to stores
KVM: emulate: speed up do_insn_fetch
KVM: emulate: avoid repeated calls to do_insn_fetch_bytes
KVM: emulate: avoid per-byte copying in instruction fetches
KVM: emulate: put pointers in the fetch_cache
KVM: x86: use kvm_read_guest_page for emulator accesses
KVM: emulate: simplify BitOp handling
KVM: emulate: fix harmless typo in MMX decoding

arch/x86/include/asm/kvm_emulate.h | 59 ++++-
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/emulate.c | 481 ++++++++++++++++++++++---------------
arch/x86/kvm/svm.c | 6 +-
arch/x86/kvm/trace.h | 6 +-
arch/x86/kvm/vmx.c | 9 +-
arch/x86/kvm/x86.c | 147 +++++++++---
include/linux/kvm_host.h | 6 +
virt/kvm/kvm_main.c | 17 +-
9 files changed, 473 insertions(+), 260 deletions(-)

--
1.8.3.1


2014-06-09 12:59:27

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 02/25] KVM: x86: return all bits from get_interrupt_shadow

For the next patch we will need to know the full state of the
interrupt shadow; we will then set KVM_REQ_EVENT when one bit
is cleared.

However, right now get_interrupt_shadow only returns the one
corresponding to the emulated instruction, or an unconditional
0 if the emulated instruction does not have an interrupt shadow.
This is confusing and does not allow us to check for cleared
bits as mentioned above.

Clean the callback up, and modify toggle_interruptibility to
match the comment above the call. As a small result, the
call to set_interrupt_shadow will be skipped in the common
case where int_shadow == 0 && mask == 0.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/svm.c | 6 +++---
arch/x86/kvm/vmx.c | 4 ++--
arch/x86/kvm/x86.c | 10 +++++-----
4 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 63e020be3da7..0b140dc65bee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -717,7 +717,7 @@ struct kvm_x86_ops {
int (*handle_exit)(struct kvm_vcpu *vcpu);
void (*skip_emulated_instruction)(struct kvm_vcpu *vcpu);
void (*set_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask);
- u32 (*get_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask);
+ u32 (*get_interrupt_shadow)(struct kvm_vcpu *vcpu);
void (*patch_hypercall)(struct kvm_vcpu *vcpu,
unsigned char *hypercall_addr);
void (*set_irq)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ec8366c5cfea..175702d51176 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -486,14 +486,14 @@ static int is_external_interrupt(u32 info)
return info == (SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR);
}

-static u32 svm_get_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
+static u32 svm_get_interrupt_shadow(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
u32 ret = 0;

if (svm->vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK)
- ret |= KVM_X86_SHADOW_INT_STI | KVM_X86_SHADOW_INT_MOV_SS;
- return ret & mask;
+ ret = KVM_X86_SHADOW_INT_STI | KVM_X86_SHADOW_INT_MOV_SS;
+ return ret;
}

static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a2ae11d162fe..57ed3aa0f5d3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1940,7 +1940,7 @@ static void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
vmcs_writel(GUEST_RFLAGS, rflags);
}

-static u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
+static u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
{
u32 interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
int ret = 0;
@@ -1950,7 +1950,7 @@ static u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
if (interruptibility & GUEST_INTR_STATE_MOV_SS)
ret |= KVM_X86_SHADOW_INT_MOV_SS;

- return ret & mask;
+ return ret;
}

static void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 451d6acea808..70faa1089b75 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2974,9 +2974,7 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
vcpu->arch.interrupt.pending && !vcpu->arch.interrupt.soft;
events->interrupt.nr = vcpu->arch.interrupt.nr;
events->interrupt.soft = 0;
- events->interrupt.shadow =
- kvm_x86_ops->get_interrupt_shadow(vcpu,
- KVM_X86_SHADOW_INT_MOV_SS | KVM_X86_SHADOW_INT_STI);
+ events->interrupt.shadow = kvm_x86_ops->get_interrupt_shadow(vcpu);

events->nmi.injected = vcpu->arch.nmi_injected;
events->nmi.pending = vcpu->arch.nmi_pending != 0;
@@ -4857,7 +4855,7 @@ static const struct x86_emulate_ops emulate_ops = {

static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)
{
- u32 int_shadow = kvm_x86_ops->get_interrupt_shadow(vcpu, mask);
+ u32 int_shadow = kvm_x86_ops->get_interrupt_shadow(vcpu);
/*
* an sti; sti; sequence only disable interrupts for the first
* instruction. So, if the last instruction, be it emulated or
@@ -4865,7 +4863,9 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)
* means that the last instruction is an sti. We should not
* leave the flag on in this case. The same goes for mov ss
*/
- if (!(int_shadow & mask))
+ if (int_shadow & mask)
+ mask = 0;
+ if (unlikely(int_shadow || mask))
kvm_x86_ops->set_interrupt_shadow(vcpu, mask);
}

--
1.8.3.1

2014-06-09 12:59:37

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 10/25] KVM: emulate: introduce memory_prepare callback to speed up memory access

Emulating a RMW instruction currently walks the page tables
twice. To avoid this, store the virtual address in the host
and access it directly.

In fact, it turns out that the optimizations we can do
actually benefit all memory accesses.

Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_emulate.h | 26 +++++++++++++++
arch/x86/kvm/x86.c | 67 ++++++++++++++++++++++++++++++++++++++
include/linux/kvm_host.h | 5 +++
virt/kvm/kvm_main.c | 5 ---
4 files changed, 98 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 46725e8aa8cb..1aa2adf0bb1a 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -167,6 +167,32 @@ struct x86_emulate_ops {
const void *new,
unsigned int bytes,
struct x86_exception *fault);
+
+ /*
+ * memory_prepare: Prepare userspace access fastpath.
+ * @addr: [IN ] Linear address to access.
+ * @bytes: [IN ] Number of bytes to access.
+ * @write: [IN ] True if *p_hva will be written to.
+ * @p_opaque: [OUT] Value passed back to memory_finish.
+ * @p_hva: [OUT] Host virtual address for __copy_from/to_user.
+ */
+ int (*memory_prepare)(struct x86_emulate_ctxt *ctxt,
+ unsigned long addr,
+ unsigned int bytes,
+ struct x86_exception *exception,
+ bool write,
+ void **p_opaque,
+ unsigned long *p_hva);
+
+ /*
+ * memory_finish: Complete userspace access fastpath.
+ * @opaque: [OUT] Value passed back from memory_prepare.
+ * @hva: [OUT] Host virtual address computed in memory_prepare.
+ */
+ void (*memory_finish)(struct x86_emulate_ctxt *ctxt,
+ void *p_opaque,
+ unsigned long p_hva);
+
void (*invlpg)(struct x86_emulate_ctxt *ctxt, ulong addr);

int (*pio_in_emulated)(struct x86_emulate_ctxt *ctxt,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e64b8b5cb6cd..02678c2f3721 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4279,6 +4279,71 @@ static const struct read_write_emulator_ops write_emultor = {
.write = true,
};

+static int emulator_memory_prepare(struct x86_emulate_ctxt *ctxt,
+ unsigned long addr,
+ unsigned int bytes,
+ struct x86_exception *exception,
+ bool write,
+ void **p_opaque,
+ unsigned long *p_hva)
+{
+ struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+ struct kvm_memory_slot *memslot;
+ int ret;
+ gpa_t gpa;
+ gfn_t gfn;
+ unsigned long hva;
+
+ if (unlikely(((addr + bytes - 1) ^ addr) & PAGE_MASK))
+ goto no_hva;
+
+ ret = vcpu_mmio_gva_to_gpa(vcpu, addr, &gpa, exception, true);
+ if (ret != 0) {
+ if (ret < 0)
+ return X86EMUL_PROPAGATE_FAULT;
+ goto no_hva;
+ }
+
+ /* A (heavily) simplified version of kvm_gfn_to_hva_cache_init. */
+ gfn = gpa >> PAGE_SHIFT;
+ memslot = gfn_to_memslot(vcpu->kvm, gfn);
+ if (!memslot)
+ goto no_hva;
+
+ if (write) {
+ if (memslot_is_readonly(memslot))
+ goto no_hva;
+
+ *p_opaque = memslot->dirty_bitmap ? memslot : NULL;
+ }
+
+ hva = __gfn_to_hva_memslot(memslot, gfn);
+ if (kvm_is_error_hva(hva))
+ goto no_hva;
+
+ *p_hva = hva + offset_in_page(gpa);
+ return X86EMUL_CONTINUE;
+
+no_hva:
+ *p_hva = KVM_HVA_ERR_BAD;
+ return X86EMUL_CONTINUE;
+}
+
+static void emulator_memory_finish(struct x86_emulate_ctxt *ctxt,
+ void *opaque,
+ unsigned long hva)
+{
+ struct kvm_memory_slot *memslot;
+ gfn_t gfn;
+
+ if (!opaque)
+ return;
+
+ memslot = opaque;
+ gfn = hva_to_gfn_memslot(hva, memslot);
+ mark_page_dirty_in_slot(memslot, gfn);
+}
+
static int emulator_read_write_onepage(unsigned long addr, void *val,
unsigned int bytes,
struct x86_exception *exception,
@@ -4823,6 +4888,8 @@ static const struct x86_emulate_ops emulate_ops = {
.read_std = kvm_read_guest_virt_system,
.write_std = kvm_write_guest_virt_system,
.fetch = kvm_fetch_guest_virt,
+ .memory_prepare = emulator_memory_prepare,
+ .memory_finish = emulator_memory_finish,
.read_emulated = emulator_read_emulated,
.write_emulated = emulator_write_emulated,
.cmpxchg_emulated = emulator_cmpxchg_emulated,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5be9805b0aeb..91f303dd0fe5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -842,6 +842,11 @@ search_memslots(struct kvm_memslots *slots, gfn_t gfn)
return NULL;
}

+static inline bool memslot_is_readonly(struct kvm_memory_slot *slot)
+{
+ return slot->flags & KVM_MEM_READONLY;
+}
+
static inline struct kvm_memory_slot *
__gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
{
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2f9bc20ae2a7..09b19afb2c11 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1028,11 +1028,6 @@ out:
return size;
}

-static bool memslot_is_readonly(struct kvm_memory_slot *slot)
-{
- return slot->flags & KVM_MEM_READONLY;
-}
-
static unsigned long __gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
gfn_t *nr_pages, bool write)
{
--
1.8.3.1

2014-06-09 12:59:44

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 12/25] KVM: emulate: extend memory access optimization to stores

Even on a store the optimization saves about 50 clock cycles,
mostly because the jump in write_memory_operand becomes much more
predictable.

Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/emulate.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 594cb560947c..eaf0853ffaf9 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1589,7 +1589,7 @@ static int load_segment_descriptor(struct x86_emulate_ctxt *ctxt,

static int prepare_memory_operand(struct x86_emulate_ctxt *ctxt,
struct operand *op,
- bool write)
+ bool read, bool write)
{
int rc;
unsigned long gva;
@@ -1605,6 +1605,10 @@ static int prepare_memory_operand(struct x86_emulate_ctxt *ctxt,
if (rc != X86EMUL_CONTINUE)
return rc;

+ /* optimisation - avoid slow emulated read if Mov */
+ if (!read)
+ return X86EMUL_CONTINUE;
+
if (likely(!kvm_is_error_hva(op->hva))) {
rc = read_from_user(ctxt, op->hva, &op->val, size);
if (!write)
@@ -4699,14 +4703,14 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
}

if ((ctxt->src.type == OP_MEM) && !(ctxt->d & NoAccess)) {
- rc = prepare_memory_operand(ctxt, &ctxt->src, false);
+ rc = prepare_memory_operand(ctxt, &ctxt->src, true, false);
if (rc != X86EMUL_CONTINUE)
goto done;
ctxt->src.orig_val64 = ctxt->src.val64;
}

if (ctxt->src2.type == OP_MEM) {
- rc = prepare_memory_operand(ctxt, &ctxt->src2, false);
+ rc = prepare_memory_operand(ctxt, &ctxt->src2, true, false);
if (rc != X86EMUL_CONTINUE)
goto done;
}
@@ -4715,9 +4719,9 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
goto special_insn;


- if ((ctxt->dst.type == OP_MEM) && !(ctxt->d & Mov)) {
- /* optimisation - avoid slow emulated read if Mov */
+ if (ctxt->dst.type == OP_MEM) {
rc = prepare_memory_operand(ctxt, &ctxt->dst,
+ !(ctxt->d & Mov),
!(ctxt->d & NoWrite));
if (rc != X86EMUL_CONTINUE)
goto done;
--
1.8.3.1

2014-06-09 12:59:58

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 17/25] KVM: emulate: rework seg_override

From: Bandan Das <[email protected]>

x86_decode_insn already sets a default for seg_override,
so remove it from the zeroed area. Also replace set/get functions
with direct access to the field.

Signed-off-by: Bandan Das <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_emulate.h | 3 +--
arch/x86/kvm/emulate.c | 41 +++++++++++++++-----------------------
2 files changed, 17 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 8f03e7158800..268bbd4aae99 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -322,12 +322,10 @@ struct x86_emulate_ctxt {
struct operand dst;
int (*execute)(struct x86_emulate_ctxt *ctxt);
int (*check_perm)(struct x86_emulate_ctxt *ctxt);
- bool has_seg_override;
bool rip_relative;
u8 rex_prefix;
u8 lock_prefix;
u8 rep_prefix;
- u8 seg_override;
/* bitmaps of registers in _regs[] that can be read */
u32 regs_valid;
/* bitmaps of registers in _regs[] that have been written */
@@ -338,6 +336,7 @@ struct x86_emulate_ctxt {
u8 modrm_reg;
u8 modrm_rm;
u8 modrm_seg;
+ u8 seg_override;
u64 d;
unsigned long _eip;
struct operand memop;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index cef069e17e2e..05f7bb416fc2 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -513,12 +513,6 @@ static u32 desc_limit_scaled(struct desc_struct *desc)
return desc->g ? (limit << 12) | 0xfff : limit;
}

-static void set_seg_override(struct x86_emulate_ctxt *ctxt, int seg)
-{
- ctxt->has_seg_override = true;
- ctxt->seg_override = seg;
-}
-
static unsigned long seg_base(struct x86_emulate_ctxt *ctxt, int seg)
{
if (ctxt->mode == X86EMUL_MODE_PROT64 && seg < VCPU_SREG_FS)
@@ -527,14 +521,6 @@ static unsigned long seg_base(struct x86_emulate_ctxt *ctxt, int seg)
return ctxt->ops->get_cached_segment_base(ctxt, seg);
}

-static unsigned seg_override(struct x86_emulate_ctxt *ctxt)
-{
- if (!ctxt->has_seg_override)
- return 0;
-
- return ctxt->seg_override;
-}
-
static int emulate_exception(struct x86_emulate_ctxt *ctxt, int vec,
u32 error, bool valid)
{
@@ -4243,7 +4229,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, struct operand *op,
op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes;
op->addr.mem.ea =
register_address(ctxt, reg_read(ctxt, VCPU_REGS_RSI));
- op->addr.mem.seg = seg_override(ctxt);
+ op->addr.mem.seg = ctxt->seg_override;
op->val = 0;
op->count = 1;
break;
@@ -4254,7 +4240,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, struct operand *op,
register_address(ctxt,
reg_read(ctxt, VCPU_REGS_RBX) +
(reg_read(ctxt, VCPU_REGS_RAX) & 0xff));
- op->addr.mem.seg = seg_override(ctxt);
+ op->addr.mem.seg = ctxt->seg_override;
op->val = 0;
break;
case OpImmFAddr:
@@ -4301,6 +4287,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
int mode = ctxt->mode;
int def_op_bytes, def_ad_bytes, goffset, simd_prefix;
bool op_prefix = false;
+ bool has_seg_override = false;
struct opcode opcode;

ctxt->memop.type = OP_NONE;
@@ -4354,11 +4341,13 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
case 0x2e: /* CS override */
case 0x36: /* SS override */
case 0x3e: /* DS override */
- set_seg_override(ctxt, (ctxt->b >> 3) & 3);
+ has_seg_override = true;
+ ctxt->seg_override = (ctxt->b >> 3) & 3;
break;
case 0x64: /* FS override */
case 0x65: /* GS override */
- set_seg_override(ctxt, ctxt->b & 7);
+ has_seg_override = true;
+ ctxt->seg_override = ctxt->b & 7;
break;
case 0x40 ... 0x4f: /* REX */
if (mode != X86EMUL_MODE_PROT64)
@@ -4496,17 +4485,19 @@ done_prefixes:
/* ModRM and SIB bytes. */
if (ctxt->d & ModRM) {
rc = decode_modrm(ctxt, &ctxt->memop);
- if (!ctxt->has_seg_override)
- set_seg_override(ctxt, ctxt->modrm_seg);
+ if (!has_seg_override) {
+ has_seg_override = true;
+ ctxt->seg_override = ctxt->modrm_seg;
+ }
} else if (ctxt->d & MemAbs)
rc = decode_abs(ctxt, &ctxt->memop);
if (rc != X86EMUL_CONTINUE)
goto done;

- if (!ctxt->has_seg_override)
- set_seg_override(ctxt, VCPU_SREG_DS);
+ if (!has_seg_override)
+ ctxt->seg_override = VCPU_SREG_DS;

- ctxt->memop.addr.mem.seg = seg_override(ctxt);
+ ctxt->memop.addr.mem.seg = ctxt->seg_override;

if (ctxt->memop.type == OP_MEM && ctxt->ad_bytes != 8)
ctxt->memop.addr.mem.ea = (u32)ctxt->memop.addr.mem.ea;
@@ -4608,8 +4599,8 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *))

void init_decode_cache(struct x86_emulate_ctxt *ctxt)
{
- memset(&ctxt->has_seg_override, 0,
- (void *)&ctxt->modrm - (void *)&ctxt->has_seg_override);
+ memset(&ctxt->rip_relative, 0,
+ (void *)&ctxt->modrm - (void *)&ctxt->rip_relative);

ctxt->io_read.pos = 0;
ctxt->io_read.end = 0;
--
1.8.3.1

2014-06-09 13:00:10

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 23/25] KVM: x86: use kvm_read_guest_page for emulator accesses

Emulator accesses are always done a page at a time, either by the emulator
itself (for fetches) or because we need to query the MMU for address
translations. Speed up these accesses by using kvm_read_guest_page
and, in the case of fetches, by inlining kvm_read_guest_virt_helper and
dropping the loop around kvm_read_guest_page.

This final tweak saves 30-100 more clock cycles (4-10%), bringing the
count (as measured by kvm-unit-tests) down to 720-1100 clock cycles on
a Sandy Bridge Xeon host, compared to 2300-3200 before the whole series
and 925-1700 after the first two low-hanging fruit changes.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/x86.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 401bdef24d54..179b7c39c3fe 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4081,7 +4081,8 @@ static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes,

if (gpa == UNMAPPED_GVA)
return X86EMUL_PROPAGATE_FAULT;
- ret = kvm_read_guest(vcpu->kvm, gpa, data, toread);
+ ret = kvm_read_guest_page(vcpu->kvm, gpa >> PAGE_SHIFT, data,
+ offset, toread);
if (ret < 0) {
r = X86EMUL_IO_NEEDED;
goto out;
@@ -4102,10 +4103,24 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
{
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
+ unsigned offset;
+ int ret;

- return kvm_read_guest_virt_helper(addr, val, bytes, vcpu,
- access | PFERR_FETCH_MASK,
- exception);
+ /* Inline kvm_read_guest_virt_helper for speed. */
+ gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, access|PFERR_FETCH_MASK,
+ exception);
+ if (unlikely(gpa == UNMAPPED_GVA))
+ return X86EMUL_PROPAGATE_FAULT;
+
+ offset = addr & (PAGE_SIZE-1);
+ if (WARN_ON(offset + bytes > PAGE_SIZE))
+ bytes = (unsigned)PAGE_SIZE - offset;
+ ret = kvm_read_guest_page(vcpu->kvm, gpa >> PAGE_SHIFT, val,
+ offset, bytes);
+ if (unlikely(ret < 0))
+ return X86EMUL_IO_NEEDED;
+
+ return X86EMUL_CONTINUE;
}

int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
--
1.8.3.1

2014-06-09 13:00:05

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 20/25] KVM: emulate: avoid repeated calls to do_insn_fetch_bytes

do_insn_fetch_bytes will only be called once in a given insn_fetch and
insn_fetch_arr, because in fact it will only be called at most twice
for any instruction and the first call is explicit in x86_decode_insn.
This observation lets us hoist the call out of the memory copying loop.
It does not buy performance, because most fetches are one byte long
anyway, but it prepares for the next patch.

The overflow check is tricky, but correct. Because do_insn_fetch_bytes
has already been called once, we know that fc->end is at least 15. So
it is okay to subtract the number of bytes we want to read.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/emulate.c | 26 +++++++++++++++++---------
1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index aee124a4cc6c..f81d7a5bc19b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -707,7 +707,7 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt,
* Prefetch the remaining bytes of the instruction without crossing page
* boundary if they are not in fetch_cache yet.
*/
-static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt)
+static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
{
struct fetch_cache *fc = &ctxt->fetch;
int rc;
@@ -719,7 +719,14 @@ static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt)
cur_size = fc->end - fc->start;
size = min(15UL - cur_size,
PAGE_SIZE - offset_in_page(fc->end));
- if (unlikely(size == 0))
+
+ /*
+ * One instruction can only straddle two pages,
+ * and one has been loaded at the beginning of
+ * x86_decode_insn. So, if not enough bytes
+ * still, we must have hit the 15-byte boundary.
+ */
+ if (unlikely(size < op_size))
return X86EMUL_UNHANDLEABLE;
rc = __linearize(ctxt, addr, size, false, true, &linear);
if (unlikely(rc != X86EMUL_CONTINUE))
@@ -735,17 +742,18 @@ static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt)
static int do_insn_fetch(struct x86_emulate_ctxt *ctxt,
void *__dest, unsigned size)
{
- int rc;
struct fetch_cache *fc = &ctxt->fetch;
u8 *dest = __dest;
u8 *src = &fc->data[ctxt->_eip - fc->start];

+ /* We have to be careful about overflow! */
+ if (unlikely(ctxt->_eip > fc->end - size)) {
+ int rc = do_insn_fetch_bytes(ctxt, size);
+ if (rc != X86EMUL_CONTINUE)
+ return rc;
+ }
+
while (size--) {
- if (unlikely(ctxt->_eip == fc->end)) {
- rc = do_insn_fetch_bytes(ctxt);
- if (rc != X86EMUL_CONTINUE)
- return rc;
- }
*dest++ = *src++;
ctxt->_eip++;
continue;
@@ -4302,7 +4310,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
if (insn_len > 0)
memcpy(ctxt->fetch.data, insn, insn_len);
else {
- rc = do_insn_fetch_bytes(ctxt);
+ rc = do_insn_fetch_bytes(ctxt, 1);
if (rc != X86EMUL_CONTINUE)
return rc;
}
--
1.8.3.1

2014-06-09 13:00:01

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 21/25] KVM: emulate: avoid per-byte copying in instruction fetches

We do not need a memory copying loop anymore in insn_fetch; we
can use a byte-aligned pointer to access instruction fields directly
from the fetch_cache. This eliminates 50-150 cycles (corresponding to
a 5-10% improvement in performance) from each instruction.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/emulate.c | 46 ++++++++++++++++++++++------------------------
1 file changed, 22 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index f81d7a5bc19b..6ec0be2d2461 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -707,7 +707,7 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt,
* Prefetch the remaining bytes of the instruction without crossing page
* boundary if they are not in fetch_cache yet.
*/
-static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
+static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
{
struct fetch_cache *fc = &ctxt->fetch;
int rc;
@@ -739,41 +739,39 @@ static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
return X86EMUL_CONTINUE;
}

-static int do_insn_fetch(struct x86_emulate_ctxt *ctxt,
- void *__dest, unsigned size)
+static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt,
+ unsigned size)
{
- struct fetch_cache *fc = &ctxt->fetch;
- u8 *dest = __dest;
- u8 *src = &fc->data[ctxt->_eip - fc->start];
-
/* We have to be careful about overflow! */
- if (unlikely(ctxt->_eip > fc->end - size)) {
- int rc = do_insn_fetch_bytes(ctxt, size);
- if (rc != X86EMUL_CONTINUE)
- return rc;
- }
-
- while (size--) {
- *dest++ = *src++;
- ctxt->_eip++;
- continue;
- }
- return X86EMUL_CONTINUE;
+ if (unlikely(ctxt->_eip > ctxt->fetch.end - size))
+ return __do_insn_fetch_bytes(ctxt, size);
+ else
+ return X86EMUL_CONTINUE;
}

/* Fetch next part of the instruction being emulated. */
#define insn_fetch(_type, _ctxt) \
-({ unsigned long _x; \
- rc = do_insn_fetch(_ctxt, &_x, sizeof(_type)); \
+({ _type _x; \
+ struct fetch_cache *_fc; \
+ \
+ rc = do_insn_fetch_bytes(_ctxt, sizeof(_type)); \
if (rc != X86EMUL_CONTINUE) \
goto done; \
- (_type)_x; \
+ _fc = &ctxt->fetch; \
+ _x = *(_type __aligned(1) *) &_fc->data[ctxt->_eip - _fc->start]; \
+ ctxt->_eip += sizeof(_type); \
+ _x; \
})

#define insn_fetch_arr(_arr, _size, _ctxt) \
-({ rc = do_insn_fetch(_ctxt, _arr, (_size)); \
+({ \
+ struct fetch_cache *_fc; \
+ rc = do_insn_fetch_bytes(_ctxt, _size); \
if (rc != X86EMUL_CONTINUE) \
goto done; \
+ _fc = &ctxt->fetch; \
+ memcpy(_arr, &_fc->data[ctxt->_eip - _fc->start], _size); \
+ ctxt->_eip += (_size); \
})

/*
@@ -4310,7 +4308,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
if (insn_len > 0)
memcpy(ctxt->fetch.data, insn, insn_len);
else {
- rc = do_insn_fetch_bytes(ctxt, 1);
+ rc = __do_insn_fetch_bytes(ctxt, 1);
if (rc != X86EMUL_CONTINUE)
return rc;
}
--
1.8.3.1

2014-06-09 13:01:00

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 25/25] KVM: emulate: fix harmless typo in MMX decoding

It was using the wrong member of the union.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/emulate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 2e8f0e674616..b8dc78f4672d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1081,7 +1081,7 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
if (ctxt->d & Mmx) {
op->type = OP_MM;
op->bytes = 8;
- op->addr.xmm = ctxt->modrm_rm & 7;
+ op->addr.mm = ctxt->modrm_rm & 7;
return rc;
}
fetch_register_operand(op);
--
1.8.3.1

2014-06-09 12:59:54

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 16/25] KVM: emulate: clean up initializations in init_decode_cache

From: Bandan Das <[email protected]>

A lot of initializations are unnecessary as they get set to
appropriate values before actually being used. Optimize
placement of fields in x86_emulate_ctxt

Signed-off-by: Bandan Das <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_emulate.h | 20 +++++++++++---------
arch/x86/kvm/emulate.c | 7 ++-----
2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index b8456835eddb..8f03e7158800 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -315,30 +315,32 @@ struct x86_emulate_ctxt {
u8 opcode_len;
u8 b;
u8 intercept;
- u8 lock_prefix;
- u8 rep_prefix;
u8 op_bytes;
u8 ad_bytes;
- u8 rex_prefix;
struct operand src;
struct operand src2;
struct operand dst;
- bool has_seg_override;
- u8 seg_override;
- u64 d;
int (*execute)(struct x86_emulate_ctxt *ctxt);
int (*check_perm)(struct x86_emulate_ctxt *ctxt);
+ bool has_seg_override;
+ bool rip_relative;
+ u8 rex_prefix;
+ u8 lock_prefix;
+ u8 rep_prefix;
+ u8 seg_override;
+ /* bitmaps of registers in _regs[] that can be read */
+ u32 regs_valid;
+ /* bitmaps of registers in _regs[] that have been written */
+ u32 regs_dirty;
/* modrm */
u8 modrm;
u8 modrm_mod;
u8 modrm_reg;
u8 modrm_rm;
u8 modrm_seg;
- bool rip_relative;
+ u64 d;
unsigned long _eip;
struct operand memop;
- u32 regs_valid; /* bitmaps of registers in _regs[] that can be read */
- u32 regs_dirty; /* bitmaps of registers in _regs[] that have been written */
/* Fields above regs are cleared together. */
unsigned long _regs[NR_VCPU_REGS];
struct operand *memopp;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c92b1cfbb430..cef069e17e2e 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4608,14 +4608,11 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *))

void init_decode_cache(struct x86_emulate_ctxt *ctxt)
{
- memset(&ctxt->opcode_len, 0,
- (void *)&ctxt->_regs - (void *)&ctxt->opcode_len);
+ memset(&ctxt->has_seg_override, 0,
+ (void *)&ctxt->modrm - (void *)&ctxt->has_seg_override);

- ctxt->fetch.start = 0;
- ctxt->fetch.end = 0;
ctxt->io_read.pos = 0;
ctxt->io_read.end = 0;
- ctxt->mem_read.pos = 0;
ctxt->mem_read.end = 0;
}

--
1.8.3.1

2014-06-09 13:01:23

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 22/25] KVM: emulate: put pointers in the fetch_cache

This simplifies the code a bit, especially the overflow checks.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_emulate.h | 4 ++--
arch/x86/kvm/emulate.c | 34 +++++++++++++++-------------------
arch/x86/kvm/trace.h | 6 +++---
3 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 692b0bfadf91..012fff3fea0d 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -269,8 +269,8 @@ struct operand {

struct fetch_cache {
u8 data[15];
- unsigned long start;
- unsigned long end;
+ u8 *ptr;
+ u8 *end;
};

struct read_cache {
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 6ec0be2d2461..6b44f90a9bb0 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -709,16 +709,15 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt,
*/
static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
{
- struct fetch_cache *fc = &ctxt->fetch;
int rc;
- int size, cur_size;
+ int size;
unsigned long linear;
-
+ int cur_size = ctxt->fetch.end - ctxt->fetch.data;
struct segmented_address addr = { .seg = VCPU_SREG_CS,
- .ea = fc->end };
- cur_size = fc->end - fc->start;
+ .ea = ctxt->eip + cur_size };
+
size = min(15UL - cur_size,
- PAGE_SIZE - offset_in_page(fc->end));
+ PAGE_SIZE - offset_in_page(addr.ea));

/*
* One instruction can only straddle two pages,
@@ -731,19 +730,18 @@ static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
rc = __linearize(ctxt, addr, size, false, true, &linear);
if (unlikely(rc != X86EMUL_CONTINUE))
return rc;
- rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size,
+ rc = ctxt->ops->fetch(ctxt, linear, ctxt->fetch.end,
size, &ctxt->exception);
if (unlikely(rc != X86EMUL_CONTINUE))
return rc;
- fc->end += size;
+ ctxt->fetch.end += size;
return X86EMUL_CONTINUE;
}

static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt,
unsigned size)
{
- /* We have to be careful about overflow! */
- if (unlikely(ctxt->_eip > ctxt->fetch.end - size))
+ if (unlikely(ctxt->fetch.end - ctxt->fetch.ptr < size))
return __do_insn_fetch_bytes(ctxt, size);
else
return X86EMUL_CONTINUE;
@@ -752,26 +750,24 @@ static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt,
/* Fetch next part of the instruction being emulated. */
#define insn_fetch(_type, _ctxt) \
({ _type _x; \
- struct fetch_cache *_fc; \
\
rc = do_insn_fetch_bytes(_ctxt, sizeof(_type)); \
if (rc != X86EMUL_CONTINUE) \
goto done; \
- _fc = &ctxt->fetch; \
- _x = *(_type __aligned(1) *) &_fc->data[ctxt->_eip - _fc->start]; \
ctxt->_eip += sizeof(_type); \
+ _x = *(_type __aligned(1) *) ctxt->fetch.ptr; \
+ ctxt->fetch.ptr += sizeof(_type); \
_x; \
})

#define insn_fetch_arr(_arr, _size, _ctxt) \
({ \
- struct fetch_cache *_fc; \
rc = do_insn_fetch_bytes(_ctxt, _size); \
if (rc != X86EMUL_CONTINUE) \
goto done; \
- _fc = &ctxt->fetch; \
- memcpy(_arr, &_fc->data[ctxt->_eip - _fc->start], _size); \
ctxt->_eip += (_size); \
+ memcpy(_arr, ctxt->fetch.ptr, _size); \
+ ctxt->fetch.ptr += (_size); \
})

/*
@@ -4302,8 +4298,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
ctxt->memop.type = OP_NONE;
ctxt->memopp = NULL;
ctxt->_eip = ctxt->eip;
- ctxt->fetch.start = ctxt->_eip;
- ctxt->fetch.end = ctxt->fetch.start + insn_len;
+ ctxt->fetch.ptr = ctxt->fetch.data;
+ ctxt->fetch.end = ctxt->fetch.data + insn_len;
ctxt->opcode_len = 1;
if (insn_len > 0)
memcpy(ctxt->fetch.data, insn, insn_len);
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 33574c95220d..e850a7d332be 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -721,10 +721,10 @@ TRACE_EVENT(kvm_emulate_insn,
),

TP_fast_assign(
- __entry->rip = vcpu->arch.emulate_ctxt.fetch.start;
__entry->csbase = kvm_x86_ops->get_segment_base(vcpu, VCPU_SREG_CS);
- __entry->len = vcpu->arch.emulate_ctxt._eip
- - vcpu->arch.emulate_ctxt.fetch.start;
+ __entry->len = vcpu->arch.emulate_ctxt.fetch.ptr
+ - vcpu->arch.emulate_ctxt.fetch.data;
+ __entry->rip = vcpu->arch.emulate_ctxt._eip - __entry->len;
memcpy(__entry->insn,
vcpu->arch.emulate_ctxt.fetch.data,
15);
--
1.8.3.1

2014-06-09 13:01:17

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 24/25] KVM: emulate: simplify BitOp handling

Memory is always the destination for BitOp instructions.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/emulate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 6b44f90a9bb0..2e8f0e674616 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4145,7 +4145,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, struct operand *op,
mem_common:
*op = ctxt->memop;
ctxt->memopp = op;
- if ((ctxt->d & BitOp) && op == &ctxt->dst)
+ if (ctxt->d & BitOp)
fetch_bit_operand(ctxt);
op->orig_val = op->val;
break;
--
1.8.3.1

2014-06-09 13:01:51

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 19/25] KVM: emulate: speed up do_insn_fetch

Hoist the common case up from do_insn_fetch_byte to do_insn_fetch,
and prime the fetch_cache in x86_decode_insn. This helps a bit the
compiler and the branch predictor, but above all it lays the
ground for further changes in the next few patches.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/emulate.c | 67 +++++++++++++++++++++++++++-----------------------
1 file changed, 36 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 32f38a3467a0..aee124a4cc6c 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -704,51 +704,51 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt,
}

/*
- * Fetch the next byte of the instruction being emulated which is pointed to
- * by ctxt->_eip, then increment ctxt->_eip.
- *
- * Also prefetch the remaining bytes of the instruction without crossing page
+ * Prefetch the remaining bytes of the instruction without crossing page
* boundary if they are not in fetch_cache yet.
*/
-static int do_insn_fetch_byte(struct x86_emulate_ctxt *ctxt, u8 *dest)
+static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt)
{
struct fetch_cache *fc = &ctxt->fetch;
int rc;
int size, cur_size;
-
- if (ctxt->_eip == fc->end) {
- unsigned long linear;
- struct segmented_address addr = { .seg = VCPU_SREG_CS,
- .ea = ctxt->_eip };
- cur_size = fc->end - fc->start;
- size = min(15UL - cur_size,
- PAGE_SIZE - offset_in_page(ctxt->_eip));
- rc = __linearize(ctxt, addr, size, false, true, &linear);
- if (unlikely(rc != X86EMUL_CONTINUE))
- return rc;
- rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size,
- size, &ctxt->exception);
- if (unlikely(rc != X86EMUL_CONTINUE))
- return rc;
- fc->end += size;
- }
- *dest = fc->data[ctxt->_eip - fc->start];
- ctxt->_eip++;
+ unsigned long linear;
+
+ struct segmented_address addr = { .seg = VCPU_SREG_CS,
+ .ea = fc->end };
+ cur_size = fc->end - fc->start;
+ size = min(15UL - cur_size,
+ PAGE_SIZE - offset_in_page(fc->end));
+ if (unlikely(size == 0))
+ return X86EMUL_UNHANDLEABLE;
+ rc = __linearize(ctxt, addr, size, false, true, &linear);
+ if (unlikely(rc != X86EMUL_CONTINUE))
+ return rc;
+ rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size,
+ size, &ctxt->exception);
+ if (unlikely(rc != X86EMUL_CONTINUE))
+ return rc;
+ fc->end += size;
return X86EMUL_CONTINUE;
}

static int do_insn_fetch(struct x86_emulate_ctxt *ctxt,
- void *dest, unsigned size)
+ void *__dest, unsigned size)
{
int rc;
+ struct fetch_cache *fc = &ctxt->fetch;
+ u8 *dest = __dest;
+ u8 *src = &fc->data[ctxt->_eip - fc->start];

- /* x86 instructions are limited to 15 bytes. */
- if (unlikely(ctxt->_eip + size - ctxt->eip > 15))
- return X86EMUL_UNHANDLEABLE;
while (size--) {
- rc = do_insn_fetch_byte(ctxt, dest++);
- if (rc != X86EMUL_CONTINUE)
- return rc;
+ if (unlikely(ctxt->_eip == fc->end)) {
+ rc = do_insn_fetch_bytes(ctxt);
+ if (rc != X86EMUL_CONTINUE)
+ return rc;
+ }
+ *dest++ = *src++;
+ ctxt->_eip++;
+ continue;
}
return X86EMUL_CONTINUE;
}
@@ -4301,6 +4301,11 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
ctxt->opcode_len = 1;
if (insn_len > 0)
memcpy(ctxt->fetch.data, insn, insn_len);
+ else {
+ rc = do_insn_fetch_bytes(ctxt);
+ if (rc != X86EMUL_CONTINUE)
+ return rc;
+ }

switch (mode) {
case X86EMUL_MODE_REAL:
--
1.8.3.1

2014-06-09 12:59:48

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 14/25] KVM: emulate: Remove ctxt->intercept and ctxt->check_perm checks

From: Bandan Das <[email protected]>

The same information can be gleaned from ctxt->d and avoids having
to zero/NULL initialize intercept and check_perm

Signed-off-by: Bandan Das <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/emulate.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 34a04ba26c23..75e26bd4a21b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4673,7 +4673,7 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
fetch_possible_mmx_operand(ctxt, &ctxt->dst);
}

- if (unlikely(ctxt->guest_mode) && ctxt->intercept) {
+ if (unlikely(ctxt->guest_mode) && (ctxt->d & Intercept)) {
rc = emulator_check_intercept(ctxt, ctxt->intercept,
X86_ICPT_PRE_EXCEPT);
if (rc != X86EMUL_CONTINUE)
@@ -4693,13 +4693,13 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
}

/* Do instruction specific permission checks */
- if (ctxt->check_perm) {
+ if (ctxt->d & CheckPerm) {
rc = ctxt->check_perm(ctxt);
if (rc != X86EMUL_CONTINUE)
goto done;
}

- if (unlikely(ctxt->guest_mode) && ctxt->intercept) {
+ if (unlikely(ctxt->guest_mode) && (ctxt->d & Intercept)) {
rc = emulator_check_intercept(ctxt, ctxt->intercept,
X86_ICPT_POST_EXCEPT);
if (rc != X86EMUL_CONTINUE)
@@ -4743,7 +4743,7 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)

special_insn:

- if (unlikely(ctxt->guest_mode) && ctxt->intercept) {
+ if (unlikely(ctxt->guest_mode) && (ctxt->d & Intercept)) {
rc = emulator_check_intercept(ctxt, ctxt->intercept,
X86_ICPT_POST_MEMACCESS);
if (rc != X86EMUL_CONTINUE)
--
1.8.3.1

2014-06-09 13:02:24

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 18/25] KVM: emulate: do not initialize memopp

From: Bandan Das <[email protected]>

rip_relative is only set if decode_modrm runs, and if you have ModRM
you will also have a memopp. We can then access memopp unconditionally.
Note that rip_relative cannot be hoisted up to decode_modrm, or you
break "mov $0, xyz(%rip)".

Also, move typecast on "out of range value" of mem.ea to decode_modrm.

Together, all these optimizations save about 50 cycles on each emulated
instructions (4-6%).

Signed-off-by: Bandan Das <[email protected]>
[Fix immediate operands with rip-relative addressing. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_emulate.h | 5 +++++
arch/x86/kvm/emulate.c | 8 ++++----
2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 268bbd4aae99..692b0bfadf91 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -322,6 +322,11 @@ struct x86_emulate_ctxt {
struct operand dst;
int (*execute)(struct x86_emulate_ctxt *ctxt);
int (*check_perm)(struct x86_emulate_ctxt *ctxt);
+ /*
+ * The following six fields are cleared together,
+ * the rest are initialized unconditionally in x86_decode_insn
+ * or elsewhere
+ */
bool rip_relative;
u8 rex_prefix;
u8 lock_prefix;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 05f7bb416fc2..32f38a3467a0 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1176,6 +1176,9 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
}
}
op->addr.mem.ea = modrm_ea;
+ if (ctxt->ad_bytes != 8)
+ ctxt->memop.addr.mem.ea = (u32)ctxt->memop.addr.mem.ea;
+
done:
return rc;
}
@@ -4499,9 +4502,6 @@ done_prefixes:

ctxt->memop.addr.mem.seg = ctxt->seg_override;

- if (ctxt->memop.type == OP_MEM && ctxt->ad_bytes != 8)
- ctxt->memop.addr.mem.ea = (u32)ctxt->memop.addr.mem.ea;
-
/*
* Decode and fetch the source operand: register, memory
* or immediate.
@@ -4522,7 +4522,7 @@ done_prefixes:
rc = decode_operand(ctxt, &ctxt->dst, (ctxt->d >> DstShift) & OpMask);

done:
- if (ctxt->memopp && ctxt->memopp->type == OP_MEM && ctxt->rip_relative)
+ if (ctxt->rip_relative)
ctxt->memopp->addr.mem.ea += ctxt->_eip;

return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
--
1.8.3.1

2014-06-09 12:59:42

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 13/25] KVM: emulate: move init_decode_cache to emulate.c

From: Bandan Das <[email protected]>

Core emulator functions all belong in emulator.c,
x86 should have no knowledge of emulator internals

Signed-off-by: Bandan Das <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_emulate.h | 1 +
arch/x86/kvm/emulate.c | 13 +++++++++++++
arch/x86/kvm/x86.c | 13 -------------
3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index f596976a3ab2..b8456835eddb 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -436,6 +436,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
#define EMULATION_OK 0
#define EMULATION_RESTART 1
#define EMULATION_INTERCEPTED 2
+void init_decode_cache(struct x86_emulate_ctxt *ctxt);
int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
u16 tss_selector, int idt_index, int reason,
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index eaf0853ffaf9..34a04ba26c23 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4608,6 +4608,19 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *))
return X86EMUL_CONTINUE;
}

+void init_decode_cache(struct x86_emulate_ctxt *ctxt)
+{
+ memset(&ctxt->opcode_len, 0,
+ (void *)&ctxt->_regs - (void *)&ctxt->opcode_len);
+
+ ctxt->fetch.start = 0;
+ ctxt->fetch.end = 0;
+ ctxt->io_read.pos = 0;
+ ctxt->io_read.end = 0;
+ ctxt->mem_read.pos = 0;
+ ctxt->mem_read.end = 0;
+}
+
int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
{
const struct x86_emulate_ops *ops = ctxt->ops;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 02678c2f3721..401bdef24d54 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4952,19 +4952,6 @@ static void inject_emulated_exception(struct kvm_vcpu *vcpu)
kvm_queue_exception(vcpu, ctxt->exception.vector);
}

-static void init_decode_cache(struct x86_emulate_ctxt *ctxt)
-{
- memset(&ctxt->opcode_len, 0,
- (void *)&ctxt->_regs - (void *)&ctxt->opcode_len);
-
- ctxt->fetch.start = 0;
- ctxt->fetch.end = 0;
- ctxt->io_read.pos = 0;
- ctxt->io_read.end = 0;
- ctxt->mem_read.pos = 0;
- ctxt->mem_read.end = 0;
-}
-
static void init_emulate_ctxt(struct kvm_vcpu *vcpu)
{
struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
--
1.8.3.1

2014-06-09 13:02:56

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 15/25] KVM: emulate: cleanup decode_modrm

From: Bandan Das <[email protected]>

Remove the if conditional - that will help us avoid
an "else initialize to 0" Also, rearrange operators
for slightly better code.

Signed-off-by: Bandan Das <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/emulate.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 75e26bd4a21b..c92b1cfbb430 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1065,19 +1065,17 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
struct operand *op)
{
u8 sib;
- int index_reg = 0, base_reg = 0, scale;
+ int index_reg, base_reg, scale;
int rc = X86EMUL_CONTINUE;
ulong modrm_ea = 0;

- if (ctxt->rex_prefix) {
- ctxt->modrm_reg = (ctxt->rex_prefix & 4) << 1; /* REX.R */
- index_reg = (ctxt->rex_prefix & 2) << 2; /* REX.X */
- ctxt->modrm_rm = base_reg = (ctxt->rex_prefix & 1) << 3; /* REG.B */
- }
+ ctxt->modrm_reg = ((ctxt->rex_prefix << 1) & 8); /* REX.R */
+ index_reg = (ctxt->rex_prefix << 2) & 8; /* REX.X */
+ base_reg = (ctxt->rex_prefix << 3) & 8; /* REX.B */

- ctxt->modrm_mod |= (ctxt->modrm & 0xc0) >> 6;
+ ctxt->modrm_mod = (ctxt->modrm & 0xc0) >> 6;
ctxt->modrm_reg |= (ctxt->modrm & 0x38) >> 3;
- ctxt->modrm_rm |= (ctxt->modrm & 0x07);
+ ctxt->modrm_rm = base_reg | (ctxt->modrm & 0x07);
ctxt->modrm_seg = VCPU_SREG_DS;

if (ctxt->modrm_mod == 3 || (ctxt->d & NoMod)) {
--
1.8.3.1

2014-06-09 13:03:27

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 11/25] KVM: emulate: activate memory access optimization

memory_prepare lets us replace segmented_read/segmented_write with direct
calls to __copy_from_user/__copy_to_user. For RMW instructions, we also
avoid double walking of the page tables.

This saves about 50 cycles (5%) on arithmetic with a memory source
operand, and up to 300 cycles (20%) on arithmetic with a memory
destination operand.

Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_emulate.h | 2 ++
arch/x86/kvm/emulate.c | 72 +++++++++++++++++++++++++++++++++++---
2 files changed, 69 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 1aa2adf0bb1a..f596976a3ab2 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -263,6 +263,8 @@ struct operand {
u64 mm_val;
void *data;
};
+ unsigned long hva;
+ void *opaque;
};

struct fetch_cache {
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 7e9dc2d6fd44..594cb560947c 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1260,6 +1260,29 @@ read_cached:
return X86EMUL_CONTINUE;
}

+static int read_from_user(struct x86_emulate_ctxt *ctxt,
+ unsigned long hva, void *dest, unsigned size)
+{
+ int rc;
+ struct read_cache *mc = &ctxt->mem_read;
+
+ if (mc->pos < mc->end)
+ goto read_cached;
+
+ WARN_ON((mc->end + size) >= sizeof(mc->data));
+
+ rc = __copy_from_user(mc->data + mc->end, (void __user *)hva, size);
+ if (rc < 0)
+ return X86EMUL_UNHANDLEABLE;
+
+ mc->end += size;
+
+read_cached:
+ memcpy(dest, mc->data + mc->pos, size);
+ mc->pos += size;
+ return X86EMUL_CONTINUE;
+}
+
static int segmented_read(struct x86_emulate_ctxt *ctxt,
struct segmented_address addr,
void *data,
@@ -1565,9 +1588,36 @@ static int load_segment_descriptor(struct x86_emulate_ctxt *ctxt,
}

static int prepare_memory_operand(struct x86_emulate_ctxt *ctxt,
- struct operand *op)
+ struct operand *op,
+ bool write)
{
- return segmented_read(ctxt, op->addr.mem, &op->val, op->bytes);
+ int rc;
+ unsigned long gva;
+ unsigned int size = op->bytes;
+
+ rc = linearize(ctxt, op->addr.mem, size, write, &gva);
+ if (rc != X86EMUL_CONTINUE)
+ return rc;
+
+ rc = ctxt->ops->memory_prepare(ctxt, gva, size,
+ &ctxt->exception, true,
+ &op->opaque, &op->hva);
+ if (rc != X86EMUL_CONTINUE)
+ return rc;
+
+ if (likely(!kvm_is_error_hva(op->hva))) {
+ rc = read_from_user(ctxt, op->hva, &op->val, size);
+ if (!write)
+ ctxt->ops->memory_finish(ctxt, op->opaque, op->hva);
+
+ if (likely(rc == X86EMUL_CONTINUE))
+ return X86EMUL_CONTINUE;
+
+ /* Should not happen. */
+ op->hva = KVM_HVA_ERR_BAD;
+ }
+
+ return read_emulated(ctxt, gva, &op->val, size);
}

static int cmpxchg_memory_operand(struct x86_emulate_ctxt *ctxt,
@@ -1582,6 +1632,17 @@ static int cmpxchg_memory_operand(struct x86_emulate_ctxt *ctxt,
static int write_memory_operand(struct x86_emulate_ctxt *ctxt,
struct operand *op)
{
+ int rc;
+
+ if (likely(!kvm_is_error_hva(op->hva))) {
+ rc = __copy_to_user((void __user *)op->hva, &op->val,
+ op->bytes);
+ ctxt->ops->memory_finish(ctxt, op->opaque, op->hva);
+
+ if (likely(!rc))
+ return X86EMUL_CONTINUE;
+ }
+
return segmented_write(ctxt, op->addr.mem,
&op->val,
op->bytes);
@@ -4638,14 +4699,14 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
}

if ((ctxt->src.type == OP_MEM) && !(ctxt->d & NoAccess)) {
- rc = prepare_memory_operand(ctxt, &ctxt->src);
+ rc = prepare_memory_operand(ctxt, &ctxt->src, false);
if (rc != X86EMUL_CONTINUE)
goto done;
ctxt->src.orig_val64 = ctxt->src.val64;
}

if (ctxt->src2.type == OP_MEM) {
- rc = prepare_memory_operand(ctxt, &ctxt->src2);
+ rc = prepare_memory_operand(ctxt, &ctxt->src2, false);
if (rc != X86EMUL_CONTINUE)
goto done;
}
@@ -4656,7 +4717,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)

if ((ctxt->dst.type == OP_MEM) && !(ctxt->d & Mov)) {
/* optimisation - avoid slow emulated read if Mov */
- rc = prepare_memory_operand(ctxt, &ctxt->dst);
+ rc = prepare_memory_operand(ctxt, &ctxt->dst,
+ !(ctxt->d & NoWrite));
if (rc != X86EMUL_CONTINUE)
goto done;
}
--
1.8.3.1

2014-06-09 12:59:34

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 05/25] KVM: emulate: protect checks on ctxt->d by a common "if (unlikely())"

There are several checks for "peculiar" aspects of instructions in both
x86_decode_insn and x86_emulate_insn. Group them together, and guard
them with a single "if" that lets the processor quickly skip them all.
Make this more effective by adding two more flag bits that say whether the
.intercept and .check_perm fields are valid. We will reuse these
flags later to avoid initializing fields of the emulate_ctxt struct.

This skims about 30 cycles for each emulated instructions, which is
approximately a 3% improvement.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/emulate.c | 175 ++++++++++++++++++++++++++-----------------------
1 file changed, 94 insertions(+), 81 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 63ba8bd82a36..4c2ae824e89e 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -162,6 +162,8 @@
#define NoWrite ((u64)1 << 45) /* No writeback */
#define SrcWrite ((u64)1 << 46) /* Write back src operand */
#define NoMod ((u64)1 << 47) /* Mod field is ignored */
+#define Intercept ((u64)1 << 48) /* Has valid intercept field */
+#define CheckPerm ((u64)1 << 49) /* Has valid check_perm field */

#define DstXacc (DstAccLo | SrcAccHi | SrcWrite)

@@ -3539,9 +3541,9 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
}

#define D(_y) { .flags = (_y) }
-#define DI(_y, _i) { .flags = (_y), .intercept = x86_intercept_##_i }
-#define DIP(_y, _i, _p) { .flags = (_y), .intercept = x86_intercept_##_i, \
- .check_perm = (_p) }
+#define DI(_y, _i) { .flags = (_y)|Intercept, .intercept = x86_intercept_##_i }
+#define DIP(_y, _i, _p) { .flags = (_y)|Intercept|CheckPerm, \
+ .intercept = x86_intercept_##_i, .check_perm = (_p) }
#define N D(NotImpl)
#define EXT(_f, _e) { .flags = ((_f) | RMExt), .u.group = (_e) }
#define G(_f, _g) { .flags = ((_f) | Group | ModRM), .u.group = (_g) }
@@ -3550,10 +3552,10 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
#define I(_f, _e) { .flags = (_f), .u.execute = (_e) }
#define F(_f, _e) { .flags = (_f) | Fastop, .u.fastop = (_e) }
#define II(_f, _e, _i) \
- { .flags = (_f), .u.execute = (_e), .intercept = x86_intercept_##_i }
+ { .flags = (_f)|Intercept, .u.execute = (_e), .intercept = x86_intercept_##_i }
#define IIP(_f, _e, _i, _p) \
- { .flags = (_f), .u.execute = (_e), .intercept = x86_intercept_##_i, \
- .check_perm = (_p) }
+ { .flags = (_f)|Intercept|CheckPerm, .u.execute = (_e), \
+ .intercept = x86_intercept_##_i, .check_perm = (_p) }
#define GP(_f, _g) { .flags = ((_f) | Prefix), .u.gprefix = (_g) }

#define D2bv(_f) D((_f) | ByteOp), D(_f)
@@ -4386,29 +4388,37 @@ done_prefixes:
return EMULATION_FAILED;

ctxt->execute = opcode.u.execute;
- ctxt->check_perm = opcode.check_perm;
- ctxt->intercept = opcode.intercept;

- if (ctxt->d & NotImpl)
- return EMULATION_FAILED;
+ if (unlikely(ctxt->d &
+ (NotImpl|EmulateOnUD|Stack|Op3264|Sse|Mmx|Intercept|CheckPerm))) {
+ /*
+ * These are copied unconditionally here, and checked unconditionally
+ * in x86_emulate_insn.
+ */
+ ctxt->check_perm = opcode.check_perm;
+ ctxt->intercept = opcode.intercept;

- if (!(ctxt->d & EmulateOnUD) && ctxt->ud)
- return EMULATION_FAILED;
+ if (ctxt->d & NotImpl)
+ return EMULATION_FAILED;

- if (mode == X86EMUL_MODE_PROT64 && (ctxt->d & Stack))
- ctxt->op_bytes = 8;
+ if (!(ctxt->d & EmulateOnUD) && ctxt->ud)
+ return EMULATION_FAILED;

- if (ctxt->d & Op3264) {
- if (mode == X86EMUL_MODE_PROT64)
+ if (mode == X86EMUL_MODE_PROT64 && (ctxt->d & Stack))
ctxt->op_bytes = 8;
- else
- ctxt->op_bytes = 4;
- }

- if (ctxt->d & Sse)
- ctxt->op_bytes = 16;
- else if (ctxt->d & Mmx)
- ctxt->op_bytes = 8;
+ if (ctxt->d & Op3264) {
+ if (mode == X86EMUL_MODE_PROT64)
+ ctxt->op_bytes = 8;
+ else
+ ctxt->op_bytes = 4;
+ }
+
+ if (ctxt->d & Sse)
+ ctxt->op_bytes = 16;
+ else if (ctxt->d & Mmx)
+ ctxt->op_bytes = 8;
+ }

/* ModRM and SIB bytes. */
if (ctxt->d & ModRM) {
@@ -4542,75 +4552,78 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
goto done;
}

- if ((ctxt->mode == X86EMUL_MODE_PROT64 && (ctxt->d & No64)) ||
- (ctxt->d & Undefined)) {
- rc = emulate_ud(ctxt);
- goto done;
- }
-
- if (((ctxt->d & (Sse|Mmx)) && ((ops->get_cr(ctxt, 0) & X86_CR0_EM)))
- || ((ctxt->d & Sse) && !(ops->get_cr(ctxt, 4) & X86_CR4_OSFXSR))) {
- rc = emulate_ud(ctxt);
- goto done;
- }
-
- if ((ctxt->d & (Sse|Mmx)) && (ops->get_cr(ctxt, 0) & X86_CR0_TS)) {
- rc = emulate_nm(ctxt);
- goto done;
- }
+ if (unlikely(ctxt->d &
+ (No64|Undefined|Sse|Mmx|Intercept|CheckPerm|Priv|Prot|String))) {
+ if ((ctxt->mode == X86EMUL_MODE_PROT64 && (ctxt->d & No64)) ||
+ (ctxt->d & Undefined)) {
+ rc = emulate_ud(ctxt);
+ goto done;
+ }

- if (ctxt->d & Mmx) {
- rc = flush_pending_x87_faults(ctxt);
- if (rc != X86EMUL_CONTINUE)
+ if (((ctxt->d & (Sse|Mmx)) && ((ops->get_cr(ctxt, 0) & X86_CR0_EM)))
+ || ((ctxt->d & Sse) && !(ops->get_cr(ctxt, 4) & X86_CR4_OSFXSR))) {
+ rc = emulate_ud(ctxt);
goto done;
- /*
- * Now that we know the fpu is exception safe, we can fetch
- * operands from it.
- */
- fetch_possible_mmx_operand(ctxt, &ctxt->src);
- fetch_possible_mmx_operand(ctxt, &ctxt->src2);
- if (!(ctxt->d & Mov))
- fetch_possible_mmx_operand(ctxt, &ctxt->dst);
- }
+ }

- if (unlikely(ctxt->guest_mode) && ctxt->intercept) {
- rc = emulator_check_intercept(ctxt, ctxt->intercept,
- X86_ICPT_PRE_EXCEPT);
- if (rc != X86EMUL_CONTINUE)
+ if ((ctxt->d & (Sse|Mmx)) && (ops->get_cr(ctxt, 0) & X86_CR0_TS)) {
+ rc = emulate_nm(ctxt);
goto done;
- }
+ }

- /* Privileged instruction can be executed only in CPL=0 */
- if ((ctxt->d & Priv) && ops->cpl(ctxt)) {
- rc = emulate_gp(ctxt, 0);
- goto done;
- }
+ if (ctxt->d & Mmx) {
+ rc = flush_pending_x87_faults(ctxt);
+ if (rc != X86EMUL_CONTINUE)
+ goto done;
+ /*
+ * Now that we know the fpu is exception safe, we can fetch
+ * operands from it.
+ */
+ fetch_possible_mmx_operand(ctxt, &ctxt->src);
+ fetch_possible_mmx_operand(ctxt, &ctxt->src2);
+ if (!(ctxt->d & Mov))
+ fetch_possible_mmx_operand(ctxt, &ctxt->dst);
+ }

- /* Instruction can only be executed in protected mode */
- if ((ctxt->d & Prot) && ctxt->mode < X86EMUL_MODE_PROT16) {
- rc = emulate_ud(ctxt);
- goto done;
- }
+ if (unlikely(ctxt->guest_mode) && ctxt->intercept) {
+ rc = emulator_check_intercept(ctxt, ctxt->intercept,
+ X86_ICPT_PRE_EXCEPT);
+ if (rc != X86EMUL_CONTINUE)
+ goto done;
+ }

- /* Do instruction specific permission checks */
- if (ctxt->check_perm) {
- rc = ctxt->check_perm(ctxt);
- if (rc != X86EMUL_CONTINUE)
+ /* Privileged instruction can be executed only in CPL=0 */
+ if ((ctxt->d & Priv) && ops->cpl(ctxt)) {
+ rc = emulate_gp(ctxt, 0);
goto done;
- }
+ }

- if (unlikely(ctxt->guest_mode) && ctxt->intercept) {
- rc = emulator_check_intercept(ctxt, ctxt->intercept,
- X86_ICPT_POST_EXCEPT);
- if (rc != X86EMUL_CONTINUE)
+ /* Instruction can only be executed in protected mode */
+ if ((ctxt->d & Prot) && ctxt->mode < X86EMUL_MODE_PROT16) {
+ rc = emulate_ud(ctxt);
goto done;
- }
+ }

- if (ctxt->rep_prefix && (ctxt->d & String)) {
- /* All REP prefixes have the same first termination condition */
- if (address_mask(ctxt, reg_read(ctxt, VCPU_REGS_RCX)) == 0) {
- ctxt->eip = ctxt->_eip;
- goto done;
+ /* Do instruction specific permission checks */
+ if (ctxt->check_perm) {
+ rc = ctxt->check_perm(ctxt);
+ if (rc != X86EMUL_CONTINUE)
+ goto done;
+ }
+
+ if (unlikely(ctxt->guest_mode) && ctxt->intercept) {
+ rc = emulator_check_intercept(ctxt, ctxt->intercept,
+ X86_ICPT_POST_EXCEPT);
+ if (rc != X86EMUL_CONTINUE)
+ goto done;
+ }
+
+ if (ctxt->rep_prefix && (ctxt->d & String)) {
+ /* All REP prefixes have the same first termination condition */
+ if (address_mask(ctxt, reg_read(ctxt, VCPU_REGS_RCX)) == 0) {
+ ctxt->eip = ctxt->_eip;
+ goto done;
+ }
}
}

--
1.8.3.1

2014-06-09 12:59:31

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 06/25] KVM: emulate: speed up emulated moves

We can just blindly move all 16 bytes of ctxt->src's value to ctxt->dst.
write_register_operand will take care of writing only the lower bytes.

Avoiding a call to memcpy (the compiler optimizes it out) gains about
200 cycles on kvm-unit-tests for register-to-register moves, and makes
them about as fast as arithmetic instructions.

We could perhaps get a larger speedup by moving all instructions _except_
moves out of x86_emulate_insn, removing opcode_len, and replacing the
switch statement with an inlined em_mov.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_emulate.h | 2 +-
arch/x86/kvm/emulate.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index ffa2671a7f2f..46725e8aa8cb 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -232,7 +232,7 @@ struct operand {
union {
unsigned long val;
u64 val64;
- char valptr[sizeof(unsigned long) + 2];
+ char valptr[sizeof(sse128_t)];
sse128_t vec_val;
u64 mm_val;
void *data;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 4c2ae824e89e..5f902d40740c 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2983,7 +2983,7 @@ static int em_rdpmc(struct x86_emulate_ctxt *ctxt)

static int em_mov(struct x86_emulate_ctxt *ctxt)
{
- memcpy(ctxt->dst.valptr, ctxt->src.valptr, ctxt->op_bytes);
+ memcpy(ctxt->dst.valptr, ctxt->src.valptr, sizeof(ctxt->src.valptr));
return X86EMUL_CONTINUE;
}

--
1.8.3.1

2014-06-09 13:03:54

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 09/25] KVM: export mark_page_dirty_in_slot

Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 12 +++++-------
2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 970c68197c69..5be9805b0aeb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -583,6 +583,7 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);

void kvm_vcpu_block(struct kvm_vcpu *vcpu);
void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c86be0f983db..2f9bc20ae2a7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -99,8 +99,6 @@ static void update_memslots(struct kvm_memslots *slots,
struct kvm_memory_slot *new, u64 last_generation);

static void kvm_release_pfn_dirty(pfn_t pfn);
-static void mark_page_dirty_in_slot(struct kvm *kvm,
- struct kvm_memory_slot *memslot, gfn_t gfn);

__visible bool kvm_rebooting;
EXPORT_SYMBOL_GPL(kvm_rebooting);
@@ -1585,7 +1583,7 @@ int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
r = __copy_to_user((void __user *)ghc->hva, data, len);
if (r)
return -EFAULT;
- mark_page_dirty_in_slot(kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+ mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT);

return 0;
}
@@ -1643,9 +1641,8 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
}
EXPORT_SYMBOL_GPL(kvm_clear_guest);

-static void mark_page_dirty_in_slot(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
+ gfn_t gfn)
{
if (memslot && memslot->dirty_bitmap) {
unsigned long rel_gfn = gfn - memslot->base_gfn;
@@ -1653,13 +1650,14 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
set_bit_le(rel_gfn, memslot->dirty_bitmap);
}
}
+EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);

void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
{
struct kvm_memory_slot *memslot;

memslot = gfn_to_memslot(kvm, gfn);
- mark_page_dirty_in_slot(kvm, memslot, gfn);
+ mark_page_dirty_in_slot(memslot, gfn);
}
EXPORT_SYMBOL_GPL(mark_page_dirty);

--
1.8.3.1

2014-06-09 13:04:12

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 08/25] KVM: emulate: abstract handling of memory operands

Abstract the pre-execution processing and writeback of memory operands
in new functions. We will soon do some work before execution even for
move destination, so call the function in that case too; but not for
the memory operand of lea, invlpg etc.

Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/emulate.c | 43 ++++++++++++++++++++++++++++---------------
1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index a1daf52fae58..7e9dc2d6fd44 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1564,6 +1564,29 @@ static int load_segment_descriptor(struct x86_emulate_ctxt *ctxt,
return __load_segment_descriptor(ctxt, selector, seg, cpl, false);
}

+static int prepare_memory_operand(struct x86_emulate_ctxt *ctxt,
+ struct operand *op)
+{
+ return segmented_read(ctxt, op->addr.mem, &op->val, op->bytes);
+}
+
+static int cmpxchg_memory_operand(struct x86_emulate_ctxt *ctxt,
+ struct operand *op)
+{
+ return segmented_cmpxchg(ctxt, op->addr.mem,
+ &op->orig_val,
+ &op->val,
+ op->bytes);
+}
+
+static int write_memory_operand(struct x86_emulate_ctxt *ctxt,
+ struct operand *op)
+{
+ return segmented_write(ctxt, op->addr.mem,
+ &op->val,
+ op->bytes);
+}
+
static void write_register_operand(struct operand *op)
{
/* The 4-byte case *is* correct: in 64-bit mode we zero-extend. */
@@ -1591,16 +1614,9 @@ static int writeback(struct x86_emulate_ctxt *ctxt, struct operand *op)
break;
case OP_MEM:
if (ctxt->lock_prefix)
- return segmented_cmpxchg(ctxt,
- op->addr.mem,
- &op->orig_val,
- &op->val,
- op->bytes);
+ return cmpxchg_memory_operand(ctxt, op);
else
- return segmented_write(ctxt,
- op->addr.mem,
- &op->val,
- op->bytes);
+ return write_memory_operand(ctxt, op);
break;
case OP_MEM_STR:
return segmented_write(ctxt,
@@ -4622,16 +4638,14 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
}

if ((ctxt->src.type == OP_MEM) && !(ctxt->d & NoAccess)) {
- rc = segmented_read(ctxt, ctxt->src.addr.mem,
- ctxt->src.valptr, ctxt->src.bytes);
+ rc = prepare_memory_operand(ctxt, &ctxt->src);
if (rc != X86EMUL_CONTINUE)
goto done;
ctxt->src.orig_val64 = ctxt->src.val64;
}

if (ctxt->src2.type == OP_MEM) {
- rc = segmented_read(ctxt, ctxt->src2.addr.mem,
- &ctxt->src2.val, ctxt->src2.bytes);
+ rc = prepare_memory_operand(ctxt, &ctxt->src2);
if (rc != X86EMUL_CONTINUE)
goto done;
}
@@ -4642,8 +4656,7 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)

if ((ctxt->dst.type == OP_MEM) && !(ctxt->d & Mov)) {
/* optimisation - avoid slow emulated read if Mov */
- rc = segmented_read(ctxt, ctxt->dst.addr.mem,
- &ctxt->dst.val, ctxt->dst.bytes);
+ rc = prepare_memory_operand(ctxt, &ctxt->dst);
if (rc != X86EMUL_CONTINUE)
goto done;
}
--
1.8.3.1

2014-06-09 13:04:52

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 07/25] KVM: emulate: simplify writeback

The "if/return" checks are useless, because we return X86EMUL_CONTINUE
anyway if we do not return.

Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/emulate.c | 28 +++++++++++-----------------
1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 5f902d40740c..a1daf52fae58 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1585,34 +1585,28 @@ static void write_register_operand(struct operand *op)

static int writeback(struct x86_emulate_ctxt *ctxt, struct operand *op)
{
- int rc;
-
switch (op->type) {
case OP_REG:
write_register_operand(op);
break;
case OP_MEM:
if (ctxt->lock_prefix)
- rc = segmented_cmpxchg(ctxt,
+ return segmented_cmpxchg(ctxt,
+ op->addr.mem,
+ &op->orig_val,
+ &op->val,
+ op->bytes);
+ else
+ return segmented_write(ctxt,
op->addr.mem,
- &op->orig_val,
&op->val,
op->bytes);
- else
- rc = segmented_write(ctxt,
- op->addr.mem,
- &op->val,
- op->bytes);
- if (rc != X86EMUL_CONTINUE)
- return rc;
break;
case OP_MEM_STR:
- rc = segmented_write(ctxt,
- op->addr.mem,
- op->data,
- op->bytes * op->count);
- if (rc != X86EMUL_CONTINUE)
- return rc;
+ return segmented_write(ctxt,
+ op->addr.mem,
+ op->data,
+ op->bytes * op->count);
break;
case OP_XMM:
write_sse_reg(ctxt, &op->vec_val, op->addr.xmm);
--
1.8.3.1

2014-06-09 13:05:12

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 04/25] KVM: emulate: move around some checks

The only purpose of this patch is to make the next patch simpler
to review. No semantic change.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/emulate.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index bc670675223d..63ba8bd82a36 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4381,12 +4381,15 @@ done_prefixes:
ctxt->d |= opcode.flags;
}

+ /* Unrecognised? */
+ if (ctxt->d == 0)
+ return EMULATION_FAILED;
+
ctxt->execute = opcode.u.execute;
ctxt->check_perm = opcode.check_perm;
ctxt->intercept = opcode.intercept;

- /* Unrecognised? */
- if (ctxt->d == 0 || (ctxt->d & NotImpl))
+ if (ctxt->d & NotImpl)
return EMULATION_FAILED;

if (!(ctxt->d & EmulateOnUD) && ctxt->ud)
@@ -4528,19 +4531,19 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)

ctxt->mem_read.pos = 0;

- if ((ctxt->mode == X86EMUL_MODE_PROT64 && (ctxt->d & No64)) ||
- (ctxt->d & Undefined)) {
+ /* LOCK prefix is allowed only with some instructions */
+ if (ctxt->lock_prefix && (!(ctxt->d & Lock) || ctxt->dst.type != OP_MEM)) {
rc = emulate_ud(ctxt);
goto done;
}

- /* LOCK prefix is allowed only with some instructions */
- if (ctxt->lock_prefix && (!(ctxt->d & Lock) || ctxt->dst.type != OP_MEM)) {
+ if ((ctxt->d & SrcMask) == SrcMemFAddr && ctxt->src.type != OP_MEM) {
rc = emulate_ud(ctxt);
goto done;
}

- if ((ctxt->d & SrcMask) == SrcMemFAddr && ctxt->src.type != OP_MEM) {
+ if ((ctxt->mode == X86EMUL_MODE_PROT64 && (ctxt->d & No64)) ||
+ (ctxt->d & Undefined)) {
rc = emulate_ud(ctxt);
goto done;
}
--
1.8.3.1

2014-06-09 13:05:34

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 03/25] KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation

Despite the provisions to emulate up to 130 consecutive instructions, in
practice KVM will emulate just one before exiting handle_invalid_guest_state,
because x86_emulate_instruction always sets KVM_REQ_EVENT.

However, we only need to do this if an interrupt could be injected,
which happens a) if an interrupt shadow bit (STI or MOV SS) has gone
away; b) if the interrupt flag has just been set (other instructions
than STI can set it without enabling an interrupt shadow).

This cuts another 700-900 cycles from the cost of emulating an
instruction (measured on a Sandy Bridge Xeon: 1650-2600 cycles
before the patch on kvm-unit-tests, 925-1700 afterwards).

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/x86.c | 36 ++++++++++++++++++++++++++----------
1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 70faa1089b75..e64b8b5cb6cd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -87,6 +87,7 @@ static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE);

static void update_cr8_intercept(struct kvm_vcpu *vcpu);
static void process_nmi(struct kvm_vcpu *vcpu);
+static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);

struct kvm_x86_ops *kvm_x86_ops;
EXPORT_SYMBOL_GPL(kvm_x86_ops);
@@ -4865,8 +4866,11 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)
*/
if (int_shadow & mask)
mask = 0;
- if (unlikely(int_shadow || mask))
+ if (unlikely(int_shadow || mask)) {
kvm_x86_ops->set_interrupt_shadow(vcpu, mask);
+ if (!mask)
+ kvm_make_request(KVM_REQ_EVENT, vcpu);
+ }
}

static void inject_emulated_exception(struct kvm_vcpu *vcpu)
@@ -5092,20 +5096,18 @@ static int kvm_vcpu_check_hw_bp(unsigned long addr, u32 type, u32 dr7,
return dr6;
}

-static void kvm_vcpu_check_singlestep(struct kvm_vcpu *vcpu, int *r)
+static void kvm_vcpu_check_singlestep(struct kvm_vcpu *vcpu, unsigned long rflags, int *r)
{
struct kvm_run *kvm_run = vcpu->run;

/*
- * Use the "raw" value to see if TF was passed to the processor.
- * Note that the new value of the flags has not been saved yet.
+ * rflags is the old, "raw" value of the flags. The new value has
+ * not been saved yet.
*
* This is correct even for TF set by the guest, because "the
* processor will not generate this exception after the instruction
* that sets the TF flag".
*/
- unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
-
if (unlikely(rflags & X86_EFLAGS_TF)) {
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
kvm_run->debug.arch.dr6 = DR6_BS | DR6_FIXED_1;
@@ -5272,13 +5274,22 @@ restart:
r = EMULATE_DONE;

if (writeback) {
+ unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
toggle_interruptibility(vcpu, ctxt->interruptibility);
- kvm_make_request(KVM_REQ_EVENT, vcpu);
vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
kvm_rip_write(vcpu, ctxt->eip);
if (r == EMULATE_DONE)
- kvm_vcpu_check_singlestep(vcpu, &r);
- kvm_set_rflags(vcpu, ctxt->eflags);
+ kvm_vcpu_check_singlestep(vcpu, rflags, &r);
+ __kvm_set_rflags(vcpu, ctxt->eflags);
+
+ /*
+ * For STI, interrupts are shadowed; so KVM_REQ_EVENT will
+ * do nothing, and it will be requested again as soon as
+ * the shadow expires. But we still need to check here,
+ * because POPF has no interrupt shadow.
+ */
+ if (unlikely((ctxt->eflags & ~rflags) & X86_EFLAGS_IF))
+ kvm_make_request(KVM_REQ_EVENT, vcpu);
} else
vcpu->arch.emulate_regs_need_sync_to_vcpu = true;

@@ -7400,12 +7411,17 @@ unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_get_rflags);

-void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
{
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP &&
kvm_is_linear_rip(vcpu, vcpu->arch.singlestep_rip))
rflags |= X86_EFLAGS_TF;
kvm_x86_ops->set_rflags(vcpu, rflags);
+}
+
+void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+{
+ __kvm_set_rflags(vcpu, rflags);
kvm_make_request(KVM_REQ_EVENT, vcpu);
}
EXPORT_SYMBOL_GPL(kvm_set_rflags);
--
1.8.3.1

2014-06-09 13:06:01

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 01/25] KVM: vmx: speed up emulation of invalid guest state

About 25% of the time spent in emulation of invalid guest state
is wasted in checking whether emulation is required for the next
instruction. However, this almost never changes except when a
segment register (or TR or LDTR) changes, or when there is a mode
transition (i.e. CR0 changes).

In fact, vmx_set_segment and vmx_set_cr0 already modify
vmx->emulation_required (except that the former for some reason
uses |= instead of just an assignment). So there is no need to
call guest_state_valid in the emulation loop.

Emulation performance test results indicate 1650-2600 cycles
for common instructions, versus 2300-3200 before this patch on
a Sandy Bridge Xeon.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 801332edefc3..a2ae11d162fe 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3653,7 +3653,7 @@ static void vmx_set_segment(struct kvm_vcpu *vcpu,
vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));

out:
- vmx->emulation_required |= emulation_required(vcpu);
+ vmx->emulation_required = emulation_required(vcpu);
}

static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
@@ -5621,7 +5621,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
cpu_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
intr_window_requested = cpu_exec_ctrl & CPU_BASED_VIRTUAL_INTR_PENDING;

- while (!guest_state_valid(vcpu) && count-- != 0) {
+ while (vmx->emulation_required && count-- != 0) {
if (intr_window_requested && vmx_interrupt_allowed(vcpu))
return handle_interrupt_window(&vmx->vcpu);

@@ -5655,7 +5655,6 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
schedule();
}

- vmx->emulation_required = emulation_required(vcpu);
out:
return ret;
}
--
1.8.3.1

2014-06-09 18:40:27

by Bandan Das

[permalink] [raw]
Subject: Re: [PATCH 12/25] KVM: emulate: extend memory access optimization to stores

Paolo Bonzini <[email protected]> writes:

> Even on a store the optimization saves about 50 clock cycles,
> mostly because the jump in write_memory_operand becomes much more
> predictable.
>
> Reviewed-by: Paolo Bonzini <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
Isn't the reviewed-by automatically implied ? :)

> ---
> arch/x86/kvm/emulate.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 594cb560947c..eaf0853ffaf9 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -1589,7 +1589,7 @@ static int load_segment_descriptor(struct x86_emulate_ctxt *ctxt,
>
> static int prepare_memory_operand(struct x86_emulate_ctxt *ctxt,
> struct operand *op,
> - bool write)
> + bool read, bool write)
> {
> int rc;
> unsigned long gva;
> @@ -1605,6 +1605,10 @@ static int prepare_memory_operand(struct x86_emulate_ctxt *ctxt,
> if (rc != X86EMUL_CONTINUE)
> return rc;
>
> + /* optimisation - avoid slow emulated read if Mov */
> + if (!read)
> + return X86EMUL_CONTINUE;
> +
> if (likely(!kvm_is_error_hva(op->hva))) {
> rc = read_from_user(ctxt, op->hva, &op->val, size);
> if (!write)
> @@ -4699,14 +4703,14 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
> }
>
> if ((ctxt->src.type == OP_MEM) && !(ctxt->d & NoAccess)) {
> - rc = prepare_memory_operand(ctxt, &ctxt->src, false);
> + rc = prepare_memory_operand(ctxt, &ctxt->src, true, false);
> if (rc != X86EMUL_CONTINUE)
> goto done;
> ctxt->src.orig_val64 = ctxt->src.val64;
> }
>
> if (ctxt->src2.type == OP_MEM) {
> - rc = prepare_memory_operand(ctxt, &ctxt->src2, false);
> + rc = prepare_memory_operand(ctxt, &ctxt->src2, true, false);
> if (rc != X86EMUL_CONTINUE)
> goto done;
> }
> @@ -4715,9 +4719,9 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
> goto special_insn;
>
>
> - if ((ctxt->dst.type == OP_MEM) && !(ctxt->d & Mov)) {
> - /* optimisation - avoid slow emulated read if Mov */
> + if (ctxt->dst.type == OP_MEM) {
> rc = prepare_memory_operand(ctxt, &ctxt->dst,
> + !(ctxt->d & Mov),
> !(ctxt->d & NoWrite));
> if (rc != X86EMUL_CONTINUE)
> goto done;

2014-06-19 11:37:59

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 12/25] KVM: emulate: extend memory access optimization to stores

Il 09/06/2014 20:40, Bandan Das ha scritto:
> Paolo Bonzini <[email protected]> writes:
>
>> > Even on a store the optimization saves about 50 clock cycles,
>> > mostly because the jump in write_memory_operand becomes much more
>> > predictable.
>> >
>> > Reviewed-by: Paolo Bonzini <[email protected]>
>> > Signed-off-by: Paolo Bonzini <[email protected]>
> Isn't the reviewed-by automatically implied ? :)
>

This patch breaks x86/vmx.flat, but apart from this the series passes
Autotest and kvm-unit-tests. I'm going to leave the memory access
optimization out, and push the rest to kvm/queue.

Paolo