2014-04-03 22:28:16

by Bandan Das

[permalink] [raw]
Subject: [RFC PATCH 0/2] Emulator speedups - avoid initializations where possible

While initializing emulation context structure, kvm memsets to 0 a
number of fields some of which are redundant since they get set eventually.
This patch attempts at avoiding some of them.

Here are some before/after numbers on a Haswell host
Note : The before numbers already include Paolo's RFC posted here :
http://comments.gmane.org/gmane.linux.kernel/1676101

Before:
667 cycles/emulated jump instruction
839 cycles/emulated move instruction
892 cycles/emulated arithmetic instruction
995 cycles/emulated memory load instruction
1026 cycles/emulated memory store instruction
1022 cycles/emulated memory RMW instruction

After:
639 cycles/emulated jump instruction
786 cycles/emulated move instruction
802 cycles/emulated arithmetic instruction
936 cycles/emulated memory load instruction
970 cycles/emulated memory store instruction
1000 cycles/emulated memory RMW instruction

Bandan Das (2):
KVM: emulate: move init_decode_cache to emulate.c
KVM: emulate: clean up initializations in init_decode_cache

arch/x86/include/asm/kvm_emulate.h | 17 +++++++------
arch/x86/kvm/emulate.c | 49 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 13 ----------
3 files changed, 59 insertions(+), 20 deletions(-)

--
1.8.3.1


2014-04-03 22:28:30

by Bandan Das

[permalink] [raw]
Subject: [RFC PATCH 1/2] KVM: emulate: move init_decode_cache to emulate.c

Core emulator functions all belong in emulator.c,
x86 should have no knowledge of emulator internals

Signed-off-by: Bandan Das <[email protected]>
---
arch/x86/include/asm/kvm_emulate.h | 1 +
arch/x86/kvm/emulate.c | 13 +++++++++++++
arch/x86/kvm/x86.c | 13 -------------
3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index d085f73..ad4cca8 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -436,6 +436,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
#define EMULATION_OK 0
#define EMULATION_RESTART 1
#define EMULATION_INTERCEPTED 2
+void init_decode_cache(struct x86_emulate_ctxt *ctxt);
int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
u16 tss_selector, int idt_index, int reason,
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8ca292c..8e2b866 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4550,6 +4550,19 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *))
return X86EMUL_CONTINUE;
}

+void init_decode_cache(struct x86_emulate_ctxt *ctxt)
+{
+ memset(&ctxt->opcode_len, 0,
+ (void *)&ctxt->_regs - (void *)&ctxt->opcode_len);
+
+ ctxt->fetch.start = 0;
+ ctxt->fetch.end = 0;
+ ctxt->io_read.pos = 0;
+ ctxt->io_read.end = 0;
+ ctxt->mem_read.pos = 0;
+ ctxt->mem_read.end = 0;
+}
+
int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
{
const struct x86_emulate_ops *ops = ctxt->ops;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4fad05d..122410d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4951,19 +4951,6 @@ static void inject_emulated_exception(struct kvm_vcpu *vcpu)
kvm_queue_exception(vcpu, ctxt->exception.vector);
}

-static void init_decode_cache(struct x86_emulate_ctxt *ctxt)
-{
- memset(&ctxt->opcode_len, 0,
- (void *)&ctxt->_regs - (void *)&ctxt->opcode_len);
-
- ctxt->fetch.start = 0;
- ctxt->fetch.end = 0;
- ctxt->io_read.pos = 0;
- ctxt->io_read.end = 0;
- ctxt->mem_read.pos = 0;
- ctxt->mem_read.end = 0;
-}
-
static void init_emulate_ctxt(struct kvm_vcpu *vcpu)
{
struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
--
1.8.3.1

2014-04-03 22:28:58

by Bandan Das

[permalink] [raw]
Subject: [RFC PATCH 2/2] KVM: emulate: clean up initializations in init_decode_cache

A lot of initializations are unnecessary as they get set to
appropriate values before actually being used. Remove some
of them and rework some others if the conditions that set
them are not true

Signed-off-by: Bandan Das <[email protected]>
---
arch/x86/include/asm/kvm_emulate.h | 16 +++++++------
arch/x86/kvm/emulate.c | 46 +++++++++++++++++++++++++++++++++-----
2 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index ad4cca8..ccb7911 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -315,30 +315,32 @@ struct x86_emulate_ctxt {
u8 opcode_len;
u8 b;
u8 intercept;
- u8 lock_prefix;
- u8 rep_prefix;
u8 op_bytes;
u8 ad_bytes;
u8 rex_prefix;
struct operand src;
struct operand src2;
struct operand dst;
+ int (*execute)(struct x86_emulate_ctxt *ctxt);
+ int (*check_perm)(struct x86_emulate_ctxt *ctxt);
+ u8 lock_prefix;
+ u8 rep_prefix;
bool has_seg_override;
u8 seg_override;
u64 d;
- int (*execute)(struct x86_emulate_ctxt *ctxt);
- int (*check_perm)(struct x86_emulate_ctxt *ctxt);
+ bool rip_relative;
+ /* bitmaps of registers in _regs[] that can be read */
+ u32 regs_valid;
+ /* bitmaps of registers in _regs[] that have been written */
+ u32 regs_dirty;
/* modrm */
u8 modrm;
u8 modrm_mod;
u8 modrm_reg;
u8 modrm_rm;
u8 modrm_seg;
- bool rip_relative;
unsigned long _eip;
struct operand memop;
- u32 regs_valid; /* bitmaps of registers in _regs[] that can be read */
- u32 regs_dirty; /* bitmaps of registers in _regs[] that have been written */
/* Fields above regs are cleared together. */
unsigned long _regs[NR_VCPU_REGS];
struct operand *memopp;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8e2b866..eac488b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1072,6 +1072,9 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
ctxt->modrm_reg = (ctxt->rex_prefix & 4) << 1; /* REX.R */
index_reg = (ctxt->rex_prefix & 2) << 2; /* REX.X */
ctxt->modrm_rm = base_reg = (ctxt->rex_prefix & 1) << 3; /* REG.B */
+ } else {
+ ctxt->modrm_reg = 0;
+ ctxt->modrm_rm = 0;
}

ctxt->modrm_mod |= (ctxt->modrm & 0xc0) >> 6;
@@ -4357,6 +4360,8 @@ done_prefixes:

if (ctxt->d & ModRM)
ctxt->modrm = insn_fetch(u8, ctxt);
+ else
+ ctxt->modrm = 0;

while (ctxt->d & GroupMask) {
switch (ctxt->d & GroupMask) {
@@ -4435,10 +4440,14 @@ done_prefixes:
ctxt->op_bytes = 16;
else if (ctxt->d & Mmx)
ctxt->op_bytes = 8;
+ } else {
+ ctxt->intercept = 0;
+ ctxt->check_perm = NULL;
}

/* ModRM and SIB bytes. */
if (ctxt->d & ModRM) {
+ ctxt->modrm_mod = 0;
rc = decode_modrm(ctxt, &ctxt->memop);
if (!ctxt->has_seg_override)
set_seg_override(ctxt, ctxt->modrm_seg);
@@ -4552,14 +4561,41 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *))

void init_decode_cache(struct x86_emulate_ctxt *ctxt)
{
- memset(&ctxt->opcode_len, 0,
- (void *)&ctxt->_regs - (void *)&ctxt->opcode_len);

- ctxt->fetch.start = 0;
- ctxt->fetch.end = 0;
+ /*
+ * Variables that don't require initializing to 0
+ * opcode_len - set in x86_decode_insn
+ * b - set in x86_decode_insn
+ * intercept - conditionally set in x86_decode_insn, added
+ * else set to 0
+ * op_bytes - initialized in x86_decode_insn
+ * ad_bytes - initialized in x86_decode_insn
+ * rex_prefix - conditionally set in x86_decode_isn
+ * struct operands src,src2,dst - set by calling decode_operand
+ * in x86_decode_insn,
+ * default.type = OP_NONE
+ * (*execute) - set in x86_decode_insn
+ * (*check_perm) - conditionally set in x86_decode_insn, added
+ * else set to 0
+ * d - set in x86_decode_insn
+ * modrm - conditionally set in x86_decode_insn, added else set to 0
+ * modrm_mod - or'ed in decode_modrm which is conditionally called in
+ * in x86_decode_insn, added initialization to 0 before call
+ * modrm_reg - set in decode_modrm or else decode_register_operand
+ * modrm_rm - set in decode_modrm, added else set to 0
+ * modrm_seg - set in decode_modrm
+ * _eip - set in x86_decode_insn
+ * memop - .type set to OP_NONE in x86_decode_insn
+ * ctxt->fetch.start - set in x86_decode_insn
+ * ctxt->fetch.end
+ * ctxt->mem_read.pos - set in x86_emulate_insn
+ */
+
+ memset(&ctxt->lock_prefix, 0,
+ (void *)&ctxt->modrm - (void *)&ctxt->lock_prefix);
+
ctxt->io_read.pos = 0;
ctxt->io_read.end = 0;
- ctxt->mem_read.pos = 0;
ctxt->mem_read.end = 0;
}

--
1.8.3.1

2014-04-04 09:47:39

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] KVM: emulate: clean up initializations in init_decode_cache

Hi Bandan.

> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 8e2b866..eac488b 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -1072,6 +1072,9 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
> ctxt->modrm_reg = (ctxt->rex_prefix & 4) << 1; /* REX.R */
> index_reg = (ctxt->rex_prefix & 2) << 2; /* REX.X */
> ctxt->modrm_rm = base_reg = (ctxt->rex_prefix & 1) << 3; /* REG.B */
> + } else {
> + ctxt->modrm_reg = 0;
> + ctxt->modrm_rm = 0;
> }

You can drop the if altogether here (and also the initialization
in "int index_reg = 0, base_reg = 0, scale;").

Also, getting down to micro-micro-optimization, the following will give
slightly better code:

ctxt->modrm_reg = (ctxt->rex_prefix << 1) & 8; /* REX.R */
index_reg = (ctxt->rex_prefix << 2) & 8; /* REX.X */
ctxt->modrm_rm = base_reg = (ctxt->rex_prefix << 3) & 8; /* REG.B */

because x86 can do a three-operand shift (using the LEA instruction), but
not three-operand AND.

> ctxt->modrm_mod |= (ctxt->modrm & 0xc0) >> 6;
> @@ -4357,6 +4360,8 @@ done_prefixes:
>
> if (ctxt->d & ModRM)
> ctxt->modrm = insn_fetch(u8, ctxt);
> + else
> + ctxt->modrm = 0;

I think in the "else" case modrm won't be read at all, but it is indeed a bit
safer this way. You can just leave it in the "zeroed" part of ctxt. There is
padding in the struct, so you get the initialization for free.


> while (ctxt->d & GroupMask) {
> switch (ctxt->d & GroupMask) {
> @@ -4435,10 +4440,14 @@ done_prefixes:
> ctxt->op_bytes = 16;
> else if (ctxt->d & Mmx)
> ctxt->op_bytes = 8;
> + } else {
> + ctxt->intercept = 0;

You can add a preparatory patch that checks (ctxt->d & Intercept) in
x86_emulate_insn instead of ctxt->intercept and skip this initialization.

> + ctxt->check_perm = NULL;

Similarly you can check (ctxt->d & CheckPerm) and skip this initialization.

> }

Same here: the common case will zero them, might as well leave it to the memset.

>
> /* ModRM and SIB bytes. */
> if (ctxt->d & ModRM) {
> + ctxt->modrm_mod = 0;

What about instead changing

ctxt->modrm_mod |= (ctxt->modrm & 0xc0) >> 6;

from "|=" to "="?

> rc = decode_modrm(ctxt, &ctxt->memop);
> if (!ctxt->has_seg_override)
> set_seg_override(ctxt, ctxt->modrm_seg);
> @@ -4552,14 +4561,41 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *))
>
> void init_decode_cache(struct x86_emulate_ctxt *ctxt)
> {
> - memset(&ctxt->opcode_len, 0,
> - (void *)&ctxt->_regs - (void *)&ctxt->opcode_len);
>
> - ctxt->fetch.start = 0;
> - ctxt->fetch.end = 0;
> + /*
> + * Variables that don't require initializing to 0
> + * opcode_len - set in x86_decode_insn
> + * b - set in x86_decode_insn
> + * intercept - conditionally set in x86_decode_insn, added
> + * else set to 0
> + * op_bytes - initialized in x86_decode_insn
> + * ad_bytes - initialized in x86_decode_insn
> + * rex_prefix - conditionally set in x86_decode_isn

I think rex_prefix must be in the zeroed area. Again, with careful field
placement you get that for free.

> + * struct operands src,src2,dst - set by calling decode_operand
> + * in x86_decode_insn,
> + * default.type = OP_NONE
> + * (*execute) - set in x86_decode_insn
> + * (*check_perm) - conditionally set in x86_decode_insn, added
> + * else set to 0
> + * d - set in x86_decode_insn
> + * modrm - conditionally set in x86_decode_insn, added else set to 0
> + * modrm_mod - or'ed in decode_modrm which is conditionally called in
> + * in x86_decode_insn, added initialization to 0 before call
> + * modrm_reg - set in decode_modrm or else decode_register_operand
> + * modrm_rm - set in decode_modrm, added else set to 0
> + * modrm_seg - set in decode_modrm
> + * _eip - set in x86_decode_insn
> + * memop - .type set to OP_NONE in x86_decode_insn
> + * ctxt->fetch.start - set in x86_decode_insn
> + * ctxt->fetch.end
> + * ctxt->mem_read.pos - set in x86_emulate_insn

Apart from the above suggestions, I agree with this patch.

Also, memopp is initialized to NULL and can be moved to the zeroed area.
But perhaps we can remove memopp, see below.

You now have:

+ u8 lock_prefix;
+ u8 rep_prefix;
bool has_seg_override;
u8 seg_override;
u64 d;
+ bool rip_relative;
+ /* bitmaps of registers in _regs[] that can be read */
+ u32 regs_valid;
+ /* bitmaps of registers in _regs[] that have been written */
+ u32 regs_dirty;

Even if we add back a couple of fields, that's pretty good! It's 32
bytes (four stores). rip_relative introduces a lot of padding; move
it together with other u8 and bool fields, and that makes 24.
And ctxt->d is in the zeroed area by mistake, which makes it 16.

There is a little more that you can do in this field.

seg_override can be moved out of the zeroed area, and the seg_override()
calls can be changed to direct accesses of the field. Please double check
though. :) Actually, if the above is true, has_seg_override can also be
eliminated and moved to x86_decode_insn as a local variable.

We could also get rid completely of rip_relative and move it to a local
variable in decode_modrm. Currently it is in x86_decode_insn because
of this earlier check:

if (ctxt->memop.type == OP_MEM && ctxt->ad_bytes != 8)
ctxt->memop.addr.mem.ea = (u32)ctxt->memop.addr.mem.ea;

...

if (ctxt->memopp && ctxt->memopp->type == OP_MEM && ctxt->rip_relative)
ctxt->memopp->addr.mem.ea += ctxt->_eip;


but I think the earlier check can be moved to decode_modrm as well.
This is my reasoning:

- the first "if" can only trigger for modrm (decode_abs will not set
mem.ea to an out-of-range value if ctxt->ad_bytes != 8)

- ctxt->memopp != NULL means that *ctxt->memopp was copied from ctxt->memop
(so it must be either ModRM or MemAbs), but rip_relative is set to 1
only if the instruction is ModRM, never for MemAbs.

And once you do this, memopp is never read anymore and can be dropped.


This gives the following series:

- patch 1 from here

- replace ->intercept and ->check_perm checks with ctxt->d checks

- cleanups and code changes from this patch, except struct reorganization and
memset change

- struct reorganization and memset change; instead of the large comment in
init_decode_cache, please add comments in front of each field or group of
fields

- removing has_seg_override and moving seg_override out of the zeroed area

- removing memopp and rip_relative

(BTW, I found a bug in my second series while studying your patch. It
ignores the mem_read cache in prepare_memory_operand, but it shouldn't).

Paolo