2009-11-16 12:19:42

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 00/42] KVM updates for the 2.6.33 merge window (batch 1/2)

Highlights:
- improved kernel context switching speed
- better interoperation with other users of virtualization extensions
- improved irq scaling
- nested svm improvements and tracing
- improved cpufreq integration
- spin loop detection on newer hardware

Notes:
- kvm/ppc64 support will be merged through the powerpc tree
- depends on tip x86/entry branch (user return notifiers)

Alexander Graf (2):
KVM: Activate Virtualization On Demand
KVM: SVM: Notify nested hypervisor of lost event injections

Avi Kivity (6):
core, x86: Add user return notifiers
x86: Fix user return notifier build
KVM: Don't wrap schedule() with vcpu_put()/vcpu_load()
KVM: Don't pass kvm_run arguments
KVM: Return -ENOTTY on unrecognized ioctls
KVM: Move assigned device code to own file

Glauber Costa (1):
KVM: x86: include pvclock MSRs in msrs_to_save

Gleb Natapov (9):
KVM: Call pic_clear_isr() on pic reset to reuse logic there
KVM: Move irq sharing information to irqchip level
KVM: Change irq routing table to use gsi indexed array
KVM: Maintain back mapping from irqchip/pin to gsi
KVM: Move irq routing data structure to rcu locking
KVM: Move irq ack notifier list to arch independent code
KVM: Convert irq notifiers lists to RCU locking
KVM: Move IO APIC to its own lock
KVM: Drop kvm->irq_lock lock from irq injection path

Huang Weiyi (1):
KVM: remove duplicated #include

Jan Kiszka (2):
KVM: x86: Refactor guest debug IOCTL handling
KVM: x86: Rework guest single-step flag injection and filtering

Jiri Slaby (1):
KVM: fix lock imbalance in kvm_*_irq_source_id()

Joerg Roedel (7):
KVM: SVM: reorganize svm_interrupt_allowed
KVM: SVM: don't copy exit_int_info on nested vmrun
KVM: SVM: Remove remaining occurences of rdtscll
KVM: SVM: Move INTR vmexit out of atomic code
KVM: SVM: Add tracepoint for nested vmrun
KVM: SVM: Add tracepoint for nested #vmexit
KVM: SVM: Add tracepoint for injected #vmexit

Juan Quintela (1):
KVM: remove pre_task_link setting in save_state_to_tss16

Marcelo Tosatti (2):
KVM: SVM: remove needless mmap_sem acquision from nested_svm_map
KVM: x86: disable paravirt mmu reporting

Mohammed Gamal (5):
KVM: x86 emulator: Add 'push/pop sreg' instructions
KVM: x86 emulator: Introduce No64 decode option
KVM: x86 emulator: Add missing decoder flags for 'or' instructions
KVM: x86 emulator: Add pusha and popa instructions
KVM: VMX: Enhance invalid guest state emulation

Stephen Rothwell (1):
x86: Fix user return notifier put_cpu_var() invocation

Zachary Amsden (4):
KVM: Separate timer intialization into an indepedent function
KVM: Kill the confusing tsc_ref_khz and ref_freq variables
KVM: Fix printk name error in svm.c
KVM: Fix hotplug of CPUs


2009-11-16 12:20:00

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 01/42] core, x86: Add user return notifiers

Add a general per-cpu notifier that is called whenever the kernel is
about to return to userspace. The notifier uses a thread_info flag
and existing checks, so there is no impact on user return or context
switch fast paths.

This will be used initially to speed up KVM task switching by lazily
updating MSRs.

Signed-off-by: Avi Kivity <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/Kconfig | 10 +++++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/thread_info.h | 7 +++-
arch/x86/kernel/process.c | 2 +
arch/x86/kernel/signal.c | 3 ++
include/linux/user-return-notifier.h | 42 +++++++++++++++++++++++++++++++
kernel/user-return-notifier.c | 46 ++++++++++++++++++++++++++++++++++
7 files changed, 109 insertions(+), 2 deletions(-)
create mode 100644 include/linux/user-return-notifier.h
create mode 100644 kernel/user-return-notifier.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 7f418bb..4e312ff 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -83,6 +83,13 @@ config KRETPROBES
def_bool y
depends on KPROBES && HAVE_KRETPROBES

+config USER_RETURN_NOTIFIER
+ bool
+ depends on HAVE_USER_RETURN_NOTIFIER
+ help
+ Provide a kernel-internal notification when a cpu is about to
+ switch to user mode.
+
config HAVE_IOREMAP_PROT
bool

@@ -126,4 +133,7 @@ config HAVE_DMA_API_DEBUG
config HAVE_DEFAULT_NO_SPIN_MUTEXES
bool

+config HAVE_USER_RETURN_NOTIFIER
+ bool
+
source "kernel/gcov/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8da9374..1df175d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -50,6 +50,7 @@ config X86
select HAVE_KERNEL_BZIP2
select HAVE_KERNEL_LZMA
select HAVE_ARCH_KMEMCHECK
+ select HAVE_USER_RETURN_NOTIFIER

config OUTPUT_FORMAT
string
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index d27d0a2..375c917 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -83,6 +83,7 @@ struct thread_info {
#define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
#define TIF_SECCOMP 8 /* secure computing */
#define TIF_MCE_NOTIFY 10 /* notify userspace of an MCE */
+#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* 32bit process */
#define TIF_FORK 18 /* ret_from_fork */
@@ -107,6 +108,7 @@ struct thread_info {
#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
#define _TIF_MCE_NOTIFY (1 << TIF_MCE_NOTIFY)
+#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
#define _TIF_FORK (1 << TIF_FORK)
@@ -142,13 +144,14 @@ struct thread_info {

/* Only used for 64 bit */
#define _TIF_DO_NOTIFY_MASK \
- (_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_NOTIFY_RESUME)
+ (_TIF_SIGPENDING | _TIF_MCE_NOTIFY | _TIF_NOTIFY_RESUME | \
+ _TIF_USER_RETURN_NOTIFY)

/* flags to check in __switch_to() */
#define _TIF_WORK_CTXSW \
(_TIF_IO_BITMAP|_TIF_DEBUGCTLMSR|_TIF_DS_AREA_MSR|_TIF_NOTSC)

-#define _TIF_WORK_CTXSW_PREV _TIF_WORK_CTXSW
+#define _TIF_WORK_CTXSW_PREV (_TIF_WORK_CTXSW|_TIF_USER_RETURN_NOTIFY)
#define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW|_TIF_DEBUG)

#define PREEMPT_ACTIVE 0x10000000
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 5284cd2..e51b056 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -9,6 +9,7 @@
#include <linux/pm.h>
#include <linux/clockchips.h>
#include <linux/random.h>
+#include <linux/user-return-notifier.h>
#include <trace/events/power.h>
#include <asm/system.h>
#include <asm/apic.h>
@@ -224,6 +225,7 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
*/
memset(tss->io_bitmap, 0xff, prev->io_bitmap_max);
}
+ propagate_user_return_notify(prev_p, next_p);
}

int sys_fork(struct pt_regs *regs)
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 6a44a76..c49f90f 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -19,6 +19,7 @@
#include <linux/stddef.h>
#include <linux/personality.h>
#include <linux/uaccess.h>
+#include <linux/user-return-notifier.h>

#include <asm/processor.h>
#include <asm/ucontext.h>
@@ -872,6 +873,8 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
if (current->replacement_session_keyring)
key_replace_session_keyring();
}
+ if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
+ fire_user_return_notifiers();

#ifdef CONFIG_X86_32
clear_thread_flag(TIF_IRET);
diff --git a/include/linux/user-return-notifier.h b/include/linux/user-return-notifier.h
new file mode 100644
index 0000000..b6ac056
--- /dev/null
+++ b/include/linux/user-return-notifier.h
@@ -0,0 +1,42 @@
+#ifndef _LINUX_USER_RETURN_NOTIFIER_H
+#define _LINUX_USER_RETURN_NOTIFIER_H
+
+#ifdef CONFIG_USER_RETURN_NOTIFIER
+
+#include <linux/list.h>
+#include <linux/sched.h>
+
+struct user_return_notifier {
+ void (*on_user_return)(struct user_return_notifier *urn);
+ struct hlist_node link;
+};
+
+
+void user_return_notifier_register(struct user_return_notifier *urn);
+void user_return_notifier_unregister(struct user_return_notifier *urn);
+
+static inline void propagate_user_return_notify(struct task_struct *prev,
+ struct task_struct *next)
+{
+ if (test_tsk_thread_flag(prev, TIF_USER_RETURN_NOTIFY)) {
+ clear_tsk_thread_flag(prev, TIF_USER_RETURN_NOTIFY);
+ set_tsk_thread_flag(next, TIF_USER_RETURN_NOTIFY);
+ }
+}
+
+void fire_user_return_notifiers(void);
+
+#else
+
+struct user_return_notifier {};
+
+static inline void propagate_user_return_notify(struct task_struct *prev,
+ struct task_struct *next)
+{
+}
+
+static inline void fire_user_return_notifiers(void) {}
+
+#endif
+
+#endif
diff --git a/kernel/user-return-notifier.c b/kernel/user-return-notifier.c
new file mode 100644
index 0000000..530ccb8
--- /dev/null
+++ b/kernel/user-return-notifier.c
@@ -0,0 +1,46 @@
+
+#include <linux/user-return-notifier.h>
+#include <linux/percpu.h>
+#include <linux/sched.h>
+#include <linux/module.h>
+
+static DEFINE_PER_CPU(struct hlist_head, return_notifier_list);
+
+#define URN_LIST_HEAD per_cpu(return_notifier_list, raw_smp_processor_id())
+
+/*
+ * Request a notification when the current cpu returns to userspace. Must be
+ * called in atomic context. The notifier will also be called in atomic
+ * context.
+ */
+void user_return_notifier_register(struct user_return_notifier *urn)
+{
+ set_tsk_thread_flag(current, TIF_USER_RETURN_NOTIFY);
+ hlist_add_head(&urn->link, &URN_LIST_HEAD);
+}
+EXPORT_SYMBOL_GPL(user_return_notifier_register);
+
+/*
+ * Removes a registered user return notifier. Must be called from atomic
+ * context, and from the same cpu registration occured in.
+ */
+void user_return_notifier_unregister(struct user_return_notifier *urn)
+{
+ hlist_del(&urn->link);
+ if (hlist_empty(&URN_LIST_HEAD))
+ clear_tsk_thread_flag(current, TIF_USER_RETURN_NOTIFY);
+}
+EXPORT_SYMBOL_GPL(user_return_notifier_unregister);
+
+/* Calls registered user return notifiers */
+void fire_user_return_notifiers(void)
+{
+ struct user_return_notifier *urn;
+ struct hlist_node *tmp1, *tmp2;
+ struct hlist_head *head;
+
+ head = &get_cpu_var(return_notifier_list);
+ hlist_for_each_entry_safe(urn, tmp1, tmp2, head, link)
+ urn->on_user_return(urn);
+ put_cpu_var();
+}
--
1.6.5.2

2009-11-16 12:19:58

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 02/42] x86: Fix user return notifier build

When CONFIG_USER_RETURN_NOTIFIER is set, we need to link
kernel/user-return-notifier.o.

Signed-off-by: Avi Kivity <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/Makefile | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index b8d4cd8..0ae57a8 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -95,6 +95,7 @@ obj-$(CONFIG_RING_BUFFER) += trace/
obj-$(CONFIG_SMP) += sched_cpupri.o
obj-$(CONFIG_SLOW_WORK) += slow-work.o
obj-$(CONFIG_PERF_EVENTS) += perf_event.o
+obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o

ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
--
1.6.5.2

2009-11-16 12:19:56

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 03/42] x86: Fix user return notifier put_cpu_var() invocation

From: Stephen Rothwell <[email protected]>

Today's linux-next build (x86_64 allmodconfig) failed like this:

kernel/user-return-notifier.c: In function
'fire_user_return_notifiers': kernel/user-return-notifier.c:45:
error: expected expression before ')' token

Introduced by commit 7c68af6e32c73992bad24107311f3433c89016e2
("core, x86: Add user return notifiers") from the tip and kvm trees
but revealed by commit e0fdb0e050eae331046385643618f12452aa7e73
("percpu: add __percpu for sparse") from the percpu tree.

Before that percpu tree commit, "put_cpu_var()" would compile
without error (even though it really needs a parameter).

Signed-off-by: Stephen Rothwell <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Christoph Lameter <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/user-return-notifier.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/user-return-notifier.c b/kernel/user-return-notifier.c
index 530ccb8..03e2d6f 100644
--- a/kernel/user-return-notifier.c
+++ b/kernel/user-return-notifier.c
@@ -42,5 +42,5 @@ void fire_user_return_notifiers(void)
head = &get_cpu_var(return_notifier_list);
hlist_for_each_entry_safe(urn, tmp1, tmp2, head, link)
urn->on_user_return(urn);
- put_cpu_var();
+ put_cpu_var(return_notifier_list);
}
--
1.6.5.2

2009-11-16 12:19:59

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 04/42] KVM: Don't wrap schedule() with vcpu_put()/vcpu_load()

Preemption notifiers will do that for us automatically.

Signed-off-by: Avi Kivity <[email protected]>
---
virt/kvm/kvm_main.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7495ce3..22b520b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1689,9 +1689,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
if (signal_pending(current))
break;

- vcpu_put(vcpu);
schedule();
- vcpu_load(vcpu);
}

finish_wait(&vcpu->wq, &wait);
--
1.6.5.2

2009-11-16 12:21:48

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 05/42] KVM: x86 emulator: Add 'push/pop sreg' instructions

From: Mohammed Gamal <[email protected]>

[avi: avoid buffer overflow]

Signed-off-by: Mohammed Gamal <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/kvm/emulate.c | 107 +++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 101 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 1be5cd6..1cdfec5 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -92,19 +92,22 @@ static u32 opcode_table[256] = {
/* 0x00 - 0x07 */
ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
- ByteOp | DstAcc | SrcImm, DstAcc | SrcImm, 0, 0,
+ ByteOp | DstAcc | SrcImm, DstAcc | SrcImm,
+ ImplicitOps | Stack, ImplicitOps | Stack,
/* 0x08 - 0x0F */
ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
- 0, 0, 0, 0,
+ 0, 0, ImplicitOps | Stack, 0,
/* 0x10 - 0x17 */
ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
- ByteOp | DstAcc | SrcImm, DstAcc | SrcImm, 0, 0,
+ ByteOp | DstAcc | SrcImm, DstAcc | SrcImm,
+ ImplicitOps | Stack, ImplicitOps | Stack,
/* 0x18 - 0x1F */
ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
- ByteOp | DstAcc | SrcImm, DstAcc | SrcImm, 0, 0,
+ ByteOp | DstAcc | SrcImm, DstAcc | SrcImm,
+ ImplicitOps | Stack, ImplicitOps | Stack,
/* 0x20 - 0x27 */
ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
@@ -244,11 +247,13 @@ static u32 twobyte_table[256] = {
/* 0x90 - 0x9F */
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 0xA0 - 0xA7 */
- 0, 0, 0, DstMem | SrcReg | ModRM | BitOp,
+ ImplicitOps | Stack, ImplicitOps | Stack,
+ 0, DstMem | SrcReg | ModRM | BitOp,
DstMem | SrcReg | Src2ImmByte | ModRM,
DstMem | SrcReg | Src2CL | ModRM, 0, 0,
/* 0xA8 - 0xAF */
- 0, 0, 0, DstMem | SrcReg | ModRM | BitOp,
+ ImplicitOps | Stack, ImplicitOps | Stack,
+ 0, DstMem | SrcReg | ModRM | BitOp,
DstMem | SrcReg | Src2ImmByte | ModRM,
DstMem | SrcReg | Src2CL | ModRM,
ModRM, 0,
@@ -1186,6 +1191,32 @@ static int emulate_pop(struct x86_emulate_ctxt *ctxt,
return rc;
}

+static void emulate_push_sreg(struct x86_emulate_ctxt *ctxt, int seg)
+{
+ struct decode_cache *c = &ctxt->decode;
+ struct kvm_segment segment;
+
+ kvm_x86_ops->get_segment(ctxt->vcpu, &segment, seg);
+
+ c->src.val = segment.selector;
+ emulate_push(ctxt);
+}
+
+static int emulate_pop_sreg(struct x86_emulate_ctxt *ctxt,
+ struct x86_emulate_ops *ops, int seg)
+{
+ struct decode_cache *c = &ctxt->decode;
+ unsigned long selector;
+ int rc;
+
+ rc = emulate_pop(ctxt, ops, &selector, c->op_bytes);
+ if (rc != 0)
+ return rc;
+
+ rc = kvm_load_segment_descriptor(ctxt->vcpu, (u16)selector, 1, seg);
+ return rc;
+}
+
static inline int emulate_grp1a(struct x86_emulate_ctxt *ctxt,
struct x86_emulate_ops *ops)
{
@@ -1707,18 +1738,66 @@ special_insn:
add: /* add */
emulate_2op_SrcV("add", c->src, c->dst, ctxt->eflags);
break;
+ case 0x06: /* push es */
+ if (ctxt->mode == X86EMUL_MODE_PROT64)
+ goto cannot_emulate;
+
+ emulate_push_sreg(ctxt, VCPU_SREG_ES);
+ break;
+ case 0x07: /* pop es */
+ if (ctxt->mode == X86EMUL_MODE_PROT64)
+ goto cannot_emulate;
+
+ rc = emulate_pop_sreg(ctxt, ops, VCPU_SREG_ES);
+ if (rc != 0)
+ goto done;
+ break;
case 0x08 ... 0x0d:
or: /* or */
emulate_2op_SrcV("or", c->src, c->dst, ctxt->eflags);
break;
+ case 0x0e: /* push cs */
+ if (ctxt->mode == X86EMUL_MODE_PROT64)
+ goto cannot_emulate;
+
+ emulate_push_sreg(ctxt, VCPU_SREG_CS);
+ break;
case 0x10 ... 0x15:
adc: /* adc */
emulate_2op_SrcV("adc", c->src, c->dst, ctxt->eflags);
break;
+ case 0x16: /* push ss */
+ if (ctxt->mode == X86EMUL_MODE_PROT64)
+ goto cannot_emulate;
+
+ emulate_push_sreg(ctxt, VCPU_SREG_SS);
+ break;
+ case 0x17: /* pop ss */
+ if (ctxt->mode == X86EMUL_MODE_PROT64)
+ goto cannot_emulate;
+
+ rc = emulate_pop_sreg(ctxt, ops, VCPU_SREG_SS);
+ if (rc != 0)
+ goto done;
+ break;
case 0x18 ... 0x1d:
sbb: /* sbb */
emulate_2op_SrcV("sbb", c->src, c->dst, ctxt->eflags);
break;
+ case 0x1e: /* push ds */
+ if (ctxt->mode == X86EMUL_MODE_PROT64)
+ goto cannot_emulate;
+
+ emulate_push_sreg(ctxt, VCPU_SREG_DS);
+ break;
+ case 0x1f: /* pop ds */
+ if (ctxt->mode == X86EMUL_MODE_PROT64)
+ goto cannot_emulate;
+
+ rc = emulate_pop_sreg(ctxt, ops, VCPU_SREG_DS);
+ if (rc != 0)
+ goto done;
+ break;
case 0x20 ... 0x25:
and: /* and */
emulate_2op_SrcV("and", c->src, c->dst, ctxt->eflags);
@@ -2297,6 +2376,14 @@ twobyte_insn:
jmp_rel(c, c->src.val);
c->dst.type = OP_NONE;
break;
+ case 0xa0: /* push fs */
+ emulate_push_sreg(ctxt, VCPU_SREG_FS);
+ break;
+ case 0xa1: /* pop fs */
+ rc = emulate_pop_sreg(ctxt, ops, VCPU_SREG_FS);
+ if (rc != 0)
+ goto done;
+ break;
case 0xa3:
bt: /* bt */
c->dst.type = OP_NONE;
@@ -2308,6 +2395,14 @@ twobyte_insn:
case 0xa5: /* shld cl, r, r/m */
emulate_2op_cl("shld", c->src2, c->src, c->dst, ctxt->eflags);
break;
+ case 0xa8: /* push gs */
+ emulate_push_sreg(ctxt, VCPU_SREG_GS);
+ break;
+ case 0xa9: /* pop gs */
+ rc = emulate_pop_sreg(ctxt, ops, VCPU_SREG_GS);
+ if (rc != 0)
+ goto done;
+ break;
case 0xab:
bts: /* bts */
/* only subword offset */
--
1.6.5.2

2009-11-16 12:20:56

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 06/42] KVM: x86 emulator: Introduce No64 decode option

From: Mohammed Gamal <[email protected]>

Introduces a new decode option "No64", which is used for instructions that are
invalid in long mode.

Signed-off-by: Mohammed Gamal <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/kvm/emulate.c | 42 ++++++++++++++----------------------------
1 files changed, 14 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 1cdfec5..1f0ff4a 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -75,6 +75,8 @@
#define Group (1<<14) /* Bits 3:5 of modrm byte extend opcode */
#define GroupDual (1<<15) /* Alternate decoding of mod == 3 */
#define GroupMask 0xff /* Group number stored in bits 0:7 */
+/* Misc flags */
+#define No64 (1<<28)
/* Source 2 operand type */
#define Src2None (0<<29)
#define Src2CL (1<<29)
@@ -93,21 +95,21 @@ static u32 opcode_table[256] = {
ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
ByteOp | DstAcc | SrcImm, DstAcc | SrcImm,
- ImplicitOps | Stack, ImplicitOps | Stack,
+ ImplicitOps | Stack | No64, ImplicitOps | Stack | No64,
/* 0x08 - 0x0F */
ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
- 0, 0, ImplicitOps | Stack, 0,
+ 0, 0, ImplicitOps | Stack | No64, 0,
/* 0x10 - 0x17 */
ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
ByteOp | DstAcc | SrcImm, DstAcc | SrcImm,
- ImplicitOps | Stack, ImplicitOps | Stack,
+ ImplicitOps | Stack | No64, ImplicitOps | Stack | No64,
/* 0x18 - 0x1F */
ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
ByteOp | DstAcc | SrcImm, DstAcc | SrcImm,
- ImplicitOps | Stack, ImplicitOps | Stack,
+ ImplicitOps | Stack | No64, ImplicitOps | Stack | No64,
/* 0x20 - 0x27 */
ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
@@ -161,7 +163,7 @@ static u32 opcode_table[256] = {
/* 0x90 - 0x97 */
DstReg, DstReg, DstReg, DstReg, DstReg, DstReg, DstReg, DstReg,
/* 0x98 - 0x9F */
- 0, 0, SrcImm | Src2Imm16, 0,
+ 0, 0, SrcImm | Src2Imm16 | No64, 0,
ImplicitOps | Stack, ImplicitOps | Stack, 0, 0,
/* 0xA0 - 0xA7 */
ByteOp | DstReg | SrcMem | Mov | MemAbs, DstReg | SrcMem | Mov | MemAbs,
@@ -188,7 +190,7 @@ static u32 opcode_table[256] = {
ByteOp | DstMem | SrcImm | ModRM | Mov, DstMem | SrcImm | ModRM | Mov,
/* 0xC8 - 0xCF */
0, 0, 0, ImplicitOps | Stack,
- ImplicitOps, SrcImmByte, ImplicitOps, ImplicitOps,
+ ImplicitOps, SrcImmByte, ImplicitOps | No64, ImplicitOps,
/* 0xD0 - 0xD7 */
ByteOp | DstMem | SrcImplicit | ModRM, DstMem | SrcImplicit | ModRM,
ByteOp | DstMem | SrcImplicit | ModRM, DstMem | SrcImplicit | ModRM,
@@ -201,7 +203,7 @@ static u32 opcode_table[256] = {
ByteOp | SrcImmUByte, SrcImmUByte,
/* 0xE8 - 0xEF */
SrcImm | Stack, SrcImm | ImplicitOps,
- SrcImmU | Src2Imm16, SrcImmByte | ImplicitOps,
+ SrcImmU | Src2Imm16 | No64, SrcImmByte | ImplicitOps,
SrcNone | ByteOp | ImplicitOps, SrcNone | ImplicitOps,
SrcNone | ByteOp | ImplicitOps, SrcNone | ImplicitOps,
/* 0xF0 - 0xF7 */
@@ -967,6 +969,11 @@ done_prefixes:
}
}

+ if (mode == X86EMUL_MODE_PROT64 && (c->d & No64)) {
+ kvm_report_emulation_failure(ctxt->vcpu, "invalid x86/64 instruction");;
+ return -1;
+ }
+
if (c->d & Group) {
group = c->d & GroupMask;
c->modrm = insn_fetch(u8, 1, c->eip);
@@ -1739,15 +1746,9 @@ special_insn:
emulate_2op_SrcV("add", c->src, c->dst, ctxt->eflags);
break;
case 0x06: /* push es */
- if (ctxt->mode == X86EMUL_MODE_PROT64)
- goto cannot_emulate;
-
emulate_push_sreg(ctxt, VCPU_SREG_ES);
break;
case 0x07: /* pop es */
- if (ctxt->mode == X86EMUL_MODE_PROT64)
- goto cannot_emulate;
-
rc = emulate_pop_sreg(ctxt, ops, VCPU_SREG_ES);
if (rc != 0)
goto done;
@@ -1757,9 +1758,6 @@ special_insn:
emulate_2op_SrcV("or", c->src, c->dst, ctxt->eflags);
break;
case 0x0e: /* push cs */
- if (ctxt->mode == X86EMUL_MODE_PROT64)
- goto cannot_emulate;
-
emulate_push_sreg(ctxt, VCPU_SREG_CS);
break;
case 0x10 ... 0x15:
@@ -1767,15 +1765,9 @@ special_insn:
emulate_2op_SrcV("adc", c->src, c->dst, ctxt->eflags);
break;
case 0x16: /* push ss */
- if (ctxt->mode == X86EMUL_MODE_PROT64)
- goto cannot_emulate;
-
emulate_push_sreg(ctxt, VCPU_SREG_SS);
break;
case 0x17: /* pop ss */
- if (ctxt->mode == X86EMUL_MODE_PROT64)
- goto cannot_emulate;
-
rc = emulate_pop_sreg(ctxt, ops, VCPU_SREG_SS);
if (rc != 0)
goto done;
@@ -1785,15 +1777,9 @@ special_insn:
emulate_2op_SrcV("sbb", c->src, c->dst, ctxt->eflags);
break;
case 0x1e: /* push ds */
- if (ctxt->mode == X86EMUL_MODE_PROT64)
- goto cannot_emulate;
-
emulate_push_sreg(ctxt, VCPU_SREG_DS);
break;
case 0x1f: /* pop ds */
- if (ctxt->mode == X86EMUL_MODE_PROT64)
- goto cannot_emulate;
-
rc = emulate_pop_sreg(ctxt, ops, VCPU_SREG_DS);
if (rc != 0)
goto done;
--
1.6.5.2

2009-11-16 12:20:42

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 07/42] KVM: Don't pass kvm_run arguments

They're just copies of vcpu->run, which is readily accessible.

Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 10 ++--
arch/x86/kvm/emulate.c | 6 +-
arch/x86/kvm/mmu.c | 2 +-
arch/x86/kvm/svm.c | 102 ++++++++++++++++++----------------
arch/x86/kvm/vmx.c | 113 ++++++++++++++++++--------------------
arch/x86/kvm/x86.c | 50 ++++++++---------
6 files changed, 141 insertions(+), 142 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d838922..0b113f2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -506,8 +506,8 @@ struct kvm_x86_ops {

void (*tlb_flush)(struct kvm_vcpu *vcpu);

- void (*run)(struct kvm_vcpu *vcpu, struct kvm_run *run);
- int (*handle_exit)(struct kvm_run *run, struct kvm_vcpu *vcpu);
+ void (*run)(struct kvm_vcpu *vcpu);
+ int (*handle_exit)(struct kvm_vcpu *vcpu);
void (*skip_emulated_instruction)(struct kvm_vcpu *vcpu);
void (*set_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask);
u32 (*get_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask);
@@ -568,7 +568,7 @@ enum emulation_result {
#define EMULTYPE_NO_DECODE (1 << 0)
#define EMULTYPE_TRAP_UD (1 << 1)
#define EMULTYPE_SKIP (1 << 2)
-int emulate_instruction(struct kvm_vcpu *vcpu, struct kvm_run *run,
+int emulate_instruction(struct kvm_vcpu *vcpu,
unsigned long cr2, u16 error_code, int emulation_type);
void kvm_report_emulation_failure(struct kvm_vcpu *cvpu, const char *context);
void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address);
@@ -585,9 +585,9 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);

struct x86_emulate_ctxt;

-int kvm_emulate_pio(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
+int kvm_emulate_pio(struct kvm_vcpu *vcpu, int in,
int size, unsigned port);
-int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
+int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, int in,
int size, unsigned long count, int down,
gva_t address, int rep, unsigned port);
void kvm_emulate_cpuid(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 1f0ff4a..0644d3d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1826,7 +1826,7 @@ special_insn:
break;
case 0x6c: /* insb */
case 0x6d: /* insw/insd */
- if (kvm_emulate_pio_string(ctxt->vcpu, NULL,
+ if (kvm_emulate_pio_string(ctxt->vcpu,
1,
(c->d & ByteOp) ? 1 : c->op_bytes,
c->rep_prefix ?
@@ -1842,7 +1842,7 @@ special_insn:
return 0;
case 0x6e: /* outsb */
case 0x6f: /* outsw/outsd */
- if (kvm_emulate_pio_string(ctxt->vcpu, NULL,
+ if (kvm_emulate_pio_string(ctxt->vcpu,
0,
(c->d & ByteOp) ? 1 : c->op_bytes,
c->rep_prefix ?
@@ -2135,7 +2135,7 @@ special_insn:
case 0xef: /* out (e/r)ax,dx */
port = c->regs[VCPU_REGS_RDX];
io_dir_in = 0;
- do_io: if (kvm_emulate_pio(ctxt->vcpu, NULL, io_dir_in,
+ do_io: if (kvm_emulate_pio(ctxt->vcpu, io_dir_in,
(c->d & ByteOp) ? 1 : c->op_bytes,
port) != 0) {
c->eip = saved_eip;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 818b92a..a902479 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2789,7 +2789,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code)
if (r)
goto out;

- er = emulate_instruction(vcpu, vcpu->run, cr2, error_code, 0);
+ er = emulate_instruction(vcpu, cr2, error_code, 0);

switch (er) {
case EMULATE_DONE:
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index c17404a..92048a6 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -286,7 +286,7 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
struct vcpu_svm *svm = to_svm(vcpu);

if (!svm->next_rip) {
- if (emulate_instruction(vcpu, vcpu->run, 0, 0, EMULTYPE_SKIP) !=
+ if (emulate_instruction(vcpu, 0, 0, EMULTYPE_SKIP) !=
EMULATE_DONE)
printk(KERN_DEBUG "%s: NOP\n", __func__);
return;
@@ -1180,7 +1180,7 @@ static void svm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long value,
}
}

-static int pf_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int pf_interception(struct vcpu_svm *svm)
{
u64 fault_address;
u32 error_code;
@@ -1194,8 +1194,10 @@ static int pf_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
return kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
}

-static int db_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int db_interception(struct vcpu_svm *svm)
{
+ struct kvm_run *kvm_run = svm->vcpu.run;
+
if (!(svm->vcpu.guest_debug &
(KVM_GUESTDBG_SINGLESTEP | KVM_GUESTDBG_USE_HW_BP)) &&
!svm->vcpu.arch.singlestep) {
@@ -1223,25 +1225,27 @@ static int db_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
return 1;
}

-static int bp_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int bp_interception(struct vcpu_svm *svm)
{
+ struct kvm_run *kvm_run = svm->vcpu.run;
+
kvm_run->exit_reason = KVM_EXIT_DEBUG;
kvm_run->debug.arch.pc = svm->vmcb->save.cs.base + svm->vmcb->save.rip;
kvm_run->debug.arch.exception = BP_VECTOR;
return 0;
}

-static int ud_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int ud_interception(struct vcpu_svm *svm)
{
int er;

- er = emulate_instruction(&svm->vcpu, kvm_run, 0, 0, EMULTYPE_TRAP_UD);
+ er = emulate_instruction(&svm->vcpu, 0, 0, EMULTYPE_TRAP_UD);
if (er != EMULATE_DONE)
kvm_queue_exception(&svm->vcpu, UD_VECTOR);
return 1;
}

-static int nm_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int nm_interception(struct vcpu_svm *svm)
{
svm->vmcb->control.intercept_exceptions &= ~(1 << NM_VECTOR);
if (!(svm->vcpu.arch.cr0 & X86_CR0_TS))
@@ -1251,7 +1255,7 @@ static int nm_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
return 1;
}

-static int mc_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int mc_interception(struct vcpu_svm *svm)
{
/*
* On an #MC intercept the MCE handler is not called automatically in
@@ -1264,8 +1268,10 @@ static int mc_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
return 1;
}

-static int shutdown_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int shutdown_interception(struct vcpu_svm *svm)
{
+ struct kvm_run *kvm_run = svm->vcpu.run;
+
/*
* VMCB is undefined after a SHUTDOWN intercept
* so reinitialize it.
@@ -1277,7 +1283,7 @@ static int shutdown_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
return 0;
}

-static int io_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int io_interception(struct vcpu_svm *svm)
{
u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
int size, in, string;
@@ -1291,7 +1297,7 @@ static int io_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)

if (string) {
if (emulate_instruction(&svm->vcpu,
- kvm_run, 0, 0, 0) == EMULATE_DO_MMIO)
+ 0, 0, 0) == EMULATE_DO_MMIO)
return 0;
return 1;
}
@@ -1301,33 +1307,33 @@ static int io_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
size = (io_info & SVM_IOIO_SIZE_MASK) >> SVM_IOIO_SIZE_SHIFT;

skip_emulated_instruction(&svm->vcpu);
- return kvm_emulate_pio(&svm->vcpu, kvm_run, in, size, port);
+ return kvm_emulate_pio(&svm->vcpu, in, size, port);
}

-static int nmi_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int nmi_interception(struct vcpu_svm *svm)
{
return 1;
}

-static int intr_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int intr_interception(struct vcpu_svm *svm)
{
++svm->vcpu.stat.irq_exits;
return 1;
}

-static int nop_on_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int nop_on_interception(struct vcpu_svm *svm)
{
return 1;
}

-static int halt_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int halt_interception(struct vcpu_svm *svm)
{
svm->next_rip = kvm_rip_read(&svm->vcpu) + 1;
skip_emulated_instruction(&svm->vcpu);
return kvm_emulate_halt(&svm->vcpu);
}

-static int vmmcall_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int vmmcall_interception(struct vcpu_svm *svm)
{
svm->next_rip = kvm_rip_read(&svm->vcpu) + 3;
skip_emulated_instruction(&svm->vcpu);
@@ -1837,7 +1843,7 @@ static void nested_svm_vmloadsave(struct vmcb *from_vmcb, struct vmcb *to_vmcb)
to_vmcb->save.sysenter_eip = from_vmcb->save.sysenter_eip;
}

-static int vmload_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int vmload_interception(struct vcpu_svm *svm)
{
struct vmcb *nested_vmcb;

@@ -1857,7 +1863,7 @@ static int vmload_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
return 1;
}

-static int vmsave_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int vmsave_interception(struct vcpu_svm *svm)
{
struct vmcb *nested_vmcb;

@@ -1877,7 +1883,7 @@ static int vmsave_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
return 1;
}

-static int vmrun_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int vmrun_interception(struct vcpu_svm *svm)
{
nsvm_printk("VMrun\n");

@@ -1907,7 +1913,7 @@ failed:
return 1;
}

-static int stgi_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int stgi_interception(struct vcpu_svm *svm)
{
if (nested_svm_check_permissions(svm))
return 1;
@@ -1920,7 +1926,7 @@ static int stgi_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
return 1;
}

-static int clgi_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int clgi_interception(struct vcpu_svm *svm)
{
if (nested_svm_check_permissions(svm))
return 1;
@@ -1937,7 +1943,7 @@ static int clgi_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
return 1;
}

-static int invlpga_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int invlpga_interception(struct vcpu_svm *svm)
{
struct kvm_vcpu *vcpu = &svm->vcpu;
nsvm_printk("INVLPGA\n");
@@ -1950,15 +1956,13 @@ static int invlpga_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
return 1;
}

-static int invalid_op_interception(struct vcpu_svm *svm,
- struct kvm_run *kvm_run)
+static int invalid_op_interception(struct vcpu_svm *svm)
{
kvm_queue_exception(&svm->vcpu, UD_VECTOR);
return 1;
}

-static int task_switch_interception(struct vcpu_svm *svm,
- struct kvm_run *kvm_run)
+static int task_switch_interception(struct vcpu_svm *svm)
{
u16 tss_selector;
int reason;
@@ -2008,14 +2012,14 @@ static int task_switch_interception(struct vcpu_svm *svm,
return kvm_task_switch(&svm->vcpu, tss_selector, reason);
}

-static int cpuid_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int cpuid_interception(struct vcpu_svm *svm)
{
svm->next_rip = kvm_rip_read(&svm->vcpu) + 2;
kvm_emulate_cpuid(&svm->vcpu);
return 1;
}

-static int iret_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int iret_interception(struct vcpu_svm *svm)
{
++svm->vcpu.stat.nmi_window_exits;
svm->vmcb->control.intercept &= ~(1UL << INTERCEPT_IRET);
@@ -2023,26 +2027,27 @@ static int iret_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
return 1;
}

-static int invlpg_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int invlpg_interception(struct vcpu_svm *svm)
{
- if (emulate_instruction(&svm->vcpu, kvm_run, 0, 0, 0) != EMULATE_DONE)
+ if (emulate_instruction(&svm->vcpu, 0, 0, 0) != EMULATE_DONE)
pr_unimpl(&svm->vcpu, "%s: failed\n", __func__);
return 1;
}

-static int emulate_on_interception(struct vcpu_svm *svm,
- struct kvm_run *kvm_run)
+static int emulate_on_interception(struct vcpu_svm *svm)
{
- if (emulate_instruction(&svm->vcpu, NULL, 0, 0, 0) != EMULATE_DONE)
+ if (emulate_instruction(&svm->vcpu, 0, 0, 0) != EMULATE_DONE)
pr_unimpl(&svm->vcpu, "%s: failed\n", __func__);
return 1;
}

-static int cr8_write_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int cr8_write_interception(struct vcpu_svm *svm)
{
+ struct kvm_run *kvm_run = svm->vcpu.run;
+
u8 cr8_prev = kvm_get_cr8(&svm->vcpu);
/* instruction emulation calls kvm_set_cr8() */
- emulate_instruction(&svm->vcpu, NULL, 0, 0, 0);
+ emulate_instruction(&svm->vcpu, 0, 0, 0);
if (irqchip_in_kernel(svm->vcpu.kvm)) {
svm->vmcb->control.intercept_cr_write &= ~INTERCEPT_CR8_MASK;
return 1;
@@ -2128,7 +2133,7 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
return 0;
}

-static int rdmsr_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int rdmsr_interception(struct vcpu_svm *svm)
{
u32 ecx = svm->vcpu.arch.regs[VCPU_REGS_RCX];
u64 data;
@@ -2221,7 +2226,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data)
return 0;
}

-static int wrmsr_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int wrmsr_interception(struct vcpu_svm *svm)
{
u32 ecx = svm->vcpu.arch.regs[VCPU_REGS_RCX];
u64 data = (svm->vcpu.arch.regs[VCPU_REGS_RAX] & -1u)
@@ -2237,17 +2242,18 @@ static int wrmsr_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
return 1;
}

-static int msr_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+static int msr_interception(struct vcpu_svm *svm)
{
if (svm->vmcb->control.exit_info_1)
- return wrmsr_interception(svm, kvm_run);
+ return wrmsr_interception(svm);
else
- return rdmsr_interception(svm, kvm_run);
+ return rdmsr_interception(svm);
}

-static int interrupt_window_interception(struct vcpu_svm *svm,
- struct kvm_run *kvm_run)
+static int interrupt_window_interception(struct vcpu_svm *svm)
{
+ struct kvm_run *kvm_run = svm->vcpu.run;
+
svm_clear_vintr(svm);
svm->vmcb->control.int_ctl &= ~V_IRQ_MASK;
/*
@@ -2265,8 +2271,7 @@ static int interrupt_window_interception(struct vcpu_svm *svm,
return 1;
}

-static int (*svm_exit_handlers[])(struct vcpu_svm *svm,
- struct kvm_run *kvm_run) = {
+static int (*svm_exit_handlers[])(struct vcpu_svm *svm) = {
[SVM_EXIT_READ_CR0] = emulate_on_interception,
[SVM_EXIT_READ_CR3] = emulate_on_interception,
[SVM_EXIT_READ_CR4] = emulate_on_interception,
@@ -2321,9 +2326,10 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm,
[SVM_EXIT_NPF] = pf_interception,
};

-static int handle_exit(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
+static int handle_exit(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ struct kvm_run *kvm_run = vcpu->run;
u32 exit_code = svm->vmcb->control.exit_code;

trace_kvm_exit(exit_code, svm->vmcb->save.rip);
@@ -2383,7 +2389,7 @@ static int handle_exit(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
return 0;
}

- return svm_exit_handlers[exit_code](svm, kvm_run);
+ return svm_exit_handlers[exit_code](svm);
}

static void reload_tss(struct kvm_vcpu *vcpu)
@@ -2588,7 +2594,7 @@ static void svm_complete_interrupts(struct vcpu_svm *svm)
#define R "e"
#endif

-static void svm_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static void svm_vcpu_run(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
u16 fs_selector;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ed53b42..4635298 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2659,7 +2659,7 @@ static int handle_rmode_exception(struct kvm_vcpu *vcpu,
* Cause the #SS fault with 0 error code in VM86 mode.
*/
if (((vec == GP_VECTOR) || (vec == SS_VECTOR)) && err_code == 0)
- if (emulate_instruction(vcpu, NULL, 0, 0, 0) == EMULATE_DONE)
+ if (emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE)
return 1;
/*
* Forward all other exceptions that are valid in real mode.
@@ -2710,15 +2710,16 @@ static void kvm_machine_check(void)
#endif
}

-static int handle_machine_check(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_machine_check(struct kvm_vcpu *vcpu)
{
/* already handled by vcpu_run */
return 1;
}

-static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_exception(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
+ struct kvm_run *kvm_run = vcpu->run;
u32 intr_info, ex_no, error_code;
unsigned long cr2, rip, dr6;
u32 vect_info;
@@ -2728,7 +2729,7 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
intr_info = vmcs_read32(VM_EXIT_INTR_INFO);

if (is_machine_check(intr_info))
- return handle_machine_check(vcpu, kvm_run);
+ return handle_machine_check(vcpu);

if ((vect_info & VECTORING_INFO_VALID_MASK) &&
!is_page_fault(intr_info))
@@ -2744,7 +2745,7 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
}

if (is_invalid_opcode(intr_info)) {
- er = emulate_instruction(vcpu, kvm_run, 0, 0, EMULTYPE_TRAP_UD);
+ er = emulate_instruction(vcpu, 0, 0, EMULTYPE_TRAP_UD);
if (er != EMULATE_DONE)
kvm_queue_exception(vcpu, UD_VECTOR);
return 1;
@@ -2803,20 +2804,19 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
return 0;
}

-static int handle_external_interrupt(struct kvm_vcpu *vcpu,
- struct kvm_run *kvm_run)
+static int handle_external_interrupt(struct kvm_vcpu *vcpu)
{
++vcpu->stat.irq_exits;
return 1;
}

-static int handle_triple_fault(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_triple_fault(struct kvm_vcpu *vcpu)
{
- kvm_run->exit_reason = KVM_EXIT_SHUTDOWN;
+ vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
return 0;
}

-static int handle_io(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_io(struct kvm_vcpu *vcpu)
{
unsigned long exit_qualification;
int size, in, string;
@@ -2827,8 +2827,7 @@ static int handle_io(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
string = (exit_qualification & 16) != 0;

if (string) {
- if (emulate_instruction(vcpu,
- kvm_run, 0, 0, 0) == EMULATE_DO_MMIO)
+ if (emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DO_MMIO)
return 0;
return 1;
}
@@ -2838,7 +2837,7 @@ static int handle_io(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
port = exit_qualification >> 16;

skip_emulated_instruction(vcpu);
- return kvm_emulate_pio(vcpu, kvm_run, in, size, port);
+ return kvm_emulate_pio(vcpu, in, size, port);
}

static void
@@ -2852,7 +2851,7 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
hypercall[2] = 0xc1;
}

-static int handle_cr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_cr(struct kvm_vcpu *vcpu)
{
unsigned long exit_qualification, val;
int cr;
@@ -2887,7 +2886,7 @@ static int handle_cr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
return 1;
if (cr8_prev <= cr8)
return 1;
- kvm_run->exit_reason = KVM_EXIT_SET_TPR;
+ vcpu->run->exit_reason = KVM_EXIT_SET_TPR;
return 0;
}
};
@@ -2922,13 +2921,13 @@ static int handle_cr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
default:
break;
}
- kvm_run->exit_reason = 0;
+ vcpu->run->exit_reason = 0;
pr_unimpl(vcpu, "unhandled control register: op %d cr %d\n",
(int)(exit_qualification >> 4) & 3, cr);
return 0;
}

-static int handle_dr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_dr(struct kvm_vcpu *vcpu)
{
unsigned long exit_qualification;
unsigned long val;
@@ -2944,13 +2943,13 @@ static int handle_dr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
* guest debugging itself.
*/
if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) {
- kvm_run->debug.arch.dr6 = vcpu->arch.dr6;
- kvm_run->debug.arch.dr7 = dr;
- kvm_run->debug.arch.pc =
+ vcpu->run->debug.arch.dr6 = vcpu->arch.dr6;
+ vcpu->run->debug.arch.dr7 = dr;
+ vcpu->run->debug.arch.pc =
vmcs_readl(GUEST_CS_BASE) +
vmcs_readl(GUEST_RIP);
- kvm_run->debug.arch.exception = DB_VECTOR;
- kvm_run->exit_reason = KVM_EXIT_DEBUG;
+ vcpu->run->debug.arch.exception = DB_VECTOR;
+ vcpu->run->exit_reason = KVM_EXIT_DEBUG;
return 0;
} else {
vcpu->arch.dr7 &= ~DR7_GD;
@@ -3016,13 +3015,13 @@ static int handle_dr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
return 1;
}

-static int handle_cpuid(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_cpuid(struct kvm_vcpu *vcpu)
{
kvm_emulate_cpuid(vcpu);
return 1;
}

-static int handle_rdmsr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_rdmsr(struct kvm_vcpu *vcpu)
{
u32 ecx = vcpu->arch.regs[VCPU_REGS_RCX];
u64 data;
@@ -3041,7 +3040,7 @@ static int handle_rdmsr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
return 1;
}

-static int handle_wrmsr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_wrmsr(struct kvm_vcpu *vcpu)
{
u32 ecx = vcpu->arch.regs[VCPU_REGS_RCX];
u64 data = (vcpu->arch.regs[VCPU_REGS_RAX] & -1u)
@@ -3058,14 +3057,12 @@ static int handle_wrmsr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
return 1;
}

-static int handle_tpr_below_threshold(struct kvm_vcpu *vcpu,
- struct kvm_run *kvm_run)
+static int handle_tpr_below_threshold(struct kvm_vcpu *vcpu)
{
return 1;
}

-static int handle_interrupt_window(struct kvm_vcpu *vcpu,
- struct kvm_run *kvm_run)
+static int handle_interrupt_window(struct kvm_vcpu *vcpu)
{
u32 cpu_based_vm_exec_control;

@@ -3081,34 +3078,34 @@ static int handle_interrupt_window(struct kvm_vcpu *vcpu,
* possible
*/
if (!irqchip_in_kernel(vcpu->kvm) &&
- kvm_run->request_interrupt_window &&
+ vcpu->run->request_interrupt_window &&
!kvm_cpu_has_interrupt(vcpu)) {
- kvm_run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN;
+ vcpu->run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN;
return 0;
}
return 1;
}

-static int handle_halt(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_halt(struct kvm_vcpu *vcpu)
{
skip_emulated_instruction(vcpu);
return kvm_emulate_halt(vcpu);
}

-static int handle_vmcall(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_vmcall(struct kvm_vcpu *vcpu)
{
skip_emulated_instruction(vcpu);
kvm_emulate_hypercall(vcpu);
return 1;
}

-static int handle_vmx_insn(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_vmx_insn(struct kvm_vcpu *vcpu)
{
kvm_queue_exception(vcpu, UD_VECTOR);
return 1;
}

-static int handle_invlpg(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_invlpg(struct kvm_vcpu *vcpu)
{
unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);

@@ -3117,14 +3114,14 @@ static int handle_invlpg(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
return 1;
}

-static int handle_wbinvd(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_wbinvd(struct kvm_vcpu *vcpu)
{
skip_emulated_instruction(vcpu);
/* TODO: Add support for VT-d/pass-through device */
return 1;
}

-static int handle_apic_access(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_apic_access(struct kvm_vcpu *vcpu)
{
unsigned long exit_qualification;
enum emulation_result er;
@@ -3133,7 +3130,7 @@ static int handle_apic_access(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
offset = exit_qualification & 0xffful;

- er = emulate_instruction(vcpu, kvm_run, 0, 0, 0);
+ er = emulate_instruction(vcpu, 0, 0, 0);

if (er != EMULATE_DONE) {
printk(KERN_ERR
@@ -3144,7 +3141,7 @@ static int handle_apic_access(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
return 1;
}

-static int handle_task_switch(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_task_switch(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long exit_qualification;
@@ -3198,7 +3195,7 @@ static int handle_task_switch(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
return 1;
}

-static int handle_ept_violation(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_ept_violation(struct kvm_vcpu *vcpu)
{
unsigned long exit_qualification;
gpa_t gpa;
@@ -3219,8 +3216,8 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
vmcs_readl(GUEST_LINEAR_ADDRESS));
printk(KERN_ERR "EPT: Exit qualification is 0x%lx\n",
(long unsigned int)exit_qualification);
- kvm_run->exit_reason = KVM_EXIT_UNKNOWN;
- kvm_run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
+ vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+ vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
return 0;
}

@@ -3290,7 +3287,7 @@ static void ept_misconfig_inspect_spte(struct kvm_vcpu *vcpu, u64 spte,
}
}

-static int handle_ept_misconfig(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
{
u64 sptes[4];
int nr_sptes, i;
@@ -3306,13 +3303,13 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
for (i = PT64_ROOT_LEVEL; i > PT64_ROOT_LEVEL - nr_sptes; --i)
ept_misconfig_inspect_spte(vcpu, sptes[i-1], i);

- kvm_run->exit_reason = KVM_EXIT_UNKNOWN;
- kvm_run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;
+ vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+ vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;

return 0;
}

-static int handle_nmi_window(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int handle_nmi_window(struct kvm_vcpu *vcpu)
{
u32 cpu_based_vm_exec_control;

@@ -3325,8 +3322,7 @@ static int handle_nmi_window(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
return 1;
}

-static void handle_invalid_guest_state(struct kvm_vcpu *vcpu,
- struct kvm_run *kvm_run)
+static void handle_invalid_guest_state(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
enum emulation_result err = EMULATE_DONE;
@@ -3335,7 +3331,7 @@ static void handle_invalid_guest_state(struct kvm_vcpu *vcpu,
preempt_enable();

while (!guest_state_valid(vcpu)) {
- err = emulate_instruction(vcpu, kvm_run, 0, 0, 0);
+ err = emulate_instruction(vcpu, 0, 0, 0);

if (err == EMULATE_DO_MMIO)
break;
@@ -3362,8 +3358,7 @@ static void handle_invalid_guest_state(struct kvm_vcpu *vcpu,
* may resume. Otherwise they set the kvm_run parameter to indicate what needs
* to be done to userspace and return 0.
*/
-static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu,
- struct kvm_run *kvm_run) = {
+static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
[EXIT_REASON_EXCEPTION_NMI] = handle_exception,
[EXIT_REASON_EXTERNAL_INTERRUPT] = handle_external_interrupt,
[EXIT_REASON_TRIPLE_FAULT] = handle_triple_fault,
@@ -3403,7 +3398,7 @@ static const int kvm_vmx_max_exit_handlers =
* The guest has exited. See if we can fix it or if we need userspace
* assistance.
*/
-static int vmx_handle_exit(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
+static int vmx_handle_exit(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 exit_reason = vmx->exit_reason;
@@ -3425,8 +3420,8 @@ static int vmx_handle_exit(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);

if (unlikely(vmx->fail)) {
- kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY;
- kvm_run->fail_entry.hardware_entry_failure_reason
+ vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
+ vcpu->run->fail_entry.hardware_entry_failure_reason
= vmcs_read32(VM_INSTRUCTION_ERROR);
return 0;
}
@@ -3459,10 +3454,10 @@ static int vmx_handle_exit(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)

if (exit_reason < kvm_vmx_max_exit_handlers
&& kvm_vmx_exit_handlers[exit_reason])
- return kvm_vmx_exit_handlers[exit_reason](vcpu, kvm_run);
+ return kvm_vmx_exit_handlers[exit_reason](vcpu);
else {
- kvm_run->exit_reason = KVM_EXIT_UNKNOWN;
- kvm_run->hw.hardware_exit_reason = exit_reason;
+ vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+ vcpu->run->hw.hardware_exit_reason = exit_reason;
}
return 0;
}
@@ -3600,7 +3595,7 @@ static void fixup_rmode_irq(struct vcpu_vmx *vmx)
#define Q "l"
#endif

-static void vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -3614,7 +3609,7 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)

/* Handle invalid guest state instead of entering VMX */
if (vmx->emulation_required && emulate_invalid_guest_state) {
- handle_invalid_guest_state(vcpu, kvm_run);
+ handle_invalid_guest_state(vcpu);
return;
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ae07d26..1687d12 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2757,13 +2757,13 @@ static void cache_all_regs(struct kvm_vcpu *vcpu)
}

int emulate_instruction(struct kvm_vcpu *vcpu,
- struct kvm_run *run,
unsigned long cr2,
u16 error_code,
int emulation_type)
{
int r, shadow_mask;
struct decode_cache *c;
+ struct kvm_run *run = vcpu->run;

kvm_clear_exception_queue(vcpu);
vcpu->arch.mmio_fault_cr2 = cr2;
@@ -2969,8 +2969,7 @@ static int pio_string_write(struct kvm_vcpu *vcpu)
return r;
}

-int kvm_emulate_pio(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
- int size, unsigned port)
+int kvm_emulate_pio(struct kvm_vcpu *vcpu, int in, int size, unsigned port)
{
unsigned long val;

@@ -2999,7 +2998,7 @@ int kvm_emulate_pio(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
}
EXPORT_SYMBOL_GPL(kvm_emulate_pio);

-int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
+int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, int in,
int size, unsigned long count, int down,
gva_t address, int rep, unsigned port)
{
@@ -3453,17 +3452,17 @@ EXPORT_SYMBOL_GPL(kvm_emulate_cpuid);
*
* No need to exit to userspace if we already have an interrupt queued.
*/
-static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu,
- struct kvm_run *kvm_run)
+static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu)
{
return (!irqchip_in_kernel(vcpu->kvm) && !kvm_cpu_has_interrupt(vcpu) &&
- kvm_run->request_interrupt_window &&
+ vcpu->run->request_interrupt_window &&
kvm_arch_interrupt_allowed(vcpu));
}

-static void post_kvm_run_save(struct kvm_vcpu *vcpu,
- struct kvm_run *kvm_run)
+static void post_kvm_run_save(struct kvm_vcpu *vcpu)
{
+ struct kvm_run *kvm_run = vcpu->run;
+
kvm_run->if_flag = (kvm_x86_ops->get_rflags(vcpu) & X86_EFLAGS_IF) != 0;
kvm_run->cr8 = kvm_get_cr8(vcpu);
kvm_run->apic_base = kvm_get_apic_base(vcpu);
@@ -3525,7 +3524,7 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu)
kvm_x86_ops->update_cr8_intercept(vcpu, tpr, max_irr);
}

-static void inject_pending_event(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static void inject_pending_event(struct kvm_vcpu *vcpu)
{
/* try to reinject previous events if any */
if (vcpu->arch.exception.pending) {
@@ -3561,11 +3560,11 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
}
}

-static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
{
int r;
bool req_int_win = !irqchip_in_kernel(vcpu->kvm) &&
- kvm_run->request_interrupt_window;
+ vcpu->run->request_interrupt_window;

if (vcpu->requests)
if (test_and_clear_bit(KVM_REQ_MMU_RELOAD, &vcpu->requests))
@@ -3586,12 +3585,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
kvm_x86_ops->tlb_flush(vcpu);
if (test_and_clear_bit(KVM_REQ_REPORT_TPR_ACCESS,
&vcpu->requests)) {
- kvm_run->exit_reason = KVM_EXIT_TPR_ACCESS;
+ vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS;
r = 0;
goto out;
}
if (test_and_clear_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests)) {
- kvm_run->exit_reason = KVM_EXIT_SHUTDOWN;
+ vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
r = 0;
goto out;
}
@@ -3615,7 +3614,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
goto out;
}

- inject_pending_event(vcpu, kvm_run);
+ inject_pending_event(vcpu);

/* enable NMI/IRQ window open exits if needed */
if (vcpu->arch.nmi_pending)
@@ -3641,7 +3640,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
}

trace_kvm_entry(vcpu->vcpu_id);
- kvm_x86_ops->run(vcpu, kvm_run);
+ kvm_x86_ops->run(vcpu);

if (unlikely(vcpu->arch.switch_db_regs || test_thread_flag(TIF_DEBUG))) {
set_debugreg(current->thread.debugreg0, 0);
@@ -3682,13 +3681,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)

kvm_lapic_sync_from_vapic(vcpu);

- r = kvm_x86_ops->handle_exit(kvm_run, vcpu);
+ r = kvm_x86_ops->handle_exit(vcpu);
out:
return r;
}


-static int __vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static int __vcpu_run(struct kvm_vcpu *vcpu)
{
int r;

@@ -3708,7 +3707,7 @@ static int __vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
r = 1;
while (r > 0) {
if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE)
- r = vcpu_enter_guest(vcpu, kvm_run);
+ r = vcpu_enter_guest(vcpu);
else {
up_read(&vcpu->kvm->slots_lock);
kvm_vcpu_block(vcpu);
@@ -3736,14 +3735,14 @@ static int __vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
if (kvm_cpu_has_pending_timer(vcpu))
kvm_inject_pending_timer_irqs(vcpu);

- if (dm_request_for_irq_injection(vcpu, kvm_run)) {
+ if (dm_request_for_irq_injection(vcpu)) {
r = -EINTR;
- kvm_run->exit_reason = KVM_EXIT_INTR;
+ vcpu->run->exit_reason = KVM_EXIT_INTR;
++vcpu->stat.request_irq_exits;
}
if (signal_pending(current)) {
r = -EINTR;
- kvm_run->exit_reason = KVM_EXIT_INTR;
+ vcpu->run->exit_reason = KVM_EXIT_INTR;
++vcpu->stat.signal_exits;
}
if (need_resched()) {
@@ -3754,7 +3753,7 @@ static int __vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
}

up_read(&vcpu->kvm->slots_lock);
- post_kvm_run_save(vcpu, kvm_run);
+ post_kvm_run_save(vcpu);

vapic_exit(vcpu);

@@ -3794,8 +3793,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
vcpu->mmio_needed = 0;

down_read(&vcpu->kvm->slots_lock);
- r = emulate_instruction(vcpu, kvm_run,
- vcpu->arch.mmio_fault_cr2, 0,
+ r = emulate_instruction(vcpu, vcpu->arch.mmio_fault_cr2, 0,
EMULTYPE_NO_DECODE);
up_read(&vcpu->kvm->slots_lock);
if (r == EMULATE_DO_MMIO) {
@@ -3811,7 +3809,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
kvm_register_write(vcpu, VCPU_REGS_RAX,
kvm_run->hypercall.ret);

- r = __vcpu_run(vcpu, kvm_run);
+ r = __vcpu_run(vcpu);

out:
if (vcpu->sigset_active)
--
1.6.5.2

2009-11-16 12:29:51

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 08/42] KVM: Call pic_clear_isr() on pic reset to reuse logic there

From: Gleb Natapov <[email protected]>

Also move call of ack notifiers after pic state change.

Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/kvm/i8259.c | 22 +++++++++-------------
1 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 01f1516..ccc941a 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -225,22 +225,11 @@ int kvm_pic_read_irq(struct kvm *kvm)

void kvm_pic_reset(struct kvm_kpic_state *s)
{
- int irq, irqbase, n;
+ int irq;
struct kvm *kvm = s->pics_state->irq_request_opaque;
struct kvm_vcpu *vcpu0 = kvm->bsp_vcpu;
+ u8 irr = s->irr, isr = s->imr;

- if (s == &s->pics_state->pics[0])
- irqbase = 0;
- else
- irqbase = 8;
-
- for (irq = 0; irq < PIC_NUM_PINS/2; irq++) {
- if (vcpu0 && kvm_apic_accept_pic_intr(vcpu0))
- if (s->irr & (1 << irq) || s->isr & (1 << irq)) {
- n = irq + irqbase;
- kvm_notify_acked_irq(kvm, SELECT_PIC(n), n);
- }
- }
s->last_irr = 0;
s->irr = 0;
s->imr = 0;
@@ -256,6 +245,13 @@ void kvm_pic_reset(struct kvm_kpic_state *s)
s->rotate_on_auto_eoi = 0;
s->special_fully_nested_mode = 0;
s->init4 = 0;
+
+ for (irq = 0; irq < PIC_NUM_PINS/2; irq++) {
+ if (vcpu0 && kvm_apic_accept_pic_intr(vcpu0))
+ if (irr & (1 << irq) || isr & (1 << irq)) {
+ pic_clear_isr(s, irq);
+ }
+ }
}

static void pic_ioport_write(void *opaque, u32 addr, u32 val)
--
1.6.5.2

2009-11-16 12:21:00

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 09/42] KVM: Move irq sharing information to irqchip level

From: Gleb Natapov <[email protected]>

This removes assumptions that max GSIs is smaller than number of pins.
Sharing is tracked on pin level not GSI level.

[avi: no PIC on ia64]

Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 -
arch/x86/kvm/irq.h | 1 +
include/linux/kvm_host.h | 2 +-
virt/kvm/ioapic.h | 1 +
virt/kvm/irq_comm.c | 59 +++++++++++++++++++++++---------------
5 files changed, 39 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0b113f2..35d3236 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -410,7 +410,6 @@ struct kvm_arch{
gpa_t ept_identity_map_addr;

unsigned long irq_sources_bitmap;
- unsigned long irq_states[KVM_IOAPIC_NUM_PINS];
u64 vm_init_tsc;
};

diff --git a/arch/x86/kvm/irq.h b/arch/x86/kvm/irq.h
index 7d6058a..c025a23 100644
--- a/arch/x86/kvm/irq.h
+++ b/arch/x86/kvm/irq.h
@@ -71,6 +71,7 @@ struct kvm_pic {
int output; /* intr from master PIC */
struct kvm_io_device dev;
void (*ack_notifier)(void *opaque, int irq);
+ unsigned long irq_states[16];
};

struct kvm_pic *kvm_create_pic(struct kvm *kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b7bbb5d..1c7f8c4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -120,7 +120,7 @@ struct kvm_kernel_irq_routing_entry {
u32 gsi;
u32 type;
int (*set)(struct kvm_kernel_irq_routing_entry *e,
- struct kvm *kvm, int level);
+ struct kvm *kvm, int irq_source_id, int level);
union {
struct {
unsigned irqchip;
diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
index 7080b71..6e461ad 100644
--- a/virt/kvm/ioapic.h
+++ b/virt/kvm/ioapic.h
@@ -41,6 +41,7 @@ struct kvm_ioapic {
u32 irr;
u32 pad;
union kvm_ioapic_redirect_entry redirtbl[IOAPIC_NUM_PINS];
+ unsigned long irq_states[IOAPIC_NUM_PINS];
struct kvm_io_device dev;
struct kvm *kvm;
void (*ack_notifier)(void *opaque, int irq);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 001663f..9783f5c 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -31,20 +31,39 @@

#include "ioapic.h"

+static inline int kvm_irq_line_state(unsigned long *irq_state,
+ int irq_source_id, int level)
+{
+ /* Logical OR for level trig interrupt */
+ if (level)
+ set_bit(irq_source_id, irq_state);
+ else
+ clear_bit(irq_source_id, irq_state);
+
+ return !!(*irq_state);
+}
+
static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
- struct kvm *kvm, int level)
+ struct kvm *kvm, int irq_source_id, int level)
{
#ifdef CONFIG_X86
- return kvm_pic_set_irq(pic_irqchip(kvm), e->irqchip.pin, level);
+ struct kvm_pic *pic = pic_irqchip(kvm);
+ level = kvm_irq_line_state(&pic->irq_states[e->irqchip.pin],
+ irq_source_id, level);
+ return kvm_pic_set_irq(pic, e->irqchip.pin, level);
#else
return -1;
#endif
}

static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
- struct kvm *kvm, int level)
+ struct kvm *kvm, int irq_source_id, int level)
{
- return kvm_ioapic_set_irq(kvm->arch.vioapic, e->irqchip.pin, level);
+ struct kvm_ioapic *ioapic = kvm->arch.vioapic;
+ level = kvm_irq_line_state(&ioapic->irq_states[e->irqchip.pin],
+ irq_source_id, level);
+
+ return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
}

inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
@@ -96,10 +115,13 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
}

static int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
- struct kvm *kvm, int level)
+ struct kvm *kvm, int irq_source_id, int level)
{
struct kvm_lapic_irq irq;

+ if (!level)
+ return -1;
+
trace_kvm_msi_set_irq(e->msi.address_lo, e->msi.data);

irq.dest_id = (e->msi.address_lo &
@@ -125,34 +147,19 @@ static int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
{
struct kvm_kernel_irq_routing_entry *e;
- unsigned long *irq_state, sig_level;
int ret = -1;

trace_kvm_set_irq(irq, level, irq_source_id);

WARN_ON(!mutex_is_locked(&kvm->irq_lock));

- if (irq < KVM_IOAPIC_NUM_PINS) {
- irq_state = (unsigned long *)&kvm->arch.irq_states[irq];
-
- /* Logical OR for level trig interrupt */
- if (level)
- set_bit(irq_source_id, irq_state);
- else
- clear_bit(irq_source_id, irq_state);
- sig_level = !!(*irq_state);
- } else if (!level)
- return ret;
- else /* Deal with MSI/MSI-X */
- sig_level = 1;
-
/* Not possible to detect if the guest uses the PIC or the
* IOAPIC. So set the bit in both. The guest will ignore
* writes to the unused one.
*/
list_for_each_entry(e, &kvm->irq_routing, link)
if (e->gsi == irq) {
- int r = e->set(e, kvm, sig_level);
+ int r = e->set(e, kvm, irq_source_id, level);
if (r < 0)
continue;

@@ -232,8 +239,14 @@ void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id)
printk(KERN_ERR "kvm: IRQ source ID out of range!\n");
return;
}
- for (i = 0; i < KVM_IOAPIC_NUM_PINS; i++)
- clear_bit(irq_source_id, &kvm->arch.irq_states[i]);
+ for (i = 0; i < KVM_IOAPIC_NUM_PINS; i++) {
+ clear_bit(irq_source_id, &kvm->arch.vioapic->irq_states[i]);
+ if (i >= 16)
+ continue;
+#ifdef CONFIG_X86
+ clear_bit(irq_source_id, &pic_irqchip(kvm)->irq_states[i]);
+#endif
+ }
clear_bit(irq_source_id, &kvm->arch.irq_sources_bitmap);
mutex_unlock(&kvm->irq_lock);
}
--
1.6.5.2

2009-11-16 12:21:38

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 10/42] KVM: Change irq routing table to use gsi indexed array

From: Gleb Natapov <[email protected]>

Use gsi indexed array instead of scanning all entries on each interrupt
injection.

Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
include/linux/kvm_host.h | 21 +++++++++--
virt/kvm/irq_comm.c | 88 +++++++++++++++++++++++++++------------------
virt/kvm/kvm_main.c | 1 -
3 files changed, 71 insertions(+), 39 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1c7f8c4..f403e66 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -128,7 +128,17 @@ struct kvm_kernel_irq_routing_entry {
} irqchip;
struct msi_msg msi;
};
- struct list_head link;
+ struct hlist_node link;
+};
+
+struct kvm_irq_routing_table {
+ struct kvm_kernel_irq_routing_entry *rt_entries;
+ u32 nr_rt_entries;
+ /*
+ * Array indexed by gsi. Each entry contains list of irq chips
+ * the gsi is connected to.
+ */
+ struct hlist_head map[0];
};

struct kvm {
@@ -166,7 +176,7 @@ struct kvm {

struct mutex irq_lock;
#ifdef CONFIG_HAVE_KVM_IRQCHIP
- struct list_head irq_routing; /* of kvm_kernel_irq_routing_entry */
+ struct kvm_irq_routing_table *irq_routing;
struct hlist_head mask_notifier_list;
#endif

@@ -390,7 +400,12 @@ void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
struct kvm_irq_mask_notifier *kimn);
void kvm_fire_mask_notifiers(struct kvm *kvm, int irq, bool mask);

-int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level);
+#ifdef __KVM_HAVE_IOAPIC
+void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
+ union kvm_ioapic_redirect_entry *entry,
+ unsigned long *deliver_bitmask);
+#endif
+int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
void kvm_register_irq_ack_notifier(struct kvm *kvm,
struct kvm_irq_ack_notifier *kian);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 9783f5c..81950f6 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -144,10 +144,12 @@ static int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
* = 0 Interrupt was coalesced (previous irq is still pending)
* > 0 Number of CPUs interrupt was delivered to
*/
-int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
+int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
{
struct kvm_kernel_irq_routing_entry *e;
int ret = -1;
+ struct kvm_irq_routing_table *irq_rt;
+ struct hlist_node *n;

trace_kvm_set_irq(irq, level, irq_source_id);

@@ -157,8 +159,9 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
* IOAPIC. So set the bit in both. The guest will ignore
* writes to the unused one.
*/
- list_for_each_entry(e, &kvm->irq_routing, link)
- if (e->gsi == irq) {
+ irq_rt = kvm->irq_routing;
+ if (irq < irq_rt->nr_rt_entries)
+ hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
int r = e->set(e, kvm, irq_source_id, level);
if (r < 0)
continue;
@@ -170,20 +173,23 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)

void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
{
- struct kvm_kernel_irq_routing_entry *e;
struct kvm_irq_ack_notifier *kian;
struct hlist_node *n;
unsigned gsi = pin;
+ int i;

trace_kvm_ack_irq(irqchip, pin);

- list_for_each_entry(e, &kvm->irq_routing, link)
+ for (i = 0; i < kvm->irq_routing->nr_rt_entries; i++) {
+ struct kvm_kernel_irq_routing_entry *e;
+ e = &kvm->irq_routing->rt_entries[i];
if (e->type == KVM_IRQ_ROUTING_IRQCHIP &&
e->irqchip.irqchip == irqchip &&
e->irqchip.pin == pin) {
gsi = e->gsi;
break;
}
+ }

hlist_for_each_entry(kian, n, &kvm->arch.irq_ack_notifier_list, link)
if (kian->gsi == gsi)
@@ -280,26 +286,30 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, int irq, bool mask)
kimn->func(kimn, mask);
}

-static void __kvm_free_irq_routing(struct list_head *irq_routing)
-{
- struct kvm_kernel_irq_routing_entry *e, *n;
-
- list_for_each_entry_safe(e, n, irq_routing, link)
- kfree(e);
-}
-
void kvm_free_irq_routing(struct kvm *kvm)
{
mutex_lock(&kvm->irq_lock);
- __kvm_free_irq_routing(&kvm->irq_routing);
+ kfree(kvm->irq_routing);
mutex_unlock(&kvm->irq_lock);
}

-static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
+static int setup_routing_entry(struct kvm_irq_routing_table *rt,
+ struct kvm_kernel_irq_routing_entry *e,
const struct kvm_irq_routing_entry *ue)
{
int r = -EINVAL;
int delta;
+ struct kvm_kernel_irq_routing_entry *ei;
+ struct hlist_node *n;
+
+ /*
+ * Do not allow GSI to be mapped to the same irqchip more than once.
+ * Allow only one to one mapping between GSI and MSI.
+ */
+ hlist_for_each_entry(ei, n, &rt->map[ue->gsi], link)
+ if (ei->type == KVM_IRQ_ROUTING_MSI ||
+ ue->u.irqchip.irqchip == ei->irqchip.irqchip)
+ return r;

e->gsi = ue->gsi;
e->type = ue->type;
@@ -332,6 +342,8 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
default:
goto out;
}
+
+ hlist_add_head(&e->link, &rt->map[e->gsi]);
r = 0;
out:
return r;
@@ -343,43 +355,49 @@ int kvm_set_irq_routing(struct kvm *kvm,
unsigned nr,
unsigned flags)
{
- struct list_head irq_list = LIST_HEAD_INIT(irq_list);
- struct list_head tmp = LIST_HEAD_INIT(tmp);
- struct kvm_kernel_irq_routing_entry *e = NULL;
- unsigned i;
+ struct kvm_irq_routing_table *new, *old;
+ u32 i, nr_rt_entries = 0;
int r;

for (i = 0; i < nr; ++i) {
+ if (ue[i].gsi >= KVM_MAX_IRQ_ROUTES)
+ return -EINVAL;
+ nr_rt_entries = max(nr_rt_entries, ue[i].gsi);
+ }
+
+ nr_rt_entries += 1;
+
+ new = kzalloc(sizeof(*new) + (nr_rt_entries * sizeof(struct hlist_head))
+ + (nr * sizeof(struct kvm_kernel_irq_routing_entry)),
+ GFP_KERNEL);
+
+ if (!new)
+ return -ENOMEM;
+
+ new->rt_entries = (void *)&new->map[nr_rt_entries];
+
+ new->nr_rt_entries = nr_rt_entries;
+
+ for (i = 0; i < nr; ++i) {
r = -EINVAL;
- if (ue->gsi >= KVM_MAX_IRQ_ROUTES)
- goto out;
if (ue->flags)
goto out;
- r = -ENOMEM;
- e = kzalloc(sizeof(*e), GFP_KERNEL);
- if (!e)
- goto out;
- r = setup_routing_entry(e, ue);
+ r = setup_routing_entry(new, &new->rt_entries[i], ue);
if (r)
goto out;
++ue;
- list_add(&e->link, &irq_list);
- e = NULL;
}

mutex_lock(&kvm->irq_lock);
- list_splice(&kvm->irq_routing, &tmp);
- INIT_LIST_HEAD(&kvm->irq_routing);
- list_splice(&irq_list, &kvm->irq_routing);
- INIT_LIST_HEAD(&irq_list);
- list_splice(&tmp, &irq_list);
+ old = kvm->irq_routing;
+ kvm->irq_routing = new;
mutex_unlock(&kvm->irq_lock);

+ new = old;
r = 0;

out:
- kfree(e);
- __kvm_free_irq_routing(&irq_list);
+ kfree(new);
return r;
}

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 22b520b..3bee948 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -957,7 +957,6 @@ static struct kvm *kvm_create_vm(void)
if (IS_ERR(kvm))
goto out;
#ifdef CONFIG_HAVE_KVM_IRQCHIP
- INIT_LIST_HEAD(&kvm->irq_routing);
INIT_HLIST_HEAD(&kvm->mask_notifier_list);
#endif

--
1.6.5.2

2009-11-16 12:21:15

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 11/42] KVM: Maintain back mapping from irqchip/pin to gsi

From: Gleb Natapov <[email protected]>

Maintain back mapping from irqchip/pin to gsi to speedup
interrupt acknowledgment notifications.

[avi: build fix on non-x86/ia64]

Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/ia64/include/asm/kvm.h | 1 +
arch/x86/include/asm/kvm.h | 1 +
include/linux/kvm_host.h | 9 +++++++++
virt/kvm/irq_comm.c | 31 ++++++++++++++-----------------
4 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/include/asm/kvm.h b/arch/ia64/include/asm/kvm.h
index 18a7e49..bc90c75 100644
--- a/arch/ia64/include/asm/kvm.h
+++ b/arch/ia64/include/asm/kvm.h
@@ -60,6 +60,7 @@ struct kvm_ioapic_state {
#define KVM_IRQCHIP_PIC_MASTER 0
#define KVM_IRQCHIP_PIC_SLAVE 1
#define KVM_IRQCHIP_IOAPIC 2
+#define KVM_NR_IRQCHIPS 3

#define KVM_CONTEXT_SIZE 8*1024

diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h
index 4a5fe91..f02e87a 100644
--- a/arch/x86/include/asm/kvm.h
+++ b/arch/x86/include/asm/kvm.h
@@ -79,6 +79,7 @@ struct kvm_ioapic_state {
#define KVM_IRQCHIP_PIC_MASTER 0
#define KVM_IRQCHIP_PIC_SLAVE 1
#define KVM_IRQCHIP_IOAPIC 2
+#define KVM_NR_IRQCHIPS 3

/* for KVM_GET_REGS and KVM_SET_REGS */
struct kvm_regs {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f403e66..cc2d749 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -131,7 +131,10 @@ struct kvm_kernel_irq_routing_entry {
struct hlist_node link;
};

+#ifdef __KVM_HAVE_IOAPIC
+
struct kvm_irq_routing_table {
+ int chip[KVM_NR_IRQCHIPS][KVM_IOAPIC_NUM_PINS];
struct kvm_kernel_irq_routing_entry *rt_entries;
u32 nr_rt_entries;
/*
@@ -141,6 +144,12 @@ struct kvm_irq_routing_table {
struct hlist_head map[0];
};

+#else
+
+struct kvm_irq_routing_table {};
+
+#endif
+
struct kvm {
spinlock_t mmu_lock;
spinlock_t requests_lock;
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 81950f6..59cf8da 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -175,25 +175,16 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
{
struct kvm_irq_ack_notifier *kian;
struct hlist_node *n;
- unsigned gsi = pin;
- int i;
+ int gsi;

trace_kvm_ack_irq(irqchip, pin);

- for (i = 0; i < kvm->irq_routing->nr_rt_entries; i++) {
- struct kvm_kernel_irq_routing_entry *e;
- e = &kvm->irq_routing->rt_entries[i];
- if (e->type == KVM_IRQ_ROUTING_IRQCHIP &&
- e->irqchip.irqchip == irqchip &&
- e->irqchip.pin == pin) {
- gsi = e->gsi;
- break;
- }
- }
-
- hlist_for_each_entry(kian, n, &kvm->arch.irq_ack_notifier_list, link)
- if (kian->gsi == gsi)
- kian->irq_acked(kian);
+ gsi = kvm->irq_routing->chip[irqchip][pin];
+ if (gsi != -1)
+ hlist_for_each_entry(kian, n, &kvm->arch.irq_ack_notifier_list,
+ link)
+ if (kian->gsi == gsi)
+ kian->irq_acked(kian);
}

void kvm_register_irq_ack_notifier(struct kvm *kvm,
@@ -332,6 +323,9 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt,
}
e->irqchip.irqchip = ue->u.irqchip.irqchip;
e->irqchip.pin = ue->u.irqchip.pin + delta;
+ if (e->irqchip.pin >= KVM_IOAPIC_NUM_PINS)
+ goto out;
+ rt->chip[ue->u.irqchip.irqchip][e->irqchip.pin] = ue->gsi;
break;
case KVM_IRQ_ROUTING_MSI:
e->set = kvm_set_msi;
@@ -356,7 +350,7 @@ int kvm_set_irq_routing(struct kvm *kvm,
unsigned flags)
{
struct kvm_irq_routing_table *new, *old;
- u32 i, nr_rt_entries = 0;
+ u32 i, j, nr_rt_entries = 0;
int r;

for (i = 0; i < nr; ++i) {
@@ -377,6 +371,9 @@ int kvm_set_irq_routing(struct kvm *kvm,
new->rt_entries = (void *)&new->map[nr_rt_entries];

new->nr_rt_entries = nr_rt_entries;
+ for (i = 0; i < 3; i++)
+ for (j = 0; j < KVM_IOAPIC_NUM_PINS; j++)
+ new->chip[i][j] = -1;

for (i = 0; i < nr; ++i) {
r = -EINVAL;
--
1.6.5.2

2009-11-16 12:29:13

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 12/42] KVM: Move irq routing data structure to rcu locking

From: Gleb Natapov <[email protected]>

Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
virt/kvm/irq_comm.c | 16 +++++++++++-----
1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 59cf8da..fb861dd 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -159,7 +159,8 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
* IOAPIC. So set the bit in both. The guest will ignore
* writes to the unused one.
*/
- irq_rt = kvm->irq_routing;
+ rcu_read_lock();
+ irq_rt = rcu_dereference(kvm->irq_routing);
if (irq < irq_rt->nr_rt_entries)
hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
int r = e->set(e, kvm, irq_source_id, level);
@@ -168,6 +169,7 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)

ret = r + ((ret < 0) ? 0 : ret);
}
+ rcu_read_unlock();
return ret;
}

@@ -179,7 +181,10 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)

trace_kvm_ack_irq(irqchip, pin);

- gsi = kvm->irq_routing->chip[irqchip][pin];
+ rcu_read_lock();
+ gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
+ rcu_read_unlock();
+
if (gsi != -1)
hlist_for_each_entry(kian, n, &kvm->arch.irq_ack_notifier_list,
link)
@@ -279,9 +284,9 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, int irq, bool mask)

void kvm_free_irq_routing(struct kvm *kvm)
{
- mutex_lock(&kvm->irq_lock);
+ /* Called only during vm destruction. Nobody can use the pointer
+ at this stage */
kfree(kvm->irq_routing);
- mutex_unlock(&kvm->irq_lock);
}

static int setup_routing_entry(struct kvm_irq_routing_table *rt,
@@ -387,8 +392,9 @@ int kvm_set_irq_routing(struct kvm *kvm,

mutex_lock(&kvm->irq_lock);
old = kvm->irq_routing;
- kvm->irq_routing = new;
+ rcu_assign_pointer(kvm->irq_routing, new);
mutex_unlock(&kvm->irq_lock);
+ synchronize_rcu();

new = old;
r = 0;
--
1.6.5.2

2009-11-16 12:29:38

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 13/42] KVM: Move irq ack notifier list to arch independent code

From: Gleb Natapov <[email protected]>

Mask irq notifier list is already there.

Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/ia64/include/asm/kvm_host.h | 1 -
arch/x86/include/asm/kvm_host.h | 1 -
include/linux/kvm_host.h | 1 +
virt/kvm/irq_comm.c | 5 ++---
virt/kvm/kvm_main.c | 1 +
5 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h
index d9b6325..a362e67 100644
--- a/arch/ia64/include/asm/kvm_host.h
+++ b/arch/ia64/include/asm/kvm_host.h
@@ -475,7 +475,6 @@ struct kvm_arch {
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
int iommu_flags;
- struct hlist_head irq_ack_notifier_list;

unsigned long irq_sources_bitmap;
unsigned long irq_states[KVM_IOAPIC_NUM_PINS];
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 35d3236..a46e2dd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -397,7 +397,6 @@ struct kvm_arch{
struct kvm_pic *vpic;
struct kvm_ioapic *vioapic;
struct kvm_pit *vpit;
- struct hlist_head irq_ack_notifier_list;
int vapics_in_nmi_mode;

unsigned int tss_addr;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cc2d749..4aa5e1d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -187,6 +187,7 @@ struct kvm {
#ifdef CONFIG_HAVE_KVM_IRQCHIP
struct kvm_irq_routing_table *irq_routing;
struct hlist_head mask_notifier_list;
+ struct hlist_head irq_ack_notifier_list;
#endif

#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index fb861dd..f019725 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -186,8 +186,7 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
rcu_read_unlock();

if (gsi != -1)
- hlist_for_each_entry(kian, n, &kvm->arch.irq_ack_notifier_list,
- link)
+ hlist_for_each_entry(kian, n, &kvm->irq_ack_notifier_list, link)
if (kian->gsi == gsi)
kian->irq_acked(kian);
}
@@ -196,7 +195,7 @@ void kvm_register_irq_ack_notifier(struct kvm *kvm,
struct kvm_irq_ack_notifier *kian)
{
mutex_lock(&kvm->irq_lock);
- hlist_add_head(&kian->link, &kvm->arch.irq_ack_notifier_list);
+ hlist_add_head(&kian->link, &kvm->irq_ack_notifier_list);
mutex_unlock(&kvm->irq_lock);
}

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3bee948..6eca153 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -958,6 +958,7 @@ static struct kvm *kvm_create_vm(void)
goto out;
#ifdef CONFIG_HAVE_KVM_IRQCHIP
INIT_HLIST_HEAD(&kvm->mask_notifier_list);
+ INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
#endif

#ifdef KVM_COALESCED_MMIO_PAGE_OFFSET
--
1.6.5.2

2009-11-16 12:29:14

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 14/42] KVM: Convert irq notifiers lists to RCU locking

From: Gleb Natapov <[email protected]>

Use RCU locking for mask/ack notifiers lists.

Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
virt/kvm/irq_comm.c | 22 ++++++++++++----------
1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index f019725..6c94614 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -183,19 +183,19 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)

rcu_read_lock();
gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
- rcu_read_unlock();
-
if (gsi != -1)
- hlist_for_each_entry(kian, n, &kvm->irq_ack_notifier_list, link)
+ hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list,
+ link)
if (kian->gsi == gsi)
kian->irq_acked(kian);
+ rcu_read_unlock();
}

void kvm_register_irq_ack_notifier(struct kvm *kvm,
struct kvm_irq_ack_notifier *kian)
{
mutex_lock(&kvm->irq_lock);
- hlist_add_head(&kian->link, &kvm->irq_ack_notifier_list);
+ hlist_add_head_rcu(&kian->link, &kvm->irq_ack_notifier_list);
mutex_unlock(&kvm->irq_lock);
}

@@ -203,8 +203,9 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
struct kvm_irq_ack_notifier *kian)
{
mutex_lock(&kvm->irq_lock);
- hlist_del_init(&kian->link);
+ hlist_del_init_rcu(&kian->link);
mutex_unlock(&kvm->irq_lock);
+ synchronize_rcu();
}

int kvm_request_irq_source_id(struct kvm *kvm)
@@ -257,7 +258,7 @@ void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
{
mutex_lock(&kvm->irq_lock);
kimn->irq = irq;
- hlist_add_head(&kimn->link, &kvm->mask_notifier_list);
+ hlist_add_head_rcu(&kimn->link, &kvm->mask_notifier_list);
mutex_unlock(&kvm->irq_lock);
}

@@ -265,8 +266,9 @@ void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
struct kvm_irq_mask_notifier *kimn)
{
mutex_lock(&kvm->irq_lock);
- hlist_del(&kimn->link);
+ hlist_del_rcu(&kimn->link);
mutex_unlock(&kvm->irq_lock);
+ synchronize_rcu();
}

void kvm_fire_mask_notifiers(struct kvm *kvm, int irq, bool mask)
@@ -274,11 +276,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, int irq, bool mask)
struct kvm_irq_mask_notifier *kimn;
struct hlist_node *n;

- WARN_ON(!mutex_is_locked(&kvm->irq_lock));
-
- hlist_for_each_entry(kimn, n, &kvm->mask_notifier_list, link)
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(kimn, n, &kvm->mask_notifier_list, link)
if (kimn->irq == irq)
kimn->func(kimn, mask);
+ rcu_read_unlock();
}

void kvm_free_irq_routing(struct kvm *kvm)
--
1.6.5.2

2009-11-16 12:20:17

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 15/42] KVM: Move IO APIC to its own lock

From: Gleb Natapov <[email protected]>

The allows removal of irq_lock from the injection path.

Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/ia64/kvm/kvm-ia64.c | 7 +---
arch/x86/kvm/i8259.c | 22 +++++++++---
arch/x86/kvm/lapic.c | 5 +--
arch/x86/kvm/x86.c | 10 +----
virt/kvm/ioapic.c | 80 +++++++++++++++++++++++++++++++++++-----------
virt/kvm/ioapic.h | 4 ++
virt/kvm/irq_comm.c | 23 ++++++++-----
7 files changed, 100 insertions(+), 51 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 0ad09f0..4a98314 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -851,8 +851,7 @@ static int kvm_vm_ioctl_get_irqchip(struct kvm *kvm,
r = 0;
switch (chip->chip_id) {
case KVM_IRQCHIP_IOAPIC:
- memcpy(&chip->chip.ioapic, ioapic_irqchip(kvm),
- sizeof(struct kvm_ioapic_state));
+ r = kvm_get_ioapic(kvm, &chip->chip.ioapic);
break;
default:
r = -EINVAL;
@@ -868,9 +867,7 @@ static int kvm_vm_ioctl_set_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
r = 0;
switch (chip->chip_id) {
case KVM_IRQCHIP_IOAPIC:
- memcpy(ioapic_irqchip(kvm),
- &chip->chip.ioapic,
- sizeof(struct kvm_ioapic_state));
+ r = kvm_set_ioapic(kvm, &chip->chip.ioapic);
break;
default:
r = -EINVAL;
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index ccc941a..d057c0c 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -38,7 +38,15 @@ static void pic_clear_isr(struct kvm_kpic_state *s, int irq)
s->isr_ack |= (1 << irq);
if (s != &s->pics_state->pics[0])
irq += 8;
+ /*
+ * We are dropping lock while calling ack notifiers since ack
+ * notifier callbacks for assigned devices call into PIC recursively.
+ * Other interrupt may be delivered to PIC while lock is dropped but
+ * it should be safe since PIC state is already updated at this stage.
+ */
+ spin_unlock(&s->pics_state->lock);
kvm_notify_acked_irq(s->pics_state->kvm, SELECT_PIC(irq), irq);
+ spin_lock(&s->pics_state->lock);
}

void kvm_pic_clear_isr_ack(struct kvm *kvm)
@@ -176,16 +184,18 @@ int kvm_pic_set_irq(void *opaque, int irq, int level)
static inline void pic_intack(struct kvm_kpic_state *s, int irq)
{
s->isr |= 1 << irq;
- if (s->auto_eoi) {
- if (s->rotate_on_auto_eoi)
- s->priority_add = (irq + 1) & 7;
- pic_clear_isr(s, irq);
- }
/*
* We don't clear a level sensitive interrupt here
*/
if (!(s->elcr & (1 << irq)))
s->irr &= ~(1 << irq);
+
+ if (s->auto_eoi) {
+ if (s->rotate_on_auto_eoi)
+ s->priority_add = (irq + 1) & 7;
+ pic_clear_isr(s, irq);
+ }
+
}

int kvm_pic_read_irq(struct kvm *kvm)
@@ -294,9 +304,9 @@ static void pic_ioport_write(void *opaque, u32 addr, u32 val)
priority = get_priority(s, s->isr);
if (priority != 8) {
irq = (priority + s->priority_add) & 7;
- pic_clear_isr(s, irq);
if (cmd == 5)
s->priority_add = (irq + 1) & 7;
+ pic_clear_isr(s, irq);
pic_update_irq(s->pics_state);
}
break;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 23c2176..df8bcb0 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -471,11 +471,8 @@ static void apic_set_eoi(struct kvm_lapic *apic)
trigger_mode = IOAPIC_LEVEL_TRIG;
else
trigger_mode = IOAPIC_EDGE_TRIG;
- if (!(apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI)) {
- mutex_lock(&apic->vcpu->kvm->irq_lock);
+ if (!(apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI))
kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
- mutex_unlock(&apic->vcpu->kvm->irq_lock);
- }
}

static void apic_send_ipi(struct kvm_lapic *apic)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1687d12..fdf989f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2038,9 +2038,7 @@ static int kvm_vm_ioctl_get_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
sizeof(struct kvm_pic_state));
break;
case KVM_IRQCHIP_IOAPIC:
- memcpy(&chip->chip.ioapic,
- ioapic_irqchip(kvm),
- sizeof(struct kvm_ioapic_state));
+ r = kvm_get_ioapic(kvm, &chip->chip.ioapic);
break;
default:
r = -EINVAL;
@@ -2070,11 +2068,7 @@ static int kvm_vm_ioctl_set_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
spin_unlock(&pic_irqchip(kvm)->lock);
break;
case KVM_IRQCHIP_IOAPIC:
- mutex_lock(&kvm->irq_lock);
- memcpy(ioapic_irqchip(kvm),
- &chip->chip.ioapic,
- sizeof(struct kvm_ioapic_state));
- mutex_unlock(&kvm->irq_lock);
+ r = kvm_set_ioapic(kvm, &chip->chip.ioapic);
break;
default:
r = -EINVAL;
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 9fe140b..38a2d20 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -182,6 +182,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
union kvm_ioapic_redirect_entry entry;
int ret = 1;

+ mutex_lock(&ioapic->lock);
if (irq >= 0 && irq < IOAPIC_NUM_PINS) {
entry = ioapic->redirtbl[irq];
level ^= entry.fields.polarity;
@@ -198,34 +199,51 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
}
trace_kvm_ioapic_set_irq(entry.bits, irq, ret == 0);
}
+ mutex_unlock(&ioapic->lock);
+
return ret;
}

-static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int pin,
- int trigger_mode)
+static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int vector,
+ int trigger_mode)
{
- union kvm_ioapic_redirect_entry *ent;
+ int i;
+
+ for (i = 0; i < IOAPIC_NUM_PINS; i++) {
+ union kvm_ioapic_redirect_entry *ent = &ioapic->redirtbl[i];

- ent = &ioapic->redirtbl[pin];
+ if (ent->fields.vector != vector)
+ continue;

- kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, pin);
+ /*
+ * We are dropping lock while calling ack notifiers because ack
+ * notifier callbacks for assigned devices call into IOAPIC
+ * recursively. Since remote_irr is cleared only after call
+ * to notifiers if the same vector will be delivered while lock
+ * is dropped it will be put into irr and will be delivered
+ * after ack notifier returns.
+ */
+ mutex_unlock(&ioapic->lock);
+ kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, i);
+ mutex_lock(&ioapic->lock);
+
+ if (trigger_mode != IOAPIC_LEVEL_TRIG)
+ continue;

- if (trigger_mode == IOAPIC_LEVEL_TRIG) {
ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
ent->fields.remote_irr = 0;
- if (!ent->fields.mask && (ioapic->irr & (1 << pin)))
- ioapic_service(ioapic, pin);
+ if (!ent->fields.mask && (ioapic->irr & (1 << i)))
+ ioapic_service(ioapic, i);
}
}

void kvm_ioapic_update_eoi(struct kvm *kvm, int vector, int trigger_mode)
{
struct kvm_ioapic *ioapic = kvm->arch.vioapic;
- int i;

- for (i = 0; i < IOAPIC_NUM_PINS; i++)
- if (ioapic->redirtbl[i].fields.vector == vector)
- __kvm_ioapic_update_eoi(ioapic, i, trigger_mode);
+ mutex_lock(&ioapic->lock);
+ __kvm_ioapic_update_eoi(ioapic, vector, trigger_mode);
+ mutex_unlock(&ioapic->lock);
}

static inline struct kvm_ioapic *to_ioapic(struct kvm_io_device *dev)
@@ -250,8 +268,8 @@ static int ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
ioapic_debug("addr %lx\n", (unsigned long)addr);
ASSERT(!(addr & 0xf)); /* check alignment */

- mutex_lock(&ioapic->kvm->irq_lock);
addr &= 0xff;
+ mutex_lock(&ioapic->lock);
switch (addr) {
case IOAPIC_REG_SELECT:
result = ioapic->ioregsel;
@@ -265,6 +283,8 @@ static int ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
result = 0;
break;
}
+ mutex_unlock(&ioapic->lock);
+
switch (len) {
case 8:
*(u64 *) val = result;
@@ -277,7 +297,6 @@ static int ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
default:
printk(KERN_WARNING "ioapic: wrong length %d\n", len);
}
- mutex_unlock(&ioapic->kvm->irq_lock);
return 0;
}

@@ -293,15 +312,15 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
(void*)addr, len, val);
ASSERT(!(addr & 0xf)); /* check alignment */

- mutex_lock(&ioapic->kvm->irq_lock);
if (len == 4 || len == 8)
data = *(u32 *) val;
else {
printk(KERN_WARNING "ioapic: Unsupported size %d\n", len);
- goto unlock;
+ return 0;
}

addr &= 0xff;
+ mutex_lock(&ioapic->lock);
switch (addr) {
case IOAPIC_REG_SELECT:
ioapic->ioregsel = data;
@@ -312,15 +331,14 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
break;
#ifdef CONFIG_IA64
case IOAPIC_REG_EOI:
- kvm_ioapic_update_eoi(ioapic->kvm, data, IOAPIC_LEVEL_TRIG);
+ __kvm_ioapic_update_eoi(ioapic, data, IOAPIC_LEVEL_TRIG);
break;
#endif

default:
break;
}
-unlock:
- mutex_unlock(&ioapic->kvm->irq_lock);
+ mutex_unlock(&ioapic->lock);
return 0;
}

@@ -349,6 +367,7 @@ int kvm_ioapic_init(struct kvm *kvm)
ioapic = kzalloc(sizeof(struct kvm_ioapic), GFP_KERNEL);
if (!ioapic)
return -ENOMEM;
+ mutex_init(&ioapic->lock);
kvm->arch.vioapic = ioapic;
kvm_ioapic_reset(ioapic);
kvm_iodevice_init(&ioapic->dev, &ioapic_mmio_ops);
@@ -360,3 +379,26 @@ int kvm_ioapic_init(struct kvm *kvm)
return ret;
}

+int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state)
+{
+ struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
+ if (!ioapic)
+ return -EINVAL;
+
+ mutex_lock(&ioapic->lock);
+ memcpy(state, ioapic, sizeof(struct kvm_ioapic_state));
+ mutex_unlock(&ioapic->lock);
+ return 0;
+}
+
+int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state)
+{
+ struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
+ if (!ioapic)
+ return -EINVAL;
+
+ mutex_lock(&ioapic->lock);
+ memcpy(ioapic, state, sizeof(struct kvm_ioapic_state));
+ mutex_unlock(&ioapic->lock);
+ return 0;
+}
diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
index 6e461ad..419c43b 100644
--- a/virt/kvm/ioapic.h
+++ b/virt/kvm/ioapic.h
@@ -45,6 +45,7 @@ struct kvm_ioapic {
struct kvm_io_device dev;
struct kvm *kvm;
void (*ack_notifier)(void *opaque, int irq);
+ struct mutex lock;
};

#ifdef DEBUG
@@ -74,4 +75,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level);
void kvm_ioapic_reset(struct kvm_ioapic *ioapic);
int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
struct kvm_lapic_irq *irq);
+int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
+int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
+
#endif
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 6c94614..fadf440 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -146,8 +146,8 @@ static int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
*/
int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
{
- struct kvm_kernel_irq_routing_entry *e;
- int ret = -1;
+ struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
+ int ret = -1, i = 0;
struct kvm_irq_routing_table *irq_rt;
struct hlist_node *n;

@@ -162,14 +162,19 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
rcu_read_lock();
irq_rt = rcu_dereference(kvm->irq_routing);
if (irq < irq_rt->nr_rt_entries)
- hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
- int r = e->set(e, kvm, irq_source_id, level);
- if (r < 0)
- continue;
-
- ret = r + ((ret < 0) ? 0 : ret);
- }
+ hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
+ irq_set[i++] = *e;
rcu_read_unlock();
+
+ while(i--) {
+ int r;
+ r = irq_set[i].set(&irq_set[i], kvm, irq_source_id, level);
+ if (r < 0)
+ continue;
+
+ ret = r + ((ret < 0) ? 0 : ret);
+ }
+
return ret;
}

--
1.6.5.2

2009-11-16 12:20:25

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 16/42] KVM: Drop kvm->irq_lock lock from irq injection path

From: Gleb Natapov <[email protected]>

The only thing it protects now is interrupt injection into lapic and
this can work lockless. Even now with kvm->irq_lock in place access
to lapic is not entirely serialized since vcpu access doesn't take
kvm->irq_lock.

Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/ia64/kvm/kvm-ia64.c | 2 --
arch/x86/kvm/i8254.c | 2 --
arch/x86/kvm/lapic.c | 2 --
arch/x86/kvm/x86.c | 2 --
virt/kvm/eventfd.c | 2 --
virt/kvm/irq_comm.c | 6 +-----
virt/kvm/kvm_main.c | 2 --
7 files changed, 1 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 4a98314..f534e0f 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -982,10 +982,8 @@ long kvm_arch_vm_ioctl(struct file *filp,
goto out;
if (irqchip_in_kernel(kvm)) {
__s32 status;
- mutex_lock(&kvm->irq_lock);
status = kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID,
irq_event.irq, irq_event.level);
- mutex_unlock(&kvm->irq_lock);
if (ioctl == KVM_IRQ_LINE_STATUS) {
irq_event.status = status;
if (copy_to_user(argp, &irq_event,
diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 144e7f6..fab7440 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -688,10 +688,8 @@ static void __inject_pit_timer_intr(struct kvm *kvm)
struct kvm_vcpu *vcpu;
int i;

- mutex_lock(&kvm->irq_lock);
kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 0);
- mutex_unlock(&kvm->irq_lock);

/*
* Provides NMI watchdog support via Virtual Wire mode.
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index df8bcb0..8787637 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -501,9 +501,7 @@ static void apic_send_ipi(struct kvm_lapic *apic)
irq.trig_mode, irq.level, irq.dest_mode, irq.delivery_mode,
irq.vector);

- mutex_lock(&apic->vcpu->kvm->irq_lock);
kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq);
- mutex_unlock(&apic->vcpu->kvm->irq_lock);
}

static u32 apic_get_tmcct(struct kvm_lapic *apic)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fdf989f..5beb4c1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2286,10 +2286,8 @@ long kvm_arch_vm_ioctl(struct file *filp,
goto out;
if (irqchip_in_kernel(kvm)) {
__s32 status;
- mutex_lock(&kvm->irq_lock);
status = kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID,
irq_event.irq, irq_event.level);
- mutex_unlock(&kvm->irq_lock);
if (ioctl == KVM_IRQ_LINE_STATUS) {
irq_event.status = status;
if (copy_to_user(argp, &irq_event,
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index bb4ebd8..30f70fd 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -61,10 +61,8 @@ irqfd_inject(struct work_struct *work)
struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
struct kvm *kvm = irqfd->kvm;

- mutex_lock(&kvm->irq_lock);
kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
- mutex_unlock(&kvm->irq_lock);
}

/*
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index fadf440..15a83b9 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -82,8 +82,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
int i, r = -1;
struct kvm_vcpu *vcpu, *lowest = NULL;

- WARN_ON(!mutex_is_locked(&kvm->irq_lock));
-
if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
kvm_is_dm_lowest_prio(irq))
printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
@@ -138,7 +136,7 @@ static int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
return kvm_irq_delivery_to_apic(kvm, NULL, &irq);
}

-/* This should be called with the kvm->irq_lock mutex held
+/*
* Return value:
* < 0 Interrupt was ignored (masked or not delivered for other reasons)
* = 0 Interrupt was coalesced (previous irq is still pending)
@@ -153,8 +151,6 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)

trace_kvm_set_irq(irq, level, irq_source_id);

- WARN_ON(!mutex_is_locked(&kvm->irq_lock));
-
/* Not possible to detect if the guest uses the PIC or the
* IOAPIC. So set the bit in both. The guest will ignore
* writes to the unused one.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6eca153..c12c95b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -137,7 +137,6 @@ static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
interrupt_work);
kvm = assigned_dev->kvm;

- mutex_lock(&kvm->irq_lock);
spin_lock_irq(&assigned_dev->assigned_dev_lock);
if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
struct kvm_guest_msix_entry *guest_entries =
@@ -156,7 +155,6 @@ static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
assigned_dev->guest_irq, 1);

spin_unlock_irq(&assigned_dev->assigned_dev_lock);
- mutex_unlock(&assigned_dev->kvm->irq_lock);
}

static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
--
1.6.5.2

2009-11-16 12:28:12

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 17/42] KVM: Return -ENOTTY on unrecognized ioctls

Not the incorrect -EINVAL.

Signed-off-by: Avi Kivity <[email protected]>
---
arch/ia64/kvm/kvm-ia64.c | 2 +-
arch/powerpc/kvm/powerpc.c | 2 +-
arch/s390/kvm/kvm-s390.c | 2 +-
arch/x86/kvm/x86.c | 2 +-
4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index f534e0f..f6471c8 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -941,7 +941,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
{
struct kvm *kvm = filp->private_data;
void __user *argp = (void __user *)arg;
- int r = -EINVAL;
+ int r = -ENOTTY;

switch (ioctl) {
case KVM_SET_MEMORY_REGION: {
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 2a4551f..95af622 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -421,7 +421,7 @@ long kvm_arch_vm_ioctl(struct file *filp,

switch (ioctl) {
default:
- r = -EINVAL;
+ r = -ENOTTY;
}

return r;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 07ced89..00e2ce8 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -150,7 +150,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
break;
}
default:
- r = -EINVAL;
+ r = -ENOTTY;
}

return r;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5beb4c1..829e306 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2176,7 +2176,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
{
struct kvm *kvm = filp->private_data;
void __user *argp = (void __user *)arg;
- int r = -EINVAL;
+ int r = -ENOTTY;
/*
* This union makes it completely explicit to gcc-3.x
* that these two variables' stack usage should be
--
1.6.5.2

2009-11-16 12:20:11

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 18/42] KVM: Move assigned device code to own file

Signed-off-by: Avi Kivity <[email protected]>
---
arch/ia64/kvm/Makefile | 2 +-
arch/x86/kvm/Makefile | 3 +-
include/linux/kvm_host.h | 17 +
virt/kvm/assigned-dev.c | 818 ++++++++++++++++++++++++++++++++++++++++++++++
virt/kvm/kvm_main.c | 798 +--------------------------------------------
5 files changed, 840 insertions(+), 798 deletions(-)
create mode 100644 virt/kvm/assigned-dev.c

diff --git a/arch/ia64/kvm/Makefile b/arch/ia64/kvm/Makefile
index 0bb99b7..1089b3e 100644
--- a/arch/ia64/kvm/Makefile
+++ b/arch/ia64/kvm/Makefile
@@ -49,7 +49,7 @@ EXTRA_CFLAGS += -Ivirt/kvm -Iarch/ia64/kvm/
EXTRA_AFLAGS += -Ivirt/kvm -Iarch/ia64/kvm/

common-objs = $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
- coalesced_mmio.o irq_comm.o)
+ coalesced_mmio.o irq_comm.o assigned-dev.o)

ifeq ($(CONFIG_IOMMU_API),y)
common-objs += $(addprefix ../../../virt/kvm/, iommu.o)
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 0e7fe78..31a7035 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -6,7 +6,8 @@ CFLAGS_svm.o := -I.
CFLAGS_vmx.o := -I.

kvm-y += $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
- coalesced_mmio.o irq_comm.o eventfd.o)
+ coalesced_mmio.o irq_comm.o eventfd.o \
+ assigned-dev.o)
kvm-$(CONFIG_IOMMU_API) += $(addprefix ../../../virt/kvm/, iommu.o)

kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4aa5e1d..c0a1cc3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -577,4 +577,21 @@ static inline bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu)
return vcpu->kvm->bsp_vcpu_id == vcpu->vcpu_id;
}
#endif
+
+#ifdef __KVM_HAVE_DEVICE_ASSIGNMENT
+
+long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
+ unsigned long arg);
+
+#else
+
+static inline long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
+ unsigned long arg)
+{
+ return -ENOTTY;
+}
+
#endif
+
+#endif
+
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
new file mode 100644
index 0000000..fd9c097
--- /dev/null
+++ b/virt/kvm/assigned-dev.c
@@ -0,0 +1,818 @@
+/*
+ * Kernel-based Virtual Machine - device assignment support
+ *
+ * Copyright (C) 2006-9 Red Hat, Inc
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/kvm.h>
+#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
+#include <linux/errno.h>
+#include <linux/spinlock.h>
+#include <linux/pci.h>
+#include <linux/interrupt.h>
+#include "irq.h"
+
+static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct list_head *head,
+ int assigned_dev_id)
+{
+ struct list_head *ptr;
+ struct kvm_assigned_dev_kernel *match;
+
+ list_for_each(ptr, head) {
+ match = list_entry(ptr, struct kvm_assigned_dev_kernel, list);
+ if (match->assigned_dev_id == assigned_dev_id)
+ return match;
+ }
+ return NULL;
+}
+
+static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
+ *assigned_dev, int irq)
+{
+ int i, index;
+ struct msix_entry *host_msix_entries;
+
+ host_msix_entries = assigned_dev->host_msix_entries;
+
+ index = -1;
+ for (i = 0; i < assigned_dev->entries_nr; i++)
+ if (irq == host_msix_entries[i].vector) {
+ index = i;
+ break;
+ }
+ if (index < 0) {
+ printk(KERN_WARNING "Fail to find correlated MSI-X entry!\n");
+ return 0;
+ }
+
+ return index;
+}
+
+static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
+{
+ struct kvm_assigned_dev_kernel *assigned_dev;
+ struct kvm *kvm;
+ int i;
+
+ assigned_dev = container_of(work, struct kvm_assigned_dev_kernel,
+ interrupt_work);
+ kvm = assigned_dev->kvm;
+
+ spin_lock_irq(&assigned_dev->assigned_dev_lock);
+ if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
+ struct kvm_guest_msix_entry *guest_entries =
+ assigned_dev->guest_msix_entries;
+ for (i = 0; i < assigned_dev->entries_nr; i++) {
+ if (!(guest_entries[i].flags &
+ KVM_ASSIGNED_MSIX_PENDING))
+ continue;
+ guest_entries[i].flags &= ~KVM_ASSIGNED_MSIX_PENDING;
+ kvm_set_irq(assigned_dev->kvm,
+ assigned_dev->irq_source_id,
+ guest_entries[i].vector, 1);
+ }
+ } else
+ kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
+ assigned_dev->guest_irq, 1);
+
+ spin_unlock_irq(&assigned_dev->assigned_dev_lock);
+}
+
+static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
+{
+ unsigned long flags;
+ struct kvm_assigned_dev_kernel *assigned_dev =
+ (struct kvm_assigned_dev_kernel *) dev_id;
+
+ spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
+ if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
+ int index = find_index_from_host_irq(assigned_dev, irq);
+ if (index < 0)
+ goto out;
+ assigned_dev->guest_msix_entries[index].flags |=
+ KVM_ASSIGNED_MSIX_PENDING;
+ }
+
+ schedule_work(&assigned_dev->interrupt_work);
+
+ if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_GUEST_INTX) {
+ disable_irq_nosync(irq);
+ assigned_dev->host_irq_disabled = true;
+ }
+
+out:
+ spin_unlock_irqrestore(&assigned_dev->assigned_dev_lock, flags);
+ return IRQ_HANDLED;
+}
+
+/* Ack the irq line for an assigned device */
+static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
+{
+ struct kvm_assigned_dev_kernel *dev;
+ unsigned long flags;
+
+ if (kian->gsi == -1)
+ return;
+
+ dev = container_of(kian, struct kvm_assigned_dev_kernel,
+ ack_notifier);
+
+ kvm_set_irq(dev->kvm, dev->irq_source_id, dev->guest_irq, 0);
+
+ /* The guest irq may be shared so this ack may be
+ * from another device.
+ */
+ spin_lock_irqsave(&dev->assigned_dev_lock, flags);
+ if (dev->host_irq_disabled) {
+ enable_irq(dev->host_irq);
+ dev->host_irq_disabled = false;
+ }
+ spin_unlock_irqrestore(&dev->assigned_dev_lock, flags);
+}
+
+static void deassign_guest_irq(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *assigned_dev)
+{
+ kvm_unregister_irq_ack_notifier(kvm, &assigned_dev->ack_notifier);
+ assigned_dev->ack_notifier.gsi = -1;
+
+ if (assigned_dev->irq_source_id != -1)
+ kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id);
+ assigned_dev->irq_source_id = -1;
+ assigned_dev->irq_requested_type &= ~(KVM_DEV_IRQ_GUEST_MASK);
+}
+
+/* The function implicit hold kvm->lock mutex due to cancel_work_sync() */
+static void deassign_host_irq(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *assigned_dev)
+{
+ /*
+ * In kvm_free_device_irq, cancel_work_sync return true if:
+ * 1. work is scheduled, and then cancelled.
+ * 2. work callback is executed.
+ *
+ * The first one ensured that the irq is disabled and no more events
+ * would happen. But for the second one, the irq may be enabled (e.g.
+ * for MSI). So we disable irq here to prevent further events.
+ *
+ * Notice this maybe result in nested disable if the interrupt type is
+ * INTx, but it's OK for we are going to free it.
+ *
+ * If this function is a part of VM destroy, please ensure that till
+ * now, the kvm state is still legal for probably we also have to wait
+ * interrupt_work done.
+ */
+ if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
+ int i;
+ for (i = 0; i < assigned_dev->entries_nr; i++)
+ disable_irq_nosync(assigned_dev->
+ host_msix_entries[i].vector);
+
+ cancel_work_sync(&assigned_dev->interrupt_work);
+
+ for (i = 0; i < assigned_dev->entries_nr; i++)
+ free_irq(assigned_dev->host_msix_entries[i].vector,
+ (void *)assigned_dev);
+
+ assigned_dev->entries_nr = 0;
+ kfree(assigned_dev->host_msix_entries);
+ kfree(assigned_dev->guest_msix_entries);
+ pci_disable_msix(assigned_dev->dev);
+ } else {
+ /* Deal with MSI and INTx */
+ disable_irq_nosync(assigned_dev->host_irq);
+ cancel_work_sync(&assigned_dev->interrupt_work);
+
+ free_irq(assigned_dev->host_irq, (void *)assigned_dev);
+
+ if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI)
+ pci_disable_msi(assigned_dev->dev);
+ }
+
+ assigned_dev->irq_requested_type &= ~(KVM_DEV_IRQ_HOST_MASK);
+}
+
+static int kvm_deassign_irq(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *assigned_dev,
+ unsigned long irq_requested_type)
+{
+ unsigned long guest_irq_type, host_irq_type;
+
+ if (!irqchip_in_kernel(kvm))
+ return -EINVAL;
+ /* no irq assignment to deassign */
+ if (!assigned_dev->irq_requested_type)
+ return -ENXIO;
+
+ host_irq_type = irq_requested_type & KVM_DEV_IRQ_HOST_MASK;
+ guest_irq_type = irq_requested_type & KVM_DEV_IRQ_GUEST_MASK;
+
+ if (host_irq_type)
+ deassign_host_irq(kvm, assigned_dev);
+ if (guest_irq_type)
+ deassign_guest_irq(kvm, assigned_dev);
+
+ return 0;
+}
+
+static void kvm_free_assigned_irq(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *assigned_dev)
+{
+ kvm_deassign_irq(kvm, assigned_dev, assigned_dev->irq_requested_type);
+}
+
+static void kvm_free_assigned_device(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel
+ *assigned_dev)
+{
+ kvm_free_assigned_irq(kvm, assigned_dev);
+
+ pci_reset_function(assigned_dev->dev);
+
+ pci_release_regions(assigned_dev->dev);
+ pci_disable_device(assigned_dev->dev);
+ pci_dev_put(assigned_dev->dev);
+
+ list_del(&assigned_dev->list);
+ kfree(assigned_dev);
+}
+
+void kvm_free_all_assigned_devices(struct kvm *kvm)
+{
+ struct list_head *ptr, *ptr2;
+ struct kvm_assigned_dev_kernel *assigned_dev;
+
+ list_for_each_safe(ptr, ptr2, &kvm->arch.assigned_dev_head) {
+ assigned_dev = list_entry(ptr,
+ struct kvm_assigned_dev_kernel,
+ list);
+
+ kvm_free_assigned_device(kvm, assigned_dev);
+ }
+}
+
+static int assigned_device_enable_host_intx(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *dev)
+{
+ dev->host_irq = dev->dev->irq;
+ /* Even though this is PCI, we don't want to use shared
+ * interrupts. Sharing host devices with guest-assigned devices
+ * on the same interrupt line is not a happy situation: there
+ * are going to be long delays in accepting, acking, etc.
+ */
+ if (request_irq(dev->host_irq, kvm_assigned_dev_intr,
+ 0, "kvm_assigned_intx_device", (void *)dev))
+ return -EIO;
+ return 0;
+}
+
+#ifdef __KVM_HAVE_MSI
+static int assigned_device_enable_host_msi(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *dev)
+{
+ int r;
+
+ if (!dev->dev->msi_enabled) {
+ r = pci_enable_msi(dev->dev);
+ if (r)
+ return r;
+ }
+
+ dev->host_irq = dev->dev->irq;
+ if (request_irq(dev->host_irq, kvm_assigned_dev_intr, 0,
+ "kvm_assigned_msi_device", (void *)dev)) {
+ pci_disable_msi(dev->dev);
+ return -EIO;
+ }
+
+ return 0;
+}
+#endif
+
+#ifdef __KVM_HAVE_MSIX
+static int assigned_device_enable_host_msix(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *dev)
+{
+ int i, r = -EINVAL;
+
+ /* host_msix_entries and guest_msix_entries should have been
+ * initialized */
+ if (dev->entries_nr == 0)
+ return r;
+
+ r = pci_enable_msix(dev->dev, dev->host_msix_entries, dev->entries_nr);
+ if (r)
+ return r;
+
+ for (i = 0; i < dev->entries_nr; i++) {
+ r = request_irq(dev->host_msix_entries[i].vector,
+ kvm_assigned_dev_intr, 0,
+ "kvm_assigned_msix_device",
+ (void *)dev);
+ /* FIXME: free requested_irq's on failure */
+ if (r)
+ return r;
+ }
+
+ return 0;
+}
+
+#endif
+
+static int assigned_device_enable_guest_intx(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *dev,
+ struct kvm_assigned_irq *irq)
+{
+ dev->guest_irq = irq->guest_irq;
+ dev->ack_notifier.gsi = irq->guest_irq;
+ return 0;
+}
+
+#ifdef __KVM_HAVE_MSI
+static int assigned_device_enable_guest_msi(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *dev,
+ struct kvm_assigned_irq *irq)
+{
+ dev->guest_irq = irq->guest_irq;
+ dev->ack_notifier.gsi = -1;
+ dev->host_irq_disabled = false;
+ return 0;
+}
+#endif
+
+#ifdef __KVM_HAVE_MSIX
+static int assigned_device_enable_guest_msix(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *dev,
+ struct kvm_assigned_irq *irq)
+{
+ dev->guest_irq = irq->guest_irq;
+ dev->ack_notifier.gsi = -1;
+ dev->host_irq_disabled = false;
+ return 0;
+}
+#endif
+
+static int assign_host_irq(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *dev,
+ __u32 host_irq_type)
+{
+ int r = -EEXIST;
+
+ if (dev->irq_requested_type & KVM_DEV_IRQ_HOST_MASK)
+ return r;
+
+ switch (host_irq_type) {
+ case KVM_DEV_IRQ_HOST_INTX:
+ r = assigned_device_enable_host_intx(kvm, dev);
+ break;
+#ifdef __KVM_HAVE_MSI
+ case KVM_DEV_IRQ_HOST_MSI:
+ r = assigned_device_enable_host_msi(kvm, dev);
+ break;
+#endif
+#ifdef __KVM_HAVE_MSIX
+ case KVM_DEV_IRQ_HOST_MSIX:
+ r = assigned_device_enable_host_msix(kvm, dev);
+ break;
+#endif
+ default:
+ r = -EINVAL;
+ }
+
+ if (!r)
+ dev->irq_requested_type |= host_irq_type;
+
+ return r;
+}
+
+static int assign_guest_irq(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *dev,
+ struct kvm_assigned_irq *irq,
+ unsigned long guest_irq_type)
+{
+ int id;
+ int r = -EEXIST;
+
+ if (dev->irq_requested_type & KVM_DEV_IRQ_GUEST_MASK)
+ return r;
+
+ id = kvm_request_irq_source_id(kvm);
+ if (id < 0)
+ return id;
+
+ dev->irq_source_id = id;
+
+ switch (guest_irq_type) {
+ case KVM_DEV_IRQ_GUEST_INTX:
+ r = assigned_device_enable_guest_intx(kvm, dev, irq);
+ break;
+#ifdef __KVM_HAVE_MSI
+ case KVM_DEV_IRQ_GUEST_MSI:
+ r = assigned_device_enable_guest_msi(kvm, dev, irq);
+ break;
+#endif
+#ifdef __KVM_HAVE_MSIX
+ case KVM_DEV_IRQ_GUEST_MSIX:
+ r = assigned_device_enable_guest_msix(kvm, dev, irq);
+ break;
+#endif
+ default:
+ r = -EINVAL;
+ }
+
+ if (!r) {
+ dev->irq_requested_type |= guest_irq_type;
+ kvm_register_irq_ack_notifier(kvm, &dev->ack_notifier);
+ } else
+ kvm_free_irq_source_id(kvm, dev->irq_source_id);
+
+ return r;
+}
+
+/* TODO Deal with KVM_DEV_IRQ_ASSIGNED_MASK_MSIX */
+static int kvm_vm_ioctl_assign_irq(struct kvm *kvm,
+ struct kvm_assigned_irq *assigned_irq)
+{
+ int r = -EINVAL;
+ struct kvm_assigned_dev_kernel *match;
+ unsigned long host_irq_type, guest_irq_type;
+
+ if (!capable(CAP_SYS_RAWIO))
+ return -EPERM;
+
+ if (!irqchip_in_kernel(kvm))
+ return r;
+
+ mutex_lock(&kvm->lock);
+ r = -ENODEV;
+ match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+ assigned_irq->assigned_dev_id);
+ if (!match)
+ goto out;
+
+ host_irq_type = (assigned_irq->flags & KVM_DEV_IRQ_HOST_MASK);
+ guest_irq_type = (assigned_irq->flags & KVM_DEV_IRQ_GUEST_MASK);
+
+ r = -EINVAL;
+ /* can only assign one type at a time */
+ if (hweight_long(host_irq_type) > 1)
+ goto out;
+ if (hweight_long(guest_irq_type) > 1)
+ goto out;
+ if (host_irq_type == 0 && guest_irq_type == 0)
+ goto out;
+
+ r = 0;
+ if (host_irq_type)
+ r = assign_host_irq(kvm, match, host_irq_type);
+ if (r)
+ goto out;
+
+ if (guest_irq_type)
+ r = assign_guest_irq(kvm, match, assigned_irq, guest_irq_type);
+out:
+ mutex_unlock(&kvm->lock);
+ return r;
+}
+
+static int kvm_vm_ioctl_deassign_dev_irq(struct kvm *kvm,
+ struct kvm_assigned_irq
+ *assigned_irq)
+{
+ int r = -ENODEV;
+ struct kvm_assigned_dev_kernel *match;
+
+ mutex_lock(&kvm->lock);
+
+ match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+ assigned_irq->assigned_dev_id);
+ if (!match)
+ goto out;
+
+ r = kvm_deassign_irq(kvm, match, assigned_irq->flags);
+out:
+ mutex_unlock(&kvm->lock);
+ return r;
+}
+
+static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
+ struct kvm_assigned_pci_dev *assigned_dev)
+{
+ int r = 0;
+ struct kvm_assigned_dev_kernel *match;
+ struct pci_dev *dev;
+
+ down_read(&kvm->slots_lock);
+ mutex_lock(&kvm->lock);
+
+ match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+ assigned_dev->assigned_dev_id);
+ if (match) {
+ /* device already assigned */
+ r = -EEXIST;
+ goto out;
+ }
+
+ match = kzalloc(sizeof(struct kvm_assigned_dev_kernel), GFP_KERNEL);
+ if (match == NULL) {
+ printk(KERN_INFO "%s: Couldn't allocate memory\n",
+ __func__);
+ r = -ENOMEM;
+ goto out;
+ }
+ dev = pci_get_bus_and_slot(assigned_dev->busnr,
+ assigned_dev->devfn);
+ if (!dev) {
+ printk(KERN_INFO "%s: host device not found\n", __func__);
+ r = -EINVAL;
+ goto out_free;
+ }
+ if (pci_enable_device(dev)) {
+ printk(KERN_INFO "%s: Could not enable PCI device\n", __func__);
+ r = -EBUSY;
+ goto out_put;
+ }
+ r = pci_request_regions(dev, "kvm_assigned_device");
+ if (r) {
+ printk(KERN_INFO "%s: Could not get access to device regions\n",
+ __func__);
+ goto out_disable;
+ }
+
+ pci_reset_function(dev);
+
+ match->assigned_dev_id = assigned_dev->assigned_dev_id;
+ match->host_busnr = assigned_dev->busnr;
+ match->host_devfn = assigned_dev->devfn;
+ match->flags = assigned_dev->flags;
+ match->dev = dev;
+ spin_lock_init(&match->assigned_dev_lock);
+ match->irq_source_id = -1;
+ match->kvm = kvm;
+ match->ack_notifier.irq_acked = kvm_assigned_dev_ack_irq;
+ INIT_WORK(&match->interrupt_work,
+ kvm_assigned_dev_interrupt_work_handler);
+
+ list_add(&match->list, &kvm->arch.assigned_dev_head);
+
+ if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU) {
+ if (!kvm->arch.iommu_domain) {
+ r = kvm_iommu_map_guest(kvm);
+ if (r)
+ goto out_list_del;
+ }
+ r = kvm_assign_device(kvm, match);
+ if (r)
+ goto out_list_del;
+ }
+
+out:
+ mutex_unlock(&kvm->lock);
+ up_read(&kvm->slots_lock);
+ return r;
+out_list_del:
+ list_del(&match->list);
+ pci_release_regions(dev);
+out_disable:
+ pci_disable_device(dev);
+out_put:
+ pci_dev_put(dev);
+out_free:
+ kfree(match);
+ mutex_unlock(&kvm->lock);
+ up_read(&kvm->slots_lock);
+ return r;
+}
+
+static int kvm_vm_ioctl_deassign_device(struct kvm *kvm,
+ struct kvm_assigned_pci_dev *assigned_dev)
+{
+ int r = 0;
+ struct kvm_assigned_dev_kernel *match;
+
+ mutex_lock(&kvm->lock);
+
+ match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+ assigned_dev->assigned_dev_id);
+ if (!match) {
+ printk(KERN_INFO "%s: device hasn't been assigned before, "
+ "so cannot be deassigned\n", __func__);
+ r = -EINVAL;
+ goto out;
+ }
+
+ if (match->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU)
+ kvm_deassign_device(kvm, match);
+
+ kvm_free_assigned_device(kvm, match);
+
+out:
+ mutex_unlock(&kvm->lock);
+ return r;
+}
+
+
+#ifdef __KVM_HAVE_MSIX
+static int kvm_vm_ioctl_set_msix_nr(struct kvm *kvm,
+ struct kvm_assigned_msix_nr *entry_nr)
+{
+ int r = 0;
+ struct kvm_assigned_dev_kernel *adev;
+
+ mutex_lock(&kvm->lock);
+
+ adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+ entry_nr->assigned_dev_id);
+ if (!adev) {
+ r = -EINVAL;
+ goto msix_nr_out;
+ }
+
+ if (adev->entries_nr == 0) {
+ adev->entries_nr = entry_nr->entry_nr;
+ if (adev->entries_nr == 0 ||
+ adev->entries_nr >= KVM_MAX_MSIX_PER_DEV) {
+ r = -EINVAL;
+ goto msix_nr_out;
+ }
+
+ adev->host_msix_entries = kzalloc(sizeof(struct msix_entry) *
+ entry_nr->entry_nr,
+ GFP_KERNEL);
+ if (!adev->host_msix_entries) {
+ r = -ENOMEM;
+ goto msix_nr_out;
+ }
+ adev->guest_msix_entries = kzalloc(
+ sizeof(struct kvm_guest_msix_entry) *
+ entry_nr->entry_nr, GFP_KERNEL);
+ if (!adev->guest_msix_entries) {
+ kfree(adev->host_msix_entries);
+ r = -ENOMEM;
+ goto msix_nr_out;
+ }
+ } else /* Not allowed set MSI-X number twice */
+ r = -EINVAL;
+msix_nr_out:
+ mutex_unlock(&kvm->lock);
+ return r;
+}
+
+static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
+ struct kvm_assigned_msix_entry *entry)
+{
+ int r = 0, i;
+ struct kvm_assigned_dev_kernel *adev;
+
+ mutex_lock(&kvm->lock);
+
+ adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+ entry->assigned_dev_id);
+
+ if (!adev) {
+ r = -EINVAL;
+ goto msix_entry_out;
+ }
+
+ for (i = 0; i < adev->entries_nr; i++)
+ if (adev->guest_msix_entries[i].vector == 0 ||
+ adev->guest_msix_entries[i].entry == entry->entry) {
+ adev->guest_msix_entries[i].entry = entry->entry;
+ adev->guest_msix_entries[i].vector = entry->gsi;
+ adev->host_msix_entries[i].entry = entry->entry;
+ break;
+ }
+ if (i == adev->entries_nr) {
+ r = -ENOSPC;
+ goto msix_entry_out;
+ }
+
+msix_entry_out:
+ mutex_unlock(&kvm->lock);
+
+ return r;
+}
+#endif
+
+long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
+ unsigned long arg)
+{
+ void __user *argp = (void __user *)arg;
+ int r = -ENOTTY;
+
+ switch (ioctl) {
+ case KVM_ASSIGN_PCI_DEVICE: {
+ struct kvm_assigned_pci_dev assigned_dev;
+
+ r = -EFAULT;
+ if (copy_from_user(&assigned_dev, argp, sizeof assigned_dev))
+ goto out;
+ r = kvm_vm_ioctl_assign_device(kvm, &assigned_dev);
+ if (r)
+ goto out;
+ break;
+ }
+ case KVM_ASSIGN_IRQ: {
+ r = -EOPNOTSUPP;
+ break;
+ }
+#ifdef KVM_CAP_ASSIGN_DEV_IRQ
+ case KVM_ASSIGN_DEV_IRQ: {
+ struct kvm_assigned_irq assigned_irq;
+
+ r = -EFAULT;
+ if (copy_from_user(&assigned_irq, argp, sizeof assigned_irq))
+ goto out;
+ r = kvm_vm_ioctl_assign_irq(kvm, &assigned_irq);
+ if (r)
+ goto out;
+ break;
+ }
+ case KVM_DEASSIGN_DEV_IRQ: {
+ struct kvm_assigned_irq assigned_irq;
+
+ r = -EFAULT;
+ if (copy_from_user(&assigned_irq, argp, sizeof assigned_irq))
+ goto out;
+ r = kvm_vm_ioctl_deassign_dev_irq(kvm, &assigned_irq);
+ if (r)
+ goto out;
+ break;
+ }
+#endif
+#ifdef KVM_CAP_DEVICE_DEASSIGNMENT
+ case KVM_DEASSIGN_PCI_DEVICE: {
+ struct kvm_assigned_pci_dev assigned_dev;
+
+ r = -EFAULT;
+ if (copy_from_user(&assigned_dev, argp, sizeof assigned_dev))
+ goto out;
+ r = kvm_vm_ioctl_deassign_device(kvm, &assigned_dev);
+ if (r)
+ goto out;
+ break;
+ }
+#endif
+#ifdef KVM_CAP_IRQ_ROUTING
+ case KVM_SET_GSI_ROUTING: {
+ struct kvm_irq_routing routing;
+ struct kvm_irq_routing __user *urouting;
+ struct kvm_irq_routing_entry *entries;
+
+ r = -EFAULT;
+ if (copy_from_user(&routing, argp, sizeof(routing)))
+ goto out;
+ r = -EINVAL;
+ if (routing.nr >= KVM_MAX_IRQ_ROUTES)
+ goto out;
+ if (routing.flags)
+ goto out;
+ r = -ENOMEM;
+ entries = vmalloc(routing.nr * sizeof(*entries));
+ if (!entries)
+ goto out;
+ r = -EFAULT;
+ urouting = argp;
+ if (copy_from_user(entries, urouting->entries,
+ routing.nr * sizeof(*entries)))
+ goto out_free_irq_routing;
+ r = kvm_set_irq_routing(kvm, entries, routing.nr,
+ routing.flags);
+ out_free_irq_routing:
+ vfree(entries);
+ break;
+ }
+#endif /* KVM_CAP_IRQ_ROUTING */
+#ifdef __KVM_HAVE_MSIX
+ case KVM_ASSIGN_SET_MSIX_NR: {
+ struct kvm_assigned_msix_nr entry_nr;
+ r = -EFAULT;
+ if (copy_from_user(&entry_nr, argp, sizeof entry_nr))
+ goto out;
+ r = kvm_vm_ioctl_set_msix_nr(kvm, &entry_nr);
+ if (r)
+ goto out;
+ break;
+ }
+ case KVM_ASSIGN_SET_MSIX_ENTRY: {
+ struct kvm_assigned_msix_entry entry;
+ r = -EFAULT;
+ if (copy_from_user(&entry, argp, sizeof entry))
+ goto out;
+ r = kvm_vm_ioctl_set_msix_entry(kvm, &entry);
+ if (r)
+ goto out;
+ break;
+ }
+#endif
+ }
+out:
+ return r;
+}
+
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c12c95b..38e4d2c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -53,12 +53,6 @@
#include "coalesced_mmio.h"
#endif

-#ifdef KVM_CAP_DEVICE_ASSIGNMENT
-#include <linux/pci.h>
-#include <linux/interrupt.h>
-#include "irq.h"
-#endif
-
#define CREATE_TRACE_POINTS
#include <trace/events/kvm.h>

@@ -90,608 +84,6 @@ static bool kvm_rebooting;

static bool largepages_enabled = true;

-#ifdef KVM_CAP_DEVICE_ASSIGNMENT
-static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct list_head *head,
- int assigned_dev_id)
-{
- struct list_head *ptr;
- struct kvm_assigned_dev_kernel *match;
-
- list_for_each(ptr, head) {
- match = list_entry(ptr, struct kvm_assigned_dev_kernel, list);
- if (match->assigned_dev_id == assigned_dev_id)
- return match;
- }
- return NULL;
-}
-
-static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
- *assigned_dev, int irq)
-{
- int i, index;
- struct msix_entry *host_msix_entries;
-
- host_msix_entries = assigned_dev->host_msix_entries;
-
- index = -1;
- for (i = 0; i < assigned_dev->entries_nr; i++)
- if (irq == host_msix_entries[i].vector) {
- index = i;
- break;
- }
- if (index < 0) {
- printk(KERN_WARNING "Fail to find correlated MSI-X entry!\n");
- return 0;
- }
-
- return index;
-}
-
-static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
-{
- struct kvm_assigned_dev_kernel *assigned_dev;
- struct kvm *kvm;
- int i;
-
- assigned_dev = container_of(work, struct kvm_assigned_dev_kernel,
- interrupt_work);
- kvm = assigned_dev->kvm;
-
- spin_lock_irq(&assigned_dev->assigned_dev_lock);
- if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
- struct kvm_guest_msix_entry *guest_entries =
- assigned_dev->guest_msix_entries;
- for (i = 0; i < assigned_dev->entries_nr; i++) {
- if (!(guest_entries[i].flags &
- KVM_ASSIGNED_MSIX_PENDING))
- continue;
- guest_entries[i].flags &= ~KVM_ASSIGNED_MSIX_PENDING;
- kvm_set_irq(assigned_dev->kvm,
- assigned_dev->irq_source_id,
- guest_entries[i].vector, 1);
- }
- } else
- kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
- assigned_dev->guest_irq, 1);
-
- spin_unlock_irq(&assigned_dev->assigned_dev_lock);
-}
-
-static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
-{
- unsigned long flags;
- struct kvm_assigned_dev_kernel *assigned_dev =
- (struct kvm_assigned_dev_kernel *) dev_id;
-
- spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
- if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
- int index = find_index_from_host_irq(assigned_dev, irq);
- if (index < 0)
- goto out;
- assigned_dev->guest_msix_entries[index].flags |=
- KVM_ASSIGNED_MSIX_PENDING;
- }
-
- schedule_work(&assigned_dev->interrupt_work);
-
- if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_GUEST_INTX) {
- disable_irq_nosync(irq);
- assigned_dev->host_irq_disabled = true;
- }
-
-out:
- spin_unlock_irqrestore(&assigned_dev->assigned_dev_lock, flags);
- return IRQ_HANDLED;
-}
-
-/* Ack the irq line for an assigned device */
-static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
-{
- struct kvm_assigned_dev_kernel *dev;
- unsigned long flags;
-
- if (kian->gsi == -1)
- return;
-
- dev = container_of(kian, struct kvm_assigned_dev_kernel,
- ack_notifier);
-
- kvm_set_irq(dev->kvm, dev->irq_source_id, dev->guest_irq, 0);
-
- /* The guest irq may be shared so this ack may be
- * from another device.
- */
- spin_lock_irqsave(&dev->assigned_dev_lock, flags);
- if (dev->host_irq_disabled) {
- enable_irq(dev->host_irq);
- dev->host_irq_disabled = false;
- }
- spin_unlock_irqrestore(&dev->assigned_dev_lock, flags);
-}
-
-static void deassign_guest_irq(struct kvm *kvm,
- struct kvm_assigned_dev_kernel *assigned_dev)
-{
- kvm_unregister_irq_ack_notifier(kvm, &assigned_dev->ack_notifier);
- assigned_dev->ack_notifier.gsi = -1;
-
- if (assigned_dev->irq_source_id != -1)
- kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id);
- assigned_dev->irq_source_id = -1;
- assigned_dev->irq_requested_type &= ~(KVM_DEV_IRQ_GUEST_MASK);
-}
-
-/* The function implicit hold kvm->lock mutex due to cancel_work_sync() */
-static void deassign_host_irq(struct kvm *kvm,
- struct kvm_assigned_dev_kernel *assigned_dev)
-{
- /*
- * In kvm_free_device_irq, cancel_work_sync return true if:
- * 1. work is scheduled, and then cancelled.
- * 2. work callback is executed.
- *
- * The first one ensured that the irq is disabled and no more events
- * would happen. But for the second one, the irq may be enabled (e.g.
- * for MSI). So we disable irq here to prevent further events.
- *
- * Notice this maybe result in nested disable if the interrupt type is
- * INTx, but it's OK for we are going to free it.
- *
- * If this function is a part of VM destroy, please ensure that till
- * now, the kvm state is still legal for probably we also have to wait
- * interrupt_work done.
- */
- if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
- int i;
- for (i = 0; i < assigned_dev->entries_nr; i++)
- disable_irq_nosync(assigned_dev->
- host_msix_entries[i].vector);
-
- cancel_work_sync(&assigned_dev->interrupt_work);
-
- for (i = 0; i < assigned_dev->entries_nr; i++)
- free_irq(assigned_dev->host_msix_entries[i].vector,
- (void *)assigned_dev);
-
- assigned_dev->entries_nr = 0;
- kfree(assigned_dev->host_msix_entries);
- kfree(assigned_dev->guest_msix_entries);
- pci_disable_msix(assigned_dev->dev);
- } else {
- /* Deal with MSI and INTx */
- disable_irq_nosync(assigned_dev->host_irq);
- cancel_work_sync(&assigned_dev->interrupt_work);
-
- free_irq(assigned_dev->host_irq, (void *)assigned_dev);
-
- if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI)
- pci_disable_msi(assigned_dev->dev);
- }
-
- assigned_dev->irq_requested_type &= ~(KVM_DEV_IRQ_HOST_MASK);
-}
-
-static int kvm_deassign_irq(struct kvm *kvm,
- struct kvm_assigned_dev_kernel *assigned_dev,
- unsigned long irq_requested_type)
-{
- unsigned long guest_irq_type, host_irq_type;
-
- if (!irqchip_in_kernel(kvm))
- return -EINVAL;
- /* no irq assignment to deassign */
- if (!assigned_dev->irq_requested_type)
- return -ENXIO;
-
- host_irq_type = irq_requested_type & KVM_DEV_IRQ_HOST_MASK;
- guest_irq_type = irq_requested_type & KVM_DEV_IRQ_GUEST_MASK;
-
- if (host_irq_type)
- deassign_host_irq(kvm, assigned_dev);
- if (guest_irq_type)
- deassign_guest_irq(kvm, assigned_dev);
-
- return 0;
-}
-
-static void kvm_free_assigned_irq(struct kvm *kvm,
- struct kvm_assigned_dev_kernel *assigned_dev)
-{
- kvm_deassign_irq(kvm, assigned_dev, assigned_dev->irq_requested_type);
-}
-
-static void kvm_free_assigned_device(struct kvm *kvm,
- struct kvm_assigned_dev_kernel
- *assigned_dev)
-{
- kvm_free_assigned_irq(kvm, assigned_dev);
-
- pci_reset_function(assigned_dev->dev);
-
- pci_release_regions(assigned_dev->dev);
- pci_disable_device(assigned_dev->dev);
- pci_dev_put(assigned_dev->dev);
-
- list_del(&assigned_dev->list);
- kfree(assigned_dev);
-}
-
-void kvm_free_all_assigned_devices(struct kvm *kvm)
-{
- struct list_head *ptr, *ptr2;
- struct kvm_assigned_dev_kernel *assigned_dev;
-
- list_for_each_safe(ptr, ptr2, &kvm->arch.assigned_dev_head) {
- assigned_dev = list_entry(ptr,
- struct kvm_assigned_dev_kernel,
- list);
-
- kvm_free_assigned_device(kvm, assigned_dev);
- }
-}
-
-static int assigned_device_enable_host_intx(struct kvm *kvm,
- struct kvm_assigned_dev_kernel *dev)
-{
- dev->host_irq = dev->dev->irq;
- /* Even though this is PCI, we don't want to use shared
- * interrupts. Sharing host devices with guest-assigned devices
- * on the same interrupt line is not a happy situation: there
- * are going to be long delays in accepting, acking, etc.
- */
- if (request_irq(dev->host_irq, kvm_assigned_dev_intr,
- 0, "kvm_assigned_intx_device", (void *)dev))
- return -EIO;
- return 0;
-}
-
-#ifdef __KVM_HAVE_MSI
-static int assigned_device_enable_host_msi(struct kvm *kvm,
- struct kvm_assigned_dev_kernel *dev)
-{
- int r;
-
- if (!dev->dev->msi_enabled) {
- r = pci_enable_msi(dev->dev);
- if (r)
- return r;
- }
-
- dev->host_irq = dev->dev->irq;
- if (request_irq(dev->host_irq, kvm_assigned_dev_intr, 0,
- "kvm_assigned_msi_device", (void *)dev)) {
- pci_disable_msi(dev->dev);
- return -EIO;
- }
-
- return 0;
-}
-#endif
-
-#ifdef __KVM_HAVE_MSIX
-static int assigned_device_enable_host_msix(struct kvm *kvm,
- struct kvm_assigned_dev_kernel *dev)
-{
- int i, r = -EINVAL;
-
- /* host_msix_entries and guest_msix_entries should have been
- * initialized */
- if (dev->entries_nr == 0)
- return r;
-
- r = pci_enable_msix(dev->dev, dev->host_msix_entries, dev->entries_nr);
- if (r)
- return r;
-
- for (i = 0; i < dev->entries_nr; i++) {
- r = request_irq(dev->host_msix_entries[i].vector,
- kvm_assigned_dev_intr, 0,
- "kvm_assigned_msix_device",
- (void *)dev);
- /* FIXME: free requested_irq's on failure */
- if (r)
- return r;
- }
-
- return 0;
-}
-
-#endif
-
-static int assigned_device_enable_guest_intx(struct kvm *kvm,
- struct kvm_assigned_dev_kernel *dev,
- struct kvm_assigned_irq *irq)
-{
- dev->guest_irq = irq->guest_irq;
- dev->ack_notifier.gsi = irq->guest_irq;
- return 0;
-}
-
-#ifdef __KVM_HAVE_MSI
-static int assigned_device_enable_guest_msi(struct kvm *kvm,
- struct kvm_assigned_dev_kernel *dev,
- struct kvm_assigned_irq *irq)
-{
- dev->guest_irq = irq->guest_irq;
- dev->ack_notifier.gsi = -1;
- dev->host_irq_disabled = false;
- return 0;
-}
-#endif
-#ifdef __KVM_HAVE_MSIX
-static int assigned_device_enable_guest_msix(struct kvm *kvm,
- struct kvm_assigned_dev_kernel *dev,
- struct kvm_assigned_irq *irq)
-{
- dev->guest_irq = irq->guest_irq;
- dev->ack_notifier.gsi = -1;
- dev->host_irq_disabled = false;
- return 0;
-}
-#endif
-
-static int assign_host_irq(struct kvm *kvm,
- struct kvm_assigned_dev_kernel *dev,
- __u32 host_irq_type)
-{
- int r = -EEXIST;
-
- if (dev->irq_requested_type & KVM_DEV_IRQ_HOST_MASK)
- return r;
-
- switch (host_irq_type) {
- case KVM_DEV_IRQ_HOST_INTX:
- r = assigned_device_enable_host_intx(kvm, dev);
- break;
-#ifdef __KVM_HAVE_MSI
- case KVM_DEV_IRQ_HOST_MSI:
- r = assigned_device_enable_host_msi(kvm, dev);
- break;
-#endif
-#ifdef __KVM_HAVE_MSIX
- case KVM_DEV_IRQ_HOST_MSIX:
- r = assigned_device_enable_host_msix(kvm, dev);
- break;
-#endif
- default:
- r = -EINVAL;
- }
-
- if (!r)
- dev->irq_requested_type |= host_irq_type;
-
- return r;
-}
-
-static int assign_guest_irq(struct kvm *kvm,
- struct kvm_assigned_dev_kernel *dev,
- struct kvm_assigned_irq *irq,
- unsigned long guest_irq_type)
-{
- int id;
- int r = -EEXIST;
-
- if (dev->irq_requested_type & KVM_DEV_IRQ_GUEST_MASK)
- return r;
-
- id = kvm_request_irq_source_id(kvm);
- if (id < 0)
- return id;
-
- dev->irq_source_id = id;
-
- switch (guest_irq_type) {
- case KVM_DEV_IRQ_GUEST_INTX:
- r = assigned_device_enable_guest_intx(kvm, dev, irq);
- break;
-#ifdef __KVM_HAVE_MSI
- case KVM_DEV_IRQ_GUEST_MSI:
- r = assigned_device_enable_guest_msi(kvm, dev, irq);
- break;
-#endif
-#ifdef __KVM_HAVE_MSIX
- case KVM_DEV_IRQ_GUEST_MSIX:
- r = assigned_device_enable_guest_msix(kvm, dev, irq);
- break;
-#endif
- default:
- r = -EINVAL;
- }
-
- if (!r) {
- dev->irq_requested_type |= guest_irq_type;
- kvm_register_irq_ack_notifier(kvm, &dev->ack_notifier);
- } else
- kvm_free_irq_source_id(kvm, dev->irq_source_id);
-
- return r;
-}
-
-/* TODO Deal with KVM_DEV_IRQ_ASSIGNED_MASK_MSIX */
-static int kvm_vm_ioctl_assign_irq(struct kvm *kvm,
- struct kvm_assigned_irq *assigned_irq)
-{
- int r = -EINVAL;
- struct kvm_assigned_dev_kernel *match;
- unsigned long host_irq_type, guest_irq_type;
-
- if (!capable(CAP_SYS_RAWIO))
- return -EPERM;
-
- if (!irqchip_in_kernel(kvm))
- return r;
-
- mutex_lock(&kvm->lock);
- r = -ENODEV;
- match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
- assigned_irq->assigned_dev_id);
- if (!match)
- goto out;
-
- host_irq_type = (assigned_irq->flags & KVM_DEV_IRQ_HOST_MASK);
- guest_irq_type = (assigned_irq->flags & KVM_DEV_IRQ_GUEST_MASK);
-
- r = -EINVAL;
- /* can only assign one type at a time */
- if (hweight_long(host_irq_type) > 1)
- goto out;
- if (hweight_long(guest_irq_type) > 1)
- goto out;
- if (host_irq_type == 0 && guest_irq_type == 0)
- goto out;
-
- r = 0;
- if (host_irq_type)
- r = assign_host_irq(kvm, match, host_irq_type);
- if (r)
- goto out;
-
- if (guest_irq_type)
- r = assign_guest_irq(kvm, match, assigned_irq, guest_irq_type);
-out:
- mutex_unlock(&kvm->lock);
- return r;
-}
-
-static int kvm_vm_ioctl_deassign_dev_irq(struct kvm *kvm,
- struct kvm_assigned_irq
- *assigned_irq)
-{
- int r = -ENODEV;
- struct kvm_assigned_dev_kernel *match;
-
- mutex_lock(&kvm->lock);
-
- match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
- assigned_irq->assigned_dev_id);
- if (!match)
- goto out;
-
- r = kvm_deassign_irq(kvm, match, assigned_irq->flags);
-out:
- mutex_unlock(&kvm->lock);
- return r;
-}
-
-static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
- struct kvm_assigned_pci_dev *assigned_dev)
-{
- int r = 0;
- struct kvm_assigned_dev_kernel *match;
- struct pci_dev *dev;
-
- down_read(&kvm->slots_lock);
- mutex_lock(&kvm->lock);
-
- match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
- assigned_dev->assigned_dev_id);
- if (match) {
- /* device already assigned */
- r = -EEXIST;
- goto out;
- }
-
- match = kzalloc(sizeof(struct kvm_assigned_dev_kernel), GFP_KERNEL);
- if (match == NULL) {
- printk(KERN_INFO "%s: Couldn't allocate memory\n",
- __func__);
- r = -ENOMEM;
- goto out;
- }
- dev = pci_get_bus_and_slot(assigned_dev->busnr,
- assigned_dev->devfn);
- if (!dev) {
- printk(KERN_INFO "%s: host device not found\n", __func__);
- r = -EINVAL;
- goto out_free;
- }
- if (pci_enable_device(dev)) {
- printk(KERN_INFO "%s: Could not enable PCI device\n", __func__);
- r = -EBUSY;
- goto out_put;
- }
- r = pci_request_regions(dev, "kvm_assigned_device");
- if (r) {
- printk(KERN_INFO "%s: Could not get access to device regions\n",
- __func__);
- goto out_disable;
- }
-
- pci_reset_function(dev);
-
- match->assigned_dev_id = assigned_dev->assigned_dev_id;
- match->host_busnr = assigned_dev->busnr;
- match->host_devfn = assigned_dev->devfn;
- match->flags = assigned_dev->flags;
- match->dev = dev;
- spin_lock_init(&match->assigned_dev_lock);
- match->irq_source_id = -1;
- match->kvm = kvm;
- match->ack_notifier.irq_acked = kvm_assigned_dev_ack_irq;
- INIT_WORK(&match->interrupt_work,
- kvm_assigned_dev_interrupt_work_handler);
-
- list_add(&match->list, &kvm->arch.assigned_dev_head);
-
- if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU) {
- if (!kvm->arch.iommu_domain) {
- r = kvm_iommu_map_guest(kvm);
- if (r)
- goto out_list_del;
- }
- r = kvm_assign_device(kvm, match);
- if (r)
- goto out_list_del;
- }
-
-out:
- mutex_unlock(&kvm->lock);
- up_read(&kvm->slots_lock);
- return r;
-out_list_del:
- list_del(&match->list);
- pci_release_regions(dev);
-out_disable:
- pci_disable_device(dev);
-out_put:
- pci_dev_put(dev);
-out_free:
- kfree(match);
- mutex_unlock(&kvm->lock);
- up_read(&kvm->slots_lock);
- return r;
-}
-#endif
-
-#ifdef KVM_CAP_DEVICE_DEASSIGNMENT
-static int kvm_vm_ioctl_deassign_device(struct kvm *kvm,
- struct kvm_assigned_pci_dev *assigned_dev)
-{
- int r = 0;
- struct kvm_assigned_dev_kernel *match;
-
- mutex_lock(&kvm->lock);
-
- match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
- assigned_dev->assigned_dev_id);
- if (!match) {
- printk(KERN_INFO "%s: device hasn't been assigned before, "
- "so cannot be deassigned\n", __func__);
- r = -EINVAL;
- goto out;
- }
-
- if (match->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU)
- kvm_deassign_device(kvm, match);
-
- kvm_free_assigned_device(kvm, match);
-
-out:
- mutex_unlock(&kvm->lock);
- return r;
-}
-#endif
-
inline int kvm_is_mmio_pfn(pfn_t pfn)
{
if (pfn_valid(pfn)) {
@@ -1824,88 +1216,6 @@ static int kvm_vcpu_ioctl_set_sigmask(struct kvm_vcpu *vcpu, sigset_t *sigset)
return 0;
}

-#ifdef __KVM_HAVE_MSIX
-static int kvm_vm_ioctl_set_msix_nr(struct kvm *kvm,
- struct kvm_assigned_msix_nr *entry_nr)
-{
- int r = 0;
- struct kvm_assigned_dev_kernel *adev;
-
- mutex_lock(&kvm->lock);
-
- adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
- entry_nr->assigned_dev_id);
- if (!adev) {
- r = -EINVAL;
- goto msix_nr_out;
- }
-
- if (adev->entries_nr == 0) {
- adev->entries_nr = entry_nr->entry_nr;
- if (adev->entries_nr == 0 ||
- adev->entries_nr >= KVM_MAX_MSIX_PER_DEV) {
- r = -EINVAL;
- goto msix_nr_out;
- }
-
- adev->host_msix_entries = kzalloc(sizeof(struct msix_entry) *
- entry_nr->entry_nr,
- GFP_KERNEL);
- if (!adev->host_msix_entries) {
- r = -ENOMEM;
- goto msix_nr_out;
- }
- adev->guest_msix_entries = kzalloc(
- sizeof(struct kvm_guest_msix_entry) *
- entry_nr->entry_nr, GFP_KERNEL);
- if (!adev->guest_msix_entries) {
- kfree(adev->host_msix_entries);
- r = -ENOMEM;
- goto msix_nr_out;
- }
- } else /* Not allowed set MSI-X number twice */
- r = -EINVAL;
-msix_nr_out:
- mutex_unlock(&kvm->lock);
- return r;
-}
-
-static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
- struct kvm_assigned_msix_entry *entry)
-{
- int r = 0, i;
- struct kvm_assigned_dev_kernel *adev;
-
- mutex_lock(&kvm->lock);
-
- adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
- entry->assigned_dev_id);
-
- if (!adev) {
- r = -EINVAL;
- goto msix_entry_out;
- }
-
- for (i = 0; i < adev->entries_nr; i++)
- if (adev->guest_msix_entries[i].vector == 0 ||
- adev->guest_msix_entries[i].entry == entry->entry) {
- adev->guest_msix_entries[i].entry = entry->entry;
- adev->guest_msix_entries[i].vector = entry->gsi;
- adev->host_msix_entries[i].entry = entry->entry;
- break;
- }
- if (i == adev->entries_nr) {
- r = -ENOSPC;
- goto msix_entry_out;
- }
-
-msix_entry_out:
- mutex_unlock(&kvm->lock);
-
- return r;
-}
-#endif
-
static long kvm_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -2164,112 +1474,6 @@ static long kvm_vm_ioctl(struct file *filp,
break;
}
#endif
-#ifdef KVM_CAP_DEVICE_ASSIGNMENT
- case KVM_ASSIGN_PCI_DEVICE: {
- struct kvm_assigned_pci_dev assigned_dev;
-
- r = -EFAULT;
- if (copy_from_user(&assigned_dev, argp, sizeof assigned_dev))
- goto out;
- r = kvm_vm_ioctl_assign_device(kvm, &assigned_dev);
- if (r)
- goto out;
- break;
- }
- case KVM_ASSIGN_IRQ: {
- r = -EOPNOTSUPP;
- break;
- }
-#ifdef KVM_CAP_ASSIGN_DEV_IRQ
- case KVM_ASSIGN_DEV_IRQ: {
- struct kvm_assigned_irq assigned_irq;
-
- r = -EFAULT;
- if (copy_from_user(&assigned_irq, argp, sizeof assigned_irq))
- goto out;
- r = kvm_vm_ioctl_assign_irq(kvm, &assigned_irq);
- if (r)
- goto out;
- break;
- }
- case KVM_DEASSIGN_DEV_IRQ: {
- struct kvm_assigned_irq assigned_irq;
-
- r = -EFAULT;
- if (copy_from_user(&assigned_irq, argp, sizeof assigned_irq))
- goto out;
- r = kvm_vm_ioctl_deassign_dev_irq(kvm, &assigned_irq);
- if (r)
- goto out;
- break;
- }
-#endif
-#endif
-#ifdef KVM_CAP_DEVICE_DEASSIGNMENT
- case KVM_DEASSIGN_PCI_DEVICE: {
- struct kvm_assigned_pci_dev assigned_dev;
-
- r = -EFAULT;
- if (copy_from_user(&assigned_dev, argp, sizeof assigned_dev))
- goto out;
- r = kvm_vm_ioctl_deassign_device(kvm, &assigned_dev);
- if (r)
- goto out;
- break;
- }
-#endif
-#ifdef KVM_CAP_IRQ_ROUTING
- case KVM_SET_GSI_ROUTING: {
- struct kvm_irq_routing routing;
- struct kvm_irq_routing __user *urouting;
- struct kvm_irq_routing_entry *entries;
-
- r = -EFAULT;
- if (copy_from_user(&routing, argp, sizeof(routing)))
- goto out;
- r = -EINVAL;
- if (routing.nr >= KVM_MAX_IRQ_ROUTES)
- goto out;
- if (routing.flags)
- goto out;
- r = -ENOMEM;
- entries = vmalloc(routing.nr * sizeof(*entries));
- if (!entries)
- goto out;
- r = -EFAULT;
- urouting = argp;
- if (copy_from_user(entries, urouting->entries,
- routing.nr * sizeof(*entries)))
- goto out_free_irq_routing;
- r = kvm_set_irq_routing(kvm, entries, routing.nr,
- routing.flags);
- out_free_irq_routing:
- vfree(entries);
- break;
- }
-#endif /* KVM_CAP_IRQ_ROUTING */
-#ifdef __KVM_HAVE_MSIX
- case KVM_ASSIGN_SET_MSIX_NR: {
- struct kvm_assigned_msix_nr entry_nr;
- r = -EFAULT;
- if (copy_from_user(&entry_nr, argp, sizeof entry_nr))
- goto out;
- r = kvm_vm_ioctl_set_msix_nr(kvm, &entry_nr);
- if (r)
- goto out;
- break;
- }
- case KVM_ASSIGN_SET_MSIX_ENTRY: {
- struct kvm_assigned_msix_entry entry;
- r = -EFAULT;
- if (copy_from_user(&entry, argp, sizeof entry))
- goto out;
- r = kvm_vm_ioctl_set_msix_entry(kvm, &entry);
- if (r)
- goto out;
- break;
- }
-#endif
case KVM_IRQFD: {
struct kvm_irqfd data;

@@ -2301,6 +1505,8 @@ static long kvm_vm_ioctl(struct file *filp,
#endif
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
+ if (r == -ENOTTY)
+ r = kvm_vm_ioctl_assigned_device(kvm, ioctl, arg);
}
out:
return r;
--
1.6.5.2

2009-11-16 12:20:15

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 19/42] KVM: x86 emulator: Add missing decoder flags for 'or' instructions

From: Mohammed Gamal <[email protected]>

Add missing decoder flags for or instructions (0xc-0xd).

Signed-off-by: Mohammed Gamal <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/kvm/emulate.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 0644d3d..db0820d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -99,7 +99,8 @@ static u32 opcode_table[256] = {
/* 0x08 - 0x0F */
ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
- 0, 0, ImplicitOps | Stack | No64, 0,
+ ByteOp | DstAcc | SrcImm, DstAcc | SrcImm,
+ ImplicitOps | Stack | No64, 0,
/* 0x10 - 0x17 */
ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
--
1.6.5.2

2009-11-16 12:26:29

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 20/42] KVM: x86 emulator: Add pusha and popa instructions

From: Mohammed Gamal <[email protected]>

This adds pusha and popa instructions (opcodes 0x60-0x61), this enables booting
MINIX with invalid guest state emulation on.

[marcelo: remove unused variable]

Signed-off-by: Mohammed Gamal <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/kvm/emulate.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 47 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index db0820d..d226dff 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -139,7 +139,8 @@ static u32 opcode_table[256] = {
DstReg | Stack, DstReg | Stack, DstReg | Stack, DstReg | Stack,
DstReg | Stack, DstReg | Stack, DstReg | Stack, DstReg | Stack,
/* 0x60 - 0x67 */
- 0, 0, 0, DstReg | SrcMem32 | ModRM | Mov /* movsxd (x86/64) */ ,
+ ImplicitOps | Stack | No64, ImplicitOps | Stack | No64,
+ 0, DstReg | SrcMem32 | ModRM | Mov /* movsxd (x86/64) */ ,
0, 0, 0, 0,
/* 0x68 - 0x6F */
SrcImm | Mov | Stack, 0, SrcImmByte | Mov | Stack, 0,
@@ -1225,6 +1226,43 @@ static int emulate_pop_sreg(struct x86_emulate_ctxt *ctxt,
return rc;
}

+static void emulate_pusha(struct x86_emulate_ctxt *ctxt)
+{
+ struct decode_cache *c = &ctxt->decode;
+ unsigned long old_esp = c->regs[VCPU_REGS_RSP];
+ int reg = VCPU_REGS_RAX;
+
+ while (reg <= VCPU_REGS_RDI) {
+ (reg == VCPU_REGS_RSP) ?
+ (c->src.val = old_esp) : (c->src.val = c->regs[reg]);
+
+ emulate_push(ctxt);
+ ++reg;
+ }
+}
+
+static int emulate_popa(struct x86_emulate_ctxt *ctxt,
+ struct x86_emulate_ops *ops)
+{
+ struct decode_cache *c = &ctxt->decode;
+ int rc = 0;
+ int reg = VCPU_REGS_RDI;
+
+ while (reg >= VCPU_REGS_RAX) {
+ if (reg == VCPU_REGS_RSP) {
+ register_address_increment(c, &c->regs[VCPU_REGS_RSP],
+ c->op_bytes);
+ --reg;
+ }
+
+ rc = emulate_pop(ctxt, ops, &c->regs[reg], c->op_bytes);
+ if (rc != 0)
+ break;
+ --reg;
+ }
+ return rc;
+}
+
static inline int emulate_grp1a(struct x86_emulate_ctxt *ctxt,
struct x86_emulate_ops *ops)
{
@@ -1816,6 +1854,14 @@ special_insn:
if (rc != 0)
goto done;
break;
+ case 0x60: /* pusha */
+ emulate_pusha(ctxt);
+ break;
+ case 0x61: /* popa */
+ rc = emulate_popa(ctxt, ops);
+ if (rc != 0)
+ goto done;
+ break;
case 0x63: /* movsxd */
if (ctxt->mode != X86EMUL_MODE_PROT64)
goto cannot_emulate;
--
1.6.5.2

2009-11-16 12:20:52

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 21/42] KVM: VMX: Enhance invalid guest state emulation

From: Mohammed Gamal <[email protected]>

- Change returned handle_invalid_guest_state() to return relevant exit codes
- Move triggering the emulation from vmx_vcpu_run() to vmx_handle_exit()
- Return to userspace instead of repeatedly trying to emulate instructions that have already failed

Signed-off-by: Mohammed Gamal <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/vmx.c | 44 ++++++++++++++++++++------------------------
1 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4635298..73cb5dd 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -107,7 +107,6 @@ struct vcpu_vmx {
} rmode;
int vpid;
bool emulation_required;
- enum emulation_result invalid_state_emulation_result;

/* Support for vnmi-less CPUs */
int soft_vnmi_blocked;
@@ -3322,35 +3321,37 @@ static int handle_nmi_window(struct kvm_vcpu *vcpu)
return 1;
}

-static void handle_invalid_guest_state(struct kvm_vcpu *vcpu)
+static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
enum emulation_result err = EMULATE_DONE;
-
- local_irq_enable();
- preempt_enable();
+ int ret = 1;

while (!guest_state_valid(vcpu)) {
err = emulate_instruction(vcpu, 0, 0, 0);

- if (err == EMULATE_DO_MMIO)
- break;
+ if (err == EMULATE_DO_MMIO) {
+ ret = 0;
+ goto out;
+ }

if (err != EMULATE_DONE) {
kvm_report_emulation_failure(vcpu, "emulation failure");
- break;
+ vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+ vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_EMULATION;
+ ret = 0;
+ goto out;
}

if (signal_pending(current))
- break;
+ goto out;
if (need_resched())
schedule();
}

- preempt_disable();
- local_irq_disable();
-
- vmx->invalid_state_emulation_result = err;
+ vmx->emulation_required = 0;
+out:
+ return ret;
}

/*
@@ -3406,13 +3407,9 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)

trace_kvm_exit(exit_reason, kvm_rip_read(vcpu));

- /* If we need to emulate an MMIO from handle_invalid_guest_state
- * we just return 0 */
- if (vmx->emulation_required && emulate_invalid_guest_state) {
- if (guest_state_valid(vcpu))
- vmx->emulation_required = 0;
- return vmx->invalid_state_emulation_result != EMULATE_DO_MMIO;
- }
+ /* If guest state is invalid, start emulating */
+ if (vmx->emulation_required && emulate_invalid_guest_state)
+ return handle_invalid_guest_state(vcpu);

/* Access CR3 don't cause VMExit in paging mode, so we need
* to sync with guest real CR3. */
@@ -3607,11 +3604,10 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
vmx->entry_time = ktime_get();

- /* Handle invalid guest state instead of entering VMX */
- if (vmx->emulation_required && emulate_invalid_guest_state) {
- handle_invalid_guest_state(vcpu);
+ /* Don't enter VMX if guest state is invalid, let the exit handler
+ start emulation until we arrive back to a valid state */
+ if (vmx->emulation_required && emulate_invalid_guest_state)
return;
- }

if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
--
1.6.5.2

2009-11-16 12:27:48

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 22/42] KVM: SVM: remove needless mmap_sem acquision from nested_svm_map

From: Marcelo Tosatti <[email protected]>

nested_svm_map unnecessarily takes mmap_sem around gfn_to_page, since
gfn_to_page / get_user_pages are responsible for it.

Signed-off-by: Marcelo Tosatti <[email protected]>
Acked-by: Alexander Graf <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/kvm/svm.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 92048a6..f54c4f9 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1396,10 +1396,7 @@ static void *nested_svm_map(struct vcpu_svm *svm, u64 gpa, enum km_type idx)
{
struct page *page;

- down_read(&current->mm->mmap_sem);
page = gfn_to_page(svm->vcpu.kvm, gpa >> PAGE_SHIFT);
- up_read(&current->mm->mmap_sem);
-
if (is_error_page(page))
goto error;

--
1.6.5.2

2009-11-16 12:20:24

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 23/42] KVM: Activate Virtualization On Demand

From: Alexander Graf <[email protected]>

X86 CPUs need to have some magic happening to enable the virtualization
extensions on them. This magic can result in unpleasant results for
users, like blocking other VMMs from working (vmx) or using invalid TLB
entries (svm).

Currently KVM activates virtualization when the respective kernel module
is loaded. This blocks us from autoloading KVM modules without breaking
other VMMs.

To circumvent this problem at least a bit, this patch introduces on
demand activation of virtualization. This means, that instead
virtualization is enabled on creation of the first virtual machine
and disabled on destruction of the last one.

So using this, KVM can be easily autoloaded, while keeping other
hypervisors usable.

Signed-off-by: Alexander Graf <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/ia64/kvm/kvm-ia64.c | 8 ++-
arch/powerpc/kvm/powerpc.c | 3 +-
arch/s390/kvm/kvm-s390.c | 3 +-
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/svm.c | 13 ++++--
arch/x86/kvm/vmx.c | 11 +++-
arch/x86/kvm/x86.c | 4 +-
include/linux/kvm_host.h | 2 +-
virt/kvm/kvm_main.c | 90 +++++++++++++++++++++++++++++++++-----
9 files changed, 108 insertions(+), 28 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index f6471c8..5fdeec5 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -124,7 +124,7 @@ long ia64_pal_vp_create(u64 *vpd, u64 *host_iva, u64 *opt_handler)

static DEFINE_SPINLOCK(vp_lock);

-void kvm_arch_hardware_enable(void *garbage)
+int kvm_arch_hardware_enable(void *garbage)
{
long status;
long tmp_base;
@@ -137,7 +137,7 @@ void kvm_arch_hardware_enable(void *garbage)
slot = ia64_itr_entry(0x3, KVM_VMM_BASE, pte, KVM_VMM_SHIFT);
local_irq_restore(saved_psr);
if (slot < 0)
- return;
+ return -EINVAL;

spin_lock(&vp_lock);
status = ia64_pal_vp_init_env(kvm_vsa_base ?
@@ -145,7 +145,7 @@ void kvm_arch_hardware_enable(void *garbage)
__pa(kvm_vm_buffer), KVM_VM_BUFFER_BASE, &tmp_base);
if (status != 0) {
printk(KERN_WARNING"kvm: Failed to Enable VT Support!!!!\n");
- return ;
+ return -EINVAL;
}

if (!kvm_vsa_base) {
@@ -154,6 +154,8 @@ void kvm_arch_hardware_enable(void *garbage)
}
spin_unlock(&vp_lock);
ia64_ptr_entry(0x3, slot);
+
+ return 0;
}

void kvm_arch_hardware_disable(void *garbage)
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 95af622..5902bbc 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -78,8 +78,9 @@ int kvmppc_emulate_mmio(struct kvm_run *run, struct kvm_vcpu *vcpu)
return r;
}

-void kvm_arch_hardware_enable(void *garbage)
+int kvm_arch_hardware_enable(void *garbage)
{
+ return 0;
}

void kvm_arch_hardware_disable(void *garbage)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 00e2ce8..5445058 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -74,9 +74,10 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
static unsigned long long *facilities;

/* Section: not file related */
-void kvm_arch_hardware_enable(void *garbage)
+int kvm_arch_hardware_enable(void *garbage)
{
/* every s390 is virtualization enabled ;-) */
+ return 0;
}

void kvm_arch_hardware_disable(void *garbage)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a46e2dd..295c7c4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -459,7 +459,7 @@ struct descriptor_table {
struct kvm_x86_ops {
int (*cpu_has_kvm_support)(void); /* __init */
int (*disabled_by_bios)(void); /* __init */
- void (*hardware_enable)(void *dummy); /* __init */
+ int (*hardware_enable)(void *dummy);
void (*hardware_disable)(void *dummy);
void (*check_processor_compatibility)(void *rtn);
int (*hardware_setup)(void); /* __init */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f54c4f9..59fe4d5 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -316,7 +316,7 @@ static void svm_hardware_disable(void *garbage)
cpu_svm_disable();
}

-static void svm_hardware_enable(void *garbage)
+static int svm_hardware_enable(void *garbage)
{

struct svm_cpu_data *svm_data;
@@ -325,16 +325,20 @@ static void svm_hardware_enable(void *garbage)
struct desc_struct *gdt;
int me = raw_smp_processor_id();

+ rdmsrl(MSR_EFER, efer);
+ if (efer & EFER_SVME)
+ return -EBUSY;
+
if (!has_svm()) {
printk(KERN_ERR "svm_cpu_init: err EOPNOTSUPP on %d\n", me);
- return;
+ return -EINVAL;
}
svm_data = per_cpu(svm_data, me);

if (!svm_data) {
printk(KERN_ERR "svm_cpu_init: svm_data is NULL on %d\n",
me);
- return;
+ return -EINVAL;
}

svm_data->asid_generation = 1;
@@ -345,11 +349,12 @@ static void svm_hardware_enable(void *garbage)
gdt = (struct desc_struct *)gdt_descr.base;
svm_data->tss_desc = (struct kvm_ldttss_desc *)(gdt + GDT_ENTRY_TSS);

- rdmsrl(MSR_EFER, efer);
wrmsrl(MSR_EFER, efer | EFER_SVME);

wrmsrl(MSR_VM_HSAVE_PA,
page_to_pfn(svm_data->save_area) << PAGE_SHIFT);
+
+ return 0;
}

static void svm_cpu_uninit(int cpu)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 73cb5dd..a187570 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1138,12 +1138,15 @@ static __init int vmx_disabled_by_bios(void)
/* locked but not enabled */
}

-static void hardware_enable(void *garbage)
+static int hardware_enable(void *garbage)
{
int cpu = raw_smp_processor_id();
u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
u64 old;

+ if (read_cr4() & X86_CR4_VMXE)
+ return -EBUSY;
+
INIT_LIST_HEAD(&per_cpu(vcpus_on_cpu, cpu));
rdmsrl(MSR_IA32_FEATURE_CONTROL, old);
if ((old & (FEATURE_CONTROL_LOCKED |
@@ -1158,6 +1161,10 @@ static void hardware_enable(void *garbage)
asm volatile (ASM_VMX_VMXON_RAX
: : "a"(&phys_addr), "m"(phys_addr)
: "memory", "cc");
+
+ ept_sync_global();
+
+ return 0;
}

static void vmclear_local_vcpus(void)
@@ -4040,8 +4047,6 @@ static int __init vmx_init(void)
if (bypass_guest_pf)
kvm_mmu_set_nonpresent_ptes(~0xffeull, 0ull);

- ept_sync_global();
-
return 0;

out3:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 829e306..3d83de8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4691,9 +4691,9 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
return kvm_x86_ops->vcpu_reset(vcpu);
}

-void kvm_arch_hardware_enable(void *garbage)
+int kvm_arch_hardware_enable(void *garbage)
{
- kvm_x86_ops->hardware_enable(garbage);
+ return kvm_x86_ops->hardware_enable(garbage);
}

void kvm_arch_hardware_disable(void *garbage)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c0a1cc3..b985a29 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -345,7 +345,7 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu);
void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);

int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu);
-void kvm_arch_hardware_enable(void *garbage);
+int kvm_arch_hardware_enable(void *garbage);
void kvm_arch_hardware_disable(void *garbage);
int kvm_arch_hardware_setup(void);
void kvm_arch_hardware_unsetup(void);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 38e4d2c..70c8cbe 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -69,6 +69,8 @@ DEFINE_SPINLOCK(kvm_lock);
LIST_HEAD(vm_list);

static cpumask_var_t cpus_hardware_enabled;
+static int kvm_usage_count = 0;
+static atomic_t hardware_enable_failed;

struct kmem_cache *kvm_vcpu_cache;
EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
@@ -79,6 +81,8 @@ struct dentry *kvm_debugfs_dir;

static long kvm_vcpu_ioctl(struct file *file, unsigned int ioctl,
unsigned long arg);
+static int hardware_enable_all(void);
+static void hardware_disable_all(void);

static bool kvm_rebooting;

@@ -339,6 +343,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {

static struct kvm *kvm_create_vm(void)
{
+ int r = 0;
struct kvm *kvm = kvm_arch_create_vm();
#ifdef KVM_COALESCED_MMIO_PAGE_OFFSET
struct page *page;
@@ -346,6 +351,11 @@ static struct kvm *kvm_create_vm(void)

if (IS_ERR(kvm))
goto out;
+
+ r = hardware_enable_all();
+ if (r)
+ goto out_err_nodisable;
+
#ifdef CONFIG_HAVE_KVM_IRQCHIP
INIT_HLIST_HEAD(&kvm->mask_notifier_list);
INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
@@ -354,8 +364,8 @@ static struct kvm *kvm_create_vm(void)
#ifdef KVM_COALESCED_MMIO_PAGE_OFFSET
page = alloc_page(GFP_KERNEL | __GFP_ZERO);
if (!page) {
- kfree(kvm);
- return ERR_PTR(-ENOMEM);
+ r = -ENOMEM;
+ goto out_err;
}
kvm->coalesced_mmio_ring =
(struct kvm_coalesced_mmio_ring *)page_address(page);
@@ -363,15 +373,13 @@ static struct kvm *kvm_create_vm(void)

#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
{
- int err;
kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops;
- err = mmu_notifier_register(&kvm->mmu_notifier, current->mm);
- if (err) {
+ r = mmu_notifier_register(&kvm->mmu_notifier, current->mm);
+ if (r) {
#ifdef KVM_COALESCED_MMIO_PAGE_OFFSET
put_page(page);
#endif
- kfree(kvm);
- return ERR_PTR(err);
+ goto out_err;
}
}
#endif
@@ -395,6 +403,12 @@ static struct kvm *kvm_create_vm(void)
#endif
out:
return kvm;
+
+out_err:
+ hardware_disable_all();
+out_err_nodisable:
+ kfree(kvm);
+ return ERR_PTR(r);
}

/*
@@ -453,6 +467,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
kvm_arch_flush_shadow(kvm);
#endif
kvm_arch_destroy_vm(kvm);
+ hardware_disable_all();
mmdrop(mm);
}

@@ -1644,11 +1659,21 @@ static struct miscdevice kvm_dev = {
static void hardware_enable(void *junk)
{
int cpu = raw_smp_processor_id();
+ int r;

if (cpumask_test_cpu(cpu, cpus_hardware_enabled))
return;
+
cpumask_set_cpu(cpu, cpus_hardware_enabled);
- kvm_arch_hardware_enable(NULL);
+
+ r = kvm_arch_hardware_enable(NULL);
+
+ if (r) {
+ cpumask_clear_cpu(cpu, cpus_hardware_enabled);
+ atomic_inc(&hardware_enable_failed);
+ printk(KERN_INFO "kvm: enabling virtualization on "
+ "CPU%d failed\n", cpu);
+ }
}

static void hardware_disable(void *junk)
@@ -1661,11 +1686,52 @@ static void hardware_disable(void *junk)
kvm_arch_hardware_disable(NULL);
}

+static void hardware_disable_all_nolock(void)
+{
+ BUG_ON(!kvm_usage_count);
+
+ kvm_usage_count--;
+ if (!kvm_usage_count)
+ on_each_cpu(hardware_disable, NULL, 1);
+}
+
+static void hardware_disable_all(void)
+{
+ spin_lock(&kvm_lock);
+ hardware_disable_all_nolock();
+ spin_unlock(&kvm_lock);
+}
+
+static int hardware_enable_all(void)
+{
+ int r = 0;
+
+ spin_lock(&kvm_lock);
+
+ kvm_usage_count++;
+ if (kvm_usage_count == 1) {
+ atomic_set(&hardware_enable_failed, 0);
+ on_each_cpu(hardware_enable, NULL, 1);
+
+ if (atomic_read(&hardware_enable_failed)) {
+ hardware_disable_all_nolock();
+ r = -EBUSY;
+ }
+ }
+
+ spin_unlock(&kvm_lock);
+
+ return r;
+}
+
static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
void *v)
{
int cpu = (long)v;

+ if (!kvm_usage_count)
+ return NOTIFY_OK;
+
val &= ~CPU_TASKS_FROZEN;
switch (val) {
case CPU_DYING:
@@ -1868,13 +1934,15 @@ static void kvm_exit_debug(void)

static int kvm_suspend(struct sys_device *dev, pm_message_t state)
{
- hardware_disable(NULL);
+ if (kvm_usage_count)
+ hardware_disable(NULL);
return 0;
}

static int kvm_resume(struct sys_device *dev)
{
- hardware_enable(NULL);
+ if (kvm_usage_count)
+ hardware_enable(NULL);
return 0;
}

@@ -1949,7 +2017,6 @@ int kvm_init(void *opaque, unsigned int vcpu_size,
goto out_free_1;
}

- on_each_cpu(hardware_enable, NULL, 1);
r = register_cpu_notifier(&kvm_cpu_notifier);
if (r)
goto out_free_2;
@@ -1999,7 +2066,6 @@ out_free_3:
unregister_reboot_notifier(&kvm_reboot_notifier);
unregister_cpu_notifier(&kvm_cpu_notifier);
out_free_2:
- on_each_cpu(hardware_disable, NULL, 1);
out_free_1:
kvm_arch_hardware_unsetup();
out_free_0a:
--
1.6.5.2

2009-11-16 12:20:20

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 24/42] KVM: remove duplicated #include

From: Huang Weiyi <[email protected]>

Remove duplicated #include('s) in
arch/x86/kvm/lapic.c

Signed-off-by: Huang Weiyi <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/lapic.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 8787637..cd60c0b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -32,7 +32,6 @@
#include <asm/current.h>
#include <asm/apicdef.h>
#include <asm/atomic.h>
-#include <asm/apicdef.h>
#include "kvm_cache_regs.h"
#include "irq.h"
#include "trace.h"
--
1.6.5.2

2009-11-16 12:25:04

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 25/42] KVM: SVM: reorganize svm_interrupt_allowed

From: Joerg Roedel <[email protected]>

This patch reorganizes the logic in svm_interrupt_allowed to
make it better to read. This is important because the logic
is a lot more complicated with Nested SVM.

Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/svm.c | 16 ++++++++++++----
1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 59fe4d5..3f3fe81 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2472,10 +2472,18 @@ static int svm_interrupt_allowed(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
struct vmcb *vmcb = svm->vmcb;
- return (vmcb->save.rflags & X86_EFLAGS_IF) &&
- !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) &&
- gif_set(svm) &&
- !(is_nested(svm) && (svm->vcpu.arch.hflags & HF_VINTR_MASK));
+ int ret;
+
+ if (!gif_set(svm) ||
+ (vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK))
+ return 0;
+
+ ret = !!(vmcb->save.rflags & X86_EFLAGS_IF);
+
+ if (is_nested(svm))
+ return ret && !(svm->vcpu.arch.hflags & HF_VINTR_MASK);
+
+ return ret;
}

static void enable_irq_window(struct kvm_vcpu *vcpu)
--
1.6.5.2

2009-11-16 12:23:48

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 26/42] KVM: SVM: don't copy exit_int_info on nested vmrun

From: Joerg Roedel <[email protected]>

The exit_int_info field is only written by the hardware and
never read. So it does not need to be copied on a vmrun
emulation.

Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/svm.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 3f3fe81..41c996a 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1797,8 +1797,6 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
svm->nested.intercept = nested_vmcb->control.intercept;

force_new_asid(&svm->vcpu);
- svm->vmcb->control.exit_int_info = nested_vmcb->control.exit_int_info;
- svm->vmcb->control.exit_int_info_err = nested_vmcb->control.exit_int_info_err;
svm->vmcb->control.int_ctl = nested_vmcb->control.int_ctl | V_INTR_MASKING_MASK;
if (nested_vmcb->control.int_ctl & V_IRQ_MASK) {
nsvm_printk("nSVM Injecting Interrupt: 0x%x\n",
--
1.6.5.2

2009-11-16 12:23:29

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 27/42] KVM: SVM: Remove remaining occurences of rdtscll

From: Joerg Roedel <[email protected]>

This patch replaces them with native_read_tsc() which can
also be used in expressions and saves a variable on the
stack in this case.

Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/svm.c | 7 +++----
1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 41c996a..9a4daca 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -763,14 +763,13 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
int i;

if (unlikely(cpu != vcpu->cpu)) {
- u64 tsc_this, delta;
+ u64 delta;

/*
* Make sure that the guest sees a monotonically
* increasing TSC.
*/
- rdtscll(tsc_this);
- delta = vcpu->arch.host_tsc - tsc_this;
+ delta = vcpu->arch.host_tsc - native_read_tsc();
svm->vmcb->control.tsc_offset += delta;
if (is_nested(svm))
svm->nested.hsave->control.tsc_offset += delta;
@@ -792,7 +791,7 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu)
for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++)
wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]);

- rdtscll(vcpu->arch.host_tsc);
+ vcpu->arch.host_tsc = native_read_tsc();
}

static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
--
1.6.5.2

2009-11-16 12:27:31

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 28/42] KVM: fix lock imbalance in kvm_*_irq_source_id()

From: Jiri Slaby <[email protected]>

Stanse found 2 lock imbalances in kvm_request_irq_source_id and
kvm_free_irq_source_id. They omit to unlock kvm->irq_lock on fail paths.

Fix that by adding unlock labels at the end of the functions and jump
there from the fail paths.

Signed-off-by: Jiri Slaby <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
virt/kvm/irq_comm.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 15a83b9..00c68d2 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -220,11 +220,13 @@ int kvm_request_irq_source_id(struct kvm *kvm)

if (irq_source_id >= sizeof(kvm->arch.irq_sources_bitmap)) {
printk(KERN_WARNING "kvm: exhaust allocatable IRQ sources!\n");
- return -EFAULT;
+ irq_source_id = -EFAULT;
+ goto unlock;
}

ASSERT(irq_source_id != KVM_USERSPACE_IRQ_SOURCE_ID);
set_bit(irq_source_id, bitmap);
+unlock:
mutex_unlock(&kvm->irq_lock);

return irq_source_id;
@@ -240,7 +242,7 @@ void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id)
if (irq_source_id < 0 ||
irq_source_id >= sizeof(kvm->arch.irq_sources_bitmap)) {
printk(KERN_ERR "kvm: IRQ source ID out of range!\n");
- return;
+ goto unlock;
}
for (i = 0; i < KVM_IOAPIC_NUM_PINS; i++) {
clear_bit(irq_source_id, &kvm->arch.vioapic->irq_states[i]);
@@ -251,6 +253,7 @@ void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id)
#endif
}
clear_bit(irq_source_id, &kvm->arch.irq_sources_bitmap);
+unlock:
mutex_unlock(&kvm->irq_lock);
}

--
1.6.5.2

2009-11-16 12:20:06

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 29/42] KVM: Separate timer intialization into an indepedent function

From: Zachary Amsden <[email protected]>

Signed-off-by: Zachary Amsden <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/x86.c | 23 +++++++++++++++--------
1 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3d83de8..6a31dfb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3118,9 +3118,22 @@ static struct notifier_block kvmclock_cpufreq_notifier_block = {
.notifier_call = kvmclock_cpufreq_notifier
};

+static void kvm_timer_init(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
+ if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+ tsc_khz_ref = tsc_khz;
+ cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block,
+ CPUFREQ_TRANSITION_NOTIFIER);
+ }
+}
+
int kvm_arch_init(void *opaque)
{
- int r, cpu;
+ int r;
struct kvm_x86_ops *ops = (struct kvm_x86_ops *)opaque;

if (kvm_x86_ops) {
@@ -3152,13 +3165,7 @@ int kvm_arch_init(void *opaque)
kvm_mmu_set_mask_ptes(PT_USER_MASK, PT_ACCESSED_MASK,
PT_DIRTY_MASK, PT64_NX_MASK, 0);

- for_each_possible_cpu(cpu)
- per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
- if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
- tsc_khz_ref = tsc_khz;
- cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block,
- CPUFREQ_TRANSITION_NOTIFIER);
- }
+ kvm_timer_init();

return 0;

--
1.6.5.2

2009-11-16 12:24:33

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 30/42] KVM: Kill the confusing tsc_ref_khz and ref_freq variables

From: Zachary Amsden <[email protected]>

They are globals, not clearly protected by any ordering or locking, and
vulnerable to various startup races.

Instead, for variable TSC machines, register the cpufreq notifier and get
the TSC frequency directly from the cpufreq machinery. Not only is it
always right, it is also perfectly accurate, as no error prone measurement
is required.

On such machines, when a new CPU online is brought online, it isn't clear what
frequency it will start with, and it may not correspond to the reference, thus
in hardware_enable we clear the cpu_tsc_khz variable to zero and make sure
it is set before running on a VCPU.

Signed-off-by: Zachary Amsden <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/x86.c | 26 ++++++++++++++++----------
1 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6a31dfb..6f75856 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1326,6 +1326,8 @@ out:
void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
kvm_x86_ops->vcpu_load(vcpu, cpu);
+ if (unlikely(per_cpu(cpu_tsc_khz, cpu) == 0))
+ per_cpu(cpu_tsc_khz, cpu) = cpufreq_quick_get(cpu);
kvm_request_guest_time_update(vcpu);
}

@@ -3063,9 +3065,6 @@ static void bounce_off(void *info)
/* nothing */
}

-static unsigned int ref_freq;
-static unsigned long tsc_khz_ref;
-
static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
void *data)
{
@@ -3074,14 +3073,11 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
struct kvm_vcpu *vcpu;
int i, send_ipi = 0;

- if (!ref_freq)
- ref_freq = freq->old;
-
if (val == CPUFREQ_PRECHANGE && freq->old > freq->new)
return 0;
if (val == CPUFREQ_POSTCHANGE && freq->old < freq->new)
return 0;
- per_cpu(cpu_tsc_khz, freq->cpu) = cpufreq_scale(tsc_khz_ref, ref_freq, freq->new);
+ per_cpu(cpu_tsc_khz, freq->cpu) = freq->new;

spin_lock(&kvm_lock);
list_for_each_entry(kvm, &vm_list, vm_list) {
@@ -3122,12 +3118,14 @@ static void kvm_timer_init(void)
{
int cpu;

- for_each_possible_cpu(cpu)
- per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
- tsc_khz_ref = tsc_khz;
cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block,
CPUFREQ_TRANSITION_NOTIFIER);
+ for_each_online_cpu(cpu)
+ per_cpu(cpu_tsc_khz, cpu) = cpufreq_get(cpu);
+ } else {
+ for_each_possible_cpu(cpu)
+ per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
}
}

@@ -4700,6 +4698,14 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)

int kvm_arch_hardware_enable(void *garbage)
{
+ /*
+ * Since this may be called from a hotplug notifcation,
+ * we can't get the CPU frequency directly.
+ */
+ if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+ int cpu = raw_smp_processor_id();
+ per_cpu(cpu_tsc_khz, cpu) = 0;
+ }
return kvm_x86_ops->hardware_enable(garbage);
}

--
1.6.5.2

2009-11-16 12:20:21

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 31/42] KVM: Fix printk name error in svm.c

From: Zachary Amsden <[email protected]>

Signed-off-by: Zachary Amsden <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/svm.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 9a4daca..d1036ce 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -330,13 +330,14 @@ static int svm_hardware_enable(void *garbage)
return -EBUSY;

if (!has_svm()) {
- printk(KERN_ERR "svm_cpu_init: err EOPNOTSUPP on %d\n", me);
+ printk(KERN_ERR "svm_hardware_enable: err EOPNOTSUPP on %d\n",
+ me);
return -EINVAL;
}
svm_data = per_cpu(svm_data, me);

if (!svm_data) {
- printk(KERN_ERR "svm_cpu_init: svm_data is NULL on %d\n",
+ printk(KERN_ERR "svm_hardware_enable: svm_data is NULL on %d\n",
me);
return -EINVAL;
}
--
1.6.5.2

2009-11-16 12:28:46

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 32/42] KVM: Fix hotplug of CPUs

From: Zachary Amsden <[email protected]>

Both VMX and SVM require per-cpu memory allocation, which is done at module
init time, for only online cpus.

Backend was not allocating enough structure for all possible CPUs, so
new CPUs coming online could not be hardware enabled.

Signed-off-by: Zachary Amsden <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/svm.c | 4 ++--
arch/x86/kvm/vmx.c | 6 ++++--
2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d1036ce..02a4269 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -482,7 +482,7 @@ static __init int svm_hardware_setup(void)
kvm_enable_efer_bits(EFER_SVME);
}

- for_each_online_cpu(cpu) {
+ for_each_possible_cpu(cpu) {
r = svm_cpu_init(cpu);
if (r)
goto err;
@@ -516,7 +516,7 @@ static __exit void svm_hardware_unsetup(void)
{
int cpu;

- for_each_online_cpu(cpu)
+ for_each_possible_cpu(cpu)
svm_cpu_uninit(cpu);

__free_pages(pfn_to_page(iopm_base >> PAGE_SHIFT), IOPM_ALLOC_ORDER);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a187570..97f4265 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1350,15 +1350,17 @@ static void free_kvm_area(void)
{
int cpu;

- for_each_online_cpu(cpu)
+ for_each_possible_cpu(cpu) {
free_vmcs(per_cpu(vmxarea, cpu));
+ per_cpu(vmxarea, cpu) = NULL;
+ }
}

static __init int alloc_kvm_area(void)
{
int cpu;

- for_each_online_cpu(cpu) {
+ for_each_possible_cpu(cpu) {
struct vmcs *vmcs;

vmcs = alloc_vmcs_cpu(cpu);
--
1.6.5.2

2009-11-16 12:28:31

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 33/42] KVM: remove pre_task_link setting in save_state_to_tss16

From: Juan Quintela <[email protected]>

Now, also remove pre_task_link setting in save_state_to_tss16.

commit b237ac37a149e8b56436fabf093532483bff13b0
Author: Gleb Natapov <[email protected]>
Date: Mon Mar 30 16:03:24 2009 +0300

KVM: Fix task switch back link handling.

CC: Gleb Natapov <[email protected]>
Signed-off-by: Juan Quintela <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/x86.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6f75856..5f44d56 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4203,7 +4203,6 @@ static void save_state_to_tss16(struct kvm_vcpu *vcpu,
tss->ss = get_segment_selector(vcpu, VCPU_SREG_SS);
tss->ds = get_segment_selector(vcpu, VCPU_SREG_DS);
tss->ldt = get_segment_selector(vcpu, VCPU_SREG_LDTR);
- tss->prev_task_link = get_segment_selector(vcpu, VCPU_SREG_TR);
}

static int load_state_from_tss16(struct kvm_vcpu *vcpu,
--
1.6.5.2

2009-11-16 12:22:36

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 34/42] KVM: x86: Refactor guest debug IOCTL handling

From: Jan Kiszka <[email protected]>

Much of so far vendor-specific code for setting up guest debug can
actually be handled by the generic code. This also fixes a minor deficit
in the SVM part /wrt processing KVM_GUESTDBG_ENABLE.

Signed-off-by: Jan Kiszka <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ++--
arch/x86/kvm/svm.c | 14 ++------------
arch/x86/kvm/vmx.c | 18 +-----------------
arch/x86/kvm/x86.c | 28 +++++++++++++++++++++-------
4 files changed, 26 insertions(+), 38 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 295c7c4..e7f8708 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -475,8 +475,8 @@ struct kvm_x86_ops {
void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu);
void (*vcpu_put)(struct kvm_vcpu *vcpu);

- int (*set_guest_debug)(struct kvm_vcpu *vcpu,
- struct kvm_guest_debug *dbg);
+ void (*set_guest_debug)(struct kvm_vcpu *vcpu,
+ struct kvm_guest_debug *dbg);
int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 02a4269..279a2ae 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1065,26 +1065,16 @@ static void update_db_intercept(struct kvm_vcpu *vcpu)
vcpu->guest_debug = 0;
}

-static int svm_guest_debug(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg)
+static void svm_guest_debug(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg)
{
- int old_debug = vcpu->guest_debug;
struct vcpu_svm *svm = to_svm(vcpu);

- vcpu->guest_debug = dbg->control;
-
- update_db_intercept(vcpu);
-
if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
svm->vmcb->save.dr7 = dbg->arch.debugreg[7];
else
svm->vmcb->save.dr7 = vcpu->arch.dr7;

- if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
- svm->vmcb->save.rflags |= X86_EFLAGS_TF | X86_EFLAGS_RF;
- else if (old_debug & KVM_GUESTDBG_SINGLESTEP)
- svm->vmcb->save.rflags &= ~(X86_EFLAGS_TF | X86_EFLAGS_RF);
-
- return 0;
+ update_db_intercept(vcpu);
}

static void load_host_msrs(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 97f4265..70020e5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1096,30 +1096,14 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
}
}

-static int set_guest_debug(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg)
+static void set_guest_debug(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg)
{
- int old_debug = vcpu->guest_debug;
- unsigned long flags;
-
- vcpu->guest_debug = dbg->control;
- if (!(vcpu->guest_debug & KVM_GUESTDBG_ENABLE))
- vcpu->guest_debug = 0;
-
if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
vmcs_writel(GUEST_DR7, dbg->arch.debugreg[7]);
else
vmcs_writel(GUEST_DR7, vcpu->arch.dr7);

- flags = vmcs_readl(GUEST_RFLAGS);
- if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
- flags |= X86_EFLAGS_TF | X86_EFLAGS_RF;
- else if (old_debug & KVM_GUESTDBG_SINGLESTEP)
- flags &= ~(X86_EFLAGS_TF | X86_EFLAGS_RF);
- vmcs_writel(GUEST_RFLAGS, flags);
-
update_exception_bitmap(vcpu);
-
- return 0;
}

static __init int cpu_has_kvm_support(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5f44d56..a06f88e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4472,12 +4472,19 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
struct kvm_guest_debug *dbg)
{
- int i, r;
+ unsigned long rflags;
+ int old_debug;
+ int i;

vcpu_load(vcpu);

- if ((dbg->control & (KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_HW_BP)) ==
- (KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_HW_BP)) {
+ old_debug = vcpu->guest_debug;
+
+ vcpu->guest_debug = dbg->control;
+ if (!(vcpu->guest_debug & KVM_GUESTDBG_ENABLE))
+ vcpu->guest_debug = 0;
+
+ if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) {
for (i = 0; i < KVM_NR_DB_REGS; ++i)
vcpu->arch.eff_db[i] = dbg->arch.debugreg[i];
vcpu->arch.switch_db_regs =
@@ -4488,16 +4495,23 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
vcpu->arch.switch_db_regs = (vcpu->arch.dr7 & DR7_BP_EN_MASK);
}

- r = kvm_x86_ops->set_guest_debug(vcpu, dbg);
+ rflags = kvm_x86_ops->get_rflags(vcpu);
+ if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
+ rflags |= X86_EFLAGS_TF | X86_EFLAGS_RF;
+ else if (old_debug & KVM_GUESTDBG_SINGLESTEP)
+ rflags &= ~(X86_EFLAGS_TF | X86_EFLAGS_RF);
+ kvm_x86_ops->set_rflags(vcpu, rflags);

- if (dbg->control & KVM_GUESTDBG_INJECT_DB)
+ kvm_x86_ops->set_guest_debug(vcpu, dbg);
+
+ if (vcpu->guest_debug & KVM_GUESTDBG_INJECT_DB)
kvm_queue_exception(vcpu, DB_VECTOR);
- else if (dbg->control & KVM_GUESTDBG_INJECT_BP)
+ else if (vcpu->guest_debug & KVM_GUESTDBG_INJECT_BP)
kvm_queue_exception(vcpu, BP_VECTOR);

vcpu_put(vcpu);

- return r;
+ return 0;
}

/*
--
1.6.5.2

2009-11-16 12:20:33

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 35/42] KVM: x86: disable paravirt mmu reporting

From: Marcelo Tosatti <[email protected]>

Disable paravirt MMU capability reporting, so that new (or rebooted)
guests switch to native operation.

Paravirt MMU is a burden to maintain and does not bring significant
advantages compared to shadow anymore.

Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/kvm/x86.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a06f88e..4693f91 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1238,8 +1238,8 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_NR_MEMSLOTS:
r = KVM_MEMORY_SLOTS;
break;
- case KVM_CAP_PV_MMU:
- r = !tdp_enabled;
+ case KVM_CAP_PV_MMU: /* obsolete */
+ r = 0;
break;
case KVM_CAP_IOMMU:
r = iommu_found();
--
1.6.5.2

2009-11-16 12:24:32

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 36/42] KVM: x86: Rework guest single-step flag injection and filtering

From: Jan Kiszka <[email protected]>

Push TF and RF injection and filtering on guest single-stepping into the
vender get/set_rflags callbacks. This makes the whole mechanism more
robust wrt user space IOCTL order and instruction emulations.

Signed-off-by: Jan Kiszka <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++
arch/x86/kvm/x86.c | 77 +++++++++++++++++++++++----------------
2 files changed, 48 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e7f8708..179a919 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -614,6 +614,9 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l);
int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);

+unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);
+void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
+
void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
void kvm_inject_page_fault(struct kvm_vcpu *vcpu, unsigned long cr2,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4693f91..385cd0a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -235,6 +235,25 @@ bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl)
}
EXPORT_SYMBOL_GPL(kvm_require_cpl);

+unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu)
+{
+ unsigned long rflags;
+
+ rflags = kvm_x86_ops->get_rflags(vcpu);
+ if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
+ rflags &= ~(unsigned long)(X86_EFLAGS_TF | X86_EFLAGS_RF);
+ return rflags;
+}
+EXPORT_SYMBOL_GPL(kvm_get_rflags);
+
+void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+{
+ if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
+ rflags |= X86_EFLAGS_TF | X86_EFLAGS_RF;
+ kvm_x86_ops->set_rflags(vcpu, rflags);
+}
+EXPORT_SYMBOL_GPL(kvm_set_rflags);
+
/*
* Load the pae pdptrs. Return true is they are all valid.
*/
@@ -2777,7 +2796,7 @@ int emulate_instruction(struct kvm_vcpu *vcpu,
kvm_x86_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l);

vcpu->arch.emulate_ctxt.vcpu = vcpu;
- vcpu->arch.emulate_ctxt.eflags = kvm_x86_ops->get_rflags(vcpu);
+ vcpu->arch.emulate_ctxt.eflags = kvm_get_rflags(vcpu);
vcpu->arch.emulate_ctxt.mode =
(vcpu->arch.emulate_ctxt.eflags & X86_EFLAGS_VM)
? X86EMUL_MODE_REAL : cs_l
@@ -2855,7 +2874,7 @@ int emulate_instruction(struct kvm_vcpu *vcpu,
return EMULATE_DO_MMIO;
}

- kvm_x86_ops->set_rflags(vcpu, vcpu->arch.emulate_ctxt.eflags);
+ kvm_set_rflags(vcpu, vcpu->arch.emulate_ctxt.eflags);

if (vcpu->mmio_is_write) {
vcpu->mmio_needed = 0;
@@ -3291,7 +3310,7 @@ void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw,
unsigned long *rflags)
{
kvm_lmsw(vcpu, msw);
- *rflags = kvm_x86_ops->get_rflags(vcpu);
+ *rflags = kvm_get_rflags(vcpu);
}

unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr)
@@ -3329,7 +3348,7 @@ void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long val,
switch (cr) {
case 0:
kvm_set_cr0(vcpu, mk_cr_64(vcpu->arch.cr0, val));
- *rflags = kvm_x86_ops->get_rflags(vcpu);
+ *rflags = kvm_get_rflags(vcpu);
break;
case 2:
vcpu->arch.cr2 = val;
@@ -3460,7 +3479,7 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu)
{
struct kvm_run *kvm_run = vcpu->run;

- kvm_run->if_flag = (kvm_x86_ops->get_rflags(vcpu) & X86_EFLAGS_IF) != 0;
+ kvm_run->if_flag = (kvm_get_rflags(vcpu) & X86_EFLAGS_IF) != 0;
kvm_run->cr8 = kvm_get_cr8(vcpu);
kvm_run->apic_base = kvm_get_apic_base(vcpu);
if (irqchip_in_kernel(vcpu->kvm))
@@ -3840,13 +3859,7 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
#endif

regs->rip = kvm_rip_read(vcpu);
- regs->rflags = kvm_x86_ops->get_rflags(vcpu);
-
- /*
- * Don't leak debug flags in case they were set for guest debugging
- */
- if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
- regs->rflags &= ~(X86_EFLAGS_TF | X86_EFLAGS_RF);
+ regs->rflags = kvm_get_rflags(vcpu);

vcpu_put(vcpu);

@@ -3874,12 +3887,10 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
kvm_register_write(vcpu, VCPU_REGS_R13, regs->r13);
kvm_register_write(vcpu, VCPU_REGS_R14, regs->r14);
kvm_register_write(vcpu, VCPU_REGS_R15, regs->r15);
-
#endif

kvm_rip_write(vcpu, regs->rip);
- kvm_x86_ops->set_rflags(vcpu, regs->rflags);
-
+ kvm_set_rflags(vcpu, regs->rflags);

vcpu->arch.exception.pending = false;

@@ -4098,7 +4109,7 @@ static int is_vm86_segment(struct kvm_vcpu *vcpu, int seg)
{
return (seg != VCPU_SREG_LDTR) &&
(seg != VCPU_SREG_TR) &&
- (kvm_x86_ops->get_rflags(vcpu) & X86_EFLAGS_VM);
+ (kvm_get_rflags(vcpu) & X86_EFLAGS_VM);
}

int kvm_load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector,
@@ -4126,7 +4137,7 @@ static void save_state_to_tss32(struct kvm_vcpu *vcpu,
{
tss->cr3 = vcpu->arch.cr3;
tss->eip = kvm_rip_read(vcpu);
- tss->eflags = kvm_x86_ops->get_rflags(vcpu);
+ tss->eflags = kvm_get_rflags(vcpu);
tss->eax = kvm_register_read(vcpu, VCPU_REGS_RAX);
tss->ecx = kvm_register_read(vcpu, VCPU_REGS_RCX);
tss->edx = kvm_register_read(vcpu, VCPU_REGS_RDX);
@@ -4150,7 +4161,7 @@ static int load_state_from_tss32(struct kvm_vcpu *vcpu,
kvm_set_cr3(vcpu, tss->cr3);

kvm_rip_write(vcpu, tss->eip);
- kvm_x86_ops->set_rflags(vcpu, tss->eflags | 2);
+ kvm_set_rflags(vcpu, tss->eflags | 2);

kvm_register_write(vcpu, VCPU_REGS_RAX, tss->eax);
kvm_register_write(vcpu, VCPU_REGS_RCX, tss->ecx);
@@ -4188,7 +4199,7 @@ static void save_state_to_tss16(struct kvm_vcpu *vcpu,
struct tss_segment_16 *tss)
{
tss->ip = kvm_rip_read(vcpu);
- tss->flag = kvm_x86_ops->get_rflags(vcpu);
+ tss->flag = kvm_get_rflags(vcpu);
tss->ax = kvm_register_read(vcpu, VCPU_REGS_RAX);
tss->cx = kvm_register_read(vcpu, VCPU_REGS_RCX);
tss->dx = kvm_register_read(vcpu, VCPU_REGS_RDX);
@@ -4209,7 +4220,7 @@ static int load_state_from_tss16(struct kvm_vcpu *vcpu,
struct tss_segment_16 *tss)
{
kvm_rip_write(vcpu, tss->ip);
- kvm_x86_ops->set_rflags(vcpu, tss->flag | 2);
+ kvm_set_rflags(vcpu, tss->flag | 2);
kvm_register_write(vcpu, VCPU_REGS_RAX, tss->ax);
kvm_register_write(vcpu, VCPU_REGS_RCX, tss->cx);
kvm_register_write(vcpu, VCPU_REGS_RDX, tss->dx);
@@ -4355,8 +4366,8 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int reason)
}

if (reason == TASK_SWITCH_IRET) {
- u32 eflags = kvm_x86_ops->get_rflags(vcpu);
- kvm_x86_ops->set_rflags(vcpu, eflags & ~X86_EFLAGS_NT);
+ u32 eflags = kvm_get_rflags(vcpu);
+ kvm_set_rflags(vcpu, eflags & ~X86_EFLAGS_NT);
}

/* set back link to prev task only if NT bit is set in eflags
@@ -4377,8 +4388,8 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int reason)
old_tss_base, &nseg_desc);

if (reason == TASK_SWITCH_CALL || reason == TASK_SWITCH_GATE) {
- u32 eflags = kvm_x86_ops->get_rflags(vcpu);
- kvm_x86_ops->set_rflags(vcpu, eflags | X86_EFLAGS_NT);
+ u32 eflags = kvm_get_rflags(vcpu);
+ kvm_set_rflags(vcpu, eflags | X86_EFLAGS_NT);
}

if (reason != TASK_SWITCH_IRET) {
@@ -4473,12 +4484,15 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
struct kvm_guest_debug *dbg)
{
unsigned long rflags;
- int old_debug;
int i;

vcpu_load(vcpu);

- old_debug = vcpu->guest_debug;
+ /*
+ * Read rflags as long as potentially injected trace flags are still
+ * filtered out.
+ */
+ rflags = kvm_get_rflags(vcpu);

vcpu->guest_debug = dbg->control;
if (!(vcpu->guest_debug & KVM_GUESTDBG_ENABLE))
@@ -4495,12 +4509,11 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
vcpu->arch.switch_db_regs = (vcpu->arch.dr7 & DR7_BP_EN_MASK);
}

- rflags = kvm_x86_ops->get_rflags(vcpu);
- if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
- rflags |= X86_EFLAGS_TF | X86_EFLAGS_RF;
- else if (old_debug & KVM_GUESTDBG_SINGLESTEP)
- rflags &= ~(X86_EFLAGS_TF | X86_EFLAGS_RF);
- kvm_x86_ops->set_rflags(vcpu, rflags);
+ /*
+ * Trigger an rflags update that will inject or remove the trace
+ * flags.
+ */
+ kvm_set_rflags(vcpu, rflags);

kvm_x86_ops->set_guest_debug(vcpu, dbg);

--
1.6.5.2

2009-11-16 12:24:10

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 37/42] KVM: x86: include pvclock MSRs in msrs_to_save

From: Glauber Costa <[email protected]>

For a while now, we are issuing a rdmsr instruction to find out which
msrs in our save list are really supported by the underlying machine.
However, it fails to account for kvm-specific msrs, such as the pvclock
ones.

This patch moves then to the beginning of the list, and skip testing them.

Cc: [email protected]
Signed-off-by: Glauber Costa <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/x86.c | 12 ++++++++----
1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 385cd0a..4de5bc0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -503,16 +503,19 @@ static inline u32 bit(int bitno)
* and KVM_SET_MSRS, and KVM_GET_MSR_INDEX_LIST.
*
* This list is modified at module load time to reflect the
- * capabilities of the host cpu.
+ * capabilities of the host cpu. This capabilities test skips MSRs that are
+ * kvm-specific. Those are put in the beginning of the list.
*/
+
+#define KVM_SAVE_MSRS_BEGIN 2
static u32 msrs_to_save[] = {
+ MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_K6_STAR,
#ifdef CONFIG_X86_64
MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
#endif
- MSR_IA32_TSC, MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
- MSR_IA32_PERF_STATUS, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA
+ MSR_IA32_TSC, MSR_IA32_PERF_STATUS, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA
};

static unsigned num_msrs_to_save;
@@ -2446,7 +2449,8 @@ static void kvm_init_msr_list(void)
u32 dummy[2];
unsigned i, j;

- for (i = j = 0; i < ARRAY_SIZE(msrs_to_save); i++) {
+ /* skip the first msrs in the list. KVM-specific */
+ for (i = j = KVM_SAVE_MSRS_BEGIN; i < ARRAY_SIZE(msrs_to_save); i++) {
if (rdmsr_safe(msrs_to_save[i], &dummy[0], &dummy[1]) < 0)
continue;
if (j < i)
--
1.6.5.2

2009-11-16 12:20:35

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 38/42] KVM: SVM: Notify nested hypervisor of lost event injections

From: Alexander Graf <[email protected]>

If event_inj is valid on a #vmexit the host CPU would write
the contents to exit_int_info, so the hypervisor knows that
the event wasn't injected.

We don't do this in nested SVM by now which is a bug and
fixed by this patch.

Signed-off-by: Alexander Graf <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/svm.c | 16 ++++++++++++++++
1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 279a2ae..e372854 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1615,6 +1615,22 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
nested_vmcb->control.exit_info_2 = vmcb->control.exit_info_2;
nested_vmcb->control.exit_int_info = vmcb->control.exit_int_info;
nested_vmcb->control.exit_int_info_err = vmcb->control.exit_int_info_err;
+
+ /*
+ * If we emulate a VMRUN/#VMEXIT in the same host #vmexit cycle we have
+ * to make sure that we do not lose injected events. So check event_inj
+ * here and copy it to exit_int_info if it is valid.
+ * Exit_int_info and event_inj can't be both valid because the case
+ * below only happens on a VMRUN instruction intercept which has
+ * no valid exit_int_info set.
+ */
+ if (vmcb->control.event_inj & SVM_EVTINJ_VALID) {
+ struct vmcb_control_area *nc = &nested_vmcb->control;
+
+ nc->exit_int_info = vmcb->control.event_inj;
+ nc->exit_int_info_err = vmcb->control.event_inj_err;
+ }
+
nested_vmcb->control.tlb_ctl = 0;
nested_vmcb->control.event_inj = 0;
nested_vmcb->control.event_inj_err = 0;
--
1.6.5.2

2009-11-16 12:27:11

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 39/42] KVM: SVM: Move INTR vmexit out of atomic code

From: Joerg Roedel <[email protected]>

The nested SVM code emulates a #vmexit caused by a request
to open the irq window right in the request function. This
is a bug because the request function runs with preemption
and interrupts disabled but the #vmexit emulation might
sleep. This can cause a schedule()-while-atomic bug and is
fixed with this patch.

Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/svm.c | 26 +++++++++++++++++++++++++-
1 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index e372854..884bffc 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -85,6 +85,9 @@ struct nested_state {
/* gpa pointers to the real vectors */
u64 vmcb_msrpm;

+ /* A VMEXIT is required but not yet emulated */
+ bool exit_required;
+
/* cache for intercepts of the guest */
u16 intercept_cr_read;
u16 intercept_cr_write;
@@ -1379,7 +1382,14 @@ static inline int nested_svm_intr(struct vcpu_svm *svm)

svm->vmcb->control.exit_code = SVM_EXIT_INTR;

- if (nested_svm_exit_handled(svm)) {
+ if (svm->nested.intercept & 1ULL) {
+ /*
+ * The #vmexit can't be emulated here directly because this
+ * code path runs with irqs and preemtion disabled. A
+ * #vmexit emulation might sleep. Only signal request for
+ * the #vmexit here.
+ */
+ svm->nested.exit_required = true;
nsvm_printk("VMexit -> INTR\n");
return 1;
}
@@ -2340,6 +2350,13 @@ static int handle_exit(struct kvm_vcpu *vcpu)

trace_kvm_exit(exit_code, svm->vmcb->save.rip);

+ if (unlikely(svm->nested.exit_required)) {
+ nested_svm_vmexit(svm);
+ svm->nested.exit_required = false;
+
+ return 1;
+ }
+
if (is_nested(svm)) {
int vmexit;

@@ -2615,6 +2632,13 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
u16 gs_selector;
u16 ldt_selector;

+ /*
+ * A vmexit emulation is required before the vcpu can be executed
+ * again.
+ */
+ if (unlikely(svm->nested.exit_required))
+ return;
+
svm->vmcb->save.rax = vcpu->arch.regs[VCPU_REGS_RAX];
svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
--
1.6.5.2

2009-11-16 12:22:57

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 40/42] KVM: SVM: Add tracepoint for nested vmrun

From: Joerg Roedel <[email protected]>

This patch adds a dedicated kvm tracepoint for a nested
vmrun.

Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/svm.c | 6 ++++++
arch/x86/kvm/trace.h | 33 +++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 1 +
3 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 884bffc..907af3f 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1726,6 +1726,12 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
/* nested_vmcb is our indicator if nested SVM is activated */
svm->nested.vmcb = svm->vmcb->save.rax;

+ trace_kvm_nested_vmrun(svm->vmcb->save.rip - 3, svm->nested.vmcb,
+ nested_vmcb->save.rip,
+ nested_vmcb->control.int_ctl,
+ nested_vmcb->control.event_inj,
+ nested_vmcb->control.nested_ctl);
+
/* Clear internal status */
kvm_clear_exception_queue(&svm->vcpu);
kvm_clear_interrupt_queue(&svm->vcpu);
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 0d480e7..b5798e1 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -349,6 +349,39 @@ TRACE_EVENT(kvm_apic_accept_irq,
__entry->coalesced ? " (coalesced)" : "")
);

+/*
+ * Tracepoint for nested VMRUN
+ */
+TRACE_EVENT(kvm_nested_vmrun,
+ TP_PROTO(__u64 rip, __u64 vmcb, __u64 nested_rip, __u32 int_ctl,
+ __u32 event_inj, bool npt),
+ TP_ARGS(rip, vmcb, nested_rip, int_ctl, event_inj, npt),
+
+ TP_STRUCT__entry(
+ __field( __u64, rip )
+ __field( __u64, vmcb )
+ __field( __u64, nested_rip )
+ __field( __u32, int_ctl )
+ __field( __u32, event_inj )
+ __field( bool, npt )
+ ),
+
+ TP_fast_assign(
+ __entry->rip = rip;
+ __entry->vmcb = vmcb;
+ __entry->nested_rip = nested_rip;
+ __entry->int_ctl = int_ctl;
+ __entry->event_inj = event_inj;
+ __entry->npt = npt;
+ ),
+
+ TP_printk("rip: 0x%016llx vmcb: 0x%016llx nrip: 0x%016llx int_ctl: 0x%08x "
+ "event_inj: 0x%08x npt: %s\n",
+ __entry->rip, __entry->vmcb, __entry->nested_rip,
+ __entry->int_ctl, __entry->event_inj,
+ __entry->npt ? "on" : "off")
+);
+
#endif /* _TRACE_KVM_H */

/* This part must be outside protection */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4de5bc0..3ab2f90 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4984,3 +4984,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cr);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmrun);
--
1.6.5.2

2009-11-16 12:20:47

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 41/42] KVM: SVM: Add tracepoint for nested #vmexit

From: Joerg Roedel <[email protected]>

This patch adds a tracepoint for every #vmexit we get from a
nested guest.

Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/svm.c | 6 ++++++
arch/x86/kvm/trace.h | 36 ++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 1 +
3 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 907af3f..edf6e8b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2366,6 +2366,12 @@ static int handle_exit(struct kvm_vcpu *vcpu)
if (is_nested(svm)) {
int vmexit;

+ trace_kvm_nested_vmexit(svm->vmcb->save.rip, exit_code,
+ svm->vmcb->control.exit_info_1,
+ svm->vmcb->control.exit_info_2,
+ svm->vmcb->control.exit_int_info,
+ svm->vmcb->control.exit_int_info_err);
+
nsvm_printk("nested handle_exit: 0x%x | 0x%lx | 0x%lx | 0x%lx\n",
exit_code, svm->vmcb->control.exit_info_1,
svm->vmcb->control.exit_info_2, svm->vmcb->save.rip);
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index b5798e1..a7eb629 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -382,6 +382,42 @@ TRACE_EVENT(kvm_nested_vmrun,
__entry->npt ? "on" : "off")
);

+/*
+ * Tracepoint for #VMEXIT while nested
+ */
+TRACE_EVENT(kvm_nested_vmexit,
+ TP_PROTO(__u64 rip, __u32 exit_code,
+ __u64 exit_info1, __u64 exit_info2,
+ __u32 exit_int_info, __u32 exit_int_info_err),
+ TP_ARGS(rip, exit_code, exit_info1, exit_info2,
+ exit_int_info, exit_int_info_err),
+
+ TP_STRUCT__entry(
+ __field( __u64, rip )
+ __field( __u32, exit_code )
+ __field( __u64, exit_info1 )
+ __field( __u64, exit_info2 )
+ __field( __u32, exit_int_info )
+ __field( __u32, exit_int_info_err )
+ ),
+
+ TP_fast_assign(
+ __entry->rip = rip;
+ __entry->exit_code = exit_code;
+ __entry->exit_info1 = exit_info1;
+ __entry->exit_info2 = exit_info2;
+ __entry->exit_int_info = exit_int_info;
+ __entry->exit_int_info_err = exit_int_info_err;
+ ),
+ TP_printk("rip: 0x%016llx reason: %s ext_inf1: 0x%016llx "
+ "ext_inf2: 0x%016llx ext_int: 0x%08x ext_int_err: 0x%08x\n",
+ __entry->rip,
+ ftrace_print_symbols_seq(p, __entry->exit_code,
+ kvm_x86_ops->exit_reasons_str),
+ __entry->exit_info1, __entry->exit_info2,
+ __entry->exit_int_info, __entry->exit_int_info_err)
+);
+
#endif /* _TRACE_KVM_H */

/* This part must be outside protection */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3ab2f90..192d58e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4985,3 +4985,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cr);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmrun);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit);
--
1.6.5.2

2009-11-16 12:22:37

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 42/42] KVM: SVM: Add tracepoint for injected #vmexit

From: Joerg Roedel <[email protected]>

This patch adds a tracepoint for a nested #vmexit that gets
re-injected to the guest.

Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/svm.c | 6 ++++++
arch/x86/kvm/trace.h | 33 +++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 1 +
3 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index edf6e8b..369eeb8 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1592,6 +1592,12 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
struct vmcb *hsave = svm->nested.hsave;
struct vmcb *vmcb = svm->vmcb;

+ trace_kvm_nested_vmexit_inject(vmcb->control.exit_code,
+ vmcb->control.exit_info_1,
+ vmcb->control.exit_info_2,
+ vmcb->control.exit_int_info,
+ vmcb->control.exit_int_info_err);
+
nested_vmcb = nested_svm_map(svm, svm->nested.vmcb, KM_USER0);
if (!nested_vmcb)
return 1;
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index a7eb629..4d6bb5e 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -418,6 +418,39 @@ TRACE_EVENT(kvm_nested_vmexit,
__entry->exit_int_info, __entry->exit_int_info_err)
);

+/*
+ * Tracepoint for #VMEXIT reinjected to the guest
+ */
+TRACE_EVENT(kvm_nested_vmexit_inject,
+ TP_PROTO(__u32 exit_code,
+ __u64 exit_info1, __u64 exit_info2,
+ __u32 exit_int_info, __u32 exit_int_info_err),
+ TP_ARGS(exit_code, exit_info1, exit_info2,
+ exit_int_info, exit_int_info_err),
+
+ TP_STRUCT__entry(
+ __field( __u32, exit_code )
+ __field( __u64, exit_info1 )
+ __field( __u64, exit_info2 )
+ __field( __u32, exit_int_info )
+ __field( __u32, exit_int_info_err )
+ ),
+
+ TP_fast_assign(
+ __entry->exit_code = exit_code;
+ __entry->exit_info1 = exit_info1;
+ __entry->exit_info2 = exit_info2;
+ __entry->exit_int_info = exit_int_info;
+ __entry->exit_int_info_err = exit_int_info_err;
+ ),
+
+ TP_printk("reason: %s ext_inf1: 0x%016llx "
+ "ext_inf2: 0x%016llx ext_int: 0x%08x ext_int_err: 0x%08x\n",
+ ftrace_print_symbols_seq(p, __entry->exit_code,
+ kvm_x86_ops->exit_reasons_str),
+ __entry->exit_info1, __entry->exit_info2,
+ __entry->exit_int_info, __entry->exit_int_info_err)
+);
#endif /* _TRACE_KVM_H */

/* This part must be outside protection */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 192d58e..a522d9b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4986,3 +4986,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cr);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmrun);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit_inject);
--
1.6.5.2

2010-08-08 12:03:12

by Serge Belyshev

[permalink] [raw]
Subject: Re: [PATCH 23/42] KVM: Activate Virtualization On Demand

Hi! Since the above patch went in
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=10474ae8
I have exactly the same problem as Dieter.

>>> If that's the case I'd say you have a broken BIOS or bootloader.
>>
>> You are right, it seems to be always cpu0. I'll check my BIOS and flash
>> a newer version if available.
>
> Please do. If that doesn't help ping me again. I'll write up a quirk patch then.
>

My h/w is a bit different (cpu: 9850 B3, mb: GA-MA790FX-DQ6 bios F7b),
but I cannot use latest available BIOS (F7d) as it breaks my SAS controller.

So I guess I need the quirk patch.

2010-08-16 13:24:11

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 23/42] KVM: Activate Virtualization On Demand


On 08.08.2010, at 14:02, Serge Belyshev wrote:

> Hi! Since the above patch went in
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=10474ae8
> I have exactly the same problem as Dieter.
>
>>>> If that's the case I'd say you have a broken BIOS or bootloader.
>>>
>>> You are right, it seems to be always cpu0. I'll check my BIOS and flash
>>> a newer version if available.
>>
>> Please do. If that doesn't help ping me again. I'll write up a quirk patch then.
>>
>
> My h/w is a bit different (cpu: 9850 B3, mb: GA-MA790FX-DQ6 bios F7b),
> but I cannot use latest available BIOS (F7d) as it breaks my SAS controller.
>
> So I guess I need the quirk patch.

Hrm - try to use the following (probably whitespace broken and 100% untested) hacky patch:

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 56c9b6b..bde9ee3 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -429,9 +429,11 @@ static int svm_hardware_enable(void *garbage)
struct desc_struct *gdt;
int me = raw_smp_processor_id();

+#if 0
rdmsrl(MSR_EFER, efer);
if (efer & EFER_SVME)
return -EBUSY;
+#endif

if (!has_svm()) {
printk(KERN_ERR "svm_hardware_enable: err EOPNOTSUPP on %d\n",


Alex-

2010-08-16 13:49:44

by Serge Belyshev

[permalink] [raw]
Subject: Re: [PATCH 23/42] KVM: Activate Virtualization On Demand

Alexander Graf <[email protected]> writes:

> Hrm - try to use the following (probably whitespace broken and 100% untested) hacky patch:
>
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 56c9b6b..bde9ee3 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -429,9 +429,11 @@ static int svm_hardware_enable(void *garbage)
> struct desc_struct *gdt;
> int me = raw_smp_processor_id();
>
> +#if 0
> rdmsrl(MSR_EFER, efer);
> if (efer & EFER_SVME)
> return -EBUSY;
> +#endif
>

I tested a bit different hack keeping rdmsrl but removing the if() and
it works fine for me:

Index: linux/arch/x86/kvm/svm.c
===================================================================
--- linux.orig/arch/x86/kvm/svm.c
+++ linux/arch/x86/kvm/svm.c
@@ -429,8 +429,7 @@ static int svm_hardware_enable(void *gar
int me = raw_smp_processor_id();

rdmsrl(MSR_EFER, efer);
- if (efer & EFER_SVME)
- return -EBUSY;
+ printk(KERN_DEBUG "svm_hardware_enable: efer %x on %d\n", efer, me);

if (!has_svm()) {
printk(KERN_ERR "svm_hardware_enable: err EOPNOTSUPP on %d\n",

Here is an example of its output:

[231441.358111] svm_hardware_enable: efer 1d01 on 0
[231441.358117] svm_hardware_enable: efer d01 on 3
[231441.358124] svm_hardware_enable: efer d01 on 2
[231441.358135] svm_hardware_enable: efer d01 on 1
[231675.168967] svm_hardware_enable: efer d01 on 0
[231675.168972] svm_hardware_enable: efer d01 on 1
[231675.168983] svm_hardware_enable: efer d01 on 2
[231675.168995] svm_hardware_enable: efer d01 on 3


After reboot, the first kvm run always reports efer = 1d01 on the first cpu.
In all subsequent runs before reboot efer stays d01 on all cpus.

2010-08-16 14:13:12

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 23/42] KVM: Activate Virtualization On Demand


On 16.08.2010, at 15:49, Serge Belyshev wrote:

> Alexander Graf <[email protected]> writes:
>
>> Hrm - try to use the following (probably whitespace broken and 100% untested) hacky patch:
>>
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 56c9b6b..bde9ee3 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -429,9 +429,11 @@ static int svm_hardware_enable(void *garbage)
>> struct desc_struct *gdt;
>> int me = raw_smp_processor_id();
>>
>> +#if 0
>> rdmsrl(MSR_EFER, efer);
>> if (efer & EFER_SVME)
>> return -EBUSY;
>> +#endif
>>
>
> I tested a bit different hack keeping rdmsrl but removing the if() and
> it works fine for me:
>
> Index: linux/arch/x86/kvm/svm.c
> ===================================================================
> --- linux.orig/arch/x86/kvm/svm.c
> +++ linux/arch/x86/kvm/svm.c
> @@ -429,8 +429,7 @@ static int svm_hardware_enable(void *gar
> int me = raw_smp_processor_id();
>
> rdmsrl(MSR_EFER, efer);
> - if (efer & EFER_SVME)
> - return -EBUSY;
> + printk(KERN_DEBUG "svm_hardware_enable: efer %x on %d\n", efer, me);

Ouch, yes. Thanks for not taking my patch ;). The efer read is of course still mandatory.

>
> if (!has_svm()) {
> printk(KERN_ERR "svm_hardware_enable: err EOPNOTSUPP on %d\n",
>
> Here is an example of its output:
>
> [231441.358111] svm_hardware_enable: efer 1d01 on 0
> [231441.358117] svm_hardware_enable: efer d01 on 3
> [231441.358124] svm_hardware_enable: efer d01 on 2
> [231441.358135] svm_hardware_enable: efer d01 on 1
> [231675.168967] svm_hardware_enable: efer d01 on 0
> [231675.168972] svm_hardware_enable: efer d01 on 1
> [231675.168983] svm_hardware_enable: efer d01 on 2
> [231675.168995] svm_hardware_enable: efer d01 on 3
>
>
> After reboot, the first kvm run always reports efer = 1d01 on the first cpu.
> In all subsequent runs before reboot efer stays d01 on all cpus.

Happy broken BIOS fun. Oh well, I guess you can live with the patched version and normal end-users will just update their BIOSes.


Alex