2006-10-23 13:29:06

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 0/7] KVM: Kernel-based Virtual Machine (v2)

Changes:

- fixed a lockup on i386 with host memory >= 4GB
- finer patch split to conform to vger limits
- minor fixes
- send through script to avoid mailer damage

TODO:

- drivers/ or arch/ ?
- filesystem + syscalls or ioctl() ?

I'll also reply to this mail with a sample userspace so people can see
how it works.

---

The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.

Using this driver, one can start multiple virtual machines on a host. Each
virtual machine is a process on the host; a virtual cpu is a thread in that
process. kill(1), nice(1), top(1) work as expected.

In effect, the driver adds a third execution mode to the existing two:
we now
have kernel mode, user mode, and guest mode. Guest mode has its own address
space mapping guest physical memory (which is accessible to user mode by
mmap()ing /dev/kvm). Guest mode has no access to any I/O devices; any such
access is intercepted and directed to user mode for emulation.

The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both
pae and non-pae paging modes are supported.

SMP hosts and UP guests are supported. At the moment only Intel hardware is
supported, but AMD virtualization support is being worked on.

Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:

- cache shadow page tables across tlv flushes
- wait until AMD and Intel release processors with nested page tables

Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a
recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.

In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.

Caveats:

- The Windows install currently bluescreens due to a problem with the
virtual
APIC. We are working on a fix. A temporary workaround is to use an
existing
image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's probably
a problem with the device model.

--
error compiling committee.c: too many arguments to function


2006-10-23 13:29:49

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 1/13] KVM: userspace interface

Changes from v1:

- removed #define __user for userspace

Still TODO:

- inode per vcpu
- pseudo filesystem + syscalls?

--

This patch defines a bunch of ioctl()s on /dev/kvm. The ioctl()s allow adding
memory to a virtual machine, adding a virtual cpu to a virtual machine (at
most one at this time), transferring control to the virtual cpu, and querying
about guest pages changed by the virtual machine.

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/include/linux/kvm.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/kvm.h
@@ -0,0 +1,198 @@
+#ifndef __LINUX_KVM_H
+#define __LINUX_KVM_H
+
+/*
+ * Userspace interface for /dev/kvm - kernel based virtual machine
+ *
+ * Note: this interface is considered experimental and may change without
+ * notice.
+ */
+
+#include <asm/types.h>
+#include <linux/ioctl.h>
+
+/* for KVM_CREATE_MEMORY_REGION */
+struct kvm_memory_region {
+ __u32 slot;
+ __u32 flags;
+ __u64 guest_phys_addr;
+ __u64 memory_size; /* bytes */
+};
+
+/* for kvm_memory_region::flags */
+#define KVM_MEM_LOG_DIRTY_PAGES 1UL
+
+
+#define KVM_EXIT_TYPE_FAIL_ENTRY 1
+#define KVM_EXIT_TYPE_VM_EXIT 2
+
+enum kvm_exit_reason {
+ KVM_EXIT_UNKNOWN,
+ KVM_EXIT_EXCEPTION,
+ KVM_EXIT_IO,
+ KVM_EXIT_CPUID,
+ KVM_EXIT_DEBUG,
+ KVM_EXIT_HLT,
+ KVM_EXIT_MMIO,
+};
+
+/* for KVM_RUN */
+struct kvm_run {
+ /* in */
+ __u32 vcpu;
+ __u32 emulated; /* skip current instruction */
+ __u32 mmio_completed; /* mmio request completed */
+
+ /* out */
+ __u32 exit_type;
+ __u32 exit_reason;
+ __u32 instruction_length;
+ union {
+ /* KVM_EXIT_UNKNOWN */
+ struct {
+ __u32 hardware_exit_reason;
+ } hw;
+ /* KVM_EXIT_EXCEPTION */
+ struct {
+ __u32 exception;
+ __u32 error_code;
+ } ex;
+ /* KVM_EXIT_IO */
+ struct {
+#define KVM_EXIT_IO_IN 0
+#define KVM_EXIT_IO_OUT 1
+ __u8 direction;
+ __u8 size; /* bytes */
+ __u8 string;
+ __u8 string_down;
+ __u8 rep;
+ __u8 pad;
+ __u16 port;
+ __u64 count;
+ union {
+ __u64 address;
+ __u32 value;
+ };
+ } io;
+ struct {
+ } debug;
+ /* KVM_EXIT_MMIO */
+ struct {
+ __u64 phys_addr;
+ __u8 data[8];
+ __u32 len;
+ __u8 is_write;
+ } mmio;
+ };
+};
+
+/* for KVM_GET_REGS and KVM_SET_REGS */
+struct kvm_regs {
+ /* in */
+ __u32 vcpu;
+ __u32 padding;
+
+ /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
+ __u64 rax, rbx, rcx, rdx;
+ __u64 rsi, rdi, rsp, rbp;
+ __u64 r8, r9, r10, r11;
+ __u64 r12, r13, r14, r15;
+ __u64 rip, rflags;
+};
+
+struct kvm_segment {
+ __u64 base;
+ __u32 limit;
+ __u16 selector;
+ __u8 type;
+ __u8 present, dpl, db, s, l, g, avl;
+ __u8 unusable;
+ __u8 padding;
+};
+
+struct kvm_dtable {
+ __u64 base;
+ __u16 limit;
+ __u16 padding[3];
+};
+
+/* for KVM_GET_SREGS and KVM_SET_SREGS */
+struct kvm_sregs {
+ /* in */
+ __u32 vcpu;
+ __u32 padding;
+
+ /* out (KVM_GET_SREGS) / in (KVM_SET_SREGS) */
+ struct kvm_segment cs, ds, es, fs, gs, ss;
+ struct kvm_segment tr, ldt;
+ struct kvm_dtable gdt, idt;
+ __u64 cr0, cr2, cr3, cr4, cr8;
+ __u64 efer;
+ __u64 apic_base;
+
+ /* out (KVM_GET_SREGS) */
+ __u32 pending_int;
+ __u32 padding2;
+};
+
+/* for KVM_TRANSLATE */
+struct kvm_translation {
+ /* in */
+ __u64 linear_address;
+ __u32 vcpu;
+ __u32 padding;
+
+ /* out */
+ __u64 physical_address;
+ __u8 valid;
+ __u8 writeable;
+ __u8 usermode;
+};
+
+/* for KVM_INTERRUPT */
+struct kvm_interrupt {
+ /* in */
+ __u32 vcpu;
+ __u32 irq;
+};
+
+struct kvm_breakpoint {
+ __u32 enabled;
+ __u32 padding;
+ __u64 address;
+};
+
+/* for KVM_DEBUG_GUEST */
+struct kvm_debug_guest {
+ /* int */
+ __u32 vcpu;
+ __u32 enabled;
+ struct kvm_breakpoint breakpoints[4];
+ __u32 singlestep;
+};
+
+/* for KVM_GET_DIRTY_LOG */
+struct kvm_dirty_log {
+ __u32 slot;
+ __u32 padding;
+ union {
+ void __user *dirty_bitmap; /* one bit per page */
+ __u64 padding;
+ };
+};
+
+#define KVMIO 0xAE
+
+#define KVM_RUN _IOWR(KVMIO, 2, struct kvm_run)
+#define KVM_GET_REGS _IOWR(KVMIO, 3, struct kvm_regs)
+#define KVM_SET_REGS _IOW(KVMIO, 4, struct kvm_regs)
+#define KVM_GET_SREGS _IOWR(KVMIO, 5, struct kvm_sregs)
+#define KVM_SET_SREGS _IOW(KVMIO, 6, struct kvm_sregs)
+#define KVM_TRANSLATE _IOWR(KVMIO, 7, struct kvm_translation)
+#define KVM_INTERRUPT _IOW(KVMIO, 8, struct kvm_interrupt)
+#define KVM_DEBUG_GUEST _IOW(KVMIO, 9, struct kvm_debug_guest)
+#define KVM_SET_MEMORY_REGION _IOW(KVMIO, 10, struct kvm_memory_region)
+#define KVM_CREATE_VCPU _IOW(KVMIO, 11, int /* vcpu_slot */)
+#define KVM_GET_DIRTY_LOG _IOW(KVMIO, 12, struct kvm_dirty_log)
+
+#endif

2006-10-23 13:30:00

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 2/13] KVM: Intel virtual mode extensions definitions

Add some constants for the various bits defined by Intel's VT extensions.

Most of this file was lifted from the Xen hypervisor.

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/vmx.h
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/vmx.h
@@ -0,0 +1,287 @@
+#ifndef VMX_H
+#define VMX_H
+
+/*
+ * vmx.h: VMX Architecture related definitions
+ * Copyright (c) 2004, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * A few random additions are:
+ * Copyright (C) 2006 Qumranet
+ * Avi Kivity <[email protected]>
+ * Yaniv Kamay <[email protected]>
+ *
+ */
+
+#define CPU_BASED_VIRTUAL_INTR_PENDING 0x00000004
+#define CPU_BASED_USE_TSC_OFFSETING 0x00000008
+#define CPU_BASED_HLT_EXITING 0x00000080
+#define CPU_BASED_INVDPG_EXITING 0x00000200
+#define CPU_BASED_MWAIT_EXITING 0x00000400
+#define CPU_BASED_RDPMC_EXITING 0x00000800
+#define CPU_BASED_RDTSC_EXITING 0x00001000
+#define CPU_BASED_CR8_LOAD_EXITING 0x00080000
+#define CPU_BASED_CR8_STORE_EXITING 0x00100000
+#define CPU_BASED_TPR_SHADOW 0x00200000
+#define CPU_BASED_MOV_DR_EXITING 0x00800000
+#define CPU_BASED_UNCOND_IO_EXITING 0x01000000
+#define CPU_BASED_ACTIVATE_IO_BITMAP 0x02000000
+#define CPU_BASED_MSR_BITMAPS 0x10000000
+#define CPU_BASED_MONITOR_EXITING 0x20000000
+#define CPU_BASED_PAUSE_EXITING 0x40000000
+
+#define PIN_BASED_EXT_INTR_MASK 0x1
+#define PIN_BASED_NMI_EXITING 0x8
+
+#define VM_EXIT_ACK_INTR_ON_EXIT 0x00008000
+#define VM_EXIT_HOST_ADD_SPACE_SIZE 0x00000200
+
+
+/* VMCS Encodings */
+enum vmcs_field {
+ GUEST_ES_SELECTOR = 0x00000800,
+ GUEST_CS_SELECTOR = 0x00000802,
+ GUEST_SS_SELECTOR = 0x00000804,
+ GUEST_DS_SELECTOR = 0x00000806,
+ GUEST_FS_SELECTOR = 0x00000808,
+ GUEST_GS_SELECTOR = 0x0000080a,
+ GUEST_LDTR_SELECTOR = 0x0000080c,
+ GUEST_TR_SELECTOR = 0x0000080e,
+ HOST_ES_SELECTOR = 0x00000c00,
+ HOST_CS_SELECTOR = 0x00000c02,
+ HOST_SS_SELECTOR = 0x00000c04,
+ HOST_DS_SELECTOR = 0x00000c06,
+ HOST_FS_SELECTOR = 0x00000c08,
+ HOST_GS_SELECTOR = 0x00000c0a,
+ HOST_TR_SELECTOR = 0x00000c0c,
+ IO_BITMAP_A = 0x00002000,
+ IO_BITMAP_A_HIGH = 0x00002001,
+ IO_BITMAP_B = 0x00002002,
+ IO_BITMAP_B_HIGH = 0x00002003,
+ MSR_BITMAP = 0x00002004,
+ MSR_BITMAP_HIGH = 0x00002005,
+ VM_EXIT_MSR_STORE_ADDR = 0x00002006,
+ VM_EXIT_MSR_STORE_ADDR_HIGH = 0x00002007,
+ VM_EXIT_MSR_LOAD_ADDR = 0x00002008,
+ VM_EXIT_MSR_LOAD_ADDR_HIGH = 0x00002009,
+ VM_ENTRY_MSR_LOAD_ADDR = 0x0000200a,
+ VM_ENTRY_MSR_LOAD_ADDR_HIGH = 0x0000200b,
+ TSC_OFFSET = 0x00002010,
+ TSC_OFFSET_HIGH = 0x00002011,
+ VIRTUAL_APIC_PAGE_ADDR = 0x00002012,
+ VIRTUAL_APIC_PAGE_ADDR_HIGH = 0x00002013,
+ VMCS_LINK_POINTER = 0x00002800,
+ VMCS_LINK_POINTER_HIGH = 0x00002801,
+ GUEST_IA32_DEBUGCTL = 0x00002802,
+ GUEST_IA32_DEBUGCTL_HIGH = 0x00002803,
+ PIN_BASED_VM_EXEC_CONTROL = 0x00004000,
+ CPU_BASED_VM_EXEC_CONTROL = 0x00004002,
+ EXCEPTION_BITMAP = 0x00004004,
+ PAGE_FAULT_ERROR_CODE_MASK = 0x00004006,
+ PAGE_FAULT_ERROR_CODE_MATCH = 0x00004008,
+ CR3_TARGET_COUNT = 0x0000400a,
+ VM_EXIT_CONTROLS = 0x0000400c,
+ VM_EXIT_MSR_STORE_COUNT = 0x0000400e,
+ VM_EXIT_MSR_LOAD_COUNT = 0x00004010,
+ VM_ENTRY_CONTROLS = 0x00004012,
+ VM_ENTRY_MSR_LOAD_COUNT = 0x00004014,
+ VM_ENTRY_INTR_INFO_FIELD = 0x00004016,
+ VM_ENTRY_EXCEPTION_ERROR_CODE = 0x00004018,
+ VM_ENTRY_INSTRUCTION_LEN = 0x0000401a,
+ TPR_THRESHOLD = 0x0000401c,
+ SECONDARY_VM_EXEC_CONTROL = 0x0000401e,
+ VM_INSTRUCTION_ERROR = 0x00004400,
+ VM_EXIT_REASON = 0x00004402,
+ VM_EXIT_INTR_INFO = 0x00004404,
+ VM_EXIT_INTR_ERROR_CODE = 0x00004406,
+ IDT_VECTORING_INFO_FIELD = 0x00004408,
+ IDT_VECTORING_ERROR_CODE = 0x0000440a,
+ VM_EXIT_INSTRUCTION_LEN = 0x0000440c,
+ VMX_INSTRUCTION_INFO = 0x0000440e,
+ GUEST_ES_LIMIT = 0x00004800,
+ GUEST_CS_LIMIT = 0x00004802,
+ GUEST_SS_LIMIT = 0x00004804,
+ GUEST_DS_LIMIT = 0x00004806,
+ GUEST_FS_LIMIT = 0x00004808,
+ GUEST_GS_LIMIT = 0x0000480a,
+ GUEST_LDTR_LIMIT = 0x0000480c,
+ GUEST_TR_LIMIT = 0x0000480e,
+ GUEST_GDTR_LIMIT = 0x00004810,
+ GUEST_IDTR_LIMIT = 0x00004812,
+ GUEST_ES_AR_BYTES = 0x00004814,
+ GUEST_CS_AR_BYTES = 0x00004816,
+ GUEST_SS_AR_BYTES = 0x00004818,
+ GUEST_DS_AR_BYTES = 0x0000481a,
+ GUEST_FS_AR_BYTES = 0x0000481c,
+ GUEST_GS_AR_BYTES = 0x0000481e,
+ GUEST_LDTR_AR_BYTES = 0x00004820,
+ GUEST_TR_AR_BYTES = 0x00004822,
+ GUEST_INTERRUPTIBILITY_INFO = 0x00004824,
+ GUEST_ACTIVITY_STATE = 0X00004826,
+ GUEST_SYSENTER_CS = 0x0000482A,
+ HOST_IA32_SYSENTER_CS = 0x00004c00,
+ CR0_GUEST_HOST_MASK = 0x00006000,
+ CR4_GUEST_HOST_MASK = 0x00006002,
+ CR0_READ_SHADOW = 0x00006004,
+ CR4_READ_SHADOW = 0x00006006,
+ CR3_TARGET_VALUE0 = 0x00006008,
+ CR3_TARGET_VALUE1 = 0x0000600a,
+ CR3_TARGET_VALUE2 = 0x0000600c,
+ CR3_TARGET_VALUE3 = 0x0000600e,
+ EXIT_QUALIFICATION = 0x00006400,
+ GUEST_LINEAR_ADDRESS = 0x0000640a,
+ GUEST_CR0 = 0x00006800,
+ GUEST_CR3 = 0x00006802,
+ GUEST_CR4 = 0x00006804,
+ GUEST_ES_BASE = 0x00006806,
+ GUEST_CS_BASE = 0x00006808,
+ GUEST_SS_BASE = 0x0000680a,
+ GUEST_DS_BASE = 0x0000680c,
+ GUEST_FS_BASE = 0x0000680e,
+ GUEST_GS_BASE = 0x00006810,
+ GUEST_LDTR_BASE = 0x00006812,
+ GUEST_TR_BASE = 0x00006814,
+ GUEST_GDTR_BASE = 0x00006816,
+ GUEST_IDTR_BASE = 0x00006818,
+ GUEST_DR7 = 0x0000681a,
+ GUEST_RSP = 0x0000681c,
+ GUEST_RIP = 0x0000681e,
+ GUEST_RFLAGS = 0x00006820,
+ GUEST_PENDING_DBG_EXCEPTIONS = 0x00006822,
+ GUEST_SYSENTER_ESP = 0x00006824,
+ GUEST_SYSENTER_EIP = 0x00006826,
+ HOST_CR0 = 0x00006c00,
+ HOST_CR3 = 0x00006c02,
+ HOST_CR4 = 0x00006c04,
+ HOST_FS_BASE = 0x00006c06,
+ HOST_GS_BASE = 0x00006c08,
+ HOST_TR_BASE = 0x00006c0a,
+ HOST_GDTR_BASE = 0x00006c0c,
+ HOST_IDTR_BASE = 0x00006c0e,
+ HOST_IA32_SYSENTER_ESP = 0x00006c10,
+ HOST_IA32_SYSENTER_EIP = 0x00006c12,
+ HOST_RSP = 0x00006c14,
+ HOST_RIP = 0x00006c16,
+};
+
+#define VMX_EXIT_REASONS_FAILED_VMENTRY 0x80000000
+
+#define EXIT_REASON_EXCEPTION_NMI 0
+#define EXIT_REASON_EXTERNAL_INTERRUPT 1
+
+#define EXIT_REASON_PENDING_INTERRUPT 7
+
+#define EXIT_REASON_TASK_SWITCH 9
+#define EXIT_REASON_CPUID 10
+#define EXIT_REASON_HLT 12
+#define EXIT_REASON_INVLPG 14
+#define EXIT_REASON_RDPMC 15
+#define EXIT_REASON_RDTSC 16
+#define EXIT_REASON_VMCALL 18
+#define EXIT_REASON_VMCLEAR 19
+#define EXIT_REASON_VMLAUNCH 20
+#define EXIT_REASON_VMPTRLD 21
+#define EXIT_REASON_VMPTRST 22
+#define EXIT_REASON_VMREAD 23
+#define EXIT_REASON_VMRESUME 24
+#define EXIT_REASON_VMWRITE 25
+#define EXIT_REASON_VMOFF 26
+#define EXIT_REASON_VMON 27
+#define EXIT_REASON_CR_ACCESS 28
+#define EXIT_REASON_DR_ACCESS 29
+#define EXIT_REASON_IO_INSTRUCTION 30
+#define EXIT_REASON_MSR_READ 31
+#define EXIT_REASON_MSR_WRITE 32
+#define EXIT_REASON_MWAIT_INSTRUCTION 36
+
+/*
+ * Interruption-information format
+ */
+#define INTR_INFO_VECTOR_MASK 0xff /* 7:0 */
+#define INTR_INFO_INTR_TYPE_MASK 0x700 /* 10:8 */
+#define INTR_INFO_DELIEVER_CODE_MASK 0x800 /* 11 */
+#define INTR_INFO_VALID_MASK 0x80000000 /* 31 */
+
+#define VECTORING_INFO_VECTOR_MASK INTR_INFO_VECTOR_MASK
+#define VECTORING_INFO_TYPE_MASK INTR_INFO_INTR_TYPE_MASK
+#define VECTORING_INFO_DELIEVER_CODE_MASK INTR_INFO_DELIEVER_CODE_MASK
+#define VECTORING_INFO_VALID_MASK INTR_INFO_VALID_MASK
+
+#define INTR_TYPE_EXT_INTR (0 << 8) /* external interrupt */
+#define INTR_TYPE_EXCEPTION (3 << 8) /* processor exception */
+
+/*
+ * Exit Qualifications for MOV for Control Register Access
+ */
+#define CONTROL_REG_ACCESS_NUM 0x7 /* 2:0, number of control register */
+#define CONTROL_REG_ACCESS_TYPE 0x30 /* 5:4, access type */
+#define CONTROL_REG_ACCESS_REG 0xf00 /* 10:8, general purpose register */
+#define LMSW_SOURCE_DATA_SHIFT 16
+#define LMSW_SOURCE_DATA (0xFFFF << LMSW_SOURCE_DATA_SHIFT) /* 16:31 lmsw source */
+#define REG_EAX (0 << 8)
+#define REG_ECX (1 << 8)
+#define REG_EDX (2 << 8)
+#define REG_EBX (3 << 8)
+#define REG_ESP (4 << 8)
+#define REG_EBP (5 << 8)
+#define REG_ESI (6 << 8)
+#define REG_EDI (7 << 8)
+#define REG_R8 (8 << 8)
+#define REG_R9 (9 << 8)
+#define REG_R10 (10 << 8)
+#define REG_R11 (11 << 8)
+#define REG_R12 (12 << 8)
+#define REG_R13 (13 << 8)
+#define REG_R14 (14 << 8)
+#define REG_R15 (15 << 8)
+
+/*
+ * Exit Qualifications for MOV for Debug Register Access
+ */
+#define DEBUG_REG_ACCESS_NUM 0x7 /* 2:0, number of debug register */
+#define DEBUG_REG_ACCESS_TYPE 0x10 /* 4, direction of access */
+#define TYPE_MOV_TO_DR (0 << 4)
+#define TYPE_MOV_FROM_DR (1 << 4)
+#define DEBUG_REG_ACCESS_REG 0xf00 /* 11:8, general purpose register */
+
+
+/* segment AR */
+#define SEGMENT_AR_L_MASK (1 << 13)
+
+/* entry controls */
+#define VM_ENTRY_CONTROLS_IA32E_MASK (1 << 9)
+
+#define AR_TYPE_ACCESSES_MASK 1
+#define AR_TYPE_READABLE_MASK (1 << 1)
+#define AR_TYPE_WRITEABLE_MASK (1 << 2)
+#define AR_TYPE_CODE_MASK (1 << 3)
+#define AR_TYPE_MASK 0x0f
+#define AR_TYPE_BUSY_64_TSS 11
+#define AR_TYPE_BUSY_32_TSS 11
+#define AR_TYPE_BUSY_16_TSS 3
+#define AR_TYPE_LDT 2
+
+#define AR_UNUSABLE_MASK (1 << 16)
+#define AR_S_MASK (1 << 4)
+#define AR_P_MASK (1 << 7)
+#define AR_L_MASK (1 << 13)
+#define AR_DB_MASK (1 << 14)
+#define AR_G_MASK (1 << 15)
+#define AR_DPL_SHIFT 5
+#define AR_DPL(ar) (((ar) >> AR_DPL_SHIFT) & 3)
+
+#define AR_RESERVD_MASK 0xfffe0f00
+
+#endif

2006-10-23 13:30:46

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 6/13] KVM: memory slot management

kvm defines memory in "slots", more or less corresponding to the DIMM slots.

this allows us to:
- avoid the VGA hole at 640K
- add a pci framebuffer at runtime
- hotplug memory

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/kvm_main.c
===================================================================
--- linux-2.6.orig/drivers/kvm/kvm_main.c
+++ linux-2.6/drivers/kvm/kvm_main.c
@@ -675,6 +675,211 @@ static void vcpu_put_rsp_rip(struct kvm_
vmcs_writel(GUEST_RIP, vcpu->rip);
}

+/*
+ * Allocate some memory and give it an address in the guest physical address
+ * space.
+ *
+ * Discontiguous memory is allowed, mostly for framebuffers.
+ */
+static int kvm_dev_ioctl_set_memory_region(struct kvm *kvm,
+ struct kvm_memory_region *mem)
+{
+ int r;
+ gfn_t base_gfn;
+ unsigned long npages;
+ unsigned long i;
+ struct kvm_memory_slot *memslot;
+ struct kvm_memory_slot old, new;
+ int memory_config_version;
+
+ r = -EINVAL;
+ /* General sanity checks */
+ if (mem->memory_size & (PAGE_SIZE - 1))
+ goto out;
+ if (mem->guest_phys_addr & (PAGE_SIZE - 1))
+ goto out;
+ if (mem->slot >= KVM_MEMORY_SLOTS)
+ goto out;
+ if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
+ goto out;
+
+ memslot = &kvm->memslots[mem->slot];
+ base_gfn = mem->guest_phys_addr >> PAGE_SHIFT;
+ npages = mem->memory_size >> PAGE_SHIFT;
+
+ if (!npages)
+ mem->flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
+
+raced:
+ spin_lock(&kvm->lock);
+
+ memory_config_version = kvm->memory_config_version;
+ new = old = *memslot;
+
+ new.base_gfn = base_gfn;
+ new.npages = npages;
+ new.flags = mem->flags;
+
+ /* Disallow changing a memory slot's size. */
+ r = -EINVAL;
+ if (npages && old.npages && npages != old.npages)
+ goto out_unlock;
+
+ /* Check for overlaps */
+ r = -EEXIST;
+ for (i = 0; i < KVM_MEMORY_SLOTS; ++i) {
+ struct kvm_memory_slot *s = &kvm->memslots[i];
+
+ if (s == memslot)
+ continue;
+ if (!((base_gfn + npages <= s->base_gfn) ||
+ (base_gfn >= s->base_gfn + s->npages)))
+ goto out_unlock;
+ }
+ /*
+ * Do memory allocations outside lock. memory_config_version will
+ * detect any races.
+ */
+ spin_unlock(&kvm->lock);
+
+ /* Deallocate if slot is being removed */
+ if (!npages)
+ new.phys_mem = 0;
+
+ /* Free page dirty bitmap if unneeded */
+ if (!(new.flags & KVM_MEM_LOG_DIRTY_PAGES))
+ new.dirty_bitmap = 0;
+
+ r = -ENOMEM;
+
+ /* Allocate if a slot is being created */
+ if (npages && !new.phys_mem) {
+ new.phys_mem = vmalloc(npages * sizeof(struct page *));
+
+ if (!new.phys_mem)
+ goto out_free;
+
+ memset(new.phys_mem, 0, npages * sizeof(struct page *));
+ for (i = 0; i < npages; ++i) {
+ new.phys_mem[i] = alloc_page(GFP_HIGHUSER);
+ if (!new.phys_mem[i])
+ goto out_free;
+ }
+ }
+
+ /* Allocate page dirty bitmap if needed */
+ if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) {
+ unsigned dirty_bytes = ALIGN(npages, BITS_PER_LONG) / 8;
+
+ new.dirty_bitmap = vmalloc(dirty_bytes);
+ if (!new.dirty_bitmap)
+ goto out_free;
+ memset(new.dirty_bitmap, 0, dirty_bytes);
+ }
+
+ spin_lock(&kvm->lock);
+
+ if (memory_config_version != kvm->memory_config_version) {
+ spin_unlock(&kvm->lock);
+ kvm_free_physmem_slot(&new, &old);
+ goto raced;
+ }
+
+ r = -EAGAIN;
+ if (kvm->busy)
+ goto out_unlock;
+
+ if (mem->slot >= kvm->nmemslots)
+ kvm->nmemslots = mem->slot + 1;
+
+ *memslot = new;
+ ++kvm->memory_config_version;
+
+ spin_unlock(&kvm->lock);
+
+ for (i = 0; i < KVM_MAX_VCPUS; ++i) {
+ struct kvm_vcpu *vcpu;
+
+ vcpu = vcpu_load(kvm, i);
+ if (!vcpu)
+ continue;
+ kvm_mmu_reset_context(vcpu);
+ vcpu_put(vcpu);
+ }
+
+ kvm_free_physmem_slot(&old, &new);
+ return 0;
+
+out_unlock:
+ spin_unlock(&kvm->lock);
+out_free:
+ kvm_free_physmem_slot(&new, &old);
+out:
+ return r;
+}
+
+/*
+ * Get (and clear) the dirty memory log for a memory slot.
+ */
+static int kvm_dev_ioctl_get_dirty_log(struct kvm *kvm,
+ struct kvm_dirty_log *log)
+{
+ struct kvm_memory_slot *memslot;
+ int r, i;
+ int n;
+ unsigned long any = 0;
+
+ spin_lock(&kvm->lock);
+
+ /*
+ * Prevent changes to guest memory configuration even while the lock
+ * is not taken.
+ */
+ ++kvm->busy;
+ spin_unlock(&kvm->lock);
+ r = -EINVAL;
+ if (log->slot >= KVM_MEMORY_SLOTS)
+ goto out;
+
+ memslot = &kvm->memslots[log->slot];
+ r = -ENOENT;
+ if (!memslot->dirty_bitmap)
+ goto out;
+
+ n = ALIGN(memslot->npages, 8) / 8;
+
+ for (i = 0; !any && i < n; ++i)
+ any = memslot->dirty_bitmap[i];
+
+ r = -EFAULT;
+ if (copy_to_user(log->dirty_bitmap, memslot->dirty_bitmap, n))
+ goto out;
+
+
+ if (any) {
+ spin_lock(&kvm->lock);
+ kvm_mmu_slot_remove_write_access(kvm, log->slot);
+ spin_unlock(&kvm->lock);
+ memset(memslot->dirty_bitmap, 0, n);
+ for (i = 0; i < KVM_MAX_VCPUS; ++i) {
+ struct kvm_vcpu *vcpu = vcpu_load(kvm, i);
+
+ if (!vcpu)
+ continue;
+ flush_guest_tlb(vcpu);
+ vcpu_put(vcpu);
+ }
+ }
+
+ r = 0;
+
+out:
+ spin_lock(&kvm->lock);
+ --kvm->busy;
+ spin_unlock(&kvm->lock);
+ return r;
+}
+
struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
{
int i;
@@ -1086,6 +1291,28 @@ static long kvm_dev_ioctl(struct file *f
int r = -EINVAL;

switch (ioctl) {
+ case KVM_SET_MEMORY_REGION: {
+ struct kvm_memory_region kvm_mem;
+
+ r = -EFAULT;
+ if (copy_from_user(&kvm_mem, (void *)arg, sizeof kvm_mem))
+ goto out;
+ r = kvm_dev_ioctl_set_memory_region(kvm, &kvm_mem);
+ if (r)
+ goto out;
+ break;
+ }
+ case KVM_GET_DIRTY_LOG: {
+ struct kvm_dirty_log log;
+
+ r = -EFAULT;
+ if (copy_from_user(&log, (void *)arg, sizeof log))
+ goto out;
+ r = kvm_dev_ioctl_get_dirty_log(kvm, &log);
+ if (r)
+ goto out;
+ break;
+ }
default:
;
}

2006-10-23 13:30:15

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 3/13] KVM: kvm data structures

Define data structures and some constants for a virtual machine.

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/kvm.h
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/kvm.h
@@ -0,0 +1,206 @@
+#ifndef __KVM_H
+#define __KVM_H
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+
+#define INVALID_PAGE (~(hpa_t)0)
+#define UNMAPPED_GVA (~(gpa_t)0)
+
+#define KVM_MAX_VCPUS 1
+#define KVM_MEMORY_SLOTS 4
+#define KVM_NUM_MMU_PAGES 256
+
+#define FX_IMAGE_SIZE 512
+#define FX_IMAGE_ALIGN 16
+#define FX_BUF_SIZE (2 * FX_IMAGE_SIZE + FX_IMAGE_ALIGN)
+
+/*
+ * Address types:
+ *
+ * gva - guest virtual address
+ * gpa - guest physical address
+ * gfn - guest frame number
+ * hva - host virtual address
+ * hpa - host physical address
+ * hfn - host frame number
+ */
+
+typedef unsigned long gva_t;
+typedef u64 gpa_t;
+typedef unsigned long gfn_t;
+
+typedef unsigned long hva_t;
+typedef u64 hpa_t;
+typedef unsigned long hfn_t;
+
+struct kvm_mmu_page {
+ struct list_head link;
+ hpa_t page_hpa;
+ unsigned long slot_bitmap; /* One bit set per slot which has memory
+ * in this shadow page.
+ */
+ int global; /* Set if all ptes in this page are global */
+ u64 *parent_pte;
+};
+
+struct vmcs {
+ u32 revision_id;
+ u32 abort;
+ char data[0];
+};
+
+struct vmx_msr_entry {
+ u32 index;
+ u32 reserved;
+ u64 data;
+};
+
+struct kvm_vcpu;
+
+/*
+ * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level
+ * 32-bit). The kvm_mmu structure abstracts the details of the current mmu
+ * mode.
+ */
+struct kvm_mmu {
+ void (*new_cr3)(struct kvm_vcpu *vcpu);
+ int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
+ void (*inval_page)(struct kvm_vcpu *vcpu, gva_t gva);
+ void (*free)(struct kvm_vcpu *vcpu);
+ gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva);
+ hpa_t root_hpa;
+ int root_level;
+ int shadow_root_level;
+};
+
+struct kvm_guest_debug {
+ int enabled;
+ unsigned long bp[4];
+ int singlestep;
+};
+
+enum {
+ VCPU_REGS_RAX = 0,
+ VCPU_REGS_RCX = 1,
+ VCPU_REGS_RDX = 2,
+ VCPU_REGS_RBX = 3,
+ VCPU_REGS_RSP = 4,
+ VCPU_REGS_RBP = 5,
+ VCPU_REGS_RSI = 6,
+ VCPU_REGS_RDI = 7,
+#ifdef __x86_64__
+ VCPU_REGS_R8 = 8,
+ VCPU_REGS_R9 = 9,
+ VCPU_REGS_R10 = 10,
+ VCPU_REGS_R11 = 11,
+ VCPU_REGS_R12 = 12,
+ VCPU_REGS_R13 = 13,
+ VCPU_REGS_R14 = 14,
+ VCPU_REGS_R15 = 15,
+#endif
+ NR_VCPU_REGS
+};
+
+struct kvm_vcpu {
+ struct kvm *kvm;
+ struct vmcs *vmcs;
+ struct mutex mutex;
+ int cpu;
+ int launched;
+ unsigned long irq_summary; /* bit vector: 1 per word in irq_pending */
+#define NR_IRQ_WORDS (256 / BITS_PER_LONG)
+ unsigned long irq_pending[NR_IRQ_WORDS];
+ unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */
+ unsigned long rip; /* needs vcpu_load_rsp_rip() */
+
+ unsigned long cr2;
+ unsigned long cr3;
+ unsigned long cr8;
+ u64 shadow_efer;
+ u64 apic_base;
+ struct vmx_msr_entry *guest_msrs;
+ struct vmx_msr_entry *host_msrs;
+
+ struct list_head free_pages;
+ struct kvm_mmu_page page_header_buf[KVM_NUM_MMU_PAGES];
+ struct kvm_mmu mmu;
+
+ struct kvm_guest_debug guest_debug;
+
+ char fx_buf[FX_BUF_SIZE];
+ char *host_fx_image;
+ char *guest_fx_image;
+
+ int mmio_needed;
+ int mmio_read_completed;
+ int mmio_is_write;
+ int mmio_size;
+ unsigned char mmio_data[8];
+ gpa_t mmio_phys_addr;
+
+ struct{
+ int active;
+ u8 save_iopl;
+ struct {
+ unsigned long base;
+ u32 limit;
+ u32 ar;
+ } tr;
+ } rmode;
+};
+
+struct kvm_memory_slot {
+ gfn_t base_gfn;
+ unsigned long npages;
+ unsigned long flags;
+ struct page **phys_mem;
+ unsigned long *dirty_bitmap;
+};
+
+struct kvm {
+ spinlock_t lock; /* protects everything except vcpus */
+ int nmemslots;
+ struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS];
+ struct list_head active_mmu_pages;
+ struct kvm_vcpu vcpus[KVM_MAX_VCPUS];
+ int memory_config_version;
+ int busy;
+};
+
+struct kvm_stat {
+ u32 pf_fixed;
+ u32 pf_guest;
+ u32 tlb_flush;
+ u32 invlpg;
+
+ u32 exits;
+ u32 io_exits;
+ u32 mmio_exits;
+ u32 signal_exits;
+ u32 irq_exits;
+};
+
+extern struct kvm_stat kvm_stat;
+
+#define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
+#define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
+
+void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
+int kvm_mmu_init(struct kvm_vcpu *vcpu);
+
+int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
+void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
+
+hpa_t gpa_to_hpa(struct kvm_vcpu *vcpu, gpa_t gpa);
+#define HPA_MSB ((sizeof(hpa_t) * 8) - 1)
+#define HPA_ERR_MASK ((hpa_t)1 << HPA_MSB)
+static inline int is_error_hpa(hpa_t hpa) { return hpa >> HPA_MSB; }
+hpa_t gva_to_hpa(struct kvm_vcpu *vcpu, gva_t gva);
+
+extern hpa_t bad_page_address;
+
+#endif

2006-10-23 13:30:57

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 4/13] KVM: random accessors and constants

Define some constants and accessors to be used later on.

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/kvm.h
===================================================================
--- linux-2.6.orig/drivers/kvm/kvm.h
+++ linux-2.6/drivers/kvm/kvm.h
@@ -7,6 +7,38 @@
#include <linux/spinlock.h>
#include <linux/mm.h>

+#include "vmx.h"
+
+#define CR0_PE_MASK (1ULL << 0)
+#define CR0_TS_MASK (1ULL << 3)
+#define CR0_NE_MASK (1ULL << 5)
+#define CR0_WP_MASK (1ULL << 16)
+#define CR0_NW_MASK (1ULL << 29)
+#define CR0_CD_MASK (1ULL << 30)
+#define CR0_PG_MASK (1ULL << 31)
+
+#define CR3_WPT_MASK (1ULL << 3)
+#define CR3_PCD_MASK (1ULL << 4)
+
+#define CR3_RESEVED_BITS 0x07ULL
+#define CR3_L_MODE_RESEVED_BITS (~((1ULL << 40) - 1) | 0x0fe7ULL)
+#define CR3_FLAGS_MASK ((1ULL << 5) - 1)
+
+#define CR4_VME_MASK (1ULL << 0)
+#define CR4_PSE_MASK (1ULL << 4)
+#define CR4_PAE_MASK (1ULL << 5)
+#define CR4_PGE_MASK (1ULL << 7)
+#define CR4_VMXE_MASK (1ULL << 13)
+
+#define KVM_GUEST_CR0_MASK \
+ (CR0_PG_MASK | CR0_PE_MASK | CR0_WP_MASK | CR0_NE_MASK)
+#define KVM_VM_CR0_ALWAYS_ON KVM_GUEST_CR0_MASK
+
+#define KVM_GUEST_CR4_MASK \
+ (CR4_PSE_MASK | CR4_PAE_MASK | CR4_PGE_MASK | CR4_VMXE_MASK | CR4_VME_MASK)
+#define KVM_PMODE_VM_CR4_ALWAYS_ON (CR4_VMXE_MASK | CR4_PAE_MASK)
+#define KVM_RMODE_VM_CR4_ALWAYS_ON (CR4_VMXE_MASK | CR4_PAE_MASK | CR4_VME_MASK)
+
#define INVALID_PAGE (~(hpa_t)0)
#define UNMAPPED_GVA (~(gpa_t)0)

@@ -18,6 +50,19 @@
#define FX_IMAGE_ALIGN 16
#define FX_BUF_SIZE (2 * FX_IMAGE_SIZE + FX_IMAGE_ALIGN)

+#define DE_VECTOR 0
+#define DF_VECTOR 8
+#define TS_VECTOR 10
+#define NP_VECTOR 11
+#define SS_VECTOR 12
+#define GP_VECTOR 13
+#define PF_VECTOR 14
+
+#define SELECTOR_TI_MASK (1 << 2)
+#define SELECTOR_RPL_MASK 0x03
+
+#define IOPL_SHIFT 12
+
/*
* Address types:
*
@@ -203,4 +248,125 @@ hpa_t gva_to_hpa(struct kvm_vcpu *vcpu,

extern hpa_t bad_page_address;

+static inline struct page *gfn_to_page(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+ return slot->phys_mem[gfn - slot->base_gfn];
+}
+
+struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+
+void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address);
+void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address);
+void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw,
+ unsigned long *rflags);
+
+unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr);
+void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long value,
+ unsigned long *rflags);
+
+int kvm_read_guest(struct kvm_vcpu *vcpu,
+ gva_t addr,
+ unsigned long size,
+ void *dest);
+
+int kvm_write_guest(struct kvm_vcpu *vcpu,
+ gva_t addr,
+ unsigned long size,
+ void *data);
+
+void vmcs_writel(unsigned long field, unsigned long value);
+unsigned long vmcs_readl(unsigned long field);
+
+static inline u16 vmcs_read16(unsigned long field)
+{
+ return vmcs_readl(field);
+}
+
+static inline u32 vmcs_read32(unsigned long field)
+{
+ return vmcs_readl(field);
+}
+
+static inline u64 vmcs_read64(unsigned long field)
+{
+#ifdef __x86_64__
+ return vmcs_readl(field);
+#else
+ return vmcs_readl(field) | ((u64)vmcs_readl(field+1) << 32);
+#endif
+}
+
+static inline void vmcs_write32(unsigned long field, u32 value)
+{
+ vmcs_writel(field, value);
+}
+
+static inline int is_long_mode(void)
+{
+ return vmcs_read32(VM_ENTRY_CONTROLS) & VM_ENTRY_CONTROLS_IA32E_MASK;
+}
+
+static inline unsigned long guest_cr4(void)
+{
+ return (vmcs_readl(CR4_READ_SHADOW) & KVM_GUEST_CR4_MASK) |
+ (vmcs_readl(GUEST_CR4) & ~KVM_GUEST_CR4_MASK);
+}
+
+static inline int is_pae(void)
+{
+ return guest_cr4() & CR4_PAE_MASK;
+}
+
+static inline int is_pse(void)
+{
+ return guest_cr4() & CR4_PSE_MASK;
+}
+
+static inline unsigned long guest_cr0(void)
+{
+ return (vmcs_readl(CR0_READ_SHADOW) & KVM_GUEST_CR0_MASK) |
+ (vmcs_readl(GUEST_CR0) & ~KVM_GUEST_CR0_MASK);
+}
+
+static inline unsigned guest_cpl(void)
+{
+ return vmcs_read16(GUEST_CS_SELECTOR) & SELECTOR_RPL_MASK;
+}
+
+static inline int is_paging(void)
+{
+ return guest_cr0() & CR0_PG_MASK;
+}
+
+static inline int is_page_fault(u32 intr_info)
+{
+ return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK |
+ INTR_INFO_VALID_MASK)) ==
+ (INTR_TYPE_EXCEPTION | PF_VECTOR | INTR_INFO_VALID_MASK);
+}
+
+static inline int is_external_interrupt(u32 intr_info)
+{
+ return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
+ == (INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
+}
+
+static inline void flush_guest_tlb(struct kvm_vcpu *vcpu)
+{
+ vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3));
+}
+
+static inline int memslot_id(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ return slot - kvm->memslots;
+}
+
+static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
+{
+ struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
+
+ return (struct kvm_mmu_page *)page->private;
+}
+
#endif

2006-10-23 13:31:36

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 8/13] KVM: vcpu execution loop

This defines the KVM_RUN ioctl(), which enters guest mode, and a mechnism for
handling exits, either in-kernel or by userspace. Actual users of the
mechanism are in later patches.

Also introduced are interrupt injection and the guest debugger.

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/kvm_main.c
===================================================================
--- linux-2.6.orig/drivers/kvm/kvm_main.c
+++ linux-2.6/drivers/kvm/kvm_main.c
@@ -1266,6 +1266,25 @@ void mark_page_dirty(struct kvm *kvm, gf
}
}

+static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
+{
+ unsigned long rip;
+ u32 interruptibility;
+
+ rip = vmcs_readl(GUEST_RIP);
+ rip += vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+ vmcs_writel(GUEST_RIP, rip);
+
+ /*
+ * We emulated an instruction, so temporary interrupt blocking
+ * should be removed, if set.
+ */
+ interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+ if (interruptibility & 3)
+ vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
+ interruptibility & ~3);
+}
+
static int pdptrs_have_reserved_bits_set(struct kvm_vcpu *vcpu,
unsigned long cr3)
{
@@ -1537,6 +1524,42 @@ static void __set_efer(struct kvm_vcpu *
}
#endif

+/*
+ * The exit handlers return 1 if the exit was handled fully and guest execution
+ * may resume. Otherwise they set the kvm_run parameter to indicate what needs
+ * to be done to userspace and return 0.
+ */
+static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu,
+ struct kvm_run *kvm_run) = {
+};
+
+static const int kvm_vmx_max_exit_handlers =
+ sizeof(kvm_vmx_exit_handlers) / sizeof(*kvm_vmx_exit_handlers);
+
+/*
+ * The guest has exited. See if we can fix it or if we need userspace
+ * assistance.
+ */
+static int kvm_handle_exit(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
+{
+ u32 vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+ u32 exit_reason = vmcs_read32(VM_EXIT_REASON);
+
+ if ( (vectoring_info & VECTORING_INFO_VALID_MASK) &&
+ exit_reason != EXIT_REASON_EXCEPTION_NMI )
+ printk("%s: unexpected, valid vectoring info and exit"
+ " reason is 0x%x\n", __FUNCTION__, exit_reason);
+ kvm_run->instruction_length = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+ if (exit_reason < kvm_vmx_max_exit_handlers
+ && kvm_vmx_exit_handlers[exit_reason])
+ return kvm_vmx_exit_handlers[exit_reason](vcpu, kvm_run);
+ else {
+ kvm_run->exit_reason = KVM_EXIT_UNKNOWN;
+ kvm_run->hw.hardware_exit_reason = exit_reason;
+ }
+ return 0;
+}
+
static void inject_rmode_irq(struct kvm_vcpu *vcpu, int irq)
{
u16 ent[2];
@@ -1617,6 +1640,24 @@ static void kvm_try_inject_irq(struct kv
| CPU_BASED_VIRTUAL_INTR_PENDING);
}

+static void kvm_guest_debug_pre(struct kvm_vcpu *vcpu)
+{
+ struct kvm_guest_debug *dbg = &vcpu->guest_debug;
+
+ set_debugreg(dbg->bp[0], 0);
+ set_debugreg(dbg->bp[1], 1);
+ set_debugreg(dbg->bp[2], 2);
+ set_debugreg(dbg->bp[3], 3);
+
+ if (dbg->singlestep) {
+ unsigned long flags;
+
+ flags = vmcs_readl(GUEST_RFLAGS);
+ flags |= X86_EFLAGS_TF | X86_EFLAGS_RF;
+ vmcs_writel(GUEST_RFLAGS, flags);
+ }
+}
+
static void load_msrs(struct vmx_msr_entry *e)
{
int i;
@@ -1631,6 +1672,239 @@ static void save_msrs(struct vmx_msr_ent
rdmsrl(e[msr_index].index, e[msr_index].data);
}

+static int kvm_dev_ioctl_run(struct kvm *kvm, struct kvm_run *kvm_run)
+{
+ struct kvm_vcpu *vcpu;
+ u8 fail;
+ u16 fs_sel, gs_sel, ldt_sel;
+ int fs_gs_ldt_reload_needed;
+
+ if (kvm_run->vcpu < 0 || kvm_run->vcpu >= KVM_MAX_VCPUS)
+ return -EINVAL;
+
+ vcpu = vcpu_load(kvm, kvm_run->vcpu);
+ if (!vcpu)
+ return -ENOENT;
+
+ if (kvm_run->emulated) {
+ skip_emulated_instruction(vcpu);
+ kvm_run->emulated = 0;
+ }
+
+ if (kvm_run->mmio_completed) {
+ memcpy(vcpu->mmio_data, kvm_run->mmio.data, 8);
+ vcpu->mmio_read_completed = 1;
+ }
+
+ vcpu->mmio_needed = 0;
+
+again:
+ /*
+ * Set host fs and gs selectors. Unfortunately, 22.2.3 does not
+ * allow segment selectors with cpl > 0 or ti == 1.
+ */
+ fs_sel = read_fs();
+ gs_sel = read_gs();
+ ldt_sel = read_ldt();
+ fs_gs_ldt_reload_needed = (fs_sel & 7) | (gs_sel & 7) | ldt_sel;
+ if (!fs_gs_ldt_reload_needed) {
+ vmcs_write16(HOST_FS_SELECTOR, fs_sel);
+ vmcs_write16(HOST_GS_SELECTOR, gs_sel);
+ } else {
+ vmcs_write16(HOST_FS_SELECTOR, 0);
+ vmcs_write16(HOST_GS_SELECTOR, 0);
+ }
+
+#ifdef __x86_64__
+ vmcs_writel(HOST_FS_BASE, read_msr(MSR_FS_BASE));
+ vmcs_writel(HOST_GS_BASE, read_msr(MSR_GS_BASE));
+#endif
+
+ if (vcpu->irq_summary &&
+ !(vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK))
+ kvm_try_inject_irq(vcpu);
+
+ if (vcpu->guest_debug.enabled)
+ kvm_guest_debug_pre(vcpu);
+
+ fx_save(vcpu->host_fx_image);
+ fx_restore(vcpu->guest_fx_image);
+
+ save_msrs(vcpu->host_msrs, 0);
+ load_msrs(vcpu->guest_msrs);
+
+ asm (
+ /* Store host registers */
+ "pushf \n\t"
+#ifdef __x86_64__
+ "push %%rax; push %%rbx; push %%rdx;"
+ "push %%rsi; push %%rdi; push %%rbp;"
+ "push %%r8; push %%r9; push %%r10; push %%r11;"
+ "push %%r12; push %%r13; push %%r14; push %%r15;"
+ "push %%rcx \n\t"
+ "vmwrite %%rsp, %2 \n\t"
+#else
+ "pusha; push %%ecx \n\t"
+ "vmwrite %%esp, %2 \n\t"
+#endif
+ /* Check if vmlaunch of vmresume is needed */
+ "cmp $0, %1 \n\t"
+ /* Load guest registers. Don't clobber flags. */
+#ifdef __x86_64__
+ "mov %c[cr2](%3), %%rax \n\t"
+ "mov %%rax, %%cr2 \n\t"
+ "mov %c[rax](%3), %%rax \n\t"
+ "mov %c[rbx](%3), %%rbx \n\t"
+ "mov %c[rdx](%3), %%rdx \n\t"
+ "mov %c[rsi](%3), %%rsi \n\t"
+ "mov %c[rdi](%3), %%rdi \n\t"
+ "mov %c[rbp](%3), %%rbp \n\t"
+ "mov %c[r8](%3), %%r8 \n\t"
+ "mov %c[r9](%3), %%r9 \n\t"
+ "mov %c[r10](%3), %%r10 \n\t"
+ "mov %c[r11](%3), %%r11 \n\t"
+ "mov %c[r12](%3), %%r12 \n\t"
+ "mov %c[r13](%3), %%r13 \n\t"
+ "mov %c[r14](%3), %%r14 \n\t"
+ "mov %c[r15](%3), %%r15 \n\t"
+ "mov %c[rcx](%3), %%rcx \n\t" /* kills %3 (rcx) */
+#else
+ "mov %c[cr2](%3), %%eax \n\t"
+ "mov %%eax, %%cr2 \n\t"
+ "mov %c[rax](%3), %%eax \n\t"
+ "mov %c[rbx](%3), %%ebx \n\t"
+ "mov %c[rdx](%3), %%edx \n\t"
+ "mov %c[rsi](%3), %%esi \n\t"
+ "mov %c[rdi](%3), %%edi \n\t"
+ "mov %c[rbp](%3), %%ebp \n\t"
+ "mov %c[rcx](%3), %%ecx \n\t" /* kills %3 (ecx) */
+#endif
+ /* Enter guest mode */
+ "jne launched \n\t"
+ "vmlaunch \n\t"
+ "jmp kvm_vmx_return \n\t"
+ "launched: vmresume \n\t"
+ ".globl kvm_vmx_return \n\t"
+ "kvm_vmx_return: "
+ /* Save guest registers, load host registers, keep flags */
+#ifdef __x86_64__
+ "xchg %3, 0(%%rsp) \n\t"
+ "mov %%rax, %c[rax](%3) \n\t"
+ "mov %%rbx, %c[rbx](%3) \n\t"
+ "pushq 0(%%rsp); popq %c[rcx](%3) \n\t"
+ "mov %%rdx, %c[rdx](%3) \n\t"
+ "mov %%rsi, %c[rsi](%3) \n\t"
+ "mov %%rdi, %c[rdi](%3) \n\t"
+ "mov %%rbp, %c[rbp](%3) \n\t"
+ "mov %%r8, %c[r8](%3) \n\t"
+ "mov %%r9, %c[r9](%3) \n\t"
+ "mov %%r10, %c[r10](%3) \n\t"
+ "mov %%r11, %c[r11](%3) \n\t"
+ "mov %%r12, %c[r12](%3) \n\t"
+ "mov %%r13, %c[r13](%3) \n\t"
+ "mov %%r14, %c[r14](%3) \n\t"
+ "mov %%r15, %c[r15](%3) \n\t"
+ "mov %%cr2, %%rax \n\t"
+ "mov %%rax, %c[cr2](%3) \n\t"
+ "mov 0(%%rsp), %3 \n\t"
+
+ "pop %%rcx; pop %%r15; pop %%r14; pop %%r13; pop %%r12;"
+ "pop %%r11; pop %%r10; pop %%r9; pop %%r8;"
+ "pop %%rbp; pop %%rdi; pop %%rsi;"
+ "pop %%rdx; pop %%rbx; pop %%rax \n\t"
+#else
+ "xchg %3, 0(%%esp) \n\t"
+ "mov %%eax, %c[rax](%3) \n\t"
+ "mov %%ebx, %c[rbx](%3) \n\t"
+ "pushl 0(%%esp); popl %c[rcx](%3) \n\t"
+ "mov %%edx, %c[rdx](%3) \n\t"
+ "mov %%esi, %c[rsi](%3) \n\t"
+ "mov %%edi, %c[rdi](%3) \n\t"
+ "mov %%ebp, %c[rbp](%3) \n\t"
+ "mov %%cr2, %%eax \n\t"
+ "mov %%eax, %c[cr2](%3) \n\t"
+ "mov 0(%%esp), %3 \n\t"
+
+ "pop %%ecx; popa \n\t"
+#endif
+ "setbe %0 \n\t"
+ "popf \n\t"
+ : "=g" (fail)
+ : "r"(vcpu->launched), "r"((unsigned long)HOST_RSP),
+ "c"(vcpu),
+ [rax]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RAX])),
+ [rbx]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RBX])),
+ [rcx]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RCX])),
+ [rdx]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RDX])),
+ [rsi]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RSI])),
+ [rdi]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RDI])),
+ [rbp]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RBP])),
+#ifdef __x86_64__
+ [r8 ]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R8 ])),
+ [r9 ]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R9 ])),
+ [r10]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R10])),
+ [r11]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R11])),
+ [r12]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R12])),
+ [r13]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R13])),
+ [r14]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R14])),
+ [r15]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_R15])),
+#endif
+ [cr2]"i"(offsetof(struct kvm_vcpu, cr2))
+ : "cc", "memory" );
+
+ ++kvm_stat.exits;
+
+ save_msrs(vcpu->guest_msrs, NUM_AUTO_MSRS);
+ load_msrs(vcpu->host_msrs);
+
+ fx_save(vcpu->guest_fx_image);
+ fx_restore(vcpu->host_fx_image);
+
+#ifndef __x86_64__
+ asm ( "mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS) );
+#endif
+
+ kvm_run->exit_type = 0;
+ if (fail) {
+ kvm_run->exit_type = KVM_EXIT_TYPE_FAIL_ENTRY;
+ kvm_run->exit_reason = vmcs_read32(VM_INSTRUCTION_ERROR);
+ } else {
+ if (fs_gs_ldt_reload_needed) {
+ load_ldt(ldt_sel);
+ load_fs(fs_sel);
+ /*
+ * If we have to reload gs, we must take care to
+ * preserve our gs base.
+ */
+ local_irq_disable();
+ load_gs(gs_sel);
+#ifdef __x86_64__
+ wrmsrl(MSR_GS_BASE, vmcs_readl(HOST_GS_BASE));
+#endif
+ local_irq_enable();
+
+ reload_tss();
+ }
+ vcpu->launched = 1;
+ kvm_run->exit_type = KVM_EXIT_TYPE_VM_EXIT;
+ if (kvm_handle_exit(kvm_run, vcpu)) {
+ /* Give scheduler a change to reschedule. */
+ vcpu_put(vcpu);
+ if (signal_pending(current)) {
+ ++kvm_stat.signal_exits;
+ return -EINTR;
+ }
+ cond_resched();
+ /* Cannot fail - no vcpu unplug yet. */
+ vcpu_load(kvm, vcpu_slot(vcpu));
+ goto again;
+ }
+ }
+
+ vcpu_put(vcpu);
+ return 0;
+}
+
static int kvm_dev_ioctl_get_regs(struct kvm *kvm, struct kvm_regs *regs)
{
struct kvm_vcpu *vcpu;
@@ -1879,6 +2153,80 @@ static int kvm_dev_ioctl_translate(struc
return 0;
}

+static int kvm_dev_ioctl_interrupt(struct kvm *kvm, struct kvm_interrupt *irq)
+{
+ struct kvm_vcpu *vcpu;
+
+ if (irq->vcpu < 0 || irq->vcpu >= KVM_MAX_VCPUS)
+ return -EINVAL;
+ if (irq->irq < 0 || irq->irq >= 256)
+ return -EINVAL;
+ vcpu = vcpu_load(kvm, irq->vcpu);
+ if (!vcpu)
+ return -ENOENT;
+
+ set_bit(irq->irq, vcpu->irq_pending);
+ set_bit(irq->irq / BITS_PER_LONG, &vcpu->irq_summary);
+
+ vcpu_put(vcpu);
+
+ return 0;
+}
+
+static int kvm_dev_ioctl_debug_guest(struct kvm *kvm,
+ struct kvm_debug_guest *dbg)
+{
+ struct kvm_vcpu *vcpu;
+ unsigned long dr7 = 0x400;
+ u32 exception_bitmap;
+ int old_singlestep;
+
+ if (dbg->vcpu < 0 || dbg->vcpu >= KVM_MAX_VCPUS)
+ return -EINVAL;
+ vcpu = vcpu_load(kvm, dbg->vcpu);
+ if (!vcpu)
+ return -ENOENT;
+
+ exception_bitmap = vmcs_read32(EXCEPTION_BITMAP);
+ old_singlestep = vcpu->guest_debug.singlestep;
+
+ vcpu->guest_debug.enabled = dbg->enabled;
+ if (vcpu->guest_debug.enabled) {
+ int i;
+
+ dr7 |= 0x200; /* exact */
+ for (i = 0; i < 4; ++i) {
+ if (!dbg->breakpoints[i].enabled)
+ continue;
+ vcpu->guest_debug.bp[i] = dbg->breakpoints[i].address;
+ dr7 |= 2 << (i*2); /* global enable */
+ dr7 |= 0 << (i*4+16); /* execution breakpoint */
+ }
+
+ exception_bitmap |= (1u << 1); /* Trap debug exceptions */
+
+ vcpu->guest_debug.singlestep = dbg->singlestep;
+ } else {
+ exception_bitmap &= ~(1u << 1); /* Ignore debug exceptions */
+ vcpu->guest_debug.singlestep = 0;
+ }
+
+ if (old_singlestep && !vcpu->guest_debug.singlestep) {
+ unsigned long flags;
+
+ flags = vmcs_readl(GUEST_RFLAGS);
+ flags &= ~(X86_EFLAGS_TF | X86_EFLAGS_RF);
+ vmcs_writel(GUEST_RFLAGS, flags);
+ }
+
+ vmcs_write32(EXCEPTION_BITMAP, exception_bitmap);
+ vmcs_writel(GUEST_DR7, dr7);
+
+ vcpu_put(vcpu);
+
+ return 0;
+}
+
static long kvm_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -1892,6 +2240,21 @@ static long kvm_dev_ioctl(struct file *f
goto out;
break;
}
+ case KVM_RUN: {
+ struct kvm_run kvm_run;
+
+ r = -EFAULT;
+ if (copy_from_user(&kvm_run, (void *)arg, sizeof kvm_run))
+ goto out;
+ r = kvm_dev_ioctl_run(kvm, &kvm_run);
+ if (r < 0)
+ goto out;
+ r = -EFAULT;
+ if (copy_to_user((void *)arg, &kvm_run, sizeof kvm_run))
+ goto out;
+ r = 0;
+ break;
+ }
case KVM_GET_REGS: {
struct kvm_regs kvm_regs;

@@ -1961,6 +2324,30 @@ static long kvm_dev_ioctl(struct file *f
r = 0;
break;
}
+ case KVM_INTERRUPT: {
+ struct kvm_interrupt irq;
+
+ r = -EFAULT;
+ if (copy_from_user(&irq, (void *)arg, sizeof irq))
+ goto out;
+ r = kvm_dev_ioctl_interrupt(kvm, &irq);
+ if (r)
+ goto out;
+ r = 0;
+ break;
+ }
+ case KVM_DEBUG_GUEST: {
+ struct kvm_debug_guest dbg;
+
+ r = -EFAULT;
+ if (copy_from_user(&dbg, (void *)arg, sizeof dbg))
+ goto out;
+ r = kvm_dev_ioctl_debug_guest(kvm, &dbg);
+ if (r)
+ goto out;
+ r = 0;
+ break;
+ }
case KVM_SET_MEMORY_REGION: {
struct kvm_memory_region kvm_mem;

2006-10-23 13:30:57

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 5/13] KVM: virtualization infrastructure

- ioctl()
- mmap()
- vcpu context management (vcpu_load/vcpu_put)
- some control register logic

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/kvm_main.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/kvm_main.c
@@ -0,0 +1,1260 @@
+/*
+ * Kernel-based Virtual Machine driver for Linux
+ *
+ * This module enables machines with Intel VT-x extensions to run virtual
+ * machines without emulation or binary translation.
+ *
+ * Copyright (C) 2006 Qumranet, Inc.
+ *
+ * Authors:
+ * Avi Kivity <[email protected]>
+ * Yaniv Kamay <[email protected]>
+ *
+ */
+
+#include "kvm.h"
+
+#include <linux/kvm.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <asm/processor.h>
+#include <linux/percpu.h>
+#include <linux/gfp.h>
+#include <asm/msr.h>
+#include <linux/mm.h>
+#include <linux/miscdevice.h>
+#include <linux/vmalloc.h>
+#include <asm/uaccess.h>
+#include <linux/reboot.h>
+#include <asm/io.h>
+#include <linux/debugfs.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+
+#include "vmx.h"
+
+MODULE_AUTHOR("Qumranet");
+MODULE_LICENSE("GPL");
+
+static struct dentry *debugfs_dir;
+static struct dentry *debugfs_pf_fixed;
+static struct dentry *debugfs_pf_guest;
+static struct dentry *debugfs_tlb_flush;
+static struct dentry *debugfs_invlpg;
+static struct dentry *debugfs_exits;
+static struct dentry *debugfs_io_exits;
+static struct dentry *debugfs_mmio_exits;
+static struct dentry *debugfs_signal_exits;
+static struct dentry *debugfs_irq_exits;
+
+struct kvm_stat kvm_stat;
+
+#define KVM_LOG_BUF_SIZE PAGE_SIZE
+
+static const u32 vmx_msr_index[] = {
+ MSR_EFER, MSR_K6_STAR,
+#ifdef __x86_64__
+ MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR
+#endif
+};
+#define NR_VMX_MSR (sizeof(vmx_msr_index) / sizeof(*vmx_msr_index))
+
+
+#ifdef __x86_64__
+/*
+ * avoid save/load MSR_SYSCALL_MASK and MSR_LSTAR by std vt
+ * mechanism (cpu bug AA24)
+ */
+#define NUM_AUTO_MSRS (NR_VMX_MSR-2)
+#else
+#define NUM_AUTO_MSRS NR_VMX_MSR
+#endif
+
+#define TSS_IOPB_BASE_OFFSET 0x66
+#define TSS_BASE_SIZE 0x68
+#define TSS_IOPB_SIZE (65536 / 8)
+#define TSS_REDIRECTION_SIZE (256 / 8)
+#define RMODE_TSS_SIZE (TSS_BASE_SIZE + TSS_REDIRECTION_SIZE + TSS_IOPB_SIZE + 1)
+
+static int rmode_tss_base(struct kvm* kvm);
+static void set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
+static void __set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
+static void lmsw(struct kvm_vcpu *vcpu, unsigned long msw);
+static void set_cr3(struct kvm_vcpu *vcpu, unsigned long cr0);
+static void set_cr4(struct kvm_vcpu *vcpu, unsigned long cr0);
+static void __set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+#ifdef __x86_64__
+static void __set_efer(struct kvm_vcpu *vcpu, u64 efer);
+#endif
+
+static struct vmx_msr_entry *find_msr_entry(struct kvm_vcpu *vcpu, u32 msr)
+{
+ int i;
+
+ for (i = 0; i < NR_VMX_MSR; ++i)
+ if (vmx_msr_index[i] == msr)
+ return &vcpu->guest_msrs[i];
+ return 0;
+}
+
+struct descriptor_table {
+ u16 limit;
+ unsigned long base;
+} __attribute__((packed));
+
+static void get_gdt(struct descriptor_table *table)
+{
+ asm ( "sgdt %0" : "=m"(*table) );
+}
+
+static void get_idt(struct descriptor_table *table)
+{
+ asm ( "sidt %0" : "=m"(*table) );
+}
+
+static u16 read_fs(void)
+{
+ u16 seg;
+ asm ( "mov %%fs, %0" : "=g"(seg) );
+ return seg;
+}
+
+static u16 read_gs(void)
+{
+ u16 seg;
+ asm ( "mov %%gs, %0" : "=g"(seg) );
+ return seg;
+}
+
+static u16 read_ldt(void)
+{
+ u16 ldt;
+ asm ( "sldt %0" : "=g"(ldt) );
+ return ldt;
+}
+
+static void load_fs(u16 sel)
+{
+ asm ( "mov %0, %%fs\n" : : "g"(sel) );
+}
+
+static void load_gs(u16 sel)
+{
+ asm ( "mov %0, %%gs\n" : : "g"(sel) );
+}
+
+#ifndef load_ldt
+static void load_ldt(u16 sel)
+{
+ asm ( "lldt %0" : : "g"(sel) );
+}
+#endif
+
+static void fx_save(void *image)
+{
+ asm ( "fxsave (%0)":: "r" (image));
+}
+
+static void fx_restore(void *image)
+{
+ asm ( "fxrstor (%0)":: "r" (image));
+}
+
+static void fpu_init(void)
+{
+ asm ( "finit" );
+}
+
+struct segment_descriptor {
+ u16 limit_low;
+ u16 base_low;
+ u8 base_mid;
+ u8 type : 4;
+ u8 system : 1;
+ u8 dpl : 2;
+ u8 present : 1;
+ u8 limit_high : 4;
+ u8 avl : 1;
+ u8 long_mode : 1;
+ u8 default_op : 1;
+ u8 granularity : 1;
+ u8 base_high;
+} __attribute__((packed));
+
+#ifdef __x86_64__
+// LDT or TSS descriptor in the GDT. 16 bytes.
+struct segment_descriptor_64 {
+ struct segment_descriptor s;
+ u32 base_higher;
+ u32 pad_zero;
+} __attribute__((packed));
+
+#endif
+
+static unsigned long segment_base(u16 selector)
+{
+ struct descriptor_table gdt;
+ struct segment_descriptor *d;
+ unsigned long table_base;
+ typedef unsigned long ul;
+ unsigned long v;
+
+ asm ( "sgdt %0" : "=m"(gdt) );
+ table_base = gdt.base;
+
+ if (selector & 4) { /* from ldt */
+ u16 ldt_selector;
+
+ asm ( "sldt %0" : "=g"(ldt_selector) );
+ table_base = segment_base(ldt_selector);
+ }
+ d = (struct segment_descriptor *)(table_base + (selector & ~7));
+ v = d->base_low | ((ul)d->base_mid << 16) | ((ul)d->base_high << 24);
+#ifdef __x86_64__
+ if (d->system == 0
+ && (d->type == 2 || d->type == 9 || d->type == 11))
+ v |= ((ul)((struct segment_descriptor_64 *)d)->base_higher) << 32;
+#endif
+ return v;
+}
+
+static unsigned long read_tr_base(void)
+{
+ u16 tr;
+ asm ( "str %0" : "=g"(tr) );
+ return segment_base(tr);
+}
+
+static void reload_tss(void)
+{
+#ifndef __x86_64__
+
+ /*
+ * VT restores TR but not its size. Useless.
+ */
+ struct descriptor_table gdt;
+ struct segment_descriptor *descs;
+
+ get_gdt(&gdt);
+ descs = (void *)gdt.base;
+ descs[GDT_ENTRY_TSS].type = 9; /* available TSS */
+ load_TR_desc();
+#endif
+}
+
+DEFINE_PER_CPU(struct vmcs *, vmxarea);
+DEFINE_PER_CPU(struct vmcs *, current_vmcs);
+
+static struct vmcs_descriptor {
+ int size;
+ int order;
+ u32 revision_id;
+} vmcs_descriptor;
+
+#define MSR_IA32_FEATURE_CONTROL 0x03a
+#define MSR_IA32_VMX_BASIC_MSR 0x480
+#define MSR_IA32_VMX_PINBASED_CTLS_MSR 0x481
+#define MSR_IA32_VMX_PROCBASED_CTLS_MSR 0x482
+#define MSR_IA32_VMX_EXIT_CTLS_MSR 0x483
+#define MSR_IA32_VMX_ENTRY_CTLS_MSR 0x484
+
+#ifdef __x86_64__
+static unsigned long read_msr(unsigned long msr)
+{
+ u64 value;
+
+ rdmsrl(msr, value);
+ return value;
+}
+#endif
+
+static inline struct page *_gfn_to_page(struct kvm *kvm, gfn_t gfn)
+{
+ struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
+ return (slot) ? slot->phys_mem[gfn - slot->base_gfn] : 0;
+}
+
+
+
+int kvm_read_guest(struct kvm_vcpu *vcpu,
+ gva_t addr,
+ unsigned long size,
+ void *dest)
+{
+ unsigned char *host_buf = dest;
+ unsigned long req_size = size;
+
+ while (size) {
+ hpa_t paddr;
+ unsigned now;
+ unsigned offset;
+ hva_t guest_buf;
+
+ paddr = gva_to_hpa(vcpu, addr);
+
+ if (is_error_hpa(paddr))
+ break;
+
+ guest_buf = (hva_t)kmap_atomic(
+ pfn_to_page(paddr >> PAGE_SHIFT),
+ KM_USER0);
+ offset = addr & ~PAGE_MASK;
+ guest_buf |= offset;
+ now = min(size, PAGE_SIZE - offset);
+ memcpy(host_buf, (void*)guest_buf, now);
+ host_buf += now;
+ addr += now;
+ size -= now;
+ kunmap_atomic((void *)(guest_buf & PAGE_MASK), KM_USER0);
+ }
+ return req_size - size;
+}
+
+int kvm_write_guest(struct kvm_vcpu *vcpu,
+ gva_t addr,
+ unsigned long size,
+ void *data)
+{
+ unsigned char *host_buf = data;
+ unsigned long req_size = size;
+
+ while (size) {
+ hpa_t paddr;
+ unsigned now;
+ unsigned offset;
+ hva_t guest_buf;
+
+ paddr = gva_to_hpa(vcpu, addr);
+
+ if (is_error_hpa(paddr))
+ break;
+
+ guest_buf = (hva_t)kmap_atomic(
+ pfn_to_page(paddr >> PAGE_SHIFT), KM_USER0);
+ offset = addr & ~PAGE_MASK;
+ guest_buf |= offset;
+ now = min(size, PAGE_SIZE - offset);
+ memcpy((void*)guest_buf, host_buf, now);
+ host_buf += now;
+ addr += now;
+ size -= now;
+ kunmap_atomic((void *)(guest_buf & PAGE_MASK), KM_USER0);
+ }
+ return req_size - size;
+}
+
+static __init void setup_vmcs_descriptor(void)
+{
+ u32 vmx_msr_low, vmx_msr_high;
+
+ rdmsr(MSR_IA32_VMX_BASIC_MSR, vmx_msr_low, vmx_msr_high);
+ vmcs_descriptor.size = vmx_msr_high & 0x1fff;
+ vmcs_descriptor.order = get_order(vmcs_descriptor.size);
+ vmcs_descriptor.revision_id = vmx_msr_low;
+};
+
+static void vmcs_clear(struct vmcs *vmcs)
+{
+ u64 phys_addr = __pa(vmcs);
+ u8 error;
+
+ asm volatile ( "vmclear %1; setna %0"
+ : "=m"(error) : "m"(phys_addr) : "cc", "memory" );
+ if (error)
+ printk(KERN_ERR "kvm: vmclear fail: %p/%llx\n",
+ vmcs, phys_addr);
+}
+
+static void __vcpu_clear(void *arg)
+{
+ struct kvm_vcpu *vcpu = arg;
+ int cpu = smp_processor_id();
+
+ if (vcpu->cpu == cpu)
+ vmcs_clear(vcpu->vmcs);
+ if (per_cpu(current_vmcs, cpu) == vcpu->vmcs)
+ per_cpu(current_vmcs, cpu) = 0;
+}
+
+static int vcpu_slot(struct kvm_vcpu *vcpu)
+{
+ return vcpu - vcpu->kvm->vcpus;
+}
+
+/*
+ * Switches to specified vcpu, until a matching vcpu_put(), but assumes
+ * vcpu mutex is already taken.
+ */
+static struct kvm_vcpu *__vcpu_load(struct kvm_vcpu *vcpu)
+{
+ u64 phys_addr = __pa(vcpu->vmcs);
+ int cpu;
+
+ cpu = get_cpu();
+
+ if (vcpu->cpu != cpu) {
+ smp_call_function(__vcpu_clear, vcpu, 0, 1);
+ vcpu->launched = 0;
+ }
+
+ if (per_cpu(current_vmcs, cpu) != vcpu->vmcs) {
+ u8 error;
+
+ per_cpu(current_vmcs, cpu) = vcpu->vmcs;
+ asm volatile ( "vmptrld %1; setna %0"
+ : "=m"(error) : "m"(phys_addr) : "cc" );
+ if (error)
+ printk(KERN_ERR "kvm: vmptrld %p/%llx fail\n",
+ vcpu->vmcs, phys_addr);
+ }
+
+ if (vcpu->cpu != cpu) {
+ struct descriptor_table dt;
+ unsigned long sysenter_esp;
+
+ vcpu->cpu = cpu;
+ /*
+ * Linux uses per-cpu TSS and GDT, so set these when switching
+ * processors.
+ */
+ vmcs_writel(HOST_TR_BASE, read_tr_base()); /* 22.2.4 */
+ get_gdt(&dt);
+ vmcs_writel(HOST_GDTR_BASE, dt.base); /* 22.2.4 */
+
+ rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
+ vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
+ }
+ return vcpu;
+}
+
+/*
+ * Switches to specified vcpu, until a matching vcpu_put()
+ */
+static struct kvm_vcpu *vcpu_load(struct kvm *kvm, int vcpu_slot)
+{
+ struct kvm_vcpu *vcpu = &kvm->vcpus[vcpu_slot];
+
+ mutex_lock(&vcpu->mutex);
+ if (unlikely(!vcpu->vmcs)) {
+ mutex_unlock(&vcpu->mutex);
+ return 0;
+ }
+ return __vcpu_load(vcpu);
+}
+
+static void vcpu_put(struct kvm_vcpu *vcpu)
+{
+ put_cpu();
+ mutex_unlock(&vcpu->mutex);
+}
+
+
+static struct vmcs *alloc_vmcs_cpu(int cpu)
+{
+ int node = cpu_to_node(cpu);
+ struct page *pages;
+ struct vmcs *vmcs;
+
+ pages = alloc_pages_node(node, GFP_KERNEL, vmcs_descriptor.order);
+ if (!pages)
+ return 0;
+ vmcs = page_address(pages);
+ memset(vmcs, 0, vmcs_descriptor.size);
+ vmcs->revision_id = vmcs_descriptor.revision_id; /* vmcs revision id */
+ return vmcs;
+}
+
+static struct vmcs *alloc_vmcs(void)
+{
+ return alloc_vmcs_cpu(smp_processor_id());
+}
+
+static void free_vmcs(struct vmcs *vmcs)
+{
+ free_pages((unsigned long)vmcs, vmcs_descriptor.order);
+}
+
+static __init int cpu_has_kvm_support(void)
+{
+ unsigned long ecx = cpuid_ecx(1);
+ return test_bit(5, &ecx); /* CPUID.1:ECX.VMX[bit 5] -> VT */
+}
+
+static __exit void free_kvm_area(void)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ free_vmcs(per_cpu(vmxarea, cpu));
+}
+
+static __init int alloc_kvm_area(void)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ struct vmcs *vmcs;
+
+ vmcs = alloc_vmcs_cpu(cpu);
+ if (!vmcs) {
+ free_kvm_area();
+ return -ENOMEM;
+ }
+
+ per_cpu(vmxarea, cpu) = vmcs;
+ }
+ return 0;
+}
+
+static __init int vmx_disabled_by_bios(void)
+{
+ u64 msr;
+
+ rdmsrl(MSR_IA32_FEATURE_CONTROL, msr);
+ return (msr & 5) == 1; /* locked but not enabled */
+}
+
+#define CR4_VMXE 0x2000
+
+static __init void kvm_enable(void *garbage)
+{
+ int cpu = raw_smp_processor_id();
+ u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
+ u64 old;
+
+ rdmsrl(MSR_IA32_FEATURE_CONTROL, old);
+ if ((old & 5) == 0)
+ /* enable and lock */
+ wrmsrl(MSR_IA32_FEATURE_CONTROL, old | 5);
+ write_cr4(read_cr4() | CR4_VMXE); /* FIXME: not cpu hotplug safe */
+ asm volatile ( "vmxon %0" : : "m"(phys_addr) : "memory", "cc" );
+}
+
+static void kvm_disable(void *garbage)
+{
+ asm volatile ( "vmxoff" : : : "cc" );
+}
+
+static int kvm_dev_open(struct inode *inode, struct file *filp)
+{
+ struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int i;
+
+ if (!kvm)
+ return -ENOMEM;
+
+ spin_lock_init(&kvm->lock);
+ INIT_LIST_HEAD(&kvm->active_mmu_pages);
+ for (i = 0; i < KVM_MAX_VCPUS; ++i) {
+ struct kvm_vcpu *vcpu = &kvm->vcpus[i];
+
+ mutex_init(&vcpu->mutex);
+ vcpu->mmu.root_hpa = INVALID_PAGE;
+ INIT_LIST_HEAD(&vcpu->free_pages);
+ }
+ filp->private_data = kvm;
+ return 0;
+}
+
+/*
+ * Free any memory in @free but not in @dont.
+ */
+static void kvm_free_physmem_slot(struct kvm_memory_slot *free,
+ struct kvm_memory_slot *dont)
+{
+ int i;
+
+ if (!dont || free->phys_mem != dont->phys_mem)
+ if (free->phys_mem) {
+ for (i = 0; i < free->npages; ++i)
+ __free_page(free->phys_mem[i]);
+ vfree(free->phys_mem);
+ }
+
+ if (!dont || free->dirty_bitmap != dont->dirty_bitmap)
+ vfree(free->dirty_bitmap);
+
+ free->phys_mem = 0;
+ free->npages = 0;
+ free->dirty_bitmap = 0;
+}
+
+static void kvm_free_physmem(struct kvm *kvm)
+{
+ int i;
+
+ for (i = 0; i < kvm->nmemslots; ++i)
+ kvm_free_physmem_slot(&kvm->memslots[i], 0);
+}
+
+static void kvm_free_vmcs(struct kvm_vcpu *vcpu)
+{
+ if (vcpu->vmcs) {
+ on_each_cpu(__vcpu_clear, vcpu, 0, 1);
+ free_vmcs(vcpu->vmcs);
+ vcpu->vmcs = 0;
+ }
+}
+
+static void kvm_free_vcpu(struct kvm_vcpu *vcpu)
+{
+ kvm_free_vmcs(vcpu);
+ kvm_mmu_destroy(vcpu);
+}
+
+static void kvm_free_vcpus(struct kvm *kvm)
+{
+ unsigned int i;
+
+ for (i = 0; i < KVM_MAX_VCPUS; ++i)
+ kvm_free_vcpu(&kvm->vcpus[i]);
+}
+
+static int kvm_dev_release(struct inode *inode, struct file *filp)
+{
+ struct kvm *kvm = filp->private_data;
+
+ kvm_free_vcpus(kvm);
+ kvm_free_physmem(kvm);
+ kfree(kvm);
+ return 0;
+}
+
+unsigned long vmcs_readl(unsigned long field)
+{
+ unsigned long value;
+
+ asm volatile ( "vmread %1, %0" : "=g"(value) : "r"(field) : "cc" );
+ return value;
+}
+
+void vmcs_writel(unsigned long field, unsigned long value)
+{
+ u8 error;
+
+ asm volatile ( "vmwrite %1, %2; setna %0"
+ : "=g"(error) : "r"(value), "r"(field) : "cc" );
+ if (error)
+ printk(KERN_ERR "vmwrite error: reg %lx value %lx (err %d)\n",
+ field, value, vmcs_read32(VM_INSTRUCTION_ERROR));
+}
+
+static void vmcs_write16(unsigned long field, u16 value)
+{
+ vmcs_writel(field, value);
+}
+
+static void vmcs_write64(unsigned long field, u64 value)
+{
+#ifdef __x86_64__
+ vmcs_writel(field, value);
+#else
+ vmcs_writel(field, value);
+ asm volatile ( "" );
+ vmcs_writel(field+1, value >> 32);
+#endif
+}
+
+/*
+ * Sync the rsp and rip registers into the vcpu structure. This allows
+ * registers to be accessed by indexing vcpu->regs.
+ */
+static void vcpu_load_rsp_rip(struct kvm_vcpu *vcpu)
+{
+ vcpu->regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
+ vcpu->rip = vmcs_readl(GUEST_RIP);
+}
+
+/*
+ * Syncs rsp and rip back into the vmcs. Should be called after possible
+ * modification.
+ */
+static void vcpu_put_rsp_rip(struct kvm_vcpu *vcpu)
+{
+ vmcs_writel(GUEST_RSP, vcpu->regs[VCPU_REGS_RSP]);
+ vmcs_writel(GUEST_RIP, vcpu->rip);
+}
+
+struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+{
+ int i;
+
+ for (i = 0; i < kvm->nmemslots; ++i) {
+ struct kvm_memory_slot *memslot = &kvm->memslots[i];
+
+ if (gfn >= memslot->base_gfn
+ && gfn < memslot->base_gfn + memslot->npages)
+ return memslot;
+ }
+ return 0;
+}
+
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+{
+ int i;
+ struct kvm_memory_slot *memslot = 0;
+ unsigned long rel_gfn;
+
+ for (i = 0; i < kvm->nmemslots; ++i) {
+ memslot = &kvm->memslots[i];
+
+ if (gfn >= memslot->base_gfn
+ && gfn < memslot->base_gfn + memslot->npages) {
+
+ if (!memslot || !memslot->dirty_bitmap)
+ return;
+
+ rel_gfn = gfn - memslot->base_gfn;
+
+ /* avoid RMW */
+ if (!test_bit(rel_gfn, memslot->dirty_bitmap))
+ set_bit(rel_gfn, memslot->dirty_bitmap);
+ return;
+ }
+ }
+}
+
+static int pdptrs_have_reserved_bits_set(struct kvm_vcpu *vcpu,
+ unsigned long cr3)
+{
+ gfn_t pdpt_gfn = cr3 >> PAGE_SHIFT;
+ unsigned offset = (cr3 & (PAGE_SIZE-1)) >> 5;
+ int i;
+ u64 pdpte;
+ u64 *pdpt;
+ struct kvm_memory_slot *memslot;
+
+ spin_lock(&vcpu->kvm->lock);
+ memslot = gfn_to_memslot(vcpu->kvm, pdpt_gfn);
+ /* FIXME: !memslot - emulate? 0xff? */
+ pdpt = kmap_atomic(gfn_to_page(memslot, pdpt_gfn), KM_USER0);
+
+ for (i = 0; i < 4; ++i) {
+ pdpte = pdpt[offset + i];
+ if ((pdpte & 1) && (pdpte & 0xfffffff0000001e6ull))
+ break;
+ }
+
+ kunmap_atomic(pdpt, KM_USER0);
+ spin_unlock(&vcpu->kvm->lock);
+
+ return i != 4;
+}
+
+#define CR0_RESEVED_BITS 0xffffffff1ffaffc0ULL
+
+static void set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+ if (cr0 & CR0_RESEVED_BITS) {
+ printk("set_cr0: 0x%lx #GP, reserved bits (0x%lx)\n", cr0, guest_cr0());
+ inject_gp(vcpu);
+ return;
+ }
+
+ if ((cr0 & CR0_NW_MASK) && !(cr0 & CR0_CD_MASK)) {
+ printk("set_cr0: #GP, CD == 0 && NW == 1\n");
+ inject_gp(vcpu);
+ return;
+ }
+
+ if ((cr0 & CR0_PG_MASK) && !(cr0 & CR0_PE_MASK)) {
+ printk("set_cr0: #GP, set PG flag and a clear PE flag\n");
+ inject_gp(vcpu);
+ return;
+ }
+
+ if (is_paging()) {
+#ifdef __x86_64__
+ if (!(cr0 & CR0_PG_MASK)) {
+ vcpu->shadow_efer &= ~EFER_LMA;
+ vmcs_write32(VM_ENTRY_CONTROLS,
+ vmcs_read32(VM_ENTRY_CONTROLS) &
+ ~VM_ENTRY_CONTROLS_IA32E_MASK);
+ }
+#endif
+ } else if ((cr0 & CR0_PG_MASK)) {
+#ifdef __x86_64__
+ if ((vcpu->shadow_efer & EFER_LME)) {
+ u32 guest_cs_ar;
+ u32 guest_tr_ar;
+ if (!is_pae()) {
+ printk("set_cr0: #GP, start paging in "
+ "long mode while PAE is disabled\n");
+ inject_gp(vcpu);
+ return;
+ }
+ guest_cs_ar = vmcs_read32(GUEST_CS_AR_BYTES);
+ if (guest_cs_ar & SEGMENT_AR_L_MASK) {
+ printk("set_cr0: #GP, start paging in "
+ "long mode while CS.L == 1\n");
+ inject_gp(vcpu);
+ return;
+
+ }
+ guest_tr_ar = vmcs_read32(GUEST_TR_AR_BYTES);
+ if ((guest_tr_ar & AR_TYPE_MASK) != AR_TYPE_BUSY_64_TSS) {
+ printk("%s: tss fixup for long mode. \n",
+ __FUNCTION__);
+ vmcs_write32(GUEST_TR_AR_BYTES,
+ (guest_tr_ar & ~AR_TYPE_MASK) |
+ AR_TYPE_BUSY_64_TSS);
+ }
+ vcpu->shadow_efer |= EFER_LMA;
+ find_msr_entry(vcpu, MSR_EFER)->data |=
+ EFER_LMA | EFER_LME;
+ vmcs_write32(VM_ENTRY_CONTROLS,
+ vmcs_read32(VM_ENTRY_CONTROLS) |
+ VM_ENTRY_CONTROLS_IA32E_MASK);
+
+ } else
+#endif
+ if (is_pae() &&
+ pdptrs_have_reserved_bits_set(vcpu, vcpu->cr3)) {
+ printk("set_cr0: #GP, pdptrs reserved bits\n");
+ inject_gp(vcpu);
+ return;
+ }
+
+ }
+
+ __set_cr0(vcpu, cr0);
+ kvm_mmu_reset_context(vcpu);
+ return;
+}
+
+static void lmsw(struct kvm_vcpu *vcpu, unsigned long msw)
+{
+ unsigned long cr0 = guest_cr0();
+
+ if ((msw & CR0_PE_MASK) && !(cr0 & CR0_PE_MASK)) {
+ enter_pmode(vcpu);
+ vmcs_writel(CR0_READ_SHADOW, cr0 | CR0_PE_MASK);
+
+ } else
+ printk("lmsw: unexpected\n");
+
+ #define LMSW_GUEST_MASK 0x0eULL
+
+ vmcs_writel(GUEST_CR0, (vmcs_readl(GUEST_CR0) & ~LMSW_GUEST_MASK)
+ | (msw & LMSW_GUEST_MASK));
+}
+
+#define CR4_RESEVED_BITS (~((1ULL << 11) - 1))
+
+static void set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+ if (cr4 & CR4_RESEVED_BITS) {
+ printk("set_cr4: #GP, reserved bits\n");
+ inject_gp(vcpu);
+ return;
+ }
+
+ if (is_long_mode()) {
+ if (!(cr4 & CR4_PAE_MASK)) {
+ printk("set_cr4: #GP, clearing PAE while in long mode\n");
+ inject_gp(vcpu);
+ return;
+ }
+ } else if (is_paging() && !is_pae() && (cr4 & CR4_PAE_MASK)
+ && pdptrs_have_reserved_bits_set(vcpu, vcpu->cr3)) {
+ printk("set_cr4: #GP, pdptrs reserved bits\n");
+ inject_gp(vcpu);
+ }
+
+ if (cr4 & CR4_VMXE_MASK) {
+ printk("set_cr4: #GP, setting VMXE\n");
+ inject_gp(vcpu);
+ return;
+ }
+ __set_cr4(vcpu, cr4);
+ spin_lock(&vcpu->kvm->lock);
+ kvm_mmu_reset_context(vcpu);
+ spin_unlock(&vcpu->kvm->lock);
+}
+
+static void set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
+{
+ if (is_long_mode()) {
+ if ( cr3 & CR3_L_MODE_RESEVED_BITS) {
+ printk("set_cr3: #GP, reserved bits\n");
+ inject_gp(vcpu);
+ return;
+ }
+ } else {
+ if (cr3 & CR3_RESEVED_BITS) {
+ printk("set_cr3: #GP, reserved bits\n");
+ inject_gp(vcpu);
+ return;
+ }
+ if (is_paging() && is_pae() &&
+ pdptrs_have_reserved_bits_set(vcpu, cr3)) {
+ printk("set_cr3: #GP, pdptrs reserved bits\n");
+ inject_gp(vcpu);
+ return;
+ }
+ }
+
+ vcpu->cr3 = cr3;
+ spin_lock(&vcpu->kvm->lock);
+ vcpu->mmu.new_cr3(vcpu);
+ spin_unlock(&vcpu->kvm->lock);
+}
+
+#define CR8_RESEVED_BITS (~0x0fULL)
+
+static void set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8)
+{
+ if ( cr8 & CR8_RESEVED_BITS) {
+ printk("set_cr8: #GP, reserved bits 0x%lx\n", cr8);
+ inject_gp(vcpu);
+ return;
+ }
+ vcpu->cr8 = cr8;
+}
+
+
+static void __set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+ if (vcpu->rmode.active && (cr0 & CR0_PE_MASK))
+ enter_pmode(vcpu);
+
+ if (!vcpu->rmode.active && !(cr0 & CR0_PE_MASK))
+ enter_rmode(vcpu);
+
+ vmcs_writel(CR0_READ_SHADOW, cr0);
+ vmcs_writel(GUEST_CR0, cr0 | KVM_VM_CR0_ALWAYS_ON);
+}
+
+static void __set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+ vmcs_writel(CR4_READ_SHADOW, cr4);
+ vmcs_writel(GUEST_CR4, cr4 | (vcpu->rmode.active ?
+ KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON));
+}
+
+#ifdef __x86_64__
+#define EFER_RESERVED_BITS 0xfffffffffffff2fe
+
+static void set_efer(struct kvm_vcpu *vcpu, u64 efer)
+{
+ struct vmx_msr_entry *msr;
+
+ if (efer & EFER_RESERVED_BITS) {
+ printk("set_efer: 0x%llx #GP, reserved bits\n", efer);
+ inject_gp(vcpu);
+ return;
+ }
+
+ if (is_paging() && (vcpu->shadow_efer & EFER_LME) != (efer & EFER_LME)) {
+ printk("set_efer: #GP, change LME while paging\n");
+ inject_gp(vcpu);
+ return;
+ }
+
+ efer &= ~EFER_LMA;
+ efer |= vcpu->shadow_efer & EFER_LMA;
+
+ vcpu->shadow_efer = efer;
+
+ msr = find_msr_entry(vcpu, MSR_EFER);
+
+ if (!(efer & EFER_LMA))
+ efer &= ~EFER_LME;
+ msr->data = efer;
+ skip_emulated_instruction(vcpu);
+}
+
+static void __set_efer(struct kvm_vcpu *vcpu, u64 efer)
+{
+ struct vmx_msr_entry *msr = find_msr_entry(vcpu, MSR_EFER);
+
+ vcpu->shadow_efer = efer;
+ if (efer & EFER_LMA) {
+ vmcs_write32(VM_ENTRY_CONTROLS,
+ vmcs_read32(VM_ENTRY_CONTROLS) |
+ VM_ENTRY_CONTROLS_IA32E_MASK);
+ msr->data = efer;
+
+ } else {
+ vmcs_write32(VM_ENTRY_CONTROLS,
+ vmcs_read32(VM_ENTRY_CONTROLS) &
+ ~VM_ENTRY_CONTROLS_IA32E_MASK);
+
+ msr->data = efer & ~EFER_LME;
+ }
+}
+#endif
+
+static void inject_rmode_irq(struct kvm_vcpu *vcpu, int irq)
+{
+ u16 ent[2];
+ u16 cs;
+ u16 ip;
+ unsigned long flags;
+ unsigned long ss_base = vmcs_readl(GUEST_SS_BASE);
+ u16 sp = vmcs_readl(GUEST_RSP);
+ u32 ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+
+ if (sp > ss_limit || sp - 6 > sp) {
+ vcpu_printf(vcpu, "%s: #SS, rsp 0x%lx ss 0x%lx limit 0x%x\n",
+ __FUNCTION__,
+ vmcs_readl(GUEST_RSP),
+ vmcs_readl(GUEST_SS_BASE),
+ vmcs_read32(GUEST_SS_LIMIT));
+ return;
+ }
+
+ if (kvm_read_guest(vcpu, irq * sizeof(ent), sizeof(ent), &ent) !=
+ sizeof(ent)) {
+ vcpu_printf(vcpu, "%s: read guest err\n", __FUNCTION__);
+ return;
+ }
+
+ flags = vmcs_readl(GUEST_RFLAGS);
+ cs = vmcs_readl(GUEST_CS_BASE) >> 4;
+ ip = vmcs_readl(GUEST_RIP);
+
+
+ if (kvm_write_guest(vcpu, ss_base + sp - 2, 2, &flags) != 2 ||
+ kvm_write_guest(vcpu, ss_base + sp - 4, 2, &cs) != 2 ||
+ kvm_write_guest(vcpu, ss_base + sp - 6, 2, &ip) != 2) {
+ vcpu_printf(vcpu, "%s: write guest err\n", __FUNCTION__);
+ return;
+ }
+
+ vmcs_writel(GUEST_RFLAGS, flags &
+ ~( X86_EFLAGS_IF | X86_EFLAGS_AC | X86_EFLAGS_TF));
+ vmcs_write16(GUEST_CS_SELECTOR, ent[1]) ;
+ vmcs_writel(GUEST_CS_BASE, ent[1] << 4);
+ vmcs_writel(GUEST_RIP, ent[0]);
+ vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp - 6));
+}
+
+static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
+{
+ int word_index = __ffs(vcpu->irq_summary);
+ int bit_index = __ffs(vcpu->irq_pending[word_index]);
+ int irq = word_index * BITS_PER_LONG + bit_index;
+
+ clear_bit(bit_index, &vcpu->irq_pending[word_index]);
+ if (!vcpu->irq_pending[word_index])
+ clear_bit(word_index, &vcpu->irq_summary);
+
+ if (vcpu->rmode.active) {
+ inject_rmode_irq(vcpu, irq);
+ return;
+ }
+ vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+ irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
+}
+
+static void kvm_try_inject_irq(struct kvm_vcpu *vcpu)
+{
+ if ((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF)
+ && (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0)
+ /*
+ * Interrupts enabled, and not blocked by sti or mov ss. Good.
+ */
+ kvm_do_inject_irq(vcpu);
+ else
+ /*
+ * Interrupts blocked. Wait for unblock.
+ */
+ vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
+ vmcs_read32(CPU_BASED_VM_EXEC_CONTROL)
+ | CPU_BASED_VIRTUAL_INTR_PENDING);
+}
+
+static void load_msrs(struct vmx_msr_entry *e)
+{
+ int i;
+
+ for (i = NUM_AUTO_MSRS; i < NR_VMX_MSR; ++i)
+ wrmsrl(e[i].index, e[i].data);
+}
+
+static void save_msrs(struct vmx_msr_entry *e, int msr_index)
+{
+ for (; msr_index < NR_VMX_MSR; ++msr_index)
+ rdmsrl(e[msr_index].index, e[msr_index].data);
+}
+
+static long kvm_dev_ioctl(struct file *filp,
+ unsigned int ioctl, unsigned long arg)
+{
+ struct kvm *kvm = filp->private_data;
+ int r = -EINVAL;
+
+ switch (ioctl) {
+ default:
+ ;
+ }
+out:
+ return r;
+}
+
+static struct page *kvm_dev_nopage(struct vm_area_struct *vma,
+ unsigned long address,
+ int *type)
+{
+ struct kvm *kvm = vma->vm_file->private_data;
+ unsigned long pgoff;
+ struct kvm_memory_slot *slot;
+ struct page *page;
+
+ *type = VM_FAULT_MINOR;
+ pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ slot = gfn_to_memslot(kvm, pgoff);
+ if (!slot)
+ return NOPAGE_SIGBUS;
+ page = gfn_to_page(slot, pgoff);
+ if (!page)
+ return NOPAGE_SIGBUS;
+ get_page(page);
+ return page;
+}
+
+static struct vm_operations_struct kvm_dev_vm_ops = {
+ .nopage = kvm_dev_nopage,
+};
+
+static int kvm_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ vma->vm_ops = &kvm_dev_vm_ops;
+ return 0;
+}
+
+static struct file_operations kvm_chardev_ops = {
+ .owner = THIS_MODULE,
+ .open = kvm_dev_open,
+ .release = kvm_dev_release,
+ .unlocked_ioctl = kvm_dev_ioctl,
+ .compat_ioctl = kvm_dev_ioctl,
+ .mmap = kvm_dev_mmap,
+};
+
+static struct miscdevice kvm_dev = {
+ MISC_DYNAMIC_MINOR,
+ "kvm",
+ &kvm_chardev_ops,
+};
+
+static int kvm_reboot(struct notifier_block *notifier, unsigned long val,
+ void *v)
+{
+ if (val == SYS_RESTART) {
+ /*
+ * Some (well, at least mine) BIOSes hang on reboot if
+ * in vmx root mode.
+ */
+ printk(KERN_INFO "kvm: exiting vmx mode\n");
+ on_each_cpu(kvm_disable, 0, 0, 1);
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block kvm_reboot_notifier = {
+ .notifier_call = kvm_reboot,
+ .priority = 0,
+};
+
+static __init void kvm_init_debug(void)
+{
+ debugfs_dir = debugfs_create_dir("kvm", 0);
+ debugfs_pf_fixed = debugfs_create_u32("pf_fixed", 0444, debugfs_dir,
+ &kvm_stat.pf_fixed);
+ debugfs_pf_guest = debugfs_create_u32("pf_guest", 0444, debugfs_dir,
+ &kvm_stat.pf_guest);
+ debugfs_tlb_flush = debugfs_create_u32("tlb_flush", 0444, debugfs_dir,
+ &kvm_stat.tlb_flush);
+ debugfs_invlpg = debugfs_create_u32("invlpg", 0444, debugfs_dir,
+ &kvm_stat.invlpg);
+ debugfs_exits = debugfs_create_u32("exits", 0444, debugfs_dir,
+ &kvm_stat.exits);
+ debugfs_io_exits = debugfs_create_u32("io_exits", 0444, debugfs_dir,
+ &kvm_stat.io_exits);
+ debugfs_mmio_exits = debugfs_create_u32("mmio_exits", 0444,
+ debugfs_dir,
+ &kvm_stat.mmio_exits);
+ debugfs_signal_exits = debugfs_create_u32("signal_exits", 0444,
+ debugfs_dir,
+ &kvm_stat.signal_exits);
+ debugfs_irq_exits = debugfs_create_u32("irq_exits", 0444, debugfs_dir,
+ &kvm_stat.irq_exits);
+}
+
+static void kvm_exit_debug(void)
+{
+ debugfs_remove(debugfs_signal_exits);
+ debugfs_remove(debugfs_irq_exits);
+ debugfs_remove(debugfs_mmio_exits);
+ debugfs_remove(debugfs_io_exits);
+ debugfs_remove(debugfs_exits);
+ debugfs_remove(debugfs_pf_fixed);
+ debugfs_remove(debugfs_pf_guest);
+ debugfs_remove(debugfs_tlb_flush);
+ debugfs_remove(debugfs_invlpg);
+ debugfs_remove(debugfs_dir);
+}
+
+hpa_t bad_page_address;
+
+static __init int kvm_init(void)
+{
+ static struct page *bad_page;
+ int r = 0;
+
+ if (!cpu_has_kvm_support()) {
+ printk(KERN_ERR "kvm: no hardware support\n");
+ return -EOPNOTSUPP;
+ }
+ if (vmx_disabled_by_bios()) {
+ printk(KERN_ERR "kvm: disabled by bios\n");
+ return -EOPNOTSUPP;
+ }
+
+ kvm_init_debug();
+
+ setup_vmcs_descriptor();
+ r = alloc_kvm_area();
+ if (r)
+ goto out;
+ on_each_cpu(kvm_enable, 0, 0, 1);
+ register_reboot_notifier(&kvm_reboot_notifier);
+
+ r = misc_register(&kvm_dev);
+ if (r) {
+ printk (KERN_ERR "kvm: misc device register failed\n");
+ goto out_free;
+ }
+
+
+ if ((bad_page = alloc_page(GFP_KERNEL)) == NULL) {
+ r = -ENOMEM;
+ goto out_free;
+ }
+
+ bad_page_address = page_to_pfn(bad_page) << PAGE_SHIFT;
+ memset(__va(bad_page_address), 0, PAGE_SIZE);
+
+ return r;
+
+out_free:
+ free_kvm_area();
+out:
+ kvm_exit_debug();
+ return r;
+}
+
+static __exit void kvm_exit(void)
+{
+ kvm_exit_debug();
+ misc_deregister(&kvm_dev);
+ unregister_reboot_notifier(&kvm_reboot_notifier);
+ on_each_cpu(kvm_disable, 0, 0, 1);
+ free_kvm_area();
+ __free_page(pfn_to_page(bad_page_address >> PAGE_SHIFT));
+}
+
+module_init(kvm_init)
+module_exit(kvm_exit)

2006-10-23 13:31:36

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 9/13] KVM: define exit handlers

This defines exit handlers for:

- exceptions (only page faults normally)
- control register access
- invlpg
- I/O instructions
- interrupt window management (exit when guest interrupts are enabled)

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/kvm_main.c
===================================================================
--- linux-2.6.orig/drivers/kvm/kvm_main.c
+++ linux-2.6/drivers/kvm/kvm_main.c
@@ -32,6 +32,7 @@
#include <linux/file.h>

#include "vmx.h"
+#include "x86_emulate.h"

MODULE_AUTHOR("Qumranet");
MODULE_LICENSE("GPL");
@@ -655,6 +656,93 @@ static void vmcs_write64(unsigned long f
#endif
}

+#ifdef __x86_64__
+#define HOST_IS_64 1
+#else
+#define HOST_IS_64 0
+#endif
+
+#define GUEST_IS_64 HOST_IS_64
+
+static void enter_pmode(struct kvm_vcpu *vcpu)
+{
+ unsigned long flags;
+
+ vcpu->rmode.active = 0;
+
+ vmcs_writel(GUEST_TR_BASE, vcpu->rmode.tr.base);
+ vmcs_write32(GUEST_TR_LIMIT, vcpu->rmode.tr.limit);
+ vmcs_write32(GUEST_TR_AR_BYTES, vcpu->rmode.tr.ar);
+
+ flags = vmcs_readl(GUEST_RFLAGS);
+ flags &= ~(IOPL_MASK | X86_EFLAGS_VM);
+ flags |= (vcpu->rmode.save_iopl << IOPL_SHIFT);
+ vmcs_writel(GUEST_RFLAGS, flags);
+
+ vmcs_writel(GUEST_CR4, (vmcs_readl(GUEST_CR4) & ~CR4_VME_MASK) |
+ (vmcs_readl(CR0_READ_SHADOW) & CR4_VME_MASK) );
+
+ vmcs_write32(EXCEPTION_BITMAP, 1 << PF_VECTOR);
+
+ #define FIX_PMODE_DATASEG(seg, save) { \
+ vmcs_write16(GUEST_##seg##_SELECTOR, 0); \
+ vmcs_writel(GUEST_##seg##_BASE, 0); \
+ vmcs_write32(GUEST_##seg##_LIMIT, 0xffff); \
+ vmcs_write32(GUEST_##seg##_AR_BYTES, 0x93); \
+ }
+
+ FIX_PMODE_DATASEG(SS, vcpu->rmode.ss);
+ FIX_PMODE_DATASEG(ES, vcpu->rmode.es);
+ FIX_PMODE_DATASEG(DS, vcpu->rmode.ds);
+ FIX_PMODE_DATASEG(GS, vcpu->rmode.gs);
+ FIX_PMODE_DATASEG(FS, vcpu->rmode.fs);
+
+ vmcs_write16(GUEST_CS_SELECTOR,
+ vmcs_read16(GUEST_CS_SELECTOR) & ~SELECTOR_RPL_MASK);
+ vmcs_write32(GUEST_CS_AR_BYTES, 0x9b);
+}
+
+static void enter_rmode(struct kvm_vcpu *vcpu)
+{
+ unsigned long flags;
+
+ vcpu->rmode.active = 1;
+
+ vcpu->rmode.tr.base = vmcs_readl(GUEST_TR_BASE);
+ vmcs_writel(GUEST_TR_BASE, rmode_tss_base(vcpu->kvm));
+
+ vcpu->rmode.tr.limit = vmcs_read32(GUEST_TR_LIMIT);
+ vmcs_write32(GUEST_TR_LIMIT, RMODE_TSS_SIZE - 1);
+
+ vcpu->rmode.tr.ar = vmcs_read32(GUEST_TR_AR_BYTES);
+ vmcs_write32(GUEST_TR_AR_BYTES, 0x008b);
+
+ flags = vmcs_readl(GUEST_RFLAGS);
+ vcpu->rmode.save_iopl = (flags & IOPL_MASK) >> IOPL_SHIFT;
+
+ flags |= IOPL_MASK | X86_EFLAGS_VM;
+
+ vmcs_writel(GUEST_RFLAGS, flags);
+ vmcs_writel(GUEST_CR4, vmcs_readl(GUEST_CR4) | CR4_VME_MASK);
+ vmcs_write32(EXCEPTION_BITMAP, ~0);
+
+ #define FIX_RMODE_SEG(seg, save) { \
+ vmcs_write16(GUEST_##seg##_SELECTOR, \
+ vmcs_readl(GUEST_##seg##_BASE) >> 4); \
+ vmcs_write32(GUEST_##seg##_LIMIT, 0xffff); \
+ vmcs_write32(GUEST_##seg##_AR_BYTES, 0xf3); \
+ }
+
+ vmcs_write32(GUEST_CS_AR_BYTES, 0xf3);
+ vmcs_write16(GUEST_CS_SELECTOR, vmcs_readl(GUEST_CS_BASE) >> 4);
+
+ FIX_RMODE_SEG(ES, vcpu->rmode.es);
+ FIX_RMODE_SEG(DS, vcpu->rmode.ds);
+ FIX_RMODE_SEG(SS, vcpu->rmode.ss);
+ FIX_RMODE_SEG(GS, vcpu->rmode.gs);
+ FIX_RMODE_SEG(FS, vcpu->rmode.fs);
+}
+
static int rmode_tss_base(struct kvm* kvm)
{
gfn_t base_gfn = kvm->memslots[0].base_gfn + kvm->memslots[0].npages - 3;
@@ -1285,6 +1373,464 @@ static void skip_emulated_instruction(st
interruptibility & ~3);
}

+static int emulator_read_std(unsigned long addr,
+ unsigned long *val,
+ unsigned int bytes,
+ struct x86_emulate_ctxt *ctxt)
+{
+ struct kvm_vcpu *vcpu = ctxt->vcpu;
+ void *data = val;
+
+ while (bytes) {
+ gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
+ unsigned offset = addr & (PAGE_SIZE-1);
+ unsigned tocopy = min(bytes, (unsigned)PAGE_SIZE - offset);
+ unsigned long pfn;
+ struct kvm_memory_slot *memslot;
+ void *page;
+
+ if (gpa == UNMAPPED_GVA)
+ return X86EMUL_PROPAGATE_FAULT;
+ pfn = gpa >> PAGE_SHIFT;
+ memslot = gfn_to_memslot(vcpu->kvm, pfn);
+ if (!memslot)
+ return X86EMUL_UNHANDLEABLE;
+ page = kmap_atomic(gfn_to_page(memslot, pfn), KM_USER0);
+
+ memcpy(data, page + offset, tocopy);
+
+ kunmap_atomic(page, KM_USER0);
+
+ bytes -= tocopy;
+ data += tocopy;
+ addr += tocopy;
+ }
+
+ return X86EMUL_CONTINUE;
+}
+
+static int emulator_write_std(unsigned long addr,
+ unsigned long val,
+ unsigned int bytes,
+ struct x86_emulate_ctxt *ctxt)
+{
+ printk(KERN_ERR "emulator_write_std: addr %lx n %d\n",
+ addr, bytes);
+ return X86EMUL_UNHANDLEABLE;
+}
+
+static int emulator_read_emulated(unsigned long addr,
+ unsigned long *val,
+ unsigned int bytes,
+ struct x86_emulate_ctxt *ctxt)
+{
+ struct kvm_vcpu *vcpu = ctxt->vcpu;
+
+ if (vcpu->mmio_read_completed) {
+ memcpy(val, vcpu->mmio_data, bytes);
+ vcpu->mmio_read_completed = 0;
+ return X86EMUL_CONTINUE;
+ } else if (emulator_read_std(addr, val, bytes, ctxt)
+ == X86EMUL_CONTINUE)
+ return X86EMUL_CONTINUE;
+ else {
+ gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
+ if (gpa == UNMAPPED_GVA)
+ return vcpu_printf(vcpu, "not present\n"), X86EMUL_PROPAGATE_FAULT;
+ vcpu->mmio_needed = 1;
+ vcpu->mmio_phys_addr = gpa;
+ vcpu->mmio_size = bytes;
+ vcpu->mmio_is_write = 0;
+
+ return X86EMUL_UNHANDLEABLE;
+ }
+}
+
+static int emulator_write_emulated(unsigned long addr,
+ unsigned long val,
+ unsigned int bytes,
+ struct x86_emulate_ctxt *ctxt)
+{
+ struct kvm_vcpu *vcpu = ctxt->vcpu;
+ gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
+
+ if (gpa == UNMAPPED_GVA)
+ return X86EMUL_PROPAGATE_FAULT;
+
+ vcpu->mmio_needed = 1;
+ vcpu->mmio_phys_addr = gpa;
+ vcpu->mmio_size = bytes;
+ vcpu->mmio_is_write = 1;
+ memcpy(vcpu->mmio_data, &val, bytes);
+
+ return X86EMUL_CONTINUE;
+}
+
+static int emulator_cmpxchg_emulated(unsigned long addr,
+ unsigned long old,
+ unsigned long new,
+ unsigned int bytes,
+ struct x86_emulate_ctxt *ctxt)
+{
+ static int reported;
+
+ if (!reported) {
+ reported = 1;
+ printk(KERN_WARNING "kvm: emulating exchange as write\n");
+ }
+ return emulator_write_emulated(addr, new, bytes, ctxt);
+}
+
+static void report_emulation_failure(struct x86_emulate_ctxt *ctxt)
+{
+ static int reported;
+ u8 opcodes[4];
+ unsigned long rip = vmcs_readl(GUEST_RIP);
+ unsigned long rip_linear = rip + vmcs_readl(GUEST_CS_BASE);
+
+ if (reported)
+ return;
+
+ emulator_read_std(rip_linear, (void *)opcodes, 4, ctxt);
+
+ printk(KERN_ERR "emulation failed but !mmio_needed?"
+ " rip %lx %02x %02x %02x %02x\n",
+ rip, opcodes[0], opcodes[1], opcodes[2], opcodes[3]);
+ reported = 1;
+}
+
+struct x86_emulate_ops emulate_ops = {
+ .read_std = emulator_read_std,
+ .write_std = emulator_write_std,
+ .read_emulated = emulator_read_emulated,
+ .write_emulated = emulator_write_emulated,
+ .cmpxchg_emulated = emulator_cmpxchg_emulated,
+};
+
+enum emulation_result {
+ EMULATE_DONE, /* no further processing */
+ EMULATE_DO_MMIO, /* kvm_run filled with mmio request */
+ EMULATE_FAIL, /* can't emulate this instruction */
+};
+
+static int emulate_instruction(struct kvm_vcpu *vcpu,
+ struct kvm_run *run,
+ unsigned long cr2,
+ u16 error_code)
+{
+ struct x86_emulate_ctxt emulate_ctxt;
+ int r;
+ u32 cs_ar;
+
+ vcpu_load_rsp_rip(vcpu);
+
+ cs_ar = vmcs_read32(GUEST_CS_AR_BYTES);
+
+ emulate_ctxt.vcpu = vcpu;
+ emulate_ctxt.eflags = vmcs_readl(GUEST_RFLAGS);
+ emulate_ctxt.cr2 = cr2;
+ emulate_ctxt.mode = (emulate_ctxt.eflags & X86_EFLAGS_VM)
+ ? X86EMUL_MODE_REAL : (cs_ar & AR_L_MASK)
+ ? X86EMUL_MODE_PROT64 : (cs_ar & AR_DB_MASK)
+ ? X86EMUL_MODE_PROT32 : X86EMUL_MODE_PROT16;
+
+ if (emulate_ctxt.mode == X86EMUL_MODE_PROT64) {
+ emulate_ctxt.cs_base = 0;
+ emulate_ctxt.ds_base = 0;
+ emulate_ctxt.es_base = 0;
+ emulate_ctxt.ss_base = 0;
+ emulate_ctxt.gs_base = 0;
+ emulate_ctxt.fs_base = 0;
+ } else {
+ emulate_ctxt.cs_base = vmcs_readl(GUEST_CS_BASE);
+ emulate_ctxt.ds_base = vmcs_readl(GUEST_DS_BASE);
+ emulate_ctxt.es_base = vmcs_readl(GUEST_ES_BASE);
+ emulate_ctxt.ss_base = vmcs_readl(GUEST_SS_BASE);
+ emulate_ctxt.gs_base = vmcs_readl(GUEST_GS_BASE);
+ emulate_ctxt.fs_base = vmcs_readl(GUEST_FS_BASE);
+ }
+
+ vcpu->mmio_is_write = 0;
+ r = x86_emulate_memop(&emulate_ctxt, &emulate_ops);
+
+ if ((r || vcpu->mmio_is_write) && run) {
+ run->mmio.phys_addr = vcpu->mmio_phys_addr;
+ memcpy(run->mmio.data, vcpu->mmio_data, 8);
+ run->mmio.len = vcpu->mmio_size;
+ run->mmio.is_write = vcpu->mmio_is_write;
+ }
+
+ if (r) {
+ if (!vcpu->mmio_needed) {
+ report_emulation_failure(&emulate_ctxt);
+ return EMULATE_FAIL;
+ }
+ return EMULATE_DO_MMIO;
+ }
+
+ vcpu_put_rsp_rip(vcpu);
+ vmcs_writel(GUEST_RFLAGS, emulate_ctxt.eflags);
+
+ if (vcpu->mmio_is_write)
+ return EMULATE_DO_MMIO;
+
+ return EMULATE_DONE;
+}
+
+static u64 mk_cr_64(u64 curr_cr, u32 new_val)
+{
+ return (curr_cr & ~((1ULL << 32) - 1)) | new_val;
+}
+
+void realmode_lgdt(struct kvm_vcpu *vcpu, u16 limit, unsigned long base)
+{
+ vmcs_writel(GUEST_GDTR_BASE, base);
+ vmcs_write32(GUEST_GDTR_LIMIT, limit);
+}
+
+void realmode_lidt(struct kvm_vcpu *vcpu, u16 limit, unsigned long base)
+{
+ vmcs_writel(GUEST_IDTR_BASE, base);
+ vmcs_write32(GUEST_IDTR_LIMIT, limit);
+}
+
+void realmode_lmsw(struct kvm_vcpu *vcpu, unsigned long msw,
+ unsigned long *rflags)
+{
+ lmsw(vcpu, msw);
+ *rflags = vmcs_readl(GUEST_RFLAGS);
+}
+
+unsigned long realmode_get_cr(struct kvm_vcpu *vcpu, int cr)
+{
+ switch (cr) {
+ case 0:
+ return guest_cr0();
+ case 2:
+ return vcpu->cr2;
+ case 3:
+ return vcpu->cr3;
+ case 4:
+ return guest_cr4();
+ default:
+ vcpu_printf(vcpu, "%s: unexpected cr %u\n", __FUNCTION__, cr);
+ return 0;
+ }
+}
+
+void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long val,
+ unsigned long *rflags)
+{
+ switch (cr) {
+ case 0:
+ set_cr0(vcpu, mk_cr_64(guest_cr0(), val));
+ *rflags = vmcs_readl(GUEST_RFLAGS);
+ break;
+ case 2:
+ vcpu->cr2 = val;
+ break;
+ case 3:
+ set_cr3(vcpu, val);
+ break;
+ case 4:
+ set_cr4(vcpu, mk_cr_64(guest_cr4(), val));
+ break;
+ default:
+ vcpu_printf(vcpu, "%s: unexpected cr %u\n", __FUNCTION__, cr);
+ }
+}
+
+static int handle_rmode_exception(struct kvm_vcpu *vcpu,
+ int vec, u32 err_code)
+{
+ if (!vcpu->rmode.active)
+ return 0;
+
+ if (vec == GP_VECTOR && err_code == 0)
+ if (emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE)
+ return 1;
+ return 0;
+}
+
+static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+{
+ u32 intr_info, error_code;
+ unsigned long cr2, rip;
+ u32 vect_info;
+ enum emulation_result er;
+
+ vect_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+ intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+
+ if ((vect_info & VECTORING_INFO_VALID_MASK) &&
+ !is_page_fault(intr_info)) {
+ printk("%s: unexpected, vectoring info 0x%x intr info 0x%x\n",
+ __FUNCTION__, vect_info, intr_info);
+ }
+
+ if (is_external_interrupt(vect_info)) {
+ int irq = vect_info & VECTORING_INFO_VECTOR_MASK;
+ set_bit(irq, vcpu->irq_pending);
+ set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
+ }
+
+ if ((intr_info & INTR_INFO_INTR_TYPE_MASK) == 0x200) { /* nmi */
+ asm ( "int $2" );
+ return 1;
+ }
+ error_code = 0;
+ rip = vmcs_readl(GUEST_RIP);
+ if (intr_info & INTR_INFO_DELIEVER_CODE_MASK)
+ error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+ if (is_page_fault(intr_info)) {
+ cr2 = vmcs_readl(EXIT_QUALIFICATION);
+
+ spin_lock(&vcpu->kvm->lock);
+ if (!vcpu->mmu.page_fault(vcpu, cr2, error_code)) {
+ spin_unlock(&vcpu->kvm->lock);
+ return 1;
+ }
+
+ er = emulate_instruction(vcpu, kvm_run, cr2, error_code);
+ spin_unlock(&vcpu->kvm->lock);
+
+ switch (er) {
+ case EMULATE_DONE:
+ return 1;
+ case EMULATE_DO_MMIO:
+ ++kvm_stat.mmio_exits;
+ kvm_run->exit_reason = KVM_EXIT_MMIO;
+ return 0;
+ case EMULATE_FAIL:
+ vcpu_printf(vcpu, "%s: emulate fail\n", __FUNCTION__);
+ break;
+ default:
+ BUG();
+ }
+ }
+
+ if (vcpu->rmode.active &&
+ handle_rmode_exception(vcpu, intr_info & INTR_INFO_VECTOR_MASK,
+ error_code))
+ return 1;
+
+ if ((intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK)) == (INTR_TYPE_EXCEPTION | 1)) {
+ kvm_run->exit_reason = KVM_EXIT_DEBUG;
+ return 0;
+ }
+ kvm_run->exit_reason = KVM_EXIT_EXCEPTION;
+ kvm_run->ex.exception = intr_info & INTR_INFO_VECTOR_MASK;
+ kvm_run->ex.error_code = error_code;
+ return 0;
+}
+
+static int handle_external_interrupt(struct kvm_vcpu *vcpu,
+ struct kvm_run *kvm_run)
+{
+ ++kvm_stat.irq_exits;
+ return 1;
+}
+
+
+static int get_io_count(struct kvm_vcpu *vcpu, u64 *count)
+{
+ u64 inst;
+ gva_t rip;
+ int countr_size;
+ int i, n;
+
+ if ((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_VM)) {
+ countr_size = 2;
+ } else {
+ u32 cs_ar = vmcs_read32(GUEST_CS_AR_BYTES);
+
+ countr_size = (cs_ar & AR_L_MASK) ? 8:
+ (cs_ar & AR_DB_MASK) ? 4: 2;
+ }
+
+ rip = vmcs_readl(GUEST_RIP);
+ if (countr_size != 8)
+ rip += vmcs_readl(GUEST_CS_BASE);
+
+ n = kvm_read_guest(vcpu, rip, sizeof(inst), &inst);
+
+ for (i = 0; i < n; i++) {
+ switch (((u8*)&inst)[i]) {
+ case 0xf0:
+ case 0xf2:
+ case 0xf3:
+ case 0x2e:
+ case 0x36:
+ case 0x3e:
+ case 0x26:
+ case 0x64:
+ case 0x65:
+ case 0x66:
+ break;
+ case 0x67:
+ countr_size = (countr_size == 2) ? 4: (countr_size >> 1);
+ default:
+ goto done;
+ }
+ }
+ return 0;
+done:
+ countr_size *= 8;
+ *count = vcpu->regs[VCPU_REGS_RCX] & (~0ULL >> (64 - countr_size));
+ return 1;
+}
+
+static int handle_io(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+{
+ u64 exit_qualification;
+
+ ++kvm_stat.io_exits;
+ exit_qualification = vmcs_read64(EXIT_QUALIFICATION);
+ kvm_run->exit_reason = KVM_EXIT_IO;
+ if (exit_qualification & 8)
+ kvm_run->io.direction = KVM_EXIT_IO_IN;
+ else
+ kvm_run->io.direction = KVM_EXIT_IO_OUT;
+ kvm_run->io.size = (exit_qualification & 7) + 1;
+ kvm_run->io.string = (exit_qualification & 16) != 0;
+ kvm_run->io.string_down
+ = (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_DF) != 0;
+ kvm_run->io.rep = (exit_qualification & 32) != 0;
+ kvm_run->io.port = exit_qualification >> 16;
+ if (kvm_run->io.string) {
+ if (!get_io_count(vcpu, &kvm_run->io.count))
+ return 1;
+ kvm_run->io.address = vmcs_readl(GUEST_LINEAR_ADDRESS);
+ } else
+ kvm_run->io.value = vcpu->regs[VCPU_REGS_RAX]; /* rax */
+ return 0;
+}
+
+
+static int handle_invlpg(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+{
+ u64 address = vmcs_read64(EXIT_QUALIFICATION);
+ int instruction_length = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+ spin_lock(&vcpu->kvm->lock);
+ vcpu->mmu.inval_page(vcpu, address);
+ spin_unlock(&vcpu->kvm->lock);
+ vmcs_writel(GUEST_RIP, vmcs_readl(GUEST_RIP) + instruction_length);
+ return 1;
+}
+
+
+static void inject_gp(struct kvm_vcpu *vcpu)
+{
+ printk("inject_general_protection: rip 0x%lx\n",
+ vmcs_readl(GUEST_RIP));
+ vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, 0);
+ vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+ GP_VECTOR |
+ INTR_TYPE_EXCEPTION |
+ INTR_INFO_DELIEVER_CODE_MASK |
+ INTR_INFO_VALID_MASK);
+}
+
static int pdptrs_have_reserved_bits_set(struct kvm_vcpu *vcpu,
unsigned long cr3)
{
@@ -1503,6 +2049,79 @@ static void __set_cr4(struct kvm_vcpu *v
KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON));
}

+static int handle_cr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+{
+ u64 exit_qualification;
+ int cr;
+ int reg;
+
+#ifdef KVM_DEBUG
+ if (guest_cpl() != 0) {
+ vcpu_printf(vcpu, "%s: not supervisor\n", __FUNCTION__);
+ inject_gp(vcpu);
+ return 1;
+ }
+#endif
+
+ exit_qualification = vmcs_read64(EXIT_QUALIFICATION);
+ cr = exit_qualification & 15;
+ reg = (exit_qualification >> 8) & 15;
+ switch ((exit_qualification >> 4) & 3) {
+ case 0: /* mov to cr */
+ switch (cr) {
+ case 0:
+ vcpu_load_rsp_rip(vcpu);
+ set_cr0(vcpu, vcpu->regs[reg]);
+ skip_emulated_instruction(vcpu);
+ return 1;
+ case 3:
+ vcpu_load_rsp_rip(vcpu);
+ set_cr3(vcpu, vcpu->regs[reg]);
+ skip_emulated_instruction(vcpu);
+ return 1;
+ case 4:
+ vcpu_load_rsp_rip(vcpu);
+ set_cr4(vcpu, vcpu->regs[reg]);
+ skip_emulated_instruction(vcpu);
+ return 1;
+ case 8:
+ vcpu_load_rsp_rip(vcpu);
+ set_cr8(vcpu, vcpu->regs[reg]);
+ skip_emulated_instruction(vcpu);
+ return 1;
+ };
+ break;
+ case 1: /*mov from cr*/
+ switch (cr) {
+ case 3:
+ vcpu_load_rsp_rip(vcpu);
+ vcpu->regs[reg] = vcpu->cr3;
+ vcpu_put_rsp_rip(vcpu);
+ skip_emulated_instruction(vcpu);
+ return 1;
+ case 8:
+ printk("handle_cr: read CR8 cpu bug (AA15) !!!!!!!!!!!!!!!!!\n");
+ vcpu_load_rsp_rip(vcpu);
+ vcpu->regs[reg] = vcpu->cr8;
+ vcpu_put_rsp_rip(vcpu);
+ skip_emulated_instruction(vcpu);
+ return 1;
+ }
+ break;
+ case 3: /* lmsw */
+ lmsw(vcpu, (exit_qualification >> LMSW_SOURCE_DATA_SHIFT) & 0x0f);
+
+ skip_emulated_instruction(vcpu);
+ return 1;
+ default:
+ break;
+ }
+ kvm_run->exit_reason = 0;
+ printk(KERN_ERR "kvm: unhandled control register: op %d cr %d\n",
+ (int)(exit_qualification >> 4) & 3, cr);
+ return 0;
+}
+
#ifdef __x86_64__
#define EFER_RESERVED_BITS 0xfffffffffffff2fe

@@ -1556,6 +2175,26 @@ static void __set_efer(struct kvm_vcpu *
}
#endif

+static int handle_interrupt_window(struct kvm_vcpu *vcpu,
+ struct kvm_run *kvm_run)
+{
+ /* Turn off interrupt window reporting. */
+ vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
+ vmcs_read32(CPU_BASED_VM_EXEC_CONTROL)
+ & ~CPU_BASED_VIRTUAL_INTR_PENDING);
+ return 1;
+}
+
+static int handle_halt(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+{
+ skip_emulated_instruction(vcpu);
+ if (vcpu->irq_summary && (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF))
+ return 1;
+
+ kvm_run->exit_reason = KVM_EXIT_HLT;
+ return 0;
+}
+
/*
* The exit handlers return 1 if the exit was handled fully and guest execution
* may resume. Otherwise they set the kvm_run parameter to indicate what needs
@@ -1563,6 +2202,13 @@ static void __set_efer(struct kvm_vcpu *
*/
static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu,
struct kvm_run *kvm_run) = {
+ [EXIT_REASON_EXCEPTION_NMI] = handle_exception,
+ [EXIT_REASON_EXTERNAL_INTERRUPT] = handle_external_interrupt,
+ [EXIT_REASON_IO_INSTRUCTION] = handle_io,
+ [EXIT_REASON_INVLPG] = handle_invlpg,
+ [EXIT_REASON_CR_ACCESS] = handle_cr,
+ [EXIT_REASON_PENDING_INTERRUPT] = handle_interrupt_window,
+ [EXIT_REASON_HLT] = handle_halt,
};

static const int kvm_vmx_max_exit_handlers =

2006-10-23 13:31:59

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 11/13] KVM: mmu

Changes from v1:

- fixed a missing typecast which caused a lockup on i386 with >= 4GB RAM

--

This patch contains the shadow page table code.

This is a fairly naive implementation that uses the tlb management instructions
to keep the shadow page tables in sync with the guest page tables:

- invlpg: remove the shadow pte for the given virtual address
- tlb flush: remove all shadow ptes for non-global pages

The relative simplicity of the approach comes at a price: every guest address
space switch needs to rebuild the shadow page tables for the new address space.

Other noteworthy items:

- the dirty bit is emulated by mapping non-dirty, writable pages as read-only.
the first write will set the dirty bit and remap the page as writable
- we support both 32-bit and 64-bit guest ptes
- the host ptes are always 64-bit, even on non-pae i386 hosts

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/mmu.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/mmu.c
@@ -0,0 +1,719 @@
+/*
+ * Kernel-based Virtual Machine driver for Linux
+ *
+ * This module enables machines with Intel VT-x extensions to run virtual
+ * machines without emulation or binary translation.
+ *
+ * MMU support
+ *
+ * Copyright (C) 2006 Qumranet, Inc.
+ *
+ * Authors:
+ * Yaniv Kamay <[email protected]>
+ * Avi Kivity <[email protected]>
+ *
+ */
+#include <linux/types.h>
+#include <linux/string.h>
+#include <asm/page.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/module.h>
+
+#include "vmx.h"
+#include "kvm.h"
+
+#define pgprintk(x...) do { } while (0)
+
+#define ASSERT(x) \
+ if (!(x)) { \
+ printk("assertion failed %s:%d: %s\n", __FILE__, __LINE__, #x);\
+ }
+
+#define PT64_ENT_PER_PAGE 512
+#define PT32_ENT_PER_PAGE 1024
+
+#define PT_WRITABLE_SHIFT 1
+
+#define PT_PRESENT_MASK (1ULL << 0)
+#define PT_WRITABLE_MASK (1ULL << PT_WRITABLE_SHIFT)
+#define PT_USER_MASK (1ULL << 2)
+#define PT_PWT_MASK (1ULL << 3)
+#define PT_PCD_MASK (1ULL << 4)
+#define PT_ACCESSED_MASK (1ULL << 5)
+#define PT_DIRTY_MASK (1ULL << 6)
+#define PT_PAGE_SIZE_MASK (1ULL << 7)
+#define PT_PAT_MASK (1ULL << 7)
+#define PT_GLOBAL_MASK (1ULL << 8)
+#define PT64_NX_MASK (1ULL << 63)
+
+#define PT_PAT_SHIFT 7
+#define PT_DIR_PAT_SHIFT 12
+#define PT_DIR_PAT_MASK (1ULL << PT_DIR_PAT_SHIFT)
+
+#define PT32_DIR_PSE36_SIZE 4
+#define PT32_DIR_PSE36_SHIFT 13
+#define PT32_DIR_PSE36_MASK (((1ULL << PT32_DIR_PSE36_SIZE) - 1) << PT32_DIR_PSE36_SHIFT)
+
+
+#define PT32_PTE_COPY_MASK \
+ (PT_PRESENT_MASK | PT_PWT_MASK | PT_PCD_MASK | \
+ PT_ACCESSED_MASK | PT_DIRTY_MASK | PT_PAT_MASK | \
+ PT_GLOBAL_MASK )
+
+#define PT32_NON_PTE_COPY_MASK \
+ (PT_PRESENT_MASK | PT_PWT_MASK | PT_PCD_MASK | \
+ PT_ACCESSED_MASK | PT_DIRTY_MASK)
+
+
+#define PT64_PTE_COPY_MASK \
+ (PT64_NX_MASK | PT32_PTE_COPY_MASK)
+
+#define PT64_NON_PTE_COPY_MASK \
+ (PT64_NX_MASK | PT32_NON_PTE_COPY_MASK)
+
+
+
+#define PT_FIRST_AVAIL_BITS_SHIFT 9
+#define PT64_SECOND_AVAIL_BITS_SHIFT 52
+
+#define PT_SHADOW_PS_MARK (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
+#define PT_SHADOW_IO_MARK (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
+
+#define PT_SHADOW_WRITABLE_SHIFT (PT_FIRST_AVAIL_BITS_SHIFT + 1)
+#define PT_SHADOW_WRITABLE_MASK (1ULL << PT_SHADOW_WRITABLE_SHIFT)
+
+#define PT_SHADOW_USER_SHIFT (PT_SHADOW_WRITABLE_SHIFT + 1)
+#define PT_SHADOW_USER_MASK (1ULL << (PT_SHADOW_USER_SHIFT))
+
+#define PT_SHADOW_BITS_OFFSET (PT_SHADOW_WRITABLE_SHIFT - PT_WRITABLE_SHIFT)
+
+#define VALID_PAGE(x) ((x) != INVALID_PAGE)
+
+#define PT64_LEVEL_BITS 9
+
+#define PT64_LEVEL_SHIFT(level) \
+ ( PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS )
+
+#define PT64_LEVEL_MASK(level) \
+ (((1ULL << PT64_LEVEL_BITS) - 1) << PT64_LEVEL_SHIFT(level))
+
+#define PT64_INDEX(address, level)\
+ (((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
+
+
+#define PT32_LEVEL_BITS 10
+
+#define PT32_LEVEL_SHIFT(level) \
+ ( PAGE_SHIFT + (level - 1) * PT32_LEVEL_BITS )
+
+#define PT32_LEVEL_MASK(level) \
+ (((1ULL << PT32_LEVEL_BITS) - 1) << PT32_LEVEL_SHIFT(level))
+
+#define PT32_INDEX(address, level)\
+ (((address) >> PT32_LEVEL_SHIFT(level)) & ((1 << PT32_LEVEL_BITS) - 1))
+
+
+#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & PAGE_MASK)
+#define PT64_DIR_BASE_ADDR_MASK \
+ (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + PT64_LEVEL_BITS)) - 1))
+
+#define PT32_BASE_ADDR_MASK PAGE_MASK
+#define PT32_DIR_BASE_ADDR_MASK \
+ (PAGE_MASK & ~((1ULL << (PAGE_SHIFT + PT32_LEVEL_BITS)) - 1))
+
+
+#define PFERR_PRESENT_MASK (1U << 0)
+#define PFERR_WRITE_MASK (1U << 1)
+#define PFERR_USER_MASK (1U << 2)
+
+#define PT64_ROOT_LEVEL 4
+#define PT32_ROOT_LEVEL 2
+#define PT32E_ROOT_LEVEL 3
+
+#define PT_DIRECTORY_LEVEL 2
+#define PT_PAGE_TABLE_LEVEL 1
+
+static int is_write_protection(void)
+{
+ return guest_cr0() & CR0_WP_MASK;
+}
+
+static int is_cpuid_PSE36(void)
+{
+ return 1;
+}
+
+static int is_present_pte(unsigned long pte)
+{
+ return pte & PT_PRESENT_MASK;
+}
+
+static int is_writeble_pte(unsigned long pte)
+{
+ return pte & PT_WRITABLE_MASK;
+}
+
+static int is_io_pte(unsigned long pte)
+{
+ return pte & PT_SHADOW_IO_MARK;
+}
+
+static void kvm_mmu_free_page(struct kvm_vcpu *vcpu, hpa_t page_hpa)
+{
+ struct kvm_mmu_page *page_head = page_header(page_hpa);
+
+ list_del(&page_head->link);
+ page_head->page_hpa = page_hpa;
+ list_add(&page_head->link, &vcpu->free_pages);
+}
+
+static int is_empty_shadow_page(hpa_t page_hpa)
+{
+ u32 *pos;
+ u32 *end;
+ for (pos = __va(page_hpa), end = pos + PAGE_SIZE / sizeof(u32);
+ pos != end; pos++)
+ if (*pos != 0)
+ return 0;
+ return 1;
+}
+
+static hpa_t kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, u64 *parent_pte)
+{
+ struct kvm_mmu_page *page;
+
+ if (list_empty(&vcpu->free_pages))
+ return INVALID_PAGE;
+
+ page = list_entry(vcpu->free_pages.next, struct kvm_mmu_page, link);
+ list_del(&page->link);
+ list_add(&page->link, &vcpu->kvm->active_mmu_pages);
+ ASSERT(is_empty_shadow_page(page->page_hpa));
+ page->slot_bitmap = 0;
+ page->global = 1;
+ page->parent_pte = parent_pte;
+ return page->page_hpa;
+}
+
+static void page_header_update_slot(struct kvm *kvm, void *pte, gpa_t gpa)
+{
+ int slot = memslot_id(kvm, gfn_to_memslot(kvm, gpa >> PAGE_SHIFT));
+ struct kvm_mmu_page *page_head = page_header(__pa(pte));
+
+ __set_bit(slot, &page_head->slot_bitmap);
+}
+
+hpa_t safe_gpa_to_hpa(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+ hpa_t hpa = gpa_to_hpa(vcpu, gpa);
+
+ return is_error_hpa(hpa) ? bad_page_address | (gpa & ~PAGE_MASK): hpa;
+}
+
+hpa_t gpa_to_hpa(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+ struct kvm_memory_slot *slot;
+ struct page *page;
+
+ ASSERT((gpa & HPA_ERR_MASK) == 0);
+ slot = gfn_to_memslot(vcpu->kvm, gpa >> PAGE_SHIFT);
+ if (!slot)
+ return gpa | HPA_ERR_MASK;
+ page = gfn_to_page(slot, gpa >> PAGE_SHIFT);
+ return ((hpa_t)page_to_pfn(page) << PAGE_SHIFT)
+ | (gpa & (PAGE_SIZE-1));
+}
+
+hpa_t gva_to_hpa(struct kvm_vcpu *vcpu, gva_t gva)
+{
+ gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, gva);
+
+ if (gpa == UNMAPPED_GVA)
+ return UNMAPPED_GVA;
+ return gpa_to_hpa(vcpu, gpa);
+}
+
+
+static void release_pt_page_64(struct kvm_vcpu *vcpu, hpa_t page_hpa,
+ int level)
+{
+ ASSERT(vcpu);
+ ASSERT(VALID_PAGE(page_hpa));
+ ASSERT(level <= PT64_ROOT_LEVEL && level > 0);
+
+ if (level == 1)
+ memset(__va(page_hpa), 0, PAGE_SIZE);
+ else {
+ u64 *pos;
+ u64 *end;
+
+ for (pos = __va(page_hpa), end = pos + PT64_ENT_PER_PAGE;
+ pos != end; pos++) {
+ u64 current_ent = *pos;
+
+ *pos = 0;
+ if (is_present_pte(current_ent))
+ release_pt_page_64(vcpu,
+ current_ent &
+ PT64_BASE_ADDR_MASK,
+ level - 1);
+ }
+ }
+ kvm_mmu_free_page(vcpu, page_hpa);
+}
+
+static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
+{
+}
+
+static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, hpa_t p)
+{
+ int level = PT32E_ROOT_LEVEL;
+ hpa_t table_addr = vcpu->mmu.root_hpa;
+
+ for (; ; level--) {
+ u32 index = PT64_INDEX(v, level);
+ u64 *table;
+
+ ASSERT(VALID_PAGE(table_addr));
+ table = __va(table_addr);
+
+ if (level == 1) {
+ mark_page_dirty(vcpu->kvm, v >> PAGE_SHIFT);
+ page_header_update_slot(vcpu->kvm, table, v);
+ table[index] = p | PT_PRESENT_MASK | PT_WRITABLE_MASK |
+ PT_USER_MASK;
+ return 0;
+ }
+
+ if (table[index] == 0) {
+ hpa_t new_table = kvm_mmu_alloc_page(vcpu,
+ &table[index]);
+
+ if (!VALID_PAGE(new_table)) {
+ pgprintk("nonpaging_map: ENOMEM\n");
+ return -ENOMEM;
+ }
+
+ if (level == PT32E_ROOT_LEVEL)
+ table[index] = new_table | PT_PRESENT_MASK;
+ else
+ table[index] = new_table | PT_PRESENT_MASK |
+ PT_WRITABLE_MASK | PT_USER_MASK;
+ }
+ table_addr = table[index] & PT64_BASE_ADDR_MASK;
+ }
+}
+
+static void nonpaging_flush(struct kvm_vcpu *vcpu)
+{
+ hpa_t root = vcpu->mmu.root_hpa;
+
+ ++kvm_stat.tlb_flush;
+ pgprintk("nonpaging_flush\n");
+ ASSERT(VALID_PAGE(root));
+ release_pt_page_64(vcpu, root, vcpu->mmu.shadow_root_level);
+ root = kvm_mmu_alloc_page(vcpu, 0);
+ ASSERT(VALID_PAGE(root));
+ vcpu->mmu.root_hpa = root;
+ if (is_paging())
+ root |= (vcpu->cr3 & (CR3_PCD_MASK | CR3_WPT_MASK));
+ vmcs_writel(GUEST_CR3, root);
+}
+
+static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr)
+{
+ return vaddr;
+}
+
+static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
+ u32 error_code)
+{
+ int ret;
+ gpa_t addr = gva;
+
+ ASSERT(vcpu);
+ ASSERT(VALID_PAGE(vcpu->mmu.root_hpa));
+
+ for (;;) {
+ hpa_t paddr;
+
+ paddr = gpa_to_hpa(vcpu , addr & PT64_BASE_ADDR_MASK);
+
+ if (is_error_hpa(paddr))
+ return 1;
+
+ ret = nonpaging_map(vcpu, addr & PAGE_MASK, paddr);
+ if (ret) {
+ nonpaging_flush(vcpu);
+ continue;
+ }
+ break;
+ }
+ return ret;
+}
+
+static void nonpaging_inval_page(struct kvm_vcpu *vcpu, gva_t addr)
+{
+}
+
+static void nonpaging_free(struct kvm_vcpu *vcpu)
+{
+ hpa_t root;
+
+ ASSERT(vcpu);
+ root = vcpu->mmu.root_hpa;
+ if (VALID_PAGE(root))
+ release_pt_page_64(vcpu, root, vcpu->mmu.shadow_root_level);
+ vcpu->mmu.root_hpa = INVALID_PAGE;
+}
+
+static int nonpaging_init_context(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu *context = &vcpu->mmu;
+
+ context->new_cr3 = nonpaging_new_cr3;
+ context->page_fault = nonpaging_page_fault;
+ context->inval_page = nonpaging_inval_page;
+ context->gva_to_gpa = nonpaging_gva_to_gpa;
+ context->free = nonpaging_free;
+ context->root_level = PT32E_ROOT_LEVEL;
+ context->shadow_root_level = PT32E_ROOT_LEVEL;
+ context->root_hpa = kvm_mmu_alloc_page(vcpu, 0);
+ ASSERT(VALID_PAGE(context->root_hpa));
+ vmcs_writel(GUEST_CR3, context->root_hpa);
+ return 0;
+}
+
+
+static void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu_page *page, *npage;
+
+ list_for_each_entry_safe(page, npage, &vcpu->kvm->active_mmu_pages,
+ link) {
+ if (page->global)
+ continue;
+
+ if (!page->parent_pte)
+ continue;
+
+ *page->parent_pte = 0;
+ release_pt_page_64(vcpu, page->page_hpa, 1);
+ }
+ ++kvm_stat.tlb_flush;
+}
+
+static void paging_new_cr3(struct kvm_vcpu *vcpu)
+{
+ kvm_mmu_flush_tlb(vcpu);
+}
+
+static void mark_pagetable_nonglobal(void *shadow_pte)
+{
+ page_header(__pa(shadow_pte))->global = 0;
+}
+
+static inline void set_pte_common(struct kvm_vcpu *vcpu,
+ u64 *shadow_pte,
+ gpa_t gaddr,
+ int dirty,
+ u64 access_bits)
+{
+ hpa_t paddr;
+
+ *shadow_pte |= access_bits << PT_SHADOW_BITS_OFFSET;
+ if (!dirty)
+ access_bits &= ~PT_WRITABLE_MASK;
+
+ if (access_bits & PT_WRITABLE_MASK)
+ mark_page_dirty(vcpu->kvm, gaddr >> PAGE_SHIFT);
+
+ *shadow_pte |= access_bits;
+
+ paddr = gpa_to_hpa(vcpu, gaddr & PT64_BASE_ADDR_MASK);
+
+ if (!(*shadow_pte & PT_GLOBAL_MASK))
+ mark_pagetable_nonglobal(shadow_pte);
+
+ if (is_error_hpa(paddr)) {
+ *shadow_pte |= gaddr;
+ *shadow_pte |= PT_SHADOW_IO_MARK;
+ *shadow_pte &= ~PT_PRESENT_MASK;
+ } else {
+ *shadow_pte |= paddr;
+ page_header_update_slot(vcpu->kvm, shadow_pte, gaddr);
+ }
+}
+
+static void inject_page_fault(struct kvm_vcpu *vcpu,
+ u64 addr,
+ u32 err_code)
+{
+ u32 vect_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+
+ pgprintk("inject_page_fault: 0x%llx err 0x%x\n", addr, err_code);
+
+ ++kvm_stat.pf_guest;
+
+ if (is_page_fault(vect_info)) {
+ printk("inject_page_fault: double fault 0x%llx @ 0x%lx\n",
+ addr, vmcs_readl(GUEST_RIP));
+ vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, 0);
+ vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+ DF_VECTOR |
+ INTR_TYPE_EXCEPTION |
+ INTR_INFO_DELIEVER_CODE_MASK |
+ INTR_INFO_VALID_MASK);
+ return;
+ }
+ vcpu->cr2 = addr;
+ vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, err_code);
+ vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+ PF_VECTOR |
+ INTR_TYPE_EXCEPTION |
+ INTR_INFO_DELIEVER_CODE_MASK |
+ INTR_INFO_VALID_MASK);
+
+}
+
+static inline int fix_read_pf(u64 *shadow_ent)
+{
+ if ((*shadow_ent & PT_SHADOW_USER_MASK) &&
+ !(*shadow_ent & PT_USER_MASK)) {
+ /*
+ * If supervisor write protect is disabled, we shadow kernel
+ * pages as user pages so we can trap the write access.
+ */
+ *shadow_ent |= PT_USER_MASK;
+ *shadow_ent &= ~PT_WRITABLE_MASK;
+
+ return 1;
+
+ }
+ return 0;
+}
+
+static int may_access(u64 pte, int write, int user)
+{
+
+ if (user && !(pte & PT_USER_MASK))
+ return 0;
+ if (write && !(pte & PT_WRITABLE_MASK))
+ return 0;
+ return 1;
+}
+
+/*
+ * Remove a shadow pte.
+ */
+static void paging_inval_page(struct kvm_vcpu *vcpu, gva_t addr)
+{
+ hpa_t page_addr = vcpu->mmu.root_hpa;
+ int level = vcpu->mmu.shadow_root_level;
+
+ ++kvm_stat.invlpg;
+
+ for (; ; level--) {
+ u32 index = PT64_INDEX(addr, level);
+ u64 *table = __va(page_addr);
+
+ if (level == PT_PAGE_TABLE_LEVEL ) {
+ table[index] = 0;
+ return;
+ }
+
+ if (!is_present_pte(table[index]))
+ return;
+
+ page_addr = table[index] & PT64_BASE_ADDR_MASK;
+
+ if (level == PT_DIRECTORY_LEVEL &&
+ (table[index] & PT_SHADOW_PS_MARK)) {
+ table[index] = 0;
+ release_pt_page_64(vcpu, page_addr, PT_PAGE_TABLE_LEVEL);
+
+ //flush tlb
+ vmcs_writel(GUEST_CR3, vcpu->mmu.root_hpa |
+ (vcpu->cr3 & (CR3_PCD_MASK | CR3_WPT_MASK)));
+ return;
+ }
+ }
+}
+
+static void paging_free(struct kvm_vcpu *vcpu)
+{
+ nonpaging_free(vcpu);
+}
+
+#define PTTYPE 64
+#include "paging_tmpl.h"
+#undef PTTYPE
+
+#define PTTYPE 32
+#include "paging_tmpl.h"
+#undef PTTYPE
+
+static int paging64_init_context(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu *context = &vcpu->mmu;
+
+ ASSERT(is_pae());
+ context->new_cr3 = paging_new_cr3;
+ context->page_fault = paging64_page_fault;
+ context->inval_page = paging_inval_page;
+ context->gva_to_gpa = paging64_gva_to_gpa;
+ context->free = paging_free;
+ context->root_level = PT64_ROOT_LEVEL;
+ context->shadow_root_level = PT64_ROOT_LEVEL;
+ context->root_hpa = kvm_mmu_alloc_page(vcpu, 0);
+ ASSERT(VALID_PAGE(context->root_hpa));
+ vmcs_writel(GUEST_CR3, context->root_hpa |
+ (vcpu->cr3 & (CR3_PCD_MASK | CR3_WPT_MASK)));
+ return 0;
+}
+
+static int paging32_init_context(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu *context = &vcpu->mmu;
+
+ context->new_cr3 = paging_new_cr3;
+ context->page_fault = paging32_page_fault;
+ context->inval_page = paging_inval_page;
+ context->gva_to_gpa = paging32_gva_to_gpa;
+ context->free = paging_free;
+ context->root_level = PT32_ROOT_LEVEL;
+ context->shadow_root_level = PT32E_ROOT_LEVEL;
+ context->root_hpa = kvm_mmu_alloc_page(vcpu, 0);
+ ASSERT(VALID_PAGE(context->root_hpa));
+ vmcs_writel(GUEST_CR3, context->root_hpa |
+ (vcpu->cr3 & (CR3_PCD_MASK | CR3_WPT_MASK)));
+ return 0;
+}
+
+static int paging32E_init_context(struct kvm_vcpu *vcpu)
+{
+ int ret;
+
+ if ((ret = paging64_init_context(vcpu)))
+ return ret;
+
+ vcpu->mmu.root_level = PT32E_ROOT_LEVEL;
+ vcpu->mmu.shadow_root_level = PT32E_ROOT_LEVEL;
+ return 0;
+}
+
+static int init_kvm_mmu(struct kvm_vcpu *vcpu)
+{
+ ASSERT(vcpu);
+ ASSERT(!VALID_PAGE(vcpu->mmu.root_hpa));
+
+ if (!is_paging())
+ return nonpaging_init_context(vcpu);
+ else if (is_long_mode())
+ return paging64_init_context(vcpu);
+ else if (is_pae())
+ return paging32E_init_context(vcpu);
+ else
+ return paging32_init_context(vcpu);
+}
+
+static void destroy_kvm_mmu(struct kvm_vcpu *vcpu)
+{
+ ASSERT(vcpu);
+ if (VALID_PAGE(vcpu->mmu.root_hpa)) {
+ vcpu->mmu.free(vcpu);
+ vcpu->mmu.root_hpa = INVALID_PAGE;
+ }
+}
+
+int kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
+{
+ destroy_kvm_mmu(vcpu);
+ return init_kvm_mmu(vcpu);
+}
+
+static void free_mmu_pages(struct kvm_vcpu *vcpu)
+{
+ while (!list_empty(&vcpu->free_pages)) {
+ struct kvm_mmu_page *page;
+
+ page = list_entry(vcpu->free_pages.next,
+ struct kvm_mmu_page, link);
+ list_del(&page->link);
+ __free_page(pfn_to_page(page->page_hpa >> PAGE_SHIFT));
+ page->page_hpa = INVALID_PAGE;
+ }
+}
+
+static int alloc_mmu_pages(struct kvm_vcpu *vcpu)
+{
+ int i;
+
+ ASSERT(vcpu);
+
+ for (i = 0; i < KVM_NUM_MMU_PAGES; i++) {
+ struct page *page;
+ struct kvm_mmu_page *page_header = &vcpu->page_header_buf[i];
+
+ INIT_LIST_HEAD(&page_header->link);
+ if ((page = alloc_page(GFP_KVM_MMU)) == NULL)
+ goto error_1;
+ page->private = (unsigned long)page_header;
+ page_header->page_hpa = (hpa_t)page_to_pfn(page) << PAGE_SHIFT;
+ memset(__va(page_header->page_hpa), 0, PAGE_SIZE);
+ list_add(&page_header->link, &vcpu->free_pages);
+ }
+ return 0;
+
+error_1:
+ free_mmu_pages(vcpu);
+ return -ENOMEM;
+}
+
+int kvm_mmu_init(struct kvm_vcpu *vcpu)
+{
+ int r;
+
+ ASSERT(vcpu);
+ ASSERT(!VALID_PAGE(vcpu->mmu.root_hpa));
+ ASSERT(list_empty(&vcpu->free_pages));
+
+ if ((r = alloc_mmu_pages(vcpu)))
+ return r;
+
+ if ((r = init_kvm_mmu(vcpu))) {
+ free_mmu_pages(vcpu);
+ return r;
+ }
+ return 0;
+}
+
+void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
+{
+ ASSERT(vcpu);
+
+ destroy_kvm_mmu(vcpu);
+ free_mmu_pages(vcpu);
+}
+
+void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+{
+ struct kvm_mmu_page *page;
+
+ list_for_each_entry(page, &kvm->active_mmu_pages, link) {
+ int i;
+ u64 *pt;
+
+ if (!test_bit(slot, &page->slot_bitmap))
+ continue;
+
+ pt = __va(page->page_hpa);
+ for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
+ /* avoid RMW */
+ if (pt[i] & PT_WRITABLE_MASK)
+ pt[i] &= ~PT_WRITABLE_MASK;
+
+ }
+}
Index: linux-2.6/drivers/kvm/paging_tmpl.h
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/paging_tmpl.h
@@ -0,0 +1,378 @@
+/*
+ * We need the mmu code to access both 32-bit and 64-bit guest ptes,
+ * so the code in this file is compiled twice, once per pte size.
+ */
+
+#if PTTYPE == 64
+ #define pt_element_t u64
+ #define guest_walker guest_walker64
+ #define FNAME(name) paging##64_##name
+ #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+ #define PT_DIR_BASE_ADDR_MASK PT64_DIR_BASE_ADDR_MASK
+ #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+ #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
+ #define PT_LEVEL_MASK(level) PT64_LEVEL_MASK(level)
+ #define PT_PTE_COPY_MASK PT64_PTE_COPY_MASK
+ #define PT_NON_PTE_COPY_MASK PT64_NON_PTE_COPY_MASK
+#elif PTTYPE == 32
+ #define pt_element_t u32
+ #define guest_walker guest_walker32
+ #define FNAME(name) paging##32_##name
+ #define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
+ #define PT_DIR_BASE_ADDR_MASK PT32_DIR_BASE_ADDR_MASK
+ #define PT_INDEX(addr, level) PT32_INDEX(addr, level)
+ #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
+ #define PT_LEVEL_MASK(level) PT32_LEVEL_MASK(level)
+ #define PT_PTE_COPY_MASK PT32_PTE_COPY_MASK
+ #define PT_NON_PTE_COPY_MASK PT32_NON_PTE_COPY_MASK
+#else
+ #error Invalid PTTYPE value
+#endif
+
+/*
+ * The guest_walker structure emulates the behavior of the hardware page
+ * table walker.
+ */
+struct guest_walker {
+ int level;
+ pt_element_t *table;
+ pt_element_t inherited_ar;
+};
+
+static void FNAME(init_walker)(struct guest_walker *walker,
+ struct kvm_vcpu *vcpu)
+{
+ hpa_t hpa;
+ struct kvm_memory_slot *slot;
+
+ walker->level = vcpu->mmu.root_level;
+ slot = gfn_to_memslot(vcpu->kvm,
+ (vcpu->cr3 & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT);
+ hpa = safe_gpa_to_hpa(vcpu, vcpu->cr3 & PT64_BASE_ADDR_MASK);
+ walker->table = kmap_atomic(pfn_to_page(hpa >> PAGE_SHIFT), KM_USER0);
+
+ ASSERT((!is_long_mode() && is_pae()) ||
+ (vcpu->cr3 & ~(PAGE_MASK | CR3_FLAGS_MASK)) == 0);
+
+ walker->table = (pt_element_t *)( (unsigned long)walker->table |
+ (unsigned long)(vcpu->cr3 & ~(PAGE_MASK | CR3_FLAGS_MASK)) );
+ walker->inherited_ar = PT_USER_MASK | PT_WRITABLE_MASK;
+}
+
+static void FNAME(release_walker)(struct guest_walker *walker)
+{
+ kunmap_atomic(walker->table, KM_USER0);
+}
+
+static void FNAME(set_pte)(struct kvm_vcpu *vcpu, u64 guest_pte,
+ u64 *shadow_pte, u64 access_bits)
+{
+ ASSERT(*shadow_pte == 0);
+ access_bits &= guest_pte;
+ *shadow_pte = (guest_pte & PT_PTE_COPY_MASK);
+ set_pte_common(vcpu, shadow_pte, guest_pte & PT_BASE_ADDR_MASK,
+ guest_pte & PT_DIRTY_MASK, access_bits);
+}
+
+static void FNAME(set_pde)(struct kvm_vcpu *vcpu, u64 guest_pde,
+ u64 *shadow_pte, u64 access_bits,
+ int index)
+{
+ gpa_t gaddr;
+
+ ASSERT(*shadow_pte == 0);
+ access_bits &= guest_pde;
+ gaddr = (guest_pde & PT_DIR_BASE_ADDR_MASK) + PAGE_SIZE * index;
+ if (PTTYPE == 32 && is_cpuid_PSE36())
+ gaddr |= (guest_pde & PT32_DIR_PSE36_MASK) <<
+ (32 - PT32_DIR_PSE36_SHIFT);
+ *shadow_pte = (guest_pde & PT_NON_PTE_COPY_MASK) |
+ ((guest_pde & PT_DIR_PAT_MASK) >>
+ (PT_DIR_PAT_SHIFT - PT_PAT_SHIFT));
+ set_pte_common(vcpu, shadow_pte, gaddr,
+ guest_pde & PT_DIRTY_MASK, access_bits);
+}
+
+/*
+ * Fetch a guest pte from a specific level in the paging hierarchy.
+ */
+static pt_element_t *FNAME(fetch_guest)(struct kvm_vcpu *vcpu,
+ struct guest_walker *walker,
+ int level,
+ gva_t addr)
+{
+
+ ASSERT(level > 0 && level <= walker->level);
+
+ for (;;) {
+ int index = PT_INDEX(addr, walker->level);
+ hpa_t paddr;
+
+ ASSERT(((unsigned long)walker->table & PAGE_MASK) ==
+ ((unsigned long)&walker->table[index] & PAGE_MASK));
+ if (level == walker->level ||
+ !is_present_pte(walker->table[index]) ||
+ (walker->level == PT_DIRECTORY_LEVEL &&
+ (walker->table[index] & PT_PAGE_SIZE_MASK) &&
+ (PTTYPE == 64 || is_pse())))
+ return &walker->table[index];
+ if (walker->level != 3 || is_long_mode())
+ walker->inherited_ar &= walker->table[index];
+ paddr = safe_gpa_to_hpa(vcpu, walker->table[index] & PT_BASE_ADDR_MASK);
+ kunmap_atomic(walker->table, KM_USER0);
+ walker->table = kmap_atomic(pfn_to_page(paddr >> PAGE_SHIFT),
+ KM_USER0);
+ --walker->level;
+ }
+}
+
+/*
+ * Fetch a shadow pte for a specific level in the paging hierarchy.
+ */
+static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
+ struct guest_walker *walker)
+{
+ hpa_t shadow_addr;
+ int level;
+ u64 *prev_shadow_ent = NULL;
+
+ shadow_addr = vcpu->mmu.root_hpa;
+ level = vcpu->mmu.shadow_root_level;
+
+ for (; ; level--) {
+ u32 index = SHADOW_PT_INDEX(addr, level);
+ u64 *shadow_ent = ((u64 *)__va(shadow_addr)) + index;
+ pt_element_t *guest_ent;
+
+ if (is_present_pte(*shadow_ent) || is_io_pte(*shadow_ent)) {
+ if (level == PT_PAGE_TABLE_LEVEL)
+ return shadow_ent;
+ shadow_addr = *shadow_ent & PT64_BASE_ADDR_MASK;
+ prev_shadow_ent = shadow_ent;
+ continue;
+ }
+
+ if (PTTYPE == 32 && level > PT32_ROOT_LEVEL) {
+ ASSERT(level == PT32E_ROOT_LEVEL);
+ guest_ent = FNAME(fetch_guest)(vcpu, walker,
+ PT32_ROOT_LEVEL, addr);
+ } else
+ guest_ent = FNAME(fetch_guest)(vcpu, walker,
+ level, addr);
+
+ if (!is_present_pte(*guest_ent))
+ return NULL;
+
+ /* Don't set accessed bit on PAE PDPTRs */
+ if (vcpu->mmu.root_level != 3 || walker->level != 3)
+ *guest_ent |= PT_ACCESSED_MASK;
+
+ if (level == PT_PAGE_TABLE_LEVEL) {
+
+ if (walker->level == PT_DIRECTORY_LEVEL) {
+ if (prev_shadow_ent)
+ *prev_shadow_ent |= PT_SHADOW_PS_MARK;
+ FNAME(set_pde)(vcpu, *guest_ent, shadow_ent,
+ walker->inherited_ar,
+ PT_INDEX(addr, PT_PAGE_TABLE_LEVEL));
+ } else {
+ ASSERT(walker->level == PT_PAGE_TABLE_LEVEL);
+ FNAME(set_pte)(vcpu, *guest_ent, shadow_ent, walker->inherited_ar);
+ }
+ return shadow_ent;
+ }
+
+ shadow_addr = kvm_mmu_alloc_page(vcpu, shadow_ent);
+ if (!VALID_PAGE(shadow_addr))
+ return ERR_PTR(-ENOMEM);
+ if (!is_long_mode() && level == 3)
+ *shadow_ent = shadow_addr |
+ (*guest_ent & (PT_PRESENT_MASK | PT_PWT_MASK | PT_PCD_MASK));
+ else {
+ *shadow_ent = shadow_addr |
+ (*guest_ent & PT_NON_PTE_COPY_MASK);
+ *shadow_ent |= (PT_WRITABLE_MASK | PT_USER_MASK);
+ }
+ prev_shadow_ent = shadow_ent;
+ }
+}
+
+/*
+ * The guest faulted for write. We need to
+ *
+ * - check write permissions
+ * - update the guest pte dirty bit
+ * - update our own dirty page tracking structures
+ */
+static int FNAME(fix_write_pf)(struct kvm_vcpu *vcpu,
+ u64 *shadow_ent,
+ struct guest_walker *walker,
+ gva_t addr,
+ int user)
+{
+ pt_element_t *guest_ent;
+ int writable_shadow;
+ gfn_t gfn;
+
+ if (is_writeble_pte(*shadow_ent))
+ return 0;
+
+ writable_shadow = *shadow_ent & PT_SHADOW_WRITABLE_MASK;
+ if (user) {
+ /*
+ * User mode access. Fail if it's a kernel page or a read-only
+ * page.
+ */
+ if (!(*shadow_ent & PT_SHADOW_USER_MASK) || !writable_shadow)
+ return 0;
+ ASSERT(*shadow_ent & PT_USER_MASK);
+ } else
+ /*
+ * Kernel mode access. Fail if it's a read-only page and
+ * supervisor write protection is enabled.
+ */
+ if (!writable_shadow) {
+ if (is_write_protection())
+ return 0;
+ *shadow_ent &= ~PT_USER_MASK;
+ }
+
+ guest_ent = FNAME(fetch_guest)(vcpu, walker, PT_PAGE_TABLE_LEVEL, addr);
+
+ if (!is_present_pte(*guest_ent)) {
+ *shadow_ent = 0;
+ return 0;
+ }
+
+ gfn = (*guest_ent & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
+ mark_page_dirty(vcpu->kvm, gfn);
+ *shadow_ent |= PT_WRITABLE_MASK;
+ *guest_ent |= PT_DIRTY_MASK;
+
+ return 1;
+}
+
+/*
+ * Page fault handler. There are several causes for a page fault:
+ * - there is no shadow pte for the guest pte
+ * - write access through a shadow pte marked read only so that we can set
+ * the dirty bit
+ * - write access to a shadow pte marked read only so we can update the page
+ * dirty bitmap, when userspace requests it
+ * - mmio access; in this case we will never install a present shadow pte
+ * - normal guest page fault due to the guest pte marked not present, not
+ * writable, or not executable
+ *
+ * Returns: 1 if we need to emulate the instruction, 0 otherwise
+ */
+static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
+ u32 error_code)
+{
+ int write_fault = error_code & PFERR_WRITE_MASK;
+ int pte_present = error_code & PFERR_PRESENT_MASK;
+ int user_fault = error_code & PFERR_USER_MASK;
+ struct guest_walker walker;
+ u64 *shadow_pte;
+ int fixed;
+
+ /*
+ * Look up the shadow pte for the faulting address.
+ */
+ for (;;) {
+ FNAME(init_walker)(&walker, vcpu);
+ shadow_pte = FNAME(fetch)(vcpu, addr, &walker);
+ if (IS_ERR(shadow_pte)) { /* must be -ENOMEM */
+ nonpaging_flush(vcpu);
+ FNAME(release_walker)(&walker);
+ continue;
+ }
+ break;
+ }
+
+ /*
+ * The page is not mapped by the guest. Let the guest handle it.
+ */
+ if (!shadow_pte) {
+ inject_page_fault(vcpu, addr, error_code);
+ FNAME(release_walker)(&walker);
+ return 0;
+ }
+
+ /*
+ * Update the shadow pte.
+ */
+ if (write_fault)
+ fixed = FNAME(fix_write_pf)(vcpu, shadow_pte, &walker, addr,
+ user_fault);
+ else
+ fixed = fix_read_pf(shadow_pte);
+
+ FNAME(release_walker)(&walker);
+
+ /*
+ * mmio: emulate if accessible, otherwise its a guest fault.
+ */
+ if (is_io_pte(*shadow_pte)) {
+ if (may_access(*shadow_pte, write_fault, user_fault))
+ return 1;
+ pgprintk("%s: io work, no access\n", __FUNCTION__);
+ inject_page_fault(vcpu, addr,
+ error_code | PFERR_PRESENT_MASK);
+ return 0;
+ }
+
+ /*
+ * pte not present, guest page fault.
+ */
+ if (pte_present && !fixed) {
+ inject_page_fault(vcpu, addr, error_code);
+ return 0;
+ }
+
+ ++kvm_stat.pf_fixed;
+
+ return 0;
+}
+
+static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr)
+{
+ struct guest_walker walker;
+ pt_element_t guest_pte;
+ gpa_t gpa;
+
+ FNAME(init_walker)(&walker, vcpu);
+ guest_pte = *FNAME(fetch_guest)(vcpu, &walker, PT_PAGE_TABLE_LEVEL,
+ vaddr);
+ FNAME(release_walker)(&walker);
+
+ if (!is_present_pte(guest_pte))
+ return UNMAPPED_GVA;
+
+ if (walker.level == PT_DIRECTORY_LEVEL) {
+ ASSERT((guest_pte & PT_PAGE_SIZE_MASK));
+ ASSERT(PTTYPE == 64 || is_pse());
+
+ gpa = (guest_pte & PT_DIR_BASE_ADDR_MASK) | (vaddr &
+ (PT_LEVEL_MASK(PT_PAGE_TABLE_LEVEL) | ~PAGE_MASK));
+
+ if (PTTYPE == 32 && is_cpuid_PSE36())
+ gpa |= (guest_pte & PT32_DIR_PSE36_MASK) <<
+ (32 - PT32_DIR_PSE36_SHIFT);
+ } else {
+ gpa = (guest_pte & PT_BASE_ADDR_MASK);
+ gpa |= (vaddr & ~PAGE_MASK);
+ }
+
+ return gpa;
+}
+
+#undef pt_element_t
+#undef guest_walker
+#undef FNAME
+#undef PT_BASE_ADDR_MASK
+#undef PT_INDEX
+#undef SHADOW_PT_INDEX
+#undef PT_LEVEL_MASK
+#undef PT_PTE_COPY_MASK
+#undef PT_NON_PTE_COPY_MASK
+#undef PT_DIR_BASE_ADDR_MASK
Index: linux-2.6/drivers/kvm/kvm.h
===================================================================
--- linux-2.6.orig/drivers/kvm/kvm.h
+++ linux-2.6/drivers/kvm/kvm.h
@@ -369,4 +369,19 @@ static inline struct kvm_mmu_page *page_
return (struct kvm_mmu_page *)page->private;
}

+#ifdef __x86_64__
+
+/*
+ * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64. Therefore
+ * we need to allocate shadow page tables in the first 4GB of memory, which
+ * happens to fit the DMA32 zone.
+ */
+#define GFP_KVM_MMU (GFP_KERNEL | __GFP_DMA32)
+
+#else
+
+#define GFP_KVM_MMU GFP_KERNEL
+
+#endif
+
#endif

2006-10-23 13:32:39

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 12/13] KVM: x86 emulator

Add an x86 instruction emulator for kvm.

We need an x86 emulator for the following reasons:

- mmio instructions are intercepted as page faults, with no information about
the operation to be performed other than the virtual address
- real-mode is emulated using the old-fashined vm86 mode, with no special
intercepts for the privileged instructions, so we need to emulate mov cr,
lgdt, and lidt
- we plan to cache shadow page tables in the future, so that a guest context
switch will not throw away all the mappings we worked so hard to build. but
cachine page tables means write-protecting the guest page tables to keep
them in sync, so any writes to the guest page tables need to be emulated

The emulator was lifted from the Xen hypervisor.

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/x86_emulate.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/x86_emulate.c
@@ -0,0 +1,1370 @@
+/******************************************************************************
+ * x86_emulate.c
+ *
+ * Generic x86 (32-bit and 64-bit) instruction decoder and emulator.
+ *
+ * Copyright (c) 2005 Keir Fraser
+ *
+ * Linux coding style, mod r/m decoder, segment base fixes, real-mode
+ * privieged instructions:
+ *
+ * Copyright (C) 2006 Qumranet
+ *
+ * Avi Kivity <[email protected]>
+ * Yaniv Kamay <[email protected]>
+ *
+ * From: xen-unstable 10676:af9809f51f81a3c43f276f00c81a52ef558afda4
+ */
+
+#ifndef __KERNEL__
+#include <stdio.h>
+#include <stdint.h>
+#include <public/xen.h>
+#define DPRINTF(_f, _a ...) printf( _f , ## _a )
+#else
+#include "kvm.h"
+#define DPRINTF(x...) do {} while (0)
+#endif
+#include "x86_emulate.h"
+
+/*
+ * Opcode effective-address decode tables.
+ * Note that we only emulate instructions that have at least one memory
+ * operand (excluding implicit stack references). We assume that stack
+ * references and instruction fetches will never occur in special memory
+ * areas that require emulation. So, for example, 'mov <imm>,<reg>' need
+ * not be handled.
+ */
+
+/* Operand sizes: 8-bit operands or specified/overridden size. */
+#define ByteOp (1<<0) /* 8-bit operands. */
+/* Destination operand type. */
+#define ImplicitOps (1<<1) /* Implicit in opcode. No generic decode. */
+#define DstReg (2<<1) /* Register operand. */
+#define DstMem (3<<1) /* Memory operand. */
+#define DstMask (3<<1)
+/* Source operand type. */
+#define SrcNone (0<<3) /* No source operand. */
+#define SrcImplicit (0<<3) /* Source operand is implicit in the opcode. */
+#define SrcReg (1<<3) /* Register operand. */
+#define SrcMem (2<<3) /* Memory operand. */
+#define SrcMem16 (3<<3) /* Memory operand (16-bit). */
+#define SrcMem32 (4<<3) /* Memory operand (32-bit). */
+#define SrcImm (5<<3) /* Immediate operand. */
+#define SrcImmByte (6<<3) /* 8-bit sign-extended immediate operand. */
+#define SrcMask (7<<3)
+/* Generic ModRM decode. */
+#define ModRM (1<<6)
+/* Destination is only written; never read. */
+#define Mov (1<<7)
+
+static u8 opcode_table[256] = {
+ /* 0x00 - 0x07 */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x08 - 0x0F */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x10 - 0x17 */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x18 - 0x1F */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x20 - 0x27 */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x28 - 0x2F */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x30 - 0x37 */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x38 - 0x3F */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x40 - 0x4F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x50 - 0x5F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x60 - 0x6F */
+ 0, 0, 0, DstReg | SrcMem32 | ModRM | Mov /* movsxd (x86/64) */ ,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x70 - 0x7F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x80 - 0x87 */
+ ByteOp | DstMem | SrcImm | ModRM, DstMem | SrcImm | ModRM,
+ ByteOp | DstMem | SrcImm | ModRM, DstMem | SrcImmByte | ModRM,
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ /* 0x88 - 0x8F */
+ ByteOp | DstMem | SrcReg | ModRM | Mov, DstMem | SrcReg | ModRM | Mov,
+ ByteOp | DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ 0, 0, 0, DstMem | SrcNone | ModRM | Mov,
+ /* 0x90 - 0x9F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xA0 - 0xA7 */
+ ByteOp | DstReg | SrcMem | Mov, DstReg | SrcMem | Mov,
+ ByteOp | DstMem | SrcReg | Mov, DstMem | SrcReg | Mov,
+ ByteOp | ImplicitOps | Mov, ImplicitOps | Mov,
+ ByteOp | ImplicitOps, ImplicitOps,
+ /* 0xA8 - 0xAF */
+ 0, 0, ByteOp | ImplicitOps | Mov, ImplicitOps | Mov,
+ ByteOp | ImplicitOps | Mov, ImplicitOps | Mov,
+ ByteOp | ImplicitOps, ImplicitOps,
+ /* 0xB0 - 0xBF */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xC0 - 0xC7 */
+ ByteOp | DstMem | SrcImm | ModRM, DstMem | SrcImmByte | ModRM, 0, 0,
+ 0, 0, ByteOp | DstMem | SrcImm | ModRM | Mov,
+ DstMem | SrcImm | ModRM | Mov,
+ /* 0xC8 - 0xCF */
+ 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xD0 - 0xD7 */
+ ByteOp | DstMem | SrcImplicit | ModRM, DstMem | SrcImplicit | ModRM,
+ ByteOp | DstMem | SrcImplicit | ModRM, DstMem | SrcImplicit | ModRM,
+ 0, 0, 0, 0,
+ /* 0xD8 - 0xDF */
+ 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xE0 - 0xEF */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xF0 - 0xF7 */
+ 0, 0, 0, 0,
+ 0, 0, ByteOp | DstMem | SrcNone | ModRM, DstMem | SrcNone | ModRM,
+ /* 0xF8 - 0xFF */
+ 0, 0, 0, 0,
+ 0, 0, ByteOp | DstMem | SrcNone | ModRM, DstMem | SrcNone | ModRM
+};
+
+static u8 twobyte_table[256] = {
+ /* 0x00 - 0x0F */
+ 0, SrcMem | ModRM | DstReg | Mov, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, ImplicitOps | ModRM, 0, 0,
+ /* 0x10 - 0x1F */
+ 0, 0, 0, 0, 0, 0, 0, 0, ImplicitOps | ModRM, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x20 - 0x2F */
+ ImplicitOps, 0, ImplicitOps, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x30 - 0x3F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x40 - 0x47 */
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ /* 0x48 - 0x4F */
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ /* 0x50 - 0x5F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x60 - 0x6F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x70 - 0x7F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x80 - 0x8F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x90 - 0x9F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xA0 - 0xA7 */
+ 0, 0, 0, DstMem | SrcReg | ModRM, 0, 0, 0, 0,
+ /* 0xA8 - 0xAF */
+ 0, 0, 0, DstMem | SrcReg | ModRM, 0, 0, 0, 0,
+ /* 0xB0 - 0xB7 */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, 0,
+ DstMem | SrcReg | ModRM,
+ 0, 0, ByteOp | DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem16 | ModRM | Mov,
+ /* 0xB8 - 0xBF */
+ 0, 0, DstMem | SrcImmByte | ModRM, DstMem | SrcReg | ModRM,
+ 0, 0, ByteOp | DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem16 | ModRM | Mov,
+ /* 0xC0 - 0xCF */
+ 0, 0, 0, 0, 0, 0, 0, ImplicitOps | ModRM, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xD0 - 0xDF */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xE0 - 0xEF */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xF0 - 0xFF */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+};
+
+/* Type, address-of, and value of an instruction's operand. */
+struct operand {
+ enum { OP_REG, OP_MEM, OP_IMM } type;
+ unsigned int bytes;
+ unsigned long val, orig_val, *ptr;
+};
+
+/* EFLAGS bit definitions. */
+#define EFLG_OF (1<<11)
+#define EFLG_DF (1<<10)
+#define EFLG_SF (1<<7)
+#define EFLG_ZF (1<<6)
+#define EFLG_AF (1<<4)
+#define EFLG_PF (1<<2)
+#define EFLG_CF (1<<0)
+
+/*
+ * Instruction emulation:
+ * Most instructions are emulated directly via a fragment of inline assembly
+ * code. This allows us to save/restore EFLAGS and thus very easily pick up
+ * any modified flags.
+ */
+
+#if defined(__x86_64__)
+#define _LO32 "k" /* force 32-bit operand */
+#define _STK "%%rsp" /* stack pointer */
+#elif defined(__i386__)
+#define _LO32 "" /* force 32-bit operand */
+#define _STK "%%esp" /* stack pointer */
+#endif
+
+/*
+ * These EFLAGS bits are restored from saved value during emulation, and
+ * any changes are written back to the saved value after emulation.
+ */
+#define EFLAGS_MASK (EFLG_OF|EFLG_SF|EFLG_ZF|EFLG_AF|EFLG_PF|EFLG_CF)
+
+/* Before executing instruction: restore necessary bits in EFLAGS. */
+#define _PRE_EFLAGS(_sav, _msk, _tmp) \
+ /* EFLAGS = (_sav & _msk) | (EFLAGS & ~_msk); */ \
+ "push %"_sav"; " \
+ "movl %"_msk",%"_LO32 _tmp"; " \
+ "andl %"_LO32 _tmp",("_STK"); " \
+ "pushf; " \
+ "notl %"_LO32 _tmp"; " \
+ "andl %"_LO32 _tmp",("_STK"); " \
+ "pop %"_tmp"; " \
+ "orl %"_LO32 _tmp",("_STK"); " \
+ "popf; " \
+ /* _sav &= ~msk; */ \
+ "movl %"_msk",%"_LO32 _tmp"; " \
+ "notl %"_LO32 _tmp"; " \
+ "andl %"_LO32 _tmp",%"_sav"; "
+
+/* After executing instruction: write-back necessary bits in EFLAGS. */
+#define _POST_EFLAGS(_sav, _msk, _tmp) \
+ /* _sav |= EFLAGS & _msk; */ \
+ "pushf; " \
+ "pop %"_tmp"; " \
+ "andl %"_msk",%"_LO32 _tmp"; " \
+ "orl %"_LO32 _tmp",%"_sav"; "
+
+/* Raw emulation: instruction has two explicit operands. */
+#define __emulate_2op_nobyte(_op,_src,_dst,_eflags,_wx,_wy,_lx,_ly,_qx,_qy) \
+ do { \
+ unsigned long _tmp; \
+ \
+ switch ((_dst).bytes) { \
+ case 2: \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","4","2") \
+ _op"w %"_wx"3,%1; " \
+ _POST_EFLAGS("0","4","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), \
+ "=&r" (_tmp) \
+ : _wy ((_src).val), "i" (EFLAGS_MASK) ); \
+ break; \
+ case 4: \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","4","2") \
+ _op"l %"_lx"3,%1; " \
+ _POST_EFLAGS("0","4","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), \
+ "=&r" (_tmp) \
+ : _ly ((_src).val), "i" (EFLAGS_MASK) ); \
+ break; \
+ case 8: \
+ __emulate_2op_8byte(_op, _src, _dst, \
+ _eflags, _qx, _qy); \
+ break; \
+ } \
+ } while (0)
+
+#define __emulate_2op(_op,_src,_dst,_eflags,_bx,_by,_wx,_wy,_lx,_ly,_qx,_qy) \
+ do { \
+ unsigned long _tmp; \
+ switch ( (_dst).bytes ) \
+ { \
+ case 1: \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","4","2") \
+ _op"b %"_bx"3,%1; " \
+ _POST_EFLAGS("0","4","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), \
+ "=&r" (_tmp) \
+ : _by ((_src).val), "i" (EFLAGS_MASK) ); \
+ break; \
+ default: \
+ __emulate_2op_nobyte(_op, _src, _dst, _eflags, \
+ _wx, _wy, _lx, _ly, _qx, _qy); \
+ break; \
+ } \
+ } while (0)
+
+/* Source operand is byte-sized and may be restricted to just %cl. */
+#define emulate_2op_SrcB(_op, _src, _dst, _eflags) \
+ __emulate_2op(_op, _src, _dst, _eflags, \
+ "b", "c", "b", "c", "b", "c", "b", "c")
+
+/* Source operand is byte, word, long or quad sized. */
+#define emulate_2op_SrcV(_op, _src, _dst, _eflags) \
+ __emulate_2op(_op, _src, _dst, _eflags, \
+ "b", "q", "w", "r", _LO32, "r", "", "r")
+
+/* Source operand is word, long or quad sized. */
+#define emulate_2op_SrcV_nobyte(_op, _src, _dst, _eflags) \
+ __emulate_2op_nobyte(_op, _src, _dst, _eflags, \
+ "w", "r", _LO32, "r", "", "r")
+
+/* Instruction has only one explicit operand (no source operand). */
+#define emulate_1op(_op, _dst, _eflags) \
+ do { \
+ unsigned long _tmp; \
+ \
+ switch ( (_dst).bytes ) \
+ { \
+ case 1: \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","3","2") \
+ _op"b %1; " \
+ _POST_EFLAGS("0","3","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), \
+ "=&r" (_tmp) \
+ : "i" (EFLAGS_MASK) ); \
+ break; \
+ case 2: \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","3","2") \
+ _op"w %1; " \
+ _POST_EFLAGS("0","3","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), \
+ "=&r" (_tmp) \
+ : "i" (EFLAGS_MASK) ); \
+ break; \
+ case 4: \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","3","2") \
+ _op"l %1; " \
+ _POST_EFLAGS("0","3","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), \
+ "=&r" (_tmp) \
+ : "i" (EFLAGS_MASK) ); \
+ break; \
+ case 8: \
+ __emulate_1op_8byte(_op, _dst, _eflags); \
+ break; \
+ } \
+ } while (0)
+
+/* Emulate an instruction with quadword operands (x86/64 only). */
+#if defined(__x86_64__)
+#define __emulate_2op_8byte(_op, _src, _dst, _eflags, _qx, _qy) \
+ do { \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","4","2") \
+ _op"q %"_qx"3,%1; " \
+ _POST_EFLAGS("0","4","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), "=&r" (_tmp) \
+ : _qy ((_src).val), "i" (EFLAGS_MASK) ); \
+ } while (0)
+
+#define __emulate_1op_8byte(_op, _dst, _eflags) \
+ do { \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","3","2") \
+ _op"q %1; " \
+ _POST_EFLAGS("0","3","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), "=&r" (_tmp) \
+ : "i" (EFLAGS_MASK) ); \
+ } while (0)
+
+#elif defined(__i386__)
+#define __emulate_2op_8byte(_op, _src, _dst, _eflags, _qx, _qy)
+#define __emulate_1op_8byte(_op, _dst, _eflags)
+#endif /* __i386__ */
+
+/* Fetch next part of the instruction being emulated. */
+#define insn_fetch(_type, _size, _eip) \
+({ unsigned long _x; \
+ rc = ops->read_std((unsigned long)(_eip) + ctxt->cs_base, &_x, \
+ (_size), ctxt); \
+ if ( rc != 0 ) \
+ goto done; \
+ (_eip) += (_size); \
+ (_type)_x; \
+})
+
+/* Access/update address held in a register, based on addressing mode. */
+#define register_address(base, reg) \
+ ((base) + ((ad_bytes == sizeof(unsigned long)) ? (reg) : \
+ ((reg) & ((1UL << (ad_bytes << 3)) - 1))))
+
+#define register_address_increment(reg, inc) \
+ do { \
+ /* signed type ensures sign extension to long */ \
+ int _inc = (inc); \
+ if ( ad_bytes == sizeof(unsigned long) ) \
+ (reg) += _inc; \
+ else \
+ (reg) = ((reg) & ~((1UL << (ad_bytes << 3)) - 1)) | \
+ (((reg) + _inc) & ((1UL << (ad_bytes << 3)) - 1)); \
+ } while (0)
+
+void *decode_register(u8 modrm_reg, unsigned long *regs,
+ int highbyte_regs)
+{
+ void *p;
+
+ p = &regs[modrm_reg];
+ if (highbyte_regs && modrm_reg >= 4 && modrm_reg < 8)
+ p = (unsigned char *)&regs[modrm_reg & 3] + 1;
+ return p;
+}
+
+static int read_descriptor(struct x86_emulate_ctxt *ctxt,
+ struct x86_emulate_ops *ops,
+ void *ptr,
+ u16 *size, unsigned long *address, int op_bytes)
+{
+ int rc;
+
+ if (op_bytes == 2)
+ op_bytes = 3;
+ *address = 0;
+ rc = ops->read_std((unsigned long)ptr, (unsigned long *)size, 2, ctxt);
+ if (rc)
+ return rc;
+ rc = ops->read_std((unsigned long)ptr + 2, address, op_bytes, ctxt);
+ return rc;
+}
+
+int
+x86_emulate_memop(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops)
+{
+ u8 b, d, sib, twobyte = 0, rex_prefix = 0;
+ u8 modrm, modrm_mod = 0, modrm_reg = 0, modrm_rm = 0;
+ unsigned long *override_base = NULL;
+ unsigned int op_bytes, ad_bytes, lock_prefix = 0, rep_prefix = 0, i;
+ int rc = 0;
+ struct operand src, dst;
+ unsigned long cr2 = ctxt->cr2;
+ int mode = ctxt->mode;
+ unsigned long modrm_ea;
+ int use_modrm_ea, index_reg = 0, base_reg = 0, scale, rip_relative = 0;
+
+ /* Shadow copy of register state. Committed on successful emulation. */
+ unsigned long _regs[NR_VCPU_REGS];
+ unsigned long _eip = ctxt->vcpu->rip, _eflags = ctxt->eflags;
+ unsigned long modrm_val = 0;
+
+ memcpy(_regs, ctxt->vcpu->regs, sizeof _regs);
+
+ switch (mode) {
+ case X86EMUL_MODE_REAL:
+ case X86EMUL_MODE_PROT16:
+ op_bytes = ad_bytes = 2;
+ break;
+ case X86EMUL_MODE_PROT32:
+ op_bytes = ad_bytes = 4;
+ break;
+#ifdef __x86_64__
+ case X86EMUL_MODE_PROT64:
+ op_bytes = 4;
+ ad_bytes = 8;
+ break;
+#endif
+ default:
+ return -1;
+ }
+
+ /* Legacy prefixes. */
+ for (i = 0; i < 8; i++) {
+ switch (b = insn_fetch(u8, 1, _eip)) {
+ case 0x66: /* operand-size override */
+ op_bytes ^= 6; /* switch between 2/4 bytes */
+ break;
+ case 0x67: /* address-size override */
+ if (mode == X86EMUL_MODE_PROT64)
+ ad_bytes ^= 12; /* switch between 4/8 bytes */
+ else
+ ad_bytes ^= 6; /* switch between 2/4 bytes */
+ break;
+ case 0x2e: /* CS override */
+ override_base = &ctxt->cs_base;
+ break;
+ case 0x3e: /* DS override */
+ override_base = &ctxt->ds_base;
+ break;
+ case 0x26: /* ES override */
+ override_base = &ctxt->es_base;
+ break;
+ case 0x64: /* FS override */
+ override_base = &ctxt->fs_base;
+ break;
+ case 0x65: /* GS override */
+ override_base = &ctxt->gs_base;
+ break;
+ case 0x36: /* SS override */
+ override_base = &ctxt->ss_base;
+ break;
+ case 0xf0: /* LOCK */
+ lock_prefix = 1;
+ break;
+ case 0xf3: /* REP/REPE/REPZ */
+ rep_prefix = 1;
+ break;
+ case 0xf2: /* REPNE/REPNZ */
+ break;
+ default:
+ goto done_prefixes;
+ }
+ }
+
+done_prefixes:
+
+ /* REX prefix. */
+ if ((mode == X86EMUL_MODE_PROT64) && ((b & 0xf0) == 0x40)) {
+ rex_prefix = b;
+ if (b & 8)
+ op_bytes = 8; /* REX.W */
+ modrm_reg = (b & 4) << 1; /* REX.R */
+ index_reg = (b & 2) << 2; /* REX.X */
+ modrm_rm = base_reg = (b & 1) << 3; /* REG.B */
+ b = insn_fetch(u8, 1, _eip);
+ }
+
+ /* Opcode byte(s). */
+ d = opcode_table[b];
+ if (d == 0) {
+ /* Two-byte opcode? */
+ if (b == 0x0f) {
+ twobyte = 1;
+ b = insn_fetch(u8, 1, _eip);
+ d = twobyte_table[b];
+ }
+
+ /* Unrecognised? */
+ if (d == 0)
+ goto cannot_emulate;
+ }
+
+ /* ModRM and SIB bytes. */
+ if (d & ModRM) {
+ modrm = insn_fetch(u8, 1, _eip);
+ modrm_mod |= (modrm & 0xc0) >> 6;
+ modrm_reg |= (modrm & 0x38) >> 3;
+ modrm_rm |= (modrm & 0x07);
+ modrm_ea = 0;
+ use_modrm_ea = 1;
+
+ if (modrm_mod == 3) {
+ modrm_val = *(unsigned long *)
+ decode_register(modrm_rm, _regs, d & ByteOp);
+ goto modrm_done;
+ }
+
+ if (ad_bytes == 2) {
+ unsigned bx = _regs[VCPU_REGS_RBX];
+ unsigned bp = _regs[VCPU_REGS_RBP];
+ unsigned si = _regs[VCPU_REGS_RSI];
+ unsigned di = _regs[VCPU_REGS_RDI];
+
+ /* 16-bit ModR/M decode. */
+ switch (modrm_mod) {
+ case 0:
+ if (modrm_rm == 6)
+ modrm_ea += insn_fetch(u16, 2, _eip);
+ break;
+ case 1:
+ modrm_ea += insn_fetch(s8, 1, _eip);
+ break;
+ case 2:
+ modrm_ea += insn_fetch(u16, 2, _eip);
+ break;
+ }
+ switch (modrm_rm) {
+ case 0:
+ modrm_ea += bx + si;
+ break;
+ case 1:
+ modrm_ea += bx + di;
+ break;
+ case 2:
+ modrm_ea += bp + si;
+ break;
+ case 3:
+ modrm_ea += bp + di;
+ break;
+ case 4:
+ modrm_ea += si;
+ break;
+ case 5:
+ modrm_ea += di;
+ break;
+ case 6:
+ if (modrm_mod != 0)
+ modrm_ea += bp;
+ break;
+ case 7:
+ modrm_ea += bx;
+ break;
+ }
+ if (modrm_rm == 2 || modrm_rm == 3 ||
+ (modrm_rm == 6 && modrm_mod != 0))
+ if (!override_base)
+ override_base = &ctxt->ss_base;
+ modrm_ea = (u16)modrm_ea;
+ } else {
+ /* 32/64-bit ModR/M decode. */
+ switch (modrm_rm) {
+ case 4:
+ case 12:
+ sib = insn_fetch(u8, 1, _eip);
+ index_reg |= (sib >> 3) & 7;
+ base_reg |= sib & 7;
+ scale = sib >> 6;
+
+ switch (base_reg) {
+ case 5:
+ if (modrm_mod != 0)
+ modrm_ea += _regs[base_reg];
+ else
+ modrm_ea += insn_fetch(s32, 4, _eip);
+ break;
+ default:
+ modrm_ea += _regs[base_reg];
+ }
+ switch (index_reg) {
+ case 4:
+ break;
+ default:
+ modrm_ea += _regs[index_reg] << scale;
+
+ }
+ break;
+ case 5:
+ if (modrm_mod != 0)
+ modrm_ea += _regs[modrm_rm];
+ else if (mode == X86EMUL_MODE_PROT64)
+ rip_relative = 1;
+ break;
+ default:
+ modrm_ea += _regs[modrm_rm];
+ break;
+ }
+ switch (modrm_mod) {
+ case 0:
+ if (modrm_rm == 5)
+ modrm_ea += insn_fetch(s32, 4, _eip);
+ break;
+ case 1:
+ modrm_ea += insn_fetch(s8, 1, _eip);
+ break;
+ case 2:
+ modrm_ea += insn_fetch(s32, 4, _eip);
+ break;
+ }
+ }
+ if (!override_base)
+ override_base = &ctxt->ds_base;
+ if (mode == X86EMUL_MODE_PROT64 &&
+ override_base != &ctxt->fs_base &&
+ override_base != &ctxt->gs_base)
+ override_base = 0;
+
+ if (override_base)
+ modrm_ea += *override_base;
+
+ if (rip_relative) {
+ modrm_ea += _eip;
+ switch (d & SrcMask) {
+ case SrcImmByte:
+ modrm_ea += 1;
+ break;
+ case SrcImm:
+ if (d & ByteOp)
+ modrm_ea += 1;
+ else
+ if (op_bytes == 8)
+ modrm_ea += 4;
+ else
+ modrm_ea += op_bytes;
+ }
+ }
+ if (ad_bytes != 8)
+ modrm_ea = (u32)modrm_ea;
+ cr2 = modrm_ea;
+ modrm_done:
+ ;
+ }
+
+ /* Decode and fetch the destination operand: register or memory. */
+ switch (d & DstMask) {
+ case ImplicitOps:
+ /* Special instructions do their own operand decoding. */
+ goto special_insn;
+ case DstReg:
+ dst.type = OP_REG;
+ if ((d & ByteOp)
+ && !(twobyte_table && (b == 0xb6 || b == 0xb7))) {
+ dst.ptr = decode_register(modrm_reg, _regs,
+ (rex_prefix == 0));
+ dst.val = *(u8 *) dst.ptr;
+ dst.bytes = 1;
+ } else {
+ dst.ptr = decode_register(modrm_reg, _regs, 0);
+ switch ((dst.bytes = op_bytes)) {
+ case 2:
+ dst.val = *(u16 *)dst.ptr;
+ break;
+ case 4:
+ dst.val = *(u32 *)dst.ptr;
+ break;
+ case 8:
+ dst.val = *(u64 *)dst.ptr;
+ break;
+ }
+ }
+ break;
+ case DstMem:
+ dst.type = OP_MEM;
+ dst.ptr = (unsigned long *)cr2;
+ dst.bytes = (d & ByteOp) ? 1 : op_bytes;
+ if (!(d & Mov) && /* optimisation - avoid slow emulated read */
+ ((rc = ops->read_emulated((unsigned long)dst.ptr,
+ &dst.val, dst.bytes, ctxt)) != 0))
+ goto done;
+ break;
+ }
+ dst.orig_val = dst.val;
+
+ /*
+ * Decode and fetch the source operand: register, memory
+ * or immediate.
+ */
+ switch (d & SrcMask) {
+ case SrcNone:
+ break;
+ case SrcReg:
+ src.type = OP_REG;
+ if (d & ByteOp) {
+ src.ptr = decode_register(modrm_reg, _regs,
+ (rex_prefix == 0));
+ src.val = src.orig_val = *(u8 *) src.ptr;
+ src.bytes = 1;
+ } else {
+ src.ptr = decode_register(modrm_reg, _regs, 0);
+ switch ((src.bytes = op_bytes)) {
+ case 2:
+ src.val = src.orig_val = *(u16 *) src.ptr;
+ break;
+ case 4:
+ src.val = src.orig_val = *(u32 *) src.ptr;
+ break;
+ case 8:
+ src.val = src.orig_val = *(u64 *) src.ptr;
+ break;
+ }
+ }
+ break;
+ case SrcMem16:
+ src.bytes = 2;
+ goto srcmem_common;
+ case SrcMem32:
+ src.bytes = 4;
+ goto srcmem_common;
+ case SrcMem:
+ src.bytes = (d & ByteOp) ? 1 : op_bytes;
+ srcmem_common:
+ src.type = OP_MEM;
+ src.ptr = (unsigned long *)cr2;
+ if ((rc = ops->read_emulated((unsigned long)src.ptr,
+ &src.val, src.bytes, ctxt)) != 0)
+ goto done;
+ src.orig_val = src.val;
+ break;
+ case SrcImm:
+ src.type = OP_IMM;
+ src.ptr = (unsigned long *)_eip;
+ src.bytes = (d & ByteOp) ? 1 : op_bytes;
+ if (src.bytes == 8)
+ src.bytes = 4;
+ /* NB. Immediates are sign-extended as necessary. */
+ switch (src.bytes) {
+ case 1:
+ src.val = insn_fetch(s8, 1, _eip);
+ break;
+ case 2:
+ src.val = insn_fetch(s16, 2, _eip);
+ break;
+ case 4:
+ src.val = insn_fetch(s32, 4, _eip);
+ break;
+ }
+ break;
+ case SrcImmByte:
+ src.type = OP_IMM;
+ src.ptr = (unsigned long *)_eip;
+ src.bytes = 1;
+ src.val = insn_fetch(s8, 1, _eip);
+ break;
+ }
+
+ if (twobyte)
+ goto twobyte_insn;
+
+ switch (b) {
+ case 0x00 ... 0x05:
+ add: /* add */
+ emulate_2op_SrcV("add", src, dst, _eflags);
+ break;
+ case 0x08 ... 0x0d:
+ or: /* or */
+ emulate_2op_SrcV("or", src, dst, _eflags);
+ break;
+ case 0x10 ... 0x15:
+ adc: /* adc */
+ emulate_2op_SrcV("adc", src, dst, _eflags);
+ break;
+ case 0x18 ... 0x1d:
+ sbb: /* sbb */
+ emulate_2op_SrcV("sbb", src, dst, _eflags);
+ break;
+ case 0x20 ... 0x25:
+ and: /* and */
+ emulate_2op_SrcV("and", src, dst, _eflags);
+ break;
+ case 0x28 ... 0x2d:
+ sub: /* sub */
+ emulate_2op_SrcV("sub", src, dst, _eflags);
+ break;
+ case 0x30 ... 0x35:
+ xor: /* xor */
+ emulate_2op_SrcV("xor", src, dst, _eflags);
+ break;
+ case 0x38 ... 0x3d:
+ cmp: /* cmp */
+ emulate_2op_SrcV("cmp", src, dst, _eflags);
+ break;
+ case 0x63: /* movsxd */
+ if (mode != X86EMUL_MODE_PROT64)
+ goto cannot_emulate;
+ dst.val = (s32) src.val;
+ break;
+ case 0x80 ... 0x83: /* Grp1 */
+ switch (modrm_reg) {
+ case 0:
+ goto add;
+ case 1:
+ goto or;
+ case 2:
+ goto adc;
+ case 3:
+ goto sbb;
+ case 4:
+ goto and;
+ case 5:
+ goto sub;
+ case 6:
+ goto xor;
+ case 7:
+ goto cmp;
+ }
+ break;
+ case 0x84 ... 0x85:
+ test: /* test */
+ emulate_2op_SrcV("test", src, dst, _eflags);
+ break;
+ case 0x86 ... 0x87: /* xchg */
+ /* Write back the register source. */
+ switch (dst.bytes) {
+ case 1:
+ *(u8 *) src.ptr = (u8) dst.val;
+ break;
+ case 2:
+ *(u16 *) src.ptr = (u16) dst.val;
+ break;
+ case 4:
+ *src.ptr = (u32) dst.val;
+ break; /* 64b reg: zero-extend */
+ case 8:
+ *src.ptr = dst.val;
+ break;
+ }
+ /*
+ * Write back the memory destination with implicit LOCK
+ * prefix.
+ */
+ dst.val = src.val;
+ lock_prefix = 1;
+ break;
+ case 0xa0 ... 0xa1: /* mov */
+ dst.ptr = (unsigned long *)&_regs[VCPU_REGS_RAX];
+ dst.val = src.val;
+ _eip += ad_bytes; /* skip src displacement */
+ break;
+ case 0xa2 ... 0xa3: /* mov */
+ dst.val = (unsigned long)_regs[VCPU_REGS_RAX];
+ _eip += ad_bytes; /* skip dst displacement */
+ break;
+ case 0x88 ... 0x8b: /* mov */
+ case 0xc6 ... 0xc7: /* mov (sole member of Grp11) */
+ dst.val = src.val;
+ break;
+ case 0x8f: /* pop (sole member of Grp1a) */
+ /* 64-bit mode: POP always pops a 64-bit operand. */
+ if (mode == X86EMUL_MODE_PROT64)
+ dst.bytes = 8;
+ if ((rc = ops->read_std(register_address(ctxt->ss_base,
+ _regs[VCPU_REGS_RSP]),
+ &dst.val, dst.bytes, ctxt)) != 0)
+ goto done;
+ register_address_increment(_regs[VCPU_REGS_RSP], dst.bytes);
+ break;
+ case 0xc0 ... 0xc1:
+ grp2: /* Grp2 */
+ switch (modrm_reg) {
+ case 0: /* rol */
+ emulate_2op_SrcB("rol", src, dst, _eflags);
+ break;
+ case 1: /* ror */
+ emulate_2op_SrcB("ror", src, dst, _eflags);
+ break;
+ case 2: /* rcl */
+ emulate_2op_SrcB("rcl", src, dst, _eflags);
+ break;
+ case 3: /* rcr */
+ emulate_2op_SrcB("rcr", src, dst, _eflags);
+ break;
+ case 4: /* sal/shl */
+ case 6: /* sal/shl */
+ emulate_2op_SrcB("sal", src, dst, _eflags);
+ break;
+ case 5: /* shr */
+ emulate_2op_SrcB("shr", src, dst, _eflags);
+ break;
+ case 7: /* sar */
+ emulate_2op_SrcB("sar", src, dst, _eflags);
+ break;
+ }
+ break;
+ case 0xd0 ... 0xd1: /* Grp2 */
+ src.val = 1;
+ goto grp2;
+ case 0xd2 ... 0xd3: /* Grp2 */
+ src.val = _regs[VCPU_REGS_RCX];
+ goto grp2;
+ case 0xf6 ... 0xf7: /* Grp3 */
+ switch (modrm_reg) {
+ case 0 ... 1: /* test */
+ /*
+ * Special case in Grp3: test has an immediate
+ * source operand.
+ */
+ src.type = OP_IMM;
+ src.ptr = (unsigned long *)_eip;
+ src.bytes = (d & ByteOp) ? 1 : op_bytes;
+ if (src.bytes == 8)
+ src.bytes = 4;
+ switch (src.bytes) {
+ case 1:
+ src.val = insn_fetch(s8, 1, _eip);
+ break;
+ case 2:
+ src.val = insn_fetch(s16, 2, _eip);
+ break;
+ case 4:
+ src.val = insn_fetch(s32, 4, _eip);
+ break;
+ }
+ goto test;
+ case 2: /* not */
+ dst.val = ~dst.val;
+ break;
+ case 3: /* neg */
+ emulate_1op("neg", dst, _eflags);
+ break;
+ default:
+ goto cannot_emulate;
+ }
+ break;
+ case 0xfe ... 0xff: /* Grp4/Grp5 */
+ switch (modrm_reg) {
+ case 0: /* inc */
+ emulate_1op("inc", dst, _eflags);
+ break;
+ case 1: /* dec */
+ emulate_1op("dec", dst, _eflags);
+ break;
+ case 6: /* push */
+ /* 64-bit mode: PUSH always pushes a 64-bit operand. */
+ if (mode == X86EMUL_MODE_PROT64) {
+ dst.bytes = 8;
+ if ((rc = ops->read_std((unsigned long)dst.ptr,
+ &dst.val, 8,
+ ctxt)) != 0)
+ goto done;
+ }
+ register_address_increment(_regs[VCPU_REGS_RSP],
+ -dst.bytes);
+ if ((rc = ops->write_std(
+ register_address(ctxt->ss_base,
+ _regs[VCPU_REGS_RSP]),
+ dst.val, dst.bytes, ctxt)) != 0)
+ goto done;
+ dst.val = dst.orig_val; /* skanky: disable writeback */
+ break;
+ default:
+ goto cannot_emulate;
+ }
+ break;
+ }
+
+writeback:
+ if ((d & Mov) || (dst.orig_val != dst.val)) {
+ switch (dst.type) {
+ case OP_REG:
+ /* The 4-byte case *is* correct: in 64-bit mode we zero-extend. */
+ switch (dst.bytes) {
+ case 1:
+ *(u8 *)dst.ptr = (u8)dst.val;
+ break;
+ case 2:
+ *(u16 *)dst.ptr = (u16)dst.val;
+ break;
+ case 4:
+ *dst.ptr = (u32)dst.val;
+ break; /* 64b: zero-ext */
+ case 8:
+ *dst.ptr = dst.val;
+ break;
+ }
+ break;
+ case OP_MEM:
+ if (lock_prefix)
+ rc = ops->cmpxchg_emulated((unsigned long)dst.
+ ptr, dst.orig_val,
+ dst.val, dst.bytes,
+ ctxt);
+ else
+ rc = ops->write_emulated((unsigned long)dst.ptr,
+ dst.val, dst.bytes,
+ ctxt);
+ if (rc != 0)
+ goto done;
+ default:
+ break;
+ }
+ }
+
+ /* Commit shadow register state. */
+ memcpy(ctxt->vcpu->regs, _regs, sizeof _regs);
+ ctxt->eflags = _eflags;
+ ctxt->vcpu->rip = _eip;
+
+done:
+ return (rc == X86EMUL_UNHANDLEABLE) ? -1 : 0;
+
+special_insn:
+ if (twobyte)
+ goto twobyte_special_insn;
+ if (rep_prefix) {
+ if (_regs[VCPU_REGS_RCX] == 0) {
+ ctxt->vcpu->rip = _eip;
+ goto done;
+ }
+ _regs[VCPU_REGS_RCX]--;
+ _eip = ctxt->vcpu->rip;
+ }
+ switch (b) {
+ case 0xa4 ... 0xa5: /* movs */
+ dst.type = OP_MEM;
+ dst.bytes = (d & ByteOp) ? 1 : op_bytes;
+ dst.ptr = (unsigned long *)register_address(ctxt->es_base,
+ _regs[VCPU_REGS_RDI]);
+ if ((rc = ops->read_emulated(register_address(
+ override_base ? *override_base : ctxt->ds_base,
+ _regs[VCPU_REGS_RSI]), &dst.val, dst.bytes, ctxt)) != 0)
+ goto done;
+ register_address_increment(_regs[VCPU_REGS_RSI],
+ (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes);
+ register_address_increment(_regs[VCPU_REGS_RDI],
+ (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes);
+ break;
+ case 0xa6 ... 0xa7: /* cmps */
+ DPRINTF("Urk! I don't handle CMPS.\n");
+ goto cannot_emulate;
+ case 0xaa ... 0xab: /* stos */
+ dst.type = OP_MEM;
+ dst.bytes = (d & ByteOp) ? 1 : op_bytes;
+ dst.ptr = (unsigned long *)cr2;
+ dst.val = _regs[VCPU_REGS_RAX];
+ register_address_increment(_regs[VCPU_REGS_RDI],
+ (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes);
+ break;
+ case 0xac ... 0xad: /* lods */
+ dst.type = OP_REG;
+ dst.bytes = (d & ByteOp) ? 1 : op_bytes;
+ dst.ptr = (unsigned long *)&_regs[VCPU_REGS_RAX];
+ if ((rc = ops->read_emulated(cr2, &dst.val, dst.bytes, ctxt)) != 0)
+ goto done;
+ register_address_increment(_regs[VCPU_REGS_RSI],
+ (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes);
+ break;
+ case 0xae ... 0xaf: /* scas */
+ DPRINTF("Urk! I don't handle SCAS.\n");
+ goto cannot_emulate;
+ }
+ goto writeback;
+
+twobyte_insn:
+ switch (b) {
+ case 0x01: /* lgdt, lidt, lmsw */
+ switch (modrm_reg) {
+ u16 size;
+ unsigned long address;
+
+ case 2: /* lgdt */
+ rc = read_descriptor(ctxt, ops, src.ptr,
+ &size, &address, op_bytes);
+ if (rc)
+ goto done;
+ realmode_lgdt(ctxt->vcpu, size, address);
+ break;
+ case 3: /* lidt */
+ rc = read_descriptor(ctxt, ops, src.ptr,
+ &size, &address, op_bytes);
+ if (rc)
+ goto done;
+ realmode_lidt(ctxt->vcpu, size, address);
+ break;
+ case 6: /* lmsw */
+ realmode_lmsw(ctxt->vcpu, (u16)modrm_val, &_eflags);
+ break;
+ default:
+ goto cannot_emulate;
+ }
+ break;
+ case 0x40 ... 0x4f: /* cmov */
+ dst.val = dst.orig_val = src.val;
+ d &= ~Mov; /* default to no move */
+ /*
+ * First, assume we're decoding an even cmov opcode
+ * (lsb == 0).
+ */
+ switch ((b & 15) >> 1) {
+ case 0: /* cmovo */
+ d |= (_eflags & EFLG_OF) ? Mov : 0;
+ break;
+ case 1: /* cmovb/cmovc/cmovnae */
+ d |= (_eflags & EFLG_CF) ? Mov : 0;
+ break;
+ case 2: /* cmovz/cmove */
+ d |= (_eflags & EFLG_ZF) ? Mov : 0;
+ break;
+ case 3: /* cmovbe/cmovna */
+ d |= (_eflags & (EFLG_CF | EFLG_ZF)) ? Mov : 0;
+ break;
+ case 4: /* cmovs */
+ d |= (_eflags & EFLG_SF) ? Mov : 0;
+ break;
+ case 5: /* cmovp/cmovpe */
+ d |= (_eflags & EFLG_PF) ? Mov : 0;
+ break;
+ case 7: /* cmovle/cmovng */
+ d |= (_eflags & EFLG_ZF) ? Mov : 0;
+ /* fall through */
+ case 6: /* cmovl/cmovnge */
+ d |= (!(_eflags & EFLG_SF) !=
+ !(_eflags & EFLG_OF)) ? Mov : 0;
+ break;
+ }
+ /* Odd cmov opcodes (lsb == 1) have inverted sense. */
+ d ^= (b & 1) ? Mov : 0;
+ break;
+ case 0xb0 ... 0xb1: /* cmpxchg */
+ /*
+ * Save real source value, then compare EAX against
+ * destination.
+ */
+ src.orig_val = src.val;
+ src.val = _regs[VCPU_REGS_RAX];
+ emulate_2op_SrcV("cmp", src, dst, _eflags);
+ /* Always write back. The question is: where to? */
+ d |= Mov;
+ if (_eflags & EFLG_ZF) {
+ /* Success: write back to memory. */
+ dst.val = src.orig_val;
+ } else {
+ /* Failure: write the value we saw to EAX. */
+ dst.type = OP_REG;
+ dst.ptr = (unsigned long *)&_regs[VCPU_REGS_RAX];
+ }
+ break;
+ case 0xa3:
+ bt: /* bt */
+ src.val &= (dst.bytes << 3) - 1; /* only subword offset */
+ emulate_2op_SrcV_nobyte("bt", src, dst, _eflags);
+ break;
+ case 0xb3:
+ btr: /* btr */
+ src.val &= (dst.bytes << 3) - 1; /* only subword offset */
+ emulate_2op_SrcV_nobyte("btr", src, dst, _eflags);
+ break;
+ case 0xab:
+ bts: /* bts */
+ src.val &= (dst.bytes << 3) - 1; /* only subword offset */
+ emulate_2op_SrcV_nobyte("bts", src, dst, _eflags);
+ break;
+ case 0xb6 ... 0xb7: /* movzx */
+ dst.bytes = op_bytes;
+ dst.val = (d & ByteOp) ? (u8) src.val : (u16) src.val;
+ break;
+ case 0xbb:
+ btc: /* btc */
+ src.val &= (dst.bytes << 3) - 1; /* only subword offset */
+ emulate_2op_SrcV_nobyte("btc", src, dst, _eflags);
+ break;
+ case 0xba: /* Grp8 */
+ switch (modrm_reg & 3) {
+ case 0:
+ goto bt;
+ case 1:
+ goto bts;
+ case 2:
+ goto btr;
+ case 3:
+ goto btc;
+ }
+ break;
+ case 0xbe ... 0xbf: /* movsx */
+ dst.bytes = op_bytes;
+ dst.val = (d & ByteOp) ? (s8) src.val : (s16) src.val;
+ break;
+ }
+ goto writeback;
+
+twobyte_special_insn:
+ /* Disable writeback. */
+ dst.orig_val = dst.val;
+ switch (b) {
+ case 0x0d: /* GrpP (prefetch) */
+ case 0x18: /* Grp16 (prefetch/nop) */
+ break;
+ case 0x20: /* mov cr, reg */
+ b = insn_fetch(u8, 1, _eip);
+ if ((b & 0xc0) != 0xc0)
+ goto cannot_emulate;
+ _regs[(b >> 3) & 7] = realmode_get_cr(ctxt->vcpu, b & 7);
+ break;
+ case 0x22: /* mov reg, cr */
+ b = insn_fetch(u8, 1, _eip);
+ if ((b & 0xc0) != 0xc0)
+ goto cannot_emulate;
+ realmode_set_cr(ctxt->vcpu, b & 7, _regs[(b >> 3) & 7] & -1u,
+ &_eflags);
+ break;
+ case 0xc7: /* Grp9 (cmpxchg8b) */
+#if defined(__i386__)
+ {
+ unsigned long old_lo, old_hi;
+ if (((rc = ops->read_emulated(cr2 + 0, &old_lo, 4,
+ ctxt)) != 0)
+ || ((rc = ops->read_emulated(cr2 + 4, &old_hi, 4,
+ ctxt)) != 0))
+ goto done;
+ if ((old_lo != _regs[VCPU_REGS_RAX])
+ || (old_hi != _regs[VCPU_REGS_RDI])) {
+ _regs[VCPU_REGS_RAX] = old_lo;
+ _regs[VCPU_REGS_RDX] = old_hi;
+ _eflags &= ~EFLG_ZF;
+ } else if (ops->cmpxchg8b_emulated == NULL) {
+ rc = X86EMUL_UNHANDLEABLE;
+ goto done;
+ } else {
+ if ((rc = ops->cmpxchg8b_emulated(cr2, old_lo,
+ old_hi,
+ _regs[VCPU_REGS_RBX],
+ _regs[VCPU_REGS_RCX],
+ ctxt)) != 0)
+ goto done;
+ _eflags |= EFLG_ZF;
+ }
+ break;
+ }
+#elif defined(__x86_64__)
+ {
+ unsigned long old, new;
+ if ((rc = ops->read_emulated(cr2, &old, 8, ctxt)) != 0)
+ goto done;
+ if (((u32) (old >> 0) != (u32) _regs[VCPU_REGS_RAX]) ||
+ ((u32) (old >> 32) != (u32) _regs[VCPU_REGS_RDX])) {
+ _regs[VCPU_REGS_RAX] = (u32) (old >> 0);
+ _regs[VCPU_REGS_RDX] = (u32) (old >> 32);
+ _eflags &= ~EFLG_ZF;
+ } else {
+ new = (_regs[VCPU_REGS_RCX] << 32) | (u32) _regs[VCPU_REGS_RBX];
+ if ((rc = ops->cmpxchg_emulated(cr2, old,
+ new, 8, ctxt)) != 0)
+ goto done;
+ _eflags |= EFLG_ZF;
+ }
+ break;
+ }
+#endif
+ }
+ goto writeback;
+
+cannot_emulate:
+ DPRINTF("Cannot emulate %02x\n", b);
+ return -1;
+}
+
+#ifdef __XEN__
+
+#include <asm/mm.h>
+#include <asm/uaccess.h>
+
+int
+x86_emulate_read_std(unsigned long addr,
+ unsigned long *val,
+ unsigned int bytes, struct x86_emulate_ctxt *ctxt)
+{
+ unsigned int rc;
+
+ *val = 0;
+
+ if ((rc = copy_from_user((void *)val, (void *)addr, bytes)) != 0) {
+ propagate_page_fault(addr + bytes - rc, 0); /* read fault */
+ return X86EMUL_PROPAGATE_FAULT;
+ }
+
+ return X86EMUL_CONTINUE;
+}
+
+int
+x86_emulate_write_std(unsigned long addr,
+ unsigned long val,
+ unsigned int bytes, struct x86_emulate_ctxt *ctxt)
+{
+ unsigned int rc;
+
+ if ((rc = copy_to_user((void *)addr, (void *)&val, bytes)) != 0) {
+ propagate_page_fault(addr + bytes - rc, PGERR_write_access);
+ return X86EMUL_PROPAGATE_FAULT;
+ }
+
+ return X86EMUL_CONTINUE;
+}
+
+#endif
Index: linux-2.6/drivers/kvm/x86_emulate.h
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/x86_emulate.h
@@ -0,0 +1,185 @@
+/******************************************************************************
+ * x86_emulate.h
+ *
+ * Generic x86 (32-bit and 64-bit) instruction decoder and emulator.
+ *
+ * Copyright (c) 2005 Keir Fraser
+ *
+ * From: xen-unstable 10676:af9809f51f81a3c43f276f00c81a52ef558afda4
+ */
+
+#ifndef __X86_EMULATE_H__
+#define __X86_EMULATE_H__
+
+struct x86_emulate_ctxt;
+
+/*
+ * x86_emulate_ops:
+ *
+ * These operations represent the instruction emulator's interface to memory.
+ * There are two categories of operation: those that act on ordinary memory
+ * regions (*_std), and those that act on memory regions known to require
+ * special treatment or emulation (*_emulated).
+ *
+ * The emulator assumes that an instruction accesses only one 'emulated memory'
+ * location, that this location is the given linear faulting address (cr2), and
+ * that this is one of the instruction's data operands. Instruction fetches and
+ * stack operations are assumed never to access emulated memory. The emulator
+ * automatically deduces which operand of a string-move operation is accessing
+ * emulated memory, and assumes that the other operand accesses normal memory.
+ *
+ * NOTES:
+ * 1. The emulator isn't very smart about emulated vs. standard memory.
+ * 'Emulated memory' access addresses should be checked for sanity.
+ * 'Normal memory' accesses may fault, and the caller must arrange to
+ * detect and handle reentrancy into the emulator via recursive faults.
+ * Accesses may be unaligned and may cross page boundaries.
+ * 2. If the access fails (cannot emulate, or a standard access faults) then
+ * it is up to the memop to propagate the fault to the guest VM via
+ * some out-of-band mechanism, unknown to the emulator. The memop signals
+ * failure by returning X86EMUL_PROPAGATE_FAULT to the emulator, which will
+ * then immediately bail.
+ * 3. Valid access sizes are 1, 2, 4 and 8 bytes. On x86/32 systems only
+ * cmpxchg8b_emulated need support 8-byte accesses.
+ * 4. The emulator cannot handle 64-bit mode emulation on an x86/32 system.
+ */
+/* Access completed successfully: continue emulation as normal. */
+#define X86EMUL_CONTINUE 0
+/* Access is unhandleable: bail from emulation and return error to caller. */
+#define X86EMUL_UNHANDLEABLE 1
+/* Terminate emulation but return success to the caller. */
+#define X86EMUL_PROPAGATE_FAULT 2 /* propagate a generated fault to guest */
+#define X86EMUL_RETRY_INSTR 2 /* retry the instruction for some reason */
+#define X86EMUL_CMPXCHG_FAILED 2 /* cmpxchg did not see expected value */
+struct x86_emulate_ops {
+ /*
+ * read_std: Read bytes of standard (non-emulated/special) memory.
+ * Used for instruction fetch, stack operations, and others.
+ * @addr: [IN ] Linear address from which to read.
+ * @val: [OUT] Value read from memory, zero-extended to 'u_long'.
+ * @bytes: [IN ] Number of bytes to read from memory.
+ */
+ int (*read_std)(unsigned long addr,
+ unsigned long *val,
+ unsigned int bytes, struct x86_emulate_ctxt * ctxt);
+
+ /*
+ * write_std: Write bytes of standard (non-emulated/special) memory.
+ * Used for stack operations, and others.
+ * @addr: [IN ] Linear address to which to write.
+ * @val: [IN ] Value to write to memory (low-order bytes used as
+ * required).
+ * @bytes: [IN ] Number of bytes to write to memory.
+ */
+ int (*write_std)(unsigned long addr,
+ unsigned long val,
+ unsigned int bytes, struct x86_emulate_ctxt * ctxt);
+
+ /*
+ * read_emulated: Read bytes from emulated/special memory area.
+ * @addr: [IN ] Linear address from which to read.
+ * @val: [OUT] Value read from memory, zero-extended to 'u_long'.
+ * @bytes: [IN ] Number of bytes to read from memory.
+ */
+ int (*read_emulated) (unsigned long addr,
+ unsigned long *val,
+ unsigned int bytes,
+ struct x86_emulate_ctxt * ctxt);
+
+ /*
+ * write_emulated: Read bytes from emulated/special memory area.
+ * @addr: [IN ] Linear address to which to write.
+ * @val: [IN ] Value to write to memory (low-order bytes used as
+ * required).
+ * @bytes: [IN ] Number of bytes to write to memory.
+ */
+ int (*write_emulated) (unsigned long addr,
+ unsigned long val,
+ unsigned int bytes,
+ struct x86_emulate_ctxt * ctxt);
+
+ /*
+ * cmpxchg_emulated: Emulate an atomic (LOCKed) CMPXCHG operation on an
+ * emulated/special memory area.
+ * @addr: [IN ] Linear address to access.
+ * @old: [IN ] Value expected to be current at @addr.
+ * @new: [IN ] Value to write to @addr.
+ * @bytes: [IN ] Number of bytes to access using CMPXCHG.
+ */
+ int (*cmpxchg_emulated) (unsigned long addr,
+ unsigned long old,
+ unsigned long new,
+ unsigned int bytes,
+ struct x86_emulate_ctxt * ctxt);
+
+ /*
+ * cmpxchg8b_emulated: Emulate an atomic (LOCKed) CMPXCHG8B operation on an
+ * emulated/special memory area.
+ * @addr: [IN ] Linear address to access.
+ * @old: [IN ] Value expected to be current at @addr.
+ * @new: [IN ] Value to write to @addr.
+ * NOTES:
+ * 1. This function is only ever called when emulating a real CMPXCHG8B.
+ * 2. This function is *never* called on x86/64 systems.
+ * 2. Not defining this function (i.e., specifying NULL) is equivalent
+ * to defining a function that always returns X86EMUL_UNHANDLEABLE.
+ */
+ int (*cmpxchg8b_emulated) (unsigned long addr,
+ unsigned long old_lo,
+ unsigned long old_hi,
+ unsigned long new_lo,
+ unsigned long new_hi,
+ struct x86_emulate_ctxt * ctxt);
+};
+
+struct cpu_user_regs;
+
+struct x86_emulate_ctxt {
+ /* Register state before/after emulation. */
+ struct kvm_vcpu *vcpu;
+
+ /* Linear faulting address (if emulating a page-faulting instruction). */
+ unsigned long eflags;
+ unsigned long cr2;
+
+ /* Emulated execution mode, represented by an X86EMUL_MODE value. */
+ int mode;
+
+ unsigned long cs_base;
+ unsigned long ds_base;
+ unsigned long es_base;
+ unsigned long ss_base;
+ unsigned long gs_base;
+ unsigned long fs_base;
+};
+
+/* Execution mode, passed to the emulator. */
+#define X86EMUL_MODE_REAL 0 /* Real mode. */
+#define X86EMUL_MODE_PROT16 2 /* 16-bit protected mode. */
+#define X86EMUL_MODE_PROT32 4 /* 32-bit protected mode. */
+#define X86EMUL_MODE_PROT64 8 /* 64-bit (long) mode. */
+
+/* Host execution mode. */
+#if defined(__i386__)
+#define X86EMUL_MODE_HOST X86EMUL_MODE_PROT32
+#elif defined(__x86_64__)
+#define X86EMUL_MODE_HOST X86EMUL_MODE_PROT64
+#endif
+
+/*
+ * x86_emulate_memop: Emulate an instruction that faulted attempting to
+ * read/write a 'special' memory area.
+ * Returns -1 on failure, 0 on success.
+ */
+int x86_emulate_memop(struct x86_emulate_ctxt *ctxt,
+ struct x86_emulate_ops *ops);
+
+/*
+ * Given the 'reg' portion of a ModRM byte, and a register block, return a
+ * pointer into the block that addresses the relevant register.
+ * @highbyte_regs specifies whether to decode AH,CH,DH,BH.
+ */
+void *decode_register(u8 modrm_reg, unsigned long *regs,
+ int highbyte_regs);
+
+#endif /* __X86_EMULATE_H__ */

2006-10-23 13:33:19

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 7/13] KVM: vcpu creation and maintenance

Create a vcpu and initialize it for real-mode bootstrap.

Also provide accessors to get/set vcpu registers.

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/kvm_main.c
===================================================================
--- linux-2.6.orig/drivers/kvm/kvm_main.c
+++ linux-2.6/drivers/kvm/kvm_main.c
@@ -655,6 +655,299 @@ static void vmcs_write64(unsigned long f
#endif
}

+static int rmode_tss_base(struct kvm* kvm)
+{
+ gfn_t base_gfn = kvm->memslots[0].base_gfn + kvm->memslots[0].npages - 3;
+ return base_gfn << PAGE_SHIFT;
+}
+
+static int init_rmode_tss(struct kvm* kvm)
+{
+ struct page *p1, *p2, *p3;
+ gfn_t fn = rmode_tss_base(kvm) >> PAGE_SHIFT;
+ char *page;
+
+ p1 = _gfn_to_page(kvm, fn++);
+ p2 = _gfn_to_page(kvm, fn++);
+ p3 = _gfn_to_page(kvm, fn);
+
+ if (!p1 || !p2 || !p3) {
+ kvm_printf(kvm,"%s: gfn_to_page failed\n", __FUNCTION__);
+ return 0;
+ }
+
+ page = kmap_atomic(p1, KM_USER0);
+ memset(page, 0, PAGE_SIZE);
+ *(u16*)(page + 0x66) = TSS_BASE_SIZE + TSS_REDIRECTION_SIZE;
+ kunmap_atomic(page, KM_USER0);
+
+ page = kmap_atomic(p2, KM_USER0);
+ memset(page, 0, PAGE_SIZE);
+ kunmap_atomic(page, KM_USER0);
+
+ page = kmap_atomic(p3, KM_USER0);
+ memset(page, 0, PAGE_SIZE);
+ *(page + RMODE_TSS_SIZE - 2 * PAGE_SIZE - 1) = ~0;
+ kunmap_atomic(page, KM_USER0);
+
+ return 1;
+}
+
+static u32 get_rdx_init_val(void)
+{
+ u32 val;
+
+ asm ("movl $1, %%eax \n\t"
+ "movl %%eax, %0 \n\t" : "=g"(val) );
+ return val;
+
+}
+
+static void fx_init(struct kvm_vcpu *vcpu)
+{
+ struct __attribute__ ((__packed__)) fx_image_s {
+ u16 control; //fcw
+ u16 status; //fsw
+ u16 tag; // ftw
+ u16 opcode; //fop
+ u64 ip; // fpu ip
+ u64 operand;// fpu dp
+ u32 mxcsr;
+ u32 mxcsr_mask;
+
+ } *fx_image;
+
+ fx_save(vcpu->host_fx_image);
+ fpu_init();
+ fx_save(vcpu->guest_fx_image);
+ fx_restore(vcpu->host_fx_image);
+
+ fx_image = (struct fx_image_s *)vcpu->guest_fx_image;
+ fx_image->mxcsr = 0x1f80;
+ memset(vcpu->guest_fx_image + sizeof(struct fx_image_s),
+ 0, FX_IMAGE_SIZE - sizeof(struct fx_image_s));
+}
+
+static void vmcs_write32_fixedbits(u32 msr, u32 vmcs_field, u32 val)
+{
+ u32 msr_high, msr_low;
+
+ rdmsr(msr, msr_low, msr_high);
+
+ val &= msr_high;
+ val |= msr_low;
+ vmcs_write32(vmcs_field, val);
+}
+
+/*
+ * Sets up the vmcs for emulated real mode.
+ */
+static int kvm_vcpu_setup(struct kvm_vcpu *vcpu)
+{
+ extern asmlinkage void kvm_vmx_return(void);
+ u32 host_sysenter_cs;
+ u32 junk;
+ unsigned long a;
+ struct descriptor_table dt;
+ int i;
+ int ret;
+ u64 tsc;
+
+
+ if (!init_rmode_tss(vcpu->kvm)) {
+ ret = 0;
+ goto out;
+ }
+
+ memset(vcpu->regs, 0, sizeof(vcpu->regs));
+ vcpu->regs[VCPU_REGS_RDX] = get_rdx_init_val();
+ vcpu->cr8 = 0;
+ vcpu->apic_base = 0xfee00000 |
+ /*for vcpu 0*/ MSR_IA32_APICBASE_BSP |
+ MSR_IA32_APICBASE_ENABLE;
+
+ fx_init(vcpu);
+
+#define SEG_SETUP(seg) do { \
+ vmcs_write16(GUEST_##seg##_SELECTOR, 0); \
+ vmcs_writel(GUEST_##seg##_BASE, 0); \
+ vmcs_write32(GUEST_##seg##_LIMIT, 0xffff); \
+ vmcs_write32(GUEST_##seg##_AR_BYTES, 0x93); \
+ } while (0)
+
+ /*
+ * GUEST_CS_BASE should really be 0xffff0000, but VT vm86 mode
+ * insists on having GUEST_CS_BASE == GUEST_CS_SELECTOR << 4. Sigh.
+ */
+ vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
+ vmcs_writel(GUEST_CS_BASE, 0x000f0000);
+ vmcs_write32(GUEST_CS_LIMIT, 0xffff);
+ vmcs_write32(GUEST_CS_AR_BYTES, 0x9b);
+
+ SEG_SETUP(DS);
+ SEG_SETUP(ES);
+ SEG_SETUP(FS);
+ SEG_SETUP(GS);
+ SEG_SETUP(SS);
+
+ vmcs_write16(GUEST_TR_SELECTOR, 0);
+ vmcs_writel(GUEST_TR_BASE, 0);
+ vmcs_write32(GUEST_TR_LIMIT, 0xffff);
+ vmcs_write32(GUEST_TR_AR_BYTES, 0x008b);
+
+ vmcs_write16(GUEST_LDTR_SELECTOR, 0);
+ vmcs_writel(GUEST_LDTR_BASE, 0);
+ vmcs_write32(GUEST_LDTR_LIMIT, 0xffff);
+ vmcs_write32(GUEST_LDTR_AR_BYTES, 0x00082);
+
+ vmcs_write32(GUEST_SYSENTER_CS, 0);
+ vmcs_writel(GUEST_SYSENTER_ESP, 0);
+ vmcs_writel(GUEST_SYSENTER_EIP, 0);
+
+ vmcs_writel(GUEST_RFLAGS, 0x02);
+ vmcs_writel(GUEST_RIP, 0xfff0);
+ vmcs_writel(GUEST_RSP, 0);
+
+ vmcs_writel(GUEST_CR3, 0);
+
+ //todo: dr0 = dr1 = dr2 = dr3 = 0; dr6 = 0xffff0ff0
+ vmcs_writel(GUEST_DR7, 0x400);
+
+ vmcs_writel(GUEST_GDTR_BASE, 0);
+ vmcs_write32(GUEST_GDTR_LIMIT, 0xffff);
+
+ vmcs_writel(GUEST_IDTR_BASE, 0);
+ vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
+
+ vmcs_write32(GUEST_ACTIVITY_STATE, 0);
+ vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
+ vmcs_write32(GUEST_PENDING_DBG_EXCEPTIONS, 0);
+
+ /* I/O */
+ vmcs_write64(IO_BITMAP_A, 0);
+ vmcs_write64(IO_BITMAP_B, 0);
+
+ rdtscll(tsc);
+ vmcs_write64(TSC_OFFSET, -tsc);
+
+ vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
+
+ /* Special registers */
+ vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
+
+ /* Control */
+ vmcs_write32_fixedbits(MSR_IA32_VMX_PINBASED_CTLS_MSR,
+ PIN_BASED_VM_EXEC_CONTROL,
+ PIN_BASED_EXT_INTR_MASK /* 20.6.1 */
+ | PIN_BASED_NMI_EXITING /* 20.6.1 */
+ );
+ vmcs_write32_fixedbits(MSR_IA32_VMX_PROCBASED_CTLS_MSR,
+ CPU_BASED_VM_EXEC_CONTROL,
+ CPU_BASED_HLT_EXITING /* 20.6.2 */
+ | CPU_BASED_CR8_LOAD_EXITING /* 20.6.2 */
+ | CPU_BASED_CR8_STORE_EXITING /* 20.6.2 */
+ | CPU_BASED_UNCOND_IO_EXITING /* 20.6.2 */
+ | CPU_BASED_INVDPG_EXITING
+ | CPU_BASED_MOV_DR_EXITING
+ | CPU_BASED_USE_TSC_OFFSETING /* 21.3 */
+ );
+
+ vmcs_write32(EXCEPTION_BITMAP, 1 << PF_VECTOR);
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0);
+ vmcs_write32(CR3_TARGET_COUNT, 0); /* 22.2.1 */
+
+ vmcs_writel(HOST_CR0, read_cr0()); /* 22.2.3 */
+ vmcs_writel(HOST_CR4, read_cr4()); /* 22.2.3, 22.2.5 */
+ vmcs_writel(HOST_CR3, read_cr3()); /* 22.2.3 FIXME: shadow tables */
+
+ vmcs_write16(HOST_CS_SELECTOR, __KERNEL_CS); /* 22.2.4 */
+ vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS); /* 22.2.4 */
+ vmcs_write16(HOST_ES_SELECTOR, __KERNEL_DS); /* 22.2.4 */
+ vmcs_write16(HOST_FS_SELECTOR, read_fs()); /* 22.2.4 */
+ vmcs_write16(HOST_GS_SELECTOR, read_gs()); /* 22.2.4 */
+ vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS); /* 22.2.4 */
+#ifdef __x86_64__
+ rdmsrl(MSR_FS_BASE, a);
+ vmcs_writel(HOST_FS_BASE, a); /* 22.2.4 */
+ rdmsrl(MSR_GS_BASE, a);
+ vmcs_writel(HOST_GS_BASE, a); /* 22.2.4 */
+#else
+ vmcs_writel(HOST_FS_BASE, 0); /* 22.2.4 */
+ vmcs_writel(HOST_GS_BASE, 0); /* 22.2.4 */
+#endif
+
+ vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8); /* 22.2.4 */
+
+ get_idt(&dt);
+ vmcs_writel(HOST_IDTR_BASE, dt.base); /* 22.2.4 */
+
+
+ vmcs_writel(HOST_RIP, (unsigned long)kvm_vmx_return); /* 22.2.5 */
+
+ rdmsr(MSR_IA32_SYSENTER_CS, host_sysenter_cs, junk);
+ vmcs_write32(HOST_IA32_SYSENTER_CS, host_sysenter_cs);
+ rdmsrl(MSR_IA32_SYSENTER_ESP, a);
+ vmcs_writel(HOST_IA32_SYSENTER_ESP, a); /* 22.2.3 */
+ rdmsrl(MSR_IA32_SYSENTER_EIP, a);
+ vmcs_writel(HOST_IA32_SYSENTER_EIP, a); /* 22.2.3 */
+
+ vmcs_write32_fixedbits(MSR_IA32_VMX_EXIT_CTLS_MSR, VM_EXIT_CONTROLS,
+ (HOST_IS_64 << 9)); /* 22.2,1, 20.7.1 */
+ vmcs_write32(VM_EXIT_MSR_STORE_COUNT, NUM_AUTO_MSRS); /* 22.2.2 */
+ vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, NUM_AUTO_MSRS); /* 22.2.2 */
+ vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, NUM_AUTO_MSRS); /* 22.2.2 */
+
+ ret = -ENOMEM;
+ vcpu->guest_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!vcpu->guest_msrs)
+ goto out;
+ vcpu->host_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!vcpu->host_msrs)
+ goto out_free_guest_msrs;
+
+ for (i = 0; i < NR_VMX_MSR; ++i) {
+ u32 index = vmx_msr_index[i];
+ u64 data;
+
+ rdmsrl(index, data);
+ vcpu->host_msrs[i].index = index;
+ vcpu->host_msrs[i].reserved = 0;
+ vcpu->host_msrs[i].data = data;
+ vcpu->guest_msrs[i] = vcpu->host_msrs[i];
+ }
+
+ vmcs_writel(VM_ENTRY_MSR_LOAD_ADDR, virt_to_phys(vcpu->guest_msrs));
+ vmcs_writel(VM_EXIT_MSR_STORE_ADDR, virt_to_phys(vcpu->guest_msrs));
+ vmcs_writel(VM_EXIT_MSR_LOAD_ADDR, virt_to_phys(vcpu->host_msrs));
+
+ /* 22.2.1, 20.8.1 */
+ vmcs_write32_fixedbits(MSR_IA32_VMX_ENTRY_CTLS_MSR,
+ VM_ENTRY_CONTROLS, 0);
+ vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */
+
+ vmcs_writel(VIRTUAL_APIC_PAGE_ADDR, 0);
+ vmcs_writel(TPR_THRESHOLD, 0);
+
+ vmcs_writel(CR0_GUEST_HOST_MASK, KVM_GUEST_CR0_MASK);
+ vmcs_writel(CR4_GUEST_HOST_MASK, KVM_GUEST_CR4_MASK);
+
+ __set_cr0(vcpu, 0x60000010); // enter rmode
+ __set_cr4(vcpu, 0);
+#ifdef __x86_64__
+ __set_efer(vcpu, 0);
+#endif
+
+ ret = kvm_mmu_init(vcpu);
+
+ return ret;
+
+out_free_guest_msrs:
+ kfree(vcpu->guest_msrs);
+out:
+ return ret;
+}
+
/*
* Sync the rsp and rip registers into the vcpu structure. This allows
* registers to be accessed by indexing vcpu->regs.
@@ -676,6 +969,60 @@ static void vcpu_put_rsp_rip(struct kvm_
}

/*
+ * Creates some virtual cpus. Good luck creating more than one.
+ */
+static int kvm_dev_ioctl_create_vcpu(struct kvm *kvm, int n)
+{
+ int r;
+ struct kvm_vcpu *vcpu;
+ struct vmcs *vmcs;
+
+ r = -EINVAL;
+ if (n < 0 || n >= KVM_MAX_VCPUS)
+ goto out;
+
+ vcpu = &kvm->vcpus[n];
+
+ mutex_lock(&vcpu->mutex);
+
+ if (vcpu->vmcs) {
+ mutex_unlock(&vcpu->mutex);
+ return -EEXIST;
+ }
+
+ vcpu->host_fx_image = (char*)ALIGN((hva_t)vcpu->fx_buf,
+ FX_IMAGE_ALIGN);
+ vcpu->guest_fx_image = vcpu->host_fx_image + FX_IMAGE_SIZE;
+
+ vcpu->cpu = -1; /* First load will set up TR */
+ vcpu->kvm = kvm;
+ vmcs = alloc_vmcs();
+ if (!vmcs) {
+ mutex_unlock(&vcpu->mutex);
+ goto out_free_vcpus;
+ }
+ vmcs_clear(vmcs);
+ vcpu->vmcs = vmcs;
+ vcpu->launched = 0;
+
+ __vcpu_load(vcpu);
+
+ r = kvm_vcpu_setup(vcpu);
+
+ vcpu_put(vcpu);
+
+ if (r < 0)
+ goto out_free_vcpus;
+
+ return 0;
+
+out_free_vcpus:
+ kvm_free_vcpu(vcpu);
+out:
+ return r;
+}
+
+/*
* Allocate some memory and give it an address in the guest physical address
* space.
*
@@ -1284,6 +1631,254 @@ static void save_msrs(struct vmx_msr_ent
rdmsrl(e[msr_index].index, e[msr_index].data);
}

+static int kvm_dev_ioctl_get_regs(struct kvm *kvm, struct kvm_regs *regs)
+{
+ struct kvm_vcpu *vcpu;
+
+ if (regs->vcpu < 0 || regs->vcpu >= KVM_MAX_VCPUS)
+ return -EINVAL;
+
+ vcpu = vcpu_load(kvm, regs->vcpu);
+ if (!vcpu)
+ return -ENOENT;
+
+ regs->rax = vcpu->regs[VCPU_REGS_RAX];
+ regs->rbx = vcpu->regs[VCPU_REGS_RBX];
+ regs->rcx = vcpu->regs[VCPU_REGS_RCX];
+ regs->rdx = vcpu->regs[VCPU_REGS_RDX];
+ regs->rsi = vcpu->regs[VCPU_REGS_RSI];
+ regs->rdi = vcpu->regs[VCPU_REGS_RDI];
+ regs->rsp = vmcs_readl(GUEST_RSP);
+ regs->rbp = vcpu->regs[VCPU_REGS_RBP];
+#ifdef __x86_64__
+ regs->r8 = vcpu->regs[VCPU_REGS_R8];
+ regs->r9 = vcpu->regs[VCPU_REGS_R9];
+ regs->r10 = vcpu->regs[VCPU_REGS_R10];
+ regs->r11 = vcpu->regs[VCPU_REGS_R11];
+ regs->r12 = vcpu->regs[VCPU_REGS_R12];
+ regs->r13 = vcpu->regs[VCPU_REGS_R13];
+ regs->r14 = vcpu->regs[VCPU_REGS_R14];
+ regs->r15 = vcpu->regs[VCPU_REGS_R15];
+#endif
+
+ regs->rip = vmcs_readl(GUEST_RIP);
+ regs->rflags = vmcs_readl(GUEST_RFLAGS);
+
+ /*
+ * Don't leak debug flags in case they were set for guest debugging
+ */
+ if (vcpu->guest_debug.enabled && vcpu->guest_debug.singlestep)
+ regs->rflags &= ~(X86_EFLAGS_TF | X86_EFLAGS_RF);
+
+ vcpu_put(vcpu);
+
+ return 0;
+}
+
+static int kvm_dev_ioctl_set_regs(struct kvm *kvm, struct kvm_regs *regs)
+{
+ struct kvm_vcpu *vcpu;
+
+ if (regs->vcpu < 0 || regs->vcpu >= KVM_MAX_VCPUS)
+ return -EINVAL;
+
+ vcpu = vcpu_load(kvm, regs->vcpu);
+ if (!vcpu)
+ return -ENOENT;
+
+ vcpu->regs[VCPU_REGS_RAX] = regs->rax;
+ vcpu->regs[VCPU_REGS_RBX] = regs->rbx;
+ vcpu->regs[VCPU_REGS_RCX] = regs->rcx;
+ vcpu->regs[VCPU_REGS_RDX] = regs->rdx;
+ vcpu->regs[VCPU_REGS_RSI] = regs->rsi;
+ vcpu->regs[VCPU_REGS_RDI] = regs->rdi;
+ vmcs_writel(GUEST_RSP, regs->rsp);
+ vcpu->regs[VCPU_REGS_RBP] = regs->rbp;
+#ifdef __x86_64__
+ vcpu->regs[VCPU_REGS_R8] = regs->r8;
+ vcpu->regs[VCPU_REGS_R9] = regs->r9;
+ vcpu->regs[VCPU_REGS_R10] = regs->r10;
+ vcpu->regs[VCPU_REGS_R11] = regs->r11;
+ vcpu->regs[VCPU_REGS_R12] = regs->r12;
+ vcpu->regs[VCPU_REGS_R13] = regs->r13;
+ vcpu->regs[VCPU_REGS_R14] = regs->r14;
+ vcpu->regs[VCPU_REGS_R15] = regs->r15;
+#endif
+
+ vmcs_writel(GUEST_RIP, regs->rip);
+ vmcs_writel(GUEST_RFLAGS, regs->rflags);
+
+ vcpu_put(vcpu);
+
+ return 0;
+}
+
+static int kvm_dev_ioctl_get_sregs(struct kvm *kvm, struct kvm_sregs *sregs)
+{
+ struct kvm_vcpu *vcpu;
+
+ if (sregs->vcpu < 0 || sregs->vcpu >= KVM_MAX_VCPUS)
+ return -EINVAL;
+ vcpu = vcpu_load(kvm, sregs->vcpu);
+ if (!vcpu)
+ return -ENOENT;
+
+#define get_segment(var, seg) \
+ do { \
+ u32 ar; \
+ \
+ sregs->var.base = vmcs_readl(GUEST_##seg##_BASE); \
+ sregs->var.limit = vmcs_read32(GUEST_##seg##_LIMIT); \
+ sregs->var.selector = vmcs_read16(GUEST_##seg##_SELECTOR); \
+ ar = vmcs_read32(GUEST_##seg##_AR_BYTES); \
+ if (ar & AR_UNUSABLE_MASK) ar = 0; \
+ sregs->var.type = ar & 15; \
+ sregs->var.s = (ar >> 4) & 1; \
+ sregs->var.dpl = (ar >> 5) & 3; \
+ sregs->var.present = (ar >> 7) & 1; \
+ sregs->var.avl = (ar >> 12) & 1; \
+ sregs->var.l = (ar >> 13) & 1; \
+ sregs->var.db = (ar >> 14) & 1; \
+ sregs->var.g = (ar >> 15) & 1; \
+ sregs->var.unusable = (ar >> 16) & 1; \
+ } while (0);
+
+ get_segment(cs, CS);
+ get_segment(ds, DS);
+ get_segment(es, ES);
+ get_segment(fs, FS);
+ get_segment(gs, GS);
+ get_segment(ss, SS);
+
+ get_segment(tr, TR);
+ get_segment(ldt, LDTR);
+#undef get_segment
+
+#define get_dtable(var, table) \
+ sregs->var.limit = vmcs_read32(GUEST_##table##_LIMIT), \
+ sregs->var.base = vmcs_readl(GUEST_##table##_BASE)
+
+ get_dtable(idt, IDTR);
+ get_dtable(gdt, GDTR);
+#undef get_dtable
+
+ sregs->cr0 = guest_cr0();
+ sregs->cr2 = vcpu->cr2;
+ sregs->cr3 = vcpu->cr3;
+ sregs->cr4 = guest_cr4();
+ sregs->cr8 = vcpu->cr8;
+ sregs->efer = vcpu->shadow_efer;
+ sregs->apic_base = vcpu->apic_base;
+
+ sregs->pending_int = vcpu->irq_summary != 0;
+
+ vcpu_put(vcpu);
+
+ return 0;
+}
+
+static int kvm_dev_ioctl_set_sregs(struct kvm *kvm, struct kvm_sregs *sregs)
+{
+ struct kvm_vcpu *vcpu;
+ int mmu_reset_needed = 0;
+
+ if (sregs->vcpu < 0 || sregs->vcpu >= KVM_MAX_VCPUS)
+ return -EINVAL;
+ vcpu = vcpu_load(kvm, sregs->vcpu);
+ if (!vcpu)
+ return -ENOENT;
+
+#define set_segment(var, seg) \
+ do { \
+ u32 ar; \
+ \
+ vmcs_writel(GUEST_##seg##_BASE, sregs->var.base); \
+ vmcs_write32(GUEST_##seg##_LIMIT, sregs->var.limit); \
+ vmcs_write16(GUEST_##seg##_SELECTOR, sregs->var.selector); \
+ if (sregs->var.unusable) { \
+ ar = (1 << 16); \
+ } else { \
+ ar = (sregs->var.type & 15); \
+ ar |= (sregs->var.s & 1) << 4; \
+ ar |= (sregs->var.dpl & 3) << 5; \
+ ar |= (sregs->var.present & 1) << 7; \
+ ar |= (sregs->var.avl & 1) << 12; \
+ ar |= (sregs->var.l & 1) << 13; \
+ ar |= (sregs->var.db & 1) << 14; \
+ ar |= (sregs->var.g & 1) << 15; \
+ } \
+ vmcs_write32(GUEST_##seg##_AR_BYTES, ar); \
+ } while (0);
+
+ set_segment(cs, CS);
+ set_segment(ds, DS);
+ set_segment(es, ES);
+ set_segment(fs, FS);
+ set_segment(gs, GS);
+ set_segment(ss, SS);
+
+ set_segment(tr, TR);
+
+ set_segment(ldt, LDTR);
+#undef set_segment
+
+#define set_dtable(var, table) \
+ vmcs_write32(GUEST_##table##_LIMIT, sregs->var.limit), \
+ vmcs_writel(GUEST_##table##_BASE, sregs->var.base)
+
+ set_dtable(idt, IDTR);
+ set_dtable(gdt, GDTR);
+#undef set_dtable
+
+ vcpu->cr2 = sregs->cr2;
+ mmu_reset_needed |= vcpu->cr3 != sregs->cr3;
+ vcpu->cr3 = sregs->cr3;
+
+ vcpu->cr8 = sregs->cr8;
+
+ mmu_reset_needed |= vcpu->shadow_efer != sregs->efer;
+#ifdef __x86_64__
+ __set_efer(vcpu, sregs->efer);
+#endif
+ vcpu->apic_base = sregs->apic_base;
+
+ mmu_reset_needed |= guest_cr0() != sregs->cr0;
+ __set_cr0(vcpu, sregs->cr0);
+
+ mmu_reset_needed |= guest_cr4() != sregs->cr4;
+ __set_cr4(vcpu, sregs->cr4);
+
+ if (mmu_reset_needed)
+ kvm_mmu_reset_context(vcpu);
+ vcpu_put(vcpu);
+
+ return 0;
+}
+
+/*
+ * Translate a guest virtual address to a guest physical address.
+ */
+static int kvm_dev_ioctl_translate(struct kvm *kvm, struct kvm_translation *tr)
+{
+ unsigned long vaddr = tr->linear_address;
+ struct kvm_vcpu *vcpu;
+ gpa_t gpa;
+
+ vcpu = vcpu_load(kvm, tr->vcpu);
+ if (!vcpu)
+ return -ENOENT;
+ spin_lock(&kvm->lock);
+ gpa = vcpu->mmu.gva_to_gpa(vcpu, vaddr);
+ tr->physical_address = gpa;
+ tr->valid = gpa != UNMAPPED_GVA;
+ tr->writeable = 1;
+ tr->usermode = 0;
+ spin_unlock(&kvm->lock);
+ vcpu_put(vcpu);
+
+ return 0;
+}
+
static long kvm_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -1291,6 +1886,81 @@ static long kvm_dev_ioctl(struct file *f
int r = -EINVAL;

switch (ioctl) {
+ case KVM_CREATE_VCPU: {
+ r = kvm_dev_ioctl_create_vcpu(kvm, arg);
+ if (r)
+ goto out;
+ break;
+ }
+ case KVM_GET_REGS: {
+ struct kvm_regs kvm_regs;
+
+ r = -EFAULT;
+ if (copy_from_user(&kvm_regs, (void *)arg, sizeof kvm_regs))
+ goto out;
+ r = kvm_dev_ioctl_get_regs(kvm, &kvm_regs);
+ if (r)
+ goto out;
+ r = -EFAULT;
+ if (copy_to_user((void *)arg, &kvm_regs, sizeof kvm_regs))
+ goto out;
+ r = 0;
+ break;
+ }
+ case KVM_SET_REGS: {
+ struct kvm_regs kvm_regs;
+
+ r = -EFAULT;
+ if (copy_from_user(&kvm_regs, (void *)arg, sizeof kvm_regs))
+ goto out;
+ r = kvm_dev_ioctl_set_regs(kvm, &kvm_regs);
+ if (r)
+ goto out;
+ r = 0;
+ break;
+ }
+ case KVM_GET_SREGS: {
+ struct kvm_sregs kvm_sregs;
+
+ r = -EFAULT;
+ if (copy_from_user(&kvm_sregs, (void *)arg, sizeof kvm_sregs))
+ goto out;
+ r = kvm_dev_ioctl_get_sregs(kvm, &kvm_sregs);
+ if (r)
+ goto out;
+ r = -EFAULT;
+ if (copy_to_user((void *)arg, &kvm_sregs, sizeof kvm_sregs))
+ goto out;
+ r = 0;
+ break;
+ }
+ case KVM_SET_SREGS: {
+ struct kvm_sregs kvm_sregs;
+
+ r = -EFAULT;
+ if (copy_from_user(&kvm_sregs, (void *)arg, sizeof kvm_sregs))
+ goto out;
+ r = kvm_dev_ioctl_set_sregs(kvm, &kvm_sregs);
+ if (r)
+ goto out;
+ r = 0;
+ break;
+ }
+ case KVM_TRANSLATE: {
+ struct kvm_translation tr;
+
+ r = -EFAULT;
+ if (copy_from_user(&tr, (void *)arg, sizeof tr))
+ goto out;
+ r = kvm_dev_ioctl_translate(kvm, &tr);
+ if (r)
+ goto out;
+ r = -EFAULT;
+ if (copy_to_user((void *)arg, &tr, sizeof tr))
+ goto out;
+ r = 0;
+ break;
+ }
case KVM_SET_MEMORY_REGION: {
struct kvm_memory_region kvm_mem;

2006-10-23 13:33:11

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 10/13] KVM: less common exit handlers

Add exit handlers for msrs, debug registers, and cpuid.

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/kvm_main.c
===================================================================
--- linux-2.6.orig/drivers/kvm/kvm_main.c
+++ linux-2.6/drivers/kvm/kvm_main.c
@@ -2122,6 +2122,113 @@ static int handle_cr(struct kvm_vcpu *vc
return 0;
}

+static int handle_dr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+{
+ u64 exit_qualification;
+ unsigned long val;
+ int dr, reg;
+
+ /*
+ * FIXME: this code assumes the host is debugging the guest.
+ * need to deal with guest debugging itself too.
+ */
+ exit_qualification = vmcs_read64(EXIT_QUALIFICATION);
+ dr = exit_qualification & 7;
+ reg = (exit_qualification >> 8) & 15;
+ vcpu_load_rsp_rip(vcpu);
+ if (exit_qualification & 16) {
+ /* mov from dr */
+ switch (dr) {
+ case 6:
+ val = 0xffff0ff0;
+ break;
+ case 7:
+ val = 0x400;
+ break;
+ default:
+ val = 0;
+ }
+ vcpu->regs[reg] = val;
+ } else {
+ /* mov to dr */
+ }
+ vcpu_put_rsp_rip(vcpu);
+ skip_emulated_instruction(vcpu);
+ return 1;
+}
+
+static int handle_cpuid(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+{
+ kvm_run->exit_reason = KVM_EXIT_CPUID;
+ return 0;
+}
+
+static int handle_rdmsr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+{
+ u32 ecx = vcpu->regs[VCPU_REGS_RCX];
+ struct vmx_msr_entry *msr = find_msr_entry(vcpu, ecx);
+ u64 data;
+
+#ifdef KVM_DEBUG
+ if (guest_cpl() != 0) {
+ vcpu_printf(vcpu, "%s: not supervisor\n", __FUNCTION__);
+ inject_gp(vcpu);
+ return 1;
+ }
+#endif
+
+ switch (ecx) {
+#ifdef __x86_64__
+ case MSR_FS_BASE:
+ data = vmcs_readl(GUEST_FS_BASE);
+ break;
+ case MSR_GS_BASE:
+ data = vmcs_readl(GUEST_GS_BASE);
+ break;
+#endif
+ case MSR_IA32_SYSENTER_CS:
+ data = vmcs_read32(GUEST_SYSENTER_CS);
+ break;
+ case MSR_IA32_SYSENTER_EIP:
+ data = vmcs_read32(GUEST_SYSENTER_EIP);
+ break;
+ case MSR_IA32_SYSENTER_ESP:
+ data = vmcs_read32(GUEST_SYSENTER_ESP);
+ break;
+ case MSR_IA32_MC0_CTL:
+ case MSR_IA32_MCG_STATUS:
+ case MSR_IA32_MCG_CAP:
+ case MSR_IA32_MC0_MISC:
+ case MSR_IA32_MC0_MISC+4:
+ case MSR_IA32_MC0_MISC+8:
+ case MSR_IA32_MC0_MISC+12:
+ case MSR_IA32_MC0_MISC+16:
+ case MSR_IA32_UCODE_REV:
+ /* MTRR registers */
+ case 0xfe:
+ case 0x200 ... 0x2ff:
+ data = 0;
+ break;
+ case MSR_IA32_APICBASE:
+ data = vcpu->apic_base;
+ break;
+ default:
+ if (msr) {
+ data = msr->data;
+ break;
+ }
+ printk(KERN_ERR "kvm: unhandled rdmsr: %x\n", ecx);
+ inject_gp(vcpu);
+ return 1;
+ }
+
+ /* FIXME: handling of bits 32:63 of rax, rdx */
+ vcpu->regs[VCPU_REGS_RAX] = data & -1u;
+ vcpu->regs[VCPU_REGS_RDX] = (data >> 32) & -1u;
+ skip_emulated_instruction(vcpu);
+ return 1;
+}
+
#ifdef __x86_64__
#define EFER_RESERVED_BITS 0xfffffffffffff2fe

@@ -2175,6 +2282,78 @@ static void __set_efer(struct kvm_vcpu *
}
#endif

+#define MSR_IA32_TIME_STAMP_COUNTER 0x10
+
+static int handle_wrmsr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+{
+ u32 ecx = vcpu->regs[VCPU_REGS_RCX];
+ struct vmx_msr_entry *msr;
+ u64 data = (vcpu->regs[VCPU_REGS_RAX] & -1u)
+ | ((u64)(vcpu->regs[VCPU_REGS_RDX] & -1u) << 32);
+
+#ifdef KVM_DEBUG
+ if (guest_cpl() != 0) {
+ vcpu_printf(vcpu, "%s: not supervisor\n", __FUNCTION__);
+ inject_gp(vcpu);
+ return 1;
+ }
+#endif
+
+ switch (ecx) {
+#ifdef __x86_64__
+ case MSR_FS_BASE:
+ vmcs_writel(GUEST_FS_BASE, data);
+ break;
+ case MSR_GS_BASE:
+ vmcs_writel(GUEST_GS_BASE, data);
+ break;
+#endif
+ case MSR_IA32_SYSENTER_CS:
+ vmcs_write32(GUEST_SYSENTER_CS, data);
+ break;
+ case MSR_IA32_SYSENTER_EIP:
+ vmcs_write32(GUEST_SYSENTER_EIP, data);
+ break;
+ case MSR_IA32_SYSENTER_ESP:
+ vmcs_write32(GUEST_SYSENTER_ESP, data);
+ break;
+#ifdef __x86_64
+ case MSR_EFER:
+ set_efer(vcpu, data);
+ return 1;
+ case MSR_IA32_MC0_STATUS:
+ printk(KERN_WARNING "%s: MSR_IA32_MC0_STATUS 0x%llx, nop\n"
+ , __FUNCTION__, data);
+ break;
+#endif
+ case MSR_IA32_TIME_STAMP_COUNTER: {
+ u64 tsc;
+
+ rdtscll(tsc);
+ vmcs_write64(TSC_OFFSET, data - tsc);
+ break;
+ }
+ case MSR_IA32_UCODE_REV:
+ case MSR_IA32_UCODE_WRITE:
+ case 0x200 ... 0x2ff: /* MTRRs */
+ break;
+ case MSR_IA32_APICBASE:
+ vcpu->apic_base = data;
+ break;
+ default:
+ msr = find_msr_entry(vcpu, ecx);
+ if (msr) {
+ msr->data = data;
+ break;
+ }
+ printk(KERN_ERR "kvm: unhandled wrmsr: %x\n", ecx);
+ inject_gp(vcpu);
+ return 1;
+ }
+ skip_emulated_instruction(vcpu);
+ return 1;
+}
+
static int handle_interrupt_window(struct kvm_vcpu *vcpu,
struct kvm_run *kvm_run)
{
@@ -2207,6 +2386,10 @@ static int (*kvm_vmx_exit_handlers[])(st
[EXIT_REASON_IO_INSTRUCTION] = handle_io,
[EXIT_REASON_INVLPG] = handle_invlpg,
[EXIT_REASON_CR_ACCESS] = handle_cr,
+ [EXIT_REASON_DR_ACCESS] = handle_dr,
+ [EXIT_REASON_CPUID] = handle_cpuid,
+ [EXIT_REASON_MSR_READ] = handle_rdmsr,
+ [EXIT_REASON_MSR_WRITE] = handle_wrmsr,
[EXIT_REASON_PENDING_INTERRUPT] = handle_interrupt_window,
[EXIT_REASON_HLT] = handle_halt,
};

2006-10-23 13:32:00

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 13/13] KVM: plumbing

Add a config entry and a Makefile for KVM.

Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/Makefile
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/Makefile
@@ -0,0 +1,6 @@
+#
+# Makefile for Kernel-based Virtual Machine module
+#
+
+kvm-objs := kvm_main.o mmu.o x86_emulate.o
+obj-$(CONFIG_KVM) += kvm.o
Index: linux-2.6/drivers/kvm/Kconfig
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/Kconfig
@@ -0,0 +1,22 @@
+
+menu "Virtualization"
+#
+# KVM configuration
+#
+config KVM
+ tristate "Kernel-based Virtual Machine (KVM) support"
+ depends on X86 && EXPERIMENTAL
+ ---help---
+ Support hosting fully virtualized guest machines using hardware
+ virtualization extensions. You will need a fairly recent Intel
+ processor equipped with VT extensions.
+
+ This module provides access to the hardware capabilities through
+ a character device node named /dev/kvm.
+
+ To compile this as a module, choose M here: the module
+ will be called kvm.
+
+ If unsure, say N.
+
+endmenu
Index: linux-2.6/drivers/Kconfig
===================================================================
--- linux-2.6.orig/drivers/Kconfig
+++ linux-2.6/drivers/Kconfig
@@ -78,4 +78,6 @@ source "drivers/rtc/Kconfig"

source "drivers/dma/Kconfig"

+source "drivers/kvm/Kconfig"
+
endmenu
Index: linux-2.6/drivers/Makefile
===================================================================
--- linux-2.6.orig/drivers/Makefile
+++ linux-2.6/drivers/Makefile
@@ -77,3 +77,4 @@ obj-$(CONFIG_CRYPTO) += crypto/
obj-$(CONFIG_SUPERH) += sh/
obj-$(CONFIG_GENERIC_TIME) += clocksource/
obj-$(CONFIG_DMA_ENGINE) += dma/
+obj-$(CONFIG_KVM) += kvm/

2006-10-23 13:44:58

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine (v2)

Avi Kivity wrote:
> I'll also reply to this mail with a sample userspace so people can see
> how it works.

Attached is the source for a small library (libkvm.a) and a small test
harness (kvmctl) plus a few tests (test/*). The only test which will
work now is sieve.flat (I haven't used the tests since we got a real OS
booting, and had to scrub it up a bit). The tests are currently only
for x86_64.

The library is also used in our qemu patch (not included). I'll post
that later on.

Basically, the library adds function wrappers around the ioctl()s, and
converts exit reasons to callbacks, so the user need not code that large
switch.

To try it, compile it, modprobe kvm, and

./kvmctl test/bootstrap test/sieve.flat

"bootstrap" is a bit of code to take the machine to 32-bit nonpaged
mode. sieve.flat will go into long mode and tell you some uninteresting
facts about prime numbers.

--
error compiling committee.c: too many arguments to function


Attachments:
kvm-sample-userspace.tar.gz (8.09 kB)

2006-10-23 15:38:54

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/13] KVM: qemu patch

Attached is a not-very-pretty patch to qemu-0.8.2 that allows it to use kvm.

You will need:
- libkvm.a from the userspace posted previously
- qemu 0.8.2 plus this patch
- some twiddling with the configure_kvm() function in ./configure to
set paths
- run ./configure with --enable-kvm
- a machine with VT, enabled in the BIOS if possible, running a kernel
with the kvm patches applied and configured
- modprobe kvm
- access to /dev/kvm

Runtime is exactly the same as qemu. You will need the same BIOS
shipped with qemu.

Some notes:
- the display is optimized by tracking which framebuffer pages have
been dirtied since the last refresh and updating only the affected scanlines
- the display bits are derived from a similar Xen patch
- I've only tested this with SDL, not VNC
- keep your original qemu binary since this one can't run without kvm


--
error compiling committee.c: too many arguments to function


Attachments:
qemu-kvm.patch (34.46 kB)

2006-10-23 19:35:05

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 5/13] KVM: virtualization infrastructure

On Monday 23 October 2006 15:30, Avi Kivity wrote:
> - ioctl()
> - mmap()
> - vcpu context management (vcpu_load/vcpu_put)
> - some control register logic

Let me comment on coding style for now, I might come back with
contents when I understand more of the code.

> +static struct dentry *debugfs_dir;
> +static struct dentry *debugfs_pf_fixed;
> +static struct dentry *debugfs_pf_guest;
> +static struct dentry *debugfs_tlb_flush;
> +static struct dentry *debugfs_invlpg;
> +static struct dentry *debugfs_exits;
> +static struct dentry *debugfs_io_exits;
> +static struct dentry *debugfs_mmio_exits;
> +static struct dentry *debugfs_signal_exits;
> +static struct dentry *debugfs_irq_exits;

How about making these an array?

> +static int rmode_tss_base(struct kvm* kvm);
> +static void set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
> +static void __set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
> +static void lmsw(struct kvm_vcpu *vcpu, unsigned long msw);
> +static void set_cr3(struct kvm_vcpu *vcpu, unsigned long cr0);
> +static void set_cr4(struct kvm_vcpu *vcpu, unsigned long cr0);
> +static void __set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
> +#ifdef __x86_64__
> +static void __set_efer(struct kvm_vcpu *vcpu, u64 efer);
> +#endif

In general, you should try to avoid forward declarations for
static functions. The expected reading order is that static
functions are called only from other functions below them
in the same file, or through callbacks.

> +struct descriptor_table {
> + u16 limit;
> + unsigned long base;
> +} __attribute__((packed));

Is this a hardware structure? If not, packing it only
make accesses rather inefficient.

> +static void get_gdt(struct descriptor_table *table)
> +{
> + asm ( "sgdt %0" : "=m"(*table) );
> +}

Spacing:

asm ("sgdt %0" : "=m" (*table));

> +static void load_fs(u16 sel)
> +{
> + asm ( "mov %0, %%fs\n" : : "g"(sel) );
> +}
> +
> +static void load_gs(u16 sel)
> +{
> + asm ( "mov %0, %%gs\n" : : "g"(sel) );
> +}

Remove the '\n'.

> +struct segment_descriptor {
> + u16 limit_low;
> + u16 base_low;
> + u8 base_mid;
> + u8 type : 4;
> + u8 system : 1;
> + u8 dpl : 2;
> + u8 present : 1;
> + u8 limit_high : 4;
> + u8 avl : 1;
> + u8 long_mode : 1;
> + u8 default_op : 1;
> + u8 granularity : 1;
> + u8 base_high;
> +} __attribute__((packed));

Bitfields are generally frowned upon. It's better to define
constants for each of these and use a u64.

> +
> +#ifdef __x86_64__
> +// LDT or TSS descriptor in the GDT. 16 bytes.
> +struct segment_descriptor_64 {
> + struct segment_descriptor s;
> + u32 base_higher;
> + u32 pad_zero;
> +} __attribute__((packed));
> +#endif

No need for packing this.

> +
> +DEFINE_PER_CPU(struct vmcs *, vmxarea);
> +DEFINE_PER_CPU(struct vmcs *, current_vmcs);

If you make these

DEFINE_PER_CPU(struct vmcs, vmxarea);
DEFINE_PER_CPU(struct vmcs, current_vmcs);

you no longer need to handle allocation of the structures
yourself. Also, they should be 'static DEFINE_PER_CPU' if
possible.

> +static void set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
> +{
> + if (cr0 & CR0_RESEVED_BITS) {
> + printk("set_cr0: 0x%lx #GP, reserved bits (0x%lx)\n", cr0, guest_cr0());
> + inject_gp(vcpu);
> + return;
> + }
> +
> + if ((cr0 & CR0_NW_MASK) && !(cr0 & CR0_CD_MASK)) {
> + printk("set_cr0: #GP, CD == 0 && NW == 1\n");
> + inject_gp(vcpu);
> + return;
> + }
> +
> + if ((cr0 & CR0_PG_MASK) && !(cr0 & CR0_PE_MASK)) {
> + printk("set_cr0: #GP, set PG flag and a clear PE flag\n");
> + inject_gp(vcpu);
> + return;
> + }
> +
> + if (is_paging()) {
> +#ifdef __x86_64__
> + if (!(cr0 & CR0_PG_MASK)) {
> + vcpu->shadow_efer &= ~EFER_LMA;
> + vmcs_write32(VM_ENTRY_CONTROLS,
> + vmcs_read32(VM_ENTRY_CONTROLS) &
> + ~VM_ENTRY_CONTROLS_IA32E_MASK);
> + }
> +#endif
> + } else if ((cr0 & CR0_PG_MASK)) {
> +#ifdef __x86_64__
> + if ((vcpu->shadow_efer & EFER_LME)) {
> + u32 guest_cs_ar;
> + u32 guest_tr_ar;
> + if (!is_pae()) {
> + printk("set_cr0: #GP, start paging in "
> + "long mode while PAE is disabled\n");
> + inject_gp(vcpu);
> + return;
> + }
> + guest_cs_ar = vmcs_read32(GUEST_CS_AR_BYTES);
> + if (guest_cs_ar & SEGMENT_AR_L_MASK) {
> + printk("set_cr0: #GP, start paging in "
> + "long mode while CS.L == 1\n");
> + inject_gp(vcpu);
> + return;
> +
> + }
> + guest_tr_ar = vmcs_read32(GUEST_TR_AR_BYTES);
> + if ((guest_tr_ar & AR_TYPE_MASK) != AR_TYPE_BUSY_64_TSS) {
> + printk("%s: tss fixup for long mode. \n",
> + __FUNCTION__);
> + vmcs_write32(GUEST_TR_AR_BYTES,
> + (guest_tr_ar & ~AR_TYPE_MASK) |
> + AR_TYPE_BUSY_64_TSS);
> + }
> + vcpu->shadow_efer |= EFER_LMA;
> + find_msr_entry(vcpu, MSR_EFER)->data |=
> + EFER_LMA | EFER_LME;
> + vmcs_write32(VM_ENTRY_CONTROLS,
> + vmcs_read32(VM_ENTRY_CONTROLS) |
> + VM_ENTRY_CONTROLS_IA32E_MASK);
> +
> + } else
> +#endif
> + if (is_pae() &&
> + pdptrs_have_reserved_bits_set(vcpu, vcpu->cr3)) {
> + printk("set_cr0: #GP, pdptrs reserved bits\n");
> + inject_gp(vcpu);
> + return;
> + }
> +
> + }
> +
> + __set_cr0(vcpu, cr0);
> + kvm_mmu_reset_context(vcpu);
> + return;
> +}

This function is a little too complex to read. Can you split it up
into smaller functions?

> + } else
> + printk("lmsw: unexpected\n");

Make sure that all printk have KERN_* level in them.

> +
> + #define LMSW_GUEST_MASK 0x0eULL

Don't indent macro definition. Normally, these should to the top of your
file.

> +static long kvm_dev_ioctl(struct file *filp,
> + unsigned int ioctl, unsigned long arg)
> +{
> + struct kvm *kvm = filp->private_data;
> + int r = -EINVAL;
> +
> + switch (ioctl) {
> + default:
> + ;
> + }
> +out:
> + return r;
> +}

Huh? Just leave out stuff like this. If the ioctl function is introduced
in a later patch, you can still add the whole function there.

Arnd <><

2006-10-23 20:16:27

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 8/13] KVM: vcpu execution loop

Arnd Bergmann wrote:
> On Monday 23 October 2006 15:30, Avi Kivity wrote:
>
>> + asm (
>> + /* Store host registers */
>> + "pushf \n\t"
>> +#ifdef __x86_64__
>> + "push %%rax; push %%rbx; push %%rdx;"
>> + "push %%rsi; push %%rdi; push %%rbp;"
>> + "push %%r8; push %%r9; push %%r10; push %%r11;"
>> + "push %%r12; push %%r13; push %%r14; push %%r15;"
>> + "push %%rcx \n\t"
>> + "vmwrite %%rsp, %2 \n\t"
>> +#else
>> + "pusha; push %%ecx \n\t"
>> + "vmwrite %%esp, %2 \n\t"
>> +#endif
>> + /* Check if vmlaunch of vmresume is needed */
>> + "cmp $0, %1 \n\t"
>> + /* Load guest registers. Don't clobber flags. */
>> +#ifdef __x86_64__
>> + "mov %c[cr2](%3), %%rax \n\t"
>> + "mov %%rax, %%cr2 \n\t"
>> + "mov %c[rax](%3), %%rax \n\t"
>> + "mov %c[rbx](%3), %%rbx \n\t"
>> + "mov %c[rdx](%3), %%rdx \n\t"
>> + "mov %c[rsi](%3), %%rsi \n\t"
>> + "mov %c[rdi](%3), %%rdi \n\t"
>> + "mov %c[rbp](%3), %%rbp \n\t"
>> ...
>>
>
> This looks like you should simply put it into a .S file.
>
>

Then I lose all the offsetof constants down the line. Sure, I could do
the asm-offsets dance but it seems to me like needless obfuscation.



--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-23 20:29:01

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 5/13] KVM: virtualization infrastructure

Arnd Bergmann wrote:
> On Monday 23 October 2006 15:30, Avi Kivity wrote:
>
>> - ioctl()
>> - mmap()
>> - vcpu context management (vcpu_load/vcpu_put)
>> - some control register logic
>>
>
> Let me comment on coding style for now, I might come back with
> contents when I understand more of the code.
>
>
>> +static struct dentry *debugfs_dir;
>> +static struct dentry *debugfs_pf_fixed;
>> +static struct dentry *debugfs_pf_guest;
>> +static struct dentry *debugfs_tlb_flush;
>> +static struct dentry *debugfs_invlpg;
>> +static struct dentry *debugfs_exits;
>> +static struct dentry *debugfs_io_exits;
>> +static struct dentry *debugfs_mmio_exits;
>> +static struct dentry *debugfs_signal_exits;
>> +static struct dentry *debugfs_irq_exits;
>>
>
> How about making these an array?
>

Okay.

>
>> +static int rmode_tss_base(struct kvm* kvm);
>> +static void set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
>> +static void __set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
>> +static void lmsw(struct kvm_vcpu *vcpu, unsigned long msw);
>> +static void set_cr3(struct kvm_vcpu *vcpu, unsigned long cr0);
>> +static void set_cr4(struct kvm_vcpu *vcpu, unsigned long cr0);
>> +static void __set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
>> +#ifdef __x86_64__
>> +static void __set_efer(struct kvm_vcpu *vcpu, u64 efer);
>> +#endif
>>
>
> In general, you should try to avoid forward declarations for
> static functions. The expected reading order is that static
> functions are called only from other functions below them
> in the same file, or through callbacks.
>
>

Okay.

>> +struct descriptor_table {
>> + u16 limit;
>> + unsigned long base;
>> +} __attribute__((packed));
>>
>
> Is this a hardware structure? If not, packing it only
> make accesses rather inefficient.
>
>

It is a hardware structure.

>> +static void get_gdt(struct descriptor_table *table)
>> +{
>> + asm ( "sgdt %0" : "=m"(*table) );
>> +}
>>
>
> Spacing:
>
> asm ("sgdt %0" : "=m" (*table));
>
>

Ouch. Will fix.

>> +static void load_fs(u16 sel)
>> +{
>> + asm ( "mov %0, %%fs\n" : : "g"(sel) );
>> +}
>> +
>> +static void load_gs(u16 sel)
>> +{
>> + asm ( "mov %0, %%gs\n" : : "g"(sel) );
>> +}
>>
>
> Remove the '\n'.
>
>

Okay.


>> +struct segment_descriptor {
>> + u16 limit_low;
>> + u16 base_low;
>> + u8 base_mid;
>> + u8 type : 4;
>> + u8 system : 1;
>> + u8 dpl : 2;
>> + u8 present : 1;
>> + u8 limit_high : 4;
>> + u8 avl : 1;
>> + u8 long_mode : 1;
>> + u8 default_op : 1;
>> + u8 granularity : 1;
>> + u8 base_high;
>> +} __attribute__((packed));
>>
>
> Bitfields are generally frowned upon. It's better to define
> constants for each of these and use a u64.
>
>

Any specific reasons? I find the code much more readable (and
lowercase) with bitfields.

>> +
>> +#ifdef __x86_64__
>> +// LDT or TSS descriptor in the GDT. 16 bytes.
>> +struct segment_descriptor_64 {
>> + struct segment_descriptor s;
>> + u32 base_higher;
>> + u32 pad_zero;
>> +} __attribute__((packed));
>> +#endif
>>
>
> No need for packing this.
>
>

Right. Will remove.

>> +
>> +DEFINE_PER_CPU(struct vmcs *, vmxarea);
>> +DEFINE_PER_CPU(struct vmcs *, current_vmcs);
>>
>
> If you make these
>
> DEFINE_PER_CPU(struct vmcs, vmxarea);
> DEFINE_PER_CPU(struct vmcs, current_vmcs);
>
> you no longer need to handle allocation of the structures
> yourself. Also, they should be 'static DEFINE_PER_CPU' if
> possible.
>
>

The structure's size is defined by the hardware (struvt vmcs is just a
header). In addition, current_vmcs changes when another guest is
switched in (it is somewhat like the scheduler's current for the VT
hardware).

>> +static void set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
>> +{
>> + if (cr0 & CR0_RESEVED_BITS) {
>> + printk("set_cr0: 0x%lx #GP, reserved bits (0x%lx)\n", cr0, guest_cr0());
>> + inject_gp(vcpu);
>> + return;
>> + }
>> +
>> + if ((cr0 & CR0_NW_MASK) && !(cr0 & CR0_CD_MASK)) {
>> + printk("set_cr0: #GP, CD == 0 && NW == 1\n");
>> + inject_gp(vcpu);
>> + return;
>> + }
>> +
>> + if ((cr0 & CR0_PG_MASK) && !(cr0 & CR0_PE_MASK)) {
>> + printk("set_cr0: #GP, set PG flag and a clear PE flag\n");
>> + inject_gp(vcpu);
>> + return;
>> + }
>> +
>> + if (is_paging()) {
>> +#ifdef __x86_64__
>> + if (!(cr0 & CR0_PG_MASK)) {
>> + vcpu->shadow_efer &= ~EFER_LMA;
>> + vmcs_write32(VM_ENTRY_CONTROLS,
>> + vmcs_read32(VM_ENTRY_CONTROLS) &
>> + ~VM_ENTRY_CONTROLS_IA32E_MASK);
>> + }
>> +#endif
>> + } else if ((cr0 & CR0_PG_MASK)) {
>> +#ifdef __x86_64__
>> + if ((vcpu->shadow_efer & EFER_LME)) {
>> + u32 guest_cs_ar;
>> + u32 guest_tr_ar;
>> + if (!is_pae()) {
>> + printk("set_cr0: #GP, start paging in "
>> + "long mode while PAE is disabled\n");
>> + inject_gp(vcpu);
>> + return;
>> + }
>> + guest_cs_ar = vmcs_read32(GUEST_CS_AR_BYTES);
>> + if (guest_cs_ar & SEGMENT_AR_L_MASK) {
>> + printk("set_cr0: #GP, start paging in "
>> + "long mode while CS.L == 1\n");
>> + inject_gp(vcpu);
>> + return;
>> +
>> + }
>> + guest_tr_ar = vmcs_read32(GUEST_TR_AR_BYTES);
>> + if ((guest_tr_ar & AR_TYPE_MASK) != AR_TYPE_BUSY_64_TSS) {
>> + printk("%s: tss fixup for long mode. \n",
>> + __FUNCTION__);
>> + vmcs_write32(GUEST_TR_AR_BYTES,
>> + (guest_tr_ar & ~AR_TYPE_MASK) |
>> + AR_TYPE_BUSY_64_TSS);
>> + }
>> + vcpu->shadow_efer |= EFER_LMA;
>> + find_msr_entry(vcpu, MSR_EFER)->data |=
>> + EFER_LMA | EFER_LME;
>> + vmcs_write32(VM_ENTRY_CONTROLS,
>> + vmcs_read32(VM_ENTRY_CONTROLS) |
>> + VM_ENTRY_CONTROLS_IA32E_MASK);
>> +
>> + } else
>> +#endif
>> + if (is_pae() &&
>> + pdptrs_have_reserved_bits_set(vcpu, vcpu->cr3)) {
>> + printk("set_cr0: #GP, pdptrs reserved bits\n");
>> + inject_gp(vcpu);
>> + return;
>> + }
>> +
>> + }
>> +
>> + __set_cr0(vcpu, cr0);
>> + kvm_mmu_reset_context(vcpu);
>> + return;
>> +}
>>
>
> This function is a little too complex to read. Can you split it up
> into smaller functions?
>
>

Okay.

>> + } else
>> + printk("lmsw: unexpected\n");
>>
>
> Make sure that all printk have KERN_* level in them.
>
>

Okay.

>> +
>> + #define LMSW_GUEST_MASK 0x0eULL
>>
>
> Don't indent macro definition. Normally, these should to the top of your
> file.
>


Okay.

>> +static long kvm_dev_ioctl(struct file *filp,
>> + unsigned int ioctl, unsigned long arg)
>> +{
>> + struct kvm *kvm = filp->private_data;
>> + int r = -EINVAL;
>> +
>> + switch (ioctl) {
>> + default:
>> + ;
>> + }
>> +out:
>> + return r;
>> +}
>>
>
> Huh? Just leave out stuff like this. If the ioctl function is introduced
> in a later patch, you can still add the whole function there.

Several different patches add content here, so I thought I wouldn't play
favorite.

It also makes reordering the patches a little less painful. Any tips on
that or is that a normal ramp up? I'm using quilt for now and syncing
to a conventional source control repository.


Thanks for the review! I'll go do my homework now.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-23 20:29:42

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 8/13] KVM: vcpu execution loop

On Monday 23 October 2006 22:16, Avi Kivity wrote:
> > This looks like you should simply put it into a .S file.
> >
> >  
>
> Then I lose all the offsetof constants down the line.  Sure, I could do
> the asm-offsets dance but it seems to me like needless obfuscation.

Ok, I see.

How if you pass &vcpu->regs and &vcpu->cr2 to the functions instead of
kvm_vcpu?

Arnd <><

2006-10-23 20:35:25

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 5/13] KVM: virtualization infrastructure

On Monday 23 October 2006 22:28, Avi Kivity wrote:

> >> +struct segment_descriptor {
> >> +    u16 limit_low;
> >> +    u16 base_low;
> >> +    u8  base_mid;
> >> +    u8  type : 4;
> >> +    u8  system : 1;
> >> +    u8  dpl : 2;
> >> +    u8  present : 1;
> >> +    u8  limit_high : 4;
> >> +    u8  avl : 1;
> >> +    u8  long_mode : 1;
> >> +    u8  default_op : 1;
> >> +    u8  granularity : 1;
> >> +    u8  base_high;
> >> +} __attribute__((packed));
> >>    
> >
> > Bitfields are generally frowned upon. It's better to define
> > constants for each of these and use a u64.
>
> Any specific reasons?  I find the code much more readable (and
> lowercase) with bitfields.

The strongest reason against bitfields is that they are not
endian-clean. This doesn't apply on a architecture-specific
patch such as KVM, but it just feels wrong to read code
with bit fields in the kernel.

> The structure's size is defined by the hardware (struvt vmcs is just a
> header).  In addition, current_vmcs changes when another guest is
> switched in (it is somewhat like the scheduler's current for the VT
> hardware).

Ok, I see.

> >> +static long kvm_dev_ioctl(struct file *filp,
> >> +                      unsigned int ioctl, unsigned long arg)
> >> +{
> >> +    struct kvm *kvm = filp->private_data;
> >> +    int r = -EINVAL;
> >> +
> >> +    switch (ioctl) {
> >> +    default:
> >> +            ;
> >> +    }
> >> +out:
> >> +    return r;
> >> +}
> >>    
> >
> > Huh? Just leave out stuff like this. If the ioctl function is introduced
> > in a later patch, you can still add the whole function there. 
>
> Several different patches add content here, so I thought I wouldn't play
> favorite.
>
> It also makes reordering the patches a little less painful.  Any tips on
> that or is that a normal ramp up?  I'm using quilt for now and syncing
> to a conventional source control repository.

I saw later how you add specific calls to this function. I guess it's
already as good as it gets, so just leave it this way.

Arnd <><

2006-10-23 20:37:20

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 8/13] KVM: vcpu execution loop

Arnd Bergmann wrote:
> On Monday 23 October 2006 22:16, Avi Kivity wrote:
>
>>> This looks like you should simply put it into a .S file.
>>>
>>>
>>>
>> Then I lose all the offsetof constants down the line. Sure, I could do
>> the asm-offsets dance but it seems to me like needless obfuscation.
>>
>
> Ok, I see.
>
> How if you pass &vcpu->regs and &vcpu->cr2 to the functions instead of
> kvm_vcpu?
>
>

I could do that, but I feel that's more brittle. I might need more (or
other) fields later on. It will also cost me more pushes on the stack
(no real performance or space impact, just C64-era frugality).


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-23 20:40:05

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 5/13] KVM: virtualization infrastructure

Arnd Bergmann wrote:
>>>> +struct segment_descriptor {
>>>> + u16 limit_low;
>>>> + u16 base_low;
>>>> + u8 base_mid;
>>>> + u8 type : 4;
>>>> + u8 system : 1;
>>>> + u8 dpl : 2;
>>>> + u8 present : 1;
>>>> + u8 limit_high : 4;
>>>> + u8 avl : 1;
>>>> + u8 long_mode : 1;
>>>> + u8 default_op : 1;
>>>> + u8 granularity : 1;
>>>> + u8 base_high;
>>>> +} __attribute__((packed));
>>>>
>>>>
>>> Bitfields are generally frowned upon. It's better to define
>>> constants for each of these and use a u64.
>>>
>> Any specific reasons? I find the code much more readable (and
>> lowercase) with bitfields.
>>
>
> The strongest reason against bitfields is that they are not
> endian-clean. This doesn't apply on a architecture-specific
> patch such as KVM, but it just feels wrong to read code
> with bit fields in the kernel.
>
>

Okay, will change. It's very localized anyway.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-23 21:02:57

by Antonio Vargas

[permalink] [raw]
Subject: Re: [PATCH 8/13] KVM: vcpu execution loop

On 10/23/06, Avi Kivity <[email protected]> wrote:
> Arnd Bergmann wrote:
> > On Monday 23 October 2006 22:16, Avi Kivity wrote:
> >
> >>> This looks like you should simply put it into a .S file.
> >>>
> >>>
> >>>
> >> Then I lose all the offsetof constants down the line. Sure, I could do
> >> the asm-offsets dance but it seems to me like needless obfuscation.
> >>
> >
> > Ok, I see.
> >
> > How if you pass &vcpu->regs and &vcpu->cr2 to the functions instead of
> > kvm_vcpu?
> >
> >
>
> I could do that, but I feel that's more brittle. I might need more (or
> other) fields later on. It will also cost me more pushes on the stack
> (no real performance or space impact, just C64-era frugality).

maybe thats the mindsent needed to make these virtual cpu patches
without eating away all the cpu power with more than needed
abstractions ;)

--
Greetz, Antonio Vargas aka winden of network

http://network.amigascne.org/
[email protected]
[email protected]

Every day, every year
you have to work
you have to study
you have to scene.

2006-10-23 21:11:28

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 8/13] KVM: vcpu execution loop

Antonio Vargas wrote:
>>
>> I could do that, but I feel that's more brittle. I might need more (or
>> other) fields later on. It will also cost me more pushes on the stack
>> (no real performance or space impact, just C64-era frugality).
>
> maybe thats the mindsent needed to make these virtual cpu patches
> without eating away all the cpu power with more than needed
> abstractions ;)
>

Unfortunately not. Saving a cycle or two doesn't help when a vm exit
costs thousands of cycles, and worse, kills your tlb.

The key is eliminating unnecessary exits. I have plans for massively
optimizing the mmu virtualization, and the next AMD core will do that in
hardware (look for a "nested page tables" sticker before you buy).

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-23 22:10:50

by Antonio Vargas

[permalink] [raw]
Subject: Re: [PATCH 8/13] KVM: vcpu execution loop

On 10/23/06, Avi Kivity <[email protected]> wrote:
> Antonio Vargas wrote:
> >>
> >> I could do that, but I feel that's more brittle. I might need more (or
> >> other) fields later on. It will also cost me more pushes on the stack
> >> (no real performance or space impact, just C64-era frugality).
> >
> > maybe thats the mindsent needed to make these virtual cpu patches
> > without eating away all the cpu power with more than needed
> > abstractions ;)
> >
>
> Unfortunately not. Saving a cycle or two doesn't help when a vm exit
> costs thousands of cycles, and worse, kills your tlb.
>
> The key is eliminating unnecessary exits. I have plans for massively
> optimizing the mmu virtualization, and the next AMD core will do that in
> hardware (look for a "nested page tables" sticker before you buy).

yes, when I read the nested pages description in amd docs, i wondered
that the intel had nothing like that and would go much slower... amd
has worked a lot on the mmu things (like cr3-keyed tlb on k8 systems
to avoid emptying always at switch)

> --
> Do not meddle in the internals of kernels, for they are subtle and quick to panic.
>
>


--
Greetz, Antonio Vargas aka winden of network

http://network.amigascne.org/
[email protected]
[email protected]

Every day, every year
you have to work
you have to study
you have to scene.

2006-10-23 22:18:53

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 8/13] KVM: vcpu execution loop

On Monday 23 October 2006 22:37, Avi Kivity wrote:
> Arnd Bergmann wrote:
> > On Monday 23 October 2006 22:16, Avi Kivity wrote:
> >>> This looks like you should simply put it into a .S file.
> >>
> >> Then I lose all the offsetof constants down the line. Sure, I could do
> >> the asm-offsets dance but it seems to me like needless obfuscation.
> >
> > Ok, I see.
> >
> > How if you pass &vcpu->regs and &vcpu->cr2 to the functions instead of
> > kvm_vcpu?
>
> I could do that, but I feel that's more brittle. I might need more (or
> other) fields later on. It will also cost me more pushes on the stack
> (no real performance or space impact, just C64-era frugality).

Maybe you could save some more stack usage by doing something like this:

static inline void vmlaunch(struct kvm_vcpu *vcpu)
{
register unsigned long rax asm("rax");
register unsigned long rbx asm("rbx");
register unsigned long rcx asm("rcx");
register unsigned long rdx asm("rdx");
register unsigned long rsi asm("rsi");
register unsigned long rdi asm("rdi");
register unsigned long rbp asm("rbp");
register unsigned long r8 asm("r8");
register unsigned long r9 asm("r9");
register unsigned long r10 asm("r10");
register unsigned long r11 asm("r11");
register unsigned long r12 asm("r12");
register unsigned long r13 asm("r13");
register unsigned long r14 asm("r14");
register unsigned long r15 asm("r15");

asm ("mov %%cr2, %0" : : "r" (vcpu->cr2));

rax = vcpu->regs[VCPU_REGS_RAX];
rbx = vcpu->regs[VCPU_REGS_RBX];
rcx = vcpu->regs[VCPU_REGS_RCX];
rdx = vcpu->regs[VCPU_REGS_RDX];
rsi = vcpu->regs[VCPU_REGS_RSI];
rdi = vcpu->regs[VCPU_REGS_RDI];
rbp = vcpu->regs[VCPU_REGS_RBP];
r8 = vcpu->regs[VCPU_REGS_R8 ];
r9 = vcpu->regs[VCPU_REGS_R9 ];
r10 = vcpu->regs[VCPU_REGS_R10];
r11 = vcpu->regs[VCPU_REGS_R11];
r12 = vcpu->regs[VCPU_REGS_R12];
r13 = vcpu->regs[VCPU_REGS_R13];
r14 = vcpu->regs[VCPU_REGS_R14];
r15 = vcpu->regs[VCPU_REGS_R15];

asm ("vmlaunch\n\t" :
"+r" (rax),
"+r" (rbx),
"+r" (rcx),
"+r" (rdx),
"+r" (rsi),
"+r" (rdi),
"+r" (rbp),
"+r" (r8),
"+r" (r9),
"+r" (r10),
"+r" (r11),
"+r" (r12),
"+r" (r13),
"+r" (r14),
"+r" (r15)
);

vcpu->regs[VCPU_REGS_RAX] = rax;
vcpu->regs[VCPU_REGS_RBX] = rbx;
vcpu->regs[VCPU_REGS_RCX] = rcx;
vcpu->regs[VCPU_REGS_RDX] = rdx;
vcpu->regs[VCPU_REGS_RSI] = rsi;
vcpu->regs[VCPU_REGS_RDI] = rdi;
vcpu->regs[VCPU_REGS_RBP] = rbp;
vcpu->regs[VCPU_REGS_R8 ] = r8 ;
vcpu->regs[VCPU_REGS_R9 ] = r9 ;
vcpu->regs[VCPU_REGS_R10] = r10;
vcpu->regs[VCPU_REGS_R11] = r11;
vcpu->regs[VCPU_REGS_R12] = r12;
vcpu->regs[VCPU_REGS_R13] = r13;
vcpu->regs[VCPU_REGS_R14] = r14;
vcpu->regs[VCPU_REGS_R15] = r15;

asm ("mov %0, %%cr2" : "=r" (vcpu->cr2));
}

Unfortunately, I couldn't get this to do the right thing with the output
flags. Unless I missed something, your solution is really the best one you
can express in gcc.

Arnd <><

2006-10-24 01:05:15

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 9/13] KVM: define exit handlers

Avi Kivity wrote:
> +static int handle_external_interrupt(struct kvm_vcpu *vcpu,
> + struct kvm_run *kvm_run)
> +{
> + ++kvm_stat.irq_exits;
> + return 1;
> +}
>

Don't you need to propagate the interrupt here? In Xen, we inject the
interrupt using the IDT. As a module, you don't have access to that.
However, you could use a software interrupt to reraise it.

I got your code running this afternoon (it's quite cool) but I noticed a
ton of "rtc: lost some interrupts at 1024Hz." messages which leads me to
believe.. you're dropping interrupts :-) Things seem to hang trying to
bring up eth0 in the guest.

BTW, have you setup a mailing list yet?

Regards,

Anthony Liguori

2006-10-24 07:23:33

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 9/13] KVM: define exit handlers

Anthony Liguori wrote:
> Avi Kivity wrote:
>> +static int handle_external_interrupt(struct kvm_vcpu *vcpu,
>> + struct kvm_run *kvm_run)
>> +{
>> + ++kvm_stat.irq_exits;
>> + return 1;
>> +}
>>
>
> Don't you need to propagate the interrupt here? In Xen, we inject the
> interrupt using the IDT. As a module, you don't have access to that.
> However, you could use a software interrupt to reraise it.

We don't set VM_EXIT_ACK_INTR_ON_EXIT on the VM exit controls, so when
an external interrupt is received, it isn't acked and remains in the
(real) apic. We do set the guest to exit on external interrupt, so the
guest exits and when it reaches the popf in kvm_dev_ioctl_run() the
interrupt is dispatched naturally using the host IDT.

[Xen can't do that since it must handle some of the interrupts itself]

>
> I got your code running this afternoon (it's quite cool) but I noticed
> a ton of "rtc: lost some interrupts at 1024Hz." messages which leads
> me to believe.. you're dropping interrupts :-)

That's in the guest, right? I get those too. Probably due to to our
shadow mmu suckiness or a problem with the virtual apic. We are
addressing both.

> Things seem to hang trying to bring up eth0 in the guest.

Hmm. What guest are you using? Are you using dhcp? ipv6? qemu user net
or tap?


>
> BTW, have you setup a mailing list yet?

I have a project queued on sourceforge, should be up in a day or two.

Thanks for testing!

--
error compiling committee.c: too many arguments to function

2006-10-24 12:03:26

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 5/13] KVM: virtualization infrastructure

Arnd Bergmann wrote:
> On Monday 23 October 2006 22:28, Avi Kivity wrote:
>
>
>>>> +struct segment_descriptor {
>>>> + u16 limit_low;
>>>> + u16 base_low;
>>>> + u8 base_mid;
>>>> + u8 type : 4;
>>>> + u8 system : 1;
>>>> + u8 dpl : 2;
>>>> + u8 present : 1;
>>>> + u8 limit_high : 4;
>>>> + u8 avl : 1;
>>>> + u8 long_mode : 1;
>>>> + u8 default_op : 1;
>>>> + u8 granularity : 1;
>>>> + u8 base_high;
>>>> +} __attribute__((packed));
>>>>
>>>>
>>> Bitfields are generally frowned upon. It's better to define
>>> constants for each of these and use a u64.
>>>
>> Any specific reasons? I find the code much more readable (and
>> lowercase) with bitfields.
>>
>
> The strongest reason against bitfields is that they are not
> endian-clean. This doesn't apply on a architecture-specific
> patch such as KVM, but it just feels wrong to read code
> with bit fields in the kernel.
>
>

This structure is suspiciously similar to struct desc_struct in
asm-x86_64/desc.h.

However, I can't use it because asm-i386/desc.h does not have a similar
definition.


Andi, will you accept a patch to move it to asm-i386/desc_defs.h so it
can be used in both archs?

--
error compiling committee.c: too many arguments to function

2006-10-24 12:27:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 5/13] KVM: virtualization infrastructure


>
> Andi, will you accept a patch to move it to asm-i386/desc_defs.h so it
> can be used in both archs?

No. But a asm-x86_64/desc_defs.h would be ok, you can include that
then.

-Andi

2006-10-24 12:51:52

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [PATCH 1/13] KVM: userspace interface

On Mon, Oct 23, 2006 at 01:29:46PM -0000, Avi Kivity wrote:


> + struct {
> + } debug;

ISTR some versions of gcc had problems with empty structs.

Cheers,
Muli


2006-10-24 12:56:09

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/13] KVM: userspace interface

Muli Ben-Yehuda wrote:
> On Mon, Oct 23, 2006 at 01:29:46PM -0000, Avi Kivity wrote:
>
>
>
>> + struct {
>> + } debug;
>>
>
> ISTR some versions of gcc had problems with empty structs.
>

Any versions >= 3.2, which is the minimum required nowadays?

--
error compiling committee.c: too many arguments to function

2006-10-24 12:59:22

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [PATCH 1/13] KVM: userspace interface

On Tue, Oct 24, 2006 at 02:56:02PM +0200, Avi Kivity wrote:
> Muli Ben-Yehuda wrote:
> >On Mon, Oct 23, 2006 at 01:29:46PM -0000, Avi Kivity wrote:
> >
> >
> >
> >>+ struct {
> >>+ } debug;
> >>
> >
> >ISTR some versions of gcc had problems with empty structs.
> >
>
> Any versions >= 3.2, which is the minimum required nowadays?

Don't recall, sorry. But in any case I don't see a problem with
dropping it and re-adding it if debug arguments are needed later.

Cheers,
Muli

2006-10-24 13:43:40

by Avi Kivity

[permalink] [raw]
Subject: [PATCH] x86: Extract segment descriptor definitions for use outside of x86_64

Code that wants to use struct desc_struct cannot do so on i386 because
desc.h contains other code that will only compile on x86_64.

So extract the structure definitions into a asm-x86_64/desc_defs.h.

Signed-off-by: Avi Kivity <[email protected]>

include/asm-x86_64/desc.h | 53 +------------------------------
include/asm-x86_64/desc_defs.h | 69 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 70 insertions(+), 52 deletions(-)

diff --git a/include/asm-x86_64/desc.h b/include/asm-x86_64/desc.h
index eb7723a..913d6ac 100644
--- a/include/asm-x86_64/desc.h
+++ b/include/asm-x86_64/desc.h
@@ -9,64 +9,13 @@ #ifndef __ASSEMBLY__

#include <linux/string.h>
#include <linux/smp.h>
+#include <asm/desc_defs.h>

#include <asm/segment.h>
#include <asm/mmu.h>

-// 8 byte segment descriptor
-struct desc_struct {
- u16 limit0;
- u16 base0;
- unsigned base1 : 8, type : 4, s : 1, dpl : 2, p : 1;
- unsigned limit : 4, avl : 1, l : 1, d : 1, g : 1, base2 : 8;
-} __attribute__((packed));
-
-struct n_desc_struct {
- unsigned int a,b;
-};
-
extern struct desc_struct cpu_gdt_table[GDT_ENTRIES];

-enum {
- GATE_INTERRUPT = 0xE,
- GATE_TRAP = 0xF,
- GATE_CALL = 0xC,
-};
-
-// 16byte gate
-struct gate_struct {
- u16 offset_low;
- u16 segment;
- unsigned ist : 3, zero0 : 5, type : 5, dpl : 2, p : 1;
- u16 offset_middle;
- u32 offset_high;
- u32 zero1;
-} __attribute__((packed));
-
-#define PTR_LOW(x) ((unsigned long)(x) & 0xFFFF)
-#define PTR_MIDDLE(x) (((unsigned long)(x) >> 16) & 0xFFFF)
-#define PTR_HIGH(x) ((unsigned long)(x) >> 32)
-
-enum {
- DESC_TSS = 0x9,
- DESC_LDT = 0x2,
-};
-
-// LDT or TSS descriptor in the GDT. 16 bytes.
-struct ldttss_desc {
- u16 limit0;
- u16 base0;
- unsigned base1 : 8, type : 5, dpl : 2, p : 1;
- unsigned limit1 : 4, zero0 : 3, g : 1, base2 : 8;
- u32 base3;
- u32 zero1;
-} __attribute__((packed));
-
-struct desc_ptr {
- unsigned short size;
- unsigned long address;
-} __attribute__((packed)) ;
-
#define load_TR_desc() asm volatile("ltr %w0"::"r" (GDT_ENTRY_TSS*8))
#define load_LDT_desc() asm volatile("lldt %w0"::"r" (GDT_ENTRY_LDT*8))
#define clear_LDT() asm volatile("lldt %w0"::"r" (0))
diff --git a/include/asm-x86_64/desc_defs.h b/include/asm-x86_64/desc_defs.h
new file mode 100644
index 0000000..7408266
--- /dev/null
+++ b/include/asm-x86_64/desc_defs.h
@@ -0,0 +1,69 @@
+/* Written 2000 by Andi Kleen */
+#ifndef __ARCH_DESC_DEFS_H
+#define __ARCH_DESC_DEFS_H
+
+/*
+ * Segment descriptor structure definitions, usable from both x86_64 and i386
+ * archs.
+ */
+
+#ifndef __ASSEMBLY__
+
+#include <linux/types.h>
+
+// 8 byte segment descriptor
+struct desc_struct {
+ u16 limit0;
+ u16 base0;
+ unsigned base1 : 8, type : 4, s : 1, dpl : 2, p : 1;
+ unsigned limit : 4, avl : 1, l : 1, d : 1, g : 1, base2 : 8;
+} __attribute__((packed));
+
+struct n_desc_struct {
+ unsigned int a,b;
+};
+
+enum {
+ GATE_INTERRUPT = 0xE,
+ GATE_TRAP = 0xF,
+ GATE_CALL = 0xC,
+};
+
+// 16byte gate
+struct gate_struct {
+ u16 offset_low;
+ u16 segment;
+ unsigned ist : 3, zero0 : 5, type : 5, dpl : 2, p : 1;
+ u16 offset_middle;
+ u32 offset_high;
+ u32 zero1;
+} __attribute__((packed));
+
+#define PTR_LOW(x) ((unsigned long)(x) & 0xFFFF)
+#define PTR_MIDDLE(x) (((unsigned long)(x) >> 16) & 0xFFFF)
+#define PTR_HIGH(x) ((unsigned long)(x) >> 32)
+
+enum {
+ DESC_TSS = 0x9,
+ DESC_LDT = 0x2,
+};
+
+// LDT or TSS descriptor in the GDT. 16 bytes.
+struct ldttss_desc {
+ u16 limit0;
+ u16 base0;
+ unsigned base1 : 8, type : 5, dpl : 2, p : 1;
+ unsigned limit1 : 4, zero0 : 3, g : 1, base2 : 8;
+ u32 base3;
+ u32 zero1;
+} __attribute__((packed));
+
+struct desc_ptr {
+ unsigned short size;
+ unsigned long address;
+} __attribute__((packed)) ;
+
+
+#endif /* !__ASSEMBLY__ */
+
+#endif



--
error compiling committee.c: too many arguments to function

2006-10-24 14:11:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86: Extract segment descriptor definitions for use outside of x86_64

On Tuesday 24 October 2006 06:43, Avi Kivity wrote:
> Code that wants to use struct desc_struct cannot do so on i386 because
> desc.h contains other code that will only compile on x86_64.
>
> So extract the structure definitions into a asm-x86_64/desc_defs.h.
Added thanks
-Andi