2006-10-19 13:45:55

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 0/7] KVM: Kernel-based Virtual Machine

The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.

Using this driver, one can start multiple virtual machines on a host. Each
virtual machine is a process on the host; a virtual cpu is a thread in that
process. kill(1), nice(1), top(1) work as expected.

In effect, the driver adds a third execution mode to the existing two:
we now
have kernel mode, user mode, and guest mode. Guest mode has its own address
space mapping guest physical memory (which is accessible to user mode by
mmap()ing /dev/kvm). Guest mode has no access to any I/O devices; any such
access is intercepted and directed to user mode for emulation.

The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both
pae and non-pae paging modes are supported.

SMP hosts and UP guests are supported. At the moment only Intel hardware is
supported, but AMD virtualization support is being worked on.

Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:

- cache shadow page tables across page faults
- wait until AMD and Intel release processors with nested page tables

Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a
recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.

In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.

Caveats:

- The Windows install currently bluescreens due to a problem with the
virtual
APIC. We are working on a fix. A temporary workaround is to use an
existing
image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's probably
a problem with the device model.

--
error compiling committee.c: too many arguments to function


2006-10-19 13:47:44

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 1/7] KVM: userspace interface

This patch defines a bunch of ioctl()s on /dev/kvm. The ioctl()s allow
adding
memory to a virtual machine, adding a virtual cpu to a virtual machine (at
most one at this time), transferring control to the virtual cpu, and
querying
about guest pages changed by the virtual machine.

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/include/linux/kvm.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/kvm.h
@@ -0,0 +1,202 @@
+#ifndef __LINUX_KVM_H
+#define __LINUX_KVM_H
+
+/*
+ * Userspace interface for /dev/kvm - kernel based virtual machine
+ *
+ * Note: this interface is considered experimental and may change without
+ * notice.
+ */
+
+#include <asm/types.h>
+#include <linux/ioctl.h>
+
+#ifndef __user
+#define __user
+#endif
+
+/* for KVM_CREATE_MEMORY_REGION */
+struct kvm_memory_region {
+ __u32 slot;
+ __u32 flags;
+ __u64 guest_phys_addr;
+ __u64 memory_size; /* bytes */
+};
+
+/* for kvm_memory_region::flags */
+#define KVM_MEM_LOG_DIRTY_PAGES 1UL
+
+
+#define KVM_EXIT_TYPE_FAIL_ENTRY 1
+#define KVM_EXIT_TYPE_VM_EXIT 2
+
+enum kvm_exit_reason {
+ KVM_EXIT_UNKNOWN,
+ KVM_EXIT_EXCEPTION,
+ KVM_EXIT_IO,
+ KVM_EXIT_CPUID,
+ KVM_EXIT_DEBUG,
+ KVM_EXIT_HLT,
+ KVM_EXIT_MMIO,
+};
+
+/* for KVM_RUN */
+struct kvm_run {
+ /* in */
+ __u32 vcpu;
+ __u32 emulated; /* skip current instruction */
+ __u32 mmio_completed; /* mmio request completed */
+
+ /* out */
+ __u32 exit_type;
+ __u32 exit_reason;
+ __u32 instruction_length;
+ union {
+ /* KVM_EXIT_UNKNOWN */
+ struct {
+ __u32 hardware_exit_reason;
+ } hw;
+ /* KVM_EXIT_EXCEPTION */
+ struct {
+ __u32 exception;
+ __u32 error_code;
+ } ex;
+ /* KVM_EXIT_IO */
+ struct {
+#define KVM_EXIT_IO_IN 0
+#define KVM_EXIT_IO_OUT 1
+ __u8 direction;
+ __u8 size; /* bytes */
+ __u8 string;
+ __u8 string_down;
+ __u8 rep;
+ __u8 pad;
+ __u16 port;
+ __u64 count;
+ union {
+ __u64 address;
+ __u32 value;
+ };
+ } io;
+ struct {
+ } debug;
+ /* KVM_EXIT_MMIO */
+ struct {
+ __u64 phys_addr;
+ __u8 data[8];
+ __u32 len;
+ __u8 is_write;
+ } mmio;
+ };
+};
+
+/* for KVM_GET_REGS and KVM_SET_REGS */
+struct kvm_regs {
+ /* in */
+ __u32 vcpu;
+ __u32 padding;
+
+ /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
+ __u64 rax, rbx, rcx, rdx;
+ __u64 rsi, rdi, rsp, rbp;
+ __u64 r8, r9, r10, r11;
+ __u64 r12, r13, r14, r15;
+ __u64 rip, rflags;
+};
+
+struct kvm_segment {
+ __u64 base;
+ __u32 limit;
+ __u16 selector;
+ __u8 type;
+ __u8 present, dpl, db, s, l, g, avl;
+ __u8 unusable;
+ __u8 padding;
+};
+
+struct kvm_dtable {
+ __u64 base;
+ __u16 limit;
+ __u16 padding[3];
+};
+
+/* for KVM_GET_SREGS and KVM_SET_SREGS */
+struct kvm_sregs {
+ /* in */
+ __u32 vcpu;
+ __u32 padding;
+
+ /* out (KVM_GET_SREGS) / in (KVM_SET_SREGS) */
+ struct kvm_segment cs, ds, es, fs, gs, ss;
+ struct kvm_segment tr, ldt;
+ struct kvm_dtable gdt, idt;
+ __u64 cr0, cr2, cr3, cr4, cr8;
+ __u64 efer;
+ __u64 apic_base;
+
+ /* out (KVM_GET_SREGS) */
+ __u32 pending_int;
+ __u32 padding2;
+};
+
+/* for KVM_TRANSLATE */
+struct kvm_translation {
+ /* in */
+ __u64 linear_address;
+ __u32 vcpu;
+ __u32 padding;
+
+ /* out */
+ __u64 physical_address;
+ __u8 valid;
+ __u8 writeable;
+ __u8 usermode;
+};
+
+/* for KVM_INTERRUPT */
+struct kvm_interrupt {
+ /* in */
+ __u32 vcpu;
+ __u32 irq;
+};
+
+struct kvm_breakpoint {
+ __u32 enabled;
+ __u32 padding;
+ __u64 address;
+};
+
+/* for KVM_DEBUG_GUEST */
+struct kvm_debug_guest {
+ /* int */
+ __u32 vcpu;
+ __u32 enabled;
+ struct kvm_breakpoint breakpoints[4];
+ __u32 singlestep;
+};
+
+/* for KVM_GET_DIRTY_LOG */
+struct kvm_dirty_log {
+ __u32 slot;
+ __u32 padding;
+ union {
+ void __user *dirty_bitmap; /* one bit per page */
+ __u64 padding;
+ };
+};
+
+#define KVMIO 0xAE
+
+#define KVM_RUN _IOWR(KVMIO, 2, struct kvm_run)
+#define KVM_GET_REGS _IOWR(KVMIO, 3, struct kvm_regs)
+#define KVM_SET_REGS _IOW(KVMIO, 4, struct kvm_regs)
+#define KVM_GET_SREGS _IOWR(KVMIO, 5, struct kvm_sregs)
+#define KVM_SET_SREGS _IOW(KVMIO, 6, struct kvm_sregs)
+#define KVM_TRANSLATE _IOWR(KVMIO, 7, struct kvm_translation)
+#define KVM_INTERRUPT _IOW(KVMIO, 8, struct kvm_interrupt)
+#define KVM_DEBUG_GUEST _IOW(KVMIO, 9, struct kvm_debug_guest)
+#define KVM_SET_MEMORY_REGION _IOW(KVMIO, 10, struct kvm_memory_region)
+#define KVM_CREATE_VCPU _IOW(KVMIO, 11, int /* vcpu_slot */)
+#define KVM_GET_DIRTY_LOG _IOW(KVMIO, 12, struct kvm_dirty_log)
+
+#endif

--
error compiling committee.c: too many arguments to function

2006-10-19 13:48:51

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 2/7] KVM: Intel virtual mode extensions definitions

Add some constants for the various bits defined by Intel's VT extensions.

Most of this file was lifted from the Xen hypervisor.

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/vmx.h
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/vmx.h
@@ -0,0 +1,287 @@
+#ifndef VMX_H
+#define VMX_H
+
+/*
+ * vmx.h: VMX Architecture related definitions
+ * Copyright (c) 2004, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License
along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * A few random additions are:
+ * Copyright (C) 2006 Qumranet
+ * Avi Kivity <[email protected]>
+ * Yaniv Kamay <[email protected]>
+ *
+ */
+
+#define CPU_BASED_VIRTUAL_INTR_PENDING 0x00000004
+#define CPU_BASED_USE_TSC_OFFSETING 0x00000008
+#define CPU_BASED_HLT_EXITING 0x00000080
+#define CPU_BASED_INVDPG_EXITING 0x00000200
+#define CPU_BASED_MWAIT_EXITING 0x00000400
+#define CPU_BASED_RDPMC_EXITING 0x00000800
+#define CPU_BASED_RDTSC_EXITING 0x00001000
+#define CPU_BASED_CR8_LOAD_EXITING 0x00080000
+#define CPU_BASED_CR8_STORE_EXITING 0x00100000
+#define CPU_BASED_TPR_SHADOW 0x00200000
+#define CPU_BASED_MOV_DR_EXITING 0x00800000
+#define CPU_BASED_UNCOND_IO_EXITING 0x01000000
+#define CPU_BASED_ACTIVATE_IO_BITMAP 0x02000000
+#define CPU_BASED_MSR_BITMAPS 0x10000000
+#define CPU_BASED_MONITOR_EXITING 0x20000000
+#define CPU_BASED_PAUSE_EXITING 0x40000000
+
+#define PIN_BASED_EXT_INTR_MASK 0x1
+#define PIN_BASED_NMI_EXITING 0x8
+
+#define VM_EXIT_ACK_INTR_ON_EXIT 0x00008000
+#define VM_EXIT_HOST_ADD_SPACE_SIZE 0x00000200
+
+
+/* VMCS Encordings */
+enum vmcs_field {
+ GUEST_ES_SELECTOR = 0x00000800,
+ GUEST_CS_SELECTOR = 0x00000802,
+ GUEST_SS_SELECTOR = 0x00000804,
+ GUEST_DS_SELECTOR = 0x00000806,
+ GUEST_FS_SELECTOR = 0x00000808,
+ GUEST_GS_SELECTOR = 0x0000080a,
+ GUEST_LDTR_SELECTOR = 0x0000080c,
+ GUEST_TR_SELECTOR = 0x0000080e,
+ HOST_ES_SELECTOR = 0x00000c00,
+ HOST_CS_SELECTOR = 0x00000c02,
+ HOST_SS_SELECTOR = 0x00000c04,
+ HOST_DS_SELECTOR = 0x00000c06,
+ HOST_FS_SELECTOR = 0x00000c08,
+ HOST_GS_SELECTOR = 0x00000c0a,
+ HOST_TR_SELECTOR = 0x00000c0c,
+ IO_BITMAP_A = 0x00002000,
+ IO_BITMAP_A_HIGH = 0x00002001,
+ IO_BITMAP_B = 0x00002002,
+ IO_BITMAP_B_HIGH = 0x00002003,
+ MSR_BITMAP = 0x00002004,
+ MSR_BITMAP_HIGH = 0x00002005,
+ VM_EXIT_MSR_STORE_ADDR = 0x00002006,
+ VM_EXIT_MSR_STORE_ADDR_HIGH = 0x00002007,
+ VM_EXIT_MSR_LOAD_ADDR = 0x00002008,
+ VM_EXIT_MSR_LOAD_ADDR_HIGH = 0x00002009,
+ VM_ENTRY_MSR_LOAD_ADDR = 0x0000200a,
+ VM_ENTRY_MSR_LOAD_ADDR_HIGH = 0x0000200b,
+ TSC_OFFSET = 0x00002010,
+ TSC_OFFSET_HIGH = 0x00002011,
+ VIRTUAL_APIC_PAGE_ADDR = 0x00002012,
+ VIRTUAL_APIC_PAGE_ADDR_HIGH = 0x00002013,
+ VMCS_LINK_POINTER = 0x00002800,
+ VMCS_LINK_POINTER_HIGH = 0x00002801,
+ GUEST_IA32_DEBUGCTL = 0x00002802,
+ GUEST_IA32_DEBUGCTL_HIGH = 0x00002803,
+ PIN_BASED_VM_EXEC_CONTROL = 0x00004000,
+ CPU_BASED_VM_EXEC_CONTROL = 0x00004002,
+ EXCEPTION_BITMAP = 0x00004004,
+ PAGE_FAULT_ERROR_CODE_MASK = 0x00004006,
+ PAGE_FAULT_ERROR_CODE_MATCH = 0x00004008,
+ CR3_TARGET_COUNT = 0x0000400a,
+ VM_EXIT_CONTROLS = 0x0000400c,
+ VM_EXIT_MSR_STORE_COUNT = 0x0000400e,
+ VM_EXIT_MSR_LOAD_COUNT = 0x00004010,
+ VM_ENTRY_CONTROLS = 0x00004012,
+ VM_ENTRY_MSR_LOAD_COUNT = 0x00004014,
+ VM_ENTRY_INTR_INFO_FIELD = 0x00004016,
+ VM_ENTRY_EXCEPTION_ERROR_CODE = 0x00004018,
+ VM_ENTRY_INSTRUCTION_LEN = 0x0000401a,
+ TPR_THRESHOLD = 0x0000401c,
+ SECONDARY_VM_EXEC_CONTROL = 0x0000401e,
+ VM_INSTRUCTION_ERROR = 0x00004400,
+ VM_EXIT_REASON = 0x00004402,
+ VM_EXIT_INTR_INFO = 0x00004404,
+ VM_EXIT_INTR_ERROR_CODE = 0x00004406,
+ IDT_VECTORING_INFO_FIELD = 0x00004408,
+ IDT_VECTORING_ERROR_CODE = 0x0000440a,
+ VM_EXIT_INSTRUCTION_LEN = 0x0000440c,
+ VMX_INSTRUCTION_INFO = 0x0000440e,
+ GUEST_ES_LIMIT = 0x00004800,
+ GUEST_CS_LIMIT = 0x00004802,
+ GUEST_SS_LIMIT = 0x00004804,
+ GUEST_DS_LIMIT = 0x00004806,
+ GUEST_FS_LIMIT = 0x00004808,
+ GUEST_GS_LIMIT = 0x0000480a,
+ GUEST_LDTR_LIMIT = 0x0000480c,
+ GUEST_TR_LIMIT = 0x0000480e,
+ GUEST_GDTR_LIMIT = 0x00004810,
+ GUEST_IDTR_LIMIT = 0x00004812,
+ GUEST_ES_AR_BYTES = 0x00004814,
+ GUEST_CS_AR_BYTES = 0x00004816,
+ GUEST_SS_AR_BYTES = 0x00004818,
+ GUEST_DS_AR_BYTES = 0x0000481a,
+ GUEST_FS_AR_BYTES = 0x0000481c,
+ GUEST_GS_AR_BYTES = 0x0000481e,
+ GUEST_LDTR_AR_BYTES = 0x00004820,
+ GUEST_TR_AR_BYTES = 0x00004822,
+ GUEST_INTERRUPTIBILITY_INFO = 0x00004824,
+ GUEST_ACTIVITY_STATE = 0X00004826,
+ GUEST_SYSENTER_CS = 0x0000482A,
+ HOST_IA32_SYSENTER_CS = 0x00004c00,
+ CR0_GUEST_HOST_MASK = 0x00006000,
+ CR4_GUEST_HOST_MASK = 0x00006002,
+ CR0_READ_SHADOW = 0x00006004,
+ CR4_READ_SHADOW = 0x00006006,
+ CR3_TARGET_VALUE0 = 0x00006008,
+ CR3_TARGET_VALUE1 = 0x0000600a,
+ CR3_TARGET_VALUE2 = 0x0000600c,
+ CR3_TARGET_VALUE3 = 0x0000600e,
+ EXIT_QUALIFICATION = 0x00006400,
+ GUEST_LINEAR_ADDRESS = 0x0000640a,
+ GUEST_CR0 = 0x00006800,
+ GUEST_CR3 = 0x00006802,
+ GUEST_CR4 = 0x00006804,
+ GUEST_ES_BASE = 0x00006806,
+ GUEST_CS_BASE = 0x00006808,
+ GUEST_SS_BASE = 0x0000680a,
+ GUEST_DS_BASE = 0x0000680c,
+ GUEST_FS_BASE = 0x0000680e,
+ GUEST_GS_BASE = 0x00006810,
+ GUEST_LDTR_BASE = 0x00006812,
+ GUEST_TR_BASE = 0x00006814,
+ GUEST_GDTR_BASE = 0x00006816,
+ GUEST_IDTR_BASE = 0x00006818,
+ GUEST_DR7 = 0x0000681a,
+ GUEST_RSP = 0x0000681c,
+ GUEST_RIP = 0x0000681e,
+ GUEST_RFLAGS = 0x00006820,
+ GUEST_PENDING_DBG_EXCEPTIONS = 0x00006822,
+ GUEST_SYSENTER_ESP = 0x00006824,
+ GUEST_SYSENTER_EIP = 0x00006826,
+ HOST_CR0 = 0x00006c00,
+ HOST_CR3 = 0x00006c02,
+ HOST_CR4 = 0x00006c04,
+ HOST_FS_BASE = 0x00006c06,
+ HOST_GS_BASE = 0x00006c08,
+ HOST_TR_BASE = 0x00006c0a,
+ HOST_GDTR_BASE = 0x00006c0c,
+ HOST_IDTR_BASE = 0x00006c0e,
+ HOST_IA32_SYSENTER_ESP = 0x00006c10,
+ HOST_IA32_SYSENTER_EIP = 0x00006c12,
+ HOST_RSP = 0x00006c14,
+ HOST_RIP = 0x00006c16,
+};
+
+#define VMX_EXIT_REASONS_FAILED_VMENTRY 0x80000000
+
+#define EXIT_REASON_EXCEPTION_NMI 0
+#define EXIT_REASON_EXTERNAL_INTERRUPT 1
+
+#define EXIT_REASON_PENDING_INTERRUPT 7
+
+#define EXIT_REASON_TASK_SWITCH 9
+#define EXIT_REASON_CPUID 10
+#define EXIT_REASON_HLT 12
+#define EXIT_REASON_INVLPG 14
+#define EXIT_REASON_RDPMC 15
+#define EXIT_REASON_RDTSC 16
+#define EXIT_REASON_VMCALL 18
+#define EXIT_REASON_VMCLEAR 19
+#define EXIT_REASON_VMLAUNCH 20
+#define EXIT_REASON_VMPTRLD 21
+#define EXIT_REASON_VMPTRST 22
+#define EXIT_REASON_VMREAD 23
+#define EXIT_REASON_VMRESUME 24
+#define EXIT_REASON_VMWRITE 25
+#define EXIT_REASON_VMOFF 26
+#define EXIT_REASON_VMON 27
+#define EXIT_REASON_CR_ACCESS 28
+#define EXIT_REASON_DR_ACCESS 29
+#define EXIT_REASON_IO_INSTRUCTION 30
+#define EXIT_REASON_MSR_READ 31
+#define EXIT_REASON_MSR_WRITE 32
+#define EXIT_REASON_MWAIT_INSTRUCTION 36
+
+/*
+ * Interruption-information format
+ */
+#define INTR_INFO_VECTOR_MASK 0xff /* 7:0 */
+#define INTR_INFO_INTR_TYPE_MASK 0x700 /* 10:8 */
+#define INTR_INFO_DELIEVER_CODE_MASK 0x800 /* 11 */
+#define INTR_INFO_VALID_MASK 0x80000000 /* 31 */
+
+#define VECTORING_INFO_VECTOR_MASK INTR_INFO_VECTOR_MASK
+#define VECTORING_INFO_TYPE_MASK INTR_INFO_INTR_TYPE_MASK
+#define VECTORING_INFO_DELIEVER_CODE_MASK
INTR_INFO_DELIEVER_CODE_MASK
+#define VECTORING_INFO_VALID_MASK INTR_INFO_VALID_MASK
+
+#define INTR_TYPE_EXT_INTR (0 << 8) /* external interrupt */
+#define INTR_TYPE_EXCEPTION (3 << 8) /* processor exception */
+
+/*
+ * Exit Qualifications for MOV for Control Register Access
+ */
+#define CONTROL_REG_ACCESS_NUM 0x7 /* 2:0, number of
control register */
+#define CONTROL_REG_ACCESS_TYPE 0x30 /* 5:4, access type */
+#define CONTROL_REG_ACCESS_REG 0xf00 /* 10:8, general
purpose register */
+#define LMSW_SOURCE_DATA_SHIFT 16
+#define LMSW_SOURCE_DATA (0xFFFF << LMSW_SOURCE_DATA_SHIFT) /* 16:31
lmsw source */
+#define REG_EAX (0 << 8)
+#define REG_ECX (1 << 8)
+#define REG_EDX (2 << 8)
+#define REG_EBX (3 << 8)
+#define REG_ESP (4 << 8)
+#define REG_EBP (5 << 8)
+#define REG_ESI (6 << 8)
+#define REG_EDI (7 << 8)
+#define REG_R8 (8 << 8)
+#define REG_R9 (9 << 8)
+#define REG_R10 (10 << 8)
+#define REG_R11 (11 << 8)
+#define REG_R12 (12 << 8)
+#define REG_R13 (13 << 8)
+#define REG_R14 (14 << 8)
+#define REG_R15 (15 << 8)
+
+/*
+ * Exit Qualifications for MOV for Debug Register Access
+ */
+#define DEBUG_REG_ACCESS_NUM 0x7 /* 2:0, number of debug
register */
+#define DEBUG_REG_ACCESS_TYPE 0x10 /* 4, direction of
access */
+#define TYPE_MOV_TO_DR (0 << 4)
+#define TYPE_MOV_FROM_DR (1 << 4)
+#define DEBUG_REG_ACCESS_REG 0xf00 /* 11:8, general
purpose register */
+
+
+/* segment AR */
+#define SEGMENT_AR_L_MASK (1 << 13)
+
+/* entry controls */
+#define VM_ENTRY_CONTROLS_IA32E_MASK (1 << 9)
+
+#define AR_TYPE_ACCESSES_MASK 1
+#define AR_TYPE_READABLE_MASK (1 << 1)
+#define AR_TYPE_WRITEABLE_MASK (1 << 2)
+#define AR_TYPE_CODE_MASK (1 << 3)
+#define AR_TYPE_MASK 0x0f
+#define AR_TYPE_BUSY_64_TSS 11
+#define AR_TYPE_BUSY_32_TSS 11
+#define AR_TYPE_BUSY_16_TSS 3
+#define AR_TYPE_LDT 2
+
+#define AR_UNUSABLE_MASK (1 << 16)
+#define AR_S_MASK (1 << 4)
+#define AR_P_MASK (1 << 7)
+#define AR_L_MASK (1 << 13)
+#define AR_DB_MASK (1 << 14)
+#define AR_G_MASK (1 << 15)
+#define AR_DPL_SHIFT 5
+#define AR_DPL(ar) (((ar) >> AR_DPL_SHIFT) & 3)
+
+#define AR_RESERVD_MASK 0xfffe0f00
+
+#endif

--
error compiling committee.c: too many arguments to function

2006-10-19 13:50:01

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 3/7] KVM: kvm data structures

Define data structures and some constants for a virtual machine.

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/kvm.h
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/kvm.h
@@ -0,0 +1,206 @@
+#ifndef __KVM_H
+#define __KVM_H
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+
+#define INVALID_PAGE (~(hpa_t)0)
+#define UNMAPPED_GVA (~(gpa_t)0)
+
+#define KVM_MAX_VCPUS 1
+#define KVM_MEMORY_SLOTS 4
+#define KVM_NUM_MMU_PAGES 256
+
+#define FX_IMAGE_SIZE 512
+#define FX_IMAGE_ALIGN 16
+#define FX_BUF_SIZE (2 * FX_IMAGE_SIZE + FX_IMAGE_ALIGN)
+
+/*
+ * Address types:
+ *
+ * gva - guest virtual address
+ * gpa - guest physical address
+ * gfn - guest frame number
+ * hva - host virtual address
+ * hpa - host physical address
+ * hfn - host frame number
+ */
+
+typedef unsigned long gva_t;
+typedef u64 gpa_t;
+typedef unsigned long gfn_t;
+
+typedef unsigned long hva_t;
+typedef u64 hpa_t;
+typedef unsigned long hfn_t;
+
+struct kvm_mmu_page {
+ struct list_head link;
+ hpa_t page_hpa;
+ unsigned long slot_bitmap; /* One bit set per slot which has memory
+ * in this shadow page.
+ */
+ int global; /* Set if all ptes in this page are global */
+ u64 *parent_pte;
+};
+
+struct vmcs {
+ u32 revision_id;
+ u32 abort;
+ char data[0];
+};
+
+struct vmx_msr_entry {
+ u32 index;
+ u32 reserved;
+ u64 data;
+};
+
+struct kvm_vcpu;
+
+/*
+ * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level
+ * 32-bit). The kvm_mmu structure abstracts the details of the current mmu
+ * mode.
+ */
+struct kvm_mmu {
+ void (*new_cr3)(struct kvm_vcpu *vcpu);
+ int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
+ void (*inval_page)(struct kvm_vcpu *vcpu, gva_t gva);
+ void (*free)(struct kvm_vcpu *vcpu);
+ gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva);
+ hpa_t root_hpa;
+ int root_level;
+ int shadow_root_level;
+};
+
+struct kvm_guest_debug {
+ int enabled;
+ unsigned long bp[4];
+ int singlestep;
+};
+
+enum {
+ VCPU_REGS_RAX = 0,
+ VCPU_REGS_RCX = 1,
+ VCPU_REGS_RDX = 2,
+ VCPU_REGS_RBX = 3,
+ VCPU_REGS_RSP = 4,
+ VCPU_REGS_RBP = 5,
+ VCPU_REGS_RSI = 6,
+ VCPU_REGS_RDI = 7,
+#ifdef __x86_64__
+ VCPU_REGS_R8 = 8,
+ VCPU_REGS_R9 = 9,
+ VCPU_REGS_R10 = 10,
+ VCPU_REGS_R11 = 11,
+ VCPU_REGS_R12 = 12,
+ VCPU_REGS_R13 = 13,
+ VCPU_REGS_R14 = 14,
+ VCPU_REGS_R15 = 15,
+#endif
+ NR_VCPU_REGS
+};
+
+struct kvm_vcpu {
+ struct kvm *kvm;
+ struct vmcs *vmcs;
+ struct mutex mutex;
+ int cpu;
+ int launched;
+ unsigned long irq_summary; /* bit vector: 1 per word in irq_pending */
+#define NR_IRQ_WORDS (256 / BITS_PER_LONG)
+ unsigned long irq_pending[NR_IRQ_WORDS];
+ unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */
+ unsigned long rip; /* needs vcpu_load_rsp_rip() */
+
+ unsigned long cr2;
+ unsigned long cr3;
+ unsigned long cr8;
+ u64 shadow_efer;
+ u64 apic_base;
+ struct vmx_msr_entry *guest_msrs;
+ struct vmx_msr_entry *host_msrs;
+
+ struct list_head free_pages;
+ struct kvm_mmu_page page_header_buf[KVM_NUM_MMU_PAGES];
+ struct kvm_mmu mmu;
+
+ struct kvm_guest_debug guest_debug;
+
+ char fx_buf[FX_BUF_SIZE];
+ char *host_fx_image;
+ char *guest_fx_image;
+
+ int mmio_needed;
+ int mmio_read_completed;
+ int mmio_is_write;
+ int mmio_size;
+ unsigned char mmio_data[8];
+ gpa_t mmio_phys_addr;
+
+ struct{
+ int active;
+ u8 save_iopl;
+ struct {
+ unsigned long base;
+ u32 limit;
+ u32 ar;
+ } tr;
+ } rmode;
+};
+
+struct kvm_memory_slot {
+ gfn_t base_gfn;
+ unsigned long npages;
+ unsigned long flags;
+ struct page **phys_mem;
+ unsigned long *dirty_bitmap;
+};
+
+struct kvm {
+ spinlock_t lock; /* protects everything except vcpus */
+ int nmemslots;
+ struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS];
+ struct list_head active_mmu_pages;
+ struct kvm_vcpu vcpus[KVM_MAX_VCPUS];
+ int memory_config_version;
+ int busy;
+};
+
+struct kvm_stat {
+ u32 pf_fixed;
+ u32 pf_guest;
+ u32 tlb_flush;
+ u32 invlpg;
+
+ u32 exits;
+ u32 io_exits;
+ u32 mmio_exits;
+ u32 signal_exits;
+ u32 irq_exits;
+};
+
+extern struct kvm_stat kvm_stat;
+
+#define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
+#define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
+
+void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
+int kvm_mmu_init(struct kvm_vcpu *vcpu);
+
+int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
+void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
+
+hpa_t gpa_to_hpa(struct kvm_vcpu *vcpu, gpa_t gpa);
+#define HPA_MSB ((sizeof(hpa_t) * 8) - 1)
+#define HPA_ERR_MASK ((hpa_t)1 << HPA_MSB)
+static inline int is_error_hpa(hpa_t hpa) { return hpa >> HPA_MSB; }
+hpa_t gva_to_hpa(struct kvm_vcpu *vcpu, gva_t gva);
+
+extern hpa_t bad_page_address;
+
+#endif

--
error compiling committee.c: too many arguments to function

2006-10-19 13:54:11

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 5/7] KVM: mmu virtualization

This patch contains the shadow page table code.

This is a fairly naive implementation that uses the tlb management
instructions
to keep the shadow page tables in sync with the guest page tables:

- invlpg: remove the shadow pte for the given virtual address
- tlb flush: remove all shadow ptes for non-global pages

The relative simplicity of the approach comes at a price: every guest
address
space switch needs to rebuild the shadow page tables for the new address
space.

Other noteworthy items:

- the dirty bit is emulated by mapping non-dirty, writable pages as
read-only.
the first write will set the dirty bit and remap the page as writable
- we support both 32-bit and 64-bit guest ptes
- the host ptes are always 64-bit, even on non-pae i386 hosts

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/mmu.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/mmu.c
@@ -0,0 +1,718 @@
+/*
+ * Kernel-based Virtual Machine driver for Linux
+ *
+ * This module enables machines with Intel VT-x extensions to run virtual
+ * machines without emulation or binary translation.
+ *
+ * MMU support
+ *
+ * Copyright (C) 2006 Qumranet, Inc.
+ *
+ * Authors:
+ * Yaniv Kamay <[email protected]>
+ * Avi Kivity <[email protected]>
+ *
+ */
+#include <linux/types.h>
+#include <linux/string.h>
+#include <asm/page.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/module.h>
+
+#include "vmx.h"
+#include "kvm.h"
+
+#define pgprintk(x...) do { } while (0)
+
+#define ASSERT(x) \
+ if (!(x)) { \
+ printk("assertion failed %s:%d: %s\n", __FILE__, __LINE__, #x);\
+ }
+
+#define PT64_ENT_PER_PAGE 512
+#define PT32_ENT_PER_PAGE 1024
+
+#define PT_WRITABLE_SHIFT 1
+
+#define PT_PRESENT_MASK (1ULL << 0)
+#define PT_WRITABLE_MASK (1ULL << PT_WRITABLE_SHIFT)
+#define PT_USER_MASK (1ULL << 2)
+#define PT_PWT_MASK (1ULL << 3)
+#define PT_PCD_MASK (1ULL << 4)
+#define PT_ACCESSED_MASK (1ULL << 5)
+#define PT_DIRTY_MASK (1ULL << 6)
+#define PT_PAGE_SIZE_MASK (1ULL << 7)
+#define PT_PAT_MASK (1ULL << 7)
+#define PT_GLOBAL_MASK (1ULL << 8)
+#define PT64_NX_MASK (1ULL << 63)
+
+#define PT_PAT_SHIFT 7
+#define PT_DIR_PAT_SHIFT 12
+#define PT_DIR_PAT_MASK (1ULL << PT_DIR_PAT_SHIFT)
+
+#define PT32_DIR_PSE36_SIZE 4
+#define PT32_DIR_PSE36_SHIFT 13
+#define PT32_DIR_PSE36_MASK (((1ULL << PT32_DIR_PSE36_SIZE) - 1) <<
PT32_DIR_PSE36_SHIFT)
+
+
+#define PT32_PTE_COPY_MASK \
+ (PT_PRESENT_MASK | PT_PWT_MASK | PT_PCD_MASK | \
+ PT_ACCESSED_MASK | PT_DIRTY_MASK | PT_PAT_MASK | \
+ PT_GLOBAL_MASK )
+
+#define PT32_NON_PTE_COPY_MASK \
+ (PT_PRESENT_MASK | PT_PWT_MASK | PT_PCD_MASK | \
+ PT_ACCESSED_MASK | PT_DIRTY_MASK)
+
+
+#define PT64_PTE_COPY_MASK \
+ (PT64_NX_MASK | PT32_PTE_COPY_MASK)
+
+#define PT64_NON_PTE_COPY_MASK \
+ (PT64_NX_MASK | PT32_NON_PTE_COPY_MASK)
+
+
+
+#define PT_FIRST_AVAIL_BITS_SHIFT 9
+#define PT64_SECOND_AVAIL_BITS_SHIFT 52
+
+#define PT_SHADOW_PS_MARK (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
+#define PT_SHADOW_IO_MARK (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
+
+#define PT_SHADOW_WRITABLE_SHIFT (PT_FIRST_AVAIL_BITS_SHIFT + 1)
+#define PT_SHADOW_WRITABLE_MASK (1ULL << PT_SHADOW_WRITABLE_SHIFT)
+
+#define PT_SHADOW_USER_SHIFT (PT_SHADOW_WRITABLE_SHIFT + 1)
+#define PT_SHADOW_USER_MASK (1ULL << (PT_SHADOW_USER_SHIFT))
+
+#define PT_SHADOW_BITS_OFFSET (PT_SHADOW_WRITABLE_SHIFT -
PT_WRITABLE_SHIFT)
+
+#define VALID_PAGE(x) ((x) != INVALID_PAGE)
+
+#define PT64_LEVEL_BITS 9
+
+#define PT64_LEVEL_SHIFT(level) \
+ ( PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS )
+
+#define PT64_LEVEL_MASK(level) \
+ (((1ULL << PT64_LEVEL_BITS) - 1) << PT64_LEVEL_SHIFT(level))
+
+#define PT64_INDEX(address, level)\
+ (((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
+
+
+#define PT32_LEVEL_BITS 10
+
+#define PT32_LEVEL_SHIFT(level) \
+ ( PAGE_SHIFT + (level - 1) * PT32_LEVEL_BITS )
+
+#define PT32_LEVEL_MASK(level) \
+ (((1ULL << PT32_LEVEL_BITS) - 1) << PT32_LEVEL_SHIFT(level))
+
+#define PT32_INDEX(address, level)\
+ (((address) >> PT32_LEVEL_SHIFT(level)) & ((1 << PT32_LEVEL_BITS) - 1))
+
+
+#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & PAGE_MASK)
+#define PT64_DIR_BASE_ADDR_MASK \
+ (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + PT64_LEVEL_BITS)) - 1))
+
+#define PT32_BASE_ADDR_MASK PAGE_MASK
+#define PT32_DIR_BASE_ADDR_MASK \
+ (PAGE_MASK & ~((1ULL << (PAGE_SHIFT + PT32_LEVEL_BITS)) - 1))
+
+
+#define PFERR_PRESENT_MASK (1U << 0)
+#define PFERR_WRITE_MASK (1U << 1)
+#define PFERR_USER_MASK (1U << 2)
+
+#define PT64_ROOT_LEVEL 4
+#define PT32_ROOT_LEVEL 2
+#define PT32E_ROOT_LEVEL 3
+
+#define PT_DIRECTORY_LEVEL 2
+#define PT_PAGE_TABLE_LEVEL 1
+
+static int is_write_protection(void)
+{
+ return guest_cr0() & CR0_WP_MASK;
+}
+
+static int is_cpuid_PSE36(void)
+{
+ return 1;
+}
+
+static int is_present_pte(unsigned long pte)
+{
+ return pte & PT_PRESENT_MASK;
+}
+
+static int is_writeble_pte(unsigned long pte)
+{
+ return pte & PT_WRITABLE_MASK;
+}
+
+static int is_io_pte(unsigned long pte)
+{
+ return pte & PT_SHADOW_IO_MARK;
+}
+
+static void kvm_mmu_free_page(struct kvm_vcpu *vcpu, hpa_t page_hpa)
+{
+ struct kvm_mmu_page *page_head = page_header(page_hpa);
+
+ list_del(&page_head->link);
+ page_head->page_hpa = page_hpa;
+ list_add(&page_head->link, &vcpu->free_pages);
+}
+
+static int is_empty_shadow_page(hpa_t page_hpa)
+{
+ u32 *pos;
+ u32 *end;
+ for (pos = __va(page_hpa), end = pos + PAGE_SIZE / sizeof(u32);
+ pos != end; pos++)
+ if (*pos != 0)
+ return 0;
+ return 1;
+}
+
+static hpa_t kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, u64 *parent_pte)
+{
+ struct kvm_mmu_page *page;
+
+ if (list_empty(&vcpu->free_pages))
+ return INVALID_PAGE;
+
+ page = list_entry(vcpu->free_pages.next, struct kvm_mmu_page, link);
+ list_del(&page->link);
+ list_add(&page->link, &vcpu->kvm->active_mmu_pages);
+ ASSERT(is_empty_shadow_page(page->page_hpa));
+ page->slot_bitmap = 0;
+ page->global = 1;
+ page->parent_pte = parent_pte;
+ return page->page_hpa;
+}
+
+static void page_header_update_slot(struct kvm *kvm, void *pte, gpa_t gpa)
+{
+ int slot = memslot_id(kvm, gfn_to_memslot(kvm, gpa >> PAGE_SHIFT));
+ struct kvm_mmu_page *page_head = page_header(__pa(pte));
+
+ __set_bit(slot, &page_head->slot_bitmap);
+}
+
+hpa_t safe_gpa_to_hpa(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+ hpa_t hpa = gpa_to_hpa(vcpu, gpa);
+
+ return is_error_hpa(hpa) ? bad_page_address | (gpa & ~PAGE_MASK): hpa;
+}
+
+hpa_t gpa_to_hpa(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+ struct kvm_memory_slot *slot;
+ struct page *page;
+
+ ASSERT((gpa & HPA_ERR_MASK) == 0);
+ slot = gfn_to_memslot(vcpu->kvm, gpa >> PAGE_SHIFT);
+ if (!slot)
+ return gpa | HPA_ERR_MASK;
+ page = gfn_to_page(slot, gpa >> PAGE_SHIFT);
+ return (page_to_pfn(page) << PAGE_SHIFT) | (gpa & (PAGE_SIZE-1));
+}
+
+hpa_t gva_to_hpa(struct kvm_vcpu *vcpu, gva_t gva)
+{
+ gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, gva);
+
+ if (gpa == UNMAPPED_GVA)
+ return UNMAPPED_GVA;
+ return gpa_to_hpa(vcpu, gpa);
+}
+
+
+static void release_pt_page_64(struct kvm_vcpu *vcpu, hpa_t page_hpa,
+ int level)
+{
+ ASSERT(vcpu);
+ ASSERT(VALID_PAGE(page_hpa));
+ ASSERT(level <= PT64_ROOT_LEVEL && level > 0);
+
+ if (level == 1)
+ memset(__va(page_hpa), 0, PAGE_SIZE);
+ else {
+ u64 *pos;
+ u64 *end;
+
+ for (pos = __va(page_hpa), end = pos + PT64_ENT_PER_PAGE;
+ pos != end; pos++) {
+ u64 current_ent = *pos;
+
+ *pos = 0;
+ if (is_present_pte(current_ent))
+ release_pt_page_64(vcpu,
+ current_ent &
+ PT64_BASE_ADDR_MASK,
+ level - 1);
+ }
+ }
+ kvm_mmu_free_page(vcpu, page_hpa);
+}
+
+static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
+{
+}
+
+static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, hpa_t p)
+{
+ int level = PT32E_ROOT_LEVEL;
+ hpa_t table_addr = vcpu->mmu.root_hpa;
+
+ for (; ; level--) {
+ u32 index = PT64_INDEX(v, level);
+ u64 *table;
+
+ ASSERT(VALID_PAGE(table_addr));
+ table = __va(table_addr);
+
+ if (level == 1) {
+ mark_page_dirty(vcpu->kvm, v >> PAGE_SHIFT);
+ page_header_update_slot(vcpu->kvm, table, v);
+ table[index] = p | PT_PRESENT_MASK | PT_WRITABLE_MASK |
+ PT_USER_MASK;
+ return 0;
+ }
+
+ if (table[index] == 0) {
+ hpa_t new_table = kvm_mmu_alloc_page(vcpu,
+ &table[index]);
+
+ if (!VALID_PAGE(new_table)) {
+ pgprintk("nonpaging_map: ENOMEM\n");
+ return -ENOMEM;
+ }
+
+ if (level == PT32E_ROOT_LEVEL)
+ table[index] = new_table | PT_PRESENT_MASK;
+ else
+ table[index] = new_table | PT_PRESENT_MASK |
+ PT_WRITABLE_MASK | PT_USER_MASK;
+ }
+ table_addr = table[index] & PT64_BASE_ADDR_MASK;
+ }
+}
+
+static void nonpaging_flush(struct kvm_vcpu *vcpu)
+{
+ hpa_t root = vcpu->mmu.root_hpa;
+
+ ++kvm_stat.tlb_flush;
+ pgprintk("nonpaging_flush\n");
+ ASSERT(VALID_PAGE(root));
+ release_pt_page_64(vcpu, root, vcpu->mmu.shadow_root_level);
+ root = kvm_mmu_alloc_page(vcpu, 0);
+ ASSERT(VALID_PAGE(root));
+ vcpu->mmu.root_hpa = root;
+ if (is_paging())
+ root |= (vcpu->cr3 & (CR3_PCD_MASK | CR3_WPT_MASK));
+ vmcs_writel(GUEST_CR3, root);
+}
+
+static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr)
+{
+ return vaddr;
+}
+
+static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
+ u32 error_code)
+{
+ int ret;
+ gpa_t addr = gva;
+
+ ASSERT(vcpu);
+ ASSERT(VALID_PAGE(vcpu->mmu.root_hpa));
+
+ for (;;) {
+ hpa_t paddr;
+
+ paddr = gpa_to_hpa(vcpu , addr & PT64_BASE_ADDR_MASK);
+
+ if (is_error_hpa(paddr))
+ return 1;
+
+ ret = nonpaging_map(vcpu, addr & PAGE_MASK, paddr);
+ if (ret) {
+ nonpaging_flush(vcpu);
+ continue;
+ }
+ break;
+ }
+ return ret;
+}
+
+static void nonpaging_inval_page(struct kvm_vcpu *vcpu, gva_t addr)
+{
+}
+
+static void nonpaging_free(struct kvm_vcpu *vcpu)
+{
+ hpa_t root;
+
+ ASSERT(vcpu);
+ root = vcpu->mmu.root_hpa;
+ if (VALID_PAGE(root))
+ release_pt_page_64(vcpu, root, vcpu->mmu.shadow_root_level);
+ vcpu->mmu.root_hpa = INVALID_PAGE;
+}
+
+static int nonpaging_init_context(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu *context = &vcpu->mmu;
+
+ context->new_cr3 = nonpaging_new_cr3;
+ context->page_fault = nonpaging_page_fault;
+ context->inval_page = nonpaging_inval_page;
+ context->gva_to_gpa = nonpaging_gva_to_gpa;
+ context->free = nonpaging_free;
+ context->root_level = PT32E_ROOT_LEVEL;
+ context->shadow_root_level = PT32E_ROOT_LEVEL;
+ context->root_hpa = kvm_mmu_alloc_page(vcpu, 0);
+ ASSERT(VALID_PAGE(context->root_hpa));
+ vmcs_writel(GUEST_CR3, context->root_hpa);
+ return 0;
+}
+
+
+static void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu_page *page, *npage;
+
+ list_for_each_entry_safe(page, npage, &vcpu->kvm->active_mmu_pages,
+ link) {
+ if (page->global)
+ continue;
+
+ if (!page->parent_pte)
+ continue;
+
+ *page->parent_pte = 0;
+ release_pt_page_64(vcpu, page->page_hpa, 1);
+ }
+ ++kvm_stat.tlb_flush;
+}
+
+static void paging_new_cr3(struct kvm_vcpu *vcpu)
+{
+ kvm_mmu_flush_tlb(vcpu);
+}
+
+static void mark_pagetable_nonglobal(void *shadow_pte)
+{
+ page_header(__pa(shadow_pte))->global = 0;
+}
+
+static inline void set_pte_common(struct kvm_vcpu *vcpu,
+ u64 *shadow_pte,
+ gpa_t gaddr,
+ int dirty,
+ u64 access_bits)
+{
+ hpa_t paddr;
+
+ *shadow_pte |= access_bits << PT_SHADOW_BITS_OFFSET;
+ if (!dirty)
+ access_bits &= ~PT_WRITABLE_MASK;
+
+ if (access_bits & PT_WRITABLE_MASK)
+ mark_page_dirty(vcpu->kvm, gaddr >> PAGE_SHIFT);
+
+ *shadow_pte |= access_bits;
+
+ paddr = gpa_to_hpa(vcpu, gaddr & PT64_BASE_ADDR_MASK);
+
+ if (!(*shadow_pte & PT_GLOBAL_MASK))
+ mark_pagetable_nonglobal(shadow_pte);
+
+ if (is_error_hpa(paddr)) {
+ *shadow_pte |= gaddr;
+ *shadow_pte |= PT_SHADOW_IO_MARK;
+ *shadow_pte &= ~PT_PRESENT_MASK;
+ } else {
+ *shadow_pte |= paddr;
+ page_header_update_slot(vcpu->kvm, shadow_pte, gaddr);
+ }
+}
+
+static void inject_page_fault(struct kvm_vcpu *vcpu,
+ u64 addr,
+ u32 err_code)
+{
+ u32 vect_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+
+ pgprintk("inject_page_fault: 0x%llx err 0x%x\n", addr, err_code);
+
+ ++kvm_stat.pf_guest;
+
+ if (is_page_fault(vect_info)) {
+ printk("inject_page_fault: double fault 0x%llx @ 0x%lx\n",
+ addr, vmcs_readl(GUEST_RIP));
+ vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, 0);
+ vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+ DF_VECTOR |
+ INTR_TYPE_EXCEPTION |
+ INTR_INFO_DELIEVER_CODE_MASK |
+ INTR_INFO_VALID_MASK);
+ return;
+ }
+ vcpu->cr2 = addr;
+ vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, err_code);
+ vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+ PF_VECTOR |
+ INTR_TYPE_EXCEPTION |
+ INTR_INFO_DELIEVER_CODE_MASK |
+ INTR_INFO_VALID_MASK);
+
+}
+
+static inline int fix_read_pf(u64 *shadow_ent)
+{
+ if ((*shadow_ent & PT_SHADOW_USER_MASK) &&
+ !(*shadow_ent & PT_USER_MASK)) {
+ /*
+ * If supervisor write protect is disabled, we shadow kernel
+ * pages as user pages so we can trap the write access.
+ */
+ *shadow_ent |= PT_USER_MASK;
+ *shadow_ent &= ~PT_WRITABLE_MASK;
+
+ return 1;
+
+ }
+ return 0;
+}
+
+static int may_access(u64 pte, int write, int user)
+{
+
+ if (user && !(pte & PT_USER_MASK))
+ return 0;
+ if (write && !(pte & PT_WRITABLE_MASK))
+ return 0;
+ return 1;
+}
+
+/*
+ * Remove a shadow pte.
+ */
+static void paging_inval_page(struct kvm_vcpu *vcpu, gva_t addr)
+{
+ hpa_t page_addr = vcpu->mmu.root_hpa;
+ int level = vcpu->mmu.shadow_root_level;
+
+ ++kvm_stat.invlpg;
+
+ for (; ; level--) {
+ u32 index = PT64_INDEX(addr, level);
+ u64 *table = __va(page_addr);
+
+ if (level == PT_PAGE_TABLE_LEVEL ) {
+ table[index] = 0;
+ return;
+ }
+
+ if (!is_present_pte(table[index]))
+ return;
+
+ page_addr = table[index] & PT64_BASE_ADDR_MASK;
+
+ if (level == PT_DIRECTORY_LEVEL &&
+ (table[index] & PT_SHADOW_PS_MARK)) {
+ table[index] = 0;
+ release_pt_page_64(vcpu, page_addr, PT_PAGE_TABLE_LEVEL);
+
+ //flush tlb
+ vmcs_writel(GUEST_CR3, vcpu->mmu.root_hpa |
+ (vcpu->cr3 & (CR3_PCD_MASK | CR3_WPT_MASK)));
+ return;
+ }
+ }
+}
+
+static void paging_free(struct kvm_vcpu *vcpu)
+{
+ nonpaging_free(vcpu);
+}
+
+#define PTTYPE 64
+#include "paging_tmpl.h"
+#undef PTTYPE
+
+#define PTTYPE 32
+#include "paging_tmpl.h"
+#undef PTTYPE
+
+static int paging64_init_context(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu *context = &vcpu->mmu;
+
+ ASSERT(is_pae());
+ context->new_cr3 = paging_new_cr3;
+ context->page_fault = paging64_page_fault;
+ context->inval_page = paging_inval_page;
+ context->gva_to_gpa = paging64_gva_to_gpa;
+ context->free = paging_free;
+ context->root_level = PT64_ROOT_LEVEL;
+ context->shadow_root_level = PT64_ROOT_LEVEL;
+ context->root_hpa = kvm_mmu_alloc_page(vcpu, 0);
+ ASSERT(VALID_PAGE(context->root_hpa));
+ vmcs_writel(GUEST_CR3, context->root_hpa |
+ (vcpu->cr3 & (CR3_PCD_MASK | CR3_WPT_MASK)));
+ return 0;
+}
+
+static int paging32_init_context(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu *context = &vcpu->mmu;
+
+ context->new_cr3 = paging_new_cr3;
+ context->page_fault = paging32_page_fault;
+ context->inval_page = paging_inval_page;
+ context->gva_to_gpa = paging32_gva_to_gpa;
+ context->free = paging_free;
+ context->root_level = PT32_ROOT_LEVEL;
+ context->shadow_root_level = PT32E_ROOT_LEVEL;
+ context->root_hpa = kvm_mmu_alloc_page(vcpu, 0);
+ ASSERT(VALID_PAGE(context->root_hpa));
+ vmcs_writel(GUEST_CR3, context->root_hpa |
+ (vcpu->cr3 & (CR3_PCD_MASK | CR3_WPT_MASK)));
+ return 0;
+}
+
+static int paging32E_init_context(struct kvm_vcpu *vcpu)
+{
+ int ret;
+
+ if ((ret = paging64_init_context(vcpu)))
+ return ret;
+
+ vcpu->mmu.root_level = PT32E_ROOT_LEVEL;
+ vcpu->mmu.shadow_root_level = PT32E_ROOT_LEVEL;
+ return 0;
+}
+
+static int init_kvm_mmu(struct kvm_vcpu *vcpu)
+{
+ ASSERT(vcpu);
+ ASSERT(!VALID_PAGE(vcpu->mmu.root_hpa));
+
+ if (!is_paging())
+ return nonpaging_init_context(vcpu);
+ else if (is_long_mode())
+ return paging64_init_context(vcpu);
+ else if (is_pae())
+ return paging32E_init_context(vcpu);
+ else
+ return paging32_init_context(vcpu);
+}
+
+static void destroy_kvm_mmu(struct kvm_vcpu *vcpu)
+{
+ ASSERT(vcpu);
+ if (VALID_PAGE(vcpu->mmu.root_hpa)) {
+ vcpu->mmu.free(vcpu);
+ vcpu->mmu.root_hpa = INVALID_PAGE;
+ }
+}
+
+int kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
+{
+ destroy_kvm_mmu(vcpu);
+ return init_kvm_mmu(vcpu);
+}
+
+static void free_mmu_pages(struct kvm_vcpu *vcpu)
+{
+ while (!list_empty(&vcpu->free_pages)) {
+ struct kvm_mmu_page *page;
+
+ page = list_entry(vcpu->free_pages.next,
+ struct kvm_mmu_page, link);
+ list_del(&page->link);
+ __free_page(pfn_to_page(page->page_hpa >> PAGE_SHIFT));
+ page->page_hpa = INVALID_PAGE;
+ }
+}
+
+static int alloc_mmu_pages(struct kvm_vcpu *vcpu)
+{
+ int i;
+
+ ASSERT(vcpu);
+
+ for (i = 0; i < KVM_NUM_MMU_PAGES; i++) {
+ struct page *page;
+ struct kvm_mmu_page *page_header = &vcpu->page_header_buf[i];
+
+ INIT_LIST_HEAD(&page_header->link);
+ if ((page = alloc_page(GFP_KVM_MMU)) == NULL)
+ goto error_1;
+ page->private = (unsigned long)page_header;
+ page_header->page_hpa = page_to_pfn(page) << PAGE_SHIFT;
+ memset(__va(page_header->page_hpa), 0, PAGE_SIZE);
+ list_add(&page_header->link, &vcpu->free_pages);
+ }
+ return 0;
+
+error_1:
+ free_mmu_pages(vcpu);
+ return -ENOMEM;
+}
+
+int kvm_mmu_init(struct kvm_vcpu *vcpu)
+{
+ int r;
+
+ ASSERT(vcpu);
+ ASSERT(!VALID_PAGE(vcpu->mmu.root_hpa));
+ ASSERT(list_empty(&vcpu->free_pages));
+
+ if ((r = alloc_mmu_pages(vcpu)))
+ return r;
+
+ if ((r = init_kvm_mmu(vcpu))) {
+ free_mmu_pages(vcpu);
+ return r;
+ }
+ return 0;
+}
+
+void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
+{
+ ASSERT(vcpu);
+
+ destroy_kvm_mmu(vcpu);
+ free_mmu_pages(vcpu);
+}
+
+void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+{
+ struct kvm_mmu_page *page;
+
+ list_for_each_entry(page, &kvm->active_mmu_pages, link) {
+ int i;
+ u64 *pt;
+
+ if (!test_bit(slot, &page->slot_bitmap))
+ continue;
+
+ pt = __va(page->page_hpa);
+ for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
+ /* avoid RMW */
+ if (pt[i] & PT_WRITABLE_MASK)
+ pt[i] &= ~PT_WRITABLE_MASK;
+
+ }
+}
Index: linux-2.6/drivers/kvm/paging_tmpl.h
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/paging_tmpl.h
@@ -0,0 +1,378 @@
+/*
+ * We need the mmu code to access both 32-bit and 64-bit guest ptes,
+ * so the code in this file is compiled twice, once per pte size.
+ */
+
+#if PTTYPE == 64
+ #define pt_element_t u64
+ #define guest_walker guest_walker64
+ #define FNAME(name) paging##64_##name
+ #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+ #define PT_DIR_BASE_ADDR_MASK PT64_DIR_BASE_ADDR_MASK
+ #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+ #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
+ #define PT_LEVEL_MASK(level) PT64_LEVEL_MASK(level)
+ #define PT_PTE_COPY_MASK PT64_PTE_COPY_MASK
+ #define PT_NON_PTE_COPY_MASK PT64_NON_PTE_COPY_MASK
+#elif PTTYPE == 32
+ #define pt_element_t u32
+ #define guest_walker guest_walker32
+ #define FNAME(name) paging##32_##name
+ #define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
+ #define PT_DIR_BASE_ADDR_MASK PT32_DIR_BASE_ADDR_MASK
+ #define PT_INDEX(addr, level) PT32_INDEX(addr, level)
+ #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
+ #define PT_LEVEL_MASK(level) PT32_LEVEL_MASK(level)
+ #define PT_PTE_COPY_MASK PT32_PTE_COPY_MASK
+ #define PT_NON_PTE_COPY_MASK PT32_NON_PTE_COPY_MASK
+#else
+ error
+#endif
+
+/*
+ * The guest_walker structure emulates the behavior of the hardware page
+ * table walker.
+ */
+struct guest_walker {
+ int level;
+ pt_element_t *table;
+ pt_element_t inherited_ar;
+};
+
+static void FNAME(init_walker)(struct guest_walker *walker,
+ struct kvm_vcpu *vcpu)
+{
+ hpa_t hpa;
+ struct kvm_memory_slot *slot;
+
+ walker->level = vcpu->mmu.root_level;
+ slot = gfn_to_memslot(vcpu->kvm,
+ (vcpu->cr3 & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT);
+ hpa = safe_gpa_to_hpa(vcpu, vcpu->cr3 & PT64_BASE_ADDR_MASK);
+ walker->table = kmap_atomic(pfn_to_page(hpa >> PAGE_SHIFT), KM_USER0);
+
+ ASSERT((!is_long_mode() && is_pae()) ||
+ (vcpu->cr3 & ~(PAGE_MASK | CR3_FLAGS_MASK)) == 0);
+
+ walker->table = (pt_element_t *)( (unsigned long)walker->table |
+ (unsigned long)(vcpu->cr3 & ~(PAGE_MASK | CR3_FLAGS_MASK)) );
+ walker->inherited_ar = PT_USER_MASK | PT_WRITABLE_MASK;
+}
+
+static void FNAME(release_walker)(struct guest_walker *walker)
+{
+ kunmap_atomic(walker->table, KM_USER0);
+}
+
+static void FNAME(set_pte)(struct kvm_vcpu *vcpu, u64 guest_pte,
+ u64 *shadow_pte, u64 access_bits)
+{
+ ASSERT(*shadow_pte == 0);
+ access_bits &= guest_pte;
+ *shadow_pte = (guest_pte & PT_PTE_COPY_MASK);
+ set_pte_common(vcpu, shadow_pte, guest_pte & PT_BASE_ADDR_MASK,
+ guest_pte & PT_DIRTY_MASK, access_bits);
+}
+
+static void FNAME(set_pde)(struct kvm_vcpu *vcpu, u64 guest_pde,
+ u64 *shadow_pte, u64 access_bits,
+ int index)
+{
+ gpa_t gaddr;
+
+ ASSERT(*shadow_pte == 0);
+ access_bits &= guest_pde;
+ gaddr = (guest_pde & PT_DIR_BASE_ADDR_MASK) + PAGE_SIZE * index;
+ if (PTTYPE == 32 && is_cpuid_PSE36())
+ gaddr |= (guest_pde & PT32_DIR_PSE36_MASK) <<
+ (32 - PT32_DIR_PSE36_SHIFT);
+ *shadow_pte = (guest_pde & PT_NON_PTE_COPY_MASK) |
+ ((guest_pde & PT_DIR_PAT_MASK) >>
+ (PT_DIR_PAT_SHIFT - PT_PAT_SHIFT));
+ set_pte_common(vcpu, shadow_pte, gaddr,
+ guest_pde & PT_DIRTY_MASK, access_bits);
+}
+
+/*
+ * Fetch a guest pte from a specific level in the paging hierarchy.
+ */
+static pt_element_t *FNAME(fetch_guest)(struct kvm_vcpu *vcpu,
+ struct guest_walker *walker,
+ int level,
+ gva_t addr)
+{
+
+ ASSERT(level > 0 && level <= walker->level);
+
+ for (;;) {
+ int index = PT_INDEX(addr, walker->level);
+ hpa_t paddr;
+
+ ASSERT(((unsigned long)walker->table & PAGE_MASK) ==
+ ((unsigned long)&walker->table[index] & PAGE_MASK));
+ if (level == walker->level ||
+ !is_present_pte(walker->table[index]) ||
+ (walker->level == PT_DIRECTORY_LEVEL &&
+ (walker->table[index] & PT_PAGE_SIZE_MASK) &&
+ (PTTYPE == 64 || is_pse())))
+ return &walker->table[index];
+ if (walker->level != 3 || is_long_mode())
+ walker->inherited_ar &= walker->table[index];
+ paddr = safe_gpa_to_hpa(vcpu, walker->table[index] &
PT_BASE_ADDR_MASK);
+ kunmap_atomic(walker->table, KM_USER0);
+ walker->table = kmap_atomic(pfn_to_page(paddr >> PAGE_SHIFT),
+ KM_USER0);
+ --walker->level;
+ }
+}
+
+/*
+ * Fetch a shadow pte for a specific level in the paging hierarchy.
+ */
+static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
+ struct guest_walker *walker)
+{
+ hpa_t shadow_addr;
+ int level;
+ u64 *prev_shadow_ent = NULL;
+
+ shadow_addr = vcpu->mmu.root_hpa;
+ level = vcpu->mmu.shadow_root_level;
+
+ for (; ; level--) {
+ u32 index = SHADOW_PT_INDEX(addr, level);
+ u64 *shadow_ent = ((u64 *)__va(shadow_addr)) + index;
+ pt_element_t *guest_ent;
+
+ if (is_present_pte(*shadow_ent) || is_io_pte(*shadow_ent)) {
+ if (level == PT_PAGE_TABLE_LEVEL)
+ return shadow_ent;
+ shadow_addr = *shadow_ent & PT64_BASE_ADDR_MASK;
+ prev_shadow_ent = shadow_ent;
+ continue;
+ }
+
+ if (PTTYPE == 32 && level > PT32_ROOT_LEVEL) {
+ ASSERT(level == PT32E_ROOT_LEVEL);
+ guest_ent = FNAME(fetch_guest)(vcpu, walker,
+ PT32_ROOT_LEVEL, addr);
+ } else
+ guest_ent = FNAME(fetch_guest)(vcpu, walker,
+ level, addr);
+
+ if (!is_present_pte(*guest_ent))
+ return NULL;
+
+ /* Don't set accessed bit on PAE PDPTRs */
+ if (vcpu->mmu.root_level != 3 || walker->level != 3)
+ *guest_ent |= PT_ACCESSED_MASK;
+
+ if (level == PT_PAGE_TABLE_LEVEL) {
+
+ if (walker->level == PT_DIRECTORY_LEVEL) {
+ if (prev_shadow_ent)
+ *prev_shadow_ent |= PT_SHADOW_PS_MARK;
+ FNAME(set_pde)(vcpu, *guest_ent, shadow_ent,
+ walker->inherited_ar,
+ PT_INDEX(addr, PT_PAGE_TABLE_LEVEL));
+ } else {
+ ASSERT(walker->level == PT_PAGE_TABLE_LEVEL);
+ FNAME(set_pte)(vcpu, *guest_ent, shadow_ent,
walker->inherited_ar);
+ }
+ return shadow_ent;
+ }
+
+ shadow_addr = kvm_mmu_alloc_page(vcpu, shadow_ent);
+ if (!VALID_PAGE(shadow_addr))
+ return ERR_PTR(-ENOMEM);
+ if (!is_long_mode() && level == 3)
+ *shadow_ent = shadow_addr |
+ (*guest_ent & (PT_PRESENT_MASK | PT_PWT_MASK |
PT_PCD_MASK));
+ else {
+ *shadow_ent = shadow_addr |
+ (*guest_ent & PT_NON_PTE_COPY_MASK);
+ *shadow_ent |= (PT_WRITABLE_MASK | PT_USER_MASK);
+ }
+ prev_shadow_ent = shadow_ent;
+ }
+}
+
+/*
+ * The guest faulted for write. We need to
+ *
+ * - check write permissions
+ * - update the guest pte dirty bit
+ * - update our own dirty page tracking structures
+ */
+static int FNAME(fix_write_pf)(struct kvm_vcpu *vcpu,
+ u64 *shadow_ent,
+ struct guest_walker *walker,
+ gva_t addr,
+ int user)
+{
+ pt_element_t *guest_ent;
+ int writable_shadow;
+ gfn_t gfn;
+
+ if (is_writeble_pte(*shadow_ent))
+ return 0;
+
+ writable_shadow = *shadow_ent & PT_SHADOW_WRITABLE_MASK;
+ if (user) {
+ /*
+ * User mode access. Fail if it's a kernel page or a read-only
+ * page.
+ */
+ if (!(*shadow_ent & PT_SHADOW_USER_MASK) || !writable_shadow)
+ return 0;
+ ASSERT(*shadow_ent & PT_USER_MASK);
+ } else
+ /*
+ * Kernel mode access. Fail if it's a read-only page and
+ * supervisor write protection is enabled.
+ */
+ if (!writable_shadow) {
+ if (is_write_protection())
+ return 0;
+ *shadow_ent &= ~PT_USER_MASK;
+ }
+
+ guest_ent = FNAME(fetch_guest)(vcpu, walker, PT_PAGE_TABLE_LEVEL,
addr);
+
+ if (!is_present_pte(*guest_ent)) {
+ *shadow_ent = 0;
+ return 0;
+ }
+
+ gfn = (*guest_ent & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
+ mark_page_dirty(vcpu->kvm, gfn);
+ *shadow_ent |= PT_WRITABLE_MASK;
+ *guest_ent |= PT_DIRTY_MASK;
+
+ return 1;
+}
+
+/*
+ * Page fault handler. There are several causes for a page fault:
+ * - there is no shadow pte for the guest pte
+ * - write access through a shadow pte marked read only so that we
can set
+ * the dirty bit
+ * - write access to a shadow pte marked read only so we can update
the page
+ * dirty bitmap, when userspace requests it
+ * - mmio access; in this case we will never install a present shadow pte
+ * - normal guest page fault due to the guest pte marked not present, not
+ * writable, or not executable
+ *
+ * Returns: 1 if we need to emulate the instruction, 0 otherwise
+ */
+static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
+ u32 error_code)
+{
+ int write_fault = error_code & PFERR_WRITE_MASK;
+ int pte_present = error_code & PFERR_PRESENT_MASK;
+ int user_fault = error_code & PFERR_USER_MASK;
+ struct guest_walker walker;
+ u64 *shadow_pte;
+ int fixed;
+
+ /*
+ * Look up the shadow pte for the faulting address.
+ */
+ for (;;) {
+ FNAME(init_walker)(&walker, vcpu);
+ shadow_pte = FNAME(fetch)(vcpu, addr, &walker);
+ if (IS_ERR(shadow_pte)) { /* must be -ENOMEM */
+ nonpaging_flush(vcpu);
+ FNAME(release_walker)(&walker);
+ continue;
+ }
+ break;
+ }
+
+ /*
+ * The page is not mapped by the guest. Let the guest handle it.
+ */
+ if (!shadow_pte) {
+ inject_page_fault(vcpu, addr, error_code);
+ FNAME(release_walker)(&walker);
+ return 0;
+ }
+
+ /*
+ * Update the shadow pte.
+ */
+ if (write_fault)
+ fixed = FNAME(fix_write_pf)(vcpu, shadow_pte, &walker, addr,
+ user_fault);
+ else
+ fixed = fix_read_pf(shadow_pte);
+
+ FNAME(release_walker)(&walker);
+
+ /*
+ * mmio: emulate if accessible, otherwise its a guest fault.
+ */
+ if (is_io_pte(*shadow_pte)) {
+ if (may_access(*shadow_pte, write_fault, user_fault))
+ return 1;
+ pgprintk("%s: io work, no access\n", __FUNCTION__);
+ inject_page_fault(vcpu, addr,
+ error_code | PFERR_PRESENT_MASK);
+ return 0;
+ }
+
+ /*
+ * pte not present, guest page fault.
+ */
+ if (pte_present && !fixed) {
+ inject_page_fault(vcpu, addr, error_code);
+ return 0;
+ }
+
+ ++kvm_stat.pf_fixed;
+
+ return 0;
+}
+
+static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr)
+{
+ struct guest_walker walker;
+ pt_element_t guest_pte;
+ gpa_t gpa;
+
+ FNAME(init_walker)(&walker, vcpu);
+ guest_pte = *FNAME(fetch_guest)(vcpu, &walker, PT_PAGE_TABLE_LEVEL,
+ vaddr);
+ FNAME(release_walker)(&walker);
+
+ if (!is_present_pte(guest_pte))
+ return UNMAPPED_GVA;
+
+ if (walker.level == PT_DIRECTORY_LEVEL) {
+ ASSERT((guest_pte & PT_PAGE_SIZE_MASK));
+ ASSERT(PTTYPE == 64 || is_pse());
+
+ gpa = (guest_pte & PT_DIR_BASE_ADDR_MASK) | (vaddr &
+ (PT_LEVEL_MASK(PT_PAGE_TABLE_LEVEL) | ~PAGE_MASK));
+
+ if (PTTYPE == 32 && is_cpuid_PSE36())
+ gpa |= (guest_pte & PT32_DIR_PSE36_MASK) <<
+ (32 - PT32_DIR_PSE36_SHIFT);
+ } else {
+ gpa = (guest_pte & PT_BASE_ADDR_MASK);
+ gpa |= (vaddr & ~PAGE_MASK);
+ }
+
+ return gpa;
+}
+
+#undef pt_element_t
+#undef guest_walker
+#undef FNAME
+#undef PT_BASE_ADDR_MASK
+#undef PT_INDEX
+#undef SHADOW_PT_INDEX
+#undef PT_LEVEL_MASK
+#undef PT_PTE_COPY_MASK
+#undef PT_NON_PTE_COPY_MASK
+#undef PT_DIR_BASE_ADDR_MASK
Index: linux-2.6/drivers/kvm/kvm.h
===================================================================
--- linux-2.6.orig/drivers/kvm/kvm.h
+++ linux-2.6/drivers/kvm/kvm.h
@@ -369,4 +369,19 @@ static inline struct kvm_mmu_page *page_
return (struct kvm_mmu_page *)page->private;
}

+#ifdef __x86_64__
+
+/*
+ * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64.
Therefore
+ * we need to allocate shadow page tables in the first 4GB of memory, which
+ * happens to fit the DMA32 zone.
+ */
+#define GFP_KVM_MMU (GFP_KERNEL | __GFP_DMA32)
+
+#else
+
+#define GFP_KVM_MMU GFP_KERNEL
+
+#endif
+
#endif


--
error compiling committee.c: too many arguments to function

2006-10-19 13:56:14

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 7/7] KVM: plumbing

Add a config entry and a Makefile for KVM.

Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/Makefile
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/Makefile
@@ -0,0 +1,6 @@
+#
+# Makefile for Kernel-based Virtual Machine module
+#
+
+kvm-objs := kvm_main.o mmu.o x86_emulate.o
+obj-$(CONFIG_KVM) += kvm.o
Index: linux-2.6/drivers/kvm/Kconfig
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/Kconfig
@@ -0,0 +1,22 @@
+
+menu "Virtualization"
+#
+# KVM configuration
+#
+config KVM
+ tristate "Kernel-based Virtual Machine (KVM) support"
+ depends on X86 && EXPERIMENTAL
+ ---help---
+ Support hosting fully virtualized guest machines using hardware
+ virtualization extensions. You will need a fairly recent Intel
+ processor equipped with VT extensions.
+
+ This module provides access to the hardware capabilities through
+ a character device node named /dev/kvm.
+
+ To compile this as a module, choose M here: the module
+ will be called kvm.
+
+ If unsure, say N.
+
+endmenu
Index: linux-2.6/drivers/Kconfig
===================================================================
--- linux-2.6.orig/drivers/Kconfig
+++ linux-2.6/drivers/Kconfig
@@ -78,4 +78,6 @@ source "drivers/rtc/Kconfig"

source "drivers/dma/Kconfig"

+source "drivers/kvm/Kconfig"
+
endmenu
Index: linux-2.6/drivers/Makefile
===================================================================
--- linux-2.6.orig/drivers/Makefile
+++ linux-2.6/drivers/Makefile
@@ -77,3 +77,4 @@ obj-$(CONFIG_CRYPTO) += crypto/
obj-$(CONFIG_SUPERH) += sh/
obj-$(CONFIG_GENERIC_TIME) += clocksource/
obj-$(CONFIG_DMA_ENGINE) += dma/
+obj-$(CONFIG_KVM) += kvm/

--
error compiling committee.c: too many arguments to function

2006-10-19 13:55:01

by Avi Kivity

[permalink] [raw]
Subject: [PATCH 6/7] KVM: x86 emulator

Add an x86 instruction emulator for kvm.

We need an x86 emulator for the following reasons:

- mmio instructions are intercepted as page faults, with no information
about
the operation to be performed other than the virtual address
- real-mode is emulated using the old-fashined vm86 mode, with no special
intercepts for the privileged instructions, so we need to emulate mov cr,
lgdt, and lidt
- we plan to cache shadow page tables in the future, so that a guest context
switch will not throw away all the mappings we worked so hard to
build. but
cachine page tables means write-protecting the guest page tables to keep
them in sync, so any writes to the guest page tables need to be emulated

The emulator was lifted from the Xen hypervisor.

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

Index: linux-2.6/drivers/kvm/x86_emulate.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/x86_emulate.c
@@ -0,0 +1,1370 @@
+/******************************************************************************
+ * x86_emulate.c
+ *
+ * Generic x86 (32-bit and 64-bit) instruction decoder and emulator.
+ *
+ * Copyright (c) 2005 Keir Fraser
+ *
+ * Linux coding style, mod r/m decoder, segment base fixes, real-mode
+ * privieged instructions:
+ *
+ * Copyright (C) 2006 Qumranet
+ *
+ * Avi Kivity <[email protected]>
+ * Yaniv Kamay <[email protected]>
+ *
+ * From: xen-unstable 10676:af9809f51f81a3c43f276f00c81a52ef558afda4
+ */
+
+#ifndef __KERNEL__
+#include <stdio.h>
+#include <stdint.h>
+#include <public/xen.h>
+#define DPRINTF(_f, _a ...) printf( _f , ## _a )
+#else
+#include "kvm.h"
+#define DPRINTF(x...) do {} while (0)
+#endif
+#include "x86_emulate.h"
+
+/*
+ * Opcode effective-address decode tables.
+ * Note that we only emulate instructions that have at least one memory
+ * operand (excluding implicit stack references). We assume that stack
+ * references and instruction fetches will never occur in special memory
+ * areas that require emulation. So, for example, 'mov <imm>,<reg>' need
+ * not be handled.
+ */
+
+/* Operand sizes: 8-bit operands or specified/overridden size. */
+#define ByteOp (1<<0) /* 8-bit operands. */
+/* Destination operand type. */
+#define ImplicitOps (1<<1) /* Implicit in opcode. No generic decode. */
+#define DstReg (2<<1) /* Register operand. */
+#define DstMem (3<<1) /* Memory operand. */
+#define DstMask (3<<1)
+/* Source operand type. */
+#define SrcNone (0<<3) /* No source operand. */
+#define SrcImplicit (0<<3) /* Source operand is implicit in the
opcode. */
+#define SrcReg (1<<3) /* Register operand. */
+#define SrcMem (2<<3) /* Memory operand. */
+#define SrcMem16 (3<<3) /* Memory operand (16-bit). */
+#define SrcMem32 (4<<3) /* Memory operand (32-bit). */
+#define SrcImm (5<<3) /* Immediate operand. */
+#define SrcImmByte (6<<3) /* 8-bit sign-extended immediate operand. */
+#define SrcMask (7<<3)
+/* Generic ModRM decode. */
+#define ModRM (1<<6)
+/* Destination is only written; never read. */
+#define Mov (1<<7)
+
+static u8 opcode_table[256] = {
+ /* 0x00 - 0x07 */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x08 - 0x0F */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x10 - 0x17 */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x18 - 0x1F */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x20 - 0x27 */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x28 - 0x2F */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x30 - 0x37 */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x38 - 0x3F */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstReg | SrcMem | ModRM, DstReg | SrcMem | ModRM,
+ 0, 0, 0, 0,
+ /* 0x40 - 0x4F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x50 - 0x5F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x60 - 0x6F */
+ 0, 0, 0, DstReg | SrcMem32 | ModRM | Mov /* movsxd (x86/64) */ ,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x70 - 0x7F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x80 - 0x87 */
+ ByteOp | DstMem | SrcImm | ModRM, DstMem | SrcImm | ModRM,
+ ByteOp | DstMem | SrcImm | ModRM, DstMem | SrcImmByte | ModRM,
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM,
+ /* 0x88 - 0x8F */
+ ByteOp | DstMem | SrcReg | ModRM | Mov, DstMem | SrcReg | ModRM | Mov,
+ ByteOp | DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ 0, 0, 0, DstMem | SrcNone | ModRM | Mov,
+ /* 0x90 - 0x9F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xA0 - 0xA7 */
+ ByteOp | DstReg | SrcMem | Mov, DstReg | SrcMem | Mov,
+ ByteOp | DstMem | SrcReg | Mov, DstMem | SrcReg | Mov,
+ ByteOp | ImplicitOps | Mov, ImplicitOps | Mov,
+ ByteOp | ImplicitOps, ImplicitOps,
+ /* 0xA8 - 0xAF */
+ 0, 0, ByteOp | ImplicitOps | Mov, ImplicitOps | Mov,
+ ByteOp | ImplicitOps | Mov, ImplicitOps | Mov,
+ ByteOp | ImplicitOps, ImplicitOps,
+ /* 0xB0 - 0xBF */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xC0 - 0xC7 */
+ ByteOp | DstMem | SrcImm | ModRM, DstMem | SrcImmByte | ModRM, 0, 0,
+ 0, 0, ByteOp | DstMem | SrcImm | ModRM | Mov,
+ DstMem | SrcImm | ModRM | Mov,
+ /* 0xC8 - 0xCF */
+ 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xD0 - 0xD7 */
+ ByteOp | DstMem | SrcImplicit | ModRM, DstMem | SrcImplicit | ModRM,
+ ByteOp | DstMem | SrcImplicit | ModRM, DstMem | SrcImplicit | ModRM,
+ 0, 0, 0, 0,
+ /* 0xD8 - 0xDF */
+ 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xE0 - 0xEF */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xF0 - 0xF7 */
+ 0, 0, 0, 0,
+ 0, 0, ByteOp | DstMem | SrcNone | ModRM, DstMem | SrcNone | ModRM,
+ /* 0xF8 - 0xFF */
+ 0, 0, 0, 0,
+ 0, 0, ByteOp | DstMem | SrcNone | ModRM, DstMem | SrcNone | ModRM
+};
+
+static u8 twobyte_table[256] = {
+ /* 0x00 - 0x0F */
+ 0, SrcMem | ModRM | DstReg | Mov, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, ImplicitOps | ModRM, 0, 0,
+ /* 0x10 - 0x1F */
+ 0, 0, 0, 0, 0, 0, 0, 0, ImplicitOps | ModRM, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x20 - 0x2F */
+ ImplicitOps, 0, ImplicitOps, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x30 - 0x3F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x40 - 0x47 */
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ /* 0x48 - 0x4F */
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
+ /* 0x50 - 0x5F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x60 - 0x6F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x70 - 0x7F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x80 - 0x8F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0x90 - 0x9F */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xA0 - 0xA7 */
+ 0, 0, 0, DstMem | SrcReg | ModRM, 0, 0, 0, 0,
+ /* 0xA8 - 0xAF */
+ 0, 0, 0, DstMem | SrcReg | ModRM, 0, 0, 0, 0,
+ /* 0xB0 - 0xB7 */
+ ByteOp | DstMem | SrcReg | ModRM, DstMem | SrcReg | ModRM, 0,
+ DstMem | SrcReg | ModRM,
+ 0, 0, ByteOp | DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem16 | ModRM | Mov,
+ /* 0xB8 - 0xBF */
+ 0, 0, DstMem | SrcImmByte | ModRM, DstMem | SrcReg | ModRM,
+ 0, 0, ByteOp | DstReg | SrcMem | ModRM | Mov,
+ DstReg | SrcMem16 | ModRM | Mov,
+ /* 0xC0 - 0xCF */
+ 0, 0, 0, 0, 0, 0, 0, ImplicitOps | ModRM, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xD0 - 0xDF */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xE0 - 0xEF */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ /* 0xF0 - 0xFF */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+};
+
+/* Type, address-of, and value of an instruction's operand. */
+struct operand {
+ enum { OP_REG, OP_MEM, OP_IMM } type;
+ unsigned int bytes;
+ unsigned long val, orig_val, *ptr;
+};
+
+/* EFLAGS bit definitions. */
+#define EFLG_OF (1<<11)
+#define EFLG_DF (1<<10)
+#define EFLG_SF (1<<7)
+#define EFLG_ZF (1<<6)
+#define EFLG_AF (1<<4)
+#define EFLG_PF (1<<2)
+#define EFLG_CF (1<<0)
+
+/*
+ * Instruction emulation:
+ * Most instructions are emulated directly via a fragment of inline
assembly
+ * code. This allows us to save/restore EFLAGS and thus very easily pick up
+ * any modified flags.
+ */
+
+#if defined(__x86_64__)
+#define _LO32 "k" /* force 32-bit operand */
+#define _STK "%%rsp" /* stack pointer */
+#elif defined(__i386__)
+#define _LO32 "" /* force 32-bit operand */
+#define _STK "%%esp" /* stack pointer */
+#endif
+
+/*
+ * These EFLAGS bits are restored from saved value during emulation, and
+ * any changes are written back to the saved value after emulation.
+ */
+#define EFLAGS_MASK (EFLG_OF|EFLG_SF|EFLG_ZF|EFLG_AF|EFLG_PF|EFLG_CF)
+
+/* Before executing instruction: restore necessary bits in EFLAGS. */
+#define _PRE_EFLAGS(_sav, _msk, _tmp) \
+ /* EFLAGS = (_sav & _msk) | (EFLAGS & ~_msk); */ \
+ "push %"_sav"; " \
+ "movl %"_msk",%"_LO32 _tmp"; " \
+ "andl %"_LO32 _tmp",("_STK"); " \
+ "pushf; " \
+ "notl %"_LO32 _tmp"; " \
+ "andl %"_LO32 _tmp",("_STK"); " \
+ "pop %"_tmp"; " \
+ "orl %"_LO32 _tmp",("_STK"); " \
+ "popf; " \
+ /* _sav &= ~msk; */ \
+ "movl %"_msk",%"_LO32 _tmp"; " \
+ "notl %"_LO32 _tmp"; " \
+ "andl %"_LO32 _tmp",%"_sav"; "
+
+/* After executing instruction: write-back necessary bits in EFLAGS. */
+#define _POST_EFLAGS(_sav, _msk, _tmp) \
+ /* _sav |= EFLAGS & _msk; */ \
+ "pushf; " \
+ "pop %"_tmp"; " \
+ "andl %"_msk",%"_LO32 _tmp"; " \
+ "orl %"_LO32 _tmp",%"_sav"; "
+
+/* Raw emulation: instruction has two explicit operands. */
+#define
__emulate_2op_nobyte(_op,_src,_dst,_eflags,_wx,_wy,_lx,_ly,_qx,_qy) \
+ do { \
+ unsigned long _tmp; \
+ \
+ switch ((_dst).bytes) { \
+ case 2: \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","4","2") \
+ _op"w %"_wx"3,%1; " \
+ _POST_EFLAGS("0","4","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), \
+ "=&r" (_tmp) \
+ : _wy ((_src).val), "i" (EFLAGS_MASK) ); \
+ break; \
+ case 4: \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","4","2") \
+ _op"l %"_lx"3,%1; " \
+ _POST_EFLAGS("0","4","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), \
+ "=&r" (_tmp) \
+ : _ly ((_src).val), "i" (EFLAGS_MASK) ); \
+ break; \
+ case 8: \
+ __emulate_2op_8byte(_op, _src, _dst, \
+ _eflags, _qx, _qy); \
+ break; \
+ } \
+ } while (0)
+
+#define
__emulate_2op(_op,_src,_dst,_eflags,_bx,_by,_wx,_wy,_lx,_ly,_qx,_qy) \
+ do { \
+ unsigned long _tmp; \
+ switch ( (_dst).bytes ) \
+ { \
+ case 1: \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","4","2") \
+ _op"b %"_bx"3,%1; " \
+ _POST_EFLAGS("0","4","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), \
+ "=&r" (_tmp) \
+ : _by ((_src).val), "i" (EFLAGS_MASK) ); \
+ break; \
+ default: \
+ __emulate_2op_nobyte(_op, _src, _dst, _eflags, \
+ _wx, _wy, _lx, _ly, _qx, _qy); \
+ break; \
+ } \
+ } while (0)
+
+/* Source operand is byte-sized and may be restricted to just %cl. */
+#define emulate_2op_SrcB(_op, _src, _dst, _eflags) \
+ __emulate_2op(_op, _src, _dst, _eflags, \
+ "b", "c", "b", "c", "b", "c", "b", "c")
+
+/* Source operand is byte, word, long or quad sized. */
+#define emulate_2op_SrcV(_op, _src, _dst, _eflags) \
+ __emulate_2op(_op, _src, _dst, _eflags, \
+ "b", "q", "w", "r", _LO32, "r", "", "r")
+
+/* Source operand is word, long or quad sized. */
+#define emulate_2op_SrcV_nobyte(_op, _src, _dst, _eflags) \
+ __emulate_2op_nobyte(_op, _src, _dst, _eflags, \
+ "w", "r", _LO32, "r", "", "r")
+
+/* Instruction has only one explicit operand (no source operand). */
+#define emulate_1op(_op, _dst,
_eflags) \
+ do { \
+ unsigned long _tmp; \
+ \
+ switch ( (_dst).bytes ) \
+ { \
+ case 1: \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","3","2") \
+ _op"b %1; " \
+ _POST_EFLAGS("0","3","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), \
+ "=&r" (_tmp) \
+ : "i" (EFLAGS_MASK) ); \
+ break; \
+ case 2: \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","3","2") \
+ _op"w %1; " \
+ _POST_EFLAGS("0","3","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), \
+ "=&r" (_tmp) \
+ : "i" (EFLAGS_MASK) ); \
+ break; \
+ case 4: \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","3","2") \
+ _op"l %1; " \
+ _POST_EFLAGS("0","3","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), \
+ "=&r" (_tmp) \
+ : "i" (EFLAGS_MASK) ); \
+ break; \
+ case 8: \
+ __emulate_1op_8byte(_op, _dst, _eflags); \
+ break; \
+ } \
+ } while (0)
+
+/* Emulate an instruction with quadword operands (x86/64 only). */
+#if defined(__x86_64__)
+#define __emulate_2op_8byte(_op, _src, _dst, _eflags, _qx, _qy) \
+ do { \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","4","2") \
+ _op"q %"_qx"3,%1; " \
+ _POST_EFLAGS("0","4","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), "=&r" (_tmp) \
+ : _qy ((_src).val), "i" (EFLAGS_MASK) ); \
+ } while (0)
+
+#define __emulate_1op_8byte(_op, _dst, _eflags) \
+ do { \
+ __asm__ __volatile__ ( \
+ _PRE_EFLAGS("0","3","2") \
+ _op"q %1; " \
+ _POST_EFLAGS("0","3","2") \
+ : "=m" (_eflags), "=m" ((_dst).val), "=&r" (_tmp) \
+ : "i" (EFLAGS_MASK) ); \
+ } while (0)
+
+#elif defined(__i386__)
+#define __emulate_2op_8byte(_op, _src, _dst, _eflags, _qx, _qy)
+#define __emulate_1op_8byte(_op, _dst, _eflags)
+#endif /* __i386__ */
+
+/* Fetch next part of the instruction being emulated. */
+#define insn_fetch(_type, _size, _eip) \
+({ unsigned long _x; \
+ rc = ops->read_std((unsigned long)(_eip) + ctxt->cs_base, &_x, \
+ (_size), ctxt); \
+ if ( rc != 0 ) \
+ goto done; \
+ (_eip) += (_size); \
+ (_type)_x; \
+})
+
+/* Access/update address held in a register, based on addressing mode. */
+#define register_address(base, reg) \
+ ((base) + ((ad_bytes == sizeof(unsigned long)) ? (reg) : \
+ ((reg) & ((1UL << (ad_bytes << 3)) - 1))))
+
+#define register_address_increment(reg, inc) \
+ do { \
+ /* signed type ensures sign extension to long */ \
+ int _inc = (inc); \
+ if ( ad_bytes == sizeof(unsigned long) ) \
+ (reg) += _inc; \
+ else \
+ (reg) = ((reg) & ~((1UL << (ad_bytes << 3)) - 1)) | \
+ (((reg) + _inc) & ((1UL << (ad_bytes << 3)) - 1)); \
+ } while (0)
+
+void *decode_register(u8 modrm_reg, unsigned long *regs,
+ int highbyte_regs)
+{
+ void *p;
+
+ p = &regs[modrm_reg];
+ if (highbyte_regs && modrm_reg >= 4 && modrm_reg < 8)
+ p = (unsigned char *)&regs[modrm_reg & 3] + 1;
+ return p;
+}
+
+static int read_descriptor(struct x86_emulate_ctxt *ctxt,
+ struct x86_emulate_ops *ops,
+ void *ptr,
+ u16 *size, unsigned long *address, int op_bytes)
+{
+ int rc;
+
+ if (op_bytes == 2)
+ op_bytes = 3;
+ *address = 0;
+ rc = ops->read_std((unsigned long)ptr, (unsigned long *)size, 2, ctxt);
+ if (rc)
+ return rc;
+ rc = ops->read_std((unsigned long)ptr + 2, address, op_bytes, ctxt);
+ return rc;
+}
+
+int
+x86_emulate_memop(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops
*ops)
+{
+ u8 b, d, sib, twobyte = 0, rex_prefix = 0;
+ u8 modrm, modrm_mod = 0, modrm_reg = 0, modrm_rm = 0;
+ unsigned long *override_base = NULL;
+ unsigned int op_bytes, ad_bytes, lock_prefix = 0, rep_prefix = 0, i;
+ int rc = 0;
+ struct operand src, dst;
+ unsigned long cr2 = ctxt->cr2;
+ int mode = ctxt->mode;
+ unsigned long modrm_ea;
+ int use_modrm_ea, index_reg = 0, base_reg = 0, scale, rip_relative = 0;
+
+ /* Shadow copy of register state. Committed on successful emulation. */
+ unsigned long _regs[NR_VCPU_REGS];
+ unsigned long _eip = ctxt->vcpu->rip, _eflags = ctxt->eflags;
+ unsigned long modrm_val = 0;
+
+ memcpy(_regs, ctxt->vcpu->regs, sizeof _regs);
+
+ switch (mode) {
+ case X86EMUL_MODE_REAL:
+ case X86EMUL_MODE_PROT16:
+ op_bytes = ad_bytes = 2;
+ break;
+ case X86EMUL_MODE_PROT32:
+ op_bytes = ad_bytes = 4;
+ break;
+#ifdef __x86_64__
+ case X86EMUL_MODE_PROT64:
+ op_bytes = 4;
+ ad_bytes = 8;
+ break;
+#endif
+ default:
+ return -1;
+ }
+
+ /* Legacy prefixes. */
+ for (i = 0; i < 8; i++) {
+ switch (b = insn_fetch(u8, 1, _eip)) {
+ case 0x66: /* operand-size override */
+ op_bytes ^= 6; /* switch between 2/4 bytes */
+ break;
+ case 0x67: /* address-size override */
+ if (mode == X86EMUL_MODE_PROT64)
+ ad_bytes ^= 12; /* switch between 4/8 bytes */
+ else
+ ad_bytes ^= 6; /* switch between 2/4 bytes */
+ break;
+ case 0x2e: /* CS override */
+ override_base = &ctxt->cs_base;
+ break;
+ case 0x3e: /* DS override */
+ override_base = &ctxt->ds_base;
+ break;
+ case 0x26: /* ES override */
+ override_base = &ctxt->es_base;
+ break;
+ case 0x64: /* FS override */
+ override_base = &ctxt->fs_base;
+ break;
+ case 0x65: /* GS override */
+ override_base = &ctxt->gs_base;
+ break;
+ case 0x36: /* SS override */
+ override_base = &ctxt->ss_base;
+ break;
+ case 0xf0: /* LOCK */
+ lock_prefix = 1;
+ break;
+ case 0xf3: /* REP/REPE/REPZ */
+ rep_prefix = 1;
+ break;
+ case 0xf2: /* REPNE/REPNZ */
+ break;
+ default:
+ goto done_prefixes;
+ }
+ }
+
+done_prefixes:
+
+ /* REX prefix. */
+ if ((mode == X86EMUL_MODE_PROT64) && ((b & 0xf0) == 0x40)) {
+ rex_prefix = b;
+ if (b & 8)
+ op_bytes = 8; /* REX.W */
+ modrm_reg = (b & 4) << 1; /* REX.R */
+ index_reg = (b & 2) << 2; /* REX.X */
+ modrm_rm = base_reg = (b & 1) << 3; /* REG.B */
+ b = insn_fetch(u8, 1, _eip);
+ }
+
+ /* Opcode byte(s). */
+ d = opcode_table[b];
+ if (d == 0) {
+ /* Two-byte opcode? */
+ if (b == 0x0f) {
+ twobyte = 1;
+ b = insn_fetch(u8, 1, _eip);
+ d = twobyte_table[b];
+ }
+
+ /* Unrecognised? */
+ if (d == 0)
+ goto cannot_emulate;
+ }
+
+ /* ModRM and SIB bytes. */
+ if (d & ModRM) {
+ modrm = insn_fetch(u8, 1, _eip);
+ modrm_mod |= (modrm & 0xc0) >> 6;
+ modrm_reg |= (modrm & 0x38) >> 3;
+ modrm_rm |= (modrm & 0x07);
+ modrm_ea = 0;
+ use_modrm_ea = 1;
+
+ if (modrm_mod == 3) {
+ modrm_val = *(unsigned long *)
+ decode_register(modrm_rm, _regs, d & ByteOp);
+ goto modrm_done;
+ }
+
+ if (ad_bytes == 2) {
+ unsigned bx = _regs[VCPU_REGS_RBX];
+ unsigned bp = _regs[VCPU_REGS_RBP];
+ unsigned si = _regs[VCPU_REGS_RSI];
+ unsigned di = _regs[VCPU_REGS_RDI];
+
+ /* 16-bit ModR/M decode. */
+ switch (modrm_mod) {
+ case 0:
+ if (modrm_rm == 6)
+ modrm_ea += insn_fetch(u16, 2, _eip);
+ break;
+ case 1:
+ modrm_ea += insn_fetch(s8, 1, _eip);
+ break;
+ case 2:
+ modrm_ea += insn_fetch(u16, 2, _eip);
+ break;
+ }
+ switch (modrm_rm) {
+ case 0:
+ modrm_ea += bx + si;
+ break;
+ case 1:
+ modrm_ea += bx + di;
+ break;
+ case 2:
+ modrm_ea += bp + si;
+ break;
+ case 3:
+ modrm_ea += bp + di;
+ break;
+ case 4:
+ modrm_ea += si;
+ break;
+ case 5:
+ modrm_ea += di;
+ break;
+ case 6:
+ if (modrm_mod != 0)
+ modrm_ea += bp;
+ break;
+ case 7:
+ modrm_ea += bx;
+ break;
+ }
+ if (modrm_rm == 2 || modrm_rm == 3 ||
+ (modrm_rm == 6 && modrm_mod != 0))
+ if (!override_base)
+ override_base = &ctxt->ss_base;
+ modrm_ea = (u16)modrm_ea;
+ } else {
+ /* 32/64-bit ModR/M decode. */
+ switch (modrm_rm) {
+ case 4:
+ case 12:
+ sib = insn_fetch(u8, 1, _eip);
+ index_reg |= (sib >> 3) & 7;
+ base_reg |= sib & 7;
+ scale = sib >> 6;
+
+ switch (base_reg) {
+ case 5:
+ if (modrm_mod != 0)
+ modrm_ea += _regs[base_reg];
+ else
+ modrm_ea += insn_fetch(s32, 4, _eip);
+ break;
+ default:
+ modrm_ea += _regs[base_reg];
+ }
+ switch (index_reg) {
+ case 4:
+ break;
+ default:
+ modrm_ea += _regs[index_reg] << scale;
+
+ }
+ break;
+ case 5:
+ if (modrm_mod != 0)
+ modrm_ea += _regs[modrm_rm];
+ else if (mode == X86EMUL_MODE_PROT64)
+ rip_relative = 1;
+ break;
+ default:
+ modrm_ea += _regs[modrm_rm];
+ break;
+ }
+ switch (modrm_mod) {
+ case 0:
+ if (modrm_rm == 5)
+ modrm_ea += insn_fetch(s32, 4, _eip);
+ break;
+ case 1:
+ modrm_ea += insn_fetch(s8, 1, _eip);
+ break;
+ case 2:
+ modrm_ea += insn_fetch(s32, 4, _eip);
+ break;
+ }
+ }
+ if (!override_base)
+ override_base = &ctxt->ds_base;
+ if (mode == X86EMUL_MODE_PROT64 &&
+ override_base != &ctxt->fs_base &&
+ override_base != &ctxt->gs_base)
+ override_base = 0;
+
+ if (override_base)
+ modrm_ea += *override_base;
+
+ if (rip_relative) {
+ modrm_ea += _eip;
+ switch (d & SrcMask) {
+ case SrcImmByte:
+ modrm_ea += 1;
+ break;
+ case SrcImm:
+ if (d & ByteOp)
+ modrm_ea += 1;
+ else
+ if (op_bytes == 8)
+ modrm_ea += 4;
+ else
+ modrm_ea += op_bytes;
+ }
+ }
+ if (ad_bytes != 8)
+ modrm_ea = (u32)modrm_ea;
+ cr2 = modrm_ea;
+ modrm_done:
+ ;
+ }
+
+ /* Decode and fetch the destination operand: register or memory. */
+ switch (d & DstMask) {
+ case ImplicitOps:
+ /* Special instructions do their own operand decoding. */
+ goto special_insn;
+ case DstReg:
+ dst.type = OP_REG;
+ if ((d & ByteOp)
+ && !(twobyte_table && (b == 0xb6 || b == 0xb7))) {
+ dst.ptr = decode_register(modrm_reg, _regs,
+ (rex_prefix == 0));
+ dst.val = *(u8 *) dst.ptr;
+ dst.bytes = 1;
+ } else {
+ dst.ptr = decode_register(modrm_reg, _regs, 0);
+ switch ((dst.bytes = op_bytes)) {
+ case 2:
+ dst.val = *(u16 *)dst.ptr;
+ break;
+ case 4:
+ dst.val = *(u32 *)dst.ptr;
+ break;
+ case 8:
+ dst.val = *(u64 *)dst.ptr;
+ break;
+ }
+ }
+ break;
+ case DstMem:
+ dst.type = OP_MEM;
+ dst.ptr = (unsigned long *)cr2;
+ dst.bytes = (d & ByteOp) ? 1 : op_bytes;
+ if (!(d & Mov) && /* optimisation - avoid slow emulated read */
+ ((rc = ops->read_emulated((unsigned long)dst.ptr,
+ &dst.val, dst.bytes, ctxt)) != 0))
+ goto done;
+ break;
+ }
+ dst.orig_val = dst.val;
+
+ /*
+ * Decode and fetch the source operand: register, memory
+ * or immediate.
+ */
+ switch (d & SrcMask) {
+ case SrcNone:
+ break;
+ case SrcReg:
+ src.type = OP_REG;
+ if (d & ByteOp) {
+ src.ptr = decode_register(modrm_reg, _regs,
+ (rex_prefix == 0));
+ src.val = src.orig_val = *(u8 *) src.ptr;
+ src.bytes = 1;
+ } else {
+ src.ptr = decode_register(modrm_reg, _regs, 0);
+ switch ((src.bytes = op_bytes)) {
+ case 2:
+ src.val = src.orig_val = *(u16 *) src.ptr;
+ break;
+ case 4:
+ src.val = src.orig_val = *(u32 *) src.ptr;
+ break;
+ case 8:
+ src.val = src.orig_val = *(u64 *) src.ptr;
+ break;
+ }
+ }
+ break;
+ case SrcMem16:
+ src.bytes = 2;
+ goto srcmem_common;
+ case SrcMem32:
+ src.bytes = 4;
+ goto srcmem_common;
+ case SrcMem:
+ src.bytes = (d & ByteOp) ? 1 : op_bytes;
+ srcmem_common:
+ src.type = OP_MEM;
+ src.ptr = (unsigned long *)cr2;
+ if ((rc = ops->read_emulated((unsigned long)src.ptr,
+ &src.val, src.bytes, ctxt)) != 0)
+ goto done;
+ src.orig_val = src.val;
+ break;
+ case SrcImm:
+ src.type = OP_IMM;
+ src.ptr = (unsigned long *)_eip;
+ src.bytes = (d & ByteOp) ? 1 : op_bytes;
+ if (src.bytes == 8)
+ src.bytes = 4;
+ /* NB. Immediates are sign-extended as necessary. */
+ switch (src.bytes) {
+ case 1:
+ src.val = insn_fetch(s8, 1, _eip);
+ break;
+ case 2:
+ src.val = insn_fetch(s16, 2, _eip);
+ break;
+ case 4:
+ src.val = insn_fetch(s32, 4, _eip);
+ break;
+ }
+ break;
+ case SrcImmByte:
+ src.type = OP_IMM;
+ src.ptr = (unsigned long *)_eip;
+ src.bytes = 1;
+ src.val = insn_fetch(s8, 1, _eip);
+ break;
+ }
+
+ if (twobyte)
+ goto twobyte_insn;
+
+ switch (b) {
+ case 0x00 ... 0x05:
+ add: /* add */
+ emulate_2op_SrcV("add", src, dst, _eflags);
+ break;
+ case 0x08 ... 0x0d:
+ or: /* or */
+ emulate_2op_SrcV("or", src, dst, _eflags);
+ break;
+ case 0x10 ... 0x15:
+ adc: /* adc */
+ emulate_2op_SrcV("adc", src, dst, _eflags);
+ break;
+ case 0x18 ... 0x1d:
+ sbb: /* sbb */
+ emulate_2op_SrcV("sbb", src, dst, _eflags);
+ break;
+ case 0x20 ... 0x25:
+ and: /* and */
+ emulate_2op_SrcV("and", src, dst, _eflags);
+ break;
+ case 0x28 ... 0x2d:
+ sub: /* sub */
+ emulate_2op_SrcV("sub", src, dst, _eflags);
+ break;
+ case 0x30 ... 0x35:
+ xor: /* xor */
+ emulate_2op_SrcV("xor", src, dst, _eflags);
+ break;
+ case 0x38 ... 0x3d:
+ cmp: /* cmp */
+ emulate_2op_SrcV("cmp", src, dst, _eflags);
+ break;
+ case 0x63: /* movsxd */
+ if (mode != X86EMUL_MODE_PROT64)
+ goto cannot_emulate;
+ dst.val = (s32) src.val;
+ break;
+ case 0x80 ... 0x83: /* Grp1 */
+ switch (modrm_reg) {
+ case 0:
+ goto add;
+ case 1:
+ goto or;
+ case 2:
+ goto adc;
+ case 3:
+ goto sbb;
+ case 4:
+ goto and;
+ case 5:
+ goto sub;
+ case 6:
+ goto xor;
+ case 7:
+ goto cmp;
+ }
+ break;
+ case 0x84 ... 0x85:
+ test: /* test */
+ emulate_2op_SrcV("test", src, dst, _eflags);
+ break;
+ case 0x86 ... 0x87: /* xchg */
+ /* Write back the register source. */
+ switch (dst.bytes) {
+ case 1:
+ *(u8 *) src.ptr = (u8) dst.val;
+ break;
+ case 2:
+ *(u16 *) src.ptr = (u16) dst.val;
+ break;
+ case 4:
+ *src.ptr = (u32) dst.val;
+ break; /* 64b reg: zero-extend */
+ case 8:
+ *src.ptr = dst.val;
+ break;
+ }
+ /*
+ * Write back the memory destination with implicit LOCK
+ * prefix.
+ */
+ dst.val = src.val;
+ lock_prefix = 1;
+ break;
+ case 0xa0 ... 0xa1: /* mov */
+ dst.ptr = (unsigned long *)&_regs[VCPU_REGS_RAX];
+ dst.val = src.val;
+ _eip += ad_bytes; /* skip src displacement */
+ break;
+ case 0xa2 ... 0xa3: /* mov */
+ dst.val = (unsigned long)_regs[VCPU_REGS_RAX];
+ _eip += ad_bytes; /* skip dst displacement */
+ break;
+ case 0x88 ... 0x8b: /* mov */
+ case 0xc6 ... 0xc7: /* mov (sole member of Grp11) */
+ dst.val = src.val;
+ break;
+ case 0x8f: /* pop (sole member of Grp1a) */
+ /* 64-bit mode: POP always pops a 64-bit operand. */
+ if (mode == X86EMUL_MODE_PROT64)
+ dst.bytes = 8;
+ if ((rc = ops->read_std(register_address(ctxt->ss_base,
+ _regs[VCPU_REGS_RSP]),
+ &dst.val, dst.bytes, ctxt)) != 0)
+ goto done;
+ register_address_increment(_regs[VCPU_REGS_RSP], dst.bytes);
+ break;
+ case 0xc0 ... 0xc1:
+ grp2: /* Grp2 */
+ switch (modrm_reg) {
+ case 0: /* rol */
+ emulate_2op_SrcB("rol", src, dst, _eflags);
+ break;
+ case 1: /* ror */
+ emulate_2op_SrcB("ror", src, dst, _eflags);
+ break;
+ case 2: /* rcl */
+ emulate_2op_SrcB("rcl", src, dst, _eflags);
+ break;
+ case 3: /* rcr */
+ emulate_2op_SrcB("rcr", src, dst, _eflags);
+ break;
+ case 4: /* sal/shl */
+ case 6: /* sal/shl */
+ emulate_2op_SrcB("sal", src, dst, _eflags);
+ break;
+ case 5: /* shr */
+ emulate_2op_SrcB("shr", src, dst, _eflags);
+ break;
+ case 7: /* sar */
+ emulate_2op_SrcB("sar", src, dst, _eflags);
+ break;
+ }
+ break;
+ case 0xd0 ... 0xd1: /* Grp2 */
+ src.val = 1;
+ goto grp2;
+ case 0xd2 ... 0xd3: /* Grp2 */
+ src.val = _regs[VCPU_REGS_RCX];
+ goto grp2;
+ case 0xf6 ... 0xf7: /* Grp3 */
+ switch (modrm_reg) {
+ case 0 ... 1: /* test */
+ /*
+ * Special case in Grp3: test has an immediate
+ * source operand.
+ */
+ src.type = OP_IMM;
+ src.ptr = (unsigned long *)_eip;
+ src.bytes = (d & ByteOp) ? 1 : op_bytes;
+ if (src.bytes == 8)
+ src.bytes = 4;
+ switch (src.bytes) {
+ case 1:
+ src.val = insn_fetch(s8, 1, _eip);
+ break;
+ case 2:
+ src.val = insn_fetch(s16, 2, _eip);
+ break;
+ case 4:
+ src.val = insn_fetch(s32, 4, _eip);
+ break;
+ }
+ goto test;
+ case 2: /* not */
+ dst.val = ~dst.val;
+ break;
+ case 3: /* neg */
+ emulate_1op("neg", dst, _eflags);
+ break;
+ default:
+ goto cannot_emulate;
+ }
+ break;
+ case 0xfe ... 0xff: /* Grp4/Grp5 */
+ switch (modrm_reg) {
+ case 0: /* inc */
+ emulate_1op("inc", dst, _eflags);
+ break;
+ case 1: /* dec */
+ emulate_1op("dec", dst, _eflags);
+ break;
+ case 6: /* push */
+ /* 64-bit mode: PUSH always pushes a 64-bit operand. */
+ if (mode == X86EMUL_MODE_PROT64) {
+ dst.bytes = 8;
+ if ((rc = ops->read_std((unsigned long)dst.ptr,
+ &dst.val, 8,
+ ctxt)) != 0)
+ goto done;
+ }
+ register_address_increment(_regs[VCPU_REGS_RSP],
+ -dst.bytes);
+ if ((rc = ops->write_std(
+ register_address(ctxt->ss_base,
+ _regs[VCPU_REGS_RSP]),
+ dst.val, dst.bytes, ctxt)) != 0)
+ goto done;
+ dst.val = dst.orig_val; /* skanky: disable writeback */
+ break;
+ default:
+ goto cannot_emulate;
+ }
+ break;
+ }
+
+writeback:
+ if ((d & Mov) || (dst.orig_val != dst.val)) {
+ switch (dst.type) {
+ case OP_REG:
+ /* The 4-byte case *is* correct: in 64-bit mode we
zero-extend. */
+ switch (dst.bytes) {
+ case 1:
+ *(u8 *)dst.ptr = (u8)dst.val;
+ break;
+ case 2:
+ *(u16 *)dst.ptr = (u16)dst.val;
+ break;
+ case 4:
+ *dst.ptr = (u32)dst.val;
+ break; /* 64b: zero-ext */
+ case 8:
+ *dst.ptr = dst.val;
+ break;
+ }
+ break;
+ case OP_MEM:
+ if (lock_prefix)
+ rc = ops->cmpxchg_emulated((unsigned long)dst.
+ ptr, dst.orig_val,
+ dst.val, dst.bytes,
+ ctxt);
+ else
+ rc = ops->write_emulated((unsigned long)dst.ptr,
+ dst.val, dst.bytes,
+ ctxt);
+ if (rc != 0)
+ goto done;
+ default:
+ break;
+ }
+ }
+
+ /* Commit shadow register state. */
+ memcpy(ctxt->vcpu->regs, _regs, sizeof _regs);
+ ctxt->eflags = _eflags;
+ ctxt->vcpu->rip = _eip;
+
+done:
+ return (rc == X86EMUL_UNHANDLEABLE) ? -1 : 0;
+
+special_insn:
+ if (twobyte)
+ goto twobyte_special_insn;
+ if (rep_prefix) {
+ if (_regs[VCPU_REGS_RCX] == 0) {
+ ctxt->vcpu->rip = _eip;
+ goto done;
+ }
+ _regs[VCPU_REGS_RCX]--;
+ _eip = ctxt->vcpu->rip;
+ }
+ switch (b) {
+ case 0xa4 ... 0xa5: /* movs */
+ dst.type = OP_MEM;
+ dst.bytes = (d & ByteOp) ? 1 : op_bytes;
+ dst.ptr = (unsigned long *)register_address(ctxt->es_base,
+ _regs[VCPU_REGS_RDI]);
+ if ((rc = ops->read_emulated(register_address(
+ override_base ? *override_base : ctxt->ds_base,
+ _regs[VCPU_REGS_RSI]), &dst.val, dst.bytes, ctxt)) != 0)
+ goto done;
+ register_address_increment(_regs[VCPU_REGS_RSI],
+ (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes);
+ register_address_increment(_regs[VCPU_REGS_RDI],
+ (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes);
+ break;
+ case 0xa6 ... 0xa7: /* cmps */
+ DPRINTF("Urk! I don't handle CMPS.\n");
+ goto cannot_emulate;
+ case 0xaa ... 0xab: /* stos */
+ dst.type = OP_MEM;
+ dst.bytes = (d & ByteOp) ? 1 : op_bytes;
+ dst.ptr = (unsigned long *)cr2;
+ dst.val = _regs[VCPU_REGS_RAX];
+ register_address_increment(_regs[VCPU_REGS_RDI],
+ (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes);
+ break;
+ case 0xac ... 0xad: /* lods */
+ dst.type = OP_REG;
+ dst.bytes = (d & ByteOp) ? 1 : op_bytes;
+ dst.ptr = (unsigned long *)&_regs[VCPU_REGS_RAX];
+ if ((rc = ops->read_emulated(cr2, &dst.val, dst.bytes, ctxt)) != 0)
+ goto done;
+ register_address_increment(_regs[VCPU_REGS_RSI],
+ (_eflags & EFLG_DF) ? -dst.bytes : dst.bytes);
+ break;
+ case 0xae ... 0xaf: /* scas */
+ DPRINTF("Urk! I don't handle SCAS.\n");
+ goto cannot_emulate;
+ }
+ goto writeback;
+
+twobyte_insn:
+ switch (b) {
+ case 0x01: /* lgdt, lidt, lmsw */
+ switch (modrm_reg) {
+ u16 size;
+ unsigned long address;
+
+ case 2: /* lgdt */
+ rc = read_descriptor(ctxt, ops, src.ptr,
+ &size, &address, op_bytes);
+ if (rc)
+ goto done;
+ realmode_lgdt(ctxt->vcpu, size, address);
+ break;
+ case 3: /* lidt */
+ rc = read_descriptor(ctxt, ops, src.ptr,
+ &size, &address, op_bytes);
+ if (rc)
+ goto done;
+ realmode_lidt(ctxt->vcpu, size, address);
+ break;
+ case 6: /* lmsw */
+ realmode_lmsw(ctxt->vcpu, (u16)modrm_val, &_eflags);
+ break;
+ default:
+ goto cannot_emulate;
+ }
+ break;
+ case 0x40 ... 0x4f: /* cmov */
+ dst.val = dst.orig_val = src.val;
+ d &= ~Mov; /* default to no move */
+ /*
+ * First, assume we're decoding an even cmov opcode
+ * (lsb == 0).
+ */
+ switch ((b & 15) >> 1) {
+ case 0: /* cmovo */
+ d |= (_eflags & EFLG_OF) ? Mov : 0;
+ break;
+ case 1: /* cmovb/cmovc/cmovnae */
+ d |= (_eflags & EFLG_CF) ? Mov : 0;
+ break;
+ case 2: /* cmovz/cmove */
+ d |= (_eflags & EFLG_ZF) ? Mov : 0;
+ break;
+ case 3: /* cmovbe/cmovna */
+ d |= (_eflags & (EFLG_CF | EFLG_ZF)) ? Mov : 0;
+ break;
+ case 4: /* cmovs */
+ d |= (_eflags & EFLG_SF) ? Mov : 0;
+ break;
+ case 5: /* cmovp/cmovpe */
+ d |= (_eflags & EFLG_PF) ? Mov : 0;
+ break;
+ case 7: /* cmovle/cmovng */
+ d |= (_eflags & EFLG_ZF) ? Mov : 0;
+ /* fall through */
+ case 6: /* cmovl/cmovnge */
+ d |= (!(_eflags & EFLG_SF) !=
+ !(_eflags & EFLG_OF)) ? Mov : 0;
+ break;
+ }
+ /* Odd cmov opcodes (lsb == 1) have inverted sense. */
+ d ^= (b & 1) ? Mov : 0;
+ break;
+ case 0xb0 ... 0xb1: /* cmpxchg */
+ /*
+ * Save real source value, then compare EAX against
+ * destination.
+ */
+ src.orig_val = src.val;
+ src.val = _regs[VCPU_REGS_RAX];
+ emulate_2op_SrcV("cmp", src, dst, _eflags);
+ /* Always write back. The question is: where to? */
+ d |= Mov;
+ if (_eflags & EFLG_ZF) {
+ /* Success: write back to memory. */
+ dst.val = src.orig_val;
+ } else {
+ /* Failure: write the value we saw to EAX. */
+ dst.type = OP_REG;
+ dst.ptr = (unsigned long *)&_regs[VCPU_REGS_RAX];
+ }
+ break;
+ case 0xa3:
+ bt: /* bt */
+ src.val &= (dst.bytes << 3) - 1; /* only subword offset */
+ emulate_2op_SrcV_nobyte("bt", src, dst, _eflags);
+ break;
+ case 0xb3:
+ btr: /* btr */
+ src.val &= (dst.bytes << 3) - 1; /* only subword offset */
+ emulate_2op_SrcV_nobyte("btr", src, dst, _eflags);
+ break;
+ case 0xab:
+ bts: /* bts */
+ src.val &= (dst.bytes << 3) - 1; /* only subword offset */
+ emulate_2op_SrcV_nobyte("bts", src, dst, _eflags);
+ break;
+ case 0xb6 ... 0xb7: /* movzx */
+ dst.bytes = op_bytes;
+ dst.val = (d & ByteOp) ? (u8) src.val : (u16) src.val;
+ break;
+ case 0xbb:
+ btc: /* btc */
+ src.val &= (dst.bytes << 3) - 1; /* only subword offset */
+ emulate_2op_SrcV_nobyte("btc", src, dst, _eflags);
+ break;
+ case 0xba: /* Grp8 */
+ switch (modrm_reg & 3) {
+ case 0:
+ goto bt;
+ case 1:
+ goto bts;
+ case 2:
+ goto btr;
+ case 3:
+ goto btc;
+ }
+ break;
+ case 0xbe ... 0xbf: /* movsx */
+ dst.bytes = op_bytes;
+ dst.val = (d & ByteOp) ? (s8) src.val : (s16) src.val;
+ break;
+ }
+ goto writeback;
+
+twobyte_special_insn:
+ /* Disable writeback. */
+ dst.orig_val = dst.val;
+ switch (b) {
+ case 0x0d: /* GrpP (prefetch) */
+ case 0x18: /* Grp16 (prefetch/nop) */
+ break;
+ case 0x20: /* mov cr, reg */
+ b = insn_fetch(u8, 1, _eip);
+ if ((b & 0xc0) != 0xc0)
+ goto cannot_emulate;
+ _regs[(b >> 3) & 7] = realmode_get_cr(ctxt->vcpu, b & 7);
+ break;
+ case 0x22: /* mov reg, cr */
+ b = insn_fetch(u8, 1, _eip);
+ if ((b & 0xc0) != 0xc0)
+ goto cannot_emulate;
+ realmode_set_cr(ctxt->vcpu, b & 7, _regs[(b >> 3) & 7] & -1u,
+ &_eflags);
+ break;
+ case 0xc7: /* Grp9 (cmpxchg8b) */
+#if defined(__i386__)
+ {
+ unsigned long old_lo, old_hi;
+ if (((rc = ops->read_emulated(cr2 + 0, &old_lo, 4,
+ ctxt)) != 0)
+ || ((rc = ops->read_emulated(cr2 + 4, &old_hi, 4,
+ ctxt)) != 0))
+ goto done;
+ if ((old_lo != _regs[VCPU_REGS_RAX])
+ || (old_hi != _regs[VCPU_REGS_RDI])) {
+ _regs[VCPU_REGS_RAX] = old_lo;
+ _regs[VCPU_REGS_RDX] = old_hi;
+ _eflags &= ~EFLG_ZF;
+ } else if (ops->cmpxchg8b_emulated == NULL) {
+ rc = X86EMUL_UNHANDLEABLE;
+ goto done;
+ } else {
+ if ((rc = ops->cmpxchg8b_emulated(cr2, old_lo,
+ old_hi,
+ _regs[VCPU_REGS_RBX],
+ _regs[VCPU_REGS_RCX],
+ ctxt)) != 0)
+ goto done;
+ _eflags |= EFLG_ZF;
+ }
+ break;
+ }
+#elif defined(__x86_64__)
+ {
+ unsigned long old, new;
+ if ((rc = ops->read_emulated(cr2, &old, 8, ctxt)) != 0)
+ goto done;
+ if (((u32) (old >> 0) != (u32) _regs[VCPU_REGS_RAX]) ||
+ ((u32) (old >> 32) != (u32) _regs[VCPU_REGS_RDX])) {
+ _regs[VCPU_REGS_RAX] = (u32) (old >> 0);
+ _regs[VCPU_REGS_RDX] = (u32) (old >> 32);
+ _eflags &= ~EFLG_ZF;
+ } else {
+ new = (_regs[VCPU_REGS_RCX] << 32) | (u32)
_regs[VCPU_REGS_RBX];
+ if ((rc = ops->cmpxchg_emulated(cr2, old,
+ new, 8, ctxt)) != 0)
+ goto done;
+ _eflags |= EFLG_ZF;
+ }
+ break;
+ }
+#endif
+ }
+ goto writeback;
+
+cannot_emulate:
+ DPRINTF("Cannot emulate %02x\n", b);
+ return -1;
+}
+
+#ifdef __XEN__
+
+#include <asm/mm.h>
+#include <asm/uaccess.h>
+
+int
+x86_emulate_read_std(unsigned long addr,
+ unsigned long *val,
+ unsigned int bytes, struct x86_emulate_ctxt *ctxt)
+{
+ unsigned int rc;
+
+ *val = 0;
+
+ if ((rc = copy_from_user((void *)val, (void *)addr, bytes)) != 0) {
+ propagate_page_fault(addr + bytes - rc, 0); /* read fault */
+ return X86EMUL_PROPAGATE_FAULT;
+ }
+
+ return X86EMUL_CONTINUE;
+}
+
+int
+x86_emulate_write_std(unsigned long addr,
+ unsigned long val,
+ unsigned int bytes, struct x86_emulate_ctxt *ctxt)
+{
+ unsigned int rc;
+
+ if ((rc = copy_to_user((void *)addr, (void *)&val, bytes)) != 0) {
+ propagate_page_fault(addr + bytes - rc, PGERR_write_access);
+ return X86EMUL_PROPAGATE_FAULT;
+ }
+
+ return X86EMUL_CONTINUE;
+}
+
+#endif
Index: linux-2.6/drivers/kvm/x86_emulate.h
===================================================================
--- /dev/null
+++ linux-2.6/drivers/kvm/x86_emulate.h
@@ -0,0 +1,185 @@
+/******************************************************************************
+ * x86_emulate.h
+ *
+ * Generic x86 (32-bit and 64-bit) instruction decoder and emulator.
+ *
+ * Copyright (c) 2005 Keir Fraser
+ *
+ * From: xen-unstable 10676:af9809f51f81a3c43f276f00c81a52ef558afda4
+ */
+
+#ifndef __X86_EMULATE_H__
+#define __X86_EMULATE_H__
+
+struct x86_emulate_ctxt;
+
+/*
+ * x86_emulate_ops:
+ *
+ * These operations represent the instruction emulator's interface to
memory.
+ * There are two categories of operation: those that act on ordinary memory
+ * regions (*_std), and those that act on memory regions known to require
+ * special treatment or emulation (*_emulated).
+ *
+ * The emulator assumes that an instruction accesses only one 'emulated
memory'
+ * location, that this location is the given linear faulting address
(cr2), and
+ * that this is one of the instruction's data operands. Instruction
fetches and
+ * stack operations are assumed never to access emulated memory. The
emulator
+ * automatically deduces which operand of a string-move operation is
accessing
+ * emulated memory, and assumes that the other operand accesses normal
memory.
+ *
+ * NOTES:
+ * 1. The emulator isn't very smart about emulated vs. standard memory.
+ * 'Emulated memory' access addresses should be checked for sanity.
+ * 'Normal memory' accesses may fault, and the caller must arrange to
+ * detect and handle reentrancy into the emulator via recursive faults.
+ * Accesses may be unaligned and may cross page boundaries.
+ * 2. If the access fails (cannot emulate, or a standard access
faults) then
+ * it is up to the memop to propagate the fault to the guest VM via
+ * some out-of-band mechanism, unknown to the emulator. The memop
signals
+ * failure by returning X86EMUL_PROPAGATE_FAULT to the emulator,
which will
+ * then immediately bail.
+ * 3. Valid access sizes are 1, 2, 4 and 8 bytes. On x86/32 systems only
+ * cmpxchg8b_emulated need support 8-byte accesses.
+ * 4. The emulator cannot handle 64-bit mode emulation on an x86/32
system.
+ */
+/* Access completed successfully: continue emulation as normal. */
+#define X86EMUL_CONTINUE 0
+/* Access is unhandleable: bail from emulation and return error to
caller. */
+#define X86EMUL_UNHANDLEABLE 1
+/* Terminate emulation but return success to the caller. */
+#define X86EMUL_PROPAGATE_FAULT 2 /* propagate a generated fault to
guest */
+#define X86EMUL_RETRY_INSTR 2 /* retry the instruction for some
reason */
+#define X86EMUL_CMPXCHG_FAILED 2 /* cmpxchg did not see expected value */
+struct x86_emulate_ops {
+ /*
+ * read_std: Read bytes of standard (non-emulated/special) memory.
+ * Used for instruction fetch, stack operations, and others.
+ * @addr: [IN ] Linear address from which to read.
+ * @val: [OUT] Value read from memory, zero-extended to 'u_long'.
+ * @bytes: [IN ] Number of bytes to read from memory.
+ */
+ int (*read_std)(unsigned long addr,
+ unsigned long *val,
+ unsigned int bytes, struct x86_emulate_ctxt * ctxt);
+
+ /*
+ * write_std: Write bytes of standard (non-emulated/special) memory.
+ * Used for stack operations, and others.
+ * @addr: [IN ] Linear address to which to write.
+ * @val: [IN ] Value to write to memory (low-order bytes used as
+ * required).
+ * @bytes: [IN ] Number of bytes to write to memory.
+ */
+ int (*write_std)(unsigned long addr,
+ unsigned long val,
+ unsigned int bytes, struct x86_emulate_ctxt * ctxt);
+
+ /*
+ * read_emulated: Read bytes from emulated/special memory area.
+ * @addr: [IN ] Linear address from which to read.
+ * @val: [OUT] Value read from memory, zero-extended to 'u_long'.
+ * @bytes: [IN ] Number of bytes to read from memory.
+ */
+ int (*read_emulated) (unsigned long addr,
+ unsigned long *val,
+ unsigned int bytes,
+ struct x86_emulate_ctxt * ctxt);
+
+ /*
+ * write_emulated: Read bytes from emulated/special memory area.
+ * @addr: [IN ] Linear address to which to write.
+ * @val: [IN ] Value to write to memory (low-order bytes used as
+ * required).
+ * @bytes: [IN ] Number of bytes to write to memory.
+ */
+ int (*write_emulated) (unsigned long addr,
+ unsigned long val,
+ unsigned int bytes,
+ struct x86_emulate_ctxt * ctxt);
+
+ /*
+ * cmpxchg_emulated: Emulate an atomic (LOCKed) CMPXCHG operation on an
+ * emulated/special memory area.
+ * @addr: [IN ] Linear address to access.
+ * @old: [IN ] Value expected to be current at @addr.
+ * @new: [IN ] Value to write to @addr.
+ * @bytes: [IN ] Number of bytes to access using CMPXCHG.
+ */
+ int (*cmpxchg_emulated) (unsigned long addr,
+ unsigned long old,
+ unsigned long new,
+ unsigned int bytes,
+ struct x86_emulate_ctxt * ctxt);
+
+ /*
+ * cmpxchg8b_emulated: Emulate an atomic (LOCKed) CMPXCHG8B
operation on an
+ * emulated/special memory area.
+ * @addr: [IN ] Linear address to access.
+ * @old: [IN ] Value expected to be current at @addr.
+ * @new: [IN ] Value to write to @addr.
+ * NOTES:
+ * 1. This function is only ever called when emulating a real
CMPXCHG8B.
+ * 2. This function is *never* called on x86/64 systems.
+ * 2. Not defining this function (i.e., specifying NULL) is equivalent
+ * to defining a function that always returns X86EMUL_UNHANDLEABLE.
+ */
+ int (*cmpxchg8b_emulated) (unsigned long addr,
+ unsigned long old_lo,
+ unsigned long old_hi,
+ unsigned long new_lo,
+ unsigned long new_hi,
+ struct x86_emulate_ctxt * ctxt);
+};
+
+struct cpu_user_regs;
+
+struct x86_emulate_ctxt {
+ /* Register state before/after emulation. */
+ struct kvm_vcpu *vcpu;
+
+ /* Linear faulting address (if emulating a page-faulting
instruction). */
+ unsigned long eflags;
+ unsigned long cr2;
+
+ /* Emulated execution mode, represented by an X86EMUL_MODE value. */
+ int mode;
+
+ unsigned long cs_base;
+ unsigned long ds_base;
+ unsigned long es_base;
+ unsigned long ss_base;
+ unsigned long gs_base;
+ unsigned long fs_base;
+};
+
+/* Execution mode, passed to the emulator. */
+#define X86EMUL_MODE_REAL 0 /* Real mode. */
+#define X86EMUL_MODE_PROT16 2 /* 16-bit protected mode. */
+#define X86EMUL_MODE_PROT32 4 /* 32-bit protected mode. */
+#define X86EMUL_MODE_PROT64 8 /* 64-bit (long) mode. */
+
+/* Host execution mode. */
+#if defined(__i386__)
+#define X86EMUL_MODE_HOST X86EMUL_MODE_PROT32
+#elif defined(__x86_64__)
+#define X86EMUL_MODE_HOST X86EMUL_MODE_PROT64
+#endif
+
+/*
+ * x86_emulate_memop: Emulate an instruction that faulted attempting to
+ * read/write a 'special' memory area.
+ * Returns -1 on failure, 0 on success.
+ */
+int x86_emulate_memop(struct x86_emulate_ctxt *ctxt,
+ struct x86_emulate_ops *ops);
+
+/*
+ * Given the 'reg' portion of a ModRM byte, and a register block, return a
+ * pointer into the block that addresses the relevant register.
+ * @highbyte_regs specifies whether to decode AH,CH,DH,BH.
+ */
+void *decode_register(u8 modrm_reg, unsigned long *regs,
+ int highbyte_regs);
+
+#endif /* __X86_EMULATE_H__ */


--
error compiling committee.c: too many arguments to function

2006-10-19 13:58:22

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Avi Kivity wrote, but forgot to attach the diffstat:
> The following patchset adds a driver for Intel's hardware virtualization
> extensions to the x86 architecture. The driver adds a character device
> (/dev/kvm) that exposes the virtualization capabilities to userspace. Using
> this driver, a process can run a virtual machine (a "guest") in a fully
> virtualized PC containing its own virtual hard disks, network adapters, and
> display.
>
>
[...]

drivers/Kconfig | 2
drivers/Makefile | 1
drivers/kvm/Kconfig | 22
drivers/kvm/Makefile | 6
drivers/kvm/kvm.h | 387 +++++
drivers/kvm/kvm_main.c | 3405
++++++++++++++++++++++++++++++++++++++++++++++
drivers/kvm/mmu.c | 718 +++++++++
drivers/kvm/paging_tmpl.h | 378 +++++
drivers/kvm/vmx.h | 287 +++
drivers/kvm/x86_emulate.c | 1370 ++++++++++++++++++
drivers/kvm/x86_emulate.h | 185 ++
include/linux/kvm.h | 202 ++
12 files changed, 6963 insertions(+)

--
error compiling committee.c: too many arguments to function

2006-10-19 14:30:53

by John Stoffel

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface


Avi> This patch defines a bunch of ioctl()s on /dev/kvm. The ioctl()s
Avi> allow adding memory to a virtual machine, adding a virtual cpu to
Avi> a virtual machine (at most one at this time), transferring
Avi> control to the virtual cpu, and querying about guest pages
Avi> changed by the virtual machine.

Yuck. ioclts are deprecated, you should be using /sysfs instead for
stuff like this, or configfs.

John

2006-10-19 14:43:30

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

John Stoffel wrote:
> Avi> This patch defines a bunch of ioctl()s on /dev/kvm. The ioctl()s
> Avi> allow adding memory to a virtual machine, adding a virtual cpu to
> Avi> a virtual machine (at most one at this time), transferring
> Avi> control to the virtual cpu, and querying about guest pages
> Avi> changed by the virtual machine.
>
> Yuck. ioclts are deprecated, you should be using /sysfs instead for
> stuff like this, or configfs.
>

I need to pass small amounts of data back and forth very efficiently.
sysfs and configfs are more for one-time admin stuff, not for continuous
device control.


--
error compiling committee.c: too many arguments to function

2006-10-19 14:47:49

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Ar Iau, 2006-10-19 am 10:30 -0400, ysgrifennodd John Stoffel:
> Avi> This patch defines a bunch of ioctl()s on /dev/kvm. The ioctl()s
> Avi> allow adding memory to a virtual machine, adding a virtual cpu to
> Avi> a virtual machine (at most one at this time), transferring
> Avi> control to the virtual cpu, and querying about guest pages
> Avi> changed by the virtual machine.
>
> Yuck. ioclts are deprecated, you should be using /sysfs instead for
> stuff like this, or configfs.

Bzzt Wrong answer, please try again 8)

The kernel summit discussions were very much that ioctl has its place,
and that the sysfs extremists were wrong. sysfs has its place (views
ranging from that being /dev/null upwards) but sysfs is useless for many
kinds of interface including those with read/write or other
synchronization properties, those that trigger actions and those that
are tied to the file handle you are working with. An executing VM
interface via sysfs is a ludicrous concept.

Making sure the ioctl sizes are the same in 32/64bit and aligned the
same way is the more important issue.

Alan

2006-10-19 14:51:31

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Alan Cox wrote:
> Ar Iau, 2006-10-19 am 10:30 -0400, ysgrifennodd John Stoffel:
>
>> Yuck. ioclts are deprecated, you should be using /sysfs instead for
>> stuff like this, or configfs.
>>
>
>
> Making sure the ioctl sizes are the same in 32/64bit and aligned the
> same way is the more important issue.
>

Yes, pointers are padded and all other types are explicitly sized.
Alignment is always natural.

--
error compiling committee.c: too many arguments to function

2006-10-19 15:25:45

by John Stoffel

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface


Avi> Alan Cox wrote:
>> Ar Iau, 2006-10-19 am 10:30 -0400, ysgrifennodd John Stoffel:
>>
>>> Yuck. ioclts are deprecated, you should be using /sysfs instead for
>>> stuff like this, or configfs.
>>
>> Making sure the ioctl sizes are the same in 32/64bit and aligned the
>> same way is the more important issue.
>>

Avi> Yes, pointers are padded and all other types are explicitly sized.
Avi> Alignment is always natural.

My apologies, I should have kept my mouth shut. :] Looks interesting
for sure.

John

2006-10-19 16:05:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Avi Kivity <[email protected]> writes:

> The following patchset adds a driver for Intel's hardware virtualization
> extensions to the x86 architecture. The driver adds a character device
> (/dev/kvm) that exposes the virtualization capabilities to userspace. Using
> this driver, a process can run a virtual machine (a "guest") in a fully
> virtualized PC containing its own virtual hard disks, network adapters, and
> display.
>
> Using this driver, one can start multiple virtual machines on a host. Each
> virtual machine is a process on the host; a virtual cpu is a thread in that
> process. kill(1), nice(1), top(1) work as expected.

Where is the user space for this? Is it free?

I suppose you need a device model. Do you use qemu's?

-Andi

2006-10-19 16:09:46

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Andi Kleen wrote:
> Avi Kivity <[email protected]> writes:
>
>
>> The following patchset adds a driver for Intel's hardware virtualization
>> extensions to the x86 architecture. The driver adds a character device
>> (/dev/kvm) that exposes the virtualization capabilities to userspace. Using
>> this driver, a process can run a virtual machine (a "guest") in a fully
>> virtualized PC containing its own virtual hard disks, network adapters, and
>> display.
>>
>> Using this driver, one can start multiple virtual machines on a host. Each
>> virtual machine is a process on the host; a virtual cpu is a thread in that
>> process. kill(1), nice(1), top(1) work as expected.
>>
>
> Where is the user space for this? Is it free?
>

I have to go through the motions of creating a sourceforge project for
this and uploading it. And yes, it is free.

> I suppose you need a device model. Do you use qemu's?
>

Yes. I can't imagine anyone doing that work from scratch (Xen also uses
qemu).

--
error compiling committee.c: too many arguments to function

2006-10-19 17:31:58

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

On Thu, Oct 19, 2006 at 03:45:49PM +0200, Avi Kivity wrote:

> The following patchset adds a driver for Intel's hardware virtualization
> extensions to the x86 architecture. The driver adds a character device
> (/dev/kvm) that exposes the virtualization capabilities to userspace. Using
> this driver, a process can run a virtual machine (a "guest") in a fully
> virtualized PC containing its own virtual hard disks, network adapters, and
> display.

Hi,

Looks pretty interesting! some comments:

- patch 4/7 hasn't made it to the list?
- it would be useful for reviewing this if you could post example code
making use of the /dev/kvm interfaces - they seem fairly complex.
- why do it this way rather than through a virtual machine monitor
such as Xen? what do you gain from having the virtual machines
encapsulated as Linux processes?

Cheers,
Muli


2006-10-19 18:00:17

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Muli Ben-Yehuda wrote:
> Hi,
>
> Looks pretty interesting! some comments:
>
> - patch 4/7 hasn't made it to the list?
>

Probably too big. It's also the ugliest. I'll split it and resend (not
through thunderbird though... ate all my tabs!).


> - it would be useful for reviewing this if you could post example code
> making use of the /dev/kvm interfaces - they seem fairly complex.
>

Working code is fairly hairy, since it's emulating a PC. That'll be on
sourceforge once they approve my new project.

In general one does

open("/dev/kvm")
ioctl(KVM_SET_MEMORY_REGION) for main memory
ioctl(KVM_SET_MEMORY_REGION) for the framebuffer
ioctl(KVM_CREATE_VCPU) for the obvious reason
if (debugger)
ioctl(KVM_DEBUG_GUEST) to singlestep or breakpoint the guest
while (1) {
ioctl(KVM_RUN)
switch (exit reason) {
handle mmio, I/O etc. might call
ioctl(KVM_INTERRUPT) to queue an external interrupt
ioctl(KVM_{GET,SET}_{REGS,SREGS}) to query/modify registers
ioctl(KVM_GET_DIRTY_LOG) to see which guest memory pages
have changed
}

I have some simple test code, I'll clean it up and post it.

> - why do it this way rather than through a virtual machine monitor
> such as Xen? what do you gain from having the virtual machines
> encapsulated as Linux processes?
>

- architectural simplicity: instead of splitting memory management and
scheduling between Xen and domain 0, use just the Linux memory
management and scheduler
- use standard tools (top(1), kill(1)) and security model (permissions
on /dev/kvm)
- much smaller codebase (although paravirtualization is not included (yet))
- no changes to core code
- easy to upgrade an existing system
- easier for drive-by virtualization (modprobe kvm; do-your-stuff;
ctrl-C; rmmod kvm)
- longer term, better performance since there's no need to switch to
domain 0 for I/O (instead just switch to user mode of the VM's process)

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-19 18:10:53

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

On Thu, 19 Oct 2006 20:00:07 +0200 Avi Kivity wrote:

> Muli Ben-Yehuda wrote:
> > Hi,
> >
> > Looks pretty interesting! some comments:
> >
> > - patch 4/7 hasn't made it to the list?
> >
>
> Probably too big. It's also the ugliest. I'll split it and resend (not
> through thunderbird though... ate all my tabs!).

This works for me (when I have to use it), without using attachments:
http://mbligh.org/linuxdocs/Email/Clients/Thunderbird

---
~Randy

2006-10-19 18:14:32

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Randy Dunlap wrote:
>> Probably too big. It's also the ugliest. I'll split it and resend (not
>> through thunderbird though... ate all my tabs!).
>>
>
> This works for me (when I have to use it), without using attachments:
> http://mbligh.org/linuxdocs/Email/Clients/Thunderbird
>

That may be fine for a single patch, but too much work for a patchset.
Since I'm using quilt, I'll try to use its mail feature (though it gave
me some nasty errors when I took a peek).

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-19 18:29:17

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Avi Kivity wrote:
> Randy Dunlap wrote:
>>> Probably too big. It's also the ugliest. I'll split it and resend
>>> (not through thunderbird though... ate all my tabs!).
>>>
>>
>> This works for me (when I have to use it), without using attachments:
>> http://mbligh.org/linuxdocs/Email/Clients/Thunderbird
>>
>
> That may be fine for a single patch, but too much work for a patchset.
> Since I'm using quilt, I'll try to use its mail feature (though it gave
> me some nasty errors when I took a peek).

Agreed. quilt or Paul Jackson's sendpatchset script.

--
~Randy

2006-10-19 18:46:34

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Sorry if I missed this, but can you provide a link to the QEMU changes?

It's hard to tell what's going on without seeing the userspace portions
of this.

My initial impression is that you've taken the Xen approach of trying to
use QEMU only for IO emulation. If this is the case, it won't work long
term. While you can use vm86 mode for 16 bit virtualization for most
cases, it cannot handle big real mode. You need the ability to transfer
down to QEMU and allow it to do emulation.

Ideally, instead of having as large of an x86 emulator in kernel space,
you would just drop down to QEMU to do emulation as needed (doing only a
single basic block and returning). This would let you have a much
reduced partial emulator in kernel space that only did the most common
(and performance critical) instructions.

Regards,

Anthony Liguori

Avi Kivity wrote:
> This patch defines a bunch of ioctl()s on /dev/kvm. The ioctl()s allow
> adding
> memory to a virtual machine, adding a virtual cpu to a virtual machine (at
> most one at this time), transferring control to the virtual cpu, and
> querying
> about guest pages changed by the virtual machine.
>
> Signed-off-by: Yaniv Kamay <[email protected]>
> Signed-off-by: Avi Kivity <[email protected]>
>
> Index: linux-2.6/include/linux/kvm.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6/include/linux/kvm.h
> @@ -0,0 +1,202 @@
> +#ifndef __LINUX_KVM_H
> +#define __LINUX_KVM_H
> +
> +/*
> + * Userspace interface for /dev/kvm - kernel based virtual machine
> + *
> + * Note: this interface is considered experimental and may change without
> + * notice.
> + */
> +
> +#include <asm/types.h>
> +#include <linux/ioctl.h>
> +
> +#ifndef __user
> +#define __user
> +#endif
> +
> +/* for KVM_CREATE_MEMORY_REGION */
> +struct kvm_memory_region {
> + __u32 slot;
> + __u32 flags;
> + __u64 guest_phys_addr;
> + __u64 memory_size; /* bytes */
> +};
> +
> +/* for kvm_memory_region::flags */
> +#define KVM_MEM_LOG_DIRTY_PAGES 1UL
> +
> +
> +#define KVM_EXIT_TYPE_FAIL_ENTRY 1
> +#define KVM_EXIT_TYPE_VM_EXIT 2
> +
> +enum kvm_exit_reason {
> + KVM_EXIT_UNKNOWN,
> + KVM_EXIT_EXCEPTION,
> + KVM_EXIT_IO,
> + KVM_EXIT_CPUID,
> + KVM_EXIT_DEBUG,
> + KVM_EXIT_HLT,
> + KVM_EXIT_MMIO,
> +};
> +
> +/* for KVM_RUN */
> +struct kvm_run {
> + /* in */
> + __u32 vcpu;
> + __u32 emulated; /* skip current instruction */
> + __u32 mmio_completed; /* mmio request completed */
> +
> + /* out */
> + __u32 exit_type;
> + __u32 exit_reason;
> + __u32 instruction_length;
> + union {
> + /* KVM_EXIT_UNKNOWN */
> + struct {
> + __u32 hardware_exit_reason;
> + } hw;
> + /* KVM_EXIT_EXCEPTION */
> + struct {
> + __u32 exception;
> + __u32 error_code;
> + } ex;
> + /* KVM_EXIT_IO */
> + struct {
> +#define KVM_EXIT_IO_IN 0
> +#define KVM_EXIT_IO_OUT 1
> + __u8 direction;
> + __u8 size; /* bytes */
> + __u8 string;
> + __u8 string_down;
> + __u8 rep;
> + __u8 pad;
> + __u16 port;
> + __u64 count;
> + union {
> + __u64 address;
> + __u32 value;
> + };
> + } io;
> + struct {
> + } debug;
> + /* KVM_EXIT_MMIO */
> + struct {
> + __u64 phys_addr;
> + __u8 data[8];
> + __u32 len;
> + __u8 is_write;
> + } mmio;
> + };
> +};
> +
> +/* for KVM_GET_REGS and KVM_SET_REGS */
> +struct kvm_regs {
> + /* in */
> + __u32 vcpu;
> + __u32 padding;
> +
> + /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
> + __u64 rax, rbx, rcx, rdx;
> + __u64 rsi, rdi, rsp, rbp;
> + __u64 r8, r9, r10, r11;
> + __u64 r12, r13, r14, r15;
> + __u64 rip, rflags;
> +};
> +
> +struct kvm_segment {
> + __u64 base;
> + __u32 limit;
> + __u16 selector;
> + __u8 type;
> + __u8 present, dpl, db, s, l, g, avl;
> + __u8 unusable;
> + __u8 padding;
> +};
> +
> +struct kvm_dtable {
> + __u64 base;
> + __u16 limit;
> + __u16 padding[3];
> +};
> +
> +/* for KVM_GET_SREGS and KVM_SET_SREGS */
> +struct kvm_sregs {
> + /* in */
> + __u32 vcpu;
> + __u32 padding;
> +
> + /* out (KVM_GET_SREGS) / in (KVM_SET_SREGS) */
> + struct kvm_segment cs, ds, es, fs, gs, ss;
> + struct kvm_segment tr, ldt;
> + struct kvm_dtable gdt, idt;
> + __u64 cr0, cr2, cr3, cr4, cr8;
> + __u64 efer;
> + __u64 apic_base;
> +
> + /* out (KVM_GET_SREGS) */
> + __u32 pending_int;
> + __u32 padding2;
> +};
> +
> +/* for KVM_TRANSLATE */
> +struct kvm_translation {
> + /* in */
> + __u64 linear_address;
> + __u32 vcpu;
> + __u32 padding;
> +
> + /* out */
> + __u64 physical_address;
> + __u8 valid;
> + __u8 writeable;
> + __u8 usermode;
> +};
> +
> +/* for KVM_INTERRUPT */
> +struct kvm_interrupt {
> + /* in */
> + __u32 vcpu;
> + __u32 irq;
> +};
> +
> +struct kvm_breakpoint {
> + __u32 enabled;
> + __u32 padding;
> + __u64 address;
> +};
> +
> +/* for KVM_DEBUG_GUEST */
> +struct kvm_debug_guest {
> + /* int */
> + __u32 vcpu;
> + __u32 enabled;
> + struct kvm_breakpoint breakpoints[4];
> + __u32 singlestep;
> +};
> +
> +/* for KVM_GET_DIRTY_LOG */
> +struct kvm_dirty_log {
> + __u32 slot;
> + __u32 padding;
> + union {
> + void __user *dirty_bitmap; /* one bit per page */
> + __u64 padding;
> + };
> +};
> +
> +#define KVMIO 0xAE
> +
> +#define KVM_RUN _IOWR(KVMIO, 2, struct kvm_run)
> +#define KVM_GET_REGS _IOWR(KVMIO, 3, struct kvm_regs)
> +#define KVM_SET_REGS _IOW(KVMIO, 4, struct kvm_regs)
> +#define KVM_GET_SREGS _IOWR(KVMIO, 5, struct kvm_sregs)
> +#define KVM_SET_SREGS _IOW(KVMIO, 6, struct kvm_sregs)
> +#define KVM_TRANSLATE _IOWR(KVMIO, 7, struct kvm_translation)
> +#define KVM_INTERRUPT _IOW(KVMIO, 8, struct kvm_interrupt)
> +#define KVM_DEBUG_GUEST _IOW(KVMIO, 9, struct kvm_debug_guest)
> +#define KVM_SET_MEMORY_REGION _IOW(KVMIO, 10, struct kvm_memory_region)
> +#define KVM_CREATE_VCPU _IOW(KVMIO, 11, int /* vcpu_slot */)
> +#define KVM_GET_DIRTY_LOG _IOW(KVMIO, 12, struct kvm_dirty_log)
> +
> +#endif
>
>

2006-10-19 18:49:31

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Alan Cox wrote:
> Ar Iau, 2006-10-19 am 10:30 -0400, ysgrifennodd John Stoffel:
>
>> Avi> This patch defines a bunch of ioctl()s on /dev/kvm. The ioctl()s
>> Avi> allow adding memory to a virtual machine, adding a virtual cpu to
>> Avi> a virtual machine (at most one at this time), transferring
>> Avi> control to the virtual cpu, and querying about guest pages
>> Avi> changed by the virtual machine.
>>
>> Yuck. ioclts are deprecated, you should be using /sysfs instead for
>> stuff like this, or configfs.
>>
>
> Bzzt Wrong answer, please try again 8)
>
> The kernel summit discussions were very much that ioctl has its place,
> and that the sysfs extremists were wrong. sysfs has its place (views
> ranging from that being /dev/null upwards) but sysfs is useless for many
> kinds of interface including those with read/write or other
> synchronization properties, those that trigger actions and those that
> are tied to the file handle you are working with. An executing VM
> interface via sysfs is a ludicrous concept.
>
> Making sure the ioctl sizes are the same in 32/64bit and aligned the
> same way is the more important issue.
>

ioctls are probably wrong here though. Ideally, you would want to be
able to support an SMP guest. This means you need to have two virtual
processors executing in kernel space. If you use ioctls, it forces you
to have two separate threads in userspace. This would be hard for
something like QEMU which is currently single threaded (and not at all
thread safe).

If you used a read/write interface, you could poll for any number of
processors and handle IO emulation in a single userspace thread (which
seems closer to how hardware really works anyway).

Regards,

Anthony Liguori

> Alan
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2006-10-19 18:55:38

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Muli Ben-Yehuda wrote:
> On Thu, Oct 19, 2006 at 03:45:49PM +0200, Avi Kivity wrote:
>
>
>> The following patchset adds a driver for Intel's hardware virtualization
>> extensions to the x86 architecture. The driver adds a character device
>> (/dev/kvm) that exposes the virtualization capabilities to userspace. Using
>> this driver, a process can run a virtual machine (a "guest") in a fully
>> virtualized PC containing its own virtual hard disks, network adapters, and
>> display.
>>
>
> Hi,
>
> Looks pretty interesting! some comments:
>
> - patch 4/7 hasn't made it to the list?
> - it would be useful for reviewing this if you could post example code
> making use of the /dev/kvm interfaces - they seem fairly complex.
> - why do it this way rather than through a virtual machine monitor
> such as Xen? what do you gain from having the virtual machines
> encapsulated as Linux processes?
>

With VT (or even SVM) you gain nothing from having a microkernel based
hypervisor. With paravirtualization, having a hypervisor that fits into
64mb of address space is critical for reducing the cost of hypercalls
(to avoid tlb flushes).

With both VT and SVM, address space switching is mandatory. Since this
is already occurring, switching to a microkernel (like Xen) has no
performance benefit to switching to a macrokernel (like Linux).

Not to mention, many of VT/SVM performance problems in Xen are related
to the amount of switching required to service IO requests (from HVM
domain, to hypervisor, to dom0 kernel, then to dom0 userspace).
Compared to KVM where you only switch from guest, to kernel, to
userspace and I find it highly likely that this is a faster approach.

There are some reasons why you may still want a hypervisor (resource
isolation and scheduler guarantees) but there's nothing fundamental that
keeps one from adding those to Linux.

This is definitely good stuff. Too much of it is just taken from Xen
though and ought to be thought out a little more but for what it's
worth, I think this is the right idea.

Regards,

Anthony Liguori

> Cheers,
> Muli
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2006-10-19 19:02:13

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Avi Kivity wrote:
> Andi Kleen wrote:
>> Avi Kivity <[email protected]> writes:
>>
>>
>>> The following patchset adds a driver for Intel's hardware
>>> virtualization
>>> extensions to the x86 architecture. The driver adds a character device
>>> (/dev/kvm) that exposes the virtualization capabilities to
>>> userspace. Using
>>> this driver, a process can run a virtual machine (a "guest") in a fully
>>> virtualized PC containing its own virtual hard disks, network
>>> adapters, and
>>> display.
>>>
>>> Using this driver, one can start multiple virtual machines on a
>>> host. Each
>>> virtual machine is a process on the host; a virtual cpu is a thread
>>> in that
>>> process. kill(1), nice(1), top(1) work as expected.
>>>
>>
>> Where is the user space for this? Is it free?
>
> I have to go through the motions of creating a sourceforge project for
> this and uploading it. And yes, it is free.

Don't even bother creating a project. Just submit the patches back to
QEMU. There's been a lot of discussion about this functionality within
QEMU. There's no reason to fork QEMU yet again (Xen has given up and is
now maintaining a patch queue). In this case, there's no reason why you
would even need a patch queue.

Regards,

Anthony Liguori

>> I suppose you need a device model. Do you use qemu's?
>>
>
> Yes. I can't imagine anyone doing that work from scratch (Xen also
> uses qemu).
>

2006-10-19 19:04:11

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Anthony Liguori wrote:
> Sorry if I missed this, but can you provide a link to the QEMU changes?
>

I'll do that once I get my sourceforge page and post it here. Watch
this space.


> It's hard to tell what's going on without seeing the userspace
> portions of this.
>
> My initial impression is that you've taken the Xen approach of trying
> to use QEMU only for IO emulation. If this is the case, it won't work
> long term. While you can use vm86 mode for 16 bit virtualization for
> most cases, it cannot handle big real mode. You need the ability to
> transfer down to QEMU and allow it to do emulation.
>

We started using VT only for 64 bit, then added 32 bit, then 16-bit
protected, then vm86 and real mode. We'd transfer the x86 state on each
mode change, but it was (a) fragile (b) considered unclean.

You're right that "big real" mode is not supported, but so far that
hasn't been a problem. Do you know of an OS that needs big real mode?

> Ideally, instead of having as large of an x86 emulator in kernel
> space, you would just drop down to QEMU to do emulation as needed
> (doing only a single basic block and returning). This would let you
> have a much reduced partial emulator in kernel space that only did the
> most common (and performance critical) instructions.
>

Over time that emulator would grow as OSes and compilers evolve... and
we'd really like to keep basic things like the apic in the kernel (as
does Xen).


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-19 19:09:12

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Avi Kivity wrote:
> Anthony Liguori wrote:
>> Sorry if I missed this, but can you provide a link to the QEMU changes?
>>
>
> I'll do that once I get my sourceforge page and post it here. Watch
> this space.
>
>
>> It's hard to tell what's going on without seeing the userspace
>> portions of this.
>>
>> My initial impression is that you've taken the Xen approach of trying
>> to use QEMU only for IO emulation. If this is the case, it won't
>> work long term. While you can use vm86 mode for 16 bit
>> virtualization for most cases, it cannot handle big real mode. You
>> need the ability to transfer down to QEMU and allow it to do emulation.
>>
>
> We started using VT only for 64 bit, then added 32 bit, then 16-bit
> protected, then vm86 and real mode. We'd transfer the x86 state on
> each mode change, but it was (a) fragile (b) considered unclean.
>
> You're right that "big real" mode is not supported, but so far that
> hasn't been a problem. Do you know of an OS that needs big real mode?

AFAIK the SLES boot splash patches to grub use it. It's definitely a
requirement. Currently, there is an effort in Xen to use QEMU for
partial emulation. Hopefully, it will be there for the next release.

Allowing QEMU to do emulation also will help with IO performance.
Instead of having to take many trips to userspace for MMIO especially,
you can allow QEMU to execute a certain number of basic blocks and then
return. Minimizing trips between userspace and the kernel is going to
be critical performance wise.

>> Ideally, instead of having as large of an x86 emulator in kernel
>> space, you would just drop down to QEMU to do emulation as needed
>> (doing only a single basic block and returning). This would let you
>> have a much reduced partial emulator in kernel space that only did
>> the most common (and performance critical) instructions.
>>
>
> Over time that emulator would grow as OSes and compilers evolve... and
> we'd really like to keep basic things like the apic in the kernel (as
> does Xen).

I've been tossing around the idea of doing partial IO emulation in the
kernel. If you could sync the device states between userspace and
kernel, it should be possible. Given that the you're already in the
kernel at VMEXIT time, if you could feed something right to the block
driver or network driver, you ought to be able to get pretty darn good
performance.

However, I do agree that it's better to start simple. I actually think
you could simplify more by using QEMU for more instruction emulation and
focus only on the hand full of instructions in the critical path for the
kernel.

Regards,

Anthony Liguori

2006-10-19 19:10:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Anthony Liguori wrote:
>
> ioctls are probably wrong here though. Ideally, you would want to be
> able to support an SMP guest. This means you need to have two virtual
> processors executing in kernel space. If you use ioctls, it forces
> you to have two separate threads in userspace. This would be hard for
> something like QEMU which is currently single threaded (and not at all
> thread safe).
>

Since we're using the Linux scheduler, we need a task per virtual cpu
anyway, so a thread per vcpu is not a problem.


> If you used a read/write interface, you could poll for any number of
> processors and handle IO emulation in a single userspace thread (which
> seems closer to how hardware really works anyway).
>

We can still do that by having the thread write an I/O request to
hardware service thread, and read back the response. However that will
not be too good for scheduling. For now the smp plan is to slap a
single lock on the qemu device model, and later fine-grain the locking
on individual devices as necessary.

Qemu's transition to aio will probably help in reducing the amount of
work done under lock.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-19 19:15:04

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Anthony Liguori wrote:
>>
>> I have to go through the motions of creating a sourceforge project
>> for this and uploading it. And yes, it is free.
>
> Don't even bother creating a project. Just submit the patches back to
> QEMU. There's been a lot of discussion about this functionality
> within QEMU. There's no reason to fork QEMU yet again (Xen has given
> up and is now maintaining a patch queue). In this case, there's no
> reason why you would even need a patch queue.

We don't plan to fork qemu, too much good stuff is being added there.
However I want to submit the patch for inclusion only after (if?) the
driver is merged into the mainline kernel. The patch is also not very
pretty at the moment.

I'll post it as a patch and probably a binary rpm for the lazy.


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-19 19:17:22

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Avi Kivity wrote:
> Anthony Liguori wrote:
>>
>> ioctls are probably wrong here though. Ideally, you would want to be
>> able to support an SMP guest. This means you need to have two
>> virtual processors executing in kernel space. If you use ioctls, it
>> forces you to have two separate threads in userspace. This would be
>> hard for something like QEMU which is currently single threaded (and
>> not at all thread safe).
>>
>
> Since we're using the Linux scheduler, we need a task per virtual cpu
> anyway, so a thread per vcpu is not a problem.
>

You miss my point I think. Using ioctls *requires* a thread per-vcpu in
userspace. This is unnecessary since you could simply provide a
char-device based read/write interface. You could then multiplex events
and poll.

If for nothing else, you have to be able to run timers in userspace and
interrupt the kernel execution (to signal DMA completion for instance).
Even in the UP case, this gets ugly quickly.

read/write is really just a much cleaner interface for anything that has
blocking semantics.

Regards,

Anthony Liguori

>> If you used a read/write interface, you could poll for any number of
>> processors and handle IO emulation in a single userspace thread
>> (which seems closer to how hardware really works anyway).
>>
>
> We can still do that by having the thread write an I/O request to
> hardware service thread, and read back the response. However that
> will not be too good for scheduling. For now the smp plan is to slap
> a single lock on the qemu device model, and later fine-grain the
> locking on individual devices as necessary.
>
> Qemu's transition to aio will probably help in reducing the amount of
> work done under lock.
>

2006-10-19 19:26:50

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Anthony Liguori wrote:
>>
>> We started using VT only for 64 bit, then added 32 bit, then 16-bit
>> protected, then vm86 and real mode. We'd transfer the x86 state on
>> each mode change, but it was (a) fragile (b) considered unclean.
>>
>> You're right that "big real" mode is not supported, but so far that
>> hasn't been a problem. Do you know of an OS that needs big real mode?
>
> AFAIK the SLES boot splash patches to grub use it. It's definitely a
> requirement. Currently, there is an effort in Xen to use QEMU for
> partial emulation. Hopefully, it will be there for the next release.
>

Ouch. Boot splash patches.

Well, we had real mode in qemu once, we can put it there again.

> Allowing QEMU to do emulation also will help with IO performance.
> Instead of having to take many trips to userspace for MMIO especially,
> you can allow QEMU to execute a certain number of basic blocks and
> then return. Minimizing trips between userspace and the kernel is
> going to be critical performance wise.
>

My plan was to allow userspace to register certain mmio addresses for
cacheing, so that if the guest code had a code sequence like

writel(dst_x_reg, x);
writel(dst_y_reg, y)
writel(width_reg, w);
writel(height_reg, h);
writel(blt_cmd_reg, fill);

then kvm would cache the first four in a mmap()able memory area and only
exit to userspace on the fifth. Userspace would then read the cached
registers from memory and emulate the command.

This saves userspace transitions but not guest/host switches. I'm
counting on Intel and AMD to reduce the cost of these, but it will
probably never be cheap.

Paravirtualized drivers are also an option; we may try to keep
compatibility with Xen's.


>
> I've been tossing around the idea of doing partial IO emulation in the
> kernel. If you could sync the device states between userspace and
> kernel, it should be possible. Given that the you're already in the
> kernel at VMEXIT time, if you could feed something right to the block
> driver or network driver, you ought to be able to get pretty darn good
> performance.
>

Do you mean putting the device model into the kernel?


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-19 19:29:02

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Avi Kivity wrote:
> Anthony Liguori wrote:
>>>
>>> I have to go through the motions of creating a sourceforge project
>>> for this and uploading it. And yes, it is free.
>>
>> Don't even bother creating a project. Just submit the patches back
>> to QEMU. There's been a lot of discussion about this functionality
>> within QEMU. There's no reason to fork QEMU yet again (Xen has given
>> up and is now maintaining a patch queue). In this case, there's no
>> reason why you would even need a patch queue.
>
> We don't plan to fork qemu, too much good stuff is being added there.
> However I want to submit the patch for inclusion only after (if?) the
> driver is merged into the mainline kernel. The patch is also not very
> pretty at the moment.

I would recommend the opposite approach. Something can go in QEMU more
quickly than in the kernel and you'll get people actually testing it.
If it gets in the kernel and there's no major userspace presence you
won't have nearly the same amount of testers...

Regards,

Anthony Liguori

> I'll post it as a patch and probably a binary rpm for the lazy.
>
>

2006-10-19 19:31:39

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Avi Kivity wrote:
> Anthony Liguori wrote:
>>>
>>> We started using VT only for 64 bit, then added 32 bit, then 16-bit
>>> protected, then vm86 and real mode. We'd transfer the x86 state on
>>> each mode change, but it was (a) fragile (b) considered unclean.
>>>
>>> You're right that "big real" mode is not supported, but so far that
>>> hasn't been a problem. Do you know of an OS that needs big real mode?
>>
>> AFAIK the SLES boot splash patches to grub use it. It's definitely a
>> requirement. Currently, there is an effort in Xen to use QEMU for
>> partial emulation. Hopefully, it will be there for the next release.
>>
>
> Ouch. Boot splash patches.
>
> Well, we had real mode in qemu once, we can put it there again.
>
>> Allowing QEMU to do emulation also will help with IO performance.
>> Instead of having to take many trips to userspace for MMIO
>> especially, you can allow QEMU to execute a certain number of basic
>> blocks and then return. Minimizing trips between userspace and the
>> kernel is going to be critical performance wise.
>>
>
> My plan was to allow userspace to register certain mmio addresses for
> cacheing, so that if the guest code had a code sequence like
>
> writel(dst_x_reg, x);
> writel(dst_y_reg, y)
> writel(width_reg, w);
> writel(height_reg, h);
> writel(blt_cmd_reg, fill);
>
> then kvm would cache the first four in a mmap()able memory area and
> only exit to userspace on the fifth. Userspace would then read the
> cached registers from memory and emulate the command.

Letting QEMU do a certain amount of emulation after every transition
would the problem in a more elegant and generic way.

> This saves userspace transitions but not guest/host switches. I'm
> counting on Intel and AMD to reduce the cost of these, but it will
> probably never be cheap.
>
> Paravirtualized drivers are also an option; we may try to keep
> compatibility with Xen's.

Please no! With proper device emulation, paravirtualized drivers
shouldn't be necessary.

>
>>
>> I've been tossing around the idea of doing partial IO emulation in
>> the kernel. If you could sync the device states between userspace
>> and kernel, it should be possible. Given that the you're already in
>> the kernel at VMEXIT time, if you could feed something right to the
>> block driver or network driver, you ought to be able to get pretty
>> darn good performance.
>>
>
> Do you mean putting the device model into the kernel?

Perhaps part of it. Still an idea at this point.

Regards,

Anthony Liguori

2006-10-19 20:10:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Avi Kivity <[email protected]> writes:
>
> You're right that "big real" mode is not supported, but so far that
> hasn't been a problem. Do you know of an OS that needs big real mode?

Some SUSE releases iso boot loader and FreeBSD

-Andi

2006-10-19 20:15:12

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface



>+#ifndef __user
>+#define __user
>+#endif

SHRUG. You should include <linux/compiler.h> instead of doing that. (And
on top, it may happen that compiler.h is automatically slurped in like
config.h, someone else could answer that)


-`J'
--

2006-10-19 20:19:27

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH 2/7] KVM: Intel virtual mode extensions definitions


>+/*
>+ * vmx.h: VMX Architecture related definitions
>+ * Copyright (c) 2004, Intel Corporation.

Bitrot code? My calendar shows 2006 :)


>+/* VMCS Encordings */
^
encodings.


-`J'
--

2006-10-19 20:26:32

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH 5/7] KVM: mmu virtualization


>+static int is_write_protection(void)
>+{
>+ return guest_cr0() & CR0_WP_MASK;
>+}
>+
>+static int is_cpuid_PSE36(void)
>+{
>+ return 1;
>+}
>+
>+static int is_present_pte(unsigned long pte)
>+{
>+ return pte & PT_PRESENT_MASK;
>+}
>+
>+static int is_writeble_pte(unsigned long pte)
>+{
>+ return pte & PT_WRITABLE_MASK;
>+}
>+
>+static int is_io_pte(unsigned long pte)
>+{
>+ return pte & PT_SHADOW_IO_MARK;
>+}

Unless the above will grow in size by later patches, or is taken address of,
mark them static inline.

>+static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
>+{
>+}

Remove it unless needed shortly afterwards.

>+static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr)
>+{
>+ return vaddr;
>+}

Candidate for inline too.

>+static void nonpaging_inval_page(struct kvm_vcpu *vcpu, gva_t addr)
>+{
>+}

Removal candidate

>+
>+static void paging_new_cr3(struct kvm_vcpu *vcpu)
>+{
>+ kvm_mmu_flush_tlb(vcpu);
>+}
>+
>+static void mark_pagetable_nonglobal(void *shadow_pte)
>+{
>+ page_header(__pa(shadow_pte))->global = 0;
>+}
>
>+static void paging_free(struct kvm_vcpu *vcpu)
>+{
>+ nonpaging_free(vcpu);
>+}

>+int kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
>+{
>+ destroy_kvm_mmu(vcpu);
>+ return init_kvm_mmu(vcpu);
>+}

Inline candidate.

>Index: linux-2.6/drivers/kvm/paging_tmpl.h
>===================================================================
>+ #define PT_DIR_BASE_ADDR_MASK PT32_DIR_BASE_ADDR_MASK
>+ #define PT_INDEX(addr, level) PT32_INDEX(addr, level)
>+ #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
>+ #define PT_LEVEL_MASK(level) PT32_LEVEL_MASK(level)
>+ #define PT_PTE_COPY_MASK PT32_PTE_COPY_MASK
>+ #define PT_NON_PTE_COPY_MASK PT32_NON_PTE_COPY_MASK
>+#else
>+ error
>+#endif

#error Free Ad Space Here
it is.

>+static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>+ struct guest_walker *walker)

I am not particularly thrilled about the FNAME() thing but I do not see a
better way ATM. Maybe someone else does.



-`J'
--

2006-10-19 20:34:49

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Ar Iau, 2006-10-19 am 13:49 -0500, ysgrifennodd Anthony Liguori:
> ioctls are probably wrong here though. Ideally, you would want to be
> able to support an SMP guest. This means you need to have two virtual
> processors executing in kernel space. If you use ioctls, it forces you

Not really and in fact with qemu you'd want to halt a trap on the second
virtual CPU until emulation was over if only to get I/O and other
instruction ordering right. Thats not an argument that only that view
should be supported of course.

> If you used a read/write interface, you could poll for any number of
> processors and handle IO emulation in a single userspace thread (which
> seems closer to how hardware really works anyway).

Agreed.

Alan

2006-10-19 21:51:57

by Alan

[permalink] [raw]
Subject: Re: [PATCH 2/7] KVM: Intel virtual mode extensions definitions

Ar Iau, 2006-10-19 am 22:19 +0200, ysgrifennodd Jan Engelhardt:
> >+/*
> >+ * vmx.h: VMX Architecture related definitions
> >+ * Copyright (c) 2004, Intel Corporation.
>
> Bitrot code? My calendar shows 2006 :)

Yes but Intel wrote the file in question in 2004. So the (c) is correct.


2006-10-19 22:13:14

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Ar Iau, 2006-10-19 am 14:31 -0500, ysgrifennodd Anthony Liguori:
> > My plan was to allow userspace to register certain mmio addresses for
> > cacheing, so that if the guest code had a code sequence like

Thats actually not ideal having played with this for something else. The
best results I got with real world hardware was supporting registration
of sequences and address/mask pairs. You can also pre-load answers to
avoid trapping back out to user apps. Obviously emulating virtualized
hardware with proper guest OS drivers is far better still.

What I had in the end looked something like this

Groups of addresses in a table. Each table has state bits. Each access
can be conditional on a mask of statebits being 1/0. Each access can
also either trap or not

Within each access the rule was matched by address and width then by
masks

Firstly if the bits matching a transition mask changed to the transition
state bits then we trapped

ie if ((new_value & transition_mask) == transition_bits)

so you can avoid trapping out on stuff that doesn't "fire" an event - eg
the head select on IDE.

Then the I/O was merged with a mask of fixed bits (for read only bits
without trapping in emulation) which occur a lot, and stored in an array
slot number given by the rule (with overlaps for .b/.w allowed). Finally
the statebits were updated by the rule again using a mask and bits.

Similar rules applied to reads so that values that didn't need traps
could be handled directly. Repeating I/O had a special case (thats
"hack") rule type for saying eg "512 bytes" then trap.

A trap was also allowed to load back a prediction sequence. That allowed
the user space side to "guess" the usual behaviour of the driver stuff
being emulated so if it got a given event it could feed a sequence of
address/size/value back [never did make these conditional to be
cleverer]

This means you can do stuff like IDE by trapping mostly on the final
'kick' of a command, and if its a read the predict then 512 byte insw()
from the driver and the next 5 or 6 port read accesses for each I/O.

The state stuff is very compact as its basically

while (rule) {
if (bitcompare(table->state, rule->state)) {
rule = rule->next;
continue;
}
if (bitcompare(rule->transition, value) == 0)
return TRAP;
value &= rule->value[0];
value |= rule->value[1];
table->state &= rule->newstate[0];
table->state |= rule->newstate[1];
table->cache[rule->cache] = value;
rule = rule->next;
}

and for read on a given table just a case of

if (predictor == NULL || port != predictor->port || size !=
predictor->size)
TRAP();
else {
value = table->cache[predictor->cache];
predictor = predictor->next;
}

Alan

2006-10-19 23:26:46

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

On Thu, Oct 19, 2006 at 04:43:20PM +0200, Avi Kivity wrote:
> John Stoffel wrote:
> >Avi> This patch defines a bunch of ioctl()s on /dev/kvm. The ioctl()s
> >Avi> allow adding memory to a virtual machine, adding a virtual cpu to
> >Avi> a virtual machine (at most one at this time), transferring
> >Avi> control to the virtual cpu, and querying about guest pages
> >Avi> changed by the virtual machine.
> >
> >Yuck. ioclts are deprecated, you should be using /sysfs instead for
> >stuff like this, or configfs.
> >
>
> I need to pass small amounts of data back and forth very efficiently.
> sysfs and configfs are more for one-time admin stuff, not for continuous
> device control.

I agree, sysfs or configfs is not for what you are wanting to do here.

thanks,

greg k-h

2006-10-20 07:16:50

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Jan Engelhardt wrote:
>> +#ifndef __user
>> +#define __user
>> +#endif
>>
>
> SHRUG. You should include <linux/compiler.h> instead of doing that. (And
> on top, it may happen that compiler.h is automatically slurped in like
> config.h, someone else could answer that)
>

This is for userspace. If there's a better solution I'll happily
incorporate it.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-20 07:17:22

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 2/7] KVM: Intel virtual mode extensions definitions

Jan Engelhardt wrote:
>> +/* VMCS Encordings */
>>
> ^
> encodings.
>
>

Will fix, thanks.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-20 07:24:59

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 5/7] KVM: mmu virtualization

Jan Engelhardt wrote:
>> +static int is_write_protection(void)
>> +{
>> + return guest_cr0() & CR0_WP_MASK;
>> +}
>> +
>> +static int is_cpuid_PSE36(void)
>> +{
>> + return 1;
>> +}
>> +
>> +static int is_present_pte(unsigned long pte)
>> +{
>> + return pte & PT_PRESENT_MASK;
>> +}
>> +
>> +static int is_writeble_pte(unsigned long pte)
>> +{
>> + return pte & PT_WRITABLE_MASK;
>> +}
>> +
>> +static int is_io_pte(unsigned long pte)
>> +{
>> + return pte & PT_SHADOW_IO_MARK;
>> +}
>>
>
> Unless the above will grow in size by later patches, or is taken address of,
> mark them static inline.
>
>

gcc inlines them well enough based on size. That means I don't have to
keep adding/removing inline declarations.

>> +static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
>> +{
>> +}
>>
>
> Remove it unless needed shortly afterwards.
>
>

That's a (*callback)().

>> +static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr)
>> +{
>> + return vaddr;
>> +}
>>
>
> Candidate for inline too.
>
>

Ditto.


>> +static void nonpaging_inval_page(struct kvm_vcpu *vcpu, gva_t addr)
>> +{
>> +}
>>
>
> Removal candidate
>
>

Ditto.

>> +int kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
>> +{
>> + destroy_kvm_mmu(vcpu);
>> + return init_kvm_mmu(vcpu);
>> +}
>>
>
> Inline candidate.
>
>

It's used in another file. gcc will probably inline its callees.

>> Index: linux-2.6/drivers/kvm/paging_tmpl.h
>> ===================================================================
>> + #define PT_DIR_BASE_ADDR_MASK PT32_DIR_BASE_ADDR_MASK
>> + #define PT_INDEX(addr, level) PT32_INDEX(addr, level)
>> + #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
>> + #define PT_LEVEL_MASK(level) PT32_LEVEL_MASK(level)
>> + #define PT_PTE_COPY_MASK PT32_PTE_COPY_MASK
>> + #define PT_NON_PTE_COPY_MASK PT32_NON_PTE_COPY_MASK
>> +#else
>> + error
>> +#endif
>>
>
> #error Free Ad Space Here
> it is.
>
>

Will fix. Thanks for the review.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-20 07:36:30

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Anthony Liguori wrote:
> Avi Kivity wrote:
>> Anthony Liguori wrote:
>>>
>>> ioctls are probably wrong here though. Ideally, you would want to
>>> be able to support an SMP guest. This means you need to have two
>>> virtual processors executing in kernel space. If you use ioctls, it
>>> forces you to have two separate threads in userspace. This would be
>>> hard for something like QEMU which is currently single threaded (and
>>> not at all thread safe).
>>>
>>
>> Since we're using the Linux scheduler, we need a task per virtual cpu
>> anyway, so a thread per vcpu is not a problem.
>>
>
> You miss my point I think. Using ioctls *requires* a thread per-vcpu
> in userspace. This is unnecessary since you could simply provide a
> char-device based read/write interface. You could then multiplex
> events and poll.
>

Yes, ioctl()s require userspace threads, but that's okay, because
they're free for us, since we need a kernel thread for each vcpu.

On the other hand, a single device model thread polling the vcpus is
guaranteed to be on the wrong physical cpu for half of the time
(assuming 2 cpus and 2 vcpus), requiring IPIs and suspending a vcpu in
order to run.

> If for nothing else, you have to be able to run timers in userspace
> and interrupt the kernel execution (to signal DMA completion for
> instance). Even in the UP case, this gets ugly quickly.
>

The timers aren't pretty (we use signals), yes. But avoiding the extra
thread is critical for performance IMO.

> read/write is really just a much cleaner interface for anything that
> has blocking semantics.
>

Ah, but scheduling a vcpu doesn't just block, it consumes the physical cpu.

All other uses of read() yield the cpu apart from setup and copying of
the data.


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-20 07:37:51

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Anthony Liguori wrote:
>>
>> We don't plan to fork qemu, too much good stuff is being added
>> there. However I want to submit the patch for inclusion only after
>> (if?) the driver is merged into the mainline kernel. The patch is
>> also not very pretty at the moment.
>
> I would recommend the opposite approach. Something can go in QEMU
> more quickly than in the kernel and you'll get people actually testing
> it. If it gets in the kernel and there's no major userspace presence
> you won't have nearly the same amount of testers...

I'll try doing both at once :)

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-20 07:42:46

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Anthony Liguori wrote:
>>
>> writel(dst_x_reg, x);
>> writel(dst_y_reg, y)
>> writel(width_reg, w);
>> writel(height_reg, h);
>> writel(blt_cmd_reg, fill);
>>
>> then kvm would cache the first four in a mmap()able memory area and
>> only exit to userspace on the fifth. Userspace would then read the
>> cached registers from memory and emulate the command.
>
> Letting QEMU do a certain amount of emulation after every transition
> would the problem in a more elegant and generic way.
>

But what amount? A basic block, or several?

Emulation has its costs. You need to marshal the registers to and fro.
You need to reset qemu's cached translations. You need to throw away
shadow page tables and qemu's softmmu. You increase the time spent in
single threaded code.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-20 15:33:35

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Avi Kivity wrote:
> Anthony Liguori wrote:
>> Avi Kivity wrote:
>>> Anthony Liguori wrote:
>>>>
>>>> ioctls are probably wrong here though. Ideally, you would want to
>>>> be able to support an SMP guest. This means you need to have two
>>>> virtual processors executing in kernel space. If you use ioctls,
>>>> it forces you to have two separate threads in userspace. This
>>>> would be hard for something like QEMU which is currently single
>>>> threaded (and not at all thread safe).
>>>>
>>>
>>> Since we're using the Linux scheduler, we need a task per virtual
>>> cpu anyway, so a thread per vcpu is not a problem.
>>>
>>
>> You miss my point I think. Using ioctls *requires* a thread per-vcpu
>> in userspace. This is unnecessary since you could simply provide a
>> char-device based read/write interface. You could then multiplex
>> events and poll.
>>
>
> Yes, ioctl()s require userspace threads, but that's okay, because
> they're free for us, since we need a kernel thread for each vcpu.
>
> On the other hand, a single device model thread polling the vcpus is
> guaranteed to be on the wrong physical cpu for half of the time
> (assuming 2 cpus and 2 vcpus), requiring IPIs and suspending a vcpu in
> order to run.

And your previously proposed solution of having one big lock would do
the same thing except require additional round trips to the kernel :-)

Moreover, you could get clever and use mmap() to expose a ring queue if
you're really concerned about SMP.

Really though, it comes down to one simple thing: blocking ioctl()s are
a real ugly interface.

>> If for nothing else, you have to be able to run timers in userspace
>> and interrupt the kernel execution (to signal DMA completion for
>> instance). Even in the UP case, this gets ugly quickly.
>>
>
> The timers aren't pretty (we use signals), yes. But avoiding the
> extra thread is critical for performance IMO.

We've had a lot of problems in QEMU with timers and kqemu. Forcing the
guest to return to userspace to allow periodic timers to run (which may
simply be the VGA refresh which the guest doesn't care about) is at best
a hack. Being able to poll an FD would make this so much nicer...

I've posted some patches on qemu-devel attempting to deal with these
issues (look for threads on optimizing char device performance). None
of them are very pretty.

Regards,

Anthony Liguori

>> read/write is really just a much cleaner interface for anything that
>> has blocking semantics.
>>
>
> Ah, but scheduling a vcpu doesn't just block, it consumes the physical
> cpu.
>
> All other uses of read() yield the cpu apart from setup and copying of
> the data.
>
>

2006-10-20 15:35:45

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Avi Kivity wrote:
> Anthony Liguori wrote:
>>>
>>> writel(dst_x_reg, x);
>>> writel(dst_y_reg, y)
>>> writel(width_reg, w);
>>> writel(height_reg, h);
>>> writel(blt_cmd_reg, fill);
>>>
>>> then kvm would cache the first four in a mmap()able memory area and
>>> only exit to userspace on the fifth. Userspace would then read the
>>> cached registers from memory and emulate the command.
>>
>> Letting QEMU do a certain amount of emulation after every transition
>> would the problem in a more elegant and generic way.
>>
>
> But what amount? A basic block, or several?
>
> Emulation has its costs. You need to marshal the registers to and
> fro. You need to reset qemu's cached translations. You need to throw
> away shadow page tables and qemu's softmmu. You increase the time
> spent in single threaded code.

Admittedly still a research topic. If you're interested in what we're
doing in Xen, check out:

http://xenbits.xensource.com/ext/xen-unstable-hvm.hg (sorry, xenbits is
down right now but hopefully it will be fixed quickly).

Regards,

Anthony Liguori

2006-10-21 13:37:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

On Thu, 2006-10-19 at 15:47 +0200, Avi Kivity wrote:
> This patch defines a bunch of ioctl()s on /dev/kvm. The ioctl()s allow
> adding
> memory to a virtual machine, adding a virtual cpu to a virtual machine (at
> most one at this time), transferring control to the virtual cpu, and
> querying
> about guest pages changed by the virtual machine.
>
> Signed-off-by: Yaniv Kamay <[email protected]>
> Signed-off-by: Avi Kivity <[email protected]>
>
> Index: linux-2.6/include/linux/kvm.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6/include/linux/kvm.h

[...]

> +
> +/* for KVM_GET_REGS and KVM_SET_REGS */
> +struct kvm_regs {
> + /* in */
> + __u32 vcpu;
> + __u32 padding;
> +
> + /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
> + __u64 rax, rbx, rcx, rdx;
> + __u64 rsi, rdi, rsp, rbp;
> + __u64 r8, r9, r10, r11;
> + __u64 r12, r13, r14, r15;
> + __u64 rip, rflags;
> +};
> +

I know this is for userspace too, but still. Shouldn't this be in
include/asm-x86_64 and not include/linux.

-- Steve


2006-10-21 13:48:48

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/7] KVM: Intel virtual mode extensions definitions

On Thu, 2006-10-19 at 15:48 +0200, Avi Kivity wrote:
> Add some constants for the various bits defined by Intel's VT extensions.
>
> Most of this file was lifted from the Xen hypervisor.
>
> Signed-off-by: Yaniv Kamay <[email protected]>
> Signed-off-by: Avi Kivity <[email protected]>
>
> Index: linux-2.6/drivers/kvm/vmx.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6/drivers/kvm/vmx.h
> @@ -0,0 +1,287 @@

This entire file is also very specific to an architecture. Couldn't it
be put somewhere in arch/x86_64 and not in drivers?

I know that this is all currently focused on Intel and AMD
virtualization platforms, but could you split out the x86_64 specific
stuff and make the rest more generic. Perhaps in the future this will
make it easier for other platforms to use this code as well.

It's hard to do a generic approach when developing it new, but if you
don't think about that now, it will be magnitudes larger in difficulty
to make generic when this is all done.

-- Steve



> purpose register */
> +#define LMSW_SOURCE_DATA_SHIFT 16
> +#define LMSW_SOURCE_DATA (0xFFFF << LMSW_SOURCE_DATA_SHIFT) /* 16:31
> lmsw source */
> +#define REG_EAX (0 << 8)
> +#define REG_ECX (1 << 8)
> +#define REG_EDX (2 << 8)
> +#define REG_EBX (3 << 8)
> +#define REG_ESP (4 << 8)
> +#define REG_EBP (5 << 8)
> +#define REG_ESI (6 << 8)
> +#define REG_EDI (7 << 8)
> +#define REG_R8 (8 << 8)
> +#define REG_R9 (9 << 8)
> +#define REG_R10 (10 << 8)
> +#define REG_R11 (11 << 8)
> +#define REG_R12 (12 << 8)
> +#define REG_R13 (13 << 8)
> +#define REG_R14 (14 << 8)
> +#define REG_R15 (15 << 8)
> +


2006-10-21 15:50:10

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

On Friday 20 October 2006 09:16, Avi Kivity wrote:
> Jan Engelhardt wrote:
> >> +#ifndef __user
> >> +#define __user
> >> +#endif
> >> ? ?
> >
> > SHRUG. You should include <linux/compiler.h> instead of doing that. (And
> > on top, it may happen that compiler.h is automatically slurped in like
> > config.h, someone else could answer that)
> > ?
>
> This is for userspace. ?If there's a better solution I'll happily
> incorporate it.

It should just work without this, when you do 'make headers_install'.
See the top of scripts/Makefile.headersinst.

Arnd <><

2006-10-21 16:16:38

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

On Thursday 19 October 2006 20:00, Avi Kivity wrote:
> Working code is fairly hairy, since it's emulating a PC. ?That'll be on
> sourceforge once they approve my new project.
>
> In general one does
>
> ? open("/dev/kvm")
> ? ioctl(KVM_SET_MEMORY_REGION) for main memory
> ? ioctl(KVM_SET_MEMORY_REGION) for the framebuffer
> ? ioctl(KVM_CREATE_VCPU) for the obvious reason
> ? if (debugger)
> ? ? ioctl(KVM_DEBUG_GUEST) to singlestep or breakpoint the guest
> ? while (1) {
> ? ? ?ioctl(KVM_RUN)
> ? ? ?switch (exit reason) {
> ? ? ? ? ?handle mmio, I/O etc. might call
> ? ? ? ? ? ? ioctl(KVM_INTERRUPT) to queue an external interrupt
> ? ? ? ? ? ? ioctl(KVM_{GET,SET}_{REGS,SREGS}) to query/modify registers
> ? ? ? ? ? ? ioctl(KVM_GET_DIRTY_LOG) to see which guest memory pages
> have changed
> ? ? ?}
>
> I have some simple test code, I'll clean it up and post it.

This looks _a_lot_ like what we're doing for the SPUs in the cell processor,
except that we're using different calls into the kernel. Have you looked
into what we have implemented there? The code is in
arch/powerpc/platforms/cell/spufs. I think it would be a good abstraction
to use for you as well, maybe we could even move to a common infrastructure,
as I have heard from a few other projects that want to do similar things.

The main differences to your interface are:

- A file system is used instead of a character device
- Directories, not open file descriptors represent contexts
- Two new syscalls were introduced (spu_create/spu_run)
- instead of ioctls, files represent different bits of information,
you can read/write, poll or mmap them.

Your example above could translate to something like:

int kvm_fd = kvm_create("/kvm/my_vcpu")
int mem_fd = openat(kvm_fd, "mem", O_RDWR);
void *mem = mmap(mem_fd, ...); // main memory
void *fbmem = mmap(mem_fd, ...); // frame buffer memory
int regs_fd = openat(kvm_fd, "regs", O_RDWR);
int irq_fd = openat(kvm_fd, "regs", O_WRONLY);

if (debugger) {
int fd = openat(fvm_fd, "debug", O_WRONLY);
write(fd, "1", 1);
close(fd);
}
while (1) {
int exit_reason = kvm_run(kvm_fd, &kvm_descriptor);
switch (exit reason) {
handle mmio, I/O etc. might call
write(irq_fd, &interrupt_packet, sizeof (interrupt_packet));
pread(regs_fd, &rax, sizeof rax, KVM_REG_RAX);
}

Arnd <><

2006-10-22 08:10:17

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Anthony Liguori wrote:
>>>
>>> You miss my point I think. Using ioctls *requires* a thread
>>> per-vcpu in userspace. This is unnecessary since you could simply
>>> provide a char-device based read/write interface. You could then
>>> multiplex events and poll.
>>>
>>
>> Yes, ioctl()s require userspace threads, but that's okay, because
>> they're free for us, since we need a kernel thread for each vcpu.
>>
>> On the other hand, a single device model thread polling the vcpus is
>> guaranteed to be on the wrong physical cpu for half of the time
>> (assuming 2 cpus and 2 vcpus), requiring IPIs and suspending a vcpu
>> in order to run.
>
> And your previously proposed solution of having one big lock would do
> the same thing except require additional round trips to the kernel :-)

No, with no contention locks stay in userspace. And if there is
contention, we fine-grain the locks.

>
> Moreover, you could get clever and use mmap() to expose a ring queue
> if you're really concerned about SMP.
>
> Really though, it comes down to one simple thing: blocking ioctl()s
> are a real ugly interface.
>

I don't think they can be termed "blocking".

Most (all?) blocking calls offload work to some other device, like a
disk or a network card, and sleep if that device has to do any
processing. They follow the same basic procedure:

- if data (or bufferspace) is available, read (or write) it
- otherwise, sleep

But in this case the "other device" is the processor, so the that model
doesn't fit very well, as it *forces* a context switch.

Moreover, we need to both read and write, which ioctls() allow, but
read()/write() require two system calls.

>>> If for nothing else, you have to be able to run timers in userspace
>>> and interrupt the kernel execution (to signal DMA completion for
>>> instance). Even in the UP case, this gets ugly quickly.
>>>
>>
>> The timers aren't pretty (we use signals), yes. But avoiding the
>> extra thread is critical for performance IMO.
>
> We've had a lot of problems in QEMU with timers and kqemu. Forcing
> the guest to return to userspace to allow periodic timers to run
> (which may simply be the VGA refresh which the guest doesn't care
> about) is at best a hack.

You can also have an additional thread to the periodic stuff.

> Being able to poll an FD would make this so much nicer...
>
> I've posted some patches on qemu-devel attempting to deal with these
> issues (look for threads on optimizing char device performance). None
> of them are very pretty.
>

Xen is different since you already have a context switch by going to
domain 0.

--
error compiling committee.c: too many arguments to function

2006-10-22 08:14:39

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Steven Rostedt wrote:
>> +
>> +/* for KVM_GET_REGS and KVM_SET_REGS */
>> +struct kvm_regs {
>> + /* in */
>> + __u32 vcpu;
>> + __u32 padding;
>> +
>> + /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
>> + __u64 rax, rbx, rcx, rdx;
>> + __u64 rsi, rdi, rsp, rbp;
>> + __u64 r8, r9, r10, r11;
>> + __u64 r12, r13, r14, r15;
>> + __u64 rip, rflags;
>> +};
>> +
>>
>
> I know this is for userspace too, but still. Shouldn't this be in
> include/asm-x86_64 and not include/linux.
>

Most of this file is arch-independent and could be used for other
virtualization-capable architectures. I could this snippet to
asm-x86_64 but where would it leave i386?

(i386 needs access to 64-bit registers since we support 64-bit guests on
a 64-bit host with a 32-bit userspace)

--
error compiling committee.c: too many arguments to function

2006-10-22 08:17:25

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 2/7] KVM: Intel virtual mode extensions definitions

Steven Rostedt wrote:
> On Thu, 2006-10-19 at 15:48 +0200, Avi Kivity wrote:
>
>> Add some constants for the various bits defined by Intel's VT extensions.
>>
>> Most of this file was lifted from the Xen hypervisor.
>>
>> Signed-off-by: Yaniv Kamay <[email protected]>
>> Signed-off-by: Avi Kivity <[email protected]>
>>
>> Index: linux-2.6/drivers/kvm/vmx.h
>> ===================================================================
>> --- /dev/null
>> +++ linux-2.6/drivers/kvm/vmx.h
>> @@ -0,0 +1,287 @@
>>
>
> This entire file is also very specific to an architecture. Couldn't it
> be put somewhere in arch/x86_64 and not in drivers?
>
> I know that this is all currently focused on Intel and AMD
> virtualization platforms, but could you split out the x86_64 specific
> stuff and make the rest more generic. Perhaps in the future this will
> make it easier for other platforms to use this code as well.
>

We're already doing some splitting since Intel and AMD have incompatible
extensions for doing this. The result however will still be x86 (-64
and i386) specific.

> It's hard to do a generic approach when developing it new, but if you
> don't think about that now, it will be magnitudes larger in difficulty
> to make generic when this is all done.
>

I don't know enough about ppc and ia64 virtualization for that. Perhaps
someone would like to comment.

--
error compiling committee.c: too many arguments to function

2006-10-22 08:19:43

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM: userspace interface

Arnd Bergmann wrote:
> On Friday 20 October 2006 09:16, Avi Kivity wrote:
>
>> Jan Engelhardt wrote:
>>
>>>> +#ifndef __user
>>>> +#define __user
>>>> +#endif
>>>>
>>>>
>>> SHRUG. You should include <linux/compiler.h> instead of doing that. (And
>>> on top, it may happen that compiler.h is automatically slurped in like
>>> config.h, someone else could answer that)
>>>
>>>
>> This is for userspace. If there's a better solution I'll happily
>> incorporate it.
>>
>
> It should just work without this, when you do 'make headers_install'.
> See the top of scripts/Makefile.headersinst.
>
> Arnd <><
>

I'll remove it. Thanks.

--
error compiling committee.c: too many arguments to function

2006-10-22 08:37:54

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Arnd Bergmann wrote:
> This looks _a_lot_ like what we're doing for the SPUs in the cell processor,
> except that we're using different calls into the kernel. Have you looked
> into what we have implemented there? The code is in
> arch/powerpc/platforms/cell/spufs. I think it would be a good abstraction
> to use for you as well, maybe we could even move to a common infrastructure,
> as I have heard from a few other projects that want to do similar things.
>
> The main differences to your interface are:
>
> - A file system is used instead of a character device
> - Directories, not open file descriptors represent contexts
> - Two new syscalls were introduced (spu_create/spu_run)
> - instead of ioctls, files represent different bits of information,
> you can read/write, poll or mmap them.
>
> Your example above could translate to something like:
>
> int kvm_fd = kvm_create("/kvm/my_vcpu")
> int mem_fd = openat(kvm_fd, "mem", O_RDWR);
> void *mem = mmap(mem_fd, ...); // main memory
> void *fbmem = mmap(mem_fd, ...); // frame buffer memory
> int regs_fd = openat(kvm_fd, "regs", O_RDWR);
> int irq_fd = openat(kvm_fd, "regs", O_WRONLY);
>
> if (debugger) {
> int fd = openat(fvm_fd, "debug", O_WRONLY);
> write(fd, "1", 1);
> close(fd);
> }
> while (1) {
> int exit_reason = kvm_run(kvm_fd, &kvm_descriptor);
> switch (exit reason) {
> handle mmio, I/O etc. might call
> write(irq_fd, &interrupt_packet, sizeof (interrupt_packet));
> pread(regs_fd, &rax, sizeof rax, KVM_REG_RAX);
> }
>

[cc'ing some others to solicit their opinion]


I like this. Since we plan to support multiple vcpus per vm, the fs
structure might look like:

/kvm/my_vm
|
+----memory # mkdir to create memory slot.
| | # how to set size and offset?
| |
| +---0 # guest physical memory slot
| |
| +-- dirty_bitmap # read to get and atomically reset
| # the changed pages log
|
|
+----cpu # mkdir/rmdir to create/remove vcpu
|
+----0
| |
| +--- irq # write to inject an irq
| |
| +--- regs # read/write to get/set registers
| |
| +--- debugger # write to set breakpoints/singlestep mode
|
+----1
[...]

It's certainly a lot more code though, and requires new syscalls. Since
this is a little esoteric does it warrant new syscalls?

--
error compiling committee.c: too many arguments to function

2006-10-22 15:23:57

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

On Sunday 22 October 2006 10:37, Avi Kivity wrote:
> I like this. ?Since we plan to support multiple vcpus per vm, the fs
> structure might look like:
>
> /kvm/my_vm
> ? ? |
> ? ? +----memory ? ? ? ? ?# mkdir to create memory slot.

Note that the way spufs does it, every directory is a reference-counted
object. Currently that includes single contexts and groups of
contexts that are supposed to be scheduled simultaneously.

The trick is that we use the special 'spu_create' syscall to
add a new object, while naming it, and return an open file
descriptor to it. When that file descriptor gets closed, the
object gets garbage-collected automatically.

This way you can simply kill a task, which also cleans up
all the special objects it allocated.

We ended up adding a lot more file than we initially planned,
but the interface is really handy, especially if you want to
create some procps-like tools for it.

> ? ? | ? ? | ? ? ? ? ? ? ?# ? ?how to set size and offset?
> ? ? | ? ? |
> ? ? | ? ? +---0 ? ? ? ? ?# guest physical memory slot
> ? ? | ? ? ? ? |
> ? ? | ? ? ? ? +-- dirty_bitmap ?# read to get and atomically reset
> ? ? | ? ? ? ? ? ? ? ? ? ? ? ? ? # the changed pages log

Have you thought about simply defining your guest to be a section
of the processes virtual address space? That way you could use
an anonymous mapping in the host as your guest address space, or
even use a file backed mapping in order to make the state persistant
over multiple runs. Or you could map the guest kernel into the
guest real address space with a private mapping and share the
text segment over multiple guests to save L2 and RAM.

> ? ? |
> ? ? |
> ? ? +----cpu ? ? ? ? ? ? # mkdir/rmdir to create/remove vcpu
> ? ? ? ? ? |

I'd recommend not allowing mkdir or similar operations, although
it's not that far off. One option would be to let the user specify
the number of CPUs at kvm_create() time, another option might
be to allow kvm_create with a special flag or yet another syscall
to create the vcpu objects.

> ? ? ? ? ? +----0
> ? ? ? ? ? | ? ? |
> ? ? ? ? ? | ? ? +--- irq ? ? # write to inject an irq
> ? ? ? ? ? | ? ? |
> ? ? ? ? ? | ? ? +--- regs ? ?# read/write to get/set registers
> ? ? ? ? ? | ? ? |
> ? ? ? ? ? | ? ? +--- debugger ? # write to set breakpoints/singlestep mode
> ? ? ? ? ? |
> ? ? ? ? ? +----1
> ? ? ? ? ? ? ? ? [...]
>
> It's certainly a lot more code though, and requires new syscalls. ?Since
> this is a little esoteric does it warrant new syscalls?

We've gone through a number of iterations on the spufs design regarding this,
and in the end decided that the garbage-collecting property of spu_create
was superior to any other option, and adding the spu_run syscall was then
the logical step. BTW, one inspiration for spu_run came from sys_vm86, which
as you are probably aware of is already doing a lot of what you do, just
not for protected mode guests.

Arnd <><

2006-10-22 16:18:38

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Arnd Bergmann wrote:
> On Sunday 22 October 2006 10:37, Avi Kivity wrote:
>
>> I like this. Since we plan to support multiple vcpus per vm, the fs
>> structure might look like:
>>
>> /kvm/my_vm
>> |
>> +----memory # mkdir to create memory slot.
>>
>
> Note that the way spufs does it, every directory is a reference-counted
> object. Currently that includes single contexts and groups of
> contexts that are supposed to be scheduled simultaneously.
>
> The trick is that we use the special 'spu_create' syscall to
> add a new object, while naming it, and return an open file
> descriptor to it. When that file descriptor gets closed, the
> object gets garbage-collected automatically.
>

Yes. Well, a single fd and ioctl()s do that as well.

>
> We ended up adding a lot more file than we initially planned,
> but the interface is really handy, especially if you want to
> create some procps-like tools for it.
>
>

I don't really see the need. The cell dsps are a shared resource, while
virtual machines are just another execution mode of an existing resource
- the main cpu, which has a sharing mechanism (the scheduler and
priorities).


>> | | # how to set size and offset?
>> | |
>> | +---0 # guest physical memory slot
>> | |
>> | +-- dirty_bitmap # read to get and atomically reset
>> | # the changed pages log
>>
>
> Have you thought about simply defining your guest to be a section
> of the processes virtual address space? That way you could use
> an anonymous mapping in the host as your guest address space, or
> even use a file backed mapping in order to make the state persistant
> over multiple runs. Or you could map the guest kernel into the
> guest real address space with a private mapping and share the
> text segment over multiple guests to save L2 and RAM.
>

I've thought of it but it can't work on i386 because guest physical
address space is larger than virtual address space on i386. So we
mmap("/dev/kvm") with file offsets corresponding to guest physical
addresses.

I still like that idea, since it allows using hugetlbfs and allowing
swapping. Perhaps we'll just accept the limitation that guests on i386
are limited.

>
>> |
>> |
>> +----cpu # mkdir/rmdir to create/remove vcpu
>> |
>>
>
> I'd recommend not allowing mkdir or similar operations, although
> it's not that far off. One option would be to let the user specify
> the number of CPUs at kvm_create() time, another option might
> be to allow kvm_create with a special flag or yet another syscall
> to create the vcpu objects.
>

Okay.

>
>> +----0
>> | |
>> | +--- irq # write to inject an irq
>> | |
>> | +--- regs # read/write to get/set registers
>> | |
>> | +--- debugger # write to set breakpoints/singlestep mode
>> |
>> +----1
>> [...]
>>
>> It's certainly a lot more code though, and requires new syscalls. Since
>> this is a little esoteric does it warrant new syscalls?
>>
>
> We've gone through a number of iterations on the spufs design regarding this,
> and in the end decided that the garbage-collecting property of spu_create
> was superior to any other option, and adding the spu_run syscall was then
> the logical step. BTW, one inspiration for spu_run came from sys_vm86, which
> as you are probably aware of is already doing a lot of what you do, just
> not for protected mode guests.
>

Yes, we're doing a sort of vmx86_64().

Thanks for the ideas, I'm certainly leaning towards a filesystem based
approach and I'll also reconsider the mapping (mmap() vi virtual address
space subsection).

--
error compiling committee.c: too many arguments to function

2006-10-22 16:51:22

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

On Sunday 22 October 2006 18:18, Avi Kivity wrote:
> Arnd Bergmann wrote:

> > We ended up adding a lot more file than we initially planned,
> > but the interface is really handy, especially if you want to
> > create some procps-like tools for it.
>
> I don't really see the need. ?The cell dsps are a shared resource, while
> virtual machines are just another execution mode of an existing resource
> - the main cpu, which has a sharing mechanism (the scheduler and
> priorities).

I don't think it's that different. The Cell SPU scheduler is also
implemented in kernel space. Every application using an SPU program
has its own contexts in spufs and doesn't look at the others.

While we don't have it yet, we're thinking about adding a sputop
or something similar that shows the utilization of spus. You don't
need that one, since get exactly that with the regular top, but you
might want to have a tool that prints statistics about how often
your guests drop out of the virtualisation mode, or the number
of interrupts delivered to them.

> > Have you thought about simply defining your guest to be a section
> > of the processes virtual address space? That way you could use
> > an anonymous mapping in the host as your guest address space, or
> > even use a file backed mapping in order to make the state persistant
> > over multiple runs. Or you could map the guest kernel into the
> > guest real address space with a private mapping and share the
> > text segment over multiple guests to save L2 and RAM.
> > ?
>
> I've thought of it but it can't work on i386 because guest physical
> address space is larger than virtual address space on i386. ?So we
> mmap("/dev/kvm") with file offsets corresponding to guest physical
> addresses.
>
> I still like that idea, since it allows using hugetlbfs and allowing
> swapping. ?Perhaps we'll just accept the limitation that guests on i386
> are limited.

What is the point of 32 bit hosts anyway? Isn't this only available
on x86_64 type CPUs in the first place?

Arnd <><

2006-10-22 17:01:36

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Arnd Bergmann wrote:
> I don't think it's that different. The Cell SPU scheduler is also
> implemented in kernel space. Every application using an SPU program
> has its own contexts in spufs and doesn't look at the others.
>
>

Okay, I've misunderstood you before.


> While we don't have it yet, we're thinking about adding a sputop
> or something similar that shows the utilization of spus. You don't
> need that one, since get exactly that with the regular top, but you
> might want to have a tool that prints statistics about how often
> your guests drop out of the virtualisation mode, or the number
> of interrupts delivered to them.
>
>

We have a debugfs interface and a kvm_stat script which shows exactly
that (including a breakdown of the reasons for the exit).

>>> Have you thought about simply defining your guest to be a section
>>> of the processes virtual address space? That way you could use
>>> an anonymous mapping in the host as your guest address space, or
>>> even use a file backed mapping in order to make the state persistant
>>> over multiple runs. Or you could map the guest kernel into the
>>> guest real address space with a private mapping and share the
>>> text segment over multiple guests to save L2 and RAM.
>>>
>>>
>> I've thought of it but it can't work on i386 because guest physical
>> address space is larger than virtual address space on i386. So we
>> mmap("/dev/kvm") with file offsets corresponding to guest physical
>> addresses.
>>
>> I still like that idea, since it allows using hugetlbfs and allowing
>> swapping. Perhaps we'll just accept the limitation that guests on i386
>> are limited.
>>
>
> What is the point of 32 bit hosts anyway? Isn't this only available
> on x86_64 type CPUs in the first place?
>

No, 32-bit hosts are fully supported (except a 32-bit host can't run a
32-bit guest).

Admittedly, virtualization is a memory-intensive operation, so a 64-bit
host will usually be used.


--
error compiling committee.c: too many arguments to function

2006-10-22 17:06:39

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

On Sunday 22 October 2006 19:01, Avi Kivity wrote:
> > While we don't have it yet, we're thinking about adding a sputop
> > or something similar that shows the utilization of spus. You don't
> > need that one, since get exactly that with the regular top, but you
> > might want to have a tool that prints statistics about how often
> > your guests drop out of the virtualisation mode, or the number
> > of interrupts delivered to them.
> >
> > ?
>
> We have a debugfs interface and a kvm_stat script which shows exactly
> that (including a breakdown of the reasons for the exit).

Ok, good. But with your own file system, you wouldn't need debugfs
any more and have all information about a guest in one place.

Arnd <><

2006-10-22 17:40:06

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Avi Kivity wrote:
> Arnd Bergmann wrote:
>> This looks _a_lot_ like what we're doing for the SPUs in the cell
>> processor,
>> except that we're using different calls into the kernel. Have you looked
>> into what we have implemented there? The code is in
>> arch/powerpc/platforms/cell/spufs. I think it would be a good
>> abstraction
>> to use for you as well, maybe we could even move to a common
>> infrastructure,
>> as I have heard from a few other projects that want to do similar
>> things.
>>
>> The main differences to your interface are:
>>
>> - A file system is used instead of a character device
>> - Directories, not open file descriptors represent contexts
>> - Two new syscalls were introduced (spu_create/spu_run)
>> - instead of ioctls, files represent different bits of information,
>> you can read/write, poll or mmap them.
>>
>> Your example above could translate to something like:
>>
>> int kvm_fd = kvm_create("/kvm/my_vcpu")
>> int mem_fd = openat(kvm_fd, "mem", O_RDWR);
>> void *mem = mmap(mem_fd, ...); // main memory
>> void *fbmem = mmap(mem_fd, ...); // frame buffer memory
>> int regs_fd = openat(kvm_fd, "regs", O_RDWR);
>> int irq_fd = openat(kvm_fd, "regs", O_WRONLY);
>>
>> if (debugger) {
>> int fd = openat(fvm_fd, "debug", O_WRONLY);
>> write(fd, "1", 1);
>> close(fd);
>> }
>> while (1) {
>> int exit_reason = kvm_run(kvm_fd, &kvm_descriptor);
>> switch (exit reason) {
>> handle mmio, I/O etc. might call
>> write(irq_fd, &interrupt_packet, sizeof
>> (interrupt_packet));
>> pread(regs_fd, &rax, sizeof rax, KVM_REG_RAX);
>> }
>>
>
> [cc'ing some others to solicit their opinion]
>
> I like this. Since we plan to support multiple vcpus per vm, the fs
> structure might look like:

I like the idea of a filesystem. In particular, if you exposed the CPU
state as a mmap()'able file, you could read/write from userspace without
any syscall overhead.

There are some clever ways that you could get around need that many
syscalls. For instance, you could have a "paused" file that you could
write a "1" into in order to run the guest (assuming that the memory/CPU
state is setup properly).

You could then have an "event" file that you could select() for read
on. When "event" became readable, you could read the exit reason, do
whatever is needed, and then write a "1" into "paused" again.

Perhaps an ioctl is better for pausing/unpausing but I do think it's
necessary to select() on something to wait for the next exit reason to
occur.

Regards,

Anthony Liguori

> /kvm/my_vm
> |
> +----memory # mkdir to create memory slot.
> | | # how to set size and offset?
> | |
> | +---0 # guest physical memory slot
> | |
> | +-- dirty_bitmap # read to get and atomically reset
> | # the changed pages log
> |
> |
> +----cpu # mkdir/rmdir to create/remove vcpu
> |
> +----0
> | |
> | +--- irq # write to inject an irq
> | |
> | +--- regs # read/write to get/set registers
> | |
> | +--- debugger # write to set breakpoints/singlestep mode
> |
> +----1
> [...]
>
> It's certainly a lot more code though, and requires new syscalls.
> Since this is a little esoteric does it warrant new syscalls?
>

2006-10-22 17:42:07

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Arnd Bergmann wrote:
> On Sunday 22 October 2006 19:01, Avi Kivity wrote:
>
>>> While we don't have it yet, we're thinking about adding a sputop
>>> or something similar that shows the utilization of spus. You don't
>>> need that one, since get exactly that with the regular top, but you
>>> might want to have a tool that prints statistics about how often
>>> your guests drop out of the virtualisation mode, or the number
>>> of interrupts delivered to them.
>>>
>>>
>>>
>> We have a debugfs interface and a kvm_stat script which shows exactly
>> that (including a breakdown of the reasons for the exit).
>>
>
> Ok, good. But with your own file system, you wouldn't need debugfs
> any more and have all information about a guest in one place.
>

One last thing: permissions. The /dev/kvm model allows permissions to
be controlled using standard unix access methods. How do you control
access to spufs?

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-22 17:47:53

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

On Sunday 22 October 2006 19:41, Avi Kivity wrote:
> One last thing: permissions. ?The /dev/kvm model allows permissions to
> be controlled using standard unix access methods. ?How do you control
> access to spufs?

The mount point has permissions that you can set to allow read/write
access to users/groups. You can do that as a mount option or later
with chmod.

spu_create has an argument to specify the permissions for a new
object and follows the regular umask rules.

I also chose to allow users to set permissions on each file in order
to do cross-user IPC, but so far, nobody has used this to my knowledge.

Arnd <><

2006-10-22 17:54:10

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

On Sunday 22 October 2006 19:39, Anthony Liguori wrote:
> I like the idea of a filesystem. ?In particular, if you exposed the CPU
> state as a mmap()'able file, you could read/write from userspace without
> any syscall overhead.

Right. It's a little tricky though regarding what happens when you
write to the register mapping of a running guest, without stopping
it first.

> There are some clever ways that you could get around need that many
> syscalls. ?For instance, you could have a "paused" file that you could
> write a "1" into in order to run the guest (assuming that the memory/CPU
> state is setup properly).

what for? writing 1, then 0 to that file is two full syscalls.
Calling kvm_run and returning from it is just one.

You can also just send SIGSTOP/SIGCONT to the task to stop it.

> You could then have an "event" file that you could select() for read
> on. ?When "event" became readable, you could read the exit reason, do
> whatever is needed, and then write a "1" into "paused" again.

It's very handy to stay inside of a single process context for both
the hypervisor and the guest, and to simply block in a kvm_run syscall
for the time the guest executes.

This syscall can then simply return the exit reason as its return
value so you don't need another syscall to read it.

> Perhaps an ioctl is better for pausing/unpausing but I do think it's
> necessary to select() on something to wait for the next exit reason to
> occur.

I would not mix ioctls with a new file system. ioctl is fine on
a character device, but with a new file system, you should be able
to express everything as read/write, or one of the new syscalls.

Arnd <><

2006-10-22 17:56:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

On Sun, Oct 22, 2006 at 07:01:29PM +0200, Avi Kivity wrote:
> >What is the point of 32 bit hosts anyway? Isn't this only available
> >on x86_64 type CPUs in the first place?
> >
>
> No, 32-bit hosts are fully supported (except a 32-bit host can't run a
> 32-bit guest).

Again, what's the point? All cpus shipped by Intel and AMD that have
hardware virtualization extensions also support the 64bit mode. Given
that I don't see any point for supporting a 32bit host.

2006-10-22 18:00:24

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Christoph Hellwig wrote:
> On Sun, Oct 22, 2006 at 07:01:29PM +0200, Avi Kivity wrote:
>
>>> What is the point of 32 bit hosts anyway? Isn't this only available
>>> on x86_64 type CPUs in the first place?
>>>
>>>
>> No, 32-bit hosts are fully supported (except a 32-bit host can't run a
>> 32-bit guest).
>>
>
> Again, what's the point? All cpus shipped by Intel and AMD that have
> hardware virtualization extensions also support the 64bit mode. Given
> that I don't see any point for supporting a 32bit host.
>

Existing installations?

Dropping 32-bit host support would certainly kill a lot of #ifdefs and
reduce the amount of testing needed. It would also force me to upgrade
my home machine.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-22 18:36:10

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

On Sunday 22 October 2006 20:00, Avi Kivity wrote:
> Existing installations?
>
> Dropping 32-bit host support would certainly kill a lot of #ifdefs and
> reduce the amount of testing needed. ?It would also force me to upgrade
> my home machine.

Ok, but if you radically change the kernel<->user API, doesn't that mean
you have to upgrade in the same way? The 32 bit emulation mode in x86_64
is actually pretty complete, so it probably boils down to a kernel upgrade
for you, without having to touch any of the user space.

Arnd <><

2006-10-22 18:41:13

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Arnd Bergmann wrote:
> On Sunday 22 October 2006 20:00, Avi Kivity wrote:
>
>> Existing installations?
>>
>> Dropping 32-bit host support would certainly kill a lot of #ifdefs and
>> reduce the amount of testing needed. It would also force me to upgrade
>> my home machine.
>>
>
> Ok, but if you radically change the kernel<->user API, doesn't that mean
> you have to upgrade in the same way?

No, why? I'd just upgrade the userspace. Am I misunderstanding you?

> The 32 bit emulation mode in x86_64
> is actually pretty complete, so it probably boils down to a kernel upgrade
> for you, without having to touch any of the user space.
>

For me personally, I don't mind. I don't know about others.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-22 18:49:43

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

On Sunday 22 October 2006 20:41, Avi Kivity wrote:
> > Ok, but if you radically change the kernel<->user API, doesn't that mean
> > you have to upgrade in the same way?
>
> No, why? I'd just upgrade the userspace. ?Am I misunderstanding you?

If you change the kernel interface, you also have to change the kernel
itself, at least if you introduce new syscalls.

> > The 32 bit emulation mode in x86_64
> > is actually pretty complete, so it probably boils down to a kernel
> > upgrade for you, without having to touch any of the user space.
> > ?
>
> For me personally, I don't mind. ?I don't know about others.

I'd really love to see your code in get into the mainline kernel,
but I'd consider 32 bit host support an unnecessary burden for
long-term maintenance. Maybe you could maintain the 32 bit version
out of tree as long as there is still interest? I would expect that
at least the point where it works out of the box on x86_64
distros is when it becomes completely obsolete.

Arnd <><

2006-10-22 18:55:41

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Arnd Bergmann wrote:
> On Sunday 22 October 2006 20:41, Avi Kivity wrote:
>
>>> Ok, but if you radically change the kernel<->user API, doesn't that mean
>>> you have to upgrade in the same way?
>>>
>> No, why? I'd just upgrade the userspace. Am I misunderstanding you?
>>
>
> If you change the kernel interface, you also have to change the kernel
> itself, at least if you introduce new syscalls.
>
>

But I don't have to upgrade all my software to 64 bit [but 32-bit
emulation solves that].

Still, an upgrade to the next 32-bit kernel could be seen as less
threatening.


>>> The 32 bit emulation mode in x86_64
>>> is actually pretty complete, so it probably boils down to a kernel
>>> upgrade for you, without having to touch any of the user space.
>>>
>>>
>> For me personally, I don't mind. I don't know about others.
>>
>
> I'd really love to see your code in get into the mainline kernel,
> but I'd consider 32 bit host support an unnecessary burden for
> long-term maintenance. Maybe you could maintain the 32 bit version
> out of tree as long as there is still interest? I would expect that
> at least the point where it works out of the box on x86_64
> distros is when it becomes completely obsolete.
>

One of my motivations was to get testers who run 32-bit for historical
or flash plugin reasons.

If there is a consensus that it should be dropped, though, I'll drop
it. I certainly didn't have any fun getting it to run.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-22 19:55:53

by Alan

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Ar Sul, 2006-10-22 am 10:37 +0200, ysgrifennodd Avi Kivity:
> I like this. Since we plan to support multiple vcpus per vm, the fs
> structure might look like:

Three times the syscall overhead is bad for an emulation very bad for an
emulation of a CPU whose virtualisation is half baked.

> It's certainly a lot more code though, and requires new syscalls. Since
> this is a little esoteric does it warrant new syscalls?

I think not - ioctl exists to avoid adding a billion esoteric one user
syscalls. The idea of a VFS sysfs type view of the running vm is great
for tools however so I wouldn't throw it out entirely or see it as ioctl
versus fs.

2006-10-22 19:57:50

by Alan

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Ar Sul, 2006-10-22 am 18:51 +0200, ysgrifennodd Arnd Bergmann:
> What is the point of 32 bit hosts anyway? Isn't this only available
> on x86_64 type CPUs in the first place?

There are a small number of vt capable 32bit only processors.

2006-10-22 19:59:54

by Alan

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Ar Sul, 2006-10-22 am 18:56 +0100, ysgrifennodd Christoph Hellwig:
> Again, what's the point? All cpus shipped by Intel and AMD that have
> hardware virtualization extensions also support the 64bit mode. Given
> that I don't see any point for supporting a 32bit host.

Really:

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx
constant_tsc pni monitor vmx est tm2 xtpr
model name : Genuine Intel(R) CPU T2300 @ 1.66GHz

2006-10-22 20:45:49

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

> Again, what's the point? All cpus shipped by Intel and AMD that have
> hardware virtualization extensions also support the 64bit mode. Given
> that I don't see any point for supporting a 32bit host.

Actually there are 32-bit only Intel CPUs with hardware virtualization --
in fact my laptop has one: "Core Duo processor Low Voltage L2400".

http://www.intel.com/products/processor_number/proc_info_table.pdf
shows quite a few models with virtualization but without EM64T.

- R.

2006-10-22 22:26:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Avi Kivity <[email protected]> writes:

> Dropping 32-bit host support would certainly kill a lot of #ifdefs and
> reduce the amount of testing needed.

Sounds like a good thing.

> It would also force me to upgrade my home machine.

Why? AFAIK there are no VT machines that don't support EM64T.

If you mean you have 32bit userland you can certainly use a 64bit kernel
with 32bit userland.

-Andi

2006-10-22 22:29:03

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Alan Cox <[email protected]> writes:

> Ar Sul, 2006-10-22 am 18:51 +0200, ysgrifennodd Arnd Bergmann:
> > What is the point of 32 bit hosts anyway? Isn't this only available
> > on x86_64 type CPUs in the first place?
>
> There are a small number of vt capable 32bit only processors.

Ah you're right. I forgot about the Yonahs. The number is probably
not even that small (when Intel ships something x86 they tend to
do it in millions)

-Andi

2006-10-23 00:27:49

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

> Ah you're right. I forgot about the Yonahs. The number is probably
> not even that small (when Intel ships something x86 they tend to
> do it in millions)

Right, it's quite a mainstream CPU -- for example every current
Thinkpad has one I think. And lots of kernel hackers tend to care
about making things work on a Thinkpad (except for akpm and his
precious vaio :).

- R.

2006-10-23 00:29:38

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Christoph Hellwig wrote:
> On Sun, Oct 22, 2006 at 07:01:29PM +0200, Avi Kivity wrote:
>
>>> What is the point of 32 bit hosts anyway? Isn't this only available
>>> on x86_64 type CPUs in the first place?
>>>
>>>
>> No, 32-bit hosts are fully supported (except a 32-bit host can't run a
>> 32-bit guest).
>>
>
> Again, what's the point? All cpus shipped by Intel and AMD that have
> hardware virtualization extensions also support the 64bit mode. Given
> that I don't see any point for supporting a 32bit host.
>

I believe that the Intel Core Duo's only support 32bit mode and are VT
enabled.

Regards,

Anthony Liguori


2006-10-23 00:39:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

On Monday 23 October 2006 02:27, Roland Dreier wrote:
> > Ah you're right. I forgot about the Yonahs. The number is probably
> > not even that small (when Intel ships something x86 they tend to
> > do it in millions)
>
> Right, it's quite a mainstream CPU -- for example every current
> Thinkpad has one I think.

The question is if they all enable VT in the BIOS though. A lot of
systems don't and without BIOS support it doesn't work.

-Andi

2006-10-23 00:52:04

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

> The question is if they all enable VT in the BIOS though. A lot of
> systems don't and without BIOS support it doesn't work.

Seems to be there on my X60s -- /proc/cpuinfo has:

model name : Genuine Intel(R) CPU L2400 @ 1.66GHz

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc pni monitor vmx est tm2 xtpr

(I haven't tried it yet but I assume the vmx flag means it's enabled)

This system is (from dmidecode):

System Information
Manufacturer: LENOVO
Product Name: 1702AT3
Version: ThinkPad X60s

- R.

2006-10-23 07:42:24

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Alan Cox wrote:
> Ar Sul, 2006-10-22 am 10:37 +0200, ysgrifennodd Avi Kivity:
>
>> I like this. Since we plan to support multiple vcpus per vm, the fs
>> structure might look like:
>>
>
> Three times the syscall overhead is bad for an emulation very bad

Why? You would usually just call kvm_run(). get/set regs are not needed
normally.

> for an
> emulation of a CPU whose virtualisation is half baked.
>
>

Blood rare. The thing can't even virtualize the first instruction executed.

>> It's certainly a lot more code though, and requires new syscalls. Since
>> this is a little esoteric does it warrant new syscalls?
>>
>
> I think not - ioctl exists to avoid adding a billion esoteric one user
> syscalls. The idea of a VFS sysfs type view of the running vm is great
> for tools however so I wouldn't throw it out entirely or see it as ioctl
> versus fs.
>

I still want a separate object per vcpu:


kvm_fd = open("/dev/kvm")
for (i = 0; i < n; ++i)
vcpu_fds[i] = ioctl(kvm_fd, KVM_CREATE_VCPU, i)

so the refcounting doesn't bounce cachelines too much. In effect it's a
mini filesystem.

--
error compiling committee.c: too many arguments to function

2006-10-23 22:29:09

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Andi Kleen wrote:
> Why? AFAIK there are no VT machines that don't support EM64T.
>
Core Duo has VT but no 64-bit.

J

2006-10-24 21:38:53

by Andy Isaacson

[permalink] [raw]
Subject: kvm_create() (was Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine)

I don't have much clue what the context of this is, but one chunk caught
my eye:

On Sat, Oct 21, 2006 at 06:16:27PM +0200, Arnd Bergmann wrote:
> Your example above could translate to something like:
>
> int kvm_fd = kvm_create("/kvm/my_vcpu")
> int mem_fd = openat(kvm_fd, "mem", O_RDWR);

Based just on this snippet, it seems to me that kvm_create() could be
simply:
open("/kvm/my_vcpu", O_CREAT | O_EXCL | O_DIRECTORY, 0777);

(Which currently seems to silently mask out O_DIRECTORY, but seems to me
should be a synonym for mkdir().)

-andy

2006-10-28 04:52:46

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: Kernel-based Virtual Machine

Hi!

> > >What is the point of 32 bit hosts anyway? Isn't this only available
> > >on x86_64 type CPUs in the first place?
> > >
> >
> > No, 32-bit hosts are fully supported (except a 32-bit host can't run a
> > 32-bit guest).
>
> Again, what's the point? All cpus shipped by Intel and AMD that have
> hardware virtualization extensions also support the 64bit mode. Given
> that I don't see any point for supporting a 32bit host.

If you have 1GB ram, having 64-bit capable cpu does not mean you want
to run 64-bit kernel. Pointers are twice as big, etc... And if your
shiny new cpu fails, you can't put hdd back in good old working
machine.

(IOW I see reasons. Not sure if they are big enough...)
Pavel
--
Thanks for all the (sleeping) penguins.