2012-05-16 07:52:00

by Zhang Yanfei

[permalink] [raw]
Subject: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

This patch set exports offsets of VMCS fields as note information for
kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
runtime state of guest machine image, such as registers, in host
machine's crash dump as VMCS format. The problem is that VMCS internal
is hidden by Intel in its specification. So, we slove this problem
by reverse engineering implemented in this patch set. The VMCSINFO
is exported via sysfs to kexec-tools just like VMCOREINFO.

Here are two usercases for two features that we want.

1) Create guest machine's crash dumpfile from host machine's crash dumpfile

In general, we want to use this feature on failure analysis for the system
where the processing depends on the communication between host and guest
machines to look into the system from both machines's viewpoints.

As a concrete situation, consider where there's heartbeat monitoring
feature on the guest machine's side, where we need to determine in
which machine side the cause of heartbeat stop lies. In our actual
experiments, we encountered such situation and we found the cause of
the bug was in host's process schedular so guest machine's vcpu stopped
for a long time and then led to heartbeat stop.

The module that judges heartbeat stop is on guest machine, so we need
to debug guest machine's data. But if the cause lies in host machine
side, we need to look into host machine's crash dump.

Without this feature, we first create guest machine's dump and then
create host mahine's, but there's only a short time between two
processings, during which it's unlikely that buggy situation remains.

So, we think the feature is useful to debug both guest machine's and
host machine's sides at the same time, and expect we can make failure
analysis efficiently.

Of course, we believe this feature is commonly useful on the situation
where guest machine doesn't work well due to something of host machine's.

2) Get offsets of VMCS information on the CPU running on the host machine

If kdump doesn't work well, then it means we cannot use kvm API to get
register values of guest machine and they are still left on its vmcs
region. In the case, we use crash dump mechanism running outside of
linux kernel, such as sadump, a firmware-based crash dump. Then VMCS
information is then necessary.

TODO:
1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information
into vmcore.
2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process
core file. To do this, we will modify kernel core dumper, gdb gcore
and crash gcore.
3. Dump guest image from the qemu-process core file into a vmcore.

Changelog for v1 to v2:
1. The VMCSINFO now has a simple binary <field><encoded offset> format,
as below:
+-------------+--------------------------+
| Byte offset | Contents |
+-------------+--------------------------+
| 0 | VMCS revision identifier |
+-------------+--------------------------+
| 4 | <field><encoded offset> |
+-------------+--------------------------+
| 16 | <field><encoded offset> |
+-------------+--------------------------+
......

The first 32 bits of VMCSINFO contains the VMCS revision identifier.
The remainder of VMCSINFO is used for <field><encoded offset> sets.
Each set takes 12 bytes: field occupys 4 bytes and its corresponding
encoded offset occupys 8 bytes.

Encoded offsets are raw values read by vmcs_read{16, 64, 32, l}, and
they are all unsigned extended to 8 bytes for each <field><encoded offset>
set will have the same size.
We do not decode offsets here. The decoding work is delayed in userspace
tools for more flexible handling.

And here are two examples of the new VMCSINFO:
Processor: Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz
VMCSINFO contains:
<0000000d> --> VMCS revision id = 0xd
<00004000><0000000001840180> --> OFFSET(PIN_BASED_VM_EXEC_CONTROL) = 0x01840180
<00004002><0000000001940190> --> OFFSET(CPU_BASED_VM_EXEC_CONTROL) = 0x01940190
<0000401e><000000000fe40fe0> --> OFFSET(SECONDARY_VM_EXEC_CONTROL) = 0x0fe40fe0
<0000400c><0000000001e401e0> --> OFFSET(VM_EXIT_CONTROLS) = 0x01e401e0
......

Processor: Intel(R) Xeon(R) CPU E7540 @ 2.00GHz (24 cores)
VMCSINFO contains:
<0000000e> --> VMCS revision id = 0xe
<00004000><0000000005540550> --> OFFSET(PIN_BASED_VM_EXEC_CONTROL) = 0x05540550
<00004002><0000000005440540> --> OFFSET(CPU_BASED_VM_EXEC_CONTROL) = 0x05440540
<0000401e><00000000054c0548> --> OFFSET(SECONDARY_VM_EXEC_CONTROL) = 0x054c0548
<0000400c><00000000057c0578> --> OFFSET(VM_EXIT_CONTROLS) = 0x057c0578
......

2. Add a new kernel module *vmcsinfo-intel* for filling VMCSINFO instead
of putting it in module kvm-intel. The new module is auto-loaded
when the vmx cpufeature is detected and it depends on module kvm-intel.
*Loading and unloading this module will have no side effect on the
running guests.*
3. The sysfs file vmcsinfo is splitted into 2 files:
/sys/kernel/vmcsinfo: shows physical address of VMCSINFO note information.
/sys/kernel/vmcsinfo_maxsize: shows max size of VMCSINFO.
4. A new Documentation/ABI entry is added for vmcsinfo and vmcsinfo_maxsize.
5. Do not update VMCSINFO note when the kernel is panicked.

zhangyanfei (5):
x86: Add helper variables and functions to hold VMCSINFO
KVM: Export symbols for module vmcsinfo-intel
KVM-INTEL: Add new module vmcsinfo-intel to fill VMCSINFO
ksysfs: Export VMCSINFO via sysfs
Documentation: Add ABI entry for sysfs file vmcsinfo and
vmcsinfo_maxsize

Documentation/ABI/testing/sysfs-kernel-vmcsinfo | 16 +
arch/x86/include/asm/vmcsinfo.h | 34 ++
arch/x86/include/asm/vmx.h | 133 ++++++++
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/vmcsinfo.c | 79 +++++
arch/x86/kvm/Kconfig | 11 +
arch/x86/kvm/Makefile | 3 +
arch/x86/kvm/vmcsinfo.c | 402 +++++++++++++++++++++++
arch/x86/kvm/vmx.c | 151 ++-------
include/linux/kvm_host.h | 3 +
kernel/ksysfs.c | 29 ++
virt/kvm/kvm_main.c | 8 +-
12 files changed, 740 insertions(+), 131 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-vmcsinfo
create mode 100644 arch/x86/include/asm/vmcsinfo.h
create mode 100644 arch/x86/kernel/vmcsinfo.c
create mode 100644 arch/x86/kvm/vmcsinfo.c


2012-05-16 07:54:13

by Zhang Yanfei

[permalink] [raw]
Subject: [PATCH v2 1/5] x86: Add helper variables and functions to hold VMCSINFO

This patch provides a set of variables to hold the VMCSINFO and also
some helper functions to help fill the VMCSINFO.

Signed-off-by: zhangyanfei <[email protected]>
---
arch/x86/include/asm/vmcsinfo.h | 34 +++++++++++++++++
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/vmcsinfo.c | 79 +++++++++++++++++++++++++++++++++++++++
3 files changed, 115 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/vmcsinfo.h
create mode 100644 arch/x86/kernel/vmcsinfo.c

diff --git a/arch/x86/include/asm/vmcsinfo.h b/arch/x86/include/asm/vmcsinfo.h
new file mode 100644
index 0000000..1ca140b
--- /dev/null
+++ b/arch/x86/include/asm/vmcsinfo.h
@@ -0,0 +1,34 @@
+#ifndef _ASM_X86_VMCSINFO_H
+#define _ASM_X86_VMCSINFO_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+#include <linux/elf.h>
+
+/*
+ * Currently, 1 page is enough for vmcsinfo.
+ */
+#define VMCSINFO_BYTES (4096)
+#define VMCSINFO_NOTE_NAME "VMCSINFO"
+#define VMCSINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCSINFO_NOTE_NAME), 4)
+#define VMCSINFO_NOTE_HEAD_BYTES ALIGN(sizeof(struct elf_note), 4)
+#define VMCSINFO_NOTE_SIZE (VMCSINFO_NOTE_HEAD_BYTES*2 \
+ + VMCSINFO_BYTES \
+ + VMCSINFO_NOTE_NAME_BYTES)
+
+extern size_t vmcsinfo_size;
+extern size_t vmcsinfo_max_size;
+extern unsigned char vmcsinfo_data[VMCSINFO_BYTES];
+
+extern void update_vmcsinfo_note(void);
+extern void vmcsinfo_append_id(u32);
+extern void vmcsinfo_append_field(u32, u64);
+extern unsigned long paddr_vmcsinfo_note(void);
+
+#define VMCSINFO_REVISION_ID(id) \
+ vmcsinfo_append_id(id)
+#define VMCSINFO_FIELD(field, offset) \
+ vmcsinfo_append_field(field, offset)
+
+#endif /* __ASSEMBLY__ */
+#endif /* _ASM_X86_VMCSINFO_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 532d2e0..63edf33 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -102,6 +102,8 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o
obj-$(CONFIG_SWIOTLB) += pci-swiotlb.o
obj-$(CONFIG_OF) += devicetree.o

+obj-y += vmcsinfo.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/vmcsinfo.c b/arch/x86/kernel/vmcsinfo.c
new file mode 100644
index 0000000..8d0ab3f
--- /dev/null
+++ b/arch/x86/kernel/vmcsinfo.c
@@ -0,0 +1,79 @@
+/*
+ * Architecture specific (i386/x86_64) functions for storing vmcs
+ * field information.
+ *
+ * Created by: zhangyanfei ([email protected])
+ *
+ * Copyright (C) Fujitsu Corporation, 2012. All rights reserved.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include <asm/vmcsinfo.h>
+#include <linux/module.h>
+#include <linux/elf.h>
+
+unsigned char vmcsinfo_data[VMCSINFO_BYTES];
+static u32 vmcsinfo_note[VMCSINFO_NOTE_SIZE/4];
+size_t vmcsinfo_max_size = sizeof(vmcsinfo_data);
+size_t vmcsinfo_size;
+EXPORT_SYMBOL_GPL(vmcsinfo_size);
+
+void update_vmcsinfo_note(void)
+{
+ u32 *buf = vmcsinfo_note;
+ struct elf_note note;
+
+ if (!vmcsinfo_size)
+ return;
+
+ note.n_namesz = strlen(VMCSINFO_NOTE_NAME) + 1;
+ note.n_descsz = vmcsinfo_size;
+ note.n_type = 0;
+ memcpy(buf, &note, sizeof(note));
+ buf += (sizeof(note) + 3)/4;
+ memcpy(buf, VMCSINFO_NOTE_NAME, note.n_namesz);
+ buf += (note.n_namesz + 3)/4;
+ memcpy(buf, vmcsinfo_data, note.n_descsz);
+ buf += (note.n_descsz + 3)/4;
+
+ note.n_namesz = 0;
+ note.n_descsz = 0;
+ note.n_type = 0;
+ memcpy(buf, &note, sizeof(note));
+}
+EXPORT_SYMBOL_GPL(update_vmcsinfo_note);
+
+void vmcsinfo_append_id(u32 id)
+{
+ size_t r;
+
+ r = sizeof(id);
+ if (r + vmcsinfo_size > vmcsinfo_max_size)
+ return;
+
+ memcpy(&vmcsinfo_data[vmcsinfo_size], &id, r);
+ vmcsinfo_size += r;
+}
+EXPORT_SYMBOL_GPL(vmcsinfo_append_id);
+
+void vmcsinfo_append_field(u32 field, u64 offset)
+{
+ size_t r;
+
+ r = sizeof(field) + sizeof(offset);
+ if (r + vmcsinfo_size > vmcsinfo_max_size)
+ return;
+
+ memcpy(&vmcsinfo_data[vmcsinfo_size], &field, sizeof(field));
+ vmcsinfo_size += sizeof(field);
+ memcpy(&vmcsinfo_data[vmcsinfo_size], &offset, sizeof(offset));
+ vmcsinfo_size += sizeof(offset);
+}
+EXPORT_SYMBOL_GPL(vmcsinfo_append_field);
+
+unsigned long paddr_vmcsinfo_note(void)
+{
+ return __pa((unsigned long)(char *)&vmcsinfo_note);
+}
--
1.7.1

2012-05-16 07:55:30

by Zhang Yanfei

[permalink] [raw]
Subject: [PATCH v2 2/5] KVM: Export symbols for module vmcsinfo-intel

A new module named vmcsinfo-intel is used to fill VMCSINFO. And
this module depends on kvm-intel and kvm module. So we should
export some symbols of kvm-intel and kvm module that are needed
by vmcsinfo-intel.

Signed-off-by: zhangyanfei <[email protected]>
---
arch/x86/include/asm/vmx.h | 133 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx.c | 151 +++++++-------------------------------------
include/linux/kvm_host.h | 3 +
virt/kvm/kvm_main.c | 8 +-
4 files changed, 164 insertions(+), 131 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 31f180c..f5b7134 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -27,6 +27,8 @@

#include <linux/types.h>

+#include <asm/kvm_host.h>
+
/*
* Definitions of Primary Processor-Based VM-Execution Controls.
*/
@@ -481,4 +483,135 @@ enum vm_instruction_error_number {
VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,
};

+#define __ex(x) __kvm_handle_fault_on_reboot(x)
+#define __ex_clear(x, reg) \
+ ____kvm_handle_fault_on_reboot(x, "xor " reg " , " reg)
+
+struct vmcs {
+ u32 revision_id;
+ u32 abort;
+ char data[0];
+};
+
+struct vmcs_config {
+ int size;
+ int order;
+ u32 revision_id;
+ u32 pin_based_exec_ctrl;
+ u32 cpu_based_exec_ctrl;
+ u32 cpu_based_2nd_exec_ctrl;
+ u32 vmexit_ctrl;
+ u32 vmentry_ctrl;
+};
+
+extern struct vmcs_config vmcs_config;
+
+DECLARE_PER_CPU(struct vmcs *, vmxarea);
+DECLARE_PER_CPU(struct vmcs *, current_vmcs);
+
+struct vmcs *alloc_vmcs(void);
+void kvm_cpu_vmxon(u64);
+void kvm_cpu_vmxoff(void);
+void vmcs_load(struct vmcs *);
+void vmcs_write_control_field(unsigned long, u32);
+void vmcs_clear(struct vmcs *);
+void free_vmcs(struct vmcs *);
+
+static __always_inline unsigned long vmcs_readl(unsigned long field)
+{
+ unsigned long value;
+
+ asm volatile (__ex_clear(ASM_VMX_VMREAD_RDX_RAX, "%0")
+ : "=a"(value) : "d"(field) : "cc");
+ return value;
+}
+
+static __always_inline u16 vmcs_read16(unsigned long field)
+{
+ return vmcs_readl(field);
+}
+
+static __always_inline u32 vmcs_read32(unsigned long field)
+{
+ return vmcs_readl(field);
+}
+
+static __always_inline u64 vmcs_read64(unsigned long field)
+{
+#ifdef CONFIG_X86_64
+ return vmcs_readl(field);
+#else
+ return vmcs_readl(field) | ((u64)vmcs_readl(field+1) << 32);
+#endif
+}
+
+static inline bool cpu_has_vmx_msr_bitmap(void)
+{
+ return vmcs_config.cpu_based_exec_ctrl & CPU_BASED_USE_MSR_BITMAPS;
+}
+
+static inline bool cpu_has_vmx_tpr_shadow(void)
+{
+ return vmcs_config.cpu_based_exec_ctrl & CPU_BASED_TPR_SHADOW;
+}
+
+static inline bool cpu_has_secondary_exec_ctrls(void)
+{
+ return vmcs_config.cpu_based_exec_ctrl &
+ CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+}
+
+static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
+{
+ return vmcs_config.cpu_based_2nd_exec_ctrl &
+ SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+}
+
+static inline bool cpu_has_vmx_flexpriority(void)
+{
+ return cpu_has_vmx_tpr_shadow() &&
+ cpu_has_vmx_virtualize_apic_accesses();
+}
+
+static inline bool cpu_has_vmx_ept(void)
+{
+ return vmcs_config.cpu_based_2nd_exec_ctrl &
+ SECONDARY_EXEC_ENABLE_EPT;
+}
+
+static inline bool cpu_has_vmx_unrestricted_guest(void)
+{
+ return vmcs_config.cpu_based_2nd_exec_ctrl &
+ SECONDARY_EXEC_UNRESTRICTED_GUEST;
+}
+
+static inline bool cpu_has_vmx_ple(void)
+{
+ return vmcs_config.cpu_based_2nd_exec_ctrl &
+ SECONDARY_EXEC_PAUSE_LOOP_EXITING;
+}
+
+static inline bool cpu_has_vmx_vpid(void)
+{
+ return vmcs_config.cpu_based_2nd_exec_ctrl &
+ SECONDARY_EXEC_ENABLE_VPID;
+}
+
+static inline bool cpu_has_vmx_rdtscp(void)
+{
+ return vmcs_config.cpu_based_2nd_exec_ctrl &
+ SECONDARY_EXEC_RDTSCP;
+}
+
+static inline bool cpu_has_virtual_nmis(void)
+{
+ return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS;
+}
+
+static inline bool cpu_has_vmx_wbinvd_exit(void)
+{
+ return vmcs_config.cpu_based_2nd_exec_ctrl &
+ SECONDARY_EXEC_WBINVD_EXITING;
+}
+
#endif
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ad85adf..3391c92 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -44,10 +44,6 @@

#include "trace.h"

-#define __ex(x) __kvm_handle_fault_on_reboot(x)
-#define __ex_clear(x, reg) \
- ____kvm_handle_fault_on_reboot(x, "xor " reg " , " reg)
-
MODULE_AUTHOR("Qumranet");
MODULE_LICENSE("GPL");

@@ -120,12 +116,6 @@ module_param(ple_window, int, S_IRUGO);
#define NR_AUTOLOAD_MSRS 8
#define VMCS02_POOL_SIZE 1

-struct vmcs {
- u32 revision_id;
- u32 abort;
- char data[0];
-};
-
/*
* Track a VMCS that may be loaded on a certain CPU. If it is (cpu!=-1), also
* remember whether it was VMLAUNCHed, and maintain a linked list of all VMCSs
@@ -601,13 +591,13 @@ static void nested_release_page_clean(struct page *page)
}

static u64 construct_eptp(unsigned long root_hpa);
-static void kvm_cpu_vmxon(u64 addr);
-static void kvm_cpu_vmxoff(void);
static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);

-static DEFINE_PER_CPU(struct vmcs *, vmxarea);
-static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
+DEFINE_PER_CPU(struct vmcs *, vmxarea);
+EXPORT_PER_CPU_SYMBOL_GPL(vmxarea);
+DEFINE_PER_CPU(struct vmcs *, current_vmcs);
+EXPORT_PER_CPU_SYMBOL_GPL(current_vmcs);
/*
* We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is needed
* when a CPU is brought down, and we need to VMCLEAR all VMCSs loaded on it.
@@ -626,16 +616,8 @@ static bool cpu_has_load_perf_global_ctrl;
static DECLARE_BITMAP(vmx_vpid_bitmap, VMX_NR_VPIDS);
static DEFINE_SPINLOCK(vmx_vpid_lock);

-static struct vmcs_config {
- int size;
- int order;
- u32 revision_id;
- u32 pin_based_exec_ctrl;
- u32 cpu_based_exec_ctrl;
- u32 cpu_based_2nd_exec_ctrl;
- u32 vmexit_ctrl;
- u32 vmentry_ctrl;
-} vmcs_config;
+struct vmcs_config vmcs_config;
+EXPORT_SYMBOL_GPL(vmcs_config);

static struct vmx_capability {
u32 ept;
@@ -716,39 +698,11 @@ static inline bool is_machine_check(u32 intr_info)
(INTR_TYPE_HARD_EXCEPTION | MC_VECTOR | INTR_INFO_VALID_MASK);
}

-static inline bool cpu_has_vmx_msr_bitmap(void)
-{
- return vmcs_config.cpu_based_exec_ctrl & CPU_BASED_USE_MSR_BITMAPS;
-}
-
-static inline bool cpu_has_vmx_tpr_shadow(void)
-{
- return vmcs_config.cpu_based_exec_ctrl & CPU_BASED_TPR_SHADOW;
-}
-
static inline bool vm_need_tpr_shadow(struct kvm *kvm)
{
return (cpu_has_vmx_tpr_shadow()) && (irqchip_in_kernel(kvm));
}

-static inline bool cpu_has_secondary_exec_ctrls(void)
-{
- return vmcs_config.cpu_based_exec_ctrl &
- CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
-}
-
-static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
-{
- return vmcs_config.cpu_based_2nd_exec_ctrl &
- SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
-}
-
-static inline bool cpu_has_vmx_flexpriority(void)
-{
- return cpu_has_vmx_tpr_shadow() &&
- cpu_has_vmx_virtualize_apic_accesses();
-}
-
static inline bool cpu_has_vmx_ept_execute_only(void)
{
return vmx_capability.ept & VMX_EPT_EXECUTE_ONLY_BIT;
@@ -804,52 +758,11 @@ static inline bool cpu_has_vmx_invvpid_global(void)
return vmx_capability.vpid & VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT;
}

-static inline bool cpu_has_vmx_ept(void)
-{
- return vmcs_config.cpu_based_2nd_exec_ctrl &
- SECONDARY_EXEC_ENABLE_EPT;
-}
-
-static inline bool cpu_has_vmx_unrestricted_guest(void)
-{
- return vmcs_config.cpu_based_2nd_exec_ctrl &
- SECONDARY_EXEC_UNRESTRICTED_GUEST;
-}
-
-static inline bool cpu_has_vmx_ple(void)
-{
- return vmcs_config.cpu_based_2nd_exec_ctrl &
- SECONDARY_EXEC_PAUSE_LOOP_EXITING;
-}
-
static inline bool vm_need_virtualize_apic_accesses(struct kvm *kvm)
{
return flexpriority_enabled && irqchip_in_kernel(kvm);
}

-static inline bool cpu_has_vmx_vpid(void)
-{
- return vmcs_config.cpu_based_2nd_exec_ctrl &
- SECONDARY_EXEC_ENABLE_VPID;
-}
-
-static inline bool cpu_has_vmx_rdtscp(void)
-{
- return vmcs_config.cpu_based_2nd_exec_ctrl &
- SECONDARY_EXEC_RDTSCP;
-}
-
-static inline bool cpu_has_virtual_nmis(void)
-{
- return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS;
-}
-
-static inline bool cpu_has_vmx_wbinvd_exit(void)
-{
- return vmcs_config.cpu_based_2nd_exec_ctrl &
- SECONDARY_EXEC_WBINVD_EXITING;
-}
-
static inline bool report_flexpriority(void)
{
return flexpriority_enabled;
@@ -930,7 +843,7 @@ static struct shared_msr_entry *find_msr_entry(struct vcpu_vmx *vmx, u32 msr)
return NULL;
}

-static void vmcs_clear(struct vmcs *vmcs)
+void vmcs_clear(struct vmcs *vmcs)
{
u64 phys_addr = __pa(vmcs);
u8 error;
@@ -942,6 +855,7 @@ static void vmcs_clear(struct vmcs *vmcs)
printk(KERN_ERR "kvm: vmclear fail: %p/%llx\n",
vmcs, phys_addr);
}
+EXPORT_SYMBOL_GPL(vmcs_clear);

static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
{
@@ -950,7 +864,7 @@ static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
loaded_vmcs->launched = 0;
}

-static void vmcs_load(struct vmcs *vmcs)
+void vmcs_load(struct vmcs *vmcs)
{
u64 phys_addr = __pa(vmcs);
u8 error;
@@ -962,6 +876,7 @@ static void vmcs_load(struct vmcs *vmcs)
printk(KERN_ERR "kvm: vmptrld %p/%llx failed\n",
vmcs, phys_addr);
}
+EXPORT_SYMBOL_GPL(vmcs_load);

static void __loaded_vmcs_clear(void *arg)
{
@@ -1033,34 +948,6 @@ static inline void ept_sync_individual_addr(u64 eptp, gpa_t gpa)
}
}

-static __always_inline unsigned long vmcs_readl(unsigned long field)
-{
- unsigned long value;
-
- asm volatile (__ex_clear(ASM_VMX_VMREAD_RDX_RAX, "%0")
- : "=a"(value) : "d"(field) : "cc");
- return value;
-}
-
-static __always_inline u16 vmcs_read16(unsigned long field)
-{
- return vmcs_readl(field);
-}
-
-static __always_inline u32 vmcs_read32(unsigned long field)
-{
- return vmcs_readl(field);
-}
-
-static __always_inline u64 vmcs_read64(unsigned long field)
-{
-#ifdef CONFIG_X86_64
- return vmcs_readl(field);
-#else
- return vmcs_readl(field) | ((u64)vmcs_readl(field+1) << 32);
-#endif
-}
-
static noinline void vmwrite_error(unsigned long field, unsigned long value)
{
printk(KERN_ERR "vmwrite error: reg %lx value %lx (err %d)\n",
@@ -1097,6 +984,12 @@ static void vmcs_write64(unsigned long field, u64 value)
#endif
}

+void vmcs_write_control_field(unsigned long field, u32 value)
+{
+ vmcs_writel(field, value);
+}
+EXPORT_SYMBOL_GPL(vmcs_write_control_field);
+
static void vmcs_clear_bits(unsigned long field, u32 mask)
{
vmcs_writel(field, vmcs_readl(field) & ~mask);
@@ -2282,12 +2175,13 @@ static __init int vmx_disabled_by_bios(void)
return 0;
}

-static void kvm_cpu_vmxon(u64 addr)
+void kvm_cpu_vmxon(u64 addr)
{
asm volatile (ASM_VMX_VMXON_RAX
: : "a"(&addr), "m"(addr)
: "memory", "cc");
}
+EXPORT_SYMBOL_GPL(kvm_cpu_vmxon);

static int hardware_enable(void *garbage)
{
@@ -2336,10 +2230,11 @@ static void vmclear_local_loaded_vmcss(void)
/* Just like cpu_vmxoff(), but with the __kvm_handle_fault_on_reboot()
* tricks.
*/
-static void kvm_cpu_vmxoff(void)
+void kvm_cpu_vmxoff(void)
{
asm volatile (__ex(ASM_VMX_VMXOFF) : : : "cc");
}
+EXPORT_SYMBOL_GPL(kvm_cpu_vmxoff);

static void hardware_disable(void *garbage)
{
@@ -2549,15 +2444,17 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
return vmcs;
}

-static struct vmcs *alloc_vmcs(void)
+struct vmcs *alloc_vmcs(void)
{
return alloc_vmcs_cpu(raw_smp_processor_id());
}
+EXPORT_SYMBOL_GPL(alloc_vmcs);

-static void free_vmcs(struct vmcs *vmcs)
+void free_vmcs(struct vmcs *vmcs)
{
free_pages((unsigned long)vmcs, vmcs_config.order);
}
+EXPORT_SYMBOL_GPL(free_vmcs);

/*
* Free a VMCS, but before that VMCLEAR it on the CPU where it was last loaded
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 72cbf08..d76e2b0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -80,6 +80,9 @@ enum kvm_bus {
KVM_NR_BUSES
};

+int hardware_enable_all(void);
+void hardware_disable_all(void);
+
int kvm_io_bus_write(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
int len, const void *val);
int kvm_io_bus_read(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, int len,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9739b53..3130e76 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -90,8 +90,6 @@ static long kvm_vcpu_ioctl(struct file *file, unsigned int ioctl,
static long kvm_vcpu_compat_ioctl(struct file *file, unsigned int ioctl,
unsigned long arg);
#endif
-static int hardware_enable_all(void);
-static void hardware_disable_all(void);

static void kvm_io_bus_destroy(struct kvm_io_bus *bus);

@@ -2286,14 +2284,15 @@ static void hardware_disable_all_nolock(void)
on_each_cpu(hardware_disable_nolock, NULL, 1);
}

-static void hardware_disable_all(void)
+void hardware_disable_all(void)
{
raw_spin_lock(&kvm_lock);
hardware_disable_all_nolock();
raw_spin_unlock(&kvm_lock);
}
+EXPORT_SYMBOL_GPL(hardware_disable_all);

-static int hardware_enable_all(void)
+int hardware_enable_all(void)
{
int r = 0;

@@ -2314,6 +2313,7 @@ static int hardware_enable_all(void)

return r;
}
+EXPORT_SYMBOL_GPL(hardware_enable_all);

static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
void *v)
--
1.7.1

2012-05-16 07:56:40

by Zhang Yanfei

[permalink] [raw]
Subject: [PATCH v2 3/5] KVM-INTEL: Add new module vmcsinfo-intel to fill VMCSINFO

This patch implements a new module named vmcsinfo-intel. The
module fills VMCSINFO with the VMCS revision identifier,
and encoded offsets of VMCS fields.

Note, offsets of fields below will not be filled into VMCSINFO:
1. fields defined in Intel specification (Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume
3C) but not defined in *vmcs_field*.
2. fields don't exist because their corresponding control bits
are not set.

Signed-off-by: zhangyanfei <[email protected]>
---
arch/x86/kvm/Kconfig | 11 ++
arch/x86/kvm/Makefile | 3 +
arch/x86/kvm/vmcsinfo.c | 402 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 416 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/kvm/vmcsinfo.c

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 1a7fe86..87df9d4 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -62,6 +62,17 @@ config KVM_INTEL
To compile this as a module, choose M here: the module
will be called kvm-intel.

+config VMCSINFO_INTEL
+ tristate "Export VMCSINFO for Intel processors"
+ depends on KVM_INTEL
+ ---help---
+ Provides support for exporting VMCSINFO on Intel processors equipped
+ with the VT extensions. The VMCSINFO contains a VMCS revision
+ identifier and offsets of VMCS fields.
+
+ To compile this as a module, choose M here: the module
+ will be called vmcsinfo-intel.
+
config KVM_AMD
tristate "KVM for AMD processors support"
depends on KVM
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 4f579e8..12a1ef6 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -4,6 +4,7 @@ ccflags-y += -Ivirt/kvm -Iarch/x86/kvm
CFLAGS_x86.o := -I.
CFLAGS_svm.o := -I.
CFLAGS_vmx.o := -I.
+CFLAGS_vmcsinfo.o := -I.

kvm-y += $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
coalesced_mmio.o irq_comm.o eventfd.o \
@@ -15,7 +16,9 @@ kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
i8254.o timer.o cpuid.o pmu.o
kvm-intel-y += vmx.o
kvm-amd-y += svm.o
+vmcsinfo-intel-y += vmcsinfo.o

obj-$(CONFIG_KVM) += kvm.o
obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
obj-$(CONFIG_KVM_AMD) += kvm-amd.o
+obj-$(CONFIG_VMCSINFO_INTEL) += vmcsinfo-intel.o
diff --git a/arch/x86/kvm/vmcsinfo.c b/arch/x86/kvm/vmcsinfo.c
new file mode 100644
index 0000000..288c445
--- /dev/null
+++ b/arch/x86/kvm/vmcsinfo.c
@@ -0,0 +1,402 @@
+/*
+ * Kernel-based Virtual Machine driver for Linux
+ *
+ * This module enables machines with Intel VT-x extensions to export
+ * offsets of VMCS fields for guest debugging.
+ *
+ * Copyright (C) 2012 Fujitsu, Inc.
+ *
+ * Authors:
+ * Zhang Yanfei <[email protected]>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/mod_devicetable.h>
+#include <linux/kernel.h>
+#include <linux/smp.h>
+#include <linux/tboot.h>
+#include <linux/kvm_host.h>
+
+#include <asm/vmx.h>
+#include <asm/special_insns.h>
+#include <asm/processor-flags.h>
+#include <asm/msr.h>
+#include <asm/msr-index.h>
+#include <asm/vmcsinfo.h>
+
+MODULE_AUTHOR("Fujitsu");
+MODULE_LICENSE("GPL");
+
+static const struct x86_cpu_id vmcsinfo_cpu_id[] = {
+ X86_FEATURE_MATCH(X86_FEATURE_VMX),
+ {}
+};
+MODULE_DEVICE_TABLE(x86cpu, vmcsinfo_cpu_id);
+
+/*
+ * For caculating offsets of fields in VMCS data, we index every 16-bit
+ * field by this kind of format:
+ * | --------- 16 bits ---------- |
+ * +-------------+-+------------+-+
+ * | high 7 bits |1| low 7 bits |0|
+ * +-------------+-+------------+-+
+ * In high byte, the lowest bit must be 1; In low byte, the lowest bit
+ * must be 0. The two bits are set like this in case indexes in VMCS
+ * data are read as big endian mode.
+ * The remaining 14 bits of the index indicate the real offset of the
+ * field. Because the size of a VMCS region is at most 4 KBytes, so
+ * 14 bits are enough to index the whole VMCS region.
+ *
+ * ENCODING_OFFSET: encode the offset into the index of this kind.
+ */
+#define OFFSET_HIGH_SHIFT (7)
+#define OFFSET_LOW_MASK ((1 << OFFSET_HIGH_SHIFT) - 1) /* 0x7f */
+#define OFFSET_HIGH_MASK (OFFSET_LOW_MASK << OFFSET_HIGH_SHIFT) /* 0x3f80 */
+#define ENCODING_OFFSET(offset) \
+ ((((offset) & OFFSET_LOW_MASK) << 1) + \
+ ((((offset) & OFFSET_HIGH_MASK) << 2) | 0x100))
+
+/*
+ * We separate these five control fields from other fields
+ * because some fields only exist on processors that support
+ * the 1-setting of control bits in the five control fields.
+ */
+static inline void append_control_field(void)
+{
+#define CONTROL_FIELD_OFFSET(field) \
+ VMCSINFO_FIELD(field, vmcs_read32(field))
+
+ CONTROL_FIELD_OFFSET(PIN_BASED_VM_EXEC_CONTROL);
+ CONTROL_FIELD_OFFSET(CPU_BASED_VM_EXEC_CONTROL);
+ if (cpu_has_secondary_exec_ctrls()) {
+ CONTROL_FIELD_OFFSET(SECONDARY_VM_EXEC_CONTROL);
+ }
+ CONTROL_FIELD_OFFSET(VM_EXIT_CONTROLS);
+ CONTROL_FIELD_OFFSET(VM_ENTRY_CONTROLS);
+}
+
+static inline void append_field16(void)
+{
+#define FIELD_OFFSET16(field) \
+ VMCSINFO_FIELD(field, vmcs_read16(field))
+
+ FIELD_OFFSET16(GUEST_ES_SELECTOR);
+ FIELD_OFFSET16(GUEST_CS_SELECTOR);
+ FIELD_OFFSET16(GUEST_SS_SELECTOR);
+ FIELD_OFFSET16(GUEST_DS_SELECTOR);
+ FIELD_OFFSET16(GUEST_FS_SELECTOR);
+ FIELD_OFFSET16(GUEST_GS_SELECTOR);
+ FIELD_OFFSET16(GUEST_LDTR_SELECTOR);
+ FIELD_OFFSET16(GUEST_TR_SELECTOR);
+ FIELD_OFFSET16(HOST_ES_SELECTOR);
+ FIELD_OFFSET16(HOST_CS_SELECTOR);
+ FIELD_OFFSET16(HOST_SS_SELECTOR);
+ FIELD_OFFSET16(HOST_DS_SELECTOR);
+ FIELD_OFFSET16(HOST_FS_SELECTOR);
+ FIELD_OFFSET16(HOST_GS_SELECTOR);
+ FIELD_OFFSET16(HOST_TR_SELECTOR);
+}
+
+static inline void append_field64(void)
+{
+#define FIELD_OFFSET64(field) \
+ VMCSINFO_FIELD(field, vmcs_read64(field))
+
+ FIELD_OFFSET64(IO_BITMAP_A);
+ FIELD_OFFSET64(IO_BITMAP_A_HIGH);
+ FIELD_OFFSET64(IO_BITMAP_B);
+ FIELD_OFFSET64(IO_BITMAP_B_HIGH);
+ FIELD_OFFSET64(VM_EXIT_MSR_STORE_ADDR);
+ FIELD_OFFSET64(VM_EXIT_MSR_STORE_ADDR_HIGH);
+ FIELD_OFFSET64(VM_EXIT_MSR_LOAD_ADDR);
+ FIELD_OFFSET64(VM_EXIT_MSR_LOAD_ADDR_HIGH);
+ FIELD_OFFSET64(VM_ENTRY_MSR_LOAD_ADDR);
+ FIELD_OFFSET64(VM_ENTRY_MSR_LOAD_ADDR_HIGH);
+ FIELD_OFFSET64(TSC_OFFSET);
+ FIELD_OFFSET64(TSC_OFFSET_HIGH);
+ FIELD_OFFSET64(VMCS_LINK_POINTER);
+ FIELD_OFFSET64(VMCS_LINK_POINTER_HIGH);
+ FIELD_OFFSET64(GUEST_IA32_DEBUGCTL);
+ FIELD_OFFSET64(GUEST_IA32_DEBUGCTL_HIGH);
+
+ if (cpu_has_vmx_msr_bitmap()) {
+ FIELD_OFFSET64(MSR_BITMAP);
+ FIELD_OFFSET64(MSR_BITMAP_HIGH);
+ }
+
+ if (cpu_has_vmx_tpr_shadow()) {
+ FIELD_OFFSET64(VIRTUAL_APIC_PAGE_ADDR);
+ FIELD_OFFSET64(VIRTUAL_APIC_PAGE_ADDR_HIGH);
+ }
+
+ if (cpu_has_secondary_exec_ctrls()) {
+ if (vmcs_config.cpu_based_2nd_exec_ctrl &
+ SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) {
+ FIELD_OFFSET64(APIC_ACCESS_ADDR);
+ FIELD_OFFSET64(APIC_ACCESS_ADDR_HIGH);
+ }
+ if (cpu_has_vmx_ept()) {
+ FIELD_OFFSET64(EPT_POINTER);
+ FIELD_OFFSET64(EPT_POINTER_HIGH);
+ FIELD_OFFSET64(GUEST_PHYSICAL_ADDRESS);
+ FIELD_OFFSET64(GUEST_PHYSICAL_ADDRESS_HIGH);
+ FIELD_OFFSET64(GUEST_PDPTR0);
+ FIELD_OFFSET64(GUEST_PDPTR0_HIGH);
+ FIELD_OFFSET64(GUEST_PDPTR1);
+ FIELD_OFFSET64(GUEST_PDPTR1_HIGH);
+ FIELD_OFFSET64(GUEST_PDPTR2);
+ FIELD_OFFSET64(GUEST_PDPTR2_HIGH);
+ FIELD_OFFSET64(GUEST_PDPTR3);
+ FIELD_OFFSET64(GUEST_PDPTR3_HIGH);
+ }
+ }
+
+ if (vmcs_config.vmexit_ctrl & VM_EXIT_SAVE_IA32_PAT || \
+ vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
+ FIELD_OFFSET64(GUEST_IA32_PAT);
+ FIELD_OFFSET64(GUEST_IA32_PAT_HIGH);
+ }
+
+ if (vmcs_config.vmexit_ctrl & VM_EXIT_SAVE_IA32_EFER || \
+ vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_EFER) {
+ FIELD_OFFSET64(GUEST_IA32_EFER);
+ FIELD_OFFSET64(GUEST_IA32_EFER_HIGH);
+ }
+
+ if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
+ FIELD_OFFSET64(GUEST_IA32_PERF_GLOBAL_CTRL);
+ FIELD_OFFSET64(GUEST_IA32_PERF_GLOBAL_CTRL_HIGH);
+ }
+
+ if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT) {
+ FIELD_OFFSET64(HOST_IA32_PAT);
+ FIELD_OFFSET64(HOST_IA32_PAT_HIGH);
+ }
+
+ if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_EFER) {
+ FIELD_OFFSET64(HOST_IA32_EFER);
+ FIELD_OFFSET64(HOST_IA32_EFER_HIGH);
+ }
+
+ if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) {
+ FIELD_OFFSET64(HOST_IA32_PERF_GLOBAL_CTRL);
+ FIELD_OFFSET64(HOST_IA32_PERF_GLOBAL_CTRL_HIGH);
+ }
+}
+
+static inline void append_field32(void)
+{
+#define FIELD_OFFSET32(field) \
+ VMCSINFO_FIELD(field, vmcs_read32(field))
+
+ FIELD_OFFSET32(EXCEPTION_BITMAP);
+ FIELD_OFFSET32(PAGE_FAULT_ERROR_CODE_MASK);
+ FIELD_OFFSET32(PAGE_FAULT_ERROR_CODE_MATCH);
+ FIELD_OFFSET32(CR3_TARGET_COUNT);
+ FIELD_OFFSET32(VM_EXIT_MSR_STORE_COUNT);
+ FIELD_OFFSET32(VM_EXIT_MSR_LOAD_COUNT);
+ FIELD_OFFSET32(VM_ENTRY_MSR_LOAD_COUNT);
+ FIELD_OFFSET32(VM_ENTRY_INTR_INFO_FIELD);
+ FIELD_OFFSET32(VM_ENTRY_EXCEPTION_ERROR_CODE);
+ FIELD_OFFSET32(VM_ENTRY_INSTRUCTION_LEN);
+ FIELD_OFFSET32(VM_INSTRUCTION_ERROR);
+ FIELD_OFFSET32(VM_EXIT_REASON);
+ FIELD_OFFSET32(VM_EXIT_INTR_INFO);
+ FIELD_OFFSET32(VM_EXIT_INTR_ERROR_CODE);
+ FIELD_OFFSET32(IDT_VECTORING_INFO_FIELD);
+ FIELD_OFFSET32(IDT_VECTORING_ERROR_CODE);
+ FIELD_OFFSET32(VM_EXIT_INSTRUCTION_LEN);
+ FIELD_OFFSET32(VMX_INSTRUCTION_INFO);
+ FIELD_OFFSET32(GUEST_ES_LIMIT);
+ FIELD_OFFSET32(GUEST_CS_LIMIT);
+ FIELD_OFFSET32(GUEST_SS_LIMIT);
+ FIELD_OFFSET32(GUEST_DS_LIMIT);
+ FIELD_OFFSET32(GUEST_FS_LIMIT);
+ FIELD_OFFSET32(GUEST_GS_LIMIT);
+ FIELD_OFFSET32(GUEST_LDTR_LIMIT);
+ FIELD_OFFSET32(GUEST_TR_LIMIT);
+ FIELD_OFFSET32(GUEST_GDTR_LIMIT);
+ FIELD_OFFSET32(GUEST_IDTR_LIMIT);
+ FIELD_OFFSET32(GUEST_ES_AR_BYTES);
+ FIELD_OFFSET32(GUEST_CS_AR_BYTES);
+ FIELD_OFFSET32(GUEST_SS_AR_BYTES);
+ FIELD_OFFSET32(GUEST_DS_AR_BYTES);
+ FIELD_OFFSET32(GUEST_FS_AR_BYTES);
+ FIELD_OFFSET32(GUEST_GS_AR_BYTES);
+ FIELD_OFFSET32(GUEST_LDTR_AR_BYTES);
+ FIELD_OFFSET32(GUEST_TR_AR_BYTES);
+ FIELD_OFFSET32(GUEST_INTERRUPTIBILITY_INFO);
+ FIELD_OFFSET32(GUEST_ACTIVITY_STATE);
+ FIELD_OFFSET32(GUEST_SYSENTER_CS);
+ FIELD_OFFSET32(HOST_IA32_SYSENTER_CS);
+
+ if (cpu_has_vmx_tpr_shadow()) {
+ FIELD_OFFSET32(TPR_THRESHOLD);
+ }
+ if (cpu_has_secondary_exec_ctrls()) {
+ if (cpu_has_vmx_ple()) {
+ FIELD_OFFSET32(PLE_GAP);
+ FIELD_OFFSET32(PLE_WINDOW);
+ }
+ }
+}
+
+static inline void append_field(void)
+{
+#define FIELD_OFFSET(field) \
+ VMCSINFO_FIELD(field, vmcs_readl(field))
+
+ FIELD_OFFSET(CR0_GUEST_HOST_MASK);
+ FIELD_OFFSET(CR4_GUEST_HOST_MASK);
+ FIELD_OFFSET(CR0_READ_SHADOW);
+ FIELD_OFFSET(CR4_READ_SHADOW);
+ FIELD_OFFSET(CR3_TARGET_VALUE0);
+ FIELD_OFFSET(CR3_TARGET_VALUE1);
+ FIELD_OFFSET(CR3_TARGET_VALUE2);
+ FIELD_OFFSET(CR3_TARGET_VALUE3);
+ FIELD_OFFSET(EXIT_QUALIFICATION);
+ FIELD_OFFSET(GUEST_LINEAR_ADDRESS);
+ FIELD_OFFSET(GUEST_CR0);
+ FIELD_OFFSET(GUEST_CR3);
+ FIELD_OFFSET(GUEST_CR4);
+ FIELD_OFFSET(GUEST_ES_BASE);
+ FIELD_OFFSET(GUEST_CS_BASE);
+ FIELD_OFFSET(GUEST_SS_BASE);
+ FIELD_OFFSET(GUEST_DS_BASE);
+ FIELD_OFFSET(GUEST_FS_BASE);
+ FIELD_OFFSET(GUEST_GS_BASE);
+ FIELD_OFFSET(GUEST_LDTR_BASE);
+ FIELD_OFFSET(GUEST_TR_BASE);
+ FIELD_OFFSET(GUEST_GDTR_BASE);
+ FIELD_OFFSET(GUEST_IDTR_BASE);
+ FIELD_OFFSET(GUEST_DR7);
+ FIELD_OFFSET(GUEST_RSP);
+ FIELD_OFFSET(GUEST_RIP);
+ FIELD_OFFSET(GUEST_RFLAGS);
+ FIELD_OFFSET(GUEST_PENDING_DBG_EXCEPTIONS);
+ FIELD_OFFSET(GUEST_SYSENTER_ESP);
+ FIELD_OFFSET(GUEST_SYSENTER_EIP);
+ FIELD_OFFSET(HOST_CR0);
+ FIELD_OFFSET(HOST_CR3);
+ FIELD_OFFSET(HOST_CR4);
+ FIELD_OFFSET(HOST_FS_BASE);
+ FIELD_OFFSET(HOST_GS_BASE);
+ FIELD_OFFSET(HOST_TR_BASE);
+ FIELD_OFFSET(HOST_GDTR_BASE);
+ FIELD_OFFSET(HOST_IDTR_BASE);
+ FIELD_OFFSET(HOST_IA32_SYSENTER_ESP);
+ FIELD_OFFSET(HOST_IA32_SYSENTER_EIP);
+ FIELD_OFFSET(HOST_RSP);
+ FIELD_OFFSET(HOST_RIP);
+}
+
+/*
+ * The format of VMCSINFO is given below:
+ * +-------------+--------------------------+
+ * | Byte offset | Contents |
+ * +-------------+--------------------------+
+ * | 0 | VMCS revision identifier |
+ * +-------------+--------------------------+
+ * | 4 | <field><encoded offset> |
+ * +-------------+--------------------------+
+ * | 16 | <field><encoded offset> |
+ * +-------------+--------------------------+
+ * ......
+ *
+ * The first 32 bits of VMCSINFO contains the VMCS revision
+ * identifier.
+ * The remainder of VMCSINFO is used for <field><encoded offset>
+ * sets. Each set takes 12 bytes: field occupys 4 bytes
+ * and its corresponding encoded offset occupys 8 bytes.
+ *
+ * Encoded offsets are raw values read by vmcs_read{16, 64, 32, l},
+ * and they are all unsigned extended to 8 bytes for each
+ * <field><encoded offset> set has the same size.
+ * We do not decode offsets here. The decoding work is delayed
+ * in userspace tools.
+ *
+ * Note, offsets of fields below will not be filled into
+ * VMCSINFO:
+ * 1. fields defined in Intel specification (Intel® 64 and
+ * IA-32 Architectures Software Developer’s Manual, Volume
+ * 3C) but not defined in *vmcs_field*.
+ * 2. fields don't exist because their corresponding
+ * control bits are not set.
+ */
+static int __init alloc_vmcsinfo_init(void)
+{
+/*
+ * The first 8 bytes in vmcs region are for
+ * VMCS revision identifier
+ * VMX-abort indicator
+ */
+#define FIELD_START (8)
+
+ int r, offset;
+ struct vmcs *vmcs;
+ int cpu;
+
+ if (vmcsinfo_size)
+ return 0;
+
+ vmcs = alloc_vmcs();
+ if (!vmcs) {
+ return -ENOMEM;
+ }
+
+ r = hardware_enable_all();
+ if (r)
+ goto out_err;
+
+ /*
+ * Write encoded offsets into VMCS data for later vmcs_read.
+ */
+ for (offset = FIELD_START; offset < vmcs_config.size;
+ offset += sizeof(u16))
+ *(u16 *)((char *)vmcs + offset) = ENCODING_OFFSET(offset);
+
+ cpu = get_cpu();
+ vmcs_clear(vmcs);
+ per_cpu(current_vmcs, cpu) = vmcs;
+ vmcs_load(vmcs);
+
+ VMCSINFO_REVISION_ID(vmcs->revision_id);
+ append_control_field();
+
+ vmcs_write_control_field(PIN_BASED_VM_EXEC_CONTROL,
+ vmcs_config.pin_based_exec_ctrl);
+ vmcs_write_control_field(CPU_BASED_VM_EXEC_CONTROL,
+ vmcs_config.cpu_based_exec_ctrl);
+ if (cpu_has_secondary_exec_ctrls()) {
+ vmcs_write_control_field(SECONDARY_VM_EXEC_CONTROL,
+ vmcs_config.cpu_based_2nd_exec_ctrl);
+ }
+ vmcs_write_control_field(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
+ vmcs_write_control_field(VM_ENTRY_CONTROLS, vmcs_config.vmentry_ctrl);
+
+ append_field16();
+ append_field64();
+ append_field32();
+ append_field();
+
+ update_vmcsinfo_note();
+
+ vmcs_clear(vmcs);
+ put_cpu();
+
+out_err:
+ free_vmcs(vmcs);
+ return r;
+}
+
+static void __exit alloc_vmcsinfo_exit(void)
+{
+ hardware_disable_all();
+}
+
+module_init(alloc_vmcsinfo_init);
+module_exit(alloc_vmcsinfo_exit);
--
1.7.1

2012-05-16 07:57:27

by Zhang Yanfei

[permalink] [raw]
Subject: [PATCH v2 4/5] ksysfs: Export VMCSINFO via sysfs

This patch creates two sysfs files to export where VMCSINFO is
allocated and what maximum size of VMCSINFO is, as below:
$ cat /sys/kernel/vmcsinfo
1cb88a0
$ cat /sys/kernel/vmcsinfo_maxsize
1000
/sys/kernel/vmcsinfo shows the physical address of VMCSINFO,
while /sys/kernel/vmcsinfo_maxsize shows the max size of VMCSINFO.

Signed-off-by: zhangyanfei <[email protected]>
---
kernel/ksysfs.c | 29 +++++++++++++++++++++++++++++
1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 4e316e1..8a27ece 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -18,6 +18,8 @@
#include <linux/stat.h>
#include <linux/sched.h>
#include <linux/capability.h>
+#include <asm/vmcsinfo.h>
+#include <asm/virtext.h>

#define KERNEL_ATTR_RO(_name) \
static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
@@ -133,6 +135,29 @@ KERNEL_ATTR_RO(vmcoreinfo);

#endif /* CONFIG_KEXEC */

+#ifdef CONFIG_X86
+static ssize_t vmcsinfo_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ if (cpu_has_vmx())
+ return sprintf(buf, "%lx\n",
+ paddr_vmcsinfo_note());
+ return 0;
+}
+KERNEL_ATTR_RO(vmcsinfo);
+
+static ssize_t vmcsinfo_maxsize_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ if (cpu_has_vmx())
+ return sprintf(buf, "%x\n",
+ (unsigned int)vmcsinfo_max_size);
+ return 0;
+}
+KERNEL_ATTR_RO(vmcsinfo_maxsize);
+
+#endif /* CONFIG_X86 */
+
/* whether file capabilities are enabled */
static ssize_t fscaps_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
@@ -182,6 +207,10 @@ static struct attribute * kernel_attrs[] = {
&kexec_crash_size_attr.attr,
&vmcoreinfo_attr.attr,
#endif
+#ifdef CONFIG_X86
+ &vmcsinfo_attr.attr,
+ &vmcsinfo_maxsize_attr.attr,
+#endif
NULL
};

--
1.7.1

2012-05-16 07:58:57

by Zhang Yanfei

[permalink] [raw]
Subject: [PATCH v2 5/5] Documentation: Add ABI entry for sysfs file vmcsinfo and vmcsinfo_maxsize

We create two new sysfs files, vmcsinfo and vmcsinfo_maxsize. And
here we add an Documentation/ABI entry for them.

Signed-off-by: zhangyanfei <[email protected]>
---
Documentation/ABI/testing/sysfs-kernel-vmcsinfo | 16 ++++++++++++++++
1 files changed, 16 insertions(+), 0 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-vmcsinfo

diff --git a/Documentation/ABI/testing/sysfs-kernel-vmcsinfo b/Documentation/ABI/testing/sysfs-kernel-vmcsinfo
new file mode 100644
index 0000000..adbf866
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-vmcsinfo
@@ -0,0 +1,16 @@
+What: /sys/kernel/vmcsinfo
+Date: April 2012
+KernelVersion: 3.4.0
+Contact: Zhang Yanfei <[email protected]>
+Description
+ Shows physical address of VMCSINFO. VMCSINFO contains
+ the VMCS revision identifier and encoded offsets of fields
+ in VMCS data on Intel processors equipped with the VT
+ extensions.
+
+What: /sys/kernel/vmcsinfo_maxsize
+Date: April 2012
+KernelVersion: 3.4.0
+Contact: Zhang Yanfei <[email protected]>
+Description
+ Shows max size of VMCSINFO.
--
1.7.1

2012-05-20 17:44:38

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

On 05/16/2012 10:50 AM, zhangyanfei wrote:
> This patch set exports offsets of VMCS fields as note information for
> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
> runtime state of guest machine image, such as registers, in host
> machine's crash dump as VMCS format. The problem is that VMCS internal
> is hidden by Intel in its specification. So, we slove this problem
> by reverse engineering implemented in this patch set. The VMCSINFO
> is exported via sysfs to kexec-tools just like VMCOREINFO.
>
> Here are two usercases for two features that we want.
>
> 1) Create guest machine's crash dumpfile from host machine's crash dumpfile
>
> In general, we want to use this feature on failure analysis for the system
> where the processing depends on the communication between host and guest
> machines to look into the system from both machines's viewpoints.
>
> As a concrete situation, consider where there's heartbeat monitoring
> feature on the guest machine's side, where we need to determine in
> which machine side the cause of heartbeat stop lies. In our actual
> experiments, we encountered such situation and we found the cause of
> the bug was in host's process schedular so guest machine's vcpu stopped
> for a long time and then led to heartbeat stop.
>
> The module that judges heartbeat stop is on guest machine, so we need
> to debug guest machine's data. But if the cause lies in host machine
> side, we need to look into host machine's crash dump.

Do you mean, that a heartbeat failure in the guest lead to host panic?

My expectation is that a problem in the guest will cause the guest to
panic and perhaps produce a dump; the host will remain up.

> Without this feature, we first create guest machine's dump and then
> create host mahine's, but there's only a short time between two
> processings, during which it's unlikely that buggy situation remains.
>
> So, we think the feature is useful to debug both guest machine's and
> host machine's sides at the same time, and expect we can make failure
> analysis efficiently.
>
> Of course, we believe this feature is commonly useful on the situation
> where guest machine doesn't work well due to something of host machine's.
>
> 2) Get offsets of VMCS information on the CPU running on the host machine
>
> If kdump doesn't work well, then it means we cannot use kvm API to get
> register values of guest machine and they are still left on its vmcs
> region. In the case, we use crash dump mechanism running outside of
> linux kernel, such as sadump, a firmware-based crash dump. Then VMCS
> information is then necessary.

Shouldn't sadump then expose the VMCS offsets? Perhaps bundling them
into its dump file?


--
error compiling committee.c: too many arguments to function

2012-05-21 02:33:44

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

?? 2012??05??21?? 01:43, Avi Kivity д??:
> On 05/16/2012 10:50 AM, zhangyanfei wrote:
>> This patch set exports offsets of VMCS fields as note information for
>> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
>> runtime state of guest machine image, such as registers, in host
>> machine's crash dump as VMCS format. The problem is that VMCS internal
>> is hidden by Intel in its specification. So, we slove this problem
>> by reverse engineering implemented in this patch set. The VMCSINFO
>> is exported via sysfs to kexec-tools just like VMCOREINFO.
>>
>> Here are two usercases for two features that we want.
>>
>> 1) Create guest machine's crash dumpfile from host machine's crash dumpfile
>>
>> In general, we want to use this feature on failure analysis for the system
>> where the processing depends on the communication between host and guest
>> machines to look into the system from both machines's viewpoints.
>>
>> As a concrete situation, consider where there's heartbeat monitoring
>> feature on the guest machine's side, where we need to determine in
>> which machine side the cause of heartbeat stop lies. In our actual
>> experiments, we encountered such situation and we found the cause of
>> the bug was in host's process schedular so guest machine's vcpu stopped
>> for a long time and then led to heartbeat stop.
>>
>> The module that judges heartbeat stop is on guest machine, so we need
>> to debug guest machine's data. But if the cause lies in host machine
>> side, we need to look into host machine's crash dump.
>
> Do you mean, that a heartbeat failure in the guest lead to host panic?
>
> My expectation is that a problem in the guest will cause the guest to
> panic and perhaps produce a dump; the host will remain up.
>

The point is that before our investigation, we didn't know which side
leads to this buggy situation. Maybe a bug in host machine or the guest
machine itself causes a heartbeat failure.

So we want to get both host machine's crash dump and guest machine's
crash dump *at the same time*. Then we could use userspace tools to
get guest machine crash dump from host machine's and analyse them
separately to find which side causes the problem.

>> Without this feature, we first create guest machine's dump and then
>> create host mahine's, but there's only a short time between two
>> processings, during which it's unlikely that buggy situation remains.
>>
>> So, we think the feature is useful to debug both guest machine's and
>> host machine's sides at the same time, and expect we can make failure
>> analysis efficiently.
>>
>> Of course, we believe this feature is commonly useful on the situation
>> where guest machine doesn't work well due to something of host machine's.
>>
>> 2) Get offsets of VMCS information on the CPU running on the host machine
>>
>> If kdump doesn't work well, then it means we cannot use kvm API to get
>> register values of guest machine and they are still left on its vmcs
>> region. In the case, we use crash dump mechanism running outside of
>> linux kernel, such as sadump, a firmware-based crash dump. Then VMCS
>> information is then necessary.
>
> Shouldn't sadump then expose the VMCS offsets? Perhaps bundling them
> into its dump file?
>

Firmware-based crash dump doesn't concern the os running on the machine.
So it will not do any os handling when machine crashes.

Thanks
Zhang Yanfei


2012-05-21 08:34:36

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

On 05/21/2012 05:32 AM, Yanfei Zhang wrote:
> ?? 2012??05??21?? 01:43, Avi Kivity д??:
> > On 05/16/2012 10:50 AM, zhangyanfei wrote:
> >> This patch set exports offsets of VMCS fields as note information for
> >> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
> >> runtime state of guest machine image, such as registers, in host
> >> machine's crash dump as VMCS format. The problem is that VMCS internal
> >> is hidden by Intel in its specification. So, we slove this problem
> >> by reverse engineering implemented in this patch set. The VMCSINFO
> >> is exported via sysfs to kexec-tools just like VMCOREINFO.
> >>
> >> Here are two usercases for two features that we want.
> >>
> >> 1) Create guest machine's crash dumpfile from host machine's crash dumpfile
> >>
> >> In general, we want to use this feature on failure analysis for the system
> >> where the processing depends on the communication between host and guest
> >> machines to look into the system from both machines's viewpoints.
> >>
> >> As a concrete situation, consider where there's heartbeat monitoring
> >> feature on the guest machine's side, where we need to determine in
> >> which machine side the cause of heartbeat stop lies. In our actual
> >> experiments, we encountered such situation and we found the cause of
> >> the bug was in host's process schedular so guest machine's vcpu stopped
> >> for a long time and then led to heartbeat stop.
> >>
> >> The module that judges heartbeat stop is on guest machine, so we need
> >> to debug guest machine's data. But if the cause lies in host machine
> >> side, we need to look into host machine's crash dump.
> >
> > Do you mean, that a heartbeat failure in the guest lead to host panic?
> >
> > My expectation is that a problem in the guest will cause the guest to
> > panic and perhaps produce a dump; the host will remain up.
> >
>
> The point is that before our investigation, we didn't know which side
> leads to this buggy situation. Maybe a bug in host machine or the guest
> machine itself causes a heartbeat failure.

How can a guest bug cause a host panic?

> So we want to get both host machine's crash dump and guest machine's
> crash dump *at the same time*. Then we could use userspace tools to
> get guest machine crash dump from host machine's and analyse them
> separately to find which side causes the problem.
>

If the guest caused the problem, there would be no panic; therefore
there was a host bug.

> >> Without this feature, we first create guest machine's dump and then
> >> create host mahine's, but there's only a short time between two
> >> processings, during which it's unlikely that buggy situation remains.
> >>
> >> So, we think the feature is useful to debug both guest machine's and
> >> host machine's sides at the same time, and expect we can make failure
> >> analysis efficiently.
> >>
> >> Of course, we believe this feature is commonly useful on the situation
> >> where guest machine doesn't work well due to something of host machine's.
> >>
> >> 2) Get offsets of VMCS information on the CPU running on the host machine
> >>
> >> If kdump doesn't work well, then it means we cannot use kvm API to get
> >> register values of guest machine and they are still left on its vmcs
> >> region. In the case, we use crash dump mechanism running outside of
> >> linux kernel, such as sadump, a firmware-based crash dump. Then VMCS
> >> information is then necessary.
> >
> > Shouldn't sadump then expose the VMCS offsets? Perhaps bundling them
> > into its dump file?
> >
>
> Firmware-based crash dump doesn't concern the os running on the machine.
> So it will not do any os handling when machine crashes.

Seems to me the VMCS offsets are OS independent.

--
error compiling committee.c: too many arguments to function

2012-05-21 09:09:31

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

?? 2012??05??21?? 16:34, Avi Kivity д??:
> On 05/21/2012 05:32 AM, Yanfei Zhang wrote:
>> ?? 2012??05??21?? 01:43, Avi Kivity д??:
>>> On 05/16/2012 10:50 AM, zhangyanfei wrote:
>>>> This patch set exports offsets of VMCS fields as note information for
>>>> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
>>>> runtime state of guest machine image, such as registers, in host
>>>> machine's crash dump as VMCS format. The problem is that VMCS internal
>>>> is hidden by Intel in its specification. So, we slove this problem
>>>> by reverse engineering implemented in this patch set. The VMCSINFO
>>>> is exported via sysfs to kexec-tools just like VMCOREINFO.
>>>>
>>>> Here are two usercases for two features that we want.
>>>>
>>>> 1) Create guest machine's crash dumpfile from host machine's crash dumpfile
>>>>
>>>> In general, we want to use this feature on failure analysis for the system
>>>> where the processing depends on the communication between host and guest
>>>> machines to look into the system from both machines's viewpoints.
>>>>
>>>> As a concrete situation, consider where there's heartbeat monitoring
>>>> feature on the guest machine's side, where we need to determine in
>>>> which machine side the cause of heartbeat stop lies. In our actual
>>>> experiments, we encountered such situation and we found the cause of
>>>> the bug was in host's process schedular so guest machine's vcpu stopped
>>>> for a long time and then led to heartbeat stop.
>>>>
>>>> The module that judges heartbeat stop is on guest machine, so we need
>>>> to debug guest machine's data. But if the cause lies in host machine
>>>> side, we need to look into host machine's crash dump.
>>>
>>> Do you mean, that a heartbeat failure in the guest lead to host panic?
>>>
>>> My expectation is that a problem in the guest will cause the guest to
>>> panic and perhaps produce a dump; the host will remain up.
>>>
>>
>> The point is that before our investigation, we didn't know which side
>> leads to this buggy situation. Maybe a bug in host machine or the guest
>> machine itself causes a heartbeat failure.
>
> How can a guest bug cause a host panic?
>
>> So we want to get both host machine's crash dump and guest machine's
>> crash dump *at the same time*. Then we could use userspace tools to
>> get guest machine crash dump from host machine's and analyse them
>> separately to find which side causes the problem.
>>
>
> If the guest caused the problem, there would be no panic; therefore
> there was a host bug.
>

Yes, a guest bug cannot cause a host panic. When heartbeat stops in guest
machine, we could trigger the host dump mechanism to work. This is because
we want to get the status of both host and guest machine at the same time
when heartbeat stops in guest machine. Then we can look for bug reasons
from both host machine's and guest machine's views.

>>>> Without this feature, we first create guest machine's dump and then
>>>> create host mahine's, but there's only a short time between two
>>>> processings, during which it's unlikely that buggy situation remains.
>>>>
>>>> So, we think the feature is useful to debug both guest machine's and
>>>> host machine's sides at the same time, and expect we can make failure
>>>> analysis efficiently.
>>>>
>>>> Of course, we believe this feature is commonly useful on the situation
>>>> where guest machine doesn't work well due to something of host machine's.
>>>>
>>>> 2) Get offsets of VMCS information on the CPU running on the host machine
>>>>
>>>> If kdump doesn't work well, then it means we cannot use kvm API to get
>>>> register values of guest machine and they are still left on its vmcs
>>>> region. In the case, we use crash dump mechanism running outside of
>>>> linux kernel, such as sadump, a firmware-based crash dump. Then VMCS
>>>> information is then necessary.
>>>
>>> Shouldn't sadump then expose the VMCS offsets? Perhaps bundling them
>>> into its dump file?
>>>
>>
>> Firmware-based crash dump doesn't concern the os running on the machine.
>> So it will not do any os handling when machine crashes.
>
> Seems to me the VMCS offsets are OS independent.
>
Hmm, you mean we could get VMCS offsets in sadump itself?
But I think if we just export VMCS offsets in kernel, we could use the current
existing dump tools with no or just very tiny change. I think this could be
a more general mechanism than making changes in all kinds of dump tools.

Thanks
Zhang Yanfei

2012-05-21 09:36:33

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

On 05/21/2012 12:08 PM, Yanfei Zhang wrote:
> ?? 2012??05??21?? 16:34, Avi Kivity д??:
> > On 05/21/2012 05:32 AM, Yanfei Zhang wrote:
> >> ?? 2012??05??21?? 01:43, Avi Kivity д??:
> >>> On 05/16/2012 10:50 AM, zhangyanfei wrote:
> >>>> This patch set exports offsets of VMCS fields as note information for
> >>>> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
> >>>> runtime state of guest machine image, such as registers, in host
> >>>> machine's crash dump as VMCS format. The problem is that VMCS internal
> >>>> is hidden by Intel in its specification. So, we slove this problem
> >>>> by reverse engineering implemented in this patch set. The VMCSINFO
> >>>> is exported via sysfs to kexec-tools just like VMCOREINFO.
> >>>>
> >>>> Here are two usercases for two features that we want.
> >>>>
> >>>> 1) Create guest machine's crash dumpfile from host machine's crash dumpfile
> >>>>
> >>>> In general, we want to use this feature on failure analysis for the system
> >>>> where the processing depends on the communication between host and guest
> >>>> machines to look into the system from both machines's viewpoints.
> >>>>
> >>>> As a concrete situation, consider where there's heartbeat monitoring
> >>>> feature on the guest machine's side, where we need to determine in
> >>>> which machine side the cause of heartbeat stop lies. In our actual
> >>>> experiments, we encountered such situation and we found the cause of
> >>>> the bug was in host's process schedular so guest machine's vcpu stopped
> >>>> for a long time and then led to heartbeat stop.
> >>>>
> >>>> The module that judges heartbeat stop is on guest machine, so we need
> >>>> to debug guest machine's data. But if the cause lies in host machine
> >>>> side, we need to look into host machine's crash dump.
> >>>
> >>> Do you mean, that a heartbeat failure in the guest lead to host panic?
> >>>
> >>> My expectation is that a problem in the guest will cause the guest to
> >>> panic and perhaps produce a dump; the host will remain up.
> >>>
> >>
> >> The point is that before our investigation, we didn't know which side
> >> leads to this buggy situation. Maybe a bug in host machine or the guest
> >> machine itself causes a heartbeat failure.
> >
> > How can a guest bug cause a host panic?
> >
> >> So we want to get both host machine's crash dump and guest machine's
> >> crash dump *at the same time*. Then we could use userspace tools to
> >> get guest machine crash dump from host machine's and analyse them
> >> separately to find which side causes the problem.
> >>
> >
> > If the guest caused the problem, there would be no panic; therefore
> > there was a host bug.
> >
>
> Yes, a guest bug cannot cause a host panic. When heartbeat stops in guest
> machine, we could trigger the host dump mechanism to work. This is because
> we want to get the status of both host and guest machine at the same time
> when heartbeat stops in guest machine. Then we can look for bug reasons
> from both host machine's and guest machine's views.

That sounds like a bad idea. Can you explain in what situation it makes
sense for a guest to stop the host (and all other guests running on it)
rather than just restarting the failed services (on the host or other
guests)?

> >>>> Without this feature, we first create guest machine's dump and then
> >>>> create host mahine's, but there's only a short time between two
> >>>> processings, during which it's unlikely that buggy situation remains.
> >>>>
> >>>> So, we think the feature is useful to debug both guest machine's and
> >>>> host machine's sides at the same time, and expect we can make failure
> >>>> analysis efficiently.
> >>>>
> >>>> Of course, we believe this feature is commonly useful on the situation
> >>>> where guest machine doesn't work well due to something of host machine's.
> >>>>
> >>>> 2) Get offsets of VMCS information on the CPU running on the host machine
> >>>>
> >>>> If kdump doesn't work well, then it means we cannot use kvm API to get
> >>>> register values of guest machine and they are still left on its vmcs
> >>>> region. In the case, we use crash dump mechanism running outside of
> >>>> linux kernel, such as sadump, a firmware-based crash dump. Then VMCS
> >>>> information is then necessary.
> >>>
> >>> Shouldn't sadump then expose the VMCS offsets? Perhaps bundling them
> >>> into its dump file?
> >>>
> >>
> >> Firmware-based crash dump doesn't concern the os running on the machine.
> >> So it will not do any os handling when machine crashes.
> >
> > Seems to me the VMCS offsets are OS independent.
> >
> Hmm, you mean we could get VMCS offsets in sadump itself?
> But I think if we just export VMCS offsets in kernel, we could use the current
> existing dump tools with no or just very tiny change. I think this could be
> a more general mechanism than making changes in all kinds of dump tools.

The sadump tool generates a core file with the OS image, right? Can it
not attach the offsets to a note, just like you propose for kdump?

--
error compiling committee.c: too many arguments to function

2012-05-21 18:58:54

by Eric Northup

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

On Wed, May 16, 2012 at 12:50 AM, zhangyanfei
<[email protected]> wrote:
>
> This patch set exports offsets of VMCS fields as note information for
> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
> runtime state of guest machine image, such as registers, in host
> machine's crash dump as VMCS format. The problem is that VMCS internal
> is hidden by Intel in its specification. So, we slove this problem
> by reverse engineering implemented in this patch set. The VMCSINFO
> is exported via sysfs to kexec-tools just like VMCOREINFO.

Perhaps I'm wrong, but this solution seems much, much more dynamic
than it needs to be.

The VMCS offsets aren't going to change between different boots on the
same CPU, unless perhaps the microcode has been updated.

So you can have the VMCS offset dumping be a manually-loaded module.
Build a database mapping from (CPUID, microcode revision) -> (VMCSINFO).
There's no need for anything beyond the (CPUID, microcode revision) to
be put in the kdump, since your offline processing of a kdump can then
look up the rest.

It means you don't have to interact with the vmx module at all, and
no extra modules or code have to be loaded on the millions of Linux
machines that won't need the functionality.

2012-05-22 04:00:17

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

于 2012年05月22日 02:58, Eric Northup 写道:
> On Wed, May 16, 2012 at 12:50 AM, zhangyanfei
> <[email protected]> wrote:
>>
>> This patch set exports offsets of VMCS fields as note information for
>> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
>> runtime state of guest machine image, such as registers, in host
>> machine's crash dump as VMCS format. The problem is that VMCS internal
>> is hidden by Intel in its specification. So, we slove this problem
>> by reverse engineering implemented in this patch set. The VMCSINFO
>> is exported via sysfs to kexec-tools just like VMCOREINFO.
>
> Perhaps I'm wrong, but this solution seems much, much more dynamic
> than it needs to be.
>
> The VMCS offsets aren't going to change between different boots on the
> same CPU, unless perhaps the microcode has been updated.
>
> So you can have the VMCS offset dumping be a manually-loaded module.
> Build a database mapping from (CPUID, microcode revision) -> (VMCSINFO).
> There's no need for anything beyond the (CPUID, microcode revision) to
> be put in the kdump, since your offline processing of a kdump can then
> look up the rest.
>
> It means you don't have to interact with the vmx module at all, and
> no extra modules or code have to be loaded on the millions of Linux
> machines that won't need the functionality.
>

We have considered this way, but there are two issues:
1) vmx resource is unique for a single cpu, and it's risky to grab it forcibly
on the environment where kvm module is used, in particular on customer's environment.
To do this safely, kvm support is needed.

2) It highly costs to prepare each cpu to each customer environment to collect
vmcsinfo. After all, there are various environments on our customer's.

Our patch provides a module, so those who doesn't want this feature can just
stop it being auto-loaded when system starts up.

Thanks
Zhang Yanfei

2012-05-22 04:00:26

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

?? 2012??05??21?? 17:36, Avi Kivity д??:
> On 05/21/2012 12:08 PM, Yanfei Zhang wrote:
>> ?? 2012??05??21?? 16:34, Avi Kivity д??:
>>> On 05/21/2012 05:32 AM, Yanfei Zhang wrote:
>>>> ?? 2012??05??21?? 01:43, Avi Kivity д??:
>>>>> On 05/16/2012 10:50 AM, zhangyanfei wrote:
>>>>>> This patch set exports offsets of VMCS fields as note information for
>>>>>> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
>>>>>> runtime state of guest machine image, such as registers, in host
>>>>>> machine's crash dump as VMCS format. The problem is that VMCS internal
>>>>>> is hidden by Intel in its specification. So, we slove this problem
>>>>>> by reverse engineering implemented in this patch set. The VMCSINFO
>>>>>> is exported via sysfs to kexec-tools just like VMCOREINFO.
>>>>>>
>>>>>> Here are two usercases for two features that we want.
>>>>>>
>>>>>> 1) Create guest machine's crash dumpfile from host machine's crash dumpfile
>>>>>>
>>>>>> In general, we want to use this feature on failure analysis for the system
>>>>>> where the processing depends on the communication between host and guest
>>>>>> machines to look into the system from both machines's viewpoints.
>>>>>>
>>>>>> As a concrete situation, consider where there's heartbeat monitoring
>>>>>> feature on the guest machine's side, where we need to determine in
>>>>>> which machine side the cause of heartbeat stop lies. In our actual
>>>>>> experiments, we encountered such situation and we found the cause of
>>>>>> the bug was in host's process schedular so guest machine's vcpu stopped
>>>>>> for a long time and then led to heartbeat stop.
>>>>>>
>>>>>> The module that judges heartbeat stop is on guest machine, so we need
>>>>>> to debug guest machine's data. But if the cause lies in host machine
>>>>>> side, we need to look into host machine's crash dump.
>>>>>
>>>>> Do you mean, that a heartbeat failure in the guest lead to host panic?
>>>>>
>>>>> My expectation is that a problem in the guest will cause the guest to
>>>>> panic and perhaps produce a dump; the host will remain up.
>>>>>
>>>>
>>>> The point is that before our investigation, we didn't know which side
>>>> leads to this buggy situation. Maybe a bug in host machine or the guest
>>>> machine itself causes a heartbeat failure.
>>>
>>> How can a guest bug cause a host panic?
>>>
>>>> So we want to get both host machine's crash dump and guest machine's
>>>> crash dump *at the same time*. Then we could use userspace tools to
>>>> get guest machine crash dump from host machine's and analyse them
>>>> separately to find which side causes the problem.
>>>>
>>>
>>> If the guest caused the problem, there would be no panic; therefore
>>> there was a host bug.
>>>
>>
>> Yes, a guest bug cannot cause a host panic. When heartbeat stops in guest
>> machine, we could trigger the host dump mechanism to work. This is because
>> we want to get the status of both host and guest machine at the same time
>> when heartbeat stops in guest machine. Then we can look for bug reasons
>> from both host machine's and guest machine's views.
>
> That sounds like a bad idea. Can you explain in what situation it makes
> sense for a guest to stop the host (and all other guests running on it)
> rather than just restarting the failed services (on the host or other
> guests)?
>

We never do this on customer's environment which maybe a host with many guests
running on it. We do this on another environment to reproduce the buggy
situation; or we do this in testing phase on development environment towards
production one on the customer's site.

>>>>>> Without this feature, we first create guest machine's dump and then
>>>>>> create host mahine's, but there's only a short time between two
>>>>>> processings, during which it's unlikely that buggy situation remains.
>>>>>>
>>>>>> So, we think the feature is useful to debug both guest machine's and
>>>>>> host machine's sides at the same time, and expect we can make failure
>>>>>> analysis efficiently.
>>>>>>
>>>>>> Of course, we believe this feature is commonly useful on the situation
>>>>>> where guest machine doesn't work well due to something of host machine's.
>>>>>>
>>>>>> 2) Get offsets of VMCS information on the CPU running on the host machine
>>>>>>
>>>>>> If kdump doesn't work well, then it means we cannot use kvm API to get
>>>>>> register values of guest machine and they are still left on its vmcs
>>>>>> region. In the case, we use crash dump mechanism running outside of
>>>>>> linux kernel, such as sadump, a firmware-based crash dump. Then VMCS
>>>>>> information is then necessary.
>>>>>
>>>>> Shouldn't sadump then expose the VMCS offsets? Perhaps bundling them
>>>>> into its dump file?
>>>>>
>>>>
>>>> Firmware-based crash dump doesn't concern the os running on the machine.
>>>> So it will not do any os handling when machine crashes.
>>>
>>> Seems to me the VMCS offsets are OS independent.
>>>
>> Hmm, you mean we could get VMCS offsets in sadump itself?
>> But I think if we just export VMCS offsets in kernel, we could use the current
>> existing dump tools with no or just very tiny change. I think this could be
>> a more general mechanism than making changes in all kinds of dump tools.
>
> The sadump tool generates a core file with the OS image, right? Can it
> not attach the offsets to a note, just like you propose for kdump?
>

Both are right.

2012-05-22 20:53:50

by Eric Northup

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

On Mon, May 21, 2012 at 8:53 PM, Yanfei Zhang
<[email protected]> wrote:
> ?? 2012??05??22?? 02:58, Eric Northup д??:
[...]
>> So you can have the VMCS offset dumping be a manually-loaded module.
>> Build a database mapping from (CPUID, microcode revision) -> (VMCSINFO).
>> There's no need for anything beyond the (CPUID, microcode revision) to
>> be put in the kdump, since your offline processing of a kdump can then
>> look up the rest.
[...]
>
> We have considered this way, but there are two issues:
> 1) vmx resource is unique for a single cpu, and it's risky to grab it forcibly
> on the environment where kvm module is used, in particular on customer's environment.
> To do this safely, kvm support is needed.

It's not risky: you just have to make sure that no one else is going
to use the VMCS on your CPU while you're running. You can disable
preemption and then save the old VMCS pointer from the CPU (see the
VMPTRST instructions). Load your temporary VMCS pointer, discover
the fields, then restore the original VMCS pointer. Then re-enable
preemption and you're done.

2012-05-28 05:28:00

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

Hello Avi,

?? 2012??05??22?? 11:40, Yanfei Zhang д??:
> ?? 2012??05??21?? 17:36, Avi Kivity д??:
>> On 05/21/2012 12:08 PM, Yanfei Zhang wrote:
>>> ?? 2012??05??21?? 16:34, Avi Kivity д??:
>>>> On 05/21/2012 05:32 AM, Yanfei Zhang wrote:
>>>>> ?? 2012??05??21?? 01:43, Avi Kivity д??:
>>>>>> On 05/16/2012 10:50 AM, zhangyanfei wrote:
>>>>>>> This patch set exports offsets of VMCS fields as note information for
>>>>>>> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
>>>>>>> runtime state of guest machine image, such as registers, in host
>>>>>>> machine's crash dump as VMCS format. The problem is that VMCS internal
>>>>>>> is hidden by Intel in its specification. So, we slove this problem
>>>>>>> by reverse engineering implemented in this patch set. The VMCSINFO
>>>>>>> is exported via sysfs to kexec-tools just like VMCOREINFO.
>>>>>>>
>>>>>>> Here are two usercases for two features that we want.
>>>>>>>
>>>>>>> 1) Create guest machine's crash dumpfile from host machine's crash dumpfile
>>>>>>>
>>>>>>> In general, we want to use this feature on failure analysis for the system
>>>>>>> where the processing depends on the communication between host and guest
>>>>>>> machines to look into the system from both machines's viewpoints.
>>>>>>>
>>>>>>> As a concrete situation, consider where there's heartbeat monitoring
>>>>>>> feature on the guest machine's side, where we need to determine in
>>>>>>> which machine side the cause of heartbeat stop lies. In our actual
>>>>>>> experiments, we encountered such situation and we found the cause of
>>>>>>> the bug was in host's process schedular so guest machine's vcpu stopped
>>>>>>> for a long time and then led to heartbeat stop.
>>>>>>>
>>>>>>> The module that judges heartbeat stop is on guest machine, so we need
>>>>>>> to debug guest machine's data. But if the cause lies in host machine
>>>>>>> side, we need to look into host machine's crash dump.
>>>>>>
>>>>>> Do you mean, that a heartbeat failure in the guest lead to host panic?
>>>>>>
>>>>>> My expectation is that a problem in the guest will cause the guest to
>>>>>> panic and perhaps produce a dump; the host will remain up.
>>>>>>
>>>>>
>>>>> The point is that before our investigation, we didn't know which side
>>>>> leads to this buggy situation. Maybe a bug in host machine or the guest
>>>>> machine itself causes a heartbeat failure.
>>>>
>>>> How can a guest bug cause a host panic?
>>>>
>>>>> So we want to get both host machine's crash dump and guest machine's
>>>>> crash dump *at the same time*. Then we could use userspace tools to
>>>>> get guest machine crash dump from host machine's and analyse them
>>>>> separately to find which side causes the problem.
>>>>>
>>>>
>>>> If the guest caused the problem, there would be no panic; therefore
>>>> there was a host bug.
>>>>
>>>
>>> Yes, a guest bug cannot cause a host panic. When heartbeat stops in guest
>>> machine, we could trigger the host dump mechanism to work. This is because
>>> we want to get the status of both host and guest machine at the same time
>>> when heartbeat stops in guest machine. Then we can look for bug reasons
>>> from both host machine's and guest machine's views.
>>
>> That sounds like a bad idea. Can you explain in what situation it makes
>> sense for a guest to stop the host (and all other guests running on it)
>> rather than just restarting the failed services (on the host or other
>> guests)?
>>
>
> We never do this on customer's environment which maybe a host with many guests
> running on it. We do this on another environment to reproduce the buggy
> situation; or we do this in testing phase on development environment towards
> production one on the customer's site.
>
>>>>>>> Without this feature, we first create guest machine's dump and then
>>>>>>> create host mahine's, but there's only a short time between two
>>>>>>> processings, during which it's unlikely that buggy situation remains.
>>>>>>>
>>>>>>> So, we think the feature is useful to debug both guest machine's and
>>>>>>> host machine's sides at the same time, and expect we can make failure
>>>>>>> analysis efficiently.
>>>>>>>
>>>>>>> Of course, we believe this feature is commonly useful on the situation
>>>>>>> where guest machine doesn't work well due to something of host machine's.
>>>>>>>
>>>>>>> 2) Get offsets of VMCS information on the CPU running on the host machine
>>>>>>>
>>>>>>> If kdump doesn't work well, then it means we cannot use kvm API to get
>>>>>>> register values of guest machine and they are still left on its vmcs
>>>>>>> region. In the case, we use crash dump mechanism running outside of
>>>>>>> linux kernel, such as sadump, a firmware-based crash dump. Then VMCS
>>>>>>> information is then necessary.
>>>>>>
>>>>>> Shouldn't sadump then expose the VMCS offsets? Perhaps bundling them
>>>>>> into its dump file?
>>>>>>
>>>>>
>>>>> Firmware-based crash dump doesn't concern the os running on the machine.
>>>>> So it will not do any os handling when machine crashes.
>>>>
>>>> Seems to me the VMCS offsets are OS independent.
>>>>
>>> Hmm, you mean we could get VMCS offsets in sadump itself?
>>> But I think if we just export VMCS offsets in kernel, we could use the current
>>> existing dump tools with no or just very tiny change. I think this could be
>>> a more general mechanism than making changes in all kinds of dump tools.
>>
>> The sadump tool generates a core file with the OS image, right? Can it
>> not attach the offsets to a note, just like you propose for kdump?
>>
>
> Both are right.


Dou you have any comments about this patch set?

Thanks
Zhang Yanfei

2012-05-28 13:29:15

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

On 05/28/2012 08:25 AM, Yanfei Zhang wrote:
>
> Dou you have any comments about this patch set?

I still have a hard time understanding why it is needed. If the host
crashes, there is no reason to look at guest state; the host should
survive no matter what the guest does.


--
error compiling committee.c: too many arguments to function

2012-05-29 07:09:26

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

?? 2012??05??28?? 21:28, Avi Kivity д??:
> On 05/28/2012 08:25 AM, Yanfei Zhang wrote:
>>
>> Dou you have any comments about this patch set?
>
> I still have a hard time understanding why it is needed. If the host
> crashes, there is no reason to look at guest state; the host should
> survive no matter what the guest does.
>
>

OK. Let me summarize it.

1. Why is this patch needed? (Our requirement)

We once came to a buggy situation: a host scheduler bug caused guest machine's
vcpu stopped for a long time and then led to heartbeat stop (host is still running).

we want to have an efficient way to make the bug analysis when we come to the similar
situation where guest machine doesn't work well due to something of host machine's,

Because we should debug both host machine's and guest machine's sides to look for
the reasons, so we want to get both host machine's crash dump and guest machine's
crash dump at the same time when the buggy situation remains.

2. What will we do?

If this bug was found on customer's environment, we have two ways to avoid
affecting other guest machines running on the same host. First, we could do bug
analysis on another environment to reproduce the buggy situation; Second, we
could migrate other guest machines to other hosts.

After the buggy situation is reproduced, we panic the host *manually*.
Then we could use userland tools to get guest machine's crash dump from host machine's
with the feature provided by this patch set. Finally we could analyse them separately
to find which side causes the problem.

Thanks
Zhang Yanfei

2012-06-11 05:36:47

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

Hello Avi,

?? 2012??05??29?? 15:06, Yanfei Zhang д??:
> ?? 2012??05??28?? 21:28, Avi Kivity д??:
>> On 05/28/2012 08:25 AM, Yanfei Zhang wrote:
>>>
>>> Dou you have any comments about this patch set?
>>
>> I still have a hard time understanding why it is needed. If the host
>> crashes, there is no reason to look at guest state; the host should
>> survive no matter what the guest does.
>>
>>
>
> OK. Let me summarize it.
>
> 1. Why is this patch needed? (Our requirement)
>
> We once came to a buggy situation: a host scheduler bug caused guest machine's
> vcpu stopped for a long time and then led to heartbeat stop (host is still running).
>
> we want to have an efficient way to make the bug analysis when we come to the similar
> situation where guest machine doesn't work well due to something of host machine's,
>
> Because we should debug both host machine's and guest machine's sides to look for
> the reasons, so we want to get both host machine's crash dump and guest machine's
> crash dump at the same time when the buggy situation remains.
>
> 2. What will we do?
>
> If this bug was found on customer's environment, we have two ways to avoid
> affecting other guest machines running on the same host. First, we could do bug
> analysis on another environment to reproduce the buggy situation; Second, we
> could migrate other guest machines to other hosts.
>
> After the buggy situation is reproduced, we panic the host *manually*.
> Then we could use userland tools to get guest machine's crash dump from host machine's
> with the feature provided by this patch set. Finally we could analyse them separately
> to find which side causes the problem.
>

Could you please tell me your attitude towards this patch?

And here is a new case from the LinuxCon Japan:

Developers from Hitach are now developing a new livedump mechanism for the
same reason as ours. They have come to the situation *many times* that guest
machines crashed due to host's failures, in particular, under development.

So they develop this mechanism to get crash dump while retaining the buggy
situation between host and guest machine. The difference between theirs and
ours is whether or not to use the feature on _customer's running machine_.

Thanks
Zhang Yanfei


2012-06-14 13:16:09

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

On 06/11/2012 08:35 AM, Yanfei Zhang wrote:
> Hello Avi,


Sorry about the delay...

>
> ?? 2012??05??29?? 15:06, Yanfei Zhang д??:
>> ?? 2012??05??28?? 21:28, Avi Kivity д??:
>>> On 05/28/2012 08:25 AM, Yanfei Zhang wrote:
>>>>
>>>> Dou you have any comments about this patch set?
>>>
>>> I still have a hard time understanding why it is needed. If the host
>>> crashes, there is no reason to look at guest state; the host should
>>> survive no matter what the guest does.
>>>
>>>
>>
>> OK. Let me summarize it.
>>
>> 1. Why is this patch needed? (Our requirement)
>>
>> We once came to a buggy situation: a host scheduler bug caused guest machine's
>> vcpu stopped for a long time and then led to heartbeat stop (host is still running).
>>
>> we want to have an efficient way to make the bug analysis when we come to the similar
>> situation where guest machine doesn't work well due to something of host machine's,
>>
>> Because we should debug both host machine's and guest machine's sides to look for
>> the reasons, so we want to get both host machine's crash dump and guest machine's
>> crash dump at the same time when the buggy situation remains.

I would argue that there are two separate bugs here: (1) a host bug
which caused the scheduling delay (2) putting a heartbeat service on a
virtualized guests with no real time guarantees.

But I understand your situation.

>>
>> 2. What will we do?
>>
>> If this bug was found on customer's environment, we have two ways to avoid
>> affecting other guest machines running on the same host. First, we could do bug
>> analysis on another environment to reproduce the buggy situation; Second, we
>> could migrate other guest machines to other hosts.

You could also use tracing (there's the latency tracer and the scheduler
tracepoints) to debug this on a live system.

>>
>> After the buggy situation is reproduced, we panic the host *manually*.
>> Then we could use userland tools to get guest machine's crash dump from host machine's
>> with the feature provided by this patch set. Finally we could analyse them separately
>> to find which side causes the problem.
>>
>
> Could you please tell me your attitude towards this patch?

I still dislike it conceptually. But let me do a technical review of
the latest version.

> And here is a new case from the LinuxCon Japan:
>
> Developers from Hitach are now developing a new livedump mechanism for the
> same reason as ours. They have come to the situation *many times* that guest
> machines crashed due to host's failures, in particular, under development.

This has happened to me as well, possible even more times :). I don't
use crash dumps for debugging but different people may use different
techniques.

> So they develop this mechanism to get crash dump while retaining the buggy
> situation between host and guest machine. The difference between theirs and
> ours is whether or not to use the feature on _customer's running machine_.


--
error compiling committee.c: too many arguments to function

2012-06-14 13:22:13

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] Documentation: Add ABI entry for sysfs file vmcsinfo and vmcsinfo_maxsize

On 05/16/2012 10:57 AM, zhangyanfei wrote:
> We create two new sysfs files, vmcsinfo and vmcsinfo_maxsize. And
> here we add an Documentation/ABI entry for them.
>
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-vmcsinfo b/Documentation/ABI/testing/sysfs-kernel-vmcsinfo
> new file mode 100644
> index 0000000..adbf866
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-vmcsinfo
> @@ -0,0 +1,16 @@
> +What: /sys/kernel/vmcsinfo
> +Date: April 2012
> +KernelVersion: 3.4.0
> +Contact: Zhang Yanfei <[email protected]>
> +Description
> + Shows physical address of VMCSINFO. VMCSINFO contains
> + the VMCS revision identifier and encoded offsets of fields
> + in VMCS data on Intel processors equipped with the VT
> + extensions.
> +
> +What: /sys/kernel/vmcsinfo_maxsize
> +Date: April 2012
> +KernelVersion: 3.4.0
> +Contact: Zhang Yanfei <[email protected]>
> +Description
> + Shows max size of VMCSINFO.
>

This describes the cpu, so maybe /sys/devices/cpu is a better place for
these files.

Would it make sense to expose the actual fields?

that is, have /sys/devices/cpu/vmcs/0806 contain the offset of
GUEST_DS_SELECTOR.

--
error compiling committee.c: too many arguments to function

2012-06-14 13:29:06

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] x86: Add helper variables and functions to hold VMCSINFO

On 05/16/2012 10:52 AM, zhangyanfei wrote:
> This patch provides a set of variables to hold the VMCSINFO and also
> some helper functions to help fill the VMCSINFO.

Need to document the format.


> +void vmcsinfo_append_id(u32 id)
> +{
> + size_t r;
> +
> + r = sizeof(id);
> + if (r + vmcsinfo_size > vmcsinfo_max_size)
> + return;
> +
> + memcpy(&vmcsinfo_data[vmcsinfo_size], &id, r);
> + vmcsinfo_size += r;
> +}
> +EXPORT_SYMBOL_GPL(vmcsinfo_append_id);
> +
> +void vmcsinfo_append_field(u32 field, u64 offset)

Why u64? It's guaranteed to fit within a page.

> +{
> + size_t r;
> +
> + r = sizeof(field) + sizeof(offset);
> + if (r + vmcsinfo_size > vmcsinfo_max_size)
> + return;
> +
> + memcpy(&vmcsinfo_data[vmcsinfo_size], &field, sizeof(field));
> + vmcsinfo_size += sizeof(field);
> + memcpy(&vmcsinfo_data[vmcsinfo_size], &offset, sizeof(offset));
> + vmcsinfo_size += sizeof(offset);

Instead of this vmcsinfo_data, how about a struct with fields for the
revision ID and field count, and an array for the fields? Should be a
lot simpler.

> +}
> +EXPORT_SYMBOL_GPL(vmcsinfo_append_field);
> +
> +unsigned long paddr_vmcsinfo_note(void)
> +{
> + return __pa((unsigned long)(char *)&vmcsinfo_note);
> +}
>


--
error compiling committee.c: too many arguments to function

2012-06-14 13:38:15

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] KVM-INTEL: Add new module vmcsinfo-intel to fill VMCSINFO

On 05/16/2012 10:55 AM, zhangyanfei wrote:
> This patch implements a new module named vmcsinfo-intel. The
> module fills VMCSINFO with the VMCS revision identifier,
> and encoded offsets of VMCS fields.
>
> Note, offsets of fields below will not be filled into VMCSINFO:
> 1. fields defined in Intel specification (Intel® 64 and
> IA-32 Architectures Software Developer’s Manual, Volume
> 3C) but not defined in *vmcs_field*.
> 2. fields don't exist because their corresponding control bits
> are not set.
>
> +
> +/*
> + * We separate these five control fields from other fields
> + * because some fields only exist on processors that support
> + * the 1-setting of control bits in the five control fields.
> + */

I thought this was checked only during VMENTRY. So perhaps you don't
need this special casing.

In fact you might be able to

// pre-fill vmcs with patterns

for (i = 0; i < 64k; ++i)
if (vmcs_read_checking(i, &pattern)) {
// decode pattern
} else
// field does not exist (VM Instruction error 12), ignore

with no knowledge of the control fields, or of any field name.


> +
> +/*
> + * The format of VMCSINFO is given below:
> + * +-------------+--------------------------+
> + * | Byte offset | Contents |
> + * +-------------+--------------------------+
> + * | 0 | VMCS revision identifier |
> + * +-------------+--------------------------+
> + * | 4 | <field><encoded offset> |
> + * +-------------+--------------------------+
> + * | 16 | <field><encoded offset> |
> + * +-------------+--------------------------+
> + * ......
> + *
> + * The first 32 bits of VMCSINFO contains the VMCS revision
> + * identifier.
> + * The remainder of VMCSINFO is used for <field><encoded offset>
> + * sets. Each set takes 12 bytes: field occupys 4 bytes
> + * and its corresponding encoded offset occupys 8 bytes.
> + *
> + * Encoded offsets are raw values read by vmcs_read{16, 64, 32, l},
> + * and they are all unsigned extended to 8 bytes for each
> + * <field><encoded offset> set has the same size.
> + * We do not decode offsets here. The decoding work is delayed
> + * in userspace tools.

It's better to do the decoding here, or no one will know how to do it.
Also have an nfields field.
--
error compiling committee.c: too many arguments to function

2012-06-15 03:04:11

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] KVM-INTEL: Add new module vmcsinfo-intel to fill VMCSINFO

From: Avi Kivity <[email protected]>
Subject: Re: [PATCH v2 3/5] KVM-INTEL: Add new module vmcsinfo-intel to fill VMCSINFO
Date: Thu, 14 Jun 2012 16:37:58 +0300

> On 05/16/2012 10:55 AM, zhangyanfei wrote:
<cut>
>> +
>> +/*
>> + * The format of VMCSINFO is given below:
>> + * +-------------+--------------------------+
>> + * | Byte offset | Contents |
>> + * +-------------+--------------------------+
>> + * | 0 | VMCS revision identifier |
>> + * +-------------+--------------------------+
>> + * | 4 | <field><encoded offset> |
>> + * +-------------+--------------------------+
>> + * | 16 | <field><encoded offset> |
>> + * +-------------+--------------------------+
>> + * ......
>> + *
>> + * The first 32 bits of VMCSINFO contains the VMCS revision
>> + * identifier.
>> + * The remainder of VMCSINFO is used for <field><encoded offset>
>> + * sets. Each set takes 12 bytes: field occupys 4 bytes
>> + * and its corresponding encoded offset occupys 8 bytes.
>> + *
>> + * Encoded offsets are raw values read by vmcs_read{16, 64, 32, l},
>> + * and they are all unsigned extended to 8 bytes for each
>> + * <field><encoded offset> set has the same size.
>> + * We do not decode offsets here. The decoding work is delayed
>> + * in userspace tools.
>
> It's better to do the decoding here, or no one will know how to do it.
> Also have an nfields field.

We did so because actual internal format is unkown; it could possibly
be encrypted, although unlikely. We thought processing such unkown
data should have been done in userland debugging tools. But decoding
them here is no problem; the decode is of simple shift operations and
has no risk affecting system status, there's only possibility to
display broken values, which is same as the case displaying the
encrypted values for users.

FYI, the assumptions behind the reverse enginerring method are:

- For each field, the corresponding offset is unique; IOW, arbitrary
two offsets for the corresponding two fields are different each
other.
- Field size remains 8 bytes, 16 bytes, 32 bytes even on vmcs region.
- Each field is 8 byte alighed on vmcs region.
- Some fields may be big endition on vmcs region.
- We found this fact under development. We give up if the data is
modified more drastically.
- offset < vmcs region size

Thanks.
HATAYAMA, Daisuke

2012-06-18 07:25:57

by YOSHIDA Masanori

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

Hi, Avi, Yanfei

I'm YOSHIDA Masanori from Hitachi, a developer of livedump.

>> And here is a new case from the LinuxCon Japan:
>>
>> Developers from Hitach are now developing a new livedump mechanism for the
>> same reason as ours. They have come to the situation *many times* that guest
>> machines crashed due to host's failures, in particular, under development.
>
> This has happened to me as well, possible even more times . I don't
> use crash dumps for debugging but different people may use different
> techniques.

As Yanfei's introduction, I'm developing livedump for cases where
guests crash due to host's failures.

Especially in very important systems, it is strongly requested to
identify the root cause of any failure even if it is very rare.
For this purpose, crash dumps must be obtained.
Therefore, we think livedump technique must be applied to the
virtualization system on that kind of area.

>>> After the buggy situation is reproduced, we panic the host *manually*.
>>> Then we could use userland tools to get guest machine's crash dump from
host machine's
>>> with the feature provided by this patch set. Finally we could analyse them
separately
>>> to find which side causes the problem.
>>>
>>
>> Could you please tell me your attitude towards this patch?
>
> I still dislike it conceptually. But let me do a technical review of
> the latest version.

Actually, current implementation of livedump is just a core part,
which dumps only the image of kernel space. And I'd like to expand
it to obtain guest image at the same time too.
Also for this situation, VMCSINFO seems necessary to be exported.

Thanks,
YOSHIDA Masanori

Yokohama Research Laboratory,
Hitachi, Ltd.