LinuxLists.cc - [RFC, PATCH 5/24] i386 Vmi code patching

2006-03-13 18:04:39

Subject: [RFC, PATCH 5/24] i386 Vmi code patching

The VMI ROM detection and code patching mechanism is illustrated in
setup.c. There ROM is a binary block published by the hypervisor, and
and there are certainly implications of this. ROMs certainly have a
history of being proprietary, very differently licensed pieces of
software, and mostly under non-free licenses. Before jumping to the
conclusion that this is a bad thing, let us consider more carefully
why hiding the interface layer to the hypervisor is actually a good
thing.

The fact that this code is in a ROM is a design choice we made based on
the ease of the delivery vehicle for our implementation, but there is
nothing fundamentally different between this and a hyper-support (or
vsyscall) page based approach. They are both merely mechanisms to
inject hypervisor code into the guest that allows transparent
virtualization of the native architecture, with design costs chosen by
the hypervisor. In many cases, the calls into this support layer need
not have a one to one mapping to the vendor defined hypercall interface;
in fact, it is beneficial if they do not. Many of the calls can be
optimized into local function calls which do not require an expensive
privilege transition into the hypervisor. Many of the calls are
obsolete when running under a VT/Pacifica based hypervisor. This is by
design, and in cases when more virtualization capabilities are available
in hardware, they can be taken advantage of instantly by simply doing
nothing - there is no need to patch the guest code for classes of
instructions which are efficiently and properly supported. But the fact
remains that for in-use hardware today, these encapsulations are
necessary, and in fact some will continue to allow for more efficient
optimizations even with hardware virtualization.

Currently, the system is designed to boot under a fully virtualized
(or native) environment, then switch into paravirtual mode. It does
this by detecting the presence of a ROM code page, invoking the
hypervisor initialization function, followed by switching the code
annotations to the versions which call the VMI ROM code. There are
some remnant stub implementations here from when the original approach
was to make native functional equivalents for each of the ROM calls.
This approach was unwieldy, and had a significant performance impact
on critical paths in the system, so the stubs are being deprecated in
favor of native inline code that allows for more native optimization
potential.

Currently, the kernel does all the fancy code patching, but this
step will eventually be done inside of the initialization routine
of the hypervisor. Not only does this free the guest from this
tedious responsibility (note the tricky code which disassembles the
VMI call regions), but it allows the hypervisor to make further choices
about inlining the code, eliminating the call/return overhead altogether.
This overhead is non-trivial in the native case, which is why the
native code must be preferentially inlined for a kernel which gets
measurably near-identical native performance in microbenchmarks.

The fact that the ROM hides this interface code is deliberate, and is
intended to allow the hypervisor vendor flexibility to change the
underlying interface without the kernel needing to change. If the
raw hypercall interface were presented to the kernel, it could look
quite ugly and inflexible. It does not allow the system to evolve
over time. On the other hand, if the layer is hidden, it gives the
benefit of allowing alternative implementations to develop. Different
versions of the interface can be supported, perhaps tailored with
optimizations for different kernels, or adding statistics gathering
and debugging code into the layer. The very value of the interface
is not in it being visible, but by making it hidden.

The question of licensing of such ROM code is a completely separate
issue. We are not trying to hide some proprietary code by putting it
inside of a ROM to keep it hidden. In fact, you can disassemble the
ROM code and see it quite readily - and you know all of the entry points.
Whether we can distribute our ROM code under a GPL compatible license
is not something I know at this time. Just as you can't compile a
binary using Linux kernel headers and claim that your binary is not
subject to the GPL, our ROM code includes headers from other parts of
our system that are specifically not under the GPL. How this affects
the final license under which the ROM is distributed is not something
I think we know at this time. But there is no reason we can't rewrite
those headers and make the ROM a separate and freely distributable
entity. Even so, the usefulness of it is extremely limited; the
interface to the hypervisor is proprietary, and subject to change,
so the code merely serves as a sample implementation of one possible
way to interface the layer to a hypervisor.

Signed-off-by: Zachary Amsden <[email protected]>

Index: linux-2.6.16-rc5/arch/i386/Makefile
===================================================================
--- linux-2.6.16-rc5.orig/arch/i386/Makefile 2006-03-08 16:53:19.000000000 -0800
+++ linux-2.6.16-rc5/arch/i386/Makefile 2006-03-08 16:53:32.000000000 -0800
@@ -78,6 +78,10 @@ mflags-$(CONFIG_X86_ES7000) := -Iinclude
mcore-$(CONFIG_X86_ES7000) := mach-default
core-$(CONFIG_X86_ES7000) := arch/i386/mach-es7000/

+# VMI subarch support
+mflags-$(CONFIG_X86_VMI) := -Iinclude/asm-i386/mach-vmi
+mcore-$(CONFIG_X86_VMI) := mach-vmi
+
# default subarch .h files
mflags-y += -Iinclude/asm-i386/mach-default

@@ -95,6 +99,7 @@ drivers-$(CONFIG_OPROFILE) += arch/i386
drivers-$(CONFIG_PM) += arch/i386/power/

CFLAGS += $(mflags-y)
+CPPFLAGS += $(mflags-y)
AFLAGS += $(mflags-y)

boot := arch/i386/boot
Index: linux-2.6.16-rc5/arch/i386/mach-vmi/setup.c
===================================================================
--- linux-2.6.16-rc5.orig/arch/i386/mach-vmi/setup.c 2006-03-08 16:53:32.000000000 -0800
+++ linux-2.6.16-rc5/arch/i386/mach-vmi/setup.c 2006-03-08 16:55:59.000000000 -0800
@@ -0,0 +1,309 @@
+/*
+ * Machine specific setup for generic
+ *
+ * Copyright (C) 2005, VMware, Inc.
+ *
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT. See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to [email protected]
+ *
+ */
+
+#include <linux/config.h>
+#include <linux/smp.h>
+#include <linux/init.h>
+#include <linux/irq.h>
+#include <linux/interrupt.h>
+#include <linux/bootmem.h>
+#include <linux/mm.h>
+#include <asm/acpi.h>
+#include <asm/arch_hooks.h>
+#include <asm/processor.h>
+#include <asm/desc.h>
+#include <asm/io.h>
+#include <asm/highmem.h>
+#include <asm/pgtable.h>
+#include <vmi.h>
+
+extern char __VMI_END;
+extern char __VMI_START;
+extern char __VMI_SHARED;
+VROMHeader *vmi_rom = NULL;
+
+VMI_UINT8 hypervisor_found;
+
+/* Convenient macro for calling VMI functions indirectly in the ROM */
+typedef VMI_UINT32 __attribute__((regparm(1))) (VROMFUNC)(void);
+
+#define VROMFunc(table,func) \
+ (((VROMFUNC *)&(((VROMCallTable *)(table))->vromCall[(func)].f)) \
+ ())
+
+#define MNEM_PUSH_I 0x68
+#define MNEM_PUSH_IB 0x6a
+#define MNEM_PUSH_EAX 0x50
+#define MNEM_PUSH_ECX 0x51
+#define MNEM_PUSH_EDX 0x52
+#define MNEM_PUSH_EBX 0x53
+#define MNEM_PUSH_ESP 0x54
+#define MNEM_PUSH_EBP 0x55
+#define MNEM_PUSH_ESI 0x56
+#define MNEM_PUSH_EDI 0x57
+#define MNEM_OPSIZE 0x66
+#define MNEM_LEA 0x8d
+#define MNEM_NOP 0x90
+#define MNEM_CALL_NEAR 0xe8
+
+static inline void patch_call_site(struct vmi_annotation *a, unsigned char *eip)
+{
+ unsigned long call = a->vmi_call;
+ unsigned char *dest = (unsigned char *)(&((VROMCallTable *)vmi_rom)->vromCall[call]);
+ *(unsigned long *)(eip+1) = dest-eip-5;
+}
+
+static void fixup_translation(struct vmi_annotation *a)
+{
+ unsigned char *c, *start, *end;
+ int left;
+
+ memcpy(a->nativeEIP, a->translationEIP, a->translation_size);
+ start = a->nativeEIP;
+ end = a->nativeEIP + a->translation_size;
+
+ for (c = start; c < end;) {
+ switch(*c) {
+ case MNEM_CALL_NEAR:
+ patch_call_site(a, c);
+ c+=5;
+ break;
+
+ case MNEM_PUSH_I:
+ c+=5;
+ break;
+
+ case MNEM_PUSH_IB:
+ c+=2;
+ break;
+
+ case MNEM_PUSH_EAX:
+ case MNEM_PUSH_ECX:
+ case MNEM_PUSH_EDX:
+ case MNEM_PUSH_EBX:
+ case MNEM_PUSH_EBP:
+ case MNEM_PUSH_ESI:
+ case MNEM_PUSH_EDI:
+ c+=1;
+ break;
+
+ case MNEM_LEA:
+ BUG_ON(*(c+1) != 0x64); /* [--][--]+disp8, %esp */
+ BUG_ON(*(c+2) != 0x24); /* none + %esp */
+ c+=4;
+ break;
+
+ default:
+ /*
+ * Don't printk - it may acquire spinlocks with
+ * partially completed VMI translations, causing
+ * nuclear meltdown of the core.
+ */
+ BUG();
+ return;
+ }
+ }
+
+ /* If the native size exceeded the translation size, pad the rest with nops */
+ for (left = a->native_size - a->translation_size; left > 0; left -= 12) {
+ int cur = left - 1;
+ int i;
+ cur = cur > 11 ? 11 : cur;
+ for (i = 0; i < cur; i++)
+ *c++ = MNEM_OPSIZE;
+ *c++ = MNEM_NOP;
+ }
+
+ for (c = start; c < end; c+= 8)
+ asm volatile ("clflush %0" ::"m" (c));
+}
+
+static void scan_annotations(void *start, void *end)
+{
+ struct vmi_annotation *a;
+ unsigned long nop_size = 0, translation_size = 0, extra_native_bytes = 0;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ for (a = start; (void *)a < end; a++) {
+ BUG_ON(a->vmi_call >= NUM_VMI_CALLS);
+ translation_size += a->translation_size;
+ if (a->nop_size > 0)
+ nop_size += a->nop_size;
+ else
+ extra_native_bytes -= a->nop_size;
+ fixup_translation(a);
+ }
+ local_irq_restore(flags);
+ printk(KERN_WARNING "VMI %d annotations=%ld, translations=%ld, nops=%ld, extra native = %ld bytes\n",
+ a - (struct vmi_annotation *)start + 1, (unsigned long)(end-start),
+ translation_size, nop_size, extra_native_bytes);
+}
+
+static void scan_builtin_annotations(void)
+{
+ scan_annotations(__vmi_annotation, __vmi_annotation_end);
+}
+
+/**
+ * pre_intr_init_hook - initialisation prior to setting up interrupt vectors
+ *
+ * Description:
+ * Perform any necessary interrupt initialisation prior to setting up
+ * the "ordinary" interrupt call gates. For legacy reasons, the ISA
+ * interrupts should be initialised here if the machine emulates a PC
+ * in any way.
+ **/
+void __init pre_intr_init_hook(void)
+{
+ init_ISA_irqs();
+}
+
+/*
+ * IRQ2 is cascade interrupt to second interrupt controller
+ */
+static struct irqaction irq2 = { no_action, 0, CPU_MASK_NONE, "cascade", NULL, NULL};
+
+/**
+ * intr_init_hook - post gate setup interrupt initialisation
+ *
+ * Description:
+ * Fill in any interrupts that may have been left out by the general
+ * init_IRQ() routine. interrupts having to do with the machine rather
+ * than the devices on the I/O bus (like APIC interrupts in intel MP
+ * systems) are started here.
+ **/
+void __init intr_init_hook(void)
+{
+#ifdef CONFIG_X86_LOCAL_APIC
+ apic_intr_init();
+#endif
+
+ setup_irq(2, &irq2);
+}
+
+
+/*
+ * Probe for the VMI option ROM
+ */
+void __init probe_vmi_rom(void)
+{
+ unsigned long base;
+
+ hypervisor_found = 0;
+
+ /* VMI ROM is in option ROM area, check signature */
+ for (base = 0xC0000; base < 0xE0000; base += 2048) {
+ VROMHeader *romstart;
+ romstart = (VROMHeader *)isa_bus_to_virt(base);
+ if (romstart->romSignature != 0xaa55)
+ continue;
+ if (romstart->vRomSignature == VMI_SIGNATURE && !vmi_rom) {
+ printk(KERN_WARNING "Detected VMI ROM version %d.%d\n",
+ romstart->APIVersionMajor,
+ romstart->APIVersionMinor);
+ vmi_rom = romstart;
+ if (romstart->APIVersionMajor != VMI_API_REV_MAJOR ||
+ romstart->APIVersionMinor+1 < MIN_VMI_API_REV_MINOR+1)
+ continue;
+ if (romstart->romLength * 512 >
+ &__VMI_END - &__VMI_START)
+ panic("VMI OPROM size exceeds mappable space\n");
+ hypervisor_found = 1;
+ break;
+ }
+ }
+}
+
+
+/*
+ * Activate the VMI interfaces
+ */
+void __init vmi_init(void)
+{
+ int romsize;
+
+ /*
+ * Setup optional callback functions if we found the VMI ROM
+ */
+ if (hypervisor_found) {
+ romsize = vmi_rom->romLength * 512;
+ if (VROMFunc(vmi_rom, VMI_CALL_Init)) {
+ printk(KERN_WARNING "VMI ROM failed to initialize\n");
+ hypervisor_found = 0;
+ } else {
+ memcpy(&__VMI_START, (char *)vmi_rom, romsize);
+ scan_builtin_annotations();
+ }
+ }
+ if (!vmi_rom)
+ printk(KERN_WARNING "VMI ROM not found"
+ " - falling back to native mode\n");
+ else if (!hypervisor_found)
+ printk(KERN_WARNING "VMI ROM version mismatch "
+ "(kernel requires version >= %d.%d) "
+ " - falling back to native mode\n",
+ VMI_API_REV_MAJOR, MIN_VMI_API_REV_MINOR);
+}
+
+
+/**
+ * pre_setup_arch_hook - hook called prior to any setup_arch() execution
+ *
+ * Description:
+ * generally used to activate any machine specific identification
+ * routines that may be needed before setup_arch() runs.
+ * We probe for various component option ROMs here.
+ **/
+void __init pre_setup_arch_hook(void)
+{
+ probe_vmi_rom();
+ vmi_init();
+}
+
+/**
+ * trap_init_hook - initialise system specific traps
+ *
+ * Description:
+ * Called as the final act of trap_init().
+ **/
+void __init trap_init_hook(void)
+{
+}
+
+static struct irqaction irq0 = { timer_interrupt, SA_INTERRUPT, CPU_MASK_NONE, "timer", NULL, NULL};
+
+/**
+ * time_init_hook - do any specific initialisations for the system timer.
+ *
+ * Description:
+ * Must plug the system timer interrupt source at HZ into the IRQ listed
+ * in irq_vectors.h:TIMER_IRQ
+ **/
+void __init time_init_hook(void)
+{
+ setup_irq(0, &irq0);
+}
Index: linux-2.6.16-rc5/arch/i386/mach-vmi/stubs.c
===================================================================
--- linux-2.6.16-rc5.orig/arch/i386/mach-vmi/stubs.c 2006-03-08 16:53:32.000000000 -0800
+++ linux-2.6.16-rc5/arch/i386/mach-vmi/stubs.c 2006-03-08 16:53:32.000000000 -0800
@@ -0,0 +1,102 @@
+/*
+ * Stub implementation of hardware features for transparent
+ * virtualization.
+ *
+ * Copyright (C) 2005, VMware, Inc.
+ *
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT. See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to [email protected]
+ *
+ */
+
+#include <linux/config.h>
+#include <asm/processor.h>
+#include <asm/desc.h>
+#include <asm/bitops.h>
+#include <vmi.h>
+#include <asm/system.h>
+#include <asm/apic.h>
+
+/* Init gets called on APs during SMP boot, so a stub is needed. */
+VMICALL void VMI_Init(VMI_UINT32 data)
+{
+
+}
+
+VMICALL VMI_UINT32 VMI_GetPxE(VMI_UINT32 *pte)
+{
+ return (*pte);
+}
+
+VMICALL void VMI_SetPxE(VMI_UINT32 *pte, VMI_UINT32 pteval)
+{
+ *pte = pteval;
+}
+
+VMICALL VMI_UINT32 VMI_SwapPxE(VMI_UINT32 *pte, VMI_UINT32 pteval)
+{
+ VMI_UINT32 val;
+
+ val = xchg(pte, pteval);
+ return val;
+}
+
+VMICALL int VMI_TestAndClearPxEBit(VMI_UINT32 *pte, int bit)
+{
+ VMI_UINT32 val;
+
+ val = test_and_clear_bit(bit, (volatile unsigned long *)pte);
+ return val;
+}
+
+VMICALL int VMI_TestAndSetPxEBit(VMI_UINT32 *pte, int bit)
+{
+ VMI_UINT32 val;
+
+ val = test_and_set_bit(bit, (volatile unsigned long *)pte);
+ return val;
+}
+
+VMICALL int VMI_TestAndClearPxELongBit(VMI_UINT64 *pte, int bit)
+{
+ return VMI_TestAndClearPxEBit((VMI_UINT32 *)pte, bit);
+}
+
+VMICALL int VMI_TestAndSetPxELongBit(VMI_UINT64 *pte, int bit)
+{
+ return VMI_TestAndSetPxEBit((VMI_UINT32 *)pte, bit);
+}
+
+VMICALL void VMI_AllocatePage(VMI_UINT32 ppn, int flags, VMI_UINT32 orig, int base,
+ int count)
+{
+}
+
+VMICALL void VMI_ReleasePage(VMI_UINT32 ppn, int flags)
+{
+}
+
+VMICALL int VMI_FlushDeferredCalls(VMI_UINT32 mode)
+{
+ return 0;
+}
+
+VMICALL void VMI_SetInitialAPState(VMI_UINT32 apState, VMI_UINT32 apicId)
+{
+}
Index: linux-2.6.16-rc5/arch/i386/mach-vmi/stubs-asm.S
===================================================================
--- linux-2.6.16-rc5.orig/arch/i386/mach-vmi/stubs-asm.S 2006-03-08 16:53:32.000000000 -0800
+++ linux-2.6.16-rc5/arch/i386/mach-vmi/stubs-asm.S 2006-03-08 16:53:32.000000000 -0800
@@ -0,0 +1,34 @@
+#include <linux/linkage.h>
+
+.section .text.VMI_SetPxELong,"ax"
+ENTRY(VMI_SetPxELong)
+ mov %edx, 4(%ecx)
+ /* sfence */
+ mov %eax, (%ecx)
+ ret
+
+.section .text.VMI_DeactivatePxELongAtomic,"ax"
+ENTRY(VMI_DeactivatePxELongAtomic)
+ push %ecx
+ mov %eax, %ecx
+ xor %eax, %eax
+ xchg %eax, (%ecx)
+ mov 4(%ecx), %edx
+ movl $0, 4(%ecx)
+ pop %ecx
+ ret
+
+.section .text.VMI_SwapPxELongAtomic,"ax"
+ENTRY(VMI_SwapPxELongAtomic)
+ push %edi
+ mov %ecx, %edi
+ push %ebx
+ mov %eax, %ebx
+ mov %edx, %ecx
+ mov (%edi), %eax
+ mov 4(%edi), %edx
+1: lock; cmpxchg8b (%edi)
+ jne 1b
+ pop %ebx
+ pop %edi
+ ret
Index: linux-2.6.16-rc5/arch/i386/mach-vmi/Makefile
===================================================================
--- linux-2.6.16-rc5.orig/arch/i386/mach-vmi/Makefile 2006-03-08 16:53:32.000000000 -0800
+++ linux-2.6.16-rc5/arch/i386/mach-vmi/Makefile 2006-03-08 16:55:46.000000000 -0800
@@ -0,0 +1,9 @@
+#
+# Makefile for the linux kernel.
+#
+
+EXTRA_CFLAGS += -I../kernel
+
+export CFLAGS_stubs.o += -ffunction-sections
+
+obj-y := setup.o stubs.o stubs-asm.o
Index: linux-2.6.16-rc5/include/asm-i386/bugs.h
===================================================================
--- linux-2.6.16-rc5.orig/include/asm-i386/bugs.h 2006-03-08 16:53:19.000000000 -0800
+++ linux-2.6.16-rc5/include/asm-i386/bugs.h 2006-03-08 16:53:32.000000000 -0800
@@ -175,6 +175,11 @@ static void __init check_config(void)
&& (boot_cpu_data.x86_mask < 6 || boot_cpu_data.x86_mask == 11))
panic("Kernel compiled for PMMX+, assumes a local APIC without the read-before-write bug!");
#endif
+
+#ifdef CONFIG_VMI_REQUIRE_HYPERVISOR
+ if (!vmi_hypervisor_found())
+ panic("No hypervisor found, aborting!\n");
+#endif
}

extern void alternative_instructions(void);
Index: linux-2.6.16-rc5/include/asm-i386/mach-vmi/setup_arch_pre.h
===================================================================
--- linux-2.6.16-rc5.orig/include/asm-i386/mach-vmi/setup_arch_pre.h 2006-03-08 16:53:32.000000000 -0800
+++ linux-2.6.16-rc5/include/asm-i386/mach-vmi/setup_arch_pre.h 2006-03-08 16:53:32.000000000 -0800
@@ -0,0 +1,6 @@
+/* Hook to call BIOS initialisation function */
+
+/* no action for generic */
+
+#define ARCH_SETUP
+extern void vmi_init(void);

2006-03-15 09:57:40

by Chris Wright

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

* Zachary Amsden ([email protected]) wrote:
> +static void fixup_translation(struct vmi_annotation *a)
> +{
> + unsigned char *c, *start, *end;
> + int left;
> +
> + memcpy(a->nativeEIP, a->translationEIP, a->translation_size);
> + start = a->nativeEIP;
> + end = a->nativeEIP + a->translation_size;
> +
> + for (c = start; c < end;) {
> + switch(*c) {
> + case MNEM_CALL_NEAR:
> + patch_call_site(a, c);
> + c+=5;
> + break;
> +
> + case MNEM_PUSH_I:
> + c+=5;
> + break;
> +
> + case MNEM_PUSH_IB:
> + c+=2;
> + break;
> +
> + case MNEM_PUSH_EAX:
> + case MNEM_PUSH_ECX:
> + case MNEM_PUSH_EDX:
> + case MNEM_PUSH_EBX:
> + case MNEM_PUSH_EBP:
> + case MNEM_PUSH_ESI:
> + case MNEM_PUSH_EDI:
> + c+=1;
> + break;
> +
> + case MNEM_LEA:
> + BUG_ON(*(c+1) != 0x64); /* [--][--]+disp8, %esp */
> + BUG_ON(*(c+2) != 0x24); /* none + %esp */
> + c+=4;
> + break;
> +
> + default:
> + /*
> + * Don't printk - it may acquire spinlocks with
> + * partially completed VMI translations, causing
> + * nuclear meltdown of the core.
> + */
> + BUG();
> + return;
> + }

Why these restrictions? How do you do int $0x82, for example?

thanks,
-chris

2006-03-15 16:02:06

by Zachary Amsden

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

Chris Wright wrote:
> * Zachary Amsden ([email protected]) wrote:
>
>> +static void fixup_translation(struct vmi_annotation *a)
>> +{
>> + unsigned char *c, *start, *end;
>> + int left;
>> +
>> + memcpy(a->nativeEIP, a->translationEIP, a->translation_size);
>> + start = a->nativeEIP;
>> + end = a->nativeEIP + a->translation_size;
>> +
>> + for (c = start; c < end;) {
>> + switch(*c) {
>> + case MNEM_CALL_NEAR:
>>
>>
> Why these restrictions? How do you do int $0x82, for example?
>

We don't. This is the minimal set of instructions that are emitted
during the annotation. The instruction sequence is only sufficient to
call out to the hypervisor ROM.

What we want to do next is to allow the hypervisor itself to do this
code fixup - in effect, relinking in the local translations, which in
many cases would then convert to int $0x82 for inline Xen calls.

Zach

2006-03-15 22:05:59

by Anthony Liguori

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

Zachary Amsden wrote:
> +void __init vmi_init(void)
> +{
> + int romsize;
> +
> + /*
> + * Setup optional callback functions if we found the VMI ROM
> + */
> + if (hypervisor_found) {
> + romsize = vmi_rom->romLength * 512;
> + if (VROMFunc(vmi_rom, VMI_CALL_Init)) {
> + printk(KERN_WARNING "VMI ROM failed to initialize\n");
> + hypervisor_found = 0;
> + } else {
> + memcpy(&__VMI_START, (char *)vmi_rom, romsize);
> + scan_builtin_annotations();
> + }
> + }
> + if (!vmi_rom)
> + printk(KERN_WARNING "VMI ROM not found"
> + " - falling back to native mode\n");
> + else if (!hypervisor_found)
> + printk(KERN_WARNING "VMI ROM version mismatch "
> + "(kernel requires version >= %d.%d) "
> + " - falling back to native mode\n",
> + VMI_API_REV_MAJOR, MIN_VMI_API_REV_MINOR);
> +}
>
Minor nitpick.

The error logic here is somewhat confusing. If a VMI_CALL_Init results
in a failure, you end up with:

VMI ROM failed to initialize
VMI ROM version mismatch (kernel requires version >= 13.0) - falling
back to native mode

The later error is misleading as the version may actually match. The
nesting here probably could be simplified to.

Regards,

Anthony Liguori

2006-03-15 23:00:36

by Pavel Machek

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

Hi!

> Currently, the system is designed to boot under a fully virtualized
> (or native) environment, then switch into paravirtual mode. It does
> this by detecting the presence of a ROM code page, invoking the

This is not going to work with xen, is it?

> The question of licensing of such ROM code is a completely separate
> issue. We are not trying to hide some proprietary code by putting it
> inside of a ROM to keep it hidden. In fact, you can disassemble the
> ROM code and see it quite readily - and you know all of the entry
> points.

Could you disassemble one entry point for us and describe how it works?

> Whether we can distribute our ROM code under a GPL compatible license
> is not something I know at this time. Just as you can't compile a
> binary using Linux kernel headers and claim that your binary is not
> subject to the GPL, our ROM code includes headers from other parts of
> our system that are specifically not under the GPL. How this

VMware is probably sole copyright holder (no?), you are perfectly free
to relicence resulting ROM code under whatever licence you want.

--
31: }

2006-03-16 19:11:34

by Jan Engelhardt

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

>
>The question of licensing of such ROM code is a completely separate
>issue. We are not trying to hide some proprietary code by putting it
>inside of a ROM to keep it hidden. In fact, you can disassemble the
>ROM code and see it quite readily - and you know all of the entry points.
>Whether we can distribute our ROM code under a GPL compatible license
>is not something I know at this time. Just as you can't compile a
>binary using Linux kernel headers and claim that your binary is not
>subject to the GPL, our ROM code includes headers from other parts of
>our system that are specifically not under the GPL. How this affects
>the final license under which the ROM is distributed is not something
>I think we know at this time.

The code _you_ wrote can be put under any license(s) you want, so in
the worst case you do not need to rewrite your .c files.

Jan Engelhardt
--

2006-03-16 19:45:43

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

On Thu, 16 Mar 2006, Jan Engelhardt wrote:

> The code _you_ wrote can be put under any license(s) you want, so in
> the worst case you do not need to rewrite your .c files.

The license might affect which other OSes could link in
the ROM though...

--
All Rights Reversed

2006-03-16 22:57:40

by Zachary Amsden

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

Rik van Riel wrote:
> On Thu, 16 Mar 2006, Jan Engelhardt wrote:
>
>
>> The code _you_ wrote can be put under any license(s) you want, so in
>> the worst case you do not need to rewrite your .c files.
>>
>
> The license might affect which other OSes could link in
> the ROM though...
>

That's actually a hideously good point. I think that means, in fact,
that the ROM code specifically cannot be GPL'd - otherwise BSD might be
unable to use it. The LGPL might be more appropriate.

Zach

2006-03-17 00:54:48

by Zachary Amsden

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

Pavel Machek wrote:
>> The question of licensing of such ROM code is a completely separate
>> issue. We are not trying to hide some proprietary code by putting it
>> inside of a ROM to keep it hidden. In fact, you can disassemble the
>> ROM code and see it quite readily - and you know all of the entry
>> points.
>>
>
> Could you disassemble one entry point for us and describe how it works?
>

000005e0 <stub_VMI_SwapPxE>:
5e0: e9 db 06 00 00 jmp cc0 <VMI_SwapPxE>
5e5: 0f 0b ud2a

(The SwapPxE call does not fit in the 32-byte region).

00000cc0 <VMI_SwapPxE>:
cc0: 83 ec 1c sub $0x1c,%esp
cc3: 89 c1 mov %eax,%ecx
cc5: 89 74 24 10 mov %esi,0x10(%esp,1)
cc9: 89 d6 mov %edx,%esi
ccb: 89 7c 24 14 mov %edi,0x14(%esp,1)
ccf: 89 6c 24 18 mov %ebp,0x18(%esp,1)
cd3: 89 54 24 0c mov %edx,0xc(%esp,1)

(Save registers and call values on stack)

cd7: 87 30 xchg %esi,(%eax)

(Atomically swap the PTE in the page)

cd9: 0f b6 44 24 0c movzbl 0xc(%esp,1),%eax

(Extract low byte of the PTE write just done)

cde: 31 d2 xor %edx,%edx
ce0: 09 f0 or %esi,%eax
ce2: a8 01 test $0x1,%al

(Were either the new or old PTE's present?)

ce4: 74 11 je cf7 <VMI_SwapPxE+0x37>

(No? jump forward to branch combination)

ce6: 0f b6 44 24 0c movzbl 0xc(%esp,1),%eax

(Extract low byte of the PTE write again)

ceb: 31 f0 xor %esi,%eax

(Compute changed bits in PTE)

ced: a8 c7 test $0xc7,%al

(Have U/S, R/W, P, PS, D bits changed?)

cef: b8 01 00 00 00 mov $0x1,%eax
cf4: 0f 45 d0 cmovne %eax,%edx
cf7: 84 d2 test %dl,%dl

(Combine above tests into one branch)

cf9: 74 32 je d2d <VMI_SwapPxE+0x6d>

(If nothing interesting changed, return)

cfb: b8 54 00 00 fc mov $0xfc000054,%eax

(Load start of linear address translation lookaside)

d00: 31 d2 xor %edx,%edx
d02: 89 cf mov %ecx,%edi
d04: 39 08 cmp %ecx,(%eax)

(Compare stored lookaside translation address with address of PTE write)

d06: 77 37 ja d3f <VMI_SwapPxE+0x7f>

(Greater? - next)

d08: 39 48 04 cmp %ecx,0x4(%eax)

(Compare stored lookaside translate address end with address of PTE write)

d0b: 76 32 jbe d3f <VMI_SwapPxE+0x7f>

(Below/Equal? - next)

d0d: 03 78 08 add 0x8(%eax),%edi

(Add translation offset to PTE address (perform VA->PA conversion))

d10: c7 04 24 00 00 00 00 movl $0x0,(%esp,1)
d17: 8b 54 24 0c mov 0xc(%esp,1),%edx
d1b: bd 89 00 00 00 mov $0x89,%ebp
d20: 89 6c 24 04 mov %ebp,0x4(%esp,1)
d24: 31 c9 xor %ecx,%ecx
d26: 89 f8 mov %edi,%eax
d28: e8 73 fd ff ff call aa0 <VMIEnqueue>

(Put arguments into order, and queue the PTE update)

d2d: 89 f0 mov %esi,%eax
d2f: 8b 74 24 10 mov 0x10(%esp,1),%esi
d33: 8b 7c 24 14 mov 0x14(%esp,1),%edi
d37: 8b 6c 24 18 mov 0x18(%esp,1),%ebp
d3b: 83 c4 1c add $0x1c,%esp
d3e: c3 ret

(Restore registers and return)

d3f: 42 inc %edx
d40: 83 c0 10 add $0x10,%eax

(Move to next translation index)

d43: 83 fa 03 cmp $0x3,%edx
d46: 7e bc jle d04 <VMI_SwapPxE+0x44>

(Retry)

d48: 89 cf mov %ecx,%edi
d4a: eb c4 jmp d10 <VMI_SwapPxE+0x50>

(Default case - VA == PA, direct mapping assumed)

And this version may be easier for some of you to read:

/*
* To convert virtual address to physical address, we must
* keep a cache of currently mapped page tables. This allows
* us to perform the VA->PA conversion with guest help, which
* is much faster than having to do a page table walk. For the
* Linux guest, when page tables are not in high memory, the
* first mapping slot is always an identity map of physical
* memory mapped at PAGE_OFFSET.
*/

static INLINE PA
VMITranslate(const LA la)
{
HyperAddressMap *m;
int i;
m = &vmiHypervisorShared->pageTranslation[0];
for (i = 0; i < VMI_LINEAR_MAP_SLOTS; i++, m++) {
if (LIKELY(la >= m->startVA && la < m->endVA)) {
return (m->physOffset + la);
}
}
/* Assume identity mapped */
return la;
}

/*
* Note we read and write guest memory directly. This
* is safe for us because we use shadow page tables. The
* Xen ROM would perform similar conditional tests here to
* figure out which hypercall it wants to make or decide
* to write the PTE directly and trigger shadow
*/

VMICALL(SwapPxE, VM_PTE *const ptep, VM_PTE const pteval)
{
VMI_ENTER;
VM_PTE pteold;

pteold = Atomic_ReadWrite((Atomic_uint32 *)ptep, pteval);
if (VMIIsPTEInteresting(pteval, pteold)) {
PA pa = VMITranslate((LA) ptep);
VMIEnqueue(pa, pteval, 0, 0, HC_SetPxEAsync);
}
VMI_RETURN(pteold);
}

Herein lies one of our problems. Atomic_ReadWrite comes from a private
header file. We need to copy all of these defines and inline functions
into a distributable header file rather than a private one - separating
public and private headers can be sort of like trying to straighten the
vines of a giant thornbush. Every time you straighten one, another
wacks you in the head. We have many, many build clients that depend on
those same headers, and moving large amounts of things around is a
tricky and painstaking process. Not that it can't be done, just that it
is not a simple copy and paste job.

Zach

2006-03-17 10:09:34

by Joshua LeVasseur

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

On Mar 16, 2006, at 00:00 , Pavel Machek wrote:

>> The question of licensing of such ROM code is a completely separate
>> issue. We are not trying to hide some proprietary code by putting it
>> inside of a ROM to keep it hidden. In fact, you can disassemble the
>> ROM code and see it quite readily - and you know all of the entry
>> points.
>
> Could you disassemble one entry point for us and describe how it
> works?

Here is how I construct my ROM. My apologies if my email client
wraps any of the lines.

I have a set of boundary functions between assembler and C code,
identified by a suffix on the function names: _ext (stands for
"extended" in response to the evolution of my ROM). They accept a
parameter called burn_clobbers_frame_t:
struct burn_clobbers_frame_t {
word_t burn_ret_address; /* Return address to the ROM */
word_t frame_pointer; /* This parameter. */
word_t edx;
word_t ecx;
word_t eax;
word_t guest_ret_address;
word_t params[0]; /* Anything the guest pushed on the stack */
};
I use this structure for the non-performance critical functions.

Here are some of the assembler macros for constructing the ROM:

vmi_stub_simple WRMSR, afterburn_cpu_write_msr_ext
vmi_stub_simple RDMSR, afterburn_cpu_read_msr_ext

vmi_stub_simple SetGDT, afterburn_cpu_write_gdt32_ext
vmi_stub_simple SetLDT, afterburn_cpu_lldt_ext
vmi_stub_simple SetIDT, afterburn_cpu_write_idt32_ext
vmi_stub_simple SetTR, afterburn_cpu_ltr_ext

macro vmi_stub_simple name func
/* Invoke a front-end C function that expects a burn_clobbers_frame_t as
* the parameter.
*/
vmi_stub_begin \name
__afterburn_save_clobbers
push %esp
subl $8, 0(%esp)
vmi_call_relocatable \func
__afterburn_restore_clobbers 4
ret
vmi_stub_end
.endm

.macro vmi_stub_begin name
/* Define the prolog of a VMI stub. */
vmi_area_begin \name
burn_counter VMI_\name
burn_counter_inc VMI_\name
.endm

.macro vmi_stub_end
/* Define the epilog of a VMI stub. */
.previous
.endm

.macro vmi_area_begin name
/* Define the start of a VMI area. */
.section .vmi_rom, "ax"
.org vmi_rom_offset_\name
.globl VMI_\name
.type VMI_\name,@function
VMI_\name:
.endm

.macro vmi_call_relocatable func
9191: call \func
.pushsection .vmi_patchups, "aw"
.balign 4
.long 9191b
.popsection
.endm

.macro vmi_jmp_relocatable target
9191: jmp \target
.pushsection .vmi_patchups, "aw"
.balign 4
.long 9191b
.popsection
.endm

extern "C" void
afterburn_cpu_write_gdt32_ext( burn_clobbers_frame_t *frame )
{
get_cpu()->gdtr = *(dtr_t *)frame->eax;
}

The ROM entry points are only half the solution. The other half
involves Xen callbacks and traps. The system call trap is
constructed to directly activate Linux's system call trap.
Everything else jumps directly into the ROM for filtering reasons.
The page-fault handler stays in assembler (because page faults are a
performance issue; on a linux kernel build, they occur almost as
frequently as system calls). The remaining traps enter C code, and
look like this:

trap_wrapper id=8, use_error_code=1 /* Double fault exception */
trap_wrapper id=9, use_error_code=0 /* Coprocessor segment
overrun */
trap_wrapper id=10, use_error_code=1 /* Invalid TSS exception */
trap_wrapper id=11, use_error_code=1 /* Segment not present */
trap_wrapper id=12, use_error_code=1 /* Stack fault exception */
trap_wrapper id=13, use_error_code=1 /* general protection fault */

.macro trap_wrapper id, use_error_code
entry trap_wrapper_\id
.if \use_error_code
subl $4, %esp /* Fault addr. */
.else
subl $8, %esp /* Error code, fault addr. */
.endif
pushl $(\id | (\use_error_code << 31)) /* Frame ID. */
cpu_save_all
movl %esp, %eax /* A pointer to the CPU save frame. */
call trap
jmp afterburn_exit
.endm

entry afterburn_exit
cpu_restore_all 0, 12 /* Error code, fault addr, frame id. */
iret

extern "C" void __attribute__(( regparm(1) ))
trap( xen_frame_t *frame )
{
if( EXPECT_FALSE(frame->iret.ip >= CONFIG_WEDGE_VIRT) ) {
con << "Unexpected fault in the ROM, ip " << (void *)frame-
>iret.ip
<< '\n';
DEBUGGER_ENTER(frame);
panic();
}

u8_t *opstream = (u8_t *)frame->iret.ip;

if( cpu_t::get_segment_privilege(frame->iret.cs) == 3 )
{
// A user-level fault.

... check for TLS issues ...

}

// Update virtual CPU state, and deliver the trap to the guest
kernel.
xen_deliver_async_vector( frame->get_id(), frame,
frame->uses_error_code());
}

2006-03-17 21:12:44

by Chris Wright

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

* Joshua LeVasseur ([email protected]) wrote:
> extern "C" void
> afterburn_cpu_write_gdt32_ext( burn_clobbers_frame_t *frame )
> {
> get_cpu()->gdtr = *(dtr_t *)frame->eax;
> }

What is this get_cpu()? Accessing data structure that's avail. in ROM
and shared with hypervisor...could you elaborate a bit here?

thanks,
-chris

2006-03-18 00:49:21

by Joshua LeVasseur

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

On Mar 17, 2006, at 22:11 , Chris Wright wrote:

> * Joshua LeVasseur ([email protected]) wrote:
>> extern "C" void
>> afterburn_cpu_write_gdt32_ext( burn_clobbers_frame_t *frame )
>> {
>> get_cpu()->gdtr = *(dtr_t *)frame->eax;
>> }
>
> What is this get_cpu()? Accessing data structure that's avail. in ROM
> and shared with hypervisor...could you elaborate a bit here?
>
> thanks,
> -chris

VMI is a very versatile interface due to the ROM; within the ROM you
can translate the instruction set architecture and device register
activity (as represented by the VMI interface) to a variety of
hypervisor interfaces. I use a virtual CPU to help perform the
translation. The performance of virtualization depends on the extent
to which you can minimize interaction with the hypervisor via
hypercalls. Many of the operations needn't be exposed to the
hypervisor, and only operate on the virtual CPU, and thus remain
completely within the ROM. The goal is to minimize interaction with
the hypervisor.

I don't share the virtual CPU with the hypervisor. There probably
are performance benefits for codesign between the hypervisor and ROM,
but I haven't had that luxury; I take the hypervisors as given and
none of them are fundamentally designed to use a ROM. On the other
hand, it makes sense to concentrate virtualization within the ROM,
rather than the hypervisor, for the same arguments you can make for
implementing functionality in an application rather than the kernel.

I've implemented ROMs for two (open source) hypervisors so far, and
try to share as much code between them as possible. The get_cpu() is
an abstraction to help hide the hypervisor specifics for locating the
virtual CPU (and it handles multiprocessor issues).

To help illustrate the role of the ROM, consider using Linux as a
hypervisor, i.e., Linux-on-Linux but with the guest kernel using the
VMI interface [1]. The ROM would translate the low-level operations
of the guest kernel into the system calls of the host Linux, and it
would be important to minimize the amount of interaction with the
host Linux. Consider interrupt delivery, which would probably be
mapped to POSIX signals. VMI offers VMI_EnableInterrupts(),
VMI_DisableInterrupts(), VMI_GetInterruptMask(), and
VMI_SetInterruptMask(). All of these operations are executed
frequently by Linux, and it would be critical to limit their side
effects to within the ROM; for performance reasons, they mustn't map
to POSIX signal mask/unmask operations. The solution is to update
only the EFLAGS in the virtual CPU when the guest kernel invokes
VMI_EnableInterrupts, DisableInterrupts, etc.. Then the ROM must
always accept asynchronous POSIX signal delivery, and must only
forward asynchronous events to the guest kernel if interrupts are
enabled in the virtual CPU. If the virtual CPU's interrupts are
disabled, then the event is only recorded in the virtual PIC, and
delivered at the next VMI_EnableInterrupts() or VMI_SetInterruptMask().

[1] Linux-on-Linux would probably limp with the current VMI. A
couple changes would be necessary, such as permitting the Linux
kernel to run at ring 3, and offering put_user() and get_user()
hooks, since the guest applications and guest kernel must use
different host address spaces. Unfortunately, put_user() and get_user
() hooks are higher-level interfaces that don't fit well within VMI.
For other CPU architectures with only two privilege levels, put_user
() and get_user() hooks may be necessary too.

Joshua

2006-03-22 20:59:29

by Andi Kleen

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

On Monday 13 March 2006 19:02, Zachary Amsden wrote:
> The VMI ROM detection and code patching mechanism is illustrated in
> setup.c. There ROM is a binary block published by the hypervisor, and
> and there are certainly implications of this. ROMs certainly have a
> history of being proprietary, very differently licensed pieces of
> software, and mostly under non-free licenses. Before jumping to the
> conclusion that this is a bad thing, let us consider more carefully
> why hiding the interface layer to the hypervisor is actually a good
> thing.

How about you fix all these issues you describe here first
and then submit it again?

The disassembly stuff indeed doesn't look like something
that belongs in the kernel.

-Andi

2006-03-22 21:40:33

by Chris Wright

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

* Andi Kleen ([email protected]) wrote:
> The disassembly stuff indeed doesn't look like something
> that belongs in the kernel.

Strongly agreed. The strict ABI requirements put forth here are not
in-line with Linux, IMO. I think source compatibility is the limit of
reasonable, and any ROM code be in-tree if something like this were to
be viable upstream.

thanks,
-chris

2006-03-22 22:18:41

by Zachary Amsden

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

Chris Wright wrote:
> * Andi Kleen ([email protected]) wrote:
>
>> The disassembly stuff indeed doesn't look like something
>> that belongs in the kernel.
>>

Agree that. It should be done prior to kernel booting, invisible to the
kernel itself. I'm working on it, but there is still a lot to do.

>
> Strongly agreed. The strict ABI requirements put forth here are not
> in-line with Linux, IMO. I think source compatibility is the limit of
> reasonable, and any ROM code be in-tree if something like this were to
> be viable upstream.
>

Strongly disagree. Without an ABI, you don't have binary
compatibility. Without binary compatibility, you have no way to inline
any hypervisor code into the kernel. And this is key for performance.
The ROM code is being phased out.

Is it the strictness of the ABI that is the problem? I don't like
constraining the native register values any much either, but it was the
expedient thing to do. The ABI can be relaxed quite a bit, but it has
to be there.

The idea of in-tree ROM code doesn't make sense. The entire point of
this layer of code is that it is modular, and specific to the
hypervisor, not the kernel. Once you lift the shroud and combine the
two layers, you have lost all of the benefit that it was supposed to
provide.

Zach

2006-03-22 22:36:57

by Daniel Arai

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

Zachary Amsden wrote:
> Chris Wright wrote:
>
>> Strongly agreed. The strict ABI requirements put forth here are not
>> in-line with Linux, IMO. I think source compatibility is the limit of
>> reasonable, and any ROM code be in-tree if something like this were to
>> be viable upstream.
>
> The idea of in-tree ROM code doesn't make sense. The entire point of
> this layer of code is that it is modular, and specific to the
> hypervisor, not the kernel. Once you lift the shroud and combine the
> two layers, you have lost all of the benefit that it was supposed to
> provide.

To elaborate a bit more, the "ROM" layer is "published" by the hypervisor. This
layer of abstraction will let you take a VMI-compiled kernel and run it
efficiently on any hypervisor that exports a VMI interface - even one that you
didn't know about (or didn't exist) when you compiled your kernel.

If the ROM part is compiled into the code, then you have to compile in support
for the specific hypervisor(s) you want to run on. It might be reasonable for
this code to be in a lodable kernel module, rather than a device ROM per se, but
you still want that kernel module to be provided by the hypervisor.

Suppose someone implements a ROM layer for UML, or QEMU, or even for Microsoft's
hypervisor. Having the ROM published by the hypervisor now lets you run your
kernel on that new hypervisor without recompiling. While this might not be much
of a benefit for an individual developer who downloads and compiles his own
kernel, this is a huge win for people who distribute binary kernels, or large IT
organizations that may have large heterogenous virtual machine farms to maintain.

Going forward, having the ROM layer published by the hypervisor gives the
hypervisor more flexibility than having the code statically compiled into the
kernel. Consider when hardware virtualization becomes more prevalent. Perhaps
there are places where today hypercalls make sense, but with hardware
virtualization, you'd rather have the hardware just take care of it. CPUID is
the only example I can come up with at the moment, but there are certainly
others. VMI lets the hypervisor decide that it doesn't actually need to replace
the CPUID instruction with a hypercall. The important factor here is that only
the hypervisor, not the kernel, knows about these performance tradeoffs. Or
maybe in the next version of Xen, it's possible to use sysenter rather than an
interrupt instruction to do hypercalls. If the hypervisor publishes this code,
even older kernels can transparently take advantage of faster ways of doing
certain things.

Dan.

2006-03-22 22:51:26

by Chris Wright

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

* Zachary Amsden ([email protected]) wrote:
> Chris Wright wrote:
> >Strongly agreed. The strict ABI requirements put forth here are not
> >in-line with Linux, IMO. I think source compatibility is the limit of
> >reasonable, and any ROM code be in-tree if something like this were to
> >be viable upstream.
>
> Strongly disagree. Without an ABI, you don't have binary
> compatibility. Without binary compatibility, you have no way to inline
> any hypervisor code into the kernel. And this is key for performance.
> The ROM code is being phased out.

With source compatibility you get the ABI at compile time. This is how
Linux handles internal interfaces. This is about an internal interface
between kernel and a platform layer. The raw hypervisor interface is
hidden (translations done in the platform layer), and the hypervisor
needs to provide stable ABI.

> Is it the strictness of the ABI that is the problem? I don't like
> constraining the native register values any much either, but it was the
> expedient thing to do. The ABI can be relaxed quite a bit, but it has
> to be there.

It's the very notion of creating such a large internal binary compatible
interface.

> The idea of in-tree ROM code doesn't make sense. The entire point of
> this layer of code is that it is modular, and specific to the
> hypervisor, not the kernel. Once you lift the shroud and combine the
> two layers, you have lost all of the benefit that it was supposed to
> provide.

You could compile all platform layers you want to support with the kernel.

thanks,
-chris

2006-03-22 23:02:34

by Chris Wright

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

* Daniel Arai ([email protected]) wrote:
> To elaborate a bit more, the "ROM" layer is "published" by the hypervisor.
> This layer of abstraction will let you take a VMI-compiled kernel and run
> it efficiently on any hypervisor that exports a VMI interface - even one
> that you didn't know about (or didn't exist) when you compiled your kernel.
>
> If the ROM part is compiled into the code, then you have to compile in
> support for the specific hypervisor(s) you want to run on. It might be
> reasonable for this code to be in a lodable kernel module, rather than a
> device ROM per se, but you still want that kernel module to be provided by
> the hypervisor.

I don't agree. That module may know how to get interface info from a
vdso analog (just like a driver knows the hardware details of the device
it's interacting with, but the core kernel api is unaware), but placing
the binary compatibility on the kernel proper is wrong IMO.

thanks,
-chris

2006-03-22 23:36:50

by Zachary Amsden

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

Chris Wright wrote:
> You could compile all platform layers you want to support with the kernel.
>

But the entire point is that you don't know what platform layers you
want to support. The platform layers can change. Xen has changed the
platform layer and re-optimized kernel / hypervisor transitions how many
times? The platform layer provides exactly the flexibility to do that,
so that a kernel you compile today against a generic platform can work
with the platform layer provided by Xen 4.0 tomorrow.

Compiling the platform layer with the kernel for "source compatibility"
is exactly what prevents you from doing this. And you get stuck having
the same stable and inflexible ABI to the hypervisor, rather than a
carefully architected ABI just before it. The most important design
decision in creating the VMI layer was disallowing data dependence
between the compiled kernel and the hypervisor ABI.

I have seen, and will continue to see, every single shared data block
layout change to meet the demands of new features. Eventually, you get
to a point where it is growing antlers and having new hooves grafted
onto it, yet still requires all of the original cruft you used to do.
It is either a maintenance nightmare, or a compatibility nightmare. If
you want compatibility, you really can't break that interface all the
time, and the real world demands of customers using virtualization
solutions really do want that compatibility. You simply can't certify a
complex platform if you have to recompile your kernel for every new
release of your chosen hypervisor. Bugs do get introduced this way, the
older kernels fall out of maintenance, and eventually you are forcing
them to upgrade to the latest kernels, which even worse, may have
changed the userspace interface, dropped legacy feature support, and
broken your key application that was the entire point of running in a VM
to begin with. People throw things in VMs and then expect those VMs to
keep running for years, and you really can't break that.

So instead, you impose a giant maintenance burden on the hypervisor,
forcing them to go to all efforts to avoid breaking this hypervisor
ABI. That leads directly to crufty, unstable, and poor performing code.

So the VMI layer is all about defining an ABI at a slightly higher
level. A level which has many benefits you simply can't get from source
compatibility. It is about the future, about preparing for the unknown,
about giving a powerful abstraction to the platform layer to do whatever
it chooses to do.

If Intel announces a new chip tomorrow, with a feature bit that allows
selective privileged instructions to operate in non-zero supervisor
CPLs, you're really going to regret the fact that you can't issue page
invalidations and TLB flushes directly in the kernel because you
unwisely decided to compile these in as direct int $0x81 hypercalls.
You can change the platform layer and let new versions pick that up, and
try to encourage people to move to newer kernels. And you have to make
this change for every single operating system you support, leading to
greater risk for introducing bugs in addition to any unwanted side
effects of a kernel upgrade. Even worse, you may find a bug that
_requires_ changing the platform layer. It might be a wide, gaping
security hole. We had a few in the course of development (kernel CS
entry value stored in shared area..). Now you have to break the
hypervisor ABI, all your customers think you suck, and they have to
upgrade all their systems. Perhaps they have been happily running the
2.4.26 kernel up to now. What do they do?

Why do you want to bind yourself to source compatibility, when it does
not bring you features at all, it only hurts you in terms of deployment
flexibility?

Zach

2006-03-22 23:41:34

by Volkmar Uhlig

[permalink] [raw]

Subject: RE: [RFC, PATCH 5/24] i386 Vmi code patching

> -----Original Message-----
> From: [email protected]
> Sent: Wednesday, March 22, 2006 5:34 PM
>
> > The idea of in-tree ROM code doesn't make sense. The entire point
> > of this layer of code is that it is modular, and specific to the
> > hypervisor, not the kernel. Once you lift the shroud and combine
> > the two layers, you have lost all of the benefit that it was
> > supposed to provide.
>
> To elaborate a bit more, the "ROM" layer is "published" by
> the hypervisor. This layer of abstraction will let you take
> a VMI-compiled kernel and run it efficiently on any
> hypervisor that exports a VMI interface - even one that you
> didn't know about (or didn't exist) when you compiled your kernel.
>
> [...]
>
> Going forward, having the ROM layer published by the
> hypervisor gives the hypervisor more flexibility than having
> the code statically compiled into the kernel. Consider when
> hardware virtualization becomes more prevalent. Perhaps
> there are places where today hypercalls make sense, but with
> hardware virtualization, you'd rather have the hardware just
> take care of it. CPUID is the only example I can come up
> with at the moment, but there are certainly others. VMI lets
> the hypervisor decide that it doesn't actually need to
> replace the CPUID instruction with a hypercall. The
> important factor here is that only the hypervisor, not the
> kernel, knows about these performance tradeoffs.

Very obvious other candidates are the shadowed system state registers
(cli, sti, CRx) provided by VT and the shadow page-table support as
defined by Pacifica. In particular since these features are dependent on
the specific processor revision a hard-coded binary interface doesn't do
any good. The ROM pretty much resembles Linux' system call interface as
provided today optimizing for the specific HW architecture.

- Volkmar

2006-03-23 00:31:12

by Anthony Liguori

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

Chris Wright wrote:
> * Andi Kleen ([email protected]) wrote:
>
>> The disassembly stuff indeed doesn't look like something
>> that belongs in the kernel.
>>
>
> Strongly agreed. The strict ABI requirements put forth here are not
> in-line with Linux, IMO. I think source compatibility is the limit of
> reasonable, and any ROM code be in-tree if something like this were to
> be viable upstream.
>

Hi Chris,

Would you have less trouble if the "ROM" were actually more like a
module? Specifically, if it had a proper elf header and symbol table,
used symbols as entry points, and was a GPL interface (so that ROM's had
to be GPL)? Then it's just a kernel module that's hidden in the option
ROM space and has a C interface.

I know you end up losing the ability to do crazy inlining of the ROM
code but I think it becomes a much less hairy interface that way.

Regards,

Anthony Liguori

> thanks,
> -chris
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Virtualization mailing list
> [email protected]
> https://lists.osdl.org/mailman/listinfo/virtualization
>

2006-03-23 00:40:09

by Chris Wright

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

* Anthony Liguori ([email protected]) wrote:
> Would you have less trouble if the "ROM" were actually more like a
> module? Specifically, if it had a proper elf header and symbol table,
> used symbols as entry points, and was a GPL interface (so that ROM's had
> to be GPL)? Then it's just a kernel module that's hidden in the option
> ROM space and has a C interface.

Yeah, point is the interface is normal C API, and has the similar free
form that normal kernel API's have.

thanks,
-chris

2006-03-23 00:42:36

by Chris Wright

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

* Zachary Amsden ([email protected]) wrote:
> Chris Wright wrote:
> >You could compile all platform layers you want to support with the kernel.
>
> But the entire point is that you don't know what platform layers you
> want to support. The platform layers can change. Xen has changed the
> platform layer and re-optimized kernel / hypervisor transitions how many
> times? The platform layer provides exactly the flexibility to do that,
> so that a kernel you compile today against a generic platform can work
> with the platform layer provided by Xen 4.0 tomorrow.

This only works if you have all possible dreamed of interface bits in
the ABI. In Linux, we often don't know what we'll need to support in the
future, but we don't write binary compatible interfaces just in case we
need to update. Preferring instead, API's that are justifiable right now.
This is the issue I have with the ABI proposal. It doesn't fit well
with Linux developement.

thanks,
-chris

2006-03-23 00:46:19

by Zachary Amsden

[permalink] [raw]

Subject: Re: [Xen-devel] Re: [RFC, PATCH 5/24] i386 Vmi code patching

Anthony Liguori wrote:
> Chris Wright wrote:
>> * Andi Kleen ([email protected]) wrote:
>>
>>> The disassembly stuff indeed doesn't look like something
>>> that belongs in the kernel.
>>>
>>
>> Strongly agreed. The strict ABI requirements put forth here are not
>> in-line with Linux, IMO. I think source compatibility is the limit of
>> reasonable, and any ROM code be in-tree if something like this were to
>> be viable upstream.
>>
>
> Hi Chris,
>
> Would you have less trouble if the "ROM" were actually more like a
> module? Specifically, if it had a proper elf header and symbol table,
> used symbols as entry points, and was a GPL interface (so that ROM's
> had to be GPL)? Then it's just a kernel module that's hidden in the
> option ROM space and has a C interface.
>
> I know you end up losing the ability to do crazy inlining of the ROM
> code but I think it becomes a much less hairy interface that way.

Actually, I think you still can get the ability to do crazy inlining of
the ROM code. You have three exports from the ELF module:

vmi_init - enter paravirtual mode
vmi_annotate - apply inline transformations based on inlining
vmi_exit - exit paravirtual mode (required for module unloading).

But you can't require the ROM to be GPL'd. It has to be multi-licensed
for compatibility with other open source or, even proprietary operating
systems. If the ROM is licensed for use only under the GPL, then by
including it in your kernel and allowing it to patch your kernel code,
you leave your non-GPL kernel in a questionable license state. If the
ROM is licensed under an open license, with a clause allowing its
inclusion into GPL'd software, then I don't think you have a problem.
Course I could be wrong. This is sort of a unique situation, and
finding an identical comparison is tricky.

Zach

2006-03-23 00:54:04

by Anthony Liguori

[permalink] [raw]

Subject: Re: [Xen-devel] Re: [RFC, PATCH 5/24] i386 Vmi code patching

Zachary Amsden wrote:
>> Hi Chris,
>>
>> Would you have less trouble if the "ROM" were actually more like a
>> module? Specifically, if it had a proper elf header and symbol
>> table, used symbols as entry points, and was a GPL interface (so that
>> ROM's had to be GPL)? Then it's just a kernel module that's hidden
>> in the option ROM space and has a C interface.
>>
>> I know you end up losing the ability to do crazy inlining of the ROM
>> code but I think it becomes a much less hairy interface that way.
>
> Actually, I think you still can get the ability to do crazy inlining
> of the ROM code. You have three exports from the ELF module:
>
> vmi_init - enter paravirtual mode
> vmi_annotate - apply inline transformations based on inlining
> vmi_exit - exit paravirtual mode (required for module unloading).

Hrm, I was actually thinking that each of the VMI calls would be an
export (vmi_init, vmi_set_pxe, etc.). I know that you want the
hypervisor to drive the inlining but I that's sufficiently hairy (not to
mention, there's not AFAIK performance data yet to justify it) that I
think it ought to be left for VMI 2.0.

> But you can't require the ROM to be GPL'd. It has to be
> multi-licensed for compatibility with other open source or, even
> proprietary operating systems. If the ROM is licensed for use only
> under the GPL, then by including it in your kernel and allowing it to
> patch your kernel code, you leave your non-GPL kernel in a
> questionable license state. If the ROM is licensed under an open
> license, with a clause allowing its inclusion into GPL'd software,
> then I don't think you have a problem. Course I could be wrong. This
> is sort of a unique situation, and finding an identical comparison is
> tricky.

Multi-licensing is fine as long as one is GPL :-)

Regards,

Anthony Liguori

> Zach

2006-03-23 00:59:15

by Zachary Amsden

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

Chris Wright wrote:
> * Zachary Amsden ([email protected]) wrote:
>
>> Chris Wright wrote:
>>
>>> You could compile all platform layers you want to support with the kernel.
>>>
>> But the entire point is that you don't know what platform layers you
>> want to support. The platform layers can change. Xen has changed the
>> platform layer and re-optimized kernel / hypervisor transitions how many
>> times? The platform layer provides exactly the flexibility to do that,
>> so that a kernel you compile today against a generic platform can work
>> with the platform layer provided by Xen 4.0 tomorrow.
>>
>
> This only works if you have all possible dreamed of interface bits in
> the ABI. In Linux, we often don't know what we'll need to support in the
> future, but we don't write binary compatible interfaces just in case we
> need to update. Preferring instead, API's that are justifiable right now.
> This is the issue I have with the ABI proposal. It doesn't fit well
> with Linux developement.
>

No, you don't need to dream up all the possible interface bits ahead of
time. With a la carte interfaces, you can take what you need now, and
add features later. You don't need an ABI for features. You need it
for compatibility. You will need to update the hypervisor ABI. And you
can't force people to upgrade their kernels.

And much of this is so low level, that a C API for it just doesn't make
sense. This code is completely hidden from Linux development to begin
with, tucked away in the low level sub-arch layer.

Zach

2006-03-23 01:01:27

by Zachary Amsden

[permalink] [raw]

Subject: Re: [Xen-devel] Re: [RFC, PATCH 5/24] i386 Vmi code patching

Anthony Liguori wrote:
>
> Hrm, I was actually thinking that each of the VMI calls would be an
> export (vmi_init, vmi_set_pxe, etc.). I know that you want the
> hypervisor to drive the inlining but I that's sufficiently hairy (not
> to mention, there's not AFAIK performance data yet to justify it) that
> I think it ought to be left for VMI 2.0.

That seems quite ok to me. It is a little weird to have the VMI calls
be an export when some of them really can never be properly callable C
functions, and you have to overwrite the native code, so the linking
step is .. well this magic disassembly glue again. But it could be made
to work, and we have discussed it before.

> Multi-licensing is fine as long as one is GPL :-)

I agree. But it sort of defeats the point of the GPL if you can
optionally redistribute the code under the BSD license as well.

2006-03-23 01:06:33

by Chris Wright

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

* Zachary Amsden ([email protected]) wrote:
> No, you don't need to dream up all the possible interface bits ahead of
> time. With a la carte interfaces, you can take what you need now, and
> add features later. You don't need an ABI for features. You need it
> for compatibility. You will need to update the hypervisor ABI. And you
> can't force people to upgrade their kernels.

How do you support an interface that's not already a part of the ABI
w/out changing the kernel?

2006-03-23 04:04:40

by Zachary Amsden

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

Chris Wright wrote:
> * Zachary Amsden ([email protected]) wrote:
>
>> No, you don't need to dream up all the possible interface bits ahead of
>> time. With a la carte interfaces, you can take what you need now, and
>> add features later. You don't need an ABI for features. You need it
>> for compatibility. You will need to update the hypervisor ABI. And you
>> can't force people to upgrade their kernels.
>>
>
> How do you support an interface that's not already a part of the ABI
> w/out changing the kernel?
>

You have to change the kernel for VMI interface upgrades - if you want
to use the upgrades. You don't need to change the kernel for hypervisor
ABI changes, nor does upgrading the interface require a kernel change.
Interface upgrades are pretty easy to compartmentalize - you add block
device support, you add a block device driver. Hypervisor ABI changes
are not so easy, because of the data dependencies and potential for
breaking compatibility. The massive security hole scenario is a good
example of why you would need to break compatibility, but any number of
things might make you want to change the hypervisor ABI.

The point of the VMI is to isolate the kernel from those changes,
allowing kernel development to proceed unhindered, and allowing
hypervisor innovation to thrive simultaneously.

Zach

2006-03-23 09:25:21

by Keir Fraser

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

On 23 Mar 2006, at 00:40, Chris Wright wrote:

>> Would you have less trouble if the "ROM" were actually more like a
>> module? Specifically, if it had a proper elf header and symbol table,
>> used symbols as entry points, and was a GPL interface (so that ROM's
>> had
>> to be GPL)? Then it's just a kernel module that's hidden in the
>> option
>> ROM space and has a C interface.
>
> Yeah, point is the interface is normal C API, and has the similar free
> form that normal kernel API's have.

i think this sounds very sane, and an OS-specific interface shim gets
around problems such as finding CPU-specific state -- we can get at
smp_processor_id() just the same as the rest of the kernel, for
example. We could extend the concept of the interface shim we already
have -- a set of OS-specific high performance shims, plus a fallback
OS-agnostic shim.

-- Keir

2006-03-23 11:43:22

by Joshua LeVasseur

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

On Mar 23, 2006, at 02:06 , Chris Wright wrote:

> * Zachary Amsden ([email protected]) wrote:
>> No, you don't need to dream up all the possible interface bits
>> ahead of
>> time. With a la carte interfaces, you can take what you need now,
>> and
>> add features later. You don't need an ABI for features. You need it
>> for compatibility. You will need to update the hypervisor ABI.
>> And you
>> can't force people to upgrade their kernels.
>
> How do you support an interface that's not already a part of the ABI
> w/out changing the kernel?

Since the base ABI primarily consists of the x86's privileged
instruction set (actually, the virtualization-sensitive instructions,
and padded with NOP instructions), any ROM can work from there, and
you don't have to worry about updating Linux to use a new ABI. If
you use a new ROM+ABI with an old kernel+ABI, they can fall back to
the base ABI. Note that this base ABI isn't arbitrary; it wasn't
pulled out of thin air; it is mostly the x86 system ISA.

If an updated hypervisor offers new features that didn't exist when a
particular Linux kernel was written and compiled, a new ROM has a
very good chance of activating those new features, even if only using
the base ABI of the older Linux kernel. The ROM is very versatile
because it maps the low-level instructions to high-level hypervisor
concepts. And it is very successful: I have built a Linux 2.6.9
binary and executed it on Xen 2.0.2, Xen 2.0.7, and Xen 3.0.1; I have
also built a Linux 2.6.12.6 binary and executed it on Xen 2.0.2, Xen
2.0.7, and Xen 3.0.1. This is significant because XenoLinux 2.6.9
shipped with Xen 2.0.2 and it doesn't work on Xen 3.0.1 due to many
interface updates; likewise XenoLinux 2.6.12.6 shipped with Xen 3.0.1
and it doesn't work on the older Xen 2 hypervisors; but the ROM hid
the interface updates from Xen 2 series to the Xen 3 series, and it
takes advantage of the new Xen 3 interfaces (it must since Xen 3
doesn't have a Xen 2 compatibility layer that I'm aware of).

The ROM's interface mapping solves two problems: it converts the x86
low-level instructions into high-performance hypervisor operations,
and it maps the low-level instructions into the hypervisor's evolving
interface. The ROM gives great independence for hypervisor
developers, or in other words, permits proliferation of hypervisors,
and freedom to experiment with interfaces (e.g., real time, or formal
verification).

Joshua

2006-03-23 18:51:01

by Zachary Amsden

[permalink] [raw]

Subject: Re: [Xen-devel] Re: [RFC, PATCH 5/24] i386 Vmi code patching

Keir Fraser wrote:
>>
>> Yeah, point is the interface is normal C API, and has the similar free
>> form that normal kernel API's have.
>
> i think this sounds very sane, and an OS-specific interface shim gets
> around problems such as finding CPU-specific state -- we can get at
> smp_processor_id() just the same as the rest of the kernel, for
> example. We could extend the concept of the interface shim we already
> have -- a set of OS-specific high performance shims, plus a fallback
> OS-agnostic shim.

Getting at smp_processor_id() is exactly the type of thing you _don't_
want to do. You really can't have callbacks into the guest in the
hypervisor platform layer. It really is not efficient, and you cause
yourself more trouble than it is worth.

And where exactly is smp_processor_id() exported to modules? It's not.
You've just locked your module into the current kernel's idea of how to
get at smp_processor_id(). It changes based on compilation options of
the kernel - for example, it is different with 4K stacks. It has
changed from a number of other different options in the past.

The fact that XenoLinux needs smp_processor_id() at all is quite
ludicrous. To disable interrupts, which is used fairly commonly to
disable pre-emption as well, what does XenoLinux have to do?

It has to disable pre-emption to call smp_processor_id() so that it can
disable interrupts, the re-enable preemption so that it can disable
pre-emption.

That is truly convoluted, and is exactly why you should never get into
these types of situations to begin with.

Zach

2006-03-23 23:46:00

by Eli Collins

[permalink] [raw]

Subject: Re: [Xen-devel] Re: [RFC, PATCH 5/24] i386 Vmi code patching

Keir Fraser wrote:
> We could extend the concept of the interface shim we already have -- a
> set of OS-specific high performance shims, plus a fallback OS-agnostic
> shim.

Currently the lack of a shim is the key difference between the VMI and
Xen approaches. Forgive me for summarizing, but I'm not sure it's been
made clear. The VMI is the interface between the OS and a shim layer--it
is not a hypervisor interface. The kernel makes VMI calls to the shim
and the shim makes hypercalls, if needed, to the hypervisor.

VMI VMI native Xen/Xen native

OS OS OS
-------------- VMI -------------- VMI
Shim (ROM)
-------------- HV API -------------- HV API
Hypervisor Native HW Hypervisor

The VMI isolates the kernel from the hypervisor so that the kernel and
the hypervisor can evolve w/o hindering each other's development. The
Xen approach still tightly couples the hypervisor with the kernel.
Coupling the kernel and hypervisor together restricts their evolution
and people who want to run different operating systems (or different
versions of the same OS) on the same hypervisor. As Josh pointed out,
you can run a single VMI Linux kernel on more versions of the Xen
hypervisor than you can using a single XenLinux kernel because the VMI
does not require a tight coupling.

Tight coupling also means you end up using a hypervisor when running a
kernel natively (e.g. "supervisor mode kernel" in the unstable Xen
repository). So for the native case you get a level of indirection (the
hypervisor) that costs you performance, and for the virtual case you do
not get a level of indirection (a shim) that buys you compatibility and
diversity. For VMI, it's the reverse, you get the level of indirection
in the virtual case and no indirection in the native case. You could
have separate kernels, and all the associated costs, for these two cases.

There are many places where the VMI and Xen patches overlap; the key
difference is that the VMI makes a distinction between the kernel and
the hypervisor interfaces. As others have pointed out this distinction
buys you a lot in terms of compatibility, ease of maintenance, and the
ability to execute the same kernel in native and virtual environments
with high performance.

Which particular bits get in is less important than the decision of
whether or not the Linux community wants the kernel tightly coupled to
the hypervisor. Extending the hypercall page you already have to
decouple the hypervisor and kernel interfaces would be excellent.

Eli

2006-03-28 00:55:31

by Chuck Ebbert

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

In-Reply-To: <[email protected]>

On Wed, 22 Mar 2006 21:15:44 +0100, Andi Kleen wrote:

> On Monday 13 March 2006 19:02, Zachary Amsden wrote:
> > The VMI ROM detection and code patching mechanism is illustrated in
> > setup.c. There ROM is a binary block published by the hypervisor, and
> > and there are certainly implications of this. ROMs certainly have a
> > history of being proprietary, very differently licensed pieces of
> > software, and mostly under non-free licenses. Before jumping to the
> > conclusion that this is a bad thing, let us consider more carefully
> > why hiding the interface layer to the hypervisor is actually a good
> > thing.
>
> How about you fix all these issues you describe here first
> and then submit it again?
>
> The disassembly stuff indeed doesn't look like something
> that belongs in the kernel.

I think they put the disassembler in there as a joke. ;)

It's not necessary for fixing up the call site, anyway. Something like
this should work, assuming there is always a call in every vmi
translation.

/* Now, measure and emit the vmi translation sequence */
#define vmi_translation_start \
.pushsection .vmi.translation,"ax"; \
781:;
#define vmi_translation_finish \
- 782:; \
+ 783:; \
.popsection;
#define vmi_translation_begin 781b
-#define vmi_translation_end 782b
+#define vmi_call_location 782b
+#define vmi_translation_end 783b
#define vmi_translation_len (vmi_translation_end - vmi_translation_begin)
+#define vmi_call_offset (vmi_call_location - vmi_translation_begin)

#define vmi_call(name) \
- call .+5+name
+ 782: call .+5+name

#define vmi_annotate(name) \
.pushsection .vmi.annotation,"a"; \
.align 4; \
.long name; \
.long vmi_padded_begin; \
.long vmi_translation_begin; \
.byte vmi_padded_len; \
.byte vmi_translation_len; \
.byte vmi_pad_total; \
- .byte 0; \
+ .byte vmi_call_offset; \
.popsection;

struct vmi_annotation {
unsigned long vmi_call;
unsigned char *nativeEIP;
unsigned char *translationEIP;
unsigned char native_size;
unsigned char translation_size;
char nop_size;
- unsigned char pad;
+ unsigned char call_offset;
};

static void fixup_translation(struct vmi_annotation *a)
{
unsigned char *c, *start, *end;
int left;

memcpy(a->nativeEIP, a->translationEIP, a->translation_size);
+ patch_call_site(a, a->nativeEIP + a->call_offset);
- start = a->nativeEIP;
- end = a->nativeEIP + a->translation_size;
-
- for (c = start; c < end;) {
- switch(*c) {
- case MNEM_CALL_NEAR:
- patch_call_site(a, c);
- c+=5;
- break;
-
- case MNEM_PUSH_I:
- c+=5;
- break;
-
- case MNEM_PUSH_IB:
- c+=2;
- break;
-
- case MNEM_PUSH_EAX:
- case MNEM_PUSH_ECX:
- case MNEM_PUSH_EDX:
- case MNEM_PUSH_EBX:
- case MNEM_PUSH_EBP:
- case MNEM_PUSH_ESI:
- case MNEM_PUSH_EDI:
- c+=1;
- break;
-
- case MNEM_LEA:
- BUG_ON(*(c+1) != 0x64); /* [--][--]+disp8, %esp */
- BUG_ON(*(c+2) != 0x24); /* none + %esp */
- c+=4;
- break;
-
- default:
- /*
- * Don't printk - it may acquire spinlocks with
- * partially completed VMI translations, causing
- * nuclear meltdown of the core.
- */
- BUG();
- return;
- }
- }

--
Chuck
"Penguins don't come from next door, they come from the Antarctic!"

2006-03-28 01:50:56

by Zachary Amsden

[permalink] [raw]

Subject: Re: [RFC, PATCH 5/24] i386 Vmi code patching

Chuck Ebbert wrote:
> In-Reply-To: <[email protected]>
>
> On Wed, 22 Mar 2006 21:15:44 +0100, Andi Kleen wrote:
>
>
>> On Monday 13 March 2006 19:02, Zachary Amsden wrote:
>>
>>> The VMI ROM detection and code patching mechanism is illustrated in
>>> setup.c. There ROM is a binary block published by the hypervisor, and
>>> and there are certainly implications of this. ROMs certainly have a
>>> history of being proprietary, very differently licensed pieces of
>>> software, and mostly under non-free licenses. Before jumping to the
>>> conclusion that this is a bad thing, let us consider more carefully
>>> why hiding the interface layer to the hypervisor is actually a good
>>> thing.
>>>
>> How about you fix all these issues you describe here first
>> and then submit it again?
>>
>> The disassembly stuff indeed doesn't look like something
>> that belongs in the kernel.
>>
>
> I think they put the disassembler in there as a joke. ;)
>
> It's not necessary for fixing up the call site, anyway. Something like
> this should work, assuming there is always a call in every vmi
> translation.
>

Very good observation. The code you illustrate is roughly what the
patch mechanism used to do. The disassembly serves one purpose, which
is incomplete in our current implementation. It provides live register
information and constant propagation that can be used by an inliner in
the VMI layer to inline hypervisor calls. This information gets encoded
naturally in the push sequence before the call instruction.

But considering it is currently incomplete, and most of the benefit can
be had be special casing just 4 VMI calls to allow selective inlining,
it does seem like a lot of complexity for little payoff. I really don't
like special casing, but in this scenario, it does seem to make sense.
And of the 4 VMI calls, only one (SetInterruptMask) takes an input, and
it takes that input in a fixed register.

Based on suggestions from Anthony Liguori and others, we are going to
drop this extra complexity - we can get to hypervisor inlining later.
Right now having the simplest and cleanest interface is more important.
Actually, adding the required padding for inlining is quite easy:

#define FILL(n) ".fill " #n "-1,1,0x66; nop"

static inline void local_irq_restore(const unsigned long flags)
{
vmi_wrap_call(
SetInterruptMask, "pushl %0; popfl" FILL(6),
VMI_NO_OUTPUT,
1, VMI_IREG1 (flags),
XCONC("cc", "memory"));
}

Now you have 8 bytes to overwrite, which is sufficient for byte
arithmetic operations to a memory address, plus a segment override.
This seems like a much simpler solution than run-time disassembly.

> - /*
> - * Don't printk - it may acquire spinlocks with
> - * partially completed VMI translations, causing
> - * nuclear meltdown of the core.
> - */
> - BUG();
> - return;
> - }
> - }
>

This part was a joke ;)