LinuxLists.cc - Re: [RFC, PATCH 1/24] i386 Vmi documentation

2006-03-14 00:00:18

Subject: Re: [RFC, PATCH 1/24] i386 Vmi documentation

Chris Wright wrote:

Hi Chris, thank you for your comments. I've tried to answer as much as
I can - hopefully I found all your questions.

>> + guest operating systems. In the future, we envision that additional
>> + higher level abstractions will be added as an adjunct to the
>> + low-level API. These higher level abstractions will target large
>> + bulk operations such as creation, and destruction of address spaces,
>> + context switches, thread creation and control.
>>
>
> This is an area where in the past VMI hasn't been well-suited to support
> Xen. It's the higher level abstractions which make the performance
> story of paravirt compelling. I haven't made it through the whole
> patchset yet, but the bits you mention above as work to be done are
> certainly important to good performance.
>

For example, multicalls, which we support, and batched page table
operations, which we support, and vendor designed virtual devices, which
we support. What is unclear to me is why you need to keep pushing
higher up the stack to get more performance. If you could have any
higher level hypercall you wanted, what would it be? Most people say -
fork() / exec(). But why? You've just radically changed the way the
guest must operate its MMU, and you've radically constrained the way
page tables and memory management structures must be layed out by
putting a ton of commonality in their infrastructure that is shared by
the hypervisor and the kernel. You've likely vastly complicated the
design of a virtualized kernel that still runs on native hardware. But
what can you truly gain, that you can not gain from a simpler, less
complicated interface that just says -

Ok, I'm about to update a whole bunch of pages tables.
Ok, I'm done and I might want to use them now. Please make sure the
hardware TLB will be in sync.

Pushing up the stack with a higher level API is a serious consideration,
but only if you can show serious results from it. I'm not convinced
that you can actually hone in on anything /that isn't already a
performance problem on native kernels/. Consider, for example, that we
don't actually support remote TLB shootdown IPIs via VMI calls. Why is
this a performance problem? Well, very likely, those IPI shootdowns are
going to be synchronous. And if you don't co-schedule the CPUs in your
virtual machine, you might just have issued synchronous IPIs to VCPUs
that aren't even running. A serious performance problem.

Is it? Or is it really, just another case where the _native_ kernel can
be even more clever, and avoid doing those IPI shootdowns in the
firstplace? I've watched IPI shootdown in Linux get drastically better
in the 2.6 series of kernels, and see (anecdotal number quoting) maybe 4
or 5 of them in the course of a kernel compile. There is no longer a
giant performance boon to be gained here.

Similarly, you can almost argue the same thing with spinlocks - if you
really are seeing performance issues because of the wakeup of a
descheduled remote VPU, maybe you really need to think about moving that
lock off a hot path or using a better, lock free synchronization method.

I'm not arguing against these features - in fact, I think they can be
done in a way that doesn't intrude too much inside of the kernel. After
all, locks and IPIs tend to be part of the lower layer architecture
anyways. And they definitely do win back some of the background noise
introduced by virtualization. But if you decide to make the interface
more complicated, you really need to have an accurate measure of exactly
what you can gain by it to justify that complexity.

Personally, I'm all for making lock primitives and shootdowns an
_optional_ extension to the interface. As with many other relatively
straightforward and non-intrusive changes. I know some of you will
disagree with me, but I think a lot of what is being referred to as
"higher level" paravirtualization is really an attempt to solve
pre-existing problems in the performance of the underlying system.

There are advanced and useful things you can do with higher level
paravirtualization, but I am not convinced at all that incredible
performance gain is one of them.

> We do not want an interface which slows down the pace. We work with
> source and drop cruft as quickly as possible (referring to internal
> changes, not user-visible ABI changes here). Making changes that
> require a new guest for some significant performance gain is perfectly
> reasonable. What we want to avoid is making changes that require a
> new guest to simply boot. This is akin to rev'ing hardware w/out any
> backwards compatibility. This goal doesn't require VMI and ROMs, but
> I agree it requires clear interface definitions.
>

This is why we provide the minor / major interface numbers. Bump the
minor number, you get a new feature. Bump the required minor version in
the guest when it relies on that feature. Bump the major number when
you break compatibility. More on this below.

>
>> + VMI_DeliverInterrupts (For future debate)
>> +
>> + Enable and deliver any pending interrupts. This would remove
>> + the implicit delivery semantic from the SetInterruptMask and
>> + EnableInterrupts calls.
>>
>
> How do you keep forwards and backwards compat here? Guest that's coded
> to do simple implicit version would never get interrupts delivered on
> newer ROM?
>

This isn't part of the interface. If it were to be included, you could
do two things - bump the minor version, and add non-delivery semantic
enable and restore interrupt calls, or bump the major version and drop
the delivery semantic from the originals.

I agree this is pretty clumsy. Expect to see more discussion about
using annotations to expand the interface without breaking binary
compatibility, as well as providing more advanced feature control. I
wanted to integrate more advanced feature control / probing into this
version of the VMI, but there are so many possible ways to do it that it
would be much nicer to get feedback from the community on what is the
best interface.

>
>> + CPU CONTROL CALLS
>> +
>> + These calls encapsulate the set of privileged instructions used to
>> + manipulate the CPU control state. These instructions are all properly
>> + virtualizable using trap and emulate, but for performance reasons, a
>> + direct call may be more efficient. With hardware virtualization
>> + capabilities, many of these calls can be left as IDENT translations, that
>> + is, inline implementations of the native instructions, which are not
>> + rewritten by the hypervisor. Some of these calls are performance critical
>> + during context switch paths, and some are not, but they are all included
>> + for completeness, with the exceptions of the obsoleted LMSW and SMSW
>> + instructions.
>>
>
> Included just for completeness can be beginning of API bloat.
>

The design impact of this bloat is zero - if you don't want to implement
virtual methods for, say, debug register access - then you don't need to
do anything. You trap and emulate by default. If on the other hand,
you do want to hook them, you are welcome to. The hypervisor is free to
choose the design costs that are appropriate for their usage scenarios,
as is the kernel - it's not in the spec, but certainly is open for
debate that certain classes of instructions such as these need not even
be converted to VMI calls. We did implement all of these in Linux for
performance and symmetry.

>
> clts, setcr0, readcr0 are interrelated for typical use. is it expected
> the hypervisor uses consitent regsister (either native or shadowed)
> here, or is it meant to be undefined?
>

CLTS allows the elimination of an extra GetCR0 call, and they all
operate on the same (shadowed) register.

> Many of these will look the same on x86-64, but the API is not
> 64-bit clean so has to be duplicated.
>

Yes, register pressure forces the PAE API to be slightly different from
the long mode API. But long mode has different register calling
conventions anyway, so it is not a big deal. The important thing is,
once the MMU mess is sorted out, the same interface can be used from C
code for both platforms, and the details about which lock primitives are
used can be hidden. The cost of which lock primitives to use differs on
32-bit and 64-bit platforms, across vendor, and the style of the
hypervisor implementation (direct / writable / shadowed page tables).

>
>
>> + 85) VMI_SetDeferredMode
>>
>
> Is this the batching, multi-call analog?
>

Yes. This interface needs to be documented in a much better fashion.
But the idea is that VMI calls are mapped into Xen multicalls by
allowing deferred completion of certain classes of operations. That
same mode of deferred operation is used to batch PTE updates in our
implementation (although Xen uses writable page tables now, this used to
provide the same support facility in Xen as well). To complement this,
there is an explicit flush - and it turns out this maps very nicely,
getting rid of a lot of the XenoLinux changes around mmu_context.h.

>> +
>> + VMI_CYCLES 64 bit unsigned integer
>> + VMI_NANOSECS 64 bit unsigned integer
>>
>
> All caps typedefs are not very popular w.r.t. CodingStyle.
>

We know this. This is not a Linux interface. This is the API
documentation, meant to be considerably different in style. Where this
ugliness has crept into our Linux patches, I have been steadily removing
it and making them look nicer. But the vast difference in the style of
the doc is to avoid namespace collision.

>
>> + #define VMICALL __attribute__((regparm(3)))
>>
>
> I understand it's for ABI documentation, but in Linux it's FASTCALL.
>

Actually, FASTCALL is regparm(2), I think.

Cheers,

Zach

2006-03-14 21:23:11

by Chris Wright

[permalink] [raw]

Subject: Re: [RFC, PATCH 1/24] i386 Vmi documentation

* Zachary Amsden ([email protected]) wrote:
> Pushing up the stack with a higher level API is a serious consideration,
> but only if you can show serious results from it. I'm not convinced
> that you can actually hone in on anything /that isn't already a
> performance problem on native kernels/. Consider, for example, that we
> don't actually support remote TLB shootdown IPIs via VMI calls. Why is
> this a performance problem? Well, very likely, those IPI shootdowns are
> going to be synchronous. And if you don't co-schedule the CPUs in your
> virtual machine, you might just have issued synchronous IPIs to VCPUs
> that aren't even running. A serious performance problem.
>
> Is it? Or is it really, just another case where the _native_ kernel can
> be even more clever, and avoid doing those IPI shootdowns in the
> firstplace? I've watched IPI shootdown in Linux get drastically better
> in the 2.6 series of kernels, and see (anecdotal number quoting) maybe 4
> or 5 of them in the course of a kernel compile. There is no longer a
> giant performance boon to be gained here.
>
> Similarly, you can almost argue the same thing with spinlocks - if you
> really are seeing performance issues because of the wakeup of a
> descheduled remote VPU, maybe you really need to think about moving that
> lock off a hot path or using a better, lock free synchronization method.
>
> I'm not arguing against these features - in fact, I think they can be
> done in a way that doesn't intrude too much inside of the kernel. After
> all, locks and IPIs tend to be part of the lower layer architecture
> anyways. And they definitely do win back some of the background noise
> introduced by virtualization. But if you decide to make the interface
> more complicated, you really need to have an accurate measure of exactly
> what you can gain by it to justify that complexity.

Yes, I completely agree. Without specific performance numbers it's just
hand waving. To make it more concrete, I'll work on a compare/contrast
of the interfaces so we have specifics to discuss.

> >Included just for completeness can be beginning of API bloat.
>
> The design impact of this bloat is zero - if you don't want to implement
> virtual methods for, say, debug register access - then you don't need to
> do anything. You trap and emulate by default. If on the other hand,
> you do want to hook them, you are welcome to. The hypervisor is free to
> choose the design costs that are appropriate for their usage scenarios,
> as is the kernel - it's not in the spec, but certainly is open for
> debate that certain classes of instructions such as these need not even
> be converted to VMI calls. We did implement all of these in Linux for
> performance and symmetry.

Yup. Just noting that API without clear users is the type of thing that
is regularly rejected from Linux.

> >Many of these will look the same on x86-64, but the API is not
> >64-bit clean so has to be duplicated.
>
> Yes, register pressure forces the PAE API to be slightly different from
> the long mode API. But long mode has different register calling
> conventions anyway, so it is not a big deal. The important thing is,
> once the MMU mess is sorted out, the same interface can be used from C
> code for both platforms, and the details about which lock primitives are
> used can be hidden. The cost of which lock primitives to use differs on
> 32-bit and 64-bit platforms, across vendor, and the style of the
> hypervisor implementation (direct / writable / shadowed page tables).

My mistake, it makes perfect sense from ABI point of view.

> >Is this the batching, multi-call analog?
>
> Yes. This interface needs to be documented in a much better fashion.
> But the idea is that VMI calls are mapped into Xen multicalls by
> allowing deferred completion of certain classes of operations. That
> same mode of deferred operation is used to batch PTE updates in our
> implementation (although Xen uses writable page tables now, this used to
> provide the same support facility in Xen as well). To complement this,
> there is an explicit flush - and it turns out this maps very nicely,
> getting rid of a lot of the XenoLinux changes around mmu_context.h.

Are these valid differences? Or did I misunderstand the batching
mechanism?

1) can't use stack based args, so have to allocate each data structure,
which could conceivably fail unless it's some fixed buffer.

2) complicates the rom implementation slightly where implementation of
each deferrable part of the API needs to have switch (am I deferred or
not) to then build the batch, or make direct hypercall.

3) flushing in smp, have to be careful to manage simulataneous defers
and flushes from potentially multiple cpus in guest.

Doesn't seem these are showstoppers, just differences worth noting.
There aren't as many multicalls left in Xen these days anyway.

thanks,
-chris

2006-03-16 01:11:49

by Chris Wright

[permalink] [raw]

Subject: Re: [RFC, PATCH 1/24] i386 Vmi documentation

* Chris Wright ([email protected]) wrote:
> Yes, I completely agree. Without specific performance numbers it's just
> hand waving. To make it more concrete, I'll work on a compare/contrast
> of the interfaces so we have specifics to discuss.

Here's a comparison of API's. In some cases there's trivial 1-to-1
mappings, and in other cases there's really no mapping. The mapping is
(loosely) annotated below the interface as [ VMI_foo(*) ]. The trailing
asterisk is meant to note the API maps at high-level, but the details
may make the mapping difficult (details such as VA vs. MFN, for example).
Thanks to Christian for doing the bulk of this comparison.

PROCESSOR STATE CALLS

- shared_info->vcpu_info[]->evtchn_upcall_mask

Enable/Disable interrupts and query whether interrupts are enabled or
disabled.

[ VMI_DisableInterrupts, VMI_EnabledInterrupts, VMI_GetInterruptMask,
VMI_SetInterruptMask ]

- shared_info->vcpu_info[]->evtchn_upcall_pending

Query if an interrupt is pending

[ ]

- force_evtchn_callback = HYPERVISOR_xen_version(0, NULL)

Deliver pending interrupts.

[ VMI_DeliverInterrupts ]

(EVENT CHANNEL, virtual interrupts)
- HYPERVISOR_event_channel_op(EVTCHNOP_alloc_unbound, ...)

Allocate a port in domain <dom> and mark as accepting interdomain
bindings from domain <remote_dom>. A fresh port is allocated in <dom>
and returned as <port>.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_bind_interdomain, ...)

Construct an interdomain event channel between the calling domain and
<remote_dom>. <remote_dom,remote_port> must identify a port that is
unbound and marked as accepting bindings from the calling domain. A fresh
port is allocated in the calling domain and returned as <local_port>.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_bind_virq, ...)

Bind a local event channel to VIRQ <irq> on specified vcpu.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_bind_pirq, ...)

Bind a local event channel to PIRQ <irq>.

[ PIC programming* ]

- HYPERVISOR_event_channel_op(EVTCHNOP_bind_ipi, ...)

Bind a local event channel to receive events.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_close, ...)

Close a local event channel <port>. If the channel is interdomain then
the remote end is placed in the unbound state (EVTCHNSTAT_unbound),
awaiting a new connection.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_send, ...)

Send an event to the remote end of the channel whose local endpoint is <port>.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_status, ...)

Get the current status of the communication channel which has an endpoint
at <dom, port>.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_bind_vcpu, ...)

Specify which vcpu a channel should notify when an event is pending.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_unmask, ...)

Unmask the specified local event-channel port and deliver a notification
to the appropriate VCPU if an event is pending.

[ ]

- HYPERVISOR_sched_op(SCHEDOP_yield, ...)

Voluntarily yield the CPU.

[ VMI_Pause ]

- HYPERVISOR_sched_op(SCHEDOP_block, ...)

Block execution of this VCPU until an event is received for processing.
If called with event upcalls masked, this operation will atomically
reenable event delivery and check for pending events before blocking the
VCPU. This avoids a "wakeup waiting" race.

Periodic timer interrupts are not delivered when guest is blocked,
except for explicit timer events setup with HYPERVISOR_set_timer_op.

[ VMI_Halt ]

- HYPERVISOR_sched_op(SCHEDOP_shutdown, ...)

Halt execution of this domain (all VCPUs) and notify the system controller.

[ VMI_Shutdown, VMI_Reboot ]

- HYPERVISOR_sched_op(SCHEDOP_shutdown, SHUTDOWN_suspend, ...)

Clean up, save suspend info, kill

[ ]

- HYPERVISOR_sched_op_new(SCHEDOP_poll, ...)

Poll a set of event-channel ports. Return when one or more are pending. An
optional timeout may be specified.

[ ]

- HYPERVISOR_vcpu_op(VCPUOP_initialise, ...)

Initialise a VCPU. Each VCPU can be initialised only once. A
newly-initialised VCPU will not run until it is brought up by VCPUOP_up.

[ VMI_SetInitialAPState ]

- HYPERVISOR_vcpu_op(VCPUOP_up, ...)

Bring up a VCPU. This makes the VCPU runnable. This operation will fail
if the VCPU has not been initialised (VCPUOP_initialise).

[ ]

- HYPERVISOR_vcpu_op(VCPUOP_down, ...)

Bring down a VCPU (i.e., make it non-runnable).
There are a few caveats that callers should observe:
1. This operation may return, and VCPU_is_up may return false, before the
VCPU stops running (i.e., the command is asynchronous). It is a good
idea to ensure that the VCPU has entered a non-critical loop before
bringing it down. Alternatively, this operation is guaranteed
synchronous if invoked by the VCPU itself.
2. After a VCPU is initialised, there is currently no way to drop all its
references to domain memory. Even a VCPU that is down still holds
memory references via its pagetable base pointer and GDT. It is good
practise to move a VCPU onto an 'idle' or default page table, LDT and
GDT before bringing it down.

[ ]

- HYPERVISOR_vcpu_op(VCPUOP_is_up, ...)

Returns 1 if the given VCPU is up.

[ ]

- HYPERVISOR_vcpu_op(VCPUOP_get_runstate_info, ...)

Return information about the state and running time of a VCPU.

[ ]

- HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, ...)

Register a shared memory area from which the guest may obtain its own
runstate information without needing to execute a hypercall.
Notes:
1. The registered address may be virtual or physical, depending on the
platform. The virtual address should be registered on x86 systems.
2. Only one shared area may be registered per VCPU. The shared area is
updated by the hypervisor each time the VCPU is scheduled. Thus
runstate.state will always be RUNSTATE_running and
runstate.state_entry_time will indicate the system time at which the
VCPU was last scheduled to run.

[ ]

DESCRIPTOR RELATED CALLS

- HYPERVISOR_set_gdt(unsigned long *frame_list, int entries)

Load the global descriptor table.

For non-shadow-translate mode guests, the frame_list is a list of
machine pages which contain the gdt.

[ VMI_SetGDT* ]

- HYPERVISOR_set_trap_table(struct trap_info *table)

Load the interrupt descriptor table.

The trap table is in a format which allows easier access from C code.
It's easier to build and easier to use in software trap despatch code.
It can easily be converted into a hardware interrupt descriptor table.

[ VMI_SetIDT, VMI_WriteIDTEntry ]

- HYPERVISOR_mmuext_op(MMUEXT_SET_LDT, ...)

Load local descriptor table.
linear_addr: Linear address of LDT base (NB. must be page-aligned).
nr_ents: Number of entries in LDT.

[ VMI_SetLDT* ]

- HYPERVISOR_update_descriptor(u64 pa, u64 desc)

Write a descriptor to a GDT or LDT.

For non-shadow-translate mode guests, the address is a machine address.

[ VMI_WriteGDTEntry*, VMI_WriteLDTEntry* ]

CPU CONTROL CALLS

- HYPERVISOR_mmuext_op(MMUEXT_NEW_BASEPTR, ...)

Write cr3 register.

[ VMI_SetCR3* ]

- shared_info->vcpu_info[]->arch->cr2

Read cr2 register.

[ VMI_GetCR2 ]

- HYPERVISOR_fpu_taskswitch(0)

Clear the taskswitch flag in control register 0.

[ VMI_CLTS ]

- HYPERVISOR_fpu_taskswitch(1)

Set the taskswitch flag in control register 0.

[ VMI_SetCR0* ]

- HYPERVISOR_set_debugreg(int reg, unsigned long value)

Write debug register.

[ VMI_SetDR ]

- HYPERVISOR_get_debugreg(int reg)

Read debug register.

[ VMI_GetDR ]

PROCESSOR INFORMATION CALLS

STACK / PRIVILEGE TRANSITION CALLS

- HYPERVISOR_stack_switch(unsigned long ss, unsigned long esp)

Set the ring1 stack pointer/segment to use when switching to ring1
from ring3.

[ VMI_UpdateKernelStack ]

- HYPERVISOR_iret

[ VMI_IRET ]

I/O CALLS

- HYPERVISOR_physdev_op(PHYSDEVOP_SET_IOPL, ...)

Set the IOPL mask.

[ VMI_SetIOPLMask ]

- HYPERVISOR_mmuext_op(MMUEXT_FLUSH_CACHE)

No additional arguments. Writes back and flushes cache contents.
(Can just trap and emulate here).

[ VMI_WBINVD ]

- HYPERVISOR_physdev_op(PHYSDEVOP_IRQ_UNMASK_NOTIFY, ...)

Advertise unmask of physical interrupt to hypervisor.

[ ]

- HYPERVISOR_physdev_op(PHYSDEVOP_IRQ_STATUS_QUERY,...)

Query if physical interrupt needs unmaks notify.

[ ]

- HYPERVISOR_physdev_op(PHYSDEVOP_SET_IOBITMAP, ...)

Set IO bitmap for guest.

[ ]

- HYPERVISOR_physdev_op(PHYSDEVOP_APIC_READ, ...)

Read IO-APIC register.

[ ]

- HYPERVISOR_physdev_op(PHYSDEVOP_APIC_WRITE, ...)

Write IO-APIC register.

[ ]

- HYPERVISOR_physdev_op(PHYSDEVOP_ASSIGN_VECTOR, ...)

Assign vector to interrupt.

[ ]

APIC CALLS

TIMER CALLS

- HYPERVISOR_set_timer_op(...)

Set timeout when to trigger timer interrupt even if guest is blocked.

MMU CALLS

- HYPERVISOR_mmuext_op(MMUEXT_(UN)PIN_*_TABLE

mfn: Machine frame number to be (un)pinned as a p.t. page.

[ RegisterPageType* ]

- HYPERVISOR_mmuext_op(MMUEXT_TLB_FLUSH_LOCAL)

No additional arguments. Flushes local TLB.

[ VMI_FlushTLB ]

- HYPERVISOR_mmuext_op(MMUEXT_INVLPG_LOCAL)

linear_addr: Linear address to be flushed from the local TLB.

[ VMI_InvalPage ]

- HYPERVISOR_mmuext_op(MMUEXT_TLB_FLUSH_MULTI)

vcpumask: Pointer to bitmap of VCPUs to be flushed.

- HYPERVISOR_mmuext_op(MMUEXT_INVLPG_MULTI)

linear_addr: Linear address to be flushed.
vcpumask: Pointer to bitmap of VCPUs to be flushed.

- HYPERVISOR_mmuext_op(MMUEXT_TLB_FLUSH_ALL)

No additional arguments. Flushes all VCPUs' TLBs.

- HYPERVISOR_mmuext_op(MMUEXT_INVLPG_ALL)

linear_addr: Linear address to be flushed from all VCPUs' TLBs.

- HYPERVISOR_update_va_mapping(...)

Update pagetable entry mapping a given virtual address.
Avoids having to map the pagetable page in the hypervisor by using
a linear pagetable mapping. Also flush the TLB if requested.

[ ]

- HYPERVISOR_mmu_update(MMU_NORMAL_PT_UPDATE, ...)

Update an entry in a page table.

[ VMI_SetPte* ]

- HYPERVISOR_mmu_update(MMU_MACHPHYS_UPDATE, ...)

Update machine -> phys table entry.

[ no machine -> phys in VMI ]

MEMORY

- HYPERVISOR_memory_op(XENMEM_increase_reservation, ...)

Increase number of frames

[ ]

- HYPERVISOR_memory_op(XENMEM_decrease_reservation, ...)

Drop frames from reservation

[ ]

- HYPERVISOR_memory_op(XENMEM_populate_physmap, ...)

[ ]

- HYPERVISOR_memory_op(XENMEM_maximum_ram_page, ...)

Get maximum MFN of mapped RAM in domain

[ ]

- HYPERVISOR_memory_op(XENMEM_current_reservation, ...)

Get current memory reservation (in pags) of domain

[ ]

- HYPERVISOR_memory_op(XENMEM_maximum_reservation, ...)

Get maximum memory reservation (in pags) of domain

[ ]

MISC

- HYPERVISOR_console_io()

read/write to console (privileged)

- HYPERVISOR_xen_version(XENVER_version, NULL)

Return major:minor (16:16).

- HYPERVISOR_xen_version(XENVER_extraversion)

Return extra version (-unstable, .subminor)

- HYPERVISOR_xen_version(XENVER_compile_info)

Return hypervisor compile information.

- HYPERVISOR_xen_version(XENVER_capabilities)

Return list of supported guest interfaces.

- HYPERVISOR_xen_version(XENVER_platform_parameters)

Return information about the platform.

- HYPERVISOR_xen_version(XENVER_get_features)

Return feature maps.

- HYPERVISOR_set_callbacks

Set entry points for upcalls to the guest from the hypervisor.
Used for event delivery and fatal condition notification.

- HYPERVISOR_vm_assist(VMASST_TYPE_4gb_segments)

Enable emulation of wrap around segments.

- HYPERVISOR_vm_assist(VMASST_TYPE_4gb_segments_notify)

Enable notification on wrap around segment event.

- HYPERVISOR_vm_assist(VMASST_TYPE_writable_pagetables)

Enable writable pagetables.

- HYPERVISOR_nmi_op(XENNMI_register_callback)

Register NMI callback for this (calling) VCPU. Currently this only makes
sense for domain 0, vcpu 0. All other callers will be returned EINVAL.

- HYPERVISOR_nmi_op(XENNMI_unregister_callback)

Deregister NMI callback for this (calling) VCPU.

- HYPERVISOR_multicall

Execute batch of hypercalls.

[VMI_SetDeferredMode*, VMI_FlushDeferredCalls*]

There are some more management specific operations for dom0 and security
that are arguably beyond the scope of this comparison.

2006-03-16 03:40:25

by Eli Collins

[permalink] [raw]

Subject: Re: [RFC, PATCH 1/24] i386 Vmi documentation

Chris Wright wrote:
> * Chris Wright ([email protected]) wrote:

<snip>

> - HYPERVISOR_event_channel_op(EVTCHNOP_send, ...)
>
> Send an event to the remote end of the channel whose local endpoint is <port>.
>
> [ ]

VMI_APICWrite is used to send IPIs. In general all the event channel
calls (modulo referencing other guests) are not needed when using a
virtual APIC. Using calls rather than a struct shared between the
hypervisor and the guest is a cleaner interface (no messy changes to
entry.S) and easier to maintain and version. This is true of
shared_info_t in general, not just the event channel.

>
> - HYPERVISOR_vcpu_op(VCPUOP_get_runstate_info, ...)
>
> Return information about the state and running time of a VCPU.
>
> [ ]

See the VMI timer interface. Note that the runstate interface above was
added recently after Dan Hecht pointed out the need for properly
paravirtualizing time (reporting stolen time correctly), the Xen 3.0.0/1
interfaces do not include runstate info.

http://lists.xensource.com/archives/html/xen-devel/2006-02/msg00836.html

It's too bad that Xen's vcpu_time_info_t presents the guest with the
variables used to calculate time rather than time itself; requiring that
the guest calculate time complicates the Linux patches and constrains
future changes to time calculation in the hypervisor.

> - HYPERVISOR_set_trap_table(struct trap_info *table)
>
> Load the interrupt descriptor table.
>
> The trap table is in a format which allows easier access from C code.
> It's easier to build and easier to use in software trap despatch code.
> It can easily be converted into a hardware interrupt descriptor table.
>
> [ VMI_SetIDT, VMI_WriteIDTEntry ]

Passing in trap_info structs (like much of the Xen interface) requires
copying to/from the guest when it's not necessary. To handle VT/Pacifica
Xen needs to understand the hardware table format anyway, so it's
simpler to just use the hardware format.

> - HYPERVISOR_set_timer_op(...)
>
> Set timeout when to trigger timer interrupt even if guest is blocked.

See VMI_SetAlarm and VMI_CancelAlarm.

> - HYPERVISOR_memory_op(XENMEM_increase_reservation, ...)
>
> Increase number of frames
>
> [ ]
>
> - HYPERVISOR_memory_op(XENMEM_decrease_reservation, ...)
>
> Drop frames from reservation
>
> [ ]

Ballooning for VMI guests is currently handled by a driver which uses a
special port in the virtual IO space.

The Xen increase reservation interface would be nicer if it took the
pfns that the guest gave up as an argument (better for this logic to be
in the balloon driver than the hypervisor). Relying on the hypervisor's
allocator to get contiguous pages is also gross. From what I can tell
extent_order is always 0 in XenLinux, an interface that just took a list
of pages would be simpler.

> - HYPERVISOR_xen_version(XENVER_compile_info)
>
> Return hypervisor compile information.

This kind of information seems gratuitous.

> - HYPERVISOR_set_callbacks
>
> Set entry points for upcalls to the guest from the hypervisor.
> Used for event delivery and fatal condition notification.

In the VMI "events" are just interrupts, delivered via the virtual IDT.

> - HYPERVISOR_nmi_op(XENNMI_register_callback)
>
> Register NMI callback for this (calling) VCPU. Currently this only makes
> sense for domain 0, vcpu 0. All other callers will be returned EINVAL.

Like the event callback, this could be integrated into the virtual IDT.