On Monday 13 March 2006 18:59, Zachary Amsden wrote:
> + The general mechanism for providing customized features and
> + capabilities is to provide notification of these feature through
> + the CPUID call,
How should that work since CPUID cannot be intercepted by
a Hypervisor (without VMX/SVM)?
> + Watchdog NMIs are of limited use if the OS is
> + already correct and running on stable hardware;
So how would your Hypervisor detect a kernel hung with interrupts
off then?
>> profiling NMIs are
> + similarly of less use, since this task is accomplished with more accuracy
> + in the VMM itself
And how does oprofile know about this?
> ; and NMIs for machine check errors should be handled
> + outside of the VM.
Right now yes, but if we ever implement intelligent memory ECC error handling it's questionable
the hypervisor can do a better job. It has far less information about how memory
is used than the kernel.
> + The net result of these choices is that most of the calls are very
> + easy to make from C-code, and calls that are likely to be required in
> + low level trap handling code are easy to call from assembler. Most
> + of these calls are also very easily implemented by the hypervisor
> + vendor in C code, and only the performance critical calls from
> + assembler paths require custom assembly implementations.
> +
> + CORE INTERFACE CALLS
Did I miss it or do you never describe how to find these entry points?
> + VMI_EnableInterrupts
> +
> + VMICALL void VMI_EnableInterrupts(void);
> +
> + Enable maskable interrupts on the processor. Note that the
> + current implementation always will deliver any pending interrupts
> + on a call which enables interrupts, for compatibility with kernel
> + code which expects this behavior. Whether this should be required
> + is open for debate.
A subtle trap is also that it will do so on the next instruction, not the
followon to next like a real x86. At some point there was code in Linux
that dependend on this.
> + VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
> +
> + Read from a model specific register. This functions identically to the
> + hardware RDMSR instruction. Note that a hypervisor may not implement
> + the full set of MSRs supported by native hardware, since many of them
> + are not useful in the context of a virtual machine.
So what happens when the kernel tries to access an unimplemented MSR?
Also we have had occasionally workarounds in the past that required
MSR writes with magic "passwords". How would these be handled?
+
> + VMI_CPUID
> +
> + /* Not expressible as a C function */
> +
> + The CPUID instruction provides processor feature identification in a
> + vendor specific manner. The instruction itself is non-virtualizable
> + without hardware support, requiring a hypervisor assisted CPUID call
> + that emulates the effect of the native instruction, while masking any
> + unsupported CPU feature bits.
Doesn't seem to be very useful because everybody can just call CPUID directly.
> + The RDTSC instruction provides a cycles counter which may be made
> + visible to userspace. For better or worse, many applications have made
> + use of this feature to implement userspace timers, database indices, or
> + for micro-benchmarking of performance. This instruction is extremely
> + problematic for virtualization, because even though it is selectively
> + virtualizable using trap and emulate, it is much more expensive to
> + virtualize it in this fashion. On the other hand, if this instruction
> + is allowed to execute without trapping, the cycle counter provided
> + could be wrong in any number of circumstances due to hardware drift,
> + migration, suspend/resume, CPU hotplug, and other unforeseen
> + consequences of running inside of a virtual machine. There is no
> + standard specification for how this instruction operates when issued
> + from userspace programs, but the VMI call here provides a proper
> + interface for the kernel to read this cycle counter.
Yes, but it will be wrong in a native kernel too so why do you want
to be better than native?
Seems useless to me.
> + VMI_RDPMC
> +
> + VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
> +
> + Similar to RDTSC, this call provides the functionality of reading
> + processor performance counters. It also is selectively visible to
> + userspace, and maintaining accurate data for the performance counters
> + is an extremely difficult task due to the side effects introduced by
> + the hypervisor.
Similar.
Overall feeling is you have far too many calls. You seem to try to implement
a full x86 replacement, but that makes it big and likely to be buggy. And
it's likely impossible to implement in any Hypervisor short of a full emulator
like yours.
I would try a diet and only implement facilities that are actually likely
to be used by modern OS.
There was one other point I wanted to make but I forgot it now @)
-Andi
* Andi Kleen ([email protected]) wrote:
> On Monday 13 March 2006 18:59, Zachary Amsden wrote:
>
> > + The general mechanism for providing customized features and
> > + capabilities is to provide notification of these feature through
> > + the CPUID call,
>
> How should that work since CPUID cannot be intercepted by
> a Hypervisor (without VMX/SVM)?
Yeah, it requires guest kernel cooperation/modification.
> > + The net result of these choices is that most of the calls are very
> > + easy to make from C-code, and calls that are likely to be required in
> > + low level trap handling code are easy to call from assembler. Most
> > + of these calls are also very easily implemented by the hypervisor
> > + vendor in C code, and only the performance critical calls from
> > + assembler paths require custom assembly implementations.
> > +
> > + CORE INTERFACE CALLS
>
> Did I miss it or do you never describe how to find these entry points?
It's the ROM interface. For native they are emitted directly inline.
For non-native, they are emitted as call stubs, which call to the ROM.
I don't recall if it's in this doc, but the inline patch has all the
gory details.
thanks,
-chris
On Wednesday 22 March 2006 22:34, Chris Wright wrote:
> * Andi Kleen ([email protected]) wrote:
> > On Monday 13 March 2006 18:59, Zachary Amsden wrote:
> >
> > > + The general mechanism for providing customized features and
> > > + capabilities is to provide notification of these feature through
> > > + the CPUID call,
> >
> > How should that work since CPUID cannot be intercepted by
> > a Hypervisor (without VMX/SVM)?
>
> Yeah, it requires guest kernel cooperation/modification.
Even then it's useless for many flags because any user program can (and will)
call CPUID directly.
> > > + The net result of these choices is that most of the calls are very
> > > + easy to make from C-code, and calls that are likely to be required in
> > > + low level trap handling code are easy to call from assembler. Most
> > > + of these calls are also very easily implemented by the hypervisor
> > > + vendor in C code, and only the performance critical calls from
> > > + assembler paths require custom assembly implementations.
> > > +
> > > + CORE INTERFACE CALLS
> >
> > Did I miss it or do you never describe how to find these entry points?
>
> It's the ROM interface. For native they are emitted directly inline.
> For non-native, they are emitted as call stubs, which call to the ROM.
> I don't recall if it's in this doc, but the inline patch has all the
> gory details.
Sure the point was if they write this long fancy document why stop
at documenting the last 5%?
-Andi
* Andi Kleen ([email protected]) wrote:
> Even then it's useless for many flags because any user program can (and will)
> call CPUID directly.
Yes, doesn't handle userspace at all. It's useful only to get coherent
view of flags in kernel. Right now, for example, Xen goes in and
basically masks off flags retroactively which is not that nice either.
thanks,
-chris
Andi Kleen wrote:
> On Monday 13 March 2006 18:59, Zachary Amsden wrote:
>
>
>> + The general mechanism for providing customized features and
>> + capabilities is to provide notification of these feature through
>> + the CPUID call,
>>
>
> How should that work since CPUID cannot be intercepted by
> a Hypervisor (without VMX/SVM)?
>
It can be intercepted with a VMI call. I actually think overloading
this for VM features as well, although convenient, might turn out to be
unwieldy.
>> + Watchdog NMIs are of limited use if the OS is
>> + already correct and running on stable hardware;
>>
>
> So how would your Hypervisor detect a kernel hung with interrupts
> off then?
>
The hypervisor can detect it fine - we never disable hardware interrupts
or NMIs except for very small windows in the fault handlers. I'm
arguing that philosophically, using NMIs to detect a software hang means
you have broken software. NMIs for detecting hardware induced hangs are
common and reasonable things to do, but on virtual hardware, that
shouldn't happen either.
>
>>> profiling NMIs are
>>>
>> + similarly of less use, since this task is accomplished with more accuracy
>> + in the VMM itself
>>
>
> And how does oprofile know about this?
>
It doesn't. But consider that oprofile is a time based NMI sampler.
That is less accurate in a VM when you have virtual time, and, somewhat
skewed spacing between NMI delivery, and less than accurate performance
counter information. You can get a lot better results for benchmarks
using the VMM to sample the guest instead.
>> ; and NMIs for machine check errors should be handled
>> + outside of the VM.
>>
>
> Right now yes, but if we ever implement intelligent memory ECC error handling it's questionable
> the hypervisor can do a better job. It has far less information about how memory
> is used than the kernel.
>
Right. I think I may have been too proactive in my defense of disabling
NMIs. I agree now, it is a bug, and it really should be supported. But
it was a convenient shortcut to getting things working - otherwise you
have to have the NMI avoidance logic in entry.S, which is not properly
virtualizable (checks raw segments without masking RPL). But seeing as
I already fixed that, I think we actually could re-enable NMIs now.
Though the usefulness of common cases may be compromised, having the VM
do machine check handling on its own data pages (so it can figure out
which processes to kill / recover) is an extremely useful case.
>> + CORE INTERFACE CALLS
>>
>
> Did I miss it or do you never describe how to find these entry points?
>
It should be described in the ROM probing section in more detail. Our
documentation is getting better with time ;)
>
>> + VMI_EnableInterrupts
>> +
>> + VMICALL void VMI_EnableInterrupts(void);
>> +
>> + Enable maskable interrupts on the processor. Note that the
>> + current implementation always will deliver any pending interrupts
>> + on a call which enables interrupts, for compatibility with kernel
>> + code which expects this behavior. Whether this should be required
>> + is open for debate.
>>
>
> A subtle trap is also that it will do so on the next instruction, not the
> followon to next like a real x86. At some point there was code in Linux
> that dependend on this.
>
There still is. This is why you have the "sti; sysexit" pair, and why
safe_halt() is "sti; hlt". You really don't want interrupts in those
windows. The architectural oddity forced us to make these calls into
the VMI interface. A third one, used by some operating systems, is
"sti; nop; cli" - i.e. deliver pending interrupts and disable again. In
most other cases, it doesn't matter.
>
>> + VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
>> +
>> + Read from a model specific register. This functions identically to the
>> + hardware RDMSR instruction. Note that a hypervisor may not implement
>> + the full set of MSRs supported by native hardware, since many of them
>> + are not useful in the context of a virtual machine.
>>
>
> So what happens when the kernel tries to access an unimplemented MSR?
>
> Also we have had occasionally workarounds in the past that required
> MSR writes with magic "passwords". How would these be handled?
>
I actually already implemented your suggestion on making MSR reads and
writes use trap and emulate - so all of these issues go away. Whether
forcing trap and emulate is a good idea for a minimal open source
hypervisor is another debate.
> +
>
>> + VMI_CPUID
>> +
>> + /* Not expressible as a C function */
>> +
>> + The CPUID instruction provides processor feature identification in a
>> + vendor specific manner. The instruction itself is non-virtualizable
>> + without hardware support, requiring a hypervisor assisted CPUID call
>> + that emulates the effect of the native instruction, while masking any
>> + unsupported CPU feature bits.
>>
>
> Doesn't seem to be very useful because everybody can just call CPUID directly.
>
Which is why the kernel _must_ use the CPUID VMI call. We're a little
bit broken in this respect today, since the boot code in head.S does
CPUID probing before the VMI init call. It works for us because we use
binary translation of the kernel up to this point. In the end, this
will disappear, and the CPUID probing will be done in the alternative
entry point known as the "start of day" state, where the kernel is
already pre-virtualized.
> Yes, but it will be wrong in a native kernel too so why do you want
> to be better than native?
>
> Seems useless to me.
>
Agree. TSC is broken in so many ways, that it really should not be used
for anything other than unreliable cycle counting.
>
>> + VMI_RDPMC
>> +
>> + VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
>> +
>> + Similar to RDTSC, this call provides the functionality of reading
>> + processor performance counters. It also is selectively visible to
>> + userspace, and maintaining accurate data for the performance counters
>> + is an extremely difficult task due to the side effects introduced by
>> + the hypervisor.
>>
>
> Similar.
>
> Overall feeling is you have far too many calls. You seem to try to implement
> a full x86 replacement, but that makes it big and likely to be buggy. And
> it's likely impossible to implement in any Hypervisor short of a full emulator
> like yours.
>
> I would try a diet and only implement facilities that are actually likely
> to be used by modern OS.
>
The interface can't really go on too much of a diet - some kernel
somewhere, maybe not Linux, under some hypervisor, maybe not VMware or
Xen, may want to use these features. What the interface can be is an a
la carte menu. By allowing specific instructions to fall back to trap
and emulate, mainstream OSes don't need to be bothered with changing to
match some rich interface. Other OSes may have vastly different
requirements, and might want to make use of these features heavily, if
they are available. And hypervisors don't need to implement anything
special for these either. Our RDPMC implementation in the ROM is quite
simple:
/*
* VMI_RDPMC - Binary RDPMC equivalent
* Must clobber no registers (other than %eax, %edx return)
*/
VMI_ENTRY(RDPMC)
rdpmc
vmireturn
VMI_CALL_END
Taken to the extreme, where the patch processing is done before the
kernel runs, in the hypervisor itself, using the annotation table
provided by the guest kernel, it is even easier. If you see an
annotation for a feature you don't care to implement, you don't do
anything at all - you leave the native instructions as they are. In
this case, neither the kernel nor the hypervisor has any extra code at
all to deal with cases they don't care about. But the rich interface is
still there, and if someone wants to bathe in butter, who are we to
judge. There certainly are uses for it. For example, WRMSR is not on
critical paths in i386 Linux today. That does not mean we should remove
it from the interface. When a new processor core comes along, and all
of a sudden, you really need that interface back, you want it ready for
use. And this case really did happen - FSBASE and GSBASE MSR writes
moved onto the critical path in x86_64.
I think I carried the diet analogy a little far.
> There was one other point I wanted to make but I forgot it now @)
>
Thanks again for your feedback,
Zach
> There was one other point I wanted to make but I forgot it now @)
Ah yes the point was that since most of the implementations of the hypercalls
likely need fast access to some per CPU state. How would you plan
to implement that? Should it be covered in the specification?
-Andi
On Wednesday 22 March 2006 23:04, Zachary Amsden wrote:
>
> It doesn't. But consider that oprofile is a time based NMI sampler.
That's one of its modes, mostly used for people with broken APICs.
But the primary mode of operation is an event based sampler using
performance counter events.
> There still is. This is why you have the "sti; sysexit" pair, and why
> safe_halt() is "sti; hlt". You really don't want interrupts in those
> windows. The architectural oddity forced us to make these calls into
> the VMI interface. A third one, used by some operating systems, is
> "sti; nop; cli" - i.e. deliver pending interrupts and disable again. In
> most other cases, it doesn't matter.
Sounds like something that should be discussed in the spec.
> > Seems useless to me.
> >
>
> Agree. TSC is broken in so many ways, that it really should not be used
> for anything other than unreliable cycle counting.
It can be used with an aggressive white list and if you know what you're
doing. The x86-64 kernel follows this approach, which allows to use
it at least on some common classes of systems (AMD single core, Intel
non NUMA P4)
Actually for cycle counting it is useless because on newer Intel CPUs
it always runs at the highest P state no matter which P state you're in.
My evil plan to deal with that was to export the cycle count running in PMC0
for the NMI watchdog to ring 3 so people could just use RDPMC 0 instead.
There was some opposition to this idea unfortunately.
But the hypervisor should keep its fingers out of all that as far as possible.
>
> >
> >> + VMI_RDPMC
> >> +
> >> + VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
> >> +
> >> + Similar to RDTSC, this call provides the functionality of reading
> >> + processor performance counters. It also is selectively visible to
> >> + userspace, and maintaining accurate data for the performance counters
> >> + is an extremely difficult task due to the side effects introduced by
> >> + the hypervisor.
> >>
> >
> > Similar.
> >
> > Overall feeling is you have far too many calls. You seem to try to implement
> > a full x86 replacement, but that makes it big and likely to be buggy. And
> > it's likely impossible to implement in any Hypervisor short of a full emulator
> > like yours.
> >
> > I would try a diet and only implement facilities that are actually likely
> > to be used by modern OS.
> >
>
> The interface can't really go on too much of a diet - some kernel
> somewhere, maybe not Linux, under some hypervisor, maybe not VMware or
> Xen, may want to use these features.
This might sound arrogant, but I would expect that near all modern
kernels don't use much more of the x86 subset than Linux is using
(biggest exception I can think of would be interrupt priorities)
>
> Taken to the extreme, where the patch processing is done before the
> kernel runs, in the hypervisor itself, using the annotation table
> provided by the guest kernel, it is even easier. If you see an
> annotation for a feature you don't care to implement, you don't do
> anything at all - you leave the native instructions as they are. In
> this case, neither the kernel nor the hypervisor has any extra code at
> all to deal with cases they don't care about. But the rich interface is
> still there, and if someone wants to bathe in butter, who are we to
> judge.
So basically you're trying to implement VT/Pacifica in software
with all these trap?
I'm not sure that's the right approach.
My feeling would be that for a efficient para virtualized interface a better
approach would be to try to optimize the kernels a bit more
for the emulated case.
Longer term there will be more optimizations (like better interaction
of VM maybe or para drivers that work faster). But if the base interface
is already so big that adding even more stuff might make it explode
at some point.
> There certainly are uses for it. For example, WRMSR is not on
> critical paths in i386 Linux today.
Actually i got a feature request today that would require to optionally
do a wrmsr in the context switch :/
-Andi
Andi Kleen wrote:
>>There was one other point I wanted to make but I forgot it now @)
>
>
> Ah yes the point was that since most of the implementations of the hypercalls
> likely need fast access to some per CPU state. How would you plan
> to implement that? Should it be covered in the specification?
I can explain how it works, but it's deliberately not part of the specification.
The whole point of the ROM layer is that it abstracts away the actual hypercall
mechanism for the guest, and the hypervisor can implement whatever is
appropriate for it. This layer allows a VMI guest to run on VMware's
hypervisor, as well as on top of Xen.
We reserve the top 64MB of linear address space for the hypervisor.
Part of this reserved space contains data structures that are shared by the VMI
ROM layer and the hypervisor. Simple VMI interface calls like "read CR 2" are
implemented by reading or writing data from this shared data structure, and
don't require a privilege level change. Things like page table updates go into
a queue in the shared area, so they can easily be batched and processed with
only one actual call into the hypervisor.
Because the guest can manipulate this data page directly, the hypervisor has to
treat any information in it as untrusted. This is similar to how the kernel has
to treat syscall arguments. Guest user code can't touch the shared area, so it
doesn't introduce any new kernel security holes. The guest kernel could
deliberately mess up the shared area contents, but guest kernel code could
corrupt any arbitrary (virtual) machine state anyway.
Because this level of interface is hidden from the guest, we can (and do) make
changes to it without changing VMI itself, or needing to recompile the guest.
We deliberately do not document it. A guest that adheres to the VMI interface
can move to new versions of the ROM/hypervisor interface (that implement the
same VMI interface) without changes.
Dan.
Andi Kleen wrote:
>> There was one other point I wanted to make but I forgot it now @)
>>
>
> Ah yes the point was that since most of the implementations of the hypercalls
> likely need fast access to some per CPU state. How would you plan
> to implement that? Should it be covered in the specification?
>
Probably. We don't have that issue currently, as we have a private
mapping of CPU state for each VCPU at a fixed address. Seeing as that
is not so feasible under Xen, I would say we need to put something in
the spec.
The way Xen deals with this is rather gruesome today. It needs
callbacks into the kernel to disable preemption so that it can
atomically compute the address of the VCPU area, just so that it can
disable interrupts on the VCPU. These contortions make backbending look
easy.
I propose an entirely different approach - use segmentation. This needs
to be in the spec, as we now need to add VMI hook points for saving and
restoring user segments. But in the end it wins, even if you can't
support per-cpu mappings using paging, you can do it with segmentation.
You'll likely get even better performance. And you don't have to worry
about these unclean callbacks into the guest kernel that really make the
interface between Xen and XenoLinux completely enmeshed. And you can
disable interrupts in one instruction:
movb $0, %gs:hypervisor_intFlags
Zach
On Wednesday 22 March 2006 23:45, Zachary Amsden wrote:
> I propose an entirely different approach - use segmentation.
That would require a lot of changes to save/restore the segmentation
register at kernel entry/exit since there is no swapgs on i386.
And will be likely slower there too and also even slow down the
VMI-kernel-no-hypervisor.
Still might be the best option.
How did that rumoured Xenolinux-over-VMI implementation solve that problem?
-Andi
Andi Kleen wrote:
> On Wednesday 22 March 2006 23:45, Zachary Amsden wrote:
>
>
>> I propose an entirely different approach - use segmentation.
>>
>
> That would require a lot of changes to save/restore the segmentation
> register at kernel entry/exit since there is no swapgs on i386.
> And will be likely slower there too and also even slow down the
> VMI-kernel-no-hypervisor.
>
There are no changes required to the kernel entry / exit paths. With
save/restore segment support in the VMI, reserving one segment for the
hypervisor data area is easy.
I take it back. There is one required change:
kernel_entry:
hypervisor_entry_hook
sti
.... kernel code
This hypervisor_entry_hook can be a nop on native hardware, and the
following for Xen:
push %gs
mov CPU_HYPER_SEL, %gs
pop %gs:SAVED_USER_GS
You already have the IRET / SYSEXIT hooks to restore it on the way
back. And now you have a segment reserved that allows you to deal with
16-bit stack segments during the IRET.
> Still might be the best option.
>
> How did that rumoured Xenolinux-over-VMI implementation solve that problem?
>
!CONFIG_SMP -- as I believe I saw in the latest Xen patches sent out as
well?
Andi Kleen wrote:
> Even then it's useless for many flags because any user program can (and will)
> call CPUID directly.
Turns out not to matter, since userspace can only make use of
capabilities that are already available to userspace. If the feature
bits for system features are visible to it, it doesn't really matter.
Yes, this could be broken in some cases. But it turns out to be safe.
Even sysenter support, which userspace does care about, is done via
setting the vsyscall page up in the kernel, rather than userspace CPUID
detection.
> Sure the point was if they write this long fancy document why stop
> at documenting the last 5%?
>
Because the last 5% is what is changing to meet Xen's needs. Why
document something that you know you are going to break in a week? I
chose to document the stable interfaces first.
On Thursday 23 March 2006 00:54, Zachary Amsden wrote:
> Andi Kleen wrote:
> > On Wednesday 22 March 2006 23:45, Zachary Amsden wrote:
> >
> >
> >> I propose an entirely different approach - use segmentation.
> >>
> >
> > That would require a lot of changes to save/restore the segmentation
> > register at kernel entry/exit since there is no swapgs on i386.
> > And will be likely slower there too and also even slow down the
> > VMI-kernel-no-hypervisor.
> >
>
> There are no changes required to the kernel entry / exit paths. With
> save/restore segment support in the VMI, reserving one segment for the
> hypervisor data area is easy.
Ok that might work yes.
> > Still might be the best option.
> >
> > How did that rumoured Xenolinux-over-VMI implementation solve that problem?
> >
>
> !CONFIG_SMP -- as I believe I saw in the latest Xen patches sent out as
> well?
Ah, cheating. This means the rumoured benchmark numbers are dubious too I guess.
-Andi