3) Architectural Differences from Native Hardware.
For the sake of performance, some requirements are imposed on kernel
fault handlers which are not present on real hardware. Most modern
operating systems should have no trouble meeting these requirements.
Failure to meet these requirements may prevent the kernel from
working properly.
1) The hardware flags on entry to a fault handler may not match
the EFLAGS image on the fault handler stack. The stack image
is correct, and will have the correct state of the interrupt
and arithmetic flags.
2) The stack used for kernel traps must be flat - that is, zero base,
segment limit determined by the hypervisor.
3) On entry to any fault handler, the stack must have sufficient space
to hold 32 bytes of data, or the guest may be terminated.
4) When calling VMI functions, the kernel must be running on a
flat 32-bit stack and code segment.
5) Most VMI functions require flat data and extra segment (DS and ES)
segments as well; notable exceptions are IRET and SYSEXIT.
XXXPara - may need to add STI and CLI to this list.
6) Interrupts must always be enabled when running code in userspace.
7) IOPL semantics for userspace are changed; although userspace may be
granted port access, it can not affect the interrupt flag.
8) The EIPs at which faults may occur in VMI calls may not match the
original native instruction EIP; this is a bug in the system
today, as many guests do rely on lazy fault handling.
9) On entry to V8086 mode, MSR_SYSENTER_CS is cleared to zero.
10) Todo - we would like to support these features, but they are not
fully tested and / or implemented:
Userspace 16-bit stack support
Proper handling of faulting IRETs
4) ROM Implementation
Modularization
Originally, we envisioned modularizing the ROM API into several
subsections, but the close coupling between the initial layers
and the requirement to support native PCI bus devices has made
ROM components for network or block devices unnecessary to this
point in time.
VMI - the virtual machine interface. This is the core CPU, I/O
and MMU virtualization layer. I/O is currently limited
to port access to emulated devices.
Detection
The presence of hypervisor ROMs can be recognized by scanning the
upper region of the first megabyte of physical memory. Multiple
ROMs may be provided to support older API versions for legacy guest
OS support. ROM detection is done in the traditional manner, by
scanning the memory region from C8000h - DFFFFh in 2 kilobyte
increments. The romSignature bytes must be '0x55, 0xAA', and the
checksum of the region indicated by the romLength field must be zero.
The checksum is a simple 8-bit addition of all bytes in the ROM
region.
Data layout
typedef struct HyperRomHeader {
uint16_t romSignature;
int8_t romLength;
unsigned char romEntry[4];
uint8_t romPad0;
uint32_t hyperSignature;
uint8_t APIVersionMinor;
uint8_t APIVersionMajor;
uint8_t reserved0;
uint8_t reserved1;
uint32_t reserved2;
uint32_t reserved3;
uint16_t pciHeaderOffset;
uint16_t pnpHeaderOffset;
uint32_t romPad3;
char reserved[32];
char elfHeader[64];
} HyperRomHeader;
The first set of fields is defined by the BIOS:
romSignature - fixed 0xAA55, BIOS ROM signature
romLength - the length of the ROM, in 512 byte chunks.
Determines the area to be checksummed.
romEntry - 16-bit initialization code stub used by BIOS.
romPad0 - reserved
The next set of fields is defined by this API:
hyperSignature - a 4 byte signature providing recognition of the
device class represented by this ROM. Each
device class defines its own unique signature.
APIVersionMinor - the revision level of this device class' API.
This indicates incremental changes to the API.
APIVersionMajor - the major version. Used to indicates large
revisions or additions to the API which break
compatibility with the previous version.
reserved0,1,2,3 - for future expansion
The next set of fields is defined by the PCI / PnP BIOS spec:
pciHeaderOffset - relative offset to the PCI device header from
the start of this ROM.
pnpHeaderOffset - relative offset to the PnP boot header from the
start of this ROM.
romPad3 - reserved by PCI spec.
Finally, there is space for future header fields, and an area
reserved for an ELF header to point to symbol information.
Hi!
> 6) Interrupts must always be enabled when running code in userspace.
I'd say this breaks userspace.
This code used to work when ran as root:
void
main(void)
{
int i;
iopl(3);
while (1) {
asm volatile("cli");
// for (i=0; i<20000000; i++)
for (i=0; i<1000000000; i++)
asm volatile("");
asm volatile("sti");
sleep(1);
}
}
...and was actually useful.
> 7) IOPL semantics for userspace are changed; although userspace may be
> granted port access, it can not affect the interrupt flag.
I'm not sure how will X like this.
Pavel
--
57: MD5CryptoServiceProvider MD5 = new MD5CryptoServiceProvider();
Pavel Machek wrote:
> Hi!
>
>
>> 6) Interrupts must always be enabled when running code in userspace.
>>
>
> I'd say this breaks userspace.
>
I agree. My claim is that this is not an issue in a virtual machine.
What possible reason can you have to disable interrupts in userspace?
Well, several. For one, the X server wants to disable interrupts
temporarily during probing of dot clocks to get accurate timings, and
also to avoid the kernel interrupting during a sensitive VGA register
access. Several other userspace programs, including CMOS time sync
utilities do this as well. I contend this is broken, even on native
hardware, for two reasons.
1) The sensitive VGA register access argument is bogus. There is
already a kernel interface that is used by X11 to take control of video
which lets the kernel know explicitly not to touch the VGA registers.
The oddity is due to the fact that there are many write only registers,
and thus, you can't track state of these without explicit handoff. The
same interface can be used to avoid these sensitive accesses.
2) Timing dot clocks by disabling interrupts is still broken and subject
to random variance. Chipsets which support system management modes can
cause the processor to enter SMM mode at any time, even when interrupts
are disabled and NMIs are masked. This is deliberately hidden from the
running code, but it does cause time to elapse, which is visible via the
TSC and all hardware time counters. Therefore, you can never get an
accurate timing in one iteration, and using multiple iterations allows
you to effectively deal with the same issues you would have if you left
interrupts enabled.
> This code used to work when ran as root:
>
> void
> main(void)
> {
> int i;
> iopl(3);
> while (1) {
> asm volatile("cli");
> // for (i=0; i<20000000; i++)
> for (i=0; i<1000000000; i++)
> asm volatile("");
> asm volatile("sti");
> sleep(1);
> }
> }
>
> ...and was actually useful.
>
The code you show above can be made to work in a virtual machine, and
you can allow userspace to disable interrupts and still have a perfectly
fine solution -- if you restrict the enabling and disabling of
interrupts in userspace to the cli and sti instructions. But it does
not work if you start using nested interrupt control, using pushf and popf.
The virtual machine monitor must always leave hardware interrupts
enabled, since it must service them without allowing the guest VM to
interfere. As such, the actual state of the hardware interrupt flag is
visible to userspace programs. CLI and STI get away with this, because
they are privileged instructions, and as such, they trap when IOPL is
not present. But PUSHF and POPF do not. A POPF instruction which
changes the interrupt flag behaves differently, depending on the IOPL
state. When IOPL is not present, and the POPF would change the state of
the interrupt flag - nothing happens. The interrupt flag is not
changed, but most importantly, it is not a privileged instruction, so it
does not trap.
Therefore, this instruction is non-virtualizable. You can not run it
directly in a virtual machine - you must simulate it. To simulate it
requires either straightforward interpretation, hardware virtualization,
or binary translation. Therein lies the crux of the problem. While you
can allow userspace to enable and disable interrupts using CLI and STI,
you have no way to simulate its use of the POPF instruction unless you
use one of these technologies. This is why we disallow all toggling of
the interrupt flag from userspace, since one of the design goals of
paravirtualization is not to change userspace code.
Combined with the above argument that enabling / disabling is really not
useful for userspace in a virtual machine, we have found that if you
just completely disallow IOPL'ed userspace to enable and disable
interrupts, _but_ never issue faults to it if it tries, everything just
works. The alternative allows you to get in a state where you can end
up in a non-virtualizable userspace scenario, which is highly undesirable.
>
>> 7) IOPL semantics for userspace are changed; although userspace may be
>> granted port access, it can not affect the interrupt flag.
>>
See above for the impact on X. X11 runs perfectly fine in our
paravirtual VMM.
Nit: Dropping cc'd persons is probably not a good thing. Some of the
people here don't subscribe to LKML in full, and would still like to be
copied on these messages. No offense meant or taken.
Zach
On Iau, 2006-03-16 at 00:37 +0100, Pavel Machek wrote:
> This code used to work when ran as root:
Unless it page faulted, or was on SMP, or ....
> I'm not sure how will X like this.
X has not used this ability for many years.
Alan
Alan Cox wrote:
> On Iau, 2006-03-16 at 00:37 +0100, Pavel Machek wrote:
>
>> This code used to work when ran as root:
>>
>
> Unless it page faulted, or was on SMP, or ....
>
Actually, quite interestingly, I believe you can take page faults in
this scenario - you might end up getting rescheduled and lose the effect
disabling interrupts, but I think the kernel lives on just fine - as
long as it doesn't BUG_ON about this. On SMP, clearly you can't
disabled IRQs on all processors with it. But I really think the point
is to try to eliminate IRQs on a single processor during some critical
timing sensitive region. One thing you definitely can't do safely is
make sysenter based syscalls off the vsyscall page - you will notice
that you always come back with interrupts enabled.
I just really don't think that is a good idea to do in userspace, when
writing a kernel module to accomplish this safely is actually really
quite easy. I would argue that the various CMOS timer update utilities
in userspace that do this same thing, really should be moved into the
kernel as fast as possible - they could race against other CPUs in
kernel mode that are doing the same thing, and there is no locking
discipline here whatsoever.
>
>> I'm not sure how will X like this.
>>
>
> X has not used this ability for many years.
>
Good to know. I thought some piece of xinit still used it to do
dot-clock probing - but I could be wrong. We really don't care about
getting accurate information here, since the dot-clocks don't actually
exist in a VM. We simulate virtual SVGA hardware instead of passing
through any installed card.
Zach
On Iau, 2006-03-16 at 07:29 -0800, Zachary Amsden wrote:
> quite easy. I would argue that the various CMOS timer update utilities
> in userspace that do this same thing, really should be moved into the
> kernel as fast as possible - they could race against other CPUs in
They were, something like 8-10 years ago. If your distributor is
shipping code that is doing cli in user space please assist in their
re-education. Several ship code which can fall back if the nvram or rtc
driver is missing but thats compat code.
> Good to know. I thought some piece of xinit still used it to do
> dot-clock probing - but I could be wrong.
It does, but it doesn't disable interrupts. You don't need to.