Hi,
Please find below the proposal for the generic use of cpuid space
allotted for hypervisors. Apart from this cpuid space another thing
worth noting would be that, Intel & AMD reserve the MSRs from 0x40000000
- 0x400000FF for software use. Though the proposal doesn't talk about
MSR's right now, we should be aware of these reservations as we may want
to extend the way we use CPUID to MSR usage as well.
While we are at it, we also think we should form a group which has at
least one person representing each of the hypervisors interested in
generalizing the hypervisor CPUID space for Linux guest OS. This group
will be informed whenever a new CPUID leaf from the generic space is to
be used. This would help avoid any duplicate definitions for a CPUID
semantic by two different hypervisors. I think most of the people are
subscribed to LKML or the virtualization lists and we should use these
lists as a platform to decide on things.
Thanks,
Alok
---
Hypervisor CPUID Interface Proposal
-----------------------------------
Intel & AMD have reserved cpuid levels 0x40000000 - 0x400000FF for
software use. Hypervisors can use these levels to provide an interface
to pass information from the hypervisor to the guest running inside a
virtual machine.
This proposal defines a standard framework for the way in which the
Linux and hypervisor communities incrementally define this CPUID space.
(This proposal may be adopted by other guest OSes. However, that is not
a requirement because a hypervisor can expose a different CPUID
interface depending on the guest OS type that is specified by the VM
configuration.)
Hypervisor Present Bit:
Bit 31 of ECX of CPUID leaf 0x1.
This bit has been reserved by Intel & AMD for use by
hypervisors, and indicates the presence of a hypervisor.
Virtual CPU's (hypervisors) set this bit to 1 and physical CPU's
(all existing and future cpu's) set this bit to zero. This bit
can be probed by the guest software to detect whether they are
running inside a virtual machine.
Hypervisor CPUID Information Leaf:
Leaf 0x40000000.
This leaf returns the CPUID leaf range supported by the
hypervisor and the hypervisor vendor signature.
# EAX: The maximum input value for CPUID supported by the hypervisor.
# EBX, ECX, EDX: Hypervisor vendor ID signature.
Hypervisor Specific Leaves:
Leaf range 0x40000001 - 0x4000000F.
These cpuid leaves are reserved as hypervisor specific leaves.
The semantics of these 15 leaves depend on the signature read
from the "Hypervisor Information Leaf".
Generic Leaves:
Leaf range 0x40000010 - 0x4000000FF.
The semantics of these leaves are consistent across all
hypervisors. This allows the guest kernel to probe and
interpret these leaves without checking for a hypervisor
signature.
A hypervisor can indicate that a leaf or a leaf's field is
unsupported by returning zero when that leaf or field is probed.
To avoid the situation where multiple hypervisors attempt to define the
semantics for the same leaf during development, we can partition
the generic leaf space to allow each hypervisor to define a part
of the generic space.
For instance:
VMware could define 0x4000001X
Xen could define 0x4000002X
KVM could define 0x4000003X
and so on...
Note that hypervisors can implement any leaves that have been
defined in the generic leaf space whenever common features can
be found. For example, VMware hypervisors can implement leafs
that have been defined in the KVM area 0x4000003X and vice
versa.
The kernel can detect the support for a generic field inside
leaf 0x400000XY using the following algorithm:
1. Get EAX from Leaf 0x400000000, Hypervisor CPUID information.
EAX returns the maximum input value for the hypervisor CPUID
space.
If EAX < 0x400000XY, then the field is not available.
2. Else, extract the field from the target Leaf 0x400000XY
by doing cpuid(0x400000XY).
If (field == 0), this feature is unsupported/unimplemented
by the hypervisor. The kernel should handle this case
gracefully so that a hypervisor is never required to
support or implement any particular generic leaf.
--------------------------------------------------------------------------------
Definition of the Generic CPUID space.
Leaf 0x40000010, Timing Information.
VMware has defined the first generic leaf to provide timing
information. This leaf returns the current TSC frequency and
current Bus frequency in kHz.
# EAX: (Virtual) TSC frequency in kHz.
# EBX: (Virtual) Bus (local apic timer) frequency in kHz.
# ECX, EDX: RESERVED (Per above, reserved fields are set to zero).
--------------------------------------------------------------------------------
Written By,
Alok N Kataria <[email protected]>
Dan Hecht <[email protected]>
Inputs from,
Jun Nakajima <[email protected]>
Alok Kataria wrote:
>
> (This proposal may be adopted by other guest OSes. However, that is not
> a requirement because a hypervisor can expose a different CPUID
> interface depending on the guest OS type that is specified by the VM
> configuration.)
>
Excuse me, but that is blatantly idiotic. Expecting the user having to
configure a VM to match the target OS is *exactly* as stupid as
expecting the user to reconfigure the BIOS. It's totally the wrong
thing to do.
-hpa
On Wed, 2008-10-01 at 10:21 -0700, H. Peter Anvin wrote:
> Alok Kataria wrote:
> >
> > (This proposal may be adopted by other guest OSes. However, that is not
> > a requirement because a hypervisor can expose a different CPUID
> > interface depending on the guest OS type that is specified by the VM
> > configuration.)
> >
>
> Excuse me, but that is blatantly idiotic. Expecting the user having to
> configure a VM to match the target OS is *exactly* as stupid as
> expecting the user to reconfigure the BIOS. It's totally the wrong
> thing to do.
Hi Peter,
Its not a user who has to do anything special here.
There are *intelligent* VM developers out there who can export a
different CPUid interface depending on the guest OS type. And this is
what most of the hypervisors do (not necessarily for CPUID, but for
other things right now).
Alok.
>
> -hpa
Alok Kataria wrote:
>
> Hi Peter,
>
> Its not a user who has to do anything special here.
> There are *intelligent* VM developers out there who can export a
> different CPUid interface depending on the guest OS type. And this is
> what most of the hypervisors do (not necessarily for CPUID, but for
> other things right now).
>
It doesn't matter, really; it's still the wrong thing to do, for the
same reason it's the wrong thing in -- for example -- ACPI, which has
similar "cleverness".
If we want to have a "Linux standard CPUID interface" suite we should
just put them on a different set of numbers and let a hypervisor export
all the interfaces.
-hpa
Alok Kataria wrote:
>
> Hypervisor CPUID Interface Proposal
> -----------------------------------
>
> Intel & AMD have reserved cpuid levels 0x40000000 - 0x400000FF for
> software use. Hypervisors can use these levels to provide an interface
> to pass information from the hypervisor to the guest running inside a
> virtual machine.
>
> This proposal defines a standard framework for the way in which the
> Linux and hypervisor communities incrementally define this CPUID space.
>
I also observe that your proposal provides no mean of positive
identification, i.e. that a hypervisor actually conforms to your proposal.
-hpa
Alok Kataria wrote:
> Hi,
>
> Please find below the proposal for the generic use of cpuid space
> allotted for hypervisors. Apart from this cpuid space another thing
> worth noting would be that, Intel & AMD reserve the MSRs from 0x40000000
> - 0x400000FF for software use. Though the proposal doesn't talk about
> MSR's right now, we should be aware of these reservations as we may want
> to extend the way we use CPUID to MSR usage as well.
>
> While we are at it, we also think we should form a group which has at
> least one person representing each of the hypervisors interested in
> generalizing the hypervisor CPUID space for Linux guest OS. This group
> will be informed whenever a new CPUID leaf from the generic space is to
> be used. This would help avoid any duplicate definitions for a CPUID
> semantic by two different hypervisors. I think most of the people are
> subscribed to LKML or the virtualization lists and we should use these
> lists as a platform to decide on things.
>
> Thanks,
> Alok
>
> ---
>
> Hypervisor CPUID Interface Proposal
> -----------------------------------
>
> Intel & AMD have reserved cpuid levels 0x40000000 - 0x400000FF for
> software use. Hypervisors can use these levels to provide an interface
> to pass information from the hypervisor to the guest running inside a
> virtual machine.
>
> This proposal defines a standard framework for the way in which the
> Linux and hypervisor communities incrementally define this CPUID space.
>
> (This proposal may be adopted by other guest OSes. However, that is not
> a requirement because a hypervisor can expose a different CPUID
> interface depending on the guest OS type that is specified by the VM
> configuration.)
>
> Hypervisor Present Bit:
> Bit 31 of ECX of CPUID leaf 0x1.
>
> This bit has been reserved by Intel & AMD for use by
> hypervisors, and indicates the presence of a hypervisor.
>
> Virtual CPU's (hypervisors) set this bit to 1 and physical CPU's
> (all existing and future cpu's) set this bit to zero. This bit
> can be probed by the guest software to detect whether they are
> running inside a virtual machine.
>
> Hypervisor CPUID Information Leaf:
> Leaf 0x40000000.
>
> This leaf returns the CPUID leaf range supported by the
> hypervisor and the hypervisor vendor signature.
>
> # EAX: The maximum input value for CPUID supported by the hypervisor.
> # EBX, ECX, EDX: Hypervisor vendor ID signature.
>
> Hypervisor Specific Leaves:
> Leaf range 0x40000001 - 0x4000000F.
>
> These cpuid leaves are reserved as hypervisor specific leaves.
> The semantics of these 15 leaves depend on the signature read
> from the "Hypervisor Information Leaf".
>
> Generic Leaves:
> Leaf range 0x40000010 - 0x4000000FF.
>
> The semantics of these leaves are consistent across all
> hypervisors. This allows the guest kernel to probe and
> interpret these leaves without checking for a hypervisor
> signature.
>
> A hypervisor can indicate that a leaf or a leaf's field is
> unsupported by returning zero when that leaf or field is probed.
>
> To avoid the situation where multiple hypervisors attempt to define the
> semantics for the same leaf during development, we can partition
> the generic leaf space to allow each hypervisor to define a part
> of the generic space.
>
> For instance:
> VMware could define 0x4000001X
> Xen could define 0x4000002X
> KVM could define 0x4000003X
> and so on...
>
No, we're not getting anywhere. This is an outright broken idea. The
space is too small to be able to chop up in this way, and the number of
vendors too large to be able to do it without having a central oversight.
The only way this can work is by having explicit positive identification
of each group of leaves with a signature. If there's a recognizable
signature, then you can inspect the rest of the group; if not, then you
can't. That way, you can avoid any leaf usage which doesn't conform to
this model, and you can also simultaneously support multiple hypervisor
ABIs. It also accommodates existing hypervisor use of this leaf space,
even if they currently use a fixed location within it.
A concrete counter-proposal:
The space 0x40000000-0x400000ff is reserved for hypervisor usage.
This region is divided into 16 16-leaf blocks. Each block has the
structure:
0x400000x0:
eax: max used leaf within the leaf block (max 0x400000xf)
e[bcd]x: leaf block signature. This may be a hypervisor-specific
signature, or a generic signature, depending on the contents of the block
A guest may search for any supported Hypervisor ABIs by inspecting each
leaf at 0x400000x0 for a known signature, and then may choose its mode
of operation accordingly. It must ignore any unknown signatures, and
not touch any of the leaves within an unknown leaf block.
Hypervisor vendors who want to add a hypervisor-specific leaf block must
choose a signature which is recognizably related to their or their
hypervisor's name.
Signatures starting with "Generic" are reserved for generic leaf blocks.
A guest may scan leaf blocks to enumerate what hypervisor ABIs/hypercall
interfaces are available to it. It may mix and match any information
from leaves it understands. However, once it starts using a specific
hypervisor ABI by making hypercalls or doing other operations with
side-effects, it must commit to using that ABI exclusively (a specific
hypervisor ABI may include the generic ABI by reference, however).
Correspondingly, a hypervisor must treat any cpuid accesses as
side-effect free.
Definition of specific blocks:
Generic hypervisor leaf block:
0x400000x0 signature is "GenericVMMIF" (or something)
0x400000x1 tsc leaf as you've described
J
Alok Kataria wrote:
> Its not a user who has to do anything special here.
> There are *intelligent* VM developers out there who can export a
> different CPUid interface depending on the guest OS type. And this is
> what most of the hypervisors do (not necessarily for CPUID, but for
> other things right now).
>
No, that's always a terrible idea. Sure, its necessary to deal with
some backward-compatibility issues, but we should even consider a new
interface which assumes this kind of thing. We want properly enumerable
interfaces.
J
Jeremy Fitzhardinge wrote:
>
> No, we're not getting anywhere. This is an outright broken idea. The
> space is too small to be able to chop up in this way, and the number of
> vendors too large to be able to do it without having a central oversight.
>
I suspect we can get a larger number space if we ask Intel & AMD. In
fact, I think we should request that the entire 0x40xxxxxx numberspace
is assigned to virtualization *anyway*.
-hpa
H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>>
>> No, we're not getting anywhere. This is an outright broken idea.
>> The space is too small to be able to chop up in this way, and the
>> number of vendors too large to be able to do it without having a
>> central oversight.
>>
>
> I suspect we can get a larger number space if we ask Intel & AMD. In
> fact, I think we should request that the entire 0x40xxxxxx numberspace
> is assigned to virtualization *anyway*.
Yes, that would be good. In that case I'd revise my proposal to back
each leaf block 256 leaves instead of 16. But it still needs to be a
proper enumeration with signatures, rather than assigning fixed points
in that space to specific interfaces.
J
Jeremy Fitzhardinge wrote:
>>
>> I suspect we can get a larger number space if we ask Intel & AMD. In
>> fact, I think we should request that the entire 0x40xxxxxx numberspace
>> is assigned to virtualization *anyway*.
>
> Yes, that would be good. In that case I'd revise my proposal to back
> each leaf block 256 leaves instead of 16. But it still needs to be a
> proper enumeration with signatures, rather than assigning fixed points
> in that space to specific interfaces.
>
With a sufficiently large block, we could use fixed points, e.g. by
having each vendor create interfaces in the 0x40SSSSXX range, where SSSS
is the PCI ID they use for PCI devices.
Note that I said "create interfaces". It's important that all about
this is who specified the interface -- for "what hypervisor is this"
just use 0x40000000 and disambiguate based on that.
-hpa
H. Peter Anvin wrote:
> With a sufficiently large block, we could use fixed points, e.g. by
> having each vendor create interfaces in the 0x40SSSSXX range, where
> SSSS is the PCI ID they use for PCI devices.
Sure, you could do that, but you'd still want to have a signature in
0x40SSSS00 to positively identify the chunk. And what if you wanted
more than 256 leaves?
> Note that I said "create interfaces". It's important that all about
> this is who specified the interface -- for "what hypervisor is this"
> just use 0x40000000 and disambiguate based on that.
"What hypervisor is this?" isn't a very interesting question; if you're
even asking it then it suggests that something has gone wrong. Its much
more useful to ask "what interfaces does this hypervisor support?", and
enumerating a smallish range of well-known leaves looking for signatures
is the simplest way to do that. (We could use signatures derived from
the PCI vendor IDs which would help with managing that namespace.)
J
Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> With a sufficiently large block, we could use fixed points, e.g. by
>> having each vendor create interfaces in the 0x40SSSSXX range, where
>> SSSS is the PCI ID they use for PCI devices.
>
> Sure, you could do that, but you'd still want to have a signature in
> 0x40SSSS00 to positively identify the chunk. And what if you wanted
> more than 256 leaves?
What you'd want, at least, is a standard CPUID identification and range
leaf at the top. 256 leaves is a *lot*, though; I'm not saying one
couldn't run out, but it'd be hard. Keep in mind that for large objects
there are "counting" CPUID levels, as much as I personally dislike them,
and one could easily argue that if you're doing something that would
require anywhere near 256 leaves you probably are storing bulk data that
belongs elsewhere.
Of course, if we had some kind of central authority assigning 8-bit IDs
that would be even better, especially since there are tools in the field
which already scan on 64K boundaries. I don't know, though, how likely
it is that we'll have to deal with 256 hypervisors.
>> Note that I said "create interfaces". It's important that all about
>> this is who specified the interface -- for "what hypervisor is this"
>> just use 0x40000000 and disambiguate based on that.
>
> "What hypervisor is this?" isn't a very interesting question; if you're
> even asking it then it suggests that something has gone wrong. Its much
> more useful to ask "what interfaces does this hypervisor support?", and
> enumerating a smallish range of well-known leaves looking for signatures
> is the simplest way to do that. (We could use signatures derived from
> the PCI vendor IDs which would help with managing that namespace.)
>
I agree completely, of course (except that "what hypervisor is this"
still has limited usage, especially when it comes to dealing with bug
workarounds. Similar to the way we use CPU vendor IDs and stepping
numbers for physical CPUs.)
-hpa
H. Peter Anvin wrote:
> What you'd want, at least, is a standard CPUID identification and
> range leaf at the top. 256 leaves is a *lot*, though; I'm not saying
> one couldn't run out, but it'd be hard. Keep in mind that for large
> objects there are "counting" CPUID levels, as much as I personally
> dislike them, and one could easily argue that if you're doing
> something that would require anywhere near 256 leaves you probably are
> storing bulk data that belongs elsewhere.
I agree, but it just makes the proposal a bit more brittle.
> Of course, if we had some kind of central authority assigning 8-bit
> IDs that would be even better, especially since there are tools in the
> field which already scan on 64K boundaries. I don't know, though, how
> likely it is that we'll have to deal with 256 hypervisors.
I'm assuming that the likelihood of getting all possible vendors -
current and future - to agree to a scheme like this is pretty small. We
need to come up with something that will work well when there are
non-cooperative parties to deal with.
> I agree completely, of course (except that "what hypervisor is this"
> still has limited usage, especially when it comes to dealing with bug
> workarounds. Similar to the way we use CPU vendor IDs and stepping
> numbers for physical CPUs.)
I guess. Its certainly useful to be able to identify the hypervisor for
bug reporting and just general status information. But making
functional changes on that basis should be a last resort.
J
Jeremy Fitzhardinge wrote:
> Alok Kataria wrote:
>
> No, we're not getting anywhere. This is an outright broken idea. The
> space is too small to be able to chop up in this way, and the number of
> vendors too large to be able to do it without having a central oversight.
>
> The only way this can work is by having explicit positive identification
> of each group of leaves with a signature. If there's a recognizable
> signature, then you can inspect the rest of the group; if not, then you
> can't. That way, you can avoid any leaf usage which doesn't conform to
> this model, and you can also simultaneously support multiple hypervisor
> ABIs. It also accommodates existing hypervisor use of this leaf space,
> even if they currently use a fixed location within it.
>
> A concrete counter-proposal:
Mmm, cpuid bikeshedding :-)
> The space 0x40000000-0x400000ff is reserved for hypervisor usage.
>
> This region is divided into 16 16-leaf blocks. Each block has the
> structure:
>
> 0x400000x0:
> eax: max used leaf within the leaf block (max 0x400000xf)
Why even bother with this? It doesn't seem necessary in your proposal.
Regards,
Anthony Liguori
Anthony Liguori wrote:
> Mmm, cpuid bikeshedding :-)
My shade of blue is better.
>> The space 0x40000000-0x400000ff is reserved for hypervisor usage.
>>
>> This region is divided into 16 16-leaf blocks. Each block has the
>> structure:
>>
>> 0x400000x0:
>> eax: max used leaf within the leaf block (max 0x400000xf)
>
> Why even bother with this? It doesn't seem necessary in your proposal.
It allows someone to incrementally add things to their block in a fairly
orderly way. But more importantly, its the prevailing idiom, and the
existing and proposed cpuid schemes already do this, so they'd fit in as-is.
J
* Jeremy Fitzhardinge ([email protected]) wrote:
> "What hypervisor is this?" isn't a very interesting question; if you're
> even asking it then it suggests that something has gone wrong.
It's essentially already happening. Everyone wants to be a better
hyperv than hyperv ;-)
On Wed, 2008-10-01 at 11:04 -0700, Jeremy Fitzhardinge wrote:
> No, we're not getting anywhere. This is an outright broken idea. The
> space is too small to be able to chop up in this way, and the number of
> vendors too large to be able to do it without having a central oversight.
>
> The only way this can work is by having explicit positive identification
> of each group of leaves with a signature. If there's a recognizable
> signature, then you can inspect the rest of the group; if not, then you
> can't. That way, you can avoid any leaf usage which doesn't conform to
> this model, and you can also simultaneously support multiple hypervisor
> ABIs. It also accommodates existing hypervisor use of this leaf space,
> even if they currently use a fixed location within it.
>
> A concrete counter-proposal:
>
> The space 0x40000000-0x400000ff is reserved for hypervisor usage.
>
> This region is divided into 16 16-leaf blocks. Each block has the
> structure:
>
> 0x400000x0:
> eax: max used leaf within the leaf block (max 0x400000xf)
> e[bcd]x: leaf block signature. This may be a hypervisor-specific
> signature, or a generic signature, depending on the contents of the block
>
> A guest may search for any supported Hypervisor ABIs by inspecting each
> leaf at 0x400000x0 for a known signature, and then may choose its mode
> of operation accordingly. It must ignore any unknown signatures, and
> not touch any of the leaves within an unknown leaf block.
> Hypervisor vendors who want to add a hypervisor-specific leaf block must
> choose a signature which is recognizably related to their or their
> hypervisor's name.
>
> Signatures starting with "Generic" are reserved for generic leaf blocks.
>
> A guest may scan leaf blocks to enumerate what hypervisor ABIs/hypercall
> interfaces are available to it. It may mix and match any information
> from leaves it understands. However, once it starts using a specific
> hypervisor ABI by making hypercalls or doing other operations with
> side-effects, it must commit to using that ABI exclusively (a specific
> hypervisor ABI may include the generic ABI by reference, however).
>
> Correspondingly, a hypervisor must treat any cpuid accesses as
> side-effect free.
>
> Definition of specific blocks:
>
> Generic hypervisor leaf block:
> 0x400000x0 signature is "GenericVMMIF" (or something)
> 0x400000x1 tsc leaf as you've described
>
I see following issues with this proposal,
1. Kernel complexity : Just thinking about the complexity that this will
put in the kernel to handle these multiple ABI signatures and scanning
all of these leaf block's is difficult to digest.
2. Divergence in the interface provided by the hypervisors :
The reason we brought up a flat hierarchy is because we think we should
be moving towards a approach where the guest code doesn't diverge too
much when running under different hypervisors. That is the guest
essentially does the same thing if its running on say Xen or VMware.
This design IMO, will take us a step backward to what we already have
seen with para virt ops. Each hypervisor (mostly) defines its own cpuid
block, the guest correspondingly needs to have code to handle each of
these cpuid blocks, with these blocks will mostly being exclusive.
3. Is their a need to do all this over engineering :
Aren't we over engineering a simple interface over here. The point is,
there are right now 256 cpuid leafs do we realistically think we are
ever going to exhaust all these leafs. We are really surprised to know
that people may think this space is small enough. It would be
interesting to know what all use you might want to put cpuid for.
Thanks,
Alok
> J
Jeremy Fitzhardinge wrote:
> Anthony Liguori wrote:
>> Mmm, cpuid bikeshedding :-)
>
> My shade of blue is better.
>
>>> The space 0x40000000-0x400000ff is reserved for hypervisor usage.
>>>
>>> This region is divided into 16 16-leaf blocks. Each block has the
>>> structure:
>>>
>>> 0x400000x0:
>>> eax: max used leaf within the leaf block (max 0x400000xf)
>> Why even bother with this? It doesn't seem necessary in your proposal.
>
> It allows someone to incrementally add things to their block in a fairly
> orderly way. But more importantly, its the prevailing idiom, and the
> existing and proposed cpuid schemes already do this, so they'd fit in as-is.
We just leave eax as zero. It wouldn't be that upsetting to change this
as it would only keep new guests from working on older KVMs.
However, I see little incentive to change anything unless there's
something compelling that we would get in return. Since we're only
talking about Linux guests, it's just as easy for us to add things to
our paravirt_ops implementation as it would be to add things using this
new model.
If this was something that other guests were all agreeing to support
(even if it was just the BSDs and OpenSolaris), then there may be value
to it. Right now, I see no real value in changing the status quo.
Regards,
Anthony Liguori
> J
On Wed, 2008-10-01 at 11:06 -0700, Jeremy Fitzhardinge wrote:
> Alok Kataria wrote:
> > Its not a user who has to do anything special here.
> > There are *intelligent* VM developers out there who can export a
> > different CPUid interface depending on the guest OS type. And this is
> > what most of the hypervisors do (not necessarily for CPUID, but for
> > other things right now).
> >
>
> No, that's always a terrible idea. Sure, its necessary to deal with
> some backward-compatibility issues, but we should even consider a new
> interface which assumes this kind of thing. We want properly enumerable
> interfaces.
The reason we still have to do this is because, Microsoft has already
defined a CPUID format which is way different than what you or I are
proposing ( with the current case of 256 leafs being available). And I
doubt they would change the way they deal with it on their OS.
Any proposal that we go with, we will have to export different CPUID
interface from the hypervisor for the 2 OS in question.
So i think this is something that we anyways will have to do and not
worth binging about in the discussion.
--
Alok
> J
Alok Kataria wrote:
> On Wed, 2008-10-01 at 11:04 -0700, Jeremy Fitzhardinge wrote:
>
> 2. Divergence in the interface provided by the hypervisors :
> The reason we brought up a flat hierarchy is because we think we should
> be moving towards a approach where the guest code doesn't diverge too
> much when running under different hypervisors. That is the guest
> essentially does the same thing if its running on say Xen or VMware.
>
> This design IMO, will take us a step backward to what we already have
> seen with para virt ops. Each hypervisor (mostly) defines its own cpuid
> block, the guest correspondingly needs to have code to handle each of
> these cpuid blocks, with these blocks will mostly being exclusive.
>
What's wrong with what we have in paravirt_ops? Just agreeing on CPUID
doesn't help very much. You still need a mechanism for doing hypercalls
to implement anything meaningful. We aren't going to agree on a
hypercall mechanism. KVM uses direct hypercall instructions, Xen uses a
hypercall page, VMware uses VMI, Hyper-V uses MSR writes. We all have
already defined the hypercall namespace in a certain way.
We've already gone down the road of trying to make standard paravirtual
interfaces (via virtio). No one was sufficiently interested in
collaborating. I don't see why other paravirtualizations are going to
be much different.
Regards,
Anthony Liguori
Alok Kataria wrote:
> 1. Kernel complexity : Just thinking about the complexity that this will
> put in the kernel to handle these multiple ABI signatures and scanning
> all of these leaf block's is difficult to digest.
>
The scanning for the signatures is trivial; it's not a significant
amount of code. Actually implementing them is a different matter, but
that's the same regardless of where they are placed or how they're
discovered. After discovery its the same either way: there's a leaf
base with offsets from it.
> 2. Divergence in the interface provided by the hypervisors :
> The reason we brought up a flat hierarchy is because we think we should
> be moving towards a approach where the guest code doesn't diverge too
> much when running under different hypervisors. That is the guest
> essentially does the same thing if its running on say Xen or VMware.
>
I guess, but the bulk of the uses of this stuff are going to be
hypervisor-specific. You're hard-pressed to come up with any other
generic uses beyond tsc. In general, if a hypervisor is going to put
something in a special cpuid leaf, its because there's no other good way
to represent it. Generic things are generally going to appear as an
emulated piece of the virtualized platform, in ACPI, DMI, a
hardware-defined cpuid leaf, etc...
> 3. Is their a need to do all this over engineering :
> Aren't we over engineering a simple interface over here. The point is,
> there are right now 256 cpuid leafs do we realistically think we are
> ever going to exhaust all these leafs. We are really surprised to know
> that people may think this space is small enough. It would be
> interesting to know what all use you might want to put cpuid for.
>
Look, if you want to propose a way to use that cpuid space in a
reasonably flexible way that allows it to be used as the need arises,
then we can talk about it. But I think your proposal is a poor way to
achieve those ends
If you want blessing for something that you've already implemented and
shipped, well, you don't need anyone's blessing for that.
J
* Anthony Liguori ([email protected]) wrote:
> We've already gone down the road of trying to make standard paravirtual
> interfaces (via virtio). No one was sufficiently interested in
> collaborating. I don't see why other paravirtualizations are going to
> be much different.
The point is to be able to support those interfaces. Presently a Linux guest
will test and find out which HV it's running on, and adapt. Another
guest will fail to enlighten itself, and perf will suffer...yadda, yadda.
thanks,
-chris
On Wed, 2008-10-01 at 14:08 -0700, Anthony Liguori wrote:
> Alok Kataria wrote:
> > On Wed, 2008-10-01 at 11:04 -0700, Jeremy Fitzhardinge wrote:
> >
> > 2. Divergence in the interface provided by the hypervisors :
> > The reason we brought up a flat hierarchy is because we think we should
> > be moving towards a approach where the guest code doesn't diverge too
> > much when running under different hypervisors. That is the guest
> > essentially does the same thing if its running on say Xen or VMware.
> >
> > This design IMO, will take us a step backward to what we already have
> > seen with para virt ops. Each hypervisor (mostly) defines its own cpuid
> > block, the guest correspondingly needs to have code to handle each of
> > these cpuid blocks, with these blocks will mostly being exclusive.
> >
>
> What's wrong with what we have in paravirt_ops?
Your explanation below answers the question you raised, the problem
being we need to have support for each of these different hypercall
mechanisms in the kernel.
I understand that this was the correct thing to do at that moment.
But do we want to go the same way again for CPUID when we can make it
generic (flat enough) for anybody to use it in the same manner and
expose a generic interface to the kernel.
> Just agreeing on CPUID
> doesn't help very much.
Yeah, nobody is removing any of the paravirt ops support.
> You still need a mechanism for doing hypercalls
> to implement anything meaningful. We aren't going to agree on a
> hypercall mechanism. KVM uses direct hypercall instructions, Xen uses a
> hypercall page, VMware uses VMI, Hyper-V uses MSR writes. We all have
> already defined the hypercall namespace in a certain way.
Thanks,
Alok
Alok Kataria wrote:
> Your explanation below answers the question you raised, the problem
> being we need to have support for each of these different hypercall
> mechanisms in the kernel.
> I understand that this was the correct thing to do at that moment.
> But do we want to go the same way again for CPUID when we can make it
> generic (flat enough) for anybody to use it in the same manner and
> expose a generic interface to the kernel.
>
But what sort of information can be stored in cpuid that's actually
useful? Right now we just it in KVM for feature bits. Most of the
stuff that's interesting is stored in shared memory because a guest can
read that without taking a vmexit or via a hypercall.
We can all agree upon a common mechanism for doing something but if no
one is using that mechanism to do anything significant, what purpose
does it serve?
Regards,
Anthony Liguori
Chris Wright wrote:
> * Anthony Liguori ([email protected]) wrote:
>
>> We've already gone down the road of trying to make standard paravirtual
>> interfaces (via virtio). No one was sufficiently interested in
>> collaborating. I don't see why other paravirtualizations are going to
>> be much different.
>>
>
> The point is to be able to support those interfaces. Presently a Linux guest
> will test and find out which HV it's running on, and adapt. Another
> guest will fail to enlighten itself, and perf will suffer...yadda, yadda.
>
Agreeing on CPUID does not get us close at all to having shared
interfaces for paravirtualization. As I said in another note, there are
more fundamental things that we differ on (like hypercall mechanism)
that's going to make that challenging.
We already are sharing code, when appropriate (see the Xen/KVM PV clock
interface).
Regards,
Anthony Liguori
> thanks,
> -chris
>
Jeremy Fitzhardinge wrote:
> Alok Kataria wrote:
>
> I guess, but the bulk of the uses of this stuff are going to be
> hypervisor-specific. You're hard-pressed to come up with any other
> generic uses beyond tsc.
And arguably, storing TSC frequency in CPUID is a terrible interface
because the TSC frequency can change any time a guest is entered. It
really should be a shared memory area so that a guest doesn't have to
vmexit to read it (like it is with the Xen/KVM paravirt clock).
Regards,
Anthony Liguori
> In general, if a hypervisor is going to put something in a special
> cpuid leaf, its because there's no other good way to represent it.
> Generic things are generally going to appear as an emulated piece of
> the virtualized platform, in ACPI, DMI, a hardware-defined cpuid leaf,
> etc...
* Anthony Liguori ([email protected]) wrote:
> And arguably, storing TSC frequency in CPUID is a terrible interface
> because the TSC frequency can change any time a guest is entered. It
True for older hardware, newer hardware should fix this. I guess the
point is, the are numbers that are easy to measure incorrectly in guest.
Doesn't justify the whole thing..
Chris Wright wrote:
> * Jeremy Fitzhardinge ([email protected]) wrote:
>> "What hypervisor is this?" isn't a very interesting question; if you're
>> even asking it then it suggests that something has gone wrong.
>
> It's essentially already happening. Everyone wants to be a better
> hyperv than hyperv ;-)
That's a hy-perv? ;)
-hpa
Alok Kataria wrote:
>> No, that's always a terrible idea. Sure, its necessary to deal with
>> some backward-compatibility issues, but we should even consider a new
>> interface which assumes this kind of thing. We want properly enumerable
>> interfaces.
>
> The reason we still have to do this is because, Microsoft has already
> defined a CPUID format which is way different than what you or I are
> proposing ( with the current case of 256 leafs being available). And I
> doubt they would change the way they deal with it on their OS.
> Any proposal that we go with, we will have to export different CPUID
> interface from the hypervisor for the 2 OS in question.
>
> So i think this is something that we anyways will have to do and not
> worth binging about in the discussion.
No, that's a good hint that what "you and I" are proposing is utterly
broken and exactly underscores what I have been stressing about
noncompliant hypervisors.
All I have seen out of Microsoft only covers CPUID levels 0x40000000 as
an vendor identification leaf and 0x40000001 as a "hypervisor
identification leaf", but you might have access to other information.
This further underscores my belief that using 0x400000xx for anything
"standards-based" at all is utterly futile, and that this space should
be treated as vendor identification and the rest as vendor-specific.
Any hope of creating a standard that's actually usable needs to be
outside this space, e.g. in the 0x40SSSSxx space I proposed earlier.
-hpa
On Wed, 2008-10-01 at 14:34 -0700, Anthony Liguori wrote:
> Jeremy Fitzhardinge wrote:
> > Alok Kataria wrote:
> >
> > I guess, but the bulk of the uses of this stuff are going to be
> > hypervisor-specific. You're hard-pressed to come up with any other
> > generic uses beyond tsc.
>
> And arguably, storing TSC frequency in CPUID is a terrible interface
> because the TSC frequency can change any time a guest is entered. It
> really should be a shared memory area so that a guest doesn't have to
> vmexit to read it (like it is with the Xen/KVM paravirt clock).
It's not terrible, it's actually brilliant. TSC is part of the
processor architecture, the processor should a way to tell us what speed
it is.
Having a TSC with no interface to determine the frequency is a terrible
design flaw. This is what caused the problem in the first place.
And now we're trying to fiddle around with software wizardry what should
be done in hardware in the first place. Once again, para-virtualization
is basically useless. We can't agree on a solution without
over-designing some complex system with interface signatures and
multi-vendor cooperation and nonsense. Solve the non-virtualized
problem and the virtualized problem goes away.
Jun, you work at Intel. Can you ask for a new architecturally defined
MSR that returns the TSC frequency? Not a virtualization specific MSR.
A real MSR that would exist on physical processors. The TSC started as
an MSR anyway. There should be another MSR that tells the frequency.
If it's hard to do in hardware, it can be a write-once MSR that gets
initialized by the BIOS. It's really a very simple solution to a very
common problem. Other MSRs are dedicated to bus speed and so on, this
seems remarkably similar.
Once the physical problem is solved, the virtualized problem doesn't
even exist. We simply add support for the newly defined MSR and voilla.
Other chipmakers probably agree it's a good idea and go along with it
too, and in the meantime, reading a non-existent MSR is a fairly
harmlessly handled #GP.
I realize it's the wrong thing for us now, but long term, it's the only
architecturally 'correct' approach. You can even extend it to have
visible TSC frequency changes clocked via performance counter events
(and then get interrupts on those events if you so wish), solving the
dynamic problem too.
Paravirtualization is a symptom of an architectural problem. We should
always be trying to fix the architecture first.
Zach
Zachary Amsden wrote:
> On Wed, 2008-10-01 at 14:34 -0700, Anthony Liguori wrote:
>
>> Jeremy Fitzhardinge wrote:
>>
>>> Alok Kataria wrote:
>>>
>>> I guess, but the bulk of the uses of this stuff are going to be
>>> hypervisor-specific. You're hard-pressed to come up with any other
>>> generic uses beyond tsc.
>>>
>> And arguably, storing TSC frequency in CPUID is a terrible interface
>> because the TSC frequency can change any time a guest is entered. It
>> really should be a shared memory area so that a guest doesn't have to
>> vmexit to read it (like it is with the Xen/KVM paravirt clock).
>>
>
> It's not terrible, it's actually brilliant.
But of course! Okay, not really :-)
> TSC is part of the
> processor architecture, the processor should a way to tell us what speed
> it is.
>
It does. 1 tick == 1 tick. The processor doesn't have a concept of
wall clock time so wall clock units don't make much sense. If it did,
I'd say, screw the TSC, just give me a ns granular time stamp and let's
all forget that the TSC even exists.
> And now we're trying to fiddle around with software wizardry what should
> be done in hardware in the first place. Once again, para-virtualization
> is basically useless. We can't agree on a solution without
> over-designing some complex system with interface signatures and
> multi-vendor cooperation and nonsense. Solve the non-virtualized
> problem and the virtualized problem goes away.
>
> Jun, you work at Intel. Can you ask for a new architecturally defined
> MSR that returns the TSC frequency? Not a virtualization specific MSR.
> A real MSR that would exist on physical processors. The TSC started as
> an MSR anyway. There should be another MSR that tells the frequency.
> If it's hard to do in hardware, it can be a write-once MSR that gets
> initialized by the BIOS.
rdtscp sort of gives you this. But still, just give me my rdnsc and
I'll be happy.
> I realize it's the wrong thing for us now, but long term, it's the only
> architecturally 'correct' approach. You can even extend it to have
> visible TSC frequency changes clocked via performance counter events
> (and then get interrupts on those events if you so wish), solving the
> dynamic problem too.
>
So a solution is needed that works for now. Anything that requires a
vmexit is bad because the TSC frequency can change quite often. Even if
you ignore the troubles with frequency scaling on older processors and
VCPU migration across NUMA nodes, there will be a very visible change in
TSC frequency after a live migration.
So there are two possible solutions. Have a shared memory area that the
guest can consult that has the latest TSC frequency (this is what KVM
and Xen do) or have some sort of interrupt mechanism that notifies the
guest when the TSC frequency changes after which, software can do
something that vmexits to get the TSC frequency.
The proposed solution doesn't include a TSC frequency change
notification mechanism.
This is part of the problem with this sort of approach to
standardization. It's hard to come up with the best interface at
first. You have to try a couple ways, and then everyone can eventually
standardize on the best one if one ever emerges.
Regards,
Anthony Liguori
> Paravirtualization is a symptom of an architectural problem. We should
> always be trying to fix the architecture first.
>
> Zach
>
>
Zachary Amsden wrote:
>
> Jun, you work at Intel. Can you ask for a new architecturally defined
> MSR that returns the TSC frequency? Not a virtualization specific MSR.
> A real MSR that would exist on physical processors. The TSC started as
> an MSR anyway. There should be another MSR that tells the frequency.
> If it's hard to do in hardware, it can be a write-once MSR that gets
> initialized by the BIOS. It's really a very simple solution to a very
> common problem. Other MSRs are dedicated to bus speed and so on, this
> seems remarkably similar.
>
Ah, if it was only that simple. Transmeta actually did this, but it's
not as useful as you think.
There are at least three crystals in modern PCs: one at 32.768 kHz (for
the RTC), one at 14.31818 MHz (PIT, PMTMR and HPET), and one at a higher
frequency (often 200 MHz.)
All the main data distribution clocks in the system are derived from the
third, which is subject to spread-spectrum modulation due to RFI
concerns. Therefore, relying on the *nominal* frequency of this clock
is vastly incorrect; often by as much as 2%. Spread-spectrum modulation
is supposed to vary around zero enough that the spreading averages out,
but the only way to know what the center frequency actually is is to
average. Furthermore, this high-frequency clock is generally not
calibrated anywhere near as well as the 14 MHz clock; in good designs
the 14 MHz is actually a TCXO (temperature compensated crystal
oscillator), which is accurate to something like ±2 ppm.
-hpa
On 10/1/2008 3:46:45 PM, H. Peter Anvin wrote:
> Alok Kataria wrote:
> > > No, that's always a terrible idea. Sure, its necessary to deal
> > > with some backward-compatibility issues, but we should even
> > > consider a new interface which assumes this kind of thing. We
> > > want properly enumerable interfaces.
> >
> > The reason we still have to do this is because, Microsoft has
> > already defined a CPUID format which is way different than what you
> > or I are proposing ( with the current case of 256 leafs being
> > available). And I doubt they would change the way they deal with it on their OS.
> > Any proposal that we go with, we will have to export different CPUID
> > interface from the hypervisor for the 2 OS in question.
> >
> > So i think this is something that we anyways will have to do and not
> > worth binging about in the discussion.
>
> No, that's a good hint that what "you and I" are proposing is utterly
> broken and exactly underscores what I have been stressing about
> noncompliant hypervisors.
>
> All I have seen out of Microsoft only covers CPUID levels 0x40000000
> as an vendor identification leaf and 0x40000001 as a "hypervisor
> identification leaf", but you might have access to other information.
No, it says "Leaf 0x40000001 as hypervisor vendor-neutral interface identification, which determines the semantics of leaves from 0x40000002 through 0x400000FF." The Leaf 0x40000000 returns vendor identifier signature (i.e. hypervisor identification) and the hypervisor CPUID leaf range, as in the proposal.
>
> This further underscores my belief that using 0x400000xx for anything
> "standards-based" at all is utterly futile, and that this space should
> be treated as vendor identification and the rest as vendor-specific.
> Any hope of creating a standard that's actually usable needs to be
> outside this space, e.g. in the 0x40SSSSxx space I proposed earlier.
>
Actually I'm not sure I'm following your logic. Are you saying using that 0x400000xx for anything "standards-based" is utterly futile because Microsoft said "the range is hypervisor vendor-neutral"? Or you were not sure what they meant there. If we are not clear, we can ask them.
> -hpa
.
Jun Nakajima | Intel Open Source Technology Center
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
H. Peter Anvin wrote:
>
> Ah, if it was only that simple. Transmeta actually did this, but it's
> not as useful as you think.
>
For what it's worth, Transmeta's implementation used CPUID leaf
0x80860001.ECX to give the TSC frequency rounded to the nearest MHz.
The caveat of spread-spectrum modulation applies.
-hpa
On Wed, 2008-10-01 at 17:39 -0700, H. Peter Anvin wrote:
> third, which is subject to spread-spectrum modulation due to RFI
> concerns. Therefore, relying on the *nominal* frequency of this clock
I'm not suggesting using the nominal value. I'm suggesting the
measurement be done in the one and only place where there is perfect
control of the system, the processor boot-strapping in the BIOS.
Only the platform designers themselves know the speed of the oscillator
which is modulating the clock and so only they should be calibrating the
speed of the TSC.
If this modulation really does alter the frequency by +/- 2% (seems high
to me, but hey, I don't design motherboards), using an LFO, then
basically all the calibration done in Linux is broken and has been for
some time. You can't calibrate only once, or risk being off by 2%, you
can't calibrate repeatedly and take the fastest estimate, or you are off
by 2%, and you can't calibrate repeatedly and take the average without
risking SMI noise affecting the lowest clock speed measurement,
contributing unknown error.
Hmm. Re-reading your e-mail, I see you are saying the nominal frequency
may be off by 2% (and I easily believe that), not necessarily that the
frequency modulation may be 2% (which I still think is high). Does
anyone know what the actual bounds on spread spectrum modulation are or
how fast the clock is modulated?
Zach
Zachary Amsden wrote:
>
> I'm not suggesting using the nominal value. I'm suggesting the
> measurement be done in the one and only place where there is perfect
> control of the system, the processor boot-strapping in the BIOS.
>
> Only the platform designers themselves know the speed of the oscillator
> which is modulating the clock and so only they should be calibrating the
> speed of the TSC.
>
No. *Noone*, including the manufacturers, know the speed of the
oscillator which is modulating the clock. What you have to do is
average over a timespan which is long enough that the SSM averages out
(a relatively small fraction of a second.)
As for trusting the BIOS on this, that's a total joke. Firmware vendors
can't get the most basic details right.
> If this modulation really does alter the frequency by +/- 2% (seems high
> to me, but hey, I don't design motherboards), using an LFO, then
> basically all the calibration done in Linux is broken and has been for
> some time. You can't calibrate only once, or risk being off by 2%, you
> can't calibrate repeatedly and take the fastest estimate, or you are off
> by 2%, and you can't calibrate repeatedly and take the average without
> risking SMI noise affecting the lowest clock speed measurement,
> contributing unknown error.
You have to calibrate over a sample interval long enough that the SSM
averages out.
> Hmm. Re-reading your e-mail, I see you are saying the nominal frequency
> may be off by 2% (and I easily believe that), not necessarily that the
> frequency modulation may be 2% (which I still think is high). Does
> anyone know what the actual bounds on spread spectrum modulation are or
> how fast the clock is modulated?
No, I'm saying the frequency modulation may be up to 2%. Typically it
is something like [-2%,+0%].
-hpa
Nakajima, Jun wrote:
>>
>> All I have seen out of Microsoft only covers CPUID levels 0x40000000
>> as an vendor identification leaf and 0x40000001 as a "hypervisor
>> identification leaf", but you might have access to other information.
>
> No, it says "Leaf 0x40000001 as hypervisor vendor-neutral interface identification, which determines the semantics of leaves from 0x40000002 through 0x400000FF." The Leaf 0x40000000 returns vendor identifier signature (i.e. hypervisor identification) and the hypervisor CPUID leaf range, as in the proposal.
>
In other words, 0x40000002+ is vendor-specific space, based on the
hypervisor specified in 0x40000001 (in theory); in practice both
0x40000000:0x40000001 since M$ seem to use clever identifiers as
"Hypervisor 1".
>> This further underscores my belief that using 0x400000xx for anything
>> "standards-based" at all is utterly futile, and that this space should
>> be treated as vendor identification and the rest as vendor-specific.
>> Any hope of creating a standard that's actually usable needs to be
>> outside this space, e.g. in the 0x40SSSSxx space I proposed earlier.
>
> Actually I'm not sure I'm following your logic. Are you saying using that 0x400000xx for anything "standards-based" is utterly futile because Microsoft said "the range is hypervisor vendor-neutral"? Or you were not sure what they meant there. If we are not clear, we can ask them.
>
What I'm saying is that Microsoft is effectively squatting on the
0x400000xx space with their definition. As written, it's not even clear
that it will remain consistent between *their own* hypervisors, even
less anyone else's.
-hpa
Chris Wright wrote:
> * Anthony Liguori ([email protected]) wrote:
>
>> And arguably, storing TSC frequency in CPUID is a terrible interface
>> because the TSC frequency can change any time a guest is entered. It
>>
>
> True for older hardware, newer hardware should fix this. I guess the
> point is, the are numbers that are easy to measure incorrectly in guest.
> Doesn't justify the whole thing..
>
It's not fixed for newer hardware. Larger systems still have multiple
tsc frequencies.
--
error compiling committee.c: too many arguments to function
On 10/1/2008 6:24:26 PM, H. Peter Anvin wrote:
> Nakajima, Jun wrote:
> > >
> > > All I have seen out of Microsoft only covers CPUID levels
> > > 0x40000000 as an vendor identification leaf and 0x40000001 as a
> > > "hypervisor identification leaf", but you might have access to other information.
> >
> > No, it says "Leaf 0x40000001 as hypervisor vendor-neutral interface
> > identification, which determines the semantics of leaves from
> > 0x40000002 through 0x400000FF." The Leaf 0x40000000 returns vendor
> > identifier signature (i.e. hypervisor identification) and the
> > hypervisor CPUID leaf range, as in the proposal.
> >
>
Resuming the thread :-)
> In other words, 0x40000002+ is vendor-specific space, based on the
> hypervisor specified in 0x40000001 (in theory); in practice both
> 0x40000000:0x40000001 since M$ seem to use clever identifiers as
> "Hypervisor 1".
What it means their hypervisor returns the interface signature (i.e. "Hv#1"), and that defines the interface. If we use "Lv_1", for example, we can define the interface 0x40000002 through 0x400000FF for Linux. Since leaf 0x40000000 and 0x40000001 are separate, we can decouple the hypervisor vender from the interface it supports. This also allows a hypervisor to support multiple interfaces.
And whether a guest wants to use the interface without checking the vender id is a different thing. For Linux, we don't want to hardcode the vender ids in the upstream code at least for such a generic interface.
So I think we need to modify the proposal:
Hypervisor interface identification Leaf:
Leaf 0x40000001.
This leaf returns the interface signature that the hypervisor implements.
# EAX: "Lv_1" (or something)
# EBX, ECX, EDX: Reserved.
Lv_1 interface Leaves:
Leaf range 0x40000002 - 0x4000000FF.
In fact, both Xen and KVM are using the leaf 0x40000001 for different purposes today (Xen: Xen version number, KVM: KVM para-virtualization features). But I don't think this would break their existing binaries mainly because they would need to expose the interface explicitly now.
>
> > > This further underscores my belief that using 0x400000xx for
> > > anything "standards-based" at all is utterly futile, and that this
> > > space should be treated as vendor identification and the rest as
> > > vendor-specific. Any hope of creating a standard that's actually
> > > usable needs to be outside this space, e.g. in the 0x40SSSSxx
> > > space I proposed earlier.
> >
> > Actually I'm not sure I'm following your logic. Are you saying using
> > that 0x400000xx for anything "standards-based" is utterly futile
> > because Microsoft said "the range is hypervisor vendor-neutral"? Or
> > you were not sure what they meant there. If we are not clear, we can
> > ask them.
> >
>
> What I'm saying is that Microsoft is effectively squatting on the
> 0x400000xx space with their definition. As written, it's not even
> clear that it will remain consistent between *their own* hypervisors,
> even less anyone else's.
I hope the above clarified your concern. You can google-search a more detailed public spec. Let me know if you want to know a specific URL.
>
> -hpa
>
.
Jun Nakajima | Intel Open Source Technology Center
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
Nakajima, Jun wrote:
> What it means their hypervisor returns the interface signature (i.e. "Hv#1"), and that defines the interface. If we use "Lv_1", for example, we can define the interface 0x40000002 through 0x400000FF for Linux. Since leaf 0x40000000 and 0x40000001 are separate, we can decouple the hypervisor vender from the interface it supports.
Right so far.
> This also allows a hypervisor to support multiple interfaces.
Wrong.
This isn't a two-way interface. It's a one-way interface, and it
*SHOULD BE*; exposing different information depending on what is running
is a hack that is utterly tortorous at best.
>
> In fact, both Xen and KVM are using the leaf 0x40000001 for different purposes today (Xen: Xen version number, KVM: KVM para-virtualization features). But I don't think this would break their existing binaries mainly because they would need to expose the interface explicitly now.
>
>>>> This further underscores my belief that using 0x400000xx for
>>>> anything "standards-based" at all is utterly futile, and that this
>>>> space should be treated as vendor identification and the rest as
>>>> vendor-specific. Any hope of creating a standard that's actually
>>>> usable needs to be outside this space, e.g. in the 0x40SSSSxx
>>>> space I proposed earlier.
>>> Actually I'm not sure I'm following your logic. Are you saying using
>>> that 0x400000xx for anything "standards-based" is utterly futile
>>> because Microsoft said "the range is hypervisor vendor-neutral"? Or
>>> you were not sure what they meant there. If we are not clear, we can
>>> ask them.
>>>
>> What I'm saying is that Microsoft is effectively squatting on the
>> 0x400000xx space with their definition. As written, it's not even
>> clear that it will remain consistent between *their own* hypervisors,
>> even less anyone else's.
>
> I hope the above clarified your concern. You can google-search a more detailed public spec. Let me know if you want to know a specific URL.
>
No, it hasn't "clarified my concern" in any way. It's exactly
*underscoring* it. In other words, I consider 0x400000xx unusable for
anything that is standards-based. The interfaces everyone is currently
using aren't designed to export multiple interfaces; they're designed to
tell the guest which *one* interface is exported. That is fine, we just
need to go elsewhere.
-hpa
On 10/3/2008 4:30:29 PM, H. Peter Anvin wrote:
> Nakajima, Jun wrote:
> > What it means their hypervisor returns the interface signature (i.e.
> > "Hv#1"), and that defines the interface. If we use "Lv_1", for
> > example, we can define the interface 0x40000002 through 0x400000FF for Linux.
> > Since leaf 0x40000000 and 0x40000001 are separate, we can decouple
> > the hypervisor vender from the interface it supports.
>
> Right so far.
>
> > This also allows a hypervisor to support multiple interfaces.
>
> Wrong.
>
> This isn't a two-way interface. It's a one-way interface, and it
> *SHOULD BE*; exposing different information depending on what is
> running is a hack that is utterly tortorous at best.
What I mean is that a hypervisor (with a single vender id) can support multiple interfaces, exposing a single interface to each guest that would expect a specific interface at runtime.
>
> >
> > In fact, both Xen and KVM are using the leaf 0x40000001 for
> > different purposes today (Xen: Xen version number, KVM: KVM
> > para-virtualization features). But I don't think this would break
> > their existing binaries mainly because they would need to expose the interface explicitly now.
> >
> > > > > This further underscores my belief that using 0x400000xx for
> > > > > anything "standards-based" at all is utterly futile, and that
> > > > > this space should be treated as vendor identification and the
> > > > > rest as vendor-specific. Any hope of creating a standard
> > > > > that's actually usable needs to be outside this space, e.g. in
> > > > > the 0x40SSSSxx space I proposed earlier.
> > > > Actually I'm not sure I'm following your logic. Are you saying
> > > > using that 0x400000xx for anything "standards-based" is utterly
> > > > futile because Microsoft said "the range is hypervisor
> > > > vendor-neutral"? Or you were not sure what they meant there. If
> > > > we are not clear, we can ask them.
> > > >
> > > What I'm saying is that Microsoft is effectively squatting on the
> > > 0x400000xx space with their definition. As written, it's not even
> > > clear that it will remain consistent between *their own*
> > > hypervisors, even less anyone else's.
> >
> > I hope the above clarified your concern. You can google-search a
> > more detailed public spec. Let me know if you want to know a specific URL.
> >
>
> No, it hasn't "clarified my concern" in any way. It's exactly
> *underscoring* it. In other words, I consider 0x400000xx unusable for
> anything that is standards-based. The interfaces everyone is
> currently using aren't designed to export multiple interfaces; they're
> designed to tell the guest which *one* interface is exported. That is
> fine, we just need to go elsewhere.
>
> -hpa
What's the significance of supporting multiple interfaces to the same guest simultaneously, i.e. _runtime_? We don't want the guests to run on such a literarily Frankenstein machine. And practically, such testing/debugging would be good only for Halloween :-).
The interface space can be distinct, but the contents are defined and implemented independently, thus you might find overlaps, inconsistency, etc. among the interfaces. And why is runtime "multiple interfaces" required for a standards-based interface?
.
Jun Nakajima | Intel Open Source Technology Center
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
Nakajima, Jun wrote:
>
> What I mean is that a hypervisor (with a single vender id) can support multiple interfaces, exposing a single interface to each guest that would expect a specific interface at runtime.
>
Yes, and for the reasons outlined in a previous post in this thread,
this is an incredibly bad idea. We already hate the guts of the ACPI
people for this reason.
>
> What's the significance of supporting multiple interfaces to the same guest simultaneously, i.e. _runtime_? We don't want the guests to run on such a literarily Frankenstein machine. And practically, such testing/debugging would be good only for Halloween :-).
>
By that notion, EVERY CPU currently shipped is a "Frankenstein" CPU,
since at very least they export Intel-derived and AMD-derived
interfaces. This is in other words, a ridiculous claim.
> The interface space can be distinct, but the contents are defined and implemented independently, thus you might find overlaps, inconsistency, etc. among the interfaces. And why is runtime "multiple interfaces" required for a standards-based interface?
That is the whole point -- without a central coordinating authority,
you're going to have to accommodate many definition sources. Otherwise,
you're just back to where we started -- each hypervisor exports an
interface and that's just that.
If there are multiple interface specifications, they should be exported
simulateously in non-conflicting numberspaces, and the *GUEST* gets to
choose what to believe. We already do this for *all kinds* of
information, including CPUID. It's the right thing to do.
-hpa
Nakajima, Jun wrote:
> What's the significance of supporting multiple interfaces to the same guest simultaneously, i.e. _runtime_? We don't want the guests to run on such a literarily Frankenstein machine. And practically, such testing/debugging would be good only for Halloween :-).
>
>
If you can only expose one interface, you need to have the user choose.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 10/3/2008 5:35:39 PM, H. Peter Anvin wrote:
> Nakajima, Jun wrote:
> >
> > What's the significance of supporting multiple interfaces to the
> > same guest simultaneously, i.e. _runtime_? We don't want the guests
> > to run on such a literarily Frankenstein machine. And practically,
> > such testing/debugging would be good only for Halloween :-).
> >
>
> By that notion, EVERY CPU currently shipped is a "Frankenstein" CPU,
> since at very least they export Intel-derived and AMD-derived interfaces.
> This is in other words, a ridiculous claim.
The big difference here is that you could create a VM at runtime (by combining the existing interfaces) that did not exist before (or was not tested before). For example, a hypervisor could show hyper-v, osx-v (if any), linux-v, etc., and a guest could create a VM with hyper-v MMU, osx-v interrupt handling, Linux-v timer, etc. And such combinations/variations can grow exponentially.
Or are you suggesting that multiple interfaces be _available_ to guests at runtime but the guest chooses one of them?
> -hpa
>
.
Jun Nakajima | Intel Open Source Technology Center
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
Nakajima, Jun wrote:
> On 10/3/2008 5:35:39 PM, H. Peter Anvin wrote:
>> Nakajima, Jun wrote:
>>> What's the significance of supporting multiple interfaces to the
>>> same guest simultaneously, i.e. _runtime_? We don't want the guests
>>> to run on such a literarily Frankenstein machine. And practically,
>>> such testing/debugging would be good only for Halloween :-).
>>>
>> By that notion, EVERY CPU currently shipped is a "Frankenstein" CPU,
>> since at very least they export Intel-derived and AMD-derived interfaces.
>> This is in other words, a ridiculous claim.
>
> The big difference here is that you could create a VM at runtime (by combining the existing interfaces) that did not exist before (or was not tested before). For example, a hypervisor could show hyper-v, osx-v (if any), linux-v, etc., and a guest could create a VM with hyper-v MMU, osx-v interrupt handling, Linux-v timer, etc. And such combinations/variations can grow exponentially.
>
> Or are you suggesting that multiple interfaces be _available_ to guests at runtime but the guest chooses one of them?
>
The guest chooses what it wants to use. We already do this: for
example, we use CPUID leaf 0x80000006 preferentially to CPUID leaf 2,
simply because it is a better interface.
And you're absolutely right that the guest may end up picking and
choosing different parts of the interfaces. That's how it is supposed
to work.
-hpa
Nakajima, Jun wrote:
> On 10/3/2008 5:35:39 PM, H. Peter Anvin wrote:
>
>> Nakajima, Jun wrote:
>>
>>> What's the significance of supporting multiple interfaces to the
>>> same guest simultaneously, i.e. _runtime_? We don't want the guests
>>> to run on such a literarily Frankenstein machine. And practically,
>>> such testing/debugging would be good only for Halloween :-).
>>>
>>>
>> By that notion, EVERY CPU currently shipped is a "Frankenstein" CPU,
>> since at very least they export Intel-derived and AMD-derived interfaces.
>> This is in other words, a ridiculous claim.
>>
>
> The big difference here is that you could create a VM at runtime (by combining the existing interfaces) that did not exist before (or was not tested before). For example, a hypervisor could show hyper-v, osx-v (if any), linux-v, etc., and a guest could create a VM with hyper-v MMU, osx-v interrupt handling, Linux-v timer, etc. And such combinations/variations can grow exponentially.
>
That would be crazy.
> Or are you suggesting that multiple interfaces be _available_ to guests at runtime but the guest chooses one of them?
>
Right, that's what I've been suggesting. I think hypervisors should
be able to offer multiple ABIs to guests, but a guest has to commit to
using one exclusively (ie, once they start to use one then the others
turn themselves off, kill the domain, etc).
J
H. Peter Anvin wrote:
> And you're absolutely right that the guest may end up picking and
> choosing different parts of the interfaces. That's how it is supposed
> to work.
No, that would be a horrible, horrible mistake. There's no sane way to
implement that; it would mean that the hypervisor would have to have
some kind of state model that incorporates all the ABIs in a consistent
way. Any guest using multiple ABIs would effectively end up being
dependent on a particular hypervisor via a frankensteinian interface
that no other hypervisor would implement in the same way, even if they
claim to implement the same set of interfaces.
If the hypervisor just needs to deal with one at a time then it can have
relatively simple ABI<->internal state translation.
However, if you have the notion of hypervisor-agnostic or common
interfaces, then you can include those as part of the rest of the ABI
and make it sane (so Xen+common, hyperv+common, etc).
J
Jeremy Fitzhardinge wrote:
>>
>> The big difference here is that you could create a VM at runtime (by
>> combining the existing interfaces) that did not exist before (or was
>> not tested before). For example, a hypervisor could show hyper-v,
>> osx-v (if any), linux-v, etc., and a guest could create a VM with
>> hyper-v MMU, osx-v interrupt handling, Linux-v timer, etc. And such
>> combinations/variations can grow exponentially.
>
> That would be crazy.
>
Not necessarily, although the example above is extreme. Redundant
interfaces is the norm in an evolving platform.
>> Or are you suggesting that multiple interfaces be _available_ to
>> guests at runtime but the guest chooses one of them?
>
> Right, that's what I've been suggesting. I think hypervisors should
> be able to offer multiple ABIs to guests, but a guest has to commit to
> using one exclusively (ie, once they start to use one then the others
> turn themselves off, kill the domain, etc).
Not inherently. Of course, there may be interfaces which are interently
or by policy mutually exclusive, but a hypervisor should only export the
interfaces it wants a guest to be able to use.
This is particularly so with CPUID, which is a *data export* interface,
it doesn't perform any action.
-hpa
H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>>>
>>> The big difference here is that you could create a VM at runtime (by
>>> combining the existing interfaces) that did not exist before (or was
>>> not tested before). For example, a hypervisor could show hyper-v,
>>> osx-v (if any), linux-v, etc., and a guest could create a VM with
>>> hyper-v MMU, osx-v interrupt handling, Linux-v timer, etc. And such
>>> combinations/variations can grow exponentially.
>>
>> That would be crazy.
>>
>
> Not necessarily, although the example above is extreme. Redundant
> interfaces is the norm in an evolving platform.
Sure. A common feature across all hypervisor-specific ABIs may get
subsumed into a generic interface which is equivalent to all the
others. That's fine. But nobody should expect to be able to mix
hyperV's lazy tlb interface with KVM's pv mmu updates and expect to get
a working result.
>>> Or are you suggesting that multiple interfaces be _available_ to
>>> guests at runtime but the guest chooses one of them?
>>
>> Right, that's what I've been suggesting. I think hypervisors
>> should be able to offer multiple ABIs to guests, but a guest has to
>> commit to using one exclusively (ie, once they start to use one then
>> the others turn themselves off, kill the domain, etc).
>
> Not inherently. Of course, there may be interfaces which are
> interently or by policy mutually exclusive, but a hypervisor should
> only export the interfaces it wants a guest to be able to use.
It should export any interface that it implements fully, but those
interfaces may have contradictory or inconsistent semantics which
prevent them from being used concurrently.
> This is particularly so with CPUID, which is a *data export*
> interface, it doesn't perform any action.
Well, sure. There's two distinct issues:
1. Using cpuid to get information about the kernel's environment. If
the environment is sane, then cpuid is a read-only, side-effect
free way of getting information, and any information gathered is
fair game.
2. One of the pieces of information you can get with cpuid is a
discovery of what paravirtual hypercall interfaces the environment
supports, which the guest can compare against its list of
interfaces that it supports. If there's some amount of
intersection, it can decide to use one of those interfaces.
I'm saying that *in general* a guest should expect to be able to use one
and only one of those interfaces. There will be explicitly defined
exceptions to that - such as using generic ABIs in addition to
hypervisor specific ABIs - but a guest can't expect to to be able to mix
and match.
A tricky issue with selecting an ABI is if two hypervisors end up using
exactly the same mechanism for implementing hypercalls (or whatever), so
that there needs to be some explicit way for the guest to nominate which
interface its actually using...
J
Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> And you're absolutely right that the guest may end up picking and
>> choosing different parts of the interfaces. That's how it is supposed
>> to work.
>
> No, that would be a horrible, horrible mistake. There's no sane way to
> implement that; it would mean that the hypervisor would have to have
> some kind of state model that incorporates all the ABIs in a consistent
> way. Any guest using multiple ABIs would effectively end up being
> dependent on a particular hypervisor via a frankensteinian interface
> that no other hypervisor would implement in the same way, even if they
> claim to implement the same set of interfaces.
>
> If the hypervisor just needs to deal with one at a time then it can have
> relatively simple ABI<->internal state translation.
>
> However, if you have the notion of hypervisor-agnostic or common
> interfaces, then you can include those as part of the rest of the ABI
> and make it sane (so Xen+common, hyperv+common, etc).
>
It depends on what classes of interfaces you're talking about. I think
you and Jun have a bit narrow definition of "ABI" in this context. This
is functionally equivalent to hardware interfaces (after all, that is
what the hypervisor ABI *is* as far as the kernel is concerned) -- noone
expects, say, a SATA controller that can run in legacy IDE mode to also
take AHCI commands at the same time, but the kernel *does* expect that a
chipset which exports LAPIC, HPET, PMTMR and TSC clock sources can use
all four at the same time. In the latter case the interfaces are
inherently independent and refer to different chunks of hardware which
just happen to be related in that they all are related to timing. In
the former case, we're dealing with *one* piece of hardware which can
operate in one of two modes.
For hypervisors, you will end up with cases where you have both types --
for example, KVM will happily use VMware's video interface, but that
doesn't mean KVM wants to use VMware's interfaces for storage. This is
exactly how it should be: the extent this kind of mix and match that is
possible is a matter of the definition of the individual interfaces
themselves, not of the overall architecture.
-hpa