2009-03-01 23:27:43

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

Nick Piggin wrote:
> On Saturday 28 February 2009 17:52:24 Jeremy Fitzhardinge wrote:
>
>> Andrew Morton wrote:
>>
>
>
>>> I hate to be the one to say it, but we should sit down and work out
>>> whether it is justifiable to merge any of this into Linux. I think
>>> it's still the case that the Xen technology is the "old" way and that
>>> the world is moving off in the "new" direction, KVM?
>>>
>> I don't think that's a particularly useful way to look at it. They're
>> different approaches to the problem, and have different tradeoffs.
>>
>> The more important question is: are there real users for this stuff?
>> Does not merging it cause more net disadvantage than merging it?
>> Despite all the noise made about kvm in kernel circles, Xen has a large
>> and growing installed base. At the moment its all running on massive
>> out-of-tree patches, which doesn't make anyone happy. It's best that it
>> be in the mainline kernel. You know, like we argue for everything else.
>>
>
> OTOH, there are good reasons not to duplicate functionality, and many
> many times throughout the kernel history competing solutions have been
> rejected even though the same arguments could be made about them.
>
> There have also been many times duplicate functionality has been merged,
> although that does often start with the intention of eliminating
> duplicate implementations and ends with pain. So I think Andrew's
> question is pretty important.
>

Those would be pertinent questions if I were suddenly popping up and
saying "hey, let's add Xen support to the kernel!" But Xen support has
been in the kernel for well over a year now, and is widely used, enabled
in distros, etc. The patches I'm proposing here are not a whole new
thing, they're part of the last 10% to fill out the kernel's support to
make it actually useful.

> The user issue aside -- that is a valid point -- you don't really touch
> on the technical issues. What tradeoffs, and where Xen does better
> than KVM would be interesting to know, can Xen tools and users ever be
> migrated to KVM or vice versa (I know very little about this myself, so
> I'm just an interested observer).
>

OK, fair point, its probably time for another Xen architecture refresher
post.

There are two big architectural differences between Xen and KVM:

Firstly, Xen has a separate hypervisor who's primary role is to context
switch between the guest domains (virtual machines). The hypervisor is
relatively small and single purpose. It doesn't, for example, contain
any device drivers or even much knowledge of things like pci buses and
their structure. The domains themselves are more or less peers; some
are more privileged than others, but from Xen's perspective they are
more or less equivalent. The first domain, dom0, is special because its
started by Xen itself, and has some inherent initial privileges; its
main job is to start other domains, and it also typically provides
virtualized/multiplexed device services to other domains via a
frontend/backend split driver structure.

KVM, on the other hand, builds all the hypervisor stuff into the kernel
itself, so you end up with a kernel which does all the normal kernel
stuff, and can run virtual machines by making them look like slightly
strange processes.

Because Xen is dedicated to just running virtual machines, its internal
architecture can be more heavily oriented towards that task, which
affects things from how its scheduler works, its use and multiplexing of
physical memory. For example, Xen manages to use new hardware
virtualization features pretty quickly, partly because it doesn't need
to trade-off against normal kernel functions. The clear distinction
between the privileged hypervisor and the rest of the domains makes the
security people happy as well. Also, because Xen is small and fairly
self-contained, there's quite a few hardware vendors shipping it burned
into the firmware so that it really is the first thing to boot (many of
instant-on features that laptops have are based on Xen). Both HP and
Dell, at least, are selling servers with Xen pre-installed in the firmware.


The second big difference is the use of paravirtualization. Xen can
securely virtualize a machine without needing any particular hardware
support. Xen works well on any post-P6 or any ia64 machine, without
needing any virtualzation hardware support. When Xen runs a kernel in
paravirtualized mode, it runs the kernel in an unprivileged processor
state. The allows the hypervisor to vet all the guest kernel's
privileged operations, which are carried out are either via hypercalls
or by memory shared between each guest and Xen.

By contrast, KVM relies on at least VT/SVM (and whatever the ia64 equiv
is called) being available in the CPUs, and needs the most modern of
hardware to get the best performance.

Once important area of paravirtualization is that Xen guests directly
use the processor's pagetables; there is no shadow pagetable or use of
hardware pagetable nesting. This means that a tlb miss is just a tlb
miss, and happens at full processor performance. This is possible
because 1) pagetables are always read-only to the guest, and 2) the
guest is responsible for looking up in a table to map guest-local pfns
into machine-wide mfns before installing them in a pte. Xen will check
that any new mapping or pagetable satisfies all the rules, by checking
that the writable reference count is 0, and that the domain owns (or has
been allowed access to) any mfn it tries to install in a pagetable.

The other interesting part of paravirtualization is the abstraction of
interrupts into event channels. Each domain has a bit-array of 1024
bits which correspond to 1024 possible event channels. An event channel
can have one of several sources, such as a timer virtual interrupt, an
inter-domain event, an inter-vcpu IPI, or mapped from a hardware
interrupt. We end up mapping the event channels back to irqs and they
are delivered as normal interrupts as far as the rest of the kernel is
concerned.

The net result is that a paravirtualized Xen guest runs a very close to
full speed. Workloads which modify live pagetables a lot take a bit of
a performance hit (since the pte updates have to trap to the hypervisor
for validation), but in general this is not a huge deal. Hardware
support for nested pagetables is only just beginning to get close to
getting performance parity, but with different tradeoffs (pagetable
updates are cheap, but tlb misses are much more expensive, and hits
consume more tlb entries).

Xen can also make full use of whatever hardware virtualization features
are available when running an "hvm" domain. This is typically how you'd
run Windows or other unmodified operating systems.

All of this is stuff that's necessary to support any PV Xen domain, and
has been in the kernel for a long time now.


The additions I'm proposing now are those needed for a Xen domain to
control the physical hardware, in order to provide virtual device
support for other less-privileged domains. These changes affect a few
areas:

* interrupts: mapping a device interrupt into an event channel for
delivery to the domain with the device driver for that interrupt
* mappings: allowing direct hardware mapping of device memory into a
domain
* dma: making sure that hardware gets programmed with machine memory
address, nor virtual ones, and that pages are machine-contiguous
when expected

Interrupts require a few hooks into the x86 APIC code, but the end
result is that hardware interrupts are delivered via event channels, but
then they're mapped back to irqs and delivered normally (they even end
up with the same irq number as they'd usually have).

Device mappings are fairly easy to arrange. I'm using a software pte
bit, _PAGE_IOMAP, to indicate that a mapping is a device mapping. This
bit is set by things like ioremap() and remap_pfn_range, and the Xen mmu
code just uses the pfn in the pte as-is, rather than doing the normal
pfn->mfn translation.

DMA is handled via the normal DMA API, with some hooks to swiotlb to
make sure that the memory underlying its pools is really DMA-ready (ie,
is contiguous and low enough in machine memory).

The changes I'm proposing may look a bit strange from a purely x86
perspective, but they fit in relatively well because they're not all
that different from what other architectures require, and so the
kernel-wide infrastructure is mostly already in place.


I hope that helps clarify what I'm trying to do here, and why Xen and
KVM do have distinct roles to play.

J


2009-03-02 06:38:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

On Monday 02 March 2009 10:27:29 Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > On Saturday 28 February 2009 17:52:24 Jeremy Fitzhardinge wrote:
> >> Andrew Morton wrote:
> >>> I hate to be the one to say it, but we should sit down and work out
> >>> whether it is justifiable to merge any of this into Linux. I think
> >>> it's still the case that the Xen technology is the "old" way and that
> >>> the world is moving off in the "new" direction, KVM?
> >>
> >> I don't think that's a particularly useful way to look at it. They're
> >> different approaches to the problem, and have different tradeoffs.
> >>
> >> The more important question is: are there real users for this stuff?
> >> Does not merging it cause more net disadvantage than merging it?
> >> Despite all the noise made about kvm in kernel circles, Xen has a large
> >> and growing installed base. At the moment its all running on massive
> >> out-of-tree patches, which doesn't make anyone happy. It's best that it
> >> be in the mainline kernel. You know, like we argue for everything else.
> >
> > OTOH, there are good reasons not to duplicate functionality, and many
> > many times throughout the kernel history competing solutions have been
> > rejected even though the same arguments could be made about them.
> >
> > There have also been many times duplicate functionality has been merged,
> > although that does often start with the intention of eliminating
> > duplicate implementations and ends with pain. So I think Andrew's
> > question is pretty important.
>
> Those would be pertinent questions if I were suddenly popping up and
> saying "hey, let's add Xen support to the kernel!" But Xen support has
> been in the kernel for well over a year now, and is widely used, enabled
> in distros, etc. The patches I'm proposing here are not a whole new
> thing, they're part of the last 10% to fill out the kernel's support to
> make it actually useful.

As a guest, I guess it has been agreed that guest support for all
different hypervisors is "a good thing". dom0 is more like a piece
of the hypervisor itself, right?


> > The user issue aside -- that is a valid point -- you don't really touch
> > on the technical issues. What tradeoffs, and where Xen does better
> > than KVM would be interesting to know, can Xen tools and users ever be
> > migrated to KVM or vice versa (I know very little about this myself, so
> > I'm just an interested observer).
>
> OK, fair point, its probably time for another Xen architecture refresher
> post.

Thanks.


> There are two big architectural differences between Xen and KVM:
>
> Firstly, Xen has a separate hypervisor who's primary role is to context
> switch between the guest domains (virtual machines). The hypervisor is
> relatively small and single purpose. It doesn't, for example, contain
> any device drivers or even much knowledge of things like pci buses and
> their structure. The domains themselves are more or less peers; some
> are more privileged than others, but from Xen's perspective they are
> more or less equivalent. The first domain, dom0, is special because its
> started by Xen itself, and has some inherent initial privileges; its
> main job is to start other domains, and it also typically provides
> virtualized/multiplexed device services to other domains via a
> frontend/backend split driver structure.
>
> KVM, on the other hand, builds all the hypervisor stuff into the kernel
> itself, so you end up with a kernel which does all the normal kernel
> stuff, and can run virtual machines by making them look like slightly
> strange processes.
>
> Because Xen is dedicated to just running virtual machines, its internal
> architecture can be more heavily oriented towards that task, which
> affects things from how its scheduler works, its use and multiplexing of
> physical memory. For example, Xen manages to use new hardware
> virtualization features pretty quickly, partly because it doesn't need
> to trade-off against normal kernel functions. The clear distinction
> between the privileged hypervisor and the rest of the domains makes the
> security people happy as well. Also, because Xen is small and fairly
> self-contained, there's quite a few hardware vendors shipping it burned
> into the firmware so that it really is the first thing to boot (many of
> instant-on features that laptops have are based on Xen). Both HP and
> Dell, at least, are selling servers with Xen pre-installed in the firmware.

That would kind of seem like Xen has a better design to me, OTOH if it
needs this dom0 for most device drivers and things, then how much
difference is it really? Is KVM really disadvantaged by being a part of
the kernel?


> The second big difference is the use of paravirtualization. Xen can
> securely virtualize a machine without needing any particular hardware
> support. Xen works well on any post-P6 or any ia64 machine, without
> needing any virtualzation hardware support. When Xen runs a kernel in
> paravirtualized mode, it runs the kernel in an unprivileged processor
> state. The allows the hypervisor to vet all the guest kernel's
> privileged operations, which are carried out are either via hypercalls
> or by memory shared between each guest and Xen.
>
> By contrast, KVM relies on at least VT/SVM (and whatever the ia64 equiv
> is called) being available in the CPUs, and needs the most modern of
> hardware to get the best performance.
>
> Once important area of paravirtualization is that Xen guests directly
> use the processor's pagetables; there is no shadow pagetable or use of
> hardware pagetable nesting. This means that a tlb miss is just a tlb
> miss, and happens at full processor performance. This is possible
> because 1) pagetables are always read-only to the guest, and 2) the
> guest is responsible for looking up in a table to map guest-local pfns
> into machine-wide mfns before installing them in a pte. Xen will check
> that any new mapping or pagetable satisfies all the rules, by checking
> that the writable reference count is 0, and that the domain owns (or has
> been allowed access to) any mfn it tries to install in a pagetable.

Xen's memory virtualization is pretty neat, I'll give it that. Is it
faster than KVM on a modern CPU? Would it be possible I wonder to make
a MMU virtualization layer for CPUs without support, using Xen's page
table protection methods, and have KVM use that? Or does that amount
to putting a significant amount of Xen hypervisor into the kernel..?


> The other interesting part of paravirtualization is the abstraction of
> interrupts into event channels. Each domain has a bit-array of 1024
> bits which correspond to 1024 possible event channels. An event channel
> can have one of several sources, such as a timer virtual interrupt, an
> inter-domain event, an inter-vcpu IPI, or mapped from a hardware
> interrupt. We end up mapping the event channels back to irqs and they
> are delivered as normal interrupts as far as the rest of the kernel is
> concerned.
>
> The net result is that a paravirtualized Xen guest runs a very close to
> full speed. Workloads which modify live pagetables a lot take a bit of
> a performance hit (since the pte updates have to trap to the hypervisor
> for validation), but in general this is not a huge deal. Hardware
> support for nested pagetables is only just beginning to get close to
> getting performance parity, but with different tradeoffs (pagetable
> updates are cheap, but tlb misses are much more expensive, and hits
> consume more tlb entries).
>
> Xen can also make full use of whatever hardware virtualization features
> are available when running an "hvm" domain. This is typically how you'd
> run Windows or other unmodified operating systems.
>
> All of this is stuff that's necessary to support any PV Xen domain, and
> has been in the kernel for a long time now.
>
>
> The additions I'm proposing now are those needed for a Xen domain to
> control the physical hardware, in order to provide virtual device
> support for other less-privileged domains. These changes affect a few
> areas:
>
> * interrupts: mapping a device interrupt into an event channel for
> delivery to the domain with the device driver for that interrupt
> * mappings: allowing direct hardware mapping of device memory into a
> domain
> * dma: making sure that hardware gets programmed with machine memory
> address, nor virtual ones, and that pages are machine-contiguous
> when expected
>
> Interrupts require a few hooks into the x86 APIC code, but the end
> result is that hardware interrupts are delivered via event channels, but
> then they're mapped back to irqs and delivered normally (they even end
> up with the same irq number as they'd usually have).
>
> Device mappings are fairly easy to arrange. I'm using a software pte
> bit, _PAGE_IOMAP, to indicate that a mapping is a device mapping. This
> bit is set by things like ioremap() and remap_pfn_range, and the Xen mmu
> code just uses the pfn in the pte as-is, rather than doing the normal
> pfn->mfn translation.
>
> DMA is handled via the normal DMA API, with some hooks to swiotlb to
> make sure that the memory underlying its pools is really DMA-ready (ie,
> is contiguous and low enough in machine memory).
>
> The changes I'm proposing may look a bit strange from a purely x86
> perspective, but they fit in relatively well because they're not all
> that different from what other architectures require, and so the
> kernel-wide infrastructure is mostly already in place.
>
>
> I hope that helps clarify what I'm trying to do here, and why Xen and
> KVM do have distinct roles to play.

Thanks, it's very informative to me and hopefully helps others with
the discussion (I don't pretend to be able to judge whether your dom0
patches should be merged or not! :)). I'll continue to read with
interest.

Thanks,
Nick

2009-03-02 08:05:27

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

Nick Piggin wrote:
>> Those would be pertinent questions if I were suddenly popping up and
>> saying "hey, let's add Xen support to the kernel!" But Xen support has
>> been in the kernel for well over a year now, and is widely used, enabled
>> in distros, etc. The patches I'm proposing here are not a whole new
>> thing, they're part of the last 10% to fill out the kernel's support to
>> make it actually useful.
>>
>
> As a guest, I guess it has been agreed that guest support for all
> different hypervisors is "a good thing". dom0 is more like a piece
> of the hypervisor itself, right?
>

Hm, I wouldn't put it like that. dom0 is no more part of the hypervisor
than the hypervisor is part of dom0. The hypervisor provides one set of
services (domain isolation and multiplexing). Domains with direct
hardware access and drivers provide arbitration for virtualized device
access. They provide orthogonal sets of functionality which are both
required to get a working system.

Also, the machinery needed to allow a kernel to operate as dom0 is more
than that: it allows direct access to hardware in general. An otherwise
unprivileged domU can be given access to a specific PCI device via
PCI-passthrough so that it can drive it directly. This is often used
for direct access to 3D hardware, or high-performance networking (esp
with multi-context hardware that's designed for virtualization use).

>> Because Xen is dedicated to just running virtual machines, its internal
>> architecture can be more heavily oriented towards that task, which
>> affects things from how its scheduler works, its use and multiplexing of
>> physical memory. For example, Xen manages to use new hardware
>> virtualization features pretty quickly, partly because it doesn't need
>> to trade-off against normal kernel functions. The clear distinction
>> between the privileged hypervisor and the rest of the domains makes the
>> security people happy as well. Also, because Xen is small and fairly
>> self-contained, there's quite a few hardware vendors shipping it burned
>> into the firmware so that it really is the first thing to boot (many of
>> instant-on features that laptops have are based on Xen). Both HP and
>> Dell, at least, are selling servers with Xen pre-installed in the firmware.
>>
>
> That would kind of seem like Xen has a better design to me, OTOH if it
> needs this dom0 for most device drivers and things, then how much
> difference is it really? Is KVM really disadvantaged by being a part of
> the kernel?
>

Well, you can lump everything together in dom0 if you want, and that is
a common way to run a Xen system. But there's no reason you can't
disaggregate drivers into their own domains, each with the
responsibility for a particular device or set of devices (or indeed, any
other service you want provided). Xen can use hardware features like
VT-d to really enforce the partitioning so that the domains can't
program their hardware to touch anything except what they're allowed to
touch, so nothing is trusted beyond its actual area of responsibility.
It also means that killing off and restarting a driver domain is a
fairly lightweight and straightforward operation because the state is
isolated and self-contained; guests using a device have to be able to
deal with a disconnect/reconnect anyway (for migration), so it doesn't
affect them much. Part of the reason there's a lot of academic interest
in Xen is because it has the architectural flexibility to try out lots
of different configurations.

I wouldn't say that KVM is necessarily disadvantaged by its design; its
just a particular set of tradeoffs made up-front. It loses Xen's
flexibility, but the result is very familiar to Linux people. A guest
domain just looks like a qemu process that happens to run in a strange
processor mode a lot of the time. The qemu process provides virtual
device access to its domain, and accesses the normal device drivers like
any other usermode process would. The domains are as isolated from each
other as much as processes normally are, but they're all floating around
in the same kernel; whether that provides enough isolation for whatever
technical, billing, security, compliance/regulatory or other
requirements you have is up to the user to judge.

>> Once important area of paravirtualization is that Xen guests directly
>> use the processor's pagetables; there is no shadow pagetable or use of
>> hardware pagetable nesting. This means that a tlb miss is just a tlb
>> miss, and happens at full processor performance. This is possible
>> because 1) pagetables are always read-only to the guest, and 2) the
>> guest is responsible for looking up in a table to map guest-local pfns
>> into machine-wide mfns before installing them in a pte. Xen will check
>> that any new mapping or pagetable satisfies all the rules, by checking
>> that the writable reference count is 0, and that the domain owns (or has
>> been allowed access to) any mfn it tries to install in a pagetable.
>>
>
> Xen's memory virtualization is pretty neat, I'll give it that. Is it
> faster than KVM on a modern CPU?

It really depends on the workload. There's three cases to consider:
software shadow pagetables, hardware nested pagetables, and Xen direct
pagetables. Even now, Xen's (highly optimised) shadow pagetable code
generally out-performs modern nested pagetables, at least when running
Windows (for which that code was most heavily tuned). Shadow pagetables
and nested pagetables will generally outperform direct pagetables when
the workload does lots of pagetable updates compared to accesses. (I
don't know what the current state of kvm's shadow pagetable performance
is, but it seems OK.)

But if you're mostly accessing the pagetable, direct pagetables still
win. On a tlb miss, it gets 4 memory accesses, whereas a nested
pagetable tlb miss needs 24 memory accesses; and a nested tlb hit means
that you have 24 tlb entries being tied up to service the hit, vs 4.
(Though the chip vendors are fairly secretive about exactly how they
structure their tlbs to deal with nested lookups, so I may be off
here.) (It also depends on whether you arrange to put the guest, host
or both memory into large pages; doing so helps a lot.)

> Would it be possible I wonder to make
> a MMU virtualization layer for CPUs without support, using Xen's page
> table protection methods, and have KVM use that? Or does that amount
> to putting a significant amount of Xen hypervisor into the kernel..?
>

At one point Avi was considering doing it, but I don't think he ever
made any real effort in that direction. KVM is pretty wedded to having
hardware support anyway, so there's not much point in removing it in
this one area.

The Xen technique gets its performance from collapsing a level of
indirection, but that has a cost in terms of flexibility; the hypervisor
can't do as much mucking around behind the guest's back (for example,
the guest sees real hardware memory addresses in the form of mfns, so
Xen can't move pages around, at least not without some form of explicit
synchronisation).

J

2009-03-02 08:20:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

On Monday 02 March 2009 19:05:10 Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > That would kind of seem like Xen has a better design to me, OTOH if it
> > needs this dom0 for most device drivers and things, then how much
> > difference is it really? Is KVM really disadvantaged by being a part of
> > the kernel?
>
> Well, you can lump everything together in dom0 if you want, and that is
> a common way to run a Xen system. But there's no reason you can't
> disaggregate drivers into their own domains, each with the
> responsibility for a particular device or set of devices (or indeed, any
> other service you want provided). Xen can use hardware features like
> VT-d to really enforce the partitioning so that the domains can't
> program their hardware to touch anything except what they're allowed to
> touch, so nothing is trusted beyond its actual area of responsibility.
> It also means that killing off and restarting a driver domain is a
> fairly lightweight and straightforward operation because the state is
> isolated and self-contained; guests using a device have to be able to
> deal with a disconnect/reconnect anyway (for migration), so it doesn't
> affect them much. Part of the reason there's a lot of academic interest
> in Xen is because it has the architectural flexibility to try out lots
> of different configurations.
>
> I wouldn't say that KVM is necessarily disadvantaged by its design; its
> just a particular set of tradeoffs made up-front. It loses Xen's
> flexibility, but the result is very familiar to Linux people. A guest
> domain just looks like a qemu process that happens to run in a strange
> processor mode a lot of the time. The qemu process provides virtual
> device access to its domain, and accesses the normal device drivers like
> any other usermode process would. The domains are as isolated from each
> other as much as processes normally are, but they're all floating around
> in the same kernel; whether that provides enough isolation for whatever
> technical, billing, security, compliance/regulatory or other
> requirements you have is up to the user to judge.

Well what is the advantage of KVM? Just that it is integrated into
the kernel? Can we look at the argument the other way around and
ask why Xen can't replace KVM? (is it possible to make use of HW
memory virtualization in Xen?) The hypervisor is GPL, right?


> > Would it be possible I wonder to make
> > a MMU virtualization layer for CPUs without support, using Xen's page
> > table protection methods, and have KVM use that? Or does that amount
> > to putting a significant amount of Xen hypervisor into the kernel..?
>
> At one point Avi was considering doing it, but I don't think he ever
> made any real effort in that direction. KVM is pretty wedded to having
> hardware support anyway, so there's not much point in removing it in
> this one area.

Not removing it, but making it available as an alternative form of
"hardware supported" MMU virtualization. As you say if direct protected
page tables often are faster than existing HW solutoins anyway, then it
could be a win for KVM even on newer CPUs.


> The Xen technique gets its performance from collapsing a level of
> indirection, but that has a cost in terms of flexibility; the hypervisor
> can't do as much mucking around behind the guest's back (for example,
> the guest sees real hardware memory addresses in the form of mfns, so
> Xen can't move pages around, at least not without some form of explicit
> synchronisation).

Any problem can be solved by adding another level of indirection... :)

2009-03-02 09:05:40

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

Nick Piggin wrote:
>> I wouldn't say that KVM is necessarily disadvantaged by its design; its
>> just a particular set of tradeoffs made up-front. It loses Xen's
>> flexibility, but the result is very familiar to Linux people. A guest
>> domain just looks like a qemu process that happens to run in a strange
>> processor mode a lot of the time. The qemu process provides virtual
>> device access to its domain, and accesses the normal device drivers like
>> any other usermode process would. The domains are as isolated from each
>> other as much as processes normally are, but they're all floating around
>> in the same kernel; whether that provides enough isolation for whatever
>> technical, billing, security, compliance/regulatory or other
>> requirements you have is up to the user to judge.
>>
>
> Well what is the advantage of KVM? Just that it is integrated into
> the kernel? Can we look at the argument the other way around and
> ask why Xen can't replace KVM?

Xen was around before KVM was even a twinkle, so KVM is redundant from
that perspective; they're certainly broadly equivalent in
functionality. But Xen has had a fairly fraught history with respect to
being merged into the kernel, and being merged gets your feet into a lot
of doors. The upshot is that using Xen has generally required some
preparation - like installing special kernels - before you can use it,
and so tends to get used for servers which are specifically intended to
be virtualized. KVM runs like an accelerated qemu, so it easy to just
fire up an instance of windows in the middle of a normal Linux desktop
session, with no special preparation.

But Xen is getting better at being on laptops and desktops, and doing
all the things people expect there (power management, suspend/resume,
etc). And people are definitely interested in using KVM in server
environments, so the lines are not very clear any more.

(Of course, we're completely forgetting VMI in all this, but VMware seem
to have as well. And we're all waiting for Rusty to make his World
Domination move.)

> (is it possible to make use of HW
> memory virtualization in Xen?)

Yes, Xen will use all available hardware features when running hvm
domains (== fully virtualized == Windows).

> The hypervisor is GPL, right?
>

Yep.

>>> Would it be possible I wonder to make
>>> a MMU virtualization layer for CPUs without support, using Xen's page
>>> table protection methods, and have KVM use that? Or does that amount
>>> to putting a significant amount of Xen hypervisor into the kernel..?
>>>
>> At one point Avi was considering doing it, but I don't think he ever
>> made any real effort in that direction. KVM is pretty wedded to having
>> hardware support anyway, so there's not much point in removing it in
>> this one area.
>>
>
> Not removing it, but making it available as an alternative form of
> "hardware supported" MMU virtualization. As you say if direct protected
> page tables often are faster than existing HW solutoins anyway, then it
> could be a win for KVM even on newer CPUs.
>

Well, yes. I'm sure it will make someone a nice little project. It
should be fairly easy to try out - all the hooks are in place, so its
just a matter of implementing the kvm bits. But it probably wouldn't be
a comfortable fit with the rest of Linux; all the memory mapped via
direct pagetables would be solidly pinned down, completely unswappable,
giving the VM subsystem much less flexibility about allocating
resources. I guess it would be no worse than a multi-hundred
megabyte/gigabyte process mlocking itself down, but I don't know if
anyone actually does that.

J

2009-03-04 17:32:15

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

Nick Piggin wrote:
> On Monday 02 March 2009 10:27:29 Jeremy Fitzhardinge wrote:

>> Once important area of paravirtualization is that Xen guests directly
>> use the processor's pagetables; there is no shadow pagetable or use of
>> hardware pagetable nesting. This means that a tlb miss is just a tlb
>> miss, and happens at full processor performance. This is possible
>> because 1) pagetables are always read-only to the guest, and 2) the
>> guest is responsible for looking up in a table to map guest-local pfns
>> into machine-wide mfns before installing them in a pte. Xen will check
>> that any new mapping or pagetable satisfies all the rules, by checking
>> that the writable reference count is 0, and that the domain owns (or has
>> been allowed access to) any mfn it tries to install in a pagetable.
>
> Xen's memory virtualization is pretty neat, I'll give it that. Is it
> faster than KVM on a modern CPU?

There is nothing architecturally that prevents KVM from making use of
Direct Paging. KVM doesn't use Direct Paging because we don't expect it
will not be worth it. Modern CPUs (Barcelona and Nehalem class) include
hardware support for MMU virtualization (via NPT and EPT respectively).

I think that for the most part (especially with large page backed
guests), there's wide agreement that even within the context of Xen,
NPT/EPT often beats PV performance. TLB miss overhead increases due to
additional memory accesses but this is largely mitigated by large pages
(see Ben Serebin's SOSP paper from a couple years ago).

> Would it be possible I wonder to make
> a MMU virtualization layer for CPUs without support, using Xen's page
> table protection methods, and have KVM use that? Or does that amount
> to putting a significant amount of Xen hypervisor into the kernel..?

There are various benchmarks out there (check KVM Forum and Xen Summit
presentations) showing NPT/EPT beating Direct Paging but FWIW the direct
paging could be implemented in KVM.

A really unfortunate aspect of direct paging is that it requires the
guest to know the host physical addresses. This requires the guest to
cooperate when doing any fancy memory tricks (live migration,
save/restore, swapping, page sharing, etc.). This introduces guest code
paths to ensure that things like live migration works which is extremely
undesirable.

FWIW, I'm not advocating not taking the Xen dom0 patches. Just pointing
out that direct paging is orthogonal to the architectural differences
between Xen and KVM.

Regards,

Anthony Liguori

2009-03-04 17:34:32

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:

> It really depends on the workload. There's three cases to consider:
> software shadow pagetables, hardware nested pagetables, and Xen direct
> pagetables. Even now, Xen's (highly optimised) shadow pagetable code
> generally out-performs modern nested pagetables, at least when running
> Windows (for which that code was most heavily tuned).

Can you point to benchmarks? I have a hard time believing this.

How can shadow paging beat nested paging assuming the presence of large
pages?

Regards,

Anthony Liguori

2009-03-04 17:38:55

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

Anthony Liguori wrote:
> Jeremy Fitzhardinge wrote:
>> Nick Piggin wrote:
>
>> It really depends on the workload. There's three cases to consider:
>> software shadow pagetables, hardware nested pagetables, and Xen
>> direct pagetables. Even now, Xen's (highly optimised) shadow
>> pagetable code generally out-performs modern nested pagetables, at
>> least when running Windows (for which that code was most heavily tuned).
>
> Can you point to benchmarks? I have a hard time believing this.

Erm, not that I know of off-hand. I don't really have any interest in
Windows performance, so I'm reduced to repeating (highly reliable) Xen
Summit corridor chat.

> How can shadow paging beat nested paging assuming the presence of
> large pages?

I think large pages do turn the tables, and its close to parity with
shadow with 4k pages on recent cpus. But see above for reliability on
that info.

J

2009-03-04 19:04:04

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

Jeremy Fitzhardinge wrote:

> OK, fair point, its probably time for another Xen architecture refresher
> post.
>
> There are two big architectural differences between Xen and KVM:
>
> Firstly, Xen has a separate hypervisor who's primary role is to context
> switch between the guest domains (virtual machines). The hypervisor is
> relatively small and single purpose. It doesn't, for example, contain
> any device drivers or even much knowledge of things like pci buses and
> their structure. The domains themselves are more or less peers; some
> are more privileged than others, but from Xen's perspective they are
> more or less equivalent. The first domain, dom0, is special because its
> started by Xen itself, and has some inherent initial privileges; its
> main job is to start other domains, and it also typically provides
> virtualized/multiplexed device services to other domains via a
> frontend/backend split driver structure.
>
> KVM, on the other hand, builds all the hypervisor stuff into the kernel
> itself, so you end up with a kernel which does all the normal kernel
> stuff, and can run virtual machines by making them look like slightly
> strange processes.
>
> Because Xen is dedicated to just running virtual machines, its internal
> architecture can be more heavily oriented towards that task, which
> affects things from how its scheduler works, its use and multiplexing of
> physical memory. For example, Xen manages to use new hardware
> virtualization features pretty quickly, partly because it doesn't need
> to trade-off against normal kernel functions. The clear distinction
> between the privileged hypervisor and the rest of the domains makes the
> security people happy as well. Also, because Xen is small and fairly
> self-contained, there's quite a few hardware vendors shipping it burned
> into the firmware so that it really is the first thing to boot (many of
> instant-on features that laptops have are based on Xen). Both HP and
> Dell, at least, are selling servers with Xen pre-installed in the firmware.

I think this is a bit misleading. I think you can understand the true
differences between Xen and KVM by s/hypervisor/Operating System/.
Fundamentally, a hypervisor is just an operating system that provides a
hardware-like interface to it's processes.

Today, the Xen operating system does not have that many features so it
requires a special process (domain-0) to drive hardware. It uses Linux
for this and it happens that the Linux domain-0 has full access to all
system resources so there is absolutely no isolation between Xen and
domain-0. The domain-0 guest is like a Linux userspace process with
access to an old-style /dev/mem.

You can argue that in theory, one could build a small, decoupled
domain-0, but you could also do this, in theory, with Linux and KVM. It
is not necessary to have all of your device drivers in your Linux
kernel. You could build an initramfs that passed all PCI devices
through (via VT-d) to a single guest, and then provided and interface to
allow that guest to create more guests. This is essentially what dom0
support is.

The real difference between KVM and Xen is that Xen is a separate
Operating System dedicated to virtualization. In many ways, it's a fork
of Linux since it uses quite a lot of Linux code.

The argument for Xen as a separate OS is no different than the argument
for a dedicated Real Time Operating System, a dedicated OS for embedded
systems, or a dedicated OS for a very large system.

Having the distros ship Xen was a really odd thing from a Linux
perspective. It's as if Red Hat started shipping VXworks with a Linux
emulation layer as Real Time Linux.

The arguments for dedicated OSes are well-known. You can do a better
scheduler for embedded/real-time/large systems. You can do a better
memory allocate for embedded/real-time/large systems. These are the
arguments that are made for Xen.

In theory, Xen, the hypervisor, could be merged with upstream Linux but
there is certainly no parties interested in that currently.

My point is not to rail on Xen, but to point out that there isn't really
a choice to be made here from a Linux perspective. It's like saying do
we really need FreeBSD and Linux, maybe those FreeBSD guys should just
merge with Linux. It's not going to happen.

KVM turns Linux into a hypervisor by adding virtualization support. Xen
is a separate hypervisor.

So the real discussion shouldn't be should KVM and Xen converge because
it really doesn't make sense. It's whether it makes sense for upstream
Linux to support being a domain-0 guest under the Xen hypervisor.

Regards,

Anthony Liguori

>
> The second big difference is the use of paravirtualization. Xen can
> securely virtualize a machine without needing any particular hardware
> support. Xen works well on any post-P6 or any ia64 machine, without
> needing any virtualzation hardware support. When Xen runs a kernel in
> paravirtualized mode, it runs the kernel in an unprivileged processor
> state. The allows the hypervisor to vet all the guest kernel's
> privileged operations, which are carried out are either via hypercalls
> or by memory shared between each guest and Xen.
>
> By contrast, KVM relies on at least VT/SVM (and whatever the ia64 equiv
> is called) being available in the CPUs, and needs the most modern of
> hardware to get the best performance.
>
> Once important area of paravirtualization is that Xen guests directly
> use the processor's pagetables; there is no shadow pagetable or use of
> hardware pagetable nesting. This means that a tlb miss is just a tlb
> miss, and happens at full processor performance. This is possible
> because 1) pagetables are always read-only to the guest, and 2) the
> guest is responsible for looking up in a table to map guest-local pfns
> into machine-wide mfns before installing them in a pte. Xen will check
> that any new mapping or pagetable satisfies all the rules, by checking
> that the writable reference count is 0, and that the domain owns (or has
> been allowed access to) any mfn it tries to install in a pagetable.
>
> The other interesting part of paravirtualization is the abstraction of
> interrupts into event channels. Each domain has a bit-array of 1024
> bits which correspond to 1024 possible event channels. An event channel
> can have one of several sources, such as a timer virtual interrupt, an
> inter-domain event, an inter-vcpu IPI, or mapped from a hardware
> interrupt. We end up mapping the event channels back to irqs and they
> are delivered as normal interrupts as far as the rest of the kernel is
> concerned.
>
> The net result is that a paravirtualized Xen guest runs a very close to
> full speed. Workloads which modify live pagetables a lot take a bit of
> a performance hit (since the pte updates have to trap to the hypervisor
> for validation), but in general this is not a huge deal. Hardware
> support for nested pagetables is only just beginning to get close to
> getting performance parity, but with different tradeoffs (pagetable
> updates are cheap, but tlb misses are much more expensive, and hits
> consume more tlb entries).
>
> Xen can also make full use of whatever hardware virtualization features
> are available when running an "hvm" domain. This is typically how you'd
> run Windows or other unmodified operating systems.
>
> All of this is stuff that's necessary to support any PV Xen domain, and
> has been in the kernel for a long time now.
>
>
> The additions I'm proposing now are those needed for a Xen domain to
> control the physical hardware, in order to provide virtual device
> support for other less-privileged domains. These changes affect a few
> areas:
>
> * interrupts: mapping a device interrupt into an event channel for
> delivery to the domain with the device driver for that interrupt
> * mappings: allowing direct hardware mapping of device memory into a
> domain
> * dma: making sure that hardware gets programmed with machine memory
> address, nor virtual ones, and that pages are machine-contiguous
> when expected
>
> Interrupts require a few hooks into the x86 APIC code, but the end
> result is that hardware interrupts are delivered via event channels, but
> then they're mapped back to irqs and delivered normally (they even end
> up with the same irq number as they'd usually have).
>
> Device mappings are fairly easy to arrange. I'm using a software pte
> bit, _PAGE_IOMAP, to indicate that a mapping is a device mapping. This
> bit is set by things like ioremap() and remap_pfn_range, and the Xen mmu
> code just uses the pfn in the pte as-is, rather than doing the normal
> pfn->mfn translation.
>
> DMA is handled via the normal DMA API, with some hooks to swiotlb to
> make sure that the memory underlying its pools is really DMA-ready (ie,
> is contiguous and low enough in machine memory).
>
> The changes I'm proposing may look a bit strange from a purely x86
> perspective, but they fit in relatively well because they're not all
> that different from what other architectures require, and so the
> kernel-wide infrastructure is mostly already in place.
>
>
> I hope that helps clarify what I'm trying to do here, and why Xen and
> KVM do have distinct roles to play.
>
> J

2009-03-04 19:20:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

Anthony Liguori wrote:
>
> I think this is a bit misleading. I think you can understand the true
> differences between Xen and KVM by s/hypervisor/Operating System/.
> Fundamentally, a hypervisor is just an operating system that provides a
> hardware-like interface to it's processes.
>
[...]

>
> The real difference between KVM and Xen is that Xen is a separate
> Operating System dedicated to virtualization. In many ways, it's a fork
> of Linux since it uses quite a lot of Linux code.
>
> The argument for Xen as a separate OS is no different than the argument
> for a dedicated Real Time Operating System, a dedicated OS for embedded
> systems, or a dedicated OS for a very large system.
>

In particular, Xen is a microkernel-type operating system. The dom0
model is a classic single-server, in the style of Mach. A lot of the
"Xen could use a distributed dom0" arguments were also done with Mach
("the real goal is a multi-server") but such a system never materialized
(Hurd was supposed to be one.) Building multiservers is *hard*, and
building multiservers which don't suck is even harder.

-hpa

2009-03-04 19:34:19

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH] xen: core dom0 support

H. Peter Anvin wrote:
> In particular, Xen is a microkernel-type operating system. The dom0
> model is a classic single-server, in the style of Mach. A lot of the
> "Xen could use a distributed dom0" arguments were also done with Mach
> ("the real goal is a multi-server") but such a system never
> materialized (Hurd was supposed to be one.) Building multiservers is
> *hard*, and building multiservers which don't suck is even harder.

A lot of the core Xen concepts (domains, event channels, etc.) were
present in the Nemesis[1] exo-kernel project.

Two other interest papers on the subject "Are virtual machine monitors
microkernels done right?"[2] from the Xen folks and a rebuttal from the
l4ka group[3].

[1] http://www.cl.cam.ac.uk/research/srg/netos/old-projects/nemesis/
[2] http://portal.acm.org/citation.cfm?id=1251124
[3] http://l4ka.org/publications/paper.php?docid=2189

Regards,

Anthony Liguori
> -hpa

2009-03-05 11:00:38

by George Dunlap

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH] xen: core dom0 support

On Wed, Mar 4, 2009 at 5:34 PM, Anthony Liguori <[email protected]> wrote:
> Can you point to benchmarks? ?I have a hard time believing this.
>
> How can shadow paging beat nested paging assuming the presence of large
> pages?

If these benchmarks would help this discussion, we can certainly run
some. As of last Fall, even with superpage support, certain workloads
perform significantly less well with HAP (hardware-assisted paging)
than with shadow pagetables. Examples are specjbb, which does almost
no pagetable updates, but totally thrashes the TLB. SysMark also
performed much better with shadow pagetables than HAP. And of course,
64-bit is worse than 32-bit. (It's actually a bit annoying from a
default-policy perspective, since about half of our workloads perform
better with HAP (up to 30% better) and half of them perform worse (up
to 30% worse)).

Our comparison would, of course, be comparing Xen+HAP to Xen+Shadow,
which isn't necessarily comparable to KVM+HAP.

Having HAP work well would be great for us as well as KVM. But
there's still the argument about hardware support: Xen can run
paravirtualized VMs on hardware with no HVM support, and can run fully
virtualized domains very well on hardware that has HVM support but not
HAP support.

-George Dunlap

2009-03-05 14:37:27

by Anthony Liguori

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH] xen: core dom0 support

George Dunlap wrote:
> On Wed, Mar 4, 2009 at 5:34 PM, Anthony Liguori <[email protected]> wrote:
>
>> Can you point to benchmarks? I have a hard time believing this.
>>
>> How can shadow paging beat nested paging assuming the presence of large
>> pages?
>>
>
> If these benchmarks would help this discussion, we can certainly run
> some. As of last Fall, even with superpage support, certain workloads
> perform significantly less well with HAP (hardware-assisted paging)
> than with shadow pagetables. Examples are specjbb, which does almost
> no pagetable updates, but totally thrashes the TLB.

I suspected specjbb was the benchmark. specjbb is really an anomaly as
it's really the only benchmark where even a naive shadow paging
implementation performs very close to native.

specjbb also turns into a pathological case with HAP. In my
measurements, HAP with 4k pages was close to 70% of native for specjbb.
Once you enable large pages though, you get pretty close to native.
IIRC, around 95%. I suspect that over time as the caching algorithms
improve, this will approach 100% of native.

Then again, there are workloads like kernbench that are pathological for
shadow paging in a much more dramatic way. At least on shadow2, I was
seeing around 60% of native with kernbench. With direct paging, it goes
to about 85% of native. With NPT and large pages, it's almost 100% of
native.

> SysMark also
> performed much better with shadow pagetables than HAP. And of course,
> 64-bit is worse than 32-bit. (It's actually a bit annoying from a
> default-policy perspective, since about half of our workloads perform
> better with HAP (up to 30% better) and half of them perform worse (up
> to 30% worse)).
>
> Our comparison would, of course, be comparing Xen+HAP to Xen+Shadow,
> which isn't necessarily comparable to KVM+HAP.
>
> Having HAP work well would be great for us as well as KVM. But
> there's still the argument about hardware support: Xen can run
> paravirtualized VMs on hardware with no HVM support, and can run fully
> virtualized domains very well on hardware that has HVM support but not
> HAP support.
>

Xen is definitely not going away and as such, supporting it in Linux
seems like a good idea to me. I'm just refuting claims that the Xen
architecture has intrinsic advantages wrt MMU virtualization. It's
simply not the case :-)

Regards,

Anthony Liguori

> -George Dunlap
>