LinuxLists.cc - [announce] [patch] KVM paravirtualization for Linux

2007-01-05 21:55:38

Subject: [announce] [patch] KVM paravirtualization for Linux

i'm pleased to announce the first release of paravirtualized KVM (Linux
under Linux), which includes support for the hardware cr3-cache feature
of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)

the patch is against 2.6.20-rc3 + KVM trunk and can be found at:

http://redhat.com/~mingo/kvm-paravirt-patches/

Some aspects of the code are still a bit ad-hoc and incomplete, but the
code is stable enough in my testing and i'd like to have some feedback.

Firstly, here are some numbers:

2-task context-switch performance (in microseconds, lower is better):

native: 1.11
----------------------------------
Qemu: 61.18
KVM upstream: 53.01
KVM trunk: 6.36
KVM trunk+paravirt/cr3: 1.60

i.e. 2-task context-switch performance is faster by a factor of 4, and
is now quite close to native speed!

"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):

native: 0.25
----------------------------------
Qemu: 7.8
KVM upstream: 2.8
KVM trunk: 0.55
KVM paravirt/cr3: 0.36

almost twice as fast.

"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):

native: 0.9
----------------------------------
Qemu: 35.2
KVM upstream: 9.4
KVM trunk: 2.8
KVM paravirt/cr3: 2.2

still a 30% improvement - which isnt too bad considering that 200 tasks
are context-switching in this workload and the cr3 cache in current CPUs
is only 4 entries.

the patchset does the following:

- it provides an ad-hoc paravirtualization hypercall API between a Linux
guest and a Linux host. (this will be replaced with a proper
hypercall later on.)

- using the hypercall API it utilizes the "cr3 target cache" feature in
Intel VMX CPUs, and extends KVM to make use of that cache. This
feature allows the avoidance of expensive VM exits into hypervisor
context. (The guest needs to be 'aware' and the cache has to be
shared between the guest and the hypervisor. So fully emulated OSs
wont benefit from this feature.)

- a few simpler paravirtualization changes are done for Linux guests: IO
port delays do not cause a VM exit anymore, the i8259A IRQ controller
code got simplified (this will be replaced with a proper, hypercall
based and host-maintained IRQ controller implementation) and TLB
flushes are more efficient, because no cr3 reads happen which would
otherwise cause a VM exit. These changes have a visible effect
already: they reduce qemu's CPU usage when a guest idles in HLT, by
about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has
HZ=1000)

Paravirtualization is triggered via the kvm_paravirt=1 boot option (for
now, this too is ad-hoc) - if that is passed then the KVM guest will
probe for paravirtualization availability on the hypervisor side - and
will use it if found. (If the guest does not find KVM-paravirt support
on the hypervisor side then it will continue as a fully emulated guest.)

Issues: i only tested this on 32-bit VMX. (64-bit should work with not
too many changes, the paravirt.c bits can be carried over to 64-bit
almost as-is. But i didnt want to spread the code too wide.)

Comments, suggestions are welcome!

Ingo

2007-01-05 22:15:22

by Zachary Amsden

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-05 22:33:45

by Ingo Molnar

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

* Zachary Amsden <[email protected]> wrote:

> What you really want is more like
> EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);

yep. Not a big issue - what is important is to put the paravirt ops into
the read-only section so that it's somewhat harder for rootkits to
modify. (Also, it needs to be made clear that this is fundamental,
lowlevel system functionality written by people under the GPLv2, so that
if you utilize it beyond its original purpose, using its internals, you
likely create a work derived from the kernel. Something simple as irq
disabling probably doesnt qualify, and that we exported to modules for a
long time, but lots of other details do. So the existence of
paravirt_ops isnt a free-for all.)

> But I'm not sure that is technically feasible yet.
>
> The kvm code should probably go in kvm.c instead of paravirt.c.

no. This is fundamental architecture boot code, not module code. kvm.c
should eventually go into kernel/ and arch/*/kernel, not the other way
around.

> Index: linux/drivers/serial/8250.c
> ===================================================================
> --- linux.orig/drivers/serial/8250.c
> +++ linux/drivers/serial/8250.c
> @@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt(
>
> l = l->next;
>
> - if (l == i->head && pass_counter++ > PASS_LIMIT) {
> + if (!kvm_paravirt
>
> Is this a bug that might happen under other virtualizations as well,
> not just kvm? Perhaps it deserves a disable feature instead of a kvm
> specific check.

yes - this limit is easily triggered via the KVM/Qemu virtual serial
drivers. You can think of "kvm_paravirt" as "Linux paravirt", it's just
a flag.

Ingo

2007-01-05 22:50:16

by Zachary Amsden

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

Ingo Molnar wrote:
> * Zachary Amsden <[email protected]> wrote:
>
>
>> What you really want is more like
>> EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);
>>
>
> yep. Not a big issue - what is important is to put the paravirt ops into
> the read-only section so that it's somewhat harder for rootkits to
> modify. (Also, it needs to be made clear that this is fundamental,
> lowlevel system functionality written by people under the GPLv2, so that
> if you utilize it beyond its original purpose, using its internals, you
> likely create a work derived from the kernel. Something simple as irq
> disabling probably doesnt qualify, and that we exported to modules for a
> long time, but lots of other details do. So the existence of
> paravirt_ops isnt a free-for all.)
>

I agree completely. It would be nice to have a way to make certain
kernel structures available, but non-mutable to non-GPL modules.

>> But I'm not sure that is technically feasible yet.
>>
>> The kvm code should probably go in kvm.c instead of paravirt.c.
>>
>
> no. This is fundamental architecture boot code, not module code. kvm.c
> should eventually go into kernel/ and arch/*/kernel, not the other way
> around.
>

What I meant was kvm.c in arch/i386/kernel - as symmetric to the other
paravirt-ops modules, which live in arch/i386/kernel/vmi.c / lhype.c,
etc. Either that, or we should move them to be symmetric, but I don't
think paravirt.c is the proper place for kvm specific code.

>
>> Index: linux/drivers/serial/8250.c
>> ===================================================================
>> --- linux.orig/drivers/serial/8250.c
>> +++ linux/drivers/serial/8250.c
>> @@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt(
>>
>> l = l->next;
>>
>> - if (l == i->head && pass_counter++ > PASS_LIMIT) {
>> + if (!kvm_paravirt
>>
>> Is this a bug that might happen under other virtualizations as well,
>> not just kvm? Perhaps it deserves a disable feature instead of a kvm
>> specific check.
>>
>
> yes - this limit is easily triggered via the KVM/Qemu virtual serial
> drivers. You can think of "kvm_paravirt" as "Linux paravirt", it's just
> a flag.
>

Can't you just test paravirt_enabled() in that case?

Zach

2007-01-05 23:02:36

by Anthony Liguori

[permalink] [raw]

Subject: Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

This is pretty cool. I've read the VT spec a number of times and never
understood why they included the CR3 caching :-) I suspect that you may
even be faster than Xen for context switches because of the hardware
assistance here. Any chance you can run your benchmarks against Xen?

You may already know this, but there isn't a CR3 cache in SVM AFAIK so a
fair bit of the code will probably have to be enabled conditionally.
Otherwise, I don't think SVM support would be that hard. The only other
odd bit I noticed was:

> + magic_val = KVM_API_MAGIC;
> + para_state->guest_version = KVM_PARA_API_VERSION;
> + para_state->host_version = -1;
> + para_state->size = sizeof(*para_state);
> +
> + asm volatile ("movl %0, %%cr3"
> + : "=&r" (ret)
> + : "a" (para_state),
> + "0" (magic_val)
> + );

If I read this correctly, you're using a CR3 write as a hypercall (and
relying on registers being set/returned that aren't part of the actual
CR3 move)? Any reason for not just using vmcall? It seems a bit less
awkward. Even a PIO operation would be a bit cleaner IMHO. PIO exits
tend to be fast in VT/SVM so that's an added benefit.

Regards,

Anthony Liguori

Ingo Molnar wrote:
> i'm pleased to announce the first release of paravirtualized KVM (Linux
> under Linux), which includes support for the hardware cr3-cache feature
> of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)
>
> the patch is against 2.6.20-rc3 + KVM trunk and can be found at:
>
> http://redhat.com/~mingo/kvm-paravirt-patches/
>
> Some aspects of the code are still a bit ad-hoc and incomplete, but the
> code is stable enough in my testing and i'd like to have some feedback.
>
> Firstly, here are some numbers:
>
> 2-task context-switch performance (in microseconds, lower is better):
>
> native: 1.11
> ----------------------------------
> Qemu: 61.18
> KVM upstream: 53.01
> KVM trunk: 6.36
> KVM trunk+paravirt/cr3: 1.60
>
> i.e. 2-task context-switch performance is faster by a factor of 4, and
> is now quite close to native speed!
>
> "hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
>
> native: 0.25
> ----------------------------------
> Qemu: 7.8
> KVM upstream: 2.8
> KVM trunk: 0.55
> KVM paravirt/cr3: 0.36
>
> almost twice as fast.
>
> "hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
>
> native: 0.9
> ----------------------------------
> Qemu: 35.2
> KVM upstream: 9.4
> KVM trunk: 2.8
> KVM paravirt/cr3: 2.2
>
> still a 30% improvement - which isnt too bad considering that 200 tasks
> are context-switching in this workload and the cr3 cache in current CPUs
> is only 4 entries.
>
> the patchset does the following:
>
> - it provides an ad-hoc paravirtualization hypercall API between a Linux
> guest and a Linux host. (this will be replaced with a proper
> hypercall later on.)
>
> - using the hypercall API it utilizes the "cr3 target cache" feature in
> Intel VMX CPUs, and extends KVM to make use of that cache. This
> feature allows the avoidance of expensive VM exits into hypervisor
> context. (The guest needs to be 'aware' and the cache has to be
> shared between the guest and the hypervisor. So fully emulated OSs
> wont benefit from this feature.)
>
> - a few simpler paravirtualization changes are done for Linux guests: IO
> port delays do not cause a VM exit anymore, the i8259A IRQ controller
> code got simplified (this will be replaced with a proper, hypercall
> based and host-maintained IRQ controller implementation) and TLB
> flushes are more efficient, because no cr3 reads happen which would
> otherwise cause a VM exit. These changes have a visible effect
> already: they reduce qemu's CPU usage when a guest idles in HLT, by
> about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has
> HZ=1000)
>
> Paravirtualization is triggered via the kvm_paravirt=1 boot option (for
> now, this too is ad-hoc) - if that is passed then the KVM guest will
> probe for paravirtualization availability on the hypervisor side - and
> will use it if found. (If the guest does not find KVM-paravirt support
> on the hypervisor side then it will continue as a fully emulated guest.)
>
> Issues: i only tested this on 32-bit VMX. (64-bit should work with not
> too many changes, the paravirt.c bits can be carried over to 64-bit
> almost as-is. But i didnt want to spread the code too wide.)
>
> Comments, suggestions are welcome!
>
> Ingo
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> kvm-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>

2007-01-05 23:32:05

by Ingo Molnar

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

* Zachary Amsden <[email protected]> wrote:

> >>What you really want is more like
> >>EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);
> >>
> >
> >yep. Not a big issue - what is important is to put the paravirt ops
> >into the read-only section so that it's somewhat harder for rootkits
> >to modify. (Also, it needs to be made clear that this is fundamental,
> >lowlevel system functionality written by people under the GPLv2, so
> >that if you utilize it beyond its original purpose, using its
> >internals, you likely create a work derived from the kernel.
> >Something simple as irq disabling probably doesnt qualify, and that
> >we exported to modules for a long time, but lots of other details do.
> >So the existence of paravirt_ops isnt a free-for all.)
>
> I agree completely. It would be nice to have a way to make certain
> kernel structures available, but non-mutable to non-GPL modules.

the thing is, we are not 'making these available to non-GPL modules'.
The _GPL postfix does not turn the other normal exports from grey into
white. The _GPL postfix turns exports into almost-black for non-GPL
modules. The rest is still grey.

in this case, i'd only make local_irq_*() available to modules (of any
type), not the other paravirt ops. Exporting the whole of paravirt_ops
was a mistake, and this has to be fixed before 2.6.20. I'll cook up a
patch.

> >yes - this limit is easily triggered via the KVM/Qemu virtual serial
> >drivers. You can think of "kvm_paravirt" as "Linux paravirt", it's
> >just a flag.
>
> Can't you just test paravirt_enabled() in that case?

yep - i've changed this in my tree.

Ingo

2007-01-07 00:31:58

by Pavel Machek

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

Hi!

> i'm pleased to announce the first release of paravirtualized KVM (Linux
> under Linux), which includes support for the hardware cr3-cache feature
> of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)
>
> the patch is against 2.6.20-rc3 + KVM trunk and can be found at:
>
> http://redhat.com/~mingo/kvm-paravirt-patches/
>
> Some aspects of the code are still a bit ad-hoc and incomplete, but the
> code is stable enough in my testing and i'd like to have some feedback.
>
> Firstly, here are some numbers:
>
> 2-task context-switch performance (in microseconds, lower is better):
>
> native: 1.11
> ----------------------------------
> Qemu: 61.18
> KVM upstream: 53.01
> KVM trunk: 6.36
> KVM trunk+paravirt/cr3: 1.60

Does this make Xen obsolete? I mean... we have xen patches in suse
kernels, should we keep updating them, or just drop them in favour of
KVM?
Pavel
--
Thanks for all the (sleeping) penguins.

2007-01-07 12:20:28

by Avi Kivity

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

Ingo Molnar wrote:
> i'm pleased to announce the first release of paravirtualized KVM (Linux
> under Linux), which includes support for the hardware cr3-cache feature
> of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)
>
> the patch is against 2.6.20-rc3 + KVM trunk and can be found at:
>
> http://redhat.com/~mingo/kvm-paravirt-patches/
>
> Some aspects of the code are still a bit ad-hoc and incomplete, but the
> code is stable enough in my testing and i'd like to have some feedback.
>
> Firstly, here are some numbers:
>
> 2-task context-switch performance (in microseconds, lower is better):
>
> native: 1.11
> ----------------------------------
> Qemu: 61.18
> KVM upstream: 53.01
> KVM trunk: 6.36
> KVM trunk+paravirt/cr3: 1.60
>
> i.e. 2-task context-switch performance is faster by a factor of 4, and
> is now quite close to native speed!
>

Very impressive! The gain probably comes not only from avoiding the
vmentry/vmexit, but also from avoiding the flushing of the global page
tlb entries.

> "hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
>
> native: 0.25
> ----------------------------------
> Qemu: 7.8
> KVM upstream: 2.8
> KVM trunk: 0.55
> KVM paravirt/cr3: 0.36
>
> almost twice as fast.
>
> "hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
>
> native: 0.9
> ----------------------------------
> Qemu: 35.2
> KVM upstream: 9.4
> KVM trunk: 2.8
> KVM paravirt/cr3: 2.2
>
> still a 30% improvement - which isnt too bad considering that 200 tasks
> are context-switching in this workload and the cr3 cache in current CPUs
> is only 4 entries.
>

This is a little too good to be true. Were both runs with the same
KVM_NUM_MMU_PAGES?

I'm also concerned that at this point in time the cr3 optimizations will
only show an improvement in microbenchmarks. In real life workloads a
context switch is usually preceded by an I/O, and with the current sorry
state of kvm I/O the context switch time would be dominated by the I/O time.

> the patchset does the following:
>
> - it provides an ad-hoc paravirtualization hypercall API between a Linux
> guest and a Linux host. (this will be replaced with a proper
> hypercall later on.)
>
> - using the hypercall API it utilizes the "cr3 target cache" feature in
> Intel VMX CPUs, and extends KVM to make use of that cache. This
> feature allows the avoidance of expensive VM exits into hypervisor
> context. (The guest needs to be 'aware' and the cache has to be
> shared between the guest and the hypervisor. So fully emulated OSs
> wont benefit from this feature.)
>
> - a few simpler paravirtualization changes are done for Linux guests: IO
> port delays do not cause a VM exit anymore, the i8259A IRQ controller
> code got simplified (this will be replaced with a proper, hypercall
> based and host-maintained IRQ controller implementation) and TLB
> flushes are more efficient, because no cr3 reads happen which would
> otherwise cause a VM exit. These changes have a visible effect
> already: they reduce qemu's CPU usage when a guest idles in HLT, by
> about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has
> HZ=1000)
>
> Paravirtualization is triggered via the kvm_paravirt=1 boot option (for
> now, this too is ad-hoc) - if that is passed then the KVM guest will
> probe for paravirtualization availability on the hypervisor side - and
> will use it if found. (If the guest does not find KVM-paravirt support
> on the hypervisor side then it will continue as a fully emulated guest.)
>
> Issues: i only tested this on 32-bit VMX. (64-bit should work with not
> too many changes, the paravirt.c bits can be carried over to 64-bit
> almost as-is. But i didnt want to spread the code too wide.)
>
> Comments, suggestions are welcome!
>
> Ingo
>

Some comments below on the code.

> +
> +/*
> + * Special, register-to-cr3 instruction based hypercall API
> + * variant to the KVM host. This utilizes the cr3 filter capability
> + * of the hardware - if this works out then no VM exit happens,
> + * if a VM exit happens then KVM will get the virtual address too.
> + */
> +static void kvm_write_cr3(unsigned long guest_cr3)
> +{
> + struct kvm_vcpu_para_state *para_state = &get_cpu_var(para_state);
> + struct kvm_cr3_cache *cache = &para_state->cr3_cache;
> + int idx;
> +
> + /*
> + * Check the cache (maintained by the host) for a matching
> + * guest_cr3 => host_cr3 mapping. Use it if found:
> + */
> + for (idx = 0; idx < cache->max_idx; idx++) {
> + if (cache->entry[idx].guest_cr3 == guest_cr3) {
> + /*
> + * Cache-hit: we load the cached host-CR3 value.
> + * This never causes any VM exit. (if it does then the
> + * hypervisor could do nothing with this instruction
> + * and the guest OS would be aborted)
> + */
> + asm volatile("movl %0, %%cr3"
> + : : "r" (cache->entry[idx].host_cr3));
> + goto out;
> + }
> + }
> +
> + /*
> + * Cache-miss. Load the guest-cr3 value into cr3, which will
> + * cause a VM exit to the hypervisor, which then loads the
> + * host cr3 value and updates the cr3_cache.
> + */
> + asm volatile("movl %0, %%cr3" : : "r" (guest_cr3));
> +out:
> + put_cpu_var(para_state);
> +}

Well, you did say it was ad-hoc. For reference, this is how I see the
hypercall API:

- A virtual pci device exports a page through the pci rom interface.
The page contains the hypercall code approriate for the current cpu.
This allows migration to work across different cpu vendors.
- In case the pci rom is discovered too late in the boot process, the
address (gpa) can also be exported via a kvm-specific msr.
- Guest/host communications is by guest physical addressed, as the
virtual->physical translation is much cheaper on the guest (__pa() vs a
page table walk).

>
> +
> +/*
> + * Simplified i8259A controller handling:
> + */
> +static void mask_and_ack_kvm(unsigned int irq)
> +{
> + unsigned int irqmask = 1 << irq;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&i8259A_lock, flags);
> + cached_irq_mask |= irqmask;
> +
> + if (irq & 8) {
> + outb(cached_slave_mask, PIC_SLAVE_IMR);
> + outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
> + outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI'
> to master-IRQ2 */
> + } else {
> + outb(cached_master_mask, PIC_MASTER_IMR);
> + /* 'Specific EOI' to master: */
> + outb(0x60+irq, PIC_MASTER_CMD);
> + }
> + spin_unlock_irqrestore(&i8259A_lock, flags);
> +}

Any reason this can't be applied to mainline? There's probably no
downside to native, and it would benefit all virtualization solutions
equally.

> Index: linux/drivers/kvm/kvm.h
> ===================================================================
> --- linux.orig/drivers/kvm/kvm.h
> +++ linux/drivers/kvm/kvm.h
> @@ -165,7 +165,7 @@ struct kvm_mmu {
> int root_level;
> int shadow_root_level;
>
> - u64 *pae_root;
> + u64 *pae_root[KVM_CR3_CACHE_SIZE];

hmm. wouldn't it be simpler to have pae_root always point at the
current root?

>
> Index: linux/drivers/kvm/mmu.c
> ===================================================================
> --- linux.orig/drivers/kvm/mmu.c
> +++ linux/drivers/kvm/mmu.c
> @@ -779,7 +779,7 @@ static int nonpaging_map(struct kvm_vcpu
>
> static void mmu_free_roots(struct kvm_vcpu *vcpu)
> {
> - int i;
> + int i, j;
> struct kvm_mmu_page *page;
>
> #ifdef CONFIG_X86_64
> @@ -793,21 +793,40 @@ static void mmu_free_roots(struct kvm_vc
> return;
> }
> #endif
> - for (i = 0; i < 4; ++i) {
> - hpa_t root = vcpu->mmu.pae_root[i];
> + /*
> + * Skip to the next cr3 filter entry and free it (if it's occupied):
> + */
> + vcpu->cr3_cache_idx++;
> + if (unlikely(vcpu->cr3_cache_idx >= vcpu->cr3_cache_limit))
> + vcpu->cr3_cache_idx = 0;
>
> - ASSERT(VALID_PAGE(root));
> - root &= PT64_BASE_ADDR_MASK;
> - page = page_header(root);
> - --page->root_count;
> - vcpu->mmu.pae_root[i] = INVALID_PAGE;
> + j = vcpu->cr3_cache_idx;
> + /*
> + * Clear the guest-visible entry:
> + */
> + if (vcpu->para_state) {
> + vcpu->para_state->cr3_cache.entry[j].guest_cr3 = 0;
> + vcpu->para_state->cr3_cache.entry[j].host_cr3 = 0;
> + }
> + ASSERT(vcpu->mmu.pae_root[j]);
> + if (VALID_PAGE(vcpu->mmu.pae_root[j][0])) {
> + vcpu->guest_cr3_gpa[j] = INVALID_PAGE;
> + for (i = 0; i < 4; ++i) {
> + hpa_t root = vcpu->mmu.pae_root[j][i];
> +
> + ASSERT(VALID_PAGE(root));
> + root &= PT64_BASE_ADDR_MASK;
> + page = page_header(root);
> + --page->root_count;
> + vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> + }
> }
> vcpu->mmu.root_hpa = INVALID_PAGE;
> }

You keep the page directories pinned here. This can be a problem if a
guest frees a page directory, and then starts using it as a regular
page. kvm sometimes chooses not to emulate a write to a guest page
table, but instead to zap it, which is impossible when the page is
freed. You need to either unpin the page when that happens, or add a
hypercall to let kvm know when a page directory is freed.

There is also a minor problem that changes to the pgd aren't caught by
kvm. It doesn't hurt much as this is PV and we can relax the guest/host
contract a little.

>
> static int alloc_mmu_pages(struct kvm_vcpu *vcpu)
> {
> struct page *page;
> - int i;
> + int i, j;
>
> ASSERT(vcpu);
>
> @@ -1227,17 +1261,22 @@ static int alloc_mmu_pages(struct kvm_vc
> ++vcpu->kvm->n_free_mmu_pages;
> }
>
> - /*
> - * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64.
> - * Therefore we need to allocate shadow page tables in the first
> - * 4GB of memory, which happens to fit the DMA32 zone.
> - */
> - page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> - if (!page)
> - goto error_1;
> - vcpu->mmu.pae_root = page_address(page);
> - for (i = 0; i < 4; ++i)
> - vcpu->mmu.pae_root[i] = INVALID_PAGE;
> + for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) {
> + /*
> + * When emulating 32-bit mode, cr3 is only 32 bits even on
> + * x86_64. Therefore we need to allocate shadow page tables
> + * in the first 4GB of memory, which happens to fit the DMA32
> + * zone:
> + */
> + page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> + if (!page)
> + goto error_1;
> +
> + ASSERT(!vcpu->mmu.pae_root[j]);
> + vcpu->mmu.pae_root[j] = page_address(page);
> + for (i = 0; i < 4; ++i)
> + vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> + }

Since a pae root uses just 32 bytes, you can store all cache entries in
a single page. Not that it matters much.

> #define ENABLE_INTERRUPTS_SYSEXIT \
> Index: linux/include/linux/kvm.h
> ===================================================================
> --- linux.orig/include/linux/kvm.h
> +++ linux/include/linux/kvm.h
> @@ -238,4 +238,44 @@ struct kvm_dirty_log {
>
> #define KVM_DUMP_VCPU _IOW(KVMIO, 250, int /* vcpu_slot */)
>
> +
> +#define KVM_CR3_CACHE_SIZE 4
> +
> +struct kvm_cr3_cache_entry {
> + u64 guest_cr3;
> + u64 host_cr3;
> +};
> +
> +struct kvm_cr3_cache {
> + struct kvm_cr3_cache_entry entry[KVM_CR3_CACHE_SIZE];
> + u32 max_idx;
> +};
> +
> +/*
> + * Per-VCPU descriptor area shared between guest and host. Writable to
> + * both guest and host. Registered with the host by the guest when
> + * a guest acknowledges paravirtual mode.
> + */
> +struct kvm_vcpu_para_state {
> + /*
> + * API version information for compatibility. If there's any support
> + * mismatch (too old host trying to execute too new guest) then
> + * the host will deny entry into paravirtual mode. Any other
> + * combination (new host + old guest and new host + new guest)
> + * is supposed to work - new host versions will support all old
> + * guest API versions.
> + */
> + u32 guest_version;
> + u32 host_version;
> + u32 size;
> + u32 __pad_00;
> +
> + struct kvm_cr3_cache cr3_cache;
> +
> +} __attribute__ ((aligned(PAGE_SIZE)));
> +
> +#define KVM_PARA_API_VERSION 1
> +
> +#define KVM_API_MAGIC 0x87654321
> +

<linux/kvm.h> is the vmm userspace interface. The guest/host interface
should probably go somewhere else.

--
error compiling committee.c: too many arguments to function

2007-01-07 17:47:32

by Ingo Molnar

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

* Avi Kivity <[email protected]> wrote:

> >2-task context-switch performance (in microseconds, lower is better):
> >
> > native: 1.11
> > ----------------------------------
> > Qemu: 61.18
> > KVM upstream: 53.01
> > KVM trunk: 6.36
> > KVM trunk+paravirt/cr3: 1.60
> >
> >i.e. 2-task context-switch performance is faster by a factor of 4, and
> >is now quite close to native speed!
> >
>
> Very impressive! The gain probably comes not only from avoiding the
> vmentry/vmexit, but also from avoiding the flushing of the global page
> tlb entries.

90% of the win comes from the avoidance of the VM exit. To quantify this
more precisely i have added an artificial __flush_tlb_global() call to
after switch_to(), just to see how much impact an extra global flush has
on the native kernel. Context-switch cost went from 1.11 usecs to 1.65
usecs. Then i added a __flush_tlb(), which made the cost go to 1.75,
which means that the global flush component is at around 0.5 usecs.

> >"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
> >
> > native: 0.25
> > ----------------------------------
> > Qemu: 7.8
> > KVM upstream: 2.8
> > KVM trunk: 0.55
> > KVM paravirt/cr3: 0.36
> >
> >almost twice as fast.
> >
> >"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
> >
> > native: 0.9
> > ----------------------------------
> > Qemu: 35.2
> > KVM upstream: 9.4
> > KVM trunk: 2.8
> > KVM paravirt/cr3: 2.2
> >
> >still a 30% improvement - which isnt too bad considering that 200 tasks
> >are context-switching in this workload and the cr3 cache in current CPUs
> >is only 4 entries.
> >
>
> This is a little too good to be true. Were both runs with the same
> KVM_NUM_MMU_PAGES?

yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk'
run should have been labeled as: 'cr3 tree with paravirt turned off'.
That's not completely 'trunk' but close to it, and all other changes
(like elimination of unnecessary TLB flushes) are fairly applied to
both.

i also did a run with much less MMU cache pages of 256, and hackbench 1
stayed the same, while hackbench 5 numbers started fluctuating badly (i
think that workload if trashing the MMU cache badly).

> I'm also concerned that at this point in time the cr3 optimizations
> will only show an improvement in microbenchmarks. In real life
> workloads a context switch is usually preceded by an I/O, and with the
> current sorry state of kvm I/O the context switch time would be
> dominated by the I/O time.

oh, i agreed completely - but in my opinion accelerating virtual I/O is
really easy. Accelerating the context-switch path (and basic syscall
overhead like KVM does) is /hard/. So i wanted to see whether KVM runs
well in all the hard cases, before looking at the low hanging
performance fruits in the I/O area =B-)

also note that there's lots of internal reasons why application
workloads can be heavily context-switching - it's not just I/O that
generates them. (pipes, critical sections / futexes, etc.) So having
near-native performance for context-switches is very important.

> >+ if (irq & 8) {
> >+ outb(cached_slave_mask, PIC_SLAVE_IMR);
> >+ outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
> >+ outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI'
> >to master-IRQ2 */
> >+ } else {
> >+ outb(cached_master_mask, PIC_MASTER_IMR);
> >+ /* 'Specific EOI' to master: */
> >+ outb(0x60+irq, PIC_MASTER_CMD);
> >+ }
> >+ spin_unlock_irqrestore(&i8259A_lock, flags);
> >+}
>
> Any reason this can't be applied to mainline? There's probably no
> downside to native, and it would benefit all virtualization solutions
> equally.

this is legacy stuff ...

> >- u64 *pae_root;
> >+ u64 *pae_root[KVM_CR3_CACHE_SIZE];
>
> hmm. wouldn't it be simpler to have pae_root always point at the
> current root?

does that guarantee that it's available? I wanted to 'pin' the root
itself this way, to make sure that if a guest switches to it via the
cache, that it's truly available and a valid root. cr3 addresses are
non-virtual so this is the only mechanism available to guarantee that
the host-side memory truly contains a root pagetable.

> >+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> >+ }
> > }
> > vcpu->mmu.root_hpa = INVALID_PAGE;
> > }
>
> You keep the page directories pinned here. [...]

yes.

> [...] This can be a problem if a guest frees a page directory, and
> then starts using it as a regular page. kvm sometimes chooses not to
> emulate a write to a guest page table, but instead to zap it, which is
> impossible when the page is freed. You need to either unpin the page
> when that happens, or add a hypercall to let kvm know when a page
> directory is freed.

the cache is zapped upon pagefaults anyway, so unpinning ought to be
possible. Which one would you prefer?

> >- for (i = 0; i < 4; ++i)
> >- vcpu->mmu.pae_root[i] = INVALID_PAGE;
> >+ for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) {
> >+ /*
> >+ * When emulating 32-bit mode, cr3 is only 32 bits even on
> >+ * x86_64. Therefore we need to allocate shadow page tables
> >+ * in the first 4GB of memory, which happens to fit the DMA32
> >+ * zone:
> >+ */
> >+ page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> >+ if (!page)
> >+ goto error_1;
> >+
> >+ ASSERT(!vcpu->mmu.pae_root[j]);
> >+ vcpu->mmu.pae_root[j] = page_address(page);
> >+ for (i = 0; i < 4; ++i)
> >+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> >+ }
>
> Since a pae root uses just 32 bytes, you can store all cache entries
> in a single page. Not that it matters much.

yeah - i wanted to extend the current code in a safe way, before
optimizing it.

> >+#define KVM_API_MAGIC 0x87654321
> >+
>
> <linux/kvm.h> is the vmm userspace interface. The guest/host
> interface should probably go somewhere else.

yeah. kvm_para.h?

Ingo

2007-01-07 17:50:43

by Hollis Blanchard

[permalink] [raw]

Subject: Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

On Sun, 2007-01-07 at 14:20 +0200, Avi Kivity wrote:
>
>
> Well, you did say it was ad-hoc. For reference, this is how I see the
> hypercall API:
[snip]
> - Guest/host communications is by guest physical addressed, as the
> virtual->physical translation is much cheaper on the guest (__pa() vs
> a page table walk).

Strongly agreed. One of the major problems we had with the PowerPC Xen
port was that Xen passes virtual addresses (even userspace virtual
addresses!) to the hypervisor. Performing a MMU search on PowerPC is far
more convoluted than x86's table walk and is not feasible in software.

I'm anxious to avoid the same mistake wherever possible.

Of course, even with physical addresses, data structures that straddle
page boundaries prevent the hypervisor from mapping contiguous physical
pages to discontiguous machine pages (or whatever terminology you want
to use).

IBM's Power Hypervisor passes hypercall data in registers most of the
time. In the places it communicates via memory, I believe the parameter
is actually a physical frame number, and the size is explicitly limited
to one page.

-Hollis

2007-01-07 18:30:00

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

On Sat, Jan 06, 2007 at 01:08:18PM +0000, Pavel Machek wrote:
> Does this make Xen obsolete? I mean... we have xen patches in suse
> kernels, should we keep updating them, or just drop them in favour of
> KVM?

After all the Novell Marketing Hype you'll probably have to keep Xen ;-)
Except for that I suspect a paravirt kvm or lhype might be the better
hypervisor choice in the long term.

2007-01-08 08:22:48

by Avi Kivity

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

Ingo Molnar wrote:

>> This is a little too good to be true. Were both runs with the same
>> KVM_NUM_MMU_PAGES?
>>
>
> yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk'
> run should have been labeled as: 'cr3 tree with paravirt turned off'.
> That's not completely 'trunk' but close to it, and all other changes
> (like elimination of unnecessary TLB flushes) are fairly applied to
> both.
>

Ok. I guess there's a switch/switch back pattern in there.

> i also did a run with much less MMU cache pages of 256, and hackbench 1
> stayed the same, while hackbench 5 numbers started fluctuating badly (i
> think that workload if trashing the MMU cache badly).
>

Yes, 256 is too low.

>
>>> - u64 *pae_root;
>>> + u64 *pae_root[KVM_CR3_CACHE_SIZE];
>>>
>> hmm. wouldn't it be simpler to have pae_root always point at the
>> current root?
>>
>
> does that guarantee that it's available? I wanted to 'pin' the root
> itself this way, to make sure that if a guest switches to it via the
> cache, that it's truly available and a valid root. cr3 addresses are
> non-virtual so this is the only mechanism available to guarantee that
> the host-side memory truly contains a root pagetable.
>
>

I meant

u64 *pae_root_cache;
u64 *pae_root; /* == pae_root_cache + 4*cache_index */

so that the rest of the code need not worry about the cache.

>>> + vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
>>> + }
>>> }
>>> vcpu->mmu.root_hpa = INVALID_PAGE;
>>> }
>>>
>> You keep the page directories pinned here. [...]
>>
>
> yes.
>
>
>> [...] This can be a problem if a guest frees a page directory, and
>> then starts using it as a regular page. kvm sometimes chooses not to
>> emulate a write to a guest page table, but instead to zap it, which is
>> impossible when the page is freed. You need to either unpin the page
>> when that happens, or add a hypercall to let kvm know when a page
>> directory is freed.
>>
>
> the cache is zapped upon pagefaults anyway, so unpinning ought to be
> possible. Which one would you prefer?
>

It's zapped by the equivalent of mmu_free_roots(), right? That's
effectively unpinning it (by zeroing ->root_count).

However, kvm takes pagefaults even for silly things like setting (in
hardware) or clearing (in software) the dirty bit.

>
>
>>> +#define KVM_API_MAGIC 0x87654321
>>> +
>>>
>> <linux/kvm.h> is the vmm userspace interface. The guest/host
>> interface should probably go somewhere else.
>>
>
> yeah. kvm_para.h?
>
>

Sounds good.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2007-01-08 08:42:44

by Ingo Molnar

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

* Avi Kivity <[email protected]> wrote:

> > the cache is zapped upon pagefaults anyway, so unpinning ought to be
> > possible. Which one would you prefer?
>
> It's zapped by the equivalent of mmu_free_roots(), right? That's
> effectively unpinning it (by zeroing ->root_count).

no, right now only the guest-visible cache is zapped - the roots are
zapped by natural rotation. I guess they should be zapped in
kvm_cr3_cache_clear() - but i wanted to keep that function an invariant
to the other MMU state, to make it easier to call it from whatever mmu
codepath.

> However, kvm takes pagefaults even for silly things like setting (in
> hardware) or clearing (in software) the dirty bit.

yeah. I think it also does some TLB flushes that are not needed. For
example in rmap_write_protect() we do this:

rmap_remove(vcpu, spte);
kvm_arch_ops->tlb_flush(vcpu);

but AFAICS rmap_write_protect() is only ever called if we write a new
cr3 - hence a TLB flush will happen anyway, because we do a
vmcs_writel(GUEST_CR3, new_cr3). Am i missing something? I didnt want to
remove it as part of the cr3 patches (to keep things simpler), but that
flush looks quite unnecessary to me. The patch below seems to work in
light testing.

Ingo

Index: linux/drivers/kvm/mmu.c
===================================================================
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -404,7 +404,11 @@ static void rmap_write_protect(struct kv
BUG_ON(!(*spte & PT_WRITABLE_MASK));
rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
rmap_remove(vcpu, spte);
- kvm_arch_ops->tlb_flush(vcpu);
+ /*
+ * While we removed a mapping there's no need to explicitly
+ * flush the TLB here, because this codepath only triggers
+ * if we write a new cr3 - which will flush the TLB anyway.
+ */
*spte &= ~(u64)PT_WRITABLE_MASK;
}
}

2007-01-08 09:08:35

by Avi Kivity

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

Ingo Molnar wrote:
> * Avi Kivity <[email protected]> wrote:
>
>
>>> the cache is zapped upon pagefaults anyway, so unpinning ought to be
>>> possible. Which one would you prefer?
>>>
>> It's zapped by the equivalent of mmu_free_roots(), right? That's
>> effectively unpinning it (by zeroing ->root_count).
>>
>
> no, right now only the guest-visible cache is zapped - the roots are
> zapped by natural rotation. I guess they should be zapped in
> kvm_cr3_cache_clear() - but i wanted to keep that function an invariant
> to the other MMU state, to make it easier to call it from whatever mmu
> codepath.
>
>

Ok. Some mechanism is then needed to unpin the pages in case they are
recycled by the guest.

>> However, kvm takes pagefaults even for silly things like setting (in
>> hardware) or clearing (in software) the dirty bit.
>>
>
> yeah. I think it also does some TLB flushes that are not needed. For
> example in rmap_write_protect() we do this:
>
> rmap_remove(vcpu, spte);
> kvm_arch_ops->tlb_flush(vcpu);
>
> but AFAICS rmap_write_protect() is only ever called if we write a new
> cr3 - hence a TLB flush will happen anyway, because we do a
> vmcs_writel(GUEST_CR3, new_cr3). Am i missing something?

No, rmap_write_protect() is called whenever we shadow a new guest page
table (the mechanism used to maintain coherency is write faults on page
tables).

> I didnt want to
> remove it as part of the cr3 patches (to keep things simpler), but that
> flush looks quite unnecessary to me. The patch below seems to work in
> light testing.
>
> Ingo
>
> Index: linux/drivers/kvm/mmu.c
> ===================================================================
> --- linux.orig/drivers/kvm/mmu.c
> +++ linux/drivers/kvm/mmu.c
> @@ -404,7 +404,11 @@ static void rmap_write_protect(struct kv
> BUG_ON(!(*spte & PT_WRITABLE_MASK));
> rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
> rmap_remove(vcpu, spte);
> - kvm_arch_ops->tlb_flush(vcpu);
> + /*
> + * While we removed a mapping there's no need to explicitly
> + * flush the TLB here, because this codepath only triggers
> + * if we write a new cr3 - which will flush the TLB anyway.
> + */
> *spte &= ~(u64)PT_WRITABLE_MASK;
> }
> }
>

This will kill svm.

I don't think the tlb flushes are really necessary on today's Intel
machines, as they probably flush the tlb on each vmx switch. But AMD
keeps the TLB, and the flushes are critical there.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2007-01-08 09:21:29

by Ingo Molnar

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

* Avi Kivity <[email protected]> wrote:

> >but AFAICS rmap_write_protect() is only ever called if we write a new
> >cr3 - hence a TLB flush will happen anyway, because we do a
> >vmcs_writel(GUEST_CR3, new_cr3). Am i missing something?
>
> No, rmap_write_protect() is called whenever we shadow a new guest page
> table (the mechanism used to maintain coherency is write faults on
> page tables).

hm. Where? I only see rmap_write_protect() called from
kvm_mmu_get_page(), which is only called from nonpaging_map() [but it
doesnt call the rmap function in that case, due to metaphysical], and
mmu_alloc_roots(). mmu_alloc_roots() is only called from context init
(init-only thing) or from paging_new_cr3().

ahh ... i missed paging_tmpl.h.

how about the patch below then?

furthermore, shouldnt we pass in the flush area size from the pagefault
handler? In most cases it would be 4096 bytes, that would mean an invlpg
is enough, not a full cr3 flush. Although ... i dont know how to invlpg
a guest-side TLB. On VMX it probably doesnt matter at all. On SVM, is
there a way (or a need) to flush a single TLB of another context ID?

Ingo

Index: linux/drivers/kvm/mmu.c
===================================================================
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -378,7 +378,7 @@ static void rmap_remove(struct kvm_vcpu
}
}

-static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
+static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn, int flush_tlb)
{
struct kvm *kvm = vcpu->kvm;
struct page *page;
@@ -404,7 +404,13 @@ static void rmap_write_protect(struct kv
BUG_ON(!(*spte & PT_WRITABLE_MASK));
rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
rmap_remove(vcpu, spte);
- kvm_arch_ops->tlb_flush(vcpu);
+ /*
+ * While we removed a mapping there's no need to explicitly
+ * flush the TLB here if we write a new cr3. It is needed
+ * from the pagefault path though.
+ */
+ if (flush_tlb)
+ kvm_arch_ops->tlb_flush(vcpu);
*spte &= ~(u64)PT_WRITABLE_MASK;
}
}
@@ -561,7 +567,8 @@ static struct kvm_mmu_page *kvm_mmu_get_
gva_t gaddr,
unsigned level,
int metaphysical,
- u64 *parent_pte)
+ u64 *parent_pte,
+ int flush_tlb)
{
union kvm_mmu_page_role role;
unsigned index;
@@ -597,7 +604,7 @@ static struct kvm_mmu_page *kvm_mmu_get_
page->role = role;
hlist_add_head(&page->hash_link, bucket);
if (!metaphysical)
- rmap_write_protect(vcpu, gfn);
+ rmap_write_protect(vcpu, gfn, flush_tlb);
return page;
}

@@ -764,7 +771,7 @@ static int nonpaging_map(struct kvm_vcpu
>> PAGE_SHIFT;
new_table = kvm_mmu_get_page(vcpu, pseudo_gfn,
v, level - 1,
- 1, &table[index]);
+ 1, &table[index], 0);
if (!new_table) {
pgprintk("nonpaging_map: ENOMEM\n");
return -ENOMEM;
@@ -838,7 +845,7 @@ static void mmu_alloc_roots(struct kvm_v

ASSERT(!VALID_PAGE(root));
page = kvm_mmu_get_page(vcpu, root_gfn, 0,
- PT64_ROOT_LEVEL, 0, NULL);
+ PT64_ROOT_LEVEL, 0, NULL, 0);
root = page->page_hpa;
++page->root_count;
vcpu->mmu.root_hpa = root;
@@ -857,7 +864,7 @@ static void mmu_alloc_roots(struct kvm_v
root_gfn = 0;
page = kvm_mmu_get_page(vcpu, root_gfn, i << 30,
PT32_ROOT_LEVEL, !is_paging(vcpu),
- NULL);
+ NULL, 0);
root = page->page_hpa;
++page->root_count;
vcpu->mmu.pae_root[j][i] = root | PT_PRESENT_MASK;
Index: linux/drivers/kvm/paging_tmpl.h
===================================================================
--- linux.orig/drivers/kvm/paging_tmpl.h
+++ linux/drivers/kvm/paging_tmpl.h
@@ -245,7 +245,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
table_gfn = walker->table_gfn[level - 2];
}
shadow_page = kvm_mmu_get_page(vcpu, table_gfn, addr, level-1,
- metaphysical, shadow_ent);
+ metaphysical, shadow_ent, 1);
shadow_addr = shadow_page->page_hpa;
shadow_pte = shadow_addr | PT_PRESENT_MASK | PT_ACCESSED_MASK
| PT_WRITABLE_MASK | PT_USER_MASK;

2007-01-08 09:31:21

by Avi Kivity

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

Ingo Molnar wrote:
> * Avi Kivity <[email protected]> wrote:
>
>
>>> but AFAICS rmap_write_protect() is only ever called if we write a new
>>> cr3 - hence a TLB flush will happen anyway, because we do a
>>> vmcs_writel(GUEST_CR3, new_cr3). Am i missing something?
>>>
>> No, rmap_write_protect() is called whenever we shadow a new guest page
>> table (the mechanism used to maintain coherency is write faults on
>> page tables).
>>
>
> hm. Where? I only see rmap_write_protect() called from
> kvm_mmu_get_page(), which is only called from nonpaging_map() [but it
> doesnt call the rmap function in that case, due to metaphysical], and
> mmu_alloc_roots(). mmu_alloc_roots() is only called from context init
> (init-only thing) or from paging_new_cr3().
>
> ahh ... i missed paging_tmpl.h.
>
> how about the patch below then?
>

Looks like a lot of complexity for very little gain. I'm not sure what
the vmwrite cost is, cut it can't be that high compared to vmexit.

I think there are two better options:

1. Find out if tlb_flush() can be implemented as a no-op on intel --
I'm fairly sure it can.
2. If not, implement tlb_flush() as a counter increment like on amd.
Then, successive tlb flushes and context switches are folded into one.

> furthermore, shouldnt we pass in the flush area size from the pagefault
> handler? In most cases it would be 4096 bytes, that would mean an invlpg
> is enough, not a full cr3 flush. Although ... i dont know how to invlpg
> a guest-side TLB. On VMX it probably doesnt matter at all. On SVM, is
> there a way (or a need) to flush a single TLB of another context ID?
>

Yes (and yes), invlpga.

A complication is that a single shadow pte can be used to map multiple
guest virtual addresses (by mapping the page table using multiple pdes),
so the kvm_mmu_page object has to keep track of the page table gva, and
whether it is multiply mapped or not. I plan to do that later.

> Ingo
>
> Index: linux/drivers/kvm/mmu.c
> ===================================================================
> --- linux.orig/drivers/kvm/mmu.c
> +++ linux/drivers/kvm/mmu.c
> @@ -378,7 +378,7 @@ static void rmap_remove(struct kvm_vcpu
> }
> }
>
> -static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
> +static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn, int flush_tlb)
> {
> struct kvm *kvm = vcpu->kvm;
> struct page *page;
> @@ -404,7 +404,13 @@ static void rmap_write_protect(struct kv
> BUG_ON(!(*spte & PT_WRITABLE_MASK));
> rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
> rmap_remove(vcpu, spte);
> - kvm_arch_ops->tlb_flush(vcpu);
> + /*
> + * While we removed a mapping there's no need to explicitly
> + * flush the TLB here if we write a new cr3. It is needed
> + * from the pagefault path though.
> + */
> + if (flush_tlb)
> + kvm_arch_ops->tlb_flush(vcpu);
> *spte &= ~(u64)PT_WRITABLE_MASK;
> }
> }
> @@ -561,7 +567,8 @@ static struct kvm_mmu_page *kvm_mmu_get_
> gva_t gaddr,
> unsigned level,
> int metaphysical,
> - u64 *parent_pte)
> + u64 *parent_pte,
> + int flush_tlb)
> {
> union kvm_mmu_page_role role;
> unsigned index;
> @@ -597,7 +604,7 @@ static struct kvm_mmu_page *kvm_mmu_get_
> page->role = role;
> hlist_add_head(&page->hash_link, bucket);
> if (!metaphysical)
> - rmap_write_protect(vcpu, gfn);
> + rmap_write_protect(vcpu, gfn, flush_tlb);
> return page;
> }
>
> @@ -764,7 +771,7 @@ static int nonpaging_map(struct kvm_vcpu
> >> PAGE_SHIFT;
> new_table = kvm_mmu_get_page(vcpu, pseudo_gfn,
> v, level - 1,
> - 1, &table[index]);
> + 1, &table[index], 0);
> if (!new_table) {
> pgprintk("nonpaging_map: ENOMEM\n");
> return -ENOMEM;
> @@ -838,7 +845,7 @@ static void mmu_alloc_roots(struct kvm_v
>
> ASSERT(!VALID_PAGE(root));
> page = kvm_mmu_get_page(vcpu, root_gfn, 0,
> - PT64_ROOT_LEVEL, 0, NULL);
> + PT64_ROOT_LEVEL, 0, NULL, 0);
> root = page->page_hpa;
> ++page->root_count;
> vcpu->mmu.root_hpa = root;
> @@ -857,7 +864,7 @@ static void mmu_alloc_roots(struct kvm_v
> root_gfn = 0;
> page = kvm_mmu_get_page(vcpu, root_gfn, i << 30,
> PT32_ROOT_LEVEL, !is_paging(vcpu),
> - NULL);
> + NULL, 0);
> root = page->page_hpa;
> ++page->root_count;
> vcpu->mmu.pae_root[j][i] = root | PT_PRESENT_MASK;
> Index: linux/drivers/kvm/paging_tmpl.h
> ===================================================================
> --- linux.orig/drivers/kvm/paging_tmpl.h
> +++ linux/drivers/kvm/paging_tmpl.h
> @@ -245,7 +245,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
> table_gfn = walker->table_gfn[level - 2];
> }
> shadow_page = kvm_mmu_get_page(vcpu, table_gfn, addr, level-1,
> - metaphysical, shadow_ent);
> + metaphysical, shadow_ent, 1);
> shadow_addr = shadow_page->page_hpa;
> shadow_pte = shadow_addr | PT_PRESENT_MASK | PT_ACCESSED_MASK
> | PT_WRITABLE_MASK | PT_USER_MASK;
>
>

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2007-01-08 09:47:42

by Ingo Molnar

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

* Avi Kivity <[email protected]> wrote:

> Looks like a lot of complexity for very little gain. I'm not sure
> what the vmwrite cost is, cut it can't be that high compared to
> vmexit.

while i disagree with characterising one extra parameter passed down
plus one extra branch as 'a lot of complexity', i agree that making the
flush a NOP on VMX is even better. (if it's possible without breaking
future hardware) I guess you are right that the only true cost here is
the vmwrite cost, and that there's no impact on CR3 flushing (because it
happens unconditionally). Forget about this patch.

Ingo

2007-01-08 18:21:36

by Christoph Lameter

[permalink] [raw]

Subject: Re: [announce] [patch] KVM paravirtualization for Linux

On Sat, 6 Jan 2007, Pavel Machek wrote:

> Does this make Xen obsolete? I mean... we have xen patches in suse
> kernels, should we keep updating them, or just drop them in favour of
> KVM?
> Pavel

Xen is duplicating basic OS components like the scheduler etc. As
a result its difficult to maintain and not well integrated with Linux.
KVM looks like a better approach.