As Paolo pointed out that this is the time to update the documentation, this
patchest does it to let mmu.txt matches with the current mmu implementation
Xiao Guangrong (7):
KVM: MMU: update the documentation for reverse mapping of parent_pte
KVM: MMU: document clear_spte_count
KVM: MMU: document write_flooding_count
KVM: MMU: document mmio page fault
KVM: MMU: document fast page fault in
KVM: MMU: document fast invalidate all pages
KVM: MMU: document fast invalidate all mmio sptes
Documentation/virtual/kvm/mmu.txt | 86 +++++++++++++++++++++++++++++++++++----
arch/x86/include/asm/kvm_host.h | 15 +++++++
arch/x86/kvm/mmu.c | 7 ++--
3 files changed, 97 insertions(+), 11 deletions(-)
--
1.8.1.4
Document it to Documentation/virtual/kvm/mmu.txt
Signed-off-by: Xiao Guangrong <[email protected]>
---
Documentation/virtual/kvm/mmu.txt | 23 +++++++++++++++++++++++
arch/x86/include/asm/kvm_host.h | 5 +++++
2 files changed, 28 insertions(+)
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index b5ce7dd..f5c4de9 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -210,6 +210,10 @@ Shadow pages contain the following information:
A bitmap indicating which sptes in spt point (directly or indirectly) at
pages that may be unsynchronized. Used to quickly locate all unsychronized
pages reachable from a given page.
+ mmu_valid_gen:
+ It is the generation number of the page which cooperates with
+ kvm->arch.mmu_valid_gen to fast invalidate all pages.
+ (see "Fast invalidate all pages" below.)
clear_spte_count:
It is only used on 32bit host which helps us to detect whether updating the
64bit spte is complete so that we can avoid reading the truncated value out
@@ -373,6 +377,25 @@ causes its write_count to be incremented, thus preventing instantiation of
a large spte. The frames at the end of an unaligned memory slot have
artificially inflated ->write_counts so they can never be instantiated.
+Fast invalidate all pages
+===========
+For the large memory and large vcpus guests, zapping all pages is a challenge
+since they have large number of pages need to be zapped, walking and zapping
+these pages are really slow and it should hold mmu-lock which stops the memory
+access on all vcpus.
+
+To make it be more scalable, kvm maintains a global mmu valid
+generation-number which is stored in kvm->arch.mmu_valid_gen and every shadow
+page stores the current global generation-number into sp->mmu_valid_gen when
+it is created.
+
+When KVM need zap all shadow pages sptes, it just simply increases the global
+generation-number then reload root shadow pages on all vcpus. Vcpu will create
+a new shadow page table according to current kvm's generation-number. It
+ensures the old pages are not used any more. The invalid-gen pages
+(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen) are zapped by using lock-break
+technique.
+
Further reading
===============
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5eb5382..c4f90f6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -222,6 +222,11 @@ struct kvm_mmu_page {
int root_count; /* Currently serving as active root */
unsigned int unsync_children;
unsigned long parent_ptes; /* Reverse mapping for parent_pte */
+
+ /*
+ * the generation number of the page which cooperates with
+ * kvm->arch.mmu_valid_gen to fast invalidate all pages.
+ */
unsigned long mmu_valid_gen;
DECLARE_BITMAP(unsync_child_bitmap, 512);
--
1.8.1.4
Document it to Documentation/virtual/kvm/mmu.txt
Signed-off-by: Xiao Guangrong <[email protected]>
---
Documentation/virtual/kvm/mmu.txt | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index f5c4de9..9b7cfb3 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -396,6 +396,31 @@ ensures the old pages are not used any more. The invalid-gen pages
(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen) are zapped by using lock-break
technique.
+Fast invalidate all mmio sptes
+===========
+As mentioned in "Reaction to events" above, kvm will cache the mmio information
+to the last sptes so that we should zap all mmio sptes when the guest mmio info
+is changed. This will happen when a new memslot is added or the existing
+memslot is moved.
+
+Zapping mmio spte is also a scalability issue for the large memory and large
+vcpus guests since it needs to hold hot mmu-lock and walk all shadow pages to
+find all the mmio spte out.
+
+We fix this issue by using the similar way of "Fast invalidate all pages".
+The global mmio valid generation-number is stored in kvm->memslots.generation
+and every mmio spte stores the current global generation-number into his
+available bits when it is created
+
+The global mmio valid generation-number is increased whenever the gust memory
+info is changed. When guests do mmio access, kvm intercepts a MMIO #PF then it
+walks the shadow page table and get the mmio spte. If the generation-number on
+the spte does not equal the global generation-number, it will go to the normal
+#PF handler to update the mmio spte.
+
+Since 19 bits are used to store generation-number on mmio spte, we zap all pages
+when the number is round.
+
Further reading
===============
--
1.8.1.4
Document fast page fault to Documentation/virtual/kvm/mmu.txt
Signed-off-by: Xiao Guangrong <[email protected]>
---
Documentation/virtual/kvm/mmu.txt | 3 +++
1 file changed, 3 insertions(+)
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index 4fb442b..b5ce7dd 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -275,6 +275,9 @@ Handling a page fault is performed as follows:
information is stored and cache the information to vcpu->arch.mmio_gva,
vcpu->arch.access and vcpu->arch.mmio_gfn then call the emulator to emulate
the instruction who will get the benefit from the cached mmio info
+ - If both P bit and R/W bit of error code are set, there has a chance to
+ fast fix the page fault, please see the description of "Fast page fault" in
+ Documentation/virtual/kvm/locking.txt
- if needed, walk the guest page tables to determine the guest translation
(gva->gpa or ngpa->gpa)
- if permissions are insufficient, reflect the fault back to the guest
--
1.8.1.4
Update the document to match the current reverse mapping of
parent_pte
Signed-off-by: Xiao Guangrong <[email protected]>
---
Documentation/virtual/kvm/mmu.txt | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index 43fcb76..869abcc 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -191,12 +191,12 @@ Shadow pages contain the following information:
A counter keeping track of how many hardware registers (guest cr3 or
pdptrs) are now pointing at the page. While this counter is nonzero, the
page cannot be destroyed. See role.invalid.
- multimapped:
- Whether there exist multiple sptes pointing at this page.
- parent_pte/parent_ptes:
- If multimapped is zero, parent_pte points at the single spte that points at
- this page's spt. Otherwise, parent_ptes points at a data structure
- with a list of parent_ptes.
+ parent_ptes:
+ The reverse mapping for the pte/ptes pointing at this page's spt. If
+ parent_ptes bit 0 is zero, only one spte points at this pages and
+ parent_ptes points at this single spte, otherwise, there exists multiple
+ sptes pointing at this page and (parent_ptes & ~0x1) points at a data
+ structure with a list of parent_ptes.
unsync:
If true, then the translations in this page may not match the guest's
translation. This is equivalent to the state of the tlb when a pte is
--
1.8.1.4
Document it to Documentation/virtual/kvm/mmu.txt
Signed-off-by: Xiao Guangrong <[email protected]>
---
Documentation/virtual/kvm/mmu.txt | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index 5a6b2e2..4fb442b 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -270,14 +270,21 @@ This is the most complicated event. The cause of a page fault can be:
Handling a page fault is performed as follows:
+ - if the RSV bit of the error code is set, the page fault is caused by guest
+ accessing MMIO, walk shadow page table to get the last spte where the mmio
+ information is stored and cache the information to vcpu->arch.mmio_gva,
+ vcpu->arch.access and vcpu->arch.mmio_gfn then call the emulator to emulate
+ the instruction who will get the benefit from the cached mmio info
- if needed, walk the guest page tables to determine the guest translation
(gva->gpa or ngpa->gpa)
- if permissions are insufficient, reflect the fault back to the guest
- determine the host page
- - if this is an mmio request, there is no host page; call the emulator
- to emulate the instruction instead
+ - if this is an mmio request, there is no host page; cache the info to
+ vcpu->arch.mmio_gva, vcpu->arch.access and vcpu->arch.mmio_gfn
- walk the shadow page table to find the spte for the translation,
instantiating missing intermediate page tables as necessary
+ - If this is an mmio request, cache the mmio info to the spte and set some
+ reserved bits on the spte
- try to unsynchronize the page
- if successful, we can let the guest continue and modify the gpte
- emulate the instruction
--
1.8.1.4
Document write_flooding_count to Documentation/virtual/kvm/mmu.txt
Signed-off-by: Xiao Guangrong <[email protected]>
---
Documentation/virtual/kvm/mmu.txt | 8 ++++++++
arch/x86/include/asm/kvm_host.h | 5 +++++
2 files changed, 13 insertions(+)
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index ce6df51..5a6b2e2 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -214,6 +214,14 @@ Shadow pages contain the following information:
It is only used on 32bit host which helps us to detect whether updating the
64bit spte is complete so that we can avoid reading the truncated value out
of mmu-lock.
+ write_flooding_count:
+ In order to sync the page table between guest and host, the page sometimes
+ needs to be write-protected (see "Synchronized and unsynchronized pages"
+ below), any write to the page can cause write emulation. If the emulation
+ on the page is too frequent we'd better unmap the page to avoid the
+ future emulation. write_flooding_count aims at this optimization which is
+ increased when the page needs to be write emulated and cleared when the
+ page is actually used.
Reverse map
===========
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1dac2c1..5eb5382 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -234,6 +234,11 @@ struct kvm_mmu_page {
int clear_spte_count;
#endif
+ /*
+ * Increased when the page needs to be write emulated, cleared
+ * when the page is actually used as page table to avoid frequent
+ * emulation on the page.
+ */
int write_flooding_count;
};
--
1.8.1.4
Document it to Documentation/virtual/kvm/mmu.txt
Signed-off-by: Xiao Guangrong <[email protected]>
---
Documentation/virtual/kvm/mmu.txt | 4 ++++
arch/x86/include/asm/kvm_host.h | 5 +++++
arch/x86/kvm/mmu.c | 7 ++++---
3 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index 869abcc..ce6df51 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -210,6 +210,10 @@ Shadow pages contain the following information:
A bitmap indicating which sptes in spt point (directly or indirectly) at
pages that may be unsynchronized. Used to quickly locate all unsychronized
pages reachable from a given page.
+ clear_spte_count:
+ It is only used on 32bit host which helps us to detect whether updating the
+ 64bit spte is complete so that we can avoid reading the truncated value out
+ of mmu-lock.
Reverse map
===========
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 966f265..1dac2c1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -226,6 +226,11 @@ struct kvm_mmu_page {
DECLARE_BITMAP(unsync_child_bitmap, 512);
#ifdef CONFIG_X86_32
+ /*
+ * Count after the page's spte has been cleared to avoid
+ * the truncated value is read out of mmu-lock.
+ * please see the comments in __get_spte_lockless().
+ */
int clear_spte_count;
#endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c87b19d..77d516c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -464,9 +464,10 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
/*
* The idea using the light way get the spte on x86_32 guest is from
* gup_get_pte(arch/x86/mm/gup.c).
- * The difference is we can not catch the spte tlb flush if we leave
- * guest mode, so we emulate it by increase clear_spte_count when spte
- * is cleared.
+ * The difference is we can not immediately catch the spte tlb since
+ * kvm may collapse tlb flush some times. Please see kvm_set_pte_rmapp.
+ *
+ * We emulate it by increase clear_spte_count when spte is cleared.
*/
static u64 __get_spte_lockless(u64 *sptep)
{
--
1.8.1.4
Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
> Update the document to match the current reverse mapping of
> parent_pte
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> Documentation/virtual/kvm/mmu.txt | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
> index 43fcb76..869abcc 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -191,12 +191,12 @@ Shadow pages contain the following information:
> A counter keeping track of how many hardware registers (guest cr3 or
> pdptrs) are now pointing at the page. While this counter is nonzero, the
> page cannot be destroyed. See role.invalid.
> - multimapped:
> - Whether there exist multiple sptes pointing at this page.
> - parent_pte/parent_ptes:
> - If multimapped is zero, parent_pte points at the single spte that points at
> - this page's spt. Otherwise, parent_ptes points at a data structure
> - with a list of parent_ptes.
> + parent_ptes:
> + The reverse mapping for the pte/ptes pointing at this page's spt. If
> + parent_ptes bit 0 is zero, only one spte points at this pages and
> + parent_ptes points at this single spte, otherwise, there exists multiple
> + sptes pointing at this page and (parent_ptes & ~0x1) points at a data
> + structure with a list of parent_ptes.
Applying with this small change:
The reverse mapping for the pte/ptes pointing at this page's spt. If
bit 0 is zero, only one spte points at this pages and parent_ptes
points at this single spte. If bit 0 is set, there exist multiple
sptes pointing at this page and (parent_ptes & ~0x1) points at a data
structure with a list of parent_ptes.
Paolo
> unsync:
> If true, then the translations in this page may not match the guest's
> translation. This is equivalent to the state of the tlb when a pte is
>
Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
> Document it to Documentation/virtual/kvm/mmu.txt
While reviewing the docs, I looked at the code.
Why can't this happen?
CPU 1: __get_spte_lockless CPU 2: __update_clear_spte_slow
------------------------------------------------------------------------------
write low
read count
read low
read high
write high
check low and count
update count
The check passes, but CPU 1 read a "torn" SPTE.
It seems like this is the same reason why seqlocks do two version
updates, one before and one after, and make the reader check "version &
~1". But maybe I'm wrong.
Paolo
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> Documentation/virtual/kvm/mmu.txt | 4 ++++
> arch/x86/include/asm/kvm_host.h | 5 +++++
> arch/x86/kvm/mmu.c | 7 ++++---
> 3 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
> index 869abcc..ce6df51 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -210,6 +210,10 @@ Shadow pages contain the following information:
> A bitmap indicating which sptes in spt point (directly or indirectly) at
> pages that may be unsynchronized. Used to quickly locate all unsychronized
> pages reachable from a given page.
> + clear_spte_count:
> + It is only used on 32bit host which helps us to detect whether updating the
> + 64bit spte is complete so that we can avoid reading the truncated value out
> + of mmu-lock.
>
> Reverse map
> ===========
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 966f265..1dac2c1 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -226,6 +226,11 @@ struct kvm_mmu_page {
> DECLARE_BITMAP(unsync_child_bitmap, 512);
>
> #ifdef CONFIG_X86_32
> + /*
> + * Count after the page's spte has been cleared to avoid
> + * the truncated value is read out of mmu-lock.
> + * please see the comments in __get_spte_lockless().
> + */
> int clear_spte_count;
> #endif
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index c87b19d..77d516c 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -464,9 +464,10 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
> /*
> * The idea using the light way get the spte on x86_32 guest is from
> * gup_get_pte(arch/x86/mm/gup.c).
> - * The difference is we can not catch the spte tlb flush if we leave
> - * guest mode, so we emulate it by increase clear_spte_count when spte
> - * is cleared.
> + * The difference is we can not immediately catch the spte tlb since
> + * kvm may collapse tlb flush some times. Please see kvm_set_pte_rmapp.
> + *
> + * We emulate it by increase clear_spte_count when spte is cleared.
> */
> static u64 __get_spte_lockless(u64 *sptep)
> {
>
Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
> Document it to Documentation/virtual/kvm/mmu.txt
Edits inline, please ack.
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> Documentation/virtual/kvm/mmu.txt | 4 ++++
> arch/x86/include/asm/kvm_host.h | 5 +++++
> arch/x86/kvm/mmu.c | 7 ++++---
> 3 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
> index 869abcc..ce6df51 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -210,6 +210,10 @@ Shadow pages contain the following information:
> A bitmap indicating which sptes in spt point (directly or indirectly) at
> pages that may be unsynchronized. Used to quickly locate all unsychronized
> pages reachable from a given page.
> + clear_spte_count:
> + It is only used on 32bit host which helps us to detect whether updating the
> + 64bit spte is complete so that we can avoid reading the truncated value out
> + of mmu-lock.
+ Only present on 32-bit hosts, where a 64-bit spte cannot be written
+ atomically. The reader uses this while running out of the MMU lock
+ to detect in-progress updates and retry them until the writer has
+ finished the write.
> Reverse map
> ===========
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 966f265..1dac2c1 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -226,6 +226,11 @@ struct kvm_mmu_page {
> DECLARE_BITMAP(unsync_child_bitmap, 512);
>
> #ifdef CONFIG_X86_32
> + /*
> + * Count after the page's spte has been cleared to avoid
> + * the truncated value is read out of mmu-lock.
> + * please see the comments in __get_spte_lockless().
+ * Used out of the mmu-lock to avoid reading spte values while an
+ * update is in progress; see the comments in __get_spte_lockless().
> + */
> int clear_spte_count;
> #endif
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index c87b19d..77d516c 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -464,9 +464,10 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
> /*
> * The idea using the light way get the spte on x86_32 guest is from
> * gup_get_pte(arch/x86/mm/gup.c).
> - * The difference is we can not catch the spte tlb flush if we leave
> - * guest mode, so we emulate it by increase clear_spte_count when spte
> - * is cleared.
> + * The difference is we can not immediately catch the spte tlb since
> + * kvm may collapse tlb flush some times. Please see kvm_set_pte_rmapp.
> + *
> + * We emulate it by increase clear_spte_count when spte is cleared.
+ * An spte tlb flush may be pending, because kvm_set_pte_rmapp
+ * coalesces them and we are running out of the MMU lock. Therefore
+ * we need to protect against in-progress updates of the spte, which
+ * is done using clear_spte_count.
> */
> static u64 __get_spte_lockless(u64 *sptep)
> {
>
On 06/19/2013 07:32 PM, Paolo Bonzini wrote:
> Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
>> Document it to Documentation/virtual/kvm/mmu.txt
>
> While reviewing the docs, I looked at the code.
>
> Why can't this happen?
>
> CPU 1: __get_spte_lockless CPU 2: __update_clear_spte_slow
> ------------------------------------------------------------------------------
> write low
> read count
> read low
> read high
> write high
> check low and count
> update count
>
> The check passes, but CPU 1 read a "torn" SPTE.
In this case, CPU 1 will read the "new low bits" and the "old high bits", right?
the P bit in the low bits is cleared when do __update_clear_spte_slow, i.e, it is
not present, so the whole value is ignored.
Il 19/06/2013 13:53, Xiao Guangrong ha scritto:
> On 06/19/2013 07:32 PM, Paolo Bonzini wrote:
>> Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
>>> Document it to Documentation/virtual/kvm/mmu.txt
>>
>> While reviewing the docs, I looked at the code.
>>
>> Why can't this happen?
>>
>> CPU 1: __get_spte_lockless CPU 2: __update_clear_spte_slow
>> ------------------------------------------------------------------------------
>> write low
>> read count
>> read low
>> read high
>> write high
>> check low and count
>> update count
>>
>> The check passes, but CPU 1 read a "torn" SPTE.
>
> In this case, CPU 1 will read the "new low bits" and the "old high bits", right?
> the P bit in the low bits is cleared when do __update_clear_spte_slow, i.e, it is
> not present, so the whole value is ignored.
Indeed that's what the comment says, too. But then why do we need the
count at all? The spte that is read is exactly the same before and
after the count is updated.
Paolo
Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
> Document write_flooding_count to Documentation/virtual/kvm/mmu.txt
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> Documentation/virtual/kvm/mmu.txt | 8 ++++++++
> arch/x86/include/asm/kvm_host.h | 5 +++++
> 2 files changed, 13 insertions(+)
>
> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
> index ce6df51..5a6b2e2 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -214,6 +214,14 @@ Shadow pages contain the following information:
> It is only used on 32bit host which helps us to detect whether updating the
> 64bit spte is complete so that we can avoid reading the truncated value out
> of mmu-lock.
> + write_flooding_count:
> + In order to sync the page table between guest and host, the page sometimes
> + needs to be write-protected (see "Synchronized and unsynchronized pages"
> + below), any write to the page can cause write emulation. If the emulation
> + on the page is too frequent we'd better unmap the page to avoid the
> + future emulation. write_flooding_count aims at this optimization which is
> + increased when the page needs to be write emulated and cleared when the
> + page is actually used.
A guest may write to a page table many times, causing a lot of
emulations if the page needs to be write-protected (see "Synchronized
and unsynchronized pages" below). Leaf pages can be unsynchronized
so that they do not trigger frequent emulation, but this is not
possible for non-leafs. This field counts the number of emulations
since the last time the page table was actually used; if emulation
is triggered too frequently on this page, KVM will unmap the page
to avoid emulation in the future.
> Reverse map
> ===========
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1dac2c1..5eb5382 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -234,6 +234,11 @@ struct kvm_mmu_page {
> int clear_spte_count;
> #endif
>
> + /*
> + * Increased when the page needs to be write emulated, cleared
> + * when the page is actually used as page table to avoid frequent
> + * emulation on the page.
> + */
/* Number of writes since the last time traversal visited this page. */
wdyt, better or worse?
Paolo
> int write_flooding_count;
> };
>
>
Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
> Document it to Documentation/virtual/kvm/mmu.txt
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> Documentation/virtual/kvm/mmu.txt | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
> index 5a6b2e2..4fb442b 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -270,14 +270,21 @@ This is the most complicated event. The cause of a page fault can be:
>
> Handling a page fault is performed as follows:
>
> + - if the RSV bit of the error code is set, the page fault is caused by guest
> + accessing MMIO, walk shadow page table to get the last spte where the mmio
> + information is stored and cache the information to vcpu->arch.mmio_gva,
> + vcpu->arch.access and vcpu->arch.mmio_gfn then call the emulator to emulate
> + the instruction who will get the benefit from the cached mmio info
+ - if the RSV bit of the error code is set, the page fault is caused by guest
+ accessing MMIO and cached MMIO information is available.
+ - walk shadow page table
+ - cache the information to vcpu->arch.mmio_gva, vcpu->arch.access and
+ vcpu->arch.mmio_gfn, and call the emulator
> - if needed, walk the guest page tables to determine the guest translation
> (gva->gpa or ngpa->gpa)
> - if permissions are insufficient, reflect the fault back to the guest
> - determine the host page
> - - if this is an mmio request, there is no host page; call the emulator
> - to emulate the instruction instead
> + - if this is an mmio request, there is no host page; cache the info to
> + vcpu->arch.mmio_gva, vcpu->arch.access and vcpu->arch.mmio_gfn
> - walk the shadow page table to find the spte for the translation,
> instantiating missing intermediate page tables as necessary
> + - If this is an mmio request, cache the mmio info to the spte and set some
> + reserved bits on the spte
Added "(see callers of kvm_mmu_set_mmio_spte_mask)". Not really related, but
just came to my mind: perhaps we can have a section on A/D bits too.
Paolo
> - try to unsynchronize the page
> - if successful, we can let the guest continue and modify the gpte
> - emulate the instruction
>
Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
> Document fast page fault to Documentation/virtual/kvm/mmu.txt
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> Documentation/virtual/kvm/mmu.txt | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
> index 4fb442b..b5ce7dd 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -275,6 +275,9 @@ Handling a page fault is performed as follows:
> information is stored and cache the information to vcpu->arch.mmio_gva,
> vcpu->arch.access and vcpu->arch.mmio_gfn then call the emulator to emulate
> the instruction who will get the benefit from the cached mmio info
> + - If both P bit and R/W bit of error code are set, there has a chance to
> + fast fix the page fault, please see the description of "Fast page fault" in
> + Documentation/virtual/kvm/locking.txt
- If both P bit and R/W bit of error code are set, this could possibly
be handled as a "fast page fault" (fixed without taking the MMU lock). See
the description in Documentation/virtual/kvm/locking.txt.
Paolo
> - if needed, walk the guest page tables to determine the guest translation
> (gva->gpa or ngpa->gpa)
> - if permissions are insufficient, reflect the fault back to the guest
>
On 06/19/2013 07:55 PM, Paolo Bonzini wrote:
> Il 19/06/2013 13:53, Xiao Guangrong ha scritto:
>> On 06/19/2013 07:32 PM, Paolo Bonzini wrote:
>>> Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
>>>> Document it to Documentation/virtual/kvm/mmu.txt
>>>
>>> While reviewing the docs, I looked at the code.
>>>
>>> Why can't this happen?
>>>
>>> CPU 1: __get_spte_lockless CPU 2: __update_clear_spte_slow
>>> ------------------------------------------------------------------------------
>>> write low
>>> read count
>>> read low
>>> read high
>>> write high
>>> check low and count
>>> update count
>>>
>>> The check passes, but CPU 1 read a "torn" SPTE.
>>
>> In this case, CPU 1 will read the "new low bits" and the "old high bits", right?
>> the P bit in the low bits is cleared when do __update_clear_spte_slow, i.e, it is
>> not present, so the whole value is ignored.
>
> Indeed that's what the comment says, too. But then why do we need the
> count at all? The spte that is read is exactly the same before and
> after the count is updated.
In order to detect repeatedly marking spte present to stop the lockless side
to see present to present change, otherwise, we can get this:
Say spte = 0xa11110001 (high 32bits = 0xa, low 32bit = 0x11110001)
CPU 1: __get_spte_lockless CPU 2: __update_clear_spte_slow
----------------------------------------------------------------------
read low: low= 0x11110001
clear the spte, then spte = 0x0ull
read high: high = 0x0
set spte to 0xb11110001 (high 32bits = 0xb,
low 32bit = 0x11110001)
read low: 0x11110001 and see
it is not changed.
In this case, CPU 1 see the low bits are not changed, then it tries to access the memory at:
0x11110000.
BTW, we are using tlb to protect lockless walking, the count can be drop after
improving kvm_set_pte_rmapp where is the only place change spte from present to present
without TLB flush.
Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
> Document it to Documentation/virtual/kvm/mmu.txt
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> Documentation/virtual/kvm/mmu.txt | 23 +++++++++++++++++++++++
> arch/x86/include/asm/kvm_host.h | 5 +++++
> 2 files changed, 28 insertions(+)
>
> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
> index b5ce7dd..f5c4de9 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -210,6 +210,10 @@ Shadow pages contain the following information:
> A bitmap indicating which sptes in spt point (directly or indirectly) at
> pages that may be unsynchronized. Used to quickly locate all unsychronized
> pages reachable from a given page.
> + mmu_valid_gen:
> + It is the generation number of the page which cooperates with
> + kvm->arch.mmu_valid_gen to fast invalidate all pages.
> + (see "Fast invalidate all pages" below.)
+ mmu_valid_gen:
+ Generation number of the page. It is compared with kvm->arch.mmu_valid_gen
+ during hash table lookup, and used to skip invalidated shadow pages (see
+ "Zapping all pages" below.)
> clear_spte_count:
> It is only used on 32bit host which helps us to detect whether updating the
> 64bit spte is complete so that we can avoid reading the truncated value out
> @@ -373,6 +377,25 @@ causes its write_count to be incremented, thus preventing instantiation of
> a large spte. The frames at the end of an unaligned memory slot have
> artificially inflated ->write_counts so they can never be instantiated.
>
> +Fast invalidate all pages
> +===========
> +For the large memory and large vcpus guests, zapping all pages is a challenge
> +since they have large number of pages need to be zapped, walking and zapping
> +these pages are really slow and it should hold mmu-lock which stops the memory
> +access on all vcpus.
> +
> +To make it be more scalable, kvm maintains a global mmu valid
> +generation-number which is stored in kvm->arch.mmu_valid_gen and every shadow
> +page stores the current global generation-number into sp->mmu_valid_gen when
> +it is created.
> +
> +When KVM need zap all shadow pages sptes, it just simply increases the global
> +generation-number then reload root shadow pages on all vcpus. Vcpu will create
> +a new shadow page table according to current kvm's generation-number. It
> +ensures the old pages are not used any more. The invalid-gen pages
> +(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen) are zapped by using lock-break
> +technique.
> +
+Zapping all pages (page generation count)
+=========================================
+
+For the large memory guests, walking and zapping all pages is really slow
+(because there are a lot of pages), and also blocks memory accesses of
+all VCPUs because it needs to hold the MMU lock.
+
+To make it be more scalable, kvm maintains a global generation number
+which is stored in kvm->arch.mmu_valid_gen. Every shadow page stores
+the current global generation-number into sp->mmu_valid_gen when it
+is created. Pages with a mismatching generation number are "obsolete".
+
+When KVM need zap all shadow pages sptes, it just simply increases the global
+generation-number then reload root shadow pages on all vcpus. As the VCPUs
+create new shadow page tables, the old pages are not used because of the
+mismatching generation number.
+
+KVM then walks through all pages and zaps obsolete pages. While the zap
+operation needs to take the MMU lock, the lock can be released periodically
+so that the VCPUs can make progress.
+
> Further reading
> ===============
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5eb5382..c4f90f6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -222,6 +222,11 @@ struct kvm_mmu_page {
> int root_count; /* Currently serving as active root */
> unsigned int unsync_children;
> unsigned long parent_ptes; /* Reverse mapping for parent_pte */
> +
> + /*
> + * the generation number of the page which cooperates with
> + * kvm->arch.mmu_valid_gen to fast invalidate all pages.
> + */
+ /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */
Paolo
> unsigned long mmu_valid_gen;
> DECLARE_BITMAP(unsync_child_bitmap, 512);
>
>
Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
> Document it to Documentation/virtual/kvm/mmu.txt
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> Documentation/virtual/kvm/mmu.txt | 25 +++++++++++++++++++++++++
> 1 file changed, 25 insertions(+)
>
> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
> index f5c4de9..9b7cfb3 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -396,6 +396,31 @@ ensures the old pages are not used any more. The invalid-gen pages
> (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen) are zapped by using lock-break
> technique.
>
> +Fast invalidate all mmio sptes
> +===========
> +As mentioned in "Reaction to events" above, kvm will cache the mmio information
> +to the last sptes so that we should zap all mmio sptes when the guest mmio info
> +is changed. This will happen when a new memslot is added or the existing
> +memslot is moved.
> +
> +Zapping mmio spte is also a scalability issue for the large memory and large
> +vcpus guests since it needs to hold hot mmu-lock and walk all shadow pages to
> +find all the mmio spte out.
> +
> +We fix this issue by using the similar way of "Fast invalidate all pages".
> +The global mmio valid generation-number is stored in kvm->memslots.generation
> +and every mmio spte stores the current global generation-number into his
> +available bits when it is created
> +
> +The global mmio valid generation-number is increased whenever the gust memory
> +info is changed. When guests do mmio access, kvm intercepts a MMIO #PF then it
> +walks the shadow page table and get the mmio spte. If the generation-number on
> +the spte does not equal the global generation-number, it will go to the normal
> +#PF handler to update the mmio spte.
> +
> +Since 19 bits are used to store generation-number on mmio spte, we zap all pages
> +when the number is round.
As mentioned in "Reaction to events" above, kvm will cache MMIO
information in leaf sptes. When a new memslot is added or an existing
memslot is changed, this information may become stale and needs to be
invalidated. This also needs to hold the MMU lock while walking all
shadow pages, and is made more scalable with a similar technique.
MMIO sptes have a few spare bits, which are used to store a
generation number. The global generation number is stored in
kvm_memslots(kvm)->generation, and increased whenever guest memory info
changes. This generation number is distinct from the one described in
the previous section.
When KVM finds an MMIO spte, it checks the generation number of the spte.
If the generation number of the spte does not equal the global generation
number, it will ignore the cached MMIO information and handle the page
fault through the slow path.
Since only 19 bits are used to store generation-number on mmio spte, all
pages are zapped when there is an overflow.
I'm also adding this in the page fault section:
- check for valid generation number in the spte (see "Fast invalidation of
MMIO sptes" below)
> Further reading
> ===============
>
>
On 06/19/2013 07:40 PM, Paolo Bonzini wrote:
> Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
>> Document it to Documentation/virtual/kvm/mmu.txt
>
> Edits inline, please ack.
Good to me.
Thank you very much for bearing my poor English.
Il 19/06/2013 14:25, Xiao Guangrong ha scritto:
> On 06/19/2013 07:55 PM, Paolo Bonzini wrote:
>> Il 19/06/2013 13:53, Xiao Guangrong ha scritto:
>>> On 06/19/2013 07:32 PM, Paolo Bonzini wrote:
>>>> Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
>>>>> Document it to Documentation/virtual/kvm/mmu.txt
>>>>
>>>> While reviewing the docs, I looked at the code.
>>>>
>>>> Why can't this happen?
>>>>
>>>> CPU 1: __get_spte_lockless CPU 2: __update_clear_spte_slow
>>>> ------------------------------------------------------------------------------
>>>> write low
>>>> read count
>>>> read low
>>>> read high
>>>> write high
>>>> check low and count
>>>> update count
>>>>
>>>> The check passes, but CPU 1 read a "torn" SPTE.
>>>
>>> In this case, CPU 1 will read the "new low bits" and the "old high bits", right?
>>> the P bit in the low bits is cleared when do __update_clear_spte_slow, i.e, it is
>>> not present, so the whole value is ignored.
>>
>> Indeed that's what the comment says, too. But then why do we need the
>> count at all? The spte that is read is exactly the same before and
>> after the count is updated.
>
> In order to detect repeatedly marking spte present to stop the lockless side
> to see present to present change, otherwise, we can get this:
>
> Say spte = 0xa11110001 (high 32bits = 0xa, low 32bit = 0x11110001)
>
> CPU 1: __get_spte_lockless CPU 2: __update_clear_spte_slow
> ----------------------------------------------------------------------
> read low: low= 0x11110001
> clear the spte, then spte = 0x0ull
> read high: high = 0x0
> set spte to 0xb11110001 (high 32bits = 0xb,
> low 32bit = 0x11110001)
>
> read low: 0x11110001 and see
> it is not changed.
>
> In this case, CPU 1 see the low bits are not changed, then it tries to access the memory at:
> 0x11110000.
Got it. What about this in the comment to __get_spte_lockless:
* The idea using the light way get the spte on x86_32 guest is from
* gup_get_pte(arch/x86/mm/gup.c).
*
* An spte tlb flush may be pending, because kvm_set_pte_rmapp
* coalesces them and we are running out of the MMU lock. Therefore
* we need to protect against in-progress updates of the spte.
*
* A race on changing present->non-present may get the old value for
* the high part of the spte. This is okay because the high part of
* the spte is ignored for non-present spte.
*
* However, we must detect a present->present change and reread the
* spte in case the change is in progress. Because all such changes
* are done in two steps (present->non-present and non-present->present),
* it is enough to count the number of present->non-present updates,
* which is done using clear_spte_count.
Paolo
> BTW, we are using tlb to protect lockless walking, the count can be drop after
> improving kvm_set_pte_rmapp where is the only place change spte from present to present
> without TLB flush.
>
On 06/19/2013 07:58 PM, Paolo Bonzini wrote:
> Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
>> Document write_flooding_count to Documentation/virtual/kvm/mmu.txt
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> Documentation/virtual/kvm/mmu.txt | 8 ++++++++
>> arch/x86/include/asm/kvm_host.h | 5 +++++
>> 2 files changed, 13 insertions(+)
>>
>> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
>> index ce6df51..5a6b2e2 100644
>> --- a/Documentation/virtual/kvm/mmu.txt
>> +++ b/Documentation/virtual/kvm/mmu.txt
>> @@ -214,6 +214,14 @@ Shadow pages contain the following information:
>> It is only used on 32bit host which helps us to detect whether updating the
>> 64bit spte is complete so that we can avoid reading the truncated value out
>> of mmu-lock.
>> + write_flooding_count:
>> + In order to sync the page table between guest and host, the page sometimes
>> + needs to be write-protected (see "Synchronized and unsynchronized pages"
>> + below), any write to the page can cause write emulation. If the emulation
>> + on the page is too frequent we'd better unmap the page to avoid the
>> + future emulation. write_flooding_count aims at this optimization which is
>> + increased when the page needs to be write emulated and cleared when the
>> + page is actually used.
>
> A guest may write to a page table many times, causing a lot of
> emulations if the page needs to be write-protected (see "Synchronized
> and unsynchronized pages" below). Leaf pages can be unsynchronized
> so that they do not trigger frequent emulation, but this is not
> possible for non-leafs. This field counts the number of emulations
> since the last time the page table was actually used; if emulation
> is triggered too frequently on this page, KVM will unmap the page
> to avoid emulation in the future.
very nice.
>
>> Reverse map
>> ===========
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 1dac2c1..5eb5382 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -234,6 +234,11 @@ struct kvm_mmu_page {
>> int clear_spte_count;
>> #endif
>>
>> + /*
>> + * Increased when the page needs to be write emulated, cleared
>> + * when the page is actually used as page table to avoid frequent
>> + * emulation on the page.
>> + */
>
> /* Number of writes since the last time traversal visited this page. */
>
> wdyt, better or worse?
I think it is better. :)
Thanks for your refining.
On 06/19/2013 08:10 PM, Paolo Bonzini wrote:
> Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
>> Document it to Documentation/virtual/kvm/mmu.txt
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> Documentation/virtual/kvm/mmu.txt | 11 +++++++++--
>> 1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
>> index 5a6b2e2..4fb442b 100644
>> --- a/Documentation/virtual/kvm/mmu.txt
>> +++ b/Documentation/virtual/kvm/mmu.txt
>> @@ -270,14 +270,21 @@ This is the most complicated event. The cause of a page fault can be:
>>
>> Handling a page fault is performed as follows:
>>
>> + - if the RSV bit of the error code is set, the page fault is caused by guest
>> + accessing MMIO, walk shadow page table to get the last spte where the mmio
>> + information is stored and cache the information to vcpu->arch.mmio_gva,
>> + vcpu->arch.access and vcpu->arch.mmio_gfn then call the emulator to emulate
>> + the instruction who will get the benefit from the cached mmio info
>
> + - if the RSV bit of the error code is set, the page fault is caused by guest
> + accessing MMIO and cached MMIO information is available.
> + - walk shadow page table
> + - cache the information to vcpu->arch.mmio_gva, vcpu->arch.access and
> + vcpu->arch.mmio_gfn, and call the emulator
Nice.
>
>> - if needed, walk the guest page tables to determine the guest translation
>> (gva->gpa or ngpa->gpa)
>> - if permissions are insufficient, reflect the fault back to the guest
>> - determine the host page
>> - - if this is an mmio request, there is no host page; call the emulator
>> - to emulate the instruction instead
>> + - if this is an mmio request, there is no host page; cache the info to
>> + vcpu->arch.mmio_gva, vcpu->arch.access and vcpu->arch.mmio_gfn
>> - walk the shadow page table to find the spte for the translation,
>> instantiating missing intermediate page tables as necessary
>> + - If this is an mmio request, cache the mmio info to the spte and set some
>> + reserved bits on the spte
>
> Added "(see callers of kvm_mmu_set_mmio_spte_mask)". Not really related, but
> just came to my mind: perhaps we can have a section on A/D bits too.
It is useful i think. Will do it.
On 06/19/2013 08:13 PM, Paolo Bonzini wrote:
> Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
>> Document fast page fault to Documentation/virtual/kvm/mmu.txt
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> Documentation/virtual/kvm/mmu.txt | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
>> index 4fb442b..b5ce7dd 100644
>> --- a/Documentation/virtual/kvm/mmu.txt
>> +++ b/Documentation/virtual/kvm/mmu.txt
>> @@ -275,6 +275,9 @@ Handling a page fault is performed as follows:
>> information is stored and cache the information to vcpu->arch.mmio_gva,
>> vcpu->arch.access and vcpu->arch.mmio_gfn then call the emulator to emulate
>> the instruction who will get the benefit from the cached mmio info
>> + - If both P bit and R/W bit of error code are set, there has a chance to
>> + fast fix the page fault, please see the description of "Fast page fault" in
>> + Documentation/virtual/kvm/locking.txt
Nice.
On 06/19/2013 08:25 PM, Paolo Bonzini wrote:
> Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
>> Document it to Documentation/virtual/kvm/mmu.txt
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> Documentation/virtual/kvm/mmu.txt | 23 +++++++++++++++++++++++
>> arch/x86/include/asm/kvm_host.h | 5 +++++
>> 2 files changed, 28 insertions(+)
>>
>> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
>> index b5ce7dd..f5c4de9 100644
>> --- a/Documentation/virtual/kvm/mmu.txt
>> +++ b/Documentation/virtual/kvm/mmu.txt
>> @@ -210,6 +210,10 @@ Shadow pages contain the following information:
>> A bitmap indicating which sptes in spt point (directly or indirectly) at
>> pages that may be unsynchronized. Used to quickly locate all unsychronized
>> pages reachable from a given page.
>> + mmu_valid_gen:
>> + It is the generation number of the page which cooperates with
>> + kvm->arch.mmu_valid_gen to fast invalidate all pages.
>> + (see "Fast invalidate all pages" below.)
>
> + mmu_valid_gen:
> + Generation number of the page. It is compared with kvm->arch.mmu_valid_gen
> + during hash table lookup, and used to skip invalidated shadow pages (see
> + "Zapping all pages" below.)
>
>> clear_spte_count:
>> It is only used on 32bit host which helps us to detect whether updating the
>> 64bit spte is complete so that we can avoid reading the truncated value out
>> @@ -373,6 +377,25 @@ causes its write_count to be incremented, thus preventing instantiation of
>> a large spte. The frames at the end of an unaligned memory slot have
>> artificially inflated ->write_counts so they can never be instantiated.
>>
>> +Fast invalidate all pages
>> +===========
>> +For the large memory and large vcpus guests, zapping all pages is a challenge
>> +since they have large number of pages need to be zapped, walking and zapping
>> +these pages are really slow and it should hold mmu-lock which stops the memory
>> +access on all vcpus.
>> +
>> +To make it be more scalable, kvm maintains a global mmu valid
>> +generation-number which is stored in kvm->arch.mmu_valid_gen and every shadow
>> +page stores the current global generation-number into sp->mmu_valid_gen when
>> +it is created.
>> +
>> +When KVM need zap all shadow pages sptes, it just simply increases the global
>> +generation-number then reload root shadow pages on all vcpus. Vcpu will create
>> +a new shadow page table according to current kvm's generation-number. It
>> +ensures the old pages are not used any more. The invalid-gen pages
>> +(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen) are zapped by using lock-break
>> +technique.
>> +
>
> +Zapping all pages (page generation count)
> +=========================================
> +
> +For the large memory guests, walking and zapping all pages is really slow
> +(because there are a lot of pages), and also blocks memory accesses of
> +all VCPUs because it needs to hold the MMU lock.
> +
> +To make it be more scalable, kvm maintains a global generation number
> +which is stored in kvm->arch.mmu_valid_gen. Every shadow page stores
> +the current global generation-number into sp->mmu_valid_gen when it
> +is created. Pages with a mismatching generation number are "obsolete".
> +
> +When KVM need zap all shadow pages sptes, it just simply increases the global
> +generation-number then reload root shadow pages on all vcpus. As the VCPUs
> +create new shadow page tables, the old pages are not used because of the
> +mismatching generation number.
> +
> +KVM then walks through all pages and zaps obsolete pages. While the zap
> +operation needs to take the MMU lock, the lock can be released periodically
> +so that the VCPUs can make progress.
> +
>
>> Further reading
>> ===============
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 5eb5382..c4f90f6 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -222,6 +222,11 @@ struct kvm_mmu_page {
>> int root_count; /* Currently serving as active root */
>> unsigned int unsync_children;
>> unsigned long parent_ptes; /* Reverse mapping for parent_pte */
>> +
>> + /*
>> + * the generation number of the page which cooperates with
>> + * kvm->arch.mmu_valid_gen to fast invalidate all pages.
>> + */
>
> + /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */
>
All the changes are fine to me.
I have learned a lot from your sentences, thanks! ;)
On 06/19/2013 08:35 PM, Paolo Bonzini wrote:
> Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
>> Document it to Documentation/virtual/kvm/mmu.txt
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> Documentation/virtual/kvm/mmu.txt | 25 +++++++++++++++++++++++++
>> 1 file changed, 25 insertions(+)
>>
>> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
>> index f5c4de9..9b7cfb3 100644
>> --- a/Documentation/virtual/kvm/mmu.txt
>> +++ b/Documentation/virtual/kvm/mmu.txt
>> @@ -396,6 +396,31 @@ ensures the old pages are not used any more. The invalid-gen pages
>> (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen) are zapped by using lock-break
>> technique.
>>
>> +Fast invalidate all mmio sptes
>> +===========
>> +As mentioned in "Reaction to events" above, kvm will cache the mmio information
>> +to the last sptes so that we should zap all mmio sptes when the guest mmio info
>> +is changed. This will happen when a new memslot is added or the existing
>> +memslot is moved.
>> +
>> +Zapping mmio spte is also a scalability issue for the large memory and large
>> +vcpus guests since it needs to hold hot mmu-lock and walk all shadow pages to
>> +find all the mmio spte out.
>> +
>> +We fix this issue by using the similar way of "Fast invalidate all pages".
>> +The global mmio valid generation-number is stored in kvm->memslots.generation
>> +and every mmio spte stores the current global generation-number into his
>> +available bits when it is created
>> +
>> +The global mmio valid generation-number is increased whenever the gust memory
>> +info is changed. When guests do mmio access, kvm intercepts a MMIO #PF then it
>> +walks the shadow page table and get the mmio spte. If the generation-number on
>> +the spte does not equal the global generation-number, it will go to the normal
>> +#PF handler to update the mmio spte.
>> +
>> +Since 19 bits are used to store generation-number on mmio spte, we zap all pages
>> +when the number is round.
>
> As mentioned in "Reaction to events" above, kvm will cache MMIO
> information in leaf sptes. When a new memslot is added or an existing
> memslot is changed, this information may become stale and needs to be
> invalidated. This also needs to hold the MMU lock while walking all
> shadow pages, and is made more scalable with a similar technique.
>
> MMIO sptes have a few spare bits, which are used to store a
> generation number. The global generation number is stored in
> kvm_memslots(kvm)->generation, and increased whenever guest memory info
> changes. This generation number is distinct from the one described in
> the previous section.
>
> When KVM finds an MMIO spte, it checks the generation number of the spte.
> If the generation number of the spte does not equal the global generation
> number, it will ignore the cached MMIO information and handle the page
> fault through the slow path.
>
> Since only 19 bits are used to store generation-number on mmio spte, all
> pages are zapped when there is an overflow.
Really nice.
>
>
> I'm also adding this in the page fault section:
>
> - check for valid generation number in the spte (see "Fast invalidation of
> MMIO sptes" below)
Oh, i forgot it in this section, thanks for your fix.
On 06/19/2013 08:41 PM, Paolo Bonzini wrote:
> Il 19/06/2013 14:25, Xiao Guangrong ha scritto:
>> On 06/19/2013 07:55 PM, Paolo Bonzini wrote:
>>> Il 19/06/2013 13:53, Xiao Guangrong ha scritto:
>>>> On 06/19/2013 07:32 PM, Paolo Bonzini wrote:
>>>>> Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
>>>>>> Document it to Documentation/virtual/kvm/mmu.txt
>>>>>
>>>>> While reviewing the docs, I looked at the code.
>>>>>
>>>>> Why can't this happen?
>>>>>
>>>>> CPU 1: __get_spte_lockless CPU 2: __update_clear_spte_slow
>>>>> ------------------------------------------------------------------------------
>>>>> write low
>>>>> read count
>>>>> read low
>>>>> read high
>>>>> write high
>>>>> check low and count
>>>>> update count
>>>>>
>>>>> The check passes, but CPU 1 read a "torn" SPTE.
>>>>
>>>> In this case, CPU 1 will read the "new low bits" and the "old high bits", right?
>>>> the P bit in the low bits is cleared when do __update_clear_spte_slow, i.e, it is
>>>> not present, so the whole value is ignored.
>>>
>>> Indeed that's what the comment says, too. But then why do we need the
>>> count at all? The spte that is read is exactly the same before and
>>> after the count is updated.
>>
>> In order to detect repeatedly marking spte present to stop the lockless side
>> to see present to present change, otherwise, we can get this:
>>
>> Say spte = 0xa11110001 (high 32bits = 0xa, low 32bit = 0x11110001)
>>
>> CPU 1: __get_spte_lockless CPU 2: __update_clear_spte_slow
>> ----------------------------------------------------------------------
>> read low: low= 0x11110001
>> clear the spte, then spte = 0x0ull
>> read high: high = 0x0
>> set spte to 0xb11110001 (high 32bits = 0xb,
>> low 32bit = 0x11110001)
>>
>> read low: 0x11110001 and see
>> it is not changed.
>>
>> In this case, CPU 1 see the low bits are not changed, then it tries to access the memory at:
>> 0x11110000.
>
> Got it. What about this in the comment to __get_spte_lockless:
>
> * The idea using the light way get the spte on x86_32 guest is from
> * gup_get_pte(arch/x86/mm/gup.c).
> *
> * An spte tlb flush may be pending, because kvm_set_pte_rmapp
> * coalesces them and we are running out of the MMU lock. Therefore
> * we need to protect against in-progress updates of the spte.
> *
> * A race on changing present->non-present may get the old value for
> * the high part of the spte. This is okay because the high part of
> * the spte is ignored for non-present spte.
> *
> * However, we must detect a present->present change and reread the
> * spte in case the change is in progress. Because all such changes
> * are done in two steps (present->non-present and non-present->present),
> * it is enough to count the number of present->non-present updates,
> * which is done using clear_spte_count.
It is fantastic :)
Il 19/06/2013 11:09, Xiao Guangrong ha scritto:
> As Paolo pointed out that this is the time to update the documentation, this
> patchest does it to let mmu.txt matches with the current mmu implementation
>
> Xiao Guangrong (7):
> KVM: MMU: update the documentation for reverse mapping of parent_pte
> KVM: MMU: document clear_spte_count
> KVM: MMU: document write_flooding_count
> KVM: MMU: document mmio page fault
> KVM: MMU: document fast page fault in
> KVM: MMU: document fast invalidate all pages
> KVM: MMU: document fast invalidate all mmio sptes
>
> Documentation/virtual/kvm/mmu.txt | 86 +++++++++++++++++++++++++++++++++++----
> arch/x86/include/asm/kvm_host.h | 15 +++++++
> arch/x86/kvm/mmu.c | 7 ++--
> 3 files changed, 97 insertions(+), 11 deletions(-)
>
Applied all to queue since they depend on the fast MMIO-spte zap
patches. Will graduate to next as soon as I'm finished testing those
patches.
Paolo
On 06/19/2013 04:09:25 AM, Xiao Guangrong wrote:
> Document it to Documentation/virtual/kvm/mmu.txt
Why break a change to a single documentation file into 7 pieces.
Are we going to bisect the documentation?
Rob-
Il 20/06/2013 07:21, Rob Landley ha scritto:
> On 06/19/2013 04:09:25 AM, Xiao Guangrong wrote:
>> Document it to Documentation/virtual/kvm/mmu.txt
>
> Why break a change to a single documentation file into 7 pieces.
>
> Are we going to bisect the documentation?
It is explaining 7 different optimizations, and each patch could have
been submitted together with the corresponding optimization. So it
makes sense---and it certainly helped review.
Paolo