2013-07-30 13:04:08

by Xiao Guangrong

[permalink] [raw]
Subject: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect

Background
==========
Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we need to
hold mmu-lock which is also required by vcpu to fix its page table fault and
mmu-notifier when host page is being changed. In the extreme cpu / memory used
guest, it becomes a scalability issue.

This patchset introduces a way to locklessly write-protect guest memory.

Idea
==========
There are the challenges we meet and the ideas to resolve them.

1) How to locklessly walk rmap?
The first idea we got to prevent "desc" being freed when we are walking the
rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
it updates the rmap really frequently.

So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
to be reused more quickly. We also store a "nulls" in the last "desc"
(desc->more) which can help us to detect whether the "desc" is moved to anther
rmap then we can re-walk the rmap if that happened. I learned this idea from
nulls-list.

Another issue is, when a spte is deleted from the "desc", another spte in the
last "desc" will be moved to this position to replace the deleted one. If the
deleted one has been accessed and we do not access the replaced one, the
replaced one is missed when we do lockless walk.
To fix this case, we do not backward move the spte, instead, we forward move
the entry: when a spte is deleted, we move the entry in the first desc to that
position.

2) How to locklessly access shadow page table?
It is easy if the handler is in the vcpu context, in that case we can use
walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
disable interrupt to stop shadow page be freed. But we are on the ioctl context
and the paths we are optimizing for have heavy workload, disabling interrupt is
not good for the system performance.

We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
call_rcu() to free the shadow page if that indicator is set. Set/Clear the
indicator are protected by slot-lock, so it need not be atomic and does not
hurt the performance and the scalability.

3) How to locklessly write-protect guest memory?
Currently, there are two behaviors when we write-protect guest memory, one is
clearing the Writable bit on spte and the another one is dropping spte when it
points to large page. The former is easy we only need to atomicly clear a bit
but the latter is hard since we need to remove the spte from rmap. so we unify
these two behaviors that only make the spte readonly. Making large spte
readonly instead of nonpresent is also good for reducing jitter.

And we need to pay more attention on the order of making spte writable, adding
spte into rmap and setting the corresponding bit on dirty bitmap since
kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
we should ensure the writable spte can be found in rmap before the dirty bitmap
is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
the page.

Performance result
====================
Host: CPU: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz x 12
Mem: 36G

The benchmark i used and will be attached:
a) kernbench
b) migrate-perf
it emulates guest migration
c) mmtest
it repeatedly writes the memory and measures the time and is used to
generate memory access in the guest which is being migrated
d) Qemu monitor command to implement guest live migration
the script can be found in migrate-perf.


1) First, we use kernbench to benchmark the performance with non-write-protection
case to detect the possible regression:

EPT enabled: Base: 84.05 After the patch: 83.53
EPT disabled: Base: 142.57 After the patch: 141.70

No regression and the optimization may come from lazily drop large spte.

2) Benchmark the performance of get dirty page
(./migrate-perf -c 12 -m 3000 -t 20)

Base: Run 20 times, Avg time:24813809 ns.
After the patch: Run 20 times, Avg time:8371577 ns.

It improves +196%

3) There is the result of Live Migration:
3.1) Less vcpus, less memory and less dirty page generated
(
Guest config: MEM_SIZE=7G VCPU_NUM=6
The workload in migrated guest:
ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 3000 -c 30 -t 60 > result &"
)

Live Migration time (ms) Benchmark (ns)
----------------------------------------+-------------+---------+
EPT | Baseline | 21638 | 266601028 |
+ -------------------------------+-------------+---------+
| After | 21110 +2.5% | 264966696 +0.6% |
----------------------------------------+-------------+---------+
Shadow | Baseline | 22542 | 271969284 | |
+----------+---------------------+-------------+---------+
| After | 21641 +4.1% | 270485511 +0.5% |
-------+----------+---------------------------------------------+

3.2) More vcpus, more memory and less dirty page generated
(
Guest config: MEM_SIZE=25G VCPU_NUM=12
The workload in migrated guest:
ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 15000 -c 30 -t 30 > result &"
)

Live Migration time (ms) Benchmark (ns)
----------------------------------------+-------------+---------+
EPT | Baseline | 72773 | 1278228350 |
+ -------------------------------+-------------+---------+
| After | 70516 +3.2% | 1266581587 +0.9% |
----------------------------------------+-------------+---------+
Shadow | Baseline | 74198 | 1323180090 | |
+----------+---------------------+-------------+---------+
| After | 64948 +14.2% | 1299283302 +1.8% |
-------+----------+---------------------------------------------+

3.3) Less vcpus, more memory and huge dirty page generated
(
Guest config: MEM_SIZE=25G VCPU_NUM=6
The workload in migrated guest:
ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 15000 -c 30 -t 200 > result &"
)

Live Migration time (ms) Benchmark (ns)
----------------------------------------+-------------+---------+
EPT | Baseline | 267473 | 1224657502 |
+ -------------------------------+-------------+---------+
| After | 267374 +0.03% | 1221520513 +0.6% |
----------------------------------------+-------------+---------+
Shadow | Baseline | 369999 | 1712004428 | |
+----------+---------------------+-------------+---------+
| After | 335737 +10.2% | 1556065063 +10.2% |
-------+----------+---------------------------------------------+

For the case of 3.3), EPT gets small benefit, the reason is only the first
time guest writes memory need take mmu-lock to mark spte from nonpresent to
present. Other writes cost lots of time to trigger the page fault due to
write-protection which are fixed by fast page fault which need not take
mmu-lock.

Xiao Guangrong (12):
KVM: MMU: remove unused parameter
KVM: MMU: properly check last spte in fast_page_fault()
KVM: MMU: lazily drop large spte
KVM: MMU: log dirty page after marking spte writable
KVM: MMU: add spte into rmap before logging dirty page
KVM: MMU: flush tlb if the spte can be locklessly modified
KVM: MMU: redesign the algorithm of pte_list
KVM: MMU: introduce nulls desc
KVM: MMU: introduce pte-list lockless walker
KVM: MMU: allow locklessly access shadow page table out of vcpu thread
KVM: MMU: locklessly write-protect the page
KVM: MMU: clean up spte_write_protect

arch/x86/include/asm/kvm_host.h | 10 +-
arch/x86/kvm/mmu.c | 442 ++++++++++++++++++++++++++++------------
arch/x86/kvm/mmu.h | 28 +++
arch/x86/kvm/x86.c | 19 +-
4 files changed, 356 insertions(+), 143 deletions(-)

--
1.8.1.4


2013-07-30 13:02:28

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 03/12] KVM: MMU: lazily drop large spte

Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided

The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter. This removes the need for the return value.

[
It has fixed the issue reported in 6b73a9606 by stopping fast page fault
marking the large spte to writable
]

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 36 +++++++++++++++++-------------------
1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cf163ca..35d4b50 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1181,8 +1181,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)

/*
* Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte writ-protection is caused by protecting shadow page table.
- * @flush indicates whether tlb need be flushed.
+ * spte write-protection is caused by protecting shadow page table.
*
* Note: write protection is difference between drity logging and spte
* protection:
@@ -1191,10 +1190,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
* - for spte protection, the spte can be writable only after unsync-ing
* shadow page.
*
- * Return true if the spte is dropped.
+ * Return true if tlb need be flushed.
*/
-static bool
-spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
+static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
{
u64 spte = *sptep;

@@ -1204,17 +1202,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)

rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);

- if (__drop_large_spte(kvm, sptep)) {
- *flush |= true;
- return true;
- }
-
if (pt_protect)
spte &= ~SPTE_MMU_WRITEABLE;
spte = spte & ~PT_WRITABLE_MASK;

- *flush |= mmu_spte_update(sptep, spte);
- return false;
+ return mmu_spte_update(sptep, spte);
}

static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1226,11 +1218,8 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,

for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
- if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
- sptep = rmap_get_first(*rmapp, &iter);
- continue;
- }

+ flush |= spte_write_protect(kvm, sptep, pt_protect);
sptep = rmap_get_next(&iter);
}

@@ -2701,6 +2690,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
break;
}

+ drop_large_spte(vcpu, iterator.sptep);
+
if (!is_shadow_present_pte(*iterator.sptep)) {
u64 base_addr = iterator.addr;

@@ -2855,7 +2846,7 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
* - false: let the real page fault path to fix it.
*/
static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
- u32 error_code)
+ u32 error_code, bool force_pt_level)
{
struct kvm_shadow_walk_iterator iterator;
struct kvm_mmu_page *sp;
@@ -2884,6 +2875,13 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
goto exit;

/*
+ * Can not map the large spte to writable if the page is dirty
+ * logged.
+ */
+ if (sp->role.level > PT_PAGE_TABLE_LEVEL && force_pt_level)
+ goto exit;
+
+ /*
* Check if it is a spurious fault caused by TLB lazily flushed.
*
* Need not check the access of upper level table entries since
@@ -2944,7 +2942,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
} else
level = PT_PAGE_TABLE_LEVEL;

- if (fast_page_fault(vcpu, v, level, error_code))
+ if (fast_page_fault(vcpu, v, level, error_code, force_pt_level))
return 0;

mmu_seq = vcpu->kvm->mmu_notifier_seq;
@@ -3422,7 +3420,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
} else
level = PT_PAGE_TABLE_LEVEL;

- if (fast_page_fault(vcpu, gpa, level, error_code))
+ if (fast_page_fault(vcpu, gpa, level, error_code, force_pt_level))
return 0;

mmu_seq = vcpu->kvm->mmu_notifier_seq;
--
1.8.1.4

2013-07-30 13:02:27

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 02/12] KVM: MMU: properly check last spte in fast_page_fault()

Using sp->role.level instead of @level since @level is not got from the
page table hierarchy

There is no issue in current code since the fast page fault only fixes the
fault caused by dirty-log, that means no large spte points to the dirty logged
page

The patch makes the code more readable and avoid potential issue in the further
development

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a731c0b..cf163ca 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2830,9 +2830,9 @@ static bool page_fault_can_be_fast(u32 error_code)
}

static bool
-fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
+fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+ u64 *sptep, u64 spte)
{
- struct kvm_mmu_page *sp = page_header(__pa(sptep));
gfn_t gfn;

WARN_ON(!sp->role.direct);
@@ -2858,6 +2858,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
u32 error_code)
{
struct kvm_shadow_walk_iterator iterator;
+ struct kvm_mmu_page *sp;
bool ret = false;
u64 spte = 0ull;

@@ -2878,7 +2879,8 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
goto exit;
}

- if (!is_last_spte(spte, level))
+ sp = page_header(__pa(iterator.sptep));
+ if (!is_last_spte(spte, sp->role.level))
goto exit;

/*
@@ -2904,7 +2906,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
* the gfn is not stable for indirect shadow page.
* See Documentation/virtual/kvm/locking.txt to get more detail.
*/
- ret = fast_pf_fix_direct_spte(vcpu, iterator.sptep, spte);
+ ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
exit:
trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
spte, ret);
--
1.8.1.4

2013-07-30 13:02:22

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 01/12] KVM: MMU: remove unused parameter

@vcpu in page_fault_can_be_fast() is not used so remove it

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3a9493a..a731c0b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2808,7 +2808,7 @@ exit:
return ret;
}

-static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, u32 error_code)
+static bool page_fault_can_be_fast(u32 error_code)
{
/*
* Do not fix the mmio spte with invalid generation number which
@@ -2861,7 +2861,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
bool ret = false;
u64 spte = 0ull;

- if (!page_fault_can_be_fast(vcpu, error_code))
+ if (!page_fault_can_be_fast(error_code))
return false;

walk_shadow_page_lockless_begin(vcpu);
--
1.8.1.4

2013-07-30 13:03:34

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

Make sure we can see the writable spte before the dirt bitmap is visible

We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
on the dirty bitmap, we should ensure the writable spte can be found in rmap
before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
failed to write-protect the page

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 35d4b50..0fe56ad 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
}
}

- if (pte_access & ACC_WRITE_MASK)
- mark_page_dirty(vcpu->kvm, gfn);
-
set_pte:
if (mmu_spte_update(sptep, spte))
kvm_flush_remote_tlbs(vcpu->kvm);
+
+ if (pte_access & ACC_WRITE_MASK)
+ mark_page_dirty(vcpu->kvm, gfn);
done:
return ret;
}
--
1.8.1.4

2013-07-30 13:03:32

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 05/12] KVM: MMU: add spte into rmap before logging dirty page

kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty
bitmap, we should ensure the writable spte can be found in rmap before the
dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and failed
to write-protect the page

It need the memory barrier to prevent out-of-order that will be added in the
later patch

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 25 ++++++++++---------------
1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0fe56ad..58283bf 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2425,6 +2425,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
{
u64 spte;
int ret = 0;
+ bool remap = is_rmap_spte(*sptep);

if (set_mmio_spte(vcpu->kvm, sptep, gfn, pfn, pte_access))
return 0;
@@ -2490,6 +2491,14 @@ set_pte:
if (mmu_spte_update(sptep, spte))
kvm_flush_remote_tlbs(vcpu->kvm);

+ if (!remap) {
+ if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
+ rmap_recycle(vcpu, sptep, gfn);
+
+ if (level > PT_PAGE_TABLE_LEVEL)
+ ++vcpu->kvm->stat.lpages;
+ }
+
if (pte_access & ACC_WRITE_MASK)
mark_page_dirty(vcpu->kvm, gfn);
done:
@@ -2501,9 +2510,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
int level, gfn_t gfn, pfn_t pfn, bool speculative,
bool host_writable)
{
- int was_rmapped = 0;
- int rmap_count;
-
pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
*sptep, write_fault, gfn);

@@ -2525,8 +2531,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
spte_to_pfn(*sptep), pfn);
drop_spte(vcpu->kvm, sptep);
kvm_flush_remote_tlbs(vcpu->kvm);
- } else
- was_rmapped = 1;
+ }
}

if (set_spte(vcpu, sptep, pte_access, level, gfn, pfn, speculative,
@@ -2544,16 +2549,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
is_large_pte(*sptep)? "2MB" : "4kB",
*sptep & PT_PRESENT_MASK ?"RW":"R", gfn,
*sptep, sptep);
- if (!was_rmapped && is_large_pte(*sptep))
- ++vcpu->kvm->stat.lpages;
-
- if (is_shadow_present_pte(*sptep)) {
- if (!was_rmapped) {
- rmap_count = rmap_add(vcpu, sptep, gfn);
- if (rmap_count > RMAP_RECYCLE_THRESHOLD)
- rmap_recycle(vcpu, sptep, gfn);
- }
- }

kvm_release_pfn_clean(pfn);
}
--
1.8.1.4

2013-07-30 13:07:35

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified

Relax the tlb flush condition since we will write-protect the spte out of mmu
lock. Note lockless write-protection only marks the writable spte to readonly
and the spte can be writable only if both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)

This patch is used to avoid this kind of race:

VCPU 0 VCPU 1
lockless wirte protection:
set spte.w = 0
lock mmu-lock

write protection the spte to sync shadow page,
see spte.w = 0, then without flush tlb

unlock mmu-lock

!!! At this point, the shadow page can still be
writable due to the corrupt tlb entry
Flush all TLB

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 58283bf..5a40564 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -600,7 +600,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
* we always atomicly update it, see the comments in
* spte_has_volatile_bits().
*/
- if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
+ if (spte_is_locklessly_modifiable(old_spte) &&
+ !is_writable_pte(new_spte))
ret = true;

if (!shadow_accessed_mask)
--
1.8.1.4

2013-07-30 13:07:57

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

The basic idea is from nulls list which uses a nulls to indicate
whether the desc is moved to different pte-list

Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 36caf6a..f8fc0cc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1010,6 +1010,14 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
desc_mark_nulls(pte_list, desc);
+
+ /*
+ * Esure the old spte has been updated into desc, so
+ * that the another side can not get the desc from pte_list
+ * but miss the old spte.
+ */
+ smp_wmb();
+
*pte_list = (unsigned long)desc | 1;
return 1;
}
@@ -1131,6 +1139,47 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
WARN_ON(desc_get_nulls_value(desc) != pte_list);
}

+/* The caller should hold rcu lock. */
+typedef void (*pte_list_walk_lockless_fn) (u64 *spte, int level);
+static void pte_list_walk_lockless(unsigned long *pte_list,
+ pte_list_walk_lockless_fn fn, int level)
+{
+ struct pte_list_desc *desc;
+ unsigned long pte_list_value;
+ int i;
+
+restart:
+ pte_list_value = ACCESS_ONCE(*pte_list);
+ if (!pte_list_value)
+ return;
+
+ if (!(pte_list_value & 1))
+ return fn((u64 *)pte_list_value, level);
+
+ /*
+ * fetch pte_list before read sptes in the desc, see the comments
+ * in pte_list_add().
+ *
+ * There is the data dependence since the desc is got from pte_list.
+ */
+ smp_read_barrier_depends();
+
+ desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
+ while (!desc_is_a_nulls(desc)) {
+ for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
+ fn(desc->sptes[i], level);
+
+ desc = ACCESS_ONCE(desc->more);
+
+ /* It is being initialized. */
+ if (unlikely(!desc))
+ goto restart;
+ }
+
+ if (unlikely(desc_get_nulls_value(desc) != pte_list))
+ goto restart;
+}
+
static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
struct kvm_memory_slot *slot)
{
@@ -4557,7 +4606,7 @@ int kvm_mmu_module_init(void)
{
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
sizeof(struct pte_list_desc),
- 0, 0, NULL);
+ 0, SLAB_DESTROY_BY_RCU, NULL);
if (!pte_list_desc_cache)
goto nomem;

--
1.8.1.4

2013-07-30 13:07:51

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list

Change the algorithm to:
1) always add new desc to the first desc (pointed by parent_ptes/rmap)
that is good to implement rcu-nulls-list-like lockless rmap
walking

2) always move the entry in first desc to the the position we want
to remove when remove a spte in the parent_ptes/rmap (backward-move).
It is good for us to implement lockless rmap walk since in the current
code, when a spte is deleted from the "desc", another spte in the last
"desc" will be moved to this position to replace the deleted one. If the
deleted one has been accessed and we do not access the replaced one, the
replaced one is missed when we do lockless walk.
To fix this case, we do not backward move the spte, instead, we forward
move the entry: when a spte is deleted, we move the entry in the first
desc to that position

Both of these also can reduce cache miss

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 182 ++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 125 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5a40564..3013bb1 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -918,6 +918,50 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
return level - 1;
}

+static int __find_first_free(struct pte_list_desc *desc)
+{
+ int i;
+
+ for (i = 0; i < PTE_LIST_EXT; i++)
+ if (!desc->sptes[i])
+ break;
+ return i;
+}
+
+static int find_first_free(struct pte_list_desc *desc)
+{
+ int free = __find_first_free(desc);
+
+ WARN_ON(free >= PTE_LIST_EXT);
+ return free;
+}
+
+static int find_last_used(struct pte_list_desc *desc)
+{
+ int used = __find_first_free(desc) - 1;
+
+ WARN_ON(used < 0 || used >= PTE_LIST_EXT);
+ return used;
+}
+
+/*
+ * TODO: we can encode the desc number into the rmap/parent_ptes
+ * since at least 10 physical/virtual address bits are reserved
+ * on x86. It is worthwhile if it shows that the desc walking is
+ * a performance issue.
+ */
+static int count_spte_number(struct pte_list_desc *desc)
+{
+ int first_free, desc_num;
+
+ first_free = __find_first_free(desc);
+
+ for (desc_num = 0; desc->more; desc = desc->more)
+ desc_num++;
+
+ return first_free + desc_num * PTE_LIST_EXT;
+}
+
/*
* Pte mapping structures:
*
@@ -934,92 +978,116 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
unsigned long *pte_list)
{
struct pte_list_desc *desc;
- int i, count = 0;
+ int free_pos;

if (!*pte_list) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
*pte_list = (unsigned long)spte;
- } else if (!(*pte_list & 1)) {
+ return 0;
+ }
+
+ if (!(*pte_list & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
*pte_list = (unsigned long)desc | 1;
- ++count;
- } else {
- rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
- desc = (struct pte_list_desc *)(*pte_list & ~1ul);
- while (desc->sptes[PTE_LIST_EXT-1] && desc->more) {
- desc = desc->more;
- count += PTE_LIST_EXT;
- }
- if (desc->sptes[PTE_LIST_EXT-1]) {
- desc->more = mmu_alloc_pte_list_desc(vcpu);
- desc = desc->more;
- }
- for (i = 0; desc->sptes[i]; ++i)
- ++count;
- desc->sptes[i] = spte;
+ return 1;
+ }
+
+ rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
+ desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+
+ /* No empty position in the desc. */
+ if (desc->sptes[PTE_LIST_EXT - 1]) {
+ struct pte_list_desc *new_desc;
+ new_desc = mmu_alloc_pte_list_desc(vcpu);
+ new_desc->more = desc;
+ desc = new_desc;
+ *pte_list = (unsigned long)desc | 1;
}
- return count;
+
+ free_pos = find_first_free(desc);
+ desc->sptes[free_pos] = spte;
+ return count_spte_number(desc);
}

static void
-pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
- int i, struct pte_list_desc *prev_desc)
+pte_list_desc_remove_entry(unsigned long *pte_list,
+ struct pte_list_desc *desc, int i)
{
- int j;
+ struct pte_list_desc *first_desc;
+ int last_used;
+
+ first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+ last_used = find_last_used(first_desc);

- for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
- ;
- desc->sptes[i] = desc->sptes[j];
- desc->sptes[j] = NULL;
- if (j != 0)
+ /*
+ * Move the entry from the first desc to this position we want
+ * to remove.
+ */
+ desc->sptes[i] = first_desc->sptes[last_used];
+ first_desc->sptes[last_used] = NULL;
+
+ /* No valid entry in this desc, we can free this desc now. */
+ if (!first_desc->sptes[0]) {
+ struct pte_list_desc *next_desc = first_desc->more;
+
+ /*
+ * Only one entry existing but still use a desc to store it?
+ */
+ WARN_ON(!next_desc);
+
+ mmu_free_pte_list_desc(first_desc);
+ first_desc = next_desc;
+ *pte_list = (unsigned long)first_desc | 1ul;
return;
- if (!prev_desc && !desc->more)
- *pte_list = (unsigned long)desc->sptes[0];
- else
- if (prev_desc)
- prev_desc->more = desc->more;
- else
- *pte_list = (unsigned long)desc->more | 1;
- mmu_free_pte_list_desc(desc);
+ }
+
+ WARN_ON(!first_desc->sptes[0]);
+
+ /*
+ * Only one entry in this desc, move the entry to the head
+ * then the desc can be freed.
+ */
+ if (!first_desc->sptes[1] && !first_desc->more) {
+ *pte_list = (unsigned long)first_desc->sptes[0];
+ mmu_free_pte_list_desc(first_desc);
+ }
}

static void pte_list_remove(u64 *spte, unsigned long *pte_list)
{
struct pte_list_desc *desc;
- struct pte_list_desc *prev_desc;
int i;

if (!*pte_list) {
- printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
- BUG();
- } else if (!(*pte_list & 1)) {
+ WARN(1, KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
+ return;
+ }
+
+ if (!(*pte_list & 1)) {
rmap_printk("pte_list_remove: %p 1->0\n", spte);
if ((u64 *)*pte_list != spte) {
- printk(KERN_ERR "pte_list_remove: %p 1->BUG\n", spte);
- BUG();
+ WARN(1, KERN_ERR "pte_list_remove: %p 1->BUG\n", spte);
}
*pte_list = 0;
- } else {
- rmap_printk("pte_list_remove: %p many->many\n", spte);
- desc = (struct pte_list_desc *)(*pte_list & ~1ul);
- prev_desc = NULL;
- while (desc) {
- for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
- if (desc->sptes[i] == spte) {
- pte_list_desc_remove_entry(pte_list,
- desc, i,
- prev_desc);
- return;
- }
- prev_desc = desc;
- desc = desc->more;
- }
- pr_err("pte_list_remove: %p many->many\n", spte);
- BUG();
+ return;
}
+
+ rmap_printk("pte_list_remove: %p many->many\n", spte);
+ desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+ while (desc) {
+ for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
+ if (desc->sptes[i] == spte) {
+ pte_list_desc_remove_entry(pte_list,
+ desc, i);
+ return;
+ }
+ desc = desc->more;
+ }
+
+ WARN(1, "pte_list_remove: %p many->many\n", spte);
}

typedef void (*pte_list_walk_fn) (u64 *spte);
--
1.8.1.4

2013-07-30 13:08:13

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 12/12] KVM: MMU: clean up spte_write_protect

Now, the only user of spte_write_protect is rmap_write_protect which
always calls spte_write_protect with pt_protect = true, so drop
it and the unused parameter @kvm

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a50eea8..8073c1f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1319,8 +1319,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
}

/*
- * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte write-protection is caused by protecting shadow page table.
+ * Write-protect on the specified @sptep.
*
* Note: write protection is difference between drity logging and spte
* protection:
@@ -1331,25 +1330,23 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
*
* Return true if tlb need be flushed.
*/
-static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
+static bool spte_write_protect(u64 *sptep)
{
u64 spte = *sptep;

if (!is_writable_pte(spte) &&
- !(pt_protect && spte_is_locklessly_modifiable(spte)))
+ !spte_is_locklessly_modifiable(spte))
return false;

rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);

- if (pt_protect)
- spte &= ~SPTE_MMU_WRITEABLE;
- spte = spte & ~PT_WRITABLE_MASK;
+ spte &= ~SPTE_MMU_WRITEABLE;
+ spte &= ~PT_WRITABLE_MASK;

return mmu_spte_update(sptep, spte);
}

-static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
- bool pt_protect)
+static bool __rmap_write_protect(unsigned long *rmapp)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -1358,7 +1355,7 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));

- flush |= spte_write_protect(kvm, sptep, pt_protect);
+ flush |= spte_write_protect(sptep);
sptep = rmap_get_next(&iter);
}

@@ -1426,7 +1423,7 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
for (i = PT_PAGE_TABLE_LEVEL;
i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
rmapp = __gfn_to_rmap(gfn, i, slot);
- write_protected |= __rmap_write_protect(kvm, rmapp, true);
+ write_protected |= __rmap_write_protect(rmapp);
}

return write_protected;
--
1.8.1.4

2013-07-30 13:08:15

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 11/12] KVM: MMU: locklessly write-protect the page

Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we need to
hold mmu-lock which is also required by vcpu to fix its page table fault and
mmu-notifier when host page is being changed. In the extreme cpu / memory used
guest, it becomes a scalability issue

This patch introduces a way to locklessly write-protect guest memory

Now, lockless rmap walk, lockless shadow page table access and lockless spte
wirte-protection are ready, it is the time to implements page write-protection
out of mmu-lock

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ---
arch/x86/kvm/mmu.c | 62 ++++++++++++++++++++++++++++-------------
arch/x86/kvm/mmu.h | 6 ++++
arch/x86/kvm/x86.c | 19 +++++++++----
4 files changed, 62 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dc842b6..3ef5645 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -780,10 +780,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask);

int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
- struct kvm_memory_slot *slot,
- gfn_t gfn_offset, unsigned long mask);
void kvm_mmu_zap_all(struct kvm *kvm);
void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7f3391f..a50eea8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1365,8 +1365,30 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
return flush;
}

-/**
- * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
+static void __rmap_write_protect_lockless(u64 *sptep, int level)
+{
+ u64 spte;
+
+retry:
+ spte = mmu_spte_get_lockless(sptep);
+ if (unlikely(!is_last_spte(spte, level) || !is_writable_pte(spte)))
+ return;
+
+ if (likely(cmpxchg64(sptep, spte, spte & ~PT_WRITABLE_MASK) == spte))
+ return;
+
+ goto retry;
+}
+
+static void rmap_write_protect_lockless(unsigned long *rmapp, int level)
+{
+ pte_list_walk_lockless(rmapp, __rmap_write_protect_lockless, level);
+}
+
+/*
+ * kvm_mmu_write_protect_pt_masked_lockless - write protect selected PT level
+ * pages out of mmu-lock.
+ *
* @kvm: kvm instance
* @slot: slot to protect
* @gfn_offset: start of the BITS_PER_LONG pages we care about
@@ -1375,16 +1397,17 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
* Used when we do not need to care about huge page mappings: e.g. during dirty
* logging we do not have any such mappings.
*/
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
- struct kvm_memory_slot *slot,
- gfn_t gfn_offset, unsigned long mask)
+void
+kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset, unsigned long mask)
{
unsigned long *rmapp;

while (mask) {
rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
PT_PAGE_TABLE_LEVEL, slot);
- __rmap_write_protect(kvm, rmapp, false);
+ rmap_write_protect_lockless(rmapp, PT_PAGE_TABLE_LEVEL);

/* clear the first set bit */
mask &= mask - 1;
@@ -2661,6 +2684,15 @@ set_pte:
++vcpu->kvm->stat.lpages;
}

+ /*
+ * We should put the sptep into rmap before dirty log
+ * otherwise the lockless spte write-protect path will
+ * clear the dirty bit map but fail to find the spte.
+ *
+ * See the comments in kvm_vm_ioctl_get_dirty_log().
+ */
+ smp_wmb();
+
if (pte_access & ACC_WRITE_MASK)
mark_page_dirty(vcpu->kvm, gfn);
done:
@@ -4422,7 +4454,7 @@ int kvm_mmu_setup(struct kvm_vcpu *vcpu)
return init_kvm_mmu(vcpu);
}

-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot)
{
struct kvm_memory_slot *memslot;
gfn_t last_gfn;
@@ -4431,8 +4463,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
memslot = id_to_memslot(kvm->memslots, slot);
last_gfn = memslot->base_gfn + memslot->npages - 1;

- spin_lock(&kvm->mmu_lock);
-
+ kvm_mmu_rcu_free_page_begin(kvm);
for (i = PT_PAGE_TABLE_LEVEL;
i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
unsigned long *rmapp;
@@ -4441,19 +4472,12 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL];
last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);

- for (index = 0; index <= last_index; ++index, ++rmapp) {
- if (*rmapp)
- __rmap_write_protect(kvm, rmapp, false);
-
- if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
- kvm_flush_remote_tlbs(kvm);
- cond_resched_lock(&kvm->mmu_lock);
- }
- }
+ for (index = 0; index <= last_index; ++index, ++rmapp)
+ rmap_write_protect_lockless(rmapp, i);
}
+ kvm_mmu_rcu_free_page_end(kvm);

kvm_flush_remote_tlbs(kvm);
- spin_unlock(&kvm->mmu_lock);
}

#define BATCH_ZAP_PAGES 10
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 85405f1..2a66c57 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -137,4 +137,10 @@ static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm)

rcu_read_unlock();
}
+
+void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot);
+void
+kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset, unsigned long mask);
#endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d2caeb9..4983eb3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3531,8 +3531,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
memset(dirty_bitmap_buffer, 0, n);

- spin_lock(&kvm->mmu_lock);
-
+ kvm_mmu_rcu_free_page_begin(kvm);
for (i = 0; i < n / sizeof(long); i++) {
unsigned long mask;
gfn_t offset;
@@ -3542,17 +3541,25 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)

is_dirty = true;

+ /*
+ * xchg acts as a full barrier that ensures
+ * clearing dirty bitmap before read rmap.
+ *
+ * See the comments in set_spte().
+ */
mask = xchg(&dirty_bitmap[i], 0);
+
dirty_bitmap_buffer[i] = mask;

offset = i * BITS_PER_LONG;
- kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
+ kvm_mmu_write_protect_pt_masked_lockless(kvm, memslot,
+ offset, mask);
}
+ kvm_mmu_rcu_free_page_end(kvm);
+
if (is_dirty)
kvm_flush_remote_tlbs(kvm);

- spin_unlock(&kvm->mmu_lock);
-
r = -EFAULT;
if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
goto out;
@@ -7088,7 +7095,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
* not be created until the end of the logging.
*/
if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
- kvm_mmu_slot_remove_write_access(kvm, mem->slot);
+ kvm_mmu_slot_remove_write_access_lockless(kvm, mem->slot);
}

void kvm_arch_flush_shadow_all(struct kvm *kvm)
--
1.8.1.4

2013-07-30 13:09:59

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 08/12] KVM: MMU: introduce nulls desc

It likes nulls list and we use the pte-list as the nulls which can help us to
detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
if that happened

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 35 ++++++++++++++++++++++++++++-------
1 file changed, 28 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3013bb1..36caf6a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -918,6 +918,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
return level - 1;
}

+static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc)
+{
+ unsigned long marker;
+
+ marker = (unsigned long)pte_list | 1UL;
+ desc->more = (struct pte_list_desc *)marker;
+}
+
+static bool desc_is_a_nulls(struct pte_list_desc *desc)
+{
+ return (unsigned long)desc & 1;
+}
+
+static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc)
+{
+ return (unsigned long *)((unsigned long)desc & ~1);
+}
+
static int __find_first_free(struct pte_list_desc *desc)
{
int i;
@@ -956,7 +974,7 @@ static int count_spte_number(struct pte_list_desc *desc)

first_free = __find_first_free(desc);

- for (desc_num = 0; desc->more; desc = desc->more)
+ for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more)
desc_num++;

return first_free + desc_num * PTE_LIST_EXT;
@@ -991,6 +1009,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
+ desc_mark_nulls(pte_list, desc);
*pte_list = (unsigned long)desc | 1;
return 1;
}
@@ -1036,7 +1055,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
/*
* Only one entry existing but still use a desc to store it?
*/
- WARN_ON(!next_desc);
+ WARN_ON(desc_is_a_nulls(next_desc));

mmu_free_pte_list_desc(first_desc);
first_desc = next_desc;
@@ -1050,7 +1069,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
* Only one entry in this desc, move the entry to the head
* then the desc can be freed.
*/
- if (!first_desc->sptes[1] && !first_desc->more) {
+ if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
*pte_list = (unsigned long)first_desc->sptes[0];
mmu_free_pte_list_desc(first_desc);
}
@@ -1077,7 +1096,7 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)

rmap_printk("pte_list_remove: %p many->many\n", spte);
desc = (struct pte_list_desc *)(*pte_list & ~1ul);
- while (desc) {
+ while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
if (desc->sptes[i] == spte) {
pte_list_desc_remove_entry(pte_list,
@@ -1103,11 +1122,13 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
return fn((u64 *)*pte_list);

desc = (struct pte_list_desc *)(*pte_list & ~1ul);
- while (desc) {
+ while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
fn(desc->sptes[i]);
desc = desc->more;
}
+
+ WARN_ON(desc_get_nulls_value(desc) != pte_list);
}

static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
@@ -1200,7 +1221,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
*/
static u64 *rmap_get_next(struct rmap_iterator *iter)
{
- if (iter->desc) {
+ if (iter->desc && !desc_is_a_nulls(iter->desc)) {
if (iter->pos < PTE_LIST_EXT - 1) {
u64 *sptep;

@@ -1212,7 +1233,7 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)

iter->desc = iter->desc->more;

- if (iter->desc) {
+ if (!desc_is_a_nulls(iter->desc)) {
iter->pos = 0;
/* desc->sptes[0] cannot be NULL */
return iter->desc->sptes[iter->pos];
--
1.8.1.4

2013-07-30 13:10:36

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 10/12] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

It is easy if the handler is in the vcpu context, in that case we can use
walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
disable interrupt to stop shadow page be freed. But we are on the ioctl context
and the paths we are optimizing for have heavy workload, disabling interrupt is
not good for the system performance

We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
call_rcu() to free the shadow page if that indicator is set. Set/Clear the
indicator are protected by slot-lock, so it need not be atomic and does not
hurt the performance and the scalability

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 +++++-
arch/x86/kvm/mmu.c | 23 +++++++++++++++++++++++
arch/x86/kvm/mmu.h | 22 ++++++++++++++++++++++
3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 531f47c..dc842b6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -226,7 +226,10 @@ struct kvm_mmu_page {
/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */
unsigned long mmu_valid_gen;

- DECLARE_BITMAP(unsync_child_bitmap, 512);
+ union {
+ DECLARE_BITMAP(unsync_child_bitmap, 512);
+ struct rcu_head rcu;
+ };

#ifdef CONFIG_X86_32
/*
@@ -545,6 +548,7 @@ struct kvm_arch {
*/
struct list_head active_mmu_pages;
struct list_head zapped_obsolete_pages;
+ bool rcu_free_shadow_page;

struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f8fc0cc..7f3391f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2322,6 +2322,22 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
return ret;
}

+static void free_pages_rcu(struct rcu_head *head)
+{
+ struct kvm_mmu_page *next, *sp;
+
+ sp = container_of(head, struct kvm_mmu_page, rcu);
+ while (sp) {
+ if (!list_empty(&sp->link))
+ next = list_first_entry(&sp->link,
+ struct kvm_mmu_page, link);
+ else
+ next = NULL;
+ kvm_mmu_free_page(sp);
+ sp = next;
+ }
+}
+
static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list)
{
@@ -2342,6 +2358,13 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
*/
kvm_flush_remote_tlbs(kvm);

+ if (kvm->arch.rcu_free_shadow_page) {
+ sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
+ list_del_init(invalid_list);
+ call_rcu(&sp->rcu, free_pages_rcu);
+ return;
+ }
+
list_for_each_entry_safe(sp, nsp, invalid_list, link) {
WARN_ON(!sp->role.invalid || sp->root_count);
kvm_mmu_free_page(sp);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 5b59c57..85405f1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -115,4 +115,26 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
}

void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
+
+/*
+ * The caller should ensure that these two functions should be
+ * serially called.
+ */
+static inline void kvm_mmu_rcu_free_page_begin(struct kvm *kvm)
+{
+ rcu_read_lock();
+
+ kvm->arch.rcu_free_shadow_page = true;
+ /* Set the indicator before access shadow page. */
+ smp_mb();
+}
+
+static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm)
+{
+ /* Make sure that access shadow page has finished. */
+ smp_mb();
+ kvm->arch.rcu_free_shadow_page = false;
+
+ rcu_read_unlock();
+}
#endif
--
1.8.1.4

2013-07-30 13:12:09

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect

On 07/30/2013 09:01 PM, Xiao Guangrong wrote:
> Background
> ==========
> Currently, when mark memslot dirty logged or get dirty page, we need to
> write-protect large guest memory, it is the heavy work, especially, we need to
> hold mmu-lock which is also required by vcpu to fix its page table fault and
> mmu-notifier when host page is being changed. In the extreme cpu / memory used
> guest, it becomes a scalability issue.
>
> This patchset introduces a way to locklessly write-protect guest memory.
>
> Idea
> ==========
> There are the challenges we meet and the ideas to resolve them.
>
> 1) How to locklessly walk rmap?
> The first idea we got to prevent "desc" being freed when we are walking the
> rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
> it updates the rmap really frequently.
>
> So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
> to be reused more quickly. We also store a "nulls" in the last "desc"
> (desc->more) which can help us to detect whether the "desc" is moved to anther
> rmap then we can re-walk the rmap if that happened. I learned this idea from
> nulls-list.
>
> Another issue is, when a spte is deleted from the "desc", another spte in the
> last "desc" will be moved to this position to replace the deleted one. If the
> deleted one has been accessed and we do not access the replaced one, the
> replaced one is missed when we do lockless walk.
> To fix this case, we do not backward move the spte, instead, we forward move
> the entry: when a spte is deleted, we move the entry in the first desc to that
> position.
>
> 2) How to locklessly access shadow page table?
> It is easy if the handler is in the vcpu context, in that case we can use
> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
> disable interrupt to stop shadow page be freed. But we are on the ioctl context
> and the paths we are optimizing for have heavy workload, disabling interrupt is
> not good for the system performance.
>
> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
> call_rcu() to free the shadow page if that indicator is set. Set/Clear the
> indicator are protected by slot-lock, so it need not be atomic and does not
> hurt the performance and the scalability.
>
> 3) How to locklessly write-protect guest memory?
> Currently, there are two behaviors when we write-protect guest memory, one is
> clearing the Writable bit on spte and the another one is dropping spte when it
> points to large page. The former is easy we only need to atomicly clear a bit
> but the latter is hard since we need to remove the spte from rmap. so we unify
> these two behaviors that only make the spte readonly. Making large spte
> readonly instead of nonpresent is also good for reducing jitter.
>
> And we need to pay more attention on the order of making spte writable, adding
> spte into rmap and setting the corresponding bit on dirty bitmap since
> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
> we should ensure the writable spte can be found in rmap before the dirty bitmap
> is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
> the page.
>
> Performance result
> ====================
> Host: CPU: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz x 12
> Mem: 36G
>
> The benchmark i used and will be attached:

The benchmarks have been attached in this mail.

Thanks!


Attachments:
migrate-perf.tar.bz2 (80.20 kB)
mmtest.tar.bz2 (16.15 kB)
Download all attachments

2013-07-30 13:26:48

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

Il 30/07/2013 15:02, Xiao Guangrong ha scritto:
> Make sure we can see the writable spte before the dirt bitmap is visible
>
> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
> on the dirty bitmap, we should ensure the writable spte can be found in rmap
> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
> failed to write-protect the page
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 35d4b50..0fe56ad 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> }
> }
>
> - if (pte_access & ACC_WRITE_MASK)
> - mark_page_dirty(vcpu->kvm, gfn);
> -
> set_pte:
> if (mmu_spte_update(sptep, spte))
> kvm_flush_remote_tlbs(vcpu->kvm);
> +
> + if (pte_access & ACC_WRITE_MASK)
> + mark_page_dirty(vcpu->kvm, gfn);
> done:
> return ret;
> }
>

What about this comment above:

/*
* Optimization: for pte sync, if spte was writable the hash
* lookup is unnecessary (and expensive). Write protection
* is responsibility of mmu_get_page / kvm_sync_page.
* Same reasoning can be applied to dirty page accounting.
*/
if (!can_unsync && is_writable_pte(*sptep))
goto set_pte;

if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {


?

Should it be changed to

if (!can_unsync && is_writable_pte(*sptep))
pte_access &= ~ACC_WRITE_MASK; /* do not mark dirty */

else if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {

?

Thanks,

Paolo

2013-07-30 13:28:01

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 05/12] KVM: MMU: add spte into rmap before logging dirty page

Il 30/07/2013 15:02, Xiao Guangrong ha scritto:
> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty
> bitmap, we should ensure the writable spte can be found in rmap before the
> dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and failed
> to write-protect the page
>
> It need the memory barrier to prevent out-of-order that will be added in the
> later patch

Do you mean that the later patch will also introduce a memory barrier?

Paolo

> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 25 ++++++++++---------------
> 1 file changed, 10 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 0fe56ad..58283bf 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2425,6 +2425,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> {
> u64 spte;
> int ret = 0;
> + bool remap = is_rmap_spte(*sptep);
>
> if (set_mmio_spte(vcpu->kvm, sptep, gfn, pfn, pte_access))
> return 0;
> @@ -2490,6 +2491,14 @@ set_pte:
> if (mmu_spte_update(sptep, spte))
> kvm_flush_remote_tlbs(vcpu->kvm);
>
> + if (!remap) {
> + if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
> + rmap_recycle(vcpu, sptep, gfn);
> +
> + if (level > PT_PAGE_TABLE_LEVEL)
> + ++vcpu->kvm->stat.lpages;
> + }
> +
> if (pte_access & ACC_WRITE_MASK)
> mark_page_dirty(vcpu->kvm, gfn);
> done:
> @@ -2501,9 +2510,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> int level, gfn_t gfn, pfn_t pfn, bool speculative,
> bool host_writable)
> {
> - int was_rmapped = 0;
> - int rmap_count;
> -
> pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
> *sptep, write_fault, gfn);
>
> @@ -2525,8 +2531,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> spte_to_pfn(*sptep), pfn);
> drop_spte(vcpu->kvm, sptep);
> kvm_flush_remote_tlbs(vcpu->kvm);
> - } else
> - was_rmapped = 1;
> + }
> }
>
> if (set_spte(vcpu, sptep, pte_access, level, gfn, pfn, speculative,
> @@ -2544,16 +2549,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> is_large_pte(*sptep)? "2MB" : "4kB",
> *sptep & PT_PRESENT_MASK ?"RW":"R", gfn,
> *sptep, sptep);
> - if (!was_rmapped && is_large_pte(*sptep))
> - ++vcpu->kvm->stat.lpages;
> -
> - if (is_shadow_present_pte(*sptep)) {
> - if (!was_rmapped) {
> - rmap_count = rmap_add(vcpu, sptep, gfn);
> - if (rmap_count > RMAP_RECYCLE_THRESHOLD)
> - rmap_recycle(vcpu, sptep, gfn);
> - }
> - }
>
> kvm_release_pfn_clean(pfn);
> }
>

2013-07-31 07:27:33

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

On 07/30/2013 09:26 PM, Paolo Bonzini wrote:
> Il 30/07/2013 15:02, Xiao Guangrong ha scritto:
>> Make sure we can see the writable spte before the dirt bitmap is visible
>>
>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
>> on the dirty bitmap, we should ensure the writable spte can be found in rmap
>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
>> failed to write-protect the page
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/kvm/mmu.c | 6 +++---
>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 35d4b50..0fe56ad 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>> }
>> }
>>
>> - if (pte_access & ACC_WRITE_MASK)
>> - mark_page_dirty(vcpu->kvm, gfn);
>> -
>> set_pte:
>> if (mmu_spte_update(sptep, spte))
>> kvm_flush_remote_tlbs(vcpu->kvm);
>> +
>> + if (pte_access & ACC_WRITE_MASK)
>> + mark_page_dirty(vcpu->kvm, gfn);
>> done:
>> return ret;
>> }
>>
>
> What about this comment above:
>
> /*
> * Optimization: for pte sync, if spte was writable the hash
> * lookup is unnecessary (and expensive). Write protection
> * is responsibility of mmu_get_page / kvm_sync_page.

This comments mean no sync shadow page created if the the spte is still writable
because add a sync page need to writable all spte point to this page. So we can
keep the spte as writable.

I think it is better to checking SPTE_MMU_WRITEABLE bit instead of PT_WRITABLE_MASK
since the latter bit can be cleared by dirty log and it can be a separate patch i
think.

> * Same reasoning can be applied to dirty page accounting.

This comment means if the spte is writable the corresponding bit on dirty bitmap
should have been set.

Thanks to your reminder, i think this comment should be dropped, now we need to
mark_page_dirty() whenever the spte update to writable. Otherwise this will happen:

VCPU 0 VCPU 1
Clear dirty bit on the bitmap
Read the spte, it is writable
write the spte
update the spte, keep it as writable
and do not call mark_page_dirty().
Flush tlb

Then vcpu 1 can continue to write the page but fail to set the bit on the bitmap.

> */
> if (!can_unsync && is_writable_pte(*sptep))
> goto set_pte;
>
> if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
>
>
> ?
>
> Should it be changed to
>
> if (!can_unsync && is_writable_pte(*sptep))
> pte_access &= ~ACC_WRITE_MASK; /* do not mark dirty */

Yes, this can avoid the issue above.

But there is only a small window between sync the spte and locklessly write-protect
the spte (since the sptep is already writable), i think we'd better keep the spte
writable to speed up the normal case. :)

2013-07-31 07:34:06

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 05/12] KVM: MMU: add spte into rmap before logging dirty page

On 07/30/2013 09:27 PM, Paolo Bonzini wrote:
> Il 30/07/2013 15:02, Xiao Guangrong ha scritto:
>> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty
>> bitmap, we should ensure the writable spte can be found in rmap before the
>> dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and failed
>> to write-protect the page
>>
>> It need the memory barrier to prevent out-of-order that will be added in the
>> later patch
>
> Do you mean that the later patch will also introduce a memory barrier?

No. Sorry for the confusion. I mean we miss the memory barrier in this patch
and will fix it in the latter patch where we introduce the lockless
write-protection.

The memory barrier is added in
[PATCH 11/12] KVM: MMU: locklessly write-protect the page:

+ /*
+ * We should put the sptep into rmap before dirty log
+ * otherwise the lockless spte write-protect path will
+ * clear the dirty bit map but fail to find the spte.
+ *
+ * See the comments in kvm_vm_ioctl_get_dirty_log().
+ */
+ smp_wmb();
+
if (pte_access & ACC_WRITE_MASK)

and the barrier in the another side is:
+ /*
+ * xchg acts as a full barrier that ensures
+ * clearing dirty bitmap before read rmap.
+ *
+ * See the comments in set_spte().
+ */
mask = xchg(&dirty_bitmap[i], 0);

2013-08-02 14:55:59

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 03/12] KVM: MMU: lazily drop large spte

On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
> Currently, kvm zaps the large spte if write-protected is needed, the later
> read can fault on that spte. Actually, we can make the large spte readonly
> instead of making them un-present, the page fault caused by read access can
> be avoided
>
> The idea is from Avi:
> | As I mentioned before, write-protecting a large spte is a good idea,
> | since it moves some work from protect-time to fault-time, so it reduces
> | jitter. This removes the need for the return value.
>
> [
> It has fixed the issue reported in 6b73a9606 by stopping fast page fault
> marking the large spte to writable
> ]

Xiao,

Can you please write a comment explaining why are the problems
with shadow vs large read-only sptes (can't recall anymore),
and then why it is now safe to do it.

Comments below.

> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 36 +++++++++++++++++-------------------
> 1 file changed, 17 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index cf163ca..35d4b50 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1181,8 +1181,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>
> /*
> * Write-protect on the specified @sptep, @pt_protect indicates whether
> - * spte writ-protection is caused by protecting shadow page table.
> - * @flush indicates whether tlb need be flushed.
> + * spte write-protection is caused by protecting shadow page table.
> *
> * Note: write protection is difference between drity logging and spte
> * protection:
> @@ -1191,10 +1190,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
> * - for spte protection, the spte can be writable only after unsync-ing
> * shadow page.
> *
> - * Return true if the spte is dropped.
> + * Return true if tlb need be flushed.
> */
> -static bool
> -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
> +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
> {
> u64 spte = *sptep;
>
> @@ -1204,17 +1202,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
>
> rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
>
> - if (__drop_large_spte(kvm, sptep)) {
> - *flush |= true;
> - return true;
> - }
> -
> if (pt_protect)
> spte &= ~SPTE_MMU_WRITEABLE;
> spte = spte & ~PT_WRITABLE_MASK;
>
> - *flush |= mmu_spte_update(sptep, spte);
> - return false;
> + return mmu_spte_update(sptep, spte);
> }
>
> static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
> @@ -1226,11 +1218,8 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
>
> for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
> BUG_ON(!(*sptep & PT_PRESENT_MASK));
> - if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
> - sptep = rmap_get_first(*rmapp, &iter);
> - continue;
> - }
>
> + flush |= spte_write_protect(kvm, sptep, pt_protect);
> sptep = rmap_get_next(&iter);
> }
>
> @@ -2701,6 +2690,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
> break;
> }
>
> + drop_large_spte(vcpu, iterator.sptep);
> +
> if (!is_shadow_present_pte(*iterator.sptep)) {
> u64 base_addr = iterator.addr;
>
> @@ -2855,7 +2846,7 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> * - false: let the real page fault path to fix it.
> */
> static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
> - u32 error_code)
> + u32 error_code, bool force_pt_level)
> {
> struct kvm_shadow_walk_iterator iterator;
> struct kvm_mmu_page *sp;
> @@ -2884,6 +2875,13 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
> goto exit;
>
> /*
> + * Can not map the large spte to writable if the page is dirty
> + * logged.
> + */
> + if (sp->role.level > PT_PAGE_TABLE_LEVEL && force_pt_level)
> + goto exit;
> +

It is not safe to derive slot->dirty_bitmap like this:
since dirty log is enabled via RCU update, "is dirty bitmap enabled"
info could be stale by the time you check it here via the parameter,
so you can instantiate a large spte (because force_pt_level == false),
while you should not.

2013-08-02 15:42:28

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 03/12] KVM: MMU: lazily drop large spte


On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti <[email protected]> wrote:

> On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
>> Currently, kvm zaps the large spte if write-protected is needed, the later
>> read can fault on that spte. Actually, we can make the large spte readonly
>> instead of making them un-present, the page fault caused by read access can
>> be avoided
>>
>> The idea is from Avi:
>> | As I mentioned before, write-protecting a large spte is a good idea,
>> | since it moves some work from protect-time to fault-time, so it reduces
>> | jitter. This removes the need for the return value.
>>
>> [
>> It has fixed the issue reported in 6b73a9606 by stopping fast page fault
>> marking the large spte to writable
>> ]
>
> Xiao,
>
> Can you please write a comment explaining why are the problems
> with shadow vs large read-only sptes (can't recall anymore),
> and then why it is now safe to do it.

Hi Marcelo,

Thanks for your review. Yes. The bug reported in 6b73a9606 is, in this patch,
we mark the large spte as readonly when the pages are dirt logged and the
readonly spte can be set to writable by fast page fault, but on that path, it failed
to check dirty logging, so it will set the large spte to writable but only set the first
page to the dirty bitmap.

For example:

1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
is writable.

2): KVM dirty log 0 ~ 2M, then set SPTE to readonly

3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.

Then 4K ~ 2M memory is not dirty logged.

In this version, we let fast page fault do not mark large spte to writable if
its page are dirty logged. But it is still not safe as you pointed out.

>>
>>
>> /*
>> + * Can not map the large spte to writable if the page is dirty
>> + * logged.
>> + */
>> + if (sp->role.level > PT_PAGE_TABLE_LEVEL && force_pt_level)
>> + goto exit;
>> +
>
> It is not safe to derive slot->dirty_bitmap like this:
> since dirty log is enabled via RCU update, "is dirty bitmap enabled"
> info could be stale by the time you check it here via the parameter,
> so you can instantiate a large spte (because force_pt_level == false),
> while you should not.

Good catch! This is true even if we enable dirty log under the protection
of mmu lock.

How about let the fault page fault only fix the small spte, that is changing
the code to:
if (sp->role.level > PT_PAGE_TABLE_LEVEL)
goto exit;
?

2013-08-02 22:53:16

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 03/12] KVM: MMU: lazily drop large spte

On Fri, Aug 02, 2013 at 11:42:19PM +0800, Xiao Guangrong wrote:
>
> On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti <[email protected]> wrote:
>
> > On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
> >> Currently, kvm zaps the large spte if write-protected is needed, the later
> >> read can fault on that spte. Actually, we can make the large spte readonly
> >> instead of making them un-present, the page fault caused by read access can
> >> be avoided
> >>
> >> The idea is from Avi:
> >> | As I mentioned before, write-protecting a large spte is a good idea,
> >> | since it moves some work from protect-time to fault-time, so it reduces
> >> | jitter. This removes the need for the return value.
> >>
> >> [
> >> It has fixed the issue reported in 6b73a9606 by stopping fast page fault
> >> marking the large spte to writable
> >> ]
> >
> > Xiao,
> >
> > Can you please write a comment explaining why are the problems
> > with shadow vs large read-only sptes (can't recall anymore),
> > and then why it is now safe to do it.
>
> Hi Marcelo,
>
> Thanks for your review. Yes. The bug reported in 6b73a9606 is, in this patch,
> we mark the large spte as readonly when the pages are dirt logged and the
> readonly spte can be set to writable by fast page fault, but on that path, it failed
> to check dirty logging, so it will set the large spte to writable but only set the first
> page to the dirty bitmap.
>
> For example:
>
> 1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
> is writable.
>
> 2): KVM dirty log 0 ~ 2M, then set SPTE to readonly
>
> 3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.
>
> Then 4K ~ 2M memory is not dirty logged.

Ok can you write a self contained summary of read-only large sptes (when
they are created, when destroyed, from which point they can't be created,
etc), and the interaction with shadow write protection and creation of
writeable sptes?
Its easy to get lost.

> In this version, we let fast page fault do not mark large spte to writable if
> its page are dirty logged. But it is still not safe as you pointed out.
>
> >>
> >>
> >> /*
> >> + * Can not map the large spte to writable if the page is dirty
> >> + * logged.
> >> + */
> >> + if (sp->role.level > PT_PAGE_TABLE_LEVEL && force_pt_level)
> >> + goto exit;
> >> +
> >
> > It is not safe to derive slot->dirty_bitmap like this:
> > since dirty log is enabled via RCU update, "is dirty bitmap enabled"
> > info could be stale by the time you check it here via the parameter,
> > so you can instantiate a large spte (because force_pt_level == false),
> > while you should not.
>
> Good catch! This is true even if we enable dirty log under the protection
> of mmu lock.
>
> How about let the fault page fault only fix the small spte, that is changing
> the code to:
> if (sp->role.level > PT_PAGE_TABLE_LEVEL)
> goto exit;
> ?

Sure.

2013-08-02 22:56:09

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 03/12] KVM: MMU: lazily drop large spte


On Aug 3, 2013, at 4:27 AM, Marcelo Tosatti <[email protected]> wrote:

> On Fri, Aug 02, 2013 at 11:42:19PM +0800, Xiao Guangrong wrote:
>>
>> On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti <[email protected]> wrote:
>>
>>> On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
>>>> Currently, kvm zaps the large spte if write-protected is needed, the later
>>>> read can fault on that spte. Actually, we can make the large spte readonly
>>>> instead of making them un-present, the page fault caused by read access can
>>>> be avoided
>>>>
>>>> The idea is from Avi:
>>>> | As I mentioned before, write-protecting a large spte is a good idea,
>>>> | since it moves some work from protect-time to fault-time, so it reduces
>>>> | jitter. This removes the need for the return value.
>>>>
>>>> [
>>>> It has fixed the issue reported in 6b73a9606 by stopping fast page fault
>>>> marking the large spte to writable
>>>> ]
>>>
>>> Xiao,
>>>
>>> Can you please write a comment explaining why are the problems
>>> with shadow vs large read-only sptes (can't recall anymore),
>>> and then why it is now safe to do it.
>>
>> Hi Marcelo,
>>
>> Thanks for your review. Yes. The bug reported in 6b73a9606 is, in this patch,
>> we mark the large spte as readonly when the pages are dirt logged and the
>> readonly spte can be set to writable by fast page fault, but on that path, it failed
>> to check dirty logging, so it will set the large spte to writable but only set the first
>> page to the dirty bitmap.
>>
>> For example:
>>
>> 1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
>> is writable.
>>
>> 2): KVM dirty log 0 ~ 2M, then set SPTE to readonly
>>
>> 3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.
>>
>> Then 4K ~ 2M memory is not dirty logged.
>
> Ok can you write a self contained summary of read-only large sptes (when
> they are created, when destroyed, from which point they can't be created,
> etc), and the interaction with shadow write protection and creation of
> writeable sptes?
> Its easy to get lost.

Okay, will do.

2013-08-03 05:09:50

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect

On Tue, 30 Jul 2013 21:01:58 +0800
Xiao Guangrong <[email protected]> wrote:

> Background
> ==========
> Currently, when mark memslot dirty logged or get dirty page, we need to
> write-protect large guest memory, it is the heavy work, especially, we need to
> hold mmu-lock which is also required by vcpu to fix its page table fault and
> mmu-notifier when host page is being changed. In the extreme cpu / memory used
> guest, it becomes a scalability issue.
>
> This patchset introduces a way to locklessly write-protect guest memory.

Nice improvements!

If I read the patch set correctly, this work contains the following changes:

Cleanups:
Patch 1 and patch 12.

Lazy large page dropping for dirty logging:
Patch 2-3.
Patch 2 is preparatory to patch 3.

This does not look like an RFC if you address Marcelo's comment.
Any reason to include this in an RFC patch set?

Making remote TLBs flushable outside of mmu_lock for dirty logging:
Patch 6.

This is nice. I'm locally using a similar patch for my work, but yours
is much cleaner and better. I hope this will get merged soon.

New Pte-list handling:
Patch 7-9.

Still reading the details.

RCU-based lockless write protection.
Patch 10-11.

If I understand RCU correctly, the current implementation has a problem:
read-side critical sections can become too long.

See the following LWN's article:
"Sleepable RCU"
https://lwn.net/Articles/202847/

Especially, kvm_mmu_slot_remove_write_access() can take hundreds of
milliseconds, or even a few seconds for guests using shadow paging.
Is it possible to break the read-side critical section after protecting
some pages? -- I guess so.

Anyway, I want to see the following non-RFC quality patches get merged first:
- Lazy large page dropping for dirty logging:
- Making remote TLBs flushable outside of mmu_lock for dirty logging

As you are doing in patch 11, the latter can eliminate the TLB flushes before
cond_resched_lock(). So this alone is an optimization, and since my work is
based on this TLB flush-less lock breaking, I would appriciate if you make this
change first in your clean way.

The remaining patches, pte-list refactoring and lock-less ones, also look
interesting, but I need to read more to understand them.

Thanks for the nice work!
Takuya


>
> Idea
> ==========
> There are the challenges we meet and the ideas to resolve them.
>
> 1) How to locklessly walk rmap?
> The first idea we got to prevent "desc" being freed when we are walking the
> rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
> it updates the rmap really frequently.
>
> So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
> to be reused more quickly. We also store a "nulls" in the last "desc"
> (desc->more) which can help us to detect whether the "desc" is moved to anther
> rmap then we can re-walk the rmap if that happened. I learned this idea from
> nulls-list.
>
> Another issue is, when a spte is deleted from the "desc", another spte in the
> last "desc" will be moved to this position to replace the deleted one. If the
> deleted one has been accessed and we do not access the replaced one, the
> replaced one is missed when we do lockless walk.
> To fix this case, we do not backward move the spte, instead, we forward move
> the entry: when a spte is deleted, we move the entry in the first desc to that
> position.
>
> 2) How to locklessly access shadow page table?
> It is easy if the handler is in the vcpu context, in that case we can use
> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
> disable interrupt to stop shadow page be freed. But we are on the ioctl context
> and the paths we are optimizing for have heavy workload, disabling interrupt is
> not good for the system performance.
>
> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
> call_rcu() to free the shadow page if that indicator is set. Set/Clear the
> indicator are protected by slot-lock, so it need not be atomic and does not
> hurt the performance and the scalability.
>
> 3) How to locklessly write-protect guest memory?
> Currently, there are two behaviors when we write-protect guest memory, one is
> clearing the Writable bit on spte and the another one is dropping spte when it
> points to large page. The former is easy we only need to atomicly clear a bit
> but the latter is hard since we need to remove the spte from rmap. so we unify
> these two behaviors that only make the spte readonly. Making large spte
> readonly instead of nonpresent is also good for reducing jitter.
>
> And we need to pay more attention on the order of making spte writable, adding
> spte into rmap and setting the corresponding bit on dirty bitmap since
> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
> we should ensure the writable spte can be found in rmap before the dirty bitmap
> is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
> the page.
>
> Performance result
> ====================
> Host: CPU: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz x 12
> Mem: 36G
>
> The benchmark i used and will be attached:
> a) kernbench
> b) migrate-perf
> it emulates guest migration
> c) mmtest
> it repeatedly writes the memory and measures the time and is used to
> generate memory access in the guest which is being migrated
> d) Qemu monitor command to implement guest live migration
> the script can be found in migrate-perf.
>
>
> 1) First, we use kernbench to benchmark the performance with non-write-protection
> case to detect the possible regression:
>
> EPT enabled: Base: 84.05 After the patch: 83.53
> EPT disabled: Base: 142.57 After the patch: 141.70
>
> No regression and the optimization may come from lazily drop large spte.
>
> 2) Benchmark the performance of get dirty page
> (./migrate-perf -c 12 -m 3000 -t 20)
>
> Base: Run 20 times, Avg time:24813809 ns.
> After the patch: Run 20 times, Avg time:8371577 ns.
>
> It improves +196%
>
> 3) There is the result of Live Migration:
> 3.1) Less vcpus, less memory and less dirty page generated
> (
> Guest config: MEM_SIZE=7G VCPU_NUM=6
> The workload in migrated guest:
> ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 3000 -c 30 -t 60 > result &"
> )
>
> Live Migration time (ms) Benchmark (ns)
> ----------------------------------------+-------------+---------+
> EPT | Baseline | 21638 | 266601028 |
> + -------------------------------+-------------+---------+
> | After | 21110 +2.5% | 264966696 +0.6% |
> ----------------------------------------+-------------+---------+
> Shadow | Baseline | 22542 | 271969284 | |
> +----------+---------------------+-------------+---------+
> | After | 21641 +4.1% | 270485511 +0.5% |
> -------+----------+---------------------------------------------+
>
> 3.2) More vcpus, more memory and less dirty page generated
> (
> Guest config: MEM_SIZE=25G VCPU_NUM=12
> The workload in migrated guest:
> ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 15000 -c 30 -t 30 > result &"
> )
>
> Live Migration time (ms) Benchmark (ns)
> ----------------------------------------+-------------+---------+
> EPT | Baseline | 72773 | 1278228350 |
> + -------------------------------+-------------+---------+
> | After | 70516 +3.2% | 1266581587 +0.9% |
> ----------------------------------------+-------------+---------+
> Shadow | Baseline | 74198 | 1323180090 | |
> +----------+---------------------+-------------+---------+
> | After | 64948 +14.2% | 1299283302 +1.8% |
> -------+----------+---------------------------------------------+
>
> 3.3) Less vcpus, more memory and huge dirty page generated
> (
> Guest config: MEM_SIZE=25G VCPU_NUM=6
> The workload in migrated guest:
> ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 15000 -c 30 -t 200 > result &"
> )
>
> Live Migration time (ms) Benchmark (ns)
> ----------------------------------------+-------------+---------+
> EPT | Baseline | 267473 | 1224657502 |
> + -------------------------------+-------------+---------+
> | After | 267374 +0.03% | 1221520513 +0.6% |
> ----------------------------------------+-------------+---------+
> Shadow | Baseline | 369999 | 1712004428 | |
> +----------+---------------------+-------------+---------+
> | After | 335737 +10.2% | 1556065063 +10.2% |
> -------+----------+---------------------------------------------+
>
> For the case of 3.3), EPT gets small benefit, the reason is only the first
> time guest writes memory need take mmu-lock to mark spte from nonpresent to
> present. Other writes cost lots of time to trigger the page fault due to
> write-protection which are fixed by fast page fault which need not take
> mmu-lock.
>
> Xiao Guangrong (12):
> KVM: MMU: remove unused parameter
> KVM: MMU: properly check last spte in fast_page_fault()
> KVM: MMU: lazily drop large spte
> KVM: MMU: log dirty page after marking spte writable
> KVM: MMU: add spte into rmap before logging dirty page
> KVM: MMU: flush tlb if the spte can be locklessly modified
> KVM: MMU: redesign the algorithm of pte_list
> KVM: MMU: introduce nulls desc
> KVM: MMU: introduce pte-list lockless walker
> KVM: MMU: allow locklessly access shadow page table out of vcpu thread
> KVM: MMU: locklessly write-protect the page
> KVM: MMU: clean up spte_write_protect
>
> arch/x86/include/asm/kvm_host.h | 10 +-
> arch/x86/kvm/mmu.c | 442 ++++++++++++++++++++++++++++------------
> arch/x86/kvm/mmu.h | 28 +++
> arch/x86/kvm/x86.c | 19 +-
> 4 files changed, 356 insertions(+), 143 deletions(-)
>
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Takuya Yoshikawa <[email protected]>

2013-08-04 14:15:35

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect


On Aug 3, 2013, at 1:09 PM, Takuya Yoshikawa <[email protected]> wrote:

> On Tue, 30 Jul 2013 21:01:58 +0800
> Xiao Guangrong <[email protected]> wrote:
>
>> Background
>> ==========
>> Currently, when mark memslot dirty logged or get dirty page, we need to
>> write-protect large guest memory, it is the heavy work, especially, we need to
>> hold mmu-lock which is also required by vcpu to fix its page table fault and
>> mmu-notifier when host page is being changed. In the extreme cpu / memory used
>> guest, it becomes a scalability issue.
>>
>> This patchset introduces a way to locklessly write-protect guest memory.
>
> Nice improvements!

Thank you!

>
> If I read the patch set correctly, this work contains the following changes:
>
> Cleanups:
> Patch 1 and patch 12.
>
> Lazy large page dropping for dirty logging:
> Patch 2-3.
> Patch 2 is preparatory to patch 3.
>
> This does not look like an RFC if you address Marcelo's comment.
> Any reason to include this in an RFC patch set?

Right, these two patches are not really RFC since you guys have reviewed the
idea.

The reason i put these into this patchset is that they are also the preparing work
for implementing lockless writ-protection since after that we do not need to
remove a spte from the rmap any more. (only need to write-protect a spte.)

>
> Making remote TLBs flushable outside of mmu_lock for dirty logging:
> Patch 6.
>
> This is nice. I'm locally using a similar patch for my work, but yours
> is much cleaner and better. I hope this will get merged soon.

Thanks!

>
> New Pte-list handling:
> Patch 7-9.
>
> Still reading the details.
>
> RCU-based lockless write protection.
> Patch 10-11.
>
> If I understand RCU correctly, the current implementation has a problem:
> read-side critical sections can become too long.
>
> See the following LWN's article:
> "Sleepable RCU"
> https://lwn.net/Articles/202847/
>
> Especially, kvm_mmu_slot_remove_write_access() can take hundreds of
> milliseconds, or even a few seconds for guests using shadow paging.
> Is it possible to break the read-side critical section after protecting
> some pages? -- I guess so.

Yes. we can use the break-tech in the code if it is needed, like this:

if (need_resched()) {
kvm_use_rcu_free_page_end();
kvm_use_rcu_free_page_begin();
}

>
> Anyway, I want to see the following non-RFC quality patches get merged first:
> - Lazy large page dropping for dirty logging:
> - Making remote TLBs flushable outside of mmu_lock for dirty logging
>
> As you are doing in patch 11, the latter can eliminate the TLB flushes before
> cond_resched_lock(). So this alone is an optimization, and since my work is
> based on this TLB flush-less lock breaking, I would appriciate if you make this
> change first in your clean way.

Okay, i will move these patches to the front then the maintainers can merge
them easily.

>
> The remaining patches, pte-list refactoring and lock-less ones, also look
> interesting, but I need to read more to understand them.
>
> Thanks for the nice work!

Thanks for your review and the comments! :)

2013-08-06 14:03:39

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect

Hi Gleb, Paolo, Marcelo, Takuya,

Any comments or further comments? :)

On 07/30/2013 09:01 PM, Xiao Guangrong wrote:
> Background
> ==========
> Currently, when mark memslot dirty logged or get dirty page, we need to
> write-protect large guest memory, it is the heavy work, especially, we need to
> hold mmu-lock which is also required by vcpu to fix its page table fault and
> mmu-notifier when host page is being changed. In the extreme cpu / memory used
> guest, it becomes a scalability issue.
>
> This patchset introduces a way to locklessly write-protect guest memory.
>
> Idea
> ==========
> There are the challenges we meet and the ideas to resolve them.
>
> 1) How to locklessly walk rmap?
> The first idea we got to prevent "desc" being freed when we are walking the
> rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
> it updates the rmap really frequently.
>
> So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
> to be reused more quickly. We also store a "nulls" in the last "desc"
> (desc->more) which can help us to detect whether the "desc" is moved to anther
> rmap then we can re-walk the rmap if that happened. I learned this idea from
> nulls-list.
>
> Another issue is, when a spte is deleted from the "desc", another spte in the
> last "desc" will be moved to this position to replace the deleted one. If the
> deleted one has been accessed and we do not access the replaced one, the
> replaced one is missed when we do lockless walk.
> To fix this case, we do not backward move the spte, instead, we forward move
> the entry: when a spte is deleted, we move the entry in the first desc to that
> position.
>
> 2) How to locklessly access shadow page table?
> It is easy if the handler is in the vcpu context, in that case we can use
> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
> disable interrupt to stop shadow page be freed. But we are on the ioctl context
> and the paths we are optimizing for have heavy workload, disabling interrupt is
> not good for the system performance.
>
> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
> call_rcu() to free the shadow page if that indicator is set. Set/Clear the
> indicator are protected by slot-lock, so it need not be atomic and does not
> hurt the performance and the scalability.
>
> 3) How to locklessly write-protect guest memory?
> Currently, there are two behaviors when we write-protect guest memory, one is
> clearing the Writable bit on spte and the another one is dropping spte when it
> points to large page. The former is easy we only need to atomicly clear a bit
> but the latter is hard since we need to remove the spte from rmap. so we unify
> these two behaviors that only make the spte readonly. Making large spte
> readonly instead of nonpresent is also good for reducing jitter.
>
> And we need to pay more attention on the order of making spte writable, adding
> spte into rmap and setting the corresponding bit on dirty bitmap since
> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
> we should ensure the writable spte can be found in rmap before the dirty bitmap
> is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
> the page.
>
> Performance result
> ====================
> Host: CPU: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz x 12
> Mem: 36G
>
> The benchmark i used and will be attached:
> a) kernbench
> b) migrate-perf
> it emulates guest migration
> c) mmtest
> it repeatedly writes the memory and measures the time and is used to
> generate memory access in the guest which is being migrated
> d) Qemu monitor command to implement guest live migration
> the script can be found in migrate-perf.
>
>
> 1) First, we use kernbench to benchmark the performance with non-write-protection
> case to detect the possible regression:
>
> EPT enabled: Base: 84.05 After the patch: 83.53
> EPT disabled: Base: 142.57 After the patch: 141.70
>
> No regression and the optimization may come from lazily drop large spte.
>
> 2) Benchmark the performance of get dirty page
> (./migrate-perf -c 12 -m 3000 -t 20)
>
> Base: Run 20 times, Avg time:24813809 ns.
> After the patch: Run 20 times, Avg time:8371577 ns.
>
> It improves +196%
>
> 3) There is the result of Live Migration:
> 3.1) Less vcpus, less memory and less dirty page generated
> (
> Guest config: MEM_SIZE=7G VCPU_NUM=6
> The workload in migrated guest:
> ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 3000 -c 30 -t 60 > result &"
> )
>
> Live Migration time (ms) Benchmark (ns)
> ----------------------------------------+-------------+---------+
> EPT | Baseline | 21638 | 266601028 |
> + -------------------------------+-------------+---------+
> | After | 21110 +2.5% | 264966696 +0.6% |
> ----------------------------------------+-------------+---------+
> Shadow | Baseline | 22542 | 271969284 | |
> +----------+---------------------+-------------+---------+
> | After | 21641 +4.1% | 270485511 +0.5% |
> -------+----------+---------------------------------------------+
>
> 3.2) More vcpus, more memory and less dirty page generated
> (
> Guest config: MEM_SIZE=25G VCPU_NUM=12
> The workload in migrated guest:
> ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 15000 -c 30 -t 30 > result &"
> )
>
> Live Migration time (ms) Benchmark (ns)
> ----------------------------------------+-------------+---------+
> EPT | Baseline | 72773 | 1278228350 |
> + -------------------------------+-------------+---------+
> | After | 70516 +3.2% | 1266581587 +0.9% |
> ----------------------------------------+-------------+---------+
> Shadow | Baseline | 74198 | 1323180090 | |
> +----------+---------------------+-------------+---------+
> | After | 64948 +14.2% | 1299283302 +1.8% |
> -------+----------+---------------------------------------------+
>
> 3.3) Less vcpus, more memory and huge dirty page generated
> (
> Guest config: MEM_SIZE=25G VCPU_NUM=6
> The workload in migrated guest:
> ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 15000 -c 30 -t 200 > result &"
> )
>
> Live Migration time (ms) Benchmark (ns)
> ----------------------------------------+-------------+---------+
> EPT | Baseline | 267473 | 1224657502 |
> + -------------------------------+-------------+---------+
> | After | 267374 +0.03% | 1221520513 +0.6% |
> ----------------------------------------+-------------+---------+
> Shadow | Baseline | 369999 | 1712004428 | |
> +----------+---------------------+-------------+---------+
> | After | 335737 +10.2% | 1556065063 +10.2% |
> -------+----------+---------------------------------------------+
>
> For the case of 3.3), EPT gets small benefit, the reason is only the first
> time guest writes memory need take mmu-lock to mark spte from nonpresent to
> present. Other writes cost lots of time to trigger the page fault due to
> write-protection which are fixed by fast page fault which need not take
> mmu-lock.
>
> Xiao Guangrong (12):
> KVM: MMU: remove unused parameter
> KVM: MMU: properly check last spte in fast_page_fault()
> KVM: MMU: lazily drop large spte
> KVM: MMU: log dirty page after marking spte writable
> KVM: MMU: add spte into rmap before logging dirty page
> KVM: MMU: flush tlb if the spte can be locklessly modified
> KVM: MMU: redesign the algorithm of pte_list
> KVM: MMU: introduce nulls desc
> KVM: MMU: introduce pte-list lockless walker
> KVM: MMU: allow locklessly access shadow page table out of vcpu thread
> KVM: MMU: locklessly write-protect the page
> KVM: MMU: clean up spte_write_protect
>
> arch/x86/include/asm/kvm_host.h | 10 +-
> arch/x86/kvm/mmu.c | 442 ++++++++++++++++++++++++++++------------
> arch/x86/kvm/mmu.h | 28 +++
> arch/x86/kvm/x86.c | 19 +-
> 4 files changed, 356 insertions(+), 143 deletions(-)
>

2013-08-07 01:49:18

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
> Make sure we can see the writable spte before the dirt bitmap is visible
>
> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
> on the dirty bitmap, we should ensure the writable spte can be found in rmap
> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
> failed to write-protect the page
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)

Can you explain why this is safe, with regard to the rule
at edde99ce05290e50 ?

"The rule is that all pages are either dirty in the current bitmap,
or write-protected, which is violated here."

Overall, please document what is the required order of operations for
both set_spte and get_dirty_log and why this order is safe (right on top
of the code).

> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 35d4b50..0fe56ad 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> }
> }
>
> - if (pte_access & ACC_WRITE_MASK)
> - mark_page_dirty(vcpu->kvm, gfn);
> -
> set_pte:
> if (mmu_spte_update(sptep, spte))
> kvm_flush_remote_tlbs(vcpu->kvm);

Here, there is a modified guest page without dirty log bit set (think
another vcpu writing to the page via this spte).

> +
> + if (pte_access & ACC_WRITE_MASK)
> + mark_page_dirty(vcpu->kvm, gfn);
> done:
> return ret;
> }
> --
> 1.8.1.4

2013-08-07 04:06:59

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
> On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
>> Make sure we can see the writable spte before the dirt bitmap is visible
>>
>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
>> on the dirty bitmap, we should ensure the writable spte can be found in rmap
>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
>> failed to write-protect the page
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/kvm/mmu.c | 6 +++---
>> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> Can you explain why this is safe, with regard to the rule
> at edde99ce05290e50 ?

BTW, this log fixed this case:

VCPU 0 KVM migration control

write-protects all pages
#Pf happen then the page
become writable, set dirty
bit on the bitmap

swap the bitmap, current bitmap is empty

write the page (no dirty log)

stop the guest and push
the remaining dirty pages
Stopped
See current bitmap is empty that means
no page is dirty.
>
> "The rule is that all pages are either dirty in the current bitmap,
> or write-protected, which is violated here."

Actually, this rule is not complete true, there's the 3th case:
the window between write guest page and set dirty bitmap is valid.
In that window, page is write-free and not dirty logged.

This case is based on the fact that at the final step of live migration,
kvm should stop the guest and push the remaining dirty pages to the
destination.

They're some examples in the current code:
example 1, in fast_pf_fix_direct_spte():
if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
/* The window in here... */
mark_page_dirty(vcpu->kvm, gfn);

example 2, in kvm_write_guest_page():
r = __copy_to_user((void __user *)addr + offset, data, len);
if (r)
return -EFAULT;
/*
* The window is here, the page is dirty but not logged in
* The bitmap.
*/
mark_page_dirty(kvm, gfn);
return 0;

>
> Overall, please document what is the required order of operations for
> both set_spte and get_dirty_log and why this order is safe (right on top
> of the code).

Okay.

The order we required here is, we should 1) set spte to writable __before__
set the dirty bitmap and 2) add the spte into rmap __before__ set the dirty
bitmap.

The point 1) is the same as fast_pf_fix_direct_spte(), which i explained above.
The point 1) and 2) can ensure we can find the spte on rmap and see the spte is
writable when we do lockless write-protection, otherwise these cases will happen

VCPU 0 kvm ioctl doing get-dirty-pages

mark_page_dirty(gfn) which
set the gfn on the dirty maps
mask = xchg(dirty_bitmap, 0)

walk all gfns which set on "mask" and
locklessly write-protect the gfn,
then walk rmap, see no spte on that rmap


add the spte into rmap

!!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.

Or:

VCPU 0 kvm ioctl doing get-dirty-pages

mark_page_dirty(gfn) which
set the gfn on the dirty maps

add spte into rmap
mask = xchg(dirty_bitmap, 0)

walk all gfns which set on "mask" and
locklessly write-protect the gfn,
then walk rmap, see spte is on the ramp
but it is readonly or nonpresent.

Mark spte writable

!!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.

Hopefully, i have clarified it, if you have any questions, please let me know.

>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 35d4b50..0fe56ad 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>> }
>> }
>>
>> - if (pte_access & ACC_WRITE_MASK)
>> - mark_page_dirty(vcpu->kvm, gfn);
>> -
>> set_pte:
>> if (mmu_spte_update(sptep, spte))
>> kvm_flush_remote_tlbs(vcpu->kvm);
>
> Here, there is a modified guest page without dirty log bit set (think
> another vcpu writing to the page via this spte).

This is okay since we call mark_page_dirty(vcpu->kvm, gfn) after that, this
is the same case as fast_pf_fix_direct_spte() and i have explained why it is
safe above.


2013-08-07 13:09:36

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH 10/12] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

On Tue, 30 Jul 2013 21:02:08 +0800
Xiao Guangrong <[email protected]> wrote:

> @@ -2342,6 +2358,13 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> */
> kvm_flush_remote_tlbs(kvm);
>
> + if (kvm->arch.rcu_free_shadow_page) {
> + sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
> + list_del_init(invalid_list);
> + call_rcu(&sp->rcu, free_pages_rcu);
> + return;
> + }
> +
> list_for_each_entry_safe(sp, nsp, invalid_list, link) {
> WARN_ON(!sp->role.invalid || sp->root_count);
> kvm_mmu_free_page(sp);

Shouldn't we avoid calling call_rcu() when we are holding mmu_lock?

Takuya

2013-08-07 13:20:09

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 10/12] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

On 08/07/2013 09:09 PM, Takuya Yoshikawa wrote:
> On Tue, 30 Jul 2013 21:02:08 +0800
> Xiao Guangrong <[email protected]> wrote:
>
>> @@ -2342,6 +2358,13 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>> */
>> kvm_flush_remote_tlbs(kvm);
>>
>> + if (kvm->arch.rcu_free_shadow_page) {
>> + sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
>> + list_del_init(invalid_list);
>> + call_rcu(&sp->rcu, free_pages_rcu);
>> + return;
>> + }
>> +
>> list_for_each_entry_safe(sp, nsp, invalid_list, link) {
>> WARN_ON(!sp->role.invalid || sp->root_count);
>> kvm_mmu_free_page(sp);
>
> Shouldn't we avoid calling call_rcu() when we are holding mmu_lock?

Using call_rcu() to free pages is a rare case that happen only between
lockless write-protection and zapping shadow pages, so i think we do
not need to care this case too much.

2013-08-08 15:07:28

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote:
> On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
> > On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
> >> Make sure we can see the writable spte before the dirt bitmap is visible
> >>
> >> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
> >> on the dirty bitmap, we should ensure the writable spte can be found in rmap
> >> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
> >> failed to write-protect the page
> >>
> >> Signed-off-by: Xiao Guangrong <[email protected]>
> >> ---
> >> arch/x86/kvm/mmu.c | 6 +++---
> >> 1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > Can you explain why this is safe, with regard to the rule
> > at edde99ce05290e50 ?
>
> BTW, this log fixed this case:
>
> VCPU 0 KVM migration control
>
> write-protects all pages
> #Pf happen then the page
> become writable, set dirty
> bit on the bitmap
>
> swap the bitmap, current bitmap is empty
>
> write the page (no dirty log)
>
> stop the guest and push
> the remaining dirty pages
> Stopped
> See current bitmap is empty that means
> no page is dirty.
> >
> > "The rule is that all pages are either dirty in the current bitmap,
> > or write-protected, which is violated here."
>
> Actually, this rule is not complete true, there's the 3th case:
> the window between write guest page and set dirty bitmap is valid.
> In that window, page is write-free and not dirty logged.
>
> This case is based on the fact that at the final step of live migration,
> kvm should stop the guest and push the remaining dirty pages to the
> destination.
>
> They're some examples in the current code:
> example 1, in fast_pf_fix_direct_spte():
> if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
> /* The window in here... */
> mark_page_dirty(vcpu->kvm, gfn);
>
> example 2, in kvm_write_guest_page():
> r = __copy_to_user((void __user *)addr + offset, data, len);
> if (r)
> return -EFAULT;
> /*
> * The window is here, the page is dirty but not logged in
> * The bitmap.
> */
> mark_page_dirty(kvm, gfn);
> return 0;
>
> >
> > Overall, please document what is the required order of operations for
> > both set_spte and get_dirty_log and why this order is safe (right on top
> > of the code).
>
> Okay.
>
> The order we required here is, we should 1) set spte to writable __before__
> set the dirty bitmap and 2) add the spte into rmap __before__ set the dirty
> bitmap.
>
> The point 1) is the same as fast_pf_fix_direct_spte(), which i explained above.
> The point 1) and 2) can ensure we can find the spte on rmap and see the spte is
> writable when we do lockless write-protection, otherwise these cases will happen
>
> VCPU 0 kvm ioctl doing get-dirty-pages
>
> mark_page_dirty(gfn) which
> set the gfn on the dirty maps
> mask = xchg(dirty_bitmap, 0)
>
> walk all gfns which set on "mask" and
> locklessly write-protect the gfn,
> then walk rmap, see no spte on that rmap
>
>
> add the spte into rmap
>
> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.
>
> Or:
>
> VCPU 0 kvm ioctl doing get-dirty-pages
>
> mark_page_dirty(gfn) which
> set the gfn on the dirty maps
>
> add spte into rmap
> mask = xchg(dirty_bitmap, 0)
>
> walk all gfns which set on "mask" and
> locklessly write-protect the gfn,
> then walk rmap, see spte is on the ramp
> but it is readonly or nonpresent.
>
> Mark spte writable
>
> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.
>
> Hopefully, i have clarified it, if you have any questions, please let me know.

Yes, partially. Please on top of the explanation above, have something along
the lines of

"The flush IPI assumes that a thread switch happens in this order"
comment at arch/x86/mm/tlb.c

"With get_user_pages_fast, we walk down the pagetables without taking"
comment at arch/x86/mm/gup.c


What about the relation between read-only spte and respective TLB flush?
That is:

vcpu0 vcpu1

lockless write protect
mark spte read-only
either some mmu-lockless path or not
write protect page:
find read-only page
assumes tlb is flushed

kvm_flush_remote_tlbs


In general, i think write protection is a good candidate for lockless operation.
However, i would also be concerned about the larger problem of mmu-lock
contention and how to solve it. Did you ever consider splitting it?

> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 35d4b50..0fe56ad 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> >> }
> >> }
> >>
> >> - if (pte_access & ACC_WRITE_MASK)
> >> - mark_page_dirty(vcpu->kvm, gfn);
> >> -
> >> set_pte:
> >> if (mmu_spte_update(sptep, spte))
> >> kvm_flush_remote_tlbs(vcpu->kvm);
> >
> > Here, there is a modified guest page without dirty log bit set (think
> > another vcpu writing to the page via this spte).
>
> This is okay since we call mark_page_dirty(vcpu->kvm, gfn) after that, this
> is the same case as fast_pf_fix_direct_spte() and i have explained why it is
> safe above.

Right.

2013-08-08 16:26:20

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

[ Post again after adjusting the format since the mail list rejected to deliver my previous one. ]

On Aug 8, 2013, at 11:06 PM, Marcelo Tosatti <[email protected]> wrote:

> On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote:
>> On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
>>> On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
>>>> Make sure we can see the writable spte before the dirt bitmap is visible
>>>>
>>>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
>>>> on the dirty bitmap, we should ensure the writable spte can be found in rmap
>>>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
>>>> failed to write-protect the page
>>>>
>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>> ---
>>>> arch/x86/kvm/mmu.c | 6 +++---
>>>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> Can you explain why this is safe, with regard to the rule
>>> at edde99ce05290e50 ?
>>
>> BTW, this log fixed this case:
>>
>> VCPU 0 KVM migration control
>>
>> write-protects all pages
>> #Pf happen then the page
>> become writable, set dirty
>> bit on the bitmap
>>
>> swap the bitmap, current bitmap is empty
>>
>> write the page (no dirty log)
>>
>> stop the guest and push
>> the remaining dirty pages
>> Stopped
>> See current bitmap is empty that means
>> no page is dirty.
>>>
>>> "The rule is that all pages are either dirty in the current bitmap,
>>> or write-protected, which is violated here."
>>
>> Actually, this rule is not complete true, there's the 3th case:
>> the window between write guest page and set dirty bitmap is valid.
>> In that window, page is write-free and not dirty logged.
>>
>> This case is based on the fact that at the final step of live migration,
>> kvm should stop the guest and push the remaining dirty pages to the
>> destination.
>>
>> They're some examples in the current code:
>> example 1, in fast_pf_fix_direct_spte():
>> if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
>> /* The window in here... */
>> mark_page_dirty(vcpu->kvm, gfn);
>>
>> example 2, in kvm_write_guest_page():
>> r = __copy_to_user((void __user *)addr + offset, data, len);
>> if (r)
>> return -EFAULT;
>> /*
>> * The window is here, the page is dirty but not logged in
>> * The bitmap.
>> */
>> mark_page_dirty(kvm, gfn);
>> return 0;
>>
>>>
>>> Overall, please document what is the required order of operations for
>>> both set_spte and get_dirty_log and why this order is safe (right on top
>>> of the code).
>>
>> Okay.
>>
>> The order we required here is, we should 1) set spte to writable __before__
>> set the dirty bitmap and 2) add the spte into rmap __before__ set the dirty
>> bitmap.
>>
>> The point 1) is the same as fast_pf_fix_direct_spte(), which i explained above.
>> The point 1) and 2) can ensure we can find the spte on rmap and see the spte is
>> writable when we do lockless write-protection, otherwise these cases will happen
>>
>> VCPU 0 kvm ioctl doing get-dirty-pages
>>
>> mark_page_dirty(gfn) which
>> set the gfn on the dirty maps
>> mask = xchg(dirty_bitmap, 0)
>>
>> walk all gfns which set on "mask" and
>> locklessly write-protect the gfn,
>> then walk rmap, see no spte on that rmap
>>
>>
>> add the spte into rmap
>>
>> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.
>>
>> Or:
>>
>> VCPU 0 kvm ioctl doing get-dirty-pages
>>
>> mark_page_dirty(gfn) which
>> set the gfn on the dirty maps
>>
>> add spte into rmap
>> mask = xchg(dirty_bitmap, 0)
>>
>> walk all gfns which set on "mask" and
>> locklessly write-protect the gfn,
>> then walk rmap, see spte is on the ramp
>> but it is readonly or nonpresent.
>>
>> Mark spte writable
>>
>> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.
>>
>> Hopefully, i have clarified it, if you have any questions, please let me know.
>
> Yes, partially. Please on top of the explanation above, have something along
> the lines of
>
> "The flush IPI assumes that a thread switch happens in this order"
> comment at arch/x86/mm/tlb.c
>
> "With get_user_pages_fast, we walk down the pagetables without taking"
> comment at arch/x86/mm/gup.c

Marcelo, thanks for your suggestion, i will improve both the changelog and
the comments in the next version.

>
>
> What about the relation between read-only spte and respective TLB flush?
> That is:
>
> vcpu0 vcpu1
>
> lockless write protect
> mark spte read-only
> either some mmu-lockless path or not
> write protect page:
> find read-only page
> assumes tlb is flushed

The tlb flush on mmu-lockless paths do not depends on the spte, there are
two lockless paths: one is kvm_mmu_slot_remove_write_access() which
unconditionally flushes tlb, another one is kvm_vm_ioctl_get_dirty_log()
which flushes tlb based on dirty bitmap (flush tlb if have dirty page).

Under the protection of mmu-lock, since we flush tlb whenever spte is zapped,
we only need to care the case of spte updating which is fixed in
"[PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified", in
that patch, it changes the flush-condition to

- if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
+ if (spte_is_locklessly_modifiable(old_spte) &&
+ !is_writable_pte(new_spte))

That means whenever a spte which can be potentially locklessly-modified
becomes readonly, do the tlb flush.

>
> kvm_flush_remote_tlbs
>
>
> In general, i think write protection is a good candidate for lockless operation.
> However, i would also be concerned about the larger problem of mmu-lock
> contention and how to solve it. Did you ever consider splitting it?

Yes, i did and actaully am still working on that. :)

2013-08-08 17:38:49

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect

Il 06/08/2013 15:16, Xiao Guangrong ha scritto:
> Hi Gleb, Paolo, Marcelo, Takuya,
>
> Any comments or further comments? :)

It's not the easiest patch to review. I've looked at it (beyond the
small comments I have already posted), but it will take some time to
digest it...

By the way, both I and Gleb will be on vacation next week. I will read
email, but I will not be able to apply patches or do pull requests.

Paolo

> On 07/30/2013 09:01 PM, Xiao Guangrong wrote:
>> Background
>> ==========
>> Currently, when mark memslot dirty logged or get dirty page, we need to
>> write-protect large guest memory, it is the heavy work, especially, we need to
>> hold mmu-lock which is also required by vcpu to fix its page table fault and
>> mmu-notifier when host page is being changed. In the extreme cpu / memory used
>> guest, it becomes a scalability issue.
>>
>> This patchset introduces a way to locklessly write-protect guest memory.
>>
>> Idea
>> ==========
>> There are the challenges we meet and the ideas to resolve them.
>>
>> 1) How to locklessly walk rmap?
>> The first idea we got to prevent "desc" being freed when we are walking the
>> rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
>> it updates the rmap really frequently.
>>
>> So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
>> to be reused more quickly. We also store a "nulls" in the last "desc"
>> (desc->more) which can help us to detect whether the "desc" is moved to anther
>> rmap then we can re-walk the rmap if that happened. I learned this idea from
>> nulls-list.
>>
>> Another issue is, when a spte is deleted from the "desc", another spte in the
>> last "desc" will be moved to this position to replace the deleted one. If the
>> deleted one has been accessed and we do not access the replaced one, the
>> replaced one is missed when we do lockless walk.
>> To fix this case, we do not backward move the spte, instead, we forward move
>> the entry: when a spte is deleted, we move the entry in the first desc to that
>> position.
>>
>> 2) How to locklessly access shadow page table?
>> It is easy if the handler is in the vcpu context, in that case we can use
>> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
>> disable interrupt to stop shadow page be freed. But we are on the ioctl context
>> and the paths we are optimizing for have heavy workload, disabling interrupt is
>> not good for the system performance.
>>
>> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
>> call_rcu() to free the shadow page if that indicator is set. Set/Clear the
>> indicator are protected by slot-lock, so it need not be atomic and does not
>> hurt the performance and the scalability.
>>
>> 3) How to locklessly write-protect guest memory?
>> Currently, there are two behaviors when we write-protect guest memory, one is
>> clearing the Writable bit on spte and the another one is dropping spte when it
>> points to large page. The former is easy we only need to atomicly clear a bit
>> but the latter is hard since we need to remove the spte from rmap. so we unify
>> these two behaviors that only make the spte readonly. Making large spte
>> readonly instead of nonpresent is also good for reducing jitter.
>>
>> And we need to pay more attention on the order of making spte writable, adding
>> spte into rmap and setting the corresponding bit on dirty bitmap since
>> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
>> we should ensure the writable spte can be found in rmap before the dirty bitmap
>> is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
>> the page.
>>
>> Performance result
>> ====================
>> Host: CPU: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz x 12
>> Mem: 36G
>>
>> The benchmark i used and will be attached:
>> a) kernbench
>> b) migrate-perf
>> it emulates guest migration
>> c) mmtest
>> it repeatedly writes the memory and measures the time and is used to
>> generate memory access in the guest which is being migrated
>> d) Qemu monitor command to implement guest live migration
>> the script can be found in migrate-perf.
>>
>>
>> 1) First, we use kernbench to benchmark the performance with non-write-protection
>> case to detect the possible regression:
>>
>> EPT enabled: Base: 84.05 After the patch: 83.53
>> EPT disabled: Base: 142.57 After the patch: 141.70
>>
>> No regression and the optimization may come from lazily drop large spte.
>>
>> 2) Benchmark the performance of get dirty page
>> (./migrate-perf -c 12 -m 3000 -t 20)
>>
>> Base: Run 20 times, Avg time:24813809 ns.
>> After the patch: Run 20 times, Avg time:8371577 ns.
>>
>> It improves +196%
>>
>> 3) There is the result of Live Migration:
>> 3.1) Less vcpus, less memory and less dirty page generated
>> (
>> Guest config: MEM_SIZE=7G VCPU_NUM=6
>> The workload in migrated guest:
>> ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 3000 -c 30 -t 60 > result &"
>> )
>>
>> Live Migration time (ms) Benchmark (ns)
>> ----------------------------------------+-------------+---------+
>> EPT | Baseline | 21638 | 266601028 |
>> + -------------------------------+-------------+---------+
>> | After | 21110 +2.5% | 264966696 +0.6% |
>> ----------------------------------------+-------------+---------+
>> Shadow | Baseline | 22542 | 271969284 | |
>> +----------+---------------------+-------------+---------+
>> | After | 21641 +4.1% | 270485511 +0.5% |
>> -------+----------+---------------------------------------------+
>>
>> 3.2) More vcpus, more memory and less dirty page generated
>> (
>> Guest config: MEM_SIZE=25G VCPU_NUM=12
>> The workload in migrated guest:
>> ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 15000 -c 30 -t 30 > result &"
>> )
>>
>> Live Migration time (ms) Benchmark (ns)
>> ----------------------------------------+-------------+---------+
>> EPT | Baseline | 72773 | 1278228350 |
>> + -------------------------------+-------------+---------+
>> | After | 70516 +3.2% | 1266581587 +0.9% |
>> ----------------------------------------+-------------+---------+
>> Shadow | Baseline | 74198 | 1323180090 | |
>> +----------+---------------------+-------------+---------+
>> | After | 64948 +14.2% | 1299283302 +1.8% |
>> -------+----------+---------------------------------------------+
>>
>> 3.3) Less vcpus, more memory and huge dirty page generated
>> (
>> Guest config: MEM_SIZE=25G VCPU_NUM=6
>> The workload in migrated guest:
>> ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 15000 -c 30 -t 200 > result &"
>> )
>>
>> Live Migration time (ms) Benchmark (ns)
>> ----------------------------------------+-------------+---------+
>> EPT | Baseline | 267473 | 1224657502 |
>> + -------------------------------+-------------+---------+
>> | After | 267374 +0.03% | 1221520513 +0.6% |
>> ----------------------------------------+-------------+---------+
>> Shadow | Baseline | 369999 | 1712004428 | |
>> +----------+---------------------+-------------+---------+
>> | After | 335737 +10.2% | 1556065063 +10.2% |
>> -------+----------+---------------------------------------------+
>>
>> For the case of 3.3), EPT gets small benefit, the reason is only the first
>> time guest writes memory need take mmu-lock to mark spte from nonpresent to
>> present. Other writes cost lots of time to trigger the page fault due to
>> write-protection which are fixed by fast page fault which need not take
>> mmu-lock.
>>
>> Xiao Guangrong (12):
>> KVM: MMU: remove unused parameter
>> KVM: MMU: properly check last spte in fast_page_fault()
>> KVM: MMU: lazily drop large spte
>> KVM: MMU: log dirty page after marking spte writable
>> KVM: MMU: add spte into rmap before logging dirty page
>> KVM: MMU: flush tlb if the spte can be locklessly modified
>> KVM: MMU: redesign the algorithm of pte_list
>> KVM: MMU: introduce nulls desc
>> KVM: MMU: introduce pte-list lockless walker
>> KVM: MMU: allow locklessly access shadow page table out of vcpu thread
>> KVM: MMU: locklessly write-protect the page
>> KVM: MMU: clean up spte_write_protect
>>
>> arch/x86/include/asm/kvm_host.h | 10 +-
>> arch/x86/kvm/mmu.c | 442 ++++++++++++++++++++++++++++------------
>> arch/x86/kvm/mmu.h | 28 +++
>> arch/x86/kvm/x86.c | 19 +-
>> 4 files changed, 356 insertions(+), 143 deletions(-)
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2013-08-09 04:51:25

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect

On 08/09/2013 01:38 AM, Paolo Bonzini wrote:
> Il 06/08/2013 15:16, Xiao Guangrong ha scritto:
>> Hi Gleb, Paolo, Marcelo, Takuya,
>>
>> Any comments or further comments? :)
>
> It's not the easiest patch to review. I've looked at it (beyond the
> small comments I have already posted), but it will take some time to
> digest it...

Thanks for your time, Paolo!

>
> By the way, both I and Gleb will be on vacation next week. I will read
> email, but I will not be able to apply patches or do pull requests.

Enjoy your vacation. :)

2013-08-28 07:23:50

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified

On Tue, Jul 30, 2013 at 09:02:04PM +0800, Xiao Guangrong wrote:
> Relax the tlb flush condition since we will write-protect the spte out of mmu
> lock. Note lockless write-protection only marks the writable spte to readonly
> and the spte can be writable only if both SPTE_HOST_WRITEABLE and
> SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)
>
> This patch is used to avoid this kind of race:
>
> VCPU 0 VCPU 1
> lockless wirte protection:
> set spte.w = 0
> lock mmu-lock
>
> write protection the spte to sync shadow page,
> see spte.w = 0, then without flush tlb
>
> unlock mmu-lock
>
> !!! At this point, the shadow page can still be
> writable due to the corrupt tlb entry
> Flush all TLB
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 58283bf..5a40564 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -600,7 +600,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> * we always atomicly update it, see the comments in
> * spte_has_volatile_bits().
> */
> - if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
> + if (spte_is_locklessly_modifiable(old_spte) &&
> + !is_writable_pte(new_spte))
> ret = true;
This will needlessly flush tlbs when dirty login is not in use (common
case) and old spte is non writable. Can you estimate how serious the
performance hit is?

>
> if (!shadow_accessed_mask)
> --
> 1.8.1.4

--
Gleb.

2013-08-28 07:50:42

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified

On 08/28/2013 03:23 PM, Gleb Natapov wrote:
> On Tue, Jul 30, 2013 at 09:02:04PM +0800, Xiao Guangrong wrote:
>> Relax the tlb flush condition since we will write-protect the spte out of mmu
>> lock. Note lockless write-protection only marks the writable spte to readonly
>> and the spte can be writable only if both SPTE_HOST_WRITEABLE and
>> SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)
>>
>> This patch is used to avoid this kind of race:
>>
>> VCPU 0 VCPU 1
>> lockless wirte protection:
>> set spte.w = 0
>> lock mmu-lock
>>
>> write protection the spte to sync shadow page,
>> see spte.w = 0, then without flush tlb
>>
>> unlock mmu-lock
>>
>> !!! At this point, the shadow page can still be
>> writable due to the corrupt tlb entry
>> Flush all TLB
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/kvm/mmu.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 58283bf..5a40564 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -600,7 +600,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>> * we always atomicly update it, see the comments in
>> * spte_has_volatile_bits().
>> */
>> - if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
>> + if (spte_is_locklessly_modifiable(old_spte) &&
>> + !is_writable_pte(new_spte))
>> ret = true;
> This will needlessly flush tlbs when dirty login is not in use (common
> case) and old spte is non writable. Can you estimate how serious the
> performance hit is?

If non write-protection caused by dirty log, the spte is always writable
if SPTE_HOST_WRITEABLE and SPTE_MMU_WRITEABLE are set. In other words,
spte_is_locklessly_modifiable(old_spte) is the same as
is_writable_pte(old_spte) in the common case.

There are two cases causing unnecessary TLB flush that are
1) guest read faults on the spte write-protected by dirty log and uses a
readonly host pfn to fix it.
This is really rare since read access on the readonly can not trigger
#PF.

2) guest requires write-protect caused by syncing shadow page.
this is only needed if ept is disabled and in the most case, the guest
has many sptes need to be write-protected. Unnecessary TLB flush is rare
too.

2013-08-28 08:13:13

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list

On Tue, Jul 30, 2013 at 09:02:05PM +0800, Xiao Guangrong wrote:
> Change the algorithm to:
> 1) always add new desc to the first desc (pointed by parent_ptes/rmap)
> that is good to implement rcu-nulls-list-like lockless rmap
> walking
>
> 2) always move the entry in first desc to the the position we want
> to remove when remove a spte in the parent_ptes/rmap (backward-move).
> It is good for us to implement lockless rmap walk since in the current
> code, when a spte is deleted from the "desc", another spte in the last
> "desc" will be moved to this position to replace the deleted one. If the
> deleted one has been accessed and we do not access the replaced one, the
> replaced one is missed when we do lockless walk.
> To fix this case, we do not backward move the spte, instead, we forward
> move the entry: when a spte is deleted, we move the entry in the first
> desc to that position
>
> Both of these also can reduce cache miss
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 182 ++++++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 125 insertions(+), 57 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 5a40564..3013bb1 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -918,6 +918,50 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
> return level - 1;
> }
>
> +static int __find_first_free(struct pte_list_desc *desc)
> +{
> + int i;
> +
> + for (i = 0; i < PTE_LIST_EXT; i++)
> + if (!desc->sptes[i])
> + break;
> + return i;
> +}
> +
> +static int find_first_free(struct pte_list_desc *desc)
> +{
> + int free = __find_first_free(desc);
> +
> + WARN_ON(free >= PTE_LIST_EXT);
> + return free;
> +}
> +
> +static int find_last_used(struct pte_list_desc *desc)
> +{
> + int used = __find_first_free(desc) - 1;
> +
> + WARN_ON(used < 0 || used >= PTE_LIST_EXT);
> + return used;
> +}
> +
> +/*
> + * TODO: we can encode the desc number into the rmap/parent_ptes
> + * since at least 10 physical/virtual address bits are reserved
> + * on x86. It is worthwhile if it shows that the desc walking is
> + * a performance issue.
> + */
> +static int count_spte_number(struct pte_list_desc *desc)
> +{
> + int first_free, desc_num;
> +
> + first_free = __find_first_free(desc);
> +
> + for (desc_num = 0; desc->more; desc = desc->more)
> + desc_num++;
> +
> + return first_free + desc_num * PTE_LIST_EXT;
> +}
> +
> /*
> * Pte mapping structures:
> *
> @@ -934,92 +978,116 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
> unsigned long *pte_list)
> {
> struct pte_list_desc *desc;
> - int i, count = 0;
> + int free_pos;
>
> if (!*pte_list) {
> rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
> *pte_list = (unsigned long)spte;
> - } else if (!(*pte_list & 1)) {
> + return 0;
> + }
> +
> + if (!(*pte_list & 1)) {
> rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
> desc = mmu_alloc_pte_list_desc(vcpu);
> desc->sptes[0] = (u64 *)*pte_list;
> desc->sptes[1] = spte;
> *pte_list = (unsigned long)desc | 1;
> - ++count;
> - } else {
> - rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
> - desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> - while (desc->sptes[PTE_LIST_EXT-1] && desc->more) {
> - desc = desc->more;
> - count += PTE_LIST_EXT;
> - }
> - if (desc->sptes[PTE_LIST_EXT-1]) {
> - desc->more = mmu_alloc_pte_list_desc(vcpu);
> - desc = desc->more;
> - }
> - for (i = 0; desc->sptes[i]; ++i)
> - ++count;
> - desc->sptes[i] = spte;
> + return 1;
> + }
> +
> + rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
> + desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> +
> + /* No empty position in the desc. */
> + if (desc->sptes[PTE_LIST_EXT - 1]) {
> + struct pte_list_desc *new_desc;
> + new_desc = mmu_alloc_pte_list_desc(vcpu);
> + new_desc->more = desc;
> + desc = new_desc;
> + *pte_list = (unsigned long)desc | 1;
> }
> - return count;
> +
> + free_pos = find_first_free(desc);
> + desc->sptes[free_pos] = spte;
> + return count_spte_number(desc);
Should it be count_spte_number(desc) - 1? The function should returns
the number of pte entries before the spte was added.

> }
>
> static void
> -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
> - int i, struct pte_list_desc *prev_desc)
> +pte_list_desc_remove_entry(unsigned long *pte_list,
> + struct pte_list_desc *desc, int i)
> {
> - int j;
> + struct pte_list_desc *first_desc;
> + int last_used;
> +
> + first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> + last_used = find_last_used(first_desc);
>
> - for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
> - ;
> - desc->sptes[i] = desc->sptes[j];
> - desc->sptes[j] = NULL;
> - if (j != 0)
> + /*
> + * Move the entry from the first desc to this position we want
> + * to remove.
> + */
> + desc->sptes[i] = first_desc->sptes[last_used];
> + first_desc->sptes[last_used] = NULL;
> +
What if desc == first_desc and i < last_used. You still move spte
backwards so lockless walk may have already examined entry at i and
will miss spte that was moved there from last_used position, no?

> + /* No valid entry in this desc, we can free this desc now. */
> + if (!first_desc->sptes[0]) {
> + struct pte_list_desc *next_desc = first_desc->more;
> +
> + /*
> + * Only one entry existing but still use a desc to store it?
> + */
> + WARN_ON(!next_desc);
> +
> + mmu_free_pte_list_desc(first_desc);
> + first_desc = next_desc;
> + *pte_list = (unsigned long)first_desc | 1ul;
> return;
> - if (!prev_desc && !desc->more)
> - *pte_list = (unsigned long)desc->sptes[0];
> - else
> - if (prev_desc)
> - prev_desc->more = desc->more;
> - else
> - *pte_list = (unsigned long)desc->more | 1;
> - mmu_free_pte_list_desc(desc);
> + }
> +
> + WARN_ON(!first_desc->sptes[0]);
> +
> + /*
> + * Only one entry in this desc, move the entry to the head
> + * then the desc can be freed.
> + */
> + if (!first_desc->sptes[1] && !first_desc->more) {
> + *pte_list = (unsigned long)first_desc->sptes[0];
> + mmu_free_pte_list_desc(first_desc);
> + }
> }
>
> static void pte_list_remove(u64 *spte, unsigned long *pte_list)
> {
> struct pte_list_desc *desc;
> - struct pte_list_desc *prev_desc;
> int i;
>
> if (!*pte_list) {
> - printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
> - BUG();
> - } else if (!(*pte_list & 1)) {
> + WARN(1, KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
Why change BUG() to WARN() here and below?

> + return;
> + }
> +
> + if (!(*pte_list & 1)) {
> rmap_printk("pte_list_remove: %p 1->0\n", spte);
> if ((u64 *)*pte_list != spte) {
> - printk(KERN_ERR "pte_list_remove: %p 1->BUG\n", spte);
> - BUG();
> + WARN(1, KERN_ERR "pte_list_remove: %p 1->BUG\n", spte);
> }
Remove {} since only one statement left in the if(). Or better yet why
not:
WARN ((u64 *)*pte_list != spte, ....)?
But again why not BUG()?

> *pte_list = 0;
> - } else {
> - rmap_printk("pte_list_remove: %p many->many\n", spte);
> - desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> - prev_desc = NULL;
> - while (desc) {
> - for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
> - if (desc->sptes[i] == spte) {
> - pte_list_desc_remove_entry(pte_list,
> - desc, i,
> - prev_desc);
> - return;
> - }
> - prev_desc = desc;
> - desc = desc->more;
> - }
> - pr_err("pte_list_remove: %p many->many\n", spte);
> - BUG();
> + return;
> }
> +
> + rmap_printk("pte_list_remove: %p many->many\n", spte);
> + desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> + while (desc) {
> + for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
> + if (desc->sptes[i] == spte) {
> + pte_list_desc_remove_entry(pte_list,
> + desc, i);
> + return;
> + }
> + desc = desc->more;
> + }
> +
> + WARN(1, "pte_list_remove: %p many->many\n", spte);
> }
>
> typedef void (*pte_list_walk_fn) (u64 *spte);
> --
> 1.8.1.4

--
Gleb.

2013-08-28 08:37:44

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list

On 08/28/2013 04:12 PM, Gleb Natapov wrote:

>> +
>> + rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
>> + desc = (struct pte_list_desc *)(*pte_list & ~1ul);
>> +
>> + /* No empty position in the desc. */
>> + if (desc->sptes[PTE_LIST_EXT - 1]) {
>> + struct pte_list_desc *new_desc;
>> + new_desc = mmu_alloc_pte_list_desc(vcpu);
>> + new_desc->more = desc;
>> + desc = new_desc;
>> + *pte_list = (unsigned long)desc | 1;
>> }
>> - return count;
>> +
>> + free_pos = find_first_free(desc);
>> + desc->sptes[free_pos] = spte;
>> + return count_spte_number(desc);
> Should it be count_spte_number(desc) - 1? The function should returns
> the number of pte entries before the spte was added.

Yes. We have handled it count_spte_number(), we count the number like this:

return first_free + desc_num * PTE_LIST_EXT;

The first_free is indexed from 0.

Maybe it is clearer to let count_spte_number() return the real number.

>
>> }
>>
>> static void
>> -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
>> - int i, struct pte_list_desc *prev_desc)
>> +pte_list_desc_remove_entry(unsigned long *pte_list,
>> + struct pte_list_desc *desc, int i)
>> {
>> - int j;
>> + struct pte_list_desc *first_desc;
>> + int last_used;
>> +
>> + first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
>> + last_used = find_last_used(first_desc);
>>
>> - for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
>> - ;
>> - desc->sptes[i] = desc->sptes[j];
>> - desc->sptes[j] = NULL;
>> - if (j != 0)
>> + /*
>> + * Move the entry from the first desc to this position we want
>> + * to remove.
>> + */
>> + desc->sptes[i] = first_desc->sptes[last_used];
>> + first_desc->sptes[last_used] = NULL;
>> +
> What if desc == first_desc and i < last_used. You still move spte
> backwards so lockless walk may have already examined entry at i and
> will miss spte that was moved there from last_used position, no?

Right. I noticed it too and fixed in the v2 which is being tested.
I fixed it by bottom-up walk desc, like this:

pte_list_walk_lockless():

desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
while (!desc_is_a_nulls(desc)) {
/*
* We should do bottom-up walk since we always use the
* bottom entry to replace the deleted entry if only
* one desc is used in the rmap when a spte is removed.
* Otherwise the moved entry will be missed.
*/
for (i = PTE_LIST_EXT - 1; i >= 0; i--)
fn(desc->sptes[i]);

desc = ACCESS_ONCE(desc->more);

/* It is being initialized. */
if (unlikely(!desc))
goto restart;
}

How about this?

>
>> + /* No valid entry in this desc, we can free this desc now. */
>> + if (!first_desc->sptes[0]) {
>> + struct pte_list_desc *next_desc = first_desc->more;
>> +
>> + /*
>> + * Only one entry existing but still use a desc to store it?
>> + */
>> + WARN_ON(!next_desc);
>> +
>> + mmu_free_pte_list_desc(first_desc);
>> + first_desc = next_desc;
>> + *pte_list = (unsigned long)first_desc | 1ul;
>> return;
>> - if (!prev_desc && !desc->more)
>> - *pte_list = (unsigned long)desc->sptes[0];
>> - else
>> - if (prev_desc)
>> - prev_desc->more = desc->more;
>> - else
>> - *pte_list = (unsigned long)desc->more | 1;
>> - mmu_free_pte_list_desc(desc);
>> + }
>> +
>> + WARN_ON(!first_desc->sptes[0]);
>> +
>> + /*
>> + * Only one entry in this desc, move the entry to the head
>> + * then the desc can be freed.
>> + */
>> + if (!first_desc->sptes[1] && !first_desc->more) {
>> + *pte_list = (unsigned long)first_desc->sptes[0];
>> + mmu_free_pte_list_desc(first_desc);
>> + }
>> }
>>
>> static void pte_list_remove(u64 *spte, unsigned long *pte_list)
>> {
>> struct pte_list_desc *desc;
>> - struct pte_list_desc *prev_desc;
>> int i;
>>
>> if (!*pte_list) {
>> - printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
>> - BUG();
>> - } else if (!(*pte_list & 1)) {
>> + WARN(1, KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
> Why change BUG() to WARN() here and below?

WARN(1, "xxx") can replace two lines in the origin code. And personally,
i prefer WARN() to BUG() since sometimes BUG() can stop my box and i need to
get the full log by using kdump.

If you object it, i will change it back in the next version. :)

>
>> + return;
>> + }
>> +
>> + if (!(*pte_list & 1)) {
>> rmap_printk("pte_list_remove: %p 1->0\n", spte);
>> if ((u64 *)*pte_list != spte) {
>> - printk(KERN_ERR "pte_list_remove: %p 1->BUG\n", spte);
>> - BUG();
>> + WARN(1, KERN_ERR "pte_list_remove: %p 1->BUG\n", spte);
>> }
> Remove {} since only one statement left in the if(). Or better yet why
> not:
> WARN ((u64 *)*pte_list != spte, ....)?

Yes, it is better.

> But again why not BUG()?

The explanation is above. :)

2013-08-28 08:40:20

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 08/12] KVM: MMU: introduce nulls desc

On Tue, Jul 30, 2013 at 09:02:06PM +0800, Xiao Guangrong wrote:
> It likes nulls list and we use the pte-list as the nulls which can help us to
> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
> if that happened
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 35 ++++++++++++++++++++++++++++-------
> 1 file changed, 28 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 3013bb1..36caf6a 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -918,6 +918,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
> return level - 1;
> }
>
> +static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc)
> +{
> + unsigned long marker;
> +
> + marker = (unsigned long)pte_list | 1UL;
> + desc->more = (struct pte_list_desc *)marker;
> +}
> +
> +static bool desc_is_a_nulls(struct pte_list_desc *desc)
> +{
> + return (unsigned long)desc & 1;
> +}
> +
> +static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc)
> +{
> + return (unsigned long *)((unsigned long)desc & ~1);
> +}
> +
> static int __find_first_free(struct pte_list_desc *desc)
> {
> int i;
> @@ -956,7 +974,7 @@ static int count_spte_number(struct pte_list_desc *desc)
>
> first_free = __find_first_free(desc);
>
> - for (desc_num = 0; desc->more; desc = desc->more)
> + for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more)
> desc_num++;
>
> return first_free + desc_num * PTE_LIST_EXT;
> @@ -991,6 +1009,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
> desc = mmu_alloc_pte_list_desc(vcpu);
> desc->sptes[0] = (u64 *)*pte_list;
> desc->sptes[1] = spte;
> + desc_mark_nulls(pte_list, desc);
> *pte_list = (unsigned long)desc | 1;
> return 1;
> }
> @@ -1036,7 +1055,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
> /*
> * Only one entry existing but still use a desc to store it?
> */
> - WARN_ON(!next_desc);
> + WARN_ON(desc_is_a_nulls(next_desc));
>
> mmu_free_pte_list_desc(first_desc);
> first_desc = next_desc;
> @@ -1050,7 +1069,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
> * Only one entry in this desc, move the entry to the head
> * then the desc can be freed.
> */
> - if (!first_desc->sptes[1] && !first_desc->more) {
> + if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
> *pte_list = (unsigned long)first_desc->sptes[0];
> mmu_free_pte_list_desc(first_desc);
> }
> @@ -1077,7 +1096,7 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)
>
> rmap_printk("pte_list_remove: %p many->many\n", spte);
> desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> - while (desc) {
> + while (!desc_is_a_nulls(desc)) {
> for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
> if (desc->sptes[i] == spte) {
> pte_list_desc_remove_entry(pte_list,
> @@ -1103,11 +1122,13 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
> return fn((u64 *)*pte_list);
>
> desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> - while (desc) {
> + while (!desc_is_a_nulls(desc)) {
> for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
> fn(desc->sptes[i]);
> desc = desc->more;
> }
> +
> + WARN_ON(desc_get_nulls_value(desc) != pte_list);
> }
>
> static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
> @@ -1200,7 +1221,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
> */
> static u64 *rmap_get_next(struct rmap_iterator *iter)
> {
> - if (iter->desc) {
> + if (iter->desc && !desc_is_a_nulls(iter->desc)) {
> if (iter->pos < PTE_LIST_EXT - 1) {
> u64 *sptep;
>
> @@ -1212,7 +1233,7 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
>
> iter->desc = iter->desc->more;
>
I'd rather do:
iter->desc = desc_is_a_nulls(iter->desc) ? NULL : iter->desc;
here and drop two desc_is_a_nulls() checks in this function.

> - if (iter->desc) {
> + if (!desc_is_a_nulls(iter->desc)) {
> iter->pos = 0;
> /* desc->sptes[0] cannot be NULL */
> return iter->desc->sptes[iter->pos];
> --
> 1.8.1.4

--
Gleb.

2013-08-28 08:54:21

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 08/12] KVM: MMU: introduce nulls desc

On 08/28/2013 04:40 PM, Gleb Natapov wrote:

>> static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
>> @@ -1200,7 +1221,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
>> */
>> static u64 *rmap_get_next(struct rmap_iterator *iter)
>> {
>> - if (iter->desc) {
>> + if (iter->desc && !desc_is_a_nulls(iter->desc)) {
>> if (iter->pos < PTE_LIST_EXT - 1) {
>> u64 *sptep;
>>
>> @@ -1212,7 +1233,7 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
>>
>> iter->desc = iter->desc->more;
>>
> I'd rather do:
> iter->desc = desc_is_a_nulls(iter->desc) ? NULL : iter->desc;
> here and drop two desc_is_a_nulls() checks in this function.

Nice, will do it in the next version. Thanks!

2013-08-28 08:58:49

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list

On Wed, Aug 28, 2013 at 04:37:32PM +0800, Xiao Guangrong wrote:
> On 08/28/2013 04:12 PM, Gleb Natapov wrote:
>
> >> +
> >> + rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
> >> + desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> >> +
> >> + /* No empty position in the desc. */
> >> + if (desc->sptes[PTE_LIST_EXT - 1]) {
> >> + struct pte_list_desc *new_desc;
> >> + new_desc = mmu_alloc_pte_list_desc(vcpu);
> >> + new_desc->more = desc;
> >> + desc = new_desc;
> >> + *pte_list = (unsigned long)desc | 1;
> >> }
> >> - return count;
> >> +
> >> + free_pos = find_first_free(desc);
> >> + desc->sptes[free_pos] = spte;
> >> + return count_spte_number(desc);
> > Should it be count_spte_number(desc) - 1? The function should returns
> > the number of pte entries before the spte was added.
>
> Yes. We have handled it count_spte_number(), we count the number like this:
>
> return first_free + desc_num * PTE_LIST_EXT;
>
> The first_free is indexed from 0.
>
Suppose when pte_list_add() is called there is one full desc, so the
number that should be returned is PTE_LIST_EXT, correct? But since
before calling count_spte_number() one more desc will be added and
desc->sptes[0] will be set in it the first_free in count_spte_number
will be 1 and PTE_LIST_EXT + 1 will be returned.

> Maybe it is clearer to let count_spte_number() return the real number.
>
> >
> >> }
> >>
> >> static void
> >> -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
> >> - int i, struct pte_list_desc *prev_desc)
> >> +pte_list_desc_remove_entry(unsigned long *pte_list,
> >> + struct pte_list_desc *desc, int i)
> >> {
> >> - int j;
> >> + struct pte_list_desc *first_desc;
> >> + int last_used;
> >> +
> >> + first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> >> + last_used = find_last_used(first_desc);
> >>
> >> - for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
> >> - ;
> >> - desc->sptes[i] = desc->sptes[j];
> >> - desc->sptes[j] = NULL;
> >> - if (j != 0)
> >> + /*
> >> + * Move the entry from the first desc to this position we want
> >> + * to remove.
> >> + */
> >> + desc->sptes[i] = first_desc->sptes[last_used];
> >> + first_desc->sptes[last_used] = NULL;
> >> +
> > What if desc == first_desc and i < last_used. You still move spte
> > backwards so lockless walk may have already examined entry at i and
> > will miss spte that was moved there from last_used position, no?
>
> Right. I noticed it too and fixed in the v2 which is being tested.
> I fixed it by bottom-up walk desc, like this:
>
> pte_list_walk_lockless():
>
> desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
> while (!desc_is_a_nulls(desc)) {
> /*
> * We should do bottom-up walk since we always use the
> * bottom entry to replace the deleted entry if only
> * one desc is used in the rmap when a spte is removed.
> * Otherwise the moved entry will be missed.
> */
I would call it top-down walk since we are walking from big indices to
smaller once.

> for (i = PTE_LIST_EXT - 1; i >= 0; i--)
> fn(desc->sptes[i]);
>
> desc = ACCESS_ONCE(desc->more);
>
> /* It is being initialized. */
> if (unlikely(!desc))
> goto restart;
> }
>
> How about this?
>
Tricky, very very tricky :)

> >
> >> + /* No valid entry in this desc, we can free this desc now. */
> >> + if (!first_desc->sptes[0]) {
> >> + struct pte_list_desc *next_desc = first_desc->more;
> >> +
> >> + /*
> >> + * Only one entry existing but still use a desc to store it?
> >> + */
> >> + WARN_ON(!next_desc);
> >> +
> >> + mmu_free_pte_list_desc(first_desc);
> >> + first_desc = next_desc;
> >> + *pte_list = (unsigned long)first_desc | 1ul;
> >> return;
> >> - if (!prev_desc && !desc->more)
> >> - *pte_list = (unsigned long)desc->sptes[0];
> >> - else
> >> - if (prev_desc)
> >> - prev_desc->more = desc->more;
> >> - else
> >> - *pte_list = (unsigned long)desc->more | 1;
> >> - mmu_free_pte_list_desc(desc);
> >> + }
> >> +
> >> + WARN_ON(!first_desc->sptes[0]);
> >> +
> >> + /*
> >> + * Only one entry in this desc, move the entry to the head
> >> + * then the desc can be freed.
> >> + */
> >> + if (!first_desc->sptes[1] && !first_desc->more) {
> >> + *pte_list = (unsigned long)first_desc->sptes[0];
> >> + mmu_free_pte_list_desc(first_desc);
> >> + }
> >> }
> >>
> >> static void pte_list_remove(u64 *spte, unsigned long *pte_list)
> >> {
> >> struct pte_list_desc *desc;
> >> - struct pte_list_desc *prev_desc;
> >> int i;
> >>
> >> if (!*pte_list) {
> >> - printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
> >> - BUG();
> >> - } else if (!(*pte_list & 1)) {
> >> + WARN(1, KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
> > Why change BUG() to WARN() here and below?
>
> WARN(1, "xxx") can replace two lines in the origin code. And personally,
> i prefer WARN() to BUG() since sometimes BUG() can stop my box and i need to
> get the full log by using kdump.
>
> If you object it, i will change it back in the next version. :)
>
For debugging WARN() is doubtlessly better, but outside of development
you do not want to allow kernel to run after serious MMU corruption is
detected. It may be exploitable further, we do not know, so the safe
choice is to stop the kernel.

--
Gleb.

2013-08-28 09:19:34

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list

On 08/28/2013 04:58 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 04:37:32PM +0800, Xiao Guangrong wrote:
>> On 08/28/2013 04:12 PM, Gleb Natapov wrote:
>>
>>>> +
>>>> + rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
>>>> + desc = (struct pte_list_desc *)(*pte_list & ~1ul);
>>>> +
>>>> + /* No empty position in the desc. */
>>>> + if (desc->sptes[PTE_LIST_EXT - 1]) {
>>>> + struct pte_list_desc *new_desc;
>>>> + new_desc = mmu_alloc_pte_list_desc(vcpu);
>>>> + new_desc->more = desc;
>>>> + desc = new_desc;
>>>> + *pte_list = (unsigned long)desc | 1;
>>>> }
>>>> - return count;
>>>> +
>>>> + free_pos = find_first_free(desc);
>>>> + desc->sptes[free_pos] = spte;
>>>> + return count_spte_number(desc);
>>> Should it be count_spte_number(desc) - 1? The function should returns
>>> the number of pte entries before the spte was added.
>>
>> Yes. We have handled it count_spte_number(), we count the number like this:
>>
>> return first_free + desc_num * PTE_LIST_EXT;
>>
>> The first_free is indexed from 0.
>>
> Suppose when pte_list_add() is called there is one full desc, so the
> number that should be returned is PTE_LIST_EXT, correct? But since
> before calling count_spte_number() one more desc will be added and
> desc->sptes[0] will be set in it the first_free in count_spte_number
> will be 1 and PTE_LIST_EXT + 1 will be returned.

Oh, yes, you are right. Will fix it in the next version, thanks for you
pointing it out.

>
>> Maybe it is clearer to let count_spte_number() return the real number.
>>
>>>
>>>> }
>>>>
>>>> static void
>>>> -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
>>>> - int i, struct pte_list_desc *prev_desc)
>>>> +pte_list_desc_remove_entry(unsigned long *pte_list,
>>>> + struct pte_list_desc *desc, int i)
>>>> {
>>>> - int j;
>>>> + struct pte_list_desc *first_desc;
>>>> + int last_used;
>>>> +
>>>> + first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
>>>> + last_used = find_last_used(first_desc);
>>>>
>>>> - for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
>>>> - ;
>>>> - desc->sptes[i] = desc->sptes[j];
>>>> - desc->sptes[j] = NULL;
>>>> - if (j != 0)
>>>> + /*
>>>> + * Move the entry from the first desc to this position we want
>>>> + * to remove.
>>>> + */
>>>> + desc->sptes[i] = first_desc->sptes[last_used];
>>>> + first_desc->sptes[last_used] = NULL;
>>>> +
>>> What if desc == first_desc and i < last_used. You still move spte
>>> backwards so lockless walk may have already examined entry at i and
>>> will miss spte that was moved there from last_used position, no?
>>
>> Right. I noticed it too and fixed in the v2 which is being tested.
>> I fixed it by bottom-up walk desc, like this:
>>
>> pte_list_walk_lockless():
>>
>> desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
>> while (!desc_is_a_nulls(desc)) {
>> /*
>> * We should do bottom-up walk since we always use the
>> * bottom entry to replace the deleted entry if only
>> * one desc is used in the rmap when a spte is removed.
>> * Otherwise the moved entry will be missed.
>> */
> I would call it top-down walk since we are walking from big indices to
> smaller once.

Okay, will fix the comments.

>
>> for (i = PTE_LIST_EXT - 1; i >= 0; i--)
>> fn(desc->sptes[i]);
>>
>> desc = ACCESS_ONCE(desc->more);
>>
>> /* It is being initialized. */
>> if (unlikely(!desc))
>> goto restart;
>> }
>>
>> How about this?
>>
> Tricky, very very tricky :)
>
>>>
>>>> + /* No valid entry in this desc, we can free this desc now. */
>>>> + if (!first_desc->sptes[0]) {
>>>> + struct pte_list_desc *next_desc = first_desc->more;
>>>> +
>>>> + /*
>>>> + * Only one entry existing but still use a desc to store it?
>>>> + */
>>>> + WARN_ON(!next_desc);
>>>> +
>>>> + mmu_free_pte_list_desc(first_desc);
>>>> + first_desc = next_desc;
>>>> + *pte_list = (unsigned long)first_desc | 1ul;
>>>> return;
>>>> - if (!prev_desc && !desc->more)
>>>> - *pte_list = (unsigned long)desc->sptes[0];
>>>> - else
>>>> - if (prev_desc)
>>>> - prev_desc->more = desc->more;
>>>> - else
>>>> - *pte_list = (unsigned long)desc->more | 1;
>>>> - mmu_free_pte_list_desc(desc);
>>>> + }
>>>> +
>>>> + WARN_ON(!first_desc->sptes[0]);
>>>> +
>>>> + /*
>>>> + * Only one entry in this desc, move the entry to the head
>>>> + * then the desc can be freed.
>>>> + */
>>>> + if (!first_desc->sptes[1] && !first_desc->more) {
>>>> + *pte_list = (unsigned long)first_desc->sptes[0];
>>>> + mmu_free_pte_list_desc(first_desc);
>>>> + }
>>>> }
>>>>
>>>> static void pte_list_remove(u64 *spte, unsigned long *pte_list)
>>>> {
>>>> struct pte_list_desc *desc;
>>>> - struct pte_list_desc *prev_desc;
>>>> int i;
>>>>
>>>> if (!*pte_list) {
>>>> - printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
>>>> - BUG();
>>>> - } else if (!(*pte_list & 1)) {
>>>> + WARN(1, KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
>>> Why change BUG() to WARN() here and below?
>>
>> WARN(1, "xxx") can replace two lines in the origin code. And personally,
>> i prefer WARN() to BUG() since sometimes BUG() can stop my box and i need to
>> get the full log by using kdump.
>>
>> If you object it, i will change it back in the next version. :)
>>
> For debugging WARN() is doubtlessly better, but outside of development
> you do not want to allow kernel to run after serious MMU corruption is
> detected. It may be exploitable further, we do not know, so the safe
> choice is to stop the kernel.

Okay, will keep BUG() in the next version.


2013-08-28 09:20:18

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On Tue, Jul 30, 2013 at 09:02:07PM +0800, Xiao Guangrong wrote:
> The basic idea is from nulls list which uses a nulls to indicate
> whether the desc is moved to different pte-list
>
> Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 50 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 36caf6a..f8fc0cc 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1010,6 +1010,14 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
> desc->sptes[0] = (u64 *)*pte_list;
> desc->sptes[1] = spte;
> desc_mark_nulls(pte_list, desc);
> +
> + /*
> + * Esure the old spte has been updated into desc, so
> + * that the another side can not get the desc from pte_list
> + * but miss the old spte.
> + */
> + smp_wmb();
> +
> *pte_list = (unsigned long)desc | 1;
> return 1;
> }
> @@ -1131,6 +1139,47 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
> WARN_ON(desc_get_nulls_value(desc) != pte_list);
> }
>
> +/* The caller should hold rcu lock. */
> +typedef void (*pte_list_walk_lockless_fn) (u64 *spte, int level);
> +static void pte_list_walk_lockless(unsigned long *pte_list,
> + pte_list_walk_lockless_fn fn, int level)
> +{
> + struct pte_list_desc *desc;
> + unsigned long pte_list_value;
> + int i;
> +
> +restart:
> + pte_list_value = ACCESS_ONCE(*pte_list);
> + if (!pte_list_value)
> + return;
> +
> + if (!(pte_list_value & 1))
> + return fn((u64 *)pte_list_value, level);
> +
> + /*
> + * fetch pte_list before read sptes in the desc, see the comments
> + * in pte_list_add().
> + *
> + * There is the data dependence since the desc is got from pte_list.
> + */
> + smp_read_barrier_depends();
> +
> + desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
> + while (!desc_is_a_nulls(desc)) {
> + for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
> + fn(desc->sptes[i], level);
> +
> + desc = ACCESS_ONCE(desc->more);
> +
> + /* It is being initialized. */
> + if (unlikely(!desc))
> + goto restart;
> + }
> +
> + if (unlikely(desc_get_nulls_value(desc) != pte_list))
> + goto restart;
So is it really enough to guaranty safety and correctness? What if desc
is moved to another rmap while we walking it so fn() is called on
incorrect sptes? Or what if desc is moved to another rmap, but then it
is moved back to initial rmap (but another place in the desc list) so
the check here will not catch that we need to restart walking?


> +}
> +
> static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
> struct kvm_memory_slot *slot)
> {
> @@ -4557,7 +4606,7 @@ int kvm_mmu_module_init(void)
> {
> pte_list_desc_cache = kmem_cache_create("pte_list_desc",
> sizeof(struct pte_list_desc),
> - 0, 0, NULL);
> + 0, SLAB_DESTROY_BY_RCU, NULL);
> if (!pte_list_desc_cache)
> goto nomem;
>
> --
> 1.8.1.4

--
Gleb.

2013-08-28 09:34:02

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On 08/28/2013 05:20 PM, Gleb Natapov wrote:
> On Tue, Jul 30, 2013 at 09:02:07PM +0800, Xiao Guangrong wrote:
>> The basic idea is from nulls list which uses a nulls to indicate
>> whether the desc is moved to different pte-list
>>
>> Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/kvm/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 50 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 36caf6a..f8fc0cc 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1010,6 +1010,14 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
>> desc->sptes[0] = (u64 *)*pte_list;
>> desc->sptes[1] = spte;
>> desc_mark_nulls(pte_list, desc);
>> +
>> + /*
>> + * Esure the old spte has been updated into desc, so
>> + * that the another side can not get the desc from pte_list
>> + * but miss the old spte.
>> + */
>> + smp_wmb();
>> +
>> *pte_list = (unsigned long)desc | 1;
>> return 1;
>> }
>> @@ -1131,6 +1139,47 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
>> WARN_ON(desc_get_nulls_value(desc) != pte_list);
>> }
>>
>> +/* The caller should hold rcu lock. */
>> +typedef void (*pte_list_walk_lockless_fn) (u64 *spte, int level);
>> +static void pte_list_walk_lockless(unsigned long *pte_list,
>> + pte_list_walk_lockless_fn fn, int level)
>> +{
>> + struct pte_list_desc *desc;
>> + unsigned long pte_list_value;
>> + int i;
>> +
>> +restart:
>> + pte_list_value = ACCESS_ONCE(*pte_list);
>> + if (!pte_list_value)
>> + return;
>> +
>> + if (!(pte_list_value & 1))
>> + return fn((u64 *)pte_list_value, level);
>> +
>> + /*
>> + * fetch pte_list before read sptes in the desc, see the comments
>> + * in pte_list_add().
>> + *
>> + * There is the data dependence since the desc is got from pte_list.
>> + */
>> + smp_read_barrier_depends();
>> +
>> + desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
>> + while (!desc_is_a_nulls(desc)) {
>> + for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
>> + fn(desc->sptes[i], level);
>> +
>> + desc = ACCESS_ONCE(desc->more);
>> +
>> + /* It is being initialized. */
>> + if (unlikely(!desc))
>> + goto restart;
>> + }
>> +
>> + if (unlikely(desc_get_nulls_value(desc) != pte_list))
>> + goto restart;
> So is it really enough to guaranty safety and correctness? What if desc
> is moved to another rmap while we walking it so fn() is called on
> incorrect sptes?

Then fn() will do needless write-protection. It is the unnecessary work
but it is acceptable for the rare case.

There has a bug that we can not detect mapping level from rmap since
the desc can be moved as you said, it can case we do write-protection
on the middle spte. Can fix it by getting mapping level from sp->role.level
since sp can not be reused when rcu is hold.

> Or what if desc is moved to another rmap, but then it
> is moved back to initial rmap (but another place in the desc list) so
> the check here will not catch that we need to restart walking?

It is okay. We always add the new desc to the head, then we will walk
all the entires under this case.

Right?

2013-08-28 09:47:36

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
> > Or what if desc is moved to another rmap, but then it
> > is moved back to initial rmap (but another place in the desc list) so
> > the check here will not catch that we need to restart walking?
>
> It is okay. We always add the new desc to the head, then we will walk
> all the entires under this case.
>
Which races another question: What if desc is added in front of the list
behind the point where lockless walker currently is?

> Right?
Not sure. While lockless walker works on a desc rmap can be completely
destroyed and recreated again. It can be any order.

--
Gleb.

2013-08-28 10:13:57

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On 08/28/2013 05:46 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
>>> Or what if desc is moved to another rmap, but then it
>>> is moved back to initial rmap (but another place in the desc list) so
>>> the check here will not catch that we need to restart walking?
>>
>> It is okay. We always add the new desc to the head, then we will walk
>> all the entires under this case.
>>
> Which races another question: What if desc is added in front of the list
> behind the point where lockless walker currently is?

That case is new spte is being added into the rmap. We need not to care the
new sptes since it will set the dirty-bitmap then they can be write-protected
next time.

>
>> Right?
> Not sure. While lockless walker works on a desc rmap can be completely
> destroyed and recreated again. It can be any order.

I think the thing is very similar as include/linux/rculist_nulls.h

2013-08-28 10:49:44

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote:
> On 08/28/2013 05:46 PM, Gleb Natapov wrote:
> > On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
> >>> Or what if desc is moved to another rmap, but then it
> >>> is moved back to initial rmap (but another place in the desc list) so
> >>> the check here will not catch that we need to restart walking?
> >>
> >> It is okay. We always add the new desc to the head, then we will walk
> >> all the entires under this case.
> >>
> > Which races another question: What if desc is added in front of the list
> > behind the point where lockless walker currently is?
>
> That case is new spte is being added into the rmap. We need not to care the
> new sptes since it will set the dirty-bitmap then they can be write-protected
> next time.
>
OK.

> >
> >> Right?
> > Not sure. While lockless walker works on a desc rmap can be completely
> > destroyed and recreated again. It can be any order.
>
> I think the thing is very similar as include/linux/rculist_nulls.h
include/linux/rculist_nulls.h is for implementing hash tables, so they
may not care about add/del/lookup race for instance, but may be we are
(you are saying above that we are not), so similarity does not prove
correctness for our case. BTW I do not see
rcu_assign_pointer()/rcu_dereference() in your patches which hints on
incorrect usage of RCU. I think any access to slab pointers will need to
use those.

--
Gleb.

2013-08-28 12:15:47

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On 08/28/2013 06:49 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote:
>> On 08/28/2013 05:46 PM, Gleb Natapov wrote:
>>> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
>>>>> Or what if desc is moved to another rmap, but then it
>>>>> is moved back to initial rmap (but another place in the desc list) so
>>>>> the check here will not catch that we need to restart walking?
>>>>
>>>> It is okay. We always add the new desc to the head, then we will walk
>>>> all the entires under this case.
>>>>
>>> Which races another question: What if desc is added in front of the list
>>> behind the point where lockless walker currently is?
>>
>> That case is new spte is being added into the rmap. We need not to care the
>> new sptes since it will set the dirty-bitmap then they can be write-protected
>> next time.
>>
> OK.
>
>>>
>>>> Right?
>>> Not sure. While lockless walker works on a desc rmap can be completely
>>> destroyed and recreated again. It can be any order.
>>
>> I think the thing is very similar as include/linux/rculist_nulls.h
> include/linux/rculist_nulls.h is for implementing hash tables, so they
> may not care about add/del/lookup race for instance, but may be we are
> (you are saying above that we are not), so similarity does not prove
> correctness for our case.

We do not care the "add" and "del" too when lookup the rmap. Under the "add"
case, it is okay, the reason i have explained above. Under the "del" case,
the spte becomes unpresent and flush all tlbs immediately, so it is also okay.

I always use a stupid way to check the correctness, that is enumerating
all cases we may meet, in this patch, we may meet these cases:

1) kvm deletes the desc before we are current on
that descs have been checked, do not need to care it.

2) kvm deletes the desc after we are currently on
Since we always add/del the head desc, we can sure the current desc has been
deleted, then we will meet case 3).

3) kvm deletes the desc that we are currently on
3.a): the desc stays in slab cache (do not be reused).
all spte entires are empty, then the fn() will skip the nonprsent spte,
and desc->more is
3.a.1) still pointing to next-desc, then we will continue the lookup
3.a.2) or it is the "nulls list", that means we reach the last one,
then finish the walk.

3.b): the desc is alloc-ed from slab cache and it's being initialized.
we will see "desc->more == NULL" then restart the walking. It's okay.

3.c): the desc is added to rmap or pte_list again.
3.c.1): the desc is added to the current rmap again.
the new desc always acts as the head desc, then we will walk
all entries, some entries are double checked and not entry
can be missed. It is okay.

3.c.2): the desc is added to another rmap or pte_list
since kvm_set_memory_region() and get_dirty are serial by slots-lock.
so the "nulls" can not be reused during lookup. Then we we will
meet the different "nulls" at the end of walking that will cause
rewalk.

I know check the algorithm like this is really silly, do you have other idea?

> BTW I do not see
> rcu_assign_pointer()/rcu_dereference() in your patches which hints on

IIUC, We can not directly use rcu_assign_pointer(), that is something like:
p = v to assign a pointer to a pointer. But in our case, we need:
*pte_list = (unsigned long)desc | 1;

So i add the smp_wmb() by myself:
/*
* Esure the old spte has been updated into desc, so
* that the another side can not get the desc from pte_list
* but miss the old spte.
*/
smp_wmb();

*pte_list = (unsigned long)desc | 1;

But i missed it when inserting a empty desc, in that case, we need the barrier
too since we should make desc->more visible before assign it to pte_list to
avoid the lookup side seeing the invalid "nulls".

I also use own code instead of rcu_dereference():
pte_list_walk_lockless():
pte_list_value = ACCESS_ONCE(*pte_list);
if (!pte_list_value)
return;

if (!(pte_list_value & 1))
return fn((u64 *)pte_list_value);

/*
* fetch pte_list before read sptes in the desc, see the comments
* in pte_list_add().
*
* There is the data dependence since the desc is got from pte_list.
*/
smp_read_barrier_depends();

That part can be replaced by rcu_dereference().

> incorrect usage of RCU. I think any access to slab pointers will need to
> use those.

Remove desc is not necessary i think since we do not mind to see the old
info. (hlist_nulls_del_rcu() does not use rcu_dereference() too)


2013-08-28 13:36:46

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On Wed, Aug 28, 2013 at 08:15:36PM +0800, Xiao Guangrong wrote:
> On 08/28/2013 06:49 PM, Gleb Natapov wrote:
> > On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote:
> >> On 08/28/2013 05:46 PM, Gleb Natapov wrote:
> >>> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
> >>>>> Or what if desc is moved to another rmap, but then it
> >>>>> is moved back to initial rmap (but another place in the desc list) so
> >>>>> the check here will not catch that we need to restart walking?
> >>>>
> >>>> It is okay. We always add the new desc to the head, then we will walk
> >>>> all the entires under this case.
> >>>>
> >>> Which races another question: What if desc is added in front of the list
> >>> behind the point where lockless walker currently is?
> >>
> >> That case is new spte is being added into the rmap. We need not to care the
> >> new sptes since it will set the dirty-bitmap then they can be write-protected
> >> next time.
> >>
> > OK.
> >
> >>>
> >>>> Right?
> >>> Not sure. While lockless walker works on a desc rmap can be completely
> >>> destroyed and recreated again. It can be any order.
> >>
> >> I think the thing is very similar as include/linux/rculist_nulls.h
> > include/linux/rculist_nulls.h is for implementing hash tables, so they
> > may not care about add/del/lookup race for instance, but may be we are
> > (you are saying above that we are not), so similarity does not prove
> > correctness for our case.
>
> We do not care the "add" and "del" too when lookup the rmap. Under the "add"
> case, it is okay, the reason i have explained above. Under the "del" case,
> the spte becomes unpresent and flush all tlbs immediately, so it is also okay.
>
> I always use a stupid way to check the correctness, that is enumerating
> all cases we may meet, in this patch, we may meet these cases:
>
> 1) kvm deletes the desc before we are current on
> that descs have been checked, do not need to care it.
>
> 2) kvm deletes the desc after we are currently on
> Since we always add/del the head desc, we can sure the current desc has been
> deleted, then we will meet case 3).
>
> 3) kvm deletes the desc that we are currently on
> 3.a): the desc stays in slab cache (do not be reused).
> all spte entires are empty, then the fn() will skip the nonprsent spte,
> and desc->more is
> 3.a.1) still pointing to next-desc, then we will continue the lookup
> 3.a.2) or it is the "nulls list", that means we reach the last one,
> then finish the walk.
>
> 3.b): the desc is alloc-ed from slab cache and it's being initialized.
> we will see "desc->more == NULL" then restart the walking. It's okay.
>
> 3.c): the desc is added to rmap or pte_list again.
> 3.c.1): the desc is added to the current rmap again.
> the new desc always acts as the head desc, then we will walk
> all entries, some entries are double checked and not entry
> can be missed. It is okay.
>
> 3.c.2): the desc is added to another rmap or pte_list
> since kvm_set_memory_region() and get_dirty are serial by slots-lock.
> so the "nulls" can not be reused during lookup. Then we we will
> meet the different "nulls" at the end of walking that will cause
> rewalk.
>
> I know check the algorithm like this is really silly, do you have other idea?
>
Not silly, but easy to miss cases. For instance 3.c.3 can be:
the desc is added to another rmap then we move to another desc on the
wrong rmap, this other desc is also deleted and reinserted into
original rmap. Seams like justification from 3.c.1 applies to that to
though.

> > BTW I do not see
> > rcu_assign_pointer()/rcu_dereference() in your patches which hints on
>
> IIUC, We can not directly use rcu_assign_pointer(), that is something like:
> p = v to assign a pointer to a pointer. But in our case, we need:
> *pte_list = (unsigned long)desc | 1;
>From Documentation/RCU/whatisRCU.txt:

The updater uses this function to assign a new value to an RCU-protected pointer.

This is what we do, no? (assuming slot->arch.rmap[] is what rcu protects here)
The fact that the value is not correct pointer should not matter.

>
> So i add the smp_wmb() by myself:
> /*
> * Esure the old spte has been updated into desc, so
> * that the another side can not get the desc from pte_list
> * but miss the old spte.
> */
> smp_wmb();
>
> *pte_list = (unsigned long)desc | 1;
>
> But i missed it when inserting a empty desc, in that case, we need the barrier
> too since we should make desc->more visible before assign it to pte_list to
> avoid the lookup side seeing the invalid "nulls".
>
> I also use own code instead of rcu_dereference():
> pte_list_walk_lockless():
> pte_list_value = ACCESS_ONCE(*pte_list);
> if (!pte_list_value)
> return;
>
> if (!(pte_list_value & 1))
> return fn((u64 *)pte_list_value);
>
> /*
> * fetch pte_list before read sptes in the desc, see the comments
> * in pte_list_add().
> *
> * There is the data dependence since the desc is got from pte_list.
> */
> smp_read_barrier_depends();
>
> That part can be replaced by rcu_dereference().
>
Yes please, also see commit c87a124a5d5e8cf8e21c4363c3372bcaf53ea190 for
kind of scary bugs we can get here.

> > incorrect usage of RCU. I think any access to slab pointers will need to
> > use those.
>
> Remove desc is not necessary i think since we do not mind to see the old
> info. (hlist_nulls_del_rcu() does not use rcu_dereference() too)
>
May be a bug. I also noticed that rculist_nulls uses rcu_dereference()
to access ->next, but it does not use rcu_assign_pointer() pointer to
assign it.

--
Gleb.

2013-08-29 06:51:03

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On 08/28/2013 09:36 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 08:15:36PM +0800, Xiao Guangrong wrote:
>> On 08/28/2013 06:49 PM, Gleb Natapov wrote:
>>> On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote:
>>>> On 08/28/2013 05:46 PM, Gleb Natapov wrote:
>>>>> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
>>>>>>> Or what if desc is moved to another rmap, but then it
>>>>>>> is moved back to initial rmap (but another place in the desc list) so
>>>>>>> the check here will not catch that we need to restart walking?
>>>>>>
>>>>>> It is okay. We always add the new desc to the head, then we will walk
>>>>>> all the entires under this case.
>>>>>>
>>>>> Which races another question: What if desc is added in front of the list
>>>>> behind the point where lockless walker currently is?
>>>>
>>>> That case is new spte is being added into the rmap. We need not to care the
>>>> new sptes since it will set the dirty-bitmap then they can be write-protected
>>>> next time.
>>>>
>>> OK.
>>>
>>>>>
>>>>>> Right?
>>>>> Not sure. While lockless walker works on a desc rmap can be completely
>>>>> destroyed and recreated again. It can be any order.
>>>>
>>>> I think the thing is very similar as include/linux/rculist_nulls.h
>>> include/linux/rculist_nulls.h is for implementing hash tables, so they
>>> may not care about add/del/lookup race for instance, but may be we are
>>> (you are saying above that we are not), so similarity does not prove
>>> correctness for our case.
>>
>> We do not care the "add" and "del" too when lookup the rmap. Under the "add"
>> case, it is okay, the reason i have explained above. Under the "del" case,
>> the spte becomes unpresent and flush all tlbs immediately, so it is also okay.
>>
>> I always use a stupid way to check the correctness, that is enumerating
>> all cases we may meet, in this patch, we may meet these cases:
>>
>> 1) kvm deletes the desc before we are current on
>> that descs have been checked, do not need to care it.
>>
>> 2) kvm deletes the desc after we are currently on
>> Since we always add/del the head desc, we can sure the current desc has been
>> deleted, then we will meet case 3).
>>
>> 3) kvm deletes the desc that we are currently on
>> 3.a): the desc stays in slab cache (do not be reused).
>> all spte entires are empty, then the fn() will skip the nonprsent spte,
>> and desc->more is
>> 3.a.1) still pointing to next-desc, then we will continue the lookup
>> 3.a.2) or it is the "nulls list", that means we reach the last one,
>> then finish the walk.
>>
>> 3.b): the desc is alloc-ed from slab cache and it's being initialized.
>> we will see "desc->more == NULL" then restart the walking. It's okay.
>>
>> 3.c): the desc is added to rmap or pte_list again.
>> 3.c.1): the desc is added to the current rmap again.
>> the new desc always acts as the head desc, then we will walk
>> all entries, some entries are double checked and not entry
>> can be missed. It is okay.
>>
>> 3.c.2): the desc is added to another rmap or pte_list
>> since kvm_set_memory_region() and get_dirty are serial by slots-lock.
>> so the "nulls" can not be reused during lookup. Then we we will
>> meet the different "nulls" at the end of walking that will cause
>> rewalk.
>>
>> I know check the algorithm like this is really silly, do you have other idea?
>>
> Not silly, but easy to miss cases. For instance 3.c.3 can be:
> the desc is added to another rmap then we move to another desc on the
> wrong rmap, this other desc is also deleted and reinserted into
> original rmap. Seams like justification from 3.c.1 applies to that to
> though.
>
>>> BTW I do not see
>>> rcu_assign_pointer()/rcu_dereference() in your patches which hints on
>>
>> IIUC, We can not directly use rcu_assign_pointer(), that is something like:
>> p = v to assign a pointer to a pointer. But in our case, we need:
>> *pte_list = (unsigned long)desc | 1;
>>From Documentation/RCU/whatisRCU.txt:
>
> The updater uses this function to assign a new value to an RCU-protected pointer.
>
> This is what we do, no? (assuming slot->arch.rmap[] is what rcu protects here)
> The fact that the value is not correct pointer should not matter.
>

Okay. Will change that code to:

+
+#define rcu_assign_head_desc(pte_list_p, value) \
+ rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), (unsigned long *)(value))
+
/*
* Pte mapping structures:
*
@@ -1006,14 +1010,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc->sptes[1] = spte;
desc_mark_nulls(pte_list, desc);

- /*
- * Esure the old spte has been updated into desc, so
- * that the another side can not get the desc from pte_list
- * but miss the old spte.
- */
- smp_wmb();
-
- *pte_list = (unsigned long)desc | 1;
+ rcu_assign_head_desc(pte_list, (unsigned long)desc | 1);

>>
>> So i add the smp_wmb() by myself:
>> /*
>> * Esure the old spte has been updated into desc, so
>> * that the another side can not get the desc from pte_list
>> * but miss the old spte.
>> */
>> smp_wmb();
>>
>> *pte_list = (unsigned long)desc | 1;
>>
>> But i missed it when inserting a empty desc, in that case, we need the barrier
>> too since we should make desc->more visible before assign it to pte_list to
>> avoid the lookup side seeing the invalid "nulls".
>>
>> I also use own code instead of rcu_dereference():
>> pte_list_walk_lockless():
>> pte_list_value = ACCESS_ONCE(*pte_list);
>> if (!pte_list_value)
>> return;
>>
>> if (!(pte_list_value & 1))
>> return fn((u64 *)pte_list_value);
>>
>> /*
>> * fetch pte_list before read sptes in the desc, see the comments
>> * in pte_list_add().
>> *
>> * There is the data dependence since the desc is got from pte_list.
>> */
>> smp_read_barrier_depends();
>>
>> That part can be replaced by rcu_dereference().
>>
> Yes please, also see commit c87a124a5d5e8cf8e21c4363c3372bcaf53ea190 for
> kind of scary bugs we can get here.

Right, it is likely trigger-able in our case, will fix it.

>
>>> incorrect usage of RCU. I think any access to slab pointers will need to
>>> use those.
>>
>> Remove desc is not necessary i think since we do not mind to see the old
>> info. (hlist_nulls_del_rcu() does not use rcu_dereference() too)
>>
> May be a bug. I also noticed that rculist_nulls uses rcu_dereference()

But list_del_rcu() does not use rcu_assign_pointer() too.

> to access ->next, but it does not use rcu_assign_pointer() pointer to
> assign it.

You mean rcu_dereference() is used in hlist_nulls_for_each_entry_rcu()? I think
it's because we should validate the prefetched data before entry->next is
accessed, it is paired with the barrier in rcu_assign_pointer() when add a
new entry into the list. rcu_assign_pointer() make other fields in the entry
be visible before linking entry to the list. Otherwise, the lookup can access
that entry but get the invalid fields.

After more thinking, I still think rcu_assign_pointer() is unneeded when a entry
is removed. The remove-API does not care the order between unlink the entry and
the changes to its fields. It is the caller's responsibility:
- in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), etc to
enforce all lookups exit and the later change on that entry is invisible to the
lookups.

- In the case of rculist_nulls, it seems refcounter is used to guarantee the order
(see the example from Documentation/RCU/rculist_nulls.txt).

- In our case, we allow the lookup to see the deleted desc even if it is in slab cache
or its is initialized or it is re-added.

Your thought?

2013-08-29 07:17:14

by Gleb Natapov

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect

On Sat, Aug 03, 2013 at 02:09:43PM +0900, Takuya Yoshikawa wrote:
> On Tue, 30 Jul 2013 21:01:58 +0800
> Xiao Guangrong <[email protected]> wrote:
>
> > Background
> > ==========
> > Currently, when mark memslot dirty logged or get dirty page, we need to
> > write-protect large guest memory, it is the heavy work, especially, we need to
> > hold mmu-lock which is also required by vcpu to fix its page table fault and
> > mmu-notifier when host page is being changed. In the extreme cpu / memory used
> > guest, it becomes a scalability issue.
> >
> > This patchset introduces a way to locklessly write-protect guest memory.
>
> Nice improvements!
>
> If I read the patch set correctly, this work contains the following changes:
>
> Cleanups:
> Patch 1 and patch 12.
>
Yes, do not see the reason to not apply 1 straightaway. 12 depends on
other patches though.

> Lazy large page dropping for dirty logging:
> Patch 2-3.
> Patch 2 is preparatory to patch 3.
>
> This does not look like an RFC if you address Marcelo's comment.
> Any reason to include this in an RFC patch set?
Agree, you can post them separately for faster inclusion.

>
> Making remote TLBs flushable outside of mmu_lock for dirty logging:
> Patch 6.
>
> This is nice. I'm locally using a similar patch for my work, but yours
> is much cleaner and better. I hope this will get merged soon.
>
But without other patches this patch itself doesn't do much, no?

> New Pte-list handling:
> Patch 7-9.
>
> Still reading the details.
>
> RCU-based lockless write protection.
> Patch 10-11.
>
> If I understand RCU correctly, the current implementation has a problem:
> read-side critical sections can become too long.
>
> See the following LWN's article:
> "Sleepable RCU"
> https://lwn.net/Articles/202847/
>
> Especially, kvm_mmu_slot_remove_write_access() can take hundreds of
> milliseconds, or even a few seconds for guests using shadow paging.
> Is it possible to break the read-side critical section after protecting
> some pages? -- I guess so.
>
> Anyway, I want to see the following non-RFC quality patches get merged first:
> - Lazy large page dropping for dirty logging:
> - Making remote TLBs flushable outside of mmu_lock for dirty logging
>
> As you are doing in patch 11, the latter can eliminate the TLB flushes before
> cond_resched_lock(). So this alone is an optimization, and since my work is
> based on this TLB flush-less lock breaking, I would appriciate if you make this
> change first in your clean way.
>
> The remaining patches, pte-list refactoring and lock-less ones, also look
> interesting, but I need to read more to understand them.
>
> Thanks for the nice work!
Indeed. FWIW I completed the review and am waiting for a new version.

--
Gleb.

2013-08-29 07:24:06

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 01/12] KVM: MMU: remove unused parameter

On Tue, Jul 30, 2013 at 09:01:59PM +0800, Xiao Guangrong wrote:
> @vcpu in page_fault_can_be_fast() is not used so remove it
>
> Signed-off-by: Xiao Guangrong <[email protected]>
Applied this one. Thanks.

> ---
> arch/x86/kvm/mmu.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 3a9493a..a731c0b 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2808,7 +2808,7 @@ exit:
> return ret;
> }
>
> -static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, u32 error_code)
> +static bool page_fault_can_be_fast(u32 error_code)
> {
> /*
> * Do not fix the mmio spte with invalid generation number which
> @@ -2861,7 +2861,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
> bool ret = false;
> u64 spte = 0ull;
>
> - if (!page_fault_can_be_fast(vcpu, error_code))
> + if (!page_fault_can_be_fast(error_code))
> return false;
>
> walk_shadow_page_lockless_begin(vcpu);
> --
> 1.8.1.4

--
Gleb.

2013-08-29 09:08:40

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On Thu, Aug 29, 2013 at 02:50:51PM +0800, Xiao Guangrong wrote:
> >>> BTW I do not see
> >>> rcu_assign_pointer()/rcu_dereference() in your patches which hints on
> >>
> >> IIUC, We can not directly use rcu_assign_pointer(), that is something like:
> >> p = v to assign a pointer to a pointer. But in our case, we need:
> >> *pte_list = (unsigned long)desc | 1;
> >>From Documentation/RCU/whatisRCU.txt:
> >
> > The updater uses this function to assign a new value to an RCU-protected pointer.
> >
> > This is what we do, no? (assuming slot->arch.rmap[] is what rcu protects here)
> > The fact that the value is not correct pointer should not matter.
> >
>
> Okay. Will change that code to:
>
> +
> +#define rcu_assign_head_desc(pte_list_p, value) \
> + rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), (unsigned long *)(value))
> +
> /*
> * Pte mapping structures:
> *
> @@ -1006,14 +1010,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
> desc->sptes[1] = spte;
> desc_mark_nulls(pte_list, desc);
>
> - /*
> - * Esure the old spte has been updated into desc, so
> - * that the another side can not get the desc from pte_list
> - * but miss the old spte.
> - */
> - smp_wmb();
> -
> - *pte_list = (unsigned long)desc | 1;
> + rcu_assign_head_desc(pte_list, (unsigned long)desc | 1);
>
> >>
> >> So i add the smp_wmb() by myself:
> >> /*
> >> * Esure the old spte has been updated into desc, so
> >> * that the another side can not get the desc from pte_list
> >> * but miss the old spte.
> >> */
> >> smp_wmb();
> >>
> >> *pte_list = (unsigned long)desc | 1;
> >>
> >> But i missed it when inserting a empty desc, in that case, we need the barrier
> >> too since we should make desc->more visible before assign it to pte_list to
> >> avoid the lookup side seeing the invalid "nulls".
> >>
> >> I also use own code instead of rcu_dereference():
> >> pte_list_walk_lockless():
> >> pte_list_value = ACCESS_ONCE(*pte_list);
> >> if (!pte_list_value)
> >> return;
> >>
> >> if (!(pte_list_value & 1))
> >> return fn((u64 *)pte_list_value);
> >>
> >> /*
> >> * fetch pte_list before read sptes in the desc, see the comments
> >> * in pte_list_add().
> >> *
> >> * There is the data dependence since the desc is got from pte_list.
> >> */
> >> smp_read_barrier_depends();
> >>
> >> That part can be replaced by rcu_dereference().
> >>
> > Yes please, also see commit c87a124a5d5e8cf8e21c4363c3372bcaf53ea190 for
> > kind of scary bugs we can get here.
>
> Right, it is likely trigger-able in our case, will fix it.
>
> >
> >>> incorrect usage of RCU. I think any access to slab pointers will need to
> >>> use those.
> >>
> >> Remove desc is not necessary i think since we do not mind to see the old
> >> info. (hlist_nulls_del_rcu() does not use rcu_dereference() too)
> >>
> > May be a bug. I also noticed that rculist_nulls uses rcu_dereference()
>
> But list_del_rcu() does not use rcu_assign_pointer() too.
>
This also suspicious.

> > to access ->next, but it does not use rcu_assign_pointer() pointer to
> > assign it.
>
> You mean rcu_dereference() is used in hlist_nulls_for_each_entry_rcu()? I think
> it's because we should validate the prefetched data before entry->next is
> accessed, it is paired with the barrier in rcu_assign_pointer() when add a
> new entry into the list. rcu_assign_pointer() make other fields in the entry
> be visible before linking entry to the list. Otherwise, the lookup can access
> that entry but get the invalid fields.
>
> After more thinking, I still think rcu_assign_pointer() is unneeded when a entry
> is removed. The remove-API does not care the order between unlink the entry and
> the changes to its fields. It is the caller's responsibility:
> - in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), etc to
> enforce all lookups exit and the later change on that entry is invisible to the
> lookups.
>
> - In the case of rculist_nulls, it seems refcounter is used to guarantee the order
> (see the example from Documentation/RCU/rculist_nulls.txt).
>
> - In our case, we allow the lookup to see the deleted desc even if it is in slab cache
> or its is initialized or it is re-added.
>
> Your thought?
>

As Documentation/RCU/whatisRCU.txt says:

As with rcu_assign_pointer(), an important function of
rcu_dereference() is to document which pointers are protected by
RCU, in particular, flagging a pointer that is subject to changing
at any time, including immediately after the rcu_dereference().
And, again like rcu_assign_pointer(), rcu_dereference() is
typically used indirectly, via the _rcu list-manipulation
primitives, such as list_for_each_entry_rcu().

The documentation aspect of rcu_assign_pointer()/rcu_dereference() is
important. The code is complicated, so self documentation will not hurt.
I want to see what is actually protected by rcu here. Freeing shadow
pages with call_rcu() further complicates matters: does it mean that
shadow pages are also protected by rcu?

--
Gleb.

2013-08-29 09:10:27

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 10/12] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

On Tue, Jul 30, 2013 at 09:02:08PM +0800, Xiao Guangrong wrote:
> It is easy if the handler is in the vcpu context, in that case we can use
> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
> disable interrupt to stop shadow page be freed. But we are on the ioctl context
> and the paths we are optimizing for have heavy workload, disabling interrupt is
> not good for the system performance
>
> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
> call_rcu() to free the shadow page if that indicator is set. Set/Clear the
> indicator are protected by slot-lock, so it need not be atomic and does not
> hurt the performance and the scalability
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 6 +++++-
> arch/x86/kvm/mmu.c | 23 +++++++++++++++++++++++
> arch/x86/kvm/mmu.h | 22 ++++++++++++++++++++++
> 3 files changed, 50 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 531f47c..dc842b6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -226,7 +226,10 @@ struct kvm_mmu_page {
> /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */
> unsigned long mmu_valid_gen;
>
> - DECLARE_BITMAP(unsync_child_bitmap, 512);
> + union {
> + DECLARE_BITMAP(unsync_child_bitmap, 512);
> + struct rcu_head rcu;
> + };
>
> #ifdef CONFIG_X86_32
> /*
> @@ -545,6 +548,7 @@ struct kvm_arch {
> */
> struct list_head active_mmu_pages;
> struct list_head zapped_obsolete_pages;
> + bool rcu_free_shadow_page;
>
> struct list_head assigned_dev_head;
> struct iommu_domain *iommu_domain;
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index f8fc0cc..7f3391f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2322,6 +2322,22 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> return ret;
> }
>
> +static void free_pages_rcu(struct rcu_head *head)
> +{
> + struct kvm_mmu_page *next, *sp;
> +
> + sp = container_of(head, struct kvm_mmu_page, rcu);
> + while (sp) {
> + if (!list_empty(&sp->link))
> + next = list_first_entry(&sp->link,
> + struct kvm_mmu_page, link);
> + else
> + next = NULL;
> + kvm_mmu_free_page(sp);
So here we are calling kvm_mmu_free_page() without holding mmu lock, why
is it safe?

> + sp = next;
> + }
> +}
> +
> static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> struct list_head *invalid_list)
> {
> @@ -2342,6 +2358,13 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> */
> kvm_flush_remote_tlbs(kvm);
>
> + if (kvm->arch.rcu_free_shadow_page) {
> + sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
> + list_del_init(invalid_list);
> + call_rcu(&sp->rcu, free_pages_rcu);
> + return;
> + }
> +
> list_for_each_entry_safe(sp, nsp, invalid_list, link) {
> WARN_ON(!sp->role.invalid || sp->root_count);
> kvm_mmu_free_page(sp);
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 5b59c57..85405f1 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -115,4 +115,26 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
> }
>
> void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
> +
> +/*
> + * The caller should ensure that these two functions should be
> + * serially called.
> + */
> +static inline void kvm_mmu_rcu_free_page_begin(struct kvm *kvm)
> +{
> + rcu_read_lock();
> +
> + kvm->arch.rcu_free_shadow_page = true;
> + /* Set the indicator before access shadow page. */
> + smp_mb();
> +}
> +
> +static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm)
> +{
> + /* Make sure that access shadow page has finished. */
> + smp_mb();
> + kvm->arch.rcu_free_shadow_page = false;
> +
> + rcu_read_unlock();
> +}
> #endif
> --
> 1.8.1.4

--
Gleb.

2013-08-29 09:25:24

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 10/12] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

On 08/29/2013 05:10 PM, Gleb Natapov wrote:
> On Tue, Jul 30, 2013 at 09:02:08PM +0800, Xiao Guangrong wrote:
>> It is easy if the handler is in the vcpu context, in that case we can use
>> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
>> disable interrupt to stop shadow page be freed. But we are on the ioctl context
>> and the paths we are optimizing for have heavy workload, disabling interrupt is
>> not good for the system performance
>>
>> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
>> call_rcu() to free the shadow page if that indicator is set. Set/Clear the
>> indicator are protected by slot-lock, so it need not be atomic and does not
>> hurt the performance and the scalability
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 6 +++++-
>> arch/x86/kvm/mmu.c | 23 +++++++++++++++++++++++
>> arch/x86/kvm/mmu.h | 22 ++++++++++++++++++++++
>> 3 files changed, 50 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 531f47c..dc842b6 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -226,7 +226,10 @@ struct kvm_mmu_page {
>> /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */
>> unsigned long mmu_valid_gen;
>>
>> - DECLARE_BITMAP(unsync_child_bitmap, 512);
>> + union {
>> + DECLARE_BITMAP(unsync_child_bitmap, 512);
>> + struct rcu_head rcu;
>> + };
>>
>> #ifdef CONFIG_X86_32
>> /*
>> @@ -545,6 +548,7 @@ struct kvm_arch {
>> */
>> struct list_head active_mmu_pages;
>> struct list_head zapped_obsolete_pages;
>> + bool rcu_free_shadow_page;
>>
>> struct list_head assigned_dev_head;
>> struct iommu_domain *iommu_domain;
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index f8fc0cc..7f3391f 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2322,6 +2322,22 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>> return ret;
>> }
>>
>> +static void free_pages_rcu(struct rcu_head *head)
>> +{
>> + struct kvm_mmu_page *next, *sp;
>> +
>> + sp = container_of(head, struct kvm_mmu_page, rcu);
>> + while (sp) {
>> + if (!list_empty(&sp->link))
>> + next = list_first_entry(&sp->link,
>> + struct kvm_mmu_page, link);
>> + else
>> + next = NULL;
>> + kvm_mmu_free_page(sp);
> So here we are calling kvm_mmu_free_page() without holding mmu lock, why
> is it safe?

Oops. :(

I should move "hlist_del(&sp->hash_link);" from this function to
kvm_mmu_prepare_zap_page(), after that kvm_mmu_free_page() will not
touch global resource anymore.

2013-08-29 09:31:47

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On Thu, Aug 29, 2013 at 02:50:51PM +0800, Xiao Guangrong wrote:
> After more thinking, I still think rcu_assign_pointer() is unneeded when a entry
> is removed. The remove-API does not care the order between unlink the entry and
> the changes to its fields. It is the caller's responsibility:
> - in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), etc to
> enforce all lookups exit and the later change on that entry is invisible to the
> lookups.
>
> - In the case of rculist_nulls, it seems refcounter is used to guarantee the order
> (see the example from Documentation/RCU/rculist_nulls.txt).
>
> - In our case, we allow the lookup to see the deleted desc even if it is in slab cache
> or its is initialized or it is re-added.
>
BTW is it a good idea? We can access deleted desc while it is allocated
and initialized to zero by kmem_cache_zalloc(), are we sure we cannot
see partially initialized desc->sptes[] entry? On related note what about
32 bit systems, they do not have atomic access to desc->sptes[].

--
Gleb.

2013-08-29 09:31:56

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On 08/29/2013 05:08 PM, Gleb Natapov wrote:
> On Thu, Aug 29, 2013 at 02:50:51PM +0800, Xiao Guangrong wrote:
>>>>> BTW I do not see
>>>>> rcu_assign_pointer()/rcu_dereference() in your patches which hints on
>>>>
>>>> IIUC, We can not directly use rcu_assign_pointer(), that is something like:
>>>> p = v to assign a pointer to a pointer. But in our case, we need:
>>>> *pte_list = (unsigned long)desc | 1;
>>> >From Documentation/RCU/whatisRCU.txt:
>>>
>>> The updater uses this function to assign a new value to an RCU-protected pointer.
>>>
>>> This is what we do, no? (assuming slot->arch.rmap[] is what rcu protects here)
>>> The fact that the value is not correct pointer should not matter.
>>>
>>
>> Okay. Will change that code to:
>>
>> +
>> +#define rcu_assign_head_desc(pte_list_p, value) \
>> + rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), (unsigned long *)(value))
>> +
>> /*
>> * Pte mapping structures:
>> *
>> @@ -1006,14 +1010,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
>> desc->sptes[1] = spte;
>> desc_mark_nulls(pte_list, desc);
>>
>> - /*
>> - * Esure the old spte has been updated into desc, so
>> - * that the another side can not get the desc from pte_list
>> - * but miss the old spte.
>> - */
>> - smp_wmb();
>> -
>> - *pte_list = (unsigned long)desc | 1;
>> + rcu_assign_head_desc(pte_list, (unsigned long)desc | 1);
>>
>>>>
>>>> So i add the smp_wmb() by myself:
>>>> /*
>>>> * Esure the old spte has been updated into desc, so
>>>> * that the another side can not get the desc from pte_list
>>>> * but miss the old spte.
>>>> */
>>>> smp_wmb();
>>>>
>>>> *pte_list = (unsigned long)desc | 1;
>>>>
>>>> But i missed it when inserting a empty desc, in that case, we need the barrier
>>>> too since we should make desc->more visible before assign it to pte_list to
>>>> avoid the lookup side seeing the invalid "nulls".
>>>>
>>>> I also use own code instead of rcu_dereference():
>>>> pte_list_walk_lockless():
>>>> pte_list_value = ACCESS_ONCE(*pte_list);
>>>> if (!pte_list_value)
>>>> return;
>>>>
>>>> if (!(pte_list_value & 1))
>>>> return fn((u64 *)pte_list_value);
>>>>
>>>> /*
>>>> * fetch pte_list before read sptes in the desc, see the comments
>>>> * in pte_list_add().
>>>> *
>>>> * There is the data dependence since the desc is got from pte_list.
>>>> */
>>>> smp_read_barrier_depends();
>>>>
>>>> That part can be replaced by rcu_dereference().
>>>>
>>> Yes please, also see commit c87a124a5d5e8cf8e21c4363c3372bcaf53ea190 for
>>> kind of scary bugs we can get here.
>>
>> Right, it is likely trigger-able in our case, will fix it.
>>
>>>
>>>>> incorrect usage of RCU. I think any access to slab pointers will need to
>>>>> use those.
>>>>
>>>> Remove desc is not necessary i think since we do not mind to see the old
>>>> info. (hlist_nulls_del_rcu() does not use rcu_dereference() too)
>>>>
>>> May be a bug. I also noticed that rculist_nulls uses rcu_dereference()
>>
>> But list_del_rcu() does not use rcu_assign_pointer() too.
>>
> This also suspicious.
>
>>> to access ->next, but it does not use rcu_assign_pointer() pointer to
>>> assign it.
>>
>> You mean rcu_dereference() is used in hlist_nulls_for_each_entry_rcu()? I think
>> it's because we should validate the prefetched data before entry->next is
>> accessed, it is paired with the barrier in rcu_assign_pointer() when add a
>> new entry into the list. rcu_assign_pointer() make other fields in the entry
>> be visible before linking entry to the list. Otherwise, the lookup can access
>> that entry but get the invalid fields.
>>
>> After more thinking, I still think rcu_assign_pointer() is unneeded when a entry
>> is removed. The remove-API does not care the order between unlink the entry and
>> the changes to its fields. It is the caller's responsibility:
>> - in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), etc to
>> enforce all lookups exit and the later change on that entry is invisible to the
>> lookups.
>>
>> - In the case of rculist_nulls, it seems refcounter is used to guarantee the order
>> (see the example from Documentation/RCU/rculist_nulls.txt).
>>
>> - In our case, we allow the lookup to see the deleted desc even if it is in slab cache
>> or its is initialized or it is re-added.
>>
>> Your thought?
>>
>
> As Documentation/RCU/whatisRCU.txt says:
>
> As with rcu_assign_pointer(), an important function of
> rcu_dereference() is to document which pointers are protected by
> RCU, in particular, flagging a pointer that is subject to changing
> at any time, including immediately after the rcu_dereference().
> And, again like rcu_assign_pointer(), rcu_dereference() is
> typically used indirectly, via the _rcu list-manipulation
> primitives, such as list_for_each_entry_rcu().
>
> The documentation aspect of rcu_assign_pointer()/rcu_dereference() is
> important. The code is complicated, so self documentation will not hurt.
> I want to see what is actually protected by rcu here. Freeing shadow
> pages with call_rcu() further complicates matters: does it mean that
> shadow pages are also protected by rcu?

Yes, it stops shadow page to be freed when we do write-protection on
it.

2013-08-29 09:51:29

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On Thu, Aug 29, 2013 at 05:31:42PM +0800, Xiao Guangrong wrote:
> > As Documentation/RCU/whatisRCU.txt says:
> >
> > As with rcu_assign_pointer(), an important function of
> > rcu_dereference() is to document which pointers are protected by
> > RCU, in particular, flagging a pointer that is subject to changing
> > at any time, including immediately after the rcu_dereference().
> > And, again like rcu_assign_pointer(), rcu_dereference() is
> > typically used indirectly, via the _rcu list-manipulation
> > primitives, such as list_for_each_entry_rcu().
> >
> > The documentation aspect of rcu_assign_pointer()/rcu_dereference() is
> > important. The code is complicated, so self documentation will not hurt.
> > I want to see what is actually protected by rcu here. Freeing shadow
> > pages with call_rcu() further complicates matters: does it mean that
> > shadow pages are also protected by rcu?
>
> Yes, it stops shadow page to be freed when we do write-protection on
> it.
>
Yeah, I got the trick, what I am saying that we have a data structure
here protected by RCU, but we do not use RCU functions to access it...
BTW why not allocate sp->spt from SLAB_DESTROY_BY_RCU cache too? We may
switch write protection on a random spt occasionally if page is deleted
and reused for another spt though. For last level spt it should not be a
problem and for non last level we have is_last_spte() check in
__rmap_write_protect_lockless(). Can it work?

--
Gleb.

2013-08-29 11:26:51

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On 08/29/2013 05:51 PM, Gleb Natapov wrote:
> On Thu, Aug 29, 2013 at 05:31:42PM +0800, Xiao Guangrong wrote:
>>> As Documentation/RCU/whatisRCU.txt says:
>>>
>>> As with rcu_assign_pointer(), an important function of
>>> rcu_dereference() is to document which pointers are protected by
>>> RCU, in particular, flagging a pointer that is subject to changing
>>> at any time, including immediately after the rcu_dereference().
>>> And, again like rcu_assign_pointer(), rcu_dereference() is
>>> typically used indirectly, via the _rcu list-manipulation
>>> primitives, such as list_for_each_entry_rcu().
>>>
>>> The documentation aspect of rcu_assign_pointer()/rcu_dereference() is
>>> important. The code is complicated, so self documentation will not hurt.
>>> I want to see what is actually protected by rcu here. Freeing shadow
>>> pages with call_rcu() further complicates matters: does it mean that
>>> shadow pages are also protected by rcu?
>>
>> Yes, it stops shadow page to be freed when we do write-protection on
>> it.
>>
> Yeah, I got the trick, what I am saying that we have a data structure
> here protected by RCU, but we do not use RCU functions to access it...

Yes, they are not used when insert a spte into rmap and get the rmap from
the entry... but do we need to use these functions to guarantee the order?

The worst case is, we fetch the spte from the desc but the spte is not
updated yet, we can happily skip this spte since it will set the
dirty-bitmap later, this is guaranteed by the barrier between mmu_spte_update()
and mark_page_dirty(), the code is:

set_spte():

if (mmu_spte_update(sptep, spte))
kvm_flush_remote_tlbs(vcpu->kvm);

if (!remap) {
if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
rmap_recycle(vcpu, sptep, gfn);

if (level > PT_PAGE_TABLE_LEVEL)
++vcpu->kvm->stat.lpages;
}

smp_wmb();

if (pte_access & ACC_WRITE_MASK)
mark_page_dirty(vcpu->kvm, gfn);

So, i guess if we can guaranteed the order by ourself, we do not need
to call the rcu functions explicitly...

But, the memory barres in the rcu functions are really light on x86 (store
can not be reordered with store), so i do not mind to explicitly use them
if you think this way is more safe. :)

> BTW why not allocate sp->spt from SLAB_DESTROY_BY_RCU cache too? We may
> switch write protection on a random spt occasionally if page is deleted
> and reused for another spt though. For last level spt it should not be a
> problem and for non last level we have is_last_spte() check in
> __rmap_write_protect_lockless(). Can it work?

Yes, i also considered this way. It can work if we handle is_last_spte()
properly. Since the sp->spte can be reused, we can not get the mapping
level from sp. We need to encode the mapping level into spte so that
cmpxhg can understand if the page table has been moved to another mapping
level. Could you allow me to make this optimization separately after this
patchset be merged?



2013-08-29 11:33:58

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On 08/29/2013 05:31 PM, Gleb Natapov wrote:
> On Thu, Aug 29, 2013 at 02:50:51PM +0800, Xiao Guangrong wrote:
>> After more thinking, I still think rcu_assign_pointer() is unneeded when a entry
>> is removed. The remove-API does not care the order between unlink the entry and
>> the changes to its fields. It is the caller's responsibility:
>> - in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), etc to
>> enforce all lookups exit and the later change on that entry is invisible to the
>> lookups.
>>
>> - In the case of rculist_nulls, it seems refcounter is used to guarantee the order
>> (see the example from Documentation/RCU/rculist_nulls.txt).
>>
>> - In our case, we allow the lookup to see the deleted desc even if it is in slab cache
>> or its is initialized or it is re-added.
>>
> BTW is it a good idea? We can access deleted desc while it is allocated
> and initialized to zero by kmem_cache_zalloc(), are we sure we cannot
> see partially initialized desc->sptes[] entry? On related note what about
> 32 bit systems, they do not have atomic access to desc->sptes[].

Good eyes. This is a bug here.

It seems we do not have a good to fix this. How disable this optimization on
32 bit host, small changes:

static inline void kvm_mmu_rcu_free_page_begin(struct kvm *kvm)
{
+#ifdef CONFIG_X86_64
rcu_read_lock();

kvm->arch.rcu_free_shadow_page = true;
/* Set the indicator before access shadow page. */
smp_mb();
+#else
+ spin_lock(kvm->mmu_lock);
+#endif
}

static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm)
{
+#ifdef CONFIG_X86_64
/* Make sure that access shadow page has finished. */
smp_mb();
kvm->arch.rcu_free_shadow_page = false;

rcu_read_unlock();
+#else
+ spin_unlock(kvm->mmu_lock);
+#endif
}

2013-08-29 12:02:42

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On 08/29/2013 07:33 PM, Xiao Guangrong wrote:
> On 08/29/2013 05:31 PM, Gleb Natapov wrote:
>> On Thu, Aug 29, 2013 at 02:50:51PM +0800, Xiao Guangrong wrote:
>>> After more thinking, I still think rcu_assign_pointer() is unneeded when a entry
>>> is removed. The remove-API does not care the order between unlink the entry and
>>> the changes to its fields. It is the caller's responsibility:
>>> - in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), etc to
>>> enforce all lookups exit and the later change on that entry is invisible to the
>>> lookups.
>>>
>>> - In the case of rculist_nulls, it seems refcounter is used to guarantee the order
>>> (see the example from Documentation/RCU/rculist_nulls.txt).
>>>
>>> - In our case, we allow the lookup to see the deleted desc even if it is in slab cache
>>> or its is initialized or it is re-added.
>>>
>> BTW is it a good idea? We can access deleted desc while it is allocated
>> and initialized to zero by kmem_cache_zalloc(), are we sure we cannot
>> see partially initialized desc->sptes[] entry? On related note what about
>> 32 bit systems, they do not have atomic access to desc->sptes[].

Ah... wait. desc is a array of pointers:

struct pte_list_desc {
u64 *sptes[PTE_LIST_EXT];
struct pte_list_desc *more;
};

assigning a pointer is aways aotomic, but we should carefully initialize it
as you said. I will introduce a constructor for desc slab cache which initialize
the struct like this:

for (i = 0; i < PTE_LIST_EXT; i++)
desc->sptes[i] = NULL;

It is okay.

>
> Good eyes. This is a bug here.
>
> It seems we do not have a good to fix this. How disable this optimization on
> 32 bit host, small changes:
>
> static inline void kvm_mmu_rcu_free_page_begin(struct kvm *kvm)
> {
> +#ifdef CONFIG_X86_64
> rcu_read_lock();
>
> kvm->arch.rcu_free_shadow_page = true;
> /* Set the indicator before access shadow page. */
> smp_mb();
> +#else
> + spin_lock(kvm->mmu_lock);
> +#endif
> }
>
> static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm)
> {
> +#ifdef CONFIG_X86_64
> /* Make sure that access shadow page has finished. */
> smp_mb();
> kvm->arch.rcu_free_shadow_page = false;
>
> rcu_read_unlock();
> +#else
> + spin_unlock(kvm->mmu_lock);
> +#endif
> }
>
>

2013-08-30 11:38:12

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On Thu, Aug 29, 2013 at 07:26:40PM +0800, Xiao Guangrong wrote:
> On 08/29/2013 05:51 PM, Gleb Natapov wrote:
> > On Thu, Aug 29, 2013 at 05:31:42PM +0800, Xiao Guangrong wrote:
> >>> As Documentation/RCU/whatisRCU.txt says:
> >>>
> >>> As with rcu_assign_pointer(), an important function of
> >>> rcu_dereference() is to document which pointers are protected by
> >>> RCU, in particular, flagging a pointer that is subject to changing
> >>> at any time, including immediately after the rcu_dereference().
> >>> And, again like rcu_assign_pointer(), rcu_dereference() is
> >>> typically used indirectly, via the _rcu list-manipulation
> >>> primitives, such as list_for_each_entry_rcu().
> >>>
> >>> The documentation aspect of rcu_assign_pointer()/rcu_dereference() is
> >>> important. The code is complicated, so self documentation will not hurt.
> >>> I want to see what is actually protected by rcu here. Freeing shadow
> >>> pages with call_rcu() further complicates matters: does it mean that
> >>> shadow pages are also protected by rcu?
> >>
> >> Yes, it stops shadow page to be freed when we do write-protection on
> >> it.
> >>
> > Yeah, I got the trick, what I am saying that we have a data structure
> > here protected by RCU, but we do not use RCU functions to access it...
>
> Yes, they are not used when insert a spte into rmap and get the rmap from
> the entry... but do we need to use these functions to guarantee the order?
>
> The worst case is, we fetch the spte from the desc but the spte is not
> updated yet, we can happily skip this spte since it will set the
> dirty-bitmap later, this is guaranteed by the barrier between mmu_spte_update()
> and mark_page_dirty(), the code is:
>
> set_spte():
>
> if (mmu_spte_update(sptep, spte))
> kvm_flush_remote_tlbs(vcpu->kvm);
>
> if (!remap) {
> if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
> rmap_recycle(vcpu, sptep, gfn);
>
> if (level > PT_PAGE_TABLE_LEVEL)
> ++vcpu->kvm->stat.lpages;
> }
>
> smp_wmb();
>
> if (pte_access & ACC_WRITE_MASK)
> mark_page_dirty(vcpu->kvm, gfn);
>
> So, i guess if we can guaranteed the order by ourself, we do not need
> to call the rcu functions explicitly...
>
> But, the memory barres in the rcu functions are really light on x86 (store
> can not be reordered with store), so i do not mind to explicitly use them
> if you think this way is more safe. :)
>
I think the self documentation aspect of using rcu function is also
important.

> > BTW why not allocate sp->spt from SLAB_DESTROY_BY_RCU cache too? We may
> > switch write protection on a random spt occasionally if page is deleted
> > and reused for another spt though. For last level spt it should not be a
> > problem and for non last level we have is_last_spte() check in
> > __rmap_write_protect_lockless(). Can it work?
>
> Yes, i also considered this way. It can work if we handle is_last_spte()
> properly. Since the sp->spte can be reused, we can not get the mapping
> level from sp. We need to encode the mapping level into spte so that
> cmpxhg can understand if the page table has been moved to another mapping
> level.
Isn't one bit that says that spte is the last one enough? IIRC we
have one more ignored bit to spare in spte.

> Could you allow me to make this optimization separately after this
> patchset be merged?
>
If you think it will complicate the initial version I am fine with
postponing it for later.

--
Gleb.

2013-08-30 11:44:32

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On Thu, Aug 29, 2013 at 08:02:30PM +0800, Xiao Guangrong wrote:
> On 08/29/2013 07:33 PM, Xiao Guangrong wrote:
> > On 08/29/2013 05:31 PM, Gleb Natapov wrote:
> >> On Thu, Aug 29, 2013 at 02:50:51PM +0800, Xiao Guangrong wrote:
> >>> After more thinking, I still think rcu_assign_pointer() is unneeded when a entry
> >>> is removed. The remove-API does not care the order between unlink the entry and
> >>> the changes to its fields. It is the caller's responsibility:
> >>> - in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), etc to
> >>> enforce all lookups exit and the later change on that entry is invisible to the
> >>> lookups.
> >>>
> >>> - In the case of rculist_nulls, it seems refcounter is used to guarantee the order
> >>> (see the example from Documentation/RCU/rculist_nulls.txt).
> >>>
> >>> - In our case, we allow the lookup to see the deleted desc even if it is in slab cache
> >>> or its is initialized or it is re-added.
> >>>
> >> BTW is it a good idea? We can access deleted desc while it is allocated
> >> and initialized to zero by kmem_cache_zalloc(), are we sure we cannot
> >> see partially initialized desc->sptes[] entry? On related note what about
> >> 32 bit systems, they do not have atomic access to desc->sptes[].
>
> Ah... wait. desc is a array of pointers:
>
> struct pte_list_desc {
> u64 *sptes[PTE_LIST_EXT];
> struct pte_list_desc *more;
> };
>
Yep, I misread it to be u64 bits and wondered why do we use u64 to store
pointers.

> assigning a pointer is aways aotomic, but we should carefully initialize it
> as you said. I will introduce a constructor for desc slab cache which initialize
> the struct like this:
>
> for (i = 0; i < PTE_LIST_EXT; i++)
> desc->sptes[i] = NULL;
>
> It is okay.
>
I hope slab does not write anything into allocated memory internally if
constructor is present. BTW do you know what happens when SLAB debug is enabled
and SLAB_DESTROY_BY_RCU is set? Does poison value is written into freed
object (freed to slab, but not yet to page allocator)?

--
Gleb.

2013-09-02 07:02:19

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On 08/30/2013 07:38 PM, Gleb Natapov wrote:
> On Thu, Aug 29, 2013 at 07:26:40PM +0800, Xiao Guangrong wrote:
>> On 08/29/2013 05:51 PM, Gleb Natapov wrote:
>>> On Thu, Aug 29, 2013 at 05:31:42PM +0800, Xiao Guangrong wrote:
>>>>> As Documentation/RCU/whatisRCU.txt says:
>>>>>
>>>>> As with rcu_assign_pointer(), an important function of
>>>>> rcu_dereference() is to document which pointers are protected by
>>>>> RCU, in particular, flagging a pointer that is subject to changing
>>>>> at any time, including immediately after the rcu_dereference().
>>>>> And, again like rcu_assign_pointer(), rcu_dereference() is
>>>>> typically used indirectly, via the _rcu list-manipulation
>>>>> primitives, such as list_for_each_entry_rcu().
>>>>>
>>>>> The documentation aspect of rcu_assign_pointer()/rcu_dereference() is
>>>>> important. The code is complicated, so self documentation will not hurt.
>>>>> I want to see what is actually protected by rcu here. Freeing shadow
>>>>> pages with call_rcu() further complicates matters: does it mean that
>>>>> shadow pages are also protected by rcu?
>>>>
>>>> Yes, it stops shadow page to be freed when we do write-protection on
>>>> it.
>>>>
>>> Yeah, I got the trick, what I am saying that we have a data structure
>>> here protected by RCU, but we do not use RCU functions to access it...
>>
>> Yes, they are not used when insert a spte into rmap and get the rmap from
>> the entry... but do we need to use these functions to guarantee the order?
>>
>> The worst case is, we fetch the spte from the desc but the spte is not
>> updated yet, we can happily skip this spte since it will set the
>> dirty-bitmap later, this is guaranteed by the barrier between mmu_spte_update()
>> and mark_page_dirty(), the code is:
>>
>> set_spte():
>>
>> if (mmu_spte_update(sptep, spte))
>> kvm_flush_remote_tlbs(vcpu->kvm);
>>
>> if (!remap) {
>> if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
>> rmap_recycle(vcpu, sptep, gfn);
>>
>> if (level > PT_PAGE_TABLE_LEVEL)
>> ++vcpu->kvm->stat.lpages;
>> }
>>
>> smp_wmb();
>>
>> if (pte_access & ACC_WRITE_MASK)
>> mark_page_dirty(vcpu->kvm, gfn);
>>
>> So, i guess if we can guaranteed the order by ourself, we do not need
>> to call the rcu functions explicitly...
>>
>> But, the memory barres in the rcu functions are really light on x86 (store
>> can not be reordered with store), so i do not mind to explicitly use them
>> if you think this way is more safe. :)
>>
> I think the self documentation aspect of using rcu function is also
> important.

Okay. I will use these rcu functions and measure them to see whether it'll
cause performance issue.

>
>>> BTW why not allocate sp->spt from SLAB_DESTROY_BY_RCU cache too? We may
>>> switch write protection on a random spt occasionally if page is deleted
>>> and reused for another spt though. For last level spt it should not be a
>>> problem and for non last level we have is_last_spte() check in
>>> __rmap_write_protect_lockless(). Can it work?
>>
>> Yes, i also considered this way. It can work if we handle is_last_spte()
>> properly. Since the sp->spte can be reused, we can not get the mapping
>> level from sp. We need to encode the mapping level into spte so that
>> cmpxhg can understand if the page table has been moved to another mapping
>> level.
> Isn't one bit that says that spte is the last one enough? IIRC we
> have one more ignored bit to spare in spte.

Right. But i also want to use this way in fast_page_fault where mapping-level
is needed.

>
>> Could you allow me to make this optimization separately after this
>> patchset be merged?
>>
> If you think it will complicate the initial version I am fine with
> postponing it for later.

Thank you, Gleb!

2013-09-02 08:50:34

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

On 08/30/2013 07:44 PM, Gleb Natapov wrote:
> On Thu, Aug 29, 2013 at 08:02:30PM +0800, Xiao Guangrong wrote:
>> On 08/29/2013 07:33 PM, Xiao Guangrong wrote:
>>> On 08/29/2013 05:31 PM, Gleb Natapov wrote:
>>>> On Thu, Aug 29, 2013 at 02:50:51PM +0800, Xiao Guangrong wrote:
>>>>> After more thinking, I still think rcu_assign_pointer() is unneeded when a entry
>>>>> is removed. The remove-API does not care the order between unlink the entry and
>>>>> the changes to its fields. It is the caller's responsibility:
>>>>> - in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), etc to
>>>>> enforce all lookups exit and the later change on that entry is invisible to the
>>>>> lookups.
>>>>>
>>>>> - In the case of rculist_nulls, it seems refcounter is used to guarantee the order
>>>>> (see the example from Documentation/RCU/rculist_nulls.txt).
>>>>>
>>>>> - In our case, we allow the lookup to see the deleted desc even if it is in slab cache
>>>>> or its is initialized or it is re-added.
>>>>>
>>>> BTW is it a good idea? We can access deleted desc while it is allocated
>>>> and initialized to zero by kmem_cache_zalloc(), are we sure we cannot
>>>> see partially initialized desc->sptes[] entry? On related note what about
>>>> 32 bit systems, they do not have atomic access to desc->sptes[].
>>
>> Ah... wait. desc is a array of pointers:
>>
>> struct pte_list_desc {
>> u64 *sptes[PTE_LIST_EXT];
>> struct pte_list_desc *more;
>> };
>>
> Yep, I misread it to be u64 bits and wondered why do we use u64 to store
> pointers.
>
>> assigning a pointer is aways aotomic, but we should carefully initialize it
>> as you said. I will introduce a constructor for desc slab cache which initialize
>> the struct like this:
>>
>> for (i = 0; i < PTE_LIST_EXT; i++)
>> desc->sptes[i] = NULL;
>>
>> It is okay.
>>
> I hope slab does not write anything into allocated memory internally if
> constructor is present.

If only constructor is present (no SLAB_DESTROY_BY_RCU), It'll temporarily
write the "poison" value into the memory then call the constructor to initialize
it again, e.g, in slab.c:

static void *cache_alloc_debugcheck_after(struct kmem_cache *cachep,
gfp_t flags, void *objp, unsigned long caller)
{
if (cachep->flags & SLAB_POISON) {
......
poison_obj(cachep, objp, POISON_INUSE);
}
......
if (cachep->ctor && cachep->flags & SLAB_POISON)
cachep->ctor(objp);
}

But SLAB_DESTROY_BY_RCU can force the allocer to don't touch
the memory, this is true in our case.

> BTW do you know what happens when SLAB debug is enabled
> and SLAB_DESTROY_BY_RCU is set?

When SLAB debug is enabled, these 3 flags may be set:
#define SLAB_DEBUG_FREE 0x00000100UL /* DEBUG: Perform (expensive) checks on free */
#define SLAB_RED_ZONE 0x00000400UL /* DEBUG: Red zone objs in a cache */
#define SLAB_POISON 0x00000800UL /* DEBUG: Poison objects */

Only SLAB_POISON can write something into the memory, and ...

> Does poison value is written into freed
> object (freed to slab, but not yet to page allocator)?

SLAB_POISON is cleared if SLAB_DESTROY_BY_RCU is used.
- In slab, There is the code in __kmem_cache_create():
if (flags & SLAB_DESTROY_BY_RCU)
BUG_ON(flags & SLAB_POISON);

- In slub, the code is in calculate_sizes():
/*
* Determine if we can poison the object itself. If the user of
* the slab may touch the object after free or before allocation
* then we should never poison the object itself.
*/
if ((flags & SLAB_POISON) && !(flags & SLAB_DESTROY_BY_RCU) &&
!s->ctor)
s->flags |= __OBJECT_POISON;
else
s->flags &= ~__OBJECT_POISON;

- in slob, it seems it does not support SLAB DEBUG.