2012-11-05 09:59:36

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH] KVM: MMU: lazily drop large spte

Do not drop large spte until it can be insteaded by small pages so that
the guest can happliy read memory through it

The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter. This removes the need for the return value.

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
1 files changed, 9 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b875a9e..1d8869c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1105,7 +1105,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)

/*
* Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte writ-protection is caused by protecting shadow page table.
+ * spte write-protection is caused by protecting shadow page table.
* @flush indicates whether tlb need be flushed.
*
* Note: write protection is difference between drity logging and spte
@@ -1114,31 +1114,23 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
* its dirty bitmap is properly set.
* - for spte protection, the spte can be writable only after unsync-ing
* shadow page.
- *
- * Return true if the spte is dropped.
*/
-static bool
+static void
spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
{
u64 spte = *sptep;

if (!is_writable_pte(spte) &&
!(pt_protect && spte_is_locklessly_modifiable(spte)))
- return false;
+ return;

rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);

- if (__drop_large_spte(kvm, sptep)) {
- *flush |= true;
- return true;
- }
-
if (pt_protect)
spte &= ~SPTE_MMU_WRITEABLE;
spte = spte & ~PT_WRITABLE_MASK;

*flush |= mmu_spte_update(sptep, spte);
- return false;
}

static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1150,11 +1142,8 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,

for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
- if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
- sptep = rmap_get_first(*rmapp, &iter);
- continue;
- }

+ spte_write_protect(kvm, sptep, &flush, pt_protect);
sptep = rmap_get_next(&iter);
}

@@ -2381,14 +2370,6 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
if ((pte_access & ACC_WRITE_MASK)
|| (!vcpu->arch.mmu.direct_map && write_fault
&& !is_write_protection(vcpu) && !user_fault)) {
-
- if (level > PT_PAGE_TABLE_LEVEL &&
- has_wrprotected_page(vcpu->kvm, gfn, level)) {
- ret = 1;
- drop_spte(vcpu->kvm, sptep);
- goto done;
- }
-
spte |= PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE;

if (!vcpu->arch.mmu.direct_map
@@ -2413,7 +2394,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
if (!can_unsync && is_writable_pte(*sptep))
goto set_pte;

- if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
+ if ((level > PT_PAGE_TABLE_LEVEL &&
+ has_wrprotected_page(vcpu->kvm, gfn, level)) ||
+ mmu_need_write_protect(vcpu, gfn, can_unsync)) {
pgprintk("%s: found shadow page for %llx, marking ro\n",
__func__, gfn);
ret = 1;
@@ -2428,7 +2411,6 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
set_pte:
if (mmu_spte_update(sptep, spte))
kvm_flush_remote_tlbs(vcpu->kvm);
-done:
return ret;
}

@@ -2635,6 +2617,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
break;
}

+ drop_large_spte(vcpu, iterator.sptep);
+
if (!is_shadow_present_pte(*iterator.sptep)) {
u64 base_addr = iterator.addr;

--
1.7.7.6


2012-11-12 23:10:44

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
> Do not drop large spte until it can be insteaded by small pages so that
> the guest can happliy read memory through it
>
> The idea is from Avi:
> | As I mentioned before, write-protecting a large spte is a good idea,
> | since it moves some work from protect-time to fault-time, so it reduces
> | jitter. This removes the need for the return value.
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
> 1 files changed, 9 insertions(+), 25 deletions(-)

Its likely that other 4k pages are mapped read-write in the 2mb range
covered by a read-only 2mb map. Therefore its not entirely useful to
map read-only.

Can you measure an improvement with this change?

2012-11-13 08:26:26

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

Hi Marcelo,

On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
> On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
>> Do not drop large spte until it can be insteaded by small pages so that
>> the guest can happliy read memory through it
>>
>> The idea is from Avi:
>> | As I mentioned before, write-protecting a large spte is a good idea,
>> | since it moves some work from protect-time to fault-time, so it reduces
>> | jitter. This removes the need for the return value.
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
>> 1 files changed, 9 insertions(+), 25 deletions(-)
>
> Its likely that other 4k pages are mapped read-write in the 2mb range
> covered by a read-only 2mb map. Therefore its not entirely useful to
> map read-only.
>

It needs a page fault to install a pte even if it is the read access.
After the change, the page fault can be avoided.

> Can you measure an improvement with this change?

I have a test case to measure the read time which has been attached.
It maps 4k pages at first (dirt-loggged), then switch to large sptes
(stop dirt-logging), at the last, measure the read access time after write
protect sptes.

Before: 23314111 ns After: 11404197 ns


Attachments:
testcase.tar.bz2 (8.25 kB)

2012-11-13 15:33:56

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

Ccing live migration developers who should be interested in this work,

On Mon, 12 Nov 2012 21:10:32 -0200
Marcelo Tosatti <[email protected]> wrote:

> On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
> > Do not drop large spte until it can be insteaded by small pages so that
> > the guest can happliy read memory through it
> >
> > The idea is from Avi:
> > | As I mentioned before, write-protecting a large spte is a good idea,
> > | since it moves some work from protect-time to fault-time, so it reduces
> > | jitter. This removes the need for the return value.
> >
> > Signed-off-by: Xiao Guangrong <[email protected]>
> > ---
> > arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
> > 1 files changed, 9 insertions(+), 25 deletions(-)
>
> Its likely that other 4k pages are mapped read-write in the 2mb range
> covered by a read-only 2mb map. Therefore its not entirely useful to
> map read-only.
>
> Can you measure an improvement with this change?

What we discussed at KVM Forum last week was about the jitter we could
measure right after starting live migration: both Isaku and Chegu reported
such jitter.

So if this patch reduces such jitter for some real workloads, by lazily
dropping largepage mappings and saving read faults until that point, that
would be very nice!

But sadly, what they measured included interactions with the outside of the
guest, and the main cause was due to the big QEMU lock problem, they guessed.
The order is so different that an improvement by a kernel side effort may not
be seen easily.

FWIW: I am now changing the initial write protection by
kvm_mmu_slot_remove_write_access() to rmap based as I proposed at KVM Forum.
ftrace said that 1ms was improved to 250-350us by the change for 10GB guest.
My code still drops largepage mappings, so the initial write protection time
itself may not be a such big issue here, I think.

Again, if we can eliminate read faults to such an extent that guests can see
measurable improvement, that should be very nice!

Any thoughts?

Thanks,
Takuya

2012-11-14 14:44:31

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
> Hi Marcelo,
>
> On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
> > On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
> >> Do not drop large spte until it can be insteaded by small pages so that
> >> the guest can happliy read memory through it
> >>
> >> The idea is from Avi:
> >> | As I mentioned before, write-protecting a large spte is a good idea,
> >> | since it moves some work from protect-time to fault-time, so it reduces
> >> | jitter. This removes the need for the return value.
> >>
> >> Signed-off-by: Xiao Guangrong <[email protected]>
> >> ---
> >> arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
> >> 1 files changed, 9 insertions(+), 25 deletions(-)
> >
> > Its likely that other 4k pages are mapped read-write in the 2mb range
> > covered by a read-only 2mb map. Therefore its not entirely useful to
> > map read-only.
> >
>
> It needs a page fault to install a pte even if it is the read access.
> After the change, the page fault can be avoided.
>
> > Can you measure an improvement with this change?
>
> I have a test case to measure the read time which has been attached.
> It maps 4k pages at first (dirt-loggged), then switch to large sptes
> (stop dirt-logging), at the last, measure the read access time after write
> protect sptes.
>
> Before: 23314111 ns After: 11404197 ns

Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
that is:

- large page must be destroyed when write protecting due to
shadowed page.
- with shadow, it does not make sense to write protect
large sptes as mentioned earlier.

So i wonder why is this part from your patch

- if (level > PT_PAGE_TABLE_LEVEL &&
- has_wrprotected_page(vcpu->kvm, gfn, level)) {
- ret = 1;
- drop_spte(vcpu->kvm, sptep);
- goto done;
- }

necessary (assuming EPT is in use).

2012-11-14 14:44:56

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On Wed, Nov 14, 2012 at 12:33:50AM +0900, Takuya Yoshikawa wrote:
> Ccing live migration developers who should be interested in this work,
>
> On Mon, 12 Nov 2012 21:10:32 -0200
> Marcelo Tosatti <[email protected]> wrote:
>
> > On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
> > > Do not drop large spte until it can be insteaded by small pages so that
> > > the guest can happliy read memory through it
> > >
> > > The idea is from Avi:
> > > | As I mentioned before, write-protecting a large spte is a good idea,
> > > | since it moves some work from protect-time to fault-time, so it reduces
> > > | jitter. This removes the need for the return value.
> > >
> > > Signed-off-by: Xiao Guangrong <[email protected]>
> > > ---
> > > arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
> > > 1 files changed, 9 insertions(+), 25 deletions(-)
> >
> > Its likely that other 4k pages are mapped read-write in the 2mb range
> > covered by a read-only 2mb map. Therefore its not entirely useful to
> > map read-only.
> >
> > Can you measure an improvement with this change?
>
> What we discussed at KVM Forum last week was about the jitter we could
> measure right after starting live migration: both Isaku and Chegu reported
> such jitter.
>
> So if this patch reduces such jitter for some real workloads, by lazily
> dropping largepage mappings and saving read faults until that point, that
> would be very nice!
>
> But sadly, what they measured included interactions with the outside of the
> guest, and the main cause was due to the big QEMU lock problem, they guessed.
> The order is so different that an improvement by a kernel side effort may not
> be seen easily.
>
> FWIW: I am now changing the initial write protection by
> kvm_mmu_slot_remove_write_access() to rmap based as I proposed at KVM Forum.
> ftrace said that 1ms was improved to 250-350us by the change for 10GB guest.
> My code still drops largepage mappings, so the initial write protection time
> itself may not be a such big issue here, I think.
>
> Again, if we can eliminate read faults to such an extent that guests can see
> measurable improvement, that should be very nice!
>
> Any thoughts?
>
> Thanks,
> Takuya

OK, makes sense. I'm worried about shadow / oos interactions
with large read-only mappings (trying to remember what was the
case exactly, it might be non-existant now).

2012-11-14 23:17:27

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On 11/14/2012 10:37 PM, Marcelo Tosatti wrote:
> On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
>> Hi Marcelo,
>>
>> On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
>>> On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
>>>> Do not drop large spte until it can be insteaded by small pages so that
>>>> the guest can happliy read memory through it
>>>>
>>>> The idea is from Avi:
>>>> | As I mentioned before, write-protecting a large spte is a good idea,
>>>> | since it moves some work from protect-time to fault-time, so it reduces
>>>> | jitter. This removes the need for the return value.
>>>>
>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>> ---
>>>> arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
>>>> 1 files changed, 9 insertions(+), 25 deletions(-)
>>>
>>> Its likely that other 4k pages are mapped read-write in the 2mb range
>>> covered by a read-only 2mb map. Therefore its not entirely useful to
>>> map read-only.
>>>
>>
>> It needs a page fault to install a pte even if it is the read access.
>> After the change, the page fault can be avoided.
>>
>>> Can you measure an improvement with this change?
>>
>> I have a test case to measure the read time which has been attached.
>> It maps 4k pages at first (dirt-loggged), then switch to large sptes
>> (stop dirt-logging), at the last, measure the read access time after write
>> protect sptes.
>>
>> Before: 23314111 ns After: 11404197 ns
>
> Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
> that is:
>
> - large page must be destroyed when write protecting due to
> shadowed page.
> - with shadow, it does not make sense to write protect
> large sptes as mentioned earlier.
>

This case is removed now, the code when e49146dce8c3dc6f44 was applied is:
|
| pt = sp->spt;
| for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
| /* avoid RMW */
| if (is_writable_pte(pt[i]))
| update_spte(&pt[i], pt[i] & ~PT_WRITABLE_MASK);
| }

The real problem in this code is it would write-protect the spte even if
it is not a last spte that caused the middle-level shadow page table was
write-protected. So e49146dce8c3dc6f44 added this code:
| if (sp->role.level != PT_PAGE_TABLE_LEVEL)
| continue;
|
was good to fix this problem.

Now, the current code is:
| for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
| if (!is_shadow_present_pte(pt[i]) ||
| !is_last_spte(pt[i], sp->role.level))
| continue;
|
| spte_write_protect(kvm, &pt[i], &flush, false);
| }
It only write-protect the last spte. So, it allows large spte existent.
(the large spte can be broken by drop_large_spte() on the page-fault path.)

> So i wonder why is this part from your patch
>
> - if (level > PT_PAGE_TABLE_LEVEL &&
> - has_wrprotected_page(vcpu->kvm, gfn, level)) {
> - ret = 1;
> - drop_spte(vcpu->kvm, sptep);
> - goto done;
> - }
>
> necessary (assuming EPT is in use).

This is safe, we change these code to:

- if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
+ if ((level > PT_PAGE_TABLE_LEVEL &&
+ has_wrprotected_page(vcpu->kvm, gfn, level)) ||
+ mmu_need_write_protect(vcpu, gfn, can_unsync)) {
pgprintk("%s: found shadow page for %llx, marking ro\n",
__func__, gfn);
ret = 1;

The spte become read-only which can ensure the shadow gfn can not be changed.

Btw, the origin code allows to create readonly spte under this case if !(pte_access & WRITEABBLE)

2012-11-14 23:34:07

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On 11/14/2012 10:44 PM, Marcelo Tosatti wrote:
> On Wed, Nov 14, 2012 at 12:33:50AM +0900, Takuya Yoshikawa wrote:
>> Ccing live migration developers who should be interested in this work,
>>
>> On Mon, 12 Nov 2012 21:10:32 -0200
>> Marcelo Tosatti <[email protected]> wrote:
>>
>>> On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
>>>> Do not drop large spte until it can be insteaded by small pages so that
>>>> the guest can happliy read memory through it
>>>>
>>>> The idea is from Avi:
>>>> | As I mentioned before, write-protecting a large spte is a good idea,
>>>> | since it moves some work from protect-time to fault-time, so it reduces
>>>> | jitter. This removes the need for the return value.
>>>>
>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>> ---
>>>> arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
>>>> 1 files changed, 9 insertions(+), 25 deletions(-)
>>>
>>> Its likely that other 4k pages are mapped read-write in the 2mb range
>>> covered by a read-only 2mb map. Therefore its not entirely useful to
>>> map read-only.
>>>
>>> Can you measure an improvement with this change?
>>
>> What we discussed at KVM Forum last week was about the jitter we could
>> measure right after starting live migration: both Isaku and Chegu reported
>> such jitter.
>>
>> So if this patch reduces such jitter for some real workloads, by lazily
>> dropping largepage mappings and saving read faults until that point, that
>> would be very nice!
>>
>> But sadly, what they measured included interactions with the outside of the
>> guest, and the main cause was due to the big QEMU lock problem, they guessed.
>> The order is so different that an improvement by a kernel side effort may not
>> be seen easily.
>>
>> FWIW: I am now changing the initial write protection by
>> kvm_mmu_slot_remove_write_access() to rmap based as I proposed at KVM Forum.
>> ftrace said that 1ms was improved to 250-350us by the change for 10GB guest.
>> My code still drops largepage mappings, so the initial write protection time
>> itself may not be a such big issue here, I think.
>>
>> Again, if we can eliminate read faults to such an extent that guests can see
>> measurable improvement, that should be very nice!
>>
>> Any thoughts?
>>
>> Thanks,
>> Takuya
>
> OK, makes sense. I'm worried about shadow / oos interactions
> with large read-only mappings (trying to remember what was the
> case exactly, it might be non-existant now).

Marcelo, i guess commit 38187c830cab84daecb41169948467f1f19317e3 is what you
mentioned, but i do not know how it can "Simplifies out of sync shadow." :(

2012-11-16 03:02:32

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On Thu, Nov 15, 2012 at 07:17:15AM +0800, Xiao Guangrong wrote:
> On 11/14/2012 10:37 PM, Marcelo Tosatti wrote:
> > On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
> >> Hi Marcelo,
> >>
> >> On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
> >>> On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
> >>>> Do not drop large spte until it can be insteaded by small pages so that
> >>>> the guest can happliy read memory through it
> >>>>
> >>>> The idea is from Avi:
> >>>> | As I mentioned before, write-protecting a large spte is a good idea,
> >>>> | since it moves some work from protect-time to fault-time, so it reduces
> >>>> | jitter. This removes the need for the return value.
> >>>>
> >>>> Signed-off-by: Xiao Guangrong <[email protected]>
> >>>> ---
> >>>> arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
> >>>> 1 files changed, 9 insertions(+), 25 deletions(-)
> >>>
> >>> Its likely that other 4k pages are mapped read-write in the 2mb range
> >>> covered by a read-only 2mb map. Therefore its not entirely useful to
> >>> map read-only.
> >>>
> >>
> >> It needs a page fault to install a pte even if it is the read access.
> >> After the change, the page fault can be avoided.
> >>
> >>> Can you measure an improvement with this change?
> >>
> >> I have a test case to measure the read time which has been attached.
> >> It maps 4k pages at first (dirt-loggged), then switch to large sptes
> >> (stop dirt-logging), at the last, measure the read access time after write
> >> protect sptes.
> >>
> >> Before: 23314111 ns After: 11404197 ns
> >
> > Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
> > that is:
> >
> > - large page must be destroyed when write protecting due to
> > shadowed page.
> > - with shadow, it does not make sense to write protect
> > large sptes as mentioned earlier.
> >
>
> This case is removed now, the code when e49146dce8c3dc6f44 was applied is:
> |
> | pt = sp->spt;
> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
> | /* avoid RMW */
> | if (is_writable_pte(pt[i]))
> | update_spte(&pt[i], pt[i] & ~PT_WRITABLE_MASK);
> | }
>
> The real problem in this code is it would write-protect the spte even if
> it is not a last spte that caused the middle-level shadow page table was
> write-protected. So e49146dce8c3dc6f44 added this code:
> | if (sp->role.level != PT_PAGE_TABLE_LEVEL)
> | continue;
> |
> was good to fix this problem.
>
> Now, the current code is:
> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
> | if (!is_shadow_present_pte(pt[i]) ||
> | !is_last_spte(pt[i], sp->role.level))
> | continue;
> |
> | spte_write_protect(kvm, &pt[i], &flush, false);
> | }
> It only write-protect the last spte. So, it allows large spte existent.
> (the large spte can be broken by drop_large_spte() on the page-fault path.)
>
> > So i wonder why is this part from your patch
> >
> > - if (level > PT_PAGE_TABLE_LEVEL &&
> > - has_wrprotected_page(vcpu->kvm, gfn, level)) {
> > - ret = 1;
> > - drop_spte(vcpu->kvm, sptep);
> > - goto done;
> > - }
> >
> > necessary (assuming EPT is in use).
>
> This is safe, we change these code to:
>
> - if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
> + if ((level > PT_PAGE_TABLE_LEVEL &&
> + has_wrprotected_page(vcpu->kvm, gfn, level)) ||
> + mmu_need_write_protect(vcpu, gfn, can_unsync)) {
> pgprintk("%s: found shadow page for %llx, marking ro\n",
> __func__, gfn);
> ret = 1;
>
> The spte become read-only which can ensure the shadow gfn can not be changed.
>
> Btw, the origin code allows to create readonly spte under this case if !(pte_access & WRITEABBLE)

Regarding shadow: it should be fine as long as fault path always deletes
large mappings, when shadowed pages are present in the region.

Ah, unshadowing from reexecute_instruction does not handle
large pages. I suppose that is what "simplification" refers
to.

2012-11-16 03:39:22

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On 11/16/2012 11:02 AM, Marcelo Tosatti wrote:
> On Thu, Nov 15, 2012 at 07:17:15AM +0800, Xiao Guangrong wrote:
>> On 11/14/2012 10:37 PM, Marcelo Tosatti wrote:
>>> On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
>>>> Hi Marcelo,
>>>>
>>>> On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
>>>>> On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
>>>>>> Do not drop large spte until it can be insteaded by small pages so that
>>>>>> the guest can happliy read memory through it
>>>>>>
>>>>>> The idea is from Avi:
>>>>>> | As I mentioned before, write-protecting a large spte is a good idea,
>>>>>> | since it moves some work from protect-time to fault-time, so it reduces
>>>>>> | jitter. This removes the need for the return value.
>>>>>>
>>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>>>> ---
>>>>>> arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
>>>>>> 1 files changed, 9 insertions(+), 25 deletions(-)
>>>>>
>>>>> Its likely that other 4k pages are mapped read-write in the 2mb range
>>>>> covered by a read-only 2mb map. Therefore its not entirely useful to
>>>>> map read-only.
>>>>>
>>>>
>>>> It needs a page fault to install a pte even if it is the read access.
>>>> After the change, the page fault can be avoided.
>>>>
>>>>> Can you measure an improvement with this change?
>>>>
>>>> I have a test case to measure the read time which has been attached.
>>>> It maps 4k pages at first (dirt-loggged), then switch to large sptes
>>>> (stop dirt-logging), at the last, measure the read access time after write
>>>> protect sptes.
>>>>
>>>> Before: 23314111 ns After: 11404197 ns
>>>
>>> Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
>>> that is:
>>>
>>> - large page must be destroyed when write protecting due to
>>> shadowed page.
>>> - with shadow, it does not make sense to write protect
>>> large sptes as mentioned earlier.
>>>
>>
>> This case is removed now, the code when e49146dce8c3dc6f44 was applied is:
>> |
>> | pt = sp->spt;
>> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
>> | /* avoid RMW */
>> | if (is_writable_pte(pt[i]))
>> | update_spte(&pt[i], pt[i] & ~PT_WRITABLE_MASK);
>> | }
>>
>> The real problem in this code is it would write-protect the spte even if
>> it is not a last spte that caused the middle-level shadow page table was
>> write-protected. So e49146dce8c3dc6f44 added this code:
>> | if (sp->role.level != PT_PAGE_TABLE_LEVEL)
>> | continue;
>> |
>> was good to fix this problem.
>>
>> Now, the current code is:
>> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>> | if (!is_shadow_present_pte(pt[i]) ||
>> | !is_last_spte(pt[i], sp->role.level))
>> | continue;
>> |
>> | spte_write_protect(kvm, &pt[i], &flush, false);
>> | }
>> It only write-protect the last spte. So, it allows large spte existent.
>> (the large spte can be broken by drop_large_spte() on the page-fault path.)
>>
>>> So i wonder why is this part from your patch
>>>
>>> - if (level > PT_PAGE_TABLE_LEVEL &&
>>> - has_wrprotected_page(vcpu->kvm, gfn, level)) {
>>> - ret = 1;
>>> - drop_spte(vcpu->kvm, sptep);
>>> - goto done;
>>> - }
>>>
>>> necessary (assuming EPT is in use).
>>
>> This is safe, we change these code to:
>>
>> - if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
>> + if ((level > PT_PAGE_TABLE_LEVEL &&
>> + has_wrprotected_page(vcpu->kvm, gfn, level)) ||
>> + mmu_need_write_protect(vcpu, gfn, can_unsync)) {
>> pgprintk("%s: found shadow page for %llx, marking ro\n",
>> __func__, gfn);
>> ret = 1;
>>
>> The spte become read-only which can ensure the shadow gfn can not be changed.
>>
>> Btw, the origin code allows to create readonly spte under this case if !(pte_access & WRITEABBLE)
>
> Regarding shadow: it should be fine as long as fault path always deletes
> large mappings, when shadowed pages are present in the region.

For hard mmu is also safe, in this patch i added these code:

@@ -2635,6 +2617,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
break;
}

+ drop_large_spte(vcpu, iterator.sptep);
+

It can delete large mappings like soft mmu does.

Anything i missed?

>
> Ah, unshadowing from reexecute_instruction does not handle
> large pages. I suppose that is what "simplification" refers
> to.

reexecute_instruction did not directly handle last spte, it just
removes all shadow pages, then let cpu retry the instruction, the
page can become writable when encounter #PF again, large spte is fine
under this case.

(Out of this thread: I notice reexecute_instruction allows to retry
instruct only if tdp_enabled == 0, but on nested npt, it also has
page write-protected by shadow pages. Maybe we need to improve this
restriction.
)

2012-11-16 03:57:02

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On Fri, Nov 16, 2012 at 11:39:12AM +0800, Xiao Guangrong wrote:
> On 11/16/2012 11:02 AM, Marcelo Tosatti wrote:
> > On Thu, Nov 15, 2012 at 07:17:15AM +0800, Xiao Guangrong wrote:
> >> On 11/14/2012 10:37 PM, Marcelo Tosatti wrote:
> >>> On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
> >>>> Hi Marcelo,
> >>>>
> >>>> On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
> >>>>> On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
> >>>>>> Do not drop large spte until it can be insteaded by small pages so that
> >>>>>> the guest can happliy read memory through it
> >>>>>>
> >>>>>> The idea is from Avi:
> >>>>>> | As I mentioned before, write-protecting a large spte is a good idea,
> >>>>>> | since it moves some work from protect-time to fault-time, so it reduces
> >>>>>> | jitter. This removes the need for the return value.
> >>>>>>
> >>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
> >>>>>> ---
> >>>>>> arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
> >>>>>> 1 files changed, 9 insertions(+), 25 deletions(-)
> >>>>>
> >>>>> Its likely that other 4k pages are mapped read-write in the 2mb range
> >>>>> covered by a read-only 2mb map. Therefore its not entirely useful to
> >>>>> map read-only.
> >>>>>
> >>>>
> >>>> It needs a page fault to install a pte even if it is the read access.
> >>>> After the change, the page fault can be avoided.
> >>>>
> >>>>> Can you measure an improvement with this change?
> >>>>
> >>>> I have a test case to measure the read time which has been attached.
> >>>> It maps 4k pages at first (dirt-loggged), then switch to large sptes
> >>>> (stop dirt-logging), at the last, measure the read access time after write
> >>>> protect sptes.
> >>>>
> >>>> Before: 23314111 ns After: 11404197 ns
> >>>
> >>> Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
> >>> that is:
> >>>
> >>> - large page must be destroyed when write protecting due to
> >>> shadowed page.
> >>> - with shadow, it does not make sense to write protect
> >>> large sptes as mentioned earlier.
> >>>
> >>
> >> This case is removed now, the code when e49146dce8c3dc6f44 was applied is:
> >> |
> >> | pt = sp->spt;
> >> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
> >> | /* avoid RMW */
> >> | if (is_writable_pte(pt[i]))
> >> | update_spte(&pt[i], pt[i] & ~PT_WRITABLE_MASK);
> >> | }
> >>
> >> The real problem in this code is it would write-protect the spte even if
> >> it is not a last spte that caused the middle-level shadow page table was
> >> write-protected. So e49146dce8c3dc6f44 added this code:
> >> | if (sp->role.level != PT_PAGE_TABLE_LEVEL)
> >> | continue;
> >> |
> >> was good to fix this problem.
> >>
> >> Now, the current code is:
> >> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
> >> | if (!is_shadow_present_pte(pt[i]) ||
> >> | !is_last_spte(pt[i], sp->role.level))
> >> | continue;
> >> |
> >> | spte_write_protect(kvm, &pt[i], &flush, false);
> >> | }
> >> It only write-protect the last spte. So, it allows large spte existent.
> >> (the large spte can be broken by drop_large_spte() on the page-fault path.)
> >>
> >>> So i wonder why is this part from your patch
> >>>
> >>> - if (level > PT_PAGE_TABLE_LEVEL &&
> >>> - has_wrprotected_page(vcpu->kvm, gfn, level)) {
> >>> - ret = 1;
> >>> - drop_spte(vcpu->kvm, sptep);
> >>> - goto done;
> >>> - }
> >>>
> >>> necessary (assuming EPT is in use).
> >>
> >> This is safe, we change these code to:
> >>
> >> - if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
> >> + if ((level > PT_PAGE_TABLE_LEVEL &&
> >> + has_wrprotected_page(vcpu->kvm, gfn, level)) ||
> >> + mmu_need_write_protect(vcpu, gfn, can_unsync)) {
> >> pgprintk("%s: found shadow page for %llx, marking ro\n",
> >> __func__, gfn);
> >> ret = 1;
> >>
> >> The spte become read-only which can ensure the shadow gfn can not be changed.
> >>
> >> Btw, the origin code allows to create readonly spte under this case if !(pte_access & WRITEABBLE)
> >
> > Regarding shadow: it should be fine as long as fault path always deletes
> > large mappings, when shadowed pages are present in the region.
>
> For hard mmu is also safe, in this patch i added these code:
>
> @@ -2635,6 +2617,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
> break;
> }
>
> + drop_large_spte(vcpu, iterator.sptep);
> +
>
> It can delete large mappings like soft mmu does.
>
> Anything i missed?
>
> >
> > Ah, unshadowing from reexecute_instruction does not handle
> > large pages. I suppose that is what "simplification" refers
> > to.
>
> reexecute_instruction did not directly handle last spte, it just
> removes all shadow pages, then let cpu retry the instruction, the
> page can become writable when encounter #PF again, large spte is fine
> under this case.

While searching for a given "gpa", you don't find large gfn which is
mapping it, right? (that is, searching for gfn 4 fails to find large
read-only "gfn 0"). Unshadowing gfn 4 will keep large read-only mapping
present.

1. large read-write spte to gfn 0
2. shadow gfn 4
3. write-protect large spte pointing to gfn 0
4. write to gfn 4
5. instruction emulation fails
5. unshadow gfn 4
6. refault, do not drop large spte because no pages shadowed
7. goto 4

> (Out of this thread: I notice reexecute_instruction allows to retry
> instruct only if tdp_enabled == 0, but on nested npt, it also has
> page write-protected by shadow pages. Maybe we need to improve this
> restriction.
> )

2012-11-16 04:46:26

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On 11/16/2012 11:56 AM, Marcelo Tosatti wrote:
> On Fri, Nov 16, 2012 at 11:39:12AM +0800, Xiao Guangrong wrote:
>> On 11/16/2012 11:02 AM, Marcelo Tosatti wrote:
>>> On Thu, Nov 15, 2012 at 07:17:15AM +0800, Xiao Guangrong wrote:
>>>> On 11/14/2012 10:37 PM, Marcelo Tosatti wrote:
>>>>> On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
>>>>>> Hi Marcelo,
>>>>>>
>>>>>> On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
>>>>>>> On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
>>>>>>>> Do not drop large spte until it can be insteaded by small pages so that
>>>>>>>> the guest can happliy read memory through it
>>>>>>>>
>>>>>>>> The idea is from Avi:
>>>>>>>> | As I mentioned before, write-protecting a large spte is a good idea,
>>>>>>>> | since it moves some work from protect-time to fault-time, so it reduces
>>>>>>>> | jitter. This removes the need for the return value.
>>>>>>>>
>>>>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>>>>>> ---
>>>>>>>> arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
>>>>>>>> 1 files changed, 9 insertions(+), 25 deletions(-)
>>>>>>>
>>>>>>> Its likely that other 4k pages are mapped read-write in the 2mb range
>>>>>>> covered by a read-only 2mb map. Therefore its not entirely useful to
>>>>>>> map read-only.
>>>>>>>
>>>>>>
>>>>>> It needs a page fault to install a pte even if it is the read access.
>>>>>> After the change, the page fault can be avoided.
>>>>>>
>>>>>>> Can you measure an improvement with this change?
>>>>>>
>>>>>> I have a test case to measure the read time which has been attached.
>>>>>> It maps 4k pages at first (dirt-loggged), then switch to large sptes
>>>>>> (stop dirt-logging), at the last, measure the read access time after write
>>>>>> protect sptes.
>>>>>>
>>>>>> Before: 23314111 ns After: 11404197 ns
>>>>>
>>>>> Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
>>>>> that is:
>>>>>
>>>>> - large page must be destroyed when write protecting due to
>>>>> shadowed page.
>>>>> - with shadow, it does not make sense to write protect
>>>>> large sptes as mentioned earlier.
>>>>>
>>>>
>>>> This case is removed now, the code when e49146dce8c3dc6f44 was applied is:
>>>> |
>>>> | pt = sp->spt;
>>>> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
>>>> | /* avoid RMW */
>>>> | if (is_writable_pte(pt[i]))
>>>> | update_spte(&pt[i], pt[i] & ~PT_WRITABLE_MASK);
>>>> | }
>>>>
>>>> The real problem in this code is it would write-protect the spte even if
>>>> it is not a last spte that caused the middle-level shadow page table was
>>>> write-protected. So e49146dce8c3dc6f44 added this code:
>>>> | if (sp->role.level != PT_PAGE_TABLE_LEVEL)
>>>> | continue;
>>>> |
>>>> was good to fix this problem.
>>>>
>>>> Now, the current code is:
>>>> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>>>> | if (!is_shadow_present_pte(pt[i]) ||
>>>> | !is_last_spte(pt[i], sp->role.level))
>>>> | continue;
>>>> |
>>>> | spte_write_protect(kvm, &pt[i], &flush, false);
>>>> | }
>>>> It only write-protect the last spte. So, it allows large spte existent.
>>>> (the large spte can be broken by drop_large_spte() on the page-fault path.)
>>>>
>>>>> So i wonder why is this part from your patch
>>>>>
>>>>> - if (level > PT_PAGE_TABLE_LEVEL &&
>>>>> - has_wrprotected_page(vcpu->kvm, gfn, level)) {
>>>>> - ret = 1;
>>>>> - drop_spte(vcpu->kvm, sptep);
>>>>> - goto done;
>>>>> - }
>>>>>
>>>>> necessary (assuming EPT is in use).
>>>>
>>>> This is safe, we change these code to:
>>>>
>>>> - if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
>>>> + if ((level > PT_PAGE_TABLE_LEVEL &&
>>>> + has_wrprotected_page(vcpu->kvm, gfn, level)) ||
>>>> + mmu_need_write_protect(vcpu, gfn, can_unsync)) {
>>>> pgprintk("%s: found shadow page for %llx, marking ro\n",
>>>> __func__, gfn);
>>>> ret = 1;
>>>>
>>>> The spte become read-only which can ensure the shadow gfn can not be changed.
>>>>
>>>> Btw, the origin code allows to create readonly spte under this case if !(pte_access & WRITEABBLE)
>>>
>>> Regarding shadow: it should be fine as long as fault path always deletes
>>> large mappings, when shadowed pages are present in the region.
>>
>> For hard mmu is also safe, in this patch i added these code:
>>
>> @@ -2635,6 +2617,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
>> break;
>> }
>>
>> + drop_large_spte(vcpu, iterator.sptep);
>> +
>>
>> It can delete large mappings like soft mmu does.
>>
>> Anything i missed?
>>
>>>
>>> Ah, unshadowing from reexecute_instruction does not handle
>>> large pages. I suppose that is what "simplification" refers
>>> to.
>>
>> reexecute_instruction did not directly handle last spte, it just
>> removes all shadow pages, then let cpu retry the instruction, the
>> page can become writable when encounter #PF again, large spte is fine
>> under this case.
>
> While searching for a given "gpa", you don't find large gfn which is
> mapping it, right? (that is, searching for gfn 4 fails to find large
> read-only "gfn 0"). Unshadowing gfn 4 will keep large read-only mapping
> present.
>
> 1. large read-write spte to gfn 0
> 2. shadow gfn 4
> 3. write-protect large spte pointing to gfn 0
> 4. write to gfn 4
> 5. instruction emulation fails
> 5. unshadow gfn 4
> 6. refault, do not drop large spte because no pages shadowed

Hmm, it is not true. :)

The large spte can become writable since 'no pages adhadoes' (that means
has_wrprotected_page() can return 0 for this case). No?


2012-11-16 09:58:03

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On Fri, Nov 16, 2012 at 12:46:16PM +0800, Xiao Guangrong wrote:
> On 11/16/2012 11:56 AM, Marcelo Tosatti wrote:
> > On Fri, Nov 16, 2012 at 11:39:12AM +0800, Xiao Guangrong wrote:
> >> On 11/16/2012 11:02 AM, Marcelo Tosatti wrote:
> >>> On Thu, Nov 15, 2012 at 07:17:15AM +0800, Xiao Guangrong wrote:
> >>>> On 11/14/2012 10:37 PM, Marcelo Tosatti wrote:
> >>>>> On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
> >>>>>> Hi Marcelo,
> >>>>>>
> >>>>>> On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
> >>>>>>> On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
> >>>>>>>> Do not drop large spte until it can be insteaded by small pages so that
> >>>>>>>> the guest can happliy read memory through it
> >>>>>>>>
> >>>>>>>> The idea is from Avi:
> >>>>>>>> | As I mentioned before, write-protecting a large spte is a good idea,
> >>>>>>>> | since it moves some work from protect-time to fault-time, so it reduces
> >>>>>>>> | jitter. This removes the need for the return value.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
> >>>>>>>> ---
> >>>>>>>> arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
> >>>>>>>> 1 files changed, 9 insertions(+), 25 deletions(-)
> >>>>>>>
> >>>>>>> Its likely that other 4k pages are mapped read-write in the 2mb range
> >>>>>>> covered by a read-only 2mb map. Therefore its not entirely useful to
> >>>>>>> map read-only.
> >>>>>>>
> >>>>>>
> >>>>>> It needs a page fault to install a pte even if it is the read access.
> >>>>>> After the change, the page fault can be avoided.
> >>>>>>
> >>>>>>> Can you measure an improvement with this change?
> >>>>>>
> >>>>>> I have a test case to measure the read time which has been attached.
> >>>>>> It maps 4k pages at first (dirt-loggged), then switch to large sptes
> >>>>>> (stop dirt-logging), at the last, measure the read access time after write
> >>>>>> protect sptes.
> >>>>>>
> >>>>>> Before: 23314111 ns After: 11404197 ns
> >>>>>
> >>>>> Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
> >>>>> that is:
> >>>>>
> >>>>> - large page must be destroyed when write protecting due to
> >>>>> shadowed page.
> >>>>> - with shadow, it does not make sense to write protect
> >>>>> large sptes as mentioned earlier.
> >>>>>
> >>>>
> >>>> This case is removed now, the code when e49146dce8c3dc6f44 was applied is:
> >>>> |
> >>>> | pt = sp->spt;
> >>>> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
> >>>> | /* avoid RMW */
> >>>> | if (is_writable_pte(pt[i]))
> >>>> | update_spte(&pt[i], pt[i] & ~PT_WRITABLE_MASK);
> >>>> | }
> >>>>
> >>>> The real problem in this code is it would write-protect the spte even if
> >>>> it is not a last spte that caused the middle-level shadow page table was
> >>>> write-protected. So e49146dce8c3dc6f44 added this code:
> >>>> | if (sp->role.level != PT_PAGE_TABLE_LEVEL)
> >>>> | continue;
> >>>> |
> >>>> was good to fix this problem.
> >>>>
> >>>> Now, the current code is:
> >>>> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
> >>>> | if (!is_shadow_present_pte(pt[i]) ||
> >>>> | !is_last_spte(pt[i], sp->role.level))
> >>>> | continue;
> >>>> |
> >>>> | spte_write_protect(kvm, &pt[i], &flush, false);
> >>>> | }
> >>>> It only write-protect the last spte. So, it allows large spte existent.
> >>>> (the large spte can be broken by drop_large_spte() on the page-fault path.)
> >>>>
> >>>>> So i wonder why is this part from your patch
> >>>>>
> >>>>> - if (level > PT_PAGE_TABLE_LEVEL &&
> >>>>> - has_wrprotected_page(vcpu->kvm, gfn, level)) {
> >>>>> - ret = 1;
> >>>>> - drop_spte(vcpu->kvm, sptep);
> >>>>> - goto done;
> >>>>> - }
> >>>>>
> >>>>> necessary (assuming EPT is in use).
> >>>>
> >>>> This is safe, we change these code to:
> >>>>
> >>>> - if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
> >>>> + if ((level > PT_PAGE_TABLE_LEVEL &&
> >>>> + has_wrprotected_page(vcpu->kvm, gfn, level)) ||
> >>>> + mmu_need_write_protect(vcpu, gfn, can_unsync)) {
> >>>> pgprintk("%s: found shadow page for %llx, marking ro\n",
> >>>> __func__, gfn);
> >>>> ret = 1;
> >>>>
> >>>> The spte become read-only which can ensure the shadow gfn can not be changed.
> >>>>
> >>>> Btw, the origin code allows to create readonly spte under this case if !(pte_access & WRITEABBLE)
> >>>
> >>> Regarding shadow: it should be fine as long as fault path always deletes
> >>> large mappings, when shadowed pages are present in the region.
> >>
> >> For hard mmu is also safe, in this patch i added these code:
> >>
> >> @@ -2635,6 +2617,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
> >> break;
> >> }
> >>
> >> + drop_large_spte(vcpu, iterator.sptep);
> >> +
> >>
> >> It can delete large mappings like soft mmu does.
> >>
> >> Anything i missed?
> >>
> >>>
> >>> Ah, unshadowing from reexecute_instruction does not handle
> >>> large pages. I suppose that is what "simplification" refers
> >>> to.
> >>
> >> reexecute_instruction did not directly handle last spte, it just
> >> removes all shadow pages, then let cpu retry the instruction, the
> >> page can become writable when encounter #PF again, large spte is fine
> >> under this case.
> >
> > While searching for a given "gpa", you don't find large gfn which is
> > mapping it, right? (that is, searching for gfn 4 fails to find large
> > read-only "gfn 0"). Unshadowing gfn 4 will keep large read-only mapping
> > present.
> >
> > 1. large read-write spte to gfn 0
> > 2. shadow gfn 4
> > 3. write-protect large spte pointing to gfn 0
> > 4. write to gfn 4
> > 5. instruction emulation fails
> > 5. unshadow gfn 4
> > 6. refault, do not drop large spte because no pages shadowed
7. refault, then goto 2 (as part of write to gfn 4)
>
> Hmm, it is not true. :)
>
> The large spte can become writable since 'no pages adhadoes' (that means
> has_wrprotected_page() can return 0 for this case). No?

What if gfn 4 is a pagetable part of the pagedirectory chain used to
map gfn 4? See corrected step 7 above.

2012-11-17 14:06:30

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On 11/16/2012 05:57 PM, Marcelo Tosatti wrote:
> On Fri, Nov 16, 2012 at 12:46:16PM +0800, Xiao Guangrong wrote:
>> On 11/16/2012 11:56 AM, Marcelo Tosatti wrote:
>>> On Fri, Nov 16, 2012 at 11:39:12AM +0800, Xiao Guangrong wrote:
>>>> On 11/16/2012 11:02 AM, Marcelo Tosatti wrote:
>>>>> On Thu, Nov 15, 2012 at 07:17:15AM +0800, Xiao Guangrong wrote:
>>>>>> On 11/14/2012 10:37 PM, Marcelo Tosatti wrote:
>>>>>>> On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
>>>>>>>> Hi Marcelo,
>>>>>>>>
>>>>>>>> On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
>>>>>>>>> On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
>>>>>>>>>> Do not drop large spte until it can be insteaded by small pages so that
>>>>>>>>>> the guest can happliy read memory through it
>>>>>>>>>>
>>>>>>>>>> The idea is from Avi:
>>>>>>>>>> | As I mentioned before, write-protecting a large spte is a good idea,
>>>>>>>>>> | since it moves some work from protect-time to fault-time, so it reduces
>>>>>>>>>> | jitter. This removes the need for the return value.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>> arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
>>>>>>>>>> 1 files changed, 9 insertions(+), 25 deletions(-)
>>>>>>>>>
>>>>>>>>> Its likely that other 4k pages are mapped read-write in the 2mb range
>>>>>>>>> covered by a read-only 2mb map. Therefore its not entirely useful to
>>>>>>>>> map read-only.
>>>>>>>>>
>>>>>>>>
>>>>>>>> It needs a page fault to install a pte even if it is the read access.
>>>>>>>> After the change, the page fault can be avoided.
>>>>>>>>
>>>>>>>>> Can you measure an improvement with this change?
>>>>>>>>
>>>>>>>> I have a test case to measure the read time which has been attached.
>>>>>>>> It maps 4k pages at first (dirt-loggged), then switch to large sptes
>>>>>>>> (stop dirt-logging), at the last, measure the read access time after write
>>>>>>>> protect sptes.
>>>>>>>>
>>>>>>>> Before: 23314111 ns After: 11404197 ns
>>>>>>>
>>>>>>> Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
>>>>>>> that is:
>>>>>>>
>>>>>>> - large page must be destroyed when write protecting due to
>>>>>>> shadowed page.
>>>>>>> - with shadow, it does not make sense to write protect
>>>>>>> large sptes as mentioned earlier.
>>>>>>>
>>>>>>
>>>>>> This case is removed now, the code when e49146dce8c3dc6f44 was applied is:
>>>>>> |
>>>>>> | pt = sp->spt;
>>>>>> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
>>>>>> | /* avoid RMW */
>>>>>> | if (is_writable_pte(pt[i]))
>>>>>> | update_spte(&pt[i], pt[i] & ~PT_WRITABLE_MASK);
>>>>>> | }
>>>>>>
>>>>>> The real problem in this code is it would write-protect the spte even if
>>>>>> it is not a last spte that caused the middle-level shadow page table was
>>>>>> write-protected. So e49146dce8c3dc6f44 added this code:
>>>>>> | if (sp->role.level != PT_PAGE_TABLE_LEVEL)
>>>>>> | continue;
>>>>>> |
>>>>>> was good to fix this problem.
>>>>>>
>>>>>> Now, the current code is:
>>>>>> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>>>>>> | if (!is_shadow_present_pte(pt[i]) ||
>>>>>> | !is_last_spte(pt[i], sp->role.level))
>>>>>> | continue;
>>>>>> |
>>>>>> | spte_write_protect(kvm, &pt[i], &flush, false);
>>>>>> | }
>>>>>> It only write-protect the last spte. So, it allows large spte existent.
>>>>>> (the large spte can be broken by drop_large_spte() on the page-fault path.)
>>>>>>
>>>>>>> So i wonder why is this part from your patch
>>>>>>>
>>>>>>> - if (level > PT_PAGE_TABLE_LEVEL &&
>>>>>>> - has_wrprotected_page(vcpu->kvm, gfn, level)) {
>>>>>>> - ret = 1;
>>>>>>> - drop_spte(vcpu->kvm, sptep);
>>>>>>> - goto done;
>>>>>>> - }
>>>>>>>
>>>>>>> necessary (assuming EPT is in use).
>>>>>>
>>>>>> This is safe, we change these code to:
>>>>>>
>>>>>> - if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
>>>>>> + if ((level > PT_PAGE_TABLE_LEVEL &&
>>>>>> + has_wrprotected_page(vcpu->kvm, gfn, level)) ||
>>>>>> + mmu_need_write_protect(vcpu, gfn, can_unsync)) {
>>>>>> pgprintk("%s: found shadow page for %llx, marking ro\n",
>>>>>> __func__, gfn);
>>>>>> ret = 1;
>>>>>>
>>>>>> The spte become read-only which can ensure the shadow gfn can not be changed.
>>>>>>
>>>>>> Btw, the origin code allows to create readonly spte under this case if !(pte_access & WRITEABBLE)
>>>>>
>>>>> Regarding shadow: it should be fine as long as fault path always deletes
>>>>> large mappings, when shadowed pages are present in the region.
>>>>
>>>> For hard mmu is also safe, in this patch i added these code:
>>>>
>>>> @@ -2635,6 +2617,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
>>>> break;
>>>> }
>>>>
>>>> + drop_large_spte(vcpu, iterator.sptep);
>>>> +
>>>>
>>>> It can delete large mappings like soft mmu does.
>>>>
>>>> Anything i missed?
>>>>
>>>>>
>>>>> Ah, unshadowing from reexecute_instruction does not handle
>>>>> large pages. I suppose that is what "simplification" refers
>>>>> to.
>>>>
>>>> reexecute_instruction did not directly handle last spte, it just
>>>> removes all shadow pages, then let cpu retry the instruction, the
>>>> page can become writable when encounter #PF again, large spte is fine
>>>> under this case.
>>>
>>> While searching for a given "gpa", you don't find large gfn which is
>>> mapping it, right? (that is, searching for gfn 4 fails to find large
>>> read-only "gfn 0"). Unshadowing gfn 4 will keep large read-only mapping
>>> present.
>>>
>>> 1. large read-write spte to gfn 0
>>> 2. shadow gfn 4
>>> 3. write-protect large spte pointing to gfn 0
>>> 4. write to gfn 4
>>> 5. instruction emulation fails
>>> 5. unshadow gfn 4
>>> 6. refault, do not drop large spte because no pages shadowed
> 7. refault, then goto 2 (as part of write to gfn 4)
>>
>> Hmm, it is not true. :)
>>
>> The large spte can become writable since 'no pages adhadoes' (that means
>> has_wrprotected_page() can return 0 for this case). No?
>
> What if gfn 4 is a pagetable part of the pagedirectory chain used to
> map gfn 4? See corrected step 7 above.

Ah, this is a real bug, and unfortunately, it exists in current
code. I will make a separate patchset to fix it. Thank you, Marcelo!

2012-11-18 05:05:47

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On Sat, Nov 17, 2012 at 10:06:18PM +0800, Xiao Guangrong wrote:
> On 11/16/2012 05:57 PM, Marcelo Tosatti wrote:
> > On Fri, Nov 16, 2012 at 12:46:16PM +0800, Xiao Guangrong wrote:
> >> On 11/16/2012 11:56 AM, Marcelo Tosatti wrote:
> >>> On Fri, Nov 16, 2012 at 11:39:12AM +0800, Xiao Guangrong wrote:
> >>>> On 11/16/2012 11:02 AM, Marcelo Tosatti wrote:
> >>>>> On Thu, Nov 15, 2012 at 07:17:15AM +0800, Xiao Guangrong wrote:
> >>>>>> On 11/14/2012 10:37 PM, Marcelo Tosatti wrote:
> >>>>>>> On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
> >>>>>>>> Hi Marcelo,
> >>>>>>>>
> >>>>>>>> On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
> >>>>>>>>> On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
> >>>>>>>>>> Do not drop large spte until it can be insteaded by small pages so that
> >>>>>>>>>> the guest can happliy read memory through it
> >>>>>>>>>>
> >>>>>>>>>> The idea is from Avi:
> >>>>>>>>>> | As I mentioned before, write-protecting a large spte is a good idea,
> >>>>>>>>>> | since it moves some work from protect-time to fault-time, so it reduces
> >>>>>>>>>> | jitter. This removes the need for the return value.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
> >>>>>>>>>> ---
> >>>>>>>>>> arch/x86/kvm/mmu.c | 34 +++++++++-------------------------
> >>>>>>>>>> 1 files changed, 9 insertions(+), 25 deletions(-)
> >>>>>>>>>
> >>>>>>>>> Its likely that other 4k pages are mapped read-write in the 2mb range
> >>>>>>>>> covered by a read-only 2mb map. Therefore its not entirely useful to
> >>>>>>>>> map read-only.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> It needs a page fault to install a pte even if it is the read access.
> >>>>>>>> After the change, the page fault can be avoided.
> >>>>>>>>
> >>>>>>>>> Can you measure an improvement with this change?
> >>>>>>>>
> >>>>>>>> I have a test case to measure the read time which has been attached.
> >>>>>>>> It maps 4k pages at first (dirt-loggged), then switch to large sptes
> >>>>>>>> (stop dirt-logging), at the last, measure the read access time after write
> >>>>>>>> protect sptes.
> >>>>>>>>
> >>>>>>>> Before: 23314111 ns After: 11404197 ns
> >>>>>>>
> >>>>>>> Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
> >>>>>>> that is:
> >>>>>>>
> >>>>>>> - large page must be destroyed when write protecting due to
> >>>>>>> shadowed page.
> >>>>>>> - with shadow, it does not make sense to write protect
> >>>>>>> large sptes as mentioned earlier.
> >>>>>>>
> >>>>>>
> >>>>>> This case is removed now, the code when e49146dce8c3dc6f44 was applied is:
> >>>>>> |
> >>>>>> | pt = sp->spt;
> >>>>>> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
> >>>>>> | /* avoid RMW */
> >>>>>> | if (is_writable_pte(pt[i]))
> >>>>>> | update_spte(&pt[i], pt[i] & ~PT_WRITABLE_MASK);
> >>>>>> | }
> >>>>>>
> >>>>>> The real problem in this code is it would write-protect the spte even if
> >>>>>> it is not a last spte that caused the middle-level shadow page table was
> >>>>>> write-protected. So e49146dce8c3dc6f44 added this code:
> >>>>>> | if (sp->role.level != PT_PAGE_TABLE_LEVEL)
> >>>>>> | continue;
> >>>>>> |
> >>>>>> was good to fix this problem.
> >>>>>>
> >>>>>> Now, the current code is:
> >>>>>> | for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
> >>>>>> | if (!is_shadow_present_pte(pt[i]) ||
> >>>>>> | !is_last_spte(pt[i], sp->role.level))
> >>>>>> | continue;
> >>>>>> |
> >>>>>> | spte_write_protect(kvm, &pt[i], &flush, false);
> >>>>>> | }
> >>>>>> It only write-protect the last spte. So, it allows large spte existent.
> >>>>>> (the large spte can be broken by drop_large_spte() on the page-fault path.)
> >>>>>>
> >>>>>>> So i wonder why is this part from your patch
> >>>>>>>
> >>>>>>> - if (level > PT_PAGE_TABLE_LEVEL &&
> >>>>>>> - has_wrprotected_page(vcpu->kvm, gfn, level)) {
> >>>>>>> - ret = 1;
> >>>>>>> - drop_spte(vcpu->kvm, sptep);
> >>>>>>> - goto done;
> >>>>>>> - }
> >>>>>>>
> >>>>>>> necessary (assuming EPT is in use).
> >>>>>>
> >>>>>> This is safe, we change these code to:
> >>>>>>
> >>>>>> - if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
> >>>>>> + if ((level > PT_PAGE_TABLE_LEVEL &&
> >>>>>> + has_wrprotected_page(vcpu->kvm, gfn, level)) ||
> >>>>>> + mmu_need_write_protect(vcpu, gfn, can_unsync)) {
> >>>>>> pgprintk("%s: found shadow page for %llx, marking ro\n",
> >>>>>> __func__, gfn);
> >>>>>> ret = 1;
> >>>>>>
> >>>>>> The spte become read-only which can ensure the shadow gfn can not be changed.
> >>>>>>
> >>>>>> Btw, the origin code allows to create readonly spte under this case if !(pte_access & WRITEABBLE)
> >>>>>
> >>>>> Regarding shadow: it should be fine as long as fault path always deletes
> >>>>> large mappings, when shadowed pages are present in the region.
> >>>>
> >>>> For hard mmu is also safe, in this patch i added these code:
> >>>>
> >>>> @@ -2635,6 +2617,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
> >>>> break;
> >>>> }
> >>>>
> >>>> + drop_large_spte(vcpu, iterator.sptep);
> >>>> +
> >>>>
> >>>> It can delete large mappings like soft mmu does.
> >>>>
> >>>> Anything i missed?
> >>>>
> >>>>>
> >>>>> Ah, unshadowing from reexecute_instruction does not handle
> >>>>> large pages. I suppose that is what "simplification" refers
> >>>>> to.
> >>>>
> >>>> reexecute_instruction did not directly handle last spte, it just
> >>>> removes all shadow pages, then let cpu retry the instruction, the
> >>>> page can become writable when encounter #PF again, large spte is fine
> >>>> under this case.
> >>>
> >>> While searching for a given "gpa", you don't find large gfn which is
> >>> mapping it, right? (that is, searching for gfn 4 fails to find large
> >>> read-only "gfn 0"). Unshadowing gfn 4 will keep large read-only mapping
> >>> present.
> >>>
> >>> 1. large read-write spte to gfn 0
> >>> 2. shadow gfn 4
> >>> 3. write-protect large spte pointing to gfn 0
> >>> 4. write to gfn 4
> >>> 5. instruction emulation fails
> >>> 5. unshadow gfn 4
> >>> 6. refault, do not drop large spte because no pages shadowed
> > 7. refault, then goto 2 (as part of write to gfn 4)
> >>
> >> Hmm, it is not true. :)
> >>
> >> The large spte can become writable since 'no pages adhadoes' (that means
> >> has_wrprotected_page() can return 0 for this case). No?
> >
> > What if gfn 4 is a pagetable part of the pagedirectory chain used to
> > map gfn 4? See corrected step 7 above.
>
> Ah, this is a real bug, and unfortunately, it exists in current
> code. I will make a separate patchset to fix it. Thank you, Marcelo!

Is it? Hum..

Anyway, it would be great if you can write a testcase (should be similar
in size to rmap_chain).

2012-11-28 05:27:50

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On 11/18/2012 11:00 AM, Marcelo Tosatti wrote:
map gfn 4? See corrected step 7 above.
>>
>> Ah, this is a real bug, and unfortunately, it exists in current
>> code. I will make a separate patchset to fix it. Thank you, Marcelo!
>
> Is it? Hum..
>
> Anyway, it would be great if you can write a testcase (should be similar
> in size to rmap_chain).

Marcelo, is this patch acceptable?

2012-11-28 11:51:17

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: lazily drop large spte

On Wed, Nov 28, 2012 at 01:27:38PM +0800, Xiao Guangrong wrote:
> On 11/18/2012 11:00 AM, Marcelo Tosatti wrote:
> map gfn 4? See corrected step 7 above.
> >>
> >> Ah, this is a real bug, and unfortunately, it exists in current
> >> code. I will make a separate patchset to fix it. Thank you, Marcelo!
> >
> > Is it? Hum..
> >
> > Anyway, it would be great if you can write a testcase (should be similar
> > in size to rmap_chain).
>
> Marcelo, is this patch acceptable?

Yes, can we get reexecute_instruction fix first? (which should then be
able to handle the case where a large read-only spte is created).

I'll merge the testcase later today.