2022-06-06 05:10:32

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 05/12] KVM: X86/MMU: Clear unsync bit directly in __mmu_unsync_walk()

From: Lai Jiangshan <[email protected]>

mmu_unsync_walk() and __mmu_unsync_walk() requires the caller to clear
unsync for the shadow pages in the resulted pvec by synching them or
zapping them.

All callers does so.

Otherwise mmu_unsync_walk() and __mmu_unsync_walk() can't work because
they always walk from the beginning.

It is possible to make mmu_unsync_walk() and __mmu_unsync_walk() lists
unsync shadow pages in the resulted pvec without needing synching them
or zapping them later. It would require to change mmu_unsync_walk()
and __mmu_unsync_walk() and make it walk from the last visited position
derived from the resulted pvec of the previous call of mmu_unsync_walk().

It would complicate the walk and no callers require the possible new
behavior.

It is better to keep the original behavior.

Since the shadow pages in the resulted pvec will be synced or zapped,
and clear_unsync_child_bit() for parents will be called anyway later.

Call clear_unsync_child_bit() earlier and directly in __mmu_unsync_walk()
to make the code more efficient (the memory of the shadow pages is hot
in the CPU cache, and no need to visit the shadow pages again later).

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f35fd5c59c38..2446ede0b7b9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1794,19 +1794,23 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
return -ENOSPC;

ret = __mmu_unsync_walk(child, pvec);
- if (!ret) {
- clear_unsync_child_bit(sp, i);
- continue;
- } else if (ret > 0) {
- nr_unsync_leaf += ret;
- } else
+ if (ret < 0)
return ret;
- } else if (child->unsync) {
+ nr_unsync_leaf += ret;
+ }
+
+ /*
+ * Clear unsync bit for @child directly if @child is fully
+ * walked and all the unsync shadow pages descended from
+ * @child (including itself) are added into @pvec, the caller
+ * must sync or zap all the unsync shadow pages in @pvec.
+ */
+ clear_unsync_child_bit(sp, i);
+ if (child->unsync) {
nr_unsync_leaf++;
if (mmu_pages_add(pvec, child, i))
return -ENOSPC;
- } else
- clear_unsync_child_bit(sp, i);
+ }
}

return nr_unsync_leaf;
--
2.19.1.6.gb485710b


2022-07-19 20:08:43

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 05/12] KVM: X86/MMU: Clear unsync bit directly in __mmu_unsync_walk()

On Sun, Jun 05, 2022, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> mmu_unsync_walk() and __mmu_unsync_walk() requires the caller to clear
> unsync for the shadow pages in the resulted pvec by synching them or
> zapping them.
>
> All callers does so.
>
> Otherwise mmu_unsync_walk() and __mmu_unsync_walk() can't work because
> they always walk from the beginning.
>
> It is possible to make mmu_unsync_walk() and __mmu_unsync_walk() lists
> unsync shadow pages in the resulted pvec without needing synching them
> or zapping them later. It would require to change mmu_unsync_walk()
> and __mmu_unsync_walk() and make it walk from the last visited position
> derived from the resulted pvec of the previous call of mmu_unsync_walk().
>
> It would complicate the walk and no callers require the possible new
> behavior.
>
> It is better to keep the original behavior.
>
> Since the shadow pages in the resulted pvec will be synced or zapped,
> and clear_unsync_child_bit() for parents will be called anyway later.
>
> Call clear_unsync_child_bit() earlier and directly in __mmu_unsync_walk()
> to make the code more efficient (the memory of the shadow pages is hot
> in the CPU cache, and no need to visit the shadow pages again later).

The changelog and shortlog do a poor job of capturing what this patch actually
does. This is a prime example of why I prefer that changelogs first document
what the patch is doing, and only then dive into background details and alternatives.

This changelog has 6-7 paragraphs talking about current KVM behaviors and
alternatives before talking about the patch itself, and then doesn't actually
describe the net effect of the change.

The use of "directly" in the shortlog is also confusing because __mmu_unsync_walk()
already invokes clear_unsync_child_bit(), e.g. this patch only affects
__mmu_unsync_walk(). IIUC, the change is that __mmu_unsync_walk() will clear
the unsync info when adding to @pvec instead of having to redo the walk after
zapping/synching the page.

KVM: x86/mmu: Clear unsync child _before_ final zap/sync

Clear the unsync child information for a shadow page when adding it to
the array of to-be-zapped/synced shadow pages, i.e. _before_ the actual
zap/sync that effects the "no longer unsync" state change. Callers of
mmu_unsync_walk() and __mmu_unsync_walk() are required to zap/sync all
entries in @pvec before dropping mmu_lock, i.e. once a shadow page is
added to the set of pages to zap/sync, success is guaranteed.

Clearing the unsync info when adding to the array yields more efficient
code as KVM will no longer need to rewalk the shadow pages to "discover"
that the child pages is no longer unsync, and as a bonus, the metadata
for the shadow page will be hot in the CPU cache.

Note, this obviously doesn't work if success isn't guaranteed, but
mmu_unsync_walk() and __mmu_unsync_walk() would require significant
changes to allow restarting a walk after failure to zap/sync. I.e.
this is but one of many details that would need to change.

> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 22 +++++++++++++---------
> 1 file changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f35fd5c59c38..2446ede0b7b9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1794,19 +1794,23 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
> return -ENOSPC;
>
> ret = __mmu_unsync_walk(child, pvec);
> - if (!ret) {
> - clear_unsync_child_bit(sp, i);
> - continue;
> - } else if (ret > 0) {
> - nr_unsync_leaf += ret;
> - } else
> + if (ret < 0)
> return ret;
> - } else if (child->unsync) {
> + nr_unsync_leaf += ret;
> + }
> +
> + /*
> + * Clear unsync bit for @child directly if @child is fully
> + * walked and all the unsync shadow pages descended from
> + * @child (including itself) are added into @pvec, the caller
> + * must sync or zap all the unsync shadow pages in @pvec.
> + */
> + clear_unsync_child_bit(sp, i);
> + if (child->unsync) {
> nr_unsync_leaf++;
> if (mmu_pages_add(pvec, child, i))

This ordering is wrong, no? If the child itself is unsync and can't be added to
@pvec, i.e. fails here, then clearing its bit in unsync_child_bitmap is wrong.

I also dislike that that this patch obfuscates that a shadow page can't be unsync
itself _and_ have unsync children (because only PG_LEVEL_4K can be unsync). In
other words, keep the

if (child->unsync_children) {

} else if (child->unsync) {

}

And at that point, we can streamline this further:

int i, ret, nr_unsync_leaf = 0;

for_each_set_bit(i, sp->unsync_child_bitmap, 512) {
struct kvm_mmu_page *child;
u64 ent = sp->spt[i];

if (is_shadow_present_pte(ent) && !is_large_pte(ent)) {
child = to_shadow_page(ent & PT64_BASE_ADDR_MASK);
if (child->unsync_children) {
ret = __mmu_unsync_walk_and_clear(child, pvec);
if (ret < 0)
return ret;
nr_unsync_leaf += ret;
} else if (child->unsync) {
if (mmu_pages_add(pvec, child))
return -ENOSPC;
nr_unsync_leaf++;
}
}

/*
* Clear the unsync info, the child is either already sync
* (bitmap is stale) or is guaranteed to be zapped/synced by
* the caller before mmu_lock is released. Note, the caller is
* required to zap/sync all entries in @pvec even if an error
* is returned!
*/
clear_unsync_child_bit(sp, i);
}

return nr_unsync_leaf;

2022-07-21 09:50:48

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 05/12] KVM: X86/MMU: Clear unsync bit directly in __mmu_unsync_walk()

On Wed, Jul 20, 2022 at 3:52 AM Sean Christopherson <[email protected]> wrote:

> > ---
> > arch/x86/kvm/mmu/mmu.c | 22 +++++++++++++---------
> > 1 file changed, 13 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f35fd5c59c38..2446ede0b7b9 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1794,19 +1794,23 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
> > return -ENOSPC;
> >
> > ret = __mmu_unsync_walk(child, pvec);
> > - if (!ret) {
> > - clear_unsync_child_bit(sp, i);
> > - continue;
> > - } else if (ret > 0) {
> > - nr_unsync_leaf += ret;
> > - } else
> > + if (ret < 0)
> > return ret;
> > - } else if (child->unsync) {
> > + nr_unsync_leaf += ret;
> > + }
> > +
> > + /*
> > + * Clear unsync bit for @child directly if @child is fully
> > + * walked and all the unsync shadow pages descended from
> > + * @child (including itself) are added into @pvec, the caller
> > + * must sync or zap all the unsync shadow pages in @pvec.
> > + */
> > + clear_unsync_child_bit(sp, i);
> > + if (child->unsync) {
> > nr_unsync_leaf++;
> > if (mmu_pages_add(pvec, child, i))
>
> This ordering is wrong, no? If the child itself is unsync and can't be added to
> @pvec, i.e. fails here, then clearing its bit in unsync_child_bitmap is wrong.

mmu_pages_add() can always successfully add the page to @pvec and
the caller needs to guarantee there is enough room to do so.

When it returns true, it means it will fail if you keep adding pages.

>
> I also dislike that that this patch obfuscates that a shadow page can't be unsync
> itself _and_ have unsync children (because only PG_LEVEL_4K can be unsync). In
> other words, keep the
>
> if (child->unsync_children) {
>
> } else if (child->unsync) {
>
> }
>

The code was not streamlined like this just because
I need to add some comments on clear_unsync_child_bit().

Duplicated clear_unsync_child_bit() would require
duplicated comments. I will use "See above" instead.

2022-07-21 16:34:32

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 05/12] KVM: X86/MMU: Clear unsync bit directly in __mmu_unsync_walk()

On Thu, Jul 21, 2022, Lai Jiangshan wrote:
> On Wed, Jul 20, 2022 at 3:52 AM Sean Christopherson <[email protected]> wrote:
>
> > > ---
> > > arch/x86/kvm/mmu/mmu.c | 22 +++++++++++++---------
> > > 1 file changed, 13 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index f35fd5c59c38..2446ede0b7b9 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1794,19 +1794,23 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
> > > return -ENOSPC;
> > >
> > > ret = __mmu_unsync_walk(child, pvec);
> > > - if (!ret) {
> > > - clear_unsync_child_bit(sp, i);
> > > - continue;
> > > - } else if (ret > 0) {
> > > - nr_unsync_leaf += ret;
> > > - } else
> > > + if (ret < 0)
> > > return ret;
> > > - } else if (child->unsync) {
> > > + nr_unsync_leaf += ret;
> > > + }
> > > +
> > > + /*
> > > + * Clear unsync bit for @child directly if @child is fully
> > > + * walked and all the unsync shadow pages descended from
> > > + * @child (including itself) are added into @pvec, the caller
> > > + * must sync or zap all the unsync shadow pages in @pvec.
> > > + */
> > > + clear_unsync_child_bit(sp, i);
> > > + if (child->unsync) {
> > > nr_unsync_leaf++;
> > > if (mmu_pages_add(pvec, child, i))
> >
> > This ordering is wrong, no? If the child itself is unsync and can't be added to
> > @pvec, i.e. fails here, then clearing its bit in unsync_child_bitmap is wrong.
>
> mmu_pages_add() can always successfully add the page to @pvec and
> the caller needs to guarantee there is enough room to do so.
>
> When it returns true, it means it will fail if you keep adding pages.

Oof, that's downright evil. As prep work, can you fold in the attached patches
earlier in this series? Then this patch can yield:

for_each_set_bit(i, sp->unsync_child_bitmap, 512) {
struct kvm_mmu_page *child;
u64 ent = sp->spt[i];

if (!is_shadow_present_pte(ent) || is_large_pte(ent))
goto clear_unsync_child;

child = to_shadow_page(ent & SPTE_BASE_ADDR_MASK);
if (!child->unsync && !child->unsync_children)
goto clear_unsync_child;

if (mmu_is_page_vec_full(pvec))
return -ENOSPC;

mmu_pages_add(pvec, child, i);

if (child->unsync_children) {
ret = __mmu_unsync_walk(child, pvec);
if (!ret)
goto clear_unsync_child;
else if (ret > 0)
nr_unsync_leaf += ret;
else
return ret;
} else {
nr_unsync_leaf++;
}

clear_unsync_child:
/*
* Clear the unsync info, the child is either already sync
* (bitmap is stale) or is guaranteed to be zapped/synced by
* the caller before mmu_lock is released. Note, the caller is
* required to zap/sync all entries in @pvec even if an error
* is returned!
*/
clear_unsync_child_bit(sp, i);
}


Attachments:
(No filename) (3.34 kB)
0001-KVM-x86-mmu-Separate-page-vec-is-full-from-adding-a-.patch (2.72 kB)
0002-KVM-x86-mmu-Check-for-full-page-vector-_before_-addi.patch (1.84 kB)
Download all attachments