Set the min_level for the TDP iterator at the root level when zapping all
SPTEs so that the _iterator_ only processes top-level SPTEs. Zapping a
non-leaf SPTE will recursively zap all its children, thus there is no
need for the iterator to attempt to step down. This avoids rereading all
the top-level SPTEs after they are zapped by causing try_step_down() to
short-circuit.
Cc: Ben Gardon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6566f70a31c1..aec069c18d82 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -751,6 +751,16 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
{
bool zap_all = (end == ZAP_ALL_END);
struct tdp_iter iter;
+ int min_level;
+
+ /*
+ * No need to step down in the iterator when zapping all SPTEs, zapping
+ * the top-level non-leaf SPTEs will recurse on all their children.
+ */
+ if (zap_all)
+ min_level = root->role.level;
+ else
+ min_level = PG_LEVEL_4K;
/*
* Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
@@ -763,7 +773,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
rcu_read_lock();
- tdp_root_for_each_pte(iter, root, start, end) {
+ for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
+ min_level, start, end) {
retry:
if (can_yield &&
tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
--
2.33.0.rc1.237.g0d66db33f3-goog
On Wed, Aug 11, 2021 at 10:07 PM Sean Christopherson <[email protected]> wrote:
>
> Set the min_level for the TDP iterator at the root level when zapping all
> SPTEs so that the _iterator_ only processes top-level SPTEs. Zapping a
> non-leaf SPTE will recursively zap all its children, thus there is no
> need for the iterator to attempt to step down. This avoids rereading all
> the top-level SPTEs after they are zapped by causing try_step_down() to
> short-circuit.
>
> Cc: Ben Gardon <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
This change looks functionally correct, but I'm not sure it's worth
adding more code special cased on zap-all for what seems like a small
performance improvement in a context which shouldn't be particularly
performance sensitive.
Change is a correct optimization though and it's not much extra code,
so I'm happy to give a:
Reviewed-by: Ben Gardon <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 6566f70a31c1..aec069c18d82 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -751,6 +751,16 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> {
> bool zap_all = (end == ZAP_ALL_END);
> struct tdp_iter iter;
> + int min_level;
> +
> + /*
> + * No need to step down in the iterator when zapping all SPTEs, zapping
> + * the top-level non-leaf SPTEs will recurse on all their children.
> + */
> + if (zap_all)
> + min_level = root->role.level;
> + else
> + min_level = PG_LEVEL_4K;
>
> /*
> * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> @@ -763,7 +773,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>
> rcu_read_lock();
>
> - tdp_root_for_each_pte(iter, root, start, end) {
> + for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
> + min_level, start, end) {
> retry:
> if (can_yield &&
> tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
> --
> 2.33.0.rc1.237.g0d66db33f3-goog
>
On 12/08/21 19:07, Sean Christopherson wrote:
> Yeah, I was/am on the fence too, I almost included a blurb in the cover letter
> saying as much. I'll do that for v2 and let Paolo decide.
I think it makes sense to have it. You can even use the ternary operator
/*
* When zapping everything, all entries at the top level
* ultimately go away, and the levels below go down with them.
* So do not bother iterating all the way down to the leaves.
*/
int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
Paolo
On Thu, Aug 12, 2021, Paolo Bonzini wrote:
> On 12/08/21 19:07, Sean Christopherson wrote:
> > Yeah, I was/am on the fence too, I almost included a blurb in the cover letter
> > saying as much. I'll do that for v2 and let Paolo decide.
>
> I think it makes sense to have it. You can even use the ternary operator
Hah, yeah, I almost used a ternary op. Honestly don't know why I didn't, guess
my brain flipped a coin.
>
> /*
> * When zapping everything, all entries at the top level
> * ultimately go away, and the levels below go down with them.
> * So do not bother iterating all the way down to the leaves.
The subtle part is that try_step_down() won't actually iterate down because it
explicitly rereads and rechecks the SPTE.
if (iter->level == iter->min_level)
return false;
/*
* Reread the SPTE before stepping down to avoid traversing into page
* tables that are no longer linked from this entry.
*/
iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep)); \
---> this is the code that is avoided
child_pt = spte_to_child_pt(iter->old_spte, iter->level); /
if (!child_pt)
return false;
My comment wasn't all that accurate either. Maybe this?
/*
* No need to try to step down in the iterator when zapping all SPTEs,
* zapping the top-level non-leaf SPTEs will recurse on their children.
*/
int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
> */
> int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
>
> Paolo
>
On 12/08/21 19:33, Sean Christopherson wrote:
> On Thu, Aug 12, 2021, Paolo Bonzini wrote:
>> On 12/08/21 19:07, Sean Christopherson wrote:
>>> Yeah, I was/am on the fence too, I almost included a blurb in the cover letter
>>> saying as much. I'll do that for v2 and let Paolo decide.
>>
>> I think it makes sense to have it. You can even use the ternary operator
>
> Hah, yeah, I almost used a ternary op. Honestly don't know why I didn't, guess
> my brain flipped a coin.
>
>>
>> /*
>> * When zapping everything, all entries at the top level
>> * ultimately go away, and the levels below go down with them.
>> * So do not bother iterating all the way down to the leaves.
>
> The subtle part is that try_step_down() won't actually iterate down because it
> explicitly rereads and rechecks the SPTE.
>
> if (iter->level == iter->min_level)
> return false;
>
> /*
> * Reread the SPTE before stepping down to avoid traversing into page
> * tables that are no longer linked from this entry.
> */
> iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep)); \
> ---> this is the code that is avoided
> child_pt = spte_to_child_pt(iter->old_spte, iter->level); /
> if (!child_pt)
> return false;
Ah, right - so I agree with Ben that it's not too important.
Paolo
On Thu, Aug 12, 2021, Paolo Bonzini wrote:
> On 12/08/21 19:33, Sean Christopherson wrote:
> > On Thu, Aug 12, 2021, Paolo Bonzini wrote:
> > > On 12/08/21 19:07, Sean Christopherson wrote:
> > > > Yeah, I was/am on the fence too, I almost included a blurb in the cover letter
> > > > saying as much. I'll do that for v2 and let Paolo decide.
> > >
> > > I think it makes sense to have it. You can even use the ternary operator
> >
> > Hah, yeah, I almost used a ternary op. Honestly don't know why I didn't, guess
> > my brain flipped a coin.
> >
> > >
> > > /*
> > > * When zapping everything, all entries at the top level
> > > * ultimately go away, and the levels below go down with them.
> > > * So do not bother iterating all the way down to the leaves.
> >
> > The subtle part is that try_step_down() won't actually iterate down because it
> > explicitly rereads and rechecks the SPTE.
> >
> > if (iter->level == iter->min_level)
> > return false;
> >
> > /*
> > * Reread the SPTE before stepping down to avoid traversing into page
> > * tables that are no longer linked from this entry.
> > */
> > iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep)); \
> > ---> this is the code that is avoided
> > child_pt = spte_to_child_pt(iter->old_spte, iter->level); /
> > if (!child_pt)
> > return false;
>
> Ah, right - so I agree with Ben that it's not too important.
Ya. There is a measurable performance improvement, but it's really only
meaningful when there aren't many SPTEs to zap, otherwise the cost of zapping
completely dominates the time.
The one thing that makes me want to include the optimization is that it will kick
in if userspace is constantly modifying memslots, e.g. for option ROMs, in which
case many of the memslot-induced zaps will run with relatively few SPTEs. The
thread doing the zapping isn't a vCPU thread, but it still holds mmu_lock for
read and thus can be a noisy neighbor of sorts.
On Thu, Aug 12, 2021, Ben Gardon wrote:
> On Wed, Aug 11, 2021 at 10:07 PM Sean Christopherson <[email protected]> wrote:
> >
> > Set the min_level for the TDP iterator at the root level when zapping all
> > SPTEs so that the _iterator_ only processes top-level SPTEs. Zapping a
> > non-leaf SPTE will recursively zap all its children, thus there is no
> > need for the iterator to attempt to step down. This avoids rereading all
> > the top-level SPTEs after they are zapped by causing try_step_down() to
> > short-circuit.
> >
> > Cc: Ben Gardon <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
>
> This change looks functionally correct, but I'm not sure it's worth
> adding more code special cased on zap-all for what seems like a small
> performance improvement in a context which shouldn't be particularly
> performance sensitive.
Yeah, I was/am on the fence too, I almost included a blurb in the cover letter
saying as much. I'll do that for v2 and let Paolo decide.
Thanks!
> Change is a correct optimization though and it's not much extra code,
> so I'm happy to give a:
> Reviewed-by: Ben Gardon <[email protected]>
On 12/08/21 19:46, Sean Christopherson wrote:
>>> if (iter->level == iter->min_level)
>>> return false;
>>>
>>> /*
>>> * Reread the SPTE before stepping down to avoid traversing into page
>>> * tables that are no longer linked from this entry.
>>> */
>>> iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep)); \
>>> ---> this is the code that is avoided
>>> child_pt = spte_to_child_pt(iter->old_spte, iter->level); /
>>> if (!child_pt)
>>> return false;
>> Ah, right - so I agree with Ben that it's not too important.
> Ya. There is a measurable performance improvement, but it's really only
> meaningful when there aren't many SPTEs to zap, otherwise the cost of zapping
> completely dominates the time.
I don't understand. When try_step_down is called by tdp_iter_next, all
it does is really just the READ_ONCE, because spte_to_child_pt will see
a non-present PTE and return immediately. Why do two, presumably cache
hot, reads cause a measurable performance improvement?
Paolo
On Fri, Aug 13, 2021, Paolo Bonzini wrote:
> On 12/08/21 19:46, Sean Christopherson wrote:
> > > > if (iter->level == iter->min_level)
> > > > return false;
> > > >
> > > > /*
> > > > * Reread the SPTE before stepping down to avoid traversing into page
> > > > * tables that are no longer linked from this entry.
> > > > */
> > > > iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep)); \
> > > > ---> this is the code that is avoided
> > > > child_pt = spte_to_child_pt(iter->old_spte, iter->level); /
> > > > if (!child_pt)
> > > > return false;
> > > Ah, right - so I agree with Ben that it's not too important.
> > Ya. There is a measurable performance improvement, but it's really only
> > meaningful when there aren't many SPTEs to zap, otherwise the cost of zapping
> > completely dominates the time.
>
> I don't understand. When try_step_down is called by tdp_iter_next, all it
> does is really just the READ_ONCE, because spte_to_child_pt will see a
> non-present PTE and return immediately. Why do two, presumably cache hot,
> reads cause a measurable performance improvement?
It's entirely possible my measurements were bad and/or noisy. Ah, and my kernel
was running with CONFIG_PROVE_RCU=y, which makes the rcu_dereference() quite a bit
more expensive.
On 13/08/21 18:13, Sean Christopherson wrote:
> On Fri, Aug 13, 2021, Paolo Bonzini wrote:
>> On 12/08/21 19:46, Sean Christopherson wrote:
>>>>> if (iter->level == iter->min_level)
>>>>> return false;
>>>>>
>>>>> /*
>>>>> * Reread the SPTE before stepping down to avoid traversing into page
>>>>> * tables that are no longer linked from this entry.
>>>>> */
>>>>> iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep)); \
>>>>> ---> this is the code that is avoided
>>>>> child_pt = spte_to_child_pt(iter->old_spte, iter->level); /
>>>>> if (!child_pt)
>>>>> return false;
>>>> Ah, right - so I agree with Ben that it's not too important.
>>> Ya. There is a measurable performance improvement, but it's really only
>>> meaningful when there aren't many SPTEs to zap, otherwise the cost of zapping
>>> completely dominates the time.
>>
>> I don't understand. When try_step_down is called by tdp_iter_next, all it
>> does is really just the READ_ONCE, because spte_to_child_pt will see a
>> non-present PTE and return immediately. Why do two, presumably cache hot,
>> reads cause a measurable performance improvement?
>
> It's entirely possible my measurements were bad and/or noisy. Ah, and my kernel
> was running with CONFIG_PROVE_RCU=y, which makes the rcu_dereference() quite a bit
> more expensive.
It's one line of code and it makes sense, so I can certainly include the
patch. I was just a bit confused.
Paolo