LinuxLists.cc - [PATCH v4 6/6] KVM: arm64: Use TLBI range-based intructions for unmap

2023-05-19 00:54:59

Subject: [PATCH v4 6/6] KVM: arm64: Use TLBI range-based intructions for unmap

The current implementation of the stage-2 unmap walker traverses
the given range and, as a part of break-before-make, performs
TLB invalidations with a DSB for every PTE. A multitude of this
combination could cause a performance bottleneck.

Hence, if the system supports FEAT_TLBIRANGE, defer the TLB
invalidations until the entire walk is finished, and then
use range-based instructions to invalidate the TLBs in one go.
Condition this upon S2FWB in order to avoid walking the page-table
again to perform the CMOs after issuing the TLBI.

Rename stage2_put_pte() to stage2_unmap_put_pte() as the function
now serves the stage-2 unmap walker specifically, rather than
acting generic.

Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 35 ++++++++++++++++++++++++++++++-----
1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b8f0dbd12f773..5832ee3418fb0 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -771,16 +771,34 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
smp_store_release(ctx->ptep, new);
}

-static void stage2_put_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s2_mmu *mmu,
- struct kvm_pgtable_mm_ops *mm_ops)
+static bool stage2_unmap_defer_tlb_flush(struct kvm_pgtable *pgt)
{
+ /*
+ * If FEAT_TLBIRANGE is implemented, defer the individial PTE
+ * TLB invalidations until the entire walk is finished, and
+ * then use the range-based TLBI instructions to do the
+ * invalidations. Condition this upon S2FWB in order to avoid
+ * a page-table walk again to perform the CMOs after TLBI.
+ */
+ return system_supports_tlb_range() && stage2_has_fwb(pgt);
+}
+
+static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx,
+ struct kvm_s2_mmu *mmu,
+ struct kvm_pgtable_mm_ops *mm_ops)
+{
+ struct kvm_pgtable *pgt = ctx->arg;
+
/*
* Clear the existing PTE, and perform break-before-make with
* TLB maintenance if it was valid.
*/
if (kvm_pte_valid(ctx->old)) {
kvm_clear_pte(ctx->ptep);
- kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
+
+ if (!stage2_unmap_defer_tlb_flush(pgt))
+ kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
+ ctx->addr, ctx->level);
}

mm_ops->put_page(ctx->ptep);
@@ -1015,7 +1033,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
* block entry and rely on the remaining portions being faulted
* back lazily.
*/
- stage2_put_pte(ctx, mmu, mm_ops);
+ stage2_unmap_put_pte(ctx, mmu, mm_ops);

if (need_flush && mm_ops->dcache_clean_inval_poc)
mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
@@ -1029,13 +1047,20 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,

int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
{
+ int ret;
struct kvm_pgtable_walker walker = {
.cb = stage2_unmap_walker,
.arg = pgt,
.flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
};

- return kvm_pgtable_walk(pgt, addr, size, &walker);
+ ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+ if (stage2_unmap_defer_tlb_flush(pgt))
+ /* Perform the deferred TLB invalidations */
+ kvm_call_hyp(__kvm_tlb_flush_vmid_range, pgt->mmu,
+ addr, addr + size);
+
+ return ret;
}

struct stage2_attr_data {
--
2.40.1.698.g37aff9b760-goog

2023-05-21 19:54:51

by Oliver Upton

[permalink] [raw]

Subject: Re: [PATCH v4 6/6] KVM: arm64: Use TLBI range-based intructions for unmap

On Fri, May 19, 2023 at 12:52:31AM +0000, Raghavendra Rao Ananta wrote:
> The current implementation of the stage-2 unmap walker traverses
> the given range and, as a part of break-before-make, performs
> TLB invalidations with a DSB for every PTE. A multitude of this
> combination could cause a performance bottleneck.
>
> Hence, if the system supports FEAT_TLBIRANGE, defer the TLB
> invalidations until the entire walk is finished, and then
> use range-based instructions to invalidate the TLBs in one go.
> Condition this upon S2FWB in order to avoid walking the page-table
> again to perform the CMOs after issuing the TLBI.

nit: Rather than discussing a theoretical CMO walker, I think this is
more readable if you mention the existing behavior of the walker.

Condition deferred TLB invalidation on the system supporting FWB, as
the optimization is entirely pointless when the unmap walker needs to
perform CMOs.

> Rename stage2_put_pte() to stage2_unmap_put_pte() as the function
> now serves the stage-2 unmap walker specifically, rather than
> acting generic.
>
> Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> ---
> arch/arm64/kvm/hyp/pgtable.c | 35 ++++++++++++++++++++++++++++++-----
> 1 file changed, 30 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index b8f0dbd12f773..5832ee3418fb0 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -771,16 +771,34 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
> smp_store_release(ctx->ptep, new);
> }
>
> -static void stage2_put_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s2_mmu *mmu,
> - struct kvm_pgtable_mm_ops *mm_ops)
> +static bool stage2_unmap_defer_tlb_flush(struct kvm_pgtable *pgt)
> {
> + /*
> + * If FEAT_TLBIRANGE is implemented, defer the individial PTE

typo: individual

Also, 'PTE' isn't significant here.

> + * TLB invalidations until the entire walk is finished, and
> + * then use the range-based TLBI instructions to do the
> + * invalidations. Condition this upon S2FWB in order to avoid
> + * a page-table walk again to perform the CMOs after TLBI.
> + */

Apply the wording suggestion from the changelog here as well.

> + return system_supports_tlb_range() && stage2_has_fwb(pgt);
> +}
> +
> +static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx,
> + struct kvm_s2_mmu *mmu,
> + struct kvm_pgtable_mm_ops *mm_ops)
> +{
> + struct kvm_pgtable *pgt = ctx->arg;
> +
> /*
> * Clear the existing PTE, and perform break-before-make with
> * TLB maintenance if it was valid.
> */
> if (kvm_pte_valid(ctx->old)) {
> kvm_clear_pte(ctx->ptep);
> - kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
> +
> + if (!stage2_unmap_defer_tlb_flush(pgt))
> + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
> + ctx->addr, ctx->level);
> }
>
> mm_ops->put_page(ctx->ptep);
> @@ -1015,7 +1033,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> * block entry and rely on the remaining portions being faulted
> * back lazily.
> */
> - stage2_put_pte(ctx, mmu, mm_ops);
> + stage2_unmap_put_pte(ctx, mmu, mm_ops);
>
> if (need_flush && mm_ops->dcache_clean_inval_poc)
> mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
> @@ -1029,13 +1047,20 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
>
> int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
> {
> + int ret;
> struct kvm_pgtable_walker walker = {
> .cb = stage2_unmap_walker,
> .arg = pgt,
> .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
> };
>
> - return kvm_pgtable_walk(pgt, addr, size, &walker);
> + ret = kvm_pgtable_walk(pgt, addr, size, &walker);
> + if (stage2_unmap_defer_tlb_flush(pgt))
> + /* Perform the deferred TLB invalidations */
> + kvm_call_hyp(__kvm_tlb_flush_vmid_range, pgt->mmu,
> + addr, addr + size);
> +
> + return ret;
> }
>
> struct stage2_attr_data {
> --
> 2.40.1.698.g37aff9b760-goog
>

--
Thanks,
Oliver

2023-05-29 14:28:59

by Marc Zyngier

[permalink] [raw]

Subject: Re: [PATCH v4 6/6] KVM: arm64: Use TLBI range-based intructions for unmap

On Fri, 19 May 2023 01:52:31 +0100,
Raghavendra Rao Ananta <[email protected]> wrote:
>
> The current implementation of the stage-2 unmap walker traverses
> the given range and, as a part of break-before-make, performs
> TLB invalidations with a DSB for every PTE. A multitude of this
> combination could cause a performance bottleneck.
>
> Hence, if the system supports FEAT_TLBIRANGE, defer the TLB
> invalidations until the entire walk is finished, and then
> use range-based instructions to invalidate the TLBs in one go.
> Condition this upon S2FWB in order to avoid walking the page-table
> again to perform the CMOs after issuing the TLBI.

But that's the real bottleneck. TLBIs are cheap compared to CMOs, even
on remarkably bad implementations. What is your plan to fix this?

>
> Rename stage2_put_pte() to stage2_unmap_put_pte() as the function
> now serves the stage-2 unmap walker specifically, rather than
> acting generic.
>
> Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> ---
> arch/arm64/kvm/hyp/pgtable.c | 35 ++++++++++++++++++++++++++++++-----
> 1 file changed, 30 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index b8f0dbd12f773..5832ee3418fb0 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -771,16 +771,34 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
> smp_store_release(ctx->ptep, new);
> }
>
> -static void stage2_put_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s2_mmu *mmu,
> - struct kvm_pgtable_mm_ops *mm_ops)
> +static bool stage2_unmap_defer_tlb_flush(struct kvm_pgtable *pgt)
> {
> + /*
> + * If FEAT_TLBIRANGE is implemented, defer the individial PTE
> + * TLB invalidations until the entire walk is finished, and
> + * then use the range-based TLBI instructions to do the
> + * invalidations. Condition this upon S2FWB in order to avoid
> + * a page-table walk again to perform the CMOs after TLBI.
> + */
> + return system_supports_tlb_range() && stage2_has_fwb(pgt);
> +}
> +
> +static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx,
> + struct kvm_s2_mmu *mmu,
> + struct kvm_pgtable_mm_ops *mm_ops)
> +{
> + struct kvm_pgtable *pgt = ctx->arg;
> +
> /*
> * Clear the existing PTE, and perform break-before-make with
> * TLB maintenance if it was valid.
> */
> if (kvm_pte_valid(ctx->old)) {
> kvm_clear_pte(ctx->ptep);
> - kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
> +
> + if (!stage2_unmap_defer_tlb_flush(pgt))
> + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
> + ctx->addr, ctx->level);

This really doesn't match the comment anymore.

Overall, I'm very concerned that we lose the consistency property that
the current code has: once called, the TLBs and the page tables are
synchronised.

Yes, this patch looks correct. But it is also really fragile.

> }
>
> mm_ops->put_page(ctx->ptep);
> @@ -1015,7 +1033,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> * block entry and rely on the remaining portions being faulted
> * back lazily.
> */
> - stage2_put_pte(ctx, mmu, mm_ops);
> + stage2_unmap_put_pte(ctx, mmu, mm_ops);
>
> if (need_flush && mm_ops->dcache_clean_inval_poc)
> mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
> @@ -1029,13 +1047,20 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
>
> int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
> {
> + int ret;
> struct kvm_pgtable_walker walker = {
> .cb = stage2_unmap_walker,
> .arg = pgt,
> .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
> };
>
> - return kvm_pgtable_walk(pgt, addr, size, &walker);
> + ret = kvm_pgtable_walk(pgt, addr, size, &walker);
> + if (stage2_unmap_defer_tlb_flush(pgt))
> + /* Perform the deferred TLB invalidations */
> + kvm_call_hyp(__kvm_tlb_flush_vmid_range, pgt->mmu,
> + addr, addr + size);

This "kvm_call_hyp(__kvm_tlb_flush_vmid_range,...)" could do with a
wrapper from the point where you introduce it.

> +
> + return ret;
> }
>

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2023-05-30 21:48:42

by Raghavendra Rao Ananta

[permalink] [raw]

Subject: Re: [PATCH v4 6/6] KVM: arm64: Use TLBI range-based intructions for unmap

On Mon, May 29, 2023 at 7:18 AM Marc Zyngier <[email protected]> wrote:
>
> On Fri, 19 May 2023 01:52:31 +0100,
> Raghavendra Rao Ananta <[email protected]> wrote:
> >
> > The current implementation of the stage-2 unmap walker traverses
> > the given range and, as a part of break-before-make, performs
> > TLB invalidations with a DSB for every PTE. A multitude of this
> > combination could cause a performance bottleneck.
> >
> > Hence, if the system supports FEAT_TLBIRANGE, defer the TLB
> > invalidations until the entire walk is finished, and then
> > use range-based instructions to invalidate the TLBs in one go.
> > Condition this upon S2FWB in order to avoid walking the page-table
> > again to perform the CMOs after issuing the TLBI.
>
> But that's the real bottleneck. TLBIs are cheap compared to CMOs, even
> on remarkably bad implementations. What is your plan to fix this?
>
Correct me if I'm wrong, but my understanding was that a multiple
issuance of TLBI + DSB was the bottleneck, and this patch tries to
avoid this by issuing only one TLBI + DSB at the end.
> >
> > Rename stage2_put_pte() to stage2_unmap_put_pte() as the function
> > now serves the stage-2 unmap walker specifically, rather than
> > acting generic.
> >
> > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > ---
> > arch/arm64/kvm/hyp/pgtable.c | 35 ++++++++++++++++++++++++++++++-----
> > 1 file changed, 30 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index b8f0dbd12f773..5832ee3418fb0 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -771,16 +771,34 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
> > smp_store_release(ctx->ptep, new);
> > }
> >
> > -static void stage2_put_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s2_mmu *mmu,
> > - struct kvm_pgtable_mm_ops *mm_ops)
> > +static bool stage2_unmap_defer_tlb_flush(struct kvm_pgtable *pgt)
> > {
> > + /*
> > + * If FEAT_TLBIRANGE is implemented, defer the individial PTE
> > + * TLB invalidations until the entire walk is finished, and
> > + * then use the range-based TLBI instructions to do the
> > + * invalidations. Condition this upon S2FWB in order to avoid
> > + * a page-table walk again to perform the CMOs after TLBI.
> > + */
> > + return system_supports_tlb_range() && stage2_has_fwb(pgt);
> > +}
> > +
> > +static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx,
> > + struct kvm_s2_mmu *mmu,
> > + struct kvm_pgtable_mm_ops *mm_ops)
> > +{
> > + struct kvm_pgtable *pgt = ctx->arg;
> > +
> > /*
> > * Clear the existing PTE, and perform break-before-make with
> > * TLB maintenance if it was valid.
> > */
> > if (kvm_pte_valid(ctx->old)) {
> > kvm_clear_pte(ctx->ptep);
> > - kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
> > +
> > + if (!stage2_unmap_defer_tlb_flush(pgt))
> > + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
> > + ctx->addr, ctx->level);
>
> This really doesn't match the comment anymore.
>
Right, I can re-write this in the next spin.

> Overall, I'm very concerned that we lose the consistency property that
> the current code has: once called, the TLBs and the page tables are
> synchronised.
>
> Yes, this patch looks correct. But it is also really fragile.
>
Yeah, we were a little skeptical about this too. Till v2, we had a
different implementation in which we had an independent fast unmap
path that disconnects the PTE hierarchy if the unmap range was exactly
KVM_PGTABLE_MIN_BLOCK_LEVEL [1]. But this had some problems, and we
pivoted to the current implementation.

> > }
> >
> > mm_ops->put_page(ctx->ptep);
> > @@ -1015,7 +1033,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > * block entry and rely on the remaining portions being faulted
> > * back lazily.
> > */
> > - stage2_put_pte(ctx, mmu, mm_ops);
> > + stage2_unmap_put_pte(ctx, mmu, mm_ops);
> >
> > if (need_flush && mm_ops->dcache_clean_inval_poc)
> > mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
> > @@ -1029,13 +1047,20 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> >
> > int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
> > {
> > + int ret;
> > struct kvm_pgtable_walker walker = {
> > .cb = stage2_unmap_walker,
> > .arg = pgt,
> > .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
> > };
> >
> > - return kvm_pgtable_walk(pgt, addr, size, &walker);
> > + ret = kvm_pgtable_walk(pgt, addr, size, &walker);
> > + if (stage2_unmap_defer_tlb_flush(pgt))
> > + /* Perform the deferred TLB invalidations */
> > + kvm_call_hyp(__kvm_tlb_flush_vmid_range, pgt->mmu,
> > + addr, addr + size);
>
> This "kvm_call_hyp(__kvm_tlb_flush_vmid_range,...)" could do with a
> wrapper from the point where you introduce it.
>
Sorry, I didn't get this comment. Do you mind elaborating on it?

Thank you.
Raghavendra

[1]: https://lore.kernel.org/all/[email protected]/
> > +
> > + return ret;
> > }
> >
>
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.

2023-05-31 09:00:40

by Marc Zyngier

[permalink] [raw]

Subject: Re: [PATCH v4 6/6] KVM: arm64: Use TLBI range-based intructions for unmap

On Tue, 30 May 2023 22:35:57 +0100,
Raghavendra Rao Ananta <[email protected]> wrote:
>
> On Mon, May 29, 2023 at 7:18 AM Marc Zyngier <[email protected]> wrote:
> >
> > On Fri, 19 May 2023 01:52:31 +0100,
> > Raghavendra Rao Ananta <[email protected]> wrote:
> > >
> > > The current implementation of the stage-2 unmap walker traverses
> > > the given range and, as a part of break-before-make, performs
> > > TLB invalidations with a DSB for every PTE. A multitude of this
> > > combination could cause a performance bottleneck.
> > >
> > > Hence, if the system supports FEAT_TLBIRANGE, defer the TLB
> > > invalidations until the entire walk is finished, and then
> > > use range-based instructions to invalidate the TLBs in one go.
> > > Condition this upon S2FWB in order to avoid walking the page-table
> > > again to perform the CMOs after issuing the TLBI.
> >
> > But that's the real bottleneck. TLBIs are cheap compared to CMOs, even
> > on remarkably bad implementations. What is your plan to fix this?
> >
> Correct me if I'm wrong, but my understanding was that a multiple
> issuance of TLBI + DSB was the bottleneck, and this patch tries to
> avoid this by issuing only one TLBI + DSB at the end.

At least on some of the machines I have access to, CMOs are fare more
expensive than TLBIs, and they are the ones causing slowdowns. Your
system shows a different behaviour, and that's fine, but you can't
draw a general conclusion from it.

> > >
> > > Rename stage2_put_pte() to stage2_unmap_put_pte() as the function
> > > now serves the stage-2 unmap walker specifically, rather than
> > > acting generic.
> > >
> > > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > > ---
> > > arch/arm64/kvm/hyp/pgtable.c | 35 ++++++++++++++++++++++++++++++-----
> > > 1 file changed, 30 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index b8f0dbd12f773..5832ee3418fb0 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -771,16 +771,34 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
> > > smp_store_release(ctx->ptep, new);
> > > }
> > >
> > > -static void stage2_put_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s2_mmu *mmu,
> > > - struct kvm_pgtable_mm_ops *mm_ops)
> > > +static bool stage2_unmap_defer_tlb_flush(struct kvm_pgtable *pgt)
> > > {
> > > + /*
> > > + * If FEAT_TLBIRANGE is implemented, defer the individial PTE
> > > + * TLB invalidations until the entire walk is finished, and
> > > + * then use the range-based TLBI instructions to do the
> > > + * invalidations. Condition this upon S2FWB in order to avoid
> > > + * a page-table walk again to perform the CMOs after TLBI.
> > > + */
> > > + return system_supports_tlb_range() && stage2_has_fwb(pgt);
> > > +}
> > > +
> > > +static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx,
> > > + struct kvm_s2_mmu *mmu,
> > > + struct kvm_pgtable_mm_ops *mm_ops)
> > > +{
> > > + struct kvm_pgtable *pgt = ctx->arg;
> > > +
> > > /*
> > > * Clear the existing PTE, and perform break-before-make with
> > > * TLB maintenance if it was valid.
> > > */
> > > if (kvm_pte_valid(ctx->old)) {
> > > kvm_clear_pte(ctx->ptep);
> > > - kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
> > > +
> > > + if (!stage2_unmap_defer_tlb_flush(pgt))
> > > + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
> > > + ctx->addr, ctx->level);
> >
> > This really doesn't match the comment anymore.
> >
> Right, I can re-write this in the next spin.
>
> > Overall, I'm very concerned that we lose the consistency property that
> > the current code has: once called, the TLBs and the page tables are
> > synchronised.
> >
> > Yes, this patch looks correct. But it is also really fragile.
> >
> Yeah, we were a little skeptical about this too. Till v2, we had a
> different implementation in which we had an independent fast unmap
> path that disconnects the PTE hierarchy if the unmap range was exactly
> KVM_PGTABLE_MIN_BLOCK_LEVEL [1]. But this had some problems, and we
> pivoted to the current implementation.

Can we at least have some sort of runtime assertions that at the point
we release the write lock, the TLBs have been invalidated? Even if
that's tied to some debug config.

>
> > > }
> > >
> > > mm_ops->put_page(ctx->ptep);
> > > @@ -1015,7 +1033,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > * block entry and rely on the remaining portions being faulted
> > > * back lazily.
> > > */
> > > - stage2_put_pte(ctx, mmu, mm_ops);
> > > + stage2_unmap_put_pte(ctx, mmu, mm_ops);
> > >
> > > if (need_flush && mm_ops->dcache_clean_inval_poc)
> > > mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
> > > @@ -1029,13 +1047,20 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > >
> > > int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
> > > {
> > > + int ret;
> > > struct kvm_pgtable_walker walker = {
> > > .cb = stage2_unmap_walker,
> > > .arg = pgt,
> > > .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
> > > };
> > >
> > > - return kvm_pgtable_walk(pgt, addr, size, &walker);
> > > + ret = kvm_pgtable_walk(pgt, addr, size, &walker);
> > > + if (stage2_unmap_defer_tlb_flush(pgt))
> > > + /* Perform the deferred TLB invalidations */
> > > + kvm_call_hyp(__kvm_tlb_flush_vmid_range, pgt->mmu,
> > > + addr, addr + size);
> >
> > This "kvm_call_hyp(__kvm_tlb_flush_vmid_range,...)" could do with a
> > wrapper from the point where you introduce it.
> >
> Sorry, I didn't get this comment. Do you mind elaborating on it?

All I'm saying is that you should have a wrapper like:

void kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
phys_addr_t base, size_t size)
{
kvm_call_hyp(__kvm_tlb_flush_vmid_range,
mmu, base, base + size);
}

and use it throughout the code.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.