2023-06-20 12:11:52

by Alistair Popple

[permalink] [raw]
Subject: [RFC PATCH 0/2] Invalidate secondary IOMMU TLB on permission upgrade

==========
Background
==========

The arm64 architecture specifies TLB permission bits may be cached and
therefore the TLB must be invalidated during permission upgrades. For
the CPU this currently occurs in the architecture specific
ptep_set_access_flags() routine.

Secondary TLBs such as implemented by the SMMU IOMMU match the CPU
architecture specification and may also cache permission bits and
require the same TLB invalidations. This may be achieved in one of two
ways.

Some SMMU implementations implement broadcast TLB maintenance
(BTM). This snoops CPU TLB invalidates and will invalidate any
secondary TLB at the same time as the CPU. However implementations are
not required to implement BTM.

Implementations without BTM rely on mmu notifier callbacks to send
explicit TLB invalidation commands to invalidate SMMU TLB. Therefore
either generic kernel code or architecture specific code needs to call
the mmu notifier on permission upgrade.

Currently that doesn't happen so devices will fault indefinitely when
writing to a PTE that was previously read-only as nothing invalidates
the SMMU TLB.

To fix that this series first renames the .invalidate_range() callback
to .invalidate_secondary_tlbs() as suggested by Jason and Sean to make
it clear this callback is only used for secondary TLBs. That was made
possible thanks to Sean's series [1] to remove KVM incorrect
usage. This series is currently in linux-next.

From here there are several possible solutions for which I would like
some feedback on a preferred approach.

=========
Solutions
=========

1. Add a call to mmu_notifier_invalidate_secondary_tlbs() to the arm64
version of ptep_set_access_flags().

This is what this RFC series does as it is the simplest
solution. Arguably this call should be made by generic kernel code
though to catch other platforms that need it.

However only ARM64, IA64 and Sparc flush the TLB in
ptep_set_access_flags() and AFAIK only ARM64 has an IOMMU that uses
shared page-tables and is therefore the only platform affected by
this.

2. Add a call to mmu_notifier_invalidate_secondary_tlbs() to generic
kernel code.

The problem with this approach is generic kernel code has no way of
knowing if it can be skipped or not for a given IOMMU. That leads to
over invalidation and subsequent performance loss on the majority of
platforms that don't need it.

3. Implement a new set of notifier operations (eg. tlb_notifier_ops)
specifically for secondary TLBs with a range of operations that can be
called by generic kernel code for every PTE modification.

See [2] for a prototype implementation of this idea.

This solves the problems of (1) and (2) because an IOMMU would only
implement the operations it needs. It also keeps the layering nice as
theoretically there is no reason a secondary TLB has to follow the
main CPU architecture specification so is free to implement its own
operations (although I know of no hardware that does this).

However it adds complexity for dealing with a problem that only exists
on some implementations of a particular feature on one
architecture. For that reason I think (1) is the best path forward due
to simplicity but would appreciate any feedback here.

============
Other Issues
============

It is unclear if mmu_notifier_invalidate_secondary_tlbs() should be
called from mmu_notifier_range_end(). Currently it is, as an analysis
of existing code shows most code doesn't explicitly invalidate
secondary TLBs and relies on it being called as part of the end()
call.

The disadvantage of changing code to explicitly invalidate secondary
TLBs is generally it can't take advantage of IOMMU specific range
based TLB invalidation commands because explicit invalidations happen
one page at a time under PTL.

To solve that we could add secondary TLB invalidation calls to the TLB
batching code, but that adds complexity so I'm not sure it's worth it
but would appreciate feedback.

[1] - https://lore.kernel.org/all/[email protected]/
[2] - https://lore.kernel.org/all/[email protected]/

Alistair Popple (2):
mm_notifiers: Rename invalidate_range notifier
arm64: Notify on pte permission upgrades

arch/arm64/mm/fault.c | 7 +-
arch/arm64/mm/hugetlbpage.c | 9 +++-
drivers/iommu/amd/iommu_v2.c | 10 +--
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 13 ++--
drivers/iommu/intel/svm.c | 8 +--
drivers/misc/ocxl/link.c | 8 +--
include/asm-generic/tlb.h | 2 +-
include/linux/mmu_notifier.h | 55 +++++++++---------
mm/huge_memory.c | 4 +-
mm/hugetlb.c | 10 +--
mm/mmu_notifier.c | 52 ++++++++++-------
mm/rmap.c | 42 +++++++-------
12 files changed, 125 insertions(+), 95 deletions(-)

base-commit: b16049b21162bb649cdd8519642a35972b7910fe
--
git-series 0.9.1


2023-06-21 15:45:51

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Invalidate secondary IOMMU TLB on permission upgrade

On Tue, Jun 20, 2023 at 09:18:24PM +1000, Alistair Popple wrote:

> 1. Add a call to mmu_notifier_invalidate_secondary_tlbs() to the arm64
> version of ptep_set_access_flags().

I prefer we modify the whole thing to only call
mmu_notifier_arch_invalidate_secondary_tlbs() (note the arch in the
name) directly beside the arch's existing tlb invalidation, and we
only need to define this for x86 and ARM arches that have secondary
TLB using drivers - eg it is very much not a generic general purpose
mmu notifier that has any meaning outside arch code.

> This is what this RFC series does as it is the simplest
> solution. Arguably this call should be made by generic kernel code
> though to catch other platforms that need it.

But that is the whole point, the generic kernel cannot and does not
know the rules for TLB invalidation the platform model requires.

> It is unclear if mmu_notifier_invalidate_secondary_tlbs() should be
> called from mmu_notifier_range_end(). Currently it is, as an analysis
> of existing code shows most code doesn't explicitly invalidate
> secondary TLBs and relies on it being called as part of the end()
> call.

If you do the above we don't need to answer this question. Calling it
unconditionally at the arches tlbi points is always correct.

> To solve that we could add secondary TLB invalidation calls to the TLB
> batching code, but that adds complexity so I'm not sure it's worth it
> but would appreciate feedback.

It sounds like the right direction to me.. Batching is going to be
important, we don't need to different verions of batching logic. We
already call the notifier from the batch infrastructure anyhow.

This still fits with the above as batching goes down to the arch's
flush_tlb_range()/etc which can carry the batch to the notifier. We
can decide if we leave the notifier call in the core code and let the
arch elide it or push it to the arch so it the call is more
consistently placed.

eg we have arch_tlbbatch_flush() on x86 too that looks kind of
suspicious that is already really funky to figure out where the
notifier is actually called from. Calling from arch code closer to the
actual TLB flush would be alot easier to reason about.

Jason