The main change is to move secondary TLB invalidation mmu notifier
callbacks into the architecture specific TLB flushing functions. This
makes secondary TLB invalidation mostly match CPU invalidation while
still allowing efficient range based invalidations based on the
existing TLB batching code.
==========
Background
==========
The arm64 architecture specifies TLB permission bits may be cached and
therefore the TLB must be invalidated during permission upgrades. For
the CPU this currently occurs in the architecture specific
ptep_set_access_flags() routine.
Secondary TLBs such as implemented by the SMMU IOMMU match the CPU
architecture specification and may also cache permission bits and
require the same TLB invalidations. This may be achieved in one of two
ways.
Some SMMU implementations implement broadcast TLB maintenance
(BTM). This snoops CPU TLB invalidates and will invalidate any
secondary TLB at the same time as the CPU. However implementations are
not required to implement BTM.
Implementations without BTM rely on mmu notifier callbacks to send
explicit TLB invalidation commands to invalidate SMMU TLB. Therefore
either generic kernel code or architecture specific code needs to call
the mmu notifier on permission upgrade.
Currently that doesn't happen so devices will fault indefinitely when
writing to a PTE that was previously read-only as nothing invalidates
the SMMU TLB.
========
Solution
========
To fix this the series first renames the .invalidate_range() callback
to .arch_invalidate_secondary_tlbs() as suggested by Jason and Sean to
make it clear this callback is only used for secondary TLBs. That was
made possible thanks to Sean's series [1] to remove KVM's incorrect
usage.
Based on feedback from Jason [2] the proposed solution to the bug is
to move the calls to mmu_notifier_arch_invalidate_secondary_tlbs()
closer to the architecture specific TLB invalidation code. This
ensures the secondary TLB won't miss invalidations, including the
existing invalidation in the ARM64 code to deal with permission
upgrade.
Currently only ARM64, PowerPC and x86 have IOMMU with secondary TLBs
requiring SW invalidation so the notifier is only called for those
architectures. It is also not called for invalidation of kernel
mappings as no secondary IOMMU implementations can access those and
hence it is not required.
[1] - https://lore.kernel.org/all/[email protected]/
[2] - https://lore.kernel.org/linux-mm/[email protected]/
Alistair Popple (4):
mm_notifiers: Rename invalidate_range notifier
arm64/smmu: Use TLBI ASID when invalidating entire range
mmu_notifiers: Call arch_invalidate_secondary_tlbs() when invalidating TLBs
mmu_notifiers: Don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
arch/arm64/include/asm/tlbflush.h | 5 +-
arch/powerpc/include/asm/book3s/64/tlbflush.h | 1 +-
arch/powerpc/mm/book3s64/radix_hugetlbpage.c | 1 +-
arch/powerpc/mm/book3s64/radix_tlb.c | 6 +-
arch/x86/mm/tlb.c | 3 +-
drivers/iommu/amd/iommu_v2.c | 10 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 29 +++--
drivers/iommu/intel/svm.c | 8 +-
drivers/misc/ocxl/link.c | 8 +-
include/asm-generic/tlb.h | 1 +-
include/linux/mmu_notifier.h | 104 ++++-------------
kernel/events/uprobes.c | 2 +-
mm/huge_memory.c | 29 +----
mm/hugetlb.c | 8 +-
mm/memory.c | 8 +-
mm/migrate_device.c | 9 +-
mm/mmu_notifier.c | 47 +++-----
mm/rmap.c | 40 +-------
18 files changed, 109 insertions(+), 210 deletions(-)
base-commit: fdf0eaf11452d72945af31804e2a1048ee1b574c
--
git-series 0.9.1
> From: Anshuman Khandual <[email protected]>
> Sent: Wednesday, July 19, 2023 11:04 AM
>
> On 7/18/23 13:26, Alistair Popple wrote:
> > The main change is to move secondary TLB invalidation mmu notifier
> > callbacks into the architecture specific TLB flushing functions. This
> > makes secondary TLB invalidation mostly match CPU invalidation while
> > still allowing efficient range based invalidations based on the
> > existing TLB batching code.
> >
> > ==========
> > Background
> > ==========
> >
> > The arm64 architecture specifies TLB permission bits may be cached and
> > therefore the TLB must be invalidated during permission upgrades. For
> > the CPU this currently occurs in the architecture specific
> > ptep_set_access_flags() routine.
> >
> > Secondary TLBs such as implemented by the SMMU IOMMU match the CPU
> > architecture specification and may also cache permission bits and
> > require the same TLB invalidations. This may be achieved in one of two
> > ways.
> >
> > Some SMMU implementations implement broadcast TLB maintenance
> > (BTM). This snoops CPU TLB invalidates and will invalidate any
> > secondary TLB at the same time as the CPU. However implementations are
> > not required to implement BTM.
>
> So, the implementations with BTM do not even need a MMU notifier callback
> for secondary TLB invalidation purpose ? Perhaps mmu_notifier_register()
> could also be skipped for such cases i.e with ARM_SMMU_FEAT_BTM
> enabled ?
>
Out of curiosity. How does BTM work with device tlb? Can SMMU translate
a TLB broadcast request (based on ASID) into a set of PCI ATS invalidation
requests (based on PCI requestor ID and PASID) in hardware?
If software intervention is required then it might be the reason why mmu
notifier cannot be skipped. With BTM enabled it just means the notifier
callback can skip iotlb invalidation...
On 7/18/23 13:26, Alistair Popple wrote:
> The main change is to move secondary TLB invalidation mmu notifier
> callbacks into the architecture specific TLB flushing functions. This
> makes secondary TLB invalidation mostly match CPU invalidation while
> still allowing efficient range based invalidations based on the
> existing TLB batching code.
>
> ==========
> Background
> ==========
>
> The arm64 architecture specifies TLB permission bits may be cached and
> therefore the TLB must be invalidated during permission upgrades. For
> the CPU this currently occurs in the architecture specific
> ptep_set_access_flags() routine.
>
> Secondary TLBs such as implemented by the SMMU IOMMU match the CPU
> architecture specification and may also cache permission bits and
> require the same TLB invalidations. This may be achieved in one of two
> ways.
>
> Some SMMU implementations implement broadcast TLB maintenance
> (BTM). This snoops CPU TLB invalidates and will invalidate any
> secondary TLB at the same time as the CPU. However implementations are
> not required to implement BTM.
So, the implementations with BTM do not even need a MMU notifier callback
for secondary TLB invalidation purpose ? Perhaps mmu_notifier_register()
could also be skipped for such cases i.e with ARM_SMMU_FEAT_BTM enabled ?
BTW, dont see ARM_SMMU_FEAT_BTM being added as a feature any where during
the probe i.e arm_smmu_device_hw_probe().
>
> Implementations without BTM rely on mmu notifier callbacks to send
> explicit TLB invalidation commands to invalidate SMMU TLB. Therefore
> either generic kernel code or architecture specific code needs to call
> the mmu notifier on permission upgrade.
>
> Currently that doesn't happen so devices will fault indefinitely when
> writing to a PTE that was previously read-only as nothing invalidates
> the SMMU TLB.
Why does not the current SMMU MMU notifier intercept all invalidation from
generic MM code and do the required secondary TLB invalidation ? Is there
a timing issue involved here ? Secondary TLB invalidation does happen but
after the damage has been done ? Could you please point us to a real world
bug report taking such indefinite faults as mentioned above ?
>
> ========
> Solution
> ========
>
> To fix this the series first renames the .invalidate_range() callback
> to .arch_invalidate_secondary_tlbs() as suggested by Jason and Sean to
> make it clear this callback is only used for secondary TLBs. That was
> made possible thanks to Sean's series [1] to remove KVM's incorrect
> usage.
>
> Based on feedback from Jason [2] the proposed solution to the bug is
> to move the calls to mmu_notifier_arch_invalidate_secondary_tlbs()
> closer to the architecture specific TLB invalidation code. This
> ensures the secondary TLB won't miss invalidations, including the
> existing invalidation in the ARM64 code to deal with permission
> upgrade.
ptep_set_access_flags() is the only problematic place where this issue
is being reported ? If yes, why dont fix that instead of moving these
into platform specific callbacks ? OR there are other problematic areas
I might be missing.
>
> Currently only ARM64, PowerPC and x86 have IOMMU with secondary TLBs
> requiring SW invalidation so the notifier is only called for those
> architectures. It is also not called for invalidation of kernel
> mappings as no secondary IOMMU implementations can access those and
> hence it is not required.
>
> [1] - https://lore.kernel.org/all/[email protected]/
> [2] - https://lore.kernel.org/linux-mm/[email protected]/
>
> Alistair Popple (4):
> mm_notifiers: Rename invalidate_range notifier
> arm64/smmu: Use TLBI ASID when invalidating entire range
> mmu_notifiers: Call arch_invalidate_secondary_tlbs() when invalidating TLBs
> mmu_notifiers: Don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
>
> arch/arm64/include/asm/tlbflush.h | 5 +-
> arch/powerpc/include/asm/book3s/64/tlbflush.h | 1 +-
> arch/powerpc/mm/book3s64/radix_hugetlbpage.c | 1 +-
> arch/powerpc/mm/book3s64/radix_tlb.c | 6 +-
> arch/x86/mm/tlb.c | 3 +-
> drivers/iommu/amd/iommu_v2.c | 10 +-
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 29 +++--
> drivers/iommu/intel/svm.c | 8 +-
> drivers/misc/ocxl/link.c | 8 +-
> include/asm-generic/tlb.h | 1 +-
> include/linux/mmu_notifier.h | 104 ++++-------------
> kernel/events/uprobes.c | 2 +-
> mm/huge_memory.c | 29 +----
> mm/hugetlb.c | 8 +-
> mm/memory.c | 8 +-
> mm/migrate_device.c | 9 +-
> mm/mmu_notifier.c | 47 +++-----
> mm/rmap.c | 40 +-------
> 18 files changed, 109 insertions(+), 210 deletions(-)
>
> base-commit: fdf0eaf11452d72945af31804e2a1048ee1b574c
"Tian, Kevin" <[email protected]> writes:
>> From: Anshuman Khandual <[email protected]>
>> Sent: Wednesday, July 19, 2023 11:04 AM
>>
>> On 7/18/23 13:26, Alistair Popple wrote:
>> > The main change is to move secondary TLB invalidation mmu notifier
>> > callbacks into the architecture specific TLB flushing functions. This
>> > makes secondary TLB invalidation mostly match CPU invalidation while
>> > still allowing efficient range based invalidations based on the
>> > existing TLB batching code.
>> >
>> > ==========
>> > Background
>> > ==========
>> >
>> > The arm64 architecture specifies TLB permission bits may be cached and
>> > therefore the TLB must be invalidated during permission upgrades. For
>> > the CPU this currently occurs in the architecture specific
>> > ptep_set_access_flags() routine.
>> >
>> > Secondary TLBs such as implemented by the SMMU IOMMU match the CPU
>> > architecture specification and may also cache permission bits and
>> > require the same TLB invalidations. This may be achieved in one of two
>> > ways.
>> >
>> > Some SMMU implementations implement broadcast TLB maintenance
>> > (BTM). This snoops CPU TLB invalidates and will invalidate any
>> > secondary TLB at the same time as the CPU. However implementations are
>> > not required to implement BTM.
>>
>> So, the implementations with BTM do not even need a MMU notifier callback
>> for secondary TLB invalidation purpose ? Perhaps mmu_notifier_register()
>> could also be skipped for such cases i.e with ARM_SMMU_FEAT_BTM
>> enabled ?
>>
A notifier callback is still required to send the PCIe ATC request to
devices. As I understand it BTM means just that SMMU TLB maintenance
isn't required. In other words SMMU with BTM will snoop CPU TLB
invalidates to maintain the SMMU TLB but still won't generate ATC
requests based on snooping.
> Out of curiosity. How does BTM work with device tlb? Can SMMU translate
> a TLB broadcast request (based on ASID) into a set of PCI ATS invalidation
> requests (based on PCI requestor ID and PASID) in hardware?
See above but I don't think so.
> If software intervention is required then it might be the reason why mmu
> notifier cannot be skipped. With BTM enabled it just means the notifier
> callback can skip iotlb invalidation...
Right. If you look at the implementation for
arm_smmu_mm_arch_invalidate_secondary_tlbs() you can see
arm_smmu_tlb_inv_range_asid() is only called if BTM is not supported to
invalidate SMMU TLB vs. arm_smmu_atc_inv_domain() which is always called
to send the invalidations down to the devices.
>> Based on feedback from Jason [2] the proposed solution to the bug is
>> to move the calls to mmu_notifier_arch_invalidate_secondary_tlbs()
>> closer to the architecture specific TLB invalidation code. This
>> ensures the secondary TLB won't miss invalidations, including the
>> existing invalidation in the ARM64 code to deal with permission
>> upgrade.
>
> ptep_set_access_flags() is the only problematic place where this issue
> is being reported ? If yes, why dont fix that instead of moving these
> into platform specific callbacks ? OR there are other problematic areas
> I might be missing.
See the previous feedback, and in particular this thread -
https://lore.kernel.org/all/5d8e1f752051173d2d1b5c3e14b54eb3506ed3ef.1684892404.git-series.apopple@nvidia.com/.
TLDR - I don't think there are any other problematic areas, but it's
hard to reason about when TLB notifiers should be called when it all
happens out of band and it's easy to miss. For example this bug would
not have been possible had they been called from the TLB flushing code.
Ideally I think most kernel code should call some generic TLB flushing
function that could call this. However at the moment no intermediate
functions exist - kernel calls the architecture specific implementations
directly. Adding a layer of indirection seems like it would be a lot of
churn with possible performance implications as well.