2024-04-02 21:22:58

by Andrew Halaney

[permalink] [raw]
Subject: sa8775p-ride: What's a normal SMMU TLB sync time?

Hey,

Sorry for the wide email, but I figured someone recently contributing
to / maintaining the Qualcomm SMMU driver may have some proper insights
into this.

Recently I remembered that performance on some Qualcomm platforms
takes a major hit when you use iommu.strict=1/CONFIG_IOMMU_DEFAULT_DMA_STRICT.

On the sa8775p-ride, I see most TLB sync calls to be about 150 us long,
with some spiking to 500 us, etc:

[root@qti-snapdragon-ride4-sa8775p-09 ~]# trace-cmd start -p function_graph -g qcom_smmu_tlb_sync --max-graph-depth 1
plugin 'function_graph'
[root@qti-snapdragon-ride4-sa8775p-09 ~]# trace-cmd show
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) ! 144.062 us | qcom_smmu_tlb_sync();

On my sc8280xp-lenovo-thinkpad-x13s (only other Qualcomm platform I can compare
with) I see around 2-15 us with spikes up to 20-30 us. That's thanks to this
patch[0], which I guess improved the platform from 1-2 ms to the ~10 us number.

It's not entirely clear to me how a DPU specific programming affects system
wide SMMU performance, but I'm curious if this is the only way to achieve this?
sa8775p doesn't have the DPU described even right now, so that's a bummer
as there's no way to make a similar immediate optimization, but I'm still struggling
to understand what that patch really did to improve things so maybe I'm missing
something.

I'm honestly not even sure what a "typical" range for TLB sync time would be,
but on sa8775p-ride its bad enough that some IRQs like UFS can cause RCU stalls
(pretty easy to reproduce with fio basic-verify.fio for example on the platform).
It also makes running with iommu.strict=1 impractical as performance for UFS,
ethernet, etc drops 75-80%.

Does anyone have any bright ideas on how to improve this, or if I'm even in
the right for assuming that time is suspiciously long?

Thanks,
Andrew

[0] https://lore.kernel.org/linux-arm-msm/CAF6AEGs9PLiCZdJ-g42-bE6f9yMR6cMyKRdWOY5m799vF9o4SQ@mail.gmail.com/



2024-04-05 03:26:41

by Bjorn Andersson

[permalink] [raw]
Subject: Re: sa8775p-ride: What's a normal SMMU TLB sync time?

On Tue, Apr 02, 2024 at 04:22:31PM -0500, Andrew Halaney wrote:
> Hey,
>
> Sorry for the wide email, but I figured someone recently contributing
> to / maintaining the Qualcomm SMMU driver may have some proper insights
> into this.
>
> Recently I remembered that performance on some Qualcomm platforms
> takes a major hit when you use iommu.strict=1/CONFIG_IOMMU_DEFAULT_DMA_STRICT.
>
> On the sa8775p-ride, I see most TLB sync calls to be about 150 us long,
> with some spiking to 500 us, etc:
>
> [root@qti-snapdragon-ride4-sa8775p-09 ~]# trace-cmd start -p function_graph -g qcom_smmu_tlb_sync --max-graph-depth 1
> plugin 'function_graph'
> [root@qti-snapdragon-ride4-sa8775p-09 ~]# trace-cmd show
> # tracer: function_graph
> #
> # CPU DURATION FUNCTION CALLS
> # | | | | | | |
> 0) ! 144.062 us | qcom_smmu_tlb_sync();
>
> On my sc8280xp-lenovo-thinkpad-x13s (only other Qualcomm platform I can compare
> with) I see around 2-15 us with spikes up to 20-30 us. That's thanks to this
> patch[0], which I guess improved the platform from 1-2 ms to the ~10 us number.
>
> It's not entirely clear to me how a DPU specific programming affects system
> wide SMMU performance, but I'm curious if this is the only way to achieve this?
> sa8775p doesn't have the DPU described even right now, so that's a bummer
> as there's no way to make a similar immediate optimization, but I'm still struggling
> to understand what that patch really did to improve things so maybe I'm missing
> something.
>

The cause was that the TLB sync is synchronized with the display updates,
but without appropriate safe_lut_tlb values the display side wouldn't
play nice.

Regards,
Bjorn

> I'm honestly not even sure what a "typical" range for TLB sync time would be,
> but on sa8775p-ride its bad enough that some IRQs like UFS can cause RCU stalls
> (pretty easy to reproduce with fio basic-verify.fio for example on the platform).
> It also makes running with iommu.strict=1 impractical as performance for UFS,
> ethernet, etc drops 75-80%.
>
> Does anyone have any bright ideas on how to improve this, or if I'm even in
> the right for assuming that time is suspiciously long?
>
> Thanks,
> Andrew
>
> [0] https://lore.kernel.org/linux-arm-msm/CAF6AEGs9PLiCZdJ-g42-bE6f9yMR6cMyKRdWOY5m799vF9o4SQ@mail.gmail.com/
>

2024-04-05 14:04:57

by Andrew Halaney

[permalink] [raw]
Subject: Re: sa8775p-ride: What's a normal SMMU TLB sync time?

On Thu, Apr 04, 2024 at 08:25:04PM -0700, Bjorn Andersson wrote:
> On Tue, Apr 02, 2024 at 04:22:31PM -0500, Andrew Halaney wrote:
> > Hey,
> >
> > Sorry for the wide email, but I figured someone recently contributing
> > to / maintaining the Qualcomm SMMU driver may have some proper insights
> > into this.
> >
> > Recently I remembered that performance on some Qualcomm platforms
> > takes a major hit when you use iommu.strict=1/CONFIG_IOMMU_DEFAULT_DMA_STRICT.
> >
> > On the sa8775p-ride, I see most TLB sync calls to be about 150 us long,
> > with some spiking to 500 us, etc:
> >
> > [root@qti-snapdragon-ride4-sa8775p-09 ~]# trace-cmd start -p function_graph -g qcom_smmu_tlb_sync --max-graph-depth 1
> > plugin 'function_graph'
> > [root@qti-snapdragon-ride4-sa8775p-09 ~]# trace-cmd show
> > # tracer: function_graph
> > #
> > # CPU DURATION FUNCTION CALLS
> > # | | | | | | |
> > 0) ! 144.062 us | qcom_smmu_tlb_sync();
> >
> > On my sc8280xp-lenovo-thinkpad-x13s (only other Qualcomm platform I can compare
> > with) I see around 2-15 us with spikes up to 20-30 us. That's thanks to this
> > patch[0], which I guess improved the platform from 1-2 ms to the ~10 us number.
> >
> > It's not entirely clear to me how a DPU specific programming affects system
> > wide SMMU performance, but I'm curious if this is the only way to achieve this?
> > sa8775p doesn't have the DPU described even right now, so that's a bummer
> > as there's no way to make a similar immediate optimization, but I'm still struggling
> > to understand what that patch really did to improve things so maybe I'm missing
> > something.
> >
>
> The cause was that the TLB sync is synchronized with the display updates,
> but without appropriate safe_lut_tlb values the display side wouldn't
> play nice.

In my case we don't have display being driven at all. I'm not sure if
that changes the situation, or just complicates it. i.e. I'm unsure if
that means we're not hitting the display situation at all but something
else entirely (assuming this time is longer than ideal), or if the
safe_lut_tlb values still effect things despite Linux knowing nothing
about the display, which as far as I know is not configured by anyone
at the moment.

Any thoughts on that?

>
> Regards,
> Bjorn
>
> > I'm honestly not even sure what a "typical" range for TLB sync time would be,
> > but on sa8775p-ride its bad enough that some IRQs like UFS can cause RCU stalls
> > (pretty easy to reproduce with fio basic-verify.fio for example on the platform).
> > It also makes running with iommu.strict=1 impractical as performance for UFS,
> > ethernet, etc drops 75-80%.
> >
> > Does anyone have any bright ideas on how to improve this, or if I'm even in
> > the right for assuming that time is suspiciously long?
> >
> > Thanks,
> > Andrew
> >
> > [0] https://lore.kernel.org/linux-arm-msm/CAF6AEGs9PLiCZdJ-g42-bE6f9yMR6cMyKRdWOY5m799vF9o4SQ@mail.gmail.com/
> >
>