2024-04-25 15:12:17

by Tanmay Jagdale

[permalink] [raw]
Subject: [PATCH V3 0/2] iommu/arm-smmu-v3: Add support for ECMDQ register mode

Resending the patches by Zhen Lei <[email protected]> that add
support for SMMU ECMDQ feature.

Tested this feature on a Marvell SoC by implementing a smmu-test driver.
This test driver spawns a thread per CPU and each thread keeps sending
map, table-walk and unmap requests for a fixed duration.

Using this test driver, we compared ECMDQ vs SMMU with software batching
support and saw ~5% improvement with ECMDQ. Performance numbers are
mentioned below:

Total Requests Average Requests Difference
Per CPU wrt ECMDQ
-----------------------------------------------------------------
ECMDQ 239286381 2991079
CMDQ Batch Size 1 228232187 2852902 -4.62%
CMDQ Batch Size 32 233465784 2918322 -2.43%
CMDQ Batch Size 64 231679588 2895994 -3.18%
CMDQ Batch Size 128 233189030 2914862 -2.55%
CMDQ Batch Size 256 230965773 2887072 -3.48%


v2 --> v3:
1. Rebased on linux 6.9-rc5

v1 --> v2:
1. Drop patch "iommu/arm-smmu-v3: Add arm_smmu_ecmdq_issue_cmdlist() for
non-shared ECMDQ" in v1
2. Drop patch "iommu/arm-smmu-v3: Add support for less than one ECMDQ
per core" in v1
3. Replace rwlock with IPI to support lockless protection against the
write operation to bit
'ERRACK' during error handling and the read operation to bit 'ERRACK'
during command insertion.
4. Standardize variable names.
- struct arm_smmu_ecmdq *__percpu *ecmdq;
+ struct arm_smmu_ecmdq *__percpu *ecmdqs;

5. Add member 'iobase' to struct arm_smmu_device to record the start
physical
address of the SMMU, to replace translation operation
(vmalloc_to_pfn(smmu->base) << PAGE_SHIFT)
+ phys_addr_t iobase;
- smmu_dma_base = (vmalloc_to_pfn(smmu->base) << PAGE_SHIFT);

6. Cancel below union. Whether ECMDQ is enabled is determined only based
on 'ecmdq_enabled'.
- union {
- u32 nr_ecmdq;
- u32 ecmdq_enabled;
- };
+ u32 nr_ecmdq;
+ bool ecmdq_enabled;

7. Eliminate some sparse check warnings. For example.
- struct arm_smmu_ecmdq *ecmdq;
+ struct arm_smmu_ecmdq __percpu *ecmdq;


Zhen Lei (2):
iommu/arm-smmu-v3: Add support for ECMDQ register mode
iommu/arm-smmu-v3: Ensure that a set of associated commands are
inserted in the same ECMDQ

drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 261 +++++++++++++++++++-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 33 +++
2 files changed, 286 insertions(+), 8 deletions(-)

--
2.34.1



2024-04-28 02:08:54

by Leizhen (ThunderTown)

[permalink] [raw]
Subject: Re: [PATCH V3 0/2] iommu/arm-smmu-v3: Add support for ECMDQ register mode



On 2024/4/25 22:41, Tanmay Jagdale wrote:
> Resending the patches by Zhen Lei <[email protected]> that add
> support for SMMU ECMDQ feature.
>
> Tested this feature on a Marvell SoC by implementing a smmu-test driver.
> This test driver spawns a thread per CPU and each thread keeps sending
> map, table-walk and unmap requests for a fixed duration.
>
> Using this test driver, we compared ECMDQ vs SMMU with software batching
> support and saw ~5% improvement with ECMDQ. Performance numbers are

This data is similar to the performance simulated by our EMU.

> mentioned below:
>
> Total Requests Average Requests Difference
> Per CPU wrt ECMDQ
> -----------------------------------------------------------------
> ECMDQ 239286381 2991079
> CMDQ Batch Size 1 228232187 2852902 -4.62%
> CMDQ Batch Size 32 233465784 2918322 -2.43%
> CMDQ Batch Size 64 231679588 2895994 -3.18%
> CMDQ Batch Size 128 233189030 2914862 -2.55%
> CMDQ Batch Size 256 230965773 2887072 -3.48%
>
>
> v2 --> v3:
> 1. Rebased on linux 6.9-rc5
>
> v1 --> v2:

Thanks.

> 1. Drop patch "iommu/arm-smmu-v3: Add arm_smmu_ecmdq_issue_cmdlist() for
> non-shared ECMDQ" in v1
> 2. Drop patch "iommu/arm-smmu-v3: Add support for less than one ECMDQ
> per core" in v1
> 3. Replace rwlock with IPI to support lockless protection against the
> write operation to bit
> 'ERRACK' during error handling and the read operation to bit 'ERRACK'
> during command insertion.
> 4. Standardize variable names.
> - struct arm_smmu_ecmdq *__percpu *ecmdq;
> + struct arm_smmu_ecmdq *__percpu *ecmdqs;
>
> 5. Add member 'iobase' to struct arm_smmu_device to record the start
> physical
> address of the SMMU, to replace translation operation
> (vmalloc_to_pfn(smmu->base) << PAGE_SHIFT)
> + phys_addr_t iobase;
> - smmu_dma_base = (vmalloc_to_pfn(smmu->base) << PAGE_SHIFT);
>
> 6. Cancel below union. Whether ECMDQ is enabled is determined only based
> on 'ecmdq_enabled'.
> - union {
> - u32 nr_ecmdq;
> - u32 ecmdq_enabled;
> - };
> + u32 nr_ecmdq;
> + bool ecmdq_enabled;
>
> 7. Eliminate some sparse check warnings. For example.
> - struct arm_smmu_ecmdq *ecmdq;
> + struct arm_smmu_ecmdq __percpu *ecmdq;
>
>
> Zhen Lei (2):
> iommu/arm-smmu-v3: Add support for ECMDQ register mode
> iommu/arm-smmu-v3: Ensure that a set of associated commands are
> inserted in the same ECMDQ
>
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 261 +++++++++++++++++++-
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 33 +++
> 2 files changed, 286 insertions(+), 8 deletions(-)
>

--
Regards,
Zhen Lei


2024-04-30 15:12:20

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH V3 0/2] iommu/arm-smmu-v3: Add support for ECMDQ register mode

On Thu, Apr 25, 2024 at 07:41:50AM -0700, Tanmay Jagdale wrote:
> Resending the patches by Zhen Lei <[email protected]> that add
> support for SMMU ECMDQ feature.
>
> Tested this feature on a Marvell SoC by implementing a smmu-test driver.
> This test driver spawns a thread per CPU and each thread keeps sending
> map, table-walk and unmap requests for a fixed duration.
>
> Using this test driver, we compared ECMDQ vs SMMU with software batching
> support and saw ~5% improvement with ECMDQ. Performance numbers are
> mentioned below:
>
> Total Requests Average Requests Difference
> Per CPU wrt ECMDQ
> -----------------------------------------------------------------
> ECMDQ 239286381 2991079
> CMDQ Batch Size 1 228232187 2852902 -4.62%
> CMDQ Batch Size 32 233465784 2918322 -2.43%
> CMDQ Batch Size 64 231679588 2895994 -3.18%
> CMDQ Batch Size 128 233189030 2914862 -2.55%
> CMDQ Batch Size 256 230965773 2887072 -3.48%

These are pretty small improvements in a targetted micro-benchmark. Do
you have any real-world numbers showing that this is worthwhile? For
example, running something like netperf.

Will

2024-05-02 16:27:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V3 0/2] iommu/arm-smmu-v3: Add support for ECMDQ register mode

On Thu, Apr 25, 2024 at 07:41:50AM -0700, Tanmay Jagdale wrote:
> Resending the patches by Zhen Lei <[email protected]> that add
> support for SMMU ECMDQ feature.
>
> Tested this feature on a Marvell SoC by implementing a smmu-test driver.
> This test driver spawns a thread per CPU and each thread keeps sending
> map, table-walk and unmap requests for a fixed duration.

So this is not just measuring invalidation performance but basically
the DMA API performance to do map/unmap operations? What is "batch
size" ?

Does this HW support the range invalidation? How many invalidation
commands does earch test cycle generate?

> Total Requests Average Requests Difference
> Per CPU wrt ECMDQ
> -----------------------------------------------------------------
> ECMDQ 239286381 2991079
> CMDQ Batch Size 1 228232187 2852902 -4.62%
> CMDQ Batch Size 32 233465784 2918322 -2.43%
> CMDQ Batch Size 64 231679588 2895994 -3.18%
> CMDQ Batch Size 128 233189030 2914862 -2.55%
> CMDQ Batch Size 256 230965773 2887072 -3.48%

If this is really 5% for a typical DMA API map/unmap cycle then that
seems interesting to me.

If it is 5% for just the invalidation command then it is harder to
say.

I'd suggest to present your results in terms of latency to do a dma
API map/unmap cycle, and to show how the results scale as you add more
threads. Does even 2 threads start to show a 4-5% gain?

Also, I'm wondering how ATS would interact, I think we have a
head-of-line blocking issue with ATS.. Allowing ATS to progress
concurrently with unrelated parallel invalidations may also be
interesting, esepcially for multi-SVA workloads.

Jason

2024-05-16 14:26:30

by Tanmay Jagdale

[permalink] [raw]
Subject: Re: [PATCH V3 0/2] iommu/arm-smmu-v3: Add support for ECMDQ register mode

Hi Will,

On Tue, Apr 30, 2024 at 04:09:50PM +0100, Will Deacon wrote:
> On Thu, Apr 25, 2024 at 07:41:50AM -0700, Tanmay Jagdale wrote:
> > Resending the patches by Zhen Lei <[email protected]> that add
> > support for SMMU ECMDQ feature.
> >
> > Tested this feature on a Marvell SoC by implementing a smmu-test driver.
> > This test driver spawns a thread per CPU and each thread keeps sending
> > map, table-walk and unmap requests for a fixed duration.
> >
> > Using this test driver, we compared ECMDQ vs SMMU with software batching
> > support and saw ~5% improvement with ECMDQ. Performance numbers are
> > mentioned below:
> >
> > Total Requests Average Requests Difference
> > Per CPU wrt ECMDQ
> > -----------------------------------------------------------------
> > ECMDQ 239286381 2991079
> > CMDQ Batch Size 1 228232187 2852902 -4.62%
> > CMDQ Batch Size 32 233465784 2918322 -2.43%
> > CMDQ Batch Size 64 231679588 2895994 -3.18%
> > CMDQ Batch Size 128 233189030 2914862 -2.55%
> > CMDQ Batch Size 256 230965773 2887072 -3.48%
>
> These are pretty small improvements in a targetted micro-benchmark. Do
> you have any real-world numbers showing that this is worthwhile? For
> example, running something like netperf.
We are running benchmarks on the latest kernel with and without ECMDQ.
We will share the performance numbers and observations here.

With Regards,
Tanmay
>
> Will
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel