LinuxLists.cc - Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable

2023-08-29 21:40:46

Subject: Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg

On 2023-08-29 21:15, Nicolin Chen wrote:
> On Tue, Aug 29, 2023 at 04:37:00PM +0100, Robin Murphy wrote:
>
>> On 2023-08-22 17:42, Nicolin Chen wrote:
>>> On Tue, Aug 22, 2023 at 10:19:21AM +0100, Robin Murphy wrote:
>>>
>>>>> out_free_data:
>>>>> @@ -1071,6 +1073,7 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
>>>>> ARM_MALI_LPAE_TTBR_ADRMODE_TABLE;
>>>>> if (cfg->coherent_walk)
>>>>> cfg->arm_mali_lpae_cfg.transtab |= ARM_MALI_LPAE_TTBR_SHARE_OUTER;
>>>>> + cfg->nents_per_pgtable = 1 << data->bits_per_level;
>>>>
>>>> The result of this highly complex and expensive calculation is clearly
>>>> redundant with the existing bits_per_level field, so why do we need to
>>>> waste space storing when the driver could simply use bits_per_level?
>>>
>>> bits_per_level is in the private struct arm_lpae_io_pgtable, while
>>> drivers can only access struct io_pgtable_cfg. Are you suggesting
>>> to move bits_per_level out of the private struct arm_lpae_io_pgtable
>>> to the public struct io_pgtable_cfg?
>>>
>>> Or am I missing another bits_per_level?
>>
>> Bleh, apologies, I always confuse myself trying to remember the fiddly
>> design of io-pgtable data. However, I think this then ends up proving
>> the opposite point - the number of pages per table only happens to be a
>> fixed constant for certain formats like LPAE, but does not necessarily
>> generalise. For instance for a single v7s config it would be 1024 or 256
>> or 16 depending on what has actually been unmapped.
>>
>> The mechanism as proposed implicitly assumes LPAE format, so I still
>> think we're better off making that assumption explicit. And at that
>> point arm-smmu-v3 can then freely admit it already knows the number is
>> simply 1/8th of the domain page size.
>
> Hmm, I am not getting that "1/8th" part, would you mind elaborating?

If we know the format is LPAE, then we already know that nearly all
pagetable levels are one full page, and the PTEs are 64 bits long. No
magic data conduit required.

> Also, what we need is actually an arbitrary number for max_tlbi_ops.
> And I think it could be irrelevant to the page size, i.e. either a
> 4K pgsize or a 64K pgsize could use the same max_tlbi_ops number,
> because what eventually impacts the latency is the number of loops
> of building/issuing commands.

Although there is perhaps a counter-argument for selective invalidation,
that if you're using 64K pages to improve TLB hit rates vs. 4K, then you
have more to lose from overinvalidation, so maybe a single threshold
isn't so appropriate for both.

Yes, ultimately it all comes down to picking an arbitrary number, but
the challenge is that we want to pick a *good* one, and ideally have
some reasoning behind it. As Will suggested, copying what the mm layer
does gives us an easy line of reasoning, even if it's just in the form
of passing the buck. And that's actually quite attractive, since
otherwise we'd then have to get into the question of what really is the
latency of building and issuing commands, since that clearly depends on
how fast the CPU is, and how fast the SMMU is, and how busy the SMMU is,
and how large the command queue is, and how many other CPUs are also
contending for the command queue... and very quickly it becomes hard to
believe that any simple constant can be good for all possible systems.

> So, combining your narrative above that nents_per_pgtable isn't so
> general as we have in the tlbflush for MMU,

FWIW I meant it doesn't generalise well enough to be a common io-pgtable
interface; I have no issue with it forming the basis of an
SMMUv3-specific heuristic when it *is* a relevant concept to all the
pagetable formats SMMUv3 can possibly support.

Thanks,
Robin.

2024-01-20 20:00:18

by Nicolin Chen

[permalink] [raw]

Subject: Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg

Hi Robin/Will,

On Tue, Aug 29, 2023 at 02:25:10PM -0700, Robin Murphy wrote:
> > Also, what we need is actually an arbitrary number for max_tlbi_ops.
> > And I think it could be irrelevant to the page size, i.e. either a
> > 4K pgsize or a 64K pgsize could use the same max_tlbi_ops number,
> > because what eventually impacts the latency is the number of loops
> > of building/issuing commands.
>
> Although there is perhaps a counter-argument for selective invalidation,
> that if you're using 64K pages to improve TLB hit rates vs. 4K, then you
> have more to lose from overinvalidation, so maybe a single threshold
> isn't so appropriate for both.
>
> Yes, ultimately it all comes down to picking an arbitrary number, but
> the challenge is that we want to pick a *good* one, and ideally have
> some reasoning behind it. As Will suggested, copying what the mm layer
> does gives us an easy line of reasoning, even if it's just in the form
> of passing the buck. And that's actually quite attractive, since
> otherwise we'd then have to get into the question of what really is the
> latency of building and issuing commands, since that clearly depends on
> how fast the CPU is, and how fast the SMMU is, and how busy the SMMU is,
> and how large the command queue is, and how many other CPUs are also
> contending for the command queue... and very quickly it becomes hard to
> believe that any simple constant can be good for all possible systems.

So, here we have another request to optimize this number further,
though the merged arbitrary number copied from MMU side could fix
the soft lockup. The iommu_unmap delay with a 500MB buffer is not
quite satisfying on our testing chip, since the threshold now for
max_tlbi_ops is at 512MB for 64K pgsize (8192 * 64KB).

As Robin remarked, this could be really a case-by-case situation.
So, I wonder if we can rethink of adding a configurable threshold
that has a default value at its current setup matching MMU side.

If this is acceptable, what can be the preferable way of having a
configuration: a per-SMMU or a per-IOMMU-group sysfs node? I am
open for any other option too.

Also, this would be added to the arm_smmu_inv_range_too_big() in
Jason's patch here:
https://lore.kernel.org/linux-iommu/[email protected]/

Thanks
Nicolin

2024-01-22 13:02:57

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg

On Sat, Jan 20, 2024 at 11:59:45AM -0800, Nicolin Chen wrote:
> Hi Robin/Will,
>
> On Tue, Aug 29, 2023 at 02:25:10PM -0700, Robin Murphy wrote:
> > > Also, what we need is actually an arbitrary number for max_tlbi_ops.
> > > And I think it could be irrelevant to the page size, i.e. either a
> > > 4K pgsize or a 64K pgsize could use the same max_tlbi_ops number,
> > > because what eventually impacts the latency is the number of loops
> > > of building/issuing commands.
> >
> > Although there is perhaps a counter-argument for selective invalidation,
> > that if you're using 64K pages to improve TLB hit rates vs. 4K, then you
> > have more to lose from overinvalidation, so maybe a single threshold
> > isn't so appropriate for both.
> >
> > Yes, ultimately it all comes down to picking an arbitrary number, but
> > the challenge is that we want to pick a *good* one, and ideally have
> > some reasoning behind it. As Will suggested, copying what the mm layer
> > does gives us an easy line of reasoning, even if it's just in the form
> > of passing the buck. And that's actually quite attractive, since
> > otherwise we'd then have to get into the question of what really is the
> > latency of building and issuing commands, since that clearly depends on
> > how fast the CPU is, and how fast the SMMU is, and how busy the SMMU is,
> > and how large the command queue is, and how many other CPUs are also
> > contending for the command queue... and very quickly it becomes hard to
> > believe that any simple constant can be good for all possible systems.
>
> So, here we have another request to optimize this number further,
> though the merged arbitrary number copied from MMU side could fix
> the soft lockup. The iommu_unmap delay with a 500MB buffer is not
> quite satisfying on our testing chip, since the threshold now for
> max_tlbi_ops is at 512MB for 64K pgsize (8192 * 64KB).
>
> As Robin remarked, this could be really a case-by-case situation.
> So, I wonder if we can rethink of adding a configurable threshold
> that has a default value at its current setup matching MMU side.
>
> If this is acceptable, what can be the preferable way of having a
> configuration: a per-SMMU or a per-IOMMU-group sysfs node? I am
> open for any other option too.

Maybe it should be more dynamic and you get xx ms to push
invalidations otherwise it gives up and does invalidate all?

The busier the system the broader the invalidation?

Or do we need to measure at boot time invalidation performance and set
a threshold that way?

Also, it seems to me that SVA use cases and, say, DMA API cases are
somewhat different where we may be willing to wait longer for DMA API.

Jason

2024-01-22 18:10:58

by Nicolin Chen

[permalink] [raw]

Subject: Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg

Hi Jason,

Thank you for the ideas!

On Mon, Jan 22, 2024 at 09:01:52AM -0400, Jason Gunthorpe wrote:
> On Sat, Jan 20, 2024 at 11:59:45AM -0800, Nicolin Chen wrote:
> > Hi Robin/Will,
> >
> > On Tue, Aug 29, 2023 at 02:25:10PM -0700, Robin Murphy wrote:
> > > > Also, what we need is actually an arbitrary number for max_tlbi_ops.
> > > > And I think it could be irrelevant to the page size, i.e. either a
> > > > 4K pgsize or a 64K pgsize could use the same max_tlbi_ops number,
> > > > because what eventually impacts the latency is the number of loops
> > > > of building/issuing commands.
> > >
> > > Although there is perhaps a counter-argument for selective invalidation,
> > > that if you're using 64K pages to improve TLB hit rates vs. 4K, then you
> > > have more to lose from overinvalidation, so maybe a single threshold
> > > isn't so appropriate for both.
> > >
> > > Yes, ultimately it all comes down to picking an arbitrary number, but
> > > the challenge is that we want to pick a *good* one, and ideally have
> > > some reasoning behind it. As Will suggested, copying what the mm layer
> > > does gives us an easy line of reasoning, even if it's just in the form
> > > of passing the buck. And that's actually quite attractive, since
> > > otherwise we'd then have to get into the question of what really is the
> > > latency of building and issuing commands, since that clearly depends on
> > > how fast the CPU is, and how fast the SMMU is, and how busy the SMMU is,
> > > and how large the command queue is, and how many other CPUs are also
> > > contending for the command queue... and very quickly it becomes hard to
> > > believe that any simple constant can be good for all possible systems.
> >
> > So, here we have another request to optimize this number further,
> > though the merged arbitrary number copied from MMU side could fix
> > the soft lockup. The iommu_unmap delay with a 500MB buffer is not
> > quite satisfying on our testing chip, since the threshold now for
> > max_tlbi_ops is at 512MB for 64K pgsize (8192 * 64KB).
> >
> > As Robin remarked, this could be really a case-by-case situation.
> > So, I wonder if we can rethink of adding a configurable threshold
> > that has a default value at its current setup matching MMU side.
> >
> > If this is acceptable, what can be the preferable way of having a
> > configuration: a per-SMMU or a per-IOMMU-group sysfs node? I am
> > open for any other option too.
>
> Maybe it should be more dynamic and you get xx ms to push
> invalidations otherwise it gives up and does invalidate all?
>
> The busier the system the broader the invalidation?

Yea, I think this could be good one.

> Or do we need to measure at boot time invalidation performance and set
> a threshold that way?

I see. We can run an invalidation at default max_tlbi_ops to
get its delay in msec or usec, and then set as the threshold
"xx ms" in the idea one.

> Also, it seems to me that SVA use cases and, say, DMA API cases are
> somewhat different where we may be willing to wait longer for DMA API.

Hmm, the lockup that my patch fixed was for an SVA case that
doesn't seem to involve DMA API:
https://lore.kernel.org/linux-iommu/[email protected]/

And the other lockup fix for a non-SVA case from Zhang doesn't
seem to involve DMA API either:
https://lore.kernel.org/linux-iommu/[email protected]/

Maybe we can treat DMA API a bit different. But I am not sure
about the justification of leaving it to wait longer. Mind
elaborating?

Thanks
Nicolin

2024-01-22 18:26:46

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg

On Mon, Jan 22, 2024 at 09:24:08AM -0800, Nicolin Chen wrote:
> > Or do we need to measure at boot time invalidation performance and set
> > a threshold that way?
>
> I see. We can run an invalidation at default max_tlbi_ops to
> get its delay in msec or usec, and then set as the threshold
> "xx ms" in the idea one.
>
> > Also, it seems to me that SVA use cases and, say, DMA API cases are
> > somewhat different where we may be willing to wait longer for DMA API.
>
> Hmm, the lockup that my patch fixed was for an SVA case that
> doesn't seem to involve DMA API:
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> And the other lockup fix for a non-SVA case from Zhang doesn't
> seem to involve DMA API either:
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> Maybe we can treat DMA API a bit different. But I am not sure
> about the justification of leaving it to wait longer. Mind
> elaborating?

Well, there are two issues.. The first is the soft lockup, that should
just be reliably prevented. The timer, for instance, is a reasonable
stab at making that universally safe.

Then there is the issue of just raw invalidation performance, where
SVA particularly is linked to the mm and the longer invalidation takes
the slower the apps will be. We don't have any idea where future DMA
might hit the cache, so it is hard to know if all invalidation is not
the right thing..

DMA api is often lazy and the active DMA is a bit more predictable, so
perhaps there is more merit in spending more time to narrow the
invalidation.

The other case was vfio unmap for VM tear down, which ideally would
use whole ASID invalidation.

If your issue is softlockup, not performance, then that should be
prevented strongly. Broadly speaking if SVA is pushing too high an
invalidation workload then we need to agressively trim it, and do so
dynamically. Certainly we should not have a tunable that has to be set
right to avoid soft lockup.

A tunable to improve performance, perhaps, but not to achieve basic
correctness.

Maybe it is really just a simple thing - compute how many invalidation
commands are needed, if they don't all fit in the current queue space,
then do an invalidate all instead?

Jason

2024-01-24 00:13:47

by Nicolin Chen

[permalink] [raw]

Subject: Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg

On Mon, Jan 22, 2024 at 01:57:00PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 22, 2024 at 09:24:08AM -0800, Nicolin Chen wrote:
> > > Or do we need to measure at boot time invalidation performance and set
> > > a threshold that way?
> >
> > I see. We can run an invalidation at default max_tlbi_ops to
> > get its delay in msec or usec, and then set as the threshold
> > "xx ms" in the idea one.
> >
> > > Also, it seems to me that SVA use cases and, say, DMA API cases are
> > > somewhat different where we may be willing to wait longer for DMA API.
> >
> > Hmm, the lockup that my patch fixed was for an SVA case that
> > doesn't seem to involve DMA API:
> > https://lore.kernel.org/linux-iommu/[email protected]/
> >
> > And the other lockup fix for a non-SVA case from Zhang doesn't
> > seem to involve DMA API either:
> > https://lore.kernel.org/linux-iommu/[email protected]/
> >
> > Maybe we can treat DMA API a bit different. But I am not sure
> > about the justification of leaving it to wait longer. Mind
> > elaborating?
>
> Well, there are two issues.. The first is the soft lockup, that should
> just be reliably prevented. The timer, for instance, is a reasonable
> stab at making that universally safe.
>
> Then there is the issue of just raw invalidation performance, where
> SVA particularly is linked to the mm and the longer invalidation takes
> the slower the apps will be. We don't have any idea where future DMA
> might hit the cache, so it is hard to know if all invalidation is not
> the right thing..
>
> DMA api is often lazy and the active DMA is a bit more predictable, so
> perhaps there is more merit in spending more time to narrow the
> invalidation.
>
> The other case was vfio unmap for VM tear down, which ideally would
> use whole ASID invalidation.

I see! Then we need a flag to pass in __iommu_dma_unmap or so.
If a caller is in dma-iommu.c, do a longer per-page invalidation.

> If your issue is softlockup, not performance, then that should be

We have both issues.

> prevented strongly. Broadly speaking if SVA is pushing too high an
> invalidation workload then we need to agressively trim it, and do so
> dynamically. Certainly we should not have a tunable that has to be set
> right to avoid soft lockup.
>
> A tunable to improve performance, perhaps, but not to achieve basic
> correctness.

So, should we make an optional tunable only for those who care
about performance? Though I think having a tunable would just
fix both issues.

> Maybe it is really just a simple thing - compute how many invalidation
> commands are needed, if they don't all fit in the current queue space,
> then do an invalidate all instead?

The queue could actually have a large space. But one large-size
invalidation would be divided into batches that have to execute
back-to-back. And the batch size is 64 commands in 64-bit case,
which might be too small as a cap.

Thanks
Nicolin

2024-01-25 13:55:51

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg

On Tue, Jan 23, 2024 at 04:11:09PM -0800, Nicolin Chen wrote:
> > prevented strongly. Broadly speaking if SVA is pushing too high an
> > invalidation workload then we need to agressively trim it, and do so
> > dynamically. Certainly we should not have a tunable that has to be set
> > right to avoid soft lockup.
> >
> > A tunable to improve performance, perhaps, but not to achieve basic
> > correctness.
>
> So, should we make an optional tunable only for those who care
> about performance? Though I think having a tunable would just
> fix both issues.

When the soft lockup issue is solved you can consider if a tunable is
still interesting..

> > Maybe it is really just a simple thing - compute how many invalidation
> > commands are needed, if they don't all fit in the current queue space,
> > then do an invalidate all instead?
>
> The queue could actually have a large space. But one large-size
> invalidation would be divided into batches that have to execute
> back-to-back. And the batch size is 64 commands in 64-bit case,
> which might be too small as a cap.

Yes, some notable code reorganizing would be needed to implement
something like this

Broadly I'd sketch sort of:

- Figure out how fast the HW can execute a lot of commands
- The above should drive some XX maximum number of commands, maybe we
need to measure at boot, IDK
- Strongly time bound SVA invalidation:
* No more than XX commands, if more needed then push invalidate
all
* All commands must fit in the available queue space, if more
needed then push invalidate all
- The total queue depth must not be larger than YY based on the
retire rate so that even a full queue will complete invalidation
below the target time.

A tunable indicating what the SVA time bound target should be might be
appropriate..

Jason

2024-01-25 17:23:56

by Nicolin Chen

[permalink] [raw]

Subject: Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg

On Thu, Jan 25, 2024 at 09:55:37AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 23, 2024 at 04:11:09PM -0800, Nicolin Chen wrote:
> > > prevented strongly. Broadly speaking if SVA is pushing too high an
> > > invalidation workload then we need to agressively trim it, and do so
> > > dynamically. Certainly we should not have a tunable that has to be set
> > > right to avoid soft lockup.
> > >
> > > A tunable to improve performance, perhaps, but not to achieve basic
> > > correctness.
> >
> > So, should we make an optional tunable only for those who care
> > about performance? Though I think having a tunable would just
> > fix both issues.
>
> When the soft lockup issue is solved you can consider if a tunable is
> still interesting..

Yea, it would be on top of the soft lockup fix. I assume we are
still going with your change: arm_smmu_inv_range_too_big, though
I wonder if we should apply before your rework series to make it
a bug fix..

> > > Maybe it is really just a simple thing - compute how many invalidation
> > > commands are needed, if they don't all fit in the current queue space,
> > > then do an invalidate all instead?
> >
> > The queue could actually have a large space. But one large-size
> > invalidation would be divided into batches that have to execute
> > back-to-back. And the batch size is 64 commands in 64-bit case,
> > which might be too small as a cap.
>
> Yes, some notable code reorganizing would be needed to implement
> something like this
>
> Broadly I'd sketch sort of:
>
> - Figure out how fast the HW can execute a lot of commands
> - The above should drive some XX maximum number of commands, maybe we
> need to measure at boot, IDK
> - Strongly time bound SVA invalidation:
> * No more than XX commands, if more needed then push invalidate
> all
> * All commands must fit in the available queue space, if more
> needed then push invalidate all
> - The total queue depth must not be larger than YY based on the
> retire rate so that even a full queue will complete invalidation
> below the target time.
>
> A tunable indicating what the SVA time bound target should be might be
> appropriate..

Thanks for listing it out. I will draft something with that, and
should we just confine it to SVA or non DMA callers in general?

Nicolin

2024-01-25 17:54:16

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg

On Thu, Jan 25, 2024 at 09:23:00AM -0800, Nicolin Chen wrote:
> > When the soft lockup issue is solved you can consider if a tunable is
> > still interesting..
>
> Yea, it would be on top of the soft lockup fix. I assume we are
> still going with your change: arm_smmu_inv_range_too_big, though
> I wonder if we should apply before your rework series to make it
> a bug fix..

It depends what change you settle on..

> > > > Maybe it is really just a simple thing - compute how many invalidation
> > > > commands are needed, if they don't all fit in the current queue space,
> > > > then do an invalidate all instead?
> > >
> > > The queue could actually have a large space. But one large-size
> > > invalidation would be divided into batches that have to execute
> > > back-to-back. And the batch size is 64 commands in 64-bit case,
> > > which might be too small as a cap.
> >
> > Yes, some notable code reorganizing would be needed to implement
> > something like this
> >
> > Broadly I'd sketch sort of:
> >
> > - Figure out how fast the HW can execute a lot of commands
> > - The above should drive some XX maximum number of commands, maybe we
> > need to measure at boot, IDK
> > - Strongly time bound SVA invalidation:
> > * No more than XX commands, if more needed then push invalidate
> > all
> > * All commands must fit in the available queue space, if more
> > needed then push invalidate all
> > - The total queue depth must not be larger than YY based on the
> > retire rate so that even a full queue will complete invalidation
> > below the target time.
> >
> > A tunable indicating what the SVA time bound target should be might be
> > appropriate..
>
> Thanks for listing it out. I will draft something with that, and
> should we just confine it to SVA or non DMA callers in general?

Also, how much of this SVA issue is multithreaded? Will multiple
command queues improve anything?

Jason

2024-01-25 19:56:24

by Nicolin Chen

[permalink] [raw]

Subject: Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg

On Thu, Jan 25, 2024 at 01:47:28PM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 25, 2024 at 09:23:00AM -0800, Nicolin Chen wrote:
> > > When the soft lockup issue is solved you can consider if a tunable is
> > > still interesting..
> >
> > Yea, it would be on top of the soft lockup fix. I assume we are
> > still going with your change: arm_smmu_inv_range_too_big, though
> > I wonder if we should apply before your rework series to make it
> > a bug fix..
>
> It depends what change you settle on..

I mean your arm_smmu_inv_range_too_big patch. Should it be a bug
fix CCing the stable tree? My previous SVA fix was, by the way.

> > > > > Maybe it is really just a simple thing - compute how many invalidation
> > > > > commands are needed, if they don't all fit in the current queue space,
> > > > > then do an invalidate all instead?
> > > >
> > > > The queue could actually have a large space. But one large-size
> > > > invalidation would be divided into batches that have to execute
> > > > back-to-back. And the batch size is 64 commands in 64-bit case,
> > > > which might be too small as a cap.
> > >
> > > Yes, some notable code reorganizing would be needed to implement
> > > something like this
> > >
> > > Broadly I'd sketch sort of:
> > >
> > > - Figure out how fast the HW can execute a lot of commands
> > > - The above should drive some XX maximum number of commands, maybe we
> > > need to measure at boot, IDK
> > > - Strongly time bound SVA invalidation:
> > > * No more than XX commands, if more needed then push invalidate
> > > all
> > > * All commands must fit in the available queue space, if more
> > > needed then push invalidate all
> > > - The total queue depth must not be larger than YY based on the
> > > retire rate so that even a full queue will complete invalidation
> > > below the target time.
> > >
> > > A tunable indicating what the SVA time bound target should be might be
> > > appropriate..
> >
> > Thanks for listing it out. I will draft something with that, and
> > should we just confine it to SVA or non DMA callers in general?
>
> Also, how much of this SVA issue is multithreaded? Will multiple
> command queues improve anything?

The bottleneck from measurement is mostly at SMMU consuming the
commands with a single CMDQ HW, so multithreading unlikely helps.
And VCMDQ only provides a multi-queue interface/wrapper for VM
isolations.

Thanks
Nic