LinuxLists.cc - [PATCH v3 5/6] iommu/amd: Tailored gather logic for AMD

2021-06-08 01:58:39

Subject: [PATCH v3 5/6] iommu/amd: Tailored gather logic for AMD

From: Nadav Amit <[email protected]>

AMD's IOMMU can flush efficiently (i.e., in a single flush) any range.
This is in contrast, for instnace, to Intel IOMMUs that have a limit on
the number of pages that can be flushed in a single flush. In addition,
AMD's IOMMU do not care about the page-size, so changes of the page size
do not need to trigger a TLB flush.

So in most cases, a TLB flush due to disjoint range or page-size changes
are not needed for AMD. Yet, vIOMMUs require the hypervisor to
synchronize the virtualized IOMMU's PTEs with the physical ones. This
process induce overheads, so it is better not to cause unnecessary
flushes, i.e., flushes of PTEs that were not modified.

Implement and use amd_iommu_iotlb_gather_add_page() and use it instead
of the generic iommu_iotlb_gather_add_page(). Ignore page-size changes
and disjoint regions unless "non-present cache" feature is reported by
the IOMMU capabilities, as this is an indication we are running on a
physical IOMMU. A similar indication is used by VT-d (see "caching
mode"). The new logic retains the same flushing behavior that we had
before the introduction of page-selective IOTLB flushes for AMD.

On virtualized environments, check if the newly flushed region and the
gathered one are disjoint and flush if it is. Also check whether the new
region would cause IOTLB invalidation of large region that would include
unmodified PTE. The latter check is done according to the "order" of the
IOTLB flush.

Cc: Joerg Roedel <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Jiajun Cao <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Lu Baolu <[email protected]>
Cc: [email protected]
Cc: [email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
drivers/iommu/amd/iommu.c | 44 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3e40f6610b6a..128f2e889ced 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2053,6 +2053,48 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
return ret;
}

+static void amd_iommu_iotlb_gather_add_page(struct iommu_domain *domain,
+ struct iommu_iotlb_gather *gather,
+ unsigned long iova, size_t size)
+{
+ /*
+ * AMD's IOMMU can flush as many pages as necessary in a single flush.
+ * Unless we run in a virtual machine, which can be inferred according
+ * to whether "non-present cache" is on, it is probably best to prefer
+ * (potentially) too extensive TLB flushing (i.e., more misses) over
+ * mutliple TLB flushes (i.e., more flushes). For virtual machines the
+ * hypervisor needs to synchronize the host IOMMU PTEs with those of
+ * the guest, and the trade-off is different: unnecessary TLB flushes
+ * should be avoided.
+ */
+ if (amd_iommu_np_cache && gather->end != 0) {
+ unsigned long start = iova, end = start + size - 1;
+
+ if (iommu_iotlb_gather_is_disjoint(gather, iova, size)) {
+ /*
+ * If the new page is disjoint from the current range,
+ * flush.
+ */
+ iommu_iotlb_sync(domain, gather);
+ } else {
+ /*
+ * If the order of TLB flushes increases by more than
+ * 1, it means that we would have to flush PTEs that
+ * were not modified. In this case, flush.
+ */
+ unsigned long new_start = min(gather->start, start);
+ unsigned long new_end = min(gather->end, end);
+ int msb_diff = fls64(gather->end ^ gather->start);
+ int new_msb_diff = fls64(new_end ^ new_start);
+
+ if (new_msb_diff > msb_diff + 1)
+ iommu_iotlb_sync(domain, gather);
+ }
+ }
+
+ iommu_iotlb_gather_add_range(gather, iova, size);
+}
+
static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
size_t page_size,
struct iommu_iotlb_gather *gather)
@@ -2067,7 +2109,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,

r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;

- iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+ amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);

return r;
}
--
2.25.1

2021-06-15 12:56:56

by Robin Murphy

[permalink] [raw]

Subject: Re: [PATCH v3 5/6] iommu/amd: Tailored gather logic for AMD

On 2021-06-07 19:25, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
>
> AMD's IOMMU can flush efficiently (i.e., in a single flush) any range.
> This is in contrast, for instnace, to Intel IOMMUs that have a limit on
> the number of pages that can be flushed in a single flush. In addition,
> AMD's IOMMU do not care about the page-size, so changes of the page size
> do not need to trigger a TLB flush.
>
> So in most cases, a TLB flush due to disjoint range or page-size changes
> are not needed for AMD. Yet, vIOMMUs require the hypervisor to
> synchronize the virtualized IOMMU's PTEs with the physical ones. This
> process induce overheads, so it is better not to cause unnecessary
> flushes, i.e., flushes of PTEs that were not modified.
>
> Implement and use amd_iommu_iotlb_gather_add_page() and use it instead
> of the generic iommu_iotlb_gather_add_page(). Ignore page-size changes
> and disjoint regions unless "non-present cache" feature is reported by
> the IOMMU capabilities, as this is an indication we are running on a
> physical IOMMU. A similar indication is used by VT-d (see "caching
> mode"). The new logic retains the same flushing behavior that we had
> before the introduction of page-selective IOTLB flushes for AMD.
>
> On virtualized environments, check if the newly flushed region and the
> gathered one are disjoint and flush if it is. Also check whether the new
> region would cause IOTLB invalidation of large region that would include
> unmodified PTE. The latter check is done according to the "order" of the
> IOTLB flush.

If it helps,

Reviewed-by: Robin Murphy <[email protected]>

I wonder if it might be more effective to defer the alignment-based
splitting part to amd_iommu_iotlb_sync() itself, but that could be
investigated as another follow-up.

Robin.

> Cc: Joerg Roedel <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Jiajun Cao <[email protected]>
> Cc: Robin Murphy <[email protected]>
> Cc: Lu Baolu <[email protected]>
> Cc: [email protected]
> Cc: [email protected]>
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> drivers/iommu/amd/iommu.c | 44 ++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 43 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 3e40f6610b6a..128f2e889ced 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -2053,6 +2053,48 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
> return ret;
> }
>
> +static void amd_iommu_iotlb_gather_add_page(struct iommu_domain *domain,
> + struct iommu_iotlb_gather *gather,
> + unsigned long iova, size_t size)
> +{
> + /*
> + * AMD's IOMMU can flush as many pages as necessary in a single flush.
> + * Unless we run in a virtual machine, which can be inferred according
> + * to whether "non-present cache" is on, it is probably best to prefer
> + * (potentially) too extensive TLB flushing (i.e., more misses) over
> + * mutliple TLB flushes (i.e., more flushes). For virtual machines the
> + * hypervisor needs to synchronize the host IOMMU PTEs with those of
> + * the guest, and the trade-off is different: unnecessary TLB flushes
> + * should be avoided.
> + */
> + if (amd_iommu_np_cache && gather->end != 0) {
> + unsigned long start = iova, end = start + size - 1;
> +
> + if (iommu_iotlb_gather_is_disjoint(gather, iova, size)) {
> + /*
> + * If the new page is disjoint from the current range,
> + * flush.
> + */
> + iommu_iotlb_sync(domain, gather);
> + } else {
> + /*
> + * If the order of TLB flushes increases by more than
> + * 1, it means that we would have to flush PTEs that
> + * were not modified. In this case, flush.
> + */
> + unsigned long new_start = min(gather->start, start);
> + unsigned long new_end = min(gather->end, end);
> + int msb_diff = fls64(gather->end ^ gather->start);
> + int new_msb_diff = fls64(new_end ^ new_start);
> +
> + if (new_msb_diff > msb_diff + 1)
> + iommu_iotlb_sync(domain, gather);
> + }
> + }
> +
> + iommu_iotlb_gather_add_range(gather, iova, size);
> +}
> +
> static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
> size_t page_size,
> struct iommu_iotlb_gather *gather)
> @@ -2067,7 +2109,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
>
> r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
>
> - iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
> + amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
>
> return r;
> }
>

2021-06-15 18:15:40

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH v3 5/6] iommu/amd: Tailored gather logic for AMD

> On Jun 15, 2021, at 5:55 AM, Robin Murphy <[email protected]> wrote:
>
> On 2021-06-07 19:25, Nadav Amit wrote:
>> From: Nadav Amit <[email protected]>
>> AMD's IOMMU can flush efficiently (i.e., in a single flush) any range.
>> This is in contrast, for instnace, to Intel IOMMUs that have a limit on
>> the number of pages that can be flushed in a single flush. In addition,
>> AMD's IOMMU do not care about the page-size, so changes of the page size
>> do not need to trigger a TLB flush.
>> So in most cases, a TLB flush due to disjoint range or page-size changes
>> are not needed for AMD. Yet, vIOMMUs require the hypervisor to
>> synchronize the virtualized IOMMU's PTEs with the physical ones. This
>> process induce overheads, so it is better not to cause unnecessary
>> flushes, i.e., flushes of PTEs that were not modified.
>> Implement and use amd_iommu_iotlb_gather_add_page() and use it instead
>> of the generic iommu_iotlb_gather_add_page(). Ignore page-size changes
>> and disjoint regions unless "non-present cache" feature is reported by
>> the IOMMU capabilities, as this is an indication we are running on a
>> physical IOMMU. A similar indication is used by VT-d (see "caching
>> mode"). The new logic retains the same flushing behavior that we had
>> before the introduction of page-selective IOTLB flushes for AMD.
>> On virtualized environments, check if the newly flushed region and the
>> gathered one are disjoint and flush if it is. Also check whether the new
>> region would cause IOTLB invalidation of large region that would include
>> unmodified PTE. The latter check is done according to the "order" of the
>> IOTLB flush.
>
> If it helps,
>
> Reviewed-by: Robin Murphy <[email protected]>

Thanks!

> I wonder if it might be more effective to defer the alignment-based splitting part to amd_iommu_iotlb_sync() itself, but that could be investigated as another follow-up.

Note that the alignment-based splitting is only used for virtualized AMD IOMMUs, so it has no impact for most users.

Right now, the performance is kind of bad on VMs since AMD’s IOMMU driver does a full IOTLB flush whenever it unmaps more than a single page. So, although your idea makes sense, I do not know exactly how to implement it right now, and regardless it is likely to provide much lower performance improvements than those that avoiding full IOTLB flushes would.

Having said that, if I figure out a way to implement it, I would give it a try (although I am admittedly afraid of a complicated logic that might cause subtle, mostly undetectable bugs).

2021-06-15 19:23:06

by Robin Murphy

[permalink] [raw]

Subject: Re: [PATCH v3 5/6] iommu/amd: Tailored gather logic for AMD

On 2021-06-15 19:14, Nadav Amit wrote:
>
>
>> On Jun 15, 2021, at 5:55 AM, Robin Murphy <[email protected]> wrote:
>>
>> On 2021-06-07 19:25, Nadav Amit wrote:
>>> From: Nadav Amit <[email protected]>
>>> AMD's IOMMU can flush efficiently (i.e., in a single flush) any range.
>>> This is in contrast, for instnace, to Intel IOMMUs that have a limit on
>>> the number of pages that can be flushed in a single flush. In addition,
>>> AMD's IOMMU do not care about the page-size, so changes of the page size
>>> do not need to trigger a TLB flush.
>>> So in most cases, a TLB flush due to disjoint range or page-size changes
>>> are not needed for AMD. Yet, vIOMMUs require the hypervisor to
>>> synchronize the virtualized IOMMU's PTEs with the physical ones. This
>>> process induce overheads, so it is better not to cause unnecessary
>>> flushes, i.e., flushes of PTEs that were not modified.
>>> Implement and use amd_iommu_iotlb_gather_add_page() and use it instead
>>> of the generic iommu_iotlb_gather_add_page(). Ignore page-size changes
>>> and disjoint regions unless "non-present cache" feature is reported by
>>> the IOMMU capabilities, as this is an indication we are running on a
>>> physical IOMMU. A similar indication is used by VT-d (see "caching
>>> mode"). The new logic retains the same flushing behavior that we had
>>> before the introduction of page-selective IOTLB flushes for AMD.
>>> On virtualized environments, check if the newly flushed region and the
>>> gathered one are disjoint and flush if it is. Also check whether the new
>>> region would cause IOTLB invalidation of large region that would include
>>> unmodified PTE. The latter check is done according to the "order" of the
>>> IOTLB flush.
>>
>> If it helps,
>>
>> Reviewed-by: Robin Murphy <[email protected]>
>
> Thanks!
>
>
>> I wonder if it might be more effective to defer the alignment-based splitting part to amd_iommu_iotlb_sync() itself, but that could be investigated as another follow-up.
>
> Note that the alignment-based splitting is only used for virtualized AMD IOMMUs, so it has no impact for most users.
>
> Right now, the performance is kind of bad on VMs since AMD’s IOMMU driver does a full IOTLB flush whenever it unmaps more than a single page. So, although your idea makes sense, I do not know exactly how to implement it right now, and regardless it is likely to provide much lower performance improvements than those that avoiding full IOTLB flushes would.
>
> Having said that, if I figure out a way to implement it, I would give it a try (although I am admittedly afraid of a complicated logic that might cause subtle, mostly undetectable bugs).

I was mainly thinking that when you observe a change in "order" and sync
to avoid over-invalidating adjacent pages, those pages may still be part
of the current unmap and you've just not seen them added yet. Hence
simply gathering contiguous pages regardless of alignment, then breaking
the total range down into appropriately-aligned commands in the sync
once you know you've seen everything, seems like it might allow issuing
fewer commands overall. But I haven't quite grasped the alignment rules
either, so possibly this is moot anyway.

Robin.

2021-06-15 19:48:20

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH v3 5/6] iommu/amd: Tailored gather logic for AMD

> On Jun 15, 2021, at 12:20 PM, Robin Murphy <[email protected]> wrote:
>
> On 2021-06-15 19:14, Nadav Amit wrote:
>>> On Jun 15, 2021, at 5:55 AM, Robin Murphy <[email protected]> wrote:
>>>
>>> On 2021-06-07 19:25, Nadav Amit wrote:
>>>> From: Nadav Amit <[email protected]>
>>>> AMD's IOMMU can flush efficiently (i.e., in a single flush) any range.
>>>> This is in contrast, for instnace, to Intel IOMMUs that have a limit on
>>>> the number of pages that can be flushed in a single flush. In addition,
>>>> AMD's IOMMU do not care about the page-size, so changes of the page size
>>>> do not need to trigger a TLB flush.
>>>> So in most cases, a TLB flush due to disjoint range or page-size changes
>>>> are not needed for AMD. Yet, vIOMMUs require the hypervisor to
>>>> synchronize the virtualized IOMMU's PTEs with the physical ones. This
>>>> process induce overheads, so it is better not to cause unnecessary
>>>> flushes, i.e., flushes of PTEs that were not modified.
>>>> Implement and use amd_iommu_iotlb_gather_add_page() and use it instead
>>>> of the generic iommu_iotlb_gather_add_page(). Ignore page-size changes
>>>> and disjoint regions unless "non-present cache" feature is reported by
>>>> the IOMMU capabilities, as this is an indication we are running on a
>>>> physical IOMMU. A similar indication is used by VT-d (see "caching
>>>> mode"). The new logic retains the same flushing behavior that we had
>>>> before the introduction of page-selective IOTLB flushes for AMD.
>>>> On virtualized environments, check if the newly flushed region and the
>>>> gathered one are disjoint and flush if it is. Also check whether the new
>>>> region would cause IOTLB invalidation of large region that would include
>>>> unmodified PTE. The latter check is done according to the "order" of the
>>>> IOTLB flush.
>>>
>>> If it helps,
>>>
>>> Reviewed-by: Robin Murphy <[email protected]>
>> Thanks!
>>> I wonder if it might be more effective to defer the alignment-based splitting part to amd_iommu_iotlb_sync() itself, but that could be investigated as another follow-up.
>> Note that the alignment-based splitting is only used for virtualized AMD IOMMUs, so it has no impact for most users.
>> Right now, the performance is kind of bad on VMs since AMD’s IOMMU driver does a full IOTLB flush whenever it unmaps more than a single page. So, although your idea makes sense, I do not know exactly how to implement it right now, and regardless it is likely to provide much lower performance improvements than those that avoiding full IOTLB flushes would.
>> Having said that, if I figure out a way to implement it, I would give it a try (although I am admittedly afraid of a complicated logic that might cause subtle, mostly undetectable bugs).
>
> I was mainly thinking that when you observe a change in "order" and sync to avoid over-invalidating adjacent pages, those pages may still be part of the current unmap and you've just not seen them added yet. Hence simply gathering contiguous pages regardless of alignment, then breaking the total range down into appropriately-aligned commands in the sync once you know you've seen everything, seems like it might allow issuing fewer commands overall. But I haven't quite grasped the alignment rules either, so possibly this is moot anyway.

Thanks for explaining. I think that what you propose makes sense. We might already flush more than needed in certain cases (e.g., patterns in which pages are added before and after the gathered range). I doubt these cases actually happen, and the tradeoff between being precise in what you flush (more flushes) and not causing the hypervisor to check unchanged mappings (synchronization cost) is less obvious.

I will see if I can change __domain_flush_pages() to your liking, and see whether it can be part of this series.

Attachments:

signature.asc (849.00 B)
Message signed with OpenPGP