From: David Stevens <[email protected]>
This patch series adds support for per-domain dynamic pools of iommu
bounce buffers to the dma-iommu API. This allows iommu mappings to be
reused while still maintaining strict iommu protection.
This bounce buffer support is used to add a new config option that, when
enabled, causes all non-direct streaming mappings below a configurable
size to go through the bounce buffers. This serves as an optimization on
systems where manipulating iommu mappings is very expensive. For
example, virtio-iommu operations in a guest on a linux host require a
vmexit, involvement the VMM, and a VFIO syscall. For relatively small
DMA operations, memcpy can be significantly faster.
As a performance comparison, on a device with an i5-10210U, I ran fio
with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
--rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
by >99%, as bounce buffers don't require syncing here in the read case.
Running with multiple jobs doesn't serve as a useful performance
comparison because virtio-iommu and vfio_iommu_type1 both have big
locks that significantly limit mulithreaded DMA performance.
These pooled bounce buffers are also used for subgranule mappings with
untrusted devices, replacing the single use bounce buffers used
currently. The biggest difference here is that the new implementation
maps a whole sglist using a single bounce buffer. The new implementation
does not support using bounce buffers for only some segments of the
sglist, so it may require more copying. However, the current
implementation requires per-segment iommu map/unmap operations for all
untrusted sglist mappings (fully aligned sglists included). On a
i5-10210U laptop with the internal NVMe drive made to appear untrusted,
fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
a statistically significant decrease in CPU load from 2.28% -> 2.17%
with the new iommu bounce buffer optimization enabled.
Each domain's buffer pool is split into multiple power-of-2 size
classes. Each class allocates a fixed number of buffer slot metadata. A
large iova range is allocated, and each slot is assigned an iova from
the range. This allows the iova to be easily mapped back to the slot,
and allows the critical section of most pool operations to be constant
time. The one exception is finding a cached buffer to reuse. These are
only separated according to R/W permissions - the use of other
permissions such as IOMMU_PRIV may require a linear search through the
cache. However, these other permissions are rare and likely exhibit high
locality, so the should not be a bottleneck in practice.
Since untrusted devices may require bounce buffers, each domain has a
fallback rbtree to manage single use buffers. This may be necessary if a
very large number of DMA operations are simultaneously in-flight, or for
very large individual DMA operations.
This patch set does not use swiotlb. There are two primary ways in which
swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
allocates buffers to be compatible with a single device, whereas
per-domain buffer pools don't handle that during buffer allocation as a
single buffer may end up being used by multiple devices. Second, swiotlb
allocation establishes the original to bounce buffer mapping, which
again doesn't work if buffers can be reused. Effectively the only code
that can be shared between the two use cases is allocating slots from
the swiotlb's memory. However, given that we're going to be allocating
memory for use with an iommu, allocating memory from a block of memory
explicitly set aside to deal with a lack of iommu seems kind of
contradictory. At best there might be a small performance improvement if
wiotlb allocation is faster than regular page allocation, but buffer
allocation isn't on the hot path anyway.
Not using the swiotlb has the benefit that memory doesn't have to be
preallocated. Instead, bounce buffers consume memory only for in-flight
dma transactions (ignoring temporarily cached buffers), which is the
smallest amount possible. This makes it easier to use bounce buffers as
an optimization on systems with large numbers of devices or in
situations where devices are unknown, since it is not necessary to try
to tune how much memory needs to be set aside to achieve good
performance without costing too much memory.
Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
is meant to address devices which create long lived streaming mappings
but manage CPU cache coherency without using the dma_sync_* APIs.
Currently, these devices don't function properly with swiotlb=force. The
new flag is used to bypass bounce buffers so such devices will function
when the new bounce buffer optimization is enabled. The flag is added to
the i915 driver, which creates such mappings. It can also be added to
various dma-buf implementations as an optimization, although that is not
done here.
v1 -> v2:
- Replace existing untrusted bounce buffers with new bounce
buffer pools. This includes significant rework to account for
untrusted bounce buffers being required instead of an
optimization.
- Add flag for persistent streaming mappings.
David Stevens (9):
Revert "iommu: Allow the dma-iommu api to use bounce buffers"
dma-iommu: expose a few helper functions to module
dma-iommu: bounce buffers for untrusted devices
dma-iommu: remove extra buffer search on unmap
dma-iommu: clear only necessary bytes
dma-iommu: add bounce buffer pools
dma-iommu: support iommu bounce buffer optimization
dma-mapping: add persistent streaming mapping flag
drm/i915: use DMA_ATTR_PERSISTENT_STREAMING flag
drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c | 4 +-
drivers/gpu/drm/i915/i915_gem_gtt.c | 3 +-
drivers/iommu/Kconfig | 11 +
drivers/iommu/Makefile | 2 +-
drivers/iommu/dma-iommu.c | 268 ++++-----
drivers/iommu/io-bounce-buffers.c | 533 +++++++++++++++++
drivers/iommu/io-bounce-buffers.h | 49 ++
drivers/iommu/io-buffer-manager.c | 633 +++++++++++++++++++++
drivers/iommu/io-buffer-manager.h | 94 +++
include/linux/dma-iommu.h | 12 +
include/linux/dma-mapping.h | 11 +
11 files changed, 1460 insertions(+), 160 deletions(-)
create mode 100644 drivers/iommu/io-bounce-buffers.c
create mode 100644 drivers/iommu/io-bounce-buffers.h
create mode 100644 drivers/iommu/io-buffer-manager.c
create mode 100644 drivers/iommu/io-buffer-manager.h
--
2.32.0.605.g8dce9f2422-goog
From: David Stevens <[email protected]>
Use the new DMA_ATTR_PERSISTENT_STREAMING for long lived dma mappings
which directly handle CPU cache coherency instead of using dma_sync_*.
Signed-off-by: David Stevens <[email protected]>
---
drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c | 4 +++-
drivers/gpu/drm/i915/i915_gem_gtt.c | 3 ++-
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
index 616c3a2f1baf..df982cfb4f34 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
@@ -48,7 +48,9 @@ static struct sg_table *i915_gem_map_dma_buf(struct dma_buf_attachment *attachme
src = sg_next(src);
}
- ret = dma_map_sgtable(attachment->dev, st, dir, DMA_ATTR_SKIP_CPU_SYNC);
+ ret = dma_map_sgtable(attachment->dev, st, dir,
+ DMA_ATTR_SKIP_CPU_SYNC |
+ DMA_ATTR_PERSISTENT_STREAMING);
if (ret)
goto err_free_sg;
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 36489be4896b..f27a849631f7 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -33,7 +33,8 @@ int i915_gem_gtt_prepare_pages(struct drm_i915_gem_object *obj,
PCI_DMA_BIDIRECTIONAL,
DMA_ATTR_SKIP_CPU_SYNC |
DMA_ATTR_NO_KERNEL_MAPPING |
- DMA_ATTR_NO_WARN))
+ DMA_ATTR_NO_WARN |
+ DMA_ATTR_PERSISTENT_STREAMING))
return 0;
/*
--
2.32.0.605.g8dce9f2422-goog
From: David Stevens <[email protected]>
Add config that uses IOMMU bounce buffer pools to avoid IOMMU
interactions as much as possible for relatively small streaming DMA
operations. This can lead to significant performance improvements on
systems where IOMMU map/unmap operations are very slow, such as when
running virtualized.
Signed-off-by: David Stevens <[email protected]>
---
drivers/iommu/Kconfig | 11 +++++
drivers/iommu/dma-iommu.c | 5 ++-
drivers/iommu/io-bounce-buffers.c | 70 +++++++++++++++++++++----------
drivers/iommu/io-buffer-manager.c | 17 +++++---
drivers/iommu/io-buffer-manager.h | 8 ++--
include/linux/dma-iommu.h | 2 +
6 files changed, 81 insertions(+), 32 deletions(-)
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 07b7c25cbed8..e573b5c276dc 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -422,4 +422,15 @@ config SPRD_IOMMU
Say Y here if you want to use the multimedia devices listed above.
+config IOMMU_BOUNCE_BUFFERS
+ bool "Use IOMMU bounce buffers"
+ depends on IOMMU_DMA
+ default n
+ help
+ Use bounce buffers for small, streaming DMA operations. This may
+ have performance benefits on systems where establishing IOMMU mappings
+ is particularly expensive, such as when running as a guest.
+
+ If unsure, say N here.
+
endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 42f85b7a90f0..965bc0a2f140 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -324,7 +324,7 @@ static void iommu_dma_flush_iotlb_all(struct iova_domain *iovad)
domain->ops->flush_iotlb_all(domain);
}
-static bool dev_is_untrusted(struct device *dev)
+bool dev_is_untrusted(struct device *dev)
{
return dev_is_pci(dev) && to_pci_dev(dev)->untrusted;
}
@@ -402,7 +402,8 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
ret = iova_reserve_iommu_regions(dev, domain);
- if (ret == 0 && dev_is_untrusted(dev)) {
+ if (ret == 0 && (dev_is_untrusted(dev) ||
+ IS_ENABLED(CONFIG_IOMMU_BOUNCE_BUFFERS))) {
cookie->bounce_buffers =
io_bounce_buffers_init(dev, domain, iovad);
if (IS_ERR(cookie->bounce_buffers))
diff --git a/drivers/iommu/io-bounce-buffers.c b/drivers/iommu/io-bounce-buffers.c
index 8af8e1546d5f..af8c2a51eeed 100644
--- a/drivers/iommu/io-bounce-buffers.c
+++ b/drivers/iommu/io-bounce-buffers.c
@@ -20,10 +20,20 @@
static unsigned int buffer_pool_size = 1024;
module_param(buffer_pool_size, uint, 0);
+#ifdef CONFIG_IOMMU_BOUNCE_BUFFERS
+// All buffers at most this size will always use bounce buffers if there
+// are slots of the appropriate size available.
+static unsigned int always_bounce_limit = PAGE_SIZE;
+module_param(always_bounce_limit, uint, 0644);
+#else
+static const unsigned int always_bounce_limit;
+#endif
+
struct io_bounce_buffers {
struct iommu_domain *domain;
struct iova_domain *iovad;
unsigned int nid;
+ bool untrusted;
struct io_buffer_manager manager;
};
@@ -56,6 +66,7 @@ struct io_bounce_buffers *io_bounce_buffers_init(struct device *dev,
buffers->domain = domain;
buffers->iovad = iovad;
buffers->nid = dev_to_node(dev);
+ buffers->untrusted = dev_is_untrusted(dev);
return buffers;
}
@@ -201,7 +212,8 @@ bool io_bounce_buffers_sync_single(struct io_bounce_buffers *buffers,
void *orig_buffer;
int prot;
- if (!io_buffer_manager_find_buffer(&buffers->manager, dma_handle, &info,
+ if (!io_buffer_manager_find_buffer(&buffers->manager, dma_handle,
+ buffers->untrusted, &info,
&orig_buffer, &prot))
return false;
@@ -237,9 +249,9 @@ bool io_bounce_buffers_sync_sg(struct io_bounce_buffers *buffers,
void *orig_buffer;
int prot;
- if (!io_buffer_manager_find_buffer(&buffers->manager,
- sg_dma_address(sgl), &info,
- &orig_buffer, &prot))
+ if (!io_buffer_manager_find_buffer(
+ &buffers->manager, sg_dma_address(sgl), buffers->untrusted,
+ &info, &orig_buffer, &prot))
return false;
// In the non bounce buffer case, iommu_dma_map_sg syncs before setting
@@ -291,7 +303,7 @@ bool io_bounce_buffers_unmap_page(struct io_bounce_buffers *buffers,
return io_buffer_manager_release_buffer(
&buffers->manager, buffers->domain, handle, true,
- io_bounce_buffers_unmap_page_sync, &args);
+ buffers->untrusted, io_bounce_buffers_unmap_page_sync, &args);
}
static void io_bounce_buffers_unmap_sg_sync(struct io_bounce_buffer_info *info,
@@ -318,7 +330,7 @@ bool io_bounce_buffers_unmap_sg(struct io_bounce_buffers *buffers,
return io_buffer_manager_release_buffer(
&buffers->manager, buffers->domain, sg_dma_address(sgl), true,
- io_bounce_buffers_unmap_sg_sync, &args);
+ buffers->untrusted, io_bounce_buffers_unmap_sg_sync, &args);
}
static void io_bounce_buffers_clear_padding(struct io_bounce_buffer_info *info,
@@ -370,7 +382,8 @@ static bool io_bounce_buffers_map_buffer(struct io_bounce_buffers *buffers,
struct sg_table sgt;
size_t mapped;
- if (offset || offset + orig_size < info->size || skiped_sync) {
+ if (buffers->untrusted &&
+ (offset || offset + orig_size < info->size || skiped_sync)) {
// Ensure that nothing is leaked to untrusted devices when
// mapping the buffer by clearing any part of the bounce buffer
// that wasn't already cleared by syncing.
@@ -396,6 +409,15 @@ static bool io_bounce_buffers_map_buffer(struct io_bounce_buffers *buffers,
return mapped >= info->size;
}
+static bool use_bounce_buffer(bool force_bounce, size_t size)
+{
+ if (IS_ENABLED(CONFIG_IOMMU_BOUNCE_BUFFERS) &&
+ size <= always_bounce_limit)
+ return true;
+
+ return force_bounce;
+}
+
bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
struct device *dev, struct page *page,
unsigned long offset, size_t size, int prot,
@@ -404,16 +426,17 @@ bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
{
bool new_buffer, skip_cpu_sync = attrs & DMA_ATTR_SKIP_CPU_SYNC;
struct io_bounce_buffer_info info;
- bool force_bounce = iova_offset(buffers->iovad, offset | size);
+ bool force_bounce = buffers->untrusted &&
+ iova_offset(buffers->iovad, offset | size);
- if (!force_bounce)
+ if (!use_bounce_buffer(force_bounce, size))
return false;
*handle = DMA_MAPPING_ERROR;
if (!io_buffer_manager_alloc_buffer(&buffers->manager, dev, page,
- offset + size, prot, buffers->nid,
- &info, &new_buffer))
- return true;
+ offset + size, prot, force_bounce,
+ buffers->nid, &info, &new_buffer))
+ return force_bounce;
if (!skip_cpu_sync)
io_bounce_buffers_do_sync(buffers, info.bounce_buffer, offset,
@@ -424,8 +447,9 @@ bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
offset, size)) {
io_buffer_manager_release_buffer(&buffers->manager,
buffers->domain, info.iova,
- false, NULL, NULL);
- return true;
+ false, force_bounce, NULL,
+ NULL);
+ return force_bounce;
}
*handle = info.iova + offset;
@@ -447,18 +471,19 @@ bool io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
for_each_sg(sgl, iter, nents, i) {
size += iter->length;
- force_bounce |= iova_offset(buffers->iovad,
- iter->offset | iter->length);
+ if (buffers->untrusted)
+ force_bounce |= iova_offset(
+ buffers->iovad, iter->offset | iter->length);
}
- if (!force_bounce)
+ if (!use_bounce_buffer(force_bounce, size))
return false;
*out_nents = 0;
if (!io_buffer_manager_alloc_buffer(&buffers->manager, dev, sgl, size,
- prot, buffers->nid, &info,
- &new_buffer))
- return true;
+ prot, force_bounce, buffers->nid,
+ &info, &new_buffer))
+ return force_bounce;
if (!skip_cpu_sync)
__io_bounce_buffers_sync_sg(buffers, sgl, nents,
@@ -470,8 +495,9 @@ bool io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
0, size)) {
io_buffer_manager_release_buffer(&buffers->manager,
buffers->domain, info.iova,
- false, NULL, NULL);
- return true;
+ false, force_bounce, NULL,
+ NULL);
+ return force_bounce;
}
i = 0;
diff --git a/drivers/iommu/io-buffer-manager.c b/drivers/iommu/io-buffer-manager.c
index 1c69df08603c..0f7f003b53bb 100644
--- a/drivers/iommu/io-buffer-manager.c
+++ b/drivers/iommu/io-buffer-manager.c
@@ -324,7 +324,8 @@ static bool io_buffer_manager_alloc_slot(struct io_buffer_manager *manager,
bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
struct device *dev, void *orig_buffer,
- size_t size, int prot, unsigned int nid,
+ size_t size, int prot, bool require_bounce,
+ unsigned int nid,
struct io_bounce_buffer_info *info,
bool *new_buffer)
{
@@ -336,6 +337,9 @@ bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
nid, info, new_buffer))
return true;
+ if (!require_bounce)
+ return false;
+
node = kzalloc(sizeof(*node), GFP_ATOMIC);
if (!node)
return false;
@@ -401,7 +405,7 @@ static bool __io_buffer_manager_find_slot(struct io_buffer_manager *manager,
}
bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
- dma_addr_t handle,
+ dma_addr_t handle, bool may_use_fallback,
struct io_bounce_buffer_info *info,
void **orig_buffer, int *prot)
{
@@ -415,7 +419,8 @@ bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
*orig_buffer = slot->orig_buffer;
*prot = slot->prot;
return true;
- }
+ } else if (!may_use_fallback)
+ return false;
spin_lock_irqsave(&manager->fallback_lock, flags);
node = find_fallback_node(&manager->fallback_buffers, handle);
@@ -433,7 +438,8 @@ bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
struct iommu_domain *domain,
dma_addr_t handle, bool inited,
- prerelease_cb cb, void *ctx)
+ bool may_use_fallback, prerelease_cb cb,
+ void *ctx)
{
struct io_buffer_slot *slot, **cache;
struct io_buffer_pool *pool;
@@ -472,7 +478,8 @@ bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
spin_unlock_irqrestore(&pool->lock, flags);
return true;
- }
+ } else if (!may_use_fallback)
+ return false;
spin_lock_irqsave(&manager->fallback_lock, flags);
node = find_fallback_node(&manager->fallback_buffers, handle);
diff --git a/drivers/iommu/io-buffer-manager.h b/drivers/iommu/io-buffer-manager.h
index 2aa3b9afcb3d..3d32f9366536 100644
--- a/drivers/iommu/io-buffer-manager.h
+++ b/drivers/iommu/io-buffer-manager.h
@@ -57,12 +57,13 @@ struct io_bounce_buffer_info {
bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
struct device *dev, void *orig_buffer,
- size_t size, int prot, unsigned int nid,
+ size_t size, int prot, bool use_fallback,
+ unsigned int nid,
struct io_bounce_buffer_info *info,
bool *new_buffer);
bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
- dma_addr_t handle,
+ dma_addr_t handle, bool may_use_fallback,
struct io_bounce_buffer_info *info,
void **orig_buffer, int *prot);
@@ -72,7 +73,8 @@ typedef void (*prerelease_cb)(struct io_bounce_buffer_info *info, int prot,
bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
struct iommu_domain *domain,
dma_addr_t handle, bool inited,
- prerelease_cb cb, void *ctx);
+ bool may_use_fallback, prerelease_cb cb,
+ void *ctx);
int io_buffer_manager_init(struct io_buffer_manager *manager,
struct device *dev, struct iova_domain *iovad,
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 944fd491d94f..70bed650d5d1 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -52,6 +52,8 @@ void __iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
u64 __iommu_dma_limit(struct iommu_domain *domain,
struct device *dev, u64 mask);
+bool dev_is_untrusted(struct device *dev);
+
#else /* CONFIG_IOMMU_DMA */
struct iommu_domain;
--
2.32.0.605.g8dce9f2422-goog
On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> From: David Stevens <[email protected]>
>
> This patch series adds support for per-domain dynamic pools of iommu
> bounce buffers to the dma-iommu API. This allows iommu mappings to be
> reused while still maintaining strict iommu protection.
>
> This bounce buffer support is used to add a new config option that, when
> enabled, causes all non-direct streaming mappings below a configurable
> size to go through the bounce buffers. This serves as an optimization on
> systems where manipulating iommu mappings is very expensive. For
> example, virtio-iommu operations in a guest on a linux host require a
> vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> DMA operations, memcpy can be significantly faster.
>
> As a performance comparison, on a device with an i5-10210U, I ran fio
> with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> by >99%, as bounce buffers don't require syncing here in the read case.
> Running with multiple jobs doesn't serve as a useful performance
> comparison because virtio-iommu and vfio_iommu_type1 both have big
> locks that significantly limit mulithreaded DMA performance.
>
> These pooled bounce buffers are also used for subgranule mappings with
> untrusted devices, replacing the single use bounce buffers used
> currently. The biggest difference here is that the new implementation
> maps a whole sglist using a single bounce buffer. The new implementation
> does not support using bounce buffers for only some segments of the
> sglist, so it may require more copying. However, the current
> implementation requires per-segment iommu map/unmap operations for all
> untrusted sglist mappings (fully aligned sglists included). On a
> i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> a statistically significant decrease in CPU load from 2.28% -> 2.17%
> with the new iommu bounce buffer optimization enabled.
>
> Each domain's buffer pool is split into multiple power-of-2 size
> classes. Each class allocates a fixed number of buffer slot metadata. A
> large iova range is allocated, and each slot is assigned an iova from
> the range. This allows the iova to be easily mapped back to the slot,
> and allows the critical section of most pool operations to be constant
> time. The one exception is finding a cached buffer to reuse. These are
> only separated according to R/W permissions - the use of other
> permissions such as IOMMU_PRIV may require a linear search through the
> cache. However, these other permissions are rare and likely exhibit high
> locality, so the should not be a bottleneck in practice.
>
> Since untrusted devices may require bounce buffers, each domain has a
> fallback rbtree to manage single use buffers. This may be necessary if a
> very large number of DMA operations are simultaneously in-flight, or for
> very large individual DMA operations.
>
> This patch set does not use swiotlb. There are two primary ways in which
> swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> allocates buffers to be compatible with a single device, whereas
> per-domain buffer pools don't handle that during buffer allocation as a
> single buffer may end up being used by multiple devices. Second, swiotlb
> allocation establishes the original to bounce buffer mapping, which
> again doesn't work if buffers can be reused. Effectively the only code
> that can be shared between the two use cases is allocating slots from
> the swiotlb's memory. However, given that we're going to be allocating
> memory for use with an iommu, allocating memory from a block of memory
> explicitly set aside to deal with a lack of iommu seems kind of
> contradictory. At best there might be a small performance improvement if
> wiotlb allocation is faster than regular page allocation, but buffer
> allocation isn't on the hot path anyway.
>
> Not using the swiotlb has the benefit that memory doesn't have to be
> preallocated. Instead, bounce buffers consume memory only for in-flight
> dma transactions (ignoring temporarily cached buffers), which is the
> smallest amount possible. This makes it easier to use bounce buffers as
> an optimization on systems with large numbers of devices or in
> situations where devices are unknown, since it is not necessary to try
> to tune how much memory needs to be set aside to achieve good
> performance without costing too much memory.
>
> Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
> is meant to address devices which create long lived streaming mappings
> but manage CPU cache coherency without using the dma_sync_* APIs.
> Currently, these devices don't function properly with swiotlb=force. The
> new flag is used to bypass bounce buffers so such devices will function
> when the new bounce buffer optimization is enabled. The flag is added to
> the i915 driver, which creates such mappings. It can also be added to
> various dma-buf implementations as an optimization, although that is not
> done here.
>
> v1 -> v2:
> - Replace existing untrusted bounce buffers with new bounce
> buffer pools. This includes significant rework to account for
> untrusted bounce buffers being required instead of an
> optimization.
> - Add flag for persistent streaming mappings.
>
Hi David,
I'm currently looking into converting s390 from our custom IOMMU based
DMA API implementation to using dma-iommu.c. We're always using an
IOMMU for PCI devices even when doing pass-through to guests (under
both the KVM and z/VM hypervisors). In this case I/O TLB flushes, which
we use to do the shadowing of the guest I/O translations, are
relatively expensive I'm thus very interested in your work. I've tried
rebasing it on v5.18 and got it to compile but didn't get DMA to work
though it seems to partially work as I don't get probe failures unlike
with a completely broken DMA API. Since I might have very well screwed
up the rebase and my DMA API conversion is experimental too I was
wondering if you're still working on this and might have a current
version I could experiment with?
Thanks,
Niklas
On Tue, May 24, 2022 at 9:27 PM Niklas Schnelle <[email protected]> wrote:
>
> On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> > From: David Stevens <[email protected]>
> >
> > This patch series adds support for per-domain dynamic pools of iommu
> > bounce buffers to the dma-iommu API. This allows iommu mappings to be
> > reused while still maintaining strict iommu protection.
> >
> > This bounce buffer support is used to add a new config option that, when
> > enabled, causes all non-direct streaming mappings below a configurable
> > size to go through the bounce buffers. This serves as an optimization on
> > systems where manipulating iommu mappings is very expensive. For
> > example, virtio-iommu operations in a guest on a linux host require a
> > vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> > DMA operations, memcpy can be significantly faster.
> >
> > As a performance comparison, on a device with an i5-10210U, I ran fio
> > with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> > --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> > and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> > spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> > 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> > by >99%, as bounce buffers don't require syncing here in the read case.
> > Running with multiple jobs doesn't serve as a useful performance
> > comparison because virtio-iommu and vfio_iommu_type1 both have big
> > locks that significantly limit mulithreaded DMA performance.
> >
> > These pooled bounce buffers are also used for subgranule mappings with
> > untrusted devices, replacing the single use bounce buffers used
> > currently. The biggest difference here is that the new implementation
> > maps a whole sglist using a single bounce buffer. The new implementation
> > does not support using bounce buffers for only some segments of the
> > sglist, so it may require more copying. However, the current
> > implementation requires per-segment iommu map/unmap operations for all
> > untrusted sglist mappings (fully aligned sglists included). On a
> > i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> > fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> > a statistically significant decrease in CPU load from 2.28% -> 2.17%
> > with the new iommu bounce buffer optimization enabled.
> >
> > Each domain's buffer pool is split into multiple power-of-2 size
> > classes. Each class allocates a fixed number of buffer slot metadata. A
> > large iova range is allocated, and each slot is assigned an iova from
> > the range. This allows the iova to be easily mapped back to the slot,
> > and allows the critical section of most pool operations to be constant
> > time. The one exception is finding a cached buffer to reuse. These are
> > only separated according to R/W permissions - the use of other
> > permissions such as IOMMU_PRIV may require a linear search through the
> > cache. However, these other permissions are rare and likely exhibit high
> > locality, so the should not be a bottleneck in practice.
> >
> > Since untrusted devices may require bounce buffers, each domain has a
> > fallback rbtree to manage single use buffers. This may be necessary if a
> > very large number of DMA operations are simultaneously in-flight, or for
> > very large individual DMA operations.
> >
> > This patch set does not use swiotlb. There are two primary ways in which
> > swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> > allocates buffers to be compatible with a single device, whereas
> > per-domain buffer pools don't handle that during buffer allocation as a
> > single buffer may end up being used by multiple devices. Second, swiotlb
> > allocation establishes the original to bounce buffer mapping, which
> > again doesn't work if buffers can be reused. Effectively the only code
> > that can be shared between the two use cases is allocating slots from
> > the swiotlb's memory. However, given that we're going to be allocating
> > memory for use with an iommu, allocating memory from a block of memory
> > explicitly set aside to deal with a lack of iommu seems kind of
> > contradictory. At best there might be a small performance improvement if
> > wiotlb allocation is faster than regular page allocation, but buffer
> > allocation isn't on the hot path anyway.
> >
> > Not using the swiotlb has the benefit that memory doesn't have to be
> > preallocated. Instead, bounce buffers consume memory only for in-flight
> > dma transactions (ignoring temporarily cached buffers), which is the
> > smallest amount possible. This makes it easier to use bounce buffers as
> > an optimization on systems with large numbers of devices or in
> > situations where devices are unknown, since it is not necessary to try
> > to tune how much memory needs to be set aside to achieve good
> > performance without costing too much memory.
> >
> > Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
> > is meant to address devices which create long lived streaming mappings
> > but manage CPU cache coherency without using the dma_sync_* APIs.
> > Currently, these devices don't function properly with swiotlb=force. The
> > new flag is used to bypass bounce buffers so such devices will function
> > when the new bounce buffer optimization is enabled. The flag is added to
> > the i915 driver, which creates such mappings. It can also be added to
> > various dma-buf implementations as an optimization, although that is not
> > done here.
> >
> > v1 -> v2:
> > - Replace existing untrusted bounce buffers with new bounce
> > buffer pools. This includes significant rework to account for
> > untrusted bounce buffers being required instead of an
> > optimization.
> > - Add flag for persistent streaming mappings.
> >
>
> Hi David,
>
> I'm currently looking into converting s390 from our custom IOMMU based
> DMA API implementation to using dma-iommu.c. We're always using an
> IOMMU for PCI devices even when doing pass-through to guests (under
> both the KVM and z/VM hypervisors). In this case I/O TLB flushes, which
> we use to do the shadowing of the guest I/O translations, are
> relatively expensive I'm thus very interested in your work. I've tried
> rebasing it on v5.18 and got it to compile but didn't get DMA to work
> though it seems to partially work as I don't get probe failures unlike
> with a completely broken DMA API. Since I might have very well screwed
> up the rebase and my DMA API conversion is experimental too I was
> wondering if you're still working on this and might have a current
> version I could experiment with?
Unfortunately I don't have anything more recent to share. I've come
across some performance issues caused by pathological usage patterns
in internal usage, but I haven't seen any correctness issues. I'm
hoping that I'll be able to address the performance issues and send a
rebased series within the next month or so.
It's definitely possible that this series has some bugs. I've tested
it on a range of chromebooks and their various hardware and drivers,
but that's still all relatively normal x86_64/arm64. If your hardware
is more particular about its DMA, this series might be missing
something.
-David
On Fri, 2022-05-27 at 10:25 +0900, David Stevens wrote:
> On Tue, May 24, 2022 at 9:27 PM Niklas Schnelle <[email protected]> wrote:
> > On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> > > From: David Stevens <[email protected]>
> > >
> > > This patch series adds support for per-domain dynamic pools of iommu
> > > bounce buffers to the dma-iommu API. This allows iommu mappings to be
> > > reused while still maintaining strict iommu protection.
> > >
> > > This bounce buffer support is used to add a new config option that, when
> > > enabled, causes all non-direct streaming mappings below a configurable
> > > size to go through the bounce buffers. This serves as an optimization on
> > > systems where manipulating iommu mappings is very expensive. For
> > > example, virtio-iommu operations in a guest on a linux host require a
> > > vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> > > DMA operations, memcpy can be significantly faster.
> > >
> > > As a performance comparison, on a device with an i5-10210U, I ran fio
> > > with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> > > --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> > > and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> > > spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> > > 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> > > by >99%, as bounce buffers don't require syncing here in the read case.
> > > Running with multiple jobs doesn't serve as a useful performance
> > > comparison because virtio-iommu and vfio_iommu_type1 both have big
> > > locks that significantly limit mulithreaded DMA performance.
> > >
> > > These pooled bounce buffers are also used for subgranule mappings with
> > > untrusted devices, replacing the single use bounce buffers used
> > > currently. The biggest difference here is that the new implementation
> > > maps a whole sglist using a single bounce buffer. The new implementation
> > > does not support using bounce buffers for only some segments of the
> > > sglist, so it may require more copying. However, the current
> > > implementation requires per-segment iommu map/unmap operations for all
> > > untrusted sglist mappings (fully aligned sglists included). On a
> > > i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> > > fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> > > a statistically significant decrease in CPU load from 2.28% -> 2.17%
> > > with the new iommu bounce buffer optimization enabled.
> > >
> > > Each domain's buffer pool is split into multiple power-of-2 size
> > > classes. Each class allocates a fixed number of buffer slot metadata. A
> > > large iova range is allocated, and each slot is assigned an iova from
> > > the range. This allows the iova to be easily mapped back to the slot,
> > > and allows the critical section of most pool operations to be constant
> > > time. The one exception is finding a cached buffer to reuse. These are
> > > only separated according to R/W permissions - the use of other
> > > permissions such as IOMMU_PRIV may require a linear search through the
> > > cache. However, these other permissions are rare and likely exhibit high
> > > locality, so the should not be a bottleneck in practice.
> > >
> > > Since untrusted devices may require bounce buffers, each domain has a
> > > fallback rbtree to manage single use buffers. This may be necessary if a
> > > very large number of DMA operations are simultaneously in-flight, or for
> > > very large individual DMA operations.
> > >
> > > This patch set does not use swiotlb. There are two primary ways in which
> > > swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> > > allocates buffers to be compatible with a single device, whereas
> > > per-domain buffer pools don't handle that during buffer allocation as a
> > > single buffer may end up being used by multiple devices. Second, swiotlb
> > > allocation establishes the original to bounce buffer mapping, which
> > > again doesn't work if buffers can be reused. Effectively the only code
> > > that can be shared between the two use cases is allocating slots from
> > > the swiotlb's memory. However, given that we're going to be allocating
> > > memory for use with an iommu, allocating memory from a block of memory
> > > explicitly set aside to deal with a lack of iommu seems kind of
> > > contradictory. At best there might be a small performance improvement if
> > > wiotlb allocation is faster than regular page allocation, but buffer
> > > allocation isn't on the hot path anyway.
> > >
> > > Not using the swiotlb has the benefit that memory doesn't have to be
> > > preallocated. Instead, bounce buffers consume memory only for in-flight
> > > dma transactions (ignoring temporarily cached buffers), which is the
> > > smallest amount possible. This makes it easier to use bounce buffers as
> > > an optimization on systems with large numbers of devices or in
> > > situations where devices are unknown, since it is not necessary to try
> > > to tune how much memory needs to be set aside to achieve good
> > > performance without costing too much memory.
> > >
> > > Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
> > > is meant to address devices which create long lived streaming mappings
> > > but manage CPU cache coherency without using the dma_sync_* APIs.
> > > Currently, these devices don't function properly with swiotlb=force. The
> > > new flag is used to bypass bounce buffers so such devices will function
> > > when the new bounce buffer optimization is enabled. The flag is added to
> > > the i915 driver, which creates such mappings. It can also be added to
> > > various dma-buf implementations as an optimization, although that is not
> > > done here.
> > >
> > > v1 -> v2:
> > > - Replace existing untrusted bounce buffers with new bounce
> > > buffer pools. This includes significant rework to account for
> > > untrusted bounce buffers being required instead of an
> > > optimization.
> > > - Add flag for persistent streaming mappings.
> > >
> >
> > Hi David,
> >
> > I'm currently looking into converting s390 from our custom IOMMU based
> > DMA API implementation to using dma-iommu.c. We're always using an
> > IOMMU for PCI devices even when doing pass-through to guests (under
> > both the KVM and z/VM hypervisors). In this case I/O TLB flushes, which
> > we use to do the shadowing of the guest I/O translations, are
> > relatively expensive I'm thus very interested in your work. I've tried
> > rebasing it on v5.18 and got it to compile but didn't get DMA to work
> > though it seems to partially work as I don't get probe failures unlike
> > with a completely broken DMA API. Since I might have very well screwed
> > up the rebase and my DMA API conversion is experimental too I was
> > wondering if you're still working on this and might have a current
> > version I could experiment with?
>
> Unfortunately I don't have anything more recent to share. I've come
> across some performance issues caused by pathological usage patterns
> in internal usage, but I haven't seen any correctness issues. I'm
> hoping that I'll be able to address the performance issues and send a
> rebased series within the next month or so.
>
> It's definitely possible that this series has some bugs. I've tested
> it on a range of chromebooks and their various hardware and drivers,
> but that's still all relatively normal x86_64/arm64. If your hardware
> is more particular about its DMA, this series might be missing
> something.
>
> -David
Hi David,
Thanks for the answer. The only unusual thing about our DMA is that we
only do 64 bit DMA and IOVAs are always >2^32. I don't think I
triggered a bug in your code though, rather I think I made some mistake
in the rebase onto 5.18 as some of the APIs changed a bit. I'm out next
week but may try it again and possibly just test on x86_64 if it
doesn't work on s390. If you have anything new I'd be interested to
hear of course. Also could you say anything more about the pathological
usage patterns?
Thanks,
Niklas
On Fri, Jun 3, 2022 at 11:53 PM Niklas Schnelle <[email protected]> wrote:
>
> On Fri, 2022-05-27 at 10:25 +0900, David Stevens wrote:
> > On Tue, May 24, 2022 at 9:27 PM Niklas Schnelle <[email protected]> wrote:
> > > On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> > > > From: David Stevens <[email protected]>
> > > >
> > > > This patch series adds support for per-domain dynamic pools of iommu
> > > > bounce buffers to the dma-iommu API. This allows iommu mappings to be
> > > > reused while still maintaining strict iommu protection.
> > > >
> > > > This bounce buffer support is used to add a new config option that, when
> > > > enabled, causes all non-direct streaming mappings below a configurable
> > > > size to go through the bounce buffers. This serves as an optimization on
> > > > systems where manipulating iommu mappings is very expensive. For
> > > > example, virtio-iommu operations in a guest on a linux host require a
> > > > vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> > > > DMA operations, memcpy can be significantly faster.
> > > >
> > > > As a performance comparison, on a device with an i5-10210U, I ran fio
> > > > with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> > > > --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> > > > and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> > > > spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> > > > 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> > > > by >99%, as bounce buffers don't require syncing here in the read case.
> > > > Running with multiple jobs doesn't serve as a useful performance
> > > > comparison because virtio-iommu and vfio_iommu_type1 both have big
> > > > locks that significantly limit mulithreaded DMA performance.
> > > >
> > > > These pooled bounce buffers are also used for subgranule mappings with
> > > > untrusted devices, replacing the single use bounce buffers used
> > > > currently. The biggest difference here is that the new implementation
> > > > maps a whole sglist using a single bounce buffer. The new implementation
> > > > does not support using bounce buffers for only some segments of the
> > > > sglist, so it may require more copying. However, the current
> > > > implementation requires per-segment iommu map/unmap operations for all
> > > > untrusted sglist mappings (fully aligned sglists included). On a
> > > > i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> > > > fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> > > > a statistically significant decrease in CPU load from 2.28% -> 2.17%
> > > > with the new iommu bounce buffer optimization enabled.
> > > >
> > > > Each domain's buffer pool is split into multiple power-of-2 size
> > > > classes. Each class allocates a fixed number of buffer slot metadata. A
> > > > large iova range is allocated, and each slot is assigned an iova from
> > > > the range. This allows the iova to be easily mapped back to the slot,
> > > > and allows the critical section of most pool operations to be constant
> > > > time. The one exception is finding a cached buffer to reuse. These are
> > > > only separated according to R/W permissions - the use of other
> > > > permissions such as IOMMU_PRIV may require a linear search through the
> > > > cache. However, these other permissions are rare and likely exhibit high
> > > > locality, so the should not be a bottleneck in practice.
> > > >
> > > > Since untrusted devices may require bounce buffers, each domain has a
> > > > fallback rbtree to manage single use buffers. This may be necessary if a
> > > > very large number of DMA operations are simultaneously in-flight, or for
> > > > very large individual DMA operations.
> > > >
> > > > This patch set does not use swiotlb. There are two primary ways in which
> > > > swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> > > > allocates buffers to be compatible with a single device, whereas
> > > > per-domain buffer pools don't handle that during buffer allocation as a
> > > > single buffer may end up being used by multiple devices. Second, swiotlb
> > > > allocation establishes the original to bounce buffer mapping, which
> > > > again doesn't work if buffers can be reused. Effectively the only code
> > > > that can be shared between the two use cases is allocating slots from
> > > > the swiotlb's memory. However, given that we're going to be allocating
> > > > memory for use with an iommu, allocating memory from a block of memory
> > > > explicitly set aside to deal with a lack of iommu seems kind of
> > > > contradictory. At best there might be a small performance improvement if
> > > > wiotlb allocation is faster than regular page allocation, but buffer
> > > > allocation isn't on the hot path anyway.
> > > >
> > > > Not using the swiotlb has the benefit that memory doesn't have to be
> > > > preallocated. Instead, bounce buffers consume memory only for in-flight
> > > > dma transactions (ignoring temporarily cached buffers), which is the
> > > > smallest amount possible. This makes it easier to use bounce buffers as
> > > > an optimization on systems with large numbers of devices or in
> > > > situations where devices are unknown, since it is not necessary to try
> > > > to tune how much memory needs to be set aside to achieve good
> > > > performance without costing too much memory.
> > > >
> > > > Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
> > > > is meant to address devices which create long lived streaming mappings
> > > > but manage CPU cache coherency without using the dma_sync_* APIs.
> > > > Currently, these devices don't function properly with swiotlb=force. The
> > > > new flag is used to bypass bounce buffers so such devices will function
> > > > when the new bounce buffer optimization is enabled. The flag is added to
> > > > the i915 driver, which creates such mappings. It can also be added to
> > > > various dma-buf implementations as an optimization, although that is not
> > > > done here.
> > > >
> > > > v1 -> v2:
> > > > - Replace existing untrusted bounce buffers with new bounce
> > > > buffer pools. This includes significant rework to account for
> > > > untrusted bounce buffers being required instead of an
> > > > optimization.
> > > > - Add flag for persistent streaming mappings.
> > > >
> > >
> > > Hi David,
> > >
> > > I'm currently looking into converting s390 from our custom IOMMU based
> > > DMA API implementation to using dma-iommu.c. We're always using an
> > > IOMMU for PCI devices even when doing pass-through to guests (under
> > > both the KVM and z/VM hypervisors). In this case I/O TLB flushes, which
> > > we use to do the shadowing of the guest I/O translations, are
> > > relatively expensive I'm thus very interested in your work. I've tried
> > > rebasing it on v5.18 and got it to compile but didn't get DMA to work
> > > though it seems to partially work as I don't get probe failures unlike
> > > with a completely broken DMA API. Since I might have very well screwed
> > > up the rebase and my DMA API conversion is experimental too I was
> > > wondering if you're still working on this and might have a current
> > > version I could experiment with?
> >
> > Unfortunately I don't have anything more recent to share. I've come
> > across some performance issues caused by pathological usage patterns
> > in internal usage, but I haven't seen any correctness issues. I'm
> > hoping that I'll be able to address the performance issues and send a
> > rebased series within the next month or so.
> >
> > It's definitely possible that this series has some bugs. I've tested
> > it on a range of chromebooks and their various hardware and drivers,
> > but that's still all relatively normal x86_64/arm64. If your hardware
> > is more particular about its DMA, this series might be missing
> > something.
> >
> > -David
>
>
> Hi David,
>
> Thanks for the answer. The only unusual thing about our DMA is that we
> only do 64 bit DMA and IOVAs are always >2^32. I don't think I
> triggered a bug in your code though, rather I think I made some mistake
> in the rebase onto 5.18 as some of the APIs changed a bit. I'm out next
> week but may try it again and possibly just test on x86_64 if it
> doesn't work on s390. If you have anything new I'd be interested to
> hear of course. Also could you say anything more about the pathological
> usage patterns?
The problem with this implementation is that if you fall outside the
max number/size for the bounce buffer pools, then DMA performance can
fall off of a cliff. Although those max parameters are tunable, it's
always possible to construct a workload that falls outside of the
expected bounds. I think this can be addressed by adding a fallback
pooling structure. It won't necessarily be as performant as the
primary buffer pools, but I think it should work reasonably well for
usage patterns that fall outside the primary buffer pools.
-David
On Fri, 2022-05-27 at 10:25 +0900, David Stevens wrote:
> On Tue, May 24, 2022 at 9:27 PM Niklas Schnelle <[email protected]> wrote:
> > On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> > > From: David Stevens <[email protected]>
> > >
> > > This patch series adds support for per-domain dynamic pools of iommu
> > > bounce buffers to the dma-iommu API. This allows iommu mappings to be
> > > reused while still maintaining strict iommu protection.
> > >
> > > This bounce buffer support is used to add a new config option that, when
> > > enabled, causes all non-direct streaming mappings below a configurable
> > > size to go through the bounce buffers. This serves as an optimization on
> > > systems where manipulating iommu mappings is very expensive. For
> > > example, virtio-iommu operations in a guest on a linux host require a
> > > vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> > > DMA operations, memcpy can be significantly faster.
> > >
> > > As a performance comparison, on a device with an i5-10210U, I ran fio
> > > with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> > > --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> > > and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> > > spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> > > 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> > > by >99%, as bounce buffers don't require syncing here in the read case.
> > > Running with multiple jobs doesn't serve as a useful performance
> > > comparison because virtio-iommu and vfio_iommu_type1 both have big
> > > locks that significantly limit mulithreaded DMA performance.
> > >
> > > These pooled bounce buffers are also used for subgranule mappings with
> > > untrusted devices, replacing the single use bounce buffers used
> > > currently. The biggest difference here is that the new implementation
> > > maps a whole sglist using a single bounce buffer. The new implementation
> > > does not support using bounce buffers for only some segments of the
> > > sglist, so it may require more copying. However, the current
> > > implementation requires per-segment iommu map/unmap operations for all
> > > untrusted sglist mappings (fully aligned sglists included). On a
> > > i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> > > fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> > > a statistically significant decrease in CPU load from 2.28% -> 2.17%
> > > with the new iommu bounce buffer optimization enabled.
> > >
> > > Each domain's buffer pool is split into multiple power-of-2 size
> > > classes. Each class allocates a fixed number of buffer slot metadata. A
> > > large iova range is allocated, and each slot is assigned an iova from
> > > the range. This allows the iova to be easily mapped back to the slot,
> > > and allows the critical section of most pool operations to be constant
> > > time. The one exception is finding a cached buffer to reuse. These are
> > > only separated according to R/W permissions - the use of other
> > > permissions such as IOMMU_PRIV may require a linear search through the
> > > cache. However, these other permissions are rare and likely exhibit high
> > > locality, so the should not be a bottleneck in practice.
> > >
> > > Since untrusted devices may require bounce buffers, each domain has a
> > > fallback rbtree to manage single use buffers. This may be necessary if a
> > > very large number of DMA operations are simultaneously in-flight, or for
> > > very large individual DMA operations.
> > >
> > > This patch set does not use swiotlb. There are two primary ways in which
> > > swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> > > allocates buffers to be compatible with a single device, whereas
> > > per-domain buffer pools don't handle that during buffer allocation as a
> > > single buffer may end up being used by multiple devices. Second, swiotlb
> > > allocation establishes the original to bounce buffer mapping, which
> > > again doesn't work if buffers can be reused. Effectively the only code
> > > that can be shared between the two use cases is allocating slots from
> > > the swiotlb's memory. However, given that we're going to be allocating
> > > memory for use with an iommu, allocating memory from a block of memory
> > > explicitly set aside to deal with a lack of iommu seems kind of
> > > contradictory. At best there might be a small performance improvement if
> > > wiotlb allocation is faster than regular page allocation, but buffer
> > > allocation isn't on the hot path anyway.
> > >
> > > Not using the swiotlb has the benefit that memory doesn't have to be
> > > preallocated. Instead, bounce buffers consume memory only for in-flight
> > > dma transactions (ignoring temporarily cached buffers), which is the
> > > smallest amount possible. This makes it easier to use bounce buffers as
> > > an optimization on systems with large numbers of devices or in
> > > situations where devices are unknown, since it is not necessary to try
> > > to tune how much memory needs to be set aside to achieve good
> > > performance without costing too much memory.
> > >
> > > Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
> > > is meant to address devices which create long lived streaming mappings
> > > but manage CPU cache coherency without using the dma_sync_* APIs.
> > > Currently, these devices don't function properly with swiotlb=force. The
> > > new flag is used to bypass bounce buffers so such devices will function
> > > when the new bounce buffer optimization is enabled. The flag is added to
> > > the i915 driver, which creates such mappings. It can also be added to
> > > various dma-buf implementations as an optimization, although that is not
> > > done here.
> > >
> > > v1 -> v2:
> > > - Replace existing untrusted bounce buffers with new bounce
> > > buffer pools. This includes significant rework to account for
> > > untrusted bounce buffers being required instead of an
> > > optimization.
> > > - Add flag for persistent streaming mappings.
> > >
> >
> > Hi David,
> >
> > I'm currently looking into converting s390 from our custom IOMMU based
> > DMA API implementation to using dma-iommu.c. We're always using an
> > IOMMU for PCI devices even when doing pass-through to guests (under
> > both the KVM and z/VM hypervisors). In this case I/O TLB flushes, which
> > we use to do the shadowing of the guest I/O translations, are
> > relatively expensive I'm thus very interested in your work. I've tried
> > rebasing it on v5.18 and got it to compile but didn't get DMA to work
> > though it seems to partially work as I don't get probe failures unlike
> > with a completely broken DMA API. Since I might have very well screwed
> > up the rebase and my DMA API conversion is experimental too I was
> > wondering if you're still working on this and might have a current
> > version I could experiment with?
>
> Unfortunately I don't have anything more recent to share. I've come
> across some performance issues caused by pathological usage patterns
> in internal usage, but I haven't seen any correctness issues. I'm
> hoping that I'll be able to address the performance issues and send a
> rebased series within the next month or so.
>
> It's definitely possible that this series has some bugs. I've tested
> it on a range of chromebooks and their various hardware and drivers,
> but that's still all relatively normal x86_64/arm64. If your hardware
> is more particular about its DMA, this series might be missing
> something.
>
> -David
Hi David,
I finally came around to trying this again. This time I managed to get
it working and figure out what was going wrong. The problem was with
the call to iommu_dma_alloc_iova() in io_buffer_manager_init(). As this
call happens during the IOMMU initialization dma_get_mask(dev) is used
before the driver calls dma_set_mask(_and_coherent)() and is thus still
the default mask of DMA_BIT_MASK(32) instead of what the device really
supports. This breaks s390 because our IOMMU currently only supports
apertures starting at an IOVA >= 2^32. For testing I worked around this
by just passing DMA_BIT_MASK(64) instead but of course that's not a
proper fix. With that in place your patches work on top of my still
experimental conversion to use dma-iommu.c on s390.
I can also already confirm that this gives a similar CPU load
(especially steal time) reduction on our z/VM hypervisor which does I/O
translation table shadowing much like your virtio-iommu test. It also
does help performance of my DMA API rework which sadly still lacks
behind our current s390 DMA API implementation. I suspect that is
because the lazy unmapping used by dma-iommu.c tries to do the
unmapping via a timer in the background while our current approach does
them all at once when wrapping around the IOVA space. The latter I
suspect works better when I/O table shadowing in the hypervisor is
serialized. So to summarize for s390 something like your series would
be of significant interest.
Best regards,
Niklas