2018-08-30 17:22:23

by Ezequiel Garcia

[permalink] [raw]
Subject: [RFC 0/3] Introduce usb_{alloc,free}_noncoherent API

Following the discussion on PWC [1] and UVC [2] drivers, where
use non-consistent mappings for the URB transfer buffers was
shown to improve transfer speed significantly, here's a proposal
for a non-coherent USB helpers.

With this pachset, it's possible to get 360x288 raw analog video
using stk1160 and a AM335x Beaglebone Black board. This isn't
possible in mainline, for the same reasons Matwey has explained [1].

First patch is a hack, obviously incomplete, to add support for
non-consistent mappings on ARM.

The second patch introduces the usb_{alloc,free}_noncoherent API,
while the third patch is an example on stk1160.

I'm sending this patchset as RFC, just to get the ball rolling.

[1] https://lkml.org/lkml/2018/8/21/663
[2] https://lkml.org/lkml/2018/6/27/188

Ezequiel Garcia (3):
HACK: ARM: dma-mapping: Get writeback memory for non-consistent
mappings
USB: core: Add non-coherent buffer allocation helpers
stk1160: Use non-coherent buffers for USB transfers

arch/arm/include/asm/pgtable.h | 3 ++
arch/arm/mm/dma-mapping.c | 9 ++--
drivers/media/usb/stk1160/stk1160-video.c | 22 +++------
drivers/usb/core/buffer.c | 29 +++++++-----
drivers/usb/core/hcd.c | 5 +-
drivers/usb/core/usb.c | 56 ++++++++++++++++++++++-
include/linux/usb.h | 5 ++
include/linux/usb/hcd.h | 4 +-
8 files changed, 97 insertions(+), 36 deletions(-)

--
2.18.0



2018-08-30 17:22:23

by Ezequiel Garcia

[permalink] [raw]
Subject: [RFC 1/3] HACK: ARM: dma-mapping: Get writeback memory for non-consistent mappings

This is obviously a hack.

Signed-off-by: Ezequiel Garcia <[email protected]>
---
arch/arm/include/asm/pgtable.h | 3 +++
arch/arm/mm/dma-mapping.c | 9 ++++++---
2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index a757401129f9..37ddd0d73434 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -122,6 +122,9 @@ extern pgprot_t pgprot_s2_device;
#define pgprot_writecombine(prot) \
__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_BUFFERABLE)

+#define pgprot_writeback(prot) \
+ __pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_WRITEBACK)
+
#define pgprot_stronglyordered(prot) \
__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_UNCACHED)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 66566472c153..11cca7bbb0a8 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -633,9 +633,12 @@ static void __free_from_contiguous(struct device *dev, struct page *page,

static inline pgprot_t __get_dma_pgprot(unsigned long attrs, pgprot_t prot)
{
- prot = (attrs & DMA_ATTR_WRITE_COMBINE) ?
- pgprot_writecombine(prot) :
- pgprot_dmacoherent(prot);
+ if (attrs & DMA_ATTR_WRITE_COMBINE)
+ prot = pgprot_writecombine(prot);
+ else if (attrs & DMA_ATTR_NON_CONSISTENT)
+ prot = pgprot_writeback(prot);
+ else
+ prot = pgprot_dmacoherent(prot);
return prot;
}

--
2.18.0


2018-08-30 17:22:34

by Ezequiel Garcia

[permalink] [raw]
Subject: [RFC 3/3] stk1160: Use non-coherent buffers for USB transfers

Platforms without hardware coherency can benefit a lot
from using non-coherent buffers. Moreover, platforms
with hardware coherency aren't impacted by this change.

For instance, on AM335x, while it's still not possible
to capture full resolution frames, this patch enables
half-resolution frame streams to work.

Signed-off-by: Ezequiel Garcia <[email protected]>
---
drivers/media/usb/stk1160/stk1160-video.c | 22 ++++++----------------
1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/drivers/media/usb/stk1160/stk1160-video.c b/drivers/media/usb/stk1160/stk1160-video.c
index 2811f612820f..aeb4264d1998 100644
--- a/drivers/media/usb/stk1160/stk1160-video.c
+++ b/drivers/media/usb/stk1160/stk1160-video.c
@@ -240,6 +240,9 @@ static void stk1160_process_isoc(struct stk1160 *dev, struct urb *urb)
return;
}

+ dma_sync_single_for_cpu(&urb->dev->dev, urb->transfer_dma,
+ urb->transfer_buffer_length, DMA_FROM_DEVICE);
+
for (i = 0; i < urb->number_of_packets; i++) {
status = urb->iso_frame_desc[i].status;
if (status < 0) {
@@ -379,16 +382,11 @@ void stk1160_free_isoc(struct stk1160 *dev)
urb = dev->isoc_ctl.urb[i];
if (urb) {

- if (dev->isoc_ctl.transfer_buffer[i]) {
-#ifndef CONFIG_DMA_NONCOHERENT
- usb_free_coherent(dev->udev,
+ if (dev->isoc_ctl.transfer_buffer[i])
+ usb_free_noncoherent(dev->udev,
urb->transfer_buffer_length,
dev->isoc_ctl.transfer_buffer[i],
urb->transfer_dma);
-#else
- kfree(dev->isoc_ctl.transfer_buffer[i]);
-#endif
- }
usb_free_urb(urb);
dev->isoc_ctl.urb[i] = NULL;
}
@@ -461,12 +459,8 @@ int stk1160_alloc_isoc(struct stk1160 *dev)
goto free_i_bufs;
dev->isoc_ctl.urb[i] = urb;

-#ifndef CONFIG_DMA_NONCOHERENT
- dev->isoc_ctl.transfer_buffer[i] = usb_alloc_coherent(dev->udev,
+ dev->isoc_ctl.transfer_buffer[i] = usb_alloc_noncoherent(dev->udev,
sb_size, GFP_KERNEL, &urb->transfer_dma);
-#else
- dev->isoc_ctl.transfer_buffer[i] = kmalloc(sb_size, GFP_KERNEL);
-#endif
if (!dev->isoc_ctl.transfer_buffer[i]) {
stk1160_err("cannot alloc %d bytes for tx[%d] buffer\n",
sb_size, i);
@@ -490,11 +484,7 @@ int stk1160_alloc_isoc(struct stk1160 *dev)
urb->interval = 1;
urb->start_frame = 0;
urb->number_of_packets = max_packets;
-#ifndef CONFIG_DMA_NONCOHERENT
urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP;
-#else
- urb->transfer_flags = URB_ISO_ASAP;
-#endif

k = 0;
for (j = 0; j < max_packets; j++) {
--
2.18.0


2018-08-30 17:24:01

by Ezequiel Garcia

[permalink] [raw]
Subject: [RFC 2/3] USB: core: Add non-coherent buffer allocation helpers

As noted recently by Matwey V. Kornilov, using coherent
buffers on platforms _without_ hardware coherency results in
some devices being completely unusable, due to transfers
being too slow.

Moreover, using non-coherent buffers on platforms _with_ hardware
coherency, do not show a significant impact. This has been tested
by Matwey on PWC USB cameras on x86_64 and ARM platforms.
Quoting [1] (where kmalloc-ed buffers use streaming mappings):

"""
[..] average memcpy() data transfer rate (rate) and handler
completion time (time) have been measured when running video stream at
640x480 resolution at 10fps.

x86_64 based system (Intel Core i5-3470). This platform has hardware
coherent DMA support and proposed change doesn't make big difference here.

* kmalloc: rate = (2.0 +- 0.4) GBps
time = (5.0 +- 3.0) usec
* usb_alloc_coherent: rate = (3.4 +- 1.2) GBps
time = (3.5 +- 3.0) usec

armv7l based system (TI AM335x BeagleBone Black @ 300MHz). This platform
has no hardware coherent DMA support. DMA coherence is implemented via
disabled page caching that slows down memcpy() due to memory controller
behaviour.

* kmalloc: rate = (114 +- 5) MBps
time = (84 +- 4) usec
* usb_alloc_coherent: rate = (28.1 +- 0.1) MBps
time = (341 +- 2) usec
""

Introduce a pair of usb_{alloc,free}_noncoherent helper functions,
for drivers that want to use non-coherent transfer buffers.

[1]: https://lkml.org/lkml/2018/8/9/734

Signed-off-by: Ezequiel Garcia <[email protected]>
---
drivers/usb/core/buffer.c | 29 ++++++++++++--------
drivers/usb/core/hcd.c | 5 ++--
drivers/usb/core/usb.c | 56 +++++++++++++++++++++++++++++++++++++--
include/linux/usb.h | 5 ++++
include/linux/usb/hcd.h | 4 +--
5 files changed, 82 insertions(+), 17 deletions(-)

diff --git a/drivers/usb/core/buffer.c b/drivers/usb/core/buffer.c
index 77eef8acff94..1bc9df883337 100644
--- a/drivers/usb/core/buffer.c
+++ b/drivers/usb/core/buffer.c
@@ -119,7 +119,8 @@ void *hcd_buffer_alloc(
struct usb_bus *bus,
size_t size,
gfp_t mem_flags,
- dma_addr_t *dma
+ dma_addr_t *dma,
+ unsigned long attrs
)
{
struct usb_hcd *hcd = bus_to_hcd(bus);
@@ -136,18 +137,22 @@ void *hcd_buffer_alloc(
return kmalloc(size, mem_flags);
}

- for (i = 0; i < HCD_BUFFER_POOLS; i++) {
- if (size <= pool_max[i])
- return dma_pool_alloc(hcd->pool[i], mem_flags, dma);
+ /* Only use pools for coherent buffer requests */
+ if (!attrs) {
+ for (i = 0; i < HCD_BUFFER_POOLS; i++)
+ if (size <= pool_max[i])
+ return dma_pool_alloc(hcd->pool[i],
+ mem_flags, dma);
}
- return dma_alloc_coherent(hcd->self.sysdev, size, dma, mem_flags);
+ return dma_alloc_attrs(hcd->self.sysdev, size, dma, mem_flags, attrs);
}

void hcd_buffer_free(
struct usb_bus *bus,
size_t size,
void *addr,
- dma_addr_t dma
+ dma_addr_t dma,
+ unsigned long attrs
)
{
struct usb_hcd *hcd = bus_to_hcd(bus);
@@ -163,11 +168,13 @@ void hcd_buffer_free(
return;
}

- for (i = 0; i < HCD_BUFFER_POOLS; i++) {
- if (size <= pool_max[i]) {
- dma_pool_free(hcd->pool[i], addr, dma);
- return;
+ if (!attrs) {
+ for (i = 0; i < HCD_BUFFER_POOLS; i++) {
+ if (size <= pool_max[i]) {
+ dma_pool_free(hcd->pool[i], addr, dma);
+ return;
+ }
}
}
- dma_free_coherent(hcd->self.sysdev, size, addr, dma);
+ dma_free_attrs(hcd->self.sysdev, size, addr, dma, attrs);
}
diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 1c21955fe7c0..25303738eb28 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -1383,7 +1383,7 @@ static int hcd_alloc_coherent(struct usb_bus *bus,
}

vaddr = hcd_buffer_alloc(bus, size + sizeof(vaddr),
- mem_flags, dma_handle);
+ mem_flags, dma_handle, 0);
if (!vaddr)
return -ENOMEM;

@@ -1416,7 +1416,8 @@ static void hcd_free_coherent(struct usb_bus *bus, dma_addr_t *dma_handle,
if (dir == DMA_FROM_DEVICE)
memcpy(vaddr, *vaddr_handle, size);

- hcd_buffer_free(bus, size + sizeof(vaddr), *vaddr_handle, *dma_handle);
+ hcd_buffer_free(bus, size + sizeof(vaddr),
+ *vaddr_handle, *dma_handle, 0);

*vaddr_handle = vaddr;
*dma_handle = 0;
diff --git a/drivers/usb/core/usb.c b/drivers/usb/core/usb.c
index 623be3174fb3..234ea5ab4bb7 100644
--- a/drivers/usb/core/usb.c
+++ b/drivers/usb/core/usb.c
@@ -858,6 +858,58 @@ int __usb_get_extra_descriptor(char *buffer, unsigned size,
}
EXPORT_SYMBOL_GPL(__usb_get_extra_descriptor);

+/**
+ * usb_alloc_noncoherent - allocate dma-non-coherent buffer
+ * @dev: device the buffer will be used with
+ * @size: requested buffer size
+ * @mem_flags: affect whether allocation may block
+ * @dma: used to return DMA address of buffer
+ *
+ * Return: Either null (indicating no buffer could be allocated), or the
+ * cpu-space pointer to a non-coherent buffer that may be used to perform
+ * DMA to the specified device. Such cpu-space buffers are returned along
+ * with the DMA address (through the pointer provided).
+ *
+ * Note:
+ * These non-conherent buffers are used with URB_NO_xxx_DMA_MAP set in
+ * urb->transfer_flags to avoid using "DMA bounce buffers". When using
+ * this API, you must have the necessary syncs points. If you are unsure
+ * about this, you should be using coherent buffers via usb_alloc_coherent.
+ *
+ * When the buffer is no longer used, free it with usb_free_noncoherent().
+ */
+void *usb_alloc_noncoherent(struct usb_device *dev, size_t size, gfp_t mem_flags,
+ dma_addr_t *dma)
+{
+ if (!dev || !dev->bus)
+ return NULL;
+ return hcd_buffer_alloc(dev->bus, size,
+ mem_flags, dma, DMA_ATTR_NON_CONSISTENT);
+}
+EXPORT_SYMBOL_GPL(usb_alloc_noncoherent);
+
+/**
+ * usb_free_noncoherent - free memory allocated with usb_alloc_noncoherent()
+ * @dev: device the buffer was used with
+ * @size: requested buffer size
+ * @addr: CPU address of buffer
+ * @dma: DMA address of buffer
+ *
+ * This reclaims an I/O buffer, letting it be reused. The memory must have
+ * been allocated using usb_alloc_noncoherent(), and the parameters must match
+ * those provided in that allocation request.
+ */
+void usb_free_noncoherent(struct usb_device *dev, size_t size, void *addr,
+ dma_addr_t dma)
+{
+ if (!dev || !dev->bus)
+ return;
+ if (!addr)
+ return;
+ hcd_buffer_free(dev->bus, size, addr, dma, DMA_ATTR_NON_CONSISTENT);
+}
+EXPORT_SYMBOL_GPL(usb_free_noncoherent);
+
/**
* usb_alloc_coherent - allocate dma-consistent buffer for URB_NO_xxx_DMA_MAP
* @dev: device the buffer will be used with
@@ -886,7 +938,7 @@ void *usb_alloc_coherent(struct usb_device *dev, size_t size, gfp_t mem_flags,
{
if (!dev || !dev->bus)
return NULL;
- return hcd_buffer_alloc(dev->bus, size, mem_flags, dma);
+ return hcd_buffer_alloc(dev->bus, size, mem_flags, dma, 0);
}
EXPORT_SYMBOL_GPL(usb_alloc_coherent);

@@ -908,7 +960,7 @@ void usb_free_coherent(struct usb_device *dev, size_t size, void *addr,
return;
if (!addr)
return;
- hcd_buffer_free(dev->bus, size, addr, dma);
+ hcd_buffer_free(dev->bus, size, addr, dma, 0);
}
EXPORT_SYMBOL_GPL(usb_free_coherent);

diff --git a/include/linux/usb.h b/include/linux/usb.h
index 4cdd515a4385..7fddd6c2a61e 100644
--- a/include/linux/usb.h
+++ b/include/linux/usb.h
@@ -1750,6 +1750,11 @@ static inline int usb_urb_dir_out(struct urb *urb)

int usb_urb_ep_type_check(const struct urb *urb);

+void *usb_alloc_noncoherent(struct usb_device *dev, size_t size,
+ gfp_t mem_flags, dma_addr_t *dma);
+void usb_free_noncoherent(struct usb_device *dev, size_t size,
+ void *addr, dma_addr_t dma);
+
void *usb_alloc_coherent(struct usb_device *dev, size_t size,
gfp_t mem_flags, dma_addr_t *dma);
void usb_free_coherent(struct usb_device *dev, size_t size,
diff --git a/include/linux/usb/hcd.h b/include/linux/usb/hcd.h
index 97e2ddec18b1..41dd5f0acaad 100644
--- a/include/linux/usb/hcd.h
+++ b/include/linux/usb/hcd.h
@@ -486,9 +486,9 @@ int hcd_buffer_create(struct usb_hcd *hcd);
void hcd_buffer_destroy(struct usb_hcd *hcd);

void *hcd_buffer_alloc(struct usb_bus *bus, size_t size,
- gfp_t mem_flags, dma_addr_t *dma);
+ gfp_t mem_flags, dma_addr_t *dma, unsigned long attrs);
void hcd_buffer_free(struct usb_bus *bus, size_t size,
- void *addr, dma_addr_t dma);
+ void *addr, dma_addr_t dma, unsigned long attrs);

/* generic bus glue, needed for host controllers that don't use PCI */
extern irqreturn_t usb_hcd_irq(int irq, void *__hcd);
--
2.18.0


2018-08-30 18:00:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 2/3] USB: core: Add non-coherent buffer allocation helpers

Please don't introduce new DMA_ATTR_NON_CONSISTENT users, it is
a rather horrible interface, and I plan to kill it off rather sooner
than later. I plan to post some patches for a better interface
that can reuse the normal dma_sync_single_* interfaces for ownership
transfers. I can happily include usb in that initial patch set based
on your work here if that helps.

2018-08-30 18:02:25

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 3/3] stk1160: Use non-coherent buffers for USB transfers

> + dma_sync_single_for_cpu(&urb->dev->dev, urb->transfer_dma,
> + urb->transfer_buffer_length, DMA_FROM_DEVICE);

You can't ue dma_sync_single_for_cpu on non-coherent dma buffers,
which is one of the major issues with them.

2018-08-30 22:13:07

by Ezequiel Garcia

[permalink] [raw]
Subject: Re: [RFC 2/3] USB: core: Add non-coherent buffer allocation helpers

On Thu, 2018-08-30 at 10:58 -0700, Christoph Hellwig wrote:
> Please don't introduce new DMA_ATTR_NON_CONSISTENT users, it is
> a rather horrible interface, and I plan to kill it off rather sooner
> than later. I plan to post some patches for a better interface
> that can reuse the normal dma_sync_single_* interfaces for ownership
> transfers. I can happily include usb in that initial patch set based
> on your work here if that helps.

Please do. Until we have proper allocators that go thru the DMA API,
drivers will have to kmalloc the USB transfer buffers, and have
streaming mappings. Which in turns mean not using IOMMU or CMA.

Regards,
Eze

2018-08-31 06:00:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 2/3] USB: core: Add non-coherent buffer allocation helpers

On Thu, Aug 30, 2018 at 07:11:35PM -0300, Ezequiel Garcia wrote:
> On Thu, 2018-08-30 at 10:58 -0700, Christoph Hellwig wrote:
> > Please don't introduce new DMA_ATTR_NON_CONSISTENT users, it is
> > a rather horrible interface, and I plan to kill it off rather sooner
> > than later. I plan to post some patches for a better interface
> > that can reuse the normal dma_sync_single_* interfaces for ownership
> > transfers. I can happily include usb in that initial patch set based
> > on your work here if that helps.
>
> Please do. Until we have proper allocators that go thru the DMA API,
> drivers will have to kmalloc the USB transfer buffers, and have
> streaming mappings. Which in turns mean not using IOMMU or CMA.

dma_map_page will of course use the iommu.

2018-08-31 06:53:48

by Tomasz Figa

[permalink] [raw]
Subject: Re: [RFC 2/3] USB: core: Add non-coherent buffer allocation helpers

On Fri, Aug 31, 2018 at 2:50 PM Christoph Hellwig <[email protected]> wrote:
>
> On Thu, Aug 30, 2018 at 07:11:35PM -0300, Ezequiel Garcia wrote:
> > On Thu, 2018-08-30 at 10:58 -0700, Christoph Hellwig wrote:
> > > Please don't introduce new DMA_ATTR_NON_CONSISTENT users, it is
> > > a rather horrible interface, and I plan to kill it off rather sooner
> > > than later. I plan to post some patches for a better interface
> > > that can reuse the normal dma_sync_single_* interfaces for ownership
> > > transfers. I can happily include usb in that initial patch set based
> > > on your work here if that helps.
> >
> > Please do. Until we have proper allocators that go thru the DMA API,
> > drivers will have to kmalloc the USB transfer buffers, and have
> > streaming mappings. Which in turns mean not using IOMMU or CMA.
>
> dma_map_page will of course use the iommu.

Sure, dma_map*() will, but using kmalloc() defeats (half of) the
purpose of it, since contiguous memory would be allocated
unnecessarily, risking failures due to fragmentation.

Best regards,
Tomasz

2018-09-07 08:56:02

by Tomasz Figa

[permalink] [raw]
Subject: Re: [RFC 3/3] stk1160: Use non-coherent buffers for USB transfers

On Fri, Aug 31, 2018 at 2:59 AM Christoph Hellwig <[email protected]> wrote:
>
> > + dma_sync_single_for_cpu(&urb->dev->dev, urb->transfer_dma,
> > + urb->transfer_buffer_length, DMA_FROM_DEVICE);
>
> You can't ue dma_sync_single_for_cpu on non-coherent dma buffers,
> which is one of the major issues with them.

It's not an issue of DMA API, but just an API mismatch. By design,
memory allocated for device (e.g. by DMA API) doesn't have to be
physically contiguous, while dma_*_single() API expects a _single_,
physically contiguous region of memory.

We need a way to allocate non-coherent memory using DMA API to handle
(on USB example, but applies to virtually any class of devices doing
DMA):
- DMA address range limitations (e.g. dma_mask) - while a USB HCD
driver is normally aware of those, USB device driver should have no
idea,
- memory mapping capability === whether contiguous memory or a set of
random pages can be allocated - this is a platform integration detail,
which even a USB HCD driver may not be aware of, if a SoC IOMMU is
just stuffed between the bus and HCD,
- platform coherency specifics - there are practical scenarios when
on a coherent-by-default system it's more efficient to allocate
non-coherent memory and manage caches explicitly to avoid the costs of
cache snooping.

If DMA_ATTR_NON_CONSISTENT is not the right way to do it, there should
be definitely a new API introduced, coupled closely to DMA API
implementation on given platform, since it's the only place which can
solve all the constraints above.

Best regards,
Tomasz

2018-10-31 01:57:39

by Tomasz Figa

[permalink] [raw]
Subject: Re: [RFC 2/3] USB: core: Add non-coherent buffer allocation helpers

Hi Christoph and everyone,

On Fri, Aug 31, 2018 at 3:51 PM Tomasz Figa <[email protected]> wrote:
>
> On Fri, Aug 31, 2018 at 2:50 PM Christoph Hellwig <[email protected]> wrote:
> >
> > On Thu, Aug 30, 2018 at 07:11:35PM -0300, Ezequiel Garcia wrote:
> > > On Thu, 2018-08-30 at 10:58 -0700, Christoph Hellwig wrote:
> > > > Please don't introduce new DMA_ATTR_NON_CONSISTENT users, it is
> > > > a rather horrible interface, and I plan to kill it off rather sooner
> > > > than later. I plan to post some patches for a better interface
> > > > that can reuse the normal dma_sync_single_* interfaces for ownership
> > > > transfers. I can happily include usb in that initial patch set based
> > > > on your work here if that helps.
> > >
> > > Please do. Until we have proper allocators that go thru the DMA API,
> > > drivers will have to kmalloc the USB transfer buffers, and have
> > > streaming mappings. Which in turns mean not using IOMMU or CMA.
> >
> > dma_map_page will of course use the iommu.
>
> Sure, dma_map*() will, but using kmalloc() defeats (half of) the
> purpose of it, since contiguous memory would be allocated
> unnecessarily, risking failures due to fragmentation.

Have we reached a conclusion here?

It sounds like it's a quite significant problem, at least for some of
the camera (media) devices over there and there are people interested
in solving it, so all we need here is a conclusion on how to do it. :)

Best regards,
Tomasz