2021-01-28 15:04:07

by Christoph Hellwig

[permalink] [raw]
Subject: add a new dma_alloc_noncontiguous API

Hi all,

this series adds the new noncontiguous DMA allocation API requested by
various media driver maintainers.


2021-01-28 15:06:11

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 1/6] dma-mapping: remove the {alloc,free}_noncoherent methods

It turns out allowing non-contigous allocations here was a rather bad
idea, as we'll now need to define ways to get the pages for mmaping
or dma_buf sharing. Revert this change and stick to the original
concept. A different API for the use case of non-contigous allocations
will be added back later.

Signed-off-by: Christoph Hellwig <[email protected]>
---
Documentation/core-api/dma-api.rst | 64 ++++++++++--------------------
drivers/iommu/dma-iommu.c | 30 --------------
include/linux/dma-map-ops.h | 5 ---
include/linux/dma-mapping.h | 17 ++++++--
kernel/dma/mapping.c | 40 -------------------
5 files changed, 35 insertions(+), 121 deletions(-)

diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst
index 75cb757bbff00a..e6d23f117308df 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -528,16 +528,14 @@ an I/O device, you should not be using this part of the API.

::

- void *
- dma_alloc_noncoherent(struct device *dev, size_t size,
- dma_addr_t *dma_handle, enum dma_data_direction dir,
- gfp_t gfp)
+ struct page *
+ dma_alloc_pages(struct device *dev, size_t size, dma_addr_t *dma_handle,
+ enum dma_data_direction dir, gfp_t gfp)

-This routine allocates a region of <size> bytes of consistent memory. It
-returns a pointer to the allocated region (in the processor's virtual address
-space) or NULL if the allocation failed. The returned memory may or may not
-be in the kernel direct mapping. Drivers must not call virt_to_page on
-the returned memory region.
+This routine allocates a region of <size> bytes of non-coherent memory. It
+returns a pointer to first struct page for the region, or NULL if the
+allocation failed. The resulting struct page can be used for everything a
+struct page is suitable for.

It also returns a <dma_handle> which may be cast to an unsigned integer the
same width as the bus and given to the device as the DMA address base of
@@ -558,51 +556,33 @@ reused.
::

void
- dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr,
+ dma_free_pages(struct device *dev, size_t size, struct page *page,
dma_addr_t dma_handle, enum dma_data_direction dir)

-Free a region of memory previously allocated using dma_alloc_noncoherent().
-dev, size and dma_handle and dir must all be the same as those passed into
-dma_alloc_noncoherent(). cpu_addr must be the virtual address returned by
-dma_alloc_noncoherent().
+Free a region of memory previously allocated using dma_alloc_pages().
+dev, size, dma_handle and dir must all be the same as those passed into
+dma_alloc_pages(). page must be the pointer returned by dma_alloc_pages().

::

- struct page *
- dma_alloc_pages(struct device *dev, size_t size, dma_addr_t *dma_handle,
- enum dma_data_direction dir, gfp_t gfp)
-
-This routine allocates a region of <size> bytes of non-coherent memory. It
-returns a pointer to first struct page for the region, or NULL if the
-allocation failed. The resulting struct page can be used for everything a
-struct page is suitable for.
-
-It also returns a <dma_handle> which may be cast to an unsigned integer the
-same width as the bus and given to the device as the DMA address base of
-the region.
-
-The dir parameter specified if data is read and/or written by the device,
-see dma_map_single() for details.
-
-The gfp parameter allows the caller to specify the ``GFP_`` flags (see
-kmalloc()) for the allocation, but rejects flags used to specify a memory
-zone such as GFP_DMA or GFP_HIGHMEM.
+ void *
+ dma_alloc_noncoherent(struct device *dev, size_t size,
+ dma_addr_t *dma_handle, enum dma_data_direction dir,
+ gfp_t gfp)

-Before giving the memory to the device, dma_sync_single_for_device() needs
-to be called, and before reading memory written by the device,
-dma_sync_single_for_cpu(), just like for streaming DMA mappings that are
-reused.
+This routine is a convenient wrapper around dma_alloc_pages that returns the
+kernel virtual address for the allocated memory instead of the page structure.

::

void
- dma_free_pages(struct device *dev, size_t size, struct page *page,
+ dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr,
dma_addr_t dma_handle, enum dma_data_direction dir)

-Free a region of memory previously allocated using dma_alloc_pages().
-dev, size and dma_handle and dir must all be the same as those passed into
-dma_alloc_noncoherent(). page must be the pointer returned by
-dma_alloc_pages().
+Free a region of memory previously allocated using dma_alloc_noncoherent().
+dev, size, dma_handle and dir must all be the same as those passed into
+dma_alloc_noncoherent(). cpu_addr must be the virtual address returned by
+dma_alloc_noncoherent().

::

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 4078358ed66ea8..255533faf90599 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1197,34 +1197,6 @@ static void *iommu_dma_alloc(struct device *dev, size_t size,
return cpu_addr;
}

-#ifdef CONFIG_DMA_REMAP
-static void *iommu_dma_alloc_noncoherent(struct device *dev, size_t size,
- dma_addr_t *handle, enum dma_data_direction dir, gfp_t gfp)
-{
- if (!gfpflags_allow_blocking(gfp)) {
- struct page *page;
-
- page = dma_common_alloc_pages(dev, size, handle, dir, gfp);
- if (!page)
- return NULL;
- return page_address(page);
- }
-
- return iommu_dma_alloc_remap(dev, size, handle, gfp | __GFP_ZERO,
- PAGE_KERNEL, 0);
-}
-
-static void iommu_dma_free_noncoherent(struct device *dev, size_t size,
- void *cpu_addr, dma_addr_t handle, enum dma_data_direction dir)
-{
- __iommu_dma_unmap(dev, handle, size);
- __iommu_dma_free(dev, size, cpu_addr);
-}
-#else
-#define iommu_dma_alloc_noncoherent NULL
-#define iommu_dma_free_noncoherent NULL
-#endif /* CONFIG_DMA_REMAP */
-
static int iommu_dma_mmap(struct device *dev, struct vm_area_struct *vma,
void *cpu_addr, dma_addr_t dma_addr, size_t size,
unsigned long attrs)
@@ -1295,8 +1267,6 @@ static const struct dma_map_ops iommu_dma_ops = {
.free = iommu_dma_free,
.alloc_pages = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
- .alloc_noncoherent = iommu_dma_alloc_noncoherent,
- .free_noncoherent = iommu_dma_free_noncoherent,
.mmap = iommu_dma_mmap,
.get_sgtable = iommu_dma_get_sgtable,
.map_page = iommu_dma_map_page,
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 70fcd0f610ea48..11e02537b9e01b 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -22,11 +22,6 @@ struct dma_map_ops {
gfp_t gfp);
void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
dma_addr_t dma_handle, enum dma_data_direction dir);
- void *(*alloc_noncoherent)(struct device *dev, size_t size,
- dma_addr_t *dma_handle, enum dma_data_direction dir,
- gfp_t gfp);
- void (*free_noncoherent)(struct device *dev, size_t size, void *vaddr,
- dma_addr_t dma_handle, enum dma_data_direction dir);
int (*mmap)(struct device *, struct vm_area_struct *,
void *, dma_addr_t, size_t, unsigned long attrs);

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 2e49996a8f391a..fbfa3f5abd9498 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -263,10 +263,19 @@ struct page *dma_alloc_pages(struct device *dev, size_t size,
dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp);
void dma_free_pages(struct device *dev, size_t size, struct page *page,
dma_addr_t dma_handle, enum dma_data_direction dir);
-void *dma_alloc_noncoherent(struct device *dev, size_t size,
- dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp);
-void dma_free_noncoherent(struct device *dev, size_t size, void *vaddr,
- dma_addr_t dma_handle, enum dma_data_direction dir);
+
+static inline void *dma_alloc_noncoherent(struct device *dev, size_t size,
+ dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
+{
+ struct page *page = dma_alloc_pages(dev, size, dma_handle, dir, gfp);
+ return page ? page_address(page) : NULL;
+}
+
+static inline void dma_free_noncoherent(struct device *dev, size_t size,
+ void *vaddr, dma_addr_t dma_handle, enum dma_data_direction dir)
+{
+ dma_free_pages(dev, size, virt_to_page(vaddr), dma_handle, dir);
+}

static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index f87a89d086544b..68992e35c8c3a7 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -515,46 +515,6 @@ void dma_free_pages(struct device *dev, size_t size, struct page *page,
}
EXPORT_SYMBOL_GPL(dma_free_pages);

-void *dma_alloc_noncoherent(struct device *dev, size_t size,
- dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
-{
- const struct dma_map_ops *ops = get_dma_ops(dev);
- void *vaddr;
-
- if (!ops || !ops->alloc_noncoherent) {
- struct page *page;
-
- page = dma_alloc_pages(dev, size, dma_handle, dir, gfp);
- if (!page)
- return NULL;
- return page_address(page);
- }
-
- size = PAGE_ALIGN(size);
- vaddr = ops->alloc_noncoherent(dev, size, dma_handle, dir, gfp);
- if (vaddr)
- debug_dma_map_page(dev, virt_to_page(vaddr), 0, size, dir,
- *dma_handle);
- return vaddr;
-}
-EXPORT_SYMBOL_GPL(dma_alloc_noncoherent);
-
-void dma_free_noncoherent(struct device *dev, size_t size, void *vaddr,
- dma_addr_t dma_handle, enum dma_data_direction dir)
-{
- const struct dma_map_ops *ops = get_dma_ops(dev);
-
- if (!ops || !ops->free_noncoherent) {
- dma_free_pages(dev, size, virt_to_page(vaddr), dma_handle, dir);
- return;
- }
-
- size = PAGE_ALIGN(size);
- debug_dma_unmap_page(dev, dma_handle, size, dir);
- ops->free_noncoherent(dev, size, vaddr, dma_handle, dir);
-}
-EXPORT_SYMBOL_GPL(dma_free_noncoherent);
-
int dma_supported(struct device *dev, u64 mask)
{
const struct dma_map_ops *ops = get_dma_ops(dev);
--
2.29.2

2021-01-28 15:06:13

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 3/6] dma-mapping: add a dma_alloc_noncontiguous API

Add a new API that returns a potentiall virtually non-contigous sg_table
and a DMA address. This API is only properly implemented for dma-iommu
and will simply return a contigious chunk as a fallback.

The intent is that media drivers can use this API if either:

- no kernel mapping or only temporary kernel mappings are required.
That is as a better replacement for DMA_ATTR_NO_KERNEL_MAPPING
- a kernel mapping is required for cached and DMA mapped pages, but
the driver also needs the pages to e.g. map them to userspace.
In that sense it is a replacement for some aspects of the recently
removed and never fully implemented DMA_ATTR_NON_CONSISTENT

Signed-off-by: Christoph Hellwig <[email protected]>
---
Documentation/core-api/dma-api.rst | 72 +++++++++++++++++++++++
include/linux/dma-map-ops.h | 20 +++++++
include/linux/dma-mapping.h | 34 +++++++++++
kernel/dma/mapping.c | 92 ++++++++++++++++++++++++++++++
4 files changed, 218 insertions(+)

diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst
index 157a474ae54416..1dd676a8217137 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -594,6 +594,78 @@ dev, size, dma_handle and dir must all be the same as those passed into
dma_alloc_noncoherent(). cpu_addr must be the virtual address returned by
dma_alloc_noncoherent().

+::
+
+ struct sg_table *
+ dma_alloc_noncontiguous(struct device *dev, size_t size,
+ dma_addr_t *dma_handle,
+ enum dma_data_direction dir, gfp_t gfp)
+
+This routine allocates <size> bytes of non-coherent and possibly non-contiguous
+memory. It returns a pointer to struct sg_table that describes the allocated
+memory , or NULL if the allocation failed. The resulting memory can be used for
+everything a struct page is suitable for.
+
+It also returns a <dma_handle> which may be cast to an unsigned integer the
+same width as the bus and given to the device as the DMA address base of
+the region.
+
+The dir parameter specified if data is read and/or written by the device,
+see dma_map_single() for details.
+
+The gfp parameter allows the caller to specify the ``GFP_`` flags (see
+kmalloc()) for the allocation, but rejects flags used to specify a memory
+zone such as GFP_DMA or GFP_HIGHMEM.
+
+Before giving the memory to the device, dma_sync_sgtable_for_device() needs
+to be called, and before reading memory written by the device,
+dma_sync_sgtable_for_cpu(), just like for streaming DMA mappings that are
+reused.
+
+::
+
+ void
+ dma_free_noncontiguous(struct device *dev, size_t size,
+ struct sg_table *sgt, dma_addr_t dma_handle,
+ enum dma_data_direction dir)
+
+Free memory previously allocated using dma_alloc_noncontiguous(). dev, size,
+dma_handle and dir must all be the same as those passed into
+dma_alloc_noncontiguous(). sgt must be the pointer returned by
+dma_alloc_noncontiguous().
+
+::
+
+ void *
+ dma_vmap_noncontiguous(struct device *dev, size_t size,
+ struct sg_table *sgt)
+
+Return a contiguous kernel mapping for an allocation returned from
+dma_alloc_noncontiguous(). dev and size must be the same as those passed into
+dma_alloc_noncontiguous(). sgt must be the pointer returned by
+dma_alloc_noncontiguous().
+
+::
+
+ void
+ dma_vunmap_noncontiguous(struct device *dev, void *vaddr)
+
+Unmap a kernel mapping returned by dma_vmap_noncontiguous(). dev must be the
+same the one passed into dma_alloc_noncontiguous(). vaddr must be the pointer
+returned by dma_vmap_noncontiguous().
+
+
+::
+
+ int
+ dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma,
+ size_t size, struct sg_table *sgt)
+
+Map an allocation returned from dma_alloc_noncontiguous() into a user address
+space. dev and size must be the same as those passed into
+dma_alloc_noncontiguous(). sgt must be the pointer returned by
+dma_alloc_noncontiguous().
+
::

int
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 11e02537b9e01b..82efa36d8b09c4 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -22,6 +22,12 @@ struct dma_map_ops {
gfp_t gfp);
void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
dma_addr_t dma_handle, enum dma_data_direction dir);
+ struct sg_table *(*alloc_noncontiguous)(struct device *dev, size_t size,
+ dma_addr_t *dma_handle, enum dma_data_direction dir,
+ gfp_t gfp);
+ void (*free_noncontiguous)(struct device *dev, size_t size,
+ struct sg_table *sgt, dma_addr_t dma_handle,
+ enum dma_data_direction dir);
int (*mmap)(struct device *, struct vm_area_struct *,
void *, dma_addr_t, size_t, unsigned long attrs);

@@ -198,6 +204,20 @@ static inline int dma_mmap_from_global_coherent(struct vm_area_struct *vma,
}
#endif /* CONFIG_DMA_DECLARE_COHERENT */

+/*
+ * This is the actual return value from the ->alloc_noncontiguous method.
+ * The users of the DMA API should only care about the sg_table, but to make
+ * the DMA-API internal vmaping and freeing easier we stash away the page
+ * array as well (except for the fallback case). This can go away any time,
+ * e.g. when a vmap-variant that takes a scatterlist comes along.
+ */
+struct dma_sgt_handle {
+ struct sg_table sgt;
+ struct page **pages;
+};
+#define sgt_handle(sgt) \
+ container_of((sgt), struct dma_sgt_handle, sgt)
+
int dma_common_get_sgtable(struct device *dev, struct sg_table *sgt,
void *cpu_addr, dma_addr_t dma_addr, size_t size,
unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 4977a748cb9483..8694c938e7c271 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -144,6 +144,16 @@ u64 dma_get_required_mask(struct device *dev);
size_t dma_max_mapping_size(struct device *dev);
bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
unsigned long dma_get_merge_boundary(struct device *dev);
+struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size,
+ dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp);
+void dma_free_noncontiguous(struct device *dev, size_t size,
+ struct sg_table *sgt, dma_addr_t dma_handle,
+ enum dma_data_direction dir);
+void *dma_vmap_noncontiguous(struct device *dev, size_t size,
+ struct sg_table *sgt);
+void dma_vunmap_noncontiguous(struct device *dev, void *vaddr);
+int dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma,
+ size_t size, struct sg_table *sgt);
#else /* CONFIG_HAS_DMA */
static inline dma_addr_t dma_map_page_attrs(struct device *dev,
struct page *page, size_t offset, size_t size,
@@ -257,6 +267,30 @@ static inline unsigned long dma_get_merge_boundary(struct device *dev)
{
return 0;
}
+static inline struct sg_table *dma_alloc_noncontiguous(struct device *dev,
+ size_t size, dma_addr_t *dma_handle,
+ enum dma_data_direction dir, gfp_t gfp)
+{
+ return NULL;
+}
+static inline void dma_free_noncontiguous(struct device *dev, size_t size,
+ struct sg_table *sgt, dma_addr_t dma_handle,
+ enum dma_data_direction dir)
+{
+}
+static inline void *dma_vmap_noncontiguous(struct device *dev, size_t size,
+ struct sg_table *sgt)
+{
+ return NULL;
+}
+static inline void dma_vunmap_noncontiguous(struct device *dev, void *vaddr)
+{
+}
+static inline int dma_mmap_noncontiguous(struct device *dev,
+ struct vm_area_struct *vma, size_t size, struct sg_table *sgt)
+{
+ return -EINVAL;
+}
#endif /* CONFIG_HAS_DMA */

struct page *dma_alloc_pages(struct device *dev, size_t size,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index c1e515496c067b..d34dfd2ba6e320 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -528,6 +528,98 @@ int dma_mmap_pages(struct device *dev, struct vm_area_struct *vma,
}
EXPORT_SYMBOL_GPL(dma_mmap_pages);

+static struct sg_table *alloc_single_sgt(struct device *dev, size_t size,
+ dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
+{
+ struct sg_table *sgt;
+ struct page *page;
+
+ sgt = kmalloc(sizeof(*sgt), gfp);
+ if (!sgt)
+ return NULL;
+ if (sg_alloc_table(sgt, 1, gfp))
+ goto out_free_sgt;
+ page = dma_alloc_pages(dev, size, dma_handle, dir, gfp);
+ if (!page)
+ goto out_free_table;
+ sg_set_page(sgt->sgl, page, PAGE_ALIGN(size), 0);
+ return sgt;
+out_free_table:
+ sg_free_table(sgt);
+out_free_sgt:
+ kfree(sgt);
+ return NULL;
+}
+
+static void free_single_sgt(struct device *dev, size_t size,
+ struct sg_table *sgt, dma_addr_t dma_handle,
+ enum dma_data_direction dir)
+{
+ dma_free_pages(dev, size, sg_page(sgt->sgl), dma_handle, dir);
+ sg_free_table(sgt);
+ kfree(sgt);
+}
+
+struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size,
+ dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
+{
+ const struct dma_map_ops *ops = get_dma_ops(dev);
+
+ if (!ops || !ops->alloc_noncontiguous)
+ return alloc_single_sgt(dev, size, dma_handle, dir, gfp);
+ return ops->alloc_noncontiguous(dev, size, dma_handle, dir, gfp);
+}
+EXPORT_SYMBOL_GPL(dma_alloc_noncontiguous);
+
+void dma_free_noncontiguous(struct device *dev, size_t size,
+ struct sg_table *sgt, dma_addr_t dma_handle,
+ enum dma_data_direction dir)
+{
+ const struct dma_map_ops *ops = get_dma_ops(dev);
+
+ if (!ops || !ops->free_noncontiguous)
+ free_single_sgt(dev, size, sgt, dma_handle, dir);
+ else
+ ops->free_noncontiguous(dev, size, sgt, dma_handle, dir);
+}
+EXPORT_SYMBOL_GPL(dma_free_noncontiguous);
+
+void *dma_vmap_noncontiguous(struct device *dev, size_t size,
+ struct sg_table *sgt)
+{
+ const struct dma_map_ops *ops = get_dma_ops(dev);
+ unsigned long count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+ if (!ops || !ops->alloc_noncontiguous)
+ return page_address(sg_page(sgt->sgl));
+ return vmap(sgt_handle(sgt)->pages, count, VM_MAP, PAGE_KERNEL);
+}
+EXPORT_SYMBOL_GPL(dma_vmap_noncontiguous);
+
+void dma_vunmap_noncontiguous(struct device *dev, void *vaddr)
+{
+ const struct dma_map_ops *ops = get_dma_ops(dev);
+
+ if (ops && ops->alloc_noncontiguous)
+ vunmap(vaddr);
+}
+EXPORT_SYMBOL_GPL(dma_vunmap_noncontiguous);
+
+int dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma,
+ size_t size, struct sg_table *sgt)
+{
+ const struct dma_map_ops *ops = get_dma_ops(dev);
+ unsigned long count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+ if (!ops || !ops->alloc_noncontiguous)
+ return dma_mmap_pages(dev, vma, size, sg_page(sgt->sgl));
+
+ if (vma->vm_pgoff >= count || vma_pages(vma) > count - vma->vm_pgoff)
+ return -ENXIO;
+ return vm_map_pages(vma, sgt_handle(sgt)->pages, count);
+}
+EXPORT_SYMBOL_GPL(dma_mmap_noncontiguous);
+
int dma_supported(struct device *dev, u64 mask)
{
const struct dma_map_ops *ops = get_dma_ops(dev);
--
2.29.2

2021-01-28 15:06:46

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 5/6] dma-iommu: implement ->alloc_noncontiguous

Implement support for allocating a non-contiguous DMA region.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/iommu/dma-iommu.c | 47 +++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 65af875ba8495c..938a2856b4a455 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -756,6 +756,49 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
return NULL;
}

+#ifdef CONFIG_DMA_REMAP
+static struct sg_table *iommu_dma_alloc_noncontiguous(struct device *dev,
+ size_t size, dma_addr_t *dma_handle,
+ enum dma_data_direction dir, gfp_t gfp)
+{
+ unsigned long count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+ struct dma_sgt_handle *sh;
+
+ sh = kmalloc(sizeof(*sh), gfp);
+ if (!sh)
+ return NULL;
+
+ sh->pages = __iommu_dma_alloc_noncontiguous(dev, size, &sh->sgt,
+ dma_handle, gfp,
+ PAGE_KERNEL, 0);
+ if (!sh->pages)
+ goto out_free_sh;
+
+ if (sg_alloc_table_from_pages(&sh->sgt, sh->pages, count, 0, size,
+ GFP_KERNEL))
+ goto out_free_pages;
+
+ return &sh->sgt;
+
+out_free_pages:
+ __iommu_dma_free_pages(sh->pages, count);
+out_free_sh:
+ kfree(sh);
+ return NULL;
+}
+
+static void iommu_dma_free_noncontiguous(struct device *dev, size_t size,
+ struct sg_table *sgt, dma_addr_t dma_handle,
+ enum dma_data_direction dir)
+{
+ struct dma_sgt_handle *sh = sgt_handle(sgt);
+
+ __iommu_dma_unmap(dev, dma_handle, size);
+ __iommu_dma_free_pages(sh->pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
+ sg_free_table(&sh->sgt);
+}
+#endif /* CONFIG_DMA_REMAP */
+
static void iommu_dma_sync_single_for_cpu(struct device *dev,
dma_addr_t dma_handle, size_t size, enum dma_data_direction dir)
{
@@ -1271,6 +1314,10 @@ static const struct dma_map_ops iommu_dma_ops = {
.free = iommu_dma_free,
.alloc_pages = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
+#ifdef CONFIG_DMA_REMAP
+ .alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
+ .free_noncontiguous = iommu_dma_free_noncontiguous,
+#endif
.mmap = iommu_dma_mmap,
.get_sgtable = iommu_dma_get_sgtable,
.map_page = iommu_dma_map_page,
--
2.29.2

2021-01-28 15:09:03

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 2/6] dma-mapping: add a dma_mmap_pages helper

Add a helper to map memory allocated using dma_alloc_pages into
a user address space, similar to the dma_alloc_attrs function for
coherent allocations.

Signed-off-by: Christoph Hellwig <[email protected]>
---
Documentation/core-api/dma-api.rst | 10 ++++++++++
include/linux/dma-mapping.h | 2 ++
kernel/dma/mapping.c | 13 +++++++++++++
3 files changed, 25 insertions(+)

diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst
index e6d23f117308df..157a474ae54416 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -563,6 +563,16 @@ Free a region of memory previously allocated using dma_alloc_pages().
dev, size, dma_handle and dir must all be the same as those passed into
dma_alloc_pages(). page must be the pointer returned by dma_alloc_pages().

+::
+
+ int
+ dma_mmap_pages(struct device *dev, struct vm_area_struct *vma,
+ size_t size, struct page *page)
+
+Map an allocation returned from dma_alloc_pages() into a user address space.
+dev and size must be the same as those passed into dma_alloc_pages().
+page must be the pointer returned by dma_alloc_pages().
+
::

void *
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index fbfa3f5abd9498..4977a748cb9483 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -263,6 +263,8 @@ struct page *dma_alloc_pages(struct device *dev, size_t size,
dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp);
void dma_free_pages(struct device *dev, size_t size, struct page *page,
dma_addr_t dma_handle, enum dma_data_direction dir);
+int dma_mmap_pages(struct device *dev, struct vm_area_struct *vma,
+ size_t size, struct page *page);

static inline void *dma_alloc_noncoherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 68992e35c8c3a7..c1e515496c067b 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -515,6 +515,19 @@ void dma_free_pages(struct device *dev, size_t size, struct page *page,
}
EXPORT_SYMBOL_GPL(dma_free_pages);

+int dma_mmap_pages(struct device *dev, struct vm_area_struct *vma,
+ size_t size, struct page *page)
+{
+ unsigned long count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+ if (vma->vm_pgoff >= count || vma_pages(vma) > count - vma->vm_pgoff)
+ return -ENXIO;
+ return remap_pfn_range(vma, vma->vm_start,
+ page_to_pfn(page) + vma->vm_pgoff,
+ vma_pages(vma) << PAGE_SHIFT, vma->vm_page_prot);
+}
+EXPORT_SYMBOL_GPL(dma_mmap_pages);
+
int dma_supported(struct device *dev, u64 mask)
{
const struct dma_map_ops *ops = get_dma_ops(dev);
--
2.29.2

2021-01-28 15:10:03

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 6/6] media: uvcvideo: Use dma_alloc_noncontiguos API

From: Ricardo Ribalda <[email protected]>

On architectures where the is no coherent caching such as ARM use the
dma_alloc_noncontiguos API and handle manually the cache flushing using
dma_sync_sgtable().

With this patch on the affected architectures we can measure up to 20x
performance improvement in uvc_video_copy_data_work().

Eg: aarch64 with an external usb camera

NON_CONTIGUOUS
frames: 999
packets: 999
empty: 0 (0 %)
errors: 0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 67034480 : duration 33303
FPS: 29.99
URB: 523446/4993 uS/qty: 104.836 avg 132.532 std 13.230 min 831.094 max (uS)
header: 76564/4993 uS/qty: 15.334 avg 15.229 std 3.438 min 186.875 max (uS)
latency: 468945/4992 uS/qty: 93.939 avg 132.577 std 9.531 min 824.010 max (uS)
decode: 54161/4993 uS/qty: 10.847 avg 6.313 std 1.614 min 111.458 max (uS)
raw decode speed: 9.931 Gbits/s
raw URB handling speed: 1.025 Gbits/s
throughput: 16.102 Mbits/s
URB decode CPU usage 0.162600 %

COHERENT
frames: 999
packets: 999
empty: 0 (0 %)
errors: 0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 54683536 : duration 33302
FPS: 29.99
URB: 1478135/4000 uS/qty: 369.533 avg 390.357 std 22.968 min 3337.865 max (uS)
header: 79761/4000 uS/qty: 19.940 avg 18.495 std 1.875 min 336.719 max (uS)
latency: 281077/4000 uS/qty: 70.269 avg 83.102 std 5.104 min 735.000 max (uS)
decode: 1197057/4000 uS/qty: 299.264 avg 318.080 std 1.615 min 2806.667 max (uS)
raw decode speed: 365.470 Mbits/s
raw URB handling speed: 295.986 Mbits/s
throughput: 13.136 Mbits/s
URB decode CPU usage 3.594500 %

Signed-off-by: Ricardo Ribalda <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/media/usb/uvc/uvc_video.c | 76 ++++++++++++++++++++++---------
drivers/media/usb/uvc/uvcvideo.h | 4 +-
2 files changed, 57 insertions(+), 23 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_video.c b/drivers/media/usb/uvc/uvc_video.c
index a6a441d92b9488..9c051b55dc7bc6 100644
--- a/drivers/media/usb/uvc/uvc_video.c
+++ b/drivers/media/usb/uvc/uvc_video.c
@@ -11,6 +11,7 @@
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/usb.h>
+#include <linux/usb/hcd.h>
#include <linux/videodev2.h>
#include <linux/vmalloc.h>
#include <linux/wait.h>
@@ -1097,6 +1098,23 @@ static int uvc_video_decode_start(struct uvc_streaming *stream,
return data[0];
}

+static inline struct device *stream_to_dmadev(struct uvc_streaming *stream)
+{
+ return bus_to_hcd(stream->dev->udev->bus)->self.sysdev;
+}
+
+static void uvc_urb_dma_sync(struct uvc_urb *uvc_urb, bool for_device)
+{
+ struct device *dma_dev = dma_dev = stream_to_dmadev(uvc_urb->stream);
+
+ if (for_device)
+ dma_sync_sgtable_for_device(dma_dev, uvc_urb->sgt,
+ DMA_FROM_DEVICE);
+ else
+ dma_sync_sgtable_for_cpu(dma_dev, uvc_urb->sgt,
+ DMA_FROM_DEVICE);
+}
+
/*
* uvc_video_decode_data_work: Asynchronous memcpy processing
*
@@ -1118,6 +1136,8 @@ static void uvc_video_copy_data_work(struct work_struct *work)
uvc_queue_buffer_release(op->buf);
}

+ uvc_urb_dma_sync(uvc_urb, true);
+
ret = usb_submit_urb(uvc_urb->urb, GFP_KERNEL);
if (ret < 0)
uvc_printk(KERN_ERR, "Failed to resubmit video URB (%d).\n",
@@ -1539,10 +1559,12 @@ static void uvc_video_complete(struct urb *urb)
* Process the URB headers, and optionally queue expensive memcpy tasks
* to be deferred to a work queue.
*/
+ uvc_urb_dma_sync(uvc_urb, false);
stream->decode(uvc_urb, buf, buf_meta);

/* If no async work is needed, resubmit the URB immediately. */
if (!uvc_urb->async_operations) {
+ uvc_urb_dma_sync(uvc_urb, true);
ret = usb_submit_urb(uvc_urb->urb, GFP_ATOMIC);
if (ret < 0)
uvc_printk(KERN_ERR,
@@ -1559,24 +1581,47 @@ static void uvc_video_complete(struct urb *urb)
*/
static void uvc_free_urb_buffers(struct uvc_streaming *stream)
{
+ struct device *dma_dev = dma_dev = stream_to_dmadev(stream);
struct uvc_urb *uvc_urb;

for_each_uvc_urb(uvc_urb, stream) {
if (!uvc_urb->buffer)
continue;

-#ifndef CONFIG_DMA_NONCOHERENT
- usb_free_coherent(stream->dev->udev, stream->urb_size,
- uvc_urb->buffer, uvc_urb->dma);
-#else
- kfree(uvc_urb->buffer);
-#endif
+ dma_vunmap_noncontiguous(dma_dev, uvc_urb->buffer);
+ dma_free_noncontiguous(dma_dev, stream->urb_size, uvc_urb->sgt,
+ uvc_urb->dma, DMA_BIDIRECTIONAL);
+
uvc_urb->buffer = NULL;
}

stream->urb_size = 0;
}

+static bool uvc_alloc_urb_buffer(struct uvc_streaming *stream,
+ struct uvc_urb *uvc_urb, gfp_t gfp_flags)
+{
+ struct device *dma_dev = stream_to_dmadev(stream);
+
+
+ uvc_urb->sgt = dma_alloc_noncontiguous(dma_dev, stream->urb_size,
+ &uvc_urb->dma, DMA_BIDIRECTIONAL,
+ gfp_flags);
+ if (!uvc_urb->sgt)
+ return false;
+
+ uvc_urb->buffer = dma_vmap_noncontiguous(dma_dev, stream->urb_size,
+ uvc_urb->sgt);
+ if (!uvc_urb->buffer) {
+ dma_free_noncontiguous(dma_dev, stream->urb_size,
+ uvc_urb->sgt, uvc_urb->dma,
+ DMA_BIDIRECTIONAL);
+ return false;
+ }
+
+ return true;
+}
+
/*
* Allocate transfer buffers. This function can be called with buffers
* already allocated when resuming from suspend, in which case it will
@@ -1607,19 +1652,11 @@ static int uvc_alloc_urb_buffers(struct uvc_streaming *stream,

/* Retry allocations until one succeed. */
for (; npackets > 1; npackets /= 2) {
+ stream->urb_size = psize * npackets;
for (i = 0; i < UVC_URBS; ++i) {
struct uvc_urb *uvc_urb = &stream->uvc_urb[i];

- stream->urb_size = psize * npackets;
-#ifndef CONFIG_DMA_NONCOHERENT
- uvc_urb->buffer = usb_alloc_coherent(
- stream->dev->udev, stream->urb_size,
- gfp_flags | __GFP_NOWARN, &uvc_urb->dma);
-#else
- uvc_urb->buffer =
- kmalloc(stream->urb_size, gfp_flags | __GFP_NOWARN);
-#endif
- if (!uvc_urb->buffer) {
+ if (!uvc_alloc_urb_buffer(stream, uvc_urb, gfp_flags)) {
uvc_free_urb_buffers(stream);
break;
}
@@ -1728,12 +1765,8 @@ static int uvc_init_video_isoc(struct uvc_streaming *stream,
urb->context = uvc_urb;
urb->pipe = usb_rcvisocpipe(stream->dev->udev,
ep->desc.bEndpointAddress);
-#ifndef CONFIG_DMA_NONCOHERENT
urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP;
urb->transfer_dma = uvc_urb->dma;
-#else
- urb->transfer_flags = URB_ISO_ASAP;
-#endif
urb->interval = ep->desc.bInterval;
urb->transfer_buffer = uvc_urb->buffer;
urb->complete = uvc_video_complete;
@@ -1793,10 +1826,8 @@ static int uvc_init_video_bulk(struct uvc_streaming *stream,

usb_fill_bulk_urb(urb, stream->dev->udev, pipe, uvc_urb->buffer,
size, uvc_video_complete, uvc_urb);
-#ifndef CONFIG_DMA_NONCOHERENT
urb->transfer_flags = URB_NO_TRANSFER_DMA_MAP;
urb->transfer_dma = uvc_urb->dma;
-#endif

uvc_urb->urb = urb;
}
@@ -1891,6 +1922,7 @@ static int uvc_video_start_transfer(struct uvc_streaming *stream,

/* Submit the URBs. */
for_each_uvc_urb(uvc_urb, stream) {
+ uvc_urb_dma_sync(uvc_urb, true);
ret = usb_submit_urb(uvc_urb->urb, gfp_flags);
if (ret < 0) {
uvc_printk(KERN_ERR, "Failed to submit URB %u (%d).\n",
diff --git a/drivers/media/usb/uvc/uvcvideo.h b/drivers/media/usb/uvc/uvcvideo.h
index a3dfacf069c44d..a386114bd22999 100644
--- a/drivers/media/usb/uvc/uvcvideo.h
+++ b/drivers/media/usb/uvc/uvcvideo.h
@@ -521,7 +521,8 @@ struct uvc_copy_op {
* @urb: the URB described by this context structure
* @stream: UVC streaming context
* @buffer: memory storage for the URB
- * @dma: DMA coherent addressing for the urb_buffer
+ * @dma: Allocated DMA handle
+ * @sgt: sgt_table with the urb locations in memory
* @async_operations: counter to indicate the number of copy operations
* @copy_operations: work descriptors for asynchronous copy operations
* @work: work queue entry for asynchronous decode
@@ -532,6 +533,7 @@ struct uvc_urb {

char *buffer;
dma_addr_t dma;
+ struct sg_table *sgt;

unsigned int async_operations;
struct uvc_copy_op copy_operations[UVC_MAX_PACKETS];
--
2.29.2

2021-01-28 15:10:49

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 4/6] dma-iommu: refactor iommu_dma_alloc_remap

Split out a new helper that only allocates a sg_table worth of
memory without mapping it into contiguous kernel address space.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/iommu/dma-iommu.c | 66 +++++++++++++++++++++------------------
1 file changed, 35 insertions(+), 31 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 255533faf90599..65af875ba8495c 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -661,24 +661,13 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
return pages;
}

-/**
- * iommu_dma_alloc_remap - Allocate and map a buffer contiguous in IOVA space
- * @dev: Device to allocate memory for. Must be a real device
- * attached to an iommu_dma_domain
- * @size: Size of buffer in bytes
- * @dma_handle: Out argument for allocated DMA handle
- * @gfp: Allocation flags
- * @prot: pgprot_t to use for the remapped mapping
- * @attrs: DMA attributes for this allocation
- *
- * If @size is less than PAGE_SIZE, then a full CPU page will be allocated,
+/*
+ * If size is less than PAGE_SIZE, then a full CPU page will be allocated,
* but an IOMMU which supports smaller pages might not map the whole thing.
- *
- * Return: Mapped virtual address, or NULL on failure.
*/
-static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
- dma_addr_t *dma_handle, gfp_t gfp, pgprot_t prot,
- unsigned long attrs)
+static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
+ size_t size, struct sg_table *sgt, dma_addr_t *dma_handle,
+ gfp_t gfp, pgprot_t prot, unsigned long attrs)
{
struct iommu_domain *domain = iommu_get_dma_domain(dev);
struct iommu_dma_cookie *cookie = domain->iova_cookie;
@@ -687,9 +676,7 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
int ioprot = dma_info_to_prot(DMA_BIDIRECTIONAL, coherent, attrs);
unsigned int count, min_size, alloc_sizes = domain->pgsize_bitmap;
struct page **pages;
- struct sg_table sgt;
dma_addr_t iova;
- void *vaddr;

*dma_handle = DMA_MAPPING_ERROR;

@@ -717,34 +704,26 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
if (!iova)
goto out_free_pages;

- if (sg_alloc_table_from_pages(&sgt, pages, count, 0, size, GFP_KERNEL))
+ if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, GFP_KERNEL))
goto out_free_iova;

if (!(ioprot & IOMMU_CACHE)) {
struct scatterlist *sg;
int i;

- for_each_sg(sgt.sgl, sg, sgt.orig_nents, i)
+ for_each_sg(sgt->sgl, sg, sgt->orig_nents, i)
arch_dma_prep_coherent(sg_page(sg), sg->length);
}

- if (iommu_map_sg_atomic(domain, iova, sgt.sgl, sgt.orig_nents, ioprot)
+ if (iommu_map_sg_atomic(domain, iova, sgt->sgl, sgt->orig_nents, ioprot)
< size)
goto out_free_sg;

- vaddr = dma_common_pages_remap(pages, size, prot,
- __builtin_return_address(0));
- if (!vaddr)
- goto out_unmap;
-
*dma_handle = iova;
- sg_free_table(&sgt);
- return vaddr;
+ return pages;

-out_unmap:
- __iommu_dma_unmap(dev, iova, size);
out_free_sg:
- sg_free_table(&sgt);
+ sg_free_table(sgt);
out_free_iova:
iommu_dma_free_iova(cookie, iova, size, NULL);
out_free_pages:
@@ -752,6 +731,31 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
return NULL;
}

+static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
+ dma_addr_t *dma_handle, gfp_t gfp, pgprot_t prot,
+ unsigned long attrs)
+{
+ struct page **pages;
+ struct sg_table sgt;
+ void *vaddr;
+
+ pages = __iommu_dma_alloc_noncontiguous(dev, size, &sgt, dma_handle,
+ gfp, prot, attrs);
+ if (!pages)
+ return NULL;
+ sg_free_table(&sgt);
+ vaddr = dma_common_pages_remap(pages, size, prot,
+ __builtin_return_address(0));
+ if (!vaddr)
+ goto out_unmap;
+ return vaddr;
+
+out_unmap:
+ __iommu_dma_unmap(dev, *dma_handle, size);
+ __iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
+ return NULL;
+}
+
static void iommu_dma_sync_single_for_cpu(struct device *dev,
dma_addr_t dma_handle, size_t size, enum dma_data_direction dir)
{
--
2.29.2

2021-01-28 15:13:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 6/6] media: uvcvideo: Use dma_alloc_noncontiguos API

I just included this patch as-is, but here are a few comments:

On Thu, Jan 28, 2021 at 03:58:37PM +0100, Christoph Hellwig wrote:
> +static void uvc_urb_dma_sync(struct uvc_urb *uvc_urb, bool for_device)
> +{
> + struct device *dma_dev = dma_dev = stream_to_dmadev(uvc_urb->stream);
> +
> + if (for_device)
> + dma_sync_sgtable_for_device(dma_dev, uvc_urb->sgt,
> + DMA_FROM_DEVICE);
> + else
> + dma_sync_sgtable_for_cpu(dma_dev, uvc_urb->sgt,
> + DMA_FROM_DEVICE);
> +}

Given that we vmap the addresses this also needs
flush_kernel_vmap_range / invalidate_kernel_vmap_range calls for
VIVT architectures.

2021-01-28 15:44:45

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 2/6] dma-mapping: add a dma_mmap_pages helper

From: Christoph Hellwig
> Sent: 28 January 2021 14:59
>
> Add a helper to map memory allocated using dma_alloc_pages into
> a user address space, similar to the dma_alloc_attrs function for
> coherent allocations.
>
...
> +::
> +
> + int
> + dma_mmap_pages(struct device *dev, struct vm_area_struct *vma,
> + size_t size, struct page *page)
> +
> +Map an allocation returned from dma_alloc_pages() into a user address space.
> +dev and size must be the same as those passed into dma_alloc_pages().
> +page must be the pointer returned by dma_alloc_pages().

To be useful this needs to specify the offset into the user address area.
(ie the offset in the mmap() buffer.)

For example we have an fpga based PCIe card that converts internal
addresses that refer to one of 512 16k 'pages' to 64bit PCIe bus
master addresses.
So it (sort of) contains its own iommu.

So we can allocate (aligned) 16k kernel memory buffers with
dma_alloc_coherent() and make them appear contiguous to the
on-board PCIe bus master users.
We then mmap() them into contiguous user addresses.
So both 'ends' see contiguous addresses without requiring
contiguous physical memory or requiring all the memory be
allocated at the same time.

Clearly in-kernel users have to allow for the 16k boundaries.
But the large structures are accesses from user space.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-01-28 16:50:19

by Ricardo Ribalda

[permalink] [raw]
Subject: Re: [PATCH 6/6] media: uvcvideo: Use dma_alloc_noncontiguos API

On Thu, Jan 28, 2021 at 4:03 PM Christoph Hellwig <[email protected]> wrote:
>
> From: Ricardo Ribalda <[email protected]>
>
> On architectures where the is no coherent caching such as ARM use the
> dma_alloc_noncontiguos API and handle manually the cache flushing using
> dma_sync_sgtable().
>
> With this patch on the affected architectures we can measure up to 20x
> performance improvement in uvc_video_copy_data_work().
>
> Eg: aarch64 with an external usb camera
>
> NON_CONTIGUOUS
> frames: 999
> packets: 999
> empty: 0 (0 %)
> errors: 0
> invalid: 0
> pts: 0 early, 0 initial, 999 ok
> scr: 0 count ok, 0 diff ok
> sof: 2048 <= sof <= 0, freq 0.000 kHz
> bytes 67034480 : duration 33303
> FPS: 29.99
> URB: 523446/4993 uS/qty: 104.836 avg 132.532 std 13.230 min 831.094 max (uS)
> header: 76564/4993 uS/qty: 15.334 avg 15.229 std 3.438 min 186.875 max (uS)
> latency: 468945/4992 uS/qty: 93.939 avg 132.577 std 9.531 min 824.010 max (uS)
> decode: 54161/4993 uS/qty: 10.847 avg 6.313 std 1.614 min 111.458 max (uS)
> raw decode speed: 9.931 Gbits/s
> raw URB handling speed: 1.025 Gbits/s
> throughput: 16.102 Mbits/s
> URB decode CPU usage 0.162600 %
>
> COHERENT
> frames: 999
> packets: 999
> empty: 0 (0 %)
> errors: 0
> invalid: 0
> pts: 0 early, 0 initial, 999 ok
> scr: 0 count ok, 0 diff ok
> sof: 2048 <= sof <= 0, freq 0.000 kHz
> bytes 54683536 : duration 33302
> FPS: 29.99
> URB: 1478135/4000 uS/qty: 369.533 avg 390.357 std 22.968 min 3337.865 max (uS)
> header: 79761/4000 uS/qty: 19.940 avg 18.495 std 1.875 min 336.719 max (uS)
> latency: 281077/4000 uS/qty: 70.269 avg 83.102 std 5.104 min 735.000 max (uS)
> decode: 1197057/4000 uS/qty: 299.264 avg 318.080 std 1.615 min 2806.667 max (uS)
> raw decode speed: 365.470 Mbits/s
> raw URB handling speed: 295.986 Mbits/s
> throughput: 13.136 Mbits/s
> URB decode CPU usage 3.594500 %
>
> Signed-off-by: Ricardo Ribalda <[email protected]>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/media/usb/uvc/uvc_video.c | 76 ++++++++++++++++++++++---------
> drivers/media/usb/uvc/uvcvideo.h | 4 +-
> 2 files changed, 57 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/media/usb/uvc/uvc_video.c b/drivers/media/usb/uvc/uvc_video.c
> index a6a441d92b9488..9c051b55dc7bc6 100644
> --- a/drivers/media/usb/uvc/uvc_video.c
> +++ b/drivers/media/usb/uvc/uvc_video.c
> @@ -11,6 +11,7 @@
> #include <linux/module.h>
> #include <linux/slab.h>
> #include <linux/usb.h>
> +#include <linux/usb/hcd.h>
> #include <linux/videodev2.h>
> #include <linux/vmalloc.h>
> #include <linux/wait.h>
> @@ -1097,6 +1098,23 @@ static int uvc_video_decode_start(struct uvc_streaming *stream,
> return data[0];
> }
>
> +static inline struct device *stream_to_dmadev(struct uvc_streaming *stream)
> +{
> + return bus_to_hcd(stream->dev->udev->bus)->self.sysdev;
> +}
> +
> +static void uvc_urb_dma_sync(struct uvc_urb *uvc_urb, bool for_device)
> +{
> + struct device *dma_dev = dma_dev = stream_to_dmadev(uvc_urb->stream);
> +
> + if (for_device)
> + dma_sync_sgtable_for_device(dma_dev, uvc_urb->sgt,
> + DMA_FROM_DEVICE);
> + else
> + dma_sync_sgtable_for_cpu(dma_dev, uvc_urb->sgt,
> + DMA_FROM_DEVICE);
> +}
> +
> /*
> * uvc_video_decode_data_work: Asynchronous memcpy processing
> *
> @@ -1118,6 +1136,8 @@ static void uvc_video_copy_data_work(struct work_struct *work)
> uvc_queue_buffer_release(op->buf);
> }
>
> + uvc_urb_dma_sync(uvc_urb, true);
> +
> ret = usb_submit_urb(uvc_urb->urb, GFP_KERNEL);
> if (ret < 0)
> uvc_printk(KERN_ERR, "Failed to resubmit video URB (%d).\n",
> @@ -1539,10 +1559,12 @@ static void uvc_video_complete(struct urb *urb)
> * Process the URB headers, and optionally queue expensive memcpy tasks
> * to be deferred to a work queue.
> */
> + uvc_urb_dma_sync(uvc_urb, false);
> stream->decode(uvc_urb, buf, buf_meta);
>
> /* If no async work is needed, resubmit the URB immediately. */
> if (!uvc_urb->async_operations) {
> + uvc_urb_dma_sync(uvc_urb, true);
> ret = usb_submit_urb(uvc_urb->urb, GFP_ATOMIC);
> if (ret < 0)
> uvc_printk(KERN_ERR,
> @@ -1559,24 +1581,47 @@ static void uvc_video_complete(struct urb *urb)
> */
> static void uvc_free_urb_buffers(struct uvc_streaming *stream)
> {
> + struct device *dma_dev = dma_dev = stream_to_dmadev(stream);
> struct uvc_urb *uvc_urb;
>
> for_each_uvc_urb(uvc_urb, stream) {
> if (!uvc_urb->buffer)
> continue;
>
> -#ifndef CONFIG_DMA_NONCOHERENT
> - usb_free_coherent(stream->dev->udev, stream->urb_size,
> - uvc_urb->buffer, uvc_urb->dma);
> -#else
> - kfree(uvc_urb->buffer);
> -#endif
> + dma_vunmap_noncontiguous(dma_dev, uvc_urb->buffer);
> + dma_free_noncontiguous(dma_dev, stream->urb_size, uvc_urb->sgt,
> + uvc_urb->dma, DMA_BIDIRECTIONAL);

Maybe DMA_FROM_DEVICE instead of DMA_BIDIRECTIONAL ?

> +
> uvc_urb->buffer = NULL;
> }
>
> stream->urb_size = 0;
> }
>
> +static bool uvc_alloc_urb_buffer(struct uvc_streaming *stream,
> + struct uvc_urb *uvc_urb, gfp_t gfp_flags)
> +{
> + struct device *dma_dev = stream_to_dmadev(stream);
> +
> +
> + uvc_urb->sgt = dma_alloc_noncontiguous(dma_dev, stream->urb_size,
> + &uvc_urb->dma, DMA_BIDIRECTIONAL,
> + gfp_flags);
> + if (!uvc_urb->sgt)
> + return false;
> +
> + uvc_urb->buffer = dma_vmap_noncontiguous(dma_dev, stream->urb_size,
> + uvc_urb->sgt);
> + if (!uvc_urb->buffer) {
> + dma_free_noncontiguous(dma_dev, stream->urb_size,
> + uvc_urb->sgt, uvc_urb->dma,
> + DMA_BIDIRECTIONAL);
> + return false;
> + }
> +
> + return true;
> +}
> +
> /*
> * Allocate transfer buffers. This function can be called with buffers
> * already allocated when resuming from suspend, in which case it will
> @@ -1607,19 +1652,11 @@ static int uvc_alloc_urb_buffers(struct uvc_streaming *stream,
>
> /* Retry allocations until one succeed. */
> for (; npackets > 1; npackets /= 2) {
> + stream->urb_size = psize * npackets;
> for (i = 0; i < UVC_URBS; ++i) {
> struct uvc_urb *uvc_urb = &stream->uvc_urb[i];
>
> - stream->urb_size = psize * npackets;
> -#ifndef CONFIG_DMA_NONCOHERENT
> - uvc_urb->buffer = usb_alloc_coherent(
> - stream->dev->udev, stream->urb_size,
> - gfp_flags | __GFP_NOWARN, &uvc_urb->dma);
> -#else
> - uvc_urb->buffer =
> - kmalloc(stream->urb_size, gfp_flags | __GFP_NOWARN);
> -#endif
> - if (!uvc_urb->buffer) {
> + if (!uvc_alloc_urb_buffer(stream, uvc_urb, gfp_flags)) {
> uvc_free_urb_buffers(stream);
> break;
> }
> @@ -1728,12 +1765,8 @@ static int uvc_init_video_isoc(struct uvc_streaming *stream,
> urb->context = uvc_urb;
> urb->pipe = usb_rcvisocpipe(stream->dev->udev,
> ep->desc.bEndpointAddress);
> -#ifndef CONFIG_DMA_NONCOHERENT
> urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP;
> urb->transfer_dma = uvc_urb->dma;
> -#else
> - urb->transfer_flags = URB_ISO_ASAP;
> -#endif
> urb->interval = ep->desc.bInterval;
> urb->transfer_buffer = uvc_urb->buffer;
> urb->complete = uvc_video_complete;
> @@ -1793,10 +1826,8 @@ static int uvc_init_video_bulk(struct uvc_streaming *stream,
>
> usb_fill_bulk_urb(urb, stream->dev->udev, pipe, uvc_urb->buffer,
> size, uvc_video_complete, uvc_urb);
> -#ifndef CONFIG_DMA_NONCOHERENT
> urb->transfer_flags = URB_NO_TRANSFER_DMA_MAP;
> urb->transfer_dma = uvc_urb->dma;
> -#endif
>
> uvc_urb->urb = urb;
> }
> @@ -1891,6 +1922,7 @@ static int uvc_video_start_transfer(struct uvc_streaming *stream,
>
> /* Submit the URBs. */
> for_each_uvc_urb(uvc_urb, stream) {
> + uvc_urb_dma_sync(uvc_urb, true);
> ret = usb_submit_urb(uvc_urb->urb, gfp_flags);
> if (ret < 0) {
> uvc_printk(KERN_ERR, "Failed to submit URB %u (%d).\n",
> diff --git a/drivers/media/usb/uvc/uvcvideo.h b/drivers/media/usb/uvc/uvcvideo.h
> index a3dfacf069c44d..a386114bd22999 100644
> --- a/drivers/media/usb/uvc/uvcvideo.h
> +++ b/drivers/media/usb/uvc/uvcvideo.h
> @@ -521,7 +521,8 @@ struct uvc_copy_op {
> * @urb: the URB described by this context structure
> * @stream: UVC streaming context
> * @buffer: memory storage for the URB
> - * @dma: DMA coherent addressing for the urb_buffer
> + * @dma: Allocated DMA handle
> + * @sgt: sgt_table with the urb locations in memory
> * @async_operations: counter to indicate the number of copy operations
> * @copy_operations: work descriptors for asynchronous copy operations
> * @work: work queue entry for asynchronous decode
> @@ -532,6 +533,7 @@ struct uvc_urb {
>
> char *buffer;
> dma_addr_t dma;
> + struct sg_table *sgt;
>
> unsigned int async_operations;
> struct uvc_copy_op copy_operations[UVC_MAX_PACKETS];
> --
> 2.29.2
>


--
Ricardo Ribalda

2021-01-28 17:04:45

by Ricardo Ribalda

[permalink] [raw]
Subject: Re: [PATCH 6/6] media: uvcvideo: Use dma_alloc_noncontiguos API

HI Christoph

Thanks for your comments

On Thu, Jan 28, 2021 at 4:09 PM Christoph Hellwig <[email protected]> wrote:
>
> I just included this patch as-is, but here are a few comments:
>
> On Thu, Jan 28, 2021 at 03:58:37PM +0100, Christoph Hellwig wrote:
> > +static void uvc_urb_dma_sync(struct uvc_urb *uvc_urb, bool for_device)
> > +{
> > + struct device *dma_dev = dma_dev = stream_to_dmadev(uvc_urb->stream);
> > +
> > + if (for_device)
> > + dma_sync_sgtable_for_device(dma_dev, uvc_urb->sgt,
> > + DMA_FROM_DEVICE);
> > + else
> > + dma_sync_sgtable_for_cpu(dma_dev, uvc_urb->sgt,
> > + DMA_FROM_DEVICE);
> > +}
>
> Given that we vmap the addresses this also needs
> flush_kernel_vmap_range / invalidate_kernel_vmap_range calls for
> VIVT architectures.

We only read from the device to the cpu. Then can we run only
invalidate_kernel_vmap_range() ?

something like ?
else {
dma_sync_sgtable_for_cpu(dma_dev, uvc_urb->sgt, DMA_FROM_DEVICE);
invalidate_kernel_vmap_range(uvc_urb->buffer,
uvc_urb->stream->urb_size );
}

Thanks!






--
Ricardo Ribalda

2021-02-01 16:39:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 6/6] media: uvcvideo: Use dma_alloc_noncontiguos API

On Thu, Jan 28, 2021 at 06:00:57PM +0100, Ricardo Ribalda wrote:
> > Given that we vmap the addresses this also needs
> > flush_kernel_vmap_range / invalidate_kernel_vmap_range calls for
> > VIVT architectures.
>
> We only read from the device to the cpu. Then can we run only
> invalidate_kernel_vmap_range() ?
>
> something like ?
> else {
> dma_sync_sgtable_for_cpu(dma_dev, uvc_urb->sgt, DMA_FROM_DEVICE);
> invalidate_kernel_vmap_range(uvc_urb->buffer,
> uvc_urb->stream->urb_size );
> }

Yes. Right now we don't have a proper state machine for the
*_kernel_vmap_range, but we should probably add one once usage of this
grows. Until then I need to respin my API patch to document how callers
need to use *_kernel_vmap_range, as well as adding the so far missing
dma-debug support.

As we're getting toward the end of the merge window I'll try to get this
done ASAP.

How should we plan to merge this code? Do you have a tree you'd like
to pick up the whole thing for? Or should I create a dma-mapping
tree branch that can be pulled in?