2014-04-15 13:09:09

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 0/5] enhance DMA CMA on x86

This patch set enhances the DMA Contiguous Memory Allocator on x86.

Currently the DMA CMA is only supported with pci-nommu dma_map_ops
and furthermore it can't be enabled on x86_64. But I would like to
allocate big contiguous memory with dma_alloc_coherent() and tell it
to the device that requires it, regardless of which dma mapping
implementation is actually used in the system.

So this makes it work with swiotlb and intel-iommu dma_map_ops, too.
And this also extends "cma=" kernel parameter to specify placement
constraint by the physical address range of memory allocations. For
example, CMA allocates memory below 4GB by "cma=64M@0-4G", it is
required for the devices only supporting 32-bit addressing on 64-bit
systems without iommu.

* Changes from v2
- Rebased on current Linus tree
- Add Acked-by line
- Fix gfp flags check for __GFP_ATOMIC, reported by Marek Szyprowski
- Avoid CMA area on highmem with cma= option, reported by Marek Szyprowski

* Changes from v1
- fix dma_alloc_coherent() with __GFP_ZERO
- add placement specifier for "cma=" kernel parameter

Akinobu Mita (5):
x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled
x86: enable DMA CMA with swiotlb
intel-iommu: integrate DMA CMA
memblock: introduce memblock_alloc_range()
cma: add placement specifier for "cma=" kernel parameter

Documentation/kernel-parameters.txt | 7 +++++--
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/swiotlb.h | 7 +++++++
arch/x86/kernel/amd_gart_64.c | 2 +-
arch/x86/kernel/pci-dma.c | 3 +--
arch/x86/kernel/pci-swiotlb.c | 9 +++++---
arch/x86/kernel/setup.c | 2 +-
arch/x86/pci/sta2x11-fixup.c | 6 ++----
drivers/base/dma-contiguous.c | 42 ++++++++++++++++++++++++++++---------
drivers/iommu/intel-iommu.c | 32 +++++++++++++++++++++-------
include/linux/dma-contiguous.h | 9 +++++---
include/linux/memblock.h | 2 ++
include/linux/swiotlb.h | 2 ++
lib/swiotlb.c | 2 +-
mm/memblock.c | 21 +++++++++++++++----
15 files changed, 108 insertions(+), 40 deletions(-)

Cc: Marek Szyprowski <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Don Dutile <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: [email protected]
Cc: [email protected]
--
1.8.3.2


2014-04-15 13:09:12

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 1/5] x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled

Calling dma_alloc_coherent() with __GFP_ZERO must return zeroed memory.

But when the contiguous memory allocator (CMA) is enabled on x86 and
the memory region is allocated by dma_alloc_from_contiguous(), it
doesn't return zeroed memory. Because dma_generic_alloc_coherent()
forgot to fill the memory region with zero if it was allocated by
dma_alloc_from_contiguous()

Most implementations of dma_alloc_coherent() return zeroed memory
regardless of whether __GFP_ZERO is specified. So this fixes it by
unconditionally zeroing the allocated memory region.

Alternatively, we could fix dma_alloc_from_contiguous() to return
zeroed out memory and remove memset() from all caller of it. But we
can't simply remove the memset on arm because __dma_clear_buffer() is
used there for ensuring cache flushing and it is used in many places.
Of course we can do redundant memset in dma_alloc_from_contiguous(),
but I think this patch is less impact for fixing this problem.

Cc: Marek Szyprowski <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Don Dutile <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Akinobu Mita <[email protected]>
---
* Change from v2
- update commit log to describe a possible alternative fix

arch/x86/kernel/pci-dma.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index f7d0672..a0ffe44 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -97,7 +97,6 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,

dma_mask = dma_alloc_coherent_mask(dev, flag);

- flag |= __GFP_ZERO;
again:
page = NULL;
/* CMA can be used only in the context which permits sleeping */
@@ -120,7 +119,7 @@ again:

return NULL;
}
-
+ memset(page_address(page), 0, size);
*dma_addr = addr;
return page_address(page);
}
--
1.8.3.2

2014-04-15 13:09:17

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 2/5] x86: enable DMA CMA with swiotlb

The DMA Contiguous Memory Allocator support on x86 is disabled when
swiotlb config option is enabled. So DMA CMA is always disabled on
x86_64 because swiotlb is always enabled. This attempts to support
for DMA CMA with enabling swiotlb config option.

The contiguous memory allocator on x86 is integrated in the function
dma_generic_alloc_coherent() which is .alloc callback in nommu_dma_ops
for dma_alloc_coherent().

x86_swiotlb_alloc_coherent() which is .alloc callback in swiotlb_dma_ops
tries to allocate with dma_generic_alloc_coherent() firstly and then
swiotlb_alloc_coherent() is called as a fallback.

The main part of supporting DMA CMA with swiotlb is that changing
x86_swiotlb_free_coherent() which is .free callback in swiotlb_dma_ops
for dma_free_coherent() so that it can distinguish memory allocated by
dma_generic_alloc_coherent() from one allocated by swiotlb_alloc_coherent()
and release it with dma_generic_free_coherent() which can handle contiguous
memory. This change requires making is_swiotlb_buffer() global function.

This also needs to change .free callback in the dma_map_ops for amd_gart
and sta2x11, because these dma_ops are also using
dma_generic_alloc_coherent().

Cc: Marek Szyprowski <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Don Dutile <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Marek Szyprowski <[email protected]>
Acked-by: Konrad Rzeszutek Wilk <[email protected]>
---
* Change from v2
- Add Acked-by line

arch/x86/Kconfig | 2 +-
arch/x86/include/asm/swiotlb.h | 7 +++++++
arch/x86/kernel/amd_gart_64.c | 2 +-
arch/x86/kernel/pci-swiotlb.c | 9 ++++++---
arch/x86/pci/sta2x11-fixup.c | 6 ++----
include/linux/swiotlb.h | 2 ++
lib/swiotlb.c | 2 +-
7 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 25d2c6f..7fa3f83 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -41,7 +41,7 @@ config X86
select ARCH_WANT_OPTIONAL_GPIOLIB
select ARCH_WANT_FRAME_POINTERS
select HAVE_DMA_ATTRS
- select HAVE_DMA_CONTIGUOUS if !SWIOTLB
+ select HAVE_DMA_CONTIGUOUS
select HAVE_KRETPROBES
select GENERIC_EARLY_IOREMAP
select HAVE_OPTPROBES
diff --git a/arch/x86/include/asm/swiotlb.h b/arch/x86/include/asm/swiotlb.h
index 977f176..ab05d73 100644
--- a/arch/x86/include/asm/swiotlb.h
+++ b/arch/x86/include/asm/swiotlb.h
@@ -29,4 +29,11 @@ static inline void pci_swiotlb_late_init(void)

static inline void dma_mark_clean(void *addr, size_t size) {}

+extern void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
+ dma_addr_t *dma_handle, gfp_t flags,
+ struct dma_attrs *attrs);
+extern void x86_swiotlb_free_coherent(struct device *dev, size_t size,
+ void *vaddr, dma_addr_t dma_addr,
+ struct dma_attrs *attrs);
+
#endif /* _ASM_X86_SWIOTLB_H */
diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index b574b29..8e3842f 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -512,7 +512,7 @@ gart_free_coherent(struct device *dev, size_t size, void *vaddr,
dma_addr_t dma_addr, struct dma_attrs *attrs)
{
gart_unmap_page(dev, dma_addr, size, DMA_BIDIRECTIONAL, NULL);
- free_pages((unsigned long)vaddr, get_order(size));
+ dma_generic_free_coherent(dev, size, vaddr, dma_addr, attrs);
}

static int gart_mapping_error(struct device *dev, dma_addr_t dma_addr)
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index 6c483ba..77dd0ad 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -14,7 +14,7 @@
#include <asm/iommu_table.h>
int swiotlb __read_mostly;

-static void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
+void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
dma_addr_t *dma_handle, gfp_t flags,
struct dma_attrs *attrs)
{
@@ -28,11 +28,14 @@ static void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
return swiotlb_alloc_coherent(hwdev, size, dma_handle, flags);
}

-static void x86_swiotlb_free_coherent(struct device *dev, size_t size,
+void x86_swiotlb_free_coherent(struct device *dev, size_t size,
void *vaddr, dma_addr_t dma_addr,
struct dma_attrs *attrs)
{
- swiotlb_free_coherent(dev, size, vaddr, dma_addr);
+ if (is_swiotlb_buffer(dma_to_phys(dev, dma_addr)))
+ swiotlb_free_coherent(dev, size, vaddr, dma_addr);
+ else
+ dma_generic_free_coherent(dev, size, vaddr, dma_addr, attrs);
}

static struct dma_map_ops swiotlb_dma_ops = {
diff --git a/arch/x86/pci/sta2x11-fixup.c b/arch/x86/pci/sta2x11-fixup.c
index 9d8a509..5ceda85 100644
--- a/arch/x86/pci/sta2x11-fixup.c
+++ b/arch/x86/pci/sta2x11-fixup.c
@@ -173,9 +173,7 @@ static void *sta2x11_swiotlb_alloc_coherent(struct device *dev,
{
void *vaddr;

- vaddr = dma_generic_alloc_coherent(dev, size, dma_handle, flags, attrs);
- if (!vaddr)
- vaddr = swiotlb_alloc_coherent(dev, size, dma_handle, flags);
+ vaddr = x86_swiotlb_alloc_coherent(dev, size, dma_handle, flags, attrs);
*dma_handle = p2a(*dma_handle, to_pci_dev(dev));
return vaddr;
}
@@ -183,7 +181,7 @@ static void *sta2x11_swiotlb_alloc_coherent(struct device *dev,
/* We have our own dma_ops: the same as swiotlb but from alloc (above) */
static struct dma_map_ops sta2x11_dma_ops = {
.alloc = sta2x11_swiotlb_alloc_coherent,
- .free = swiotlb_free_coherent,
+ .free = x86_swiotlb_free_coherent,
.map_page = swiotlb_map_page,
.unmap_page = swiotlb_unmap_page,
.map_sg = swiotlb_map_sg_attrs,
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index a5ffd32..e7a018e 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -116,4 +116,6 @@ static inline void swiotlb_free(void) { }
#endif

extern void swiotlb_print_info(void);
+extern int is_swiotlb_buffer(phys_addr_t paddr);
+
#endif /* __LINUX_SWIOTLB_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 7f57f24..caaab5d 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -374,7 +374,7 @@ void __init swiotlb_free(void)
io_tlb_nslabs = 0;
}

-static int is_swiotlb_buffer(phys_addr_t paddr)
+int is_swiotlb_buffer(phys_addr_t paddr)
{
return paddr >= io_tlb_start && paddr < io_tlb_end;
}
--
1.8.3.2

2014-04-15 13:09:27

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 4/5] memblock: introduce memblock_alloc_range()

This introduces memblock_alloc_range() which allocates memblock from
the specified range of physical address. I would like to use this
function to specify the location of CMA.

Cc: Marek Szyprowski <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Don Dutile <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Akinobu Mita <[email protected]>
---
* Change from v2
- Rebased on current Linus tree

include/linux/memblock.h | 2 ++
mm/memblock.c | 21 +++++++++++++++++----
2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 8a20a51..c5a61d9 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -221,6 +221,8 @@ static inline bool memblock_bottom_up(void) { return false; }
#define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0)
#define MEMBLOCK_ALLOC_ACCESSIBLE 0

+phys_addr_t __init memblock_alloc_range(phys_addr_t size, phys_addr_t align,
+ phys_addr_t start, phys_addr_t end);
phys_addr_t memblock_alloc_base(phys_addr_t size, phys_addr_t align,
phys_addr_t max_addr);
phys_addr_t __memblock_alloc_base(phys_addr_t size, phys_addr_t align,
diff --git a/mm/memblock.c b/mm/memblock.c
index e9d6ca9..9a3bed0 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -975,22 +975,35 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
}
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

-static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
- phys_addr_t align, phys_addr_t max_addr,
- int nid)
+static phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
+ phys_addr_t align, phys_addr_t start,
+ phys_addr_t end, int nid)
{
phys_addr_t found;

if (!align)
align = SMP_CACHE_BYTES;

- found = memblock_find_in_range_node(size, align, 0, max_addr, nid);
+ found = memblock_find_in_range_node(size, align, start, end, nid);
if (found && !memblock_reserve(found, size))
return found;

return 0;
}

+phys_addr_t __init memblock_alloc_range(phys_addr_t size, phys_addr_t align,
+ phys_addr_t start, phys_addr_t end)
+{
+ return memblock_alloc_range_nid(size, align, start, end, NUMA_NO_NODE);
+}
+
+static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
+ phys_addr_t align, phys_addr_t max_addr,
+ int nid)
+{
+ return memblock_alloc_range_nid(size, align, 0, max_addr, nid);
+}
+
phys_addr_t __init memblock_alloc_nid(phys_addr_t size, phys_addr_t align, int nid)
{
return memblock_alloc_base_nid(size, align, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
--
1.8.3.2

2014-04-15 13:10:17

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 5/5] cma: add placement specifier for "cma=" kernel parameter

Currently, "cma=" kernel parameter is used to specify the size of CMA,
but we can't specify where it is located. We want to locate CMA below
4GB for devices only supporting 32-bit addressing on 64-bit systems
without iommu.

This enables to specify the placement of CMA by extending "cma=" kernel
parameter.

Examples:
1. locate 64MB CMA below 4GB by "cma=64M@0-4G"
2. locate 64MB CMA exact at 512MB by "cma=64M@512M"

Note that the DMA contiguous memory allocator on x86 assumes that
page_address() works for the pages to allocate. So this change requires
to limit end address of contiguous memory area upto max_pfn_mapped to
prevent from locating it on highmem area by the argument of
dma_contiguous_reserve().

Cc: Marek Szyprowski <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Don Dutile <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Akinobu Mita <[email protected]>
---
* Change from v2
- Avoid CMA area on highmem with cma= option, reported by Marek Szyprowski

Documentation/kernel-parameters.txt | 7 +++++--
arch/x86/kernel/setup.c | 2 +-
drivers/base/dma-contiguous.c | 42 ++++++++++++++++++++++++++++---------
include/linux/dma-contiguous.h | 9 +++++---
4 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 03e50b4..8488e68 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -617,8 +617,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
Also note the kernel might malfunction if you disable
some critical bits.

- cma=nn[MG] [ARM,KNL]
- Sets the size of kernel global memory area for contiguous
+ cma=nn[MG]@[start[MG][-end[MG]]]
+ [ARM,X86,KNL]
+ Sets the size of kernel global memory area for
+ contiguous memory allocations and optionally the
+ placement constraint by the physical address range of
memory allocations. For more information, see
include/linux/dma-contiguous.h

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 09c76d2..78a0e62 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1119,7 +1119,7 @@ void __init setup_arch(char **cmdline_p)
setup_real_mode();

memblock_set_current_limit(get_max_mapped());
- dma_contiguous_reserve(0);
+ dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);

/*
* NOTE: On x86-32, only from this point on, fixmaps are ready for use.
diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 165c2c2..b056661 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -59,11 +59,22 @@ struct cma *dma_contiguous_default_area;
*/
static const phys_addr_t size_bytes = CMA_SIZE_MBYTES * SZ_1M;
static phys_addr_t size_cmdline = -1;
+static phys_addr_t base_cmdline;
+static phys_addr_t limit_cmdline;

static int __init early_cma(char *p)
{
pr_debug("%s(%s)\n", __func__, p);
size_cmdline = memparse(p, &p);
+ if (*p != '@')
+ return 0;
+ base_cmdline = memparse(p + 1, &p);
+ if (*p != '-') {
+ limit_cmdline = base_cmdline + size_cmdline;
+ return 0;
+ }
+ limit_cmdline = memparse(p + 1, &p);
+
return 0;
}
early_param("cma", early_cma);
@@ -107,11 +118,18 @@ static inline __maybe_unused phys_addr_t cma_early_percent_memory(void)
void __init dma_contiguous_reserve(phys_addr_t limit)
{
phys_addr_t selected_size = 0;
+ phys_addr_t selected_base = 0;
+ phys_addr_t selected_limit = limit;
+ bool fixed = false;

pr_debug("%s(limit %08lx)\n", __func__, (unsigned long)limit);

if (size_cmdline != -1) {
selected_size = size_cmdline;
+ selected_base = base_cmdline;
+ selected_limit = min_not_zero(limit_cmdline, limit);
+ if (base_cmdline + size_cmdline == limit_cmdline)
+ fixed = true;
} else {
#ifdef CONFIG_CMA_SIZE_SEL_MBYTES
selected_size = size_bytes;
@@ -128,10 +146,12 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
pr_debug("%s: reserving %ld MiB for global area\n", __func__,
(unsigned long)selected_size / SZ_1M);

- dma_contiguous_reserve_area(selected_size, 0, limit,
- &dma_contiguous_default_area);
+ dma_contiguous_reserve_area(selected_size, selected_base,
+ selected_limit,
+ &dma_contiguous_default_area,
+ fixed);
}
-};
+}

static DEFINE_MUTEX(cma_mutex);

@@ -187,15 +207,20 @@ core_initcall(cma_init_reserved_areas);
* @base: Base address of the reserved area optional, use 0 for any
* @limit: End address of the reserved memory (optional, 0 for any).
* @res_cma: Pointer to store the created cma region.
+ * @fixed: hint about where to place the reserved area
*
* This function reserves memory from early allocator. It should be
* called by arch specific code once the early allocator (memblock or bootmem)
* has been activated and all other subsystems have already allocated/reserved
* memory. This function allows to create custom reserved areas for specific
* devices.
+ *
+ * If @fixed is true, reserve contiguous area at exactly @base. If false,
+ * reserve in range from @base to @limit.
*/
int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
- phys_addr_t limit, struct cma **res_cma)
+ phys_addr_t limit, struct cma **res_cma,
+ bool fixed)
{
struct cma *cma = &cma_areas[cma_area_count];
phys_addr_t alignment;
@@ -221,18 +246,15 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
limit &= ~(alignment - 1);

/* Reserve memory */
- if (base) {
+ if (base && fixed) {
if (memblock_is_region_reserved(base, size) ||
memblock_reserve(base, size) < 0) {
ret = -EBUSY;
goto err;
}
} else {
- /*
- * Use __memblock_alloc_base() since
- * memblock_alloc_base() panic()s.
- */
- phys_addr_t addr = __memblock_alloc_base(size, alignment, limit);
+ phys_addr_t addr = memblock_alloc_range(size, alignment, base,
+ limit);
if (!addr) {
ret = -ENOMEM;
goto err;
diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h
index 3b28f93..772eab5 100644
--- a/include/linux/dma-contiguous.h
+++ b/include/linux/dma-contiguous.h
@@ -88,7 +88,8 @@ static inline void dma_contiguous_set_default(struct cma *cma)
void dma_contiguous_reserve(phys_addr_t addr_limit);

int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
- phys_addr_t limit, struct cma **res_cma);
+ phys_addr_t limit, struct cma **res_cma,
+ bool fixed);

/**
* dma_declare_contiguous() - reserve area for contiguous memory handling
@@ -108,7 +109,7 @@ static inline int dma_declare_contiguous(struct device *dev, phys_addr_t size,
{
struct cma *cma;
int ret;
- ret = dma_contiguous_reserve_area(size, base, limit, &cma);
+ ret = dma_contiguous_reserve_area(size, base, limit, &cma, true);
if (ret == 0)
dev_set_cma_area(dev, cma);

@@ -136,7 +137,9 @@ static inline void dma_contiguous_set_default(struct cma *cma) { }
static inline void dma_contiguous_reserve(phys_addr_t limit) { }

static inline int dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
- phys_addr_t limit, struct cma **res_cma) {
+ phys_addr_t limit, struct cma **res_cma,
+ bool fixed)
+{
return -ENOSYS;
}

--
1.8.3.2

2014-04-15 13:10:53

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 3/5] intel-iommu: integrate DMA CMA

This adds support for the DMA Contiguous Memory Allocator for intel-iommu.
This change enables dma_alloc_coherent() to allocate big contiguous
memory.

It is achieved in the same way as nommu_dma_ops currently does, i.e.
trying to allocate memory by dma_alloc_from_contiguous() and alloc_pages()
is used as a fallback.

Cc: Marek Szyprowski <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Don Dutile <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Akinobu Mita <[email protected]>
---
* Changes from v2
- Fix gfp flags check for __GFP_ATOMIC, reported by Marek Szyprowski
- Rebased on current Linus tree

drivers/iommu/intel-iommu.c | 32 ++++++++++++++++++++++++--------
1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index cdb97c4..78c68cb 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3185,7 +3185,7 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t flags,
struct dma_attrs *attrs)
{
- void *vaddr;
+ struct page *page = NULL;
int order;

size = PAGE_ALIGN(size);
@@ -3200,17 +3200,31 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
flags |= GFP_DMA32;
}

- vaddr = (void *)__get_free_pages(flags, order);
- if (!vaddr)
+ if (flags & __GFP_WAIT) {
+ unsigned int count = size >> PAGE_SHIFT;
+
+ page = dma_alloc_from_contiguous(dev, count, order);
+ if (page && iommu_no_mapping(dev) &&
+ page_to_phys(page) + size > dev->coherent_dma_mask) {
+ dma_release_from_contiguous(dev, page, count);
+ page = NULL;
+ }
+ }
+
+ if (!page)
+ page = alloc_pages(flags, order);
+ if (!page)
return NULL;
- memset(vaddr, 0, size);
+ memset(page_address(page), 0, size);

- *dma_handle = __intel_map_single(dev, virt_to_bus(vaddr), size,
+ *dma_handle = __intel_map_single(dev, page_to_phys(page), size,
DMA_BIDIRECTIONAL,
dev->coherent_dma_mask);
if (*dma_handle)
- return vaddr;
- free_pages((unsigned long)vaddr, order);
+ return page_address(page);
+ if (!dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT))
+ __free_pages(page, order);
+
return NULL;
}

@@ -3218,12 +3232,14 @@ static void intel_free_coherent(struct device *dev, size_t size, void *vaddr,
dma_addr_t dma_handle, struct dma_attrs *attrs)
{
int order;
+ struct page *page = virt_to_page(vaddr);

size = PAGE_ALIGN(size);
order = get_order(size);

intel_unmap_page(dev, dma_handle, size, DMA_BIDIRECTIONAL, NULL);
- free_pages((unsigned long)vaddr, order);
+ if (!dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT))
+ __free_pages(page, order);
}

static void intel_unmap_sg(struct device *dev, struct scatterlist *sglist,
--
1.8.3.2

2014-04-16 19:44:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled

On Tue, 15 Apr 2014 22:08:45 +0900 Akinobu Mita <[email protected]> wrote:

> Calling dma_alloc_coherent() with __GFP_ZERO must return zeroed memory.
>
> But when the contiguous memory allocator (CMA) is enabled on x86 and
> the memory region is allocated by dma_alloc_from_contiguous(), it
> doesn't return zeroed memory. Because dma_generic_alloc_coherent()
> forgot to fill the memory region with zero if it was allocated by
> dma_alloc_from_contiguous()
>
> Most implementations of dma_alloc_coherent() return zeroed memory
> regardless of whether __GFP_ZERO is specified. So this fixes it by
> unconditionally zeroing the allocated memory region.
>
> Alternatively, we could fix dma_alloc_from_contiguous() to return
> zeroed out memory and remove memset() from all caller of it. But we
> can't simply remove the memset on arm because __dma_clear_buffer() is
> used there for ensuring cache flushing and it is used in many places.
> Of course we can do redundant memset in dma_alloc_from_contiguous(),
> but I think this patch is less impact for fixing this problem.

But this patch does a duplicated memset if the page was allocated by
alloc_pages_node()?

Would it not be better to pass the gfp_t to dma_alloc_from_contiguous()
and have it implement __GFP_ZERO? That will fix thsi inefficiency,
will be symmetrical with the other underlying allocators and should
permit the appropriate fixups in arm?


> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -97,7 +97,6 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
>
> dma_mask = dma_alloc_coherent_mask(dev, flag);
>
> - flag |= __GFP_ZERO;
> again:
> page = NULL;
> /* CMA can be used only in the context which permits sleeping */
> @@ -120,7 +119,7 @@ again:
>
> return NULL;
> }
> -
> + memset(page_address(page), 0, size);
> *dma_addr = addr;
> return page_address(page);
> }
> --
> 1.8.3.2

2014-04-17 15:40:52

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled

2014-04-17 4:44 GMT+09:00 Andrew Morton <[email protected]>:
> On Tue, 15 Apr 2014 22:08:45 +0900 Akinobu Mita <[email protected]> wrote:
>
>> Calling dma_alloc_coherent() with __GFP_ZERO must return zeroed memory.
>>
>> But when the contiguous memory allocator (CMA) is enabled on x86 and
>> the memory region is allocated by dma_alloc_from_contiguous(), it
>> doesn't return zeroed memory. Because dma_generic_alloc_coherent()
>> forgot to fill the memory region with zero if it was allocated by
>> dma_alloc_from_contiguous()
>>
>> Most implementations of dma_alloc_coherent() return zeroed memory
>> regardless of whether __GFP_ZERO is specified. So this fixes it by
>> unconditionally zeroing the allocated memory region.
>>
>> Alternatively, we could fix dma_alloc_from_contiguous() to return
>> zeroed out memory and remove memset() from all caller of it. But we
>> can't simply remove the memset on arm because __dma_clear_buffer() is
>> used there for ensuring cache flushing and it is used in many places.
>> Of course we can do redundant memset in dma_alloc_from_contiguous(),
>> but I think this patch is less impact for fixing this problem.
>
> But this patch does a duplicated memset if the page was allocated by
> alloc_pages_node()?

You're right. Clearing __GFP_ZERO bit in gfp flags before allocating by
alloc_pages_node() can fix this duplicated memset.

> Would it not be better to pass the gfp_t to dma_alloc_from_contiguous()
> and have it implement __GFP_ZERO? That will fix thsi inefficiency,
> will be symmetrical with the other underlying allocators and should
> permit the appropriate fixups in arm?

Sounds good. If it also handles __GFP_WAIT, We can remove __GFP_WAIT
check which is almost always required before calling
dma_alloc_from_contiguous().

>> --- a/arch/x86/kernel/pci-dma.c
>> +++ b/arch/x86/kernel/pci-dma.c
>> @@ -97,7 +97,6 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
>>
>> dma_mask = dma_alloc_coherent_mask(dev, flag);
>>
>> - flag |= __GFP_ZERO;

I'll soon prepare a follow-up patch to clear __GFP_ZERO like

+ flag &= ~__GFP_ZERO

>> again:
>> page = NULL;
>> /* CMA can be used only in the context which permits sleeping */
>> @@ -120,7 +119,7 @@ again:
>>
>> return NULL;
>> }
>> -
>> + memset(page_address(page), 0, size);
>> *dma_addr = addr;
>> return page_address(page);
>> }
>> --
>> 1.8.3.2

2014-10-01 09:05:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] enhance DMA CMA on x86

On Tue, 30 Sep 2014, Peter Hurley wrote:
> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
> > Whether the proposed patchset is the correct solution to support it is
> > a completely different question.
>
> This patchset has been in mainline since 3.16 and has already caused
> regressions, so the question of whether this is the correct solution has
> already been answered.

Agreed.

> > So either you stop this right now and help Akinobu to find the proper
> > solution
>
> If this is only a test platform for ARM parts then I don't think it
> unreasonable to suggest forking x86 swiotlb support into a iommu=cma
> selector that gets DMA mapping working for this test platform and doesn't
> cause a bunch of breakage.

Breakage is not acceptable in any case.

> Which is different than if the plan is to ship production units for x86;
> then a general purpose solution will be required.
>
> As to the good design of a general purpose solution for allocating and
> mapping huge order pages, you are certainly more qualified to help Akinobu
> than I am.

Fair enough. Still this does not make the case for outright rejecting
the idea of supporting that kind of device even if it is a esoteric
case. We deal with enough esoteric hardware in Linux and if done
right, it's no harm to anyone.

I'll have a look at the technical details.

Thanks,

tglx

2014-10-02 16:42:28

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] enhance DMA CMA on x86

On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
> > Whether the proposed patchset is the correct solution to support it is
> > a completely different question.
>
> This patchset has been in mainline since 3.16 and has already caused
> regressions, so the question of whether this is the correct solution has
> already been answered.
>
> > So either you stop this right now and help Akinobu to find the proper
> > solution
>
> If this is only a test platform for ARM parts then I don't think it
> unreasonable to suggest forking x86 swiotlb support into a iommu=cma

Not sure what you mean by 'forking x86 swiotlb' ? As in have SWIOTLB
work under ARM?

> selector that gets DMA mapping working for this test platform and doesn't
> cause a bunch of breakage.

I think you might want to take a look at the IOMMU_DETECT macros
and enable CMA there only if the certain devices are available.

That way the normal flow of detecting which IOMMU to use is still present
and will turn of CMA if there is no device that would use it.

>
> Which is different than if the plan is to ship production units for x86;
> then a general purpose solution will be required.
>
> As to the good design of a general purpose solution for allocating and
> mapping huge order pages, you are certainly more qualified to help Akinobu
> than I am.
>
> Regards,
> Peter Hurley
>

2014-10-02 22:03:22

by Peter Hurley

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] enhance DMA CMA on x86

On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>>> Whether the proposed patchset is the correct solution to support it is
>>> a completely different question.
>>
>> This patchset has been in mainline since 3.16 and has already caused
>> regressions, so the question of whether this is the correct solution has
>> already been answered.
>>
>>> So either you stop this right now and help Akinobu to find the proper
>>> solution
>>
>> If this is only a test platform for ARM parts then I don't think it
>> unreasonable to suggest forking x86 swiotlb support into a iommu=cma
>
> Not sure what you mean by 'forking x86 swiotlb' ? As in have SWIOTLB
> work under ARM?

No, that's not what I meant.

>> selector that gets DMA mapping working for this test platform and doesn't
>> cause a bunch of breakage.
>
> I think you might want to take a look at the IOMMU_DETECT macros
> and enable CMA there only if the certain devices are available.
>
> That way the normal flow of detecting which IOMMU to use is still present
> and will turn of CMA if there is no device that would use it.
>
>>
>> Which is different than if the plan is to ship production units for x86;
>> then a general purpose solution will be required.
>>
>> As to the good design of a general purpose solution for allocating and
>> mapping huge order pages, you are certainly more qualified to help Akinobu
>> than I am.

What Akinobu's patches intend to support is:

phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);

which raises three issues:

1. Where do coherent blocks of this size come from?
2. How to prevent fragmentation of these reserved blocks over time by
existing DMA users?
3. Is this support generically required across all iommu implementations on x86?

Questions 1 and 2 are non-trivial, in the general case, otherwise the page
allocator would already do this. Simply dropping in the contiguous memory
allocator doesn't work because CMA does not have the same policy and performance
as the page allocator, and is already causing performance regressions even
in the absence of huge page allocations.

So that's why I raised question 3; is making the necessary compromises to support
64MB coherent DMA allocations across all x86 iommu implementations actually
required?

Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
designed to be limited to testing configurations, as the introductory
commit states:

commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
Author: Marek Szyprowski <[email protected]>
Date: Thu Dec 29 13:09:51 2011 +0100

X86: integrate CMA with DMA-mapping subsystem

This patch adds support for CMA to dma-mapping subsystem for x86
architecture that uses common pci-dma/pci-nommu implementation. This
allows to test CMA on KVM/QEMU and a lot of common x86 boxes.

Signed-off-by: Marek Szyprowski <[email protected]>
Signed-off-by: Kyungmin Park <[email protected]>
CC: Michal Nazarewicz <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>


Which brings me to my suggestion: if support for huge coherent DMA is
required only for a special test platform, then could not this support
be specific to a new iommu configuration, namely iommu=cma, which would
get initialized much the same way that iommu=calgary is now.

The code for such a iommu configuration would mostly duplicate
arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
the other x86 iommu implementations.

Regards,
Peter Hurley

2014-10-02 23:08:36

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] enhance DMA CMA on x86

2014-10-03 7:03 GMT+09:00 Peter Hurley <[email protected]>:
> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:

>>> Which is different than if the plan is to ship production units for x86;
>>> then a general purpose solution will be required.
>>>
>>> As to the good design of a general purpose solution for allocating and
>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>> than I am.
>
> What Akinobu's patches intend to support is:
>
> phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>
> which raises three issues:
>
> 1. Where do coherent blocks of this size come from?
> 2. How to prevent fragmentation of these reserved blocks over time by
> existing DMA users?
> 3. Is this support generically required across all iommu implementations on x86?
>
> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
> allocator would already do this. Simply dropping in the contiguous memory
> allocator doesn't work because CMA does not have the same policy and performance
> as the page allocator, and is already causing performance regressions even
> in the absence of huge page allocations.

Could you take a look at the patches I sent? Can they fix these issues?
https://lkml.org/lkml/2014/9/28/110

With these patches, normal alloc_pages() is used for allocation first
and dma_alloc_from_contiguous() is used as a fallback.

> So that's why I raised question 3; is making the necessary compromises to support
> 64MB coherent DMA allocations across all x86 iommu implementations actually
> required?
>
> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
> designed to be limited to testing configurations, as the introductory
> commit states:
>
> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
> Author: Marek Szyprowski <[email protected]>
> Date: Thu Dec 29 13:09:51 2011 +0100
>
> X86: integrate CMA with DMA-mapping subsystem
>
> This patch adds support for CMA to dma-mapping subsystem for x86
> architecture that uses common pci-dma/pci-nommu implementation. This
> allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>
> Signed-off-by: Marek Szyprowski <[email protected]>
> Signed-off-by: Kyungmin Park <[email protected]>
> CC: Michal Nazarewicz <[email protected]>
> Acked-by: Arnd Bergmann <[email protected]>
>
>
> Which brings me to my suggestion: if support for huge coherent DMA is
> required only for a special test platform, then could not this support
> be specific to a new iommu configuration, namely iommu=cma, which would
> get initialized much the same way that iommu=calgary is now.
>
> The code for such a iommu configuration would mostly duplicate
> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
> the other x86 iommu implementations.

I'm not sure I read correctly, though. Can boot option 'cma=0' also
help avoiding CMA from IOMMU implementation?

2014-10-03 13:41:37

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] enhance DMA CMA on x86

On Fri, Oct 03, 2014 at 08:08:33AM +0900, Akinobu Mita wrote:
> 2014-10-03 7:03 GMT+09:00 Peter Hurley <[email protected]>:
> > On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
> >> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
> >>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>
> >>> Which is different than if the plan is to ship production units for x86;
> >>> then a general purpose solution will be required.
> >>>
> >>> As to the good design of a general purpose solution for allocating and
> >>> mapping huge order pages, you are certainly more qualified to help Akinobu
> >>> than I am.
> >
> > What Akinobu's patches intend to support is:
> >
> > phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
> >
> > which raises three issues:
> >
> > 1. Where do coherent blocks of this size come from?
> > 2. How to prevent fragmentation of these reserved blocks over time by
> > existing DMA users?
> > 3. Is this support generically required across all iommu implementations on x86?
> >
> > Questions 1 and 2 are non-trivial, in the general case, otherwise the page
> > allocator would already do this. Simply dropping in the contiguous memory
> > allocator doesn't work because CMA does not have the same policy and performance
> > as the page allocator, and is already causing performance regressions even
> > in the absence of huge page allocations.
>
> Could you take a look at the patches I sent? Can they fix these issues?
> https://lkml.org/lkml/2014/9/28/110
>
> With these patches, normal alloc_pages() is used for allocation first
> and dma_alloc_from_contiguous() is used as a fallback.
>
> > So that's why I raised question 3; is making the necessary compromises to support
> > 64MB coherent DMA allocations across all x86 iommu implementations actually
> > required?
> >
> > Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
> > designed to be limited to testing configurations, as the introductory
> > commit states:
> >
> > commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
> > Author: Marek Szyprowski <[email protected]>
> > Date: Thu Dec 29 13:09:51 2011 +0100
> >
> > X86: integrate CMA with DMA-mapping subsystem
> >
> > This patch adds support for CMA to dma-mapping subsystem for x86
> > architecture that uses common pci-dma/pci-nommu implementation. This
> > allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
> >
> > Signed-off-by: Marek Szyprowski <[email protected]>
> > Signed-off-by: Kyungmin Park <[email protected]>
> > CC: Michal Nazarewicz <[email protected]>
> > Acked-by: Arnd Bergmann <[email protected]>
> >
> >
> > Which brings me to my suggestion: if support for huge coherent DMA is
> > required only for a special test platform, then could not this support
> > be specific to a new iommu configuration, namely iommu=cma, which would
> > get initialized much the same way that iommu=calgary is now.
> >
> > The code for such a iommu configuration would mostly duplicate
> > arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
> > the other x86 iommu implementations.

Right. That sounds like a good plan ..
>
> I'm not sure I read correctly, though. Can boot option 'cma=0' also
> help avoiding CMA from IOMMU implementation?

.. it would automatically done now instead of having to pass 'cma=0'.

2014-10-03 14:27:21

by Peter Hurley

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] enhance DMA CMA on x86

On 10/02/2014 07:08 PM, Akinobu Mita wrote:
> 2014-10-03 7:03 GMT+09:00 Peter Hurley <[email protected]>:
>> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>
>>>> Which is different than if the plan is to ship production units for x86;
>>>> then a general purpose solution will be required.
>>>>
>>>> As to the good design of a general purpose solution for allocating and
>>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>>> than I am.
>>
>> What Akinobu's patches intend to support is:
>>
>> phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>>
>> which raises three issues:
>>
>> 1. Where do coherent blocks of this size come from?
>> 2. How to prevent fragmentation of these reserved blocks over time by
>> existing DMA users?
>> 3. Is this support generically required across all iommu implementations on x86?
>>
>> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
>> allocator would already do this. Simply dropping in the contiguous memory
>> allocator doesn't work because CMA does not have the same policy and performance
>> as the page allocator, and is already causing performance regressions even
>> in the absence of huge page allocations.
>
> Could you take a look at the patches I sent? Can they fix these issues?
> https://lkml.org/lkml/2014/9/28/110
>
> With these patches, normal alloc_pages() is used for allocation first
> and dma_alloc_from_contiguous() is used as a fallback.

Sure, I can test these patches this weekend.
Where are the unit tests?

>> So that's why I raised question 3; is making the necessary compromises to support
>> 64MB coherent DMA allocations across all x86 iommu implementations actually
>> required?
>>
>> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
>> designed to be limited to testing configurations, as the introductory
>> commit states:
>>
>> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
>> Author: Marek Szyprowski <[email protected]>
>> Date: Thu Dec 29 13:09:51 2011 +0100
>>
>> X86: integrate CMA with DMA-mapping subsystem
>>
>> This patch adds support for CMA to dma-mapping subsystem for x86
>> architecture that uses common pci-dma/pci-nommu implementation. This
>> allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>>
>> Signed-off-by: Marek Szyprowski <[email protected]>
>> Signed-off-by: Kyungmin Park <[email protected]>
>> CC: Michal Nazarewicz <[email protected]>
>> Acked-by: Arnd Bergmann <[email protected]>
>>
>>
>> Which brings me to my suggestion: if support for huge coherent DMA is
>> required only for a special test platform, then could not this support
>> be specific to a new iommu configuration, namely iommu=cma, which would
>> get initialized much the same way that iommu=calgary is now.
>>
>> The code for such a iommu configuration would mostly duplicate
>> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
>> the other x86 iommu implementations.
>
> I'm not sure I read correctly, though. Can boot option 'cma=0' also
> help avoiding CMA from IOMMU implementation?

Maybe, but that's not an appropriate solution for distro kernels.

Nor does this address configurations that want a really large CMA so
1GB huge pages can be allocated (not for DMA though).

Regards,
Peter Hurley

2014-10-03 16:06:23

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] enhance DMA CMA on x86

2014-10-03 23:27 GMT+09:00 Peter Hurley <[email protected]>:
> On 10/02/2014 07:08 PM, Akinobu Mita wrote:
>> 2014-10-03 7:03 GMT+09:00 Peter Hurley <[email protected]>:
>>> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>>>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>>
>>>>> Which is different than if the plan is to ship production units for x86;
>>>>> then a general purpose solution will be required.
>>>>>
>>>>> As to the good design of a general purpose solution for allocating and
>>>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>>>> than I am.
>>>
>>> What Akinobu's patches intend to support is:
>>>
>>> phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>>>
>>> which raises three issues:
>>>
>>> 1. Where do coherent blocks of this size come from?
>>> 2. How to prevent fragmentation of these reserved blocks over time by
>>> existing DMA users?
>>> 3. Is this support generically required across all iommu implementations on x86?
>>>
>>> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
>>> allocator would already do this. Simply dropping in the contiguous memory
>>> allocator doesn't work because CMA does not have the same policy and performance
>>> as the page allocator, and is already causing performance regressions even
>>> in the absence of huge page allocations.
>>
>> Could you take a look at the patches I sent? Can they fix these issues?
>> https://lkml.org/lkml/2014/9/28/110
>>
>> With these patches, normal alloc_pages() is used for allocation first
>> and dma_alloc_from_contiguous() is used as a fallback.
>
> Sure, I can test these patches this weekend.
> Where are the unit tests?

Thanks a lot. I would like to know whether the performance regression
you see will disappear or not with these patches as if CONFIG_DMA_CMA is
disabled.

>>> So that's why I raised question 3; is making the necessary compromises to support
>>> 64MB coherent DMA allocations across all x86 iommu implementations actually
>>> required?
>>>
>>> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
>>> designed to be limited to testing configurations, as the introductory
>>> commit states:
>>>
>>> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
>>> Author: Marek Szyprowski <[email protected]>
>>> Date: Thu Dec 29 13:09:51 2011 +0100
>>>
>>> X86: integrate CMA with DMA-mapping subsystem
>>>
>>> This patch adds support for CMA to dma-mapping subsystem for x86
>>> architecture that uses common pci-dma/pci-nommu implementation. This
>>> allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>>>
>>> Signed-off-by: Marek Szyprowski <[email protected]>
>>> Signed-off-by: Kyungmin Park <[email protected]>
>>> CC: Michal Nazarewicz <[email protected]>
>>> Acked-by: Arnd Bergmann <[email protected]>
>>>
>>>
>>> Which brings me to my suggestion: if support for huge coherent DMA is
>>> required only for a special test platform, then could not this support
>>> be specific to a new iommu configuration, namely iommu=cma, which would
>>> get initialized much the same way that iommu=calgary is now.
>>>
>>> The code for such a iommu configuration would mostly duplicate
>>> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
>>> the other x86 iommu implementations.
>>
>> I'm not sure I read correctly, though. Can boot option 'cma=0' also
>> help avoiding CMA from IOMMU implementation?
>
> Maybe, but that's not an appropriate solution for distro kernels.
>
> Nor does this address configurations that want a really large CMA so
> 1GB huge pages can be allocated (not for DMA though).

Now I see the point of iommu=cma you suggested. But what should we do
when CONFIG_SWIOTLB is disabled, especially for x86_32?
Should we just introduce yet another flag to tell not using DMA_CMA
instead of adding new swiotlb-like iommu implementation?

2014-10-03 16:35:08

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] enhance DMA CMA on x86

On 10/3/2014 12:06 PM, Akinobu Mita wrote:
> 2014-10-03 23:27 GMT+09:00 Peter Hurley <[email protected]>:
>> On 10/02/2014 07:08 PM, Akinobu Mita wrote:
>>> 2014-10-03 7:03 GMT+09:00 Peter Hurley <[email protected]>:
>>>> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>>>>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>>>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>>>
>>>>>> Which is different than if the plan is to ship production units for x86;
>>>>>> then a general purpose solution will be required.
>>>>>>
>>>>>> As to the good design of a general purpose solution for allocating and
>>>>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>>>>> than I am.
>>>>
>>>> What Akinobu's patches intend to support is:
>>>>
>>>> phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>>>>
>>>> which raises three issues:
>>>>
>>>> 1. Where do coherent blocks of this size come from?
>>>> 2. How to prevent fragmentation of these reserved blocks over time by
>>>> existing DMA users?
>>>> 3. Is this support generically required across all iommu implementations on x86?
>>>>
>>>> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
>>>> allocator would already do this. Simply dropping in the contiguous memory
>>>> allocator doesn't work because CMA does not have the same policy and performance
>>>> as the page allocator, and is already causing performance regressions even
>>>> in the absence of huge page allocations.
>>>
>>> Could you take a look at the patches I sent? Can they fix these issues?
>>> https://lkml.org/lkml/2014/9/28/110
>>>
>>> With these patches, normal alloc_pages() is used for allocation first
>>> and dma_alloc_from_contiguous() is used as a fallback.
>>
>> Sure, I can test these patches this weekend.
>> Where are the unit tests?
>
> Thanks a lot. I would like to know whether the performance regression
> you see will disappear or not with these patches as if CONFIG_DMA_CMA is
> disabled.
>
>>>> So that's why I raised question 3; is making the necessary compromises to support
>>>> 64MB coherent DMA allocations across all x86 iommu implementations actually
>>>> required?
>>>>
>>>> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
>>>> designed to be limited to testing configurations, as the introductory
>>>> commit states:
>>>>
>>>> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
>>>> Author: Marek Szyprowski <[email protected]>
>>>> Date: Thu Dec 29 13:09:51 2011 +0100
>>>>
>>>> X86: integrate CMA with DMA-mapping subsystem
>>>>
>>>> This patch adds support for CMA to dma-mapping subsystem for x86
>>>> architecture that uses common pci-dma/pci-nommu implementation. This
>>>> allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>>>>
>>>> Signed-off-by: Marek Szyprowski <[email protected]>
>>>> Signed-off-by: Kyungmin Park <[email protected]>
>>>> CC: Michal Nazarewicz <[email protected]>
>>>> Acked-by: Arnd Bergmann <[email protected]>
>>>>
>>>>
>>>> Which brings me to my suggestion: if support for huge coherent DMA is
>>>> required only for a special test platform, then could not this support
>>>> be specific to a new iommu configuration, namely iommu=cma, which would
>>>> get initialized much the same way that iommu=calgary is now.
>>>>
>>>> The code for such a iommu configuration would mostly duplicate
>>>> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
>>>> the other x86 iommu implementations.
>>>
>>> I'm not sure I read correctly, though. Can boot option 'cma=0' also
>>> help avoiding CMA from IOMMU implementation?
>>
>> Maybe, but that's not an appropriate solution for distro kernels.
>>
>> Nor does this address configurations that want a really large CMA so
>> 1GB huge pages can be allocated (not for DMA though).
>
> Now I see the point of iommu=cma you suggested. But what should we do
> when CONFIG_SWIOTLB is disabled, especially for x86_32?
> Should we just introduce yet another flag to tell not using DMA_CMA
> instead of adding new swiotlb-like iommu implementation?
>

If you implement an DMA API producer - aka dma_ops (which is what Peter
is thinking I believe) it won't matter which IOMMUs / DMA producers are
selected right?

Or are you saying that CMA needs SWIOTLB to handle certain type of
pages as a fallback mechanism - and hence there needs to be a tight
relationship?

In which case I would look at making SWIOTLB be more library like - the
Xen-SWIOTLB already does that by using certain parts of the SWIOTLB code
which are exposed to the rest of the kernel.

2014-10-03 16:39:24

by Peter Hurley

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] enhance DMA CMA on x86

On 10/03/2014 12:06 PM, Akinobu Mita wrote:
> 2014-10-03 23:27 GMT+09:00 Peter Hurley <[email protected]>:
>> On 10/02/2014 07:08 PM, Akinobu Mita wrote:
>>> 2014-10-03 7:03 GMT+09:00 Peter Hurley <[email protected]>:
>>>> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>>>>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>>>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>>>
>>>>>> Which is different than if the plan is to ship production units for x86;
>>>>>> then a general purpose solution will be required.
>>>>>>
>>>>>> As to the good design of a general purpose solution for allocating and
>>>>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>>>>> than I am.
>>>>
>>>> What Akinobu's patches intend to support is:
>>>>
>>>> phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>>>>
>>>> which raises three issues:
>>>>
>>>> 1. Where do coherent blocks of this size come from?
>>>> 2. How to prevent fragmentation of these reserved blocks over time by
>>>> existing DMA users?
>>>> 3. Is this support generically required across all iommu implementations on x86?
>>>>
>>>> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
>>>> allocator would already do this. Simply dropping in the contiguous memory
>>>> allocator doesn't work because CMA does not have the same policy and performance
>>>> as the page allocator, and is already causing performance regressions even
>>>> in the absence of huge page allocations.
>>>
>>> Could you take a look at the patches I sent? Can they fix these issues?
>>> https://lkml.org/lkml/2014/9/28/110
>>>
>>> With these patches, normal alloc_pages() is used for allocation first
>>> and dma_alloc_from_contiguous() is used as a fallback.
>>
>> Sure, I can test these patches this weekend.
>> Where are the unit tests?
>
> Thanks a lot. I would like to know whether the performance regression
> you see will disappear or not with these patches as if CONFIG_DMA_CMA is
> disabled.

I think something may have gotten lost in translation.

My "test" consists of doing my daily work (email, emacs, kernel builds,
web breaks, etc).

I don't have a testsuite that validates a page allocator or records any
performance metrics (for TTM allocations under load, as an example).

Without a unit test and performance metrics, my "test" is not really
positive affirmation of a correct implementation.


>>>> So that's why I raised question 3; is making the necessary compromises to support
>>>> 64MB coherent DMA allocations across all x86 iommu implementations actually
>>>> required?
>>>>
>>>> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
>>>> designed to be limited to testing configurations, as the introductory
>>>> commit states:
>>>>
>>>> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
>>>> Author: Marek Szyprowski <[email protected]>
>>>> Date: Thu Dec 29 13:09:51 2011 +0100
>>>>
>>>> X86: integrate CMA with DMA-mapping subsystem
>>>>
>>>> This patch adds support for CMA to dma-mapping subsystem for x86
>>>> architecture that uses common pci-dma/pci-nommu implementation. This
>>>> allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>>>>
>>>> Signed-off-by: Marek Szyprowski <[email protected]>
>>>> Signed-off-by: Kyungmin Park <[email protected]>
>>>> CC: Michal Nazarewicz <[email protected]>
>>>> Acked-by: Arnd Bergmann <[email protected]>
>>>>
>>>>
>>>> Which brings me to my suggestion: if support for huge coherent DMA is
>>>> required only for a special test platform, then could not this support
>>>> be specific to a new iommu configuration, namely iommu=cma, which would
>>>> get initialized much the same way that iommu=calgary is now.
>>>>
>>>> The code for such a iommu configuration would mostly duplicate
>>>> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
>>>> the other x86 iommu implementations.
>>>
>>> I'm not sure I read correctly, though. Can boot option 'cma=0' also
>>> help avoiding CMA from IOMMU implementation?
>>
>> Maybe, but that's not an appropriate solution for distro kernels.
>>
>> Nor does this address configurations that want a really large CMA so
>> 1GB huge pages can be allocated (not for DMA though).
>
> Now I see the point of iommu=cma you suggested. But what should we do
> when CONFIG_SWIOTLB is disabled, especially for x86_32?
> Should we just introduce yet another flag to tell not using DMA_CMA
> instead of adding new swiotlb-like iommu implementation?

Again, since I don't know what you're using this for and
there are no existing mainline users, I can't really design this for
you.

I'm just trying to do my best to come up with alternative solutions
that limit the impact to existing x86 configurations, while still
achieving your goals (without really knowing what those design
constraints are).

Regards,
Peter Hurley

2014-10-05 06:01:46

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] enhance DMA CMA on x86

2014-10-04 1:39 GMT+09:00 Peter Hurley <[email protected]>:
> On 10/03/2014 12:06 PM, Akinobu Mita wrote:
>> 2014-10-03 23:27 GMT+09:00 Peter Hurley <[email protected]>:
>>> On 10/02/2014 07:08 PM, Akinobu Mita wrote:
>>>> 2014-10-03 7:03 GMT+09:00 Peter Hurley <[email protected]>:
>>>>> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>>>>>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>>>>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>>>>
>>>>>>> Which is different than if the plan is to ship production units for x86;
>>>>>>> then a general purpose solution will be required.
>>>>>>>
>>>>>>> As to the good design of a general purpose solution for allocating and
>>>>>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>>>>>> than I am.
>>>>>
>>>>> What Akinobu's patches intend to support is:
>>>>>
>>>>> phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>>>>>
>>>>> which raises three issues:
>>>>>
>>>>> 1. Where do coherent blocks of this size come from?
>>>>> 2. How to prevent fragmentation of these reserved blocks over time by
>>>>> existing DMA users?
>>>>> 3. Is this support generically required across all iommu implementations on x86?
>>>>>
>>>>> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
>>>>> allocator would already do this. Simply dropping in the contiguous memory
>>>>> allocator doesn't work because CMA does not have the same policy and performance
>>>>> as the page allocator, and is already causing performance regressions even
>>>>> in the absence of huge page allocations.
>>>>
>>>> Could you take a look at the patches I sent? Can they fix these issues?
>>>> https://lkml.org/lkml/2014/9/28/110
>>>>
>>>> With these patches, normal alloc_pages() is used for allocation first
>>>> and dma_alloc_from_contiguous() is used as a fallback.
>>>
>>> Sure, I can test these patches this weekend.
>>> Where are the unit tests?
>>
>> Thanks a lot. I would like to know whether the performance regression
>> you see will disappear or not with these patches as if CONFIG_DMA_CMA is
>> disabled.
>
> I think something may have gotten lost in translation.
>
> My "test" consists of doing my daily work (email, emacs, kernel builds,
> web breaks, etc).
>
> I don't have a testsuite that validates a page allocator or records any
> performance metrics (for TTM allocations under load, as an example).
>
> Without a unit test and performance metrics, my "test" is not really
> positive affirmation of a correct implementation.
>
>
>>>>> So that's why I raised question 3; is making the necessary compromises to support
>>>>> 64MB coherent DMA allocations across all x86 iommu implementations actually
>>>>> required?
>>>>>
>>>>> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
>>>>> designed to be limited to testing configurations, as the introductory
>>>>> commit states:
>>>>>
>>>>> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
>>>>> Author: Marek Szyprowski <[email protected]>
>>>>> Date: Thu Dec 29 13:09:51 2011 +0100
>>>>>
>>>>> X86: integrate CMA with DMA-mapping subsystem
>>>>>
>>>>> This patch adds support for CMA to dma-mapping subsystem for x86
>>>>> architecture that uses common pci-dma/pci-nommu implementation. This
>>>>> allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>>>>>
>>>>> Signed-off-by: Marek Szyprowski <[email protected]>
>>>>> Signed-off-by: Kyungmin Park <[email protected]>
>>>>> CC: Michal Nazarewicz <[email protected]>
>>>>> Acked-by: Arnd Bergmann <[email protected]>
>>>>>
>>>>>
>>>>> Which brings me to my suggestion: if support for huge coherent DMA is
>>>>> required only for a special test platform, then could not this support
>>>>> be specific to a new iommu configuration, namely iommu=cma, which would
>>>>> get initialized much the same way that iommu=calgary is now.
>>>>>
>>>>> The code for such a iommu configuration would mostly duplicate
>>>>> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
>>>>> the other x86 iommu implementations.
>>>>
>>>> I'm not sure I read correctly, though. Can boot option 'cma=0' also
>>>> help avoiding CMA from IOMMU implementation?
>>>
>>> Maybe, but that's not an appropriate solution for distro kernels.
>>>
>>> Nor does this address configurations that want a really large CMA so
>>> 1GB huge pages can be allocated (not for DMA though).

kernel parameter 'cma=' is only available when CONFIG_DMA_CMA is enabled.
cma=0 doesn't disable 1GB huge pages as far as I can see.
So I prepare a patch which make default cma size zero on x86.

>> Now I see the point of iommu=cma you suggested. But what should we do
>> when CONFIG_SWIOTLB is disabled, especially for x86_32?
>> Should we just introduce yet another flag to tell not using DMA_CMA
>> instead of adding new swiotlb-like iommu implementation?
>
> Again, since I don't know what you're using this for and
> there are no existing mainline users, I can't really design this for
> you.
>
> I'm just trying to do my best to come up with alternative solutions
> that limit the impact to existing x86 configurations, while still
> achieving your goals (without really knowing what those design
> constraints are).

Thanks a lot for your advise.