From: Petr Tesarik <[email protected]>
The goal of my work is to provide more flexibility in the sizing of
SWIOTLB. This patch series is a request for comments from the wider
community. The code is more of a crude hack than final solution.
I would appreciate suggestions for measuring the performance impact
of changes in SWIOTLB. More info at the end of this cover letter.
The software IO TLB was designed with these assumptions:
1. It would not be used much, especially on 64-bit systems.
2. A small fixed memory area (64 MiB by default) is sufficient to
handle the few cases which require a bounce buffer.
3. 64 MiB is little enough that it has no impact on the rest of the
system.
First, if SEV is active, all DMA must be done through shared
unencrypted pages, and SWIOTLB is used to make this happen without
changing device drivers. The software IO TLB size is increased to
6% of total memory in sev_setup_arch(), but that is more of an
approximation. The actual requirements may vary depending on the
amount of I/O and which drivers are used. These factors may not be
know at boot time, i.e. when SWIOTLB is allocated.
Second, on the Raspberry Pi 4, swiotlb is used by dma-buf for pages
moved from the rendering GPU (v3d driver), which can access all
memory, to the display output (vc4 driver), which is connected to a
bus with an address limit of 1 GiB and no IOMMU. These buffers can
be large (8 MiB with a FullHD monitor, 34 MiB with a 4K monitor)
and cannot be even handled by current SWIOTLB, because they exceed
the maximum segment size of 256 KiB. Mapping failures can be
easily reproduced with GNOME remote desktop on a Raspberry Pi 4.
Third, other colleagues have noticed that they can reliably get rid
of occasional OOM kills on an Arm embedded device by reducing the
SWIOTLB size. This can be achieved with a kernel parameter, but
determining the right value puts additional burden on pre-release
testing, which could be avoided if SWIOTLB is allocated small and
grows only when necessary.
I have tried to measure the expected performance degradation so
that I could reduce it and/or compare it to alternative approaches.
I have performed all tests on an otherwise idle Raspberry Pi 4 with
swiotlb=force (which, addmittedly, is a bit artificial). I quickly
ran into trouble.
I ran fio against an ext3 filesystem mounted from a UAS drive. To
my surprise, forcing swiotlb (without my patches) *improved* IOPS
and bandwidth for 4K and 64K blocks by 3 to 7 percent, and made no
visible difference for 1M blocks. I also observed smaller minimum
and average completion latencies, and even smaller maximum
latencies for 4K blocks. However, when I ran the tests again later
to verify some oddities, there was a performance drop. It appears
that I/O, bandwidth and latencies reported by two consecutive fio
runs may differ by as much as 10%, so the results are invalid.
I tried to make a micro-benchmark on dma_map_page_attrs() using the
bcc tool funclatency, but just loading the eBPF program was enough
to change the behaviour of the system wildly.
I wonder if anyone can give me advice on measuring SWIOTLB
performance. I can see that AMD, IBM and Microsoft people have
mentioned performance in their patches, but AFAICS without
explaining how it was measured. Knowing a bit more would be much
appreciated.
Petr Tesarik (4):
dma-mapping: introduce the DMA_ATTR_MAY_SLEEP attribute
swiotlb: Move code around in preparation for dynamic bounce buffers
swiotlb: Allow dynamic allocation of bounce buffers
swiotlb: Add an option to allow dynamic bounce buffers
.../admin-guide/kernel-parameters.txt | 6 +-
Documentation/core-api/dma-attributes.rst | 10 +
include/linux/dma-mapping.h | 6 +
include/linux/swiotlb.h | 17 +-
kernel/dma/swiotlb.c | 233 +++++++++++++++---
5 files changed, 241 insertions(+), 31 deletions(-)
--
2.25.1
From: Petr Tesarik <[email protected]>
Introduce a DMA attribute to tell the DMA-mapping subsystem that
the operation is allowed to sleep.
This patch merely adds the flag, which is not used for anything at
the moment. It should be used by users who can sleep (e.g. dma-buf
ioctls) to allow page reclaim and/or allocations from CMA.
Signed-off-by: Petr Tesarik <[email protected]>
---
Documentation/core-api/dma-attributes.rst | 10 ++++++++++
include/linux/dma-mapping.h | 6 ++++++
2 files changed, 16 insertions(+)
diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst
index 1887d92e8e92..6481ce2acf5d 100644
--- a/Documentation/core-api/dma-attributes.rst
+++ b/Documentation/core-api/dma-attributes.rst
@@ -130,3 +130,13 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged
subsystem that the buffer is fully accessible at the elevated privilege
level (and ideally inaccessible or at least read-only at the
lesser-privileged levels).
+
+DMA_ATTR_MAY_SLEEP
+------------------
+
+This tells the DMA-mapping subsystem that it is allowed to sleep. For example,
+if mapping needs a bounce buffer, software IO TLB may use CMA for the
+allocation if this flag is given.
+
+This attribute is not used for dma_alloc_* functions. Instead, the provided
+GFP flags are used to determine whether the allocation may sleep.
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 0ee20b764000..7a75c503ac38 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -61,6 +61,12 @@
*/
#define DMA_ATTR_PRIVILEGED (1UL << 9)
+/*
+ * DMA_ATTR_MAY_SLEEP: This tells the DMA-mapping subsystem that it is allowed
+ * to sleep.
+ */
+#define DMA_ATTR_MAY_SLEEP (1UL << 10)
+
/*
* A dma_addr_t can hold any valid DMA or bus address for the platform. It can
* be given to a device to use as a DMA source or target. It is specific to a
--
2.25.1
From: Petr Tesarik <[email protected]>
The software IO TLB was designed with the assumption that it is not
used much, especially on 64-bit systems, so a small fixed memory
area (currently 64 MiB) is sufficient to handle the few cases which
still require a bounce buffer. However, these cases are not so rare
in some circumstances.
First, if SEV is active, all DMA must be done through shared
unencrypted pages, and SWIOTLB is used to make this happen without
changing device drivers. The software IO TLB size is increased to
6% of total memory in sev_setup_arch(), but that is more of an
approximation. The actual requirements may vary depending on which
drivers are used and the amount of I/O.
Second, on the Raspberry Pi 4, swiotlb is used by dma-buf for pages
moved from the rendering GPU (v3d driver), which can access all
memory, to the display output (vc4 driver), which is connected to a
bus with an address limit of 1 GiB and no IOMMU. These buffers can
be large (several megabytes) and cannot be handled by SWIOTLB,
because they exceed maximum segment size of 256 KiB. Such mapping
failures can be easily reproduced on a Raspberry Pi4: Starting
GNOME remote desktop results in a flood of kernel messages like
these:
[ 387.937625] vc4-drm gpu: swiotlb buffer is full (sz: 524288 bytes), total 32768 (slots), used 3136 (slots)
[ 387.960381] vc4-drm gpu: swiotlb buffer is full (sz: 815104 bytes), total 32768 (slots), used 2 (slots)
This second example cannot be even solved without increasing the
segment size (and the complexity of {map,unmap}_single size). At
that point, it's better to allocate bounce buffers dynamically with
dma_direct_alloc_pages().
One caveat is that the DMA API often takes only the address of a
buffer, and the implementation (direct or IOMMU) checks whether it
belongs to the software IO TLB. This is easy if the IO TLB is a
single chunk of physically contiguous memory, but not if some
buffers are allocated dynamically. Testing on a Raspberry Pi 4
shows that there can be 1k+ such buffers. This requires something
better than a linked list. I'm using a maple tree to track
dynamically allocated buffers. This data structure was invented for
a similar use case, but there are some challenges:
1. The value is limited to ULONG_MAX, which is too little both for
physical addresses (e.g. x86 PAE or 32-bit ARM LPAE) and DMA
addresses (e.g. Xen guests on 32-bit ARM).
2. Since buffers are currently allocated with page granularity, a
PFN can be used instead. However, some values are reserved by
the maple tree implementation. Liam suggests to use
xa_mk_value() in that case, but that reduces the usable range by
half. Luckily, 31 bits are still enough to hold a PFN on all
32-bit platforms.
3. Software IO TLB is used from interrupt context. The maple tree
implementation is not IRQ-safe (MT_FLAGS_LOCK_IRQ does nothing
AFAICS). Instead, I use an external lock, spin_lock_irqsave() and
spin_unlock_irqrestore().
Note that bounce buffers are never allocated dynamically if the
software IO TLB is in fact a DMA restricted pool, which is intended
to be stay in its designated location in physical memory.
Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/swiotlb.h | 11 ++-
kernel/dma/swiotlb.c | 156 +++++++++++++++++++++++++++++++++++++---
2 files changed, 157 insertions(+), 10 deletions(-)
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index b71adba03dc7..0ef27d6491b9 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -7,6 +7,7 @@
#include <linux/init.h>
#include <linux/types.h>
#include <linux/limits.h>
+#include <linux/maple_tree.h>
#include <linux/spinlock.h>
struct device;
@@ -87,6 +88,8 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
* @for_alloc: %true if the pool is used for memory allocation
* @nareas: The area number in the pool.
* @area_nslabs: The slot number in the area.
+ * @dyn_lock: Protect dynamically allocated slots.
+ * @dyn_slots: Dynamically allocated slots.
*/
struct io_tlb_mem {
phys_addr_t start;
@@ -102,9 +105,13 @@ struct io_tlb_mem {
unsigned int area_nslabs;
struct io_tlb_area *areas;
struct io_tlb_slot *slots;
+ spinlock_t dyn_lock;
+ struct maple_tree dyn_slots;
};
extern struct io_tlb_mem io_tlb_default_mem;
+bool is_swiotlb_dyn(struct io_tlb_mem *mem, phys_addr_t paddr);
+
static inline bool is_swiotlb_fixed(struct io_tlb_mem *mem, phys_addr_t paddr)
{
return paddr >= mem->start && paddr < mem->end;
@@ -114,7 +121,9 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
- return mem && is_swiotlb_fixed(mem, paddr);
+ return mem &&
+ (is_swiotlb_fixed(mem, paddr) ||
+ is_swiotlb_dyn(mem, paddr));
}
static inline bool is_swiotlb_force_bounce(struct device *dev)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index e8608bcb205e..c6a0b8f2aa6f 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -41,6 +41,7 @@
#include <linux/string.h>
#include <linux/swiotlb.h>
#include <linux/types.h>
+#include <linux/xarray.h>
#ifdef CONFIG_DMA_RESTRICTED_POOL
#include <linux/of.h>
#include <linux/of_fdt.h>
@@ -68,6 +69,13 @@ struct io_tlb_slot {
unsigned int list;
};
+struct io_tlb_dyn_slot {
+ phys_addr_t orig_addr;
+ size_t alloc_size;
+ struct page *page;
+ dma_addr_t dma_addr;
+};
+
static bool swiotlb_force_bounce;
static bool swiotlb_force_disable;
@@ -292,6 +300,10 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
mem->slots[i].alloc_size = 0;
}
+ spin_lock_init(&mem->dyn_lock);
+ mt_init_flags(&mem->dyn_slots, MT_FLAGS_LOCK_EXTERN);
+ mt_set_external_lock(&mem->dyn_slots, &mem->dyn_lock);
+
/*
* If swiotlb_unencrypted_base is set, the bounce buffer memory will
* be remapped and cleared in swiotlb_update_mem_attributes.
@@ -516,6 +528,115 @@ void __init swiotlb_exit(void)
memset(mem, 0, sizeof(*mem));
}
+static struct io_tlb_dyn_slot *swiotlb_dyn_slot(struct io_tlb_mem *mem,
+ phys_addr_t paddr)
+{
+ unsigned long index = (uintptr_t)xa_mk_value(PHYS_PFN(paddr));
+ struct io_tlb_dyn_slot *slot;
+ unsigned long flags;
+
+ spin_lock_irqsave(&mem->dyn_lock, flags);
+ slot = mt_find(&mem->dyn_slots, &index, index);
+ spin_unlock_irqrestore(&mem->dyn_lock, flags);
+ return slot;
+}
+
+bool is_swiotlb_dyn(struct io_tlb_mem *mem, phys_addr_t paddr)
+{
+ return !!swiotlb_dyn_slot(mem, paddr);
+}
+
+static void swiotlb_dyn_bounce(struct device *dev, phys_addr_t tlb_addr,
+ size_t size, enum dma_data_direction dir)
+{
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_dyn_slot *slot = swiotlb_dyn_slot(mem, tlb_addr);
+ unsigned int tlb_offset;
+ unsigned char *vaddr;
+
+ if (!slot)
+ return;
+
+ tlb_offset = tlb_addr - page_to_phys(slot->page);
+ vaddr = page_address(slot->page) + tlb_offset;
+
+ swiotlb_copy(dev, slot->orig_addr, vaddr, size, slot->alloc_size,
+ tlb_offset, dir);
+}
+
+static phys_addr_t swiotlb_dyn_map(struct device *dev, phys_addr_t orig_addr,
+ size_t alloc_size, unsigned int alloc_align_mask,
+ enum dma_data_direction dir, unsigned long attrs)
+{
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_dyn_slot *slot;
+ unsigned long index;
+ unsigned long flags;
+ phys_addr_t paddr;
+ gfp_t gfp;
+ int err;
+
+ /* Allocation has page granularity. Avoid small buffers. */
+ if (alloc_size < PAGE_SIZE)
+ goto err;
+
+ /* DMA direct does not deal with physical address constraints. */
+ if (alloc_align_mask || dma_get_min_align_mask(dev))
+ goto err;
+
+ gfp = (attrs & DMA_ATTR_MAY_SLEEP) ? GFP_KERNEL : GFP_NOWAIT;
+ slot = kmalloc(sizeof(*slot), gfp | __GFP_NOWARN);
+ if (!slot)
+ goto err;
+
+ slot->orig_addr = orig_addr;
+ slot->alloc_size = alloc_size;
+ slot->page = dma_direct_alloc_pages(dev, PAGE_ALIGN(alloc_size),
+ &slot->dma_addr, dir,
+ gfp | __GFP_NOWARN);
+ if (!slot->page)
+ goto err_free_slot;
+
+ paddr = page_to_phys(slot->page);
+ index = (uintptr_t)xa_mk_value(PHYS_PFN(paddr));
+ spin_lock_irqsave(&mem->dyn_lock, flags);
+ err = mtree_store_range(&mem->dyn_slots, index,
+ index + PFN_UP(alloc_size) - 1,
+ slot, GFP_NOWAIT | __GFP_NOWARN);
+ spin_unlock_irqrestore(&mem->dyn_lock, flags);
+ if (err)
+ goto err_free_dma;
+
+ return paddr;
+
+err_free_dma:
+ dma_direct_free_pages(dev, slot->alloc_size, slot->page,
+ slot->dma_addr, dir);
+
+err_free_slot:
+ kfree(slot);
+err:
+ return (phys_addr_t)DMA_MAPPING_ERROR;
+}
+
+static void swiotlb_dyn_unmap(struct device *dev, phys_addr_t tlb_addr,
+ enum dma_data_direction dir)
+{
+ unsigned long index = (uintptr_t)xa_mk_value(PHYS_PFN(tlb_addr));
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_dyn_slot *slot;
+ unsigned long flags;
+
+ spin_lock_irqsave(&mem->dyn_lock, flags);
+ slot = mt_find(&mem->dyn_slots, &index, index);
+ mtree_erase(&mem->dyn_slots, index);
+ spin_unlock_irqrestore(&mem->dyn_lock, flags);
+
+ dma_direct_free_pages(dev, slot->alloc_size, slot->page,
+ slot->dma_addr, dir);
+ kfree(slot);
+}
+
/*
* Return the offset into a iotlb slot required to keep the device happy.
*/
@@ -524,11 +645,8 @@ static unsigned int swiotlb_align_offset(struct device *dev, u64 addr)
return addr & dma_get_min_align_mask(dev) & (IO_TLB_SIZE - 1);
}
-/*
- * Bounce: copy the swiotlb buffer from or back to the original dma location
- */
-static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
- enum dma_data_direction dir)
+static void swiotlb_fixed_bounce(struct device *dev, phys_addr_t tlb_addr,
+ size_t size, enum dma_data_direction dir)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
@@ -608,6 +726,18 @@ static void swiotlb_copy(struct device *dev, phys_addr_t orig_addr,
}
}
+/*
+ * Bounce: copy the swiotlb buffer from or back to the original dma location
+ */
+static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
+ enum dma_data_direction dir)
+{
+ if (is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
+ swiotlb_fixed_bounce(dev, tlb_addr, size, dir);
+ else
+ swiotlb_dyn_bounce(dev, tlb_addr, size, dir);
+}
+
static inline phys_addr_t slot_addr(phys_addr_t start, phys_addr_t idx)
{
return start + (idx << IO_TLB_SHIFT);
@@ -799,8 +929,13 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
return (phys_addr_t)DMA_MAPPING_ERROR;
}
- tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
- alloc_align_mask, attrs);
+ tlb_addr = (phys_addr_t)DMA_MAPPING_ERROR;
+ if (!is_swiotlb_for_alloc(dev))
+ tlb_addr = swiotlb_dyn_map(dev, orig_addr, alloc_size,
+ alloc_align_mask, dir, attrs);
+ if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
+ tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
+ alloc_align_mask, attrs);
if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR) {
if (!(attrs & DMA_ATTR_NO_WARN))
@@ -882,7 +1017,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
- swiotlb_release_slots(dev, tlb_addr);
+ if (is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
+ swiotlb_release_slots(dev, tlb_addr);
+ else
+ swiotlb_dyn_unmap(dev, tlb_addr, dir);
}
void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
@@ -1013,7 +1151,7 @@ bool swiotlb_free(struct device *dev, struct page *page, size_t size)
{
phys_addr_t tlb_addr = page_to_phys(page);
- if (!is_swiotlb_buffer(dev, tlb_addr))
+ if (!is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
return false;
swiotlb_release_slots(dev, tlb_addr);
--
2.25.1
From: Petr Tesarik <[email protected]>
Dynamic allocation of bounce buffers may introduce regression for
some workloads. The expected outcomes are bigger worst-case I/O
latency reduced performance for some workloads. Unfortunately,
real-world testing has been too unstable to draw any conclusion.
To stay on the safe side, make the feature disabled by default and
let people turn it on with "swiotlb=dynamic" if needed. Since this
option can be combined with "force", the parser must be modified to
allow multiple options separated by commas.
A new bool field is added to struct io_tlb_mem to tell whether
dynamic allocations are allowed. This field is always false for
DMA restricted pools. It is also false for other software IO
TLBs unless "swiotlb=dynamic" was specified.
Signed-off-by: Petr Tesarik <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 6 +++++-
include/linux/swiotlb.h | 3 ++-
kernel/dma/swiotlb.c | 19 ++++++++++++++-----
3 files changed, 21 insertions(+), 7 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6cfa6e3996cf..6240a463631b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6081,14 +6081,18 @@
Execution Facility on pSeries.
swiotlb= [ARM,IA-64,PPC,MIPS,X86]
- Format: { <int> [,<int>] | force | noforce }
+ Format: { <int> [,<int>] [,option-list] | option-list }
<int> -- Number of I/O TLB slabs
<int> -- Second integer after comma. Number of swiotlb
areas with their own lock. Will be rounded up
to a power of 2.
+ <option-list> -- Comma-separated list of options.
+
+ Available options:
force -- force using of bounce buffers even if they
wouldn't be automatically used by the kernel
noforce -- Never use bounce buffers (for debugging)
+ dynamic -- allow dynamic allocation of bounce buffers
switches= [HW,M68k]
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 0ef27d6491b9..628e25ad7db7 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -101,6 +101,7 @@ struct io_tlb_mem {
bool late_alloc;
bool force_bounce;
bool for_alloc;
+ bool allow_dyn;
unsigned int nareas;
unsigned int area_nslabs;
struct io_tlb_area *areas;
@@ -123,7 +124,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
return mem &&
(is_swiotlb_fixed(mem, paddr) ||
- is_swiotlb_dyn(mem, paddr));
+ (mem->allow_dyn && is_swiotlb_dyn(mem, paddr)));
}
static inline bool is_swiotlb_force_bounce(struct device *dev)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c6a0b8f2aa6f..3efaefebb6af 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -78,6 +78,7 @@ struct io_tlb_dyn_slot {
static bool swiotlb_force_bounce;
static bool swiotlb_force_disable;
+static bool swiotlb_dynamic;
struct io_tlb_mem io_tlb_default_mem;
@@ -159,10 +160,17 @@ setup_io_tlb_npages(char *str)
swiotlb_adjust_nareas(simple_strtoul(str, &str, 0));
if (*str == ',')
++str;
- if (!strcmp(str, "force"))
- swiotlb_force_bounce = true;
- else if (!strcmp(str, "noforce"))
- swiotlb_force_disable = true;
+ while (str && *str) {
+ char *opt = strsep(&str, ",");
+ if (!strcmp(opt, "force"))
+ swiotlb_force_bounce = true;
+ else if (!strcmp(opt, "noforce"))
+ swiotlb_force_disable = true;
+ else if (!strcmp(opt, "dynamic"))
+ swiotlb_dynamic = true;
+ else
+ pr_warn("Invalid swiotlb option: %s", opt);
+ }
return 0;
}
@@ -287,6 +295,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
mem->area_nslabs = nslabs / mem->nareas;
mem->force_bounce = swiotlb_force_bounce || (flags & SWIOTLB_FORCE);
+ mem->allow_dyn = swiotlb_dynamic;
for (i = 0; i < mem->nareas; i++) {
spin_lock_init(&mem->areas[i].lock);
@@ -930,7 +939,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
}
tlb_addr = (phys_addr_t)DMA_MAPPING_ERROR;
- if (!is_swiotlb_for_alloc(dev))
+ if (mem->allow_dyn)
tlb_addr = swiotlb_dyn_map(dev, orig_addr, alloc_size,
alloc_align_mask, dir, attrs);
if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
--
2.25.1
From: Petr Tesarik <[email protected]>
In preparation for the introduction of dynamically allocated
bounce buffers, separate out common code and code which handles
non-dynamic bounce buffers.
No functional change, but this commit should make the addition of
dynamic allocations easier to review.
Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/swiotlb.h | 7 ++++-
kernel/dma/swiotlb.c | 64 +++++++++++++++++++++++++++++------------
2 files changed, 52 insertions(+), 19 deletions(-)
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 35bc4e281c21..b71adba03dc7 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -105,11 +105,16 @@ struct io_tlb_mem {
};
extern struct io_tlb_mem io_tlb_default_mem;
+static inline bool is_swiotlb_fixed(struct io_tlb_mem *mem, phys_addr_t paddr)
+{
+ return paddr >= mem->start && paddr < mem->end;
+}
+
static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
- return mem && paddr >= mem->start && paddr < mem->end;
+ return mem && is_swiotlb_fixed(mem, paddr);
}
static inline bool is_swiotlb_force_bounce(struct device *dev)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index a34c38bbe28f..e8608bcb205e 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -78,6 +78,10 @@ phys_addr_t swiotlb_unencrypted_base;
static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
static unsigned long default_nareas;
+static void swiotlb_copy(struct device *dev, phys_addr_t orig_addr,
+ unsigned char *vaddr, size_t size, size_t alloc_size,
+ unsigned int tlb_offset, enum dma_data_direction dir);
+
/**
* struct io_tlb_area - IO TLB memory area descriptor
*
@@ -530,7 +534,6 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
phys_addr_t orig_addr = mem->slots[index].orig_addr;
size_t alloc_size = mem->slots[index].alloc_size;
- unsigned long pfn = PFN_DOWN(orig_addr);
unsigned char *vaddr = mem->vaddr + tlb_addr - mem->start;
unsigned int tlb_offset, orig_addr_offset;
@@ -547,6 +550,18 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size
}
tlb_offset -= orig_addr_offset;
+ swiotlb_copy(dev, orig_addr, vaddr, size, alloc_size, tlb_offset, dir);
+}
+
+/*
+ * Copy swiotlb buffer content, checking for overflows.
+ */
+static void swiotlb_copy(struct device *dev, phys_addr_t orig_addr,
+ unsigned char *vaddr, size_t size, size_t alloc_size,
+ unsigned int tlb_offset, enum dma_data_direction dir)
+{
+ unsigned long pfn = PFN_DOWN(orig_addr);
+
if (tlb_offset > alloc_size) {
dev_WARN_ONCE(dev, 1,
"Buffer overflow detected. Allocation size: %zu. Mapping size: %zu+%u.\n",
@@ -738,15 +753,35 @@ static unsigned long mem_used(struct io_tlb_mem *mem)
return used;
}
+static phys_addr_t swiotlb_fixed_map(struct device *dev, phys_addr_t orig_addr,
+ size_t alloc_size, unsigned int alloc_align_mask,
+ unsigned long attrs)
+{
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ unsigned int offset = swiotlb_align_offset(dev, orig_addr);
+ int index = swiotlb_find_slots(dev, orig_addr,
+ alloc_size + offset, alloc_align_mask);
+ unsigned int i;
+
+ if (index == -1)
+ return (phys_addr_t)DMA_MAPPING_ERROR;
+
+ /*
+ * Save away the mapping from the original address to the DMA address.
+ * This is needed when we sync the memory. Then we sync the buffer if
+ * needed.
+ */
+ for (i = 0; i < nr_slots(alloc_size + offset); i++)
+ mem->slots[index + i].orig_addr = slot_addr(orig_addr, i);
+ return slot_addr(mem->start, index) + offset;
+}
+
phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
size_t mapping_size, size_t alloc_size,
unsigned int alloc_align_mask, enum dma_data_direction dir,
unsigned long attrs)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
- unsigned int offset = swiotlb_align_offset(dev, orig_addr);
- unsigned int i;
- int index;
phys_addr_t tlb_addr;
if (!mem || !mem->nslabs) {
@@ -764,24 +799,17 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
return (phys_addr_t)DMA_MAPPING_ERROR;
}
- index = swiotlb_find_slots(dev, orig_addr,
- alloc_size + offset, alloc_align_mask);
- if (index == -1) {
+ tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
+ alloc_align_mask, attrs);
+
+ if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR) {
if (!(attrs & DMA_ATTR_NO_WARN))
dev_warn_ratelimited(dev,
- "swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
- alloc_size, mem->nslabs, mem_used(mem));
- return (phys_addr_t)DMA_MAPPING_ERROR;
+ "swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
+ alloc_size, mem->nslabs, mem_used(mem));
+ return tlb_addr;
}
- /*
- * Save away the mapping from the original address to the DMA address.
- * This is needed when we sync the memory. Then we sync the buffer if
- * needed.
- */
- for (i = 0; i < nr_slots(alloc_size + offset); i++)
- mem->slots[index + i].orig_addr = slot_addr(orig_addr, i);
- tlb_addr = slot_addr(mem->start, index) + offset;
/*
* When dir == DMA_FROM_DEVICE we could omit the copy from the orig
* to the tlb buffer, if we knew for sure the device will
--
2.25.1
Hello after a week,
I'm top-posting, because I'm not replying to anything particular in my
original RFC. It seems there are no specific comments from other people
either. My understanding is that the general idea of making SWIOTLB grow
and shrink as needed is welcome, but nobody can suggest a good method to
measure the expected performance impact.
FWIW I have seen zero impact if the patches are applied, but not enabled
through "swiotlb=dynamic" boot parameter. I believe this is sufficient
to start developing dynamic SWIOTLB as an experimental feature.
@Christoph: I was kind of expecting you to jump up and tell me that the
non-coherent DMA API cannot be used to allocate bounce buffers. ;-) But
since that has not happened, I will assume that there is no weird
architecture which treats SWIOTLB differently from other DMA-capable
regions.
Anyway, I have some comments of my own to the v1 RFC series.
First, the maple tree cannot be used to track buffers, because it is not
safe to use from interrupt context. The maple tree implementation
currently allocates new nodes with kmem_cache_alloc_bulk(), which must
be called with interrupts enabled. This can be changed in the maple tree
code, but I could also pre-allocate nodes from non-irq context, or use
another data structure instead.
Second, based on discussions with my dear colleague Roberto, I have
considered alternative approaches:
A. Keep track of SWIOTLB utilization and allocate additional io_tlb_mem
structures from a separate kernel thread when necessary. The advantage
is that allocations are moved away from the hot path (map/unmap). The
disadvantage is that these additional tables would necessarily be small
(limited by MAX_ORDER and physical memory fragmentation). Another
disadvantage is that such a multi-table SWIOTLB is not very likely to
shrink, because a table can be freed only if ALL slots are free.
B. Allocate a very big SWIOTLB, but allow to use it for normal
allocations (similar to the CMA approach). The advantage is that there
is only one table, pushing performance impact down to almost zero. The
main challenge is migrating pages to/from the SWIOTLB. Existing CMA code
cannot be reused, because CMA cannot be used from atomic contexts,
unlike SWIOTLB.
For the time being, I propose to start with my RFC series, accepting
some performance drop if the feature is explicitly enabled. With more
feedback from the field (especially from you, SEV SNP guys), I can work
on improving performance.
Petr T
On 3/20/2023 1:28 PM, Petr Tesarik wrote:
> From: Petr Tesarik <[email protected]>
>
> The goal of my work is to provide more flexibility in the sizing of
> SWIOTLB. This patch series is a request for comments from the wider
> community. The code is more of a crude hack than final solution.
>
> I would appreciate suggestions for measuring the performance impact
> of changes in SWIOTLB. More info at the end of this cover letter.
>
> The software IO TLB was designed with these assumptions:
>
> 1. It would not be used much, especially on 64-bit systems.
> 2. A small fixed memory area (64 MiB by default) is sufficient to
> handle the few cases which require a bounce buffer.
> 3. 64 MiB is little enough that it has no impact on the rest of the
> system.
>
> First, if SEV is active, all DMA must be done through shared
> unencrypted pages, and SWIOTLB is used to make this happen without
> changing device drivers. The software IO TLB size is increased to
> 6% of total memory in sev_setup_arch(), but that is more of an
> approximation. The actual requirements may vary depending on the
> amount of I/O and which drivers are used. These factors may not be
> know at boot time, i.e. when SWIOTLB is allocated.
>
> Second, on the Raspberry Pi 4, swiotlb is used by dma-buf for pages
> moved from the rendering GPU (v3d driver), which can access all
> memory, to the display output (vc4 driver), which is connected to a
> bus with an address limit of 1 GiB and no IOMMU. These buffers can
> be large (8 MiB with a FullHD monitor, 34 MiB with a 4K monitor)
> and cannot be even handled by current SWIOTLB, because they exceed
> the maximum segment size of 256 KiB. Mapping failures can be
> easily reproduced with GNOME remote desktop on a Raspberry Pi 4.
>
> Third, other colleagues have noticed that they can reliably get rid
> of occasional OOM kills on an Arm embedded device by reducing the
> SWIOTLB size. This can be achieved with a kernel parameter, but
> determining the right value puts additional burden on pre-release
> testing, which could be avoided if SWIOTLB is allocated small and
> grows only when necessary.
>
> I have tried to measure the expected performance degradation so
> that I could reduce it and/or compare it to alternative approaches.
> I have performed all tests on an otherwise idle Raspberry Pi 4 with
> swiotlb=force (which, addmittedly, is a bit artificial). I quickly
> ran into trouble.
>
> I ran fio against an ext3 filesystem mounted from a UAS drive. To
> my surprise, forcing swiotlb (without my patches) *improved* IOPS
> and bandwidth for 4K and 64K blocks by 3 to 7 percent, and made no
> visible difference for 1M blocks. I also observed smaller minimum
> and average completion latencies, and even smaller maximum
> latencies for 4K blocks. However, when I ran the tests again later
> to verify some oddities, there was a performance drop. It appears
> that I/O, bandwidth and latencies reported by two consecutive fio
> runs may differ by as much as 10%, so the results are invalid.
>
> I tried to make a micro-benchmark on dma_map_page_attrs() using the
> bcc tool funclatency, but just loading the eBPF program was enough
> to change the behaviour of the system wildly.
>
> I wonder if anyone can give me advice on measuring SWIOTLB
> performance. I can see that AMD, IBM and Microsoft people have
> mentioned performance in their patches, but AFAICS without
> explaining how it was measured. Knowing a bit more would be much
> appreciated.
>
> Petr Tesarik (4):
> dma-mapping: introduce the DMA_ATTR_MAY_SLEEP attribute
> swiotlb: Move code around in preparation for dynamic bounce buffers
> swiotlb: Allow dynamic allocation of bounce buffers
> swiotlb: Add an option to allow dynamic bounce buffers
>
> .../admin-guide/kernel-parameters.txt | 6 +-
> Documentation/core-api/dma-attributes.rst | 10 +
> include/linux/dma-mapping.h | 6 +
> include/linux/swiotlb.h | 17 +-
> kernel/dma/swiotlb.c | 233 +++++++++++++++---
> 5 files changed, 241 insertions(+), 31 deletions(-)
>
On Mon, Mar 20, 2023 at 01:28:13PM +0100, Petr Tesarik wrote:
> From: Petr Tesarik <[email protected]>
>
> Introduce a DMA attribute to tell the DMA-mapping subsystem that
> the operation is allowed to sleep.
>
> This patch merely adds the flag, which is not used for anything at
> the moment. It should be used by users who can sleep (e.g. dma-buf
> ioctls) to allow page reclaim and/or allocations from CMA.
So what drivers would call this? As-is it doesn't have any users in
the series.
[adding Alex as he has been interested in this in the past]
On Mon, Mar 20, 2023 at 01:28:15PM +0100, Petr Tesarik wrote:
> Second, on the Raspberry Pi 4, swiotlb is used by dma-buf for pages
> moved from the rendering GPU (v3d driver), which can access all
> memory, to the display output (vc4 driver), which is connected to a
> bus with an address limit of 1 GiB and no IOMMU. These buffers can
> be large (several megabytes) and cannot be handled by SWIOTLB,
> because they exceed maximum segment size of 256 KiB. Such mapping
> failures can be easily reproduced on a Raspberry Pi4: Starting
> GNOME remote desktop results in a flood of kernel messages like
> these:
Shouldn't we make sure dma-buf allocates the buffers for the most
restricted devices, and more importantly does something like a dma
coherent allocation instead of a dynamic mapping of random memory?
While a larger swiotlb works around this I don't think this fixes the root
cause.
> 1. The value is limited to ULONG_MAX, which is too little both for
> physical addresses (e.g. x86 PAE or 32-bit ARM LPAE) and DMA
> addresses (e.g. Xen guests on 32-bit ARM).
>
> 2. Since buffers are currently allocated with page granularity, a
> PFN can be used instead. However, some values are reserved by
> the maple tree implementation. Liam suggests to use
> xa_mk_value() in that case, but that reduces the usable range by
> half. Luckily, 31 bits are still enough to hold a PFN on all
> 32-bit platforms.
>
> 3. Software IO TLB is used from interrupt context. The maple tree
> implementation is not IRQ-safe (MT_FLAGS_LOCK_IRQ does nothing
> AFAICS). Instead, I use an external lock, spin_lock_irqsave() and
> spin_unlock_irqrestore().
>
> Note that bounce buffers are never allocated dynamically if the
> software IO TLB is in fact a DMA restricted pool, which is intended
> to be stay in its designated location in physical memory.
I'm a little worried about all that because it causes quite a bit
of overhead even for callers that don't end up going into the
dynamic range or do not use swiotlb at all. I don't really have a
good answer here except for the usual avoid bounce buffering whenever
you can that might not always be easy to do.
> + gfp = (attrs & DMA_ATTR_MAY_SLEEP) ? GFP_KERNEL : GFP_NOWAIT;
> + slot = kmalloc(sizeof(*slot), gfp | __GFP_NOWARN);
> + if (!slot)
> + goto err;
> +
> + slot->orig_addr = orig_addr;
> + slot->alloc_size = alloc_size;
> + slot->page = dma_direct_alloc_pages(dev, PAGE_ALIGN(alloc_size),
> + &slot->dma_addr, dir,
> + gfp | __GFP_NOWARN);
> + if (!slot->page)
> + goto err_free_slot;
Without GFP_NOIO allocations this will deadlock eventually.
On 3/28/2023 5:57 AM, Christoph Hellwig wrote:
> On Mon, Mar 20, 2023 at 01:28:13PM +0100, Petr Tesarik wrote:
>> From: Petr Tesarik <[email protected]>
>>
>> Introduce a DMA attribute to tell the DMA-mapping subsystem that
>> the operation is allowed to sleep.
>>
>> This patch merely adds the flag, which is not used for anything at
>> the moment. It should be used by users who can sleep (e.g. dma-buf
>> ioctls) to allow page reclaim and/or allocations from CMA.
>
> So what drivers would call this? As-is it doesn't have any users in
> the series.
Yes, I removed one patch from the RFC series to reduce the Cc list while
I wasn't sure if the proposal would be considered at all.
The full series in my local tree added it to the implementation of
DRM_IOCTL_PRIME_FD_TO_HANDLE:
diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index f924b8b4ab6b..f32e12445570 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -635,7 +635,7 @@ struct sg_table *drm_gem_map_dma_buf(struct
dma_buf_attachment *attach,
return sgt;
ret = dma_map_sgtable(attach->dev, sgt, dir,
- DMA_ATTR_SKIP_CPU_SYNC);
+ DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MAY_SLEEP);
if (ret) {
sg_free_table(sgt);
kfree(sgt);
I also noticed a similar place in udmabuf, but since I don't have a use
case ATM, I haven't added the flag there (yet).
Petr T
On 3/28/2023 6:07 AM, Christoph Hellwig wrote:
> [adding Alex as he has been interested in this in the past]
>
> On Mon, Mar 20, 2023 at 01:28:15PM +0100, Petr Tesarik wrote:
>> Second, on the Raspberry Pi 4, swiotlb is used by dma-buf for pages
>> moved from the rendering GPU (v3d driver), which can access all
>> memory, to the display output (vc4 driver), which is connected to a
>> bus with an address limit of 1 GiB and no IOMMU. These buffers can
>> be large (several megabytes) and cannot be handled by SWIOTLB,
>> because they exceed maximum segment size of 256 KiB. Such mapping
>> failures can be easily reproduced on a Raspberry Pi4: Starting
>> GNOME remote desktop results in a flood of kernel messages like
>> these:
>
> Shouldn't we make sure dma-buf allocates the buffers for the most
> restricted devices, and more importantly does something like a dma
> coherent allocation instead of a dynamic mapping of random memory?
>
> While a larger swiotlb works around this I don't think this fixes the root
> cause.
I tend to agree here. However, it's the DMABUF design itself that causes
some trouble. The buffer is allocated by the v3d driver, which does not
have the restriction, so the DMA API typically allocates an address
somewhere near the 4G boundary. Userspace then exports the buffer, sends
it to another process as a file descriptor and imports it into the vc4
driver, which requires DMA below 1G. In the beginning, v3d had no idea
that the buffer would be exported to userspace, much less that it would
be later imported into vc4.
Anyway, I suspected that the buffers need not be imported into the vc4
driver (also hinted by Eric Anholt in a 2018 blog post [1]), and it
seems I was right. I encountered the issue with Ubuntu 22.10; I
installed latest openSUSE Tumbleweed yesterday, and I was not able to
reproduce the issue there, most likely because the Mesa drivers have
been fixed meanwhile. This makes the specific case of the Raspberry Pi 4
drivers moot. The issue may still affect other combinations of drivers,
but I don't have any other real-world example ATM.
[1] https://anholt.github.io/twivc4/2018/02/12/twiv/
>> 1. The value is limited to ULONG_MAX, which is too little both for
>> physical addresses (e.g. x86 PAE or 32-bit ARM LPAE) and DMA
>> addresses (e.g. Xen guests on 32-bit ARM).
>>
>> 2. Since buffers are currently allocated with page granularity, a
>> PFN can be used instead. However, some values are reserved by
>> the maple tree implementation. Liam suggests to use
>> xa_mk_value() in that case, but that reduces the usable range by
>> half. Luckily, 31 bits are still enough to hold a PFN on all
>> 32-bit platforms.
>>
>> 3. Software IO TLB is used from interrupt context. The maple tree
>> implementation is not IRQ-safe (MT_FLAGS_LOCK_IRQ does nothing
>> AFAICS). Instead, I use an external lock, spin_lock_irqsave() and
>> spin_unlock_irqrestore().
>>
>> Note that bounce buffers are never allocated dynamically if the
>> software IO TLB is in fact a DMA restricted pool, which is intended
>> to be stay in its designated location in physical memory.
>
> I'm a little worried about all that because it causes quite a bit
> of overhead even for callers that don't end up going into the
> dynamic range or do not use swiotlb at all. I don't really have a
> good answer here except for the usual avoid bounce buffering whenever
> you can that might not always be easy to do.
I'm also worried about all this overhead. OTOH I was not able to confirm
it, because the difference between two successive fio test runs on an
unmodified kernel was bigger than the difference between a vanilla and a
patched kernel, except the maximum completion latency, which OTOH
affected less than 0.01% of all requests.
BTW my testing also suggests that the streaming DMA API is quite
inefficient, because UAS performance _improved_ with swiotlb=force.
Sure, this should probably be addressed in the UAS and/or xHCI driver,
but what I mean is that moving away from swiotlb may even cause
performance regressions, which is counter-intuitive. At least I would
_not_ have expected it.
>> + gfp = (attrs & DMA_ATTR_MAY_SLEEP) ? GFP_KERNEL : GFP_NOWAIT;
>> + slot = kmalloc(sizeof(*slot), gfp | __GFP_NOWARN);
>> + if (!slot)
>> + goto err;
>> +
>> + slot->orig_addr = orig_addr;
>> + slot->alloc_size = alloc_size;
>> + slot->page = dma_direct_alloc_pages(dev, PAGE_ALIGN(alloc_size),
>> + &slot->dma_addr, dir,
>> + gfp | __GFP_NOWARN);
>> + if (!slot->page)
>> + goto err_free_slot;
>
> Without GFP_NOIO allocations this will deadlock eventually.
Ah, that would affect the non-sleeping case (GFP_KERNEL), right?
Petr T
On 3/28/2023 9:54 AM, Petr Tesarik wrote:
> On 3/28/2023 6:07 AM, Christoph Hellwig wrote:
>> [adding Alex as he has been interested in this in the past]
>>
>[...]>> I'm a little worried about all that because it causes quite a bit
>> of overhead even for callers that don't end up going into the
>> dynamic range or do not use swiotlb at all. I don't really have a
>> good answer here except for the usual avoid bounce buffering whenever
>> you can that might not always be easy to do.
>
> I'm also worried about all this overhead.
Oh, wait! I can do at least something for devices which do not use
swiotlb at all.
If a device does not use bounce buffers, it cannot pass an address
that belongs to the swiotlb. Consequently, the potentially
expensive check can be skipped. This avoids the dynamic lookup
penalty for devices which do not need the swiotlb.
Note that the counter always remains zero if dma_io_tlb_mem is
NULL, so the NULL check is not required.
diff --git a/drivers/base/core.c b/drivers/base/core.c
index a3e14143ec0c..f36638f207b8 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2957,6 +2957,7 @@ void device_initialize(struct device *dev)
#endif
#ifdef CONFIG_SWIOTLB
dev->dma_io_tlb_mem = &io_tlb_default_mem;
+ atomic_set(&dev->dma_io_tlb_cnt, 0);
#endif
}
EXPORT_SYMBOL_GPL(device_initialize);
diff --git a/include/linux/device.h b/include/linux/device.h
index 44e3acae7b36..cfdddce4cc30 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -504,6 +504,7 @@ struct device_physical_location {
* @dma_mem: Internal for coherent mem override.
* @cma_area: Contiguous memory area for dma allocations
* @dma_io_tlb_mem: Pointer to the swiotlb pool used. Not for driver use.
+ * @dma_io_tlb_cnt: Number of buffers mapped from the swiotlb pool.
* @archdata: For arch-specific additions.
* @of_node: Associated device tree node.
* @fwnode: Associated device node supplied by platform firmware.
@@ -609,6 +610,7 @@ struct device {
#endif
#ifdef CONFIG_SWIOTLB
struct io_tlb_mem *dma_io_tlb_mem;
+ atomic_t dma_io_tlb_cnt;
#endif
/* arch specific additions */
struct dev_archdata archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 628e25ad7db7..7a115f4db49d 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -122,7 +122,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
- return mem &&
+ return atomic_read(&dev->dma_io_tlb_cnt) &&
(is_swiotlb_fixed(mem, paddr) ||
(mem->allow_dyn && is_swiotlb_dyn(mem, paddr)));
}
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 3efaefebb6af..3dda1d3a39e8 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -954,6 +954,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
return tlb_addr;
}
+ atomic_inc(&dev->dma_io_tlb_cnt);
+
/*
* When dir == DMA_FROM_DEVICE we could omit the copy from the orig
* to the tlb buffer, if we knew for sure the device will
@@ -1030,6 +1032,7 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
swiotlb_release_slots(dev, tlb_addr);
else
swiotlb_dyn_unmap(dev, tlb_addr, dir);
+ atomic_dec(&dev->dma_io_tlb_cnt);
}
void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
On Tue, 28 Mar 2023 09:54:35 +0200
Petr Tesarik <[email protected]> wrote:
> On 3/28/2023 6:07 AM, Christoph Hellwig wrote:
> > [adding Alex as he has been interested in this in the past]
> >
> > On Mon, Mar 20, 2023 at 01:28:15PM +0100, Petr Tesarik wrote:
> >> Second, on the Raspberry Pi 4, swiotlb is used by dma-buf for pages
> >> moved from the rendering GPU (v3d driver), which can access all
> >> memory, to the display output (vc4 driver), which is connected to a
> >> bus with an address limit of 1 GiB and no IOMMU. These buffers can
> >> be large (several megabytes) and cannot be handled by SWIOTLB,
> >> because they exceed maximum segment size of 256 KiB. Such mapping
> >> failures can be easily reproduced on a Raspberry Pi4: Starting
> >> GNOME remote desktop results in a flood of kernel messages like
> >> these:
> >
> > Shouldn't we make sure dma-buf allocates the buffers for the most
> > restricted devices, and more importantly does something like a dma
> > coherent allocation instead of a dynamic mapping of random memory?
> >
> > While a larger swiotlb works around this I don't think this fixes the root
> > cause.
>
> I tend to agree here. However, it's the DMABUF design itself that causes
> some trouble. The buffer is allocated by the v3d driver, which does not
> have the restriction, so the DMA API typically allocates an address
> somewhere near the 4G boundary. Userspace then exports the buffer, sends
> it to another process as a file descriptor and imports it into the vc4
> driver, which requires DMA below 1G. In the beginning, v3d had no idea
> that the buffer would be exported to userspace, much less that it would
> be later imported into vc4.
>
> Anyway, I suspected that the buffers need not be imported into the vc4
> driver (also hinted by Eric Anholt in a 2018 blog post [1]), and it
> seems I was right. I encountered the issue with Ubuntu 22.10; I
> installed latest openSUSE Tumbleweed yesterday, and I was not able to
> reproduce the issue there, most likely because the Mesa drivers have
> been fixed meanwhile. This makes the specific case of the Raspberry Pi 4
> drivers moot. The issue may still affect other combinations of drivers,
> but I don't have any other real-world example ATM.
I'm only seeing this problem with Wayland, no issue when switching Ubuntu to
X. It seems Tumbleweed is using X by default.
...Juerg
> [1] https://anholt.github.io/twivc4/2018/02/12/twiv/
>
> >> 1. The value is limited to ULONG_MAX, which is too little both for
> >> physical addresses (e.g. x86 PAE or 32-bit ARM LPAE) and DMA
> >> addresses (e.g. Xen guests on 32-bit ARM).
> >>
> >> 2. Since buffers are currently allocated with page granularity, a
> >> PFN can be used instead. However, some values are reserved by
> >> the maple tree implementation. Liam suggests to use
> >> xa_mk_value() in that case, but that reduces the usable range by
> >> half. Luckily, 31 bits are still enough to hold a PFN on all
> >> 32-bit platforms.
> >>
> >> 3. Software IO TLB is used from interrupt context. The maple tree
> >> implementation is not IRQ-safe (MT_FLAGS_LOCK_IRQ does nothing
> >> AFAICS). Instead, I use an external lock, spin_lock_irqsave() and
> >> spin_unlock_irqrestore().
> >>
> >> Note that bounce buffers are never allocated dynamically if the
> >> software IO TLB is in fact a DMA restricted pool, which is intended
> >> to be stay in its designated location in physical memory.
> >
> > I'm a little worried about all that because it causes quite a bit
> > of overhead even for callers that don't end up going into the
> > dynamic range or do not use swiotlb at all. I don't really have a
> > good answer here except for the usual avoid bounce buffering whenever
> > you can that might not always be easy to do.
>
> I'm also worried about all this overhead. OTOH I was not able to confirm
> it, because the difference between two successive fio test runs on an
> unmodified kernel was bigger than the difference between a vanilla and a
> patched kernel, except the maximum completion latency, which OTOH
> affected less than 0.01% of all requests.
>
> BTW my testing also suggests that the streaming DMA API is quite
> inefficient, because UAS performance _improved_ with swiotlb=force.
> Sure, this should probably be addressed in the UAS and/or xHCI driver,
> but what I mean is that moving away from swiotlb may even cause
> performance regressions, which is counter-intuitive. At least I would
> _not_ have expected it.
>
> >> + gfp = (attrs & DMA_ATTR_MAY_SLEEP) ? GFP_KERNEL : GFP_NOWAIT;
> >> + slot = kmalloc(sizeof(*slot), gfp | __GFP_NOWARN);
> >> + if (!slot)
> >> + goto err;
> >> +
> >> + slot->orig_addr = orig_addr;
> >> + slot->alloc_size = alloc_size;
> >> + slot->page = dma_direct_alloc_pages(dev, PAGE_ALIGN(alloc_size),
> >> + &slot->dma_addr, dir,
> >> + gfp | __GFP_NOWARN);
> >> + if (!slot->page)
> >> + goto err_free_slot;
> >
> > Without GFP_NOIO allocations this will deadlock eventually.
>
> Ah, that would affect the non-sleeping case (GFP_KERNEL), right?
>
> Petr T
>
Hi Juerg,
On Fri, 31 Mar 2023 07:26:09 +0000
Juerg Haefliger <[email protected]> wrote:
> On Tue, 28 Mar 2023 09:54:35 +0200
> Petr Tesarik <[email protected]> wrote:
>
>[...]
> > Anyway, I suspected that the buffers need not be imported into the vc4
> > driver (also hinted by Eric Anholt in a 2018 blog post [1]), and it
> > seems I was right. I encountered the issue with Ubuntu 22.10; I
> > installed latest openSUSE Tumbleweed yesterday, and I was not able to
> > reproduce the issue there, most likely because the Mesa drivers have
> > been fixed meanwhile. This makes the specific case of the Raspberry Pi 4
> > drivers moot. The issue may still affect other combinations of drivers,
> > but I don't have any other real-world example ATM.
>
> I'm only seeing this problem with Wayland, no issue when switching Ubuntu to
> X. It seems Tumbleweed is using X by default.
I know; I was the team lead of SUSE low-level graphics engineers until
end of last year... I have just double-checked, but this output of
wayland-info in the GNOME session accessed over RDP is quite convincing:
interface: 'wl_compositor', version: 5, name: 1
interface: 'wl_shm', version: 1, name: 2
formats (fourcc):
0x48344258 = 'XB4H'
0x48344241 = 'AB4H'
0x48345258 = 'XR4H'
0x48345241 = 'AR4H'
0x30334258 = 'XB30'
0x30334241 = 'AB30'
0x30335258 = 'XR30'
0x30335241 = 'AR30'
0x36314752 = 'RG16'
0x34324258 = 'XB24'
0x34324241 = 'AB24'
1 = 'XR24'
0 = 'AR24'
interface: 'wl_output', version: 3, name: 3
x: 0, y: 0, scale: 1,
physical_width: 430 mm, physical_height: 270 mm,
make: 'FUS', model: 'P20W-5 ECO',
subpixel_orientation: unknown, output_transform: normal,
mode:
width: 1680 px, height: 1050 px, refresh: 59.954 Hz,
flags: current preferred
interface: 'zxdg_output_manager_v1', version: 3, name: 4
xdg_output_v1
output: 3
name: 'HDMI-1'
description: 'Fujitsu Siemens Computers GmbH 20"'
logical_x: 0, logical_y: 0
logical_width: 1680, logical_height: 1050
interface: 'wl_data_device_manager', version: 3, name: 5
interface: 'zwp_primary_selection_device_manager_v1', version: 1, name: 6
interface: 'wl_subcompositor', version: 1, name: 7
interface: 'xdg_wm_base', version: 4, name: 8
interface: 'gtk_shell1', version: 5, name: 9
interface: 'wp_viewporter', version: 1, name: 10
interface: 'zwp_pointer_gestures_v1', version: 3, name: 11
interface: 'zwp_tablet_manager_v2', version: 1, name: 12
interface: 'wl_seat', version: 8, name: 13
name: seat0
capabilities: pointer keyboard
keyboard repeat rate: 33
keyboard repeat delay: 500
interface: 'zwp_relative_pointer_manager_v1', version: 1, name: 14
interface: 'zwp_pointer_constraints_v1', version: 1, name: 15
interface: 'zxdg_exporter_v1', version: 1, name: 16
interface: 'zxdg_importer_v1', version: 1, name: 17
interface: 'zwp_linux_dmabuf_v1', version: 3, name: 18
formats (fourcc) and modifiers (names):
0x48344258 = 'XB4H'; 0x00ffffffffffffff = INVALID
0x48344241 = 'AB4H'; 0x00ffffffffffffff = INVALID
0x36314752 = 'RG16'; 0x00ffffffffffffff = INVALID
0x30334258 = 'XB30'; 0x00ffffffffffffff = INVALID
0x30335258 = 'XR30'; 0x00ffffffffffffff = INVALID
0x30334241 = 'AB30'; 0x00ffffffffffffff = INVALID
0x30335241 = 'AR30'; 0x00ffffffffffffff = INVALID
0x34324258 = 'XB24'; 0x00ffffffffffffff = INVALID
0x34325258 = 'XR24'; 0x00ffffffffffffff = INVALID
0x34324241 = 'AB24'; 0x00ffffffffffffff = INVALID
0x34325241 = 'AR24'; 0x00ffffffffffffff = INVALID
interface: 'wp_single_pixel_buffer_manager_v1', version: 1, name: 19
interface: 'zwp_keyboard_shortcuts_inhibit_manager_v1', version: 1, name: 20
interface: 'zwp_text_input_manager_v3', version: 1, name: 21
interface: 'wp_presentation', version: 1, name: 22
presentation clock id: 1 (CLOCK_MONOTONIC)
interface: 'xdg_activation_v1', version: 1, name: 23
Petr T
On 3/20/23 19:28, Petr Tesarik wrote:
> +
> +DMA_ATTR_MAY_SLEEP
> +------------------
> +
> +This tells the DMA-mapping subsystem that it is allowed to sleep. For example,
> +if mapping needs a bounce buffer, software IO TLB may use CMA for the
> +allocation if this flag is given.
> +
> +This attribute is not used for dma_alloc_* functions. Instead, the provided
dma_alloc_\* (escape wildcard in order to
not confuse Sphinx for emphasis).
> +GFP flags are used to determine whether the allocation may sleep.
Otherwise the doc LGTM.
--
An old man doll... just what I always wanted! - Clara
On Fri, 31 Mar 2023 11:00:43 +0200
Petr Tesařík <[email protected]> wrote:
> Hi Juerg,
>
> On Fri, 31 Mar 2023 07:26:09 +0000
> Juerg Haefliger <[email protected]> wrote:
>
> > On Tue, 28 Mar 2023 09:54:35 +0200
> > Petr Tesarik <[email protected]> wrote:
> >
> >[...]
> > > Anyway, I suspected that the buffers need not be imported into the vc4
> > > driver (also hinted by Eric Anholt in a 2018 blog post [1]), and it
> > > seems I was right. I encountered the issue with Ubuntu 22.10; I
> > > installed latest openSUSE Tumbleweed yesterday, and I was not able to
> > > reproduce the issue there, most likely because the Mesa drivers have
> > > been fixed meanwhile. This makes the specific case of the Raspberry Pi 4
> > > drivers moot. The issue may still affect other combinations of drivers,
> > > but I don't have any other real-world example ATM.
> >
> > I'm only seeing this problem with Wayland, no issue when switching Ubuntu to
> > X. It seems Tumbleweed is using X by default.
>
> I know; I was the team lead of SUSE low-level graphics engineers until
> end of last year... I have just double-checked, but this output of
> wayland-info in the GNOME session accessed over RDP is quite convincing:
It sure is but how did you get that?? For me it's:
$ wayland-info
failed to create display: No such file or directory
And from strace:
connect(3, {sa_family=AF_UNIX, sun_path="/run/user/1000/wayland-0"}, 27) = -1 ENOENT (No such file or directory)
Which is kind of expected when running X, no?
$ ps -ef | grep -iP 'xorg|wayland'
opensuse 1377 1375 0 09:13 tty2 00:00:16 /usr/bin/Xorg.bin vt2 -displayfd 3 -auth /run/user/1000/gdm/Xauthority -nolisten tcp -background none -noreset -keeptty -novtswitch -verbose 3
What am I missing?
...Juerg
> interface: 'wl_compositor', version: 5, name: 1
> interface: 'wl_shm', version: 1, name: 2
> formats (fourcc):
> 0x48344258 = 'XB4H'
> 0x48344241 = 'AB4H'
> 0x48345258 = 'XR4H'
> 0x48345241 = 'AR4H'
> 0x30334258 = 'XB30'
> 0x30334241 = 'AB30'
> 0x30335258 = 'XR30'
> 0x30335241 = 'AR30'
> 0x36314752 = 'RG16'
> 0x34324258 = 'XB24'
> 0x34324241 = 'AB24'
> 1 = 'XR24'
> 0 = 'AR24'
> interface: 'wl_output', version: 3, name: 3
> x: 0, y: 0, scale: 1,
> physical_width: 430 mm, physical_height: 270 mm,
> make: 'FUS', model: 'P20W-5 ECO',
> subpixel_orientation: unknown, output_transform: normal,
> mode:
> width: 1680 px, height: 1050 px, refresh: 59.954 Hz,
> flags: current preferred
> interface: 'zxdg_output_manager_v1', version: 3, name: 4
> xdg_output_v1
> output: 3
> name: 'HDMI-1'
> description: 'Fujitsu Siemens Computers GmbH 20"'
> logical_x: 0, logical_y: 0
> logical_width: 1680, logical_height: 1050
> interface: 'wl_data_device_manager', version: 3, name: 5
> interface: 'zwp_primary_selection_device_manager_v1', version: 1, name: 6
> interface: 'wl_subcompositor', version: 1, name: 7
> interface: 'xdg_wm_base', version: 4, name: 8
> interface: 'gtk_shell1', version: 5, name: 9
> interface: 'wp_viewporter', version: 1, name: 10
> interface: 'zwp_pointer_gestures_v1', version: 3, name: 11
> interface: 'zwp_tablet_manager_v2', version: 1, name: 12
> interface: 'wl_seat', version: 8, name: 13
> name: seat0
> capabilities: pointer keyboard
> keyboard repeat rate: 33
> keyboard repeat delay: 500
> interface: 'zwp_relative_pointer_manager_v1', version: 1, name: 14
> interface: 'zwp_pointer_constraints_v1', version: 1, name: 15
> interface: 'zxdg_exporter_v1', version: 1, name: 16
> interface: 'zxdg_importer_v1', version: 1, name: 17
> interface: 'zwp_linux_dmabuf_v1', version: 3, name: 18
> formats (fourcc) and modifiers (names):
> 0x48344258 = 'XB4H'; 0x00ffffffffffffff = INVALID
> 0x48344241 = 'AB4H'; 0x00ffffffffffffff = INVALID
> 0x36314752 = 'RG16'; 0x00ffffffffffffff = INVALID
> 0x30334258 = 'XB30'; 0x00ffffffffffffff = INVALID
> 0x30335258 = 'XR30'; 0x00ffffffffffffff = INVALID
> 0x30334241 = 'AB30'; 0x00ffffffffffffff = INVALID
> 0x30335241 = 'AR30'; 0x00ffffffffffffff = INVALID
> 0x34324258 = 'XB24'; 0x00ffffffffffffff = INVALID
> 0x34325258 = 'XR24'; 0x00ffffffffffffff = INVALID
> 0x34324241 = 'AB24'; 0x00ffffffffffffff = INVALID
> 0x34325241 = 'AR24'; 0x00ffffffffffffff = INVALID
> interface: 'wp_single_pixel_buffer_manager_v1', version: 1, name: 19
> interface: 'zwp_keyboard_shortcuts_inhibit_manager_v1', version: 1, name: 20
> interface: 'zwp_text_input_manager_v3', version: 1, name: 21
> interface: 'wp_presentation', version: 1, name: 22
> presentation clock id: 1 (CLOCK_MONOTONIC)
> interface: 'xdg_activation_v1', version: 1, name: 23
>
> Petr T
On Tue, Mar 28, 2023 at 09:21:10AM +0200, Petr Tesarik wrote:
> The full series in my local tree added it to the implementation of
> DRM_IOCTL_PRIME_FD_TO_HANDLE:
Umm, an all these are callers that absolutely never should even
end up in swiotlb. If we have large buffers allocated by media
subsystems, we need to make sure they are fully addressable.
On Tue, Mar 28, 2023 at 09:54:35AM +0200, Petr Tesarik wrote:
> I tend to agree here. However, it's the DMABUF design itself that causes
> some trouble. The buffer is allocated by the v3d driver, which does not
> have the restriction, so the DMA API typically allocates an address
> somewhere near the 4G boundary. Userspace then exports the buffer, sends
> it to another process as a file descriptor and imports it into the vc4
> driver, which requires DMA below 1G. In the beginning, v3d had no idea
> that the buffer would be exported to userspace, much less that it would
> be later imported into vc4.
Then we need to either:
a) figure out a way to communicate these addressing limitations
b) find a way to migrate a buffer into other memory, similar to
how page migration works for page cache
> BTW my testing also suggests that the streaming DMA API is quite
> inefficient, because UAS performance _improved_ with swiotlb=force.
> Sure, this should probably be addressed in the UAS and/or xHCI driver,
> but what I mean is that moving away from swiotlb may even cause
> performance regressions, which is counter-intuitive. At least I would
> _not_ have expected it.
That is indeed very odd. Are you running with a very slow iommu
driver there? Or what is the actual use case there in general?
> >> + gfp = (attrs & DMA_ATTR_MAY_SLEEP) ? GFP_KERNEL : GFP_NOWAIT;
> >> + slot = kmalloc(sizeof(*slot), gfp | __GFP_NOWARN);
> >> + if (!slot)
> >> + goto err;
> >> +
> >> + slot->orig_addr = orig_addr;
> >> + slot->alloc_size = alloc_size;
> >> + slot->page = dma_direct_alloc_pages(dev, PAGE_ALIGN(alloc_size),
> >> + &slot->dma_addr, dir,
> >> + gfp | __GFP_NOWARN);
> >> + if (!slot->page)
> >> + goto err_free_slot;
> >
> > Without GFP_NOIO allocations this will deadlock eventually.
>
> Ah, that would affect the non-sleeping case (GFP_KERNEL), right?
Yes.
On Tue, Mar 28, 2023 at 02:43:03PM +0200, Petr Tesarik wrote:
> Oh, wait! I can do at least something for devices which do not use
> swiotlb at all.
>
> If a device does not use bounce buffers, it cannot pass an address
> that belongs to the swiotlb. Consequently, the potentially
> expensive check can be skipped. This avoids the dynamic lookup
> penalty for devices which do not need the swiotlb.
>
> Note that the counter always remains zero if dma_io_tlb_mem is
> NULL, so the NULL check is not required.
Hmm, that's yet another atomic for each map/unmap, and bloats
struct device.
(Btw, in case anyone is interested, we really need to get started
on moving the dma fields out of struct device into a sub-struct
only allocated for DMA capable busses)
On Mon, Mar 27, 2023 at 01:06:34PM +0200, Petr Tesarik wrote:
> B. Allocate a very big SWIOTLB, but allow to use it for normal
> allocations (similar to the CMA approach). The advantage is that there
> is only one table, pushing performance impact down to almost zero. The
> main challenge is migrating pages to/from the SWIOTLB. Existing CMA code
> cannot be reused, because CMA cannot be used from atomic contexts,
> unlike SWIOTLB.
That actually sounds very interesting, although I'd go futher and
figure out if we:
a) could get away to only allow the CMA allocation for sleeping contexts,
if we have enough sleeping context to matter
b) check with the CMA maintainers if it is feasible and acceptable
to them to extent CMA for irq allocations.
That being said, I think cases like dma-buf sharing really need to
be addressed at a higher level instead of basically double allocating
these long-term memory allocations.
I'd also really love to hear some feedback from the various confidential
computing implementors, as that seems to be ther big driving force for
swiotlb currently.
On Fri, 7 Apr 2023 07:57:04 +0200
Christoph Hellwig <[email protected]> wrote:
> On Tue, Mar 28, 2023 at 02:43:03PM +0200, Petr Tesarik wrote:
> > Oh, wait! I can do at least something for devices which do not use
> > swiotlb at all.
> >
> > If a device does not use bounce buffers, it cannot pass an address
> > that belongs to the swiotlb. Consequently, the potentially
> > expensive check can be skipped. This avoids the dynamic lookup
> > penalty for devices which do not need the swiotlb.
> >
> > Note that the counter always remains zero if dma_io_tlb_mem is
> > NULL, so the NULL check is not required.
>
> Hmm, that's yet another atomic for each map/unmap, and bloats
> struct device.
I'm not sure how bad it is to bloat struct device. It is already quite
large, e.g. in my x86 build it is 768 bytes (exact size depends on
config options), and nobody seems to be concerned...
Regarding the atomic operations, I am currently testing a slightly
different approach, which merely sets a flag if there are any
dynamically allocated bounce buffers. The atomic check changes to
smp_load_acquire(), and the atomic inc/dec to smp_store_release()
only if the flag changes. That said, if I hammer this path with heavy
parallel I/O, I can still see some performance cost for devices that
use swiotlb, but at least devices that do not need such bounce buffers
seem to be unaffected then.
> (Btw, in case anyone is interested, we really need to get started
> on moving the dma fields out of struct device into a sub-struct
> only allocated for DMA capable busses)
I like this idea. In fact, my WIP topic branch now moves the swiotlb
fields into a separate struct, but I can surely go further and move all
DMA-related fields. I doubt it is worth to allocate it separately,
though. We are talking about replacing some 100 bytes (in the worst
case) with a pointer to a dynamically allocated struct, but the
dynamic allocator adds some overhead. I believe it pays off only if the
vast majority of struct device instances do not need these DMA fields,
but is that really the case?
Petr T
On Fri, 7 Apr 2023 07:55:48 +0200
Christoph Hellwig <[email protected]> wrote:
> On Tue, Mar 28, 2023 at 09:54:35AM +0200, Petr Tesarik wrote:
> > I tend to agree here. However, it's the DMABUF design itself that causes
> > some trouble. The buffer is allocated by the v3d driver, which does not
> > have the restriction, so the DMA API typically allocates an address
> > somewhere near the 4G boundary. Userspace then exports the buffer, sends
> > it to another process as a file descriptor and imports it into the vc4
> > driver, which requires DMA below 1G. In the beginning, v3d had no idea
> > that the buffer would be exported to userspace, much less that it would
> > be later imported into vc4.
>
> Then we need to either:
>
> a) figure out a way to communicate these addressing limitations
AFAICS this would require a complete overhaul of the dma-buf userspace
API so that intended imports are communicated at export time. In other
words, it would be quite intrusive. Not my preferrence.
> b) find a way to migrate a buffer into other memory, similar to
> how page migration works for page cache
Let me express the idea in my own words to make sure I get it right.
When a DMA buffer is imported, but before it is ultimately pinned in
memory, the importing device driver checks whether the buffer meets its
DMA constraints. If not, it calls a function provided by the exporting
device driver to migrate the buffer. This makes sense, but:
1) The operation must be implemented in the exporting driver; this
will take some time.
2) In theory, there may be no overlap between the exporting device
and the importing device. OTOH I'm not aware of any real-world
example, so we can probably return a suitable error code, and
that's it.
Anyway, I have already written in another reply that my original use
case is moot, because a more recent distribution can do the job without
using dma-buf, so it has been fixed in user space, be it in GNOME,
pipewire, or Mesa (I don't really have to know).
At this point I would go with the assumption that large buffers
allocated by media subsystems will not hit swiotlb. Consequently, I
don't plan to spend more time on this branch of the story.
> > BTW my testing also suggests that the streaming DMA API is quite
> > inefficient, because UAS performance _improved_ with swiotlb=force.
> > Sure, this should probably be addressed in the UAS and/or xHCI driver,
> > but what I mean is that moving away from swiotlb may even cause
> > performance regressions, which is counter-intuitive. At least I would
> > _not_ have expected it.
>
> That is indeed very odd. Are you running with a very slow iommu
> driver there? Or what is the actual use case there in general?
This was on a Raspberry Pi 4, which does not have any IOMMU. IOW it
looks like copying data around can be faster than sending it straight
to the device. When I have some more time, I must investigate what is
really happening there, because it does not make any sense to me.
Petr T
On Fri, Apr 07, 2023 at 12:46:27PM +0200, Petr Tesařík wrote:
> > b) find a way to migrate a buffer into other memory, similar to
> > how page migration works for page cache
>
> Let me express the idea in my own words to make sure I get it right.
> When a DMA buffer is imported, but before it is ultimately pinned in
> memory, the importing device driver checks whether the buffer meets its
> DMA constraints. If not, it calls a function provided by the exporting
> device driver to migrate the buffer.
Yes.
> This makes sense, but:
>
> 1) The operation must be implemented in the exporting driver; this
> will take some time.
>
> 2) In theory, there may be no overlap between the exporting device
> and the importing device. OTOH I'm not aware of any real-world
> example, so we can probably return a suitable error code, and
> that's it.
Indeed. And if there is no overlap, which as you said is indeed
very unlikely but in theory possible, we could still keep migrating
forther and back.
One important thing that we should do is to consolidate more of the
dma-buf implementation code. Right now they just seem to be a wild
mess of copy and pasted boilerplate code unfortunately.
> Anyway, I have already written in another reply that my original use
> case is moot, because a more recent distribution can do the job without
> using dma-buf, so it has been fixed in user space, be it in GNOME,
> pipewire, or Mesa (I don't really have to know).
>
> At this point I would go with the assumption that large buffers
> allocated by media subsystems will not hit swiotlb. Consequently, I
> don't plan to spend more time on this branch of the story.
Sounds fine to me, and thanks for taking the effort so far.
> > > BTW my testing also suggests that the streaming DMA API is quite
> > > inefficient, because UAS performance _improved_ with swiotlb=force.
> > > Sure, this should probably be addressed in the UAS and/or xHCI driver,
> > > but what I mean is that moving away from swiotlb may even cause
> > > performance regressions, which is counter-intuitive. At least I would
> > > _not_ have expected it.
> >
> > That is indeed very odd. Are you running with a very slow iommu
> > driver there? Or what is the actual use case there in general?
>
> This was on a Raspberry Pi 4, which does not have any IOMMU. IOW it
> looks like copying data around can be faster than sending it straight
> to the device. When I have some more time, I must investigate what is
> really happening there, because it does not make any sense to me.
If you're not using an IOMMU that doesn't actually make any sense to
me. swiotlb calls into exactly the same routines as dma-direct does
for the dma setup on each I/O, just after copying the data. So if you
do have some spare cycles to investigate what is going on here, I'd
be really curious about the results.
On 4/7/2023 12:15 PM, Petr Tesařík wrote:
> On Fri, 7 Apr 2023 07:57:04 +0200
> Christoph Hellwig <[email protected]> wrote:
>[...]>> (Btw, in case anyone is interested, we really need to get started
>> on moving the dma fields out of struct device into a sub-struct
>> only allocated for DMA capable busses)
>
> I like this idea. In fact, my WIP topic branch now moves the swiotlb
> fields into a separate struct, but I can surely go further and move all
> DMA-related fields.
I have looked into this now, and it looks like a nice cleanup. The
challenge is to get these patches reviewed by all affected maintainers,
and it would be blocking my work on dynamically allocated bounce buffers.
How about moving only some fields initially (coherent override, cma and
swiotlb)? These few are not used outside kernel/dma, but at least the
swiotlb part makes my patches easier to follow. I can move the rest as
soon as the dynamic patch series is merged.
Petr Tesarik
Hi Christoph!
I'd like to follow up on this sub-thread:
On Fri, 7 Apr 2023 12:15:55 +0200
Petr Tesařík <[email protected]> wroe:
> On Fri, 7 Apr 2023 07:57:04 +0200
> Christoph Hellwig <[email protected]> wrote:
>[...]
> > (Btw, in case anyone is interested, we really need to get started
> > on moving the dma fields out of struct device into a sub-struct
> > only allocated for DMA capable busses)
>
> I like this idea. In fact, my WIP topic branch now moves the swiotlb
> fields into a separate struct,
As you have noticed, I have removed that commit again in v2.
The reason is that I'm not sure about the intended goal. I have looked
around for examples of moving fields out of struct device and found
different approaches:
A. struct dev_msi_info
The MSI fields are merely grouped in a separate struct, which is
defined in device.h and embedded in struct device. I don't see much
benefit.
B. struct dev_pm_info
This struct is also embedded in struct device, but it is defined in
<linux/pm.h>, which is mentioned in MAINTAINERS. The benefit is that
further changes are reviewed by this maintainer. The downside is
that device.h includes pm.h.
C. struct dev_pin_info
This struct is merely declared in device.h and defined
pinctrl/devinfo.h (which is not included). Only a pointer to this
struct is stored in struct device. Of course, the pointer must be
initialized (and released) somehow.
Here my question: What did you want for DMA fields?
A. Only grouping those fields in their own struct?
B. Or move the definition to another include file (cf. MAINTAINERS)?
C. Or store a pointer in struct device?
Since you mentioned "allocated", it sounds like you want to achieve C,
but:
1. Is it worth the extra dereference for every use?
2. How should the struct be allocated? Presumably not with kmalloc() in
device_initialize(), because I don't know how to determine if a
device is DMA capable this low in the call stack. So, should it be
allocated together with the containing structure? AFAICS this would
mean changing nearly all device drivers...
As you can see, I need some more guidance from you before I can start
working on this. ;-)
Petr T
On 2023-04-21 14:03, Petr Tesařík wrote:
> Hi Christoph!
>
> I'd like to follow up on this sub-thread:
>
> On Fri, 7 Apr 2023 12:15:55 +0200
> Petr Tesařík <[email protected]> wroe:
>
>> On Fri, 7 Apr 2023 07:57:04 +0200
>> Christoph Hellwig <[email protected]> wrote:
>> [...]
>>> (Btw, in case anyone is interested, we really need to get started
>>> on moving the dma fields out of struct device into a sub-struct
>>> only allocated for DMA capable busses)
>>
>> I like this idea. In fact, my WIP topic branch now moves the swiotlb
>> fields into a separate struct,
>
> As you have noticed, I have removed that commit again in v2.
>
> The reason is that I'm not sure about the intended goal. I have looked
> around for examples of moving fields out of struct device and found
> different approaches:
>
> A. struct dev_msi_info
> The MSI fields are merely grouped in a separate struct, which is
> defined in device.h and embedded in struct device. I don't see much
> benefit.
>
> B. struct dev_pm_info
> This struct is also embedded in struct device, but it is defined in
> <linux/pm.h>, which is mentioned in MAINTAINERS. The benefit is that
> further changes are reviewed by this maintainer. The downside is
> that device.h includes pm.h.
>
> C. struct dev_pin_info
> This struct is merely declared in device.h and defined
> pinctrl/devinfo.h (which is not included). Only a pointer to this
> struct is stored in struct device. Of course, the pointer must be
> initialized (and released) somehow.
>
> Here my question: What did you want for DMA fields?
>
> A. Only grouping those fields in their own struct?
> B. Or move the definition to another include file (cf. MAINTAINERS)?
> C. Or store a pointer in struct device?
dev->dma_parms is already this, and IIRC still has some very old
comments somewhere about consolidating the other DMA-related fields in
there.
> Since you mentioned "allocated", it sounds like you want to achieve C,
> but:
>
> 1. Is it worth the extra dereference for every use?
> 2. How should the struct be allocated? Presumably not with kmalloc() in
> device_initialize(), because I don't know how to determine if a
> device is DMA capable this low in the call stack. So, should it be
> allocated together with the containing structure? AFAICS this would
> mean changing nearly all device drivers...
The bus code knows whether it's a DMA-capable bus or not, and as such
should already be providing a .dma_configure method and/or performing
some initialisation of DMA fields. Many of the ones that would need to
are already providing dma_parms, in fact.
Thanks,
Robin.
>
> As you can see, I need some more guidance from you before I can start
> working on this. ;-)
>
> Petr T
Hi Robin,
On Fri, 21 Apr 2023 15:58:18 +0100
Robin Murphy <[email protected]> wrote:
> On 2023-04-21 14:03, Petr Tesařík wrote:
>[...]
> > Here my question: What did you want for DMA fields?
> >
> > A. Only grouping those fields in their own struct?
> > B. Or move the definition to another include file (cf. MAINTAINERS)?
> > C. Or store a pointer in struct device?
>
> dev->dma_parms is already this, and IIRC still has some very old
> comments somewhere about consolidating the other DMA-related fields in
> there.
Thank you for the hint! I have actually seen dma_parms, but since it
can be NULL and was initialized from various drivers, it did not occur
to me that NULL simply means not DMA-capable.
This is really helpful.
Petr T
> > Since you mentioned "allocated", it sounds like you want to achieve C,
> > but:
> >
> > 1. Is it worth the extra dereference for every use?
> > 2. How should the struct be allocated? Presumably not with kmalloc() in
> > device_initialize(), because I don't know how to determine if a
> > device is DMA capable this low in the call stack. So, should it be
> > allocated together with the containing structure? AFAICS this would
> > mean changing nearly all device drivers...
>
> The bus code knows whether it's a DMA-capable bus or not, and as such
> should already be providing a .dma_configure method and/or performing
> some initialisation of DMA fields. Many of the ones that would need to
> are already providing dma_parms, in fact.
>
> Thanks,
> Robin.
>
> >
> > As you can see, I need some more guidance from you before I can start
> > working on this. ;-)
> >
> > Petr T
On Fri, Apr 21, 2023 at 05:09:34PM +0200, Petr Tesařík wrote:
> > > A. Only grouping those fields in their own struct?
> > > B. Or move the definition to another include file (cf. MAINTAINERS)?
> > > C. Or store a pointer in struct device?
> >
> > dev->dma_parms is already this, and IIRC still has some very old
> > comments somewhere about consolidating the other DMA-related fields in
> > there.
>
> Thank you for the hint! I have actually seen dma_parms, but since it
> can be NULL and was initialized from various drivers, it did not occur
> to me that NULL simply means not DMA-capable.
Yes, dma_parms are still optional. A much better hint is the dma_mask
itself, which for historic reasons is implemented in a completely
awfull way as a pointer to something stored in the containing struture,
which all kinds of platform devices or minor buses doing nasty hacks
there.
On Thu, 06 Apr 2023 11:44:55 +0000
Juerg Haefliger <[email protected]> wrote:
> On Fri, 31 Mar 2023 11:00:43 +0200
> Petr Tesařík <[email protected]> wrote:
>
> > Hi Juerg,
> >
> > On Fri, 31 Mar 2023 07:26:09 +0000
> > Juerg Haefliger <[email protected]> wrote:
> >
> > > On Tue, 28 Mar 2023 09:54:35 +0200
> > > Petr Tesarik <[email protected]> wrote:
> > >
> > >[...]
> > > > Anyway, I suspected that the buffers need not be imported into the vc4
> > > > driver (also hinted by Eric Anholt in a 2018 blog post [1]), and it
> > > > seems I was right. I encountered the issue with Ubuntu 22.10; I
> > > > installed latest openSUSE Tumbleweed yesterday, and I was not able to
> > > > reproduce the issue there, most likely because the Mesa drivers have
> > > > been fixed meanwhile. This makes the specific case of the Raspberry Pi 4
> > > > drivers moot. The issue may still affect other combinations of drivers,
> > > > but I don't have any other real-world example ATM.
> > >
> > > I'm only seeing this problem with Wayland, no issue when switching Ubuntu to
> > > X. It seems Tumbleweed is using X by default.
> >
> > I know; I was the team lead of SUSE low-level graphics engineers until
> > end of last year... I have just double-checked, but this output of
> > wayland-info in the GNOME session accessed over RDP is quite convincing:
>
> It sure is but how did you get that??
I'm sorry for late reply.
Yes, the default is GNOME on X.org, but you can change it on the GDM
login screen. After choosing your account, a settings gear icon appears
in the lower right corner. Click on it and choose GNOME. That starts a
Wayland session.
HTH
Petr T