From: Petr Tesarik <[email protected]>
Motivation
==========
The software IO TLB was designed with these assumptions:
1) It would not be used much. Small systems (little RAM) don't need it, and
big systems (lots of RAM) would have modern DMA controllers and an IOMMU
chip to handle legacy devices.
2) A small fixed memory area (64 MiB by default) is sufficient to
handle the few cases which require a bounce buffer.
3) 64 MiB is little enough that it has no impact on the rest of the
system.
4) Bounce buffers require large contiguous chunks of low memory. Such
memory is precious and can be allocated only early at boot.
It turns out they are not always true:
1) Embedded systems may have more than 4GiB RAM but no IOMMU and legacy
32-bit peripheral busses and/or DMA controllers.
2) CoCo VMs use bounce buffers for all I/O but may need substantially more
than 64 MiB.
3) Embedded developers put as many features as possible into the available
memory. A few dozen "missing" megabytes may limit what features can be
implemented.
4) If CMA is available, it can allocate large continuous chunks even after
the system has run for some time.
Goals
=====
The goal of this work is to start with a small software IO TLB at boot and
expand it later when/if needed.
Design
======
This version of the patch series retains the current slot allocation
algorithm with multiple areas to reduce lock contention, but additional
slots can be added when necessary.
These alternatives have been considered:
- Allocate and free buffers as needed using direct DMA API. This works
quite well, except in CoCo VMs where each allocation/free requires
decrypting/encrypting memory, which is a very expensive operation.
- Allocate a very large software IO TLB at boot, but allow to migrate pages
to/from it (like CMA does). For systems with CMA, this would mean two big
allocations at boot. Finding the balance between CMA, SWIOTLB and rest of
available RAM can be challenging. More importantly, there is no clear
benefit compared to allocating SWIOTLB memory pools from the CMA.
Implementation Constraints
==========================
These constraints have been taken into account:
1) Minimize impact on devices which do not benefit from the change.
2) Minimize the number of memory decryption/encryption operations.
3) Avoid contention on a lock or atomic variable to preserve parallel
scalability.
Additionally, the software IO TLB code is also used to implement restricted
DMA pools. These pools are restricted to a pre-defined physical memory
region and must not use any other memory. In other words, dynamic
allocation of memory pools must be disabled for restricted DMA pools.
Data Structures
===============
The existing struct io_tlb_mem is the central type for a SWIOTLB allocator,
but it now contains multiple memory pools::
io_tlb_mem
+---------+ io_tlb_pool
| SWIOTLB | +-------+ +-------+ +-------+
|allocator|-->|default|-->|dynamic|-->|dynamic|-->...
| | |memory | |memory | |memory |
+---------+ | pool | | pool | | pool |
+-------+ +-------+ +-------+
The allocator structure contains global state (such as flags and counters)
and structures needed to schedule new allocations. Each memory pool
contains the actual buffer slots and metadata. The first memory pool in the
list is the default memory pool allocated statically at early boot.
New memory pools are allocated from a kernel worker thread. That's because
bounce buffers are allocated when mapping a DMA buffer, which may happen in
interrupt context where large atomic allocations would probably fail.
Allocation from process context is much more likely to succeed, especially
if it can use CMA.
Nonetheless, the onset of a load spike may fill up the SWIOTLB before the
worker has a chance to run. In that case, try to allocate a small transient
memory pool to accommodate the request. If memory is encrypted and the
device cannot do DMA to encrypted memory, this buffer is allocated from the
coherent atomic DMA memory pool. Reducing the size of SWIOTLB may therefore
require increasing the size of the coherent pool with the "coherent_pool"
command-line parameter.
Performance
===========
All testing compared a vanilla v6.4-rc6 kernel with a fully patched
kernel. The kernel was booted with "swiotlb=force" to allow stress-testing
the software IO TLB on a high-performance device that would otherwise not
need it. CONFIG_DEBUG_FS was set to 'y' to match the configuration of
popular distribution kernels; it is understood that parallel workloads
suffer from contention on the recently added debugfs atomic counters.
These benchmarks were run:
- small: single-threaded I/O of 4 KiB blocks,
- big: single-threaded I/O of 64 KiB blocks,
- 4way: 4-way parallel I/O of 4 KiB blocks.
In all tested cases, the default 64 MiB SWIOTLB would be sufficient (but
wasteful). The "default" pair of columns shows performance impact when
booted with 64 MiB SWIOTLB (i.e. current state). The "growing" pair of
columns shows the impact when booted with a 1 MiB initial SWIOTLB, which
grew to 5 MiB at run time. The "var" column in the tables below is the
coefficient of variance over 5 runs of the test, the "diff" column is the
difference in read-write I/O bandwidth (MiB/s). The very first column is
the coefficient of variance in the results of the base unpatched kernel.
First, on an x86 VM against a QEMU virtio SATA driver backed by a RAM-based
block device on the host:
base default growing
var var diff var diff
small 1.96% 0.47% -1.5% 0.52% -2.2%
big 2.03% 1.35% +0.9% 2.22% +2.9%
4way 0.80% 0.45% -0.7% 1.22% <0.1%
Second, on a Raspberry Pi4 with 8G RAM and a class 10 A1 microSD card:
base default growing
var var diff var diff
small 1.09% 1.69% +0.5% 2.14% -0.2%
big 0.03% 0.28% -0.5% 0.03% -0.1%
4way 5.15% 2.39% +0.2% 0.66% <0.1%
Third, on a CoCo VM. This was a bigger system, so I also added a 24-thread
parallel I/O test:
base default growing
var var diff var diff
small 2.41% 6.02% +1.1% 10.33% +6.7%
big 9.20% 2.81% -0.6% 16.84% -0.2%
4way 0.86% 2.66% -0.1% 2.22% -4.9%
24way 3.19% 6.19% +4.4% 4.08% -5.9%
Note the increased variance of the CoCo VM, although the host was not
otherwise loaded. These are caused by the first run, which includes the
overhead of allocating additional bounce buffers and sharing them with the
hypervisor. The system was not rebooted between successive runs.
Parallel tests suffer from a reduced number of areas in the dynamically
allocated memory pools. This can be improved by allocating a larger pool
from CMA (not implemented in this series yet).
I have no good explanation for the increase in performance of the
24-thread I/O test with the default (non-growing) memory pool. Although the
difference is within variance, it seems to be real. The average bandwidth
is consistently above that of the unpatched kernel.
To sum it up:
- All workloads benefit from reduced memory footprint.
- No performance regressions have been observed with the default size of
the software IO TLB.
- Most workloads retain their former performance even if the software IO
TLB grows at run time.
Changelog
=========
Changes from v6:
- Rebase on dma-mapping for-next tree. Drop changes to pci-dma.c.
Changes from v5:
- Re-introduce is_swiotlb_allocated(), now again required because of commit
b035f5a6d852 ("mm: slab: reduce the kmalloc() minimum alignment if DMA
bouncing possible").
Changes from v4:
- Guard the code with a CONFIG_SWIOTLB_DYNAMIC option
- Remove is_swiotlb_allocated(); instead, prevent repeated initialization
in swiotlb_init_late()
- Rename default_swiotlb_start() to default_swiotlb_base()
- Embed the default struct io_tlb_pool into struct io_tlb_mem
- Do not re-introduce struct io_tlb_pool.used
Changes from v3:
- Provide swiotlb_is_allocated() instead of extending swiotlb_is_active().
- Do not grow SWIOTLB if its address has been queried (affects Octeon).
- Do not grow SWIOTLB if a remap function is used (affects Xen PV).
- Use dma_mask instead of coherent_dma_mask.
- Replace complex ternary operators with if-else blocks.
Changes from v2:
- Complete rewrite using dynamically allocated memory pools rather
than a list of individual buffers
- Depend on other SWIOTLB fixes (already sent)
- Fix Xen and MIPS Octeon builds
Changes from RFC:
- Track dynamic buffers per device instead of per swiotlb
- Use a linked list instead of a maple tree
- Move initialization of swiotlb fields of struct device to a
helper function
- Rename __lookup_dyn_slot() to lookup_dyn_slot_locked()
- Introduce per-device flag if dynamic buffers are in use
- Add one more user of DMA_ATTR_MAY_SLEEP
- Add kernel-doc comments for new (and some old) code
- Properly escape '*' in dma-attributes.rst
Petr Tesarik (9):
swiotlb: bail out of swiotlb_init_late() if swiotlb is already
allocated
swiotlb: make io_tlb_default_mem local to swiotlb.c
swiotlb: add documentation and rename swiotlb_do_find_slots()
swiotlb: separate memory pool data from other allocator data
swiotlb: add a flag whether SWIOTLB is allowed to grow
swiotlb: if swiotlb is full, fall back to a transient memory pool
swiotlb: determine potential physical address limit
swiotlb: allocate a new memory pool when existing pools are full
swiotlb: search the software IO TLB only if the device makes use of it
arch/arm/xen/mm.c | 10 +-
arch/mips/pci/pci-octeon.c | 2 +-
drivers/base/core.c | 4 +-
drivers/xen/swiotlb-xen.c | 2 +-
include/linux/device.h | 10 +-
include/linux/dma-mapping.h | 2 +
include/linux/swiotlb.h | 131 +++++--
kernel/dma/Kconfig | 13 +
kernel/dma/direct.c | 2 +-
kernel/dma/swiotlb.c | 683 ++++++++++++++++++++++++++++++++----
mm/slab_common.c | 5 +-
11 files changed, 764 insertions(+), 100 deletions(-)
--
2.25.1
From: Petr Tesarik <[email protected]>
Try to allocate a transient memory pool if no suitable slots can be found
and the respective SWIOTLB is allowed to grow. The transient pool is just
enough big for this one bounce buffer. It is inserted into a per-device
list of transient memory pools, and it is freed again when the bounce
buffer is unmapped.
Transient memory pools are kept in an RCU list. A memory barrier is
required after adding a new entry, because any address within a transient
buffer must be immediately recognized as belonging to the SWIOTLB, even if
it is passed to another CPU.
Deletion does not require any synchronization beyond RCU ordering
guarantees. After a buffer is unmapped, its physical addresses may no
longer be passed to the DMA API, so the memory range of the corresponding
stale entry in the RCU list never matches. If the memory range gets
allocated again, then it happens only after a RCU quiescent state.
Since bounce buffers can now be allocated from different pools, add a
parameter to swiotlb_alloc_pool() to let the caller know which memory pool
is used. Add swiotlb_find_pool() to find the memory pool corresponding to
an address. This function is now also used by is_swiotlb_buffer(), because
a simple boundary check is no longer sufficient.
The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
simplified and enhanced to use coherent memory pools if needed.
Note that this is not the most efficient way to provide a bounce buffer,
but when a DMA buffer can't be mapped, something may (and will) actually
break. At that point it is better to make an allocation, even if it may be
an expensive operation.
Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/device.h | 6 +
include/linux/dma-mapping.h | 2 +
include/linux/swiotlb.h | 29 +++-
kernel/dma/direct.c | 2 +-
kernel/dma/swiotlb.c | 316 +++++++++++++++++++++++++++++++++++-
5 files changed, 345 insertions(+), 10 deletions(-)
diff --git a/include/linux/device.h b/include/linux/device.h
index d9754a68ba95..5fd89c9d005c 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -626,6 +626,8 @@ struct device_physical_location {
* @dma_mem: Internal for coherent mem override.
* @cma_area: Contiguous memory area for dma allocations
* @dma_io_tlb_mem: Software IO TLB allocator. Not for driver use.
+ * @dma_io_tlb_pools: List of transient swiotlb memory pools.
+ * @dma_io_tlb_lock: Protects changes to the list of active pools.
* @archdata: For arch-specific additions.
* @of_node: Associated device tree node.
* @fwnode: Associated device node supplied by platform firmware.
@@ -731,6 +733,10 @@ struct device {
#endif
#ifdef CONFIG_SWIOTLB
struct io_tlb_mem *dma_io_tlb_mem;
+#endif
+#ifdef CONFIG_SWIOTLB_DYNAMIC
+ struct list_head dma_io_tlb_pools;
+ spinlock_t dma_io_tlb_lock;
#endif
/* arch specific additions */
struct dev_archdata archdata;
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index e13050eb9777..f0ccca16a0ac 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -418,6 +418,8 @@ static inline void dma_sync_sgtable_for_device(struct device *dev,
#define dma_get_sgtable(d, t, v, h, s) dma_get_sgtable_attrs(d, t, v, h, s, 0)
#define dma_mmap_coherent(d, v, c, h, s) dma_mmap_attrs(d, v, c, h, s, 0)
+bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size);
+
static inline void *dma_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp)
{
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 57be2a0a9fbf..66867d2188ba 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -80,6 +80,9 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
* @area_nslabs: Number of slots in each area.
* @areas: Array of memory area descriptors.
* @slots: Array of slot descriptors.
+ * @node: Member of the IO TLB memory pool list.
+ * @rcu: RCU head for swiotlb_dyn_free().
+ * @transient: %true if transient memory pool.
*/
struct io_tlb_pool {
phys_addr_t start;
@@ -91,6 +94,11 @@ struct io_tlb_pool {
unsigned int area_nslabs;
struct io_tlb_area *areas;
struct io_tlb_slot *slots;
+#ifdef CONFIG_SWIOTLB_DYNAMIC
+ struct list_head node;
+ struct rcu_head rcu;
+ bool transient;
+#endif
};
/**
@@ -122,6 +130,20 @@ struct io_tlb_mem {
#endif
};
+#ifdef CONFIG_SWIOTLB_DYNAMIC
+
+struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr);
+
+#else
+
+static inline struct io_tlb_pool *swiotlb_find_pool(struct device *dev,
+ phys_addr_t paddr)
+{
+ return &dev->dma_io_tlb_mem->defpool;
+}
+
+#endif
+
/**
* is_swiotlb_buffer() - check if a physical address belongs to a swiotlb
* @dev: Device which has mapped the buffer.
@@ -137,7 +159,12 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
- return mem && paddr >= mem->defpool.start && paddr < mem->defpool.end;
+ if (!mem)
+ return false;
+
+ if (IS_ENABLED(CONFIG_SWIOTLB_DYNAMIC))
+ return swiotlb_find_pool(dev, paddr);
+ return paddr >= mem->defpool.start && paddr < mem->defpool.end;
}
static inline bool is_swiotlb_force_bounce(struct device *dev)
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index d29cade048db..9596ae1aa0da 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -66,7 +66,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 *phys_limit)
return 0;
}
-static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
+bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
{
dma_addr_t dma_addr = phys_to_dma_direct(dev, phys);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 767c8fb36a6b..30d0fcc3ccb9 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -35,6 +35,7 @@
#include <linux/memblock.h>
#include <linux/mm.h>
#include <linux/pfn.h>
+#include <linux/rculist.h>
#include <linux/scatterlist.h>
#include <linux/set_memory.h>
#include <linux/spinlock.h>
@@ -510,6 +511,211 @@ void __init swiotlb_exit(void)
memset(mem, 0, sizeof(*mem));
}
+#ifdef CONFIG_SWIOTLB_DYNAMIC
+
+/**
+ * alloc_dma_pages() - allocate pages to be used for DMA
+ * @gfp: GFP flags for the allocation.
+ * @bytes: Size of the buffer.
+ *
+ * Allocate pages from the buddy allocator. If successful, make the allocated
+ * pages decrypted that they can be used for DMA.
+ *
+ * Return: Decrypted pages, or %NULL on failure.
+ */
+static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes)
+{
+ unsigned int order = get_order(bytes);
+ struct page *page;
+ void *vaddr;
+
+ page = alloc_pages(gfp, order);
+ if (!page)
+ return NULL;
+
+ vaddr = page_address(page);
+ if (set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
+ goto error;
+ return page;
+
+error:
+ __free_pages(page, order);
+ return NULL;
+}
+
+/**
+ * swiotlb_alloc_tlb() - allocate a dynamic IO TLB buffer
+ * @dev: Device for which a memory pool is allocated.
+ * @bytes: Size of the buffer.
+ * @phys_limit: Maximum allowed physical address of the buffer.
+ * @gfp: GFP flags for the allocation.
+ *
+ * Return: Allocated pages, or %NULL on allocation failure.
+ */
+static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
+ u64 phys_limit, gfp_t gfp)
+{
+ struct page *page;
+
+ /*
+ * Allocate from the atomic pools if memory is encrypted and
+ * the allocation is atomic, because decrypting may block.
+ */
+ if (!gfpflags_allow_blocking(gfp) && dev && force_dma_unencrypted(dev)) {
+ void *vaddr;
+
+ if (!IS_ENABLED(CONFIG_DMA_COHERENT_POOL))
+ return NULL;
+
+ return dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
+ dma_coherent_ok);
+ }
+
+ gfp &= ~GFP_ZONEMASK;
+ if (phys_limit <= DMA_BIT_MASK(zone_dma_bits))
+ gfp |= __GFP_DMA;
+ else if (phys_limit <= DMA_BIT_MASK(32))
+ gfp |= __GFP_DMA32;
+
+ while ((page = alloc_dma_pages(gfp, bytes)) &&
+ page_to_phys(page) + bytes - 1 > phys_limit) {
+ /* allocated, but too high */
+ __free_pages(page, get_order(bytes));
+
+ if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
+ phys_limit < DMA_BIT_MASK(64) &&
+ !(gfp & (__GFP_DMA32 | __GFP_DMA)))
+ gfp |= __GFP_DMA32;
+ else if (IS_ENABLED(CONFIG_ZONE_DMA) &&
+ !(gfp & __GFP_DMA))
+ gfp = (gfp & ~__GFP_DMA32) | __GFP_DMA;
+ else
+ return NULL;
+ }
+
+ return page;
+}
+
+/**
+ * swiotlb_free_tlb() - free a dynamically allocated IO TLB buffer
+ * @vaddr: Virtual address of the buffer.
+ * @bytes: Size of the buffer.
+ */
+static void swiotlb_free_tlb(void *vaddr, size_t bytes)
+{
+ if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
+ dma_free_from_pool(NULL, vaddr, bytes))
+ return;
+
+ /* Intentional leak if pages cannot be encrypted again. */
+ if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
+ __free_pages(virt_to_page(vaddr), get_order(bytes));
+}
+
+/**
+ * swiotlb_alloc_pool() - allocate a new IO TLB memory pool
+ * @dev: Device for which a memory pool is allocated.
+ * @nslabs: Desired number of slabs.
+ * @phys_limit: Maximum DMA buffer physical address.
+ * @gfp: GFP flags for the allocations.
+ *
+ * Allocate and initialize a new IO TLB memory pool.
+ *
+ * Return: New memory pool, or %NULL on allocation failure.
+ */
+static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
+ unsigned int nslabs, u64 phys_limit, gfp_t gfp)
+{
+ struct io_tlb_pool *pool;
+ struct page *tlb;
+ size_t pool_size;
+ size_t tlb_size;
+
+ pool_size = sizeof(*pool) + array_size(sizeof(*pool->areas), 1) +
+ array_size(sizeof(*pool->slots), nslabs);
+ pool = kzalloc(pool_size, gfp);
+ if (!pool)
+ goto error;
+ pool->areas = (void *)pool + sizeof(*pool);
+ pool->slots = (void *)pool->areas + sizeof(*pool->areas);
+
+ tlb_size = nslabs << IO_TLB_SHIFT;
+ tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp);
+ if (!tlb)
+ goto error_tlb;
+
+ swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), nslabs, true, 1);
+ return pool;
+
+error_tlb:
+ kfree(pool);
+error:
+ return NULL;
+}
+
+/**
+ * swiotlb_dyn_free() - RCU callback to free a memory pool
+ * @rcu: RCU head in the corresponding struct io_tlb_pool.
+ */
+static void swiotlb_dyn_free(struct rcu_head *rcu)
+{
+ struct io_tlb_pool *pool = container_of(rcu, struct io_tlb_pool, rcu);
+ size_t tlb_size = pool->end - pool->start;
+
+ swiotlb_free_tlb(pool->vaddr, tlb_size);
+ kfree(pool);
+}
+
+/**
+ * swiotlb_find_pool() - find the IO TLB pool for a physical address
+ * @dev: Device which has mapped the DMA buffer.
+ * @paddr: Physical address within the DMA buffer.
+ *
+ * Find the IO TLB memory pool descriptor which contains the given physical
+ * address, if any.
+ *
+ * Return: Memory pool which contains @paddr, or %NULL if none.
+ */
+struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr)
+{
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_pool *pool = &mem->defpool;
+
+ if (paddr >= pool->start && paddr < pool->end)
+ return pool;
+
+ /* Pairs with smp_wmb() in swiotlb_find_slots(). */
+ smp_rmb();
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(pool, &dev->dma_io_tlb_pools, node) {
+ if (paddr >= pool->start && paddr < pool->end)
+ goto out;
+ }
+ pool = NULL;
+out:
+ rcu_read_unlock();
+ return pool;
+}
+
+/**
+ * swiotlb_del_pool() - remove an IO TLB pool from a device
+ * @dev: Owning device.
+ * @pool: Memory pool to be removed.
+ */
+static void swiotlb_del_pool(struct device *dev, struct io_tlb_pool *pool)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&dev->dma_io_tlb_lock, flags);
+ list_del_rcu(&pool->node);
+ spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags);
+
+ call_rcu(&pool->rcu, swiotlb_dyn_free);
+}
+
+#endif /* CONFIG_SWIOTLB_DYNAMIC */
+
/**
* swiotlb_dev_init() - initialize swiotlb fields in &struct device
* @dev: Device to be initialized.
@@ -517,6 +723,10 @@ void __init swiotlb_exit(void)
void swiotlb_dev_init(struct device *dev)
{
dev->dma_io_tlb_mem = &io_tlb_default_mem;
+#ifdef CONFIG_SWIOTLB_DYNAMIC
+ INIT_LIST_HEAD(&dev->dma_io_tlb_pools);
+ spin_lock_init(&dev->dma_io_tlb_lock);
+#endif
}
/*
@@ -533,7 +743,7 @@ static unsigned int swiotlb_align_offset(struct device *dev, u64 addr)
static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
enum dma_data_direction dir)
{
- struct io_tlb_pool *mem = &dev->dma_io_tlb_mem->defpool;
+ struct io_tlb_pool *mem = swiotlb_find_pool(dev, tlb_addr);
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
phys_addr_t orig_addr = mem->slots[index].orig_addr;
size_t alloc_size = mem->slots[index].alloc_size;
@@ -799,6 +1009,8 @@ static int swiotlb_pool_find_slots(struct device *dev, struct io_tlb_pool *pool,
return -1;
}
+#ifdef CONFIG_SWIOTLB_DYNAMIC
+
/**
* swiotlb_find_slots() - search for slots in the whole swiotlb
* @dev: Device which maps the buffer.
@@ -806,6 +1018,7 @@ static int swiotlb_pool_find_slots(struct device *dev, struct io_tlb_pool *pool,
* @alloc_size: Total requested size of the bounce buffer,
* including initial alignment padding.
* @alloc_align_mask: Required alignment of the allocated buffer.
+ * @retpool: Used memory pool, updated on return.
*
* Search through the whole software IO TLB to find a sequence of slots that
* match the allocation constraints.
@@ -813,12 +1026,64 @@ static int swiotlb_pool_find_slots(struct device *dev, struct io_tlb_pool *pool,
* Return: Index of the first allocated slot, or -1 on error.
*/
static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
- size_t alloc_size, unsigned int alloc_align_mask)
+ size_t alloc_size, unsigned int alloc_align_mask,
+ struct io_tlb_pool **retpool)
{
- return swiotlb_pool_find_slots(dev, &dev->dma_io_tlb_mem->defpool,
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_pool *pool;
+ unsigned long nslabs;
+ unsigned long flags;
+ u64 phys_limit;
+ int index;
+
+ pool = &mem->defpool;
+ index = swiotlb_pool_find_slots(dev, pool, orig_addr,
+ alloc_size, alloc_align_mask);
+ if (index >= 0)
+ goto found;
+
+ if (!mem->can_grow)
+ return -1;
+
+ nslabs = nr_slots(alloc_size);
+ phys_limit = min_not_zero(*dev->dma_mask, dev->bus_dma_limit);
+ pool = swiotlb_alloc_pool(dev, nslabs, phys_limit,
+ GFP_NOWAIT | __GFP_NOWARN);
+ if (!pool)
+ return -1;
+
+ index = swiotlb_pool_find_slots(dev, pool, orig_addr,
+ alloc_size, alloc_align_mask);
+ if (index < 0) {
+ swiotlb_dyn_free(&pool->rcu);
+ return -1;
+ }
+
+ pool->transient = true;
+ spin_lock_irqsave(&dev->dma_io_tlb_lock, flags);
+ list_add_rcu(&pool->node, &dev->dma_io_tlb_pools);
+ spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags);
+
+ /* Pairs with smp_rmb() in swiotlb_find_pool(). */
+ smp_wmb();
+found:
+ *retpool = pool;
+ return index;
+}
+
+#else /* !CONFIG_SWIOTLB_DYNAMIC */
+
+static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
+ size_t alloc_size, unsigned int alloc_align_mask,
+ struct io_tlb_pool **retpool)
+{
+ *retpool = &dev->dma_io_tlb_mem->defpool;
+ return swiotlb_pool_find_slots(dev, *retpool,
orig_addr, alloc_size, alloc_align_mask);
}
+#endif /* CONFIG_SWIOTLB_DYNAMIC */
+
#ifdef CONFIG_DEBUG_FS
/**
@@ -899,7 +1164,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
}
index = swiotlb_find_slots(dev, orig_addr,
- alloc_size + offset, alloc_align_mask);
+ alloc_size + offset, alloc_align_mask, &pool);
if (index == -1) {
if (!(attrs & DMA_ATTR_NO_WARN))
dev_warn_ratelimited(dev,
@@ -913,7 +1178,6 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
* This is needed when we sync the memory. Then we sync the buffer if
* needed.
*/
- pool = &mem->defpool;
for (i = 0; i < nr_slots(alloc_size + offset); i++)
pool->slots[index + i].orig_addr = slot_addr(orig_addr, i);
tlb_addr = slot_addr(pool->start, index) + offset;
@@ -930,7 +1194,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
{
- struct io_tlb_pool *mem = &dev->dma_io_tlb_mem->defpool;
+ struct io_tlb_pool *mem = swiotlb_find_pool(dev, tlb_addr);
unsigned long flags;
unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
@@ -977,6 +1241,41 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
dec_used(dev->dma_io_tlb_mem, nslots);
}
+#ifdef CONFIG_SWIOTLB_DYNAMIC
+
+/**
+ * swiotlb_del_transient() - delete a transient memory pool
+ * @dev: Device which mapped the buffer.
+ * @tlb_addr: Physical address within a bounce buffer.
+ *
+ * Check whether the address belongs to a transient SWIOTLB memory pool.
+ * If yes, then delete the pool.
+ *
+ * Return: %true if @tlb_addr belonged to a transient pool that was released.
+ */
+static bool swiotlb_del_transient(struct device *dev, phys_addr_t tlb_addr)
+{
+ struct io_tlb_pool *pool;
+
+ pool = swiotlb_find_pool(dev, tlb_addr);
+ if (!pool->transient)
+ return false;
+
+ dec_used(dev->dma_io_tlb_mem, pool->nslabs);
+ swiotlb_del_pool(dev, pool);
+ return true;
+}
+
+#else /* !CONFIG_SWIOTLB_DYNAMIC */
+
+static inline bool swiotlb_del_transient(struct device *dev,
+ phys_addr_t tlb_addr)
+{
+ return false;
+}
+
+#endif /* CONFIG_SWIOTLB_DYNAMIC */
+
/*
* tlb_addr is the physical address of the bounce buffer to unmap.
*/
@@ -991,6 +1290,8 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
+ if (swiotlb_del_transient(dev, tlb_addr))
+ return;
swiotlb_release_slots(dev, tlb_addr);
}
@@ -1179,11 +1480,10 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
if (!mem)
return NULL;
- index = swiotlb_find_slots(dev, 0, size, 0);
+ index = swiotlb_find_slots(dev, 0, size, 0, &pool);
if (index == -1)
return NULL;
- pool = &mem->defpool;
tlb_addr = slot_addr(pool->start, index);
return pfn_to_page(PFN_DOWN(tlb_addr));
--
2.25.1
From: Petr Tesarik <[email protected]>
When swiotlb_find_slots() cannot find suitable slots, schedule the
allocation of a new memory pool. It is not possible to allocate the pool
immediately, because this code may run in interrupt context, which is not
suitable for large memory allocations. This means that the memory pool will
be available too late for the currently requested mapping, but the stress
on the software IO TLB allocator is likely to continue, and subsequent
allocations will benefit from the additional pool eventually.
Keep all memory pools for an allocator in an RCU list to avoid locking on
the read side. For modifications, add a new spinlock to struct io_tlb_mem.
The spinlock also protects updates to the total number of slabs (nslabs in
struct io_tlb_mem), but not reads of the value. Readers may therefore
encounter a stale value, but this is not an issue:
- swiotlb_tbl_map_single() and is_swiotlb_active() only check for non-zero
value. This is ensured by the existence of the default memory pool,
allocated at boot.
- The exact value is used only for non-critical purposes (debugfs, kernel
messages).
Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/swiotlb.h | 8 +++
kernel/dma/swiotlb.c | 148 +++++++++++++++++++++++++++++++++-------
2 files changed, 131 insertions(+), 25 deletions(-)
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 9825fa14abe4..8371c92a0271 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -8,6 +8,7 @@
#include <linux/types.h>
#include <linux/limits.h>
#include <linux/spinlock.h>
+#include <linux/workqueue.h>
struct device;
struct page;
@@ -104,12 +105,16 @@ struct io_tlb_pool {
/**
* struct io_tlb_mem - Software IO TLB allocator
* @defpool: Default (initial) IO TLB memory pool descriptor.
+ * @pool: IO TLB memory pool descriptor (if not dynamic).
* @nslabs: Total number of IO TLB slabs in all pools.
* @debugfs: The dentry to debugfs.
* @force_bounce: %true if swiotlb bouncing is forced
* @for_alloc: %true if the pool is used for memory allocation
* @can_grow: %true if more pools can be allocated dynamically.
* @phys_limit: Maximum allowed physical address.
+ * @lock: Lock to synchronize changes to the list.
+ * @pools: List of IO TLB memory pool descriptors (if dynamic).
+ * @dyn_alloc: Dynamic IO TLB pool allocation work.
* @total_used: The total number of slots in the pool that are currently used
* across all areas. Used only for calculating used_hiwater in
* debugfs.
@@ -125,6 +130,9 @@ struct io_tlb_mem {
#ifdef CONFIG_SWIOTLB_DYNAMIC
bool can_grow;
u64 phys_limit;
+ spinlock_t lock;
+ struct list_head pools;
+ struct work_struct dyn_alloc;
#endif
#ifdef CONFIG_DEBUG_FS
atomic_long_t total_used;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 0fa081defdbd..adf80dec42d7 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -79,8 +79,23 @@ struct io_tlb_slot {
static bool swiotlb_force_bounce;
static bool swiotlb_force_disable;
+#ifdef CONFIG_SWIOTLB_DYNAMIC
+
+static void swiotlb_dyn_alloc(struct work_struct *work);
+
+static struct io_tlb_mem io_tlb_default_mem = {
+ .lock = __SPIN_LOCK_UNLOCKED(io_tlb_default_mem.lock),
+ .pools = LIST_HEAD_INIT(io_tlb_default_mem.pools),
+ .dyn_alloc = __WORK_INITIALIZER(io_tlb_default_mem.dyn_alloc,
+ swiotlb_dyn_alloc),
+};
+
+#else /* !CONFIG_SWIOTLB_DYNAMIC */
+
static struct io_tlb_mem io_tlb_default_mem;
+#endif /* CONFIG_SWIOTLB_DYNAMIC */
+
static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
static unsigned long default_nareas;
@@ -278,6 +293,23 @@ static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
return;
}
+/**
+ * add_mem_pool() - add a memory pool to the allocator
+ * @mem: Software IO TLB allocator.
+ * @pool: Memory pool to be added.
+ */
+static void add_mem_pool(struct io_tlb_mem *mem, struct io_tlb_pool *pool)
+{
+#ifdef CONFIG_SWIOTLB_DYNAMIC
+ spin_lock(&mem->lock);
+ list_add_rcu(&pool->node, &mem->pools);
+ mem->nslabs += pool->nslabs;
+ spin_unlock(&mem->lock);
+#else
+ mem->nslabs = pool->nslabs;
+#endif
+}
+
static void __init *swiotlb_memblock_alloc(unsigned long nslabs,
unsigned int flags,
int (*remap)(void *tlb, unsigned long nslabs))
@@ -375,7 +407,7 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
swiotlb_init_io_tlb_pool(mem, __pa(tlb), nslabs, false,
default_nareas);
- io_tlb_default_mem.nslabs = nslabs;
+ add_mem_pool(&io_tlb_default_mem, mem);
if (flags & SWIOTLB_VERBOSE)
swiotlb_print_info();
@@ -474,7 +506,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
(nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
nareas);
- io_tlb_default_mem.nslabs = nslabs;
+ add_mem_pool(&io_tlb_default_mem, mem);
swiotlb_print_info();
return 0;
@@ -625,44 +657,83 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
/**
* swiotlb_alloc_pool() - allocate a new IO TLB memory pool
* @dev: Device for which a memory pool is allocated.
- * @nslabs: Desired number of slabs.
+ * @minslabs: Minimum number of slabs.
+ * @nslabs: Desired (maximum) number of slabs.
+ * @nareas: Number of areas.
* @phys_limit: Maximum DMA buffer physical address.
* @gfp: GFP flags for the allocations.
*
- * Allocate and initialize a new IO TLB memory pool.
+ * Allocate and initialize a new IO TLB memory pool. The actual number of
+ * slabs may be reduced if allocation of @nslabs fails. If even
+ * @minslabs cannot be allocated, this function fails.
*
* Return: New memory pool, or %NULL on allocation failure.
*/
static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
- unsigned int nslabs, u64 phys_limit, gfp_t gfp)
+ unsigned long minslabs, unsigned long nslabs,
+ unsigned int nareas, u64 phys_limit, gfp_t gfp)
{
struct io_tlb_pool *pool;
+ unsigned int slot_order;
struct page *tlb;
size_t pool_size;
size_t tlb_size;
- pool_size = sizeof(*pool) + array_size(sizeof(*pool->areas), 1) +
- array_size(sizeof(*pool->slots), nslabs);
+ pool_size = sizeof(*pool) + array_size(sizeof(*pool->areas), nareas);
pool = kzalloc(pool_size, gfp);
if (!pool)
goto error;
pool->areas = (void *)pool + sizeof(*pool);
- pool->slots = (void *)pool->areas + sizeof(*pool->areas);
tlb_size = nslabs << IO_TLB_SHIFT;
- tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp);
- if (!tlb)
- goto error_tlb;
+ while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp))) {
+ if (nslabs <= minslabs)
+ goto error_tlb;
+ nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
+ nareas = limit_nareas(nareas, nslabs);
+ tlb_size = nslabs << IO_TLB_SHIFT;
+ }
- swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), nslabs, true, 1);
+ slot_order = get_order(array_size(sizeof(*pool->slots), nslabs));
+ pool->slots = (struct io_tlb_slot *)
+ __get_free_pages(gfp, slot_order);
+ if (!pool->slots)
+ goto error_slots;
+
+ swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), nslabs, true, nareas);
return pool;
+error_slots:
+ swiotlb_free_tlb(page_address(tlb), tlb_size);
error_tlb:
kfree(pool);
error:
return NULL;
}
+/**
+ * swiotlb_dyn_alloc() - dynamic memory pool allocation worker
+ * @work: Pointer to dyn_alloc in struct io_tlb_mem.
+ */
+static void swiotlb_dyn_alloc(struct work_struct *work)
+{
+ struct io_tlb_mem *mem =
+ container_of(work, struct io_tlb_mem, dyn_alloc);
+ struct io_tlb_pool *pool;
+
+ pool = swiotlb_alloc_pool(NULL, IO_TLB_MIN_SLABS, default_nslabs,
+ default_nareas, mem->phys_limit, GFP_KERNEL);
+ if (!pool) {
+ pr_warn_ratelimited("Failed to allocate new pool");
+ return;
+ }
+
+ add_mem_pool(mem, pool);
+
+ /* Pairs with smp_rmb() in swiotlb_find_pool(). */
+ smp_wmb();
+}
+
/**
* swiotlb_dyn_free() - RCU callback to free a memory pool
* @rcu: RCU head in the corresponding struct io_tlb_pool.
@@ -670,8 +741,10 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
static void swiotlb_dyn_free(struct rcu_head *rcu)
{
struct io_tlb_pool *pool = container_of(rcu, struct io_tlb_pool, rcu);
+ size_t slots_size = array_size(sizeof(*pool->slots), pool->nslabs);
size_t tlb_size = pool->end - pool->start;
+ free_pages((unsigned long)pool->slots, get_order(slots_size));
swiotlb_free_tlb(pool->vaddr, tlb_size);
kfree(pool);
}
@@ -689,15 +762,19 @@ static void swiotlb_dyn_free(struct rcu_head *rcu)
struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
- struct io_tlb_pool *pool = &mem->defpool;
-
- if (paddr >= pool->start && paddr < pool->end)
- return pool;
+ struct io_tlb_pool *pool;
- /* Pairs with smp_wmb() in swiotlb_find_slots(). */
+ /* Pairs with smp_wmb() in swiotlb_find_slots() and
+ * swiotlb_dyn_alloc(), which modify the RCU lists.
+ */
smp_rmb();
rcu_read_lock();
+ list_for_each_entry_rcu(pool, &mem->pools, node) {
+ if (paddr >= pool->start && paddr < pool->end)
+ goto out;
+ }
+
list_for_each_entry_rcu(pool, &dev->dma_io_tlb_pools, node) {
if (paddr >= pool->start && paddr < pool->end)
goto out;
@@ -1046,18 +1123,24 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
u64 phys_limit;
int index;
- pool = &mem->defpool;
- index = swiotlb_pool_find_slots(dev, pool, orig_addr,
- alloc_size, alloc_align_mask);
- if (index >= 0)
- goto found;
-
+ rcu_read_lock();
+ list_for_each_entry_rcu(pool, &mem->pools, node) {
+ index = swiotlb_pool_find_slots(dev, pool, orig_addr,
+ alloc_size, alloc_align_mask);
+ if (index >= 0) {
+ rcu_read_unlock();
+ goto found;
+ }
+ }
+ rcu_read_unlock();
if (!mem->can_grow)
return -1;
+ schedule_work(&mem->dyn_alloc);
+
nslabs = nr_slots(alloc_size);
phys_limit = min_not_zero(*dev->dma_mask, dev->bus_dma_limit);
- pool = swiotlb_alloc_pool(dev, nslabs, phys_limit,
+ pool = swiotlb_alloc_pool(dev, nslabs, nslabs, 1, phys_limit,
GFP_NOWAIT | __GFP_NOWARN);
if (!pool)
return -1;
@@ -1141,7 +1224,19 @@ static unsigned long mem_pool_used(struct io_tlb_pool *pool)
*/
static unsigned long mem_used(struct io_tlb_mem *mem)
{
+#ifdef CONFIG_SWIOTLB_DYNAMIC
+ struct io_tlb_pool *pool;
+ unsigned long used = 0;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(pool, &mem->pools, node)
+ used += mem_pool_used(pool);
+ rcu_read_unlock();
+
+ return used;
+#else
return mem_pool_used(&mem->defpool);
+#endif
}
#endif /* CONFIG_DEBUG_FS */
@@ -1562,7 +1657,10 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
false, nareas);
mem->force_bounce = true;
mem->for_alloc = true;
- mem->nslabs = nslabs;
+#ifdef CONFIG_SWIOTLB_DYNAMIC
+ spin_lock_init(&mem->lock);
+#endif
+ add_mem_pool(mem, pool);
rmem->priv = mem;
--
2.25.1
From: Petr Tesarik <[email protected]>
SWIOTLB implementation details should not be exposed to the rest of the
kernel. This will allow to make changes to the implementation without
modifying non-swiotlb code.
To avoid breaking existing users, provide helper functions for the few
required fields.
As a bonus, using a helper function to initialize struct device allows to
get rid of an #ifdef in driver core.
Signed-off-by: Petr Tesarik <[email protected]>
---
arch/mips/pci/pci-octeon.c | 2 +-
drivers/base/core.c | 4 +---
drivers/xen/swiotlb-xen.c | 2 +-
include/linux/swiotlb.h | 25 +++++++++++++++++++++++-
kernel/dma/swiotlb.c | 39 +++++++++++++++++++++++++++++++++++++-
mm/slab_common.c | 5 ++---
6 files changed, 67 insertions(+), 10 deletions(-)
diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c
index e457a18cbdc5..d19d9d456309 100644
--- a/arch/mips/pci/pci-octeon.c
+++ b/arch/mips/pci/pci-octeon.c
@@ -664,7 +664,7 @@ static int __init octeon_pci_setup(void)
/* BAR1 movable regions contiguous to cover the swiotlb */
octeon_bar1_pci_phys =
- io_tlb_default_mem.start & ~((1ull << 22) - 1);
+ default_swiotlb_base() & ~((1ull << 22) - 1);
for (index = 0; index < 32; index++) {
union cvmx_pci_bar1_indexx bar1_index;
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 3dff5037943e..46d1d78c5beb 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -3108,9 +3108,7 @@ void device_initialize(struct device *dev)
defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
dev->dma_coherent = dma_default_coherent;
#endif
-#ifdef CONFIG_SWIOTLB
- dev->dma_io_tlb_mem = &io_tlb_default_mem;
-#endif
+ swiotlb_dev_init(dev);
}
EXPORT_SYMBOL_GPL(device_initialize);
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 67aa74d20162..946bd56f0ac5 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -381,7 +381,7 @@ xen_swiotlb_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
static int
xen_swiotlb_dma_supported(struct device *hwdev, u64 mask)
{
- return xen_phys_to_dma(hwdev, io_tlb_default_mem.end - 1) <= mask;
+ return xen_phys_to_dma(hwdev, default_swiotlb_limit()) <= mask;
}
const struct dma_map_ops xen_swiotlb_dma_ops = {
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 4e52cd5e0bdc..2d453b3e7771 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -110,7 +110,6 @@ struct io_tlb_mem {
atomic_long_t used_hiwater;
#endif
};
-extern struct io_tlb_mem io_tlb_default_mem;
static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
@@ -128,13 +127,22 @@ static inline bool is_swiotlb_force_bounce(struct device *dev)
void swiotlb_init(bool addressing_limited, unsigned int flags);
void __init swiotlb_exit(void);
+void swiotlb_dev_init(struct device *dev);
size_t swiotlb_max_mapping_size(struct device *dev);
+bool is_swiotlb_allocated(void);
bool is_swiotlb_active(struct device *dev);
void __init swiotlb_adjust_size(unsigned long size);
+phys_addr_t default_swiotlb_base(void);
+phys_addr_t default_swiotlb_limit(void);
#else
static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
{
}
+
+static inline void swiotlb_dev_init(struct device *dev)
+{
+}
+
static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
return false;
@@ -151,6 +159,11 @@ static inline size_t swiotlb_max_mapping_size(struct device *dev)
return SIZE_MAX;
}
+static inline bool is_swiotlb_allocated(void)
+{
+ return false;
+}
+
static inline bool is_swiotlb_active(struct device *dev)
{
return false;
@@ -159,6 +172,16 @@ static inline bool is_swiotlb_active(struct device *dev)
static inline void swiotlb_adjust_size(unsigned long size)
{
}
+
+static inline phys_addr_t default_swiotlb_base(void)
+{
+ return 0;
+}
+
+static inline phys_addr_t default_swiotlb_limit(void)
+{
+ return 0;
+}
#endif /* CONFIG_SWIOTLB */
extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index ee57fd9949dc..0840bc15fb53 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -71,7 +71,7 @@ struct io_tlb_slot {
static bool swiotlb_force_bounce;
static bool swiotlb_force_disable;
-struct io_tlb_mem io_tlb_default_mem;
+static struct io_tlb_mem io_tlb_default_mem;
static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
static unsigned long default_nareas;
@@ -489,6 +489,15 @@ void __init swiotlb_exit(void)
memset(mem, 0, sizeof(*mem));
}
+/**
+ * swiotlb_dev_init() - initialize swiotlb fields in &struct device
+ * @dev: Device to be initialized.
+ */
+void swiotlb_dev_init(struct device *dev)
+{
+ dev->dma_io_tlb_mem = &io_tlb_default_mem;
+}
+
/*
* Return the offset into a iotlb slot required to keep the device happy.
*/
@@ -953,6 +962,14 @@ size_t swiotlb_max_mapping_size(struct device *dev)
return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE - min_align;
}
+/**
+ * is_swiotlb_allocated() - check if the default software IO TLB is initialized
+ */
+bool is_swiotlb_allocated(void)
+{
+ return io_tlb_default_mem.nslabs;
+}
+
bool is_swiotlb_active(struct device *dev)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
@@ -960,6 +977,26 @@ bool is_swiotlb_active(struct device *dev)
return mem && mem->nslabs;
}
+/**
+ * default_swiotlb_base() - get the base address of the default SWIOTLB
+ *
+ * Get the lowest physical address used by the default software IO TLB pool.
+ */
+phys_addr_t default_swiotlb_base(void)
+{
+ return io_tlb_default_mem.start;
+}
+
+/**
+ * default_swiotlb_limit() - get the address limit of the default SWIOTLB
+ *
+ * Get the highest physical address used by the default software IO TLB pool.
+ */
+phys_addr_t default_swiotlb_limit(void)
+{
+ return io_tlb_default_mem.end - 1;
+}
+
#ifdef CONFIG_DEBUG_FS
static int io_tlb_used_get(void *data, u64 *val)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index d1555ea2981a..c639dba9a3b7 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -864,10 +864,9 @@ void __init setup_kmalloc_cache_index_table(void)
static unsigned int __kmalloc_minalign(void)
{
-#ifdef CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC
- if (io_tlb_default_mem.nslabs)
+ if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) &&
+ is_swiotlb_allocated())
return ARCH_KMALLOC_MINALIGN;
-#endif
return dma_get_cache_alignment();
}
--
2.25.1
From: Petr Tesarik <[email protected]>
The value returned by default_swiotlb_limit() should be constant, because
it is used to decide whether DMA can be used. To allow allocating memory
pools on the fly, use the maximum possible physical address rather than the
highest address used by the default pool.
For swiotlb_init_remap(), this is either an arch-specific limit used by
memblock_alloc_low(), or the highest directly mapped physical address if
the initialization flags include SWIOTLB_ANY. For swiotlb_init_late(), the
highest address is determined by the GFP flags.
Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/swiotlb.h | 2 ++
kernel/dma/swiotlb.c | 14 ++++++++++++++
2 files changed, 16 insertions(+)
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 66867d2188ba..9825fa14abe4 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -109,6 +109,7 @@ struct io_tlb_pool {
* @force_bounce: %true if swiotlb bouncing is forced
* @for_alloc: %true if the pool is used for memory allocation
* @can_grow: %true if more pools can be allocated dynamically.
+ * @phys_limit: Maximum allowed physical address.
* @total_used: The total number of slots in the pool that are currently used
* across all areas. Used only for calculating used_hiwater in
* debugfs.
@@ -123,6 +124,7 @@ struct io_tlb_mem {
bool for_alloc;
#ifdef CONFIG_SWIOTLB_DYNAMIC
bool can_grow;
+ u64 phys_limit;
#endif
#ifdef CONFIG_DEBUG_FS
atomic_long_t total_used;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 30d0fcc3ccb9..0fa081defdbd 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -334,6 +334,10 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
#ifdef CONFIG_SWIOTLB_DYNAMIC
if (!remap)
io_tlb_default_mem.can_grow = true;
+ if (flags & SWIOTLB_ANY)
+ io_tlb_default_mem.phys_limit = virt_to_phys(high_memory - 1);
+ else
+ io_tlb_default_mem.phys_limit = ARCH_LOW_ADDRESS_LIMIT;
#endif
if (!default_nareas)
@@ -409,6 +413,12 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
#ifdef CONFIG_SWIOTLB_DYNAMIC
if (!remap)
io_tlb_default_mem.can_grow = true;
+ if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp_mask & __GFP_DMA))
+ io_tlb_default_mem.phys_limit = DMA_BIT_MASK(zone_dma_bits);
+ else if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp_mask & __GFP_DMA32))
+ io_tlb_default_mem.phys_limit = DMA_BIT_MASK(32);
+ else
+ io_tlb_default_mem.phys_limit = virt_to_phys(high_memory - 1);
#endif
if (!default_nareas)
@@ -1397,7 +1407,11 @@ phys_addr_t default_swiotlb_base(void)
*/
phys_addr_t default_swiotlb_limit(void)
{
+#ifdef CONFIG_SWIOTLB_DYNAMIC
+ return io_tlb_default_mem.phys_limit;
+#else
return io_tlb_default_mem.defpool.end - 1;
+#endif
}
#ifdef CONFIG_DEBUG_FS
--
2.25.1
On Tue, Aug 01, 2023 at 08:23:55AM +0200, Petr Tesarik wrote:
> From: Petr Tesarik <[email protected]>
>
> Motivation
> ==========
>
> The software IO TLB was designed with these assumptions:
>
> 1) It would not be used much. Small systems (little RAM) don't need it, and
> big systems (lots of RAM) would have modern DMA controllers and an IOMMU
> chip to handle legacy devices.
> 2) A small fixed memory area (64 MiB by default) is sufficient to
> handle the few cases which require a bounce buffer.
> 3) 64 MiB is little enough that it has no impact on the rest of the
> system.
> 4) Bounce buffers require large contiguous chunks of low memory. Such
> memory is precious and can be allocated only early at boot.
>
> It turns out they are not always true:
>
> 1) Embedded systems may have more than 4GiB RAM but no IOMMU and legacy
> 32-bit peripheral busses and/or DMA controllers.
> 2) CoCo VMs use bounce buffers for all I/O but may need substantially more
> than 64 MiB.
> 3) Embedded developers put as many features as possible into the available
> memory. A few dozen "missing" megabytes may limit what features can be
> implemented.
> 4) If CMA is available, it can allocate large continuous chunks even after
> the system has run for some time.
>
> Goals
> =====
>
> The goal of this work is to start with a small software IO TLB at boot and
> expand it later when/if needed.
>
> Design
> ======
>
> This version of the patch series retains the current slot allocation
> algorithm with multiple areas to reduce lock contention, but additional
> slots can be added when necessary.
>
> These alternatives have been considered:
>
> - Allocate and free buffers as needed using direct DMA API. This works
> quite well, except in CoCo VMs where each allocation/free requires
> decrypting/encrypting memory, which is a very expensive operation.
>
> - Allocate a very large software IO TLB at boot, but allow to migrate pages
> to/from it (like CMA does). For systems with CMA, this would mean two big
> allocations at boot. Finding the balance between CMA, SWIOTLB and rest of
> available RAM can be challenging. More importantly, there is no clear
> benefit compared to allocating SWIOTLB memory pools from the CMA.
>
> Implementation Constraints
> ==========================
>
> These constraints have been taken into account:
>
> 1) Minimize impact on devices which do not benefit from the change.
> 2) Minimize the number of memory decryption/encryption operations.
> 3) Avoid contention on a lock or atomic variable to preserve parallel
> scalability.
>
> Additionally, the software IO TLB code is also used to implement restricted
> DMA pools. These pools are restricted to a pre-defined physical memory
> region and must not use any other memory. In other words, dynamic
> allocation of memory pools must be disabled for restricted DMA pools.
>
> Data Structures
> ===============
>
> The existing struct io_tlb_mem is the central type for a SWIOTLB allocator,
> but it now contains multiple memory pools::
>
> io_tlb_mem
> +---------+ io_tlb_pool
> | SWIOTLB | +-------+ +-------+ +-------+
> |allocator|-->|default|-->|dynamic|-->|dynamic|-->...
> | | |memory | |memory | |memory |
> +---------+ | pool | | pool | | pool |
> +-------+ +-------+ +-------+
>
> The allocator structure contains global state (such as flags and counters)
> and structures needed to schedule new allocations. Each memory pool
> contains the actual buffer slots and metadata. The first memory pool in the
> list is the default memory pool allocated statically at early boot.
>
> New memory pools are allocated from a kernel worker thread. That's because
> bounce buffers are allocated when mapping a DMA buffer, which may happen in
> interrupt context where large atomic allocations would probably fail.
> Allocation from process context is much more likely to succeed, especially
> if it can use CMA.
>
> Nonetheless, the onset of a load spike may fill up the SWIOTLB before the
> worker has a chance to run. In that case, try to allocate a small transient
> memory pool to accommodate the request. If memory is encrypted and the
> device cannot do DMA to encrypted memory, this buffer is allocated from the
> coherent atomic DMA memory pool. Reducing the size of SWIOTLB may therefore
> require increasing the size of the coherent pool with the "coherent_pool"
> command-line parameter.
>
> Performance
> ===========
>
> All testing compared a vanilla v6.4-rc6 kernel with a fully patched
> kernel. The kernel was booted with "swiotlb=force" to allow stress-testing
> the software IO TLB on a high-performance device that would otherwise not
> need it. CONFIG_DEBUG_FS was set to 'y' to match the configuration of
> popular distribution kernels; it is understood that parallel workloads
> suffer from contention on the recently added debugfs atomic counters.
>
> These benchmarks were run:
>
> - small: single-threaded I/O of 4 KiB blocks,
> - big: single-threaded I/O of 64 KiB blocks,
> - 4way: 4-way parallel I/O of 4 KiB blocks.
>
> In all tested cases, the default 64 MiB SWIOTLB would be sufficient (but
> wasteful). The "default" pair of columns shows performance impact when
> booted with 64 MiB SWIOTLB (i.e. current state). The "growing" pair of
> columns shows the impact when booted with a 1 MiB initial SWIOTLB, which
> grew to 5 MiB at run time. The "var" column in the tables below is the
> coefficient of variance over 5 runs of the test, the "diff" column is the
> difference in read-write I/O bandwidth (MiB/s). The very first column is
> the coefficient of variance in the results of the base unpatched kernel.
>
> First, on an x86 VM against a QEMU virtio SATA driver backed by a RAM-based
> block device on the host:
>
> base default growing
> var var diff var diff
> small 1.96% 0.47% -1.5% 0.52% -2.2%
> big 2.03% 1.35% +0.9% 2.22% +2.9%
> 4way 0.80% 0.45% -0.7% 1.22% <0.1%
>
> Second, on a Raspberry Pi4 with 8G RAM and a class 10 A1 microSD card:
>
> base default growing
> var var diff var diff
> small 1.09% 1.69% +0.5% 2.14% -0.2%
> big 0.03% 0.28% -0.5% 0.03% -0.1%
> 4way 5.15% 2.39% +0.2% 0.66% <0.1%
>
> Third, on a CoCo VM. This was a bigger system, so I also added a 24-thread
> parallel I/O test:
>
> base default growing
> var var diff var diff
> small 2.41% 6.02% +1.1% 10.33% +6.7%
> big 9.20% 2.81% -0.6% 16.84% -0.2%
> 4way 0.86% 2.66% -0.1% 2.22% -4.9%
> 24way 3.19% 6.19% +4.4% 4.08% -5.9%
>
> Note the increased variance of the CoCo VM, although the host was not
> otherwise loaded. These are caused by the first run, which includes the
> overhead of allocating additional bounce buffers and sharing them with the
> hypervisor. The system was not rebooted between successive runs.
>
> Parallel tests suffer from a reduced number of areas in the dynamically
> allocated memory pools. This can be improved by allocating a larger pool
> from CMA (not implemented in this series yet).
>
> I have no good explanation for the increase in performance of the
> 24-thread I/O test with the default (non-growing) memory pool. Although the
> difference is within variance, it seems to be real. The average bandwidth
> is consistently above that of the unpatched kernel.
>
> To sum it up:
>
> - All workloads benefit from reduced memory footprint.
> - No performance regressions have been observed with the default size of
> the software IO TLB.
> - Most workloads retain their former performance even if the software IO
> TLB grows at run time.
>
For the driver-core touched portions:
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Thanks,
I've applied this to a new swiotlb-dynamic branch that I'll pull into
the dma-mapping for-next tree.
On 8/1/2023 6:03 PM, Christoph Hellwig wrote:
> Thanks,
>
> I've applied this to a new swiotlb-dynamic branch that I'll pull into
> the dma-mapping for-next tree.
Thank you.
I guess I can prepare some follow-up series now. ;-)
Petr T