2023-06-27 09:59:57

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v3 0/7] Allow dynamic allocation of software IO TLB bounce buffers

From: Petr Tesarik <[email protected]>

Note: This patch series depends on fixes from this thread:

https://lore.kernel.org/linux-iommu/[email protected]/T/

Motivation
==========

The software IO TLB was designed with these assumptions:

1) It would not be used much. Small systems (little RAM) don't need it, and
big systems (lots of RAM) would have modern DMA controllers and an IOMMU
chip to handle legacy devices.
2) A small fixed memory area (64 MiB by default) is sufficient to
handle the few cases which require a bounce buffer.
3) 64 MiB is little enough that it has no impact on the rest of the
system.
4) Bounce buffers require large contiguous chunks of low memory. Such
memory is precious and can be allocated only early at boot.

It turns out they are not always true:

1) Embedded systems may have more than 4GiB RAM but no IOMMU and legacy
32-bit peripheral busses and/or DMA controllers.
2) CoCo VMs use bounce buffers for all I/O but may need substantially more
than 64 MiB.
3) Embedded developers put as many features as possible into the available
memory. A few dozen "missing" megabytes may limit what features can be
implemented.
4) If CMA is available, it can allocate large continuous chunks even after
the system has run for some time.

Goals
=====

The goal of this work is to start with a small software IO TLB at boot and
expand it later when/if needed.

Design
======

This version of the patch series retains the current slot allocation
algorithm with multiple areas to reduce lock contention, but additional
slots can be added when necessary.

These alternatives have been considered:

- Allocate and free buffers as needed using direct DMA API. This works
quite well, except in CoCo VMs where each allocation/free requires
decrypting/encrypting memory, which is a very expensive operation.

- Allocate a very large software IO TLB at boot, but allow to migrate pages
to/from it (like CMA does). For systems with CMA, this would mean two big
allocations at boot. Finding the balance between CMA, SWIOTLB and rest of
available RAM can be challenging. More importantly, there is no clear
benefit compared to allocating SWIOTLB memory pools from the CMA.

Implementation Constraints
==========================

These constraints have been taken into account:

1) Minimize impact on devices which do not benefit from the change.
2) Minimize the number of memory decryption/encryption operations.
3) Avoid contention on a lock or atomic variable to preserve parallel
scalability.

Additionally, the software IO TLB code is also used to implement restricted
DMA pools. These pools are restricted to a pre-defined physical memory
region and must not use any other memory. In other words, dynamic
allocation of memory pools must be disabled for restricted DMA pools.

Data Structures
===============

The existing struct io_tlb_mem is the central type for a SWIOTLB allocator,
but it now contains multiple memory pools::

io_tlb_mem
+---------+ io_tlb_pool
| SWIOTLB | +-------+ +-------+ +-------+
|allocator|-->|default|-->|dynamic|-->|dynamic|-->...
| | |memory | |memory | |memory |
+---------+ | pool | | pool | | pool |
+-------+ +-------+ +-------+

The allocator structure contains global state (such as flags and counters)
and structures needed to schedule new allocations. Each memory pool
contains the actual buffer slots and metadata. The first memory pool in the
list is the default memory pool allocated statically at early boot.

New memory pools are allocated from a kernel worker thread. That's because
bounce buffers are allocated when mapping a DMA buffer, which may happen in
interrupt context where large atomic allocations would probably fail.
Allocation from process context is much more likely to succeed, especially
if it can use CMA.

Nonetheless, the onset of a load spike may fill up the SWIOTLB before the
worker has a chance to run. In that case, try to allocate a small transient
memory pool to accommodate the request. If memory is encrypted and the
device cannot do DMA to encrypted memory, this buffer is allocated from the
coherent atomic DMA memory pool. Reducing the size of SWIOTLB may therefore
require increasing the size of the coherent pool with the "coherent_pool"
command-line parameter.

Performance
===========

All testing compared a vanilla v6.4-rc6 kernel with a fully patched
kernel. The kernel was booted with "swiotlb=force" to allow stress-testing
the software IO TLB on a high-performance device that would otherwise not
need it. CONFIG_DEBUG_FS was set to 'y' to match the configuration of
popular distribution kernels; it is understood that parallel workloads
suffer from contention on the recently added debugfs atomic counters.

These benchmarks were run:

- small: single-threaded I/O of 4 KiB blocks,
- big: single-threaded I/O of 64 KiB blocks,
- 4way: 4-way parallel I/O of 4 KiB blocks.

In all tested cases, the default 64 MiB SWIOTLB would be sufficient (but
wasteful). The "default" pair of columns shows performance impact when
booted with 64 MiB SWIOTLB (i.e. current state). The "growing" pair of
columns shows the impact when booted with a 1 MiB initial SWIOTLB, which
grew to 5 MiB at run time. The "var" column in the tables below is the
coefficient of variance over 5 runs of the test, the "diff" column is the
difference in read-write I/O bandwidth (MiB/s). The very first column is
the coefficient of variance in the results of the base unpatched kernel.

First, on an x86 VM against a QEMU virtio SATA driver backed by a RAM-based
block device on the host:

base default growing
var var diff var diff
small 1.96% 0.47% -1.5% 0.52% -2.2%
big 2.03% 1.35% +0.9% 2.22% +2.9%
4way 0.80% 0.45% -0.7% 1.22% <0.1%

Second, on a Raspberry Pi4 with 8G RAM and a class 10 A1 microSD card:

base default growing
var var diff var diff
small 1.09% 1.69% +0.5% 2.14% -0.2%
big 0.03% 0.28% -0.5% 0.03% -0.1%
4way 5.15% 2.39% +0.2% 0.66% <0.1%

Third, on a CoCo VM. This was a bigger system, so I also added a 24-thread
parallel I/O test:

base default growing
var var diff var diff
small 2.41% 6.02% +1.1% 10.33% +6.7%
big 9.20% 2.81% -0.6% 16.84% -0.2%
4way 0.86% 2.66% -0.1% 2.22% -4.9%
24way 3.19% 6.19% +4.4% 4.08% -5.9%

Note the increased variance of the CoCo VM, although the host was not
otherwise loaded. These are caused by the first run, which includes the
overhead of allocating additional bounce buffers and sharing them with the
hypervisor. The system was not rebooted between successive runs.

Parallel tests suffer from a reduced number of areas in the dynamically
allocated memory pools. This can be improved by allocating a larger pool
from CMA (not implemented in this series yet).

I have no good explanation for the increase in performance of the
24-thread I/O test with the default (non-growing) memory pool. Although the
difference is within variance, it seems to be real. The average bandwidth
is consistently above that of the unpatched kernel.

To sum it up:

- All workloads benefit from reduced memory footprint.
- No performance regressions have been observed with the default size of
the software IO TLB.
- Most workloads retain their former performance even if the software IO
TLB grows at run time.

Changelog
=========

Changes from v2:
- Complete rewrite using dynamically allocated memory pools rather
than a list of individual buffers
- Depend on other SWIOTLB fixes (already sent)
- Fix Xen and MIPS Octeon builds

Changes from RFC:
- Track dynamic buffers per device instead of per swiotlb
- Use a linked list instead of a maple tree
- Move initialization of swiotlb fields of struct device to a
helper function
- Rename __lookup_dyn_slot() to lookup_dyn_slot_locked()
- Introduce per-device flag if dynamic buffers are in use
- Add one more user of DMA_ATTR_MAY_SLEEP
- Add kernel-doc comments for new (and some old) code
- Properly escape '*' in dma-attributes.rst

Petr Tesarik (7):
swiotlb: make io_tlb_default_mem local to swiotlb.c
swiotlb: add documentation and rename swiotlb_do_find_slots()
swiotlb: separate memory pool data from other allocator data
swiotlb: if swiotlb is full, fall back to a transient memory pool
swiotlb: determine potential physical address limit
swiotlb: allocate a new memory pool when existing pools are full
swiotlb: search the software IO TLB only if a device makes use of it

arch/arm/xen/mm.c | 2 +-
arch/mips/pci/pci-octeon.c | 2 +-
arch/x86/kernel/pci-dma.c | 2 +-
drivers/base/core.c | 4 +-
drivers/xen/swiotlb-xen.c | 2 +-
include/linux/device.h | 8 +-
include/linux/dma-mapping.h | 2 +
include/linux/swiotlb.h | 103 +++++--
kernel/dma/direct.c | 2 +-
kernel/dma/swiotlb.c | 591 ++++++++++++++++++++++++++++++++----
10 files changed, 618 insertions(+), 100 deletions(-)

--
2.25.1



2023-06-27 10:13:43

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v3 2/7] swiotlb: add documentation and rename swiotlb_do_find_slots()

From: Petr Tesarik <[email protected]>

Add some kernel-doc comments and move the existing documentation of struct
io_tlb_slot to its correct location. The latter was forgotten in commit
942a8186eb445 ("swiotlb: move struct io_tlb_slot to swiotlb.c").

Use the opportunity to give swiotlb_do_find_slots() a more descriptive
name, which makes it clear how it differs from swiotlb_find_slots().

Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/swiotlb.h | 15 ++++++++----
kernel/dma/swiotlb.c | 52 ++++++++++++++++++++++++++++++++++++-----
2 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index b38045be532d..9f486843a66a 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -77,10 +77,6 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
* @end. For default swiotlb, this is command line adjustable via
* setup_io_tlb_npages.
* @used: The number of used IO TLB block.
- * @list: The free list describing the number of free entries available
- * from each index.
- * @orig_addr: The original address corresponding to a mapped entry.
- * @alloc_size: Size of the allocated buffer.
* @debugfs: The dentry to debugfs.
* @late_alloc: %true if allocated using the page allocator
* @force_bounce: %true if swiotlb bouncing is forced
@@ -113,6 +109,17 @@ struct io_tlb_mem {
#endif
};

+/**
+ * is_swiotlb_buffer() - check if a physical address belongs to a swiotlb
+ * @dev: Device which has mapped the buffer.
+ * @paddr: Physical address within the DMA buffer.
+ *
+ * Check if @paddr points into a bounce buffer.
+ *
+ * Return:
+ * * %true if @paddr points into a bounce buffer
+ * * %false otherwise
+ */
static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 63d318059f27..888c25a9bfcc 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -62,6 +62,13 @@

#define INVALID_PHYS_ADDR (~(phys_addr_t)0)

+/**
+ * struct io_tlb_slot - IO TLB slot descriptor
+ * @orig_addr: The original address corresponding to a mapped entry.
+ * @alloc_size: Size of the allocated buffer.
+ * @list: The free list describing the number of free entries available
+ * from each index.
+ */
struct io_tlb_slot {
phys_addr_t orig_addr;
size_t alloc_size;
@@ -632,11 +639,22 @@ static void dec_used(struct io_tlb_mem *mem, unsigned int nslots)
}
#endif /* CONFIG_DEBUG_FS */

-/*
- * Find a suitable number of IO TLB entries size that will fit this request and
- * allocate a buffer from that IO TLB pool.
+/**
+ * area_find_slots() - search for slots in one IO TLB memory area
+ * @dev: Device which maps the buffer.
+ * @area_index: Index of the IO TLB memory area to be searched.
+ * @orig_addr: Original (non-bounced) IO buffer address.
+ * @alloc_size: Total requested size of the bounce buffer,
+ * including initial alignment padding.
+ * @alloc_align_mask: Required alignment of the allocated buffer.
+ *
+ * Find a suitable sequence of IO TLB entries for the request and allocate
+ * a buffer from the given IO TLB memory area.
+ * This function takes care of locking.
+ *
+ * Return: Index of the first allocated slot, or -1 on error.
*/
-static int swiotlb_do_find_slots(struct device *dev, int area_index,
+static int area_find_slots(struct device *dev, int area_index,
phys_addr_t orig_addr, size_t alloc_size,
unsigned int alloc_align_mask)
{
@@ -731,6 +749,19 @@ static int swiotlb_do_find_slots(struct device *dev, int area_index,
return slot_index;
}

+/**
+ * swiotlb_find_slots() - search for slots in the whole swiotlb
+ * @dev: Device which maps the buffer.
+ * @orig_addr: Original (non-bounced) IO buffer address.
+ * @alloc_size: Total requested size of the bounce buffer,
+ * including initial alignment padding.
+ * @alloc_align_mask: Required alignment of the allocated buffer.
+ *
+ * Search through the whole software IO TLB to find a sequence of slots that
+ * match the allocation constraints.
+ *
+ * Return: Index of the first allocated slot, or -1 on error.
+ */
static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
size_t alloc_size, unsigned int alloc_align_mask)
{
@@ -739,8 +770,8 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
int i = start, index;

do {
- index = swiotlb_do_find_slots(dev, i, orig_addr, alloc_size,
- alloc_align_mask);
+ index = area_find_slots(dev, i, orig_addr, alloc_size,
+ alloc_align_mask);
if (index >= 0)
return index;
if (++i >= mem->nareas)
@@ -750,6 +781,15 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
return -1;
}

+/**
+ * mem_used() - get number of used slots in an allocator
+ * @mem: Software IO TLB allocator.
+ *
+ * The result is not accurate, because there is no locking of individual
+ * areas.
+ *
+ * Return: Approximate number of used slots.
+ */
static unsigned long mem_used(struct io_tlb_mem *mem)
{
int i;
--
2.25.1


2023-06-27 10:14:18

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v3 5/7] swiotlb: determine potential physical address limit

From: Petr Tesarik <[email protected]>

The value returned by default_swiotlb_limit() should not change, because it
is used to decide whether DMA can be used. To allow allocating memory pools
on the fly, use the maximum possible physical address rather than the
highest address used by the default pool.

For swiotlb_init_remap(), this is either an arch-specific limit used by
memblock_alloc_low(), or the highest directly mapped physical address if
the initialization flags include SWIOTLB_ANY. For swiotlb_init_late(), the
highest address is determined by the GFP flags.

Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/swiotlb.h | 2 ++
kernel/dma/swiotlb.c | 11 ++++++++++-
2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index ae1688438850..4a3af1c216d0 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -105,6 +105,7 @@ struct io_tlb_pool {
* struct io_tlb_mem - Software IO TLB allocator
* @pool: IO TLB memory pool descriptor.
* @nslabs: Total number of IO TLB slabs in all pools.
+ * @phys_limit: Maximum allowed physical address.
* @debugfs: The dentry to debugfs.
* @force_bounce: %true if swiotlb bouncing is forced
* @for_alloc: %true if the pool is used for memory allocation
@@ -117,6 +118,7 @@ struct io_tlb_pool {
struct io_tlb_mem {
struct io_tlb_pool *pool;
unsigned long nslabs;
+ u64 phys_limit;
struct dentry *debugfs;
bool force_bounce;
bool for_alloc;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 06b4fa7c2e9b..5bb83097ade6 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -333,6 +333,9 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,

io_tlb_default_mem.force_bounce =
swiotlb_force_bounce || (flags & SWIOTLB_FORCE);
+ io_tlb_default_mem.phys_limit = flags & SWIOTLB_ANY
+ ? virt_to_phys(high_memory - 1)
+ : ARCH_LOW_ADDRESS_LIMIT;

if (!default_nareas)
swiotlb_adjust_nareas(num_possible_cpus());
@@ -400,6 +403,12 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
return 0;

io_tlb_default_mem.force_bounce = swiotlb_force_bounce;
+ io_tlb_default_mem.phys_limit =
+ IS_ENABLED(CONFIG_ZONE_DMA) && (gfp_mask & __GFP_DMA)
+ ? DMA_BIT_MASK(zone_dma_bits)
+ : (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp_mask & __GFP_DMA32)
+ ? DMA_BIT_MASK(32)
+ : virt_to_phys(high_memory - 1));

if (!default_nareas)
swiotlb_adjust_nareas(num_possible_cpus());
@@ -1308,7 +1317,7 @@ phys_addr_t default_swiotlb_start(void)
*/
phys_addr_t default_swiotlb_limit(void)
{
- return io_tlb_default_pool.end - 1;
+ return io_tlb_default_mem.phys_limit;
}

#ifdef CONFIG_DEBUG_FS
--
2.25.1


2023-06-27 10:14:44

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v3 7/7] swiotlb: search the software IO TLB only if a device makes use of it

From: Petr Tesarik <[email protected]>

Skip searching the software IO TLB if a device has never used it, making
sure these devices are not affected by the introduction of multiple IO TLB
memory pools.

Additional memory barrier is required to ensure that the new value of the
flag is visible to other CPUs after mapping a new bounce buffer. For
efficiency, the flag check should be inlined, and then the memory barrier
must be moved to is_swiotlb_buffer(). However, it can replace the existing
barrier in swiotlb_find_pool(), because all callers use is_swiotlb_buffer()
first to verify that the buffer address belongs to the software IO TLB.

Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/device.h | 2 ++
include/linux/swiotlb.h | 6 +++++-
kernel/dma/swiotlb.c | 14 ++++++--------
3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index a1ee4c5924b8..d4b35925a6d1 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -512,6 +512,7 @@ struct device_physical_location {
* @dma_io_tlb_mem: Software IO TLB allocator. Not for driver use.
* @dma_io_tlb_pools: List of transient swiotlb memory pools.
* @dma_io_tlb_lock: Protects changes to the list of active pools.
+ * @dma_uses_io_tlb: %true if device has used the software IO TLB.
* @archdata: For arch-specific additions.
* @of_node: Associated device tree node.
* @fwnode: Associated device node supplied by platform firmware.
@@ -619,6 +620,7 @@ struct device {
struct io_tlb_mem *dma_io_tlb_mem;
struct list_head dma_io_tlb_pools;
spinlock_t dma_io_tlb_lock;
+ bool dma_uses_io_tlb;
#endif
/* arch specific additions */
struct dev_archdata archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index ae402890ba41..eb30b5c624f1 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -148,7 +148,11 @@ struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr);
*/
static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
- return dev->dma_io_tlb_mem &&
+ /* Pairs with smp_wmb() in swiotlb_find_slots() and
+ * swiotlb_dyn_alloc(), which modify the RCU lists.
+ */
+ smp_rmb();
+ return dev->dma_uses_io_tlb &&
!!swiotlb_find_pool(dev, paddr);
}

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 7661a6402e80..26ab9ed2921b 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -701,7 +701,7 @@ static void swiotlb_dyn_alloc(struct work_struct *work)

add_mem_pool(mem, pool);

- /* Pairs with smp_rmb() in swiotlb_find_pool(). */
+ /* Pairs with smp_rmb() in is_swiotlb_buffer(). */
smp_wmb();
}

@@ -729,6 +729,7 @@ void swiotlb_dev_init(struct device *dev)
dev->dma_io_tlb_mem = &io_tlb_default_mem;
INIT_LIST_HEAD(&dev->dma_io_tlb_pools);
spin_lock_init(&dev->dma_io_tlb_lock);
+ dev->dma_uses_io_tlb = false;
}

/**
@@ -746,11 +747,6 @@ struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr)
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
struct io_tlb_pool *pool;

- /* Pairs with smp_wmb() in swiotlb_find_slots() and
- * swiotlb_dyn_alloc(), which modify the RCU lists.
- */
- smp_rmb();
-
rcu_read_lock();
list_for_each_entry_rcu(pool, &mem->pools, node) {
if (paddr >= pool->start && paddr < pool->end)
@@ -1125,9 +1121,11 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
list_add_rcu(&pool->node, &dev->dma_io_tlb_pools);
spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags);

- /* Pairs with smp_rmb() in swiotlb_find_pool(). */
- smp_wmb();
found:
+ dev->dma_uses_io_tlb = true;
+ /* Pairs with smp_rmb() in is_swiotlb_buffer() */
+ smp_wmb();
+
*retpool = pool;
return index;
}
--
2.25.1


2023-06-27 10:17:05

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v3 3/7] swiotlb: separate memory pool data from other allocator data

From: Petr Tesarik <[email protected]>

Carve out memory pool specific fields from struct io_tlb_mem. The original
struct now contains shared data for the whole allocator, while the new
struct io_tlb_pool contains data that is specific to one memory pool of
(potentially) many.

Allocate both structures together for restricted DMA pools to keep the
error cleanup path simple.

Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/device.h | 2 +-
include/linux/swiotlb.h | 49 ++++++-----
kernel/dma/swiotlb.c | 181 +++++++++++++++++++++++++---------------
3 files changed, 147 insertions(+), 85 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 472dd24d4823..83081aa99e6a 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -509,7 +509,7 @@ struct device_physical_location {
* @dma_pools: Dma pools (if dma'ble device).
* @dma_mem: Internal for coherent mem override.
* @cma_area: Contiguous memory area for dma allocations
- * @dma_io_tlb_mem: Pointer to the swiotlb pool used. Not for driver use.
+ * @dma_io_tlb_mem: Software IO TLB allocator. Not for driver use.
* @archdata: For arch-specific additions.
* @of_node: Associated device tree node.
* @fwnode: Associated device node supplied by platform firmware.
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 9f486843a66a..0aa6868cb68c 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -62,8 +62,7 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
#ifdef CONFIG_SWIOTLB

/**
- * struct io_tlb_mem - IO TLB Memory Pool Descriptor
- *
+ * struct io_tlb_pool - IO TLB memory pool descriptor
* @start: The start address of the swiotlb memory pool. Used to do a quick
* range check to see if the memory was in fact allocated by this
* API.
@@ -73,16 +72,36 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
* @vaddr: The vaddr of the swiotlb memory pool. The swiotlb memory pool
* may be remapped in the memory encrypted case and store virtual
* address for bounce buffer operation.
- * @nslabs: The number of IO TLB blocks (in groups of 64) between @start and
- * @end. For default swiotlb, this is command line adjustable via
- * setup_io_tlb_npages.
- * @used: The number of used IO TLB block.
+ * @nslabs: The number of IO TLB slots between @start and @end. For the
+ * default swiotlb, this can be adjusted with a boot parameter,
+ * see setup_io_tlb_npages().
+ * @used: The number of used IO TLB slots.
+ * @late_alloc: %true if allocated using the page allocator.
+ * @nareas: Number of areas in the pool.
+ * @area_nslabs: Number of slots in each area.
+ * @areas: Array of memory area descriptors.
+ * @slots: Array of slot descriptors.
+ */
+struct io_tlb_pool {
+ phys_addr_t start;
+ phys_addr_t end;
+ void *vaddr;
+ unsigned long nslabs;
+ unsigned long used;
+ bool late_alloc;
+ unsigned int nareas;
+ unsigned int area_nslabs;
+ struct io_tlb_area *areas;
+ struct io_tlb_slot *slots;
+};
+
+/**
+ * struct io_tlb_mem - Software IO TLB allocator
+ * @pool: IO TLB memory pool descriptor.
+ * @nslabs: Total number of IO TLB slabs in all pools.
* @debugfs: The dentry to debugfs.
- * @late_alloc: %true if allocated using the page allocator
* @force_bounce: %true if swiotlb bouncing is forced
* @for_alloc: %true if the pool is used for memory allocation
- * @nareas: The area number in the pool.
- * @area_nslabs: The slot number in the area.
* @total_used: The total number of slots in the pool that are currently used
* across all areas. Used only for calculating used_hiwater in
* debugfs.
@@ -90,19 +109,11 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
* in debugfs.
*/
struct io_tlb_mem {
- phys_addr_t start;
- phys_addr_t end;
- void *vaddr;
+ struct io_tlb_pool *pool;
unsigned long nslabs;
- unsigned long used;
struct dentry *debugfs;
- bool late_alloc;
bool force_bounce;
bool for_alloc;
- unsigned int nareas;
- unsigned int area_nslabs;
- struct io_tlb_area *areas;
- struct io_tlb_slot *slots;
#ifdef CONFIG_DEBUG_FS
atomic_long_t total_used;
atomic_long_t used_hiwater;
@@ -124,7 +135,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;

- return mem && paddr >= mem->start && paddr < mem->end;
+ return mem && paddr >= mem->pool->start && paddr < mem->pool->end;
}

static inline bool is_swiotlb_force_bounce(struct device *dev)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 888c25a9bfcc..4c5de91bda57 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -78,7 +78,10 @@ struct io_tlb_slot {
static bool swiotlb_force_bounce;
static bool swiotlb_force_disable;

-static struct io_tlb_mem io_tlb_default_mem;
+static struct io_tlb_pool io_tlb_default_pool;
+static struct io_tlb_mem io_tlb_default_mem = {
+ .pool = &io_tlb_default_pool,
+};

static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
static unsigned long default_nareas;
@@ -209,7 +212,7 @@ void __init swiotlb_adjust_size(unsigned long size)

void swiotlb_print_info(void)
{
- struct io_tlb_mem *mem = &io_tlb_default_mem;
+ struct io_tlb_pool *mem = &io_tlb_default_pool;

if (!mem->nslabs) {
pr_warn("No low mem\n");
@@ -238,7 +241,7 @@ static inline unsigned long nr_slots(u64 val)
*/
void __init swiotlb_update_mem_attributes(void)
{
- struct io_tlb_mem *mem = &io_tlb_default_mem;
+ struct io_tlb_pool *mem = &io_tlb_default_pool;
unsigned long bytes;

if (!mem->nslabs || mem->late_alloc)
@@ -247,9 +250,8 @@ void __init swiotlb_update_mem_attributes(void)
set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
}

-static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
- unsigned long nslabs, unsigned int flags,
- bool late_alloc, unsigned int nareas)
+static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
+ unsigned long nslabs, bool late_alloc, unsigned int nareas)
{
void *vaddr = phys_to_virt(start);
unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
@@ -261,8 +263,6 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
mem->nareas = nareas;
mem->area_nslabs = nslabs / mem->nareas;

- mem->force_bounce = swiotlb_force_bounce || (flags & SWIOTLB_FORCE);
-
for (i = 0; i < mem->nareas; i++) {
spin_lock_init(&mem->areas[i].lock);
mem->areas[i].index = 0;
@@ -319,7 +319,7 @@ static void __init *swiotlb_memblock_alloc(unsigned long nslabs,
void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
int (*remap)(void *tlb, unsigned long nslabs))
{
- struct io_tlb_mem *mem = &io_tlb_default_mem;
+ struct io_tlb_pool *mem = &io_tlb_default_pool;
unsigned long nslabs;
unsigned int nareas;
size_t alloc_size;
@@ -330,6 +330,9 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
if (swiotlb_force_disable)
return;

+ io_tlb_default_mem.force_bounce =
+ swiotlb_force_bounce || (flags & SWIOTLB_FORCE);
+
if (!default_nareas)
swiotlb_adjust_nareas(num_possible_cpus());

@@ -363,8 +366,9 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
return;
}

- swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, flags, false,
- default_nareas);
+ swiotlb_init_io_tlb_pool(mem, __pa(tlb), nslabs, false,
+ default_nareas);
+ io_tlb_default_mem.nslabs = nslabs;

if (flags & SWIOTLB_VERBOSE)
swiotlb_print_info();
@@ -383,7 +387,7 @@ void __init swiotlb_init(bool addressing_limit, unsigned int flags)
int swiotlb_init_late(size_t size, gfp_t gfp_mask,
int (*remap)(void *tlb, unsigned long nslabs))
{
- struct io_tlb_mem *mem = &io_tlb_default_mem;
+ struct io_tlb_pool *mem = &io_tlb_default_pool;
unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
unsigned int nareas;
unsigned char *vstart = NULL;
@@ -394,6 +398,8 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
if (swiotlb_force_disable)
return 0;

+ io_tlb_default_mem.force_bounce = swiotlb_force_bounce;
+
if (!default_nareas)
swiotlb_adjust_nareas(num_possible_cpus());

@@ -445,8 +451,9 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,

set_memory_decrypted((unsigned long)vstart,
(nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
- swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, 0, true,
- nareas);
+ swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
+ nareas);
+ io_tlb_default_mem.nslabs = nslabs;

swiotlb_print_info();
return 0;
@@ -460,7 +467,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,

void __init swiotlb_exit(void)
{
- struct io_tlb_mem *mem = &io_tlb_default_mem;
+ struct io_tlb_pool *mem = &io_tlb_default_pool;
unsigned long tbl_vaddr;
size_t tbl_size, slots_size;
unsigned int area_order;
@@ -516,7 +523,7 @@ static unsigned int swiotlb_align_offset(struct device *dev, u64 addr)
static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
enum dma_data_direction dir)
{
- struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_pool *mem = dev->dma_io_tlb_mem->pool;
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
phys_addr_t orig_addr = mem->slots[index].orig_addr;
size_t alloc_size = mem->slots[index].alloc_size;
@@ -598,7 +605,7 @@ static inline unsigned long get_max_slots(unsigned long boundary_mask)
return nr_slots(boundary_mask + 1);
}

-static unsigned int wrap_area_index(struct io_tlb_mem *mem, unsigned int index)
+static unsigned int wrap_area_index(struct io_tlb_pool *mem, unsigned int index)
{
if (index >= mem->area_nslabs)
return 0;
@@ -642,6 +649,7 @@ static void dec_used(struct io_tlb_mem *mem, unsigned int nslots)
/**
* area_find_slots() - search for slots in one IO TLB memory area
* @dev: Device which maps the buffer.
+ * @pool: Memory pool to be searched.
* @area_index: Index of the IO TLB memory area to be searched.
* @orig_addr: Original (non-bounced) IO buffer address.
* @alloc_size: Total requested size of the bounce buffer,
@@ -654,15 +662,14 @@ static void dec_used(struct io_tlb_mem *mem, unsigned int nslots)
*
* Return: Index of the first allocated slot, or -1 on error.
*/
-static int area_find_slots(struct device *dev, int area_index,
- phys_addr_t orig_addr, size_t alloc_size,
+static int area_find_slots(struct device *dev, struct io_tlb_pool *pool,
+ int area_index, phys_addr_t orig_addr, size_t alloc_size,
unsigned int alloc_align_mask)
{
- struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
- struct io_tlb_area *area = mem->areas + area_index;
+ struct io_tlb_area *area = pool->areas + area_index;
unsigned long boundary_mask = dma_get_seg_boundary(dev);
dma_addr_t tbl_dma_addr =
- phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
+ phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
unsigned long max_slots = get_max_slots(boundary_mask);
unsigned int iotlb_align_mask =
dma_get_min_align_mask(dev) | alloc_align_mask;
@@ -674,7 +681,7 @@ static int area_find_slots(struct device *dev, int area_index,
unsigned int slot_index;

BUG_ON(!nslots);
- BUG_ON(area_index >= mem->nareas);
+ BUG_ON(area_index >= pool->nareas);

/*
* For allocations of PAGE_SIZE or larger only look for page aligned
@@ -691,19 +698,19 @@ static int area_find_slots(struct device *dev, int area_index,
stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1;

spin_lock_irqsave(&area->lock, flags);
- if (unlikely(nslots > mem->area_nslabs - area->used))
+ if (unlikely(nslots > pool->area_nslabs - area->used))
goto not_found;

- slot_base = area_index * mem->area_nslabs;
+ slot_base = area_index * pool->area_nslabs;
index = area->index;

- for (slots_checked = 0; slots_checked < mem->area_nslabs; ) {
+ for (slots_checked = 0; slots_checked < pool->area_nslabs; ) {
slot_index = slot_base + index;

if (orig_addr &&
(slot_addr(tbl_dma_addr, slot_index) &
iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
- index = wrap_area_index(mem, index + 1);
+ index = wrap_area_index(pool, index + 1);
slots_checked++;
continue;
}
@@ -716,10 +723,10 @@ static int area_find_slots(struct device *dev, int area_index,
if (!iommu_is_span_boundary(slot_index, nslots,
nr_slots(tbl_dma_addr),
max_slots)) {
- if (mem->slots[slot_index].list >= nslots)
+ if (pool->slots[slot_index].list >= nslots)
goto found;
}
- index = wrap_area_index(mem, index + stride);
+ index = wrap_area_index(pool, index + stride);
slots_checked += stride;
}

@@ -729,58 +736,97 @@ static int area_find_slots(struct device *dev, int area_index,

found:
for (i = slot_index; i < slot_index + nslots; i++) {
- mem->slots[i].list = 0;
- mem->slots[i].alloc_size = alloc_size - (offset +
+ pool->slots[i].list = 0;
+ pool->slots[i].alloc_size = alloc_size - (offset +
((i - slot_index) << IO_TLB_SHIFT));
}
for (i = slot_index - 1;
io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
- mem->slots[i].list; i--)
- mem->slots[i].list = ++count;
+ pool->slots[i].list; i--)
+ pool->slots[i].list = ++count;

/*
* Update the indices to avoid searching in the next round.
*/
- area->index = wrap_area_index(mem, index + nslots);
+ area->index = wrap_area_index(pool, index + nslots);
area->used += nslots;
spin_unlock_irqrestore(&area->lock, flags);

- inc_used_and_hiwater(mem, nslots);
+ inc_used_and_hiwater(dev->dma_io_tlb_mem, nslots);
return slot_index;
}

/**
- * swiotlb_find_slots() - search for slots in the whole swiotlb
+ * pool_find_slots() - search for slots in one memory pool
* @dev: Device which maps the buffer.
+ * @pool: Memory pool to be searched.
* @orig_addr: Original (non-bounced) IO buffer address.
* @alloc_size: Total requested size of the bounce buffer,
* including initial alignment padding.
* @alloc_align_mask: Required alignment of the allocated buffer.
*
- * Search through the whole software IO TLB to find a sequence of slots that
- * match the allocation constraints.
+ * Search through one memory pool to find a sequence of slots that match the
+ * allocation constraints.
*
* Return: Index of the first allocated slot, or -1 on error.
*/
-static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
- size_t alloc_size, unsigned int alloc_align_mask)
+static int pool_find_slots(struct device *dev, struct io_tlb_pool *pool,
+ phys_addr_t orig_addr, size_t alloc_size,
+ unsigned int alloc_align_mask)
{
- struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
- int start = raw_smp_processor_id() & (mem->nareas - 1);
+ int start = raw_smp_processor_id() & (pool->nareas - 1);
int i = start, index;

do {
- index = area_find_slots(dev, i, orig_addr, alloc_size,
+ index = area_find_slots(dev, pool, i, orig_addr, alloc_size,
alloc_align_mask);
if (index >= 0)
return index;
- if (++i >= mem->nareas)
+ if (++i >= pool->nareas)
i = 0;
} while (i != start);

return -1;
}

+/**
+ * swiotlb_find_slots() - search for slots in the whole swiotlb
+ * @dev: Device which maps the buffer.
+ * @orig_addr: Original (non-bounced) IO buffer address.
+ * @alloc_size: Total requested size of the bounce buffer,
+ * including initial alignment padding.
+ * @alloc_align_mask: Required alignment of the allocated buffer.
+ *
+ * Search through the whole software IO TLB to find a sequence of slots that
+ * match the allocation constraints.
+ *
+ * Return: Index of the first allocated slot, or -1 on error.
+ */
+static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
+ size_t alloc_size, unsigned int alloc_align_mask)
+{
+ return pool_find_slots(dev, dev->dma_io_tlb_mem->pool, orig_addr,
+ alloc_size, alloc_align_mask);
+}
+
+/**
+ * mem_pool_used() - get number of used slots in a memory pool
+ * @pool: Software IO TLB memory pool.
+ *
+ * The result is not accurate, see mem_used().
+ *
+ * Return: Approximate number of used slots.
+ */
+static unsigned long mem_pool_used(struct io_tlb_pool *pool)
+{
+ int i;
+ unsigned long used = 0;
+
+ for (i = 0; i < pool->nareas; i++)
+ used += pool->areas[i].used;
+ return used;
+}
+
/**
* mem_used() - get number of used slots in an allocator
* @mem: Software IO TLB allocator.
@@ -792,12 +838,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
*/
static unsigned long mem_used(struct io_tlb_mem *mem)
{
- int i;
- unsigned long used = 0;
-
- for (i = 0; i < mem->nareas; i++)
- used += mem->areas[i].used;
- return used;
+ return mem_pool_used(mem->pool);
}

phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
@@ -807,6 +848,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
unsigned int offset = swiotlb_align_offset(dev, orig_addr);
+ struct io_tlb_pool *pool;
unsigned int i;
int index;
phys_addr_t tlb_addr;
@@ -841,9 +883,10 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
* This is needed when we sync the memory. Then we sync the buffer if
* needed.
*/
+ pool = mem->pool;
for (i = 0; i < nr_slots(alloc_size + offset); i++)
- mem->slots[index + i].orig_addr = slot_addr(orig_addr, i);
- tlb_addr = slot_addr(mem->start, index) + offset;
+ pool->slots[index + i].orig_addr = slot_addr(orig_addr, i);
+ tlb_addr = slot_addr(pool->start, index) + offset;
/*
* When dir == DMA_FROM_DEVICE we could omit the copy from the orig
* to the tlb buffer, if we knew for sure the device will
@@ -857,7 +900,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,

static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
{
- struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_pool *mem = dev->dma_io_tlb_mem->pool;
unsigned long flags;
unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
@@ -901,7 +944,7 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
area->used -= nslots;
spin_unlock_irqrestore(&area->lock, flags);

- dec_used(mem, nslots);
+ dec_used(dev->dma_io_tlb_mem, nslots);
}

/*
@@ -1009,7 +1052,7 @@ EXPORT_SYMBOL_GPL(is_swiotlb_active);
*/
phys_addr_t default_swiotlb_start(void)
{
- return io_tlb_default_mem.start;
+ return io_tlb_default_pool.start;
}

/**
@@ -1019,7 +1062,7 @@ phys_addr_t default_swiotlb_start(void)
*/
phys_addr_t default_swiotlb_limit(void)
{
- return io_tlb_default_mem.end - 1;
+ return io_tlb_default_pool.end - 1;
}

#ifdef CONFIG_DEBUG_FS
@@ -1095,6 +1138,7 @@ static inline void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
struct page *swiotlb_alloc(struct device *dev, size_t size)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_pool *pool;
phys_addr_t tlb_addr;
int index;

@@ -1105,7 +1149,8 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
if (index == -1)
return NULL;

- tlb_addr = slot_addr(mem->start, index);
+ pool = mem->pool;
+ tlb_addr = slot_addr(pool->start, index);

return pfn_to_page(PFN_DOWN(tlb_addr));
}
@@ -1142,29 +1187,35 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
* to it.
*/
if (!mem) {
- mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+ struct io_tlb_pool *pool;
+
+ mem = kzalloc(sizeof(*mem) + sizeof(*pool), GFP_KERNEL);
if (!mem)
return -ENOMEM;
+ pool = (void*)mem + sizeof(*mem);

- mem->slots = kcalloc(nslabs, sizeof(*mem->slots), GFP_KERNEL);
- if (!mem->slots) {
+ pool->slots = kcalloc(nslabs, sizeof(*pool->slots), GFP_KERNEL);
+ if (!pool->slots) {
kfree(mem);
return -ENOMEM;
}

- mem->areas = kcalloc(nareas, sizeof(*mem->areas),
+ pool->areas = kcalloc(nareas, sizeof(*pool->areas),
GFP_KERNEL);
- if (!mem->areas) {
- kfree(mem->slots);
+ if (!pool->areas) {
+ kfree(pool->slots);
kfree(mem);
return -ENOMEM;
}

set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
rmem->size >> PAGE_SHIFT);
- swiotlb_init_io_tlb_mem(mem, rmem->base, nslabs, SWIOTLB_FORCE,
- false, nareas);
+ swiotlb_init_io_tlb_pool(pool, rmem->base, nslabs,
+ false, nareas);
+ mem->force_bounce = true;
mem->for_alloc = true;
+ mem->pool = pool;
+ mem->nslabs = nslabs;

rmem->priv = mem;

--
2.25.1


2023-06-27 10:17:24

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v3 1/7] swiotlb: make io_tlb_default_mem local to swiotlb.c

From: Petr Tesarik <[email protected]>

SWIOTLB implementation details should not be exposed to the rest of the
kernel. This will allow to make changes to the implementation without
modifying non-swiotlb code.

To avoid breaking existing users, provide helper functions for the few
required fields. Enhance is_swiotlb_active() to work with the default IO
TLB pool when passed a NULL device.

As a bonus, using a helper function to initialize struct device allows to
get rid of an #ifdef in driver core.

Signed-off-by: Petr Tesarik <[email protected]>
---
arch/arm/xen/mm.c | 2 +-
arch/mips/pci/pci-octeon.c | 2 +-
arch/x86/kernel/pci-dma.c | 2 +-
drivers/base/core.c | 4 +---
drivers/xen/swiotlb-xen.c | 2 +-
include/linux/swiotlb.h | 17 ++++++++++++++++-
kernel/dma/swiotlb.c | 39 ++++++++++++++++++++++++++++++++++++--
7 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
index 3d826c0b5fee..529aabc1be38 100644
--- a/arch/arm/xen/mm.c
+++ b/arch/arm/xen/mm.c
@@ -125,7 +125,7 @@ static int __init xen_mm_init(void)
return 0;

/* we can work with the default swiotlb */
- if (!io_tlb_default_mem.nslabs) {
+ if (!is_swiotlb_active(NULL)) {
rc = swiotlb_init_late(swiotlb_size_or_default(),
xen_swiotlb_gfp(), NULL);
if (rc < 0)
diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c
index e457a18cbdc5..c5c4c1f7d5e4 100644
--- a/arch/mips/pci/pci-octeon.c
+++ b/arch/mips/pci/pci-octeon.c
@@ -664,7 +664,7 @@ static int __init octeon_pci_setup(void)

/* BAR1 movable regions contiguous to cover the swiotlb */
octeon_bar1_pci_phys =
- io_tlb_default_mem.start & ~((1ull << 22) - 1);
+ default_swiotlb_start() & ~((1ull << 22) - 1);

for (index = 0; index < 32; index++) {
union cvmx_pci_bar1_indexx bar1_index;
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index de6be0a3965e..33e960672837 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -90,7 +90,7 @@ int pci_xen_swiotlb_init_late(void)
return 0;

/* we can work with the default swiotlb */
- if (!io_tlb_default_mem.nslabs) {
+ if (!is_swiotlb_active(NULL)) {
int rc = swiotlb_init_late(swiotlb_size_or_default(),
GFP_KERNEL, xen_swiotlb_fixup);
if (rc < 0)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 3dff5037943e..46d1d78c5beb 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -3108,9 +3108,7 @@ void device_initialize(struct device *dev)
defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
dev->dma_coherent = dma_default_coherent;
#endif
-#ifdef CONFIG_SWIOTLB
- dev->dma_io_tlb_mem = &io_tlb_default_mem;
-#endif
+ swiotlb_dev_init(dev);
}
EXPORT_SYMBOL_GPL(device_initialize);

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 67aa74d20162..946bd56f0ac5 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -381,7 +381,7 @@ xen_swiotlb_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
static int
xen_swiotlb_dma_supported(struct device *hwdev, u64 mask)
{
- return xen_phys_to_dma(hwdev, io_tlb_default_mem.end - 1) <= mask;
+ return xen_phys_to_dma(hwdev, default_swiotlb_limit()) <= mask;
}

const struct dma_map_ops xen_swiotlb_dma_ops = {
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7af2673b47ba..b38045be532d 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -112,7 +112,6 @@ struct io_tlb_mem {
atomic_long_t used_hiwater;
#endif
};
-extern struct io_tlb_mem io_tlb_default_mem;

static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
@@ -130,13 +129,19 @@ static inline bool is_swiotlb_force_bounce(struct device *dev)

void swiotlb_init(bool addressing_limited, unsigned int flags);
void __init swiotlb_exit(void);
+void swiotlb_dev_init(struct device *dev);
size_t swiotlb_max_mapping_size(struct device *dev);
bool is_swiotlb_active(struct device *dev);
void __init swiotlb_adjust_size(unsigned long size);
+phys_addr_t default_swiotlb_start(void);
+phys_addr_t default_swiotlb_limit(void);
#else
static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
{
}
+static inline void swiotlb_dev_init(struct device *dev)
+{
+}
static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
return false;
@@ -161,6 +166,16 @@ static inline bool is_swiotlb_active(struct device *dev)
static inline void swiotlb_adjust_size(unsigned long size)
{
}
+
+static inline phys_addr_t default_swiotlb_start(void)
+{
+ return 0;
+}
+
+static inline phys_addr_t default_swiotlb_limit(void)
+{
+ return 0;
+}
#endif /* CONFIG_SWIOTLB */

extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 079df5ad38d0..63d318059f27 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -71,7 +71,7 @@ struct io_tlb_slot {
static bool swiotlb_force_bounce;
static bool swiotlb_force_disable;

-struct io_tlb_mem io_tlb_default_mem;
+static struct io_tlb_mem io_tlb_default_mem;

static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
static unsigned long default_nareas;
@@ -486,6 +486,15 @@ void __init swiotlb_exit(void)
memset(mem, 0, sizeof(*mem));
}

+/**
+ * swiotlb_dev_init() - initialize swiotlb fields in &struct device
+ * @dev: Device to be initialized.
+ */
+void swiotlb_dev_init(struct device *dev)
+{
+ dev->dma_io_tlb_mem = &io_tlb_default_mem;
+}
+
/*
* Return the offset into a iotlb slot required to keep the device happy.
*/
@@ -939,14 +948,40 @@ size_t swiotlb_max_mapping_size(struct device *dev)
return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE - min_align;
}

+/**
+ * is_swiotlb_active() - check if the software IO TLB is initialized
+ * @dev: Device to check, or %NULL for the default IO TLB.
+ */
bool is_swiotlb_active(struct device *dev)
{
- struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_mem *mem = dev
+ ? dev->dma_io_tlb_mem
+ : &io_tlb_default_mem;

return mem && mem->nslabs;
}
EXPORT_SYMBOL_GPL(is_swiotlb_active);

+/**
+ * default_swiotlb_start() - get the start of the default SWIOTLB
+ *
+ * Get the lowest physical address used by the default software IO TLB pool.
+ */
+phys_addr_t default_swiotlb_start(void)
+{
+ return io_tlb_default_mem.start;
+}
+
+/**
+ * default_swiotlb_limit() - get the highest address in the default SWIOTLB
+ *
+ * Get the highest physical address used by the default software IO TLB pool.
+ */
+phys_addr_t default_swiotlb_limit(void)
+{
+ return io_tlb_default_mem.end - 1;
+}
+
#ifdef CONFIG_DEBUG_FS

static int io_tlb_used_get(void *data, u64 *val)
--
2.25.1


2023-06-27 10:18:02

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v3 6/7] swiotlb: allocate a new memory pool when existing pools are full

From: Petr Tesarik <[email protected]>

When swiotlb_find_slots() cannot find suitable slots, schedule the
allocation of a new memory pool. It is not possible to allocate the pool
immediately, because this code may run in interrupt context, which is not
suitable for large memory allocations. This means that the memory pool will
be available too late for the currently requested mapping, but the stress
on the software IO TLB allocator is likely to continue, and subsequent
allocations will benefit from the additional pool eventually.

Keep all memory pools for an allocator in an RCU list to avoid locking on
the read side. For modifications, add a new spinlock to struct io_tlb_mem.

The spinlock also protects updates to the total number of slabs (nslabs in
struct io_tlb_mem), but not reads of the value. Readers may therefore
encounter a stale value, but this is not an issue:

- swiotlb_tbl_map_single() and is_swiotlb_active() only check for non-zero
value. This is ensured by the existence of the default memory pool,
allocated at boot.

- The exact value is used only for non-critical purposes (debugfs, kernel
messages).

Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/swiotlb.h | 9 ++-
kernel/dma/swiotlb.c | 141 +++++++++++++++++++++++++++++++---------
2 files changed, 119 insertions(+), 31 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 4a3af1c216d0..ae402890ba41 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -8,6 +8,7 @@
#include <linux/types.h>
#include <linux/limits.h>
#include <linux/spinlock.h>
+#include <linux/workqueue.h>

struct device;
struct page;
@@ -103,12 +104,14 @@ struct io_tlb_pool {

/**
* struct io_tlb_mem - Software IO TLB allocator
- * @pool: IO TLB memory pool descriptor.
+ * @lock: Lock to synchronize changes to the list.
+ * @pools: List of IO TLB memory pool descriptors.
* @nslabs: Total number of IO TLB slabs in all pools.
* @phys_limit: Maximum allowed physical address.
* @debugfs: The dentry to debugfs.
* @force_bounce: %true if swiotlb bouncing is forced
* @for_alloc: %true if the pool is used for memory allocation
+ * @dyn_alloc: Dynamic IO TLB pool allocation work.
* @total_used: The total number of slots in the pool that are currently used
* across all areas. Used only for calculating used_hiwater in
* debugfs.
@@ -116,12 +119,14 @@ struct io_tlb_pool {
* in debugfs.
*/
struct io_tlb_mem {
- struct io_tlb_pool *pool;
+ spinlock_t lock;
+ struct list_head pools;
unsigned long nslabs;
u64 phys_limit;
struct dentry *debugfs;
bool force_bounce;
bool for_alloc;
+ struct work_struct dyn_alloc;
#ifdef CONFIG_DEBUG_FS
atomic_long_t total_used;
atomic_long_t used_hiwater;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 5bb83097ade6..7661a6402e80 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -79,9 +79,14 @@ struct io_tlb_slot {
static bool swiotlb_force_bounce;
static bool swiotlb_force_disable;

+static void swiotlb_dyn_alloc(struct work_struct *work);
+
static struct io_tlb_pool io_tlb_default_pool;
-static struct io_tlb_mem io_tlb_default_mem = {
- .pool = &io_tlb_default_pool,
+struct io_tlb_mem io_tlb_default_mem = {
+ .lock = __SPIN_LOCK_UNLOCKED(io_tlb_default_mem.lock),
+ .pools = LIST_HEAD_INIT(io_tlb_default_mem.pools),
+ .dyn_alloc = __WORK_INITIALIZER(io_tlb_default_mem.dyn_alloc,
+ swiotlb_dyn_alloc),
};

static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
@@ -281,6 +286,19 @@ static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
return;
}

+/**
+ * add_mem_pool() - add a memory pool to the allocator
+ * @mem: Software IO TLB allocator.
+ * @pool: Memory pool to be added.
+ */
+static void add_mem_pool(struct io_tlb_mem *mem, struct io_tlb_pool *pool)
+{
+ spin_lock(&mem->lock);
+ list_add_rcu(&pool->node, &mem->pools);
+ mem->nslabs += pool->nslabs;
+ spin_unlock(&mem->lock);
+}
+
static void __init *swiotlb_memblock_alloc(unsigned long nslabs,
unsigned int flags,
int (*remap)(void *tlb, unsigned long nslabs))
@@ -372,7 +390,7 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,

swiotlb_init_io_tlb_pool(mem, __pa(tlb), nslabs, false,
default_nareas);
- io_tlb_default_mem.nslabs = nslabs;
+ add_mem_pool(&io_tlb_default_mem, mem);

if (flags & SWIOTLB_VERBOSE)
swiotlb_print_info();
@@ -463,7 +481,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
(nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
nareas);
- io_tlb_default_mem.nslabs = nslabs;
+ add_mem_pool(&io_tlb_default_mem, mem);

swiotlb_print_info();
return 0;
@@ -610,44 +628,83 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
/**
* swiotlb_alloc_pool() - allocate a new IO TLB memory pool
* @dev: Device for which a memory pool is allocated.
- * @nslabs: Desired number of slabs.
+ * @minslabs: Minimum number of slabs.
+ * @nslabs: Desired (maximum) number of slabs.
+ * @nareas: Number of areas.
* @phys_limit: Maximum DMA buffer physical address.
* @gfp: GFP flags for the allocations.
*
- * Allocate and initialize a new IO TLB memory pool.
+ * Allocate and initialize a new IO TLB memory pool. The actual number of
+ * slabs may be reduced if allocation of @nslabs fails. If even
+ * @minslabs cannot be allocated, this function fails.
*
* Return: New memory pool, or %NULL on allocation failure.
*/
static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
- unsigned int nslabs, u64 phys_limit, gfp_t gfp)
+ unsigned long minslabs, unsigned long nslabs,
+ unsigned int nareas, u64 phys_limit, gfp_t gfp)
{
struct io_tlb_pool *pool;
+ unsigned int slot_order;
struct page *tlb;
size_t pool_size;
size_t tlb_size;

- pool_size = sizeof(*pool) + array_size(sizeof(*pool->areas), 1) +
- array_size(sizeof(*pool->slots), nslabs);
+ pool_size = sizeof(*pool) + array_size(sizeof(*pool->areas), nareas);
pool = kzalloc(pool_size, gfp);
if (!pool)
goto error;
pool->areas = (void *)pool + sizeof(*pool);
- pool->slots = (void *)pool->areas + sizeof(*pool->areas);

tlb_size = nslabs << IO_TLB_SHIFT;
- tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp);
- if (!tlb)
- goto error_tlb;
+ while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp))) {
+ if (nslabs <= minslabs)
+ goto error_tlb;
+ nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
+ nareas = limit_nareas(nareas, nslabs);
+ tlb_size = nslabs << IO_TLB_SHIFT;
+ }

- swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), nslabs, true, 1);
+ slot_order = get_order(array_size(sizeof(*pool->slots), nslabs));
+ pool->slots = (struct io_tlb_slot *)
+ __get_free_pages(gfp, slot_order);
+ if (!pool->slots)
+ goto error_slots;
+
+ swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), nslabs, true, nareas);
return pool;

+error_slots:
+ swiotlb_free_tlb(page_address(tlb), tlb_size);
error_tlb:
kfree(pool);
error:
return NULL;
}

+/**
+ * swiotlb_dyn_alloc() - dynamic memory pool allocation worker
+ * @work: Pointer to dyn_alloc in struct io_tlb_mem.
+ */
+static void swiotlb_dyn_alloc(struct work_struct *work)
+{
+ struct io_tlb_mem *mem =
+ container_of(work, struct io_tlb_mem, dyn_alloc);
+ struct io_tlb_pool *pool;
+
+ pool = swiotlb_alloc_pool(NULL, IO_TLB_MIN_SLABS, default_nslabs,
+ default_nareas, mem->phys_limit, GFP_KERNEL);
+ if (!pool) {
+ pr_warn_ratelimited("Failed to allocate new pool");
+ return;
+ }
+
+ add_mem_pool(mem, pool);
+
+ /* Pairs with smp_rmb() in swiotlb_find_pool(). */
+ smp_wmb();
+}
+
/**
* swiotlb_dyn_free() - RCU callback to free a memory pool
* @rcu: RCU head in the corresponding struct io_tlb_pool.
@@ -655,8 +712,10 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
static void swiotlb_dyn_free(struct rcu_head *rcu)
{
struct io_tlb_pool *pool = container_of(rcu, struct io_tlb_pool, rcu);
+ size_t slots_size = array_size(sizeof(*pool->slots), pool->nslabs);
size_t tlb_size = pool->end - pool->start;

+ free_pages((unsigned long)pool->slots, get_order(slots_size));
swiotlb_free_tlb(pool->vaddr, tlb_size);
kfree(pool);
}
@@ -685,15 +744,19 @@ void swiotlb_dev_init(struct device *dev)
struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
- struct io_tlb_pool *pool = mem->pool;
-
- if (paddr >= pool->start && paddr < pool->end)
- return pool;
+ struct io_tlb_pool *pool;

- /* Pairs with smp_wmb() in swiotlb_find_slots(). */
+ /* Pairs with smp_wmb() in swiotlb_find_slots() and
+ * swiotlb_dyn_alloc(), which modify the RCU lists.
+ */
smp_rmb();

rcu_read_lock();
+ list_for_each_entry_rcu(pool, &mem->pools, node) {
+ if (paddr >= pool->start && paddr < pool->end)
+ goto out;
+ }
+
list_for_each_entry_rcu(pool, &dev->dma_io_tlb_pools, node) {
if (paddr >= pool->start && paddr < pool->end)
goto out;
@@ -1019,22 +1082,34 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
struct io_tlb_pool **retpool)
{
struct io_tlb_pool *pool;
+ struct io_tlb_mem *mem;
+ unsigned long nslabs;
unsigned long flags;
u64 phys_limit;
int index;

- pool = dev->dma_io_tlb_mem->pool;
- index = pool_find_slots(dev, pool, orig_addr,
- alloc_size, alloc_align_mask);
- if (index >= 0)
- goto found;
+ mem = dev->dma_io_tlb_mem;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(pool, &mem->pools, node) {
+ index = pool_find_slots(dev, pool, orig_addr,
+ alloc_size, alloc_align_mask);
+ if (index >= 0) {
+ rcu_read_unlock();
+ goto found;
+ }
+ }
+ rcu_read_unlock();

if (dev->dma_io_tlb_mem->for_alloc)
return -1;

+ schedule_work(&mem->dyn_alloc);
+
phys_limit = min_not_zero(dev->coherent_dma_mask, dev->bus_dma_limit);
- pool = swiotlb_alloc_pool(dev, nr_slots(alloc_size), phys_limit,
- GFP_NOWAIT | __GFP_NOWARN);
+ nslabs = nr_slots(alloc_size);
+ pool = swiotlb_alloc_pool(dev, nslabs, nslabs, 1, phys_limit,
+ GFP_NOWAIT | __GFP_NOWARN);
if (!pool)
return -1;

@@ -1086,7 +1161,15 @@ static unsigned long mem_pool_used(struct io_tlb_pool *pool)
*/
static unsigned long mem_used(struct io_tlb_mem *mem)
{
- return mem_pool_used(mem->pool);
+ struct io_tlb_pool *pool;
+ unsigned long used = 0;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(pool, &mem->pools, node)
+ used += mem_pool_used(pool);
+ rcu_read_unlock();
+
+ return used;
}

phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
@@ -1468,8 +1551,8 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
false, nareas);
mem->force_bounce = true;
mem->for_alloc = true;
- mem->pool = pool;
- mem->nslabs = nslabs;
+ spin_lock_init(&mem->lock);
+ add_mem_pool(mem, pool);

rmem->priv = mem;

--
2.25.1


2023-06-27 10:18:53

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

From: Petr Tesarik <[email protected]>

Try to allocate a transient memory pool if no suitable slots can be found,
except when allocating from a restricted pool. The transient pool is just
enough big for this one bounce buffer. It is inserted into a per-device
list of transient memory pools, and it is freed again when the bounce
buffer is unmapped.

Transient memory pools are kept in an RCU list. A memory barrier is
required after adding a new entry, because any address within a transient
buffer must be immediately recognized as belonging to the SWIOTLB, even if
it is passed to another CPU.

Deletion does not require any synchronization beyond RCU ordering
guarantees. After a buffer is unmapped, its physical addresses may no
longer be passed to the DMA API, so the memory range of the corresponding
stale entry in the RCU list never matches. If the memory range gets
allocated again, then it happens only after a RCU quiescent state.

Since bounce buffers can now be allocated from different pools, add a
parameter to swiotlb_alloc_pool() to let the caller know which memory pool
is used. Add swiotlb_find_pool() to find the memory pool corresponding to
an address. This function is now also used by is_swiotlb_buffer(), because
a simple boundary check is no longer sufficient.

The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
simplified and enhanced to use coherent memory pools if needed.

Note that this is not the most efficient way to provide a bounce buffer,
but when a DMA buffer can't be mapped, something may (and will) actually
break. At that point it is better to make an allocation, even if it may be
an expensive operation.

Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/device.h | 4 +
include/linux/dma-mapping.h | 2 +
include/linux/swiotlb.h | 13 +-
kernel/dma/direct.c | 2 +-
kernel/dma/swiotlb.c | 265 ++++++++++++++++++++++++++++++++++--
5 files changed, 272 insertions(+), 14 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 83081aa99e6a..a1ee4c5924b8 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -510,6 +510,8 @@ struct device_physical_location {
* @dma_mem: Internal for coherent mem override.
* @cma_area: Contiguous memory area for dma allocations
* @dma_io_tlb_mem: Software IO TLB allocator. Not for driver use.
+ * @dma_io_tlb_pools: List of transient swiotlb memory pools.
+ * @dma_io_tlb_lock: Protects changes to the list of active pools.
* @archdata: For arch-specific additions.
* @of_node: Associated device tree node.
* @fwnode: Associated device node supplied by platform firmware.
@@ -615,6 +617,8 @@ struct device {
#endif
#ifdef CONFIG_SWIOTLB
struct io_tlb_mem *dma_io_tlb_mem;
+ struct list_head dma_io_tlb_pools;
+ spinlock_t dma_io_tlb_lock;
#endif
/* arch specific additions */
struct dev_archdata archdata;
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 0ee20b764000..c36c5a546787 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -417,6 +417,8 @@ static inline void dma_sync_sgtable_for_device(struct device *dev,
#define dma_get_sgtable(d, t, v, h, s) dma_get_sgtable_attrs(d, t, v, h, s, 0)
#define dma_mmap_coherent(d, v, c, h, s) dma_mmap_attrs(d, v, c, h, s, 0)

+bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size);
+
static inline void *dma_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp)
{
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 0aa6868cb68c..ae1688438850 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -63,6 +63,7 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,

/**
* struct io_tlb_pool - IO TLB memory pool descriptor
+ * @node: Member of the IO TLB memory pool list.
* @start: The start address of the swiotlb memory pool. Used to do a quick
* range check to see if the memory was in fact allocated by this
* API.
@@ -77,22 +78,27 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
* see setup_io_tlb_npages().
* @used: The number of used IO TLB slots.
* @late_alloc: %true if allocated using the page allocator.
+ * @transient: %true if transient memory pool.
* @nareas: Number of areas in the pool.
* @area_nslabs: Number of slots in each area.
* @areas: Array of memory area descriptors.
* @slots: Array of slot descriptors.
+ * @rcu: RCU head for swiotlb_dyn_free().
*/
struct io_tlb_pool {
+ struct list_head node;
phys_addr_t start;
phys_addr_t end;
void *vaddr;
unsigned long nslabs;
unsigned long used;
bool late_alloc;
+ bool transient;
unsigned int nareas;
unsigned int area_nslabs;
struct io_tlb_area *areas;
struct io_tlb_slot *slots;
+ struct rcu_head rcu;
};

/**
@@ -120,6 +126,8 @@ struct io_tlb_mem {
#endif
};

+struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr);
+
/**
* is_swiotlb_buffer() - check if a physical address belongs to a swiotlb
* @dev: Device which has mapped the buffer.
@@ -133,9 +141,8 @@ struct io_tlb_mem {
*/
static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
- struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
-
- return mem && paddr >= mem->pool->start && paddr < mem->pool->end;
+ return dev->dma_io_tlb_mem &&
+ !!swiotlb_find_pool(dev, paddr);
}

static inline bool is_swiotlb_force_bounce(struct device *dev)
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 5595d1d5cdcc..820561cab38d 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -66,7 +66,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 *phys_limit)
return 0;
}

-static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
+bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
{
dma_addr_t dma_addr = phys_to_dma_direct(dev, phys);

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 4c5de91bda57..06b4fa7c2e9b 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -35,6 +35,7 @@
#include <linux/memblock.h>
#include <linux/mm.h>
#include <linux/pfn.h>
+#include <linux/rculist.h>
#include <linux/scatterlist.h>
#include <linux/set_memory.h>
#include <linux/spinlock.h>
@@ -500,6 +501,157 @@ void __init swiotlb_exit(void)
memset(mem, 0, sizeof(*mem));
}

+/**
+ * alloc_dma_pages() - allocate pages to be used for DMA
+ * @gfp: GFP flags for the allocation.
+ * @bytes: Size of the buffer.
+ *
+ * Allocate pages from the buddy allocator. If successful, make the allocated
+ * pages decrypted that they can be used for DMA.
+ *
+ * Return: Decrypted pages, or %NULL on failure.
+ */
+static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes)
+{
+ unsigned int order = get_order(bytes);
+ struct page *page;
+ void *vaddr;
+
+ page = alloc_pages(gfp, order);
+ if (!page)
+ return NULL;
+
+ vaddr = page_address(page);
+ if (set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes))) {
+ __free_pages(page, order);
+ return NULL;
+ }
+
+ return page;
+}
+
+/**
+ * swiotlb_alloc_tlb() - allocate a dynamic IO TLB buffer
+ * @dev: Device for which a memory pool is allocated.
+ * @bytes: Size of the buffer.
+ * @phys_limit: Maximum allowed physical address of the buffer.
+ * @gfp: GFP flags for the allocation.
+ *
+ * Return: Allocated pages, or %NULL on allocation failure.
+ */
+static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
+ u64 phys_limit, gfp_t gfp)
+{
+ struct page *page;
+
+ /*
+ * Allocate from the atomic pools if memory is encrypted and
+ * the allocation is atomic, because decrypting may block.
+ */
+ if (dev && force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp)) {
+ void *vaddr;
+
+ return IS_ENABLED(CONFIG_DMA_COHERENT_POOL)
+ ? dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
+ dma_coherent_ok)
+ : NULL;
+ }
+
+ gfp &= ~GFP_ZONEMASK;
+ if (phys_limit <= DMA_BIT_MASK(zone_dma_bits))
+ gfp |= __GFP_DMA;
+ else if (phys_limit <= DMA_BIT_MASK(32))
+ gfp |= __GFP_DMA32;
+
+ while ((page = alloc_dma_pages(gfp, bytes)) &&
+ page_to_phys(page) + bytes - 1 > phys_limit) {
+ /* allocated, but too high */
+ __free_pages(page, get_order(bytes));
+
+ if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
+ phys_limit < DMA_BIT_MASK(64) &&
+ !(gfp & (__GFP_DMA32 | __GFP_DMA)))
+ gfp |= __GFP_DMA32;
+ else if (IS_ENABLED(CONFIG_ZONE_DMA) &&
+ !(gfp & __GFP_DMA))
+ gfp = (gfp & ~__GFP_DMA32) | __GFP_DMA;
+ else
+ return NULL;
+ }
+
+ return page;
+}
+
+/**
+ * swiotlb_free_tlb() - free a dynamically allocated IO TLB buffer
+ * @vaddr: Virtual address of the buffer.
+ * @bytes: Size of the buffer.
+ */
+static void swiotlb_free_tlb(void *vaddr, size_t bytes)
+{
+ if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
+ dma_free_from_pool(NULL, vaddr, bytes))
+ return;
+
+ /* Intentional leak if pages cannot be encrypted again. */
+ if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
+ __free_pages(virt_to_page(vaddr), get_order(bytes));
+}
+
+/**
+ * swiotlb_alloc_pool() - allocate a new IO TLB memory pool
+ * @dev: Device for which a memory pool is allocated.
+ * @nslabs: Desired number of slabs.
+ * @phys_limit: Maximum DMA buffer physical address.
+ * @gfp: GFP flags for the allocations.
+ *
+ * Allocate and initialize a new IO TLB memory pool.
+ *
+ * Return: New memory pool, or %NULL on allocation failure.
+ */
+static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
+ unsigned int nslabs, u64 phys_limit, gfp_t gfp)
+{
+ struct io_tlb_pool *pool;
+ struct page *tlb;
+ size_t pool_size;
+ size_t tlb_size;
+
+ pool_size = sizeof(*pool) + array_size(sizeof(*pool->areas), 1) +
+ array_size(sizeof(*pool->slots), nslabs);
+ pool = kzalloc(pool_size, gfp);
+ if (!pool)
+ goto error;
+ pool->areas = (void *)pool + sizeof(*pool);
+ pool->slots = (void *)pool->areas + sizeof(*pool->areas);
+
+ tlb_size = nslabs << IO_TLB_SHIFT;
+ tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp);
+ if (!tlb)
+ goto error_tlb;
+
+ swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), nslabs, true, 1);
+ return pool;
+
+error_tlb:
+ kfree(pool);
+error:
+ return NULL;
+}
+
+/**
+ * swiotlb_dyn_free() - RCU callback to free a memory pool
+ * @rcu: RCU head in the corresponding struct io_tlb_pool.
+ */
+static void swiotlb_dyn_free(struct rcu_head *rcu)
+{
+ struct io_tlb_pool *pool = container_of(rcu, struct io_tlb_pool, rcu);
+ size_t tlb_size = pool->end - pool->start;
+
+ swiotlb_free_tlb(pool->vaddr, tlb_size);
+ kfree(pool);
+}
+
/**
* swiotlb_dev_init() - initialize swiotlb fields in &struct device
* @dev: Device to be initialized.
@@ -507,6 +659,56 @@ void __init swiotlb_exit(void)
void swiotlb_dev_init(struct device *dev)
{
dev->dma_io_tlb_mem = &io_tlb_default_mem;
+ INIT_LIST_HEAD(&dev->dma_io_tlb_pools);
+ spin_lock_init(&dev->dma_io_tlb_lock);
+}
+
+/**
+ * swiotlb_find_pool() - find the IO TLB pool for a physical address
+ * @dev: Device which has mapped the DMA buffer.
+ * @paddr: Physical address within the DMA buffer.
+ *
+ * Find the IO TLB memory pool descriptor which contains the given physical
+ * address, if any.
+ *
+ * Return: Memory pool which contains @paddr, or %NULL if none.
+ */
+struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr)
+{
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_pool *pool = mem->pool;
+
+ if (paddr >= pool->start && paddr < pool->end)
+ return pool;
+
+ /* Pairs with smp_wmb() in swiotlb_find_slots(). */
+ smp_rmb();
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(pool, &dev->dma_io_tlb_pools, node) {
+ if (paddr >= pool->start && paddr < pool->end)
+ goto out;
+ }
+ pool = NULL;
+out:
+ rcu_read_unlock();
+ return pool;
+}
+
+/**
+ * swiotlb_del_pool() - remove an IO TLB pool from a device
+ * @dev: Owning device.
+ * @pool: Memory pool to be removed.
+ */
+static void swiotlb_del_pool(struct device *dev, struct io_tlb_pool *pool)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&dev->dma_io_tlb_lock, flags);
+ list_del_rcu(&pool->node);
+ spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags);
+
+ call_rcu(&pool->rcu, swiotlb_dyn_free);
}

/*
@@ -523,7 +725,7 @@ static unsigned int swiotlb_align_offset(struct device *dev, u64 addr)
static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
enum dma_data_direction dir)
{
- struct io_tlb_pool *mem = dev->dma_io_tlb_mem->pool;
+ struct io_tlb_pool *mem = swiotlb_find_pool(dev, tlb_addr);
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
phys_addr_t orig_addr = mem->slots[index].orig_addr;
size_t alloc_size = mem->slots[index].alloc_size;
@@ -796,6 +998,7 @@ static int pool_find_slots(struct device *dev, struct io_tlb_pool *pool,
* @alloc_size: Total requested size of the bounce buffer,
* including initial alignment padding.
* @alloc_align_mask: Required alignment of the allocated buffer.
+ * @retpool: Used memory pool, updated on return.
*
* Search through the whole software IO TLB to find a sequence of slots that
* match the allocation constraints.
@@ -803,10 +1006,46 @@ static int pool_find_slots(struct device *dev, struct io_tlb_pool *pool,
* Return: Index of the first allocated slot, or -1 on error.
*/
static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
- size_t alloc_size, unsigned int alloc_align_mask)
+ size_t alloc_size, unsigned int alloc_align_mask,
+ struct io_tlb_pool **retpool)
{
- return pool_find_slots(dev, dev->dma_io_tlb_mem->pool, orig_addr,
- alloc_size, alloc_align_mask);
+ struct io_tlb_pool *pool;
+ unsigned long flags;
+ u64 phys_limit;
+ int index;
+
+ pool = dev->dma_io_tlb_mem->pool;
+ index = pool_find_slots(dev, pool, orig_addr,
+ alloc_size, alloc_align_mask);
+ if (index >= 0)
+ goto found;
+
+ if (dev->dma_io_tlb_mem->for_alloc)
+ return -1;
+
+ phys_limit = min_not_zero(dev->coherent_dma_mask, dev->bus_dma_limit);
+ pool = swiotlb_alloc_pool(dev, nr_slots(alloc_size), phys_limit,
+ GFP_NOWAIT | __GFP_NOWARN);
+ if (!pool)
+ return -1;
+
+ index = pool_find_slots(dev, pool, orig_addr,
+ alloc_size, alloc_align_mask);
+ if (index < 0) {
+ swiotlb_dyn_free(&pool->rcu);
+ return -1;
+ }
+
+ pool->transient = true;
+ spin_lock_irqsave(&dev->dma_io_tlb_lock, flags);
+ list_add_rcu(&pool->node, &dev->dma_io_tlb_pools);
+ spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags);
+
+ /* Pairs with smp_rmb() in swiotlb_find_pool(). */
+ smp_wmb();
+found:
+ *retpool = pool;
+ return index;
}

/**
@@ -869,7 +1108,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
}

index = swiotlb_find_slots(dev, orig_addr,
- alloc_size + offset, alloc_align_mask);
+ alloc_size + offset, alloc_align_mask, &pool);
if (index == -1) {
if (!(attrs & DMA_ATTR_NO_WARN))
dev_warn_ratelimited(dev,
@@ -883,7 +1122,6 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
* This is needed when we sync the memory. Then we sync the buffer if
* needed.
*/
- pool = mem->pool;
for (i = 0; i < nr_slots(alloc_size + offset); i++)
pool->slots[index + i].orig_addr = slot_addr(orig_addr, i);
tlb_addr = slot_addr(pool->start, index) + offset;
@@ -900,7 +1138,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,

static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
{
- struct io_tlb_pool *mem = dev->dma_io_tlb_mem->pool;
+ struct io_tlb_pool *mem = swiotlb_find_pool(dev, tlb_addr);
unsigned long flags;
unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
@@ -954,6 +1192,8 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
size_t mapping_size, enum dma_data_direction dir,
unsigned long attrs)
{
+ struct io_tlb_pool *pool;
+
/*
* First, sync the memory before unmapping the entry
*/
@@ -961,7 +1201,13 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);

- swiotlb_release_slots(dev, tlb_addr);
+ pool = swiotlb_find_pool(dev, tlb_addr);
+ if (pool->transient) {
+ dec_used(dev->dma_io_tlb_mem, pool->nslabs);
+ swiotlb_del_pool(dev, pool);
+ } else {
+ swiotlb_release_slots(dev, tlb_addr);
+ }
}

void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
@@ -1145,11 +1391,10 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
if (!mem)
return NULL;

- index = swiotlb_find_slots(dev, 0, size, 0);
+ index = swiotlb_find_slots(dev, 0, size, 0, &pool);
if (index == -1)
return NULL;

- pool = mem->pool;
tlb_addr = slot_addr(pool->start, index);

return pfn_to_page(PFN_DOWN(tlb_addr));
--
2.25.1


2023-06-27 11:03:19

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] swiotlb: make io_tlb_default_mem local to swiotlb.c

On Tue, Jun 27, 2023 at 11:54:23AM +0200, Petr Tesarik wrote:
> +/**
> + * is_swiotlb_active() - check if the software IO TLB is initialized
> + * @dev: Device to check, or %NULL for the default IO TLB.
> + */
> bool is_swiotlb_active(struct device *dev)
> {
> - struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> + struct io_tlb_mem *mem = dev
> + ? dev->dma_io_tlb_mem
> + : &io_tlb_default_mem;

That's impossible to read and maintain over time, sorry.

Please use real "if () else" lines, so that it can be maintained over
time.

thanks,

greg k-h

2023-06-27 11:18:13

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] swiotlb: make io_tlb_default_mem local to swiotlb.c

On 27/06/2023 11:24 am, Greg Kroah-Hartman wrote:
> On Tue, Jun 27, 2023 at 11:54:23AM +0200, Petr Tesarik wrote:
>> +/**
>> + * is_swiotlb_active() - check if the software IO TLB is initialized
>> + * @dev: Device to check, or %NULL for the default IO TLB.
>> + */
>> bool is_swiotlb_active(struct device *dev)
>> {
>> - struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>> + struct io_tlb_mem *mem = dev
>> + ? dev->dma_io_tlb_mem
>> + : &io_tlb_default_mem;
>
> That's impossible to read and maintain over time, sorry.
>
> Please use real "if () else" lines, so that it can be maintained over
> time.

Moreover, it makes for a horrible interface anyway. If there's a need
for a non-specific "is SWIOTLB present at all?" check unrelated to any
particular device (which arguably still smells of poking into
implementation details...), please encapsulate it in its own distinct
helper like, say, is_swiotlb_present(void).

However, the more I think about it, the more I doubt that logic like
octeon_pci_setup() can continue to work properly at all if SWIOTLB
allocation becomes dynamic... :/

Thanks,
Robin.

2023-06-27 11:38:35

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] swiotlb: make io_tlb_default_mem local to swiotlb.c

On Tue, 27 Jun 2023 11:55:00 +0100
Robin Murphy <[email protected]> wrote:

> On 27/06/2023 11:24 am, Greg Kroah-Hartman wrote:
> > On Tue, Jun 27, 2023 at 11:54:23AM +0200, Petr Tesarik wrote:
> >> +/**
> >> + * is_swiotlb_active() - check if the software IO TLB is initialized
> >> + * @dev: Device to check, or %NULL for the default IO TLB.
> >> + */
> >> bool is_swiotlb_active(struct device *dev)
> >> {
> >> - struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> >> + struct io_tlb_mem *mem = dev
> >> + ? dev->dma_io_tlb_mem
> >> + : &io_tlb_default_mem;
> >
> > That's impossible to read and maintain over time, sorry.
> >
> > Please use real "if () else" lines, so that it can be maintained over
> > time.
>
> Moreover, it makes for a horrible interface anyway. If there's a need
> for a non-specific "is SWIOTLB present at all?" check unrelated to any
> particular device (which arguably still smells of poking into
> implementation details...), please encapsulate it in its own distinct
> helper like, say, is_swiotlb_present(void).
>
> However, the more I think about it, the more I doubt that logic like
> octeon_pci_setup() can continue to work properly at all if SWIOTLB
> allocation becomes dynamic... :/

Good, so I'm not alone. I don't know enough of the Octeon hardware to
understand how much magic is behind these PCI BARs and why one of them
should be (sometimes) programmed the way it is.

OTOH it doesn't seem to me that this platform forces DMA through
SWIOTLB. At least all calls to swiotlb_init() under arch/mips take this
form:

swiotlb_init(true, SWIOTLB_VERBOSE);

This makes me believe that this PCI BAR setup is merely an optimization.

However, if nobody has a clear answer, a fallback solution is to stay
on the safe side and add a flag to struct io_tlb_mem whether SWIOTLB
can grow dynamically. The helper function would then set this flag and
make sure that on this Octeon platform, the SWIOTLB stays restricted to
the default pool.

Hopefully, Thomas Bogendoerfer can shed some light on that code.

Petr T

2023-06-27 11:41:14

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] swiotlb: make io_tlb_default_mem local to swiotlb.c

Oops, originally sent only to Robin. Restoring the recipient list here...

On Tue, 27 Jun 2023 11:55:00 +0100
Robin Murphy <[email protected]> wrote:

> On 27/06/2023 11:24 am, Greg Kroah-Hartman wrote:
> > On Tue, Jun 27, 2023 at 11:54:23AM +0200, Petr Tesarik wrote:
> >> +/**
> >> + * is_swiotlb_active() - check if the software IO TLB is initialized
> >> + * @dev: Device to check, or %NULL for the default IO TLB.
> >> + */
> >> bool is_swiotlb_active(struct device *dev)
> >> {
> >> - struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> >> + struct io_tlb_mem *mem = dev
> >> + ? dev->dma_io_tlb_mem
> >> + : &io_tlb_default_mem;
> >
> > That's impossible to read and maintain over time, sorry.
> >
> > Please use real "if () else" lines, so that it can be maintained over
> > time.
>
> Moreover, it makes for a horrible interface anyway. If there's a need
> for a non-specific "is SWIOTLB present at all?" check unrelated to any
> particular device (which arguably still smells of poking into
> implementation details...), please encapsulate it in its own distinct
> helper like, say, is_swiotlb_present(void).

I'm sorry for writing two replies, but I realized too late that this
part is unrelated to the MIPS Octeon platform.

Xen is the only user of an "is SWIOTLB present" interface. IIUC Xen
needs bounce buffers for the PCI frontend driver, but if there is no
other reason to have a SWIOTLB, the system does not set up one at boot.

Yeah, they should probably do things differently. At least this code in
arch/x86/kernel/pci-dma.c is fishy:

/* XXX: this switches the dma ops under live devices! */
dma_ops = &xen_swiotlb_dma_ops;

However, I don't think it's up to me to fix that...

To sum it up, I can certainly provide a separate function instead of
overloading the is_swiotlb_active() API.

Thanks for the suggestion!

Petr T

2023-06-27 16:19:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] swiotlb: make io_tlb_default_mem local to swiotlb.c

On Tue, Jun 27, 2023 at 01:30:06PM +0200, Petr Tesařík wrote:
> Xen is the only user of an "is SWIOTLB present" interface. IIUC Xen
> needs bounce buffers for the PCI frontend driver, but if there is no
> other reason to have a SWIOTLB, the system does not set up one at boot.

Please take a look at my "unexport swiotlb_active v2" series that
unfortunately missed the 6.5 merge window waiting for reviews.

2023-06-27 17:37:36

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] swiotlb: make io_tlb_default_mem local to swiotlb.c

On Tue, 27 Jun 2023 17:48:02 +0200
Christoph Hellwig <[email protected]> wrote:

> On Tue, Jun 27, 2023 at 01:30:06PM +0200, Petr Tesařík wrote:
> > Xen is the only user of an "is SWIOTLB present" interface. IIUC Xen
> > needs bounce buffers for the PCI frontend driver, but if there is no
> > other reason to have a SWIOTLB, the system does not set up one at boot.
>
> Please take a look at my "unexport swiotlb_active v2" series that
> unfortunately missed the 6.5 merge window waiting for reviews.

I noticed it, but it seems I missed the part which completely removes
pci_xen_swiotlb_init_late().

Then we're left only with a reference from xen_mm_init() in
arch/arm/xen/mm.c, and I believe this one can also be solved
differently.

Petr T

2023-07-06 04:06:28

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

From: Petr Tesarik <[email protected]> Sent: Tuesday, June 27, 2023 2:54 AM
>
> Try to allocate a transient memory pool if no suitable slots can be found,
> except when allocating from a restricted pool. The transient pool is just
> enough big for this one bounce buffer. It is inserted into a per-device
> list of transient memory pools, and it is freed again when the bounce
> buffer is unmapped.
>
> Transient memory pools are kept in an RCU list. A memory barrier is
> required after adding a new entry, because any address within a transient
> buffer must be immediately recognized as belonging to the SWIOTLB, even if
> it is passed to another CPU.
>
> Deletion does not require any synchronization beyond RCU ordering
> guarantees. After a buffer is unmapped, its physical addresses may no
> longer be passed to the DMA API, so the memory range of the corresponding
> stale entry in the RCU list never matches. If the memory range gets
> allocated again, then it happens only after a RCU quiescent state.
>
> Since bounce buffers can now be allocated from different pools, add a
> parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> an address. This function is now also used by is_swiotlb_buffer(), because
> a simple boundary check is no longer sufficient.
>
> The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> simplified and enhanced to use coherent memory pools if needed.
>
> Note that this is not the most efficient way to provide a bounce buffer,
> but when a DMA buffer can't be mapped, something may (and will) actually
> break. At that point it is better to make an allocation, even if it may be
> an expensive operation.

I continue to think about swiotlb memory management from the standpoint
of CoCo VMs that may be quite large with high network and storage loads.
These VMs are often running mission-critical workloads that can't tolerate
a bounce buffer allocation failure. To prevent such failures, the swiotlb
memory size must be overly large, which wastes memory.

Your new approach helps by using the coherent memory pools as an overflow
space. But in a lot of ways, it only pushes the problem around. As you
noted in your cover letter, reducing the initial size of the swiotlb might
require increasing the size of the coherent pools.

What might be really useful is to pend bounce buffer requests while the
new worker thread is adding more swiotlb pools. Of course, requests made
from interrupt level can't be pended, but at least in my experience with large
CoCo VMs, storage I/O is the biggest consumer of bounce buffers. A lot
(most?) storage requests make the swiotlb_map() call in a context where
it is OK to pend. If the coherent pool overflow space is could be used only
for swiotlb_map() calls that can't pend, it's more likely to be sufficient to
bridge the gap until new pools are added.

Could swiotlb code detect if it's OK to pend, and then pend a bounce
buffer request until the worker thread adds a new pool? Even an overly
conversative check would help reduce pressure on the coherent pools
as overflow space.

Michael

>
> Signed-off-by: Petr Tesarik <[email protected]>
> ---
> include/linux/device.h | 4 +
> include/linux/dma-mapping.h | 2 +
> include/linux/swiotlb.h | 13 +-
> kernel/dma/direct.c | 2 +-
> kernel/dma/swiotlb.c | 265 ++++++++++++++++++++++++++++++++++--
> 5 files changed, 272 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 83081aa99e6a..a1ee4c5924b8 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -510,6 +510,8 @@ struct device_physical_location {
> * @dma_mem: Internal for coherent mem override.
> * @cma_area: Contiguous memory area for dma allocations
> * @dma_io_tlb_mem: Software IO TLB allocator. Not for driver use.
> + * @dma_io_tlb_pools: List of transient swiotlb memory pools.
> + * @dma_io_tlb_lock: Protects changes to the list of active pools.
> * @archdata: For arch-specific additions.
> * @of_node: Associated device tree node.
> * @fwnode: Associated device node supplied by platform firmware.
> @@ -615,6 +617,8 @@ struct device {
> #endif
> #ifdef CONFIG_SWIOTLB
> struct io_tlb_mem *dma_io_tlb_mem;
> + struct list_head dma_io_tlb_pools;
> + spinlock_t dma_io_tlb_lock;
> #endif
> /* arch specific additions */
> struct dev_archdata archdata;
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 0ee20b764000..c36c5a546787 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -417,6 +417,8 @@ static inline void dma_sync_sgtable_for_device(struct device
> *dev,
> #define dma_get_sgtable(d, t, v, h, s) dma_get_sgtable_attrs(d, t, v, h, s, 0)
> #define dma_mmap_coherent(d, v, c, h, s) dma_mmap_attrs(d, v, c, h, s, 0)
>
> +bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size);
> +
> static inline void *dma_alloc_coherent(struct device *dev, size_t size,
> dma_addr_t *dma_handle, gfp_t gfp)
> {
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 0aa6868cb68c..ae1688438850 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -63,6 +63,7 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
>
> /**
> * struct io_tlb_pool - IO TLB memory pool descriptor
> + * @node: Member of the IO TLB memory pool list.
> * @start: The start address of the swiotlb memory pool. Used to do a quick
> * range check to see if the memory was in fact allocated by this
> * API.
> @@ -77,22 +78,27 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t
> phys,
> * see setup_io_tlb_npages().
> * @used: The number of used IO TLB slots.
> * @late_alloc: %true if allocated using the page allocator.
> + * @transient: %true if transient memory pool.
> * @nareas: Number of areas in the pool.
> * @area_nslabs: Number of slots in each area.
> * @areas: Array of memory area descriptors.
> * @slots: Array of slot descriptors.
> + * @rcu: RCU head for swiotlb_dyn_free().
> */
> struct io_tlb_pool {
> + struct list_head node;
> phys_addr_t start;
> phys_addr_t end;
> void *vaddr;
> unsigned long nslabs;
> unsigned long used;
> bool late_alloc;
> + bool transient;
> unsigned int nareas;
> unsigned int area_nslabs;
> struct io_tlb_area *areas;
> struct io_tlb_slot *slots;
> + struct rcu_head rcu;
> };
>
> /**
> @@ -120,6 +126,8 @@ struct io_tlb_mem {
> #endif
> };
>
> +struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr);
> +
> /**
> * is_swiotlb_buffer() - check if a physical address belongs to a swiotlb
> * @dev: Device which has mapped the buffer.
> @@ -133,9 +141,8 @@ struct io_tlb_mem {
> */
> static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
> {
> - struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> -
> - return mem && paddr >= mem->pool->start && paddr < mem->pool->end;
> + return dev->dma_io_tlb_mem &&
> + !!swiotlb_find_pool(dev, paddr);
> }
>
> static inline bool is_swiotlb_force_bounce(struct device *dev)
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 5595d1d5cdcc..820561cab38d 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -66,7 +66,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device *dev,
> u64 *phys_limit)
> return 0;
> }
>
> -static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
> +bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
> {
> dma_addr_t dma_addr = phys_to_dma_direct(dev, phys);
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 4c5de91bda57..06b4fa7c2e9b 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -35,6 +35,7 @@
> #include <linux/memblock.h>
> #include <linux/mm.h>
> #include <linux/pfn.h>
> +#include <linux/rculist.h>
> #include <linux/scatterlist.h>
> #include <linux/set_memory.h>
> #include <linux/spinlock.h>
> @@ -500,6 +501,157 @@ void __init swiotlb_exit(void)
> memset(mem, 0, sizeof(*mem));
> }
>
> +/**
> + * alloc_dma_pages() - allocate pages to be used for DMA
> + * @gfp: GFP flags for the allocation.
> + * @bytes: Size of the buffer.
> + *
> + * Allocate pages from the buddy allocator. If successful, make the allocated
> + * pages decrypted that they can be used for DMA.
> + *
> + * Return: Decrypted pages, or %NULL on failure.
> + */
> +static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes)
> +{
> + unsigned int order = get_order(bytes);
> + struct page *page;
> + void *vaddr;
> +
> + page = alloc_pages(gfp, order);
> + if (!page)
> + return NULL;
> +
> + vaddr = page_address(page);
> + if (set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes))) {
> + __free_pages(page, order);
> + return NULL;
> + }
> +
> + return page;
> +}
> +
> +/**
> + * swiotlb_alloc_tlb() - allocate a dynamic IO TLB buffer
> + * @dev: Device for which a memory pool is allocated.
> + * @bytes: Size of the buffer.
> + * @phys_limit: Maximum allowed physical address of the buffer.
> + * @gfp: GFP flags for the allocation.
> + *
> + * Return: Allocated pages, or %NULL on allocation failure.
> + */
> +static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
> + u64 phys_limit, gfp_t gfp)
> +{
> + struct page *page;
> +
> + /*
> + * Allocate from the atomic pools if memory is encrypted and
> + * the allocation is atomic, because decrypting may block.
> + */
> + if (dev && force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp)) {
> + void *vaddr;
> +
> + return IS_ENABLED(CONFIG_DMA_COHERENT_POOL)
> + ? dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
> + dma_coherent_ok)
> + : NULL;
> + }
> +
> + gfp &= ~GFP_ZONEMASK;
> + if (phys_limit <= DMA_BIT_MASK(zone_dma_bits))
> + gfp |= __GFP_DMA;
> + else if (phys_limit <= DMA_BIT_MASK(32))
> + gfp |= __GFP_DMA32;
> +
> + while ((page = alloc_dma_pages(gfp, bytes)) &&
> + page_to_phys(page) + bytes - 1 > phys_limit) {
> + /* allocated, but too high */
> + __free_pages(page, get_order(bytes));
> +
> + if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
> + phys_limit < DMA_BIT_MASK(64) &&
> + !(gfp & (__GFP_DMA32 | __GFP_DMA)))
> + gfp |= __GFP_DMA32;
> + else if (IS_ENABLED(CONFIG_ZONE_DMA) &&
> + !(gfp & __GFP_DMA))
> + gfp = (gfp & ~__GFP_DMA32) | __GFP_DMA;
> + else
> + return NULL;
> + }
> +
> + return page;
> +}
> +
> +/**
> + * swiotlb_free_tlb() - free a dynamically allocated IO TLB buffer
> + * @vaddr: Virtual address of the buffer.
> + * @bytes: Size of the buffer.
> + */
> +static void swiotlb_free_tlb(void *vaddr, size_t bytes)
> +{
> + if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
> + dma_free_from_pool(NULL, vaddr, bytes))
> + return;
> +
> + /* Intentional leak if pages cannot be encrypted again. */
> + if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
> + __free_pages(virt_to_page(vaddr), get_order(bytes));
> +}
> +
> +/**
> + * swiotlb_alloc_pool() - allocate a new IO TLB memory pool
> + * @dev: Device for which a memory pool is allocated.
> + * @nslabs: Desired number of slabs.
> + * @phys_limit: Maximum DMA buffer physical address.
> + * @gfp: GFP flags for the allocations.
> + *
> + * Allocate and initialize a new IO TLB memory pool.
> + *
> + * Return: New memory pool, or %NULL on allocation failure.
> + */
> +static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
> + unsigned int nslabs, u64 phys_limit, gfp_t gfp)
> +{
> + struct io_tlb_pool *pool;
> + struct page *tlb;
> + size_t pool_size;
> + size_t tlb_size;
> +
> + pool_size = sizeof(*pool) + array_size(sizeof(*pool->areas), 1) +
> + array_size(sizeof(*pool->slots), nslabs);
> + pool = kzalloc(pool_size, gfp);
> + if (!pool)
> + goto error;
> + pool->areas = (void *)pool + sizeof(*pool);
> + pool->slots = (void *)pool->areas + sizeof(*pool->areas);
> +
> + tlb_size = nslabs << IO_TLB_SHIFT;
> + tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp);
> + if (!tlb)
> + goto error_tlb;
> +
> + swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), nslabs, true, 1);
> + return pool;
> +
> +error_tlb:
> + kfree(pool);
> +error:
> + return NULL;
> +}
> +
> +/**
> + * swiotlb_dyn_free() - RCU callback to free a memory pool
> + * @rcu: RCU head in the corresponding struct io_tlb_pool.
> + */
> +static void swiotlb_dyn_free(struct rcu_head *rcu)
> +{
> + struct io_tlb_pool *pool = container_of(rcu, struct io_tlb_pool, rcu);
> + size_t tlb_size = pool->end - pool->start;
> +
> + swiotlb_free_tlb(pool->vaddr, tlb_size);
> + kfree(pool);
> +}
> +
> /**
> * swiotlb_dev_init() - initialize swiotlb fields in &struct device
> * @dev: Device to be initialized.
> @@ -507,6 +659,56 @@ void __init swiotlb_exit(void)
> void swiotlb_dev_init(struct device *dev)
> {
> dev->dma_io_tlb_mem = &io_tlb_default_mem;
> + INIT_LIST_HEAD(&dev->dma_io_tlb_pools);
> + spin_lock_init(&dev->dma_io_tlb_lock);
> +}
> +
> +/**
> + * swiotlb_find_pool() - find the IO TLB pool for a physical address
> + * @dev: Device which has mapped the DMA buffer.
> + * @paddr: Physical address within the DMA buffer.
> + *
> + * Find the IO TLB memory pool descriptor which contains the given physical
> + * address, if any.
> + *
> + * Return: Memory pool which contains @paddr, or %NULL if none.
> + */
> +struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr)
> +{
> + struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> + struct io_tlb_pool *pool = mem->pool;
> +
> + if (paddr >= pool->start && paddr < pool->end)
> + return pool;
> +
> + /* Pairs with smp_wmb() in swiotlb_find_slots(). */
> + smp_rmb();
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(pool, &dev->dma_io_tlb_pools, node) {
> + if (paddr >= pool->start && paddr < pool->end)
> + goto out;
> + }
> + pool = NULL;
> +out:
> + rcu_read_unlock();
> + return pool;
> +}
> +
> +/**
> + * swiotlb_del_pool() - remove an IO TLB pool from a device
> + * @dev: Owning device.
> + * @pool: Memory pool to be removed.
> + */
> +static void swiotlb_del_pool(struct device *dev, struct io_tlb_pool *pool)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&dev->dma_io_tlb_lock, flags);
> + list_del_rcu(&pool->node);
> + spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags);
> +
> + call_rcu(&pool->rcu, swiotlb_dyn_free);
> }
>
> /*
> @@ -523,7 +725,7 @@ static unsigned int swiotlb_align_offset(struct device *dev, u64
> addr)
> static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
> enum dma_data_direction dir)
> {
> - struct io_tlb_pool *mem = dev->dma_io_tlb_mem->pool;
> + struct io_tlb_pool *mem = swiotlb_find_pool(dev, tlb_addr);
> int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
> phys_addr_t orig_addr = mem->slots[index].orig_addr;
> size_t alloc_size = mem->slots[index].alloc_size;
> @@ -796,6 +998,7 @@ static int pool_find_slots(struct device *dev, struct io_tlb_pool
> *pool,
> * @alloc_size: Total requested size of the bounce buffer,
> * including initial alignment padding.
> * @alloc_align_mask: Required alignment of the allocated buffer.
> + * @retpool: Used memory pool, updated on return.
> *
> * Search through the whole software IO TLB to find a sequence of slots that
> * match the allocation constraints.
> @@ -803,10 +1006,46 @@ static int pool_find_slots(struct device *dev, struct
> io_tlb_pool *pool,
> * Return: Index of the first allocated slot, or -1 on error.
> */
> static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> - size_t alloc_size, unsigned int alloc_align_mask)
> + size_t alloc_size, unsigned int alloc_align_mask,
> + struct io_tlb_pool **retpool)
> {
> - return pool_find_slots(dev, dev->dma_io_tlb_mem->pool, orig_addr,
> - alloc_size, alloc_align_mask);
> + struct io_tlb_pool *pool;
> + unsigned long flags;
> + u64 phys_limit;
> + int index;
> +
> + pool = dev->dma_io_tlb_mem->pool;
> + index = pool_find_slots(dev, pool, orig_addr,
> + alloc_size, alloc_align_mask);
> + if (index >= 0)
> + goto found;
> +
> + if (dev->dma_io_tlb_mem->for_alloc)
> + return -1;
> +
> + phys_limit = min_not_zero(dev->coherent_dma_mask, dev->bus_dma_limit);
> + pool = swiotlb_alloc_pool(dev, nr_slots(alloc_size), phys_limit,
> + GFP_NOWAIT | __GFP_NOWARN);
> + if (!pool)
> + return -1;
> +
> + index = pool_find_slots(dev, pool, orig_addr,
> + alloc_size, alloc_align_mask);
> + if (index < 0) {
> + swiotlb_dyn_free(&pool->rcu);
> + return -1;
> + }
> +
> + pool->transient = true;
> + spin_lock_irqsave(&dev->dma_io_tlb_lock, flags);
> + list_add_rcu(&pool->node, &dev->dma_io_tlb_pools);
> + spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags);
> +
> + /* Pairs with smp_rmb() in swiotlb_find_pool(). */
> + smp_wmb();
> +found:
> + *retpool = pool;
> + return index;
> }
>
> /**
> @@ -869,7 +1108,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> phys_addr_t orig_addr,
> }
>
> index = swiotlb_find_slots(dev, orig_addr,
> - alloc_size + offset, alloc_align_mask);
> + alloc_size + offset, alloc_align_mask, &pool);
> if (index == -1) {
> if (!(attrs & DMA_ATTR_NO_WARN))
> dev_warn_ratelimited(dev,
> @@ -883,7 +1122,6 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> phys_addr_t orig_addr,
> * This is needed when we sync the memory. Then we sync the buffer if
> * needed.
> */
> - pool = mem->pool;
> for (i = 0; i < nr_slots(alloc_size + offset); i++)
> pool->slots[index + i].orig_addr = slot_addr(orig_addr, i);
> tlb_addr = slot_addr(pool->start, index) + offset;
> @@ -900,7 +1138,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> phys_addr_t orig_addr,
>
> static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
> {
> - struct io_tlb_pool *mem = dev->dma_io_tlb_mem->pool;
> + struct io_tlb_pool *mem = swiotlb_find_pool(dev, tlb_addr);
> unsigned long flags;
> unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
> int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
> @@ -954,6 +1192,8 @@ void swiotlb_tbl_unmap_single(struct device *dev,
> phys_addr_t tlb_addr,
> size_t mapping_size, enum dma_data_direction dir,
> unsigned long attrs)
> {
> + struct io_tlb_pool *pool;
> +
> /*
> * First, sync the memory before unmapping the entry
> */
> @@ -961,7 +1201,13 @@ void swiotlb_tbl_unmap_single(struct device *dev,
> phys_addr_t tlb_addr,
> (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
> swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
>
> - swiotlb_release_slots(dev, tlb_addr);
> + pool = swiotlb_find_pool(dev, tlb_addr);
> + if (pool->transient) {
> + dec_used(dev->dma_io_tlb_mem, pool->nslabs);
> + swiotlb_del_pool(dev, pool);
> + } else {
> + swiotlb_release_slots(dev, tlb_addr);
> + }
> }
>
> void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
> @@ -1145,11 +1391,10 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
> if (!mem)
> return NULL;
>
> - index = swiotlb_find_slots(dev, 0, size, 0);
> + index = swiotlb_find_slots(dev, 0, size, 0, &pool);
> if (index == -1)
> return NULL;
>
> - pool = mem->pool;
> tlb_addr = slot_addr(pool->start, index);
>
> return pfn_to_page(PFN_DOWN(tlb_addr));
> --
> 2.25.1


2023-07-06 08:39:13

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:
> From: Petr Tesarik <[email protected]> Sent: Tuesday, June 27, 2023 2:54 AM
> >
> > Try to allocate a transient memory pool if no suitable slots can be found,
> > except when allocating from a restricted pool. The transient pool is just
> > enough big for this one bounce buffer. It is inserted into a per-device
> > list of transient memory pools, and it is freed again when the bounce
> > buffer is unmapped.
> >
> > Transient memory pools are kept in an RCU list. A memory barrier is
> > required after adding a new entry, because any address within a transient
> > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > it is passed to another CPU.
> >
> > Deletion does not require any synchronization beyond RCU ordering
> > guarantees. After a buffer is unmapped, its physical addresses may no
> > longer be passed to the DMA API, so the memory range of the corresponding
> > stale entry in the RCU list never matches. If the memory range gets
> > allocated again, then it happens only after a RCU quiescent state.
> >
> > Since bounce buffers can now be allocated from different pools, add a
> > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > an address. This function is now also used by is_swiotlb_buffer(), because
> > a simple boundary check is no longer sufficient.
> >
> > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > simplified and enhanced to use coherent memory pools if needed.
> >
> > Note that this is not the most efficient way to provide a bounce buffer,
> > but when a DMA buffer can't be mapped, something may (and will) actually
> > break. At that point it is better to make an allocation, even if it may be
> > an expensive operation.
>
> I continue to think about swiotlb memory management from the standpoint
> of CoCo VMs that may be quite large with high network and storage loads.
> These VMs are often running mission-critical workloads that can't tolerate
> a bounce buffer allocation failure. To prevent such failures, the swiotlb
> memory size must be overly large, which wastes memory.

If "mission critical workloads" are in a vm that allowes overcommit and
no control over other vms in that same system, then you have worse
problems, sorry.

Just don't do that.

thanks,

greg k-h

2023-07-06 15:43:23

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

From: Greg Kroah-Hartman <[email protected]> Sent: Thursday, July 6, 2023 1:07 AM
>
> On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:
> > From: Petr Tesarik <[email protected]> Sent: Tuesday, June 27, 2023
> 2:54 AM
> > >
> > > Try to allocate a transient memory pool if no suitable slots can be found,
> > > except when allocating from a restricted pool. The transient pool is just
> > > enough big for this one bounce buffer. It is inserted into a per-device
> > > list of transient memory pools, and it is freed again when the bounce
> > > buffer is unmapped.
> > >
> > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > required after adding a new entry, because any address within a transient
> > > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > > it is passed to another CPU.
> > >
> > > Deletion does not require any synchronization beyond RCU ordering
> > > guarantees. After a buffer is unmapped, its physical addresses may no
> > > longer be passed to the DMA API, so the memory range of the corresponding
> > > stale entry in the RCU list never matches. If the memory range gets
> > > allocated again, then it happens only after a RCU quiescent state.
> > >
> > > Since bounce buffers can now be allocated from different pools, add a
> > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > > an address. This function is now also used by is_swiotlb_buffer(), because
> > > a simple boundary check is no longer sufficient.
> > >
> > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > > simplified and enhanced to use coherent memory pools if needed.
> > >
> > > Note that this is not the most efficient way to provide a bounce buffer,
> > > but when a DMA buffer can't be mapped, something may (and will) actually
> > > break. At that point it is better to make an allocation, even if it may be
> > > an expensive operation.
> >
> > I continue to think about swiotlb memory management from the standpoint
> > of CoCo VMs that may be quite large with high network and storage loads.
> > These VMs are often running mission-critical workloads that can't tolerate
> > a bounce buffer allocation failure. To prevent such failures, the swiotlb
> > memory size must be overly large, which wastes memory.
>
> If "mission critical workloads" are in a vm that allowes overcommit and
> no control over other vms in that same system, then you have worse
> problems, sorry.
>
> Just don't do that.
>

No, the cases I'm concerned about don't involve memory overcommit.

CoCo VMs must use swiotlb bounce buffers to do DMA I/O. Current swiotlb
code in the Linux guest allocates a configurable, but fixed, amount of guest
memory at boot time for this purpose. But it's hard to know how much
swiotlb bounce buffer memory will be needed to handle peak I/O loads.
This patch set does dynamic allocation of swiotlb bounce buffer memory,
which can help avoid needing to configure an overly large fixed size at boot.

Michael

2023-07-07 10:15:14

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote:
> From: Greg Kroah-Hartman <[email protected]> Sent: Thursday, July 6, 2023 1:07 AM
> >
> > On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:
> > > From: Petr Tesarik <[email protected]> Sent: Tuesday, June 27, 2023
> > 2:54 AM
> > > >
> > > > Try to allocate a transient memory pool if no suitable slots can be found,
> > > > except when allocating from a restricted pool. The transient pool is just
> > > > enough big for this one bounce buffer. It is inserted into a per-device
> > > > list of transient memory pools, and it is freed again when the bounce
> > > > buffer is unmapped.
> > > >
> > > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > > required after adding a new entry, because any address within a transient
> > > > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > > > it is passed to another CPU.
> > > >
> > > > Deletion does not require any synchronization beyond RCU ordering
> > > > guarantees. After a buffer is unmapped, its physical addresses may no
> > > > longer be passed to the DMA API, so the memory range of the corresponding
> > > > stale entry in the RCU list never matches. If the memory range gets
> > > > allocated again, then it happens only after a RCU quiescent state.
> > > >
> > > > Since bounce buffers can now be allocated from different pools, add a
> > > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > > > an address. This function is now also used by is_swiotlb_buffer(), because
> > > > a simple boundary check is no longer sufficient.
> > > >
> > > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > > > simplified and enhanced to use coherent memory pools if needed.
> > > >
> > > > Note that this is not the most efficient way to provide a bounce buffer,
> > > > but when a DMA buffer can't be mapped, something may (and will) actually
> > > > break. At that point it is better to make an allocation, even if it may be
> > > > an expensive operation.
> > >
> > > I continue to think about swiotlb memory management from the standpoint
> > > of CoCo VMs that may be quite large with high network and storage loads.
> > > These VMs are often running mission-critical workloads that can't tolerate
> > > a bounce buffer allocation failure. To prevent such failures, the swiotlb
> > > memory size must be overly large, which wastes memory.
> >
> > If "mission critical workloads" are in a vm that allowes overcommit and
> > no control over other vms in that same system, then you have worse
> > problems, sorry.
> >
> > Just don't do that.
> >
>
> No, the cases I'm concerned about don't involve memory overcommit.
>
> CoCo VMs must use swiotlb bounce buffers to do DMA I/O. Current swiotlb
> code in the Linux guest allocates a configurable, but fixed, amount of guest
> memory at boot time for this purpose. But it's hard to know how much
> swiotlb bounce buffer memory will be needed to handle peak I/O loads.
> This patch set does dynamic allocation of swiotlb bounce buffer memory,
> which can help avoid needing to configure an overly large fixed size at boot.

But, as you point out, memory allocation can fail at runtime, so how can
you "guarantee" that this will work properly anymore if you are going to
make it dynamic?

confused,

greg k-h

2023-07-07 11:06:37

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

On Fri, 7 Jul 2023 10:29:00 +0100
Greg Kroah-Hartman <[email protected]> wrote:

> On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote:
> > From: Greg Kroah-Hartman <[email protected]> Sent: Thursday, July 6, 2023 1:07 AM
> > >
> > > On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:
> > > > From: Petr Tesarik <[email protected]> Sent: Tuesday, June 27, 2023
> > > 2:54 AM
> > > > >
> > > > > Try to allocate a transient memory pool if no suitable slots can be found,
> > > > > except when allocating from a restricted pool. The transient pool is just
> > > > > enough big for this one bounce buffer. It is inserted into a per-device
> > > > > list of transient memory pools, and it is freed again when the bounce
> > > > > buffer is unmapped.
> > > > >
> > > > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > > > required after adding a new entry, because any address within a transient
> > > > > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > > > > it is passed to another CPU.
> > > > >
> > > > > Deletion does not require any synchronization beyond RCU ordering
> > > > > guarantees. After a buffer is unmapped, its physical addresses may no
> > > > > longer be passed to the DMA API, so the memory range of the corresponding
> > > > > stale entry in the RCU list never matches. If the memory range gets
> > > > > allocated again, then it happens only after a RCU quiescent state.
> > > > >
> > > > > Since bounce buffers can now be allocated from different pools, add a
> > > > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > > > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > > > > an address. This function is now also used by is_swiotlb_buffer(), because
> > > > > a simple boundary check is no longer sufficient.
> > > > >
> > > > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > > > > simplified and enhanced to use coherent memory pools if needed.
> > > > >
> > > > > Note that this is not the most efficient way to provide a bounce buffer,
> > > > > but when a DMA buffer can't be mapped, something may (and will) actually
> > > > > break. At that point it is better to make an allocation, even if it may be
> > > > > an expensive operation.
> > > >
> > > > I continue to think about swiotlb memory management from the standpoint
> > > > of CoCo VMs that may be quite large with high network and storage loads.
> > > > These VMs are often running mission-critical workloads that can't tolerate
> > > > a bounce buffer allocation failure. To prevent such failures, the swiotlb
> > > > memory size must be overly large, which wastes memory.
> > >
> > > If "mission critical workloads" are in a vm that allowes overcommit and
> > > no control over other vms in that same system, then you have worse
> > > problems, sorry.
> > >
> > > Just don't do that.
> > >
> >
> > No, the cases I'm concerned about don't involve memory overcommit.
> >
> > CoCo VMs must use swiotlb bounce buffers to do DMA I/O. Current swiotlb
> > code in the Linux guest allocates a configurable, but fixed, amount of guest
> > memory at boot time for this purpose. But it's hard to know how much
> > swiotlb bounce buffer memory will be needed to handle peak I/O loads.
> > This patch set does dynamic allocation of swiotlb bounce buffer memory,
> > which can help avoid needing to configure an overly large fixed size at boot.
>
> But, as you point out, memory allocation can fail at runtime, so how can
> you "guarantee" that this will work properly anymore if you are going to
> make it dynamic?

In general, there is no guarantee, of course, because bounce buffers
may be requested from interrupt context. I believe Michael is looking
for the SWIOTLB_MAY_SLEEP flag that was introduced in my v2 series, so
new pools can be allocated with GFP_KERNEL instead of GFP_NOWAIT if
possible, and then there is no need to dip into the coherent pool.

Well, I have deliberately removed all complexities from my v3 series,
but I have more WIP local topic branches in my local repo:

- allow blocking allocations if possible
- allocate a new pool before existing pools are full
- free unused memory pools

I can make a bigger series, or I can send another series as RFC if this
is desired. ATM I don't feel confident enough that my v3 series will be
accepted without major changes, so I haven't invested time into
finalizing the other topic branches.

@Michael: If you know that my plan is to introduce blocking allocations
with a follow-up patch series, is the present approach acceptable?

Petr T

2023-07-08 15:29:33

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

From: Petr Tesa??k <[email protected]> Sent: Friday, July 7, 2023 3:22 AM
>
> On Fri, 7 Jul 2023 10:29:00 +0100
> Greg Kroah-Hartman <[email protected]> wrote:
>
> > On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote:
> > > From: Greg Kroah-Hartman <[email protected]> Sent: Thursday, July 6,
> 2023 1:07 AM
> > > >
> > > > On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:
> > > > > From: Petr Tesarik <[email protected]> Sent: Tuesday, June 27, 2023
> > > > 2:54 AM
> > > > > >
> > > > > > Try to allocate a transient memory pool if no suitable slots can be found,
> > > > > > except when allocating from a restricted pool. The transient pool is just
> > > > > > enough big for this one bounce buffer. It is inserted into a per-device
> > > > > > list of transient memory pools, and it is freed again when the bounce
> > > > > > buffer is unmapped.
> > > > > >
> > > > > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > > > > required after adding a new entry, because any address within a transient
> > > > > > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > > > > > it is passed to another CPU.
> > > > > >
> > > > > > Deletion does not require any synchronization beyond RCU ordering
> > > > > > guarantees. After a buffer is unmapped, its physical addresses may no
> > > > > > longer be passed to the DMA API, so the memory range of the corresponding
> > > > > > stale entry in the RCU list never matches. If the memory range gets
> > > > > > allocated again, then it happens only after a RCU quiescent state.
> > > > > >
> > > > > > Since bounce buffers can now be allocated from different pools, add a
> > > > > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > > > > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > > > > > an address. This function is now also used by is_swiotlb_buffer(), because
> > > > > > a simple boundary check is no longer sufficient.
> > > > > >
> > > > > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > > > > > simplified and enhanced to use coherent memory pools if needed.
> > > > > >
> > > > > > Note that this is not the most efficient way to provide a bounce buffer,
> > > > > > but when a DMA buffer can't be mapped, something may (and will) actually
> > > > > > break. At that point it is better to make an allocation, even if it may be
> > > > > > an expensive operation.
> > > > >
> > > > > I continue to think about swiotlb memory management from the standpoint
> > > > > of CoCo VMs that may be quite large with high network and storage loads.
> > > > > These VMs are often running mission-critical workloads that can't tolerate
> > > > > a bounce buffer allocation failure. To prevent such failures, the swiotlb
> > > > > memory size must be overly large, which wastes memory.
> > > >
> > > > If "mission critical workloads" are in a vm that allowes overcommit and
> > > > no control over other vms in that same system, then you have worse
> > > > problems, sorry.
> > > >
> > > > Just don't do that.
> > > >
> > >
> > > No, the cases I'm concerned about don't involve memory overcommit.
> > >
> > > CoCo VMs must use swiotlb bounce buffers to do DMA I/O. Current swiotlb
> > > code in the Linux guest allocates a configurable, but fixed, amount of guest
> > > memory at boot time for this purpose. But it's hard to know how much
> > > swiotlb bounce buffer memory will be needed to handle peak I/O loads.
> > > This patch set does dynamic allocation of swiotlb bounce buffer memory,
> > > which can help avoid needing to configure an overly large fixed size at boot.
> >
> > But, as you point out, memory allocation can fail at runtime, so how can
> > you "guarantee" that this will work properly anymore if you are going to
> > make it dynamic?
>
> In general, there is no guarantee, of course, because bounce buffers
> may be requested from interrupt context. I believe Michael is looking
> for the SWIOTLB_MAY_SLEEP flag that was introduced in my v2 series, so
> new pools can be allocated with GFP_KERNEL instead of GFP_NOWAIT if
> possible, and then there is no need to dip into the coherent pool.
>
> Well, I have deliberately removed all complexities from my v3 series,
> but I have more WIP local topic branches in my local repo:
>
> - allow blocking allocations if possible
> - allocate a new pool before existing pools are full
> - free unused memory pools
>
> I can make a bigger series, or I can send another series as RFC if this
> is desired. ATM I don't feel confident enough that my v3 series will be
> accepted without major changes, so I haven't invested time into
> finalizing the other topic branches.
>
> @Michael: If you know that my plan is to introduce blocking allocations
> with a follow-up patch series, is the present approach acceptable?
>

Yes, I think the present approach is acceptable as a first step. But
let me elaborate a bit on my thinking.

I was originally wondering if it is possible for swiotlb_map() to detect
whether it is called from a context that allows sleeping, without the use
of SWIOTLB_MAY_SLEEP. This would get the benefits without having to
explicitly update drivers to add the flag. But maybe that's too risky. For
the CoCo VM scenario that I'm most interested in, being a VM implicitly
reduces the set of drivers that are being used, and so it's not that hard
to add the flag in the key drivers that generate most of the bounce
buffer traffic.

Then I was thinking about a slightly different usage for the flag than what
you implemented in v2 of the series. In the case where swiotlb_map()
can't allocate slots because of the swiotlb pool being full (or mostly full),
kick the background thread (if it is not already awake) to allocate a
dynamic pool and grow the total size of the swiotlb. Then if
SWIOTLB_MAY_SLEEP is *not* set, allocate a transient pool just as you
have implemented in this v3 of the series. But if SWIOTLB_MAY_SLEEP
*is* set, swiotlb_map() should sleep until the background thread has
completed the memory allocation and grown the size of the swiotlb.
After the sleep, retry the slot allocation. Maybe what I'm describing
is what you mean by "allow blocking allocations". :-)

This approach effectively throttles incoming swiotlb requests when space
is exhausted, and gives the dynamic sizing mechanism a chance to catch
up in an efficient fashion. Limiting transient pools to requests that can't
sleep will reduce the likelihood of exhausting the coherent memory
pools. And as you mentioned above, kicking the background thread at the
90% full mark (or some such heuristic) also helps the dynamic sizing
mechanism keep up with demand.

Michael





2023-07-10 09:58:50

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

On Sat, 8 Jul 2023 15:18:32 +0000
"Michael Kelley (LINUX)" <[email protected]> wrote:

> From: Petr Tesařík <[email protected]> Sent: Friday, July 7, 2023 3:22 AM
> >
> > On Fri, 7 Jul 2023 10:29:00 +0100
> > Greg Kroah-Hartman <[email protected]> wrote:
> >
> > > On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote:
> > > > From: Greg Kroah-Hartman <[email protected]> Sent: Thursday, July 6,
> > 2023 1:07 AM
> > > > >
> > > > > On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:
> > > > > > From: Petr Tesarik <[email protected]> Sent: Tuesday, June 27, 2023
> > > > > 2:54 AM
> > > > > > >
> > > > > > > Try to allocate a transient memory pool if no suitable slots can be found,
> > > > > > > except when allocating from a restricted pool. The transient pool is just
> > > > > > > enough big for this one bounce buffer. It is inserted into a per-device
> > > > > > > list of transient memory pools, and it is freed again when the bounce
> > > > > > > buffer is unmapped.
> > > > > > >
> > > > > > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > > > > > required after adding a new entry, because any address within a transient
> > > > > > > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > > > > > > it is passed to another CPU.
> > > > > > >
> > > > > > > Deletion does not require any synchronization beyond RCU ordering
> > > > > > > guarantees. After a buffer is unmapped, its physical addresses may no
> > > > > > > longer be passed to the DMA API, so the memory range of the corresponding
> > > > > > > stale entry in the RCU list never matches. If the memory range gets
> > > > > > > allocated again, then it happens only after a RCU quiescent state.
> > > > > > >
> > > > > > > Since bounce buffers can now be allocated from different pools, add a
> > > > > > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > > > > > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > > > > > > an address. This function is now also used by is_swiotlb_buffer(), because
> > > > > > > a simple boundary check is no longer sufficient.
> > > > > > >
> > > > > > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > > > > > > simplified and enhanced to use coherent memory pools if needed.
> > > > > > >
> > > > > > > Note that this is not the most efficient way to provide a bounce buffer,
> > > > > > > but when a DMA buffer can't be mapped, something may (and will) actually
> > > > > > > break. At that point it is better to make an allocation, even if it may be
> > > > > > > an expensive operation.
> > > > > >
> > > > > > I continue to think about swiotlb memory management from the standpoint
> > > > > > of CoCo VMs that may be quite large with high network and storage loads.
> > > > > > These VMs are often running mission-critical workloads that can't tolerate
> > > > > > a bounce buffer allocation failure. To prevent such failures, the swiotlb
> > > > > > memory size must be overly large, which wastes memory.
> > > > >
> > > > > If "mission critical workloads" are in a vm that allowes overcommit and
> > > > > no control over other vms in that same system, then you have worse
> > > > > problems, sorry.
> > > > >
> > > > > Just don't do that.
> > > > >
> > > >
> > > > No, the cases I'm concerned about don't involve memory overcommit.
> > > >
> > > > CoCo VMs must use swiotlb bounce buffers to do DMA I/O. Current swiotlb
> > > > code in the Linux guest allocates a configurable, but fixed, amount of guest
> > > > memory at boot time for this purpose. But it's hard to know how much
> > > > swiotlb bounce buffer memory will be needed to handle peak I/O loads.
> > > > This patch set does dynamic allocation of swiotlb bounce buffer memory,
> > > > which can help avoid needing to configure an overly large fixed size at boot.
> > >
> > > But, as you point out, memory allocation can fail at runtime, so how can
> > > you "guarantee" that this will work properly anymore if you are going to
> > > make it dynamic?
> >
> > In general, there is no guarantee, of course, because bounce buffers
> > may be requested from interrupt context. I believe Michael is looking
> > for the SWIOTLB_MAY_SLEEP flag that was introduced in my v2 series, so
> > new pools can be allocated with GFP_KERNEL instead of GFP_NOWAIT if
> > possible, and then there is no need to dip into the coherent pool.
> >
> > Well, I have deliberately removed all complexities from my v3 series,
> > but I have more WIP local topic branches in my local repo:
> >
> > - allow blocking allocations if possible
> > - allocate a new pool before existing pools are full
> > - free unused memory pools
> >
> > I can make a bigger series, or I can send another series as RFC if this
> > is desired. ATM I don't feel confident enough that my v3 series will be
> > accepted without major changes, so I haven't invested time into
> > finalizing the other topic branches.
> >
> > @Michael: If you know that my plan is to introduce blocking allocations
> > with a follow-up patch series, is the present approach acceptable?
> >
>
> Yes, I think the present approach is acceptable as a first step. But
> let me elaborate a bit on my thinking.
>
> I was originally wondering if it is possible for swiotlb_map() to detect
> whether it is called from a context that allows sleeping, without the use
> of SWIOTLB_MAY_SLEEP. This would get the benefits without having to
> explicitly update drivers to add the flag. But maybe that's too risky.

This is a recurring topic and it has been discussed several times in
the mailing lists. If you ask me, the best answer is this one by Andrew
Morton, albeit a bit dated:

https://lore.kernel.org/lkml/[email protected]/

> For
> the CoCo VM scenario that I'm most interested in, being a VM implicitly
> reduces the set of drivers that are being used, and so it's not that hard
> to add the flag in the key drivers that generate most of the bounce
> buffer traffic.

Yes, that's my thinking as well.

> Then I was thinking about a slightly different usage for the flag than what
> you implemented in v2 of the series. In the case where swiotlb_map()
> can't allocate slots because of the swiotlb pool being full (or mostly full),
> kick the background thread (if it is not already awake) to allocate a
> dynamic pool and grow the total size of the swiotlb. Then if
> SWIOTLB_MAY_SLEEP is *not* set, allocate a transient pool just as you
> have implemented in this v3 of the series. But if SWIOTLB_MAY_SLEEP
> *is* set, swiotlb_map() should sleep until the background thread has
> completed the memory allocation and grown the size of the swiotlb.
> After the sleep, retry the slot allocation. Maybe what I'm describing
> is what you mean by "allow blocking allocations". :-)

Not really, but I like the idea. After all, the only reason to have
transient pools is when something is needed immediately while the
background allocation is running.

> This approach effectively throttles incoming swiotlb requests when space
> is exhausted, and gives the dynamic sizing mechanism a chance to catch
> up in an efficient fashion. Limiting transient pools to requests that can't
> sleep will reduce the likelihood of exhausting the coherent memory
> pools. And as you mentioned above, kicking the background thread at the
> 90% full mark (or some such heuristic) also helps the dynamic sizing
> mechanism keep up with demand.

FWIW I did some testing, and my systems were not able to survive a
sudden I/O peak without transient pools, no matter how low I set the
threshold for kicking a background. OTOH I always tested with the
smallest possible SWIOTLB (256 KiB * rounded up number of CPUs, e.g. 16
MiB on my VM with 48 CPUs). Other sizes may lead to different results.

As a matter of fact, the size of the initial SWIOTLB memory pool and the
size(s) of additional pool(s) sound like interesting tunable parameters
that I haven't explored in depth yet.

Petr T

2023-07-11 16:18:48

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

From: Petr Tesa??k <[email protected]> Sent: Monday, July 10, 2023 2:36 AM
>
> On Sat, 8 Jul 2023 15:18:32 +0000
> "Michael Kelley (LINUX)" <[email protected]> wrote:
>
> > From: Petr Tesa??k <[email protected]> Sent: Friday, July 7, 2023 3:22 AM
> > >
> > > On Fri, 7 Jul 2023 10:29:00 +0100
> > > Greg Kroah-Hartman <[email protected]> wrote:
> > >
> > > > On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote:
> > > > > From: Greg Kroah-Hartman <[email protected]> Sent: Thursday, July 6,
> > > 2023 1:07 AM
> > > > > >
> > > > > > On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:
> > > > > > > From: Petr Tesarik <[email protected]> Sent: Tuesday, June 27, 2023
> > > > > > 2:54 AM
> > > > > > > >
> > > > > > > > Try to allocate a transient memory pool if no suitable slots can be found,
> > > > > > > > except when allocating from a restricted pool. The transient pool is just
> > > > > > > > enough big for this one bounce buffer. It is inserted into a per-device
> > > > > > > > list of transient memory pools, and it is freed again when the bounce
> > > > > > > > buffer is unmapped.
> > > > > > > >
> > > > > > > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > > > > > > required after adding a new entry, because any address within a transient
> > > > > > > > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > > > > > > > it is passed to another CPU.
> > > > > > > >
> > > > > > > > Deletion does not require any synchronization beyond RCU ordering
> > > > > > > > guarantees. After a buffer is unmapped, its physical addresses may no
> > > > > > > > longer be passed to the DMA API, so the memory range of the corresponding
> > > > > > > > stale entry in the RCU list never matches. If the memory range gets
> > > > > > > > allocated again, then it happens only after a RCU quiescent state.
> > > > > > > >
> > > > > > > > Since bounce buffers can now be allocated from different pools, add a
> > > > > > > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > > > > > > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > > > > > > > an address. This function is now also used by is_swiotlb_buffer(), because
> > > > > > > > a simple boundary check is no longer sufficient.
> > > > > > > >
> > > > > > > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > > > > > > > simplified and enhanced to use coherent memory pools if needed.
> > > > > > > >
> > > > > > > > Note that this is not the most efficient way to provide a bounce buffer,
> > > > > > > > but when a DMA buffer can't be mapped, something may (and will) actually
> > > > > > > > break. At that point it is better to make an allocation, even if it may be
> > > > > > > > an expensive operation.
> > > > > > >
> > > > > > > I continue to think about swiotlb memory management from the standpoint
> > > > > > > of CoCo VMs that may be quite large with high network and storage loads.
> > > > > > > These VMs are often running mission-critical workloads that can't tolerate
> > > > > > > a bounce buffer allocation failure. To prevent such failures, the swiotlb
> > > > > > > memory size must be overly large, which wastes memory.
> > > > > >
> > > > > > If "mission critical workloads" are in a vm that allowes overcommit and
> > > > > > no control over other vms in that same system, then you have worse
> > > > > > problems, sorry.
> > > > > >
> > > > > > Just don't do that.
> > > > > >
> > > > >
> > > > > No, the cases I'm concerned about don't involve memory overcommit.
> > > > >
> > > > > CoCo VMs must use swiotlb bounce buffers to do DMA I/O. Current swiotlb
> > > > > code in the Linux guest allocates a configurable, but fixed, amount of guest
> > > > > memory at boot time for this purpose. But it's hard to know how much
> > > > > swiotlb bounce buffer memory will be needed to handle peak I/O loads.
> > > > > This patch set does dynamic allocation of swiotlb bounce buffer memory,
> > > > > which can help avoid needing to configure an overly large fixed size at boot.
> > > >
> > > > But, as you point out, memory allocation can fail at runtime, so how can
> > > > you "guarantee" that this will work properly anymore if you are going to
> > > > make it dynamic?
> > >
> > > In general, there is no guarantee, of course, because bounce buffers
> > > may be requested from interrupt context. I believe Michael is looking
> > > for the SWIOTLB_MAY_SLEEP flag that was introduced in my v2 series, so
> > > new pools can be allocated with GFP_KERNEL instead of GFP_NOWAIT if
> > > possible, and then there is no need to dip into the coherent pool.
> > >
> > > Well, I have deliberately removed all complexities from my v3 series,
> > > but I have more WIP local topic branches in my local repo:
> > >
> > > - allow blocking allocations if possible
> > > - allocate a new pool before existing pools are full
> > > - free unused memory pools
> > >
> > > I can make a bigger series, or I can send another series as RFC if this
> > > is desired. ATM I don't feel confident enough that my v3 series will be
> > > accepted without major changes, so I haven't invested time into
> > > finalizing the other topic branches.
> > >
> > > @Michael: If you know that my plan is to introduce blocking allocations
> > > with a follow-up patch series, is the present approach acceptable?
> > >
> >
> > Yes, I think the present approach is acceptable as a first step. But
> > let me elaborate a bit on my thinking.
> >
> > I was originally wondering if it is possible for swiotlb_map() to detect
> > whether it is called from a context that allows sleeping, without the use
> > of SWIOTLB_MAY_SLEEP. This would get the benefits without having to
> > explicitly update drivers to add the flag. But maybe that's too risky.
>
> This is a recurring topic and it has been discussed several times in
> the mailing lists. If you ask me, the best answer is this one by Andrew
> Morton, albeit a bit dated:
>
> https://lore.kernel.org/lkml/[email protected]/

Thanks. That's useful context.

>
> > For
> > the CoCo VM scenario that I'm most interested in, being a VM implicitly
> > reduces the set of drivers that are being used, and so it's not that hard
> > to add the flag in the key drivers that generate most of the bounce
> > buffer traffic.
>
> Yes, that's my thinking as well.
>
> > Then I was thinking about a slightly different usage for the flag than what
> > you implemented in v2 of the series. In the case where swiotlb_map()
> > can't allocate slots because of the swiotlb pool being full (or mostly full),
> > kick the background thread (if it is not already awake) to allocate a
> > dynamic pool and grow the total size of the swiotlb. Then if
> > SWIOTLB_MAY_SLEEP is *not* set, allocate a transient pool just as you
> > have implemented in this v3 of the series. But if SWIOTLB_MAY_SLEEP
> > *is* set, swiotlb_map() should sleep until the background thread has
> > completed the memory allocation and grown the size of the swiotlb.
> > After the sleep, retry the slot allocation. Maybe what I'm describing
> > is what you mean by "allow blocking allocations". :-)
>
> Not really, but I like the idea. After all, the only reason to have
> transient pools is when something is needed immediately while the
> background allocation is running.

You can also take the thinking one step further: For bounce buffer
requests that allow blocking, you could decide not to grow the pool
above a specified maximum. If the max has been reached and space
is not available, sleep until space is released by some other in-progress
request. This could be a valid way to handle peak demand while
capping the memory allocated to the bounce buffer pool. There
would be a latency hit because of the waiting, but that could
be a valid tradeoff for rare peaks. Of course, for requests that can't
block, you'd still need to allocate a transient pool.

Michael

>
> > This approach effectively throttles incoming swiotlb requests when space
> > is exhausted, and gives the dynamic sizing mechanism a chance to catch
> > up in an efficient fashion. Limiting transient pools to requests that can't
> > sleep will reduce the likelihood of exhausting the coherent memory
> > pools. And as you mentioned above, kicking the background thread at the
> > 90% full mark (or some such heuristic) also helps the dynamic sizing
> > mechanism keep up with demand.
>
> FWIW I did some testing, and my systems were not able to survive a
> sudden I/O peak without transient pools, no matter how low I set the
> threshold for kicking a background. OTOH I always tested with the
> smallest possible SWIOTLB (256 KiB * rounded up number of CPUs, e.g. 16
> MiB on my VM with 48 CPUs). Other sizes may lead to different results.
>
> As a matter of fact, the size of the initial SWIOTLB memory pool and the
> size(s) of additional pool(s) sound like interesting tunable parameters
> that I haven't explored in depth yet.
>
> Petr T