2023-05-09 09:29:26

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v2 RESEND 0/7] Allow dynamic allocation of software IO TLB bounce buffers

From: Petr Tesarik <[email protected]>

The goal of my work is to provide more flexibility in the sizing of
SWIOTLB.

The software IO TLB was designed with these assumptions:

1. It would not be used much, especially on 64-bit systems.
2. A small fixed memory area (64 MiB by default) is sufficient to
handle the few cases which require a bounce buffer.
3. 64 MiB is little enough that it has no impact on the rest of the
system.

First, if SEV is active, all DMA must be done through shared
unencrypted pages, and SWIOTLB is used to make this happen without
changing device drivers. The software IO TLB size is increased to
6% of total memory in sev_setup_arch(), but that is more of an
approximation. The actual requirements may vary depending on the
amount of I/O and which drivers are used. These factors may not be
know at boot time, i.e. when SWIOTLB is allocated.

Second, other colleagues have noticed that they can reliably get
rid of occasional OOM kills on an Arm embedded device by reducing
the SWIOTLB size. This can be achieved with a kernel parameter, but
determining the right value puts additional burden on pre-release
testing, which could be avoided if SWIOTLB is allocated small and
grows only when necessary.

Changes from RFC:
- Track dynamic buffers per device instead of per swiotlb
- Use a linked list instead of a maple tree
- Move initialization of swiotlb fields of struct device to a
helper function
- Rename __lookup_dyn_slot() to lookup_dyn_slot_locked()
- Introduce per-device flag if dynamic buffers are in use
- Add one more user of DMA_ATTR_MAY_SLEEP
- Add kernel-doc comments for new (and some old) code
- Properly escape '*' in dma-attributes.rst

Petr Tesarik (7):
swiotlb: Use a helper to initialize swiotlb fields in struct device
swiotlb: Move code around in preparation for dynamic bounce buffers
dma-mapping: introduce the DMA_ATTR_MAY_SLEEP attribute
swiotlb: Dynamically allocated bounce buffers
swiotlb: Add a boot option to enable dynamic bounce buffers
drm: Use DMA_ATTR_MAY_SLEEP from process context
swiotlb: per-device flag if there are dynamically allocated buffers

.../admin-guide/kernel-parameters.txt | 6 +-
Documentation/core-api/dma-attributes.rst | 10 +
drivers/base/core.c | 4 +-
drivers/gpu/drm/drm_gem_shmem_helper.c | 2 +-
drivers/gpu/drm/drm_prime.c | 2 +-
include/linux/device.h | 12 +
include/linux/dma-mapping.h | 6 +
include/linux/swiotlb.h | 54 ++-
kernel/dma/swiotlb.c | 382 ++++++++++++++++--
9 files changed, 443 insertions(+), 35 deletions(-)

--
2.25.1


2023-05-09 09:30:25

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v2 RESEND 3/7] dma-mapping: introduce the DMA_ATTR_MAY_SLEEP attribute

From: Petr Tesarik <[email protected]>

Introduce a DMA attribute to tell the DMA-mapping subsystem that
the operation is allowed to sleep.

This patch merely adds the flag, but it does not do anything at
the moment.

Signed-off-by: Petr Tesarik <[email protected]>
---
Documentation/core-api/dma-attributes.rst | 10 ++++++++++
include/linux/dma-mapping.h | 6 ++++++
2 files changed, 16 insertions(+)

diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst
index 1887d92e8e92..9ce00926455f 100644
--- a/Documentation/core-api/dma-attributes.rst
+++ b/Documentation/core-api/dma-attributes.rst
@@ -130,3 +130,13 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged
subsystem that the buffer is fully accessible at the elevated privilege
level (and ideally inaccessible or at least read-only at the
lesser-privileged levels).
+
+DMA_ATTR_MAY_SLEEP
+------------------
+
+This tells the DMA-mapping subsystem that it is allowed to sleep. For example,
+if mapping needs a bounce buffer, software IO TLB may use CMA for the
+allocation if this flag is given.
+
+This attribute is not used for dma_alloc\* functions. Instead, the provided
+GFP flags are used to determine whether the allocation may sleep.
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 0ee20b764000..7a75c503ac38 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -61,6 +61,12 @@
*/
#define DMA_ATTR_PRIVILEGED (1UL << 9)

+/*
+ * DMA_ATTR_MAY_SLEEP: This tells the DMA-mapping subsystem that it is allowed
+ * to sleep.
+ */
+#define DMA_ATTR_MAY_SLEEP (1UL << 10)
+
/*
* A dma_addr_t can hold any valid DMA or bus address for the platform. It can
* be given to a device to use as a DMA source or target. It is specific to a
--
2.25.1

2023-05-09 09:36:19

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v2 RESEND 2/7] swiotlb: Move code around in preparation for dynamic bounce buffers

From: Petr Tesarik <[email protected]>

To prepare for the introduction of dynamically allocated bounce
buffers, separate out common code and code which handles non-dynamic
(aka fixed) bounce buffers.

No functional change, but this commit should make the addition of
dynamic allocations easier to review.

Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/swiotlb.h | 31 ++++++++++-
kernel/dma/swiotlb.c | 110 +++++++++++++++++++++++++++++++++-------
2 files changed, 122 insertions(+), 19 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index d851dbce1143..281ecc6b9bcc 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -114,11 +114,40 @@ struct io_tlb_mem {
};
extern struct io_tlb_mem io_tlb_default_mem;

+/**
+ * is_swiotlb_fixed() - check if a physical address belongs to a swiotlb slot
+ * @mem: relevant swiotlb pool
+ * @paddr: physical address within the DMA buffer
+ *
+ * Check if @paddr points into a fixed bounce buffer slot.
+ * This check should be as fast as possible.
+ *
+ * Return:
+ * * %true if @paddr points into a @mem fixed slot
+ * * %false otherwise
+ */
+static inline bool is_swiotlb_fixed(struct io_tlb_mem *mem, phys_addr_t paddr)
+{
+ return paddr >= mem->start && paddr < mem->end;
+}
+
+/**
+ * is_swiotlb_buffer() - check if a physical address is allocated from the
+ * swiotlb pool
+ * @dev: device which has mapped the buffer
+ * @paddr: physical address within the DMA buffer
+ *
+ * Check if @paddr points into a bounce buffer.
+ *
+ * Return:
+ * * %true if @paddr points into a bounce buffer
+ * * %false otherwise
+ */
static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;

- return mem && paddr >= mem->start && paddr < mem->end;
+ return mem && is_swiotlb_fixed(mem, paddr);
}

static inline bool is_swiotlb_force_bounce(struct device *dev)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index af2e304c672c..96ba93be6772 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -76,6 +76,10 @@ struct io_tlb_mem io_tlb_default_mem;
static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
static unsigned long default_nareas;

+static void swiotlb_copy(struct device *dev, phys_addr_t orig_addr,
+ unsigned char *vaddr, size_t size, size_t alloc_size,
+ unsigned int tlb_offset, enum dma_data_direction dir);
+
/**
* struct io_tlb_area - IO TLB memory area descriptor
*
@@ -480,7 +484,6 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
phys_addr_t orig_addr = mem->slots[index].orig_addr;
size_t alloc_size = mem->slots[index].alloc_size;
- unsigned long pfn = PFN_DOWN(orig_addr);
unsigned char *vaddr = mem->vaddr + tlb_addr - mem->start;
unsigned int tlb_offset, orig_addr_offset;

@@ -497,6 +500,34 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size
}

tlb_offset -= orig_addr_offset;
+ swiotlb_copy(dev, orig_addr, vaddr, size, alloc_size, tlb_offset, dir);
+}
+
+/**
+ * swiotlb_copy() - copy swiotlb buffer content, checking for overflows.
+ * @dev: device which has mapped the bounce buffer
+ * @orig_addr: physical address of the original buffer
+ * @vaddr: virtual address inside the bounce buffer
+ * @size: number of bytes to copy
+ * @alloc_size: total allocated size of the bounce buffer
+ * @tlb_offset: offset within the bounce buffer
+ * @dir: direction of the data transfer
+ *
+ * If @dir is %DMA_TO_DEVICE, copy data from the original buffer to the
+ * bounce buffer, otherwise copy from the bounce buffer to the original
+ * buffer.
+ *
+ * The original buffer may be in high memory; that's why @orig_addr is
+ * a physical address. Note that this is the address of the beginning
+ * of the bounce buffer. Copying starts at offset @tlb_offset. This is
+ * needed to check accesses beyond the allocated size.
+ */
+static void swiotlb_copy(struct device *dev, phys_addr_t orig_addr,
+ unsigned char *vaddr, size_t size, size_t alloc_size,
+ unsigned int tlb_offset, enum dma_data_direction dir)
+{
+ unsigned long pfn = PFN_DOWN(orig_addr);
+
if (tlb_offset > alloc_size) {
dev_WARN_ONCE(dev, 1,
"Buffer overflow detected. Allocation size: %zu. Mapping size: %zu+%u.\n",
@@ -727,15 +758,65 @@ static unsigned long mem_used(struct io_tlb_mem *mem)
return used;
}

+/**
+ * swiotlb_fixed_map() - allocate a bounce buffer from fixed slots
+ * @dev: device which maps the buffer
+ * @orig_addr: address of the original buffer
+ * @alloc_size: total size of the original buffer
+ * @alloc_align_mask:
+ * required physical alignment of the I/O buffer
+ * @attrs: optional DMA attributes for the map operation
+ *
+ * Search for a suitable slot or sequence of slots and initialize them
+ * for use with the original buffer.
+ *
+ * Return: Physical address of the bounce buffer, or %DMA_MAPPING_ERROR.
+ */
+static phys_addr_t swiotlb_fixed_map(struct device *dev, phys_addr_t orig_addr,
+ size_t alloc_size, unsigned int alloc_align_mask,
+ unsigned long attrs)
+{
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ unsigned int offset = swiotlb_align_offset(dev, orig_addr);
+ int index = swiotlb_find_slots(dev, orig_addr,
+ alloc_size + offset, alloc_align_mask);
+ unsigned int i;
+
+ if (index == -1)
+ return (phys_addr_t)DMA_MAPPING_ERROR;
+
+ /*
+ * Save away the mapping from the original address to the DMA address.
+ * This is needed when we sync the memory. Then we sync the buffer if
+ * needed.
+ */
+ for (i = 0; i < nr_slots(alloc_size + offset); i++)
+ mem->slots[index + i].orig_addr = slot_addr(orig_addr, i);
+ return slot_addr(mem->start, index) + offset;
+}
+
+/**
+ * swiotlb_tbl_map_single() - map DMA buffer to a bounce buffer
+ * @dev: device which maps the buffer
+ * @orig_addr: address of the original buffer
+ * @mapping_size: size of the original buffer to be synced now
+ * @alloc_size: total size of the original buffer
+ * @alloc_align_mask:
+ * required physical alignment of the I/O buffer
+ * @dir: direction of the data transfer
+ * @attrs: optional DMA attributes for the map operation
+ *
+ * Create a mapping of the DMA buffer into a bounce buffer and copy the
+ * original data.
+ *
+ * Return: Physical address of the bounce buffer, or %DMA_MAPPING_ERROR.
+ */
phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
size_t mapping_size, size_t alloc_size,
unsigned int alloc_align_mask, enum dma_data_direction dir,
unsigned long attrs)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
- unsigned int offset = swiotlb_align_offset(dev, orig_addr);
- unsigned int i;
- int index;
phys_addr_t tlb_addr;

if (!mem || !mem->nslabs) {
@@ -753,24 +834,17 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
return (phys_addr_t)DMA_MAPPING_ERROR;
}

- index = swiotlb_find_slots(dev, orig_addr,
- alloc_size + offset, alloc_align_mask);
- if (index == -1) {
+ tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
+ alloc_align_mask, attrs);
+
+ if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR) {
if (!(attrs & DMA_ATTR_NO_WARN))
dev_warn_ratelimited(dev,
- "swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
- alloc_size, mem->nslabs, mem_used(mem));
- return (phys_addr_t)DMA_MAPPING_ERROR;
+ "swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
+ alloc_size, mem->nslabs, mem_used(mem));
+ return tlb_addr;
}

- /*
- * Save away the mapping from the original address to the DMA address.
- * This is needed when we sync the memory. Then we sync the buffer if
- * needed.
- */
- for (i = 0; i < nr_slots(alloc_size + offset); i++)
- mem->slots[index + i].orig_addr = slot_addr(orig_addr, i);
- tlb_addr = slot_addr(mem->start, index) + offset;
/*
* When dir == DMA_FROM_DEVICE we could omit the copy from the orig
* to the tlb buffer, if we knew for sure the device will
--
2.25.1

2023-05-09 09:37:50

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v2 RESEND 1/7] swiotlb: Use a helper to initialize swiotlb fields in struct device

From: Petr Tesarik <[email protected]>

Move swiotlb initialization code to swiotlb.h. This change also
allows to provide a stub implementation if swiotlb is not
configured, getting rid of an #ifdef in driver core.

Signed-off-by: Petr Tesarik <[email protected]>
---
drivers/base/core.c | 4 +---
include/linux/swiotlb.h | 12 ++++++++++++
2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 3dff5037943e..46d1d78c5beb 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -3108,9 +3108,7 @@ void device_initialize(struct device *dev)
defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
dev->dma_coherent = dma_default_coherent;
#endif
-#ifdef CONFIG_SWIOTLB
- dev->dma_io_tlb_mem = &io_tlb_default_mem;
-#endif
+ swiotlb_dev_init(dev);
}
EXPORT_SYMBOL_GPL(device_initialize);

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7af2673b47ba..d851dbce1143 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -128,6 +128,15 @@ static inline bool is_swiotlb_force_bounce(struct device *dev)
return mem && mem->force_bounce;
}

+/**
+ * swiotlb_dev_init() - initialize swiotlb fields in &struct device
+ * @dev: device to be initialized
+ */
+static inline void swiotlb_dev_init(struct device *dev)
+{
+ dev->dma_io_tlb_mem = &io_tlb_default_mem;
+}
+
void swiotlb_init(bool addressing_limited, unsigned int flags);
void __init swiotlb_exit(void);
size_t swiotlb_max_mapping_size(struct device *dev);
@@ -137,6 +146,9 @@ void __init swiotlb_adjust_size(unsigned long size);
static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
{
}
+static inline void swiotlb_dev_init(struct device *dev)
+{
+}
static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
return false;
--
2.25.1

2023-05-09 09:43:14

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

From: Petr Tesarik <[email protected]>

The software IO TLB was designed with the assumption that it is not
used much, especially on 64-bit systems, so a small fixed memory
area (currently 64 MiB) is sufficient to handle the few cases which
still require a bounce buffer. However, these cases are not so rare
in some circumstances.

First, if SEV is active, all DMA must be done through shared
unencrypted pages, and SWIOTLB is used to make this happen without
changing device drivers. The software IO TLB size is increased to 6%
of total memory in sev_setup_arch(), but that is more of an
approximation. The actual requirements may vary depending on which
drivers are used and the amount of I/O.

Second, some embedded devices have very little RAM, so 64 MiB is not
negligible. Sadly, these are exactly the devices that also often
need a software IO TLB. Although minimum swiotlb size can be found
empirically by extensive testing, it would be easier to allocate a
small swiotlb at boot and let it grow on demand.

Growing the SWIOTLB data structures at run time is impossible. The
whole SWIOTLB region is contiguous in physical memory to allow
combining adjacent slots and also to ensure that alignment
constraints can be met. The SWIOTLB is too big for the buddy
allocator (cf. MAX_ORDER). More importantly, even if a new SWIOTLB
could be allocated (e.g. from CMA), it cannot be extended in-place
(because surrounding pages may be already allocated for other
purposes), and there is no mechanism for relocating already mapped
bounce buffers: The DMA API gets only the address of a buffer, and
the implementation (direct or IOMMU) checks whether it belongs to
the software IO TLB.

It is possible to allocate multiple smaller struct io_tlb_mem
instances. However, they would have to be stored in a non-constant
container (list or tree), which needs synchronization between
readers and writers, creating contention in a hot path for all
devices, not only those which need software IO TLB.

Another option is to allocate a very large SWIOTLB at boot, but
allow migrating pages to other users (like CMA does). This approach
might work, but there are many open issues:

1. After a page is migrated away from SWIOTLB, it must not be used
as a (direct) DMA buffer. Otherwise SWIOTLB code would have to
check which pages have been migrated to determine whether a given
buffer address belongs to a bounce buffer or not, effectively
introducing all the issues of multiple SWIOTLB instances.

2. Unlike SWIOTLB, CMA cannot be used from atomic contexts, and that
for many different reasons. This might be changed in theory, but
it would take a lot of investigation and time. OTOH improvement
to the SWIOTLB is needed now.

3. If SWIOTLB is implemented separately from CMA and not as its
part, users have to solve the dilemma of how to distribute
precious DMA-able memory between them.

The present patch is a simplistic solution. Bounce buffers are
allocated using the non-coherent DMA API, removing the need to
implement a new custom allocator. These buffers are then tracked in
a per-device linked list, reducing the impact on devices that do not
need the SWIOTLB.

Analysis of real-world I/O patterns has shown that the same buffer
is typically looked up repeatedly (for further sync operations, or
to be unmapped). The most recently used bounce buffer is therefore
always moved to the beginning of the list. The list performed better
than a maple tree when tested with fio against a QEMU SATA drive
backed by a RAM block device in the host (4 cores, 16 iodepth).
Other scenarios are also likely to benefit from this MRU order but
have not been tested.

Operations on the list are serialized with a spinlock. It is
unfortunately not possible to use an RCU list, because quiescent
state is not guaranteed to happen before an asynchronous event (e.g.
hardware interrupt) on another CPU. If that CPU used an old version
of the list, it would fail to copy data to and/or from the newly
allocated bounce buffer.

Last but not least, bounce buffers are never allocated dynamically
if the SWIOTLB is in fact a DMA restricted pool, because that would
defeat the purpose of a restricted pool.

Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/device.h | 8 ++
include/linux/swiotlb.h | 8 +-
kernel/dma/swiotlb.c | 252 ++++++++++++++++++++++++++++++++++++++--
3 files changed, 259 insertions(+), 9 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 472dd24d4823..d1d2b8557b30 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -510,6 +510,12 @@ struct device_physical_location {
* @dma_mem: Internal for coherent mem override.
* @cma_area: Contiguous memory area for dma allocations
* @dma_io_tlb_mem: Pointer to the swiotlb pool used. Not for driver use.
+ * @dma_io_tlb_dyn_lock:
+ * Spinlock to protect the list of dynamically allocated bounce
+ * buffers.
+ * @dma_io_tlb_dyn_slots:
+ * Dynamically allocated bounce buffers for this device.
+ * Not for driver use.
* @archdata: For arch-specific additions.
* @of_node: Associated device tree node.
* @fwnode: Associated device node supplied by platform firmware.
@@ -615,6 +621,8 @@ struct device {
#endif
#ifdef CONFIG_SWIOTLB
struct io_tlb_mem *dma_io_tlb_mem;
+ spinlock_t dma_io_tlb_dyn_lock;
+ struct list_head dma_io_tlb_dyn_slots;
#endif
/* arch specific additions */
struct dev_archdata archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 281ecc6b9bcc..6aada6ac31e2 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -114,6 +114,8 @@ struct io_tlb_mem {
};
extern struct io_tlb_mem io_tlb_default_mem;

+bool is_swiotlb_dyn(struct device *dev, phys_addr_t paddr);
+
/**
* is_swiotlb_fixed() - check if a physical address belongs to a swiotlb slot
* @mem: relevant swiotlb pool
@@ -147,7 +149,9 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;

- return mem && is_swiotlb_fixed(mem, paddr);
+ return mem &&
+ (is_swiotlb_fixed(mem, paddr) ||
+ is_swiotlb_dyn(dev, paddr));
}

static inline bool is_swiotlb_force_bounce(struct device *dev)
@@ -164,6 +168,8 @@ static inline bool is_swiotlb_force_bounce(struct device *dev)
static inline void swiotlb_dev_init(struct device *dev)
{
dev->dma_io_tlb_mem = &io_tlb_default_mem;
+ spin_lock_init(&dev->dma_io_tlb_dyn_lock);
+ INIT_LIST_HEAD(&dev->dma_io_tlb_dyn_slots);
}

void swiotlb_init(bool addressing_limited, unsigned int flags);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 96ba93be6772..612e1c2e9573 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -68,6 +68,22 @@ struct io_tlb_slot {
unsigned int list;
};

+/**
+ * struct io_tlb_dyn_slot - dynamically allocated swiotlb slot
+ * @node: node in the per-device list
+ * @orig_addr: physical address of the original DMA buffer
+ * @alloc_size: total size of the DMA buffer
+ * @page: first page of the bounce buffer
+ * @dma_addr: DMA address of the bounce buffer
+ */
+struct io_tlb_dyn_slot {
+ struct list_head node;
+ phys_addr_t orig_addr;
+ size_t alloc_size;
+ struct page *page;
+ dma_addr_t dma_addr;
+};
+
static bool swiotlb_force_bounce;
static bool swiotlb_force_disable;

@@ -466,6 +482,191 @@ void __init swiotlb_exit(void)
memset(mem, 0, sizeof(*mem));
}

+/**
+ * lookup_dyn_slot_locked() - look up a dynamically allocated bounce buffer
+ * @dev: device which has mapped the buffer
+ * @paddr: physical address within the bounce buffer
+ *
+ * Walk through the list of bounce buffers dynamically allocated for @dev,
+ * looking for a buffer which contains @paddr.
+ *
+ * Context: Any context. Expects dma_io_tlb_dyn_lock lock to be held.
+ * Return:
+ * Address of a &struct io_tlb_dyn_slot, or %NULL if not found.
+ */
+static struct io_tlb_dyn_slot *lookup_dyn_slot_locked(struct device *dev,
+ phys_addr_t paddr)
+{
+ struct io_tlb_dyn_slot *slot;
+
+ list_for_each_entry(slot, &dev->dma_io_tlb_dyn_slots, node) {
+ phys_addr_t start = page_to_phys(slot->page);
+ phys_addr_t end = start + slot->alloc_size - 1;
+
+ if (start <= paddr && paddr <= end)
+ return slot;
+ }
+ return NULL;
+}
+
+/**
+ * lookup_dyn_slot() - look up a dynamically allocated bounce buffer
+ * @dev: device which has mapped the buffer
+ * @paddr: physical address within the bounce buffer
+ *
+ * Search for a dynamically allocated bounce buffer which contains
+ * @paddr. If found, the buffer is moved to the beginning of the linked
+ * list. Use lookup_dyn_slot_locked() directly where this behavior is not
+ * required/desired.
+ *
+ * Context: Any context. Takes and releases dma_io_tlb_dyn_lock.
+ * Return:
+ * Address of a &struct io_tlb_dyn_slot, or %NULL if not found.
+ */
+static struct io_tlb_dyn_slot *lookup_dyn_slot(struct device *dev,
+ phys_addr_t paddr)
+{
+ struct io_tlb_dyn_slot *slot;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
+ slot = lookup_dyn_slot_locked(dev, paddr);
+ list_move(&slot->node, &dev->dma_io_tlb_dyn_slots);
+ spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
+ return slot;
+}
+
+/**
+ * is_swiotlb_dyn() - check if a physical address belongs to a dynamically
+ * allocated bounce buffer
+ * @dev: device which has mapped the buffer
+ * @paddr: physical address within the bounce buffer
+ *
+ * Check whether there is a dynamically allocated bounce buffer which
+ * contains @paddr. If found, the buffer is moved to the beginning of
+ * the MRU list.
+ *
+ * Return:
+ * * %true if @paddr points into a dynamically allocated bounce buffer
+ * * %false otherwise
+ */
+bool is_swiotlb_dyn(struct device *dev, phys_addr_t paddr)
+{
+ return !!lookup_dyn_slot(dev, paddr);
+}
+
+/**
+ * swiotlb_dyn_bounce() - copy a dynamically allocated buffer from or back
+ * to the original dma location
+ * @dev: device which has mapped the buffer
+ * @tlb_addr: physical address inside the bounce buffer
+ * @size: size of the region to be copied
+ * @dir: direction of the data transfer
+ *
+ * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
+ * This function works only for dynamically allocated bounce buffers.
+ */
+static void swiotlb_dyn_bounce(struct device *dev, phys_addr_t tlb_addr,
+ size_t size, enum dma_data_direction dir)
+{
+ struct io_tlb_dyn_slot *slot = lookup_dyn_slot(dev, tlb_addr);
+ unsigned int tlb_offset;
+ unsigned char *vaddr;
+
+ if (!slot)
+ return;
+
+ tlb_offset = tlb_addr - page_to_phys(slot->page);
+ vaddr = page_address(slot->page) + tlb_offset;
+
+ swiotlb_copy(dev, slot->orig_addr, vaddr, size, slot->alloc_size,
+ tlb_offset, dir);
+}
+
+/**
+ * swiotlb_dyn_map() - allocate a bounce buffer dynamically
+ * @dev: device which maps the buffer
+ * @orig_addr: address of the original buffer
+ * @alloc_size: total size of the original buffer
+ * @alloc_align_mask:
+ * required physical alignment of the I/O buffer
+ * @dir: direction of the data transfer
+ * @attrs: optional DMA attributes for the map operation
+ *
+ * Allocate a new bounce buffer using DMA non-coherent API. This function
+ * assumes that there is a fallback allocation scheme if the allocation
+ * fails. In fact, it always fails for buffers smaller than a page and
+ * for address constraints that are not (yet) correctly handled by
+ * dma_direct_alloc_pages().
+ *
+ * Return: Physical address of the bounce buffer, or %DMA_MAPPING_ERROR.
+ */
+static phys_addr_t swiotlb_dyn_map(struct device *dev, phys_addr_t orig_addr,
+ size_t alloc_size, unsigned int alloc_align_mask,
+ enum dma_data_direction dir, unsigned long attrs)
+{
+ struct io_tlb_dyn_slot *slot;
+ unsigned long flags;
+ gfp_t gfp;
+
+ /* Allocation has page granularity. Avoid small buffers. */
+ if (alloc_size < PAGE_SIZE)
+ goto err;
+
+ /* DMA direct does not deal with physical address constraints. */
+ if (alloc_align_mask || dma_get_min_align_mask(dev))
+ goto err;
+
+ gfp = (attrs & DMA_ATTR_MAY_SLEEP) ? GFP_NOIO : GFP_NOWAIT;
+ slot = kmalloc(sizeof(*slot), gfp | __GFP_NOWARN);
+ if (!slot)
+ goto err;
+
+ slot->orig_addr = orig_addr;
+ slot->alloc_size = alloc_size;
+ slot->page = dma_direct_alloc_pages(dev, PAGE_ALIGN(alloc_size),
+ &slot->dma_addr, dir,
+ gfp | __GFP_NOWARN);
+ if (!slot->page)
+ goto err_free_slot;
+
+ spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
+ list_add(&slot->node, &dev->dma_io_tlb_dyn_slots);
+ spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
+
+ return page_to_phys(slot->page);
+
+err_free_slot:
+ kfree(slot);
+err:
+ return (phys_addr_t)DMA_MAPPING_ERROR;
+}
+
+/**
+ * swiotlb_dyn_unmap() - unmap a dynamically allocated bounce buffer
+ * @dev: device which mapped the buffer
+ * @tlb_addr: physical address of the bounce buffer
+ * @dir: direction of the data transfer
+ *
+ * Release all resources associated with a dynamically allocated bounce
+ * buffer.
+ */
+static void swiotlb_dyn_unmap(struct device *dev, phys_addr_t tlb_addr,
+ enum dma_data_direction dir)
+{
+ struct io_tlb_dyn_slot *slot;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
+ slot = lookup_dyn_slot_locked(dev, tlb_addr);
+ list_del(&slot->node);
+ spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
+
+ dma_direct_free_pages(dev, slot->alloc_size, slot->page,
+ slot->dma_addr, dir);
+ kfree(slot);
+}
+
/*
* Return the offset into a iotlb slot required to keep the device happy.
*/
@@ -474,11 +675,19 @@ static unsigned int swiotlb_align_offset(struct device *dev, u64 addr)
return addr & dma_get_min_align_mask(dev) & (IO_TLB_SIZE - 1);
}

-/*
- * Bounce: copy the swiotlb buffer from or back to the original dma location
+/**
+ * swiotlb_fixed_bounce() - copy a fixed-slot swiotlb buffer from or back
+ * to the original dma location
+ * @dev: device which has mapped the buffer
+ * @tlb_addr: physical address inside the bounce buffer
+ * @size: size of the region to be copied
+ * @dir: direction of the data transfer
+ *
+ * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
+ * This function works only for fixed bounce buffers.
*/
-static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
- enum dma_data_direction dir)
+static void swiotlb_fixed_bounce(struct device *dev, phys_addr_t tlb_addr,
+ size_t size, enum dma_data_direction dir)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
@@ -574,6 +783,25 @@ static void swiotlb_copy(struct device *dev, phys_addr_t orig_addr,
}
}

+/**
+ * swiotlb_bounce() - copy the swiotlb buffer from or back to the original
+ * dma location
+ * @dev: device which has mapped the buffer
+ * @tlb_addr: physical address inside the bounce buffer
+ * @size: size of the region to be copied
+ * @dir: direction of the data transfer
+ *
+ * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
+ */
+static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
+ enum dma_data_direction dir)
+{
+ if (is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
+ swiotlb_fixed_bounce(dev, tlb_addr, size, dir);
+ else
+ swiotlb_dyn_bounce(dev, tlb_addr, size, dir);
+}
+
static inline phys_addr_t slot_addr(phys_addr_t start, phys_addr_t idx)
{
return start + (idx << IO_TLB_SHIFT);
@@ -834,8 +1062,13 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
return (phys_addr_t)DMA_MAPPING_ERROR;
}

- tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
- alloc_align_mask, attrs);
+ tlb_addr = (phys_addr_t)DMA_MAPPING_ERROR;
+ if (!is_swiotlb_for_alloc(dev))
+ tlb_addr = swiotlb_dyn_map(dev, orig_addr, alloc_size,
+ alloc_align_mask, dir, attrs);
+ if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
+ tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
+ alloc_align_mask, attrs);

if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR) {
if (!(attrs & DMA_ATTR_NO_WARN))
@@ -919,7 +1152,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);

- swiotlb_release_slots(dev, tlb_addr);
+ if (is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
+ swiotlb_release_slots(dev, tlb_addr);
+ else
+ swiotlb_dyn_unmap(dev, tlb_addr, dir);
}

void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
@@ -1089,7 +1325,7 @@ bool swiotlb_free(struct device *dev, struct page *page, size_t size)
{
phys_addr_t tlb_addr = page_to_phys(page);

- if (!is_swiotlb_buffer(dev, tlb_addr))
+ if (!is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
return false;

swiotlb_release_slots(dev, tlb_addr);
--
2.25.1

2023-05-09 09:50:42

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v2 RESEND 6/7] drm: Use DMA_ATTR_MAY_SLEEP from process context

From: Petr Tesarik <[email protected]>

These mappings are never done from atomic context. If a dynamically
allocated bounce buffer is used for the mapping, this change allows
to allocate from CMA.

Signed-off-by: Petr Tesarik <[email protected]>
---
drivers/gpu/drm/drm_gem_shmem_helper.c | 2 +-
drivers/gpu/drm/drm_prime.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
index 4ea6507a77e5..b32dcefb17a9 100644
--- a/drivers/gpu/drm/drm_gem_shmem_helper.c
+++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
@@ -709,7 +709,7 @@ static struct sg_table *drm_gem_shmem_get_pages_sgt_locked(struct drm_gem_shmem_
goto err_put_pages;
}
/* Map the pages for use by the h/w. */
- ret = dma_map_sgtable(obj->dev->dev, sgt, DMA_BIDIRECTIONAL, 0);
+ ret = dma_map_sgtable(obj->dev->dev, sgt, DMA_BIDIRECTIONAL, DMA_ATTR_MAY_SLEEP);
if (ret)
goto err_free_sgt;

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index d29dafce9bb0..ac28489943b8 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -639,7 +639,7 @@ struct sg_table *drm_gem_map_dma_buf(struct dma_buf_attachment *attach,
return sgt;

ret = dma_map_sgtable(attach->dev, sgt, dir,
- DMA_ATTR_SKIP_CPU_SYNC);
+ DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MAY_SLEEP);
if (ret) {
sg_free_table(sgt);
kfree(sgt);
--
2.25.1

2023-05-09 09:52:09

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v2 RESEND 5/7] swiotlb: Add a boot option to enable dynamic bounce buffers

From: Petr Tesarik <[email protected]>

The main goal of allocating bounce buffers dynamically is to allow
allocating a minimal fixed swiotlb at boot time but avoid hard
limits on the amount of I/O that can be handled later.

Compared to fixed IO TLB slots, dynamic allocation of bounce buffers
typically increases the worst-case I/O latency and may also reduce
performance for some workloads.

I did some basic testing with fio against a QEMU SATA drive backed
by a RAM block device in the host to minimize external factors. The
kernel was booted with "swiotlb=force,dynamic". I performed testing
of single-threaded I/O of 4-KiB segments, single-threaded I/O of
1-MiB segments, and 4-core parallel I/O of 64-KiB segments. The last
column is the coefficient of variance in 5 runs of the test:

Read Write Coeff
single 4-KiB +1.9% +1.9% 1.7%
single 1-MiB -8.1% -8.2% 2.2%
parallel -9.4% -9.5% 2.6%

There is a slight increase in bandwidth for single-threaded 4-KiB
segments. This is because the buddy allocator is quite efficient for
order-0 allocations, so the overhead is offset by faster allocation
from an almost empty fixed swiotlb (which is still used for buffers
smaller than one page).

Anyway, since the feature is new and does not benefit all
workloads, make it disabled by default and let people turn it on
with "swiotlb=dynamic" if needed. Since this option can be combined
with "force", the parser is modified to allow multiple options
separated by commas.

A new bool field is added to struct io_tlb_mem to tell whether
dynamic allocations are allowed. This field is always false for DMA
restricted pools. It is also false for other software IO TLBs
unless "swiotlb=dynamic" was specified.

Signed-off-by: Petr Tesarik <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 6 +++++-
include/linux/swiotlb.h | 3 ++-
kernel/dma/swiotlb.c | 20 ++++++++++++++-----
3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9e5bab29685f..9f7f64edf6b5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6152,14 +6152,18 @@
Execution Facility on pSeries.

swiotlb= [ARM,IA-64,PPC,MIPS,X86]
- Format: { <int> [,<int>] | force | noforce }
+ Format: { <int> [,<int>] [,option-list] | option-list }
<int> -- Number of I/O TLB slabs
<int> -- Second integer after comma. Number of swiotlb
areas with their own lock. Will be rounded up
to a power of 2.
+ <option-list> -- Comma-separated list of options.
+
+ Available options:
force -- force using of bounce buffers even if they
wouldn't be automatically used by the kernel
noforce -- Never use bounce buffers (for debugging)
+ dynamic -- allow dynamic allocation of bounce buffers

switches= [HW,M68k]

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 6aada6ac31e2..daa2064f2ede 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -103,6 +103,7 @@ struct io_tlb_mem {
bool late_alloc;
bool force_bounce;
bool for_alloc;
+ bool allow_dyn;
unsigned int nareas;
unsigned int area_nslabs;
struct io_tlb_area *areas;
@@ -151,7 +152,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)

return mem &&
(is_swiotlb_fixed(mem, paddr) ||
- is_swiotlb_dyn(dev, paddr));
+ (mem->allow_dyn && is_swiotlb_dyn(dev, paddr)));
}

static inline bool is_swiotlb_force_bounce(struct device *dev)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 612e1c2e9573..81eab1c72c50 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -86,6 +86,7 @@ struct io_tlb_dyn_slot {

static bool swiotlb_force_bounce;
static bool swiotlb_force_disable;
+static bool swiotlb_dynamic;

struct io_tlb_mem io_tlb_default_mem;

@@ -165,10 +166,18 @@ setup_io_tlb_npages(char *str)
swiotlb_adjust_nareas(simple_strtoul(str, &str, 0));
if (*str == ',')
++str;
- if (!strcmp(str, "force"))
- swiotlb_force_bounce = true;
- else if (!strcmp(str, "noforce"))
- swiotlb_force_disable = true;
+ while (str && *str) {
+ char *opt = strsep(&str, ",");
+
+ if (!strcmp(opt, "force"))
+ swiotlb_force_bounce = true;
+ else if (!strcmp(opt, "noforce"))
+ swiotlb_force_disable = true;
+ else if (!strcmp(opt, "dynamic"))
+ swiotlb_dynamic = true;
+ else
+ pr_warn("Invalid swiotlb option: %s", opt);
+ }

return 0;
}
@@ -251,6 +260,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
mem->area_nslabs = nslabs / mem->nareas;

mem->force_bounce = swiotlb_force_bounce || (flags & SWIOTLB_FORCE);
+ mem->allow_dyn = swiotlb_dynamic;

for (i = 0; i < mem->nareas; i++) {
spin_lock_init(&mem->areas[i].lock);
@@ -1063,7 +1073,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
}

tlb_addr = (phys_addr_t)DMA_MAPPING_ERROR;
- if (!is_swiotlb_for_alloc(dev))
+ if (mem->allow_dyn)
tlb_addr = swiotlb_dyn_map(dev, orig_addr, alloc_size,
alloc_align_mask, dir, attrs);
if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
--
2.25.1

2023-05-09 09:52:39

by Petr Tesarik

[permalink] [raw]
Subject: [PATCH v2 RESEND 7/7] swiotlb: per-device flag if there are dynamically allocated buffers

From: Petr Tesarik <[email protected]>

Do not walk the list of dynamically allocated bounce buffers if the
list is empty. This avoids taking dma_io_tlb_dyn_lock for devices
which do not use any dynamically allocated bounce buffers.

When unmapping the last dynamically allocated bounce buffer, the
flag is set to false as soon as possible to allow skipping the
spinlock even before the list itself is updated.

Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/device.h | 4 ++++
include/linux/swiotlb.h | 6 +++++-
kernel/dma/swiotlb.c | 6 ++++++
3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index d1d2b8557b30..e340e0f06dce 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -516,6 +516,9 @@ struct device_physical_location {
* @dma_io_tlb_dyn_slots:
* Dynamically allocated bounce buffers for this device.
* Not for driver use.
+ * @dma_io_tlb_have_dyn:
+ * Does this device have any dynamically allocated bounce
+ * buffers? Not for driver use.
* @archdata: For arch-specific additions.
* @of_node: Associated device tree node.
* @fwnode: Associated device node supplied by platform firmware.
@@ -623,6 +626,7 @@ struct device {
struct io_tlb_mem *dma_io_tlb_mem;
spinlock_t dma_io_tlb_dyn_lock;
struct list_head dma_io_tlb_dyn_slots;
+ bool dma_io_tlb_have_dyn;
#endif
/* arch specific additions */
struct dev_archdata archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index daa2064f2ede..8cbb0bebb0bc 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -152,7 +152,11 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)

return mem &&
(is_swiotlb_fixed(mem, paddr) ||
- (mem->allow_dyn && is_swiotlb_dyn(dev, paddr)));
+ /* Pairs with smp_store_release() in swiotlb_dyn_map()
+ * and swiotlb_dyn_unmap().
+ */
+ (smp_load_acquire(&dev->dma_io_tlb_have_dyn) &&
+ is_swiotlb_dyn(dev, paddr)));
}

static inline bool is_swiotlb_force_bounce(struct device *dev)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 81eab1c72c50..e8be3ee50f18 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -642,6 +642,9 @@ static phys_addr_t swiotlb_dyn_map(struct device *dev, phys_addr_t orig_addr,

spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
list_add(&slot->node, &dev->dma_io_tlb_dyn_slots);
+ if (!dev->dma_io_tlb_have_dyn)
+ /* Pairs with smp_load_acquire() in is_swiotlb_buffer() */
+ smp_store_release(&dev->dma_io_tlb_have_dyn, true);
spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);

return page_to_phys(slot->page);
@@ -668,6 +671,9 @@ static void swiotlb_dyn_unmap(struct device *dev, phys_addr_t tlb_addr,
unsigned long flags;

spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
+ if (list_is_singular(&dev->dma_io_tlb_dyn_slots))
+ /* Pairs with smp_load_acquire() in is_swiotlb_buffer() */
+ smp_store_release(&dev->dma_io_tlb_have_dyn, false);
slot = lookup_dyn_slot_locked(dev, tlb_addr);
list_del(&slot->node);
spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
--
2.25.1

2023-05-14 19:20:31

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 7/7] swiotlb: per-device flag if there are dynamically allocated buffers

On Tue, May 09, 2023 at 11:18:19AM +0200, Petr Tesarik wrote:
> diff --git a/include/linux/device.h b/include/linux/device.h
> index d1d2b8557b30..e340e0f06dce 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -516,6 +516,9 @@ struct device_physical_location {
> * @dma_io_tlb_dyn_slots:
> * Dynamically allocated bounce buffers for this device.
> * Not for driver use.
> + * @dma_io_tlb_have_dyn:
> + * Does this device have any dynamically allocated bounce
> + * buffers? Not for driver use.
> * @archdata: For arch-specific additions.
> * @of_node: Associated device tree node.
> * @fwnode: Associated device node supplied by platform firmware.
> @@ -623,6 +626,7 @@ struct device {
> struct io_tlb_mem *dma_io_tlb_mem;
> spinlock_t dma_io_tlb_dyn_lock;
> struct list_head dma_io_tlb_dyn_slots;
> + bool dma_io_tlb_have_dyn;
> #endif
> /* arch specific additions */
> struct dev_archdata archdata;
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index daa2064f2ede..8cbb0bebb0bc 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -152,7 +152,11 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
>
> return mem &&
> (is_swiotlb_fixed(mem, paddr) ||
> - (mem->allow_dyn && is_swiotlb_dyn(dev, paddr)));
> + /* Pairs with smp_store_release() in swiotlb_dyn_map()
> + * and swiotlb_dyn_unmap().
> + */
> + (smp_load_acquire(&dev->dma_io_tlb_have_dyn) &&
> + is_swiotlb_dyn(dev, paddr)));
> }
>
> static inline bool is_swiotlb_force_bounce(struct device *dev)
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 81eab1c72c50..e8be3ee50f18 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -642,6 +642,9 @@ static phys_addr_t swiotlb_dyn_map(struct device *dev, phys_addr_t orig_addr,
>
> spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> list_add(&slot->node, &dev->dma_io_tlb_dyn_slots);
> + if (!dev->dma_io_tlb_have_dyn)
> + /* Pairs with smp_load_acquire() in is_swiotlb_buffer() */
> + smp_store_release(&dev->dma_io_tlb_have_dyn, true);
> spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);

I'm not sure this works. What this seems to do is that if the caller of
is_swiotlb_buffer() sees the flag set, it's guaranteed that something
was added to the dma_io_tlb_dyn_slots list. But the reverse is not
necessarily true. IOW, if something was added to the list, there is a
brief window where the dma_io_tlb_have_dyn flag is still false. In the
general case, I doubt any ordering between list_add() and the flag
setting changes anything since neither of them may be visible to another
CPU.

What you need is for a 'paddr' added to the dynamic list to be correctly
identified by another CPU as dynamic swiotlb. That other CPU doesn't
check random addresses but only those returned by the DMA API. Such
values would be normally passed through a memory location (e.g. driver
local structures) and that's what you want to order against.

What I mean is that a 'dev->blah = paddr' needs to be ordered _after_
your flag setting. So you need the either the 'blah = paddr' assignment
to have release semantics or the flag setting to be an
smp_store_acquire() (but we don't have such thing). You'd have to use an
explicit smp_wmb() barrier after the flag setting (it can be outside the
lock). The spin_unlock() is not sufficient since it only has release
semantics. I also don't think the ordering between list_add() and flag
setting matters since the smp_wmb() would ensure that both are visible
when the 'paddr' value made it to the other CPU.

Similarly on the is_swiotlb_buffer() side, you want the flag reading to
be ordered after the 'blah = paddr' is observed. Here the
smp_load_acquire() is sufficient.

> return page_to_phys(slot->page);
> @@ -668,6 +671,9 @@ static void swiotlb_dyn_unmap(struct device *dev, phys_addr_t tlb_addr,
> unsigned long flags;
>
> spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> + if (list_is_singular(&dev->dma_io_tlb_dyn_slots))
> + /* Pairs with smp_load_acquire() in is_swiotlb_buffer() */
> + smp_store_release(&dev->dma_io_tlb_have_dyn, false);
> slot = lookup_dyn_slot_locked(dev, tlb_addr);
> list_del(&slot->node);
> spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);

As with the map case, I don't think the ordering between list_del() and
the flag setting matters. If you unmap the last dynamic buffer, the
worst that can happen is that an is_swiotlb_buffer() call attempts a
read of the list but the flag will eventually become visible. There
shouldn't be another caller trying to unmap the same paddr (in well
behaved drivers).

Now, thinking about the list_head access and the flag ordering, since it
doesn't matter, you might as well not bother with the flag at all and
rely on list_add() and list_empty() ordering vs the hypothetical 'blah'
access. Both of these use READ/WRITE_ONCE() for setting
dma_io_tlb_dyn_slots.next. You only need an smp_wmb() after the
list_add() and an smp_rmb() before a list_empty() check in
is_swiotlb_buffer(), no dma_iotlb_have_dyn variable.

That's my reasoning but to be absolutely sure, you can pass that through
some formal modelling.

--
Catalin

2023-05-15 09:08:27

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 7/7] swiotlb: per-device flag if there are dynamically allocated buffers

Hi Catalin,

On Sun, 14 May 2023 19:54:27 +0100
Catalin Marinas <[email protected]> wrote:

> On Tue, May 09, 2023 at 11:18:19AM +0200, Petr Tesarik wrote:
> > diff --git a/include/linux/device.h b/include/linux/device.h
> > index d1d2b8557b30..e340e0f06dce 100644
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -516,6 +516,9 @@ struct device_physical_location {
> > * @dma_io_tlb_dyn_slots:
> > * Dynamically allocated bounce buffers for this device.
> > * Not for driver use.
> > + * @dma_io_tlb_have_dyn:
> > + * Does this device have any dynamically allocated bounce
> > + * buffers? Not for driver use.
> > * @archdata: For arch-specific additions.
> > * @of_node: Associated device tree node.
> > * @fwnode: Associated device node supplied by platform firmware.
> > @@ -623,6 +626,7 @@ struct device {
> > struct io_tlb_mem *dma_io_tlb_mem;
> > spinlock_t dma_io_tlb_dyn_lock;
> > struct list_head dma_io_tlb_dyn_slots;
> > + bool dma_io_tlb_have_dyn;
> > #endif
> > /* arch specific additions */
> > struct dev_archdata archdata;
> > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> > index daa2064f2ede..8cbb0bebb0bc 100644
> > --- a/include/linux/swiotlb.h
> > +++ b/include/linux/swiotlb.h
> > @@ -152,7 +152,11 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
> >
> > return mem &&
> > (is_swiotlb_fixed(mem, paddr) ||
> > - (mem->allow_dyn && is_swiotlb_dyn(dev, paddr)));
> > + /* Pairs with smp_store_release() in swiotlb_dyn_map()
> > + * and swiotlb_dyn_unmap().
> > + */
> > + (smp_load_acquire(&dev->dma_io_tlb_have_dyn) &&
> > + is_swiotlb_dyn(dev, paddr)));
> > }
> >
> > static inline bool is_swiotlb_force_bounce(struct device *dev)
> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > index 81eab1c72c50..e8be3ee50f18 100644
> > --- a/kernel/dma/swiotlb.c
> > +++ b/kernel/dma/swiotlb.c
> > @@ -642,6 +642,9 @@ static phys_addr_t swiotlb_dyn_map(struct device *dev, phys_addr_t orig_addr,
> >
> > spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> > list_add(&slot->node, &dev->dma_io_tlb_dyn_slots);
> > + if (!dev->dma_io_tlb_have_dyn)
> > + /* Pairs with smp_load_acquire() in is_swiotlb_buffer() */
> > + smp_store_release(&dev->dma_io_tlb_have_dyn, true);
> > spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
>
> I'm not sure this works. What this seems to do is that if the caller of
> is_swiotlb_buffer() sees the flag set, it's guaranteed that something
> was added to the dma_io_tlb_dyn_slots list. But the reverse is not
> necessarily true. IOW, if something was added to the list, there is a
> brief window where the dma_io_tlb_have_dyn flag is still false. In the
> general case, I doubt any ordering between list_add() and the flag
> setting changes anything since neither of them may be visible to another
> CPU.

Thank you for the review! This patch probably needs a bit more
explanation.

The goal is to avoid taking a spin lock in the mkost common case that
the dynamic list is empty. The only required invariant is:

When the flag is clear, it is safe to skip the list.

It's not a bug to walk an empty list, it's merely less efficient. Such
race window would be acceptable. OTOH that's not your concern if I
understand you correctly.

> What you need is for a 'paddr' added to the dynamic list to be correctly
> identified by another CPU as dynamic swiotlb. That other CPU doesn't
> check random addresses but only those returned by the DMA API.

Yes, that's correct.

> Such
> values would be normally passed through a memory location (e.g. driver
> local structures) and that's what you want to order against.

This would certainly work, but I'm not sure I need it. My only goal is
that when the flag is set, the new value is observed by all CPUs on the
next call to is_swiotlb_buffer().

> What I mean is that a 'dev->blah = paddr' needs to be ordered _after_
> your flag setting. So you need the either the 'blah = paddr' assignment
> to have release semantics or the flag setting to be an
> smp_store_acquire() (but we don't have such thing). You'd have to use an
> explicit smp_wmb() barrier after the flag setting (it can be outside the
> lock). The spin_unlock() is not sufficient since it only has release
> semantics.

Understood. The spinlock is supposed to order changes to the list, not
to the flag.

> I also don't think the ordering between list_add() and flag
> setting matters since the smp_wmb() would ensure that both are visible
> when the 'paddr' value made it to the other CPU.

If the flag makes it before the list, then the other CPU will walk the
list only after acquiring dma_io_tlb_dyn_lock, and that's what the list
is ordered against.

I don't think there is any concern if the list makes it before the
flag, as long as the new value of the flag is observed on the next call
to is_swiotlb_buffer (on any CPU).

> Similarly on the is_swiotlb_buffer() side, you want the flag reading to
> be ordered after the 'blah = paddr' is observed. Here the
> smp_load_acquire() is sufficient.
>
> > return page_to_phys(slot->page);
> > @@ -668,6 +671,9 @@ static void swiotlb_dyn_unmap(struct device *dev, phys_addr_t tlb_addr,
> > unsigned long flags;
> >
> > spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> > + if (list_is_singular(&dev->dma_io_tlb_dyn_slots))
> > + /* Pairs with smp_load_acquire() in is_swiotlb_buffer() */
> > + smp_store_release(&dev->dma_io_tlb_have_dyn, false);
> > slot = lookup_dyn_slot_locked(dev, tlb_addr);
> > list_del(&slot->node);
> > spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
>
> As with the map case, I don't think the ordering between list_del() and
> the flag setting matters. If you unmap the last dynamic buffer, the
> worst that can happen is that an is_swiotlb_buffer() call attempts a
> read of the list but the flag will eventually become visible. There
> shouldn't be another caller trying to unmap the same paddr (in well
> behaved drivers).
>
> Now, thinking about the list_head access and the flag ordering, since it
> doesn't matter, you might as well not bother with the flag at all and
> rely on list_add() and list_empty() ordering vs the hypothetical 'blah'
> access. Both of these use READ/WRITE_ONCE() for setting
> dma_io_tlb_dyn_slots.next. You only need an smp_wmb() after the
> list_add() and an smp_rmb() before a list_empty() check in
> is_swiotlb_buffer(), no dma_iotlb_have_dyn variable.

Wait, let me check that I understand you right. Do you suggest that I
convert dma_io_tlb_dyn_slots to a lockless list and get rid of the
spinlock?

I'm sure it can be done for list_add() and list_del(). I'll have
to think about list_move().

> That's my reasoning but to I'll have be absolutely sure, you can pass that through
> some formal modelling.

Good idea. Especially if I try to get rid of the lock.

Petr T

2023-05-15 10:24:53

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 7/7] swiotlb: per-device flag if there are dynamically allocated buffers

On Mon, 15 May 2023 10:48:47 +0200
Petr Tesařík <[email protected]> wrote:

> Hi Catalin,
>
> On Sun, 14 May 2023 19:54:27 +0100
> Catalin Marinas <[email protected]> wrote:
>[...]
> > Now, thinking about the list_head access and the flag ordering, since it
> > doesn't matter, you might as well not bother with the flag at all and
> > rely on list_add() and list_empty() ordering vs the hypothetical 'blah'
> > access. Both of these use READ/WRITE_ONCE() for setting
> > dma_io_tlb_dyn_slots.next. You only need an smp_wmb() after the
> > list_add() and an smp_rmb() before a list_empty() check in
^^^^^^^^^
Got it, finally. Well, that's exactly something I don't want to do.
For example, on arm64 (seeing your email address), smp_rmb() translates
to a "dsb ld" instruction. I would expect that this is more expensive
than a "ldar", generated by smp_load_acquire().

I mean, for devices that do not need swiotlb, the flag never changes
from zero, so reading it must be as cheap as possible.

Petr T

> > is_swiotlb_buffer(), no dma_iotlb_have_dyn variable.
>
> Wait, let me check that I understand you right. Do you suggest that I
> convert dma_io_tlb_dyn_slots to a lockless list and get rid of the
> spinlock?
>
> I'm sure it can be done for list_add() and list_del(). I'll have
> to think about list_move().

Hm, even the documentation of llist_empty() says that it is "not
guaranteed to be accurate or up to date". If it could be, I'm quite
sure the authors would have gladly implemented it as such.

Petr T

2023-05-15 16:41:36

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 7/7] swiotlb: per-device flag if there are dynamically allocated buffers

(some of you replies may have been filtered to various of my mailboxes,
depending on which lists you cc'ed; replying here)

On Mon, May 15, 2023 at 12:00:54PM +0200, Petr Tesařík wrote:
> On Mon, 15 May 2023 10:48:47 +0200
> Petr Tesařík <[email protected]> wrote:
> > On Sun, 14 May 2023 19:54:27 +0100
> > Catalin Marinas <[email protected]> wrote:
> > > Now, thinking about the list_head access and the flag ordering, since it
> > > doesn't matter, you might as well not bother with the flag at all and
> > > rely on list_add() and list_empty() ordering vs the hypothetical 'blah'
> > > access. Both of these use READ/WRITE_ONCE() for setting
> > > dma_io_tlb_dyn_slots.next. You only need an smp_wmb() after the
> > > list_add() and an smp_rmb() before a list_empty() check in
> ^^^^^^^^^
> Got it, finally. Well, that's exactly something I don't want to do.
> For example, on arm64 (seeing your email address), smp_rmb() translates
> to a "dsb ld" instruction. I would expect that this is more expensive
> than a "ldar", generated by smp_load_acquire().

It translates to a dmb ishld which is on par with ldar (dsb is indeed a
lot more expensive but that's not generated here).

> > > is_swiotlb_buffer(), no dma_iotlb_have_dyn variable.
> >
> > Wait, let me check that I understand you right. Do you suggest that I
> > convert dma_io_tlb_dyn_slots to a lockless list and get rid of the
> > spinlock?
> >
> > I'm sure it can be done for list_add() and list_del(). I'll have
> > to think about list_move().
>
> Hm, even the documentation of llist_empty() says that it is "not
> guaranteed to be accurate or up to date". If it could be, I'm quite
> sure the authors would have gladly implemented it as such.

It doesn't but neither does your flag. If you want a guarantee, you'd
need locks because a llist_empty() on its own can race with other
llist_add/del_*() that may not yet be visible to a CPU at exactly that
moment. BTW, the llist implementation cannot delete a random element, so
not sure this is suitable for your implementation (it can only delete
the first element or the whole list).

I do think you need to change your invariants and not rely on an
absolute list_empty() or some flag:

P0:
list_add(paddr);
WRITE_ONCE(blah, paddr);

P1:
paddr = READ_ONCE(blah);
list_empty();

Your invariant (on P1) should be blah == paddr => !list_empty(). If
there is another P2 removing paddr from the list, this wouldn't work
(nor your flag) but the assumption is that a correctly written driver
that still has a reference to paddr doesn't use it after being removed
from the list (i.e. it doesn't do a dma_unmap(paddr) and still continue
to use this paddr for e.g. dma_sync()).

For such invariant, you'd need ordering between list_add() and the
write of paddr (smp_wmb() would do). On P1, you need an smp_rmb() before
list_empty() since the implementation does a READ_ONCE only).

You still need the locks for list modifications and list traversal as I
don't see how you can use the llist implementation with random element
removal.


There is another scenario to take into account on the list_del() side.
Let's assume that there are other elements on the list, so
list_empty() == false:

P0:
list_del(paddr);
/* the memory gets freed, added to some slab or page free list */
WRITE_ONCE(slab_free_list, __va(paddr));

P1:
paddr = __pa(READ_ONCE(slab_free_list));/* re-allocating paddr freed on P0 */
if (!list_empty()) { /* assuming other elements on the list */
/* searching the list */
list_for_each() {
if (pos->paddr) == __pa(vaddr))
/* match */
}
}

On P0, you want the list update to be visible before the memory is freed
(and potentially reallocated on P1). An smp_wmb() on P0 would do. For
P1, we don't care about list_empty() as there can be other elements
already. But we do want any list elements reading during the search to
be ordered after the slab_free_list reading. The smp_rmb() you'd add for
the case above would suffice.

--
Catalin

2023-05-15 20:06:14

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

From: Petr Tesarik <[email protected]> Sent: Tuesday, May 9, 2023 2:18 AM
>
> The software IO TLB was designed with the assumption that it is not
> used much, especially on 64-bit systems, so a small fixed memory
> area (currently 64 MiB) is sufficient to handle the few cases which
> still require a bounce buffer. However, these cases are not so rare
> in some circumstances.
>
> First, if SEV is active, all DMA must be done through shared
> unencrypted pages, and SWIOTLB is used to make this happen without
> changing device drivers. The software IO TLB size is increased to 6%
> of total memory in sev_setup_arch(), but that is more of an
> approximation. The actual requirements may vary depending on which
> drivers are used and the amount of I/O.

FWIW, I don't think the approach you have implemented here will be
practical to use for CoCo VMs (SEV, TDX, whatever else). The problem
is that dma_direct_alloc_pages() and dma_direct_free_pages() must
call dma_set_decrypted() and dma_set_encrypted(), respectively. In CoCo
VMs, these calls are expensive because they require a hypercall to the host,
and the operation on the host isn't trivial either. I haven't measured the
overhead, but doing a hypercall on every DMA map operation and on
every unmap operation has long been something we thought we must
avoid. The fixed swiotlb bounce buffer space solves this problem by
doing set_decrypted() in batch at boot time, and never
doing set_encrypted().

In Microsoft's first implementation of bounce buffering for SEV-SNP VMs,
we created custom bounce buffer code separate from swiotlb. This code
did similar what you've done, but maintained a per-device pool of allocated
buffers that could be reused, rather than freeing the memory (and marking
the memory encrypted again) on every DMA unmap operation. (The pool
was actually per-VMBus channel, but VMBus channels are per-device, so
the effect was the same.) The reusable pool avoided most of the calls to
set_decrypted()/set_encrypted() and made it practical from a performance
standpoint. But of course, the pool could grow arbitrarily large, so there
was additional complexity to decay and trim the pool size. LKML feedback
early on was to use swiotlb instead, which made sense, but at the cost of
needing to figure out the appropriate fixed size of the swiotlb, and likely
over-provisioning to avoid running out of bounce buffer space.

Now we're considering again a more dynamic approach, which is good, but
we're encountering the same problems.

See https://lore.kernel.org/linux-hyperv/[email protected]/
for this historical example.

Michael

>
> Second, some embedded devices have very little RAM, so 64 MiB is not
> negligible. Sadly, these are exactly the devices that also often
> need a software IO TLB. Although minimum swiotlb size can be found
> empirically by extensive testing, it would be easier to allocate a
> small swiotlb at boot and let it grow on demand.
>
> Growing the SWIOTLB data structures at run time is impossible. The
> whole SWIOTLB region is contiguous in physical memory to allow
> combining adjacent slots and also to ensure that alignment
> constraints can be met. The SWIOTLB is too big for the buddy
> allocator (cf. MAX_ORDER). More importantly, even if a new SWIOTLB
> could be allocated (e.g. from CMA), it cannot be extended in-place
> (because surrounding pages may be already allocated for other
> purposes), and there is no mechanism for relocating already mapped
> bounce buffers: The DMA API gets only the address of a buffer, and
> the implementation (direct or IOMMU) checks whether it belongs to
> the software IO TLB.
>
> It is possible to allocate multiple smaller struct io_tlb_mem
> instances. However, they would have to be stored in a non-constant
> container (list or tree), which needs synchronization between
> readers and writers, creating contention in a hot path for all
> devices, not only those which need software IO TLB.
>
> Another option is to allocate a very large SWIOTLB at boot, but
> allow migrating pages to other users (like CMA does). This approach
> might work, but there are many open issues:
>
> 1. After a page is migrated away from SWIOTLB, it must not be used
> as a (direct) DMA buffer. Otherwise SWIOTLB code would have to
> check which pages have been migrated to determine whether a given
> buffer address belongs to a bounce buffer or not, effectively
> introducing all the issues of multiple SWIOTLB instances.
>
> 2. Unlike SWIOTLB, CMA cannot be used from atomic contexts, and that
> for many different reasons. This might be changed in theory, but
> it would take a lot of investigation and time. OTOH improvement
> to the SWIOTLB is needed now.
>
> 3. If SWIOTLB is implemented separately from CMA and not as its
> part, users have to solve the dilemma of how to distribute
> precious DMA-able memory between them.
>
> The present patch is a simplistic solution. Bounce buffers are
> allocated using the non-coherent DMA API, removing the need to
> implement a new custom allocator. These buffers are then tracked in
> a per-device linked list, reducing the impact on devices that do not
> need the SWIOTLB.
>
> Analysis of real-world I/O patterns has shown that the same buffer
> is typically looked up repeatedly (for further sync operations, or
> to be unmapped). The most recently used bounce buffer is therefore
> always moved to the beginning of the list. The list performed better
> than a maple tree when tested with fio against a QEMU SATA drive
> backed by a RAM block device in the host (4 cores, 16 iodepth).
> Other scenarios are also likely to benefit from this MRU order but
> have not been tested.
>
> Operations on the list are serialized with a spinlock. It is
> unfortunately not possible to use an RCU list, because quiescent
> state is not guaranteed to happen before an asynchronous event (e.g.
> hardware interrupt) on another CPU. If that CPU used an old version
> of the list, it would fail to copy data to and/or from the newly
> allocated bounce buffer.
>
> Last but not least, bounce buffers are never allocated dynamically
> if the SWIOTLB is in fact a DMA restricted pool, because that would
> defeat the purpose of a restricted pool.
>
> Signed-off-by: Petr Tesarik <[email protected]>
> ---
> include/linux/device.h | 8 ++
> include/linux/swiotlb.h | 8 +-
> kernel/dma/swiotlb.c | 252 ++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 259 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 472dd24d4823..d1d2b8557b30 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -510,6 +510,12 @@ struct device_physical_location {
> * @dma_mem: Internal for coherent mem override.
> * @cma_area: Contiguous memory area for dma allocations
> * @dma_io_tlb_mem: Pointer to the swiotlb pool used. Not for driver use.
> + * @dma_io_tlb_dyn_lock:
> + * Spinlock to protect the list of dynamically allocated bounce
> + * buffers.
> + * @dma_io_tlb_dyn_slots:
> + * Dynamically allocated bounce buffers for this device.
> + * Not for driver use.
> * @archdata: For arch-specific additions.
> * @of_node: Associated device tree node.
> * @fwnode: Associated device node supplied by platform firmware.
> @@ -615,6 +621,8 @@ struct device {
> #endif
> #ifdef CONFIG_SWIOTLB
> struct io_tlb_mem *dma_io_tlb_mem;
> + spinlock_t dma_io_tlb_dyn_lock;
> + struct list_head dma_io_tlb_dyn_slots;
> #endif
> /* arch specific additions */
> struct dev_archdata archdata;
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 281ecc6b9bcc..6aada6ac31e2 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -114,6 +114,8 @@ struct io_tlb_mem {
> };
> extern struct io_tlb_mem io_tlb_default_mem;
>
> +bool is_swiotlb_dyn(struct device *dev, phys_addr_t paddr);
> +
> /**
> * is_swiotlb_fixed() - check if a physical address belongs to a swiotlb slot
> * @mem: relevant swiotlb pool
> @@ -147,7 +149,9 @@ static inline bool is_swiotlb_buffer(struct device *dev,
> phys_addr_t paddr)
> {
> struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>
> - return mem && is_swiotlb_fixed(mem, paddr);
> + return mem &&
> + (is_swiotlb_fixed(mem, paddr) ||
> + is_swiotlb_dyn(dev, paddr));
> }
>
> static inline bool is_swiotlb_force_bounce(struct device *dev)
> @@ -164,6 +168,8 @@ static inline bool is_swiotlb_force_bounce(struct device *dev)
> static inline void swiotlb_dev_init(struct device *dev)
> {
> dev->dma_io_tlb_mem = &io_tlb_default_mem;
> + spin_lock_init(&dev->dma_io_tlb_dyn_lock);
> + INIT_LIST_HEAD(&dev->dma_io_tlb_dyn_slots);
> }
>
> void swiotlb_init(bool addressing_limited, unsigned int flags);
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 96ba93be6772..612e1c2e9573 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -68,6 +68,22 @@ struct io_tlb_slot {
> unsigned int list;
> };
>
> +/**
> + * struct io_tlb_dyn_slot - dynamically allocated swiotlb slot
> + * @node: node in the per-device list
> + * @orig_addr: physical address of the original DMA buffer
> + * @alloc_size: total size of the DMA buffer
> + * @page: first page of the bounce buffer
> + * @dma_addr: DMA address of the bounce buffer
> + */
> +struct io_tlb_dyn_slot {
> + struct list_head node;
> + phys_addr_t orig_addr;
> + size_t alloc_size;
> + struct page *page;
> + dma_addr_t dma_addr;
> +};
> +
> static bool swiotlb_force_bounce;
> static bool swiotlb_force_disable;
>
> @@ -466,6 +482,191 @@ void __init swiotlb_exit(void)
> memset(mem, 0, sizeof(*mem));
> }
>
> +/**
> + * lookup_dyn_slot_locked() - look up a dynamically allocated bounce buffer
> + * @dev: device which has mapped the buffer
> + * @paddr: physical address within the bounce buffer
> + *
> + * Walk through the list of bounce buffers dynamically allocated for @dev,
> + * looking for a buffer which contains @paddr.
> + *
> + * Context: Any context. Expects dma_io_tlb_dyn_lock lock to be held.
> + * Return:
> + * Address of a &struct io_tlb_dyn_slot, or %NULL if not found.
> + */
> +static struct io_tlb_dyn_slot *lookup_dyn_slot_locked(struct device *dev,
> + phys_addr_t paddr)
> +{
> + struct io_tlb_dyn_slot *slot;
> +
> + list_for_each_entry(slot, &dev->dma_io_tlb_dyn_slots, node) {
> + phys_addr_t start = page_to_phys(slot->page);
> + phys_addr_t end = start + slot->alloc_size - 1;
> +
> + if (start <= paddr && paddr <= end)
> + return slot;
> + }
> + return NULL;
> +}
> +
> +/**
> + * lookup_dyn_slot() - look up a dynamically allocated bounce buffer
> + * @dev: device which has mapped the buffer
> + * @paddr: physical address within the bounce buffer
> + *
> + * Search for a dynamically allocated bounce buffer which contains
> + * @paddr. If found, the buffer is moved to the beginning of the linked
> + * list. Use lookup_dyn_slot_locked() directly where this behavior is not
> + * required/desired.
> + *
> + * Context: Any context. Takes and releases dma_io_tlb_dyn_lock.
> + * Return:
> + * Address of a &struct io_tlb_dyn_slot, or %NULL if not found.
> + */
> +static struct io_tlb_dyn_slot *lookup_dyn_slot(struct device *dev,
> + phys_addr_t paddr)
> +{
> + struct io_tlb_dyn_slot *slot;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> + slot = lookup_dyn_slot_locked(dev, paddr);
> + list_move(&slot->node, &dev->dma_io_tlb_dyn_slots);
> + spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
> + return slot;
> +}
> +
> +/**
> + * is_swiotlb_dyn() - check if a physical address belongs to a dynamically
> + * allocated bounce buffer
> + * @dev: device which has mapped the buffer
> + * @paddr: physical address within the bounce buffer
> + *
> + * Check whether there is a dynamically allocated bounce buffer which
> + * contains @paddr. If found, the buffer is moved to the beginning of
> + * the MRU list.
> + *
> + * Return:
> + * * %true if @paddr points into a dynamically allocated bounce buffer
> + * * %false otherwise
> + */
> +bool is_swiotlb_dyn(struct device *dev, phys_addr_t paddr)
> +{
> + return !!lookup_dyn_slot(dev, paddr);
> +}
> +
> +/**
> + * swiotlb_dyn_bounce() - copy a dynamically allocated buffer from or back
> + * to the original dma location
> + * @dev: device which has mapped the buffer
> + * @tlb_addr: physical address inside the bounce buffer
> + * @size: size of the region to be copied
> + * @dir: direction of the data transfer
> + *
> + * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
> + * This function works only for dynamically allocated bounce buffers.
> + */
> +static void swiotlb_dyn_bounce(struct device *dev, phys_addr_t tlb_addr,
> + size_t size, enum dma_data_direction dir)
> +{
> + struct io_tlb_dyn_slot *slot = lookup_dyn_slot(dev, tlb_addr);
> + unsigned int tlb_offset;
> + unsigned char *vaddr;
> +
> + if (!slot)
> + return;
> +
> + tlb_offset = tlb_addr - page_to_phys(slot->page);
> + vaddr = page_address(slot->page) + tlb_offset;
> +
> + swiotlb_copy(dev, slot->orig_addr, vaddr, size, slot->alloc_size,
> + tlb_offset, dir);
> +}
> +
> +/**
> + * swiotlb_dyn_map() - allocate a bounce buffer dynamically
> + * @dev: device which maps the buffer
> + * @orig_addr: address of the original buffer
> + * @alloc_size: total size of the original buffer
> + * @alloc_align_mask:
> + * required physical alignment of the I/O buffer
> + * @dir: direction of the data transfer
> + * @attrs: optional DMA attributes for the map operation
> + *
> + * Allocate a new bounce buffer using DMA non-coherent API. This function
> + * assumes that there is a fallback allocation scheme if the allocation
> + * fails. In fact, it always fails for buffers smaller than a page and
> + * for address constraints that are not (yet) correctly handled by
> + * dma_direct_alloc_pages().
> + *
> + * Return: Physical address of the bounce buffer, or %DMA_MAPPING_ERROR.
> + */
> +static phys_addr_t swiotlb_dyn_map(struct device *dev, phys_addr_t orig_addr,
> + size_t alloc_size, unsigned int alloc_align_mask,
> + enum dma_data_direction dir, unsigned long attrs)
> +{
> + struct io_tlb_dyn_slot *slot;
> + unsigned long flags;
> + gfp_t gfp;
> +
> + /* Allocation has page granularity. Avoid small buffers. */
> + if (alloc_size < PAGE_SIZE)
> + goto err;
> +
> + /* DMA direct does not deal with physical address constraints. */
> + if (alloc_align_mask || dma_get_min_align_mask(dev))
> + goto err;
> +
> + gfp = (attrs & DMA_ATTR_MAY_SLEEP) ? GFP_NOIO : GFP_NOWAIT;
> + slot = kmalloc(sizeof(*slot), gfp | __GFP_NOWARN);
> + if (!slot)
> + goto err;
> +
> + slot->orig_addr = orig_addr;
> + slot->alloc_size = alloc_size;
> + slot->page = dma_direct_alloc_pages(dev, PAGE_ALIGN(alloc_size),
> + &slot->dma_addr, dir,
> + gfp | __GFP_NOWARN);
> + if (!slot->page)
> + goto err_free_slot;
> +
> + spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> + list_add(&slot->node, &dev->dma_io_tlb_dyn_slots);
> + spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
> +
> + return page_to_phys(slot->page);
> +
> +err_free_slot:
> + kfree(slot);
> +err:
> + return (phys_addr_t)DMA_MAPPING_ERROR;
> +}
> +
> +/**
> + * swiotlb_dyn_unmap() - unmap a dynamically allocated bounce buffer
> + * @dev: device which mapped the buffer
> + * @tlb_addr: physical address of the bounce buffer
> + * @dir: direction of the data transfer
> + *
> + * Release all resources associated with a dynamically allocated bounce
> + * buffer.
> + */
> +static void swiotlb_dyn_unmap(struct device *dev, phys_addr_t tlb_addr,
> + enum dma_data_direction dir)
> +{
> + struct io_tlb_dyn_slot *slot;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> + slot = lookup_dyn_slot_locked(dev, tlb_addr);
> + list_del(&slot->node);
> + spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
> +
> + dma_direct_free_pages(dev, slot->alloc_size, slot->page,
> + slot->dma_addr, dir);
> + kfree(slot);
> +}
> +
> /*
> * Return the offset into a iotlb slot required to keep the device happy.
> */
> @@ -474,11 +675,19 @@ static unsigned int swiotlb_align_offset(struct device *dev,
> u64 addr)
> return addr & dma_get_min_align_mask(dev) & (IO_TLB_SIZE - 1);
> }
>
> -/*
> - * Bounce: copy the swiotlb buffer from or back to the original dma location
> +/**
> + * swiotlb_fixed_bounce() - copy a fixed-slot swiotlb buffer from or back
> + * to the original dma location
> + * @dev: device which has mapped the buffer
> + * @tlb_addr: physical address inside the bounce buffer
> + * @size: size of the region to be copied
> + * @dir: direction of the data transfer
> + *
> + * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
> + * This function works only for fixed bounce buffers.
> */
> -static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
> - enum dma_data_direction dir)
> +static void swiotlb_fixed_bounce(struct device *dev, phys_addr_t tlb_addr,
> + size_t size, enum dma_data_direction dir)
> {
> struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
> @@ -574,6 +783,25 @@ static void swiotlb_copy(struct device *dev, phys_addr_t
> orig_addr,
> }
> }
>
> +/**
> + * swiotlb_bounce() - copy the swiotlb buffer from or back to the original
> + * dma location
> + * @dev: device which has mapped the buffer
> + * @tlb_addr: physical address inside the bounce buffer
> + * @size: size of the region to be copied
> + * @dir: direction of the data transfer
> + *
> + * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
> + */
> +static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
> + enum dma_data_direction dir)
> +{
> + if (is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
> + swiotlb_fixed_bounce(dev, tlb_addr, size, dir);
> + else
> + swiotlb_dyn_bounce(dev, tlb_addr, size, dir);
> +}
> +
> static inline phys_addr_t slot_addr(phys_addr_t start, phys_addr_t idx)
> {
> return start + (idx << IO_TLB_SHIFT);
> @@ -834,8 +1062,13 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> phys_addr_t orig_addr,
> return (phys_addr_t)DMA_MAPPING_ERROR;
> }
>
> - tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
> - alloc_align_mask, attrs);
> + tlb_addr = (phys_addr_t)DMA_MAPPING_ERROR;
> + if (!is_swiotlb_for_alloc(dev))
> + tlb_addr = swiotlb_dyn_map(dev, orig_addr, alloc_size,
> + alloc_align_mask, dir, attrs);
> + if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
> + tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
> + alloc_align_mask, attrs);
>
> if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR) {
> if (!(attrs & DMA_ATTR_NO_WARN))
> @@ -919,7 +1152,10 @@ void swiotlb_tbl_unmap_single(struct device *dev,
> phys_addr_t tlb_addr,
> (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
> swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
>
> - swiotlb_release_slots(dev, tlb_addr);
> + if (is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
> + swiotlb_release_slots(dev, tlb_addr);
> + else
> + swiotlb_dyn_unmap(dev, tlb_addr, dir);
> }
>
> void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
> @@ -1089,7 +1325,7 @@ bool swiotlb_free(struct device *dev, struct page *page,
> size_t size)
> {
> phys_addr_t tlb_addr = page_to_phys(page);
>
> - if (!is_swiotlb_buffer(dev, tlb_addr))
> + if (!is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
> return false;
>
> swiotlb_release_slots(dev, tlb_addr);
> --
> 2.25.1


2023-05-16 06:21:57

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

Hi Michael,

On Mon, 15 May 2023 19:43:52 +0000
"Michael Kelley (LINUX)" <[email protected]> wrote:

> From: Petr Tesarik <[email protected]> Sent: Tuesday, May 9, 2023 2:18 AM
> >
> > The software IO TLB was designed with the assumption that it is not
> > used much, especially on 64-bit systems, so a small fixed memory
> > area (currently 64 MiB) is sufficient to handle the few cases which
> > still require a bounce buffer. However, these cases are not so rare
> > in some circumstances.
> >
> > First, if SEV is active, all DMA must be done through shared
> > unencrypted pages, and SWIOTLB is used to make this happen without
> > changing device drivers. The software IO TLB size is increased to 6%
> > of total memory in sev_setup_arch(), but that is more of an
> > approximation. The actual requirements may vary depending on which
> > drivers are used and the amount of I/O.
>
> FWIW, I don't think the approach you have implemented here will be
> practical to use for CoCo VMs (SEV, TDX, whatever else). The problem
> is that dma_direct_alloc_pages() and dma_direct_free_pages() must
> call dma_set_decrypted() and dma_set_encrypted(), respectively. In CoCo
> VMs, these calls are expensive because they require a hypercall to the host,
> and the operation on the host isn't trivial either. I haven't measured the
> overhead, but doing a hypercall on every DMA map operation and on
> every unmap operation has long been something we thought we must
> avoid. The fixed swiotlb bounce buffer space solves this problem by
> doing set_decrypted() in batch at boot time, and never
> doing set_encrypted().

I know. For performance, alloc() and free() on each DMA map/unmap is
not ideal even without CoCo. I have already tested some code locally
to keep buffers around after unmap, effectively creating a per-device
pool as described below. Right now, I don't have SEV-capable hardware
for testing, but on bare metal, this pool brought performance back to
that of fixed swiotlb buffers, for some scenarios making it even better.

However, these per-device pools add more complexity, so I decided to
start with a smaller patch series that can be reviewed more easily. If
there is no objection in general, I'll send the second part with the
per-device pools.

> In Microsoft's first implementation of bounce buffering for SEV-SNP VMs,
> we created custom bounce buffer code separate from swiotlb. This code
> did similar what you've done, but maintained a per-device pool of allocated
> buffers that could be reused, rather than freeing the memory (and marking
> the memory encrypted again) on every DMA unmap operation. (The pool
> was actually per-VMBus channel, but VMBus channels are per-device, so
> the effect was the same.) The reusable pool avoided most of the calls to
> set_decrypted()/set_encrypted() and made it practical from a performance
> standpoint. But of course, the pool could grow arbitrarily large, so there
> was additional complexity to decay and trim the pool size. LKML feedback
> early on was to use swiotlb instead, which made sense, but at the cost of
> needing to figure out the appropriate fixed size of the swiotlb, and likely
> over-provisioning to avoid running out of bounce buffer space.
>
> Now we're considering again a more dynamic approach, which is good, but
> we're encountering the same problems.
>
> See https://lore.kernel.org/linux-hyperv/[email protected]/
> for this historical example.

Thanks for the pointer. I'll definitely have a look!

Petr T

> Michael
>
> >
> > Second, some embedded devices have very little RAM, so 64 MiB is not
> > negligible. Sadly, these are exactly the devices that also often
> > need a software IO TLB. Although minimum swiotlb size can be found
> > empirically by extensive testing, it would be easier to allocate a
> > small swiotlb at boot and let it grow on demand.
> >
> > Growing the SWIOTLB data structures at run time is impossible. The
> > whole SWIOTLB region is contiguous in physical memory to allow
> > combining adjacent slots and also to ensure that alignment
> > constraints can be met. The SWIOTLB is too big for the buddy
> > allocator (cf. MAX_ORDER). More importantly, even if a new SWIOTLB
> > could be allocated (e.g. from CMA), it cannot be extended in-place
> > (because surrounding pages may be already allocated for other
> > purposes), and there is no mechanism for relocating already mapped
> > bounce buffers: The DMA API gets only the address of a buffer, and
> > the implementation (direct or IOMMU) checks whether it belongs to
> > the software IO TLB.
> >
> > It is possible to allocate multiple smaller struct io_tlb_mem
> > instances. However, they would have to be stored in a non-constant
> > container (list or tree), which needs synchronization between
> > readers and writers, creating contention in a hot path for all
> > devices, not only those which need software IO TLB.
> >
> > Another option is to allocate a very large SWIOTLB at boot, but
> > allow migrating pages to other users (like CMA does). This approach
> > might work, but there are many open issues:
> >
> > 1. After a page is migrated away from SWIOTLB, it must not be used
> > as a (direct) DMA buffer. Otherwise SWIOTLB code would have to
> > check which pages have been migrated to determine whether a given
> > buffer address belongs to a bounce buffer or not, effectively
> > introducing all the issues of multiple SWIOTLB instances.
> >
> > 2. Unlike SWIOTLB, CMA cannot be used from atomic contexts, and that
> > for many different reasons. This might be changed in theory, but
> > it would take a lot of investigation and time. OTOH improvement
> > to the SWIOTLB is needed now.
> >
> > 3. If SWIOTLB is implemented separately from CMA and not as its
> > part, users have to solve the dilemma of how to distribute
> > precious DMA-able memory between them.
> >
> > The present patch is a simplistic solution. Bounce buffers are
> > allocated using the non-coherent DMA API, removing the need to
> > implement a new custom allocator. These buffers are then tracked in
> > a per-device linked list, reducing the impact on devices that do not
> > need the SWIOTLB.
> >
> > Analysis of real-world I/O patterns has shown that the same buffer
> > is typically looked up repeatedly (for further sync operations, or
> > to be unmapped). The most recently used bounce buffer is therefore
> > always moved to the beginning of the list. The list performed better
> > than a maple tree when tested with fio against a QEMU SATA drive
> > backed by a RAM block device in the host (4 cores, 16 iodepth).
> > Other scenarios are also likely to benefit from this MRU order but
> > have not been tested.
> >
> > Operations on the list are serialized with a spinlock. It is
> > unfortunately not possible to use an RCU list, because quiescent
> > state is not guaranteed to happen before an asynchronous event (e.g.
> > hardware interrupt) on another CPU. If that CPU used an old version
> > of the list, it would fail to copy data to and/or from the newly
> > allocated bounce buffer.
> >
> > Last but not least, bounce buffers are never allocated dynamically
> > if the SWIOTLB is in fact a DMA restricted pool, because that would
> > defeat the purpose of a restricted pool.
> >
> > Signed-off-by: Petr Tesarik <[email protected]>
> > ---
> > include/linux/device.h | 8 ++
> > include/linux/swiotlb.h | 8 +-
> > kernel/dma/swiotlb.c | 252 ++++++++++++++++++++++++++++++++++++++--
> > 3 files changed, 259 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/device.h b/include/linux/device.h
> > index 472dd24d4823..d1d2b8557b30 100644
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -510,6 +510,12 @@ struct device_physical_location {
> > * @dma_mem: Internal for coherent mem override.
> > * @cma_area: Contiguous memory area for dma allocations
> > * @dma_io_tlb_mem: Pointer to the swiotlb pool used. Not for driver use.
> > + * @dma_io_tlb_dyn_lock:
> > + * Spinlock to protect the list of dynamically allocated bounce
> > + * buffers.
> > + * @dma_io_tlb_dyn_slots:
> > + * Dynamically allocated bounce buffers for this device.
> > + * Not for driver use.
> > * @archdata: For arch-specific additions.
> > * @of_node: Associated device tree node.
> > * @fwnode: Associated device node supplied by platform firmware.
> > @@ -615,6 +621,8 @@ struct device {
> > #endif
> > #ifdef CONFIG_SWIOTLB
> > struct io_tlb_mem *dma_io_tlb_mem;
> > + spinlock_t dma_io_tlb_dyn_lock;
> > + struct list_head dma_io_tlb_dyn_slots;
> > #endif
> > /* arch specific additions */
> > struct dev_archdata archdata;
> > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> > index 281ecc6b9bcc..6aada6ac31e2 100644
> > --- a/include/linux/swiotlb.h
> > +++ b/include/linux/swiotlb.h
> > @@ -114,6 +114,8 @@ struct io_tlb_mem {
> > };
> > extern struct io_tlb_mem io_tlb_default_mem;
> >
> > +bool is_swiotlb_dyn(struct device *dev, phys_addr_t paddr);
> > +
> > /**
> > * is_swiotlb_fixed() - check if a physical address belongs to a swiotlb slot
> > * @mem: relevant swiotlb pool
> > @@ -147,7 +149,9 @@ static inline bool is_swiotlb_buffer(struct device *dev,
> > phys_addr_t paddr)
> > {
> > struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> >
> > - return mem && is_swiotlb_fixed(mem, paddr);
> > + return mem &&
> > + (is_swiotlb_fixed(mem, paddr) ||
> > + is_swiotlb_dyn(dev, paddr));
> > }
> >
> > static inline bool is_swiotlb_force_bounce(struct device *dev)
> > @@ -164,6 +168,8 @@ static inline bool is_swiotlb_force_bounce(struct device *dev)
> > static inline void swiotlb_dev_init(struct device *dev)
> > {
> > dev->dma_io_tlb_mem = &io_tlb_default_mem;
> > + spin_lock_init(&dev->dma_io_tlb_dyn_lock);
> > + INIT_LIST_HEAD(&dev->dma_io_tlb_dyn_slots);
> > }
> >
> > void swiotlb_init(bool addressing_limited, unsigned int flags);
> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > index 96ba93be6772..612e1c2e9573 100644
> > --- a/kernel/dma/swiotlb.c
> > +++ b/kernel/dma/swiotlb.c
> > @@ -68,6 +68,22 @@ struct io_tlb_slot {
> > unsigned int list;
> > };
> >
> > +/**
> > + * struct io_tlb_dyn_slot - dynamically allocated swiotlb slot
> > + * @node: node in the per-device list
> > + * @orig_addr: physical address of the original DMA buffer
> > + * @alloc_size: total size of the DMA buffer
> > + * @page: first page of the bounce buffer
> > + * @dma_addr: DMA address of the bounce buffer
> > + */
> > +struct io_tlb_dyn_slot {
> > + struct list_head node;
> > + phys_addr_t orig_addr;
> > + size_t alloc_size;
> > + struct page *page;
> > + dma_addr_t dma_addr;
> > +};
> > +
> > static bool swiotlb_force_bounce;
> > static bool swiotlb_force_disable;
> >
> > @@ -466,6 +482,191 @@ void __init swiotlb_exit(void)
> > memset(mem, 0, sizeof(*mem));
> > }
> >
> > +/**
> > + * lookup_dyn_slot_locked() - look up a dynamically allocated bounce buffer
> > + * @dev: device which has mapped the buffer
> > + * @paddr: physical address within the bounce buffer
> > + *
> > + * Walk through the list of bounce buffers dynamically allocated for @dev,
> > + * looking for a buffer which contains @paddr.
> > + *
> > + * Context: Any context. Expects dma_io_tlb_dyn_lock lock to be held.
> > + * Return:
> > + * Address of a &struct io_tlb_dyn_slot, or %NULL if not found.
> > + */
> > +static struct io_tlb_dyn_slot *lookup_dyn_slot_locked(struct device *dev,
> > + phys_addr_t paddr)
> > +{
> > + struct io_tlb_dyn_slot *slot;
> > +
> > + list_for_each_entry(slot, &dev->dma_io_tlb_dyn_slots, node) {
> > + phys_addr_t start = page_to_phys(slot->page);
> > + phys_addr_t end = start + slot->alloc_size - 1;
> > +
> > + if (start <= paddr && paddr <= end)
> > + return slot;
> > + }
> > + return NULL;
> > +}
> > +
> > +/**
> > + * lookup_dyn_slot() - look up a dynamically allocated bounce buffer
> > + * @dev: device which has mapped the buffer
> > + * @paddr: physical address within the bounce buffer
> > + *
> > + * Search for a dynamically allocated bounce buffer which contains
> > + * @paddr. If found, the buffer is moved to the beginning of the linked
> > + * list. Use lookup_dyn_slot_locked() directly where this behavior is not
> > + * required/desired.
> > + *
> > + * Context: Any context. Takes and releases dma_io_tlb_dyn_lock.
> > + * Return:
> > + * Address of a &struct io_tlb_dyn_slot, or %NULL if not found.
> > + */
> > +static struct io_tlb_dyn_slot *lookup_dyn_slot(struct device *dev,
> > + phys_addr_t paddr)
> > +{
> > + struct io_tlb_dyn_slot *slot;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> > + slot = lookup_dyn_slot_locked(dev, paddr);
> > + list_move(&slot->node, &dev->dma_io_tlb_dyn_slots);
> > + spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
> > + return slot;
> > +}
> > +
> > +/**
> > + * is_swiotlb_dyn() - check if a physical address belongs to a dynamically
> > + * allocated bounce buffer
> > + * @dev: device which has mapped the buffer
> > + * @paddr: physical address within the bounce buffer
> > + *
> > + * Check whether there is a dynamically allocated bounce buffer which
> > + * contains @paddr. If found, the buffer is moved to the beginning of
> > + * the MRU list.
> > + *
> > + * Return:
> > + * * %true if @paddr points into a dynamically allocated bounce buffer
> > + * * %false otherwise
> > + */
> > +bool is_swiotlb_dyn(struct device *dev, phys_addr_t paddr)
> > +{
> > + return !!lookup_dyn_slot(dev, paddr);
> > +}
> > +
> > +/**
> > + * swiotlb_dyn_bounce() - copy a dynamically allocated buffer from or back
> > + * to the original dma location
> > + * @dev: device which has mapped the buffer
> > + * @tlb_addr: physical address inside the bounce buffer
> > + * @size: size of the region to be copied
> > + * @dir: direction of the data transfer
> > + *
> > + * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
> > + * This function works only for dynamically allocated bounce buffers.
> > + */
> > +static void swiotlb_dyn_bounce(struct device *dev, phys_addr_t tlb_addr,
> > + size_t size, enum dma_data_direction dir)
> > +{
> > + struct io_tlb_dyn_slot *slot = lookup_dyn_slot(dev, tlb_addr);
> > + unsigned int tlb_offset;
> > + unsigned char *vaddr;
> > +
> > + if (!slot)
> > + return;
> > +
> > + tlb_offset = tlb_addr - page_to_phys(slot->page);
> > + vaddr = page_address(slot->page) + tlb_offset;
> > +
> > + swiotlb_copy(dev, slot->orig_addr, vaddr, size, slot->alloc_size,
> > + tlb_offset, dir);
> > +}
> > +
> > +/**
> > + * swiotlb_dyn_map() - allocate a bounce buffer dynamically
> > + * @dev: device which maps the buffer
> > + * @orig_addr: address of the original buffer
> > + * @alloc_size: total size of the original buffer
> > + * @alloc_align_mask:
> > + * required physical alignment of the I/O buffer
> > + * @dir: direction of the data transfer
> > + * @attrs: optional DMA attributes for the map operation
> > + *
> > + * Allocate a new bounce buffer using DMA non-coherent API. This function
> > + * assumes that there is a fallback allocation scheme if the allocation
> > + * fails. In fact, it always fails for buffers smaller than a page and
> > + * for address constraints that are not (yet) correctly handled by
> > + * dma_direct_alloc_pages().
> > + *
> > + * Return: Physical address of the bounce buffer, or %DMA_MAPPING_ERROR.
> > + */
> > +static phys_addr_t swiotlb_dyn_map(struct device *dev, phys_addr_t orig_addr,
> > + size_t alloc_size, unsigned int alloc_align_mask,
> > + enum dma_data_direction dir, unsigned long attrs)
> > +{
> > + struct io_tlb_dyn_slot *slot;
> > + unsigned long flags;
> > + gfp_t gfp;
> > +
> > + /* Allocation has page granularity. Avoid small buffers. */
> > + if (alloc_size < PAGE_SIZE)
> > + goto err;
> > +
> > + /* DMA direct does not deal with physical address constraints. */
> > + if (alloc_align_mask || dma_get_min_align_mask(dev))
> > + goto err;
> > +
> > + gfp = (attrs & DMA_ATTR_MAY_SLEEP) ? GFP_NOIO : GFP_NOWAIT;
> > + slot = kmalloc(sizeof(*slot), gfp | __GFP_NOWARN);
> > + if (!slot)
> > + goto err;
> > +
> > + slot->orig_addr = orig_addr;
> > + slot->alloc_size = alloc_size;
> > + slot->page = dma_direct_alloc_pages(dev, PAGE_ALIGN(alloc_size),
> > + &slot->dma_addr, dir,
> > + gfp | __GFP_NOWARN);
> > + if (!slot->page)
> > + goto err_free_slot;
> > +
> > + spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> > + list_add(&slot->node, &dev->dma_io_tlb_dyn_slots);
> > + spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
> > +
> > + return page_to_phys(slot->page);
> > +
> > +err_free_slot:
> > + kfree(slot);
> > +err:
> > + return (phys_addr_t)DMA_MAPPING_ERROR;
> > +}
> > +
> > +/**
> > + * swiotlb_dyn_unmap() - unmap a dynamically allocated bounce buffer
> > + * @dev: device which mapped the buffer
> > + * @tlb_addr: physical address of the bounce buffer
> > + * @dir: direction of the data transfer
> > + *
> > + * Release all resources associated with a dynamically allocated bounce
> > + * buffer.
> > + */
> > +static void swiotlb_dyn_unmap(struct device *dev, phys_addr_t tlb_addr,
> > + enum dma_data_direction dir)
> > +{
> > + struct io_tlb_dyn_slot *slot;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> > + slot = lookup_dyn_slot_locked(dev, tlb_addr);
> > + list_del(&slot->node);
> > + spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
> > +
> > + dma_direct_free_pages(dev, slot->alloc_size, slot->page,
> > + slot->dma_addr, dir);
> > + kfree(slot);
> > +}
> > +
> > /*
> > * Return the offset into a iotlb slot required to keep the device happy.
> > */
> > @@ -474,11 +675,19 @@ static unsigned int swiotlb_align_offset(struct device *dev,
> > u64 addr)
> > return addr & dma_get_min_align_mask(dev) & (IO_TLB_SIZE - 1);
> > }
> >
> > -/*
> > - * Bounce: copy the swiotlb buffer from or back to the original dma location
> > +/**
> > + * swiotlb_fixed_bounce() - copy a fixed-slot swiotlb buffer from or back
> > + * to the original dma location
> > + * @dev: device which has mapped the buffer
> > + * @tlb_addr: physical address inside the bounce buffer
> > + * @size: size of the region to be copied
> > + * @dir: direction of the data transfer
> > + *
> > + * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
> > + * This function works only for fixed bounce buffers.
> > */
> > -static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
> > - enum dma_data_direction dir)
> > +static void swiotlb_fixed_bounce(struct device *dev, phys_addr_t tlb_addr,
> > + size_t size, enum dma_data_direction dir)
> > {
> > struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> > int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
> > @@ -574,6 +783,25 @@ static void swiotlb_copy(struct device *dev, phys_addr_t
> > orig_addr,
> > }
> > }
> >
> > +/**
> > + * swiotlb_bounce() - copy the swiotlb buffer from or back to the original
> > + * dma location
> > + * @dev: device which has mapped the buffer
> > + * @tlb_addr: physical address inside the bounce buffer
> > + * @size: size of the region to be copied
> > + * @dir: direction of the data transfer
> > + *
> > + * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
> > + */
> > +static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
> > + enum dma_data_direction dir)
> > +{
> > + if (is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
> > + swiotlb_fixed_bounce(dev, tlb_addr, size, dir);
> > + else
> > + swiotlb_dyn_bounce(dev, tlb_addr, size, dir);
> > +}
> > +
> > static inline phys_addr_t slot_addr(phys_addr_t start, phys_addr_t idx)
> > {
> > return start + (idx << IO_TLB_SHIFT);
> > @@ -834,8 +1062,13 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> > phys_addr_t orig_addr,
> > return (phys_addr_t)DMA_MAPPING_ERROR;
> > }
> >
> > - tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
> > - alloc_align_mask, attrs);
> > + tlb_addr = (phys_addr_t)DMA_MAPPING_ERROR;
> > + if (!is_swiotlb_for_alloc(dev))
> > + tlb_addr = swiotlb_dyn_map(dev, orig_addr, alloc_size,
> > + alloc_align_mask, dir, attrs);
> > + if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
> > + tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
> > + alloc_align_mask, attrs);
> >
> > if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR) {
> > if (!(attrs & DMA_ATTR_NO_WARN))
> > @@ -919,7 +1152,10 @@ void swiotlb_tbl_unmap_single(struct device *dev,
> > phys_addr_t tlb_addr,
> > (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
> > swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
> >
> > - swiotlb_release_slots(dev, tlb_addr);
> > + if (is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
> > + swiotlb_release_slots(dev, tlb_addr);
> > + else
> > + swiotlb_dyn_unmap(dev, tlb_addr, dir);
> > }
> >
> > void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
> > @@ -1089,7 +1325,7 @@ bool swiotlb_free(struct device *dev, struct page *page,
> > size_t size)
> > {
> > phys_addr_t tlb_addr = page_to_phys(page);
> >
> > - if (!is_swiotlb_buffer(dev, tlb_addr))
> > + if (!is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
> > return false;
> >
> > swiotlb_release_slots(dev, tlb_addr);
> > --
> > 2.25.1
>


2023-05-16 06:22:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

On Mon, May 15, 2023 at 07:43:52PM +0000, Michael Kelley (LINUX) wrote:
> FWIW, I don't think the approach you have implemented here will be
> practical to use for CoCo VMs (SEV, TDX, whatever else). The problem
> is that dma_direct_alloc_pages() and dma_direct_free_pages() must
> call dma_set_decrypted() and dma_set_encrypted(), respectively. In CoCo
> VMs, these calls are expensive because they require a hypercall to the host,
> and the operation on the host isn't trivial either. I haven't measured the
> overhead, but doing a hypercall on every DMA map operation and on
> every unmap operation has long been something we thought we must
> avoid. The fixed swiotlb bounce buffer space solves this problem by
> doing set_decrypted() in batch at boot time, and never
> doing set_encrypted().

I also suspect it doesn't really scale too well due to the number of
allocations. I suspect a better way to implement things would be to
add more large chunks that are used just like the main swiotlb buffers.

That is when we run out of space try to allocate another chunk of the
same size in the background, similar to what we do with the pool in
dma-pool.c. This means we'll do a fairly large allocation, so we'll
need compaction or even CMA to back it up, but the other big upside
is that it also reduces the number of buffers that need to be checked
in is_swiotlb_buffer or the free / sync side.

2023-05-16 06:49:47

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

Hi Christoph,

On Tue, 16 May 2023 08:13:09 +0200
Christoph Hellwig <[email protected]> wrote:

> On Mon, May 15, 2023 at 07:43:52PM +0000, Michael Kelley (LINUX) wrote:
> > FWIW, I don't think the approach you have implemented here will be
> > practical to use for CoCo VMs (SEV, TDX, whatever else). The problem
> > is that dma_direct_alloc_pages() and dma_direct_free_pages() must
> > call dma_set_decrypted() and dma_set_encrypted(), respectively. In CoCo
> > VMs, these calls are expensive because they require a hypercall to the host,
> > and the operation on the host isn't trivial either. I haven't measured the
> > overhead, but doing a hypercall on every DMA map operation and on
> > every unmap operation has long been something we thought we must
> > avoid. The fixed swiotlb bounce buffer space solves this problem by
> > doing set_decrypted() in batch at boot time, and never
> > doing set_encrypted().
>
> I also suspect it doesn't really scale too well due to the number of
> allocations. I suspect a better way to implement things would be to
> add more large chunks that are used just like the main swiotlb buffers.
>
> That is when we run out of space try to allocate another chunk of the
> same size in the background, similar to what we do with the pool in
> dma-pool.c. This means we'll do a fairly large allocation, so we'll
> need compaction or even CMA to back it up, but the other big upside
> is that it also reduces the number of buffers that need to be checked
> in is_swiotlb_buffer or the free / sync side.

I have considered this approach. The two main issues I ran into were:

1. MAX_ORDER allocations were too small (at least with 4K pages), and
even then they would often fail.

2. Allocating from CMA did work but only from process context.
I made a stab at modifying the CMA allocator to work from interrupt
context, but there are non-trivial interactions with the buddy
allocator. Making them safe from interrupt context looked like a
major task.

I also had some fears about the length of the dynamic buffer list. I
observed maximum length for block devices, and then it roughly followed
the queue depth. Walking a few hundred buffers was still fast enough.
I admit the list length may become an issue with high-end NVMe and
I/O-intensive applications.

Last but not least, when many smaller swiotlb chunks are allocated, they
must be kept in a list (or another data structure), somewhat reducing the
performance win. OK, one thing I did *not* consider back then was
allocating these additional swiotlb chunks per device. It looks a bit
too wasteful.

Petr T

2023-05-16 08:11:14

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 7/7] swiotlb: per-device flag if there are dynamically allocated buffers

On Mon, 15 May 2023 17:28:38 +0100
Catalin Marinas <[email protected]> wrote:

> (some of you replies may have been filtered to various of my mailboxes,
> depending on which lists you cc'ed; replying here)
>
> On Mon, May 15, 2023 at 12:00:54PM +0200, Petr Tesařík wrote:
> > On Mon, 15 May 2023 10:48:47 +0200
> > Petr Tesařík <[email protected]> wrote:
> > > On Sun, 14 May 2023 19:54:27 +0100
> > > Catalin Marinas <[email protected]> wrote:
> > > > Now, thinking about the list_head access and the flag ordering, since it
> > > > doesn't matter, you might as well not bother with the flag at all and
> > > > rely on list_add() and list_empty() ordering vs the hypothetical 'blah'
> > > > access. Both of these use READ/WRITE_ONCE() for setting
> > > > dma_io_tlb_dyn_slots.next. You only need an smp_wmb() after the
> > > > list_add() and an smp_rmb() before a list_empty() check in
> > ^^^^^^^^^
> > Got it, finally. Well, that's exactly something I don't want to do.
> > For example, on arm64 (seeing your email address), smp_rmb() translates
> > to a "dsb ld" instruction. I would expect that this is more expensive
> > than a "ldar", generated by smp_load_acquire().
>
> It translates to a dmb ishld which is on par with ldar (dsb is indeed a
> lot more expensive but that's not generated here).

You're right, dsb is generated for the non-smp barrier variants. Thanks
for the correction.

> > > > is_swiotlb_buffer(), no dma_iotlb_have_dyn variable.
> > >
> > > Wait, let me check that I understand you right. Do you suggest that I
> > > convert dma_io_tlb_dyn_slots to a lockless list and get rid of the
> > > spinlock?
> > >
> > > I'm sure it can be done for list_add() and list_del(). I'll have
> > > to think about list_move().
> >
> > Hm, even the documentation of llist_empty() says that it is "not
> > guaranteed to be accurate or up to date". If it could be, I'm quite
> > sure the authors would have gladly implemented it as such.
>
> It doesn't but neither does your flag.

Yes, I have already agreed in another sub-thread.

> If you want a guarantee, you'd
> need locks because a llist_empty() on its own can race with other
> llist_add/del_*() that may not yet be visible to a CPU at exactly that
> moment. BTW, the llist implementation cannot delete a random element, so
> not sure this is suitable for your implementation (it can only delete
> the first element or the whole list).
>
> I do think you need to change your invariants and not rely on an
> absolute list_empty() or some flag:
>
> P0:
> list_add(paddr);
> WRITE_ONCE(blah, paddr);
>
> P1:
> paddr = READ_ONCE(blah);
> list_empty();
>
> Your invariant (on P1) should be blah == paddr => !list_empty(). If
> there is another P2 removing paddr from the list, this wouldn't work
> (nor your flag) but the assumption is that a correctly written driver
> that still has a reference to paddr doesn't use it after being removed
> from the list (i.e. it doesn't do a dma_unmap(paddr) and still continue
> to use this paddr for e.g. dma_sync()).

Right. In other words, given any paddr:

a. Either it is on the list, and then the list cannot become empty by
any concurrent code.

a. Or it is not on the list, but then we may skip the search
regardless of any races with other CPUs.

> For such invariant, you'd need ordering between list_add() and the
> write of paddr (smp_wmb() would do). On P1, you need an smp_rmb() before
> list_empty() since the implementation does a READ_ONCE only).

I agree.

> You still need the locks for list modifications and list traversal as I
> don't see how you can use the llist implementation with random element
> removal.

That's right. It might even perform better than a truly non-blocking
list (cf. Valois, Harris, Zhang).

> There is another scenario to take into account on the list_del() side.
> Let's assume that there are other elements on the list, so
> list_empty() == false:
>
> P0:
> list_del(paddr);
> /* the memory gets freed, added to some slab or page free list */
> WRITE_ONCE(slab_free_list, __va(paddr));
>
> P1:
> paddr = __pa(READ_ONCE(slab_free_list));/* re-allocating paddr freed on P0 */
> if (!list_empty()) { /* assuming other elements on the list */
> /* searching the list */
> list_for_each() {
> if (pos->paddr) == __pa(vaddr))
> /* match */
> }
> }
>
> On P0, you want the list update to be visible before the memory is freed
> (and potentially reallocated on P1). An smp_wmb() on P0 would do. For
> P1, we don't care about list_empty() as there can be other elements
> already. But we do want any list elements reading during the search to
> be ordered after the slab_free_list reading. The smp_rmb() you'd add for
> the case above would suffice.

Yes, but to protect against concurrent insertions/deletions, a spinlock
is held while searching the list. The spin lock provides the necessary
memory barriers implicitly.

Again, thank you very much for your time!

Petr T

2023-05-16 11:27:44

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 7/7] swiotlb: per-device flag if there are dynamically allocated buffers

On Tue, May 16, 2023 at 09:55:12AM +0200, Petr Tesařík wrote:
> On Mon, 15 May 2023 17:28:38 +0100
> Catalin Marinas <[email protected]> wrote:
> > There is another scenario to take into account on the list_del() side.
> > Let's assume that there are other elements on the list, so
> > list_empty() == false:
> >
> > P0:
> > list_del(paddr);
> > /* the memory gets freed, added to some slab or page free list */
> > WRITE_ONCE(slab_free_list, __va(paddr));
> >
> > P1:
> > paddr = __pa(READ_ONCE(slab_free_list));/* re-allocating paddr freed on P0 */
> > if (!list_empty()) { /* assuming other elements on the list */
> > /* searching the list */
> > list_for_each() {
> > if (pos->paddr) == __pa(vaddr))
> > /* match */
> > }
> > }
> >
> > On P0, you want the list update to be visible before the memory is freed
> > (and potentially reallocated on P1). An smp_wmb() on P0 would do. For
> > P1, we don't care about list_empty() as there can be other elements
> > already. But we do want any list elements reading during the search to
> > be ordered after the slab_free_list reading. The smp_rmb() you'd add for
> > the case above would suffice.
>
> Yes, but to protect against concurrent insertions/deletions, a spinlock
> is held while searching the list. The spin lock provides the necessary
> memory barriers implicitly.

Well, mostly. The spinlock acquire/release semantics ensure that
accesses within the locked region are not observed outside the
lock/unlock. But it doesn't guarantee anything about accesses outside
such region in relation to the accesses within the region. For example:

P0:
spin_lock_irqsave(&swiotlb_dyn_lock);
list_del(paddr);
spin_unlock_irqrestore(&swiotlb_dyn_lock);

/* the blah write below can be observed before list_del() above */
WRITE_ONCE(blah, paddr);

/* that's somewhat tricker but slab_free_list update can also be
* seen before list_del() above on certain architectures */
spin_lock_irqsave(&slab_lock);
WRITE_ONCE(slab_free_list, __va(paddr));
spin_unlock_irqrestore(&slab_lock);

On most architectures, the writing of the pointer to a slab structure
(assuming some spinlocks) would be ordered against the list_del() from
the swiotlb code. Apart from powerpc where the spin_unlock() is not
necessarily ordered against the subsequent spin_lock(). The architecture
selects ARCH_WEAK_RELEASE_ACQUIRE which in turns makes
smp_mb__after_unlock_lock() an smp_mb() (rather than no-op on all the
other architectures).

On arm64 we have smp_mb__after_spinlock() which ensures that memory
accesses prior to spin_lock() are not observed after accesses within the
locked region. I don't think this matters for your case but I thought
I'd mention it.

--
Catalin

2023-05-16 18:29:37

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

On Tue, May 16, 2023 at 08:39:42AM +0200, Petr Tesařík wrote:
> On Tue, 16 May 2023 08:13:09 +0200
> Christoph Hellwig <[email protected]> wrote:
> > On Mon, May 15, 2023 at 07:43:52PM +0000, Michael Kelley (LINUX) wrote:
> > > FWIW, I don't think the approach you have implemented here will be
> > > practical to use for CoCo VMs (SEV, TDX, whatever else). The problem
> > > is that dma_direct_alloc_pages() and dma_direct_free_pages() must
> > > call dma_set_decrypted() and dma_set_encrypted(), respectively. In CoCo
> > > VMs, these calls are expensive because they require a hypercall to the host,
> > > and the operation on the host isn't trivial either. I haven't measured the
> > > overhead, but doing a hypercall on every DMA map operation and on
> > > every unmap operation has long been something we thought we must
> > > avoid. The fixed swiotlb bounce buffer space solves this problem by
> > > doing set_decrypted() in batch at boot time, and never
> > > doing set_encrypted().
> >
> > I also suspect it doesn't really scale too well due to the number of
> > allocations. I suspect a better way to implement things would be to
> > add more large chunks that are used just like the main swiotlb buffers.
> >
> > That is when we run out of space try to allocate another chunk of the
> > same size in the background, similar to what we do with the pool in
> > dma-pool.c. This means we'll do a fairly large allocation, so we'll
> > need compaction or even CMA to back it up, but the other big upside
> > is that it also reduces the number of buffers that need to be checked
> > in is_swiotlb_buffer or the free / sync side.
>
> I have considered this approach. The two main issues I ran into were:
>
> 1. MAX_ORDER allocations were too small (at least with 4K pages), and
> even then they would often fail.
>
> 2. Allocating from CMA did work but only from process context.
> I made a stab at modifying the CMA allocator to work from interrupt
> context, but there are non-trivial interactions with the buddy
> allocator. Making them safe from interrupt context looked like a
> major task.

Can you kick off a worker thread when the swiotlb allocation gets past
some reserve limit? It still has a risk of failing to bounce until the
swiotlb buffer is extended.

> I also had some fears about the length of the dynamic buffer list. I
> observed maximum length for block devices, and then it roughly followed
> the queue depth. Walking a few hundred buffers was still fast enough.
> I admit the list length may become an issue with high-end NVMe and
> I/O-intensive applications.

You could replace the list with an rbtree, O(log n) look-up vs O(n),
could be faster if you have many bounces active.

--
Catalin

2023-05-17 06:57:12

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

Hi Catalin,

On Tue, 16 May 2023 18:59:30 +0100
Catalin Marinas <[email protected]> wrote:

> On Tue, May 16, 2023 at 08:39:42AM +0200, Petr Tesařík wrote:
> > On Tue, 16 May 2023 08:13:09 +0200
> > Christoph Hellwig <[email protected]> wrote:
> > > On Mon, May 15, 2023 at 07:43:52PM +0000, Michael Kelley (LINUX) wrote:
> > > > FWIW, I don't think the approach you have implemented here will be
> > > > practical to use for CoCo VMs (SEV, TDX, whatever else). The problem
> > > > is that dma_direct_alloc_pages() and dma_direct_free_pages() must
> > > > call dma_set_decrypted() and dma_set_encrypted(), respectively. In CoCo
> > > > VMs, these calls are expensive because they require a hypercall to the host,
> > > > and the operation on the host isn't trivial either. I haven't measured the
> > > > overhead, but doing a hypercall on every DMA map operation and on
> > > > every unmap operation has long been something we thought we must
> > > > avoid. The fixed swiotlb bounce buffer space solves this problem by
> > > > doing set_decrypted() in batch at boot time, and never
> > > > doing set_encrypted().
> > >
> > > I also suspect it doesn't really scale too well due to the number of
> > > allocations. I suspect a better way to implement things would be to
> > > add more large chunks that are used just like the main swiotlb buffers.
> > >
> > > That is when we run out of space try to allocate another chunk of the
> > > same size in the background, similar to what we do with the pool in
> > > dma-pool.c. This means we'll do a fairly large allocation, so we'll
> > > need compaction or even CMA to back it up, but the other big upside
> > > is that it also reduces the number of buffers that need to be checked
> > > in is_swiotlb_buffer or the free / sync side.
> >
> > I have considered this approach. The two main issues I ran into were:
> >
> > 1. MAX_ORDER allocations were too small (at least with 4K pages), and
> > even then they would often fail.
> >
> > 2. Allocating from CMA did work but only from process context.
> > I made a stab at modifying the CMA allocator to work from interrupt
> > context, but there are non-trivial interactions with the buddy
> > allocator. Making them safe from interrupt context looked like a
> > major task.
>
> Can you kick off a worker thread when the swiotlb allocation gets past
> some reserve limit? It still has a risk of failing to bounce until the
> swiotlb buffer is extended.

Yes, this can be done, and some form of a worker thread is in fact on
my roadmap. Initially, I want to see the impact of a "dumb" algorithm
and get some feedback from people like you. Glad your ideas move in a
similar direction as mine. :-)

> > I also had some fears about the length of the dynamic buffer list. I
> > observed maximum length for block devices, and then it roughly followed
> > the queue depth. Walking a few hundred buffers was still fast enough.
> > I admit the list length may become an issue with high-end NVMe and
> > I/O-intensive applications.
>
> You could replace the list with an rbtree, O(log n) look-up vs O(n),
> could be faster if you have many bounces active.

I could also do it for individual allocations. And I did it.

First, the paddr lookup does not search for a specific key in the tree,
but rather for a match within a range. The maple tree was invented for
exactly this purpose, so that's what I tried. There are some caveats
when using maple trees from interrupt context, but I made the necessary
modifications in my local tree.

I ran my tests against a SATA virtio disk in an x86-64 VM booted with
swiotlb=force. The results were:

Compared to plain linked list
small-rw -6.6% (4KiB, 1 thread)
parallel-rw -10.5% (64KiB, 4 threads)
big-rw -8.5% (1MiB, 1 thread)

Of course, these numbers say nothing about the performance of a maple
tree for tracking additional swiotlb chunks, because the mix of
additions, deletions and lookups will be different, but my point is
that a "better" data structure may not always be better.

My testing also suggests that the lookup path is extremely hot. It was
very sensitive even to small changes (like moving the flag setting
before deletion).

Anyway, my greatest objection to allocating additional swiotlb chunks is
that _all_ of them must be searched to determine that the physical
address does _not_ belong to a swiotlb, incurring performance penalty
for non-constrained (presumably fast) devices that do not need a
swiotlb. Sure, this is irrelevant for CoCo VMs where all devices must
use swiotlb, but we've seen other scenarios which benefit from a
dynamically sized swiotlb. It's a bit frustrating if a cheap wifi
adapter reduces your disk performance...

Besides, if the list or tree of swiotlb chunks is protected with a
lock, this lock becomes contended.

Last but not least, the allocation size is dynamic in theory, but it
would most likely never shrink, because a swiotlb chunk can be freed
only if it is completely unused, which may never happen after a spike,
because some mappings are rather long-lived (which is probably wrong
but it's the current state).

Much of the above can be solved if additional swiotlb chunks are
allocated per device. OTOH I am a bit concerned about increasing memory
requirements. After all, one of the use cases is to reduce kernel
memory footprint on mobile devices.

To say something positive, I have also found one upside to additional
swiotlb chunks: They make it possible to meet all alignment and boundary
crossing constraints (unlike plain CMA allocations).

Petr T

2023-05-17 07:10:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

Just thinking out loud:

- what if we always way overallocate the swiotlb buffer
- and then mark the second half / two thirds / <pull some number out
of the thin air> slots as used, and make that region available
through a special CMA mechanism as ZONE_MOVABLE (but not allowing
other CMA allocations to dip into it).

This allows us to have a single slot management for the entire
area, but allow reclaiming from it. We'd probably also need to make
this CMA variant irq safe.

This could still be combined with more aggressive use of per-device
swiotlb area, which is probably a good idea based on some hints.
E.g. device could hint an amount of inflight DMA to the DMA layer,
and if there are addressing limitations and the amout is large enough
that could cause the allocation of a per-device swiotlb area.


2023-05-17 07:49:21

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

Hi Christoph,

On Wed, 17 May 2023 08:56:53 +0200
Christoph Hellwig <[email protected]> wrote:

> Just thinking out loud:
>
> - what if we always way overallocate the swiotlb buffer
> - and then mark the second half / two thirds / <pull some number out
> of the thin air> slots as used, and make that region available
> through a special CMA mechanism as ZONE_MOVABLE (but not allowing
> other CMA allocations to dip into it).

This approach has also been considered internally at Huawei, and it
looked like a viable option, just more complex. We decided to send the
simple approach first to get some feedback and find out who else might
be interested in the dynamic sizing of swiotlb (if anyone).

> This allows us to have a single slot management for the entire
> area, but allow reclaiming from it. We'd probably also need to make
> this CMA variant irq safe.

Let me recap my internal analysis.

On the pro side:

- no performance penalty for devices that do not use swiotlb
- all alignment and boundary constraints can be met
- efficient use of memory for buffers smaller than 1 page

On the con side:

- ZONE_MOVABLE cannot be used for most kernel allocations
- competition with CMA over precious physical address space
(How much should be reserved for CMA and how much for SWIOTLB?)

To quote from Memory hotplug documentation:

Usually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. [...]
Actual safe zone ratios depend on the workload. Extreme cases, like
excessive long-term pinning of pages, might not be able to deal with
ZONE_MOVABLE at all.

This should be no big issue on bare metal (where the motivation is
addressing limitations), but the size of SWIOTLB in CoCo VMs probably
needs some consideration.

> This could still be combined with more aggressive use of per-device
> swiotlb area, which is probably a good idea based on some hints.
> E.g. device could hint an amount of inflight DMA to the DMA layer,
> and if there are addressing limitations and the amout is large enough
> that could cause the allocation of a per-device swiotlb area.

I would not rely on device hints, because it probably depends on
workload rather than type of device. I'd rather implement some logic
based on the actual runtime usage pattern. I have some ideas already.

Petr T

2023-05-17 09:44:30

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

On Wed, May 17, 2023 at 08:56:53AM +0200, Christoph Hellwig wrote:
> Just thinking out loud:
>
> - what if we always way overallocate the swiotlb buffer
> - and then mark the second half / two thirds / <pull some number out
> of the thin air> slots as used, and make that region available
> through a special CMA mechanism as ZONE_MOVABLE (but not allowing
> other CMA allocations to dip into it).
>
> This allows us to have a single slot management for the entire
> area, but allow reclaiming from it. We'd probably also need to make
> this CMA variant irq safe.

I think this could work. It doesn't need to be ZONE_MOVABLE (and we
actually need this buffer in ZONE_DMA). But we can introduce a new
migrate type, MIGRATE_SWIOTLB, and movable page allocations can use this
range. The CMA allocations go to free_list[MIGRATE_CMA], so they won't
overlap.

One of the downsides is that migrating movable pages still needs a
sleep-able context.

Another potential confusion is is_swiotlb_buffer() for pages in this
range allocated through the normal page allocator. We may need to check
the slots as well rather than just the buffer boundaries.

(we are actually looking at a MIGRATE_METADATA type for the arm64 memory
tagging extension which uses a 3% carveout of the RAM for storing the
tags and people want that reused somehow; we have some WIP patches but
we'll post them later this summer)

> This could still be combined with more aggressive use of per-device
> swiotlb area, which is probably a good idea based on some hints.
> E.g. device could hint an amount of inflight DMA to the DMA layer,
> and if there are addressing limitations and the amout is large enough
> that could cause the allocation of a per-device swiotlb area.

If we go for one large-ish per-device buffer for specific cases, maybe
something similar to the rmem_swiotlb_setup() but which can be
dynamically allocated at run-time and may live alongside the default
swiotlb. The advantage is that it uses a similar slot tracking to the
default swiotlb, no need to invent another. This per-device buffer could
also be allocated from the MIGRATE_SWIOTLB range if we make it large
enough at boot. It would be seen just a local accelerator for devices
that use bouncing frequently or from irq context.

--
Catalin

2023-05-17 10:17:08

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

On Wed, 17 May 2023 10:41:19 +0100
Catalin Marinas <[email protected]> wrote:

> On Wed, May 17, 2023 at 08:56:53AM +0200, Christoph Hellwig wrote:
> > Just thinking out loud:
> >
> > - what if we always way overallocate the swiotlb buffer
> > - and then mark the second half / two thirds / <pull some number out
> > of the thin air> slots as used, and make that region available
> > through a special CMA mechanism as ZONE_MOVABLE (but not allowing
> > other CMA allocations to dip into it).
> >
> > This allows us to have a single slot management for the entire
> > area, but allow reclaiming from it. We'd probably also need to make
> > this CMA variant irq safe.
>
> I think this could work. It doesn't need to be ZONE_MOVABLE (and we
> actually need this buffer in ZONE_DMA). But we can introduce a new
> migrate type, MIGRATE_SWIOTLB, and movable page allocations can use this
> range. The CMA allocations go to free_list[MIGRATE_CMA], so they won't
> overlap.
>
> One of the downsides is that migrating movable pages still needs a
> sleep-able context.

Pages can be migrated by a separate worker thread when the number of
free slots reaches a low watermark.

> Another potential confusion is is_swiotlb_buffer() for pages in this
> range allocated through the normal page allocator. We may need to check
> the slots as well rather than just the buffer boundaries.

Ah, yes, I forgot about this part; thanks for the reminder.

Indeed, movable pages can be used for the page cache, and drivers do
DMA to/from buffers in the page cache.

Let me recap:

- Allocated chunks must still be tracked with this approach.
- The pool of available slots cannot be grown from interrupt context.

So, what exactly is the advantage compared to allocating additional
swiotlb chunks from CMA?

> (we are actually looking at a MIGRATE_METADATA type for the arm64 memory
> tagging extension which uses a 3% carveout of the RAM for storing the
> tags and people want that reused somehow; we have some WIP patches but
> we'll post them later this summer)
>
> > This could still be combined with more aggressive use of per-device
> > swiotlb area, which is probably a good idea based on some hints.
> > E.g. device could hint an amount of inflight DMA to the DMA layer,
> > and if there are addressing limitations and the amout is large enough
> > that could cause the allocation of a per-device swiotlb area.
>
> If we go for one large-ish per-device buffer for specific cases, maybe
> something similar to the rmem_swiotlb_setup() but which can be
> dynamically allocated at run-time and may live alongside the default
> swiotlb. The advantage is that it uses a similar slot tracking to the
> default swiotlb, no need to invent another. This per-device buffer could
> also be allocated from the MIGRATE_SWIOTLB range if we make it large
> enough at boot. It would be seen just a local accelerator for devices
> that use bouncing frequently or from irq context.

A per-device pool could also be used for small buffers. IIRC somebody
was interested in that.

Petr T

2023-05-17 11:30:11

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

On Wed, May 17, 2023 at 11:58:21AM +0200, Petr Tesařík wrote:
> On Wed, 17 May 2023 10:41:19 +0100
> Catalin Marinas <[email protected]> wrote:
> > On Wed, May 17, 2023 at 08:56:53AM +0200, Christoph Hellwig wrote:
> > > Just thinking out loud:
> > >
> > > - what if we always way overallocate the swiotlb buffer
> > > - and then mark the second half / two thirds / <pull some number out
> > > of the thin air> slots as used, and make that region available
> > > through a special CMA mechanism as ZONE_MOVABLE (but not allowing
> > > other CMA allocations to dip into it).
> > >
> > > This allows us to have a single slot management for the entire
> > > area, but allow reclaiming from it. We'd probably also need to make
> > > this CMA variant irq safe.
> >
> > I think this could work. It doesn't need to be ZONE_MOVABLE (and we
> > actually need this buffer in ZONE_DMA). But we can introduce a new
> > migrate type, MIGRATE_SWIOTLB, and movable page allocations can use this
> > range. The CMA allocations go to free_list[MIGRATE_CMA], so they won't
> > overlap.
> >
> > One of the downsides is that migrating movable pages still needs a
> > sleep-able context.
>
> Pages can be migrated by a separate worker thread when the number of
> free slots reaches a low watermark.

Indeed, you still need such worker thread.

> > Another potential confusion is is_swiotlb_buffer() for pages in this
> > range allocated through the normal page allocator. We may need to check
> > the slots as well rather than just the buffer boundaries.
>
> Ah, yes, I forgot about this part; thanks for the reminder.
>
> Indeed, movable pages can be used for the page cache, and drivers do
> DMA to/from buffers in the page cache.
>
> Let me recap:
>
> - Allocated chunks must still be tracked with this approach.
> - The pool of available slots cannot be grown from interrupt context.
>
> So, what exactly is the advantage compared to allocating additional
> swiotlb chunks from CMA?

This would work as well but it depends on how many other drivers
allocate from the CMA range. Maybe it's simpler to this initially (I
haven't got to your other emails yet).

> > > This could still be combined with more aggressive use of per-device
> > > swiotlb area, which is probably a good idea based on some hints.
> > > E.g. device could hint an amount of inflight DMA to the DMA layer,
> > > and if there are addressing limitations and the amout is large enough
> > > that could cause the allocation of a per-device swiotlb area.
> >
> > If we go for one large-ish per-device buffer for specific cases, maybe
> > something similar to the rmem_swiotlb_setup() but which can be
> > dynamically allocated at run-time and may live alongside the default
> > swiotlb. The advantage is that it uses a similar slot tracking to the
> > default swiotlb, no need to invent another. This per-device buffer could
> > also be allocated from the MIGRATE_SWIOTLB range if we make it large
> > enough at boot. It would be seen just a local accelerator for devices
> > that use bouncing frequently or from irq context.
>
> A per-device pool could also be used for small buffers. IIRC somebody
> was interested in that.

That was me ;) but TBH, I don't care how large the bounce buffer is,
only that it can bounce small structures. If there's some critical path,
people can change the kmalloc() allocation for those structures to make
them cacheline-aligned.

--
Catalin

2023-05-17 11:43:39

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

On Wed, 17 May 2023 08:35:10 +0200
Petr Tesařík <[email protected]> wrote:

>[...]
> Anyway, my greatest objection to allocating additional swiotlb chunks is
> that _all_ of them must be searched to determine that the physical
> address does _not_ belong to a swiotlb, incurring performance penalty

I thought about this part again, and I overlooked one option. We can
track only the _active_ swiotlbs for each device. If a device never
needs a swiotlb, there is no active swiotlb, and is_swiotlb_buffer()
short-circuits to false. This should avoid all collateral damage to
innocent devices.

We would also maintain a (global) list of all allocated swiotlbs, used
by swiotlb_map() to find free slots and add the respective swiotlb to
the per-device active list.

One potential advantage is that we could use mapping size and alignment
to choose a swiotlb cleverly and minimize internal fragmentation...

OK, I'm dreaming. Let's agree on the general approach first.

Petr T

2023-05-23 10:40:55

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

On Wed, May 17, 2023 at 01:27:48PM +0200, Petr Tesařík wrote:
> On Wed, 17 May 2023 08:35:10 +0200
> Petr Tesařík <[email protected]> wrote:
> > Anyway, my greatest objection to allocating additional swiotlb chunks is
> > that _all_ of them must be searched to determine that the physical
> > address does _not_ belong to a swiotlb, incurring performance penalty
>
> I thought about this part again, and I overlooked one option. We can
> track only the _active_ swiotlbs for each device. If a device never
> needs a swiotlb, there is no active swiotlb, and is_swiotlb_buffer()
> short-circuits to false. This should avoid all collateral damage to
> innocent devices.

Does this work with dma-buf or does dma-buf not allow swiotlb bouncing?

--
Catalin

2023-05-23 12:03:43

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

On Tue, 23 May 2023 10:54:11 +0100
Catalin Marinas <[email protected]> wrote:

> On Wed, May 17, 2023 at 01:27:48PM +0200, Petr Tesařík wrote:
> > On Wed, 17 May 2023 08:35:10 +0200
> > Petr Tesařík <[email protected]> wrote:
> > > Anyway, my greatest objection to allocating additional swiotlb chunks is
> > > that _all_ of them must be searched to determine that the physical
> > > address does _not_ belong to a swiotlb, incurring performance penalty
> >
> > I thought about this part again, and I overlooked one option. We can
> > track only the _active_ swiotlbs for each device. If a device never
> > needs a swiotlb, there is no active swiotlb, and is_swiotlb_buffer()
> > short-circuits to false. This should avoid all collateral damage to
> > innocent devices.
>
> Does this work with dma-buf or does dma-buf not allow swiotlb bouncing?

Currently, it does work with dma-buf. OTOH Christoph is apparently not
very happy about it and would rather implement alternative mechanisms to
let dma-buf allocate buffers so that they do not require swiotlb. See
his reply here:

https://lkml.org/lkml/2023/4/7/38

OTOH if you're asking because of swiotlb use by encrypted VM guests,
the answer might be different.

Cheers
Petr T