swiotlb currently reports the total number of slabs and the instantaneous
in-use slabs in debugfs. But with increased usage of swiotlb for all I/O
in Confidential Computing (coco) VMs, it has become difficult to know
how much memory to allocate for swiotlb bounce buffers, either via the
automatic algorithm in the kernel or by specifying a value on the
kernel boot line. The current automatic algorithm generously allocates
swiotlb bounce buffer memory, and may be wasting significant memory in
many use cases.
To support better understanding swiotlb usage, add tracking of the
the high water mark usage of swiotlb bounce buffer memory. Report the
high water mark in debugfs along with the other swiotlb metrics. Allow
the high water to be reset to zero at runtime by writing to it.
Since a global in-use slab count is added alongside the existing
per-area in-use count, the mem_used() function that sums across all
areas is no longer needed. Remove it and replace with the global
in-use count.
Signed-off-by: Michael Kelley <[email protected]>
---
kernel/dma/swiotlb.c | 46 ++++++++++++++++++++++++++++++++++------------
1 file changed, 34 insertions(+), 12 deletions(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index d3d6be0..2dc501e 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -76,6 +76,9 @@ struct io_tlb_slot {
static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
static unsigned long default_nareas;
+static atomic_long_t total_used = ATOMIC_LONG_INIT(0);
+static atomic_long_t used_hiwater = ATOMIC_LONG_INIT(0);
+
/**
* struct io_tlb_area - IO TLB memory area descriptor
*
@@ -594,6 +597,7 @@ static int swiotlb_do_find_slots(struct device *dev, int area_index,
unsigned long flags;
unsigned int slot_base;
unsigned int slot_index;
+ unsigned long old_hiwater, new_used;
BUG_ON(!nslots);
BUG_ON(area_index >= mem->nareas);
@@ -663,6 +667,14 @@ static int swiotlb_do_find_slots(struct device *dev, int area_index,
area->index = 0;
area->used += nslots;
spin_unlock_irqrestore(&area->lock, flags);
+
+ new_used = atomic_long_add_return(nslots, &total_used);
+ old_hiwater = atomic_long_read(&used_hiwater);
+ do {
+ if (new_used <= old_hiwater)
+ break;
+ } while (!atomic_long_try_cmpxchg(&used_hiwater, &old_hiwater, new_used));
+
return slot_index;
}
@@ -685,16 +697,6 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
return -1;
}
-static unsigned long mem_used(struct io_tlb_mem *mem)
-{
- int i;
- unsigned long used = 0;
-
- for (i = 0; i < mem->nareas; i++)
- used += mem->areas[i].used;
- return used;
-}
-
phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
size_t mapping_size, size_t alloc_size,
unsigned int alloc_align_mask, enum dma_data_direction dir,
@@ -727,7 +729,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
if (!(attrs & DMA_ATTR_NO_WARN))
dev_warn_ratelimited(dev,
"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
- alloc_size, mem->nslabs, mem_used(mem));
+ alloc_size, mem->nslabs, atomic_long_read(&total_used));
return (phys_addr_t)DMA_MAPPING_ERROR;
}
@@ -795,6 +797,8 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
mem->slots[i].list = ++count;
area->used -= nslots;
spin_unlock_irqrestore(&area->lock, flags);
+
+ atomic_long_sub(nslots, &total_used);
}
/*
@@ -891,10 +895,26 @@ bool is_swiotlb_active(struct device *dev)
static int io_tlb_used_get(void *data, u64 *val)
{
- *val = mem_used(&io_tlb_default_mem);
+ *val = (u64)atomic_long_read(&total_used);
return 0;
}
+
+static int io_tlb_hiwater_get(void *data, u64 *val)
+{
+ *val = (u64)atomic_long_read(&used_hiwater);
+ return 0;
+}
+
+static int io_tlb_hiwater_set(void *data, u64 val)
+{
+ /* Write of any value resets high water mark to zero */
+ atomic_long_set(&used_hiwater, 0);
+ return 0;
+}
+
DEFINE_DEBUGFS_ATTRIBUTE(fops_io_tlb_used, io_tlb_used_get, NULL, "%llu\n");
+DEFINE_DEBUGFS_ATTRIBUTE(fops_io_tlb_hiwater, io_tlb_hiwater_get,
+ io_tlb_hiwater_set, "%llu\n");
static void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
const char *dirname)
@@ -906,6 +926,8 @@ static void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
debugfs_create_ulong("io_tlb_nslabs", 0400, mem->debugfs, &mem->nslabs);
debugfs_create_file("io_tlb_used", 0400, mem->debugfs, NULL,
&fops_io_tlb_used);
+ debugfs_create_file("io_tlb_used_hiwater", 0600, mem->debugfs, NULL,
+ &fops_io_tlb_hiwater);
}
static int __init __maybe_unused swiotlb_create_default_debugfs(void)
--
1.8.3.1
> From: Michael Kelley (LINUX) <[email protected]>
> Sent: Monday, February 27, 2023 8:42 AM
> ...
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> @@ -76,6 +76,9 @@ struct io_tlb_slot {
> static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
> static unsigned long default_nareas;
>
> +static atomic_long_t total_used = ATOMIC_LONG_INIT(0);
> +static atomic_long_t used_hiwater = ATOMIC_LONG_INIT(0);
> ...
> @@ -594,6 +597,7 @@ static int swiotlb_do_find_slots(struct device *dev, int
> area_index,
> unsigned long flags;
> unsigned int slot_base;
> unsigned int slot_index;
> + unsigned long old_hiwater, new_used;
> ...
> @@ -663,6 +667,14 @@ static int swiotlb_do_find_slots(struct device *dev, int
> area_index,
> area->index = 0;
> area->used += nslots;
> spin_unlock_irqrestore(&area->lock, flags);
> +
> + new_used = atomic_long_add_return(nslots, &total_used);
> + old_hiwater = atomic_long_read(&used_hiwater);
> + do {
> + if (new_used <= old_hiwater)
> + break;
> + } while (!atomic_long_try_cmpxchg(&used_hiwater, &old_hiwater,
> new_used));
Here 'old_hiwater' is not updated in the loop, which looks suspicious to me.
Imagine the below scenario:
Initially total_used is 0, used_hiwater is 0.
Thread A: total_used = 10; new_used = 10; old_hiwater = 0;
Thread B: total_used = 20; new_used = 20; old_hiwater = 0;
Thread A enters the 'do' loop:
used_hiwater is 0; old_hiwater is 0, so used_hiwater = 10, and we
exit from the loop.
Thread B enters the 'do' loop:
new_used is 20, old_hiwater *always* is 0, used_hiwater is 10;
because used_hiwater always doesn't equal old_hiwater,
atomic_long_try_cmpxchg() always returns 0, and we get stuck in
the loop forever.
I think the line
+ old_hiwater = atomic_long_read(&used_hiwater);
should be moved into the loop to resolve the issue.
> +static int io_tlb_hiwater_set(void *data, u64 val)
> +{
> + /* Write of any value resets high water mark to zero */
Maybe it's better if we return -EINVAL when 'val' isn't 0 ?
> + atomic_long_set(&used_hiwater, 0);
> + return 0;
> +}
Hi,
On 2/27/2023 5:41 PM, Michael Kelley wrote:
> swiotlb currently reports the total number of slabs and the instantaneous
> in-use slabs in debugfs. But with increased usage of swiotlb for all I/O
> in Confidential Computing (coco) VMs, it has become difficult to know
> how much memory to allocate for swiotlb bounce buffers, either via the
> automatic algorithm in the kernel or by specifying a value on the
> kernel boot line. The current automatic algorithm generously allocates
> swiotlb bounce buffer memory, and may be wasting significant memory in
> many use cases.
I like the idea. Tracking and exposing the high watermark is definitely a step in the right direction. I would also appreciate an indicator of fragmentation, but that can be implemented later.
However, you apparently have a scenario where swiotlb is used intensely. Are you able to measure swiotlb performance? If yes, can you suggest a suitable method? I'm asking for my work on making swiotlb more flexible (able to grow and shrink).
Petr T
From: Dexuan Cui <[email protected]> Sent: Wednesday, March 22, 2023 7:10 AM
>
> > From: Michael Kelley (LINUX) <[email protected]>
> > Sent: Monday, February 27, 2023 8:42 AM
> > ...
> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > @@ -76,6 +76,9 @@ struct io_tlb_slot {
> > static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
> > static unsigned long default_nareas;
> >
> > +static atomic_long_t total_used = ATOMIC_LONG_INIT(0);
> > +static atomic_long_t used_hiwater = ATOMIC_LONG_INIT(0);
> > ...
> > @@ -594,6 +597,7 @@ static int swiotlb_do_find_slots(struct device *dev, int
> > area_index,
> > unsigned long flags;
> > unsigned int slot_base;
> > unsigned int slot_index;
> > + unsigned long old_hiwater, new_used;
> > ...
> > @@ -663,6 +667,14 @@ static int swiotlb_do_find_slots(struct device *dev, int
> > area_index,
> > area->index = 0;
> > area->used += nslots;
> > spin_unlock_irqrestore(&area->lock, flags);
> > +
> > + new_used = atomic_long_add_return(nslots, &total_used);
> > + old_hiwater = atomic_long_read(&used_hiwater);
> > + do {
> > + if (new_used <= old_hiwater)
> > + break;
> > + } while (!atomic_long_try_cmpxchg(&used_hiwater, &old_hiwater, new_used));
>
> Here 'old_hiwater' is not updated in the loop, which looks suspicious to me.
Actually, it *is* updated in the loop. The atomic_long_try_cmpxchg()
function updates old_hiwater to the current value if the exchange
fails. The address of old_hiwater is passed as the argument, and
not just the value, so that it can be updated. See the documentation
for "CMPXCHG vs TRY_CMPXCHG" in Documentation/atomic_t.txt,
which includes a usage example.
> Imagine the below scenario:
>
> Initially total_used is 0, used_hiwater is 0.
>
> Thread A: total_used = 10; new_used = 10; old_hiwater = 0;
> Thread B: total_used = 20; new_used = 20; old_hiwater = 0;
>
> Thread A enters the 'do' loop:
> used_hiwater is 0; old_hiwater is 0, so used_hiwater = 10, and we
> exit from the loop.
>
> Thread B enters the 'do' loop:
> new_used is 20, old_hiwater *always* is 0, used_hiwater is 10;
> because used_hiwater always doesn't equal old_hiwater,
> atomic_long_try_cmpxchg() always returns 0, and we get stuck in
> the loop forever.
>
> I think the line
> + old_hiwater = atomic_long_read(&used_hiwater);
> should be moved into the loop to resolve the issue.
>
> > +static int io_tlb_hiwater_set(void *data, u64 val)
> > +{
> > + /* Write of any value resets high water mark to zero */
>
> Maybe it's better if we return -EINVAL when 'val' isn't 0 ?
Yes, that would probably be clearer, to prevent someone from
thinking they could reset the value to something non-zero. We
*could* allow resetting the value to something non-zero, but I
don't see a real use case for that.
Thanks for the review ....
Michael
>
> > + atomic_long_set(&used_hiwater, 0);
> > + return 0;
> > +}
From: Petr Tesarik <[email protected]> Sent: Wednesday, March 22, 2023 7:33 AM
>
> On 2/27/2023 5:41 PM, Michael Kelley wrote:
> > swiotlb currently reports the total number of slabs and the instantaneous
> > in-use slabs in debugfs. But with increased usage of swiotlb for all I/O
> > in Confidential Computing (coco) VMs, it has become difficult to know
> > how much memory to allocate for swiotlb bounce buffers, either via the
> > automatic algorithm in the kernel or by specifying a value on the
> > kernel boot line. The current automatic algorithm generously allocates
> > swiotlb bounce buffer memory, and may be wasting significant memory in
> > many use cases.
>
> I like the idea. Tracking and exposing the high watermark is definitely a step in the right
> direction. I would also appreciate an indicator of fragmentation, but that can be
> implemented later.
>
> However, you apparently have a scenario where swiotlb is used intensely. Are you able
> to measure swiotlb performance? If yes, can you suggest a suitable method? I'm asking
> for my work on making swiotlb more flexible (able to grow and shrink).
>
Yes, in the CoCo VM scenario, all DMA transfers must use bounce buffers
because the hardware doesn't support I/O devices doing DMA to encrypted
memory. Our measurements have primarily been of disk I/O bandwidth
and latency using 'fio'. We compare the measurements in a normal VM
vs. a CoCo VM, with the delta mostly being the overhead of doing the
bounce buffering. The delta includes the cost of swiotlb allocate/free,
plus the data copying. Admittedly, this is a more macro-level measurement,
but it is directly relevant because it's ultimately the impact that users of
CoCo VMs see. We have not done micro-level measurements of just the
swiotlb allocate/free functions.
Making the swiotlb able to grow and shrink would likely be very desirable
for CoCo VMs. I'll look forward to seeing how that comes out. :-)
Michael
> From: Michael Kelley (LINUX) <[email protected]>
> Sent: Wednesday, March 22, 2023 7:39 AM
> > > ...
> > > + new_used = atomic_long_add_return(nslots, &total_used);
> > > + old_hiwater = atomic_long_read(&used_hiwater);
> > > + do {
> > > + if (new_used <= old_hiwater)
> > > + break;
> > > + } while (!atomic_long_try_cmpxchg(&used_hiwater, &old_hiwater,
> new_used));
> >
> > Here 'old_hiwater' is not updated in the loop, which looks suspicious to me.
>
> Actually, it *is* updated in the loop. The atomic_long_try_cmpxchg()
> function updates old_hiwater to the current value if the exchange
> fails.
Yes, you're correct. Thanks for the explanation!