2022-06-08 02:16:32

by Tony Battersby

[permalink] [raw]
Subject: [PATCH v6 00/11] mpt3sas and dmapool scalability

This patch series improves dmapool scalability by replacing linear scans
with red-black trees.

Note that Keith Busch is also working on improving dmapool scalability,
so for now I would recommend not merging my scalability patches until
Keith's approach can be evaluated. In the meantime, my patches can
serve as a benchmark comparison. I also have a number of cleanup
patches in my series that could be useful on their own.

Changes since v5:
1. inline pool_free_page() into dma_pool_destroy() to avoid adding
unused code
2. convert scnprintf() to sysfs_emit()
3. avoid adding a hole in struct dma_pool
4. fix big O usage in description

References:

v5
https://lore.kernel.org/linux-mm/[email protected]/

Keith Busch's dmapool performance enhancements
https://lore.kernel.org/linux-mm/[email protected]/

Below is my original description of the motivation for these patches.

drivers/scsi/mpt3sas is running into a scalability problem with the
kernel's DMA pool implementation. With a LSI/Broadcom SAS 9300-8i
12Gb/s HBA and max_sgl_entries=256, during modprobe, mpt3sas does the
equivalent of:

chain_dma_pool = dma_pool_create(size = 128);
for (i = 0; i < 373959; i++)
{
dma_addr[i] = dma_pool_alloc(chain_dma_pool);
}

And at rmmod, system shutdown, or system reboot, mpt3sas does the
equivalent of:

for (i = 0; i < 373959; i++)
{
dma_pool_free(chain_dma_pool, dma_addr[i]);
}
dma_pool_destroy(chain_dma_pool);

With this usage, both dma_pool_alloc() and dma_pool_free() exhibit
O(n) complexity, although dma_pool_free() is much worse due to
implementation details. On my system, the dma_pool_free() loop above
takes about 9 seconds to run. Note that the problem was even worse
before commit 74522a92bbf0 ("scsi: mpt3sas: Optimize I/O memory
consumption in driver."), where the dma_pool_free() loop could take ~30
seconds.

mpt3sas also has some other DMA pools, but chain_dma_pool is the only
one with so many allocations:

cat /sys/devices/pci0000:80/0000:80:07.0/0000:85:00.0/pools
(manually cleaned up column alignment)
poolinfo - 0.1
reply_post_free_array pool 1 21 192 1
reply_free pool 1 1 41728 1
reply pool 1 1 1335296 1
sense pool 1 1 970272 1
chain pool 373959 386048 128 12064
reply_post_free pool 12 12 166528 12

The patches in this series improve the scalability of the DMA pool
implementation, which significantly reduces the running time of the
DMA alloc/free loops. With the patches applied, "modprobe mpt3sas",
"rmmod mpt3sas", and system shutdown/reboot with mpt3sas loaded are
significantly faster. Here are some benchmarks (of DMA alloc/free
only, not the entire modprobe/rmmod):

dma_pool_create() + dma_pool_alloc() loop, size = 128, count = 373959
original: 350 ms ( 1x)
dmapool patches: 18 ms (19x)

dma_pool_free() loop + dma_pool_destroy(), size = 128, count = 373959
original: 8901 ms ( 1x)
dmapool patches: 19 ms ( 477x)



2022-06-08 02:27:18

by Tony Battersby

[permalink] [raw]
Subject: [PATCH v6 03/11] dmapool: cleanup integer types

To represent the size of a single allocation, dmapool currently uses
'unsigned int' in some places and 'size_t' in other places. Standardize
on 'unsigned int' to reduce overhead, but use 'size_t' when counting all
the blocks in the entire pool.

Signed-off-by: Tony Battersby <[email protected]>
---

Changes since v5:
moved 'size' down in struct dma_pool to be near the other 32-bit ints.
blks_per_alloc will fill out the hole in a later patch.

This puts an upper bound on 'size' of INT_MAX to avoid overflowing the
following comparison in pool_initialise_page():

unsigned int offset = 0;
unsigned int next = offset + pool->size;
if (unlikely((next + pool->size) > ...

'boundary' is passed in as a size_t but gets stored as an unsigned int.
'boundary' values >= 'allocation' do not have any effect, so clipping
'boundary' to 'allocation' keeps it within the range of unsigned int
without affecting anything else. A few lines above (not in the diff)
you can see that if 'boundary' is passed in as 0 then it is set to
'allocation', so it is nothing new. For reference, here is the
relevant code after being patched:

if (!boundary)
boundary = allocation;
else if ((boundary < size) || (boundary & (boundary - 1)))
return NULL;

boundary = min(boundary, allocation);

mm/dmapool.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 1829291f5d70..f85d6bde2205 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -43,10 +43,10 @@
struct dma_pool { /* the pool */
struct list_head page_list;
spinlock_t lock;
- size_t size;
struct device *dev;
- size_t allocation;
- size_t boundary;
+ unsigned int size;
+ unsigned int allocation;
+ unsigned int boundary;
char name[32];
struct list_head pools;
};
@@ -73,7 +73,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
mutex_lock(&pools_lock);
list_for_each_entry(pool, &dev->dma_pools, pools) {
unsigned pages = 0;
- unsigned blocks = 0;
+ size_t blocks = 0;

spin_lock_irq(&pool->lock);
list_for_each_entry(page, &pool->page_list, page_list) {
@@ -83,9 +83,10 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
spin_unlock_irq(&pool->lock);

/* per-pool info, no real statistics yet */
- size += sysfs_emit_at(buf, size, "%-16s %4u %4zu %4zu %2u\n",
+ size += sysfs_emit_at(buf, size, "%-16s %4zu %4zu %4u %2u\n",
pool->name, blocks,
- pages * (pool->allocation / pool->size),
+ (size_t) pages *
+ (pool->allocation / pool->size),
pool->size, pages);
}
mutex_unlock(&pools_lock);
@@ -130,7 +131,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
else if (align & (align - 1))
return NULL;

- if (size == 0)
+ if (size == 0 || size > INT_MAX)
return NULL;
else if (size < 4)
size = 4;
@@ -143,6 +144,8 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
else if ((boundary < size) || (boundary & (boundary - 1)))
return NULL;

+ boundary = min(boundary, allocation);
+
retval = kmalloc(sizeof(*retval), GFP_KERNEL);
if (!retval)
return retval;
@@ -303,7 +306,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
{
unsigned long flags;
struct dma_page *page;
- size_t offset;
+ unsigned int offset;
void *retval;

might_alloc(mem_flags);
--
2.25.1

2022-06-08 02:43:18

by Tony Battersby

[permalink] [raw]
Subject: [PATCH v6 11/11] dmapool: improve scalability of dma_pool_free

dma_pool_free() scales poorly when the pool contains many pages because
pool_find_page() does a linear scan of all allocated pages. Improve its
scalability by replacing the linear scan with a red-black tree lookup.
In big O notation, this improves the algorithm from O(n) to O(log n).

Signed-off-by: Tony Battersby <[email protected]>
---

Changes since v5:
pool_free_page() no longer exists.
Less churn in dma_pool_destroy().
Updated big O usage in description.

mm/dmapool.c | 114 ++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 90 insertions(+), 24 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index fc9ae0683c20..31102a00fa7c 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -12,11 +12,12 @@
* Many older drivers still have their own code to do this.
*
* The current design of this allocator is fairly simple. The pool is
- * represented by the 'struct dma_pool' which keeps a doubly-linked list of
- * allocated pages. Each page in the page_list is split into blocks of at
- * least 'size' bytes. Free blocks are tracked in an unsorted singly-linked
- * list of free blocks within the page. Used blocks aren't tracked, but we
- * keep a count of how many are currently allocated from each page.
+ * represented by the 'struct dma_pool' which keeps a red-black tree of all
+ * allocated pages, keyed by DMA address for fast lookup when freeing.
+ * Each page in the page_tree is split into blocks of at least 'size' bytes.
+ * Free blocks are tracked in an unsorted singly-linked list of free blocks
+ * within the page. Used blocks aren't tracked, but we keep a count of how
+ * many are currently allocated from each page.
*
* The avail_page_list keeps track of pages that have one or more free blocks
* available to (re)allocate. Pages are moved in and out of avail_page_list
@@ -36,6 +37,7 @@
#include <linux/slab.h>
#include <linux/stat.h>
#include <linux/spinlock.h>
+#include <linux/rbtree.h>
#include <linux/string.h>
#include <linux/types.h>
#include <linux/wait.h>
@@ -45,7 +47,7 @@
#endif

struct dma_pool { /* the pool */
- struct list_head page_list;
+ struct rb_root page_tree;
struct list_head avail_page_list;
spinlock_t lock;
struct device *dev;
@@ -58,7 +60,7 @@ struct dma_pool { /* the pool */
};

struct dma_page { /* cacheable header for 'allocation' bytes */
- struct list_head page_list;
+ struct rb_node page_node;
struct list_head avail_page_link;
void *vaddr;
dma_addr_t dma;
@@ -69,11 +71,17 @@ struct dma_page { /* cacheable header for 'allocation' bytes */
static DEFINE_MUTEX(pools_lock);
static DEFINE_MUTEX(pools_reg_lock);

+static inline struct dma_page *rb_to_dma_page(struct rb_node *node)
+{
+ return rb_entry(node, struct dma_page, page_node);
+}
+
static ssize_t pools_show(struct device *dev, struct device_attribute *attr, char *buf)
{
int size;
struct dma_page *page;
struct dma_pool *pool;
+ struct rb_node *node;

size = sysfs_emit(buf, "poolinfo - 0.1\n");

@@ -83,7 +91,10 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
size_t blocks = 0;

spin_lock_irq(&pool->lock);
- list_for_each_entry(page, &pool->page_list, page_list) {
+ for (node = rb_first(&pool->page_tree);
+ node;
+ node = rb_next(node)) {
+ page = rb_to_dma_page(node);
pages++;
blocks += page->in_use;
}
@@ -160,7 +171,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,

retval->dev = dev;

- INIT_LIST_HEAD(&retval->page_list);
+ retval->page_tree = RB_ROOT;
INIT_LIST_HEAD(&retval->avail_page_list);
spin_lock_init(&retval->lock);
retval->size = size;
@@ -204,6 +215,63 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
}
EXPORT_SYMBOL(dma_pool_create);

+/*
+ * Find the dma_page that manages the given DMA address.
+ */
+static struct dma_page *pool_find_page(struct dma_pool *pool, dma_addr_t dma)
+{
+ struct rb_node *node = pool->page_tree.rb_node;
+
+ while (node) {
+ struct dma_page *page = rb_to_dma_page(node);
+
+ if (dma < page->dma)
+ node = node->rb_left;
+ else if ((dma - page->dma) >= pool->allocation)
+ node = node->rb_right;
+ else
+ return page;
+ }
+ return NULL;
+}
+
+/*
+ * Insert a dma_page into the page_tree.
+ */
+static int pool_insert_page(struct dma_pool *pool, struct dma_page *new_page)
+{
+ dma_addr_t dma = new_page->dma;
+ struct rb_node **node = &(pool->page_tree.rb_node), *parent = NULL;
+
+ while (*node) {
+ struct dma_page *this_page = rb_to_dma_page(*node);
+
+ parent = *node;
+ if (dma < this_page->dma)
+ node = &((*node)->rb_left);
+ else if (likely((dma - this_page->dma) >= pool->allocation))
+ node = &((*node)->rb_right);
+ else {
+ /*
+ * A page that overlaps the new DMA range is already
+ * present in the tree. This should not happen.
+ */
+ WARN(1,
+ "%s: %s: DMA address overlap: old %pad new %pad len %u\n",
+ dev_name(pool->dev),
+ pool->name, &this_page->dma, &dma,
+ pool->allocation);
+ return -1;
+ }
+ }
+
+ /* Add new node and rebalance tree. */
+ rb_link_node(&new_page->page_node, parent, node);
+ rb_insert_color(&new_page->page_node, &pool->page_tree);
+
+ return 0;
+}
+
static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
{
unsigned int offset = 0;
@@ -274,7 +342,10 @@ void dma_pool_destroy(struct dma_pool *pool)
device_remove_file(pool->dev, &dev_attr_pools);
mutex_unlock(&pools_reg_lock);

- list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
+ rbtree_postorder_for_each_entry_safe(page,
+ tmp,
+ &pool->page_tree,
+ page_node) {
void *vaddr = page->vaddr;

if (is_page_busy(page)) {
@@ -333,7 +404,15 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,

spin_lock_irqsave(&pool->lock, flags);

- list_add(&page->page_list, &pool->page_list);
+ if (unlikely(pool_insert_page(pool, page))) {
+ /*
+ * This should not happen, so something must have gone horribly
+ * wrong. Instead of crashing, intentionally leak the memory
+ * and make for the exit.
+ */
+ spin_unlock_irqrestore(&pool->lock, flags);
+ return NULL;
+ }
list_add(&page->avail_page_link, &pool->avail_page_list);
ready:
page->in_use++;
@@ -375,19 +454,6 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
}
EXPORT_SYMBOL(dma_pool_alloc);

-static struct dma_page *pool_find_page(struct dma_pool *pool, dma_addr_t dma)
-{
- struct dma_page *page;
-
- list_for_each_entry(page, &pool->page_list, page_list) {
- if (dma < page->dma)
- continue;
- if ((dma - page->dma) < pool->allocation)
- return page;
- }
- return NULL;
-}
-
/**
* dma_pool_free - put block back into dma pool
* @pool: the dma pool holding the block
--
2.25.1

2022-06-08 04:21:07

by Tony Battersby

[permalink] [raw]
Subject: [PATCH v6 06/11] dmapool: debug: prevent endless loop in case of corruption

Prevent a possible endless loop with DMAPOOL_DEBUG enabled if a buggy
driver corrupts DMA pool memory.

Signed-off-by: Tony Battersby <[email protected]>
---
mm/dmapool.c | 37 ++++++++++++++++++++++++++++++-------
1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index d3e5a6151fb4..facdb3571976 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -417,16 +417,39 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
}
{
unsigned int chain = page->offset;
+ unsigned int free_blks = 0;
+
while (chain < pool->allocation) {
- if (chain != offset) {
- chain = *(int *)(page->vaddr + chain);
- continue;
+ if (unlikely(chain == offset)) {
+ spin_unlock_irqrestore(&pool->lock, flags);
+ dev_err(pool->dev,
+ "%s %s, dma %pad already free\n",
+ __func__, pool->name, &dma);
+ return;
}
- spin_unlock_irqrestore(&pool->lock, flags);
- dev_err(pool->dev, "%s %s, dma %pad already free\n",
- __func__, pool->name, &dma);
- return;
+
+ /*
+ * A buggy driver could corrupt the freelist by
+ * use-after-free, buffer overflow, etc. Besides
+ * checking for corruption, this also prevents an
+ * endless loop in case corruption causes a circular
+ * loop in the freelist.
+ */
+ if (unlikely(++free_blks + page->in_use >
+ pool->blks_per_alloc)) {
+ freelist_corrupt:
+ spin_unlock_irqrestore(&pool->lock, flags);
+ dev_err(pool->dev,
+ "%s %s, freelist corrupted\n",
+ __func__, pool->name);
+ return;
+ }
+
+ chain = *(int *)(page->vaddr + chain);
}
+ if (unlikely(free_blks + page->in_use !=
+ pool->blks_per_alloc))
+ goto freelist_corrupt;
}
memset(vaddr, POOL_POISON_FREED, pool->size);
#endif
--
2.25.1

2022-06-08 04:21:16

by Tony Battersby

[permalink] [raw]
Subject: [PATCH v6 04/11] dmapool: fix boundary comparison

Fix the boundary comparison when constructing the list of free blocks
for the case that 'size' is a power of two. Since 'boundary' is also a
power of two, that would make 'boundary' a multiple of 'size', in which
case a single block would never cross the boundary. This bug would
cause some of the allocated memory to be wasted (but not leaked).

Example:

size = 512
boundary = 2048
allocation = 4096

Address range
0 - 511
512 - 1023
1024 - 1535
1536 - 2047 *
2048 - 2559
2560 - 3071
3072 - 3583
3584 - 4095 *

Prior to this fix, the address ranges marked with "*" would not have
been used even though they didn't cross the given boundary.

Fixes: e34f44b3517f ("pool: Improve memory usage for devices which can't cross boundaries")
Signed-off-by: Tony Battersby <[email protected]>
---
mm/dmapool.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index f85d6bde2205..122781fe2c03 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -201,7 +201,7 @@ static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)

do {
unsigned int next = offset + pool->size;
- if (unlikely((next + pool->size) >= next_boundary)) {
+ if (unlikely((next + pool->size) > next_boundary)) {
next = next_boundary;
next_boundary += pool->boundary;
}
--
2.25.1

2022-06-08 04:24:47

by Tony Battersby

[permalink] [raw]
Subject: [PATCH v6 05/11] dmapool: improve accuracy of debug statistics

The "total number of blocks in pool" debug statistic currently does not
take the boundary value into account, so it diverges from the "total
number of blocks in use" statistic when a boundary is in effect. Add a
calculation for the number of blocks per allocation that takes the
boundary into account, and use it to replace the inaccurate calculation.

This depends on the patch "dmapool: fix boundary comparison" for the
calculated blks_per_alloc value to be correct.

Signed-off-by: Tony Battersby <[email protected]>
---

The added blks_per_alloc value will also be used in the next patch.

mm/dmapool.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 122781fe2c03..d3e5a6151fb4 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -47,6 +47,7 @@ struct dma_pool { /* the pool */
unsigned int size;
unsigned int allocation;
unsigned int boundary;
+ unsigned int blks_per_alloc;
char name[32];
struct list_head pools;
};
@@ -85,8 +86,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
/* per-pool info, no real statistics yet */
size += sysfs_emit_at(buf, size, "%-16s %4zu %4zu %4u %2u\n",
pool->name, blocks,
- (size_t) pages *
- (pool->allocation / pool->size),
+ (size_t) pages * pool->blks_per_alloc,
pool->size, pages);
}
mutex_unlock(&pools_lock);
@@ -159,6 +159,9 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
retval->size = size;
retval->boundary = boundary;
retval->allocation = allocation;
+ retval->blks_per_alloc =
+ (allocation / boundary) * (boundary / size) +
+ (allocation % boundary) / size;

INIT_LIST_HEAD(&retval->pools);

--
2.25.1