Changelog:
RFC -> v2:
* Added callback variant for page bulk allocator and mempool bulk allocator
per Mel Gorman.
* Used the callback version in dm-crypt driver.
* Some code cleanup and refactor to reduce duplicate code.
rfc: https://lore.kernel.org/linux-mm/[email protected]/
We have full disk encryption enabled, profiling shows page allocations may
incur a noticeable overhead when writing. The dm-crypt creates an "out"
bio for writing. And fill the "out" bio with the same amount of pages
as "in" bio. But the driver allocates one page at a time in a loop. For
1M bio it means the driver has to call page allocator 256 times. It seems
not that efficient.
Since v5.13 we have page bulk allocator supported, so dm-crypt could use
it to do page allocations more efficiently.
I could just call the page bulk allocator in dm-crypt driver before the
mempool allocator, but it seems ad-hoc and the quick search shows some
others do the similar thing, for example, f2fs compress, block bounce,
g2fs, ufs, etc. So it seems more neat to implement a general bulk allocation
API for mempool.
Currently the bulk allocator just supported list and array to consume the
pages. But neither is the best fit to dm-crypt ussecase. So introduce
a new bulk allocator API, callback, per the suggestion from Mel Gorman.
It consumes the pages by calling a callback with a parameter.
So introduce the mempool page bulk allocator.
The below APIs are introduced:
- mempool_init_pages_bulk()
- mempool_create_pages_bulk()
They initialize the mempool for page bulk allocator. The pool is filled
by alloc_page() in a loop.
- mempool_alloc_pages_bulk_cb()
- mempool_alloc_pages_bulk_array()
They do bulk allocation from mempool. The list version is not implemented
since there is no user for list version bulk allocator so far and it may
be gong soon.
They do the below conceptually:
1. Call bulk page allocator
2. If the allocation is fulfilled then return otherwise try to
allocate the remaining pages from the mempool
3. If it is fulfilled then return otherwise retry from #1 with sleepable
gfp
4. If it is still failed, sleep for a while to wait for the mempool is
refilled, then retry from #1
The populated pages will stay on array until the callers consume them or
free them, or will be consumed by the callback.
Since mempool allocator is guaranteed to success in the sleepable context,
so the two APIs return true for success or false for fail. It is the
caller's responsibility to handle failure case (partial allocation), just
like the page bulk allocator.
The mempool typically is an object agnostic allocator, but bulk allocation
is only supported by pages, so the mempool bulk allocator is for page
allocation only as well.
With the mempool bulk allocator the IOPS of dm-crypt with 1M I/O would get
improved by approxiamately 6%. The test is done on a machine with 80 CPU and
128GB memory with an encrypted ram device (the impact from storage hardware
could be minimized so that we could benchmark the dm-crypt layer more
accurately).
Before the patch:
Jobs: 1 (f=1): [w(1)][100.0%][w=1301MiB/s][w=1301 IOPS][eta 00m:00s]
crypt: (groupid=0, jobs=1): err= 0: pid=48512: Wed Feb 1 18:11:30 2023
write: IOPS=1300, BW=1301MiB/s (1364MB/s)(76.2GiB/60001msec); 0 zone resets
slat (usec): min=724, max=867, avg=765.71, stdev=19.27
clat (usec): min=4, max=196297, avg=195688.86, stdev=6450.50
lat (usec): min=801, max=197064, avg=196454.90, stdev=6450.35
clat percentiles (msec):
| 1.00th=[ 197], 5.00th=[ 197], 10.00th=[ 197], 20.00th=[ 197],
| 30.00th=[ 197], 40.00th=[ 197], 50.00th=[ 197], 60.00th=[ 197],
| 70.00th=[ 197], 80.00th=[ 197], 90.00th=[ 197], 95.00th=[ 197],
| 99.00th=[ 197], 99.50th=[ 197], 99.90th=[ 197], 99.95th=[ 197],
| 99.99th=[ 197]
bw ( MiB/s): min= 800, max= 1308, per=99.69%, avg=1296.94, stdev=46.02, samples=119
iops : min= 800, max= 1308, avg=1296.94, stdev=46.02, samples=119
lat (usec) : 10=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.05%
lat (msec) : 100=0.08%, 250=99.83%
cpu : usr=3.88%, sys=96.02%, ctx=69, majf=1, minf=9
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,78060,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
WRITE: bw=1301MiB/s (1364MB/s), 1301MiB/s-1301MiB/s (1364MB/s-1364MB/s), io=76.2GiB (81.9GB), run=60001-60001msec
After the patch:
Jobs: 1 (f=1): [w(1)][100.0%][w=1401MiB/s][w=1401 IOPS][eta 00m:00s]
crypt: (groupid=0, jobs=1): err= 0: pid=2171: Wed Feb 1 21:08:16 2023
write: IOPS=1401, BW=1402MiB/s (1470MB/s)(82.1GiB/60001msec); 0 zone resets
slat (usec): min=685, max=815, avg=710.77, stdev=13.24
clat (usec): min=4, max=182206, avg=181658.31, stdev=5810.58
lat (usec): min=709, max=182913, avg=182369.36, stdev=5810.67
clat percentiles (msec):
| 1.00th=[ 182], 5.00th=[ 182], 10.00th=[ 182], 20.00th=[ 182],
| 30.00th=[ 182], 40.00th=[ 182], 50.00th=[ 182], 60.00th=[ 182],
| 70.00th=[ 182], 80.00th=[ 182], 90.00th=[ 182], 95.00th=[ 182],
| 99.00th=[ 182], 99.50th=[ 182], 99.90th=[ 182], 99.95th=[ 182],
| 99.99th=[ 182]
bw ( MiB/s): min= 900, max= 1408, per=99.71%, avg=1397.60, stdev=46.04, samples=119
iops : min= 900, max= 1408, avg=1397.60, stdev=46.04, samples=119
lat (usec) : 10=0.01%, 750=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.05%
lat (msec) : 100=0.08%, 250=99.83%
cpu : usr=3.66%, sys=96.23%, ctx=76, majf=1, minf=9
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,84098,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
WRITE: bw=1402MiB/s (1470MB/s), 1402MiB/s-1402MiB/s (1470MB/s-1470MB/s), io=82.1GiB (88.2GB), run=60001-60001msec
And the benchmark with 4K size I/O doesn't show measurable regression.
Yang Shi (5):
mm: page_alloc: add API for bulk allocator with callback
mm: mempool: extract the common initialization and alloc code
mm: mempool: introduce page bulk allocator
md: dm-crypt: move crypt_free_buffer_pages ahead
md: dm-crypt: use mempool page bulk allocator
drivers/md/dm-crypt.c | 95 ++++++++++++++++++++++++++++++---------------------
include/linux/gfp.h | 21 +++++++++---
include/linux/mempool.h | 21 ++++++++++++
mm/mempolicy.c | 12 ++++---
mm/mempool.c | 248 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------
mm/page_alloc.c | 21 ++++++++----
6 files changed, 323 insertions(+), 95 deletions(-)
Currently the bulk allocator support to pass pages via list or array,
but neither is suitable for some usecases, for example, dm-crypt, which
doesn't need a list, but array may be too big to fit on stack. So
adding a new bulk allocator API, which passes in a callback function
that deal with the allocated pages.
The API defined in this patch will be used by the following patches.
Signed-off-by: Yang Shi <[email protected]>
---
include/linux/gfp.h | 21 +++++++++++++++++----
mm/mempolicy.c | 12 +++++++-----
mm/page_alloc.c | 21 +++++++++++++++------
3 files changed, 39 insertions(+), 15 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 65a78773dcca..265c19b4822f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -182,7 +182,9 @@ struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
- struct page **page_array);
+ struct page **page_array,
+ void (*cb)(struct page *, void *),
+ void *data);
unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
unsigned long nr_pages,
@@ -192,13 +194,15 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
static inline unsigned long
alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
{
- return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, NULL);
+ return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, NULL,
+ NULL, NULL);
}
static inline unsigned long
alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_array)
{
- return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array);
+ return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array,
+ NULL, NULL);
}
static inline unsigned long
@@ -207,7 +211,16 @@ alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct p
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();
- return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
+ return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array,
+ NULL, NULL);
+}
+
+static inline unsigned long
+alloc_pages_bulk_cb(gfp_t gfp, unsigned long nr_pages,
+ void (*cb)(struct page *page, void *data), void *data)
+{
+ return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, NULL,
+ cb, data);
}
static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0919c7a719d4..00b2d5341790 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2318,12 +2318,13 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
nr_allocated = __alloc_pages_bulk(gfp,
interleave_nodes(pol), NULL,
nr_pages_per_node + 1, NULL,
- page_array);
+ page_array, NULL, NULL);
delta--;
} else {
nr_allocated = __alloc_pages_bulk(gfp,
interleave_nodes(pol), NULL,
- nr_pages_per_node, NULL, page_array);
+ nr_pages_per_node, NULL, page_array,
+ NULL, NULL);
}
page_array += nr_allocated;
@@ -2344,12 +2345,13 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
nr_allocated = __alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
- nr_pages, NULL, page_array);
+ nr_pages, NULL, page_array,
+ NULL, NULL);
if (nr_allocated < nr_pages)
nr_allocated += __alloc_pages_bulk(gfp, numa_node_id(), NULL,
nr_pages - nr_allocated, NULL,
- page_array + nr_allocated);
+ page_array + nr_allocated, NULL, NULL);
return nr_allocated;
}
@@ -2377,7 +2379,7 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
return __alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()),
policy_nodemask(gfp, pol), nr_pages, NULL,
- page_array);
+ page_array, NULL, NULL);
}
int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1113483fa6c5..d23b8e49a8cd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5402,22 +5402,27 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
* @nr_pages: The number of pages desired on the list or array
* @page_list: Optional list to store the allocated pages
* @page_array: Optional array to store the pages
+ * @cb: Optional callback to handle the page
+ * @data: The parameter passed in by the callback
*
* This is a batched version of the page allocator that attempts to
* allocate nr_pages quickly. Pages are added to page_list if page_list
- * is not NULL, otherwise it is assumed that the page_array is valid.
+ * is not NULL, or it is assumed if the page_array is valid, or it is
+ * passed to a callback if cb is valid.
*
- * For lists, nr_pages is the number of pages that should be allocated.
+ * For lists and cb, nr_pages is the number of pages that should be allocated.
*
* For arrays, only NULL elements are populated with pages and nr_pages
* is the maximum number of pages that will be stored in the array.
*
- * Returns the number of pages on the list or array.
+ * Returns the number of pages on the list or array or consumed by cb.
*/
unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
- struct page **page_array)
+ struct page **page_array,
+ void (*cb)(struct page *, void *),
+ void *data)
{
struct page *page;
unsigned long __maybe_unused UP_flags;
@@ -5532,8 +5537,10 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
prep_new_page(page, 0, gfp, 0);
if (page_list)
list_add(&page->lru, page_list);
- else
+ else if (page_array)
page_array[nr_populated] = page;
+ else
+ cb(page, data);
nr_populated++;
}
@@ -5554,8 +5561,10 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
if (page) {
if (page_list)
list_add(&page->lru, page_list);
- else
+ else if (page_array)
page_array[nr_populated] = page;
+ else
+ cb(page, data);
nr_populated++;
}
--
2.39.0
Extract the common initialization code to __mempool_init() and
__mempool_create(). And extract the common alloc code into an internal
function. This will make the following patch easier and avoid duplicate
code.
Signed-off-by: Yang Shi <[email protected]>
---
mm/mempool.c | 93 ++++++++++++++++++++++++++++++++--------------------
1 file changed, 57 insertions(+), 36 deletions(-)
diff --git a/mm/mempool.c b/mm/mempool.c
index 734bcf5afbb7..975c9d1491b6 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -182,9 +182,10 @@ void mempool_destroy(mempool_t *pool)
}
EXPORT_SYMBOL(mempool_destroy);
-int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
- mempool_free_t *free_fn, void *pool_data,
- gfp_t gfp_mask, int node_id)
+static inline int __mempool_init(mempool_t *pool, int min_nr,
+ mempool_alloc_t *alloc_fn,
+ mempool_free_t *free_fn, void *pool_data,
+ gfp_t gfp_mask, int node_id)
{
spin_lock_init(&pool->lock);
pool->min_nr = min_nr;
@@ -214,6 +215,14 @@ int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
return 0;
}
+
+int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
+ mempool_free_t *free_fn, void *pool_data,
+ gfp_t gfp_mask, int node_id)
+{
+ return __mempool_init(pool, min_nr, alloc_fn, free_fn, pool_data,
+ gfp_mask, node_id);
+}
EXPORT_SYMBOL(mempool_init_node);
/**
@@ -233,12 +242,30 @@ EXPORT_SYMBOL(mempool_init_node);
int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data)
{
- return mempool_init_node(pool, min_nr, alloc_fn, free_fn,
- pool_data, GFP_KERNEL, NUMA_NO_NODE);
-
+ return __mempool_init(pool, min_nr, alloc_fn, free_fn,
+ pool_data, GFP_KERNEL, NUMA_NO_NODE);
}
EXPORT_SYMBOL(mempool_init);
+static mempool_t *__mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
+ mempool_free_t *free_fn, void *pool_data,
+ gfp_t gfp_mask, int node_id)
+{
+ mempool_t *pool;
+
+ pool = kzalloc_node(sizeof(*pool), gfp_mask, node_id);
+ if (!pool)
+ return NULL;
+
+ if (__mempool_init(pool, min_nr, alloc_fn, free_fn, pool_data,
+ gfp_mask, node_id)) {
+ kfree(pool);
+ return NULL;
+ }
+
+ return pool;
+}
+
/**
* mempool_create - create a memory pool
* @min_nr: the minimum number of elements guaranteed to be
@@ -258,8 +285,8 @@ EXPORT_SYMBOL(mempool_init);
mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data)
{
- return mempool_create_node(min_nr, alloc_fn, free_fn, pool_data,
- GFP_KERNEL, NUMA_NO_NODE);
+ return __mempool_create(min_nr, alloc_fn, free_fn, pool_data,
+ GFP_KERNEL, NUMA_NO_NODE);
}
EXPORT_SYMBOL(mempool_create);
@@ -267,19 +294,8 @@ mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int node_id)
{
- mempool_t *pool;
-
- pool = kzalloc_node(sizeof(*pool), gfp_mask, node_id);
- if (!pool)
- return NULL;
-
- if (mempool_init_node(pool, min_nr, alloc_fn, free_fn, pool_data,
- gfp_mask, node_id)) {
- kfree(pool);
- return NULL;
- }
-
- return pool;
+ return __mempool_create(min_nr, alloc_fn, free_fn, pool_data,
+ gfp_mask, node_id);
}
EXPORT_SYMBOL(mempool_create_node);
@@ -363,21 +379,7 @@ int mempool_resize(mempool_t *pool, int new_min_nr)
}
EXPORT_SYMBOL(mempool_resize);
-/**
- * mempool_alloc - allocate an element from a specific memory pool
- * @pool: pointer to the memory pool which was allocated via
- * mempool_create().
- * @gfp_mask: the usual allocation bitmask.
- *
- * this function only sleeps if the alloc_fn() function sleeps or
- * returns NULL. Note that due to preallocation, this function
- * *never* fails when called from process contexts. (it might
- * fail if called from an IRQ context.)
- * Note: using __GFP_ZERO is not supported.
- *
- * Return: pointer to the allocated element or %NULL on error.
- */
-void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
+static void *__mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
{
void *element;
unsigned long flags;
@@ -444,6 +446,25 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
finish_wait(&pool->wait, &wait);
goto repeat_alloc;
}
+
+/**
+ * mempool_alloc - allocate an element from a specific memory pool
+ * @pool: pointer to the memory pool which was allocated via
+ * mempool_create().
+ * @gfp_mask: the usual allocation bitmask.
+ *
+ * this function only sleeps if the alloc_fn() function sleeps or
+ * returns NULL. Note that due to preallocation, this function
+ * *never* fails when called from process contexts. (it might
+ * fail if called from an IRQ context.)
+ * Note: using __GFP_ZERO is not supported.
+ *
+ * Return: pointer to the allocated element or %NULL on error.
+ */
+void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
+{
+ return __mempool_alloc(pool, gfp_mask);
+}
EXPORT_SYMBOL(mempool_alloc);
/**
--
2.39.0
With moving crypt_free_buffer_pages() before crypt_alloc_buffer(), we
don't need an extra declaration anymore.
Signed-off-by: Yang Shi <[email protected]>
---
drivers/md/dm-crypt.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 2653516bcdef..73069f200cc5 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1639,7 +1639,17 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
return 0;
}
-static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone);
+
+static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
+{
+ struct bio_vec *bv;
+ struct bvec_iter_all iter_all;
+
+ bio_for_each_segment_all(bv, clone, iter_all) {
+ BUG_ON(!bv->bv_page);
+ mempool_free(bv->bv_page, &cc->page_pool);
+ }
+}
/*
* Generate a new unfragmented bio with the given size
@@ -1707,17 +1717,6 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
return clone;
}
-static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
-{
- struct bio_vec *bv;
- struct bvec_iter_all iter_all;
-
- bio_for_each_segment_all(bv, clone, iter_all) {
- BUG_ON(!bv->bv_page);
- mempool_free(bv->bv_page, &cc->page_pool);
- }
-}
-
static void crypt_io_init(struct dm_crypt_io *io, struct crypt_config *cc,
struct bio *bio, sector_t sector)
{
--
2.39.0
Since v5.13 the page bulk allocator was introduced to allocate order-0
pages in bulk. There are a few mempool allocator callers which does
order-0 page allocation in a loop, for example, dm-crypt, f2fs compress,
etc. A mempool page bulk allocator seems useful. So introduce the
mempool page bulk allocator.
It introduces the below APIs:
- mempool_init_pages_bulk()
- mempool_create_pages_bulk()
They initialize the mempool for page bulk allocator. The pool is filled
by alloc_page() in a loop.
- mempool_alloc_pages_bulk_array()
- mempool_alloc_pages_bulk_cb()
They do bulk allocation from mempool.
They do the below conceptually:
1. Call bulk page allocator
2. If the allocation is fulfilled then return otherwise try to
allocate the remaining pages from the mempool
3. If it is fulfilled then return otherwise retry from #1 with sleepable
gfp
4. If it is still failed, sleep for a while to wait for the mempool is
refilled, then retry from #1
The populated pages will stay on the array until the callers consume them
or free them, or are consumed by the callback immediately.
Since mempool allocator is guaranteed to success in the sleepable context,
so the two APIs return true for success or false for fail. It is the
caller's responsibility to handle failure case (partial allocation), just
like the page bulk allocator.
The mempool typically is an object agnostic allocator, but bulk allocation
is only supported by pages, so the mempool bulk allocator is for page
allocation only as well.
Signed-off-by: Yang Shi <[email protected]>
---
include/linux/mempool.h | 21 +++++
mm/mempool.c | 177 ++++++++++++++++++++++++++++++++++++----
2 files changed, 181 insertions(+), 17 deletions(-)
diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 4aae6c06c5f2..1907395b2ef5 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -13,6 +13,12 @@ struct kmem_cache;
typedef void * (mempool_alloc_t)(gfp_t gfp_mask, void *pool_data);
typedef void (mempool_free_t)(void *element, void *pool_data);
+typedef unsigned int (mempool_alloc_pages_bulk_t)(gfp_t gfp_mask,
+ unsigned int nr, void *pool_data,
+ struct page **page_array,
+ void (*cb)(struct page *, void *),
+ void *data);
+
typedef struct mempool_s {
spinlock_t lock;
int min_nr; /* nr of elements at *elements */
@@ -22,6 +28,7 @@ typedef struct mempool_s {
void *pool_data;
mempool_alloc_t *alloc;
mempool_free_t *free;
+ mempool_alloc_pages_bulk_t *alloc_pages_bulk;
wait_queue_head_t wait;
} mempool_t;
@@ -41,18 +48,32 @@ int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
gfp_t gfp_mask, int node_id);
int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data);
+int mempool_init_pages_bulk(mempool_t *pool, int min_nr,
+ mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
+ mempool_free_t *free_fn, void *pool_data);
extern mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data);
extern mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int nid);
+extern mempool_t *mempool_create_pages_bulk(int min_nr,
+ mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
+ mempool_free_t *free_fn, void *pool_data);
extern int mempool_resize(mempool_t *pool, int new_min_nr);
extern void mempool_destroy(mempool_t *pool);
extern void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) __malloc;
extern void mempool_free(void *element, mempool_t *pool);
+extern bool mempool_alloc_pages_bulk_array(mempool_t *pool, gfp_t gfp_mask,
+ unsigned int nr,
+ struct page **page_array);
+extern bool mempool_alloc_pages_bulk_cb(mempool_t *pool, gfp_t gfp_mask,
+ unsigned int nr,
+ void (*cb)(struct page *, void *),
+ void *data);
+
/*
* A mempool_alloc_t and mempool_free_t that get the memory from
* a slab cache that is passed in through pool_data.
diff --git a/mm/mempool.c b/mm/mempool.c
index 975c9d1491b6..dddcd847d765 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -183,6 +183,7 @@ void mempool_destroy(mempool_t *pool)
EXPORT_SYMBOL(mempool_destroy);
static inline int __mempool_init(mempool_t *pool, int min_nr,
+ mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int node_id)
@@ -192,8 +193,11 @@ static inline int __mempool_init(mempool_t *pool, int min_nr,
pool->pool_data = pool_data;
pool->alloc = alloc_fn;
pool->free = free_fn;
+ pool->alloc_pages_bulk = alloc_pages_bulk_fn;
init_waitqueue_head(&pool->wait);
+ WARN_ON_ONCE(alloc_pages_bulk_fn && alloc_fn);
+
pool->elements = kmalloc_array_node(min_nr, sizeof(void *),
gfp_mask, node_id);
if (!pool->elements)
@@ -205,7 +209,10 @@ static inline int __mempool_init(mempool_t *pool, int min_nr,
while (pool->curr_nr < pool->min_nr) {
void *element;
- element = pool->alloc(gfp_mask, pool->pool_data);
+ if (pool->alloc_pages_bulk)
+ element = alloc_page(gfp_mask);
+ else
+ element = pool->alloc(gfp_mask, pool->pool_data);
if (unlikely(!element)) {
mempool_exit(pool);
return -ENOMEM;
@@ -220,7 +227,7 @@ int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int node_id)
{
- return __mempool_init(pool, min_nr, alloc_fn, free_fn, pool_data,
+ return __mempool_init(pool, min_nr, NULL, alloc_fn, free_fn, pool_data,
gfp_mask, node_id);
}
EXPORT_SYMBOL(mempool_init_node);
@@ -242,14 +249,39 @@ EXPORT_SYMBOL(mempool_init_node);
int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data)
{
- return __mempool_init(pool, min_nr, alloc_fn, free_fn,
+ return __mempool_init(pool, min_nr, NULL, alloc_fn, free_fn,
pool_data, GFP_KERNEL, NUMA_NO_NODE);
}
EXPORT_SYMBOL(mempool_init);
-static mempool_t *__mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
- mempool_free_t *free_fn, void *pool_data,
- gfp_t gfp_mask, int node_id)
+/**
+ * mempool_init_pages_bulk - initialize a pages pool for bulk allocator
+ * @pool: pointer to the memory pool that should be initialized
+ * @min_nr: the minimum number of elements guaranteed to be
+ * allocated for this pool.
+ * @alloc_pages_bulk_fn: user-defined pages bulk allocation function.
+ * @free_fn: user-defined element-freeing function.
+ * @pool_data: optional private data available to the user-defined functions.
+ *
+ * Like mempool_create(), but initializes the pool in (i.e. embedded in another
+ * structure).
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int mempool_init_pages_bulk(mempool_t *pool, int min_nr,
+ mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
+ mempool_free_t *free_fn, void *pool_data)
+{
+ return __mempool_init(pool, min_nr, alloc_pages_bulk_fn, NULL,
+ free_fn, pool_data, GFP_KERNEL, NUMA_NO_NODE);
+}
+EXPORT_SYMBOL(mempool_init_pages_bulk);
+
+static mempool_t *__mempool_create(int min_nr,
+ mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
+ mempool_alloc_t *alloc_fn,
+ mempool_free_t *free_fn, void *pool_data,
+ gfp_t gfp_mask, int node_id)
{
mempool_t *pool;
@@ -257,8 +289,8 @@ static mempool_t *__mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
if (!pool)
return NULL;
- if (__mempool_init(pool, min_nr, alloc_fn, free_fn, pool_data,
- gfp_mask, node_id)) {
+ if (__mempool_init(pool, min_nr, alloc_pages_bulk_fn, alloc_fn,
+ free_fn, pool_data, gfp_mask, node_id)) {
kfree(pool);
return NULL;
}
@@ -285,7 +317,7 @@ static mempool_t *__mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data)
{
- return __mempool_create(min_nr, alloc_fn, free_fn, pool_data,
+ return __mempool_create(min_nr, NULL, alloc_fn, free_fn, pool_data,
GFP_KERNEL, NUMA_NO_NODE);
}
EXPORT_SYMBOL(mempool_create);
@@ -294,11 +326,21 @@ mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int node_id)
{
- return __mempool_create(min_nr, alloc_fn, free_fn, pool_data,
+ return __mempool_create(min_nr, NULL, alloc_fn, free_fn, pool_data,
gfp_mask, node_id);
}
EXPORT_SYMBOL(mempool_create_node);
+mempool_t* mempool_create_pages_bulk(int min_nr,
+ mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
+ mempool_free_t *free_fn, void *pool_data)
+{
+ return __mempool_create(min_nr, alloc_pages_bulk_fn, NULL,
+ free_fn, pool_data, GFP_KERNEL,
+ NUMA_NO_NODE);
+}
+EXPORT_SYMBOL(mempool_create_pages_bulk);
+
/**
* mempool_resize - resize an existing memory pool
* @pool: pointer to the memory pool which was allocated via
@@ -379,12 +421,23 @@ int mempool_resize(mempool_t *pool, int new_min_nr)
}
EXPORT_SYMBOL(mempool_resize);
-static void *__mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
+#define MEMPOOL_BULK_SUCCESS_PTR ((void *)16)
+
+static void * __mempool_alloc(mempool_t *pool, gfp_t gfp_mask, unsigned int nr,
+ struct page **page_array,
+ void (*cb)(struct page *, void *),
+ void *data)
{
void *element;
unsigned long flags;
wait_queue_entry_t wait;
gfp_t gfp_temp;
+ int i;
+ unsigned int ret, nr_remaining;
+ struct page *page;
+ bool bulk_page_alloc = true;
+
+ ret = nr_remaining = 0;
VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
might_alloc(gfp_mask);
@@ -395,14 +448,27 @@ static void *__mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
+ if ((nr == 1) && (!page_array && !cb && !data))
+ bulk_page_alloc = false;
+
repeat_alloc:
+ i = 0;
+
+ if (bulk_page_alloc) {
+ ret = pool->alloc_pages_bulk(gfp_temp, nr, pool->pool_data,
+ page_array, cb, data);
+ if (ret == nr)
+ return MEMPOOL_BULK_SUCCESS_PTR;
+ } else {
+ element = pool->alloc(gfp_temp, pool->pool_data);
+ if (likely(element != NULL))
+ return element;
+ }
- element = pool->alloc(gfp_temp, pool->pool_data);
- if (likely(element != NULL))
- return element;
+ nr_remaining = nr - ret;
spin_lock_irqsave(&pool->lock, flags);
- if (likely(pool->curr_nr)) {
+ while (pool->curr_nr && (nr_remaining > 0)) {
element = remove_element(pool);
spin_unlock_irqrestore(&pool->lock, flags);
/* paired with rmb in mempool_free(), read comment there */
@@ -412,9 +478,34 @@ static void *__mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
* for debugging.
*/
kmemleak_update_trace(element);
- return element;
+
+ if (!bulk_page_alloc)
+ return element;
+
+ page = (struct page *)element;
+ if (page_array)
+ page_array[ret + i] = page;
+ else
+ cb(page, data);
+
+ i++;
+ nr_remaining--;
+
+ spin_lock_irqsave(&pool->lock, flags);
+ }
+
+ if (bulk_page_alloc && !nr_remaining) {
+ spin_unlock_irqrestore(&pool->lock, flags);
+ return MEMPOOL_BULK_SUCCESS_PTR;
}
+ /*
+ * The bulk allocator counts in the populated pages for array,
+ * but don't do it for the callback version.
+ */
+ if (bulk_page_alloc && !page_array)
+ nr = nr_remaining;
+
/*
* We use gfp mask w/o direct reclaim or IO for the first round. If
* alloc failed with that and @pool was empty, retry immediately.
@@ -463,10 +554,62 @@ static void *__mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
*/
void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
{
- return __mempool_alloc(pool, gfp_mask);
+ return __mempool_alloc(pool, gfp_mask, 1, NULL, NULL, NULL);
}
EXPORT_SYMBOL(mempool_alloc);
+/**
+ * mempool_alloc_pages_bulk - allocate a bulk of pagesfrom a specific
+ * memory pool
+ * @pool: pointer to the memory pool which was allocated via
+ * mempool_create().
+ * @gfp_mask: the usual allocation bitmask.
+ * @nr: the number of requested pages.
+ * @page_array: the array the pages will be added to.
+ * @cb: the callback function that will handle the page.
+ * @data: the parameter used by the callback
+ *
+ * this function only sleeps if the alloc_pages_bulk_fn() function sleeps
+ * or the allocation can not be satisfied even though the mempool is depleted.
+ * Note that due to preallocation, this function *never* fails when called
+ * from process contexts. (it might fail if called from an IRQ context.)
+ * Note: using __GFP_ZERO is not supported. And the caller should not pass
+ * in both valid page_array and callback.
+ *
+ * Return: true when nr pages are allocated or false if not. It is the
+ * caller's responsibility to free the partial allocated pages.
+ */
+static bool mempool_alloc_pages_bulk(mempool_t *pool, gfp_t gfp_mask,
+ unsigned int nr,
+ struct page **page_array,
+ void (*cb)(struct page *, void *),
+ void *data)
+{
+ if(!__mempool_alloc(pool, gfp_mask, nr, page_array, cb, data))
+ return false;
+
+ return true;
+}
+
+bool mempool_alloc_pages_bulk_array(mempool_t *pool, gfp_t gfp_mask,
+ unsigned int nr,
+ struct page **page_array)
+{
+ return mempool_alloc_pages_bulk(pool, gfp_mask, nr, page_array,
+ NULL, NULL);
+}
+EXPORT_SYMBOL(mempool_alloc_pages_bulk_array);
+
+bool mempool_alloc_pages_bulk_cb(mempool_t *pool, gfp_t gfp_mask,
+ unsigned int nr,
+ void (*cb)(struct page *, void *),
+ void *data)
+{
+ return mempool_alloc_pages_bulk(pool, gfp_mask, nr, NULL,
+ cb, data);
+}
+EXPORT_SYMBOL(mempool_alloc_pages_bulk_cb);
+
/**
* mempool_free - return an element to the pool.
* @element: pool element pointer.
--
2.39.0
When using dm-crypt for full disk encryption, dm-crypt would allocate
an out bio and allocate the same amount of pages as in bio for
encryption. It currently allocates one page at a time in a loop. This
is not efficient. So using mempool page bulk allocator instead of
allocating one page at a time.
The mempool page bulk allocator would improve the IOPS with 1M I/O
by approxiamately 6%. The test is done on a machine with 80 vCPU and
128GB memory with an encrypted ram device (the impact from storage
hardware could be minimized so that we could benchmark the dm-crypt
layer more accurately).
Before the patch:
Jobs: 1 (f=1): [w(1)][100.0%][w=1301MiB/s][w=1301 IOPS][eta 00m:00s]
crypt: (groupid=0, jobs=1): err= 0: pid=48512: Wed Feb 1 18:11:30 2023
write: IOPS=1300, BW=1301MiB/s (1364MB/s)(76.2GiB/60001msec); 0 zone resets
slat (usec): min=724, max=867, avg=765.71, stdev=19.27
clat (usec): min=4, max=196297, avg=195688.86, stdev=6450.50
lat (usec): min=801, max=197064, avg=196454.90, stdev=6450.35
clat percentiles (msec):
| 1.00th=[ 197], 5.00th=[ 197], 10.00th=[ 197], 20.00th=[ 197],
| 30.00th=[ 197], 40.00th=[ 197], 50.00th=[ 197], 60.00th=[ 197],
| 70.00th=[ 197], 80.00th=[ 197], 90.00th=[ 197], 95.00th=[ 197],
| 99.00th=[ 197], 99.50th=[ 197], 99.90th=[ 197], 99.95th=[ 197],
| 99.99th=[ 197]
bw ( MiB/s): min= 800, max= 1308, per=99.69%, avg=1296.94, stdev=46.02, samples=119
iops : min= 800, max= 1308, avg=1296.94, stdev=46.02, samples=119
lat (usec) : 10=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.05%
lat (msec) : 100=0.08%, 250=99.83%
cpu : usr=3.88%, sys=96.02%, ctx=69, majf=1, minf=9
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,78060,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
WRITE: bw=1301MiB/s (1364MB/s), 1301MiB/s-1301MiB/s (1364MB/s-1364MB/s), io=76.2GiB (81.9GB), run=60001-60001msec
After the patch:
Jobs: 1 (f=1): [w(1)][100.0%][w=1401MiB/s][w=1401 IOPS][eta 00m:00s]
crypt: (groupid=0, jobs=1): err= 0: pid=2171: Wed Feb 1 21:08:16 2023
write: IOPS=1401, BW=1402MiB/s (1470MB/s)(82.1GiB/60001msec); 0 zone resets
slat (usec): min=685, max=815, avg=710.77, stdev=13.24
clat (usec): min=4, max=182206, avg=181658.31, stdev=5810.58
lat (usec): min=709, max=182913, avg=182369.36, stdev=5810.67
clat percentiles (msec):
| 1.00th=[ 182], 5.00th=[ 182], 10.00th=[ 182], 20.00th=[ 182],
| 30.00th=[ 182], 40.00th=[ 182], 50.00th=[ 182], 60.00th=[ 182],
| 70.00th=[ 182], 80.00th=[ 182], 90.00th=[ 182], 95.00th=[ 182],
| 99.00th=[ 182], 99.50th=[ 182], 99.90th=[ 182], 99.95th=[ 182],
| 99.99th=[ 182]
bw ( MiB/s): min= 900, max= 1408, per=99.71%, avg=1397.60, stdev=46.04, samples=119
iops : min= 900, max= 1408, avg=1397.60, stdev=46.04, samples=119
lat (usec) : 10=0.01%, 750=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.05%
lat (msec) : 100=0.08%, 250=99.83%
cpu : usr=3.66%, sys=96.23%, ctx=76, majf=1, minf=9
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,84098,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
WRITE: bw=1402MiB/s (1470MB/s), 1402MiB/s-1402MiB/s (1470MB/s-1470MB/s), io=82.1GiB (88.2GB), run=60001-60001msec
The function tracing also shows the time consumed by page allocations is
reduced significantly. The test allocated 1M (256 pages) bio in the same
environment.
Before the patch:
It took approximately 600us by excluding the bio_add_page() calls.
2720.630754 | 56) xfs_io-38859 | 2.571 us | mempool_alloc();
2720.630757 | 56) xfs_io-38859 | 0.937 us | bio_add_page();
2720.630758 | 56) xfs_io-38859 | 1.772 us | mempool_alloc();
2720.630760 | 56) xfs_io-38859 | 0.852 us | bio_add_page();
….
2720.631559 | 56) xfs_io-38859 | 2.058 us | mempool_alloc();
2720.631561 | 56) xfs_io-38859 | 0.717 us | bio_add_page();
2720.631562 | 56) xfs_io-38859 | 2.014 us | mempool_alloc();
2720.631564 | 56) xfs_io-38859 | 0.620 us | bio_add_page();
After the patch:
It took approxiamately 30us.
11564.266385 | 22) xfs_io-136183 | + 30.551 us | __alloc_pages_bulk();
Page allocations overhead is around 6% (600us/9853us) in dm-crypt layer shown by
function trace. The data also matches the IOPS data shown by fio.
And the benchmark with 4K size I/O doesn't show measurable regression.
Signed-off-by: Yang Shi <[email protected]>
---
drivers/md/dm-crypt.c | 72 +++++++++++++++++++++++++++----------------
1 file changed, 46 insertions(+), 26 deletions(-)
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 73069f200cc5..30268ba07fd6 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1651,6 +1651,21 @@ static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
}
}
+struct crypt_bulk_cb_data {
+ struct bio *bio;
+ unsigned int size;
+};
+
+static void crypt_bulk_alloc_cb(struct page *page, void *data)
+{
+ unsigned int len;
+ struct crypt_bulk_cb_data *b_data = (struct crypt_bulk_cb_data *)data;
+
+ len = (b_data->size > PAGE_SIZE) ? PAGE_SIZE : b_data->size;
+ bio_add_page(b_data->bio, page, len, 0);
+ b_data->size -= len;
+}
+
/*
* Generate a new unfragmented bio with the given size
* This should never violate the device limitations (but only because
@@ -1674,8 +1689,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
struct bio *clone;
unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
- unsigned i, len, remaining_size;
- struct page *page;
+ struct crypt_bulk_cb_data data;
retry:
if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -1686,22 +1700,17 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
clone->bi_private = io;
clone->bi_end_io = crypt_endio;
- remaining_size = size;
-
- for (i = 0; i < nr_iovecs; i++) {
- page = mempool_alloc(&cc->page_pool, gfp_mask);
- if (!page) {
- crypt_free_buffer_pages(cc, clone);
- bio_put(clone);
- gfp_mask |= __GFP_DIRECT_RECLAIM;
- goto retry;
- }
-
- len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size;
-
- bio_add_page(clone, page, len, 0);
+ data.bio = clone;
+ data.size = size;
- remaining_size -= len;
+ if (!mempool_alloc_pages_bulk_cb(&cc->page_pool, gfp_mask, nr_iovecs,
+ crypt_bulk_alloc_cb, &data)) {
+ crypt_free_buffer_pages(cc, clone);
+ bio_put(clone);
+ data.bio = NULL;
+ data.size = 0;
+ gfp_mask |= __GFP_DIRECT_RECLAIM;
+ goto retry;
}
/* Allocate space for integrity tags */
@@ -2655,10 +2664,14 @@ static void crypt_calculate_pages_per_client(void)
dm_crypt_pages_per_client = pages;
}
-static void *crypt_page_alloc(gfp_t gfp_mask, void *pool_data)
+static unsigned int crypt_alloc_pages_bulk(gfp_t gfp_mask, unsigned int nr,
+ void *pool_data,
+ struct page **page_array,
+ void (*cb)(struct page *, void *),
+ void *data)
{
struct crypt_config *cc = pool_data;
- struct page *page;
+ unsigned int ret;
/*
* Note, percpu_counter_read_positive() may over (and under) estimate
@@ -2667,13 +2680,13 @@ static void *crypt_page_alloc(gfp_t gfp_mask, void *pool_data)
*/
if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) >= dm_crypt_pages_per_client) &&
likely(gfp_mask & __GFP_NORETRY))
- return NULL;
+ return 0;
- page = alloc_page(gfp_mask);
- if (likely(page != NULL))
- percpu_counter_add(&cc->n_allocated_pages, 1);
+ ret = alloc_pages_bulk_cb(gfp_mask, nr, cb, data);
- return page;
+ percpu_counter_add(&cc->n_allocated_pages, ret);
+
+ return ret;
}
static void crypt_page_free(void *page, void *pool_data)
@@ -2705,11 +2718,16 @@ static void crypt_dtr(struct dm_target *ti)
bioset_exit(&cc->bs);
+ /*
+ * With mempool bulk allocator the pages in the pool are not
+ * counted in n_allocated_pages.
+ */
+ WARN_ON(percpu_counter_sum(&cc->n_allocated_pages) != 0);
+
mempool_exit(&cc->page_pool);
mempool_exit(&cc->req_pool);
mempool_exit(&cc->tag_pool);
- WARN_ON(percpu_counter_sum(&cc->n_allocated_pages) != 0);
percpu_counter_destroy(&cc->n_allocated_pages);
if (cc->iv_gen_ops && cc->iv_gen_ops->dtr)
@@ -3251,7 +3269,9 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
ALIGN(sizeof(struct dm_crypt_io) + cc->dmreq_start + additional_req_size,
ARCH_KMALLOC_MINALIGN);
- ret = mempool_init(&cc->page_pool, BIO_MAX_VECS, crypt_page_alloc, crypt_page_free, cc);
+ ret = mempool_init_pages_bulk(&cc->page_pool, BIO_MAX_VECS,
+ crypt_alloc_pages_bulk, crypt_page_free,
+ cc);
if (ret) {
ti->error = "Cannot allocate page mempool";
goto bad;
--
2.39.0
Hi Yang,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on linus/master]
[also build test WARNING on v6.2-rc8 next-20230214]
[cannot apply to device-mapper-dm/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Yang-Shi/mm-page_alloc-add-API-for-bulk-allocator-with-callback/20230215-030305
patch link: https://lore.kernel.org/r/20230214190221.1156876-4-shy828301%40gmail.com
patch subject: [v2 PATCH 3/5] mm: mempool: introduce page bulk allocator
config: loongarch-randconfig-r006-20230212 (https://download.01.org/0day-ci/archive/20230215/[email protected]/config)
compiler: loongarch64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/ecf5ea78b27092c35d884fad653f53d599d9ddba
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Yang-Shi/mm-page_alloc-add-API-for-bulk-allocator-with-callback/20230215-030305
git checkout ecf5ea78b27092c35d884fad653f53d599d9ddba
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=loongarch olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=loongarch SHELL=/bin/bash drivers/scsi/snic/
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
In file included from drivers/scsi/snic/snic_scsi.c:4:
>> include/linux/mempool.h:18:48: warning: 'struct page' declared inside parameter list will not be visible outside of this definition or declaration
18 | struct page **page_array,
| ^~~~
include/linux/mempool.h:71:51: warning: 'struct page' declared inside parameter list will not be visible outside of this definition or declaration
71 | struct page **page_array);
| ^~~~
include/linux/mempool.h:74:59: warning: 'struct page' declared inside parameter list will not be visible outside of this definition or declaration
74 | void (*cb)(struct page *, void *),
| ^~~~
vim +18 include/linux/mempool.h
15
16 typedef unsigned int (mempool_alloc_pages_bulk_t)(gfp_t gfp_mask,
17 unsigned int nr, void *pool_data,
> 18 struct page **page_array,
19 void (*cb)(struct page *, void *),
20 void *data);
21
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
On Tue, 14 Feb 2023, Yang Shi wrote:
>
> Changelog:
> RFC -> v2:
> * Added callback variant for page bulk allocator and mempool bulk allocator
> per Mel Gorman.
> * Used the callback version in dm-crypt driver.
> * Some code cleanup and refactor to reduce duplicate code.
>
> rfc: https://lore.kernel.org/linux-mm/[email protected]/
Hi
This seems like unneeded complication to me. We have alloc_pages(), it can
allocate multiple pages efficiently, so why not use it?
I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if
alloc_pages() fails (either because the system is low on memory or because
memory is too fragmented), fall back to the existing code that does
mempool_alloc().
Mikulas
On Wed, Feb 15, 2023 at 4:23 AM Mikulas Patocka <[email protected]> wrote:
>
>
>
> On Tue, 14 Feb 2023, Yang Shi wrote:
>
> >
> > Changelog:
> > RFC -> v2:
> > * Added callback variant for page bulk allocator and mempool bulk allocator
> > per Mel Gorman.
> > * Used the callback version in dm-crypt driver.
> > * Some code cleanup and refactor to reduce duplicate code.
> >
> > rfc: https://lore.kernel.org/linux-mm/[email protected]/
>
> Hi
>
> This seems like unneeded complication to me. We have alloc_pages(), it can
> allocate multiple pages efficiently, so why not use it?
The alloc_pages() allocates *contiguous* pages, but dm-crypt doesn't
need contiguous pages at all. This may incur unnecessary compaction
overhead to the dm-crypt layer when memory is fragmented. The bulk
allocator is a good fit to this usecase, which allocates multiple
order-0 pages.
In addition, filesystem writeback doesn't guarantee power-of-2 pages
every time IIUC. But alloc_pages() just can allocate power-of-2 pages.
>
> I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if
> alloc_pages() fails (either because the system is low on memory or because
> memory is too fragmented), fall back to the existing code that does
> mempool_alloc().
My PoC patches just did this way, but called bulk allocator. There may
be other potential mepool users as I listed in this cover letter,
which may get benefits from bulk allocator. So introducing a new bulk
mempool API seems better for long run although we just have one user
for now. And it makes other uses easier to gain the benefit by just
calling the new API.
>
> Mikulas
>
On Wed, 15 Feb 2023, Yang Shi wrote:
> On Wed, Feb 15, 2023 at 4:23 AM Mikulas Patocka <[email protected]> wrote:
> >
> >
> >
> > On Tue, 14 Feb 2023, Yang Shi wrote:
> >
> > >
> > > Changelog:
> > > RFC -> v2:
> > > * Added callback variant for page bulk allocator and mempool bulk allocator
> > > per Mel Gorman.
> > > * Used the callback version in dm-crypt driver.
> > > * Some code cleanup and refactor to reduce duplicate code.
> > >
> > > rfc: https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Hi
> >
> > This seems like unneeded complication to me. We have alloc_pages(), it can
> > allocate multiple pages efficiently, so why not use it?
>
> The alloc_pages() allocates *contiguous* pages, but dm-crypt doesn't
> need contiguous pages at all. This may incur unnecessary compaction
It doesn't hurt that the pages are contiguous - and allocating and freeing
a few compound pages is even faster than allocating and freeing many
0-order pages.
> overhead to the dm-crypt layer when memory is fragmented.
The compaction overhead may be suppressed by the GFP flags (i.e. don't use
__GFP_DIRECT_RECLAIM).
> The bulk allocator is a good fit to this usecase, which allocates
> multiple order-0 pages.
>
> In addition, filesystem writeback doesn't guarantee power-of-2 pages
> every time IIUC. But alloc_pages() just can allocate power-of-2 pages.
So, we can allocate more compound pages for the non-power-of-2 case - see
the next patch that I'm sending.
> >
> > I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if
> > alloc_pages() fails (either because the system is low on memory or because
> > memory is too fragmented), fall back to the existing code that does
> > mempool_alloc().
>
> My PoC patches just did this way, but called bulk allocator. There may
> be other potential mepool users as I listed in this cover letter,
> which may get benefits from bulk allocator. So introducing a new bulk
> mempool API seems better for long run although we just have one user
> for now. And it makes other uses easier to gain the benefit by just
> calling the new API.
This mempool bulk refactoring just makes the code bigger. And it is not
needed - dm-crypt can fall-back to non-bulk mempool allocations.
In the next email, I'm sending a patch that is noticeably smaller and that
uses alloc_pages()/__free_pages().
Mikulas
On Thu, Feb 16, 2023 at 9:45 AM Mikulas Patocka <[email protected]> wrote:
>
>
>
> On Wed, 15 Feb 2023, Yang Shi wrote:
>
> > On Wed, Feb 15, 2023 at 4:23 AM Mikulas Patocka <[email protected]> wrote:
> > >
> > >
> > >
> > > On Tue, 14 Feb 2023, Yang Shi wrote:
> > >
> > > >
> > > > Changelog:
> > > > RFC -> v2:
> > > > * Added callback variant for page bulk allocator and mempool bulk allocator
> > > > per Mel Gorman.
> > > > * Used the callback version in dm-crypt driver.
> > > > * Some code cleanup and refactor to reduce duplicate code.
> > > >
> > > > rfc: https://lore.kernel.org/linux-mm/[email protected]/
> > >
> > > Hi
> > >
> > > This seems like unneeded complication to me. We have alloc_pages(), it can
> > > allocate multiple pages efficiently, so why not use it?
> >
> > The alloc_pages() allocates *contiguous* pages, but dm-crypt doesn't
> > need contiguous pages at all. This may incur unnecessary compaction
>
> It doesn't hurt that the pages are contiguous - and allocating and freeing
> a few compound pages is even faster than allocating and freeing many
> 0-order pages.
If "allocating many order-0 pages" means calling alloc_page() multiple
times, just like what the dm-crypt code does before this patchset,
yeah, allocating a compound page may be faster.
But it may be not true with bulk allocator. And it also depends on how
bad the fragmentation is and how contended the zone lock is. When
allocating order-0 page, the bulk allocator just could take the pages
from pcp list within one call. And the pcp list could hold a lot pages
actually, on my test machine per pcp list could have more than 1000
pages.
>
> > overhead to the dm-crypt layer when memory is fragmented.
>
> The compaction overhead may be suppressed by the GFP flags (i.e. don't use
> __GFP_DIRECT_RECLAIM).
You could, but you may pressure the mempool quite more often when
light memory pressure and fragmentation exist. The bulk allocator may
just succeed with light reclamation without allocating from mempool.
>
> > The bulk allocator is a good fit to this usecase, which allocates
> > multiple order-0 pages.
> >
> > In addition, filesystem writeback doesn't guarantee power-of-2 pages
> > every time IIUC. But alloc_pages() just can allocate power-of-2 pages.
>
> So, we can allocate more compound pages for the non-power-of-2 case - see
> the next patch that I'm sending.
Thanks for the patch. If the callers are willing to handle the
complexity (calculating the proper orders, dealing with the compound
pages, etc), it is definitely an option for them.
>
> > >
> > > I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if
> > > alloc_pages() fails (either because the system is low on memory or because
> > > memory is too fragmented), fall back to the existing code that does
> > > mempool_alloc().
> >
> > My PoC patches just did this way, but called bulk allocator. There may
> > be other potential mepool users as I listed in this cover letter,
> > which may get benefits from bulk allocator. So introducing a new bulk
> > mempool API seems better for long run although we just have one user
> > for now. And it makes other uses easier to gain the benefit by just
> > calling the new API.
>
> This mempool bulk refactoring just makes the code bigger. And it is not
> needed - dm-crypt can fall-back to non-bulk mempool allocations.
Do you mean the mempool code? It may be inevitable by adding a new
API. But it is not significantly bigger. And the API hides all the
details and complexity from the callers, as well as handle all the
allocation corner cases in mm layer. It would make the users life much
easier. Of course if the callers are happy to handle all the
complexity by themselves, they don't have to call the API.
>
> In the next email, I'm sending a patch that is noticeably smaller and that
> uses alloc_pages()/__free_pages().
Thanks for the patch. But if other potential users would like to do
the same optimization, the code have to be duplicated everywhere.
Maybe not every one is happy to handle this by themselves.
>
> Mikulas
>