2023-11-29 10:34:57

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes

Also in git [1]. Changes since v2 [2]:

- empty cache refill/full cache flush using internal bulk operations
- bulk alloc/free operations also use the cache
- memcg, KASAN etc hooks processed when the cache is used for the
operation - now fully transparent
- NUMA node-specific allocations now explicitly bypass the cache

[1] https://git.kernel.org/vbabka/l/slub-percpu-caches-v3r2
[2] https://lore.kernel.org/all/[email protected]/

----

At LSF/MM I've mentioned that I see several use cases for introducing
opt-in percpu arrays for caching alloc/free objects in SLUB. This is my
first exploration of this idea, speficially for the use case of maple
tree nodes. The assumptions are:

- percpu arrays will be faster thank bulk alloc/free which needs
relatively long freelists to work well. Especially in the freeing case
we need the nodes to come from the same slab (or small set of those)

- preallocation for the worst case of needed nodes for a tree operation
that can't reclaim due to locks is wasteful. We could instead expect
that most of the time percpu arrays would satisfy the constained
allocations, and in the rare cases it does not we can dip into
GFP_ATOMIC reserves temporarily. So instead of preallocation just
prefill the arrays.

- NUMA locality of the nodes is not a concern as the nodes of a
process's VMA tree end up all over the place anyway.

Patches 1-4 are preparatory, but should also work as standalone fixes
and cleanups, so I would like to add them for 6.8 after review, and
probably rebasing on top of the current series in slab/for-next, mainly
SLAB removal, as it should be easier to follow than the necessary
conflict resolutions.

Patch 5 adds the per-cpu array caches support. Locking is stolen from
Mel's recent page allocator's pcplists implementation so it can avoid
disabling IRQs and just disable preemption, but the trylocks can fail in
rare situations - in most cases the locks are uncontended so the locking
should be cheap.

Then maple tree is modified in patches 6-9 to benefit from this. From
that, only Liam's patches make sense and the rest are my crude hacks.
Liam is already working on a better solution for the maple tree side.
I'm including this only so the bots have something for testing that uses
the new code. The stats below thus likely don't reflect the full
benefits that can be achieved from cache prefill vs preallocation.

I've briefly tested this with virtme VM boot and checking the stats from
CONFIG_SLUB_STATS in sysfs.

Patch 5:

slub per-cpu array caches implemented including new counters but maple
tree doesn't use them yet

/sys/kernel/slab/maple_node # grep . alloc_cpu_cache alloc_*path free_cpu_cache free_*path cpu_cache* | cut -d' ' -f1
alloc_cpu_cache:0
alloc_fastpath:20213
alloc_slowpath:1741
free_cpu_cache:0
free_fastpath:10754
free_slowpath:9232
cpu_cache_flush:0
cpu_cache_refill:0

Patch 7:

maple node cache creates percpu array with 32 entries,
not changed anything else

majority alloc/free operations are satisfied by the array, number of
flushed/refilled objects is 1/3 of the cached operations so the hit
ratio is 2/3. Note the flush/refill operations also increase the
fastpath/slowpath counters, thus the majority of those indeed come from
the flushes and refills.

alloc_cpu_cache:11880
alloc_fastpath:4131
alloc_slowpath:587
free_cpu_cache:13075
free_fastpath:437
free_slowpath:2216
cpu_cache_flush:4336
cpu_cache_refill:3216

Patch 9:

This tries to replace maple tree's preallocation with the cache prefill.
Thus should reduce all of the counters as many of the preallocations for
the worst-case scenarios are not needed in the end. But according to
Liam it's not the full solution, which probably explains why the
reduction is only modest.

alloc_cpu_cache:11540
alloc_fastpath:3756
alloc_slowpath:512
free_cpu_cache:12775
free_fastpath:388
free_slowpath:1944
cpu_cache_flush:3904
cpu_cache_refill:2742

---
Liam R. Howlett (2):
tools: Add SLUB percpu array functions for testing
maple_tree: Remove MA_STATE_PREALLOC

Vlastimil Babka (7):
mm/slub: fix bulk alloc and free stats
mm/slub: introduce __kmem_cache_free_bulk() without free hooks
mm/slub: handle bulk and single object freeing separately
mm/slub: free KFENCE objects in slab_free_hook()
mm/slub: add opt-in percpu array cache of objects
maple_tree: use slub percpu array
maple_tree: replace preallocation with slub percpu array prefill

include/linux/slab.h | 4 +
include/linux/slub_def.h | 12 +
lib/maple_tree.c | 46 ++-
mm/Kconfig | 1 +
mm/slub.c | 561 +++++++++++++++++++++++++++++---
tools/include/linux/slab.h | 4 +
tools/testing/radix-tree/linux.c | 14 +
tools/testing/radix-tree/linux/kernel.h | 1 +
8 files changed, 578 insertions(+), 65 deletions(-)
---
base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
change-id: 20231128-slub-percpu-caches-9441892011d7

Best regards,
--
Vlastimil Babka <[email protected]>


2023-11-29 10:35:05

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH RFC v3 9/9] maple_tree: replace preallocation with slub percpu array prefill

With the percpu array we can try not doing the preallocations in maple
tree, and instead make sure the percpu array is prefilled, and using
GFP_ATOMIC in places that relied on the preallocation (in case we miss
or fail trylock on the array), i.e. mas_store_prealloc(). For now simply
add __GFP_NOFAIL there as well.
---
lib/maple_tree.c | 17 ++++++-----------
1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index f5c0bca2c5d7..d84a0c0fe83b 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5452,7 +5452,12 @@ void mas_store_prealloc(struct ma_state *mas, void *entry)

mas_wr_store_setup(&wr_mas);
trace_ma_write(__func__, mas, 0, entry);
+
+retry:
mas_wr_store_entry(&wr_mas);
+ if (unlikely(mas_nomem(mas, GFP_ATOMIC | __GFP_NOFAIL)))
+ goto retry;
+
MAS_WR_BUG_ON(&wr_mas, mas_is_err(mas));
mas_destroy(mas);
}
@@ -5471,8 +5476,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
MA_WR_STATE(wr_mas, mas, entry);
unsigned char node_size;
int request = 1;
- int ret;
-

if (unlikely(!mas->index && mas->last == ULONG_MAX))
goto ask_now;
@@ -5512,16 +5515,8 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)

/* node store, slot store needs one node */
ask_now:
- mas_node_count_gfp(mas, request, gfp);
- if (likely(!mas_is_err(mas)))
- return 0;
+ return kmem_cache_prefill_percpu_array(maple_node_cache, request, gfp);

- mas_set_alloc_req(mas, 0);
- ret = xa_err(mas->node);
- mas_reset(mas);
- mas_destroy(mas);
- mas_reset(mas);
- return ret;
}
EXPORT_SYMBOL_GPL(mas_preallocate);


--
2.43.0

Subject: Re: [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes

On Wed, 29 Nov 2023, Vlastimil Babka wrote:

> At LSF/MM I've mentioned that I see several use cases for introducing
> opt-in percpu arrays for caching alloc/free objects in SLUB. This is my
> first exploration of this idea, speficially for the use case of maple
> tree nodes. The assumptions are:

Hohumm... So we are not really removing SLAB but merging SLAB features
into SLUB. In addition to per cpu slabs, we now have per cpu queues.

> - percpu arrays will be faster thank bulk alloc/free which needs
> relatively long freelists to work well. Especially in the freeing case
> we need the nodes to come from the same slab (or small set of those)

Percpu arrays require the code to handle individual objects. Handling
freelists in partial SLABS means that numerous objects can be handled at
once by handling the pointer to the list of objects.

In order to make the SLUB in page freelists work better you need to have
larger freelist and that comes with larger page sizes. I.e. boot with
slub_min_order=5 or so to increase performance.

Also this means increasing TLB pressure. The in page freelists of SLUB
cause objects from the same page be served. The SLAB queueing approach
results in objects being mixed from any address and thus neighboring
objects may require more TLB entries.

> - preallocation for the worst case of needed nodes for a tree operation
> that can't reclaim due to locks is wasteful. We could instead expect
> that most of the time percpu arrays would satisfy the constained
> allocations, and in the rare cases it does not we can dip into
> GFP_ATOMIC reserves temporarily. So instead of preallocation just
> prefill the arrays.

The partial percpu slabs could already do the same.

> - NUMA locality of the nodes is not a concern as the nodes of a
> process's VMA tree end up all over the place anyway.

NUMA locality is already controlled by the user through the node
specification for percpu slabs. All objects coming from the same in page
freelist of SLUB have the same NUMA locality which simplifies things.

If you would consider NUMA locality for the percpu array then you'd be
back to my beloved alien caches. We were not able to avoid that when we
tuned SLAB for maximum performance.

> Patch 5 adds the per-cpu array caches support. Locking is stolen from
> Mel's recent page allocator's pcplists implementation so it can avoid
> disabling IRQs and just disable preemption, but the trylocks can fail in
> rare situations - in most cases the locks are uncontended so the locking
> should be cheap.

Ok the locking is new but the design follows basic SLAB queue handling.

2023-11-29 21:21:07

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes

On Wed, Nov 29, 2023 at 12:16:17PM -0800, Christoph Lameter (Ampere) wrote:
> Percpu arrays require the code to handle individual objects. Handling
> freelists in partial SLABS means that numerous objects can be handled at
> once by handling the pointer to the list of objects.

That works great until you hit degenerate cases like having one or two free
objects per slab. Users have hit these cases and complained about them.
Arrays are much cheaper than lists, around 10x in my testing.

> In order to make the SLUB in page freelists work better you need to have
> larger freelist and that comes with larger page sizes. I.e. boot with
> slub_min_order=5 or so to increase performance.

That comes with its own problems, of course.

> Also this means increasing TLB pressure. The in page freelists of SLUB cause
> objects from the same page be served. The SLAB queueing approach
> results in objects being mixed from any address and thus neighboring objects
> may require more TLB entries.

Is that still a concern for modern CPUs? We're using 1GB TLB entries
these days, and there are usually thousands of TLB entries. This feels
like more of a concern for a 90s era CPU.

2023-11-30 09:14:21

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes

On 11/29/23 21:16, Christoph Lameter (Ampere) wrote:
> On Wed, 29 Nov 2023, Vlastimil Babka wrote:
>
>> At LSF/MM I've mentioned that I see several use cases for introducing
>> opt-in percpu arrays for caching alloc/free objects in SLUB. This is my
>> first exploration of this idea, speficially for the use case of maple
>> tree nodes. The assumptions are:
>
> Hohumm... So we are not really removing SLAB but merging SLAB features
> into SLUB.

Hey, you've tried a similar thing back in 2010 too :)
https://lore.kernel.org/all/[email protected]/

In addition to per cpu slabs, we now have per cpu queues.

But importantly, it's very consciously opt-in. Whether the caches using
percpu arrays can also skip per cpu (partial) slabs, remains to be seen.

>> - percpu arrays will be faster thank bulk alloc/free which needs
>> relatively long freelists to work well. Especially in the freeing case
>> we need the nodes to come from the same slab (or small set of those)
>
> Percpu arrays require the code to handle individual objects. Handling
> freelists in partial SLABS means that numerous objects can be handled at
> once by handling the pointer to the list of objects.
>
> In order to make the SLUB in page freelists work better you need to have
> larger freelist and that comes with larger page sizes. I.e. boot with
> slub_min_order=5 or so to increase performance.

In the freeing case, you might still end up with objects mixed from
different slab pages, so the detached freelist building will be inefficient.

> Also this means increasing TLB pressure. The in page freelists of SLUB
> cause objects from the same page be served. The SLAB queueing approach
> results in objects being mixed from any address and thus neighboring
> objects may require more TLB entries.

As Willy noted, we have 1GB entries in directmap. Also we found out that
even if there are actions that cause it to fragment, it's not worth trying
to minimize the fragmentations - https://lwn.net/Articles/931406/

>> - preallocation for the worst case of needed nodes for a tree operation
>> that can't reclaim due to locks is wasteful. We could instead expect
>> that most of the time percpu arrays would satisfy the constained
>> allocations, and in the rare cases it does not we can dip into
>> GFP_ATOMIC reserves temporarily. So instead of preallocation just
>> prefill the arrays.
>
> The partial percpu slabs could already do the same.

Possibly for the prefill, but efficient freeing will always be an issue.

>> - NUMA locality of the nodes is not a concern as the nodes of a
>> process's VMA tree end up all over the place anyway.
>
> NUMA locality is already controlled by the user through the node
> specification for percpu slabs. All objects coming from the same in page
> freelist of SLUB have the same NUMA locality which simplifies things.
>
> If you would consider NUMA locality for the percpu array then you'd be
> back to my beloved alien caches. We were not able to avoid that when we
> tuned SLAB for maximum performance.

True, it's easier not to support NUMA locality.

>> Patch 5 adds the per-cpu array caches support. Locking is stolen from
>> Mel's recent page allocator's pcplists implementation so it can avoid
>> disabling IRQs and just disable preemption, but the trylocks can fail in
>> rare situations - in most cases the locks are uncontended so the locking
>> should be cheap.
>
> Ok the locking is new but the design follows basic SLAB queue handling.

Subject: Re: [PATCH RFC v3 0/9] SLUB percpu array caches and maple tree nodes

On Wed, 29 Nov 2023, Matthew Wilcox wrote:

>> In order to make the SLUB in page freelists work better you need to have
>> larger freelist and that comes with larger page sizes. I.e. boot with
>> slub_min_order=5 or so to increase performance.
>
> That comes with its own problems, of course.

Well I thought you were solving those with the folios?

>> Also this means increasing TLB pressure. The in page freelists of SLUB cause
>> objects from the same page be served. The SLAB queueing approach
>> results in objects being mixed from any address and thus neighboring objects
>> may require more TLB entries.
>
> Is that still a concern for modern CPUs? We're using 1GB TLB entries
> these days, and there are usually thousands of TLB entries. This feels
> like more of a concern for a 90s era CPU.

ARM kernel memory is mapped by 4K entries by default since rodata=full is
the default. Security concerns screw it up.