2024-04-02 03:45:41

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH] slub: fix slub segmentation

On 2024/4/2 11:10, Ming Yang wrote:
> When one of numa nodes runs out of memory and lots of processes still
> booting, slabinfo shows much slub segmentation exits. The following
> shows some of them:
>
> tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs>
> <num_slabs> <sharedavail>
> kmalloc-512 84309 380800 1024 32 8 :
> tunables 0 0 0 : slabdata 11900 11900 0
> kmalloc-256 65869 365408 512 32 4 :
> tunables 0 0 0 : slabdata 11419 11419 0
>
> 365408 "kmalloc-256" objects are alloced but only 65869 of them are
> used; While 380800 "kmalloc-512" objects are alloced but only 84309
> of them are used.
>
> This problem exits in the following senario:
> 1. Multiple numa nodes, e.g. four nodes.
> 2. Lack of memory in any one node.
> 3. Functions which alloc many slub memory in certain numa nodes,
> like alloc_fair_sched_group.
>
> The slub segmentation generated because of the following reason:
> In function "___slab_alloc" a new slab is attempted to be gotten via
> function "get_partial". If the argument 'node' is assigned but there
> are neither partial memory nor buddy memory in that assigned node, no
> slab could be gotten. And then the program attempt to alloc new slub
> from buddy system, as mentationed before: no buddy memory in that
> assigned node left, a new slub might be alloced from the buddy system
> of other node directly, no matter whether there is free partil memory
> left on other node. As a result slub segmentation generated.
>
> The key point of above allocation flow is: the slab should be alloced
> from the partial of other node first, instead of the buddy system of
> other node directly.
>
> In this commit a new slub allocation flow is proposed:
> 1. Attempt to get a slab via function get_partial (first step in
> new_objects lable).
> 2. If no slab is gotten and 'node' is assigned, try to alloc a new
> slab just from the assigned node instead of all node.
> 3. If no slab could be alloced from the assigned node, try to alloc
> slub from partial of other node.
> 4. If the alloctation in step 3 fails, alloc a new slub from buddy
> system of all node.

FYI, there is another patch to the very same problem:

https://lore.kernel.org/all/[email protected]/

>
> Signed-off-by: Ming Yang <[email protected]>
> Signed-off-by: Liang Zhang <[email protected]>
> Signed-off-by: Zhigang Wang <[email protected]>
> Reviewed-by: Shixin Liu <[email protected]>
> ---
> This patch can be tested and verified by following steps:
> 1. First, try to run out memory on node0. echo 1000(depending on your memory) >
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages.
> 2. Second, boot 10000(depending on your memory) processes which use setsid
> systemcall, as the setsid systemcall may likely call function
> alloc_fair_sched_group.
> 3. Last, check slabinfo, cat /proc/slabinfo.
>
> Hardware info:
> Memory : 8GiB
> CPU (total #): 120
> numa node: 4
>
> Test clang code example:
> int main() {
> void *p = malloc(1024);
> setsid();
> while(1);
> }
>
> mm/slub.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 1bb2a93cf7..3eb2e7d386 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3522,7 +3522,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> }
>
> slub_put_cpu_ptr(s->cpu_slab);
> + if (node != NUMA_NO_NODE) {
> + slab = new_slab(s, gfpflags | __GFP_THISNODE, node);
> + if (slab)
> + goto slab_alloced;
> +
> + slab = get_any_partial(s, &pc);
> + if (slab)
> + goto slab_alloced;
> + }
> slab = new_slab(s, gfpflags, node);
> +
> +slab_alloced:
> c = slub_get_cpu_ptr(s->cpu_slab);
>
> if (unlikely(!slab)) {


2024-04-02 16:14:22

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] slub: fix slub segmentation

On 4/2/24 5:45 AM, Chengming Zhou wrote:
> On 2024/4/2 11:10, Ming Yang wrote:
>> When one of numa nodes runs out of memory and lots of processes still
>> booting, slabinfo shows much slub segmentation exits. The following

You mean fragmentation not segmentation, right?

>> shows some of them:
>>
>> tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs>
>> <num_slabs> <sharedavail>
>> kmalloc-512 84309 380800 1024 32 8 :
>> tunables 0 0 0 : slabdata 11900 11900 0
>> kmalloc-256 65869 365408 512 32 4 :
>> tunables 0 0 0 : slabdata 11419 11419 0
>>
>> 365408 "kmalloc-256" objects are alloced but only 65869 of them are
>> used; While 380800 "kmalloc-512" objects are alloced but only 84309
>> of them are used.
>>
>> This problem exits in the following senario:
>> 1. Multiple numa nodes, e.g. four nodes.
>> 2. Lack of memory in any one node.
>> 3. Functions which alloc many slub memory in certain numa nodes,
>> like alloc_fair_sched_group.
>>
>> The slub segmentation generated because of the following reason:
>> In function "___slab_alloc" a new slab is attempted to be gotten via
>> function "get_partial". If the argument 'node' is assigned but there
>> are neither partial memory nor buddy memory in that assigned node, no
>> slab could be gotten. And then the program attempt to alloc new slub
>> from buddy system, as mentationed before: no buddy memory in that
>> assigned node left, a new slub might be alloced from the buddy system
>> of other node directly, no matter whether there is free partil memory
>> left on other node. As a result slub segmentation generated.
>>
>> The key point of above allocation flow is: the slab should be alloced
>> from the partial of other node first, instead of the buddy system of
>> other node directly.
>>
>> In this commit a new slub allocation flow is proposed:
>> 1. Attempt to get a slab via function get_partial (first step in
>> new_objects lable).
>> 2. If no slab is gotten and 'node' is assigned, try to alloc a new
>> slab just from the assigned node instead of all node.
>> 3. If no slab could be alloced from the assigned node, try to alloc
>> slub from partial of other node.
>> 4. If the alloctation in step 3 fails, alloc a new slub from buddy
>> system of all node.
>
> FYI, there is another patch to the very same problem:
>
> https://lore.kernel.org/all/[email protected]/

Yeah and I have just taken that one to slab/for-6.10

>>
>> Signed-off-by: Ming Yang <[email protected]>
>> Signed-off-by: Liang Zhang <[email protected]>
>> Signed-off-by: Zhigang Wang <[email protected]>
>> Reviewed-by: Shixin Liu <[email protected]>
>> ---
>> This patch can be tested and verified by following steps:
>> 1. First, try to run out memory on node0. echo 1000(depending on your memory) >
>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages.
>> 2. Second, boot 10000(depending on your memory) processes which use setsid
>> systemcall, as the setsid systemcall may likely call function
>> alloc_fair_sched_group.
>> 3. Last, check slabinfo, cat /proc/slabinfo.
>>
>> Hardware info:
>> Memory : 8GiB
>> CPU (total #): 120
>> numa node: 4
>>
>> Test clang code example:
>> int main() {
>> void *p = malloc(1024);
>> setsid();
>> while(1);
>> }
>>
>> mm/slub.c | 11 +++++++++++
>> 1 file changed, 11 insertions(+)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 1bb2a93cf7..3eb2e7d386 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3522,7 +3522,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>> }
>>
>> slub_put_cpu_ptr(s->cpu_slab);
>> + if (node != NUMA_NO_NODE) {
>> + slab = new_slab(s, gfpflags | __GFP_THISNODE, node);
>> + if (slab)
>> + goto slab_alloced;
>> +
>> + slab = get_any_partial(s, &pc);
>> + if (slab)
>> + goto slab_alloced;
>> + }
>> slab = new_slab(s, gfpflags, node);
>> +
>> +slab_alloced:
>> c = slub_get_cpu_ptr(s->cpu_slab);
>>
>> if (unlikely(!slab)) {