2024-05-09 14:25:41

by Vlastimil Babka

[permalink] [raw]
Subject: [GIT PULL] slab updates for 6.10

Hi Linus,

please pull the latest slab updates from:

git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git tags/slab-for-6.10

Sending this early due to upcoming LSF/MM travel and chances there's no rc8.

Thanks,
Vlastimil

======================================

This time it's mostly random cleanups and fixes, with two performance fixes
that might have significant impact, but limited to systems experiencing
particular bad corner case scenarios rather than general performance
improvements.

The memcg hook changes are going through the mm tree due to dependencies.

- Prevent stalls when reading /proc/slabinfo (Jianfeng Wang)

This fixes the long-standing problem that can happen with workloads that have
alloc/free patterns resulting in many partially used slabs (in e.g. dentry
cache). Reading /proc/slabinfo will traverse the long partial slab list under
spinlock with disabled irqs and thus can stall other processes or even
trigger the lockup detection. The traversal is only done to count free
objects so that <active_objs> column can be reported along with <num_objs>.

To avoid affecting fast paths with another shared counter (attempted in the
past) or complex partial list traversal schemes that allow rescheduling, the
chosen solution resorts to approximation - when the partial list is over
10000 slabs long, we will only traverse first 5000 slabs from head and tail
each and use the average of those to estimate the whole list. Both head and
tail are used as the slabs near head to tend to have more free objects than
the slabs towards the tail.

It is expected the approximation should not break existing /proc/slabinfo
consumers. The <num_objs> field is still accurate and reflects the overall
kmem_cache footprint. The <active_objs> was already imprecise due to cpu and
percpu-partial slabs, so can't be relied upon to determine exact cache usage.
The difference between <active_objs> and <num_objs> is mainly useful to
determine the slab fragmentation, and that will be possible even with the
approximation in place.

- Prevent allocating many slabs when a NUMA node is full (Chen Jun)

Currently, on NUMA systems with a node under significantly bigger pressure
than other nodes, the fallback strategy may result in each kmalloc_node()
that can't be safisfied from the preferred node, to allocate a new slab on a
fallback node, and not reuse the slabs already on that node's partial list.

This is now fixed and partial lists of fallback nodes are checked even for
kmalloc_node() allocations. It's still preferred to allocate a new slab on
the requested node before a fallback, but only with a GFP_NOWAIT attempt,
which will fail quickly when the node is under a significant memory pressure.

- More SLAB removal related cleanups (Xiu Jianfeng, Hyunmin Lee)

- Fix slub_kunit self-test with hardened freelists (Guenter Roeck)

- Mark racy accesses for KCSAN (linke li)

- Misc cleanups (Xiongwei Song, Haifeng Xu, Sangyun Kim)

----------------------------------------------------------------
Chen Jun (1):
mm/slub: Reduce memory consumption in extreme scenarios

Guenter Roeck (1):
mm/slub, kunit: Use inverted data to corrupt kmem cache

Haifeng Xu (1):
slub: Set __GFP_COMP in kmem_cache by default

Hyunmin Lee (2):
mm/slub: create kmalloc 96 and 192 caches regardless cache size order
mm/slub: remove the check for NULL kmalloc_caches

Jianfeng Wang (2):
slub: introduce count_partial_free_approx()
slub: use count_partial_free_approx() in slab_out_of_memory()

Sangyun Kim (1):
mm/slub: remove duplicate initialization for early_kmem_cache_node_alloc()

Xiongwei Song (3):
mm/slub: remove the check of !kmem_cache_has_cpu_partial()
mm/slub: add slub_get_cpu_partial() helper
mm/slub: simplify get_partial_node()

Xiu Jianfeng (2):
mm/slub: remove dummy slabinfo functions
mm/slub: correct comment in do_slab_free()

linke li (2):
mm/slub: mark racy accesses on slab->slabs
mm/slub: mark racy access on slab->freelist

lib/slub_kunit.c | 2 +-
mm/slab.h | 3 --
mm/slab_common.c | 27 +++++--------
mm/slub.c | 118 ++++++++++++++++++++++++++++++++++++++++---------------
4 files changed, 96 insertions(+), 54 deletions(-)


2024-05-13 17:34:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] slab updates for 6.10

On Thu, 9 May 2024 at 07:25, Vlastimil Babka <[email protected]> wrote:
>
> To avoid affecting fast paths with another shared counter (attempted in the
> past) or complex partial list traversal schemes that allow rescheduling, the
> chosen solution resorts to approximation - when the partial list is over
> 10000 slabs long, we will only traverse first 5000 slabs from head and tail
> each and use the average of those to estimate the whole list. Both head and
> tail are used as the slabs near head to tend to have more free objects than
> the slabs towards the tail.

I suspect you could have cut this down by an order of magnitude, and
made the limit be just 1k slabs rather than 10k slabs. Or even
_another_ order of magnitude smaller.

Somebody was being a bit too worried about approximations, methinks -
but I think the real worry goes the other way, where it's practically
so hard to even hit the approximation situation that it gets no
testing at all.

IOW, I suspect it's better to be explicit about approximations, and
have people aware of it, rather than be overly cautious and have it be
a special case that almost never triggers in any normal loads.

But pulled.

Linus

2024-05-13 17:38:59

by pr-tracker-bot

[permalink] [raw]
Subject: Re: [GIT PULL] slab updates for 6.10

The pull request you sent on Thu, 9 May 2024 16:25:05 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git tags/slab-for-6.10

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/cd97950cbcabe662cd8a9fd0a08a247c1ea1fb28

Thank you!

--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

2024-05-20 10:18:54

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [GIT PULL] slab updates for 6.10

On 5/13/24 7:33 PM, Linus Torvalds wrote:
> On Thu, 9 May 2024 at 07:25, Vlastimil Babka <[email protected]> wrote:
>>
>> To avoid affecting fast paths with another shared counter (attempted in the
>> past) or complex partial list traversal schemes that allow rescheduling, the
>> chosen solution resorts to approximation - when the partial list is over
>> 10000 slabs long, we will only traverse first 5000 slabs from head and tail
>> each and use the average of those to estimate the whole list. Both head and
>> tail are used as the slabs near head to tend to have more free objects than
>> the slabs towards the tail.
>
> I suspect you could have cut this down by an order of magnitude, and
> made the limit be just 1k slabs rather than 10k slabs. Or even
> _another_ order of magnitude smaller.
>
> Somebody was being a bit too worried about approximations, methinks -

Indeed, my focus was that we make the approximation as accurate as possible
when introducing it, to minimize the chance of possibly breaking somebody
and having to revert it. Then we can try reduce the limit once the approach
itself is established.

> but I think the real worry goes the other way, where it's practically
> so hard to even hit the approximation situation that it gets no
> testing at all.

Good point.

> IOW, I suspect it's better to be explicit about approximations, and
> have people aware of it, rather than be overly cautious and have it be
> a special case that almost never triggers in any normal loads.

OK we can reduce the limit sooner than later. As for explicit, there was an
idea that an approximated line in slabinfo would be marked, but I thought
changing the layout would be more likely to break someone parsing it, than
an unmarked approximation. We can be more explicit e.g. in the documentation
though for sure.

> But pulled.

Thanks.

>
> Linus