2024-04-11 18:36:55

by Jianfeng Wang

[permalink] [raw]
Subject: [PATCH] slub: limit number of slabs to scan in count_partial()

When reading "/proc/slabinfo", the kernel needs to report the number
of free objects for each kmem_cache. The current implementation relies
on count_partial() that counts the number of free objects by scanning
each kmem_cache_node's list of partial slabs and summing free objects
from every partial slab in the list. This process must hold per
kmem_cache_node spinlock and disable IRQ and may take a long time.
Consequently, it can block slab allocation requests on other CPU cores
and cause timeouts for network devices etc., when the partial slab list
is long. In production, even NMI watchdog can be triggered due to this
matter: e.g., for "buffer_head", the number of partial slabs was
observed to be ~1M in one kmem_cache_node. This problem was also
confirmed by several others [1-3] in the past.

Iterating a partial list to get the exact count of objects can cause
soft lockups for a long list with or without the lock (e.g., if
preemption is disabled), and is not very useful too: the object
count can change right after the lock is released. The approach of
maintaining free-object counters requires atomic operations on the
fast path [3].

So, the fix is to limit the number of slabs to scan in
count_partial(), and output an approximated result if the list is too
long. Default to 10000 which should be enough for most sane cases.

[1] https://lore.kernel.org/linux-mm/
alpine.DEB.2.21.2003031602460.1537@http://www.lameter.com/T/
[2] https://lore.kernel.org/lkml/
alpine.DEB.2.22.394.2008071258020.55871@http://www.lameter.com/T/
[3] https://lore.kernel.org/lkml/
[email protected]/T/

Signed-off-by: Jianfeng Wang <[email protected]>
---
mm/slub.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1bb2a93cf7b6..5ed998ec7d6d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3213,16 +3213,25 @@ static inline bool free_debug_processing(struct kmem_cache *s,
#endif /* CONFIG_SLUB_DEBUG */

#if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS)
+#define MAX_PARTIAL_TO_SCAN 10000
+
static unsigned long count_partial(struct kmem_cache_node *n,
int (*get_count)(struct slab *))
{
unsigned long flags;
unsigned long x = 0;
+ unsigned long scanned = 0;
struct slab *slab;

spin_lock_irqsave(&n->list_lock, flags);
- list_for_each_entry(slab, &n->partial, slab_list)
+ list_for_each_entry(slab, &n->partial, slab_list) {
x += get_count(slab);
+ if (++scanned > MAX_PARTIAL_TO_SCAN) {
+ /* Approximate total count of objects */
+ x = mult_frac(x, n->nr_partial, scanned);
+ break;
+ }
+ }
spin_unlock_irqrestore(&n->list_lock, flags);
return x;
}
--
2.42.1



2024-04-12 07:41:56

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] slub: limit number of slabs to scan in count_partial()

On 4/11/24 6:40 PM, Jianfeng Wang wrote:
> When reading "/proc/slabinfo", the kernel needs to report the number
> of free objects for each kmem_cache. The current implementation relies
> on count_partial() that counts the number of free objects by scanning
> each kmem_cache_node's list of partial slabs and summing free objects
> from every partial slab in the list. This process must hold per
> kmem_cache_node spinlock and disable IRQ and may take a long time.
> Consequently, it can block slab allocation requests on other CPU cores
> and cause timeouts for network devices etc., when the partial slab list
> is long. In production, even NMI watchdog can be triggered due to this
> matter: e.g., for "buffer_head", the number of partial slabs was
> observed to be ~1M in one kmem_cache_node. This problem was also

Wonder what led to such a situation, whether it could have been the NUMA
related problem this patch is fixing:
https://lore.kernel.org/all/[email protected]/

But guess we don't do that with buffer heads, and it's just possible that a
surge of allocations led to many slabs being allocated, and then the a
slower trickle freeing didn't happen in a matching order, leading to many
slabs becoming partially free...

> confirmed by several others [1-3] in the past.
>
> Iterating a partial list to get the exact count of objects can cause
> soft lockups for a long list with or without the lock (e.g., if
> preemption is disabled), and is not very useful too: the object
> count can change right after the lock is released. The approach of
> maintaining free-object counters requires atomic operations on the
> fast path [3].
>
> So, the fix is to limit the number of slabs to scan in
> count_partial(), and output an approximated result if the list is too
> long. Default to 10000 which should be enough for most sane cases.
>
> [1] https://lore.kernel.org/linux-mm/
> alpine.DEB.2.21.2003031602460.1537@http://www.lameter.com/T/
> [2] https://lore.kernel.org/lkml/
> alpine.DEB.2.22.394.2008071258020.55871@http://www.lameter.com/T/
> [3] https://lore.kernel.org/lkml/
> [email protected]/T/
>
> Signed-off-by: Jianfeng Wang <[email protected]>
> ---
> mm/slub.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 1bb2a93cf7b6..5ed998ec7d6d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3213,16 +3213,25 @@ static inline bool free_debug_processing(struct kmem_cache *s,
> #endif /* CONFIG_SLUB_DEBUG */
>
> #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS)
> +#define MAX_PARTIAL_TO_SCAN 10000
> +
> static unsigned long count_partial(struct kmem_cache_node *n,
> int (*get_count)(struct slab *))
> {
> unsigned long flags;
> unsigned long x = 0;
> + unsigned long scanned = 0;
> struct slab *slab;
>
> spin_lock_irqsave(&n->list_lock, flags);
> - list_for_each_entry(slab, &n->partial, slab_list)
> + list_for_each_entry(slab, &n->partial, slab_list) {
> x += get_count(slab);
> + if (++scanned > MAX_PARTIAL_TO_SCAN) {
> + /* Approximate total count of objects */
> + x = mult_frac(x, n->nr_partial, scanned);
> + break;
> + }
> + }
> spin_unlock_irqrestore(&n->list_lock, flags);
> return x;
> }