2021-08-05 15:21:49

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 00/35] SLUB: reduce irq disabled scope and make it RT compatible

Hi Andrew,

I believe the series is ready for mmotm. No known bugs, Mel found no !RT perf
regressions in v3 [9], Mike also (details below). RT guys validated it on RT
config and already incorporated the series in the RT tree.

Thanks, Vlastimil.

Changes since v3 [8]:
* Rebase to 5.14-rc4
* Fix unbounded percpu partial list growth reported by Sebastian Andrzej Siewior
* Prevent spurious uninitialized local variable warning reported by Mel Gorman

Changes since v2 [5]:
* Rebase to 5.14-rc3
* A number of fixes to the RT parts, big thanks to Mike Galbraith for testing
and debugging!
* The largest fix is to protect kmem_cache_cpu->partial by local_lock instead
of cmpxchg tricks, which are insufficient on RT. To avoid divergence
between RT and !RT, just do it everywhere. Affected mainly patch 25 and a
new patch 33. This also addresses a theoretical race raised earlier by Jann
Horn.
* Smaller fixes reported by Sebastian Andrzej Siewior and Cyrill Gorcunov

Changes since RFC v1 [1]:
* Addressed feedback from Christoph and Mel, added their acks.
* Finished RT conversion, adopting 2 patches from the RT tree.
* The local_lock conversion has to sacrifice lockless fathpaths on PREEMPT_RT
* Added some more cleanup patches to the front.

This series was initially inspired by Mel's pcplist local_lock rewrite, and
also interest to better understand SLUB's locking and the new primitives and RT
variants and implications. It should make SLUB more preemption-friendly,
especially for RT, hopefully without noticeable regressions, as the fast paths
are not affected.

Series is based on 5.14-rc4 and also available as a git branch:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0

The series should now be sufficiently tested in both RT and !RT configs, mainly
thanks to Mike. The RFC/v1 version also got basic performance screening by
Mel that didn't show major regressions. Mike's testing with hackbench of v2 on
!RT reported negligible differences [6]:

virgin(ish) tip
5.13.0.g60ab3ed-tip
7,320.67 msec task-clock # 7.792 CPUs utilized ( +- 0.31% )
221,215 context-switches # 0.030 M/sec ( +- 3.97% )
16,234 cpu-migrations # 0.002 M/sec ( +- 4.07% )
13,233 page-faults # 0.002 M/sec ( +- 0.91% )
27,592,205,252 cycles # 3.769 GHz ( +- 0.32% )
8,309,495,040 instructions # 0.30 insn per cycle ( +- 0.37% )
1,555,210,607 branches # 212.441 M/sec ( +- 0.42% )
5,484,209 branch-misses # 0.35% of all branches ( +- 2.13% )

0.93949 +- 0.00423 seconds time elapsed ( +- 0.45% )
0.94608 +- 0.00384 seconds time elapsed ( +- 0.41% ) (repeat)
0.94422 +- 0.00410 seconds time elapsed ( +- 0.43% )

5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
7,343.57 msec task-clock # 7.776 CPUs utilized ( +- 0.44% )
223,044 context-switches # 0.030 M/sec ( +- 3.02% )
16,057 cpu-migrations # 0.002 M/sec ( +- 4.03% )
13,164 page-faults # 0.002 M/sec ( +- 0.97% )
27,684,906,017 cycles # 3.770 GHz ( +- 0.45% )
8,323,273,871 instructions # 0.30 insn per cycle ( +- 0.28% )
1,556,106,680 branches # 211.901 M/sec ( +- 0.31% )
5,463,468 branch-misses # 0.35% of all branches ( +- 1.33% )

0.94440 +- 0.00352 seconds time elapsed ( +- 0.37% )
0.94830 +- 0.00228 seconds time elapsed ( +- 0.24% ) (repeat)
0.93813 +- 0.00440 seconds time elapsed ( +- 0.47% ) (repeat)

RT configs showed some throughput regressions, but that's expected tradeoff for
the preemption improvements through the RT mutex. It didn't prevent the v2 to
be incorporated to the 5.13 RT tree [7], leading to testing exposure and
bugfixes.

Before the series, SLUB is lockless in both allocation and free fast paths, but
elsewhere, it's disabling irqs for considerable periods of time - especially in
allocation slowpath and the bulk allocation, where IRQs are re-enabled only
when a new page from the page allocator is needed, and the context allows
blocking. The irq disabled sections can then include deactivate_slab() which
walks a full freelist and frees the slab back to page allocator or
unfreeze_partials() going through a list of percpu partial slabs. The RT tree
currently has some patches mitigating these, but we can do much better in
mainline too.

Patches 1-6 are straightforward improvements or cleanups that could exist
outside of this series too, but are prerequsities.

Patches 7-10 are also preparatory code changes without functional changes, but
not so useful without the rest of the series.

Patch 11 simplifies the fast paths on systems with preemption, based on
(hopefully correct) observation that the current loops to verify tid are
unnecessary.

Patches 12-21 focus on reducing irq disabled scope in the allocation slowpath.

Patch 12 moves disabling of irqs into ___slab_alloc() from its callers, which
are the allocation slowpath, and bulk allocation. Instead these callers only
disable preemption to stabilize the cpu. The following patches then gradually
reduce the scope of disabled irqs in ___slab_alloc() and the functions called
from there. As of patch 15, the re-enabling of irqs based on gfp flags before
calling the page allocator is removed from allocate_slab(). As of patch 18,
it's possible to reach the page allocator (in case of existing slabs depleted)
without disabling and re-enabling irqs a single time.

Pathces 22-27 reduce the scope of disabled irqs in functions related to
unfreezing percpu partial slab.

Patch 28 is preparatory. Patch 29 is adopted from the RT tree and converts the
flushing of percpu slabs on all cpus from using IPI to workqueue, so that the
processing isn't happening with irqs disabled in the IPI handler. The flushing
is not performance critical so it should be acceptable.

Patch 30 also comes from RT tree and makes object_map_lock RT compatible.

Patches 31-32 make slab_lock irq-safe on RT where we cannot rely on having
irq disabled from the list_lock spin lock usage.

Patch 33 changes kmem_cache_cpu->partial handling in put_cpu_partial() from
cmpxchg loop to a short irq disabled section, which is used by all other code
modifying the field. This addresses a theoretical race scenario pointed out by
Jann, and makes the critical section safe wrt with RT local_lock semantics
after the conversion in patch 35.

Patch 34 changes preempt disable to migrate disable, so that the nested
list_lock spinlock is safe to take on RT. Because migrate_disable() is a
function call even on !RT, a small set of private wrappers is introduced
to keep using the cheaper preempt_disable() on !PREEMPT_RT configurations.

As of this patch, SLUB should be compatible with RT's lock semantics, to the
best of my knowledge.

Finally, patch 35 changes irq disabled sections that protect kmem_cache_cpu
fields in the slow paths, with a local lock. However on PREEMPT_RT it means the
lockless fast paths can now preempt slow paths which don't expect that, so the
local lock has to be taken also in the fast paths and they are no longer
lockless. It's up to RT folks to decide if this is a good tradeoff.
The patch also updates the locking documentation in the file's comment.

The main results of this series:

* irq disabling is only done for minimum amount of time needed to protect the
kmem_cache_cpu data and as part of spin lock, local lock and bit lock
operations to make them irq-safe

* SLUB should be fully PREEMPT_RT compatible

This should have obvious implications for better preemptibility, especially on RT.

Some details are different than how the previous SLUB RT tree patches were
implemented:

mm: sl[au]b: Change list_lock to raw_spinlock_t [2] - the SLAB part can be
dropped as a different patch restricts RT to SLUB anyway. And after this series
the list_lock in SLUB is never used with irqs disabled before taking the lock
so it doesn't have to be converted to raw_spinlock_t.

mm: slub: Move discard_slab() invocations out of IRQ-off sections [3] should be
unnecessary as this series does move these invocations outside irq disabled
sections in a different way.

The remaining patches to upstream from the RT tree are small ones related to
KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes
sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could
perhaps be re-evaluated as the series addresses some latency issues with it.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0001-mm-sl-au-b-Change-list_lock-to-raw_spinlock_t.patch?h=linux-5.12.y-rt-patches
[3] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0004-mm-slub-Move-discard_slab-invocations-out-of-IRQ-off.patch?h=linux-5.12.y-rt-patches
[4] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0005-mm-slub-Move-flush_cpu_slab-invocations-__free_slab-.patch?h=linux-5.12.y-rt-patches
[5] https://lore.kernel.org/lkml/[email protected]/
[6] https://lore.kernel.org/lkml/[email protected]
[7] https://lore.kernel.org/linux-rt-users/[email protected]/
[8] https://lore.kernel.org/lkml/[email protected]/
[9] https://lore.kernel.org/lkml/[email protected]/

Sebastian Andrzej Siewior (2):
mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations
out of IRQ context
mm: slub: Make object_map_lock a raw_spinlock_t

Vlastimil Babka (33):
mm, slub: don't call flush_all() from slab_debug_trace_open()
mm, slub: allocate private object map for debugfs listings
mm, slub: allocate private object map for validate_slab_cache()
mm, slub: don't disable irq for debug_check_no_locks_freed()
mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()
mm, slub: unify cmpxchg_double_slab() and __cmpxchg_double_slab()
mm, slub: extract get_partial() from new_slab_objects()
mm, slub: dissolve new_slab_objects() into ___slab_alloc()
mm, slub: return slab page from get_partial() and set c->page
afterwards
mm, slub: restructure new page checks in ___slab_alloc()
mm, slub: simplify kmem_cache_cpu and tid setup
mm, slub: move disabling/enabling irqs to ___slab_alloc()
mm, slub: do initial checks in ___slab_alloc() with irqs enabled
mm, slub: move disabling irqs closer to get_partial() in
___slab_alloc()
mm, slub: restore irqs around calling new_slab()
mm, slub: validate slab from partial list or page allocator before
making it cpu slab
mm, slub: check new pages with restored irqs
mm, slub: stop disabling irqs around get_partial()
mm, slub: move reset of c->page and freelist out of deactivate_slab()
mm, slub: make locking in deactivate_slab() irq-safe
mm, slub: call deactivate_slab() without disabling irqs
mm, slub: move irq control into unfreeze_partials()
mm, slub: discard slabs in unfreeze_partials() without irqs disabled
mm, slub: detach whole partial list at once in unfreeze_partials()
mm, slub: separate detaching of partial list in unfreeze_partials()
from unfreezing
mm, slub: only disable irq with spin_lock in __unfreeze_partials()
mm, slub: don't disable irqs in slub_cpu_dead()
mm, slab: make flush_slab() possible to call with irqs enabled
mm, slub: optionally save/restore irqs in slab_[un]lock()/
mm, slub: make slab_lock() disable irqs with PREEMPT_RT
mm, slub: protect put_cpu_partial() with disabled irqs instead of
cmpxchg
mm, slub: use migrate_disable() on PREEMPT_RT
mm, slub: convert kmem_cpu_slab protection to local_lock

include/linux/slub_def.h | 2 +
mm/slub.c | 809 +++++++++++++++++++++++++--------------
2 files changed, 531 insertions(+), 280 deletions(-)

--
2.32.0


2021-08-05 15:23:56

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 06/35] mm, slub: unify cmpxchg_double_slab() and __cmpxchg_double_slab()

These functions differ only in irq disabling in the slow path. We can create a
common function with an extra bool parameter to control the irq disabling.
As the functions are inline and the parameter compile-time constant, there
will be no runtime overhead due to this change.

Also change the DEBUG_VM based irqs disable assert to the more standard
lockdep_assert based one.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
---
mm/slub.c | 62 +++++++++++++++++++++----------------------------------
1 file changed, 24 insertions(+), 38 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 812345fdf13c..15edf260632e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -371,13 +371,13 @@ static __always_inline void slab_unlock(struct page *page)
__bit_spin_unlock(PG_locked, &page->flags);
}

-/* Interrupts must be disabled (for the fallback code to work right) */
-static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
+static inline bool ___cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
void *freelist_old, unsigned long counters_old,
void *freelist_new, unsigned long counters_new,
- const char *n)
+ const char *n, bool disable_irqs)
{
- VM_BUG_ON(!irqs_disabled());
+ if (!disable_irqs)
+ lockdep_assert_irqs_disabled();
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
if (s->flags & __CMPXCHG_DOUBLE) {
@@ -388,15 +388,23 @@ static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page
} else
#endif
{
+ unsigned long flags;
+
+ if (disable_irqs)
+ local_irq_save(flags);
slab_lock(page);
if (page->freelist == freelist_old &&
page->counters == counters_old) {
page->freelist = freelist_new;
page->counters = counters_new;
slab_unlock(page);
+ if (disable_irqs)
+ local_irq_restore(flags);
return true;
}
slab_unlock(page);
+ if (disable_irqs)
+ local_irq_restore(flags);
}

cpu_relax();
@@ -409,45 +417,23 @@ static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page
return false;
}

-static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
+/* Interrupts must be disabled (for the fallback code to work right) */
+static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
void *freelist_old, unsigned long counters_old,
void *freelist_new, unsigned long counters_new,
const char *n)
{
-#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
- defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
- if (s->flags & __CMPXCHG_DOUBLE) {
- if (cmpxchg_double(&page->freelist, &page->counters,
- freelist_old, counters_old,
- freelist_new, counters_new))
- return true;
- } else
-#endif
- {
- unsigned long flags;
-
- local_irq_save(flags);
- slab_lock(page);
- if (page->freelist == freelist_old &&
- page->counters == counters_old) {
- page->freelist = freelist_new;
- page->counters = counters_new;
- slab_unlock(page);
- local_irq_restore(flags);
- return true;
- }
- slab_unlock(page);
- local_irq_restore(flags);
- }
-
- cpu_relax();
- stat(s, CMPXCHG_DOUBLE_FAIL);
-
-#ifdef SLUB_DEBUG_CMPXCHG
- pr_info("%s %s: cmpxchg double redo ", n, s->name);
-#endif
+ return ___cmpxchg_double_slab(s, page, freelist_old, counters_old,
+ freelist_new, counters_new, n, false);
+}

- return false;
+static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
+ void *freelist_old, unsigned long counters_old,
+ void *freelist_new, unsigned long counters_new,
+ const char *n)
+{
+ return ___cmpxchg_double_slab(s, page, freelist_old, counters_old,
+ freelist_new, counters_new, n, true);
}

#ifdef CONFIG_SLUB_DEBUG
--
2.32.0

2021-08-05 15:24:48

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 10/35] mm, slub: restructure new page checks in ___slab_alloc()

When we allocate slab object from a newly acquired page (from node's partial
list or page allocator), we usually also retain the page as a new percpu slab.
There are two exceptions - when pfmemalloc status of the page doesn't match our
gfp flags, or when the cache has debugging enabled.

The current code for these decisions is not easy to follow, so restructure it
and add comments. The new structure will also help with the following changes.
No functional change.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ed18fa3157ad..c32048353645 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2765,13 +2765,29 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
c->page = page;

check_new_page:
- if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
- goto load_freelist;

- /* Only entered in the debug case */
- if (kmem_cache_debug(s) &&
- !alloc_debug_processing(s, page, freelist, addr))
- goto new_slab; /* Slab failed checks. Next slab needed */
+ if (kmem_cache_debug(s)) {
+ if (!alloc_debug_processing(s, page, freelist, addr))
+ /* Slab failed checks. Next slab needed */
+ goto new_slab;
+ else
+ /*
+ * For debug case, we don't load freelist so that all
+ * allocations go through alloc_debug_processing()
+ */
+ goto return_single;
+ }
+
+ if (unlikely(!pfmemalloc_match(page, gfpflags)))
+ /*
+ * For !pfmemalloc_match() case we don't load freelist so that
+ * we don't make further mismatched allocations easier.
+ */
+ goto return_single;
+
+ goto load_freelist;
+
+return_single:

deactivate_slab(s, page, get_freepointer(s, freelist), c);
return freelist;
--
2.32.0

2021-08-05 15:24:58

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 07/35] mm, slub: extract get_partial() from new_slab_objects()

The later patches will need more fine grained control over individual actions
in ___slab_alloc(), the only caller of new_slab_objects(), so this is a first
preparatory step with no functional change.

This adds a goto label that appears unnecessary at this point, but will be
useful for later changes.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
---
mm/slub.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 15edf260632e..25f102acf5b4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2596,17 +2596,12 @@ slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid)
static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
int node, struct kmem_cache_cpu **pc)
{
- void *freelist;
+ void *freelist = NULL;
struct kmem_cache_cpu *c = *pc;
struct page *page;

WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));

- freelist = get_partial(s, flags, node, c);
-
- if (freelist)
- return freelist;
-
page = new_slab(s, flags, node);
if (page) {
c = raw_cpu_ptr(s->cpu_slab);
@@ -2770,6 +2765,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
}

+ freelist = get_partial(s, gfpflags, node, c);
+ if (freelist)
+ goto check_new_page;
+
freelist = new_slab_objects(s, gfpflags, node, &c);

if (unlikely(!freelist)) {
@@ -2777,6 +2776,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return NULL;
}

+check_new_page:
page = c->page;
if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
goto load_freelist;
--
2.32.0

2021-08-05 15:25:24

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 12/35] mm, slub: move disabling/enabling irqs to ___slab_alloc()

Currently __slab_alloc() disables irqs around the whole ___slab_alloc(). This
includes cases where this is not needed, such as when the allocation ends up in
the page allocator and has to awkwardly enable irqs back based on gfp flags.
Also the whole kmem_cache_alloc_bulk() is executed with irqs disabled even when
it hits the __slab_alloc() slow path, and long periods with disabled interrupts
are undesirable.

As a first step towards reducing irq disabled periods, move irq handling into
___slab_alloc(). Callers will instead prevent the s->cpu_slab percpu pointer
from becoming invalid via get_cpu_ptr(), thus preempt_disable(). This does not
protect against modification by an irq handler, which is still done by disabled
irq for most of ___slab_alloc(). As a small immediate benefit,
slab_out_of_memory() from ___slab_alloc() is now called with irqs enabled.

kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables them
before calling ___slab_alloc(), which then disables them at its discretion. The
whole kmem_cache_alloc_bulk() operation also disables preemption.

When ___slab_alloc() calls new_slab() to allocate a new page, re-enable
preemption, because new_slab() will re-enable interrupts in contexts that allow
blocking (this will be improved by later patches).

The patch itself will thus increase overhead a bit due to disabled preemption
(on configs where it matters) and increased disabling/enabling irqs in
kmem_cache_alloc_bulk(), but that will be gradually improved in the following
patches.

Note in __slab_alloc() we need to change the #ifdef CONFIG_PREEMPT guard to
CONFIG_PREEMPT_COUNT to make sure preempt disable/enable is properly paired in
all configurations. On configs without involuntary preemption and debugging
the re-read of kmem_cache_cpu pointer is still compiled out as it was before.

[ Mike Galbraith <[email protected]>: Fix kmem_cache_alloc_bulk() error path ]
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 36 ++++++++++++++++++++++++------------
1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index e2d803b7e3b5..db03cb972c11 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2653,7 +2653,7 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page)
* we need to allocate a new slab. This is the slowest path since it involves
* a call to the page allocator and the setup of a new slab.
*
- * Version of __slab_alloc to use when we know that interrupts are
+ * Version of __slab_alloc to use when we know that preemption is
* already disabled (which is the case for bulk allocation).
*/
static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
@@ -2661,9 +2661,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
{
void *freelist;
struct page *page;
+ unsigned long flags;

stat(s, ALLOC_SLOWPATH);

+ local_irq_save(flags);
page = c->page;
if (!page) {
/*
@@ -2726,6 +2728,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
VM_BUG_ON(!c->page->frozen);
c->freelist = get_freepointer(s, freelist);
c->tid = next_tid(c->tid);
+ local_irq_restore(flags);
return freelist;

new_slab:
@@ -2743,14 +2746,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto check_new_page;
}

+ put_cpu_ptr(s->cpu_slab);
page = new_slab(s, gfpflags, node);
+ c = get_cpu_ptr(s->cpu_slab);

if (unlikely(!page)) {
+ local_irq_restore(flags);
slab_out_of_memory(s, gfpflags, node);
return NULL;
}

- c = raw_cpu_ptr(s->cpu_slab);
if (c->page)
flush_slab(s, c);

@@ -2790,31 +2795,33 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return_single:

deactivate_slab(s, page, get_freepointer(s, freelist), c);
+ local_irq_restore(flags);
return freelist;
}

/*
- * Another one that disabled interrupt and compensates for possible
- * cpu changes by refetching the per cpu area pointer.
+ * A wrapper for ___slab_alloc() for contexts where preemption is not yet
+ * disabled. Compensates for possible cpu changes by refetching the per cpu area
+ * pointer.
*/
static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
unsigned long addr, struct kmem_cache_cpu *c)
{
void *p;
- unsigned long flags;

- local_irq_save(flags);
-#ifdef CONFIG_PREEMPTION
+#ifdef CONFIG_PREEMPT_COUNT
/*
* We may have been preempted and rescheduled on a different
- * cpu before disabling interrupts. Need to reload cpu area
+ * cpu before disabling preemption. Need to reload cpu area
* pointer.
*/
- c = this_cpu_ptr(s->cpu_slab);
+ c = get_cpu_ptr(s->cpu_slab);
#endif

p = ___slab_alloc(s, gfpflags, node, addr, c);
- local_irq_restore(flags);
+#ifdef CONFIG_PREEMPT_COUNT
+ put_cpu_ptr(s->cpu_slab);
+#endif
return p;
}

@@ -3342,8 +3349,8 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
* IRQs, which protects against PREEMPT and interrupts
* handlers invoking normal fastpath.
*/
+ c = get_cpu_ptr(s->cpu_slab);
local_irq_disable();
- c = this_cpu_ptr(s->cpu_slab);

for (i = 0; i < size; i++) {
void *object = kfence_alloc(s, s->object_size, flags);
@@ -3364,6 +3371,8 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
*/
c->tid = next_tid(c->tid);

+ local_irq_enable();
+
/*
* Invoking slow path likely have side-effect
* of re-populating per CPU c->freelist
@@ -3376,6 +3385,8 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
c = this_cpu_ptr(s->cpu_slab);
maybe_wipe_obj_freeptr(s, p[i]);

+ local_irq_disable();
+
continue; /* goto for-loop */
}
c->freelist = get_freepointer(s, object);
@@ -3384,6 +3395,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
}
c->tid = next_tid(c->tid);
local_irq_enable();
+ put_cpu_ptr(s->cpu_slab);

/*
* memcg and kmem_cache debug support and memory initialization.
@@ -3393,7 +3405,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
slab_want_init_on_alloc(flags, s));
return i;
error:
- local_irq_enable();
+ put_cpu_ptr(s->cpu_slab);
slab_post_alloc_hook(s, objcg, flags, i, p, false);
__kmem_cache_free_bulk(s, i, p);
return 0;
--
2.32.0

2021-08-05 15:26:26

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 15/35] mm, slub: restore irqs around calling new_slab()

allocate_slab() currently re-enables irqs before calling to the page allocator.
It depends on gfpflags_allow_blocking() to determine if it's safe to do so.
Now we can instead simply restore irq before calling it through new_slab().
The other caller early_kmem_cache_node_alloc() is unaffected by this.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 868a782b6f62..9350ff5110a0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1792,9 +1792,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)

flags &= gfp_allowed_mask;

- if (gfpflags_allow_blocking(flags))
- local_irq_enable();
-
flags |= s->allocflags;

/*
@@ -1853,8 +1850,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
page->frozen = 1;

out:
- if (gfpflags_allow_blocking(flags))
- local_irq_disable();
if (!page)
return NULL;

@@ -2782,16 +2777,17 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto check_new_page;
}

+ local_irq_restore(flags);
put_cpu_ptr(s->cpu_slab);
page = new_slab(s, gfpflags, node);
c = get_cpu_ptr(s->cpu_slab);

if (unlikely(!page)) {
- local_irq_restore(flags);
slab_out_of_memory(s, gfpflags, node);
return NULL;
}

+ local_irq_save(flags);
if (c->page)
flush_slab(s, c);

--
2.32.0

2021-08-05 15:26:28

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 13/35] mm, slub: do initial checks in ___slab_alloc() with irqs enabled

As another step of shortening irq disabled sections in ___slab_alloc(), delay
disabling irqs until we pass the initial checks if there is a cached percpu
slab and it's suitable for our allocation.

Now we have to recheck c->page after actually disabling irqs as an allocation
in irq handler might have replaced it.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 41 ++++++++++++++++++++++++++++++++---------
1 file changed, 32 insertions(+), 9 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index db03cb972c11..7eb06fe9d7a0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2665,8 +2665,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,

stat(s, ALLOC_SLOWPATH);

- local_irq_save(flags);
- page = c->page;
+reread_page:
+
+ page = READ_ONCE(c->page);
if (!page) {
/*
* if the node is not online or has no normal memory, just
@@ -2675,6 +2676,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (unlikely(node != NUMA_NO_NODE &&
!node_isset(node, slab_nodes)))
node = NUMA_NO_NODE;
+ local_irq_save(flags);
+ if (unlikely(c->page)) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
goto new_slab;
}
redo:
@@ -2689,8 +2695,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
} else {
stat(s, ALLOC_NODE_MISMATCH);
- deactivate_slab(s, page, c->freelist, c);
- goto new_slab;
+ goto deactivate_slab;
}
}

@@ -2699,12 +2704,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
* PFMEMALLOC but right now, we are losing the pfmemalloc
* information when the page leaves the per-cpu allocator
*/
- if (unlikely(!pfmemalloc_match(page, gfpflags))) {
- deactivate_slab(s, page, c->freelist, c);
- goto new_slab;
- }
+ if (unlikely(!pfmemalloc_match(page, gfpflags)))
+ goto deactivate_slab;

- /* must check again c->freelist in case of cpu migration or IRQ */
+ /* must check again c->page in case IRQ handler changed it */
+ local_irq_save(flags);
+ if (unlikely(page != c->page)) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
freelist = c->freelist;
if (freelist)
goto load_freelist;
@@ -2720,6 +2728,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
stat(s, ALLOC_REFILL);

load_freelist:
+
+ lockdep_assert_irqs_disabled();
+
/*
* freelist is pointing to the list of objects to be used.
* page is pointing to the page from which the objects are obtained.
@@ -2731,11 +2742,23 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
local_irq_restore(flags);
return freelist;

+deactivate_slab:
+
+ local_irq_save(flags);
+ if (page != c->page) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
+ deactivate_slab(s, page, c->freelist, c);
+
new_slab:

+ lockdep_assert_irqs_disabled();
+
if (slub_percpu_partial(c)) {
page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
+ local_irq_restore(flags);
stat(s, CPU_PARTIAL_ALLOC);
goto redo;
}
--
2.32.0

2021-08-05 15:26:36

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 11/35] mm, slub: simplify kmem_cache_cpu and tid setup

In slab_alloc_node() and do_slab_free() fastpaths we need to guarantee that
our kmem_cache_cpu pointer is from the same cpu as the tid value. Currently
that's done by reading the tid first using this_cpu_read(), then the
kmem_cache_cpu pointer and verifying we read the same tid using the pointer and
plain READ_ONCE().

This can be simplified to just fetching kmem_cache_cpu pointer and then reading
tid using the pointer. That guarantees they are from the same cpu. We don't
need to read the tid using this_cpu_read() because the value will be validated
by this_cpu_cmpxchg_double(), making sure we are on the correct cpu and the
freelist didn't change by anyone preempting us since reading the tid.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 22 +++++++++-------------
1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index c32048353645..e2d803b7e3b5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2865,15 +2865,14 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
* reading from one cpu area. That does not matter as long
* as we end up on the original cpu again when doing the cmpxchg.
*
- * We should guarantee that tid and kmem_cache are retrieved on
- * the same cpu. It could be different if CONFIG_PREEMPTION so we need
- * to check if it is matched or not.
+ * We must guarantee that tid and kmem_cache_cpu are retrieved on the
+ * same cpu. We read first the kmem_cache_cpu pointer and use it to read
+ * the tid. If we are preempted and switched to another cpu between the
+ * two reads, it's OK as the two are still associated with the same cpu
+ * and cmpxchg later will validate the cpu.
*/
- do {
- tid = this_cpu_read(s->cpu_slab->tid);
- c = raw_cpu_ptr(s->cpu_slab);
- } while (IS_ENABLED(CONFIG_PREEMPTION) &&
- unlikely(tid != READ_ONCE(c->tid)));
+ c = raw_cpu_ptr(s->cpu_slab);
+ tid = READ_ONCE(c->tid);

/*
* Irqless object alloc/free algorithm used here depends on sequence
@@ -3147,11 +3146,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
* data is retrieved via this pointer. If we are on the same cpu
* during the cmpxchg then the free will succeed.
*/
- do {
- tid = this_cpu_read(s->cpu_slab->tid);
- c = raw_cpu_ptr(s->cpu_slab);
- } while (IS_ENABLED(CONFIG_PREEMPTION) &&
- unlikely(tid != READ_ONCE(c->tid)));
+ c = raw_cpu_ptr(s->cpu_slab);
+ tid = READ_ONCE(c->tid);

/* Same with comment on barrier() in slab_alloc_node() */
barrier();
--
2.32.0

2021-08-05 15:26:36

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 14/35] mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc()

Continue reducing the irq disabled scope. Check for per-cpu partial slabs with
first with irqs enabled and then recheck with irqs disabled before grabbing
the slab page. Mostly preparatory for the following patches.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 34 +++++++++++++++++++++++++---------
1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 7eb06fe9d7a0..868a782b6f62 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2676,11 +2676,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (unlikely(node != NUMA_NO_NODE &&
!node_isset(node, slab_nodes)))
node = NUMA_NO_NODE;
- local_irq_save(flags);
- if (unlikely(c->page)) {
- local_irq_restore(flags);
- goto reread_page;
- }
goto new_slab;
}
redo:
@@ -2721,6 +2716,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,

if (!freelist) {
c->page = NULL;
+ local_irq_restore(flags);
stat(s, DEACTIVATE_BYPASS);
goto new_slab;
}
@@ -2750,12 +2746,19 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto reread_page;
}
deactivate_slab(s, page, c->freelist, c);
+ local_irq_restore(flags);

new_slab:

- lockdep_assert_irqs_disabled();
-
if (slub_percpu_partial(c)) {
+ local_irq_save(flags);
+ if (unlikely(c->page)) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
+ if (unlikely(!slub_percpu_partial(c)))
+ goto new_objects; /* stolen by an IRQ handler */
+
page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
local_irq_restore(flags);
@@ -2763,6 +2766,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
}

+ local_irq_save(flags);
+ if (unlikely(c->page)) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
+
+new_objects:
+
+ lockdep_assert_irqs_disabled();
+
freelist = get_partial(s, gfpflags, node, &page);
if (freelist) {
c->page = page;
@@ -2795,15 +2808,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
check_new_page:

if (kmem_cache_debug(s)) {
- if (!alloc_debug_processing(s, page, freelist, addr))
+ if (!alloc_debug_processing(s, page, freelist, addr)) {
/* Slab failed checks. Next slab needed */
+ c->page = NULL;
+ local_irq_restore(flags);
goto new_slab;
- else
+ } else {
/*
* For debug case, we don't load freelist so that all
* allocations go through alloc_debug_processing()
*/
goto return_single;
+ }
}

if (unlikely(!pfmemalloc_match(page, gfpflags)))
--
2.32.0

2021-08-05 15:26:47

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 21/35] mm, slub: call deactivate_slab() without disabling irqs

The function is now safe to be called with irqs enabled, so move the calls
outside of irq disabled sections.

When called from ___slab_alloc() -> flush_slab() we have irqs disabled, so to
reenable them before deactivate_slab() we need to open-code flush_slab() in
___slab_alloc() and reenable irqs after modifying the kmem_cache_cpu fields.
But that means a IRQ handler meanwhile might have assigned a new page to
kmem_cache_cpu.page so we have to retry the whole check.

The remaining callers of flush_slab() are the IPI handler which has disabled
irqs anyway, and slub_cpu_dead() which will be dealt with in the following
patch.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 27c98f4fb3f5..8de4ead2dbf3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2749,8 +2749,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
freelist = c->freelist;
c->page = NULL;
c->freelist = NULL;
- deactivate_slab(s, page, freelist);
local_irq_restore(flags);
+ deactivate_slab(s, page, freelist);

new_slab:

@@ -2818,18 +2818,32 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
*/
goto return_single;

+retry_load_page:
+
local_irq_save(flags);
- if (unlikely(c->page))
- flush_slab(s, c);
+ if (unlikely(c->page)) {
+ void *flush_freelist = c->freelist;
+ struct page *flush_page = c->page;
+
+ c->page = NULL;
+ c->freelist = NULL;
+ c->tid = next_tid(c->tid);
+
+ local_irq_restore(flags);
+
+ deactivate_slab(s, flush_page, flush_freelist);
+
+ stat(s, CPUSLAB_FLUSH);
+
+ goto retry_load_page;
+ }
c->page = page;

goto load_freelist;

return_single:

- local_irq_save(flags);
deactivate_slab(s, page, get_freepointer(s, freelist));
- local_irq_restore(flags);
return freelist;
}

--
2.32.0

2021-08-05 15:26:54

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 20/35] mm, slub: make locking in deactivate_slab() irq-safe

dectivate_slab() now no longer touches the kmem_cache_cpu structure, so it will
be possible to call it with irqs enabled. Just convert the spin_lock calls to
their irq saving/restoring variants to make it irq-safe.

Note we now have to use cmpxchg_double_slab() for irq-safe slab_lock(), because
in some situations we don't take the list_lock, which would disable irqs.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 10fd2eaf8125..27c98f4fb3f5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2206,6 +2206,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
enum slab_modes l = M_NONE, m = M_NONE;
void *nextfree, *freelist_iter, *freelist_tail;
int tail = DEACTIVATE_TO_HEAD;
+ unsigned long flags = 0;
struct page new;
struct page old;

@@ -2281,7 +2282,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
* that acquire_slab() will see a slab page that
* is frozen
*/
- spin_lock(&n->list_lock);
+ spin_lock_irqsave(&n->list_lock, flags);
}
} else {
m = M_FULL;
@@ -2292,7 +2293,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
* slabs from diagnostic functions will not see
* any frozen slabs.
*/
- spin_lock(&n->list_lock);
+ spin_lock_irqsave(&n->list_lock, flags);
}
}

@@ -2309,14 +2310,14 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
}

l = m;
- if (!__cmpxchg_double_slab(s, page,
+ if (!cmpxchg_double_slab(s, page,
old.freelist, old.counters,
new.freelist, new.counters,
"unfreezing slab"))
goto redo;

if (lock)
- spin_unlock(&n->list_lock);
+ spin_unlock_irqrestore(&n->list_lock, flags);

if (m == M_PARTIAL)
stat(s, tail);
--
2.32.0

2021-08-05 15:27:03

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 27/35] mm, slub: don't disable irqs in slub_cpu_dead()

slub_cpu_dead() cleans up for an offlined cpu from another cpu and calls only
functions that are now irq safe, so we don't need to disable irqs anymore.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index b8581a9b58cc..c10f2c9b9352 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2537,14 +2537,10 @@ static void flush_all(struct kmem_cache *s)
static int slub_cpu_dead(unsigned int cpu)
{
struct kmem_cache *s;
- unsigned long flags;

mutex_lock(&slab_mutex);
- list_for_each_entry(s, &slab_caches, list) {
- local_irq_save(flags);
+ list_for_each_entry(s, &slab_caches, list)
__flush_cpu_slab(s, cpu);
- local_irq_restore(flags);
- }
mutex_unlock(&slab_mutex);
return 0;
}
--
2.32.0

2021-08-05 15:27:04

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 17/35] mm, slub: check new pages with restored irqs

Building on top of the previous patch, re-enable irqs before checking new
pages. alloc_debug_processing() is now called with enabled irqs so we need to
remove VM_BUG_ON(!irqs_disabled()); in check_slab() - there doesn't seem to be
a need for it anyway.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 5d58fde2bd70..f7c6cebb524d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -995,8 +995,6 @@ static int check_slab(struct kmem_cache *s, struct page *page)
{
int maxobj;

- VM_BUG_ON(!irqs_disabled());
-
if (!PageSlab(page)) {
slab_err(s, page, "Not a valid slab page");
return 0;
@@ -2772,10 +2770,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
lockdep_assert_irqs_disabled();

freelist = get_partial(s, gfpflags, node, &page);
+ local_irq_restore(flags);
if (freelist)
goto check_new_page;

- local_irq_restore(flags);
put_cpu_ptr(s->cpu_slab);
page = new_slab(s, gfpflags, node);
c = get_cpu_ptr(s->cpu_slab);
@@ -2785,7 +2783,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return NULL;
}

- local_irq_save(flags);
/*
* No other reference to the page yet so we can
* muck around with it freely without cmpxchg
@@ -2800,7 +2797,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (kmem_cache_debug(s)) {
if (!alloc_debug_processing(s, page, freelist, addr)) {
/* Slab failed checks. Next slab needed */
- local_irq_restore(flags);
goto new_slab;
} else {
/*
@@ -2818,6 +2814,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
*/
goto return_single;

+ local_irq_save(flags);
if (unlikely(c->page))
flush_slab(s, c);
c->page = page;
@@ -2826,6 +2823,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,

return_single:

+ local_irq_save(flags);
if (unlikely(c->page))
flush_slab(s, c);
c->page = page;
--
2.32.0

2021-08-05 15:27:06

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 25/35] mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing

Unfreezing partial list can be split to two phases - detaching the list from
struct kmem_cache_cpu, and processing the list. The whole operation does not
need to be protected by disabled irqs. Restructure the code to separate the
detaching (with disabled irqs) and unfreezing (with irq disabling to be reduced
in the next patch).

Also, unfreeze_partials() can be called from another cpu on behalf of a cpu
that is being offlined, where disabling irqs on the local cpu has no sense, so
restructure the code as follows:

- __unfreeze_partials() is the bulk of unfreeze_partials() that processes the
detached percpu partial list
- unfreeze_partials() detaches list from current cpu with irqs disabled and
calls __unfreeze_partials()
- unfreeze_partials_cpu() is to be called for the offlined cpu so it needs no
irq disabling, and is called from __flush_cpu_slab()
- flush_cpu_slab() is for the local cpu thus it needs to call
unfreeze_partials(). So it can't simply call
__flush_cpu_slab(smp_processor_id()) anymore and we have to open-code the
proper calls.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 73 ++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 51 insertions(+), 22 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 984173ce8465..c8637150bf28 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2330,25 +2330,15 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
}
}

-/*
- * Unfreeze all the cpu partial slabs.
- *
- * This function must be called with preemption or migration
- * disabled with c local to the cpu.
- */
-static void unfreeze_partials(struct kmem_cache *s,
- struct kmem_cache_cpu *c)
-{
#ifdef CONFIG_SLUB_CPU_PARTIAL
+static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page)
+{
struct kmem_cache_node *n = NULL, *n2 = NULL;
- struct page *page, *partial_page, *discard_page = NULL;
+ struct page *page, *discard_page = NULL;
unsigned long flags;

local_irq_save(flags);

- partial_page = slub_percpu_partial(c);
- c->partial = NULL;
-
while (partial_page) {
struct page new;
struct page old;
@@ -2403,10 +2393,45 @@ static void unfreeze_partials(struct kmem_cache *s,
discard_slab(s, page);
stat(s, FREE_SLAB);
}
+}

-#endif /* CONFIG_SLUB_CPU_PARTIAL */
+/*
+ * Unfreeze all the cpu partial slabs.
+ */
+static void unfreeze_partials(struct kmem_cache *s)
+{
+ struct page *partial_page;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ partial_page = this_cpu_read(s->cpu_slab->partial);
+ this_cpu_write(s->cpu_slab->partial, NULL);
+ local_irq_restore(flags);
+
+ if (partial_page)
+ __unfreeze_partials(s, partial_page);
}

+static void unfreeze_partials_cpu(struct kmem_cache *s,
+ struct kmem_cache_cpu *c)
+{
+ struct page *partial_page;
+
+ partial_page = slub_percpu_partial(c);
+ c->partial = NULL;
+
+ if (partial_page)
+ __unfreeze_partials(s, partial_page);
+}
+
+#else /* CONFIG_SLUB_CPU_PARTIAL */
+
+static inline void unfreeze_partials(struct kmem_cache *s) { }
+static inline void unfreeze_partials_cpu(struct kmem_cache *s,
+ struct kmem_cache_cpu *c) { }
+
+#endif /* CONFIG_SLUB_CPU_PARTIAL */
+
/*
* Put a page that was just frozen (in __slab_free|get_partial_node) into a
* partial page slot if available.
@@ -2435,7 +2460,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
* partial array is full. Move the existing
* set to the per node partial list.
*/
- unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+ unfreeze_partials(s);
oldpage = NULL;
pobjects = 0;
pages = 0;
@@ -2470,11 +2495,6 @@ static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
stat(s, CPUSLAB_FLUSH);
}

-/*
- * Flush cpu slab.
- *
- * Called from IPI handler with interrupts disabled.
- */
static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
{
struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
@@ -2482,14 +2502,23 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
if (c->page)
flush_slab(s, c);

- unfreeze_partials(s, c);
+ unfreeze_partials_cpu(s, c);
}

+/*
+ * Flush cpu slab.
+ *
+ * Called from IPI handler with interrupts disabled.
+ */
static void flush_cpu_slab(void *d)
{
struct kmem_cache *s = d;
+ struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);

- __flush_cpu_slab(s, smp_processor_id());
+ if (c->page)
+ flush_slab(s, c);
+
+ unfreeze_partials(s);
}

static bool has_cpu_slab(int cpu, void *info)
--
2.32.0

2021-08-05 15:27:10

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 22/35] mm, slub: move irq control into unfreeze_partials()

unfreeze_partials() can be optimized so that it doesn't need irqs disabled for
the whole time. As the first step, move irq control into the function and
remove it from the put_cpu_partial() caller.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 8de4ead2dbf3..51f8d83d3ea8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2333,9 +2333,8 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
/*
* Unfreeze all the cpu partial slabs.
*
- * This function must be called with interrupts disabled
- * for the cpu using c (or some other guarantee must be there
- * to guarantee no concurrent accesses).
+ * This function must be called with preemption or migration
+ * disabled with c local to the cpu.
*/
static void unfreeze_partials(struct kmem_cache *s,
struct kmem_cache_cpu *c)
@@ -2343,6 +2342,9 @@ static void unfreeze_partials(struct kmem_cache *s,
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct kmem_cache_node *n = NULL, *n2 = NULL;
struct page *page, *discard_page = NULL;
+ unsigned long flags;
+
+ local_irq_save(flags);

while ((page = slub_percpu_partial(c))) {
struct page new;
@@ -2395,6 +2397,8 @@ static void unfreeze_partials(struct kmem_cache *s,
discard_slab(s, page);
stat(s, FREE_SLAB);
}
+
+ local_irq_restore(flags);
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}

@@ -2422,14 +2426,11 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
pobjects = oldpage->pobjects;
pages = oldpage->pages;
if (drain && pobjects > slub_cpu_partial(s)) {
- unsigned long flags;
/*
* partial array is full. Move the existing
* set to the per node partial list.
*/
- local_irq_save(flags);
unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
- local_irq_restore(flags);
oldpage = NULL;
pobjects = 0;
pages = 0;
--
2.32.0

2021-08-05 15:27:19

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 19/35] mm, slub: move reset of c->page and freelist out of deactivate_slab()

deactivate_slab() removes the cpu slab by merging the cpu freelist with slab's
freelist and putting the slab on the proper node's list. It also sets the
respective kmem_cache_cpu pointers to NULL.

By extracting the kmem_cache_cpu operations from the function, we can make it
not dependent on disabled irqs.

Also if we return a single free pointer from ___slab_alloc, we no longer have
to assign kmem_cache_cpu.page before deactivation or care if somebody preempted
us and assigned a different page to our kmem_cache_cpu in the process.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 31 ++++++++++++++++++-------------
1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index b4a62aa00ae2..10fd2eaf8125 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2192,10 +2192,13 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
}

/*
- * Remove the cpu slab
+ * Finishes removing the cpu slab. Merges cpu's freelist with page's freelist,
+ * unfreezes the slabs and puts it on the proper list.
+ * Assumes the slab has been already safely taken away from kmem_cache_cpu
+ * by the caller.
*/
static void deactivate_slab(struct kmem_cache *s, struct page *page,
- void *freelist, struct kmem_cache_cpu *c)
+ void *freelist)
{
enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
struct kmem_cache_node *n = get_node(s, page_to_nid(page));
@@ -2324,9 +2327,6 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
discard_slab(s, page);
stat(s, FREE_SLAB);
}
-
- c->page = NULL;
- c->freelist = NULL;
}

/*
@@ -2451,10 +2451,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)

static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
{
- stat(s, CPUSLAB_FLUSH);
- deactivate_slab(s, c->page, c->freelist, c);
+ void *freelist = c->freelist;
+ struct page *page = c->page;

+ c->page = NULL;
+ c->freelist = NULL;
c->tid = next_tid(c->tid);
+
+ deactivate_slab(s, page, freelist);
+
+ stat(s, CPUSLAB_FLUSH);
}

/*
@@ -2739,7 +2745,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
local_irq_restore(flags);
goto reread_page;
}
- deactivate_slab(s, page, c->freelist, c);
+ freelist = c->freelist;
+ c->page = NULL;
+ c->freelist = NULL;
+ deactivate_slab(s, page, freelist);
local_irq_restore(flags);

new_slab:
@@ -2818,11 +2827,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return_single:

local_irq_save(flags);
- if (unlikely(c->page))
- flush_slab(s, c);
- c->page = page;
-
- deactivate_slab(s, page, get_freepointer(s, freelist), c);
+ deactivate_slab(s, page, get_freepointer(s, freelist));
local_irq_restore(flags);
return freelist;
}
--
2.32.0

2021-08-05 15:27:21

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 31/35] mm, slub: optionally save/restore irqs in slab_[un]lock()/

For PREEMPT_RT we will need to disable irqs for this bit spinlock. As a
preparation, add a flags parameter, and an internal version that takes
additional bool parameter to control irq saving/restoring (the flags
parameter is compile-time unused if the bool is a constant false).

Convert ___cmpxchg_double_slab(), which also comes with the same bool
parameter, to use the internal version.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 52 +++++++++++++++++++++++++++++++++-------------------
1 file changed, 33 insertions(+), 19 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 9cb58d884c58..9208020f72d5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -359,16 +359,33 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
/*
* Per slab locking using the pagelock
*/
-static __always_inline void slab_lock(struct page *page)
+static __always_inline void
+__slab_lock(struct page *page, unsigned long *flags, bool disable_irqs)
{
VM_BUG_ON_PAGE(PageTail(page), page);
+ if (disable_irqs)
+ local_irq_save(*flags);
bit_spin_lock(PG_locked, &page->flags);
}

-static __always_inline void slab_unlock(struct page *page)
+static __always_inline void
+__slab_unlock(struct page *page, unsigned long *flags, bool disable_irqs)
{
VM_BUG_ON_PAGE(PageTail(page), page);
__bit_spin_unlock(PG_locked, &page->flags);
+ if (disable_irqs)
+ local_irq_restore(*flags);
+}
+
+static __always_inline void
+slab_lock(struct page *page, unsigned long *flags)
+{
+ __slab_lock(page, flags, false);
+}
+
+static __always_inline void slab_unlock(struct page *page, unsigned long *flags)
+{
+ __slab_unlock(page, flags, false);
}

static inline bool ___cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
@@ -388,23 +405,18 @@ static inline bool ___cmpxchg_double_slab(struct kmem_cache *s, struct page *pag
} else
#endif
{
- unsigned long flags;
+ /* init to 0 to prevent spurious warnings */
+ unsigned long flags = 0;

- if (disable_irqs)
- local_irq_save(flags);
- slab_lock(page);
+ __slab_lock(page, &flags, disable_irqs);
if (page->freelist == freelist_old &&
page->counters == counters_old) {
page->freelist = freelist_new;
page->counters = counters_new;
- slab_unlock(page);
- if (disable_irqs)
- local_irq_restore(flags);
+ __slab_unlock(page, &flags, disable_irqs);
return true;
}
- slab_unlock(page);
- if (disable_irqs)
- local_irq_restore(flags);
+ __slab_unlock(page, &flags, disable_irqs);
}

cpu_relax();
@@ -1255,11 +1267,11 @@ static noinline int free_debug_processing(
struct kmem_cache_node *n = get_node(s, page_to_nid(page));
void *object = head;
int cnt = 0;
- unsigned long flags;
+ unsigned long flags, flags2;
int ret = 0;

spin_lock_irqsave(&n->list_lock, flags);
- slab_lock(page);
+ slab_lock(page, &flags2);

if (s->flags & SLAB_CONSISTENCY_CHECKS) {
if (!check_slab(s, page))
@@ -1292,7 +1304,7 @@ static noinline int free_debug_processing(
slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n",
bulk_cnt, cnt);

- slab_unlock(page);
+ slab_unlock(page, &flags2);
spin_unlock_irqrestore(&n->list_lock, flags);
if (!ret)
slab_fix(s, "Object at 0x%p not freed", object);
@@ -4048,9 +4060,10 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page,
void *addr = page_address(page);
unsigned long *map;
void *p;
+ unsigned long flags;

slab_err(s, page, text, s->name);
- slab_lock(page);
+ slab_lock(page, &flags);

map = get_map(s, page);
for_each_object(p, s, addr, page->objects) {
@@ -4061,7 +4074,7 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page,
}
}
put_map(map);
- slab_unlock(page);
+ slab_unlock(page, &flags);
#endif
}

@@ -4786,8 +4799,9 @@ static void validate_slab(struct kmem_cache *s, struct page *page,
{
void *p;
void *addr = page_address(page);
+ unsigned long flags;

- slab_lock(page);
+ slab_lock(page, &flags);

if (!check_slab(s, page) || !on_freelist(s, page, NULL))
goto unlock;
@@ -4802,7 +4816,7 @@ static void validate_slab(struct kmem_cache *s, struct page *page,
break;
}
unlock:
- slab_unlock(page);
+ slab_unlock(page, &flags);
}

static int validate_slab_node(struct kmem_cache *s,
--
2.32.0

2021-08-05 15:27:28

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 33/35] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg

Jann Horn reported [1] the following theoretically possible race:

task A: put_cpu_partial() calls preempt_disable()
task A: oldpage = this_cpu_read(s->cpu_slab->partial)
interrupt: kfree() reaches unfreeze_partials() and discards the page
task B (on another CPU): reallocates page as page cache
task A: reads page->pages and page->pobjects, which are actually
halves of the pointer page->lru.prev
task B (on another CPU): frees page
interrupt: allocates page as SLUB page and places it on the percpu partial list
task A: this_cpu_cmpxchg() succeeds

which would cause page->pages and page->pobjects to end up containing
halves of pointers that would then influence when put_cpu_partial()
happens and show up in root-only sysfs files. Maybe that's acceptable,
I don't know. But there should probably at least be a comment for now
to point out that we're reading union fields of a page that might be
in a completely different state.

Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only safe
against s->cpu_slab->partial manipulation in ___slab_alloc() if the latter
disables irqs, otherwise a __slab_free() in an irq handler could call
put_cpu_partial() in the middle of ___slab_alloc() manipulating ->partial
and corrupt it. This becomes an issue on RT after a local_lock is introduced
in later patch. The fix means taking the local_lock also in put_cpu_partial()
on RT.

After debugging this issue, Mike Galbraith suggested [2] that to avoid
different locking schemes on RT and !RT, we can just protect put_cpu_partial()
with disabled irqs (to be converted to local_lock_irqsave() later) everywhere.
This should be acceptable as it's not a fast path, and moving the actual
partial unfreezing outside of the irq disabled section makes it short, and with
the retry loop gone the code can be also simplified. In addition, the race
reported by Jann should no longer be possible.

[1] https://lore.kernel.org/lkml/CAG48ez1mvUuXwg0YPH5ANzhQLpbphqk-ZS+jbRz+H66fvm4FcA@mail.gmail.com/
[2] https://lore.kernel.org/linux-rt-users/[email protected]/

Reported-by: Jann Horn <[email protected]>
Suggested-by: Mike Galbraith <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 81 ++++++++++++++++++++++++++++++-------------------------
1 file changed, 44 insertions(+), 37 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 252421ff1d5f..c35ad273e3e9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2003,7 +2003,12 @@ static inline void *acquire_slab(struct kmem_cache *s,
return freelist;
}

+#ifdef CONFIG_SLUB_CPU_PARTIAL
static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain);
+#else
+static inline void put_cpu_partial(struct kmem_cache *s, struct page *page,
+ int drain) { }
+#endif
static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);

/*
@@ -2437,14 +2442,6 @@ static void unfreeze_partials_cpu(struct kmem_cache *s,
__unfreeze_partials(s, partial_page);
}

-#else /* CONFIG_SLUB_CPU_PARTIAL */
-
-static inline void unfreeze_partials(struct kmem_cache *s) { }
-static inline void unfreeze_partials_cpu(struct kmem_cache *s,
- struct kmem_cache_cpu *c) { }
-
-#endif /* CONFIG_SLUB_CPU_PARTIAL */
-
/*
* Put a page that was just frozen (in __slab_free|get_partial_node) into a
* partial page slot if available.
@@ -2454,46 +2451,56 @@ static inline void unfreeze_partials_cpu(struct kmem_cache *s,
*/
static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
{
-#ifdef CONFIG_SLUB_CPU_PARTIAL
struct page *oldpage;
- int pages;
- int pobjects;
+ struct page *page_to_unfreeze = NULL;
+ unsigned long flags;
+ int pages = 0;
+ int pobjects = 0;

- preempt_disable();
- do {
- pages = 0;
- pobjects = 0;
- oldpage = this_cpu_read(s->cpu_slab->partial);
+ local_irq_save(flags);
+
+ oldpage = this_cpu_read(s->cpu_slab->partial);

- if (oldpage) {
+ if (oldpage) {
+ if (drain && oldpage->pobjects > slub_cpu_partial(s)) {
+ /*
+ * Partial array is full. Move the existing set to the
+ * per node partial list. Postpone the actual unfreezing
+ * outside of the critical section.
+ */
+ page_to_unfreeze = oldpage;
+ oldpage = NULL;
+ } else {
pobjects = oldpage->pobjects;
pages = oldpage->pages;
- if (drain && pobjects > slub_cpu_partial(s)) {
- /*
- * partial array is full. Move the existing
- * set to the per node partial list.
- */
- unfreeze_partials(s);
- oldpage = NULL;
- pobjects = 0;
- pages = 0;
- stat(s, CPU_PARTIAL_DRAIN);
- }
}
+ }

- pages++;
- pobjects += page->objects - page->inuse;
+ pages++;
+ pobjects += page->objects - page->inuse;

- page->pages = pages;
- page->pobjects = pobjects;
- page->next = oldpage;
+ page->pages = pages;
+ page->pobjects = pobjects;
+ page->next = oldpage;

- } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
- != oldpage);
- preempt_enable();
-#endif /* CONFIG_SLUB_CPU_PARTIAL */
+ this_cpu_write(s->cpu_slab->partial, page);
+
+ local_irq_restore(flags);
+
+ if (page_to_unfreeze) {
+ __unfreeze_partials(s, page_to_unfreeze);
+ stat(s, CPU_PARTIAL_DRAIN);
+ }
}

+#else /* CONFIG_SLUB_CPU_PARTIAL */
+
+static inline void unfreeze_partials(struct kmem_cache *s) { }
+static inline void unfreeze_partials_cpu(struct kmem_cache *s,
+ struct kmem_cache_cpu *c) { }
+
+#endif /* CONFIG_SLUB_CPU_PARTIAL */
+
static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c,
bool lock)
{
--
2.32.0

2021-08-05 15:27:32

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 34/35] mm, slub: use migrate_disable() on PREEMPT_RT

We currently use preempt_disable() (directly or via get_cpu_ptr()) to stabilize
the pointer to kmem_cache_cpu. On PREEMPT_RT this would be incompatible with
the list_lock spinlock. We can use migrate_disable() instead, but that
increases overhead on !PREEMPT_RT as it's an unconditional function call.

In order to get the best available mechanism on both PREEMPT_RT and
!PREEMPT_RT, introduce private slub_get_cpu_ptr() and slub_put_cpu_ptr()
wrappers and use them.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 39 ++++++++++++++++++++++++++++++---------
1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index c35ad273e3e9..690e762912b7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -118,6 +118,26 @@
* the fast path and disables lockless freelists.
*/

+/*
+ * We could simply use migrate_disable()/enable() but as long as it's a
+ * function call even on !PREEMPT_RT, use inline preempt_disable() there.
+ */
+#ifndef CONFIG_PREEMPT_RT
+#define slub_get_cpu_ptr(var) get_cpu_ptr(var)
+#define slub_put_cpu_ptr(var) put_cpu_ptr(var)
+#else
+#define slub_get_cpu_ptr(var) \
+({ \
+ migrate_disable(); \
+ this_cpu_ptr(var); \
+})
+#define slub_put_cpu_ptr(var) \
+do { \
+ (void)(var); \
+ migrate_enable(); \
+} while (0)
+#endif
+
#ifdef CONFIG_SLUB_DEBUG
#ifdef CONFIG_SLUB_DEBUG_ON
DEFINE_STATIC_KEY_TRUE(slub_debug_enabled);
@@ -2806,7 +2826,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (unlikely(!pfmemalloc_match(page, gfpflags)))
goto deactivate_slab;

- /* must check again c->page in case IRQ handler changed it */
+ /* must check again c->page in case we got preempted and it changed */
local_irq_save(flags);
if (unlikely(page != c->page)) {
local_irq_restore(flags);
@@ -2865,7 +2885,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
}
if (unlikely(!slub_percpu_partial(c))) {
local_irq_restore(flags);
- goto new_objects; /* stolen by an IRQ handler */
+ /* we were preempted and partial list got empty */
+ goto new_objects;
}

page = c->page = slub_percpu_partial(c);
@@ -2881,9 +2902,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (freelist)
goto check_new_page;

- put_cpu_ptr(s->cpu_slab);
+ slub_put_cpu_ptr(s->cpu_slab);
page = new_slab(s, gfpflags, node);
- c = get_cpu_ptr(s->cpu_slab);
+ c = slub_get_cpu_ptr(s->cpu_slab);

if (unlikely(!page)) {
slab_out_of_memory(s, gfpflags, node);
@@ -2966,12 +2987,12 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
* cpu before disabling preemption. Need to reload cpu area
* pointer.
*/
- c = get_cpu_ptr(s->cpu_slab);
+ c = slub_get_cpu_ptr(s->cpu_slab);
#endif

p = ___slab_alloc(s, gfpflags, node, addr, c);
#ifdef CONFIG_PREEMPT_COUNT
- put_cpu_ptr(s->cpu_slab);
+ slub_put_cpu_ptr(s->cpu_slab);
#endif
return p;
}
@@ -3500,7 +3521,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
* IRQs, which protects against PREEMPT and interrupts
* handlers invoking normal fastpath.
*/
- c = get_cpu_ptr(s->cpu_slab);
+ c = slub_get_cpu_ptr(s->cpu_slab);
local_irq_disable();

for (i = 0; i < size; i++) {
@@ -3546,7 +3567,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
}
c->tid = next_tid(c->tid);
local_irq_enable();
- put_cpu_ptr(s->cpu_slab);
+ slub_put_cpu_ptr(s->cpu_slab);

/*
* memcg and kmem_cache debug support and memory initialization.
@@ -3556,7 +3577,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
slab_want_init_on_alloc(flags, s));
return i;
error:
- put_cpu_ptr(s->cpu_slab);
+ slub_put_cpu_ptr(s->cpu_slab);
slab_post_alloc_hook(s, objcg, flags, i, p, false);
__kmem_cache_free_bulk(s, i, p);
return 0;
--
2.32.0

2021-08-05 15:28:00

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions of
local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT that's
equivalent, with better lockdep visibility. On PREEMPT_RT that means better
preemption.

However, the cost on PREEMPT_RT is the loss of lockless fast paths which only
work with cpu freelist. Those are designed to detect and recover from being
preempted by other conflicting operations (both fast or slow path), but the
slow path operations assume they cannot be preempted by a fast path operation,
which is guaranteed naturally with disabled irqs. With local locks on
PREEMPT_RT, the fast paths now also need to take the local lock to avoid races.

In the allocation fastpath slab_alloc_node() we can just defer to the slowpath
__slab_alloc() which also works with cpu freelist, but under the local lock.
In the free fastpath do_slab_free() we have to add a new local lock protected
version of freeing to the cpu freelist, as the existing slowpath only works
with the page freelist.

Also update the comment about locking scheme in SLUB to reflect changes done
by this series.

[ Mike Galbraith <[email protected]>: use local_lock() without irq in PREEMPT_RT
scope; debugging of RT crashes resulting in put_cpu_partial() locking changes ]
Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/slub_def.h | 2 +
mm/slub.c | 146 ++++++++++++++++++++++++++++++---------
2 files changed, 115 insertions(+), 33 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index dcde82a4434c..b5bcac29b979 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -10,6 +10,7 @@
#include <linux/kfence.h>
#include <linux/kobject.h>
#include <linux/reciprocal_div.h>
+#include <linux/local_lock.h>

enum stat_item {
ALLOC_FASTPATH, /* Allocation from cpu slab */
@@ -41,6 +42,7 @@ enum stat_item {
NR_SLUB_STAT_ITEMS };

struct kmem_cache_cpu {
+ local_lock_t lock; /* Protects the fields below except stat */
void **freelist; /* Pointer to next available object */
unsigned long tid; /* Globally unique transaction id */
struct page *page; /* The slab from which we are allocating */
diff --git a/mm/slub.c b/mm/slub.c
index 690e762912b7..8052334fcc56 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -46,13 +46,21 @@
/*
* Lock order:
* 1. slab_mutex (Global Mutex)
- * 2. node->list_lock
- * 3. slab_lock(page) (Only on some arches and for debugging)
+ * 2. node->list_lock (Spinlock)
+ * 3. kmem_cache->cpu_slab->lock (Local lock)
+ * 4. slab_lock(page) (Only on some arches or for debugging)
+ * 5. object_map_lock (Only for debugging)
*
* slab_mutex
*
* The role of the slab_mutex is to protect the list of all the slabs
* and to synchronize major metadata changes to slab cache structures.
+ * Also synchronizes memory hotplug callbacks.
+ *
+ * slab_lock
+ *
+ * The slab_lock is a wrapper around the page lock, thus it is a bit
+ * spinlock.
*
* The slab_lock is only used for debugging and on arches that do not
* have the ability to do a cmpxchg_double. It only protects:
@@ -61,6 +69,8 @@
* C. page->objects -> Number of objects in page
* D. page->frozen -> frozen state
*
+ * Frozen slabs
+ *
* If a slab is frozen then it is exempt from list management. It is not
* on any list except per cpu partial list. The processor that froze the
* slab is the one who can perform list operations on the page. Other
@@ -68,6 +78,8 @@
* froze the slab is the only one that can retrieve the objects from the
* page's freelist.
*
+ * list_lock
+ *
* The list_lock protects the partial and full list on each node and
* the partial slab counter. If taken then no new slabs may be added or
* removed from the lists nor make the number of partial slabs be modified.
@@ -79,10 +91,36 @@
* slabs, operations can continue without any centralized lock. F.e.
* allocating a long series of objects that fill up slabs does not require
* the list lock.
- * Interrupts are disabled during allocation and deallocation in order to
- * make the slab allocator safe to use in the context of an irq. In addition
- * interrupts are disabled to ensure that the processor does not change
- * while handling per_cpu slabs, due to kernel preemption.
+ *
+ * cpu_slab->lock local lock
+ *
+ * This locks protect slowpath manipulation of all kmem_cache_cpu fields
+ * except the stat counters. This is a percpu structure manipulated only by
+ * the local cpu, so the lock protects against being preempted or interrupted
+ * by an irq. Fast path operations rely on lockless operations instead.
+ * On PREEMPT_RT, the local lock does not actually disable irqs (and thus
+ * prevent the lockless operations), so fastpath operations also need to take
+ * the lock and are no longer lockless.
+ *
+ * lockless fastpaths
+ *
+ * The fast path allocation (slab_alloc_node()) and freeing (do_slab_free())
+ * are fully lockless when satisfied from the percpu slab (and when
+ * cmpxchg_double is possible to use, otherwise slab_lock is taken).
+ * They also don't disable preemption or migration or irqs. They rely on
+ * the transaction id (tid) field to detect being preempted or moved to
+ * another cpu.
+ *
+ * irq, preemption, migration considerations
+ *
+ * Interrupts are disabled as part of list_lock or local_lock operations, or
+ * around the slab_lock operation, in order to make the slab allocator safe
+ * to use in the context of an irq.
+ *
+ * In addition, preemption (or migration on PREEMPT_RT) is disabled in the
+ * allocation slowpath, bulk allocation, and put_cpu_partial(), so that the
+ * local cpu doesn't change in the process and e.g. the kmem_cache_cpu pointer
+ * doesn't have to be revalidated in each section protected by the local lock.
*
* SLUB assigns one slab for allocation to each processor.
* Allocations only occur from these slabs called cpu slabs.
@@ -2228,9 +2266,13 @@ static inline void note_cmpxchg_failure(const char *n,
static void init_kmem_cache_cpus(struct kmem_cache *s)
{
int cpu;
+ struct kmem_cache_cpu *c;

- for_each_possible_cpu(cpu)
- per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
+ for_each_possible_cpu(cpu) {
+ c = per_cpu_ptr(s->cpu_slab, cpu);
+ local_lock_init(&c->lock);
+ c->tid = init_tid(cpu);
+ }
}

/*
@@ -2441,10 +2483,10 @@ static void unfreeze_partials(struct kmem_cache *s)
struct page *partial_page;
unsigned long flags;

- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
partial_page = this_cpu_read(s->cpu_slab->partial);
this_cpu_write(s->cpu_slab->partial, NULL);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);

if (partial_page)
__unfreeze_partials(s, partial_page);
@@ -2477,7 +2519,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
int pages = 0;
int pobjects = 0;

- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);

oldpage = this_cpu_read(s->cpu_slab->partial);

@@ -2505,7 +2547,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)

this_cpu_write(s->cpu_slab->partial, page);

- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);

if (page_to_unfreeze) {
__unfreeze_partials(s, page_to_unfreeze);
@@ -2529,7 +2571,7 @@ static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c,
struct page *page;

if (lock)
- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);

freelist = c->freelist;
page = c->page;
@@ -2539,7 +2581,7 @@ static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c,
c->tid = next_tid(c->tid);

if (lock)
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);

if (page)
deactivate_slab(s, page, freelist);
@@ -2827,9 +2869,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto deactivate_slab;

/* must check again c->page in case we got preempted and it changed */
- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
if (unlikely(page != c->page)) {
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
goto reread_page;
}
freelist = c->freelist;
@@ -2840,7 +2882,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,

if (!freelist) {
c->page = NULL;
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
stat(s, DEACTIVATE_BYPASS);
goto new_slab;
}
@@ -2849,7 +2891,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,

load_freelist:

- lockdep_assert_irqs_disabled();
+#ifdef CONFIG_PREEMPT_RT
+ lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock.lock));
+#else
+ lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
+#endif

/*
* freelist is pointing to the list of objects to be used.
@@ -2859,39 +2905,39 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
VM_BUG_ON(!c->page->frozen);
c->freelist = get_freepointer(s, freelist);
c->tid = next_tid(c->tid);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
return freelist;

deactivate_slab:

- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
if (page != c->page) {
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
goto reread_page;
}
freelist = c->freelist;
c->page = NULL;
c->freelist = NULL;
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
deactivate_slab(s, page, freelist);

new_slab:

if (slub_percpu_partial(c)) {
- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
if (unlikely(c->page)) {
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
goto reread_page;
}
if (unlikely(!slub_percpu_partial(c))) {
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
/* we were preempted and partial list got empty */
goto new_objects;
}

page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
stat(s, CPU_PARTIAL_ALLOC);
goto redo;
}
@@ -2944,7 +2990,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,

retry_load_page:

- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
if (unlikely(c->page)) {
void *flush_freelist = c->freelist;
struct page *flush_page = c->page;
@@ -2953,7 +2999,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
c->freelist = NULL;
c->tid = next_tid(c->tid);

- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);

deactivate_slab(s, flush_page, flush_freelist);

@@ -3072,7 +3118,15 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,

object = c->freelist;
page = c->page;
- if (unlikely(!object || !page || !node_match(page, node))) {
+ /*
+ * We cannot use the lockless fastpath on PREEMPT_RT because if a
+ * slowpath has taken the local_lock_irqsave(), it is not protected
+ * against a fast path operation in an irq handler. So we need to take
+ * the slow path which uses local_lock. It is still relatively fast if
+ * there is a suitable cpu freelist.
+ */
+ if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
+ unlikely(!object || !page || !node_match(page, node))) {
object = __slab_alloc(s, gfpflags, node, addr, c);
} else {
void *next_object = get_freepointer_safe(s, object);
@@ -3332,6 +3386,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
barrier();

if (likely(page == c->page)) {
+#ifndef CONFIG_PREEMPT_RT
void **freelist = READ_ONCE(c->freelist);

set_freepointer(s, tail_obj, freelist);
@@ -3344,6 +3399,31 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
note_cmpxchg_failure("slab_free", s, tid);
goto redo;
}
+#else /* CONFIG_PREEMPT_RT */
+ /*
+ * We cannot use the lockless fastpath on PREEMPT_RT because if
+ * a slowpath has taken the local_lock_irqsave(), it is not
+ * protected against a fast path operation in an irq handler. So
+ * we need to take the local_lock. We shouldn't simply defer to
+ * __slab_free() as that wouldn't use the cpu freelist at all.
+ */
+ void **freelist;
+
+ local_lock(&s->cpu_slab->lock);
+ c = this_cpu_ptr(s->cpu_slab);
+ if (unlikely(page != c->page)) {
+ local_unlock(&s->cpu_slab->lock);
+ goto redo;
+ }
+ tid = c->tid;
+ freelist = c->freelist;
+
+ set_freepointer(s, tail_obj, freelist);
+ c->freelist = head;
+ c->tid = next_tid(tid);
+
+ local_unlock(&s->cpu_slab->lock);
+#endif
stat(s, FREE_FASTPATH);
} else
__slab_free(s, page, head, tail_obj, cnt, addr);
@@ -3522,7 +3602,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
* handlers invoking normal fastpath.
*/
c = slub_get_cpu_ptr(s->cpu_slab);
- local_irq_disable();
+ local_lock_irq(&s->cpu_slab->lock);

for (i = 0; i < size; i++) {
void *object = kfence_alloc(s, s->object_size, flags);
@@ -3543,7 +3623,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
*/
c->tid = next_tid(c->tid);

- local_irq_enable();
+ local_unlock_irq(&s->cpu_slab->lock);

/*
* Invoking slow path likely have side-effect
@@ -3557,7 +3637,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
c = this_cpu_ptr(s->cpu_slab);
maybe_wipe_obj_freeptr(s, p[i]);

- local_irq_disable();
+ local_lock_irq(&s->cpu_slab->lock);

continue; /* goto for-loop */
}
@@ -3566,7 +3646,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
maybe_wipe_obj_freeptr(s, p[i]);
}
c->tid = next_tid(c->tid);
- local_irq_enable();
+ local_unlock_irq(&s->cpu_slab->lock);
slub_put_cpu_ptr(s->cpu_slab);

/*
--
2.32.0

2021-08-05 15:28:06

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 26/35] mm, slub: only disable irq with spin_lock in __unfreeze_partials()

__unfreeze_partials() no longer needs to have irqs disabled, except for making
the spin_lock operations irq-safe, so convert the spin_locks operations and
remove the separate irq handling.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index c8637150bf28..b8581a9b58cc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2335,9 +2335,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page)
{
struct kmem_cache_node *n = NULL, *n2 = NULL;
struct page *page, *discard_page = NULL;
- unsigned long flags;
-
- local_irq_save(flags);
+ unsigned long flags = 0;

while (partial_page) {
struct page new;
@@ -2349,10 +2347,10 @@ static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page)
n2 = get_node(s, page_to_nid(page));
if (n != n2) {
if (n)
- spin_unlock(&n->list_lock);
+ spin_unlock_irqrestore(&n->list_lock, flags);

n = n2;
- spin_lock(&n->list_lock);
+ spin_lock_irqsave(&n->list_lock, flags);
}

do {
@@ -2381,9 +2379,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page)
}

if (n)
- spin_unlock(&n->list_lock);
-
- local_irq_restore(flags);
+ spin_unlock_irqrestore(&n->list_lock, flags);

while (discard_page) {
page = discard_page;
--
2.32.0

2021-08-05 15:28:18

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

From: Sebastian Andrzej Siewior <[email protected]>

flush_all() flushes a specific SLAB cache on each CPU (where the cache
is present). The deactivate_slab()/__free_slab() invocation happens
within IPI handler and is problematic for PREEMPT_RT.

The flush operation is not a frequent operation or a hot path. The
per-CPU flush operation can be moved to within a workqueue.

[[email protected]: adapt to new SLUB changes]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 48 insertions(+), 8 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index dceb289cb052..da48ada3d17f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2513,33 +2513,73 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
unfreeze_partials_cpu(s, c);
}

+struct slub_flush_work {
+ struct work_struct work;
+ struct kmem_cache *s;
+ bool skip;
+};
+
/*
* Flush cpu slab.
*
- * Called from IPI handler with interrupts disabled.
+ * Called from CPU work handler with migration disabled.
*/
-static void flush_cpu_slab(void *d)
+static void flush_cpu_slab(struct work_struct *w)
{
- struct kmem_cache *s = d;
- struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
+ struct kmem_cache *s;
+ struct kmem_cache_cpu *c;
+ struct slub_flush_work *sfw;
+
+ sfw = container_of(w, struct slub_flush_work, work);
+
+ s = sfw->s;
+ c = this_cpu_ptr(s->cpu_slab);

if (c->page)
- flush_slab(s, c, false);
+ flush_slab(s, c, true);

unfreeze_partials(s);
}

-static bool has_cpu_slab(int cpu, void *info)
+static bool has_cpu_slab(int cpu, struct kmem_cache *s)
{
- struct kmem_cache *s = info;
struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);

return c->page || slub_percpu_partial(c);
}

+static DEFINE_MUTEX(flush_lock);
+static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
+
static void flush_all(struct kmem_cache *s)
{
- on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
+ struct slub_flush_work *sfw;
+ unsigned int cpu;
+
+ mutex_lock(&flush_lock);
+ cpus_read_lock();
+
+ for_each_online_cpu(cpu) {
+ sfw = &per_cpu(slub_flush, cpu);
+ if (!has_cpu_slab(cpu, s)) {
+ sfw->skip = true;
+ continue;
+ }
+ INIT_WORK(&sfw->work, flush_cpu_slab);
+ sfw->skip = false;
+ sfw->s = s;
+ schedule_work_on(cpu, &sfw->work);
+ }
+
+ for_each_online_cpu(cpu) {
+ sfw = &per_cpu(slub_flush, cpu);
+ if (sfw->skip)
+ continue;
+ flush_work(&sfw->work);
+ }
+
+ cpus_read_unlock();
+ mutex_unlock(&flush_lock);
}

/*
--
2.32.0

2021-08-05 15:28:27

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 30/35] mm: slub: Make object_map_lock a raw_spinlock_t

From: Sebastian Andrzej Siewior <[email protected]>

The variable object_map is protected by object_map_lock. The lock is always
acquired in debug code and within already atomic context

Make object_map_lock a raw_spinlock_t.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index da48ada3d17f..9cb58d884c58 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -438,7 +438,7 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,

#ifdef CONFIG_SLUB_DEBUG
static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)];
-static DEFINE_SPINLOCK(object_map_lock);
+static DEFINE_RAW_SPINLOCK(object_map_lock);

static void __fill_map(unsigned long *obj_map, struct kmem_cache *s,
struct page *page)
@@ -483,7 +483,7 @@ static unsigned long *get_map(struct kmem_cache *s, struct page *page)
{
VM_BUG_ON(!irqs_disabled());

- spin_lock(&object_map_lock);
+ raw_spin_lock(&object_map_lock);

__fill_map(object_map, s, page);

@@ -493,7 +493,7 @@ static unsigned long *get_map(struct kmem_cache *s, struct page *page)
static void put_map(unsigned long *map) __releases(&object_map_lock)
{
VM_BUG_ON(map != object_map);
- spin_unlock(&object_map_lock);
+ raw_spin_unlock(&object_map_lock);
}

static inline unsigned int size_from_object(struct kmem_cache *s)
--
2.32.0

2021-08-05 15:28:28

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 28/35] mm, slab: make flush_slab() possible to call with irqs enabled

Currently flush_slab() is always called with disabled IRQs if it's needed, but
the following patches will change that, so add a parameter to control IRQ
disabling within the function, which only protects the kmem_cache_cpu
manipulation and not the call to deactivate_slab() which doesn't need it.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index c10f2c9b9352..dceb289cb052 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2477,16 +2477,28 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}

-static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
+static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c,
+ bool lock)
{
- void *freelist = c->freelist;
- struct page *page = c->page;
+ unsigned long flags;
+ void *freelist;
+ struct page *page;
+
+ if (lock)
+ local_irq_save(flags);
+
+ freelist = c->freelist;
+ page = c->page;

c->page = NULL;
c->freelist = NULL;
c->tid = next_tid(c->tid);

- deactivate_slab(s, page, freelist);
+ if (lock)
+ local_irq_restore(flags);
+
+ if (page)
+ deactivate_slab(s, page, freelist);

stat(s, CPUSLAB_FLUSH);
}
@@ -2496,7 +2508,7 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);

if (c->page)
- flush_slab(s, c);
+ flush_slab(s, c, false);

unfreeze_partials_cpu(s, c);
}
@@ -2512,7 +2524,7 @@ static void flush_cpu_slab(void *d)
struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);

if (c->page)
- flush_slab(s, c);
+ flush_slab(s, c, false);

unfreeze_partials(s);
}
--
2.32.0

2021-08-05 15:28:34

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 23/35] mm, slub: discard slabs in unfreeze_partials() without irqs disabled

No need for disabled irqs when discarding slabs, so restore them before
discarding.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 51f8d83d3ea8..4d60bf482735 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2389,6 +2389,8 @@ static void unfreeze_partials(struct kmem_cache *s,
if (n)
spin_unlock(&n->list_lock);

+ local_irq_restore(flags);
+
while (discard_page) {
page = discard_page;
discard_page = discard_page->next;
@@ -2398,7 +2400,6 @@ static void unfreeze_partials(struct kmem_cache *s,
stat(s, FREE_SLAB);
}

- local_irq_restore(flags);
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}

--
2.32.0

2021-08-05 15:29:07

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 16/35] mm, slub: validate slab from partial list or page allocator before making it cpu slab

When we obtain a new slab page from node partial list or page allocator, we
assign it to kmem_cache_cpu, perform some checks, and if they fail, we undo
the assignment.

In order to allow doing the checks without irq disabled, restructure the code
so that the checks are done first, and kmem_cache_cpu.page assignment only
after they pass.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 9350ff5110a0..5d58fde2bd70 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2772,10 +2772,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
lockdep_assert_irqs_disabled();

freelist = get_partial(s, gfpflags, node, &page);
- if (freelist) {
- c->page = page;
+ if (freelist)
goto check_new_page;
- }

local_irq_restore(flags);
put_cpu_ptr(s->cpu_slab);
@@ -2788,9 +2786,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
}

local_irq_save(flags);
- if (c->page)
- flush_slab(s, c);
-
/*
* No other reference to the page yet so we can
* muck around with it freely without cmpxchg
@@ -2799,14 +2794,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
page->freelist = NULL;

stat(s, ALLOC_SLAB);
- c->page = page;

check_new_page:

if (kmem_cache_debug(s)) {
if (!alloc_debug_processing(s, page, freelist, addr)) {
/* Slab failed checks. Next slab needed */
- c->page = NULL;
local_irq_restore(flags);
goto new_slab;
} else {
@@ -2825,10 +2818,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
*/
goto return_single;

+ if (unlikely(c->page))
+ flush_slab(s, c);
+ c->page = page;
+
goto load_freelist;

return_single:

+ if (unlikely(c->page))
+ flush_slab(s, c);
+ c->page = page;
+
deactivate_slab(s, page, get_freepointer(s, freelist), c);
local_irq_restore(flags);
return freelist;
--
2.32.0

2021-08-05 15:29:59

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 32/35] mm, slub: make slab_lock() disable irqs with PREEMPT_RT

We need to disable irqs around slab_lock() (a bit spinlock) to make it
irq-safe. The calls to slab_lock() are nested under spin_lock_irqsave() which
doesn't disable irqs on PREEMPT_RT, so add explicit disabling with PREEMPT_RT.

We also distinguish cmpxchg_double_slab() where we do the disabling explicitly
and __cmpxchg_double_slab() for contexts with already disabled irqs. However
these context are also typically spin_lock_irqsave() thus insufficient on
PREEMPT_RT. Thus, change __cmpxchg_double_slab() to be same as
cmpxchg_double_slab() on PREEMPT_RT.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 9208020f72d5..252421ff1d5f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -380,12 +380,12 @@ __slab_unlock(struct page *page, unsigned long *flags, bool disable_irqs)
static __always_inline void
slab_lock(struct page *page, unsigned long *flags)
{
- __slab_lock(page, flags, false);
+ __slab_lock(page, flags, IS_ENABLED(CONFIG_PREEMPT_RT));
}

static __always_inline void slab_unlock(struct page *page, unsigned long *flags)
{
- __slab_unlock(page, flags, false);
+ __slab_unlock(page, flags, IS_ENABLED(CONFIG_PREEMPT_RT));
}

static inline bool ___cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
@@ -429,14 +429,19 @@ static inline bool ___cmpxchg_double_slab(struct kmem_cache *s, struct page *pag
return false;
}

-/* Interrupts must be disabled (for the fallback code to work right) */
+/*
+ * Interrupts must be disabled (for the fallback code to work right), typically
+ * by an _irqsave() lock variant. Except on PREEMPT_RT where locks are different
+ * so we disable interrupts explicitly here.
+ */
static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
void *freelist_old, unsigned long counters_old,
void *freelist_new, unsigned long counters_new,
const char *n)
{
return ___cmpxchg_double_slab(s, page, freelist_old, counters_old,
- freelist_new, counters_new, n, false);
+ freelist_new, counters_new, n,
+ IS_ENABLED(CONFIG_PREEMPT_RT));
}

static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
--
2.32.0

2021-08-05 15:30:33

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 18/35] mm, slub: stop disabling irqs around get_partial()

The function get_partial() does not need to have irqs disabled as a whole. It's
sufficient to convert spin_lock operations to their irq saving/restoring
versions.

As a result, it's now possible to reach the page allocator from the slab
allocator without disabling and re-enabling interrupts on the way.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 22 ++++++++--------------
1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index f7c6cebb524d..b4a62aa00ae2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1993,11 +1993,12 @@ static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
* Try to allocate a partial slab from a specific node.
*/
static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
- struct page **ret_page, gfp_t flags)
+ struct page **ret_page, gfp_t gfpflags)
{
struct page *page, *page2;
void *object = NULL;
unsigned int available = 0;
+ unsigned long flags;
int objects;

/*
@@ -2009,11 +2010,11 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
if (!n || !n->nr_partial)
return NULL;

- spin_lock(&n->list_lock);
+ spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry_safe(page, page2, &n->partial, slab_list) {
void *t;

- if (!pfmemalloc_match(page, flags))
+ if (!pfmemalloc_match(page, gfpflags))
continue;

t = acquire_slab(s, n, page, object == NULL, &objects);
@@ -2034,7 +2035,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
break;

}
- spin_unlock(&n->list_lock);
+ spin_unlock_irqrestore(&n->list_lock, flags);
return object;
}

@@ -2749,8 +2750,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
local_irq_restore(flags);
goto reread_page;
}
- if (unlikely(!slub_percpu_partial(c)))
+ if (unlikely(!slub_percpu_partial(c))) {
+ local_irq_restore(flags);
goto new_objects; /* stolen by an IRQ handler */
+ }

page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
@@ -2759,18 +2762,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
}

- local_irq_save(flags);
- if (unlikely(c->page)) {
- local_irq_restore(flags);
- goto reread_page;
- }
-
new_objects:

- lockdep_assert_irqs_disabled();
-
freelist = get_partial(s, gfpflags, node, &page);
- local_irq_restore(flags);
if (freelist)
goto check_new_page;

--
2.32.0

2021-08-05 18:45:57

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 02/35] mm, slub: allocate private object map for debugfs listings

Slub has a static spinlock protected bitmap for marking which objects are on
freelist when it wants to list them, for situations where dynamically
allocating such map can lead to recursion or locking issues, and on-stack
bitmap would be too large.

The handlers of debugfs files alloc_traces and free_traces also currently use this
shared bitmap, but their syscall context makes it straightforward to allocate a
private map before entering locked sections, so switch these processing paths
to use a private bitmap.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 44 +++++++++++++++++++++++++++++---------------
1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index f5908e6b6fb1..211d380d94d1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -454,6 +454,18 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)];
static DEFINE_SPINLOCK(object_map_lock);

+static void __fill_map(unsigned long *obj_map, struct kmem_cache *s,
+ struct page *page)
+{
+ void *addr = page_address(page);
+ void *p;
+
+ bitmap_zero(obj_map, page->objects);
+
+ for (p = page->freelist; p; p = get_freepointer(s, p))
+ set_bit(__obj_to_index(s, addr, p), obj_map);
+}
+
#if IS_ENABLED(CONFIG_KUNIT)
static bool slab_add_kunit_errors(void)
{
@@ -483,17 +495,11 @@ static inline bool slab_add_kunit_errors(void) { return false; }
static unsigned long *get_map(struct kmem_cache *s, struct page *page)
__acquires(&object_map_lock)
{
- void *p;
- void *addr = page_address(page);
-
VM_BUG_ON(!irqs_disabled());

spin_lock(&object_map_lock);

- bitmap_zero(object_map, page->objects);
-
- for (p = page->freelist; p; p = get_freepointer(s, p))
- set_bit(__obj_to_index(s, addr, p), object_map);
+ __fill_map(object_map, s, page);

return object_map;
}
@@ -4876,17 +4882,17 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
}

static void process_slab(struct loc_track *t, struct kmem_cache *s,
- struct page *page, enum track_item alloc)
+ struct page *page, enum track_item alloc,
+ unsigned long *obj_map)
{
void *addr = page_address(page);
void *p;
- unsigned long *map;

- map = get_map(s, page);
+ __fill_map(obj_map, s, page);
+
for_each_object(p, s, addr, page->objects)
- if (!test_bit(__obj_to_index(s, addr, p), map))
+ if (!test_bit(__obj_to_index(s, addr, p), obj_map))
add_location(t, s, get_track(s, p, alloc));
- put_map(map);
}
#endif /* CONFIG_DEBUG_FS */
#endif /* CONFIG_SLUB_DEBUG */
@@ -5813,14 +5819,21 @@ static int slab_debug_trace_open(struct inode *inode, struct file *filep)
struct loc_track *t = __seq_open_private(filep, &slab_debugfs_sops,
sizeof(struct loc_track));
struct kmem_cache *s = file_inode(filep)->i_private;
+ unsigned long *obj_map;
+
+ obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL);
+ if (!obj_map)
+ return -ENOMEM;

if (strcmp(filep->f_path.dentry->d_name.name, "alloc_traces") == 0)
alloc = TRACK_ALLOC;
else
alloc = TRACK_FREE;

- if (!alloc_loc_track(t, PAGE_SIZE / sizeof(struct location), GFP_KERNEL))
+ if (!alloc_loc_track(t, PAGE_SIZE / sizeof(struct location), GFP_KERNEL)) {
+ bitmap_free(obj_map);
return -ENOMEM;
+ }

for_each_kmem_cache_node(s, node, n) {
unsigned long flags;
@@ -5831,12 +5844,13 @@ static int slab_debug_trace_open(struct inode *inode, struct file *filep)

spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry(page, &n->partial, slab_list)
- process_slab(t, s, page, alloc);
+ process_slab(t, s, page, alloc, obj_map);
list_for_each_entry(page, &n->full, slab_list)
- process_slab(t, s, page, alloc);
+ process_slab(t, s, page, alloc, obj_map);
spin_unlock_irqrestore(&n->list_lock, flags);
}

+ bitmap_free(obj_map);
return 0;
}

--
2.32.0

2021-08-05 18:46:05

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 04/35] mm, slub: don't disable irq for debug_check_no_locks_freed()

In slab_free_hook() we disable irqs around the debug_check_no_locks_freed()
call, which is unnecessary, as irqs are already being disabled inside the call.
This seems to be leftover from the past where there were more calls inside the
irq disabled sections. Remove the irq disable/enable operations.

Mel noted:
> Looks like it was needed for kmemcheck which went away back in 4.15

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 14 +-------------
1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index e1889b26a889..4ac4ad021fca 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1588,20 +1588,8 @@ static __always_inline bool slab_free_hook(struct kmem_cache *s,
{
kmemleak_free_recursive(x, s->flags);

- /*
- * Trouble is that we may no longer disable interrupts in the fast path
- * So in order to make the debug calls that expect irqs to be
- * disabled we need to disable interrupts temporarily.
- */
-#ifdef CONFIG_LOCKDEP
- {
- unsigned long flags;
+ debug_check_no_locks_freed(x, s->object_size);

- local_irq_save(flags);
- debug_check_no_locks_freed(x, s->object_size);
- local_irq_restore(flags);
- }
-#endif
if (!(s->flags & SLAB_DEBUG_OBJECTS))
debug_check_no_obj_freed(x, s->object_size);

--
2.32.0

2021-08-05 18:46:45

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 05/35] mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()

Commit d6e0b7fa1186 ("slub: make dead caches discard free slabs immediately")
introduced cpu partial flushing for kmemcg caches, based on setting the target
cpu_partial to 0 and adding a flushing check in put_cpu_partial().
This code that sets cpu_partial to 0 was later moved by c9fc586403e7 ("slab:
introduce __kmemcg_cache_deactivate()") and ultimately removed by 9855609bde03
("mm: memcg/slab: use a single set of kmem_caches for all accounted
allocations"). However the check and flush in put_cpu_partial() was never
removed, although it's effectively a dead code. So this patch removes it.

Note that d6e0b7fa1186 also added preempt_disable()/enable() to
unfreeze_partials() which could be thus also considered unnecessary. But
further patches will rely on it, so keep it.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 7 -------
1 file changed, 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 4ac4ad021fca..812345fdf13c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2463,13 +2463,6 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)

} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
!= oldpage);
- if (unlikely(!slub_cpu_partial(s))) {
- unsigned long flags;
-
- local_irq_save(flags);
- unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
- local_irq_restore(flags);
- }
preempt_enable();
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}
--
2.32.0

2021-08-05 18:47:27

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 08/35] mm, slub: dissolve new_slab_objects() into ___slab_alloc()

The later patches will need more fine grained control over individual actions
in ___slab_alloc(), the only caller of new_slab_objects(), so dissolve it
there. This is a preparatory step with no functional change.

The only minor change is moving WARN_ON_ONCE() for using a constructor together
with __GFP_ZERO to new_slab(), which makes it somewhat less frequent, but still
able to catch a development change introducing a systematic misuse.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 50 ++++++++++++++++++--------------------------------
1 file changed, 18 insertions(+), 32 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 25f102acf5b4..3222ea51fb50 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1868,6 +1868,8 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
if (unlikely(flags & GFP_SLAB_BUG_MASK))
flags = kmalloc_fix_flags(flags);

+ WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
+
return allocate_slab(s,
flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
}
@@ -2593,36 +2595,6 @@ slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid)
#endif
}

-static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
- int node, struct kmem_cache_cpu **pc)
-{
- void *freelist = NULL;
- struct kmem_cache_cpu *c = *pc;
- struct page *page;
-
- WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
-
- page = new_slab(s, flags, node);
- if (page) {
- c = raw_cpu_ptr(s->cpu_slab);
- if (c->page)
- flush_slab(s, c);
-
- /*
- * No other reference to the page yet so we can
- * muck around with it freely without cmpxchg
- */
- freelist = page->freelist;
- page->freelist = NULL;
-
- stat(s, ALLOC_SLAB);
- c->page = page;
- *pc = c;
- }
-
- return freelist;
-}
-
static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags)
{
if (unlikely(PageSlabPfmemalloc(page)))
@@ -2769,13 +2741,27 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (freelist)
goto check_new_page;

- freelist = new_slab_objects(s, gfpflags, node, &c);
+ page = new_slab(s, gfpflags, node);

- if (unlikely(!freelist)) {
+ if (unlikely(!page)) {
slab_out_of_memory(s, gfpflags, node);
return NULL;
}

+ c = raw_cpu_ptr(s->cpu_slab);
+ if (c->page)
+ flush_slab(s, c);
+
+ /*
+ * No other reference to the page yet so we can
+ * muck around with it freely without cmpxchg
+ */
+ freelist = page->freelist;
+ page->freelist = NULL;
+
+ stat(s, ALLOC_SLAB);
+ c->page = page;
+
check_new_page:
page = c->page;
if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
--
2.32.0

2021-08-05 19:00:19

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 09/35] mm, slub: return slab page from get_partial() and set c->page afterwards

The function get_partial() finds a suitable page on a partial list, acquires
and returns its freelist and assigns the page pointer to kmem_cache_cpu.
In later patch we will need more control over the kmem_cache_cpu.page
assignment, so instead of passing a kmem_cache_cpu pointer, pass a pointer to a
pointer to a page that get_partial() can fill and the caller can assign the
kmem_cache_cpu.page pointer. No functional change as all of this still happens
with disabled IRQs.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 3222ea51fb50..ed18fa3157ad 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2000,7 +2000,7 @@ static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
* Try to allocate a partial slab from a specific node.
*/
static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
- struct kmem_cache_cpu *c, gfp_t flags)
+ struct page **ret_page, gfp_t flags)
{
struct page *page, *page2;
void *object = NULL;
@@ -2029,7 +2029,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,

available += objects;
if (!object) {
- c->page = page;
+ *ret_page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
} else {
@@ -2049,7 +2049,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
* Get a page from somewhere. Search in increasing NUMA distances.
*/
static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
- struct kmem_cache_cpu *c)
+ struct page **ret_page)
{
#ifdef CONFIG_NUMA
struct zonelist *zonelist;
@@ -2091,7 +2091,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,

if (n && cpuset_zone_allowed(zone, flags) &&
n->nr_partial > s->min_partial) {
- object = get_partial_node(s, n, c, flags);
+ object = get_partial_node(s, n, ret_page, flags);
if (object) {
/*
* Don't check read_mems_allowed_retry()
@@ -2113,7 +2113,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
* Get a partial page, lock it and return it.
*/
static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
- struct kmem_cache_cpu *c)
+ struct page **ret_page)
{
void *object;
int searchnode = node;
@@ -2121,11 +2121,11 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
if (node == NUMA_NO_NODE)
searchnode = numa_mem_id();

- object = get_partial_node(s, get_node(s, searchnode), c, flags);
+ object = get_partial_node(s, get_node(s, searchnode), ret_page, flags);
if (object || node != NUMA_NO_NODE)
return object;

- return get_any_partial(s, flags, c);
+ return get_any_partial(s, flags, ret_page);
}

#ifdef CONFIG_PREEMPTION
@@ -2737,9 +2737,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
}

- freelist = get_partial(s, gfpflags, node, c);
- if (freelist)
+ freelist = get_partial(s, gfpflags, node, &page);
+ if (freelist) {
+ c->page = page;
goto check_new_page;
+ }

page = new_slab(s, gfpflags, node);

@@ -2763,7 +2765,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
c->page = page;

check_new_page:
- page = c->page;
if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
goto load_freelist;

--
2.32.0

2021-08-05 19:04:49

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 03/35] mm, slub: allocate private object map for validate_slab_cache()

validate_slab_cache() is called either to handle a sysfs write, or from a
self-test context. In both situations it's straightforward to preallocate a
private object bitmap instead of grabbing the shared static one meant for
critical sections, so let's do that.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 211d380d94d1..e1889b26a889 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4676,11 +4676,11 @@ static int count_total(struct page *page)
#endif

#ifdef CONFIG_SLUB_DEBUG
-static void validate_slab(struct kmem_cache *s, struct page *page)
+static void validate_slab(struct kmem_cache *s, struct page *page,
+ unsigned long *obj_map)
{
void *p;
void *addr = page_address(page);
- unsigned long *map;

slab_lock(page);

@@ -4688,21 +4688,20 @@ static void validate_slab(struct kmem_cache *s, struct page *page)
goto unlock;

/* Now we know that a valid freelist exists */
- map = get_map(s, page);
+ __fill_map(obj_map, s, page);
for_each_object(p, s, addr, page->objects) {
- u8 val = test_bit(__obj_to_index(s, addr, p), map) ?
+ u8 val = test_bit(__obj_to_index(s, addr, p), obj_map) ?
SLUB_RED_INACTIVE : SLUB_RED_ACTIVE;

if (!check_object(s, page, p, val))
break;
}
- put_map(map);
unlock:
slab_unlock(page);
}

static int validate_slab_node(struct kmem_cache *s,
- struct kmem_cache_node *n)
+ struct kmem_cache_node *n, unsigned long *obj_map)
{
unsigned long count = 0;
struct page *page;
@@ -4711,7 +4710,7 @@ static int validate_slab_node(struct kmem_cache *s,
spin_lock_irqsave(&n->list_lock, flags);

list_for_each_entry(page, &n->partial, slab_list) {
- validate_slab(s, page);
+ validate_slab(s, page, obj_map);
count++;
}
if (count != n->nr_partial) {
@@ -4724,7 +4723,7 @@ static int validate_slab_node(struct kmem_cache *s,
goto out;

list_for_each_entry(page, &n->full, slab_list) {
- validate_slab(s, page);
+ validate_slab(s, page, obj_map);
count++;
}
if (count != atomic_long_read(&n->nr_slabs)) {
@@ -4743,10 +4742,17 @@ long validate_slab_cache(struct kmem_cache *s)
int node;
unsigned long count = 0;
struct kmem_cache_node *n;
+ unsigned long *obj_map;
+
+ obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL);
+ if (!obj_map)
+ return -ENOMEM;

flush_all(s);
for_each_kmem_cache_node(s, node, n)
- count += validate_slab_node(s, n);
+ count += validate_slab_node(s, n, obj_map);
+
+ bitmap_free(obj_map);

return count;
}
--
2.32.0

2021-08-05 19:04:53

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v4 24/35] mm, slub: detach whole partial list at once in unfreeze_partials()

Instead of iterating through the live percpu partial list, detach it from the
kmem_cache_cpu at once. This is simpler and will allow further optimization.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 4d60bf482735..984173ce8465 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2341,16 +2341,20 @@ static void unfreeze_partials(struct kmem_cache *s,
{
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct kmem_cache_node *n = NULL, *n2 = NULL;
- struct page *page, *discard_page = NULL;
+ struct page *page, *partial_page, *discard_page = NULL;
unsigned long flags;

local_irq_save(flags);

- while ((page = slub_percpu_partial(c))) {
+ partial_page = slub_percpu_partial(c);
+ c->partial = NULL;
+
+ while (partial_page) {
struct page new;
struct page old;

- slub_set_percpu_partial(c, page);
+ page = partial_page;
+ partial_page = page->next;

n2 = get_node(s, page_to_nid(page));
if (n != n2) {
--
2.32.0

Subject: Re: [PATCH v4 00/35] SLUB: reduce irq disabled scope and make it RT compatible

On 2021-08-05 17:19:25 [+0200], Vlastimil Babka wrote:
> Hi Andrew,
Hi,

> I believe the series is ready for mmotm. No known bugs, Mel found no !RT perf
> regressions in v3 [9], Mike also (details below). RT guys validated it on RT
> config and already incorporated the series in the RT tree.

Correct, incl. the percpu-partial list fix.

…
> RT configs showed some throughput regressions, but that's expected tradeoff for
> the preemption improvements through the RT mutex. It didn't prevent the v2 to
> be incorporated to the 5.13 RT tree [7], leading to testing exposure and
> bugfixes.

There was throughput regression in RT compared to previous releases
(without this series). The regression was (based on my testing) only
visible in hackbench and was addressed by adding adaptiv spinning to
RT-mutex. With that we almost back to what we had before :)

…
> The remaining patches to upstream from the RT tree are small ones related to
> KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes
> sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could
> perhaps be re-evaluated as the series addresses some latency issues with it.

With your rework CONFIG_SLUB_CPU_PARTIAL can be enabled in RT since
v5.14-rc3-rt1. So it has been re-evaluated :)

Regarding SLAB/SLOB: SLOB has a few design parts which are incompatible
with RT (if my memory suits me so it was never attempted to get it
working). SLAB was used before SLUB and required a lot of love. SLUB
performed better compared to SLAB (in both throughput and latency) and
after a while the SLAB patches were dropped.

Sebastian

2021-08-06 07:52:04

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4 00/35] SLUB: reduce irq disabled scope and make it RT compatible

On Thu, 2021-08-05 at 18:42 +0200, Sebastian Andrzej Siewior wrote:
>
> There was throughput regression in RT compared to previous releases
> (without this series). The regression was (based on my testing) only
> visible in hackbench and was addressed by adding adaptiv spinning to
> RT-mutex. With that we almost back to what we had before :)

Numbers on my box say a throughput regression remains (silly fork bomb
scenario.. yawn), which can be recouped by either turning on all
SL[AU]B features or converting the list_lock to a raw lock. They also
seem to be saying that if you turned on PREEMPT_RT because you care
about RT performance first and foremost (gee), you'll do neither of
those, because either will eliminate an RT performance progression.

-Mike

numbers...

box is old i4790 desktop
perf stat -r10 hackbench -s4096 -l500
full warmup, record, repeat twice for elapsed

SLUB+SLUB_DEBUG only

begin previously reported numbers
5.14.0.g79e92006-tip-rt (5.12-rt based as before, 5.13-rt didn't yet exist)
7,984.52 msec task-clock # 7.565 CPUs utilized ( +- 0.66% )
353,566 context-switches # 44.281 K/sec ( +- 2.77% )
37,685 cpu-migrations # 4.720 K/sec ( +- 6.37% )
12,939 page-faults # 1.620 K/sec ( +- 0.67% )
29,901,079,227 cycles # 3.745 GHz ( +- 0.71% )
14,550,797,818 instructions # 0.49 insn per cycle ( +- 0.47% )
3,056,685,643 branches # 382.826 M/sec ( +- 0.51% )
9,598,083 branch-misses # 0.31% of all branches ( +- 2.11% )

1.05542 +- 0.00409 seconds time elapsed ( +- 0.39% )
1.05990 +- 0.00244 seconds time elapsed ( +- 0.23% ) (repeat)
1.05367 +- 0.00303 seconds time elapsed ( +- 0.29% ) (repeat)

5.14.0.g79e92006-tip-rt +slub-local-lock-v2r3 -0034-mm-slub-convert-kmem_cpu_slab-protection-to-local_lock.patch
6,899.35 msec task-clock # 5.637 CPUs utilized ( +- 0.53% )
420,304 context-switches # 60.919 K/sec ( +- 2.83% )
187,130 cpu-migrations # 27.123 K/sec ( +- 1.81% )
13,206 page-faults # 1.914 K/sec ( +- 0.96% )
25,110,362,933 cycles # 3.640 GHz ( +- 0.49% )
15,853,643,635 instructions # 0.63 insn per cycle ( +- 0.64% )
3,366,261,524 branches # 487.910 M/sec ( +- 0.70% )
14,839,618 branch-misses # 0.44% of all branches ( +- 2.01% )

1.22390 +- 0.00744 seconds time elapsed ( +- 0.61% )
1.21813 +- 0.00907 seconds time elapsed ( +- 0.74% ) (repeat)
1.22097 +- 0.00952 seconds time elapsed ( +- 0.78% ) (repeat)

repeat of above with raw list_lock
8,072.62 msec task-clock # 7.605 CPUs utilized ( +- 0.49% )
359,514 context-switches # 44.535 K/sec ( +- 4.95% )
35,285 cpu-migrations # 4.371 K/sec ( +- 5.82% )
13,503 page-faults # 1.673 K/sec ( +- 0.96% )
30,247,989,681 cycles # 3.747 GHz ( +- 0.52% )
14,580,011,391 instructions # 0.48 insn per cycle ( +- 0.81% )
3,063,743,405 branches # 379.523 M/sec ( +- 0.85% )
8,907,160 branch-misses # 0.29% of all branches ( +- 3.99% )

1.06150 +- 0.00427 seconds time elapsed ( +- 0.40% )
1.05041 +- 0.00176 seconds time elapsed ( +- 0.17% ) (repeat)
1.06086 +- 0.00237 seconds time elapsed ( +- 0.22% ) (repeat)

5.14.0.g79e92006-rt3-tip-rt +slub-local-lock-v2r3 full set
7,598.44 msec task-clock # 5.813 CPUs utilized ( +- 0.85% )
488,161 context-switches # 64.245 K/sec ( +- 4.29% )
196,866 cpu-migrations # 25.909 K/sec ( +- 1.49% )
13,042 page-faults # 1.716 K/sec ( +- 0.73% )
27,695,116,746 cycles # 3.645 GHz ( +- 0.79% )
18,423,934,168 instructions # 0.67 insn per cycle ( +- 0.88% )
3,969,540,695 branches # 522.415 M/sec ( +- 0.92% )
15,493,482 branch-misses # 0.39% of all branches ( +- 2.15% )

1.30709 +- 0.00890 seconds time elapsed ( +- 0.68% )
1.3205 +- 0.0134 seconds time elapsed ( +- 1.02% ) (repeat)
1.3083 +- 0.0132 seconds time elapsed ( +- 1.01% ) (repeat)
end previously reported numbers

5.14.0.gf6a71a5-rt6-tip-rt (same config, full slub set.. obviously)
7,707.63 msec task-clock # 5.880 CPUs utilized ( +- 1.46% )
562,533 context-switches # 72.984 K/sec ( +- 7.46% )
208,475 cpu-migrations # 27.048 K/sec ( +- 2.26% )
13,022 page-faults # 1.689 K/sec ( +- 0.80% )
28,025,004,779 cycles # 3.636 GHz ( +- 1.34% )
18,487,135,489 instructions # 0.66 insn per cycle ( +- 1.58% )
3,997,110,493 branches # 518.591 M/sec ( +- 1.65% )
16,078,322 branch-misses # 0.40% of all branches ( +- 4.23% )

1.3108 +- 0.0135 seconds time elapsed ( +- 1.03% )
1.2997 +- 0.0138 seconds time elapsed ( +- 1.06% ) (repeat)
1.3009 +- 0.0166 seconds time elapsed ( +- 1.28% ) (repeat)

5.14.0.gf6a71a5-rt6-tip-rt +list_lock=raw_spinlock_t
8,252.59 msec task-clock # 7.584 CPUs utilized ( +- 0.27% )
400,991 context-switches # 48.590 K/sec ( +- 6.15% )
35,979 cpu-migrations # 4.360 K/sec ( +- 5.63% )
13,261 page-faults # 1.607 K/sec ( +- 0.73% )
30,910,310,737 cycles # 3.746 GHz ( +- 0.31% )
16,522,383,240 instructions # 0.53 insn per cycle ( +- 0.92% )
3,535,219,839 branches # 428.377 M/sec ( +- 0.96% )
10,115,967 branch-misses # 0.29% of all branches ( +- 4.32% )

1.08817 +- 0.00238 seconds time elapsed ( +- 0.22% )
1.08583 +- 0.00243 seconds time elapsed ( +- 0.22% ) (repeat)
1.09003 +- 0.00164 seconds time elapsed ( +- 0.15% ) (repeat)

5.14.0.g251a152-rt6-master-rt (+SLAB_MERGE_DEFAULT,SLUB_CPU_PARTIAL,SLAB_FREELIST_RANDOM/HARDENED)
8,170.48 msec task-clock # 7.390 CPUs utilized ( +- 0.43% )
449,994 context-switches # 55.076 K/sec ( +- 4.20% )
55,912 cpu-migrations # 6.843 K/sec ( +- 4.28% )
13,144 page-faults # 1.609 K/sec ( +- 0.53% )
30,484,114,812 cycles # 3.731 GHz ( +- 0.44% )
17,554,521,787 instructions # 0.58 insn per cycle ( +- 0.76% )
3,751,725,852 branches # 459.181 M/sec ( +- 0.81% )
13,421,985 branch-misses # 0.36% of all branches ( +- 2.40% )

1.10563 +- 0.00382 seconds time elapsed ( +- 0.35% )
1.1098 +- 0.0147 seconds time elapsed ( +- 1.32% ) (repeat)
1.11308 +- 0.00883 seconds time elapsed ( +- 0.79% ) (repeat)

5.14.0.gf6a71a5-rt6-tip-rt +SLAB_MERGE_DEFAULT,SLUB_CPU_PARTIAL,SLAB_FREELIST_RANDOM/HARDENED
8,026.39 msec task-clock # 7.320 CPUs utilized ( +- 0.70% )
496,579 context-switches # 61.868 K/sec ( +- 6.78% )
65,022 cpu-migrations # 8.101 K/sec ( +- 8.29% )
13,161 page-faults # 1.640 K/sec ( +- 0.51% )
29,870,954,733 cycles # 3.722 GHz ( +- 0.67% )
17,617,522,235 instructions # 0.59 insn per cycle ( +- 1.36% )
3,760,346,459 branches # 468.498 M/sec ( +- 1.45% )
12,863,520 branch-misses # 0.34% of all branches ( +- 4.45% )

1.0965 +- 0.0103 seconds time elapsed ( +- 0.94% )
1.08149 +- 0.00362 seconds time elapsed ( +- 0.33% ) (repeat)
1.10027 +- 0.00916 seconds time elapsed ( +- 0.83% )

yup, perf delta == config delta, lets have a peek at jitter

cyclictest -Smqp99& perf stat -r100 hackbench -s4096 -l500 && killall cyclictest

5.14.0.gf6a71a5-rt6-tip-rt
SLUB+SLUB_DEBUG
T: 1 ( 5903) P:99 I:1500 C: 92330 Min: 1 Act: 2 Avg: 6 Max: 19
T: 2 ( 5904) P:99 I:2000 C: 69247 Min: 1 Act: 2 Avg: 6 Max: 21
T: 3 ( 5905) P:99 I:2500 C: 55395 Min: 1 Act: 3 Avg: 6 Max: 22
T: 4 ( 5906) P:99 I:3000 C: 46163 Min: 1 Act: 4 Avg: 7 Max: 22
T: 5 ( 5907) P:99 I:3500 C: 39568 Min: 1 Act: 3 Avg: 6 Max: 23
T: 6 ( 5909) P:99 I:4000 C: 34621 Min: 1 Act: 2 Avg: 7 Max: 22
T: 7 ( 5910) P:99 I:4500 C: 30774 Min: 1 Act: 3 Avg: 7 Max: 18

SLUB+SLUB_DEBUG+list_lock=raw_spinlock_t
T: 1 ( 4044) P:99 I:1500 C: 73340 Min: 1 Act: 3 Avg: 10 Max: 28
T: 2 ( 4045) P:99 I:2000 C: 55004 Min: 1 Act: 4 Avg: 10 Max: 33
T: 3 ( 4046) P:99 I:2500 C: 44002 Min: 1 Act: 2 Avg: 10 Max: 26
T: 4 ( 4047) P:99 I:3000 C: 36668 Min: 1 Act: 3 Avg: 10 Max: 24
T: 5 ( 4048) P:99 I:3500 C: 31429 Min: 1 Act: 3 Avg: 10 Max: 27
T: 6 ( 4049) P:99 I:4000 C: 27500 Min: 1 Act: 3 Avg: 11 Max: 30
T: 7 ( 4050) P:99 I:4500 C: 24444 Min: 1 Act: 4 Avg: 11 Max: 25

SLUB+SLUB_DEBUG+SLAB_MERGE_DEFAULT,SLUB_CPU_PARTIAL,SLAB_FREELIST_RANDOM/HARDENED
T: 1 ( 4036) P:99 I:1500 C: 74039 Min: 1 Act: 3 Avg: 9 Max: 31
T: 2 ( 4037) P:99 I:2000 C: 55528 Min: 1 Act: 3 Avg: 10 Max: 29
T: 3 ( 4038) P:99 I:2500 C: 44422 Min: 1 Act: 2 Avg: 10 Max: 31
T: 4 ( 4039) P:99 I:3000 C: 37017 Min: 1 Act: 2 Avg: 9 Max: 23
T: 5 ( 4040) P:99 I:3500 C: 31729 Min: 1 Act: 3 Avg: 10 Max: 29
T: 6 ( 4041) P:99 I:4000 C: 27762 Min: 1 Act: 2 Avg: 8 Max: 26
T: 7 ( 4042) P:99 I:4500 C: 24677 Min: 1 Act: 3 Avg: 9 Max: 27

conclusion: gee, pi both works and ain't free - ditto add more stuff=cycles :)




2021-08-06 08:17:20

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 00/35] SLUB: reduce irq disabled scope and make it RT compatible

On 8/6/21 7:14 AM, Mike Galbraith wrote:
> On Thu, 2021-08-05 at 18:42 +0200, Sebastian Andrzej Siewior wrote:
>>
>> There was throughput regression in RT compared to previous releases
>> (without this series). The regression was (based on my testing) only
>> visible in hackbench and was addressed by adding adaptiv spinning to
>> RT-mutex. With that we almost back to what we had before :)
>
> Numbers on my box say a throughput regression remains (silly fork bomb
> scenario.. yawn), which can be recouped by either turning on all
> SL[AU]B features or converting the list_lock to a raw lock.

I'm surprised you can still do that raw lock in v3/v4 because there's now a path
where get_partial_node() takes the list_lock and can call put_cpu_partial()
which takes the local_lock. But seems your results below indicate that this was
without CONFIG_SLUB_CPU_PARTIAL so that would still work.

> They also
> seem to be saying that if you turned on PREEMPT_RT because you care
> about RT performance first and foremost (gee), you'll do neither of
> those, because either will eliminate an RT performance progression.

That was my assumption, that there would be some tradeoff and RT is willing to
sacrifice some throughput here... which should be only visible if your benchmark
is close to slab microbenchmark, as hackbench is.

Thanks again!

2021-08-09 14:19:28

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context



On 8/5/2021 11:19 AM, Vlastimil Babka wrote:
> From: Sebastian Andrzej Siewior <[email protected]>
>
> flush_all() flushes a specific SLAB cache on each CPU (where the cache
> is present). The deactivate_slab()/__free_slab() invocation happens
> within IPI handler and is problematic for PREEMPT_RT.
>
> The flush operation is not a frequent operation or a hot path. The
> per-CPU flush operation can be moved to within a workqueue.
>
> [[email protected]: adapt to new SLUB changes]
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
> mm/slub.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 48 insertions(+), 8 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index dceb289cb052..da48ada3d17f 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2513,33 +2513,73 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
> unfreeze_partials_cpu(s, c);
> }
>
> +struct slub_flush_work {
> + struct work_struct work;
> + struct kmem_cache *s;
> + bool skip;
> +};
> +
> /*
> * Flush cpu slab.
> *
> - * Called from IPI handler with interrupts disabled.
> + * Called from CPU work handler with migration disabled.
> */
> -static void flush_cpu_slab(void *d)
> +static void flush_cpu_slab(struct work_struct *w)
> {
> - struct kmem_cache *s = d;
> - struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
> + struct kmem_cache *s;
> + struct kmem_cache_cpu *c;
> + struct slub_flush_work *sfw;
> +
> + sfw = container_of(w, struct slub_flush_work, work);
> +
> + s = sfw->s;
> + c = this_cpu_ptr(s->cpu_slab);
>
> if (c->page)
> - flush_slab(s, c, false);
> + flush_slab(s, c, true);
>
> unfreeze_partials(s);
> }
>
> -static bool has_cpu_slab(int cpu, void *info)
> +static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> {
> - struct kmem_cache *s = info;
> struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
>
> return c->page || slub_percpu_partial(c);
> }
>
> +static DEFINE_MUTEX(flush_lock);
> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
> +
> static void flush_all(struct kmem_cache *s)
> {
> - on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
> + struct slub_flush_work *sfw;
> + unsigned int cpu;
> +
> + mutex_lock(&flush_lock);

Vlastimil, taking the lock here could trigger a warning during memory offline/online due to the locking order:

slab_mutex -> flush_lock

[ 91.374541] WARNING: possible circular locking dependency detected
[ 91.381411] 5.14.0-rc5-next-20210809+ #84 Not tainted
[ 91.387149] ------------------------------------------------------
[ 91.394016] lsbug/1523 is trying to acquire lock:
[ 91.399406] ffff800018e76530 (flush_lock){+.+.}-{3:3}, at: flush_all+0x50/0x1c8
[ 91.407425]
but task is already holding lock:
[ 91.414638] ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at: slab_memory_callback+0x44/0x280
[ 91.423603]
which lock already depends on the new lock.

[ 91.433854]
the existing dependency chain (in reverse order) is:
[ 91.442715]
-> #4 (slab_mutex){+.+.}-{3:3}:
[ 91.449766] __lock_acquire+0xb0c/0x1aa8
[ 91.454901] lock_acquire+0x34c/0xb20
[ 91.459773] __mutex_lock+0x194/0x1470
[ 91.464732] mutex_lock_nested+0x6c/0xc0
[ 91.469864] slab_memory_callback+0x44/0x280
[ 91.475344] blocking_notifier_call_chain+0xd0/0x138
[ 91.481519] memory_notify+0x28/0x38
[ 91.486304] offline_pages+0x2cc/0xce4
[ 91.491262] memory_subsys_offline+0xd8/0x280
[ 91.496827] device_offline+0x154/0x1e0
[ 91.501872] online_store+0xa4/0x118
[ 91.506656] dev_attr_store+0x44/0x78
[ 91.511527] sysfs_kf_write+0xe8/0x138
[ 91.516485] kernfs_fop_write_iter+0x26c/0x3d0
[ 91.522138] new_sync_write+0x2bc/0x4f8
[ 91.527185] vfs_write+0x718/0xc88
[ 91.531795] ksys_write+0xf8/0x1e0
[ 91.536404] __arm64_sys_write+0x74/0xa8
[ 91.541535] invoke_syscall.constprop.0+0xdc/0x1d8
[ 91.547536] do_el0_svc+0xe4/0x2a8
[ 91.552146] el0_svc+0x64/0x130
[ 91.556498] el0t_64_sync_handler+0xb0/0xb8
[ 91.561889] el0t_64_sync+0x180/0x184
[ 91.566760]
-> #3 ((memory_chain).rwsem){++++}-{3:3}:
[ 91.574680] __lock_acquire+0xb0c/0x1aa8
[ 91.579814] lock_acquire+0x34c/0xb20
[ 91.584685] down_read+0xf0/0x488
[ 91.589210] blocking_notifier_call_chain+0x58/0x138
[ 91.595383] memory_notify+0x28/0x38
[ 91.600167] offline_pages+0x2cc/0xce4
[ 91.605124] memory_subsys_offline+0xd8/0x280
[ 91.610689] device_offline+0x154/0x1e0
[ 91.615734] online_store+0xa4/0x118
[ 91.620518] dev_attr_store+0x44/0x78
[ 91.625388] sysfs_kf_write+0xe8/0x138
[ 91.630346] kernfs_fop_write_iter+0x26c/0x3d0
[ 91.635997] new_sync_write+0x2bc/0x4f8
[ 91.641043] vfs_write+0x718/0xc88
[ 91.645652] ksys_write+0xf8/0x1e0
[ 91.650262] __arm64_sys_write+0x74/0xa8
[ 91.655393] invoke_syscall.constprop.0+0xdc/0x1d8
[ 91.661394] do_el0_svc+0xe4/0x2a8
[ 91.666004] el0_svc+0x64/0x130
[ 91.670355] el0t_64_sync_handler+0xb0/0xb8
[ 91.675747] el0t_64_sync+0x180/0x184
[ 91.680617]
-> #2 (pcp_batch_high_lock){+.+.}-{3:3}:
[ 91.688449] __lock_acquire+0xb0c/0x1aa8
[ 91.693582] lock_acquire+0x34c/0xb20
[ 91.698452] __mutex_lock+0x194/0x1470
[ 91.703410] mutex_lock_nested+0x6c/0xc0
[ 91.708541] zone_pcp_update+0x3c/0x68
[ 91.713500] page_alloc_cpu_online+0x64/0x90
[ 91.718978] cpuhp_invoke_callback+0x588/0x2ba8
[ 91.724718] cpuhp_invoke_callback_range+0xa4/0x108
[ 91.730804] cpu_up+0x598/0xb78
[ 91.735154] bringup_nonboot_cpus+0x110/0x168
[ 91.740719] smp_init+0x4c/0xe0
[ 91.745070] kernel_init_freeable+0x554/0x7c8
[ 91.750637] kernel_init+0x2c/0x140
[ 91.755334] ret_from_fork+0x10/0x20
[ 91.760118]
-> #1 (cpu_hotplug_lock){++++}-{0:0}:
[ 91.767688] __lock_acquire+0xb0c/0x1aa8
[ 91.772820] lock_acquire+0x34c/0xb20
[ 91.777691] cpus_read_lock+0x98/0x308
[ 91.782649] flush_all+0x54/0x1c8
[ 91.787173] __kmem_cache_shrink+0x38/0x2f0
[ 91.792566] kmem_cache_shrink+0x28/0x38
[ 91.797699] acpi_os_purge_cache+0x18/0x28
[ 91.803006] acpi_purge_cached_objects+0x44/0xdc
[ 91.808832] acpi_initialize_objects+0x24/0x88
[ 91.814487] acpi_bus_init+0xe0/0x47c
[ 91.819357] acpi_init+0x130/0x27c
[ 91.823967] do_one_initcall+0x180/0xbe8
[ 91.829098] kernel_init_freeable+0x710/0x7c8
[ 91.834663] kernel_init+0x2c/0x140
[ 91.839360] ret_from_fork+0x10/0x20
[ 91.844143]
-> #0 (flush_lock){+.+.}-{3:3}:
[ 91.851193] check_prev_add+0x194/0x1170
[ 91.856326] validate_chain+0xfe8/0x1c20
[ 91.861458] __lock_acquire+0xb0c/0x1aa8
[ 91.866589] lock_acquire+0x34c/0xb20
[ 91.871460] __mutex_lock+0x194/0x1470
[ 91.876418] mutex_lock_nested+0x6c/0xc0
[ 91.881549] flush_all+0x50/0x1c8
[ 91.886072] __kmem_cache_shrink+0x38/0x2f0
[ 91.891465] slab_memory_callback+0x68/0x280
[ 91.896943] blocking_notifier_call_chain+0xd0/0x138
[ 91.903117] memory_notify+0x28/0x38
[ 91.907901] offline_pages+0x2cc/0xce4
[ 91.912859] memory_subsys_offline+0xd8/0x280
[ 91.918424] device_offline+0x154/0x1e0
[ 91.923470] online_store+0xa4/0x118
[ 91.928254] dev_attr_store+0x44/0x78
[ 91.933125] sysfs_kf_write+0xe8/0x138
[ 91.938083] kernfs_fop_write_iter+0x26c/0x3d0
[ 91.943735] new_sync_write+0x2bc/0x4f8
[ 91.948781] vfs_write+0x718/0xc88
[ 91.953391] ksys_write+0xf8/0x1e0
[ 91.958000] __arm64_sys_write+0x74/0xa8
[ 91.963130] invoke_syscall.constprop.0+0xdc/0x1d8
[ 91.969131] do_el0_svc+0xe4/0x2a8
[ 91.973741] el0_svc+0x64/0x130
[ 91.978093] el0t_64_sync_handler+0xb0/0xb8
[ 91.983484] el0t_64_sync+0x180/0x184
[ 91.988354]
other info that might help us debug this:

[ 91.998431] Chain exists of:
flush_lock --> (memory_chain).rwsem --> slab_mutex

[ 92.010867] Possible unsafe locking scenario:

[ 92.018166] CPU0 CPU1
[ 92.023380] ---- ----
[ 92.028595] lock(slab_mutex);
[ 92.032425] lock((memory_chain).rwsem);
[ 92.039641] lock(slab_mutex);
[ 92.045989] lock(flush_lock);
[ 92.049819]
*** DEADLOCK ***

[ 92.057811] 10 locks held by lsbug/1523:
[ 92.062420] #0: ffff0000505a8430 (sb_writers#6){.+.+}-{0:0}, at: ksys_write+0xf8/0x1e0
[ 92.071128] #1: ffff000870f99e88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x1dc/0x3d0
[ 92.080701] #2: ffff0000145b2ab8 (kn->active#175){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x1f8/0x3d0
[ 92.090623] #3: ffff800018f84f08 (device_hotplug_lock){+.+.}-{3:3}, at: lock_device_hotplug_sysfs+0x24/0x88
[ 92.101151] #4: ffff0000145e9190 (&dev->mutex){....}-{3:3}, at: device_offline+0xa0/0x1e0
[ 92.110115] #5: ffff800011d26450 (cpu_hotplug_lock){++++}-{0:0}, at: offline_pages+0x10c/0xce4
[ 92.119514] #6: ffff800018e60570 (mem_hotplug_lock){++++}-{0:0}, at: offline_pages+0x11c/0xce4
[ 92.128919] #7: ffff800018e5bb68 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x30/0x60
[ 92.138668] #8: ffff800018fa0610 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x58/0x138
[ 92.149633] #9: ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at: slab_memory_callback+0x44/0x280
[ 92.159033]
stack backtrace:
[ 92.164772] CPU: 29 PID: 1523 Comm: lsbug Not tainted 5.14.0-rc5-next-20210809+ #84
[ 92.173116] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
[ 92.181631] Call trace:
[ 92.184763] dump_backtrace+0x0/0x3b8
[ 92.189115] show_stack+0x20/0x30
[ 92.193118] dump_stack_lvl+0x8c/0xb8
[ 92.197469] dump_stack+0x1c/0x38
[ 92.201472] print_circular_bug.isra.0+0x530/0x540
[ 92.206953] check_noncircular+0x27c/0x2f0
[ 92.211738] check_prev_add+0x194/0x1170
[ 92.216349] validate_chain+0xfe8/0x1c20
[ 92.220961] __lock_acquire+0xb0c/0x1aa8
[ 92.225571] lock_acquire+0x34c/0xb20
[ 92.229921] __mutex_lock+0x194/0x1470
[ 92.234358] mutex_lock_nested+0x6c/0xc0
[ 92.238968] flush_all+0x50/0x1c8
flush_all at /usr/src/linux-next/mm/slub.c:2649
[ 92.242971] __kmem_cache_shrink+0x38/0x2f0
[ 92.247842] slab_memory_callback+0x68/0x280
slab_mem_going_offline_callback at /usr/src/linux-next/mm/slub.c:4586
(inlined by) slab_memory_callback at /usr/src/linux-next/mm/slub.c:4678
[ 92.252800] blocking_notifier_call_chain+0xd0/0x138
notifier_call_chain at /usr/src/linux-next/kernel/notifier.c:83
(inlined by) blocking_notifier_call_chain at /usr/src/linux-next/kernel/notifier.c:337
(inlined by) blocking_notifier_call_chain at /usr/src/linux-next/kernel/notifier.c:325
[ 92.258453] memory_notify+0x28/0x38
[ 92.262717] offline_pages+0x2cc/0xce4
[ 92.267153] memory_subsys_offline+0xd8/0x280
[ 92.272198] device_offline+0x154/0x1e0
[ 92.276723] online_store+0xa4/0x118
[ 92.280986] dev_attr_store+0x44/0x78
[ 92.285336] sysfs_kf_write+0xe8/0x138
[ 92.289774] kernfs_fop_write_iter+0x26c/0x3d0
[ 92.294906] new_sync_write+0x2bc/0x4f8
[ 92.299431] vfs_write+0x718/0xc88
[ 92.303520] ksys_write+0xf8/0x1e0
[ 92.307608] __arm64_sys_write+0x74/0xa8
[ 92.312219] invoke_syscall.constprop.0+0xdc/0x1d8
[ 92.317698] do_el0_svc+0xe4/0x2a8
[ 92.321789] el0_svc+0x64/0x130
[ 92.325619] el0t_64_sync_handler+0xb0/0xb8
[ 92.330489] el0t_64_sync+0x180/0x184

> + cpus_read_lock();
> +
> + for_each_online_cpu(cpu) {
> + sfw = &per_cpu(slub_flush, cpu);
> + if (!has_cpu_slab(cpu, s)) {
> + sfw->skip = true;
> + continue;
> + }
> + INIT_WORK(&sfw->work, flush_cpu_slab);
> + sfw->skip = false;
> + sfw->s = s;
> + schedule_work_on(cpu, &sfw->work);
> + }
> +
> + for_each_online_cpu(cpu) {
> + sfw = &per_cpu(slub_flush, cpu);
> + if (sfw->skip)
> + continue;
> + flush_work(&sfw->work);
> + }
> +
> + cpus_read_unlock();
> + mutex_unlock(&flush_lock);
> }
>
> /*
>

2021-08-09 18:50:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

On Mon, 2021-08-09 at 09:41 -0400, Qian Cai wrote:
>
>
> On 8/5/2021 11:19 AM, Vlastimil Babka wrote:
> >
> >  
> > +static DEFINE_MUTEX(flush_lock);
> > +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
> > +
> >  static void flush_all(struct kmem_cache *s)
> >  {
> > -       on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
> > +       struct slub_flush_work *sfw;
> > +       unsigned int cpu;
> > +
> > +       mutex_lock(&flush_lock);
>
> Vlastimil, taking the lock here could trigger a warning during memory
> offline/online due to the locking order:
>
> slab_mutex -> flush_lock

Bugger. That chain ending with cpu_hotplug_lock makes slub_cpu_dead()
taking slab_mutex a non-starter for cpu hotplug as well. It's
established early by kernel_init_freeable()..kmem_cache_destroy() as
well as by slab_mem_going_offline_callback().

> [   91.374541] WARNING: possible circular locking dependency detected
> [   91.381411] 5.14.0-rc5-next-20210809+ #84 Not tainted
> [   91.387149] ------------------------------------------------------
> [   91.394016] lsbug/1523 is trying to acquire lock:
> [   91.399406] ffff800018e76530 (flush_lock){+.+.}-{3:3}, at:
> flush_all+0x50/0x1c8
> [   91.407425]
>                but task is already holding lock:
> [   91.414638] ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at:
> slab_memory_callback+0x44/0x280
> [   91.423603]
>                which lock already depends on the new lock.
>
> [   91.433854]
>                the existing dependency chain (in reverse order) is:
> [   91.442715]
>                -> #4 (slab_mutex){+.+.}-{3:3}:
> [   91.449766]        __lock_acquire+0xb0c/0x1aa8
> [   91.454901]        lock_acquire+0x34c/0xb20
> [   91.459773]        __mutex_lock+0x194/0x1470
> [   91.464732]        mutex_lock_nested+0x6c/0xc0
> [   91.469864]        slab_memory_callback+0x44/0x280
> [   91.475344]        blocking_notifier_call_chain+0xd0/0x138
> [   91.481519]        memory_notify+0x28/0x38
> [   91.486304]        offline_pages+0x2cc/0xce4
> [   91.491262]        memory_subsys_offline+0xd8/0x280
> [   91.496827]        device_offline+0x154/0x1e0
> [   91.501872]        online_store+0xa4/0x118
> [   91.506656]        dev_attr_store+0x44/0x78
> [   91.511527]        sysfs_kf_write+0xe8/0x138
> [   91.516485]        kernfs_fop_write_iter+0x26c/0x3d0
> [   91.522138]        new_sync_write+0x2bc/0x4f8
> [   91.527185]        vfs_write+0x718/0xc88
> [   91.531795]        ksys_write+0xf8/0x1e0
> [   91.536404]        __arm64_sys_write+0x74/0xa8
> [   91.541535]        invoke_syscall.constprop.0+0xdc/0x1d8
> [   91.547536]        do_el0_svc+0xe4/0x2a8
> [   91.552146]        el0_svc+0x64/0x130
> [   91.556498]        el0t_64_sync_handler+0xb0/0xb8
> [   91.561889]        el0t_64_sync+0x180/0x184
> [   91.566760]
>                -> #3 ((memory_chain).rwsem){++++}-{3:3}:
> [   91.574680]        __lock_acquire+0xb0c/0x1aa8
> [   91.579814]        lock_acquire+0x34c/0xb20
> [   91.584685]        down_read+0xf0/0x488
> [   91.589210]        blocking_notifier_call_chain+0x58/0x138
> [   91.595383]        memory_notify+0x28/0x38
> [   91.600167]        offline_pages+0x2cc/0xce4
> [   91.605124]        memory_subsys_offline+0xd8/0x280
> [   91.610689]        device_offline+0x154/0x1e0
> [   91.615734]        online_store+0xa4/0x118
> [   91.620518]        dev_attr_store+0x44/0x78
> [   91.625388]        sysfs_kf_write+0xe8/0x138
> [   91.630346]        kernfs_fop_write_iter+0x26c/0x3d0
> [   91.635997]        new_sync_write+0x2bc/0x4f8
> [   91.641043]        vfs_write+0x718/0xc88
> [   91.645652]        ksys_write+0xf8/0x1e0
> [   91.650262]        __arm64_sys_write+0x74/0xa8
> [   91.655393]        invoke_syscall.constprop.0+0xdc/0x1d8
> [   91.661394]        do_el0_svc+0xe4/0x2a8
> [   91.666004]        el0_svc+0x64/0x130
> [   91.670355]        el0t_64_sync_handler+0xb0/0xb8
> [   91.675747]        el0t_64_sync+0x180/0x184
> [   91.680617]
>                -> #2 (pcp_batch_high_lock){+.+.}-{3:3}:
> [   91.688449]        __lock_acquire+0xb0c/0x1aa8
> [   91.693582]        lock_acquire+0x34c/0xb20
> [   91.698452]        __mutex_lock+0x194/0x1470
> [   91.703410]        mutex_lock_nested+0x6c/0xc0
> [   91.708541]        zone_pcp_update+0x3c/0x68
> [   91.713500]        page_alloc_cpu_online+0x64/0x90
> [   91.718978]        cpuhp_invoke_callback+0x588/0x2ba8
> [   91.724718]        cpuhp_invoke_callback_range+0xa4/0x108
> [   91.730804]        cpu_up+0x598/0xb78
> [   91.735154]        bringup_nonboot_cpus+0x110/0x168
> [   91.740719]        smp_init+0x4c/0xe0
> [   91.745070]        kernel_init_freeable+0x554/0x7c8
> [   91.750637]        kernel_init+0x2c/0x140
> [   91.755334]        ret_from_fork+0x10/0x20
> [   91.760118]
>                -> #1 (cpu_hotplug_lock){++++}-{0:0}:
> [   91.767688]        __lock_acquire+0xb0c/0x1aa8
> [   91.772820]        lock_acquire+0x34c/0xb20
> [   91.777691]        cpus_read_lock+0x98/0x308
> [   91.782649]        flush_all+0x54/0x1c8
> [   91.787173]        __kmem_cache_shrink+0x38/0x2f0
> [   91.792566]        kmem_cache_shrink+0x28/0x38
> [   91.797699]        acpi_os_purge_cache+0x18/0x28
> [   91.803006]        acpi_purge_cached_objects+0x44/0xdc
> [   91.808832]        acpi_initialize_objects+0x24/0x88
> [   91.814487]        acpi_bus_init+0xe0/0x47c
> [   91.819357]        acpi_init+0x130/0x27c
> [   91.823967]        do_one_initcall+0x180/0xbe8
> [   91.829098]        kernel_init_freeable+0x710/0x7c8
> [   91.834663]        kernel_init+0x2c/0x140
> [   91.839360]        ret_from_fork+0x10/0x20
> [   91.844143]
>                -> #0 (flush_lock){+.+.}-{3:3}:
> [   91.851193]        check_prev_add+0x194/0x1170
> [   91.856326]        validate_chain+0xfe8/0x1c20
> [   91.861458]        __lock_acquire+0xb0c/0x1aa8
> [   91.866589]        lock_acquire+0x34c/0xb20
> [   91.871460]        __mutex_lock+0x194/0x1470
> [   91.876418]        mutex_lock_nested+0x6c/0xc0
> [   91.881549]        flush_all+0x50/0x1c8
> [   91.886072]        __kmem_cache_shrink+0x38/0x2f0
> [   91.891465]        slab_memory_callback+0x68/0x280
> [   91.896943]        blocking_notifier_call_chain+0xd0/0x138
> [   91.903117]        memory_notify+0x28/0x38
> [   91.907901]        offline_pages+0x2cc/0xce4
> [   91.912859]        memory_subsys_offline+0xd8/0x280
> [   91.918424]        device_offline+0x154/0x1e0
> [   91.923470]        online_store+0xa4/0x118
> [   91.928254]        dev_attr_store+0x44/0x78
> [   91.933125]        sysfs_kf_write+0xe8/0x138
> [   91.938083]        kernfs_fop_write_iter+0x26c/0x3d0
> [   91.943735]        new_sync_write+0x2bc/0x4f8
> [   91.948781]        vfs_write+0x718/0xc88
> [   91.953391]        ksys_write+0xf8/0x1e0
> [   91.958000]        __arm64_sys_write+0x74/0xa8
> [   91.963130]        invoke_syscall.constprop.0+0xdc/0x1d8
> [   91.969131]        do_el0_svc+0xe4/0x2a8
> [   91.973741]        el0_svc+0x64/0x130
> [   91.978093]        el0t_64_sync_handler+0xb0/0xb8
> [   91.983484]        el0t_64_sync+0x180/0x184
> [   91.988354]
>                other info that might help us debug this:
>
> [   91.998431] Chain exists of:
>                  flush_lock --> (memory_chain).rwsem --> slab_mutex
>
> [   92.010867]  Possible unsafe locking scenario:
>
> [   92.018166]        CPU0                    CPU1
> [   92.023380]        ----                    ----
> [   92.028595]   lock(slab_mutex);
> [   92.032425]                               
> lock((memory_chain).rwsem);
> [   92.039641]                                lock(slab_mutex);
> [   92.045989]   lock(flush_lock);
> [   92.049819]
>                 *** DEADLOCK ***
>
> [   92.057811] 10 locks held by lsbug/1523:
> [   92.062420]  #0: ffff0000505a8430 (sb_writers#6){.+.+}-{0:0}, at:
> ksys_write+0xf8/0x1e0
> [   92.071128]  #1: ffff000870f99e88 (&of->mutex){+.+.}-{3:3}, at:
> kernfs_fop_write_iter+0x1dc/0x3d0
> [   92.080701]  #2: ffff0000145b2ab8 (kn->active#175){.+.+}-{0:0},
> at: kernfs_fop_write_iter+0x1f8/0x3d0
> [   92.090623]  #3: ffff800018f84f08 (device_hotplug_lock){+.+.}-
> {3:3}, at: lock_device_hotplug_sysfs+0x24/0x88
> [   92.101151]  #4: ffff0000145e9190 (&dev->mutex){....}-{3:3}, at:
> device_offline+0xa0/0x1e0
> [   92.110115]  #5: ffff800011d26450 (cpu_hotplug_lock){++++}-{0:0},
> at: offline_pages+0x10c/0xce4
> [   92.119514]  #6: ffff800018e60570 (mem_hotplug_lock){++++}-{0:0},
> at: offline_pages+0x11c/0xce4
> [   92.128919]  #7: ffff800018e5bb68 (pcp_batch_high_lock){+.+.}-
> {3:3}, at: zone_pcp_disable+0x30/0x60
> [   92.138668]  #8: ffff800018fa0610 ((memory_chain).rwsem){++++}-
> {3:3}, at: blocking_notifier_call_chain+0x58/0x138
> [   92.149633]  #9: ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at:
> slab_memory_callback+0x44/0x280
> [   92.159033]
>                stack backtrace:
> [   92.164772] CPU: 29 PID: 1523 Comm: lsbug Not tainted 5.14.0-rc5-
> next-20210809+ #84
> [   92.173116] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR,
> BIOS 1.6 06/28/2020
> [   92.181631] Call trace:
> [   92.184763]  dump_backtrace+0x0/0x3b8
> [   92.189115]  show_stack+0x20/0x30
> [   92.193118]  dump_stack_lvl+0x8c/0xb8
> [   92.197469]  dump_stack+0x1c/0x38
> [   92.201472]  print_circular_bug.isra.0+0x530/0x540
> [   92.206953]  check_noncircular+0x27c/0x2f0
> [   92.211738]  check_prev_add+0x194/0x1170
> [   92.216349]  validate_chain+0xfe8/0x1c20
> [   92.220961]  __lock_acquire+0xb0c/0x1aa8
> [   92.225571]  lock_acquire+0x34c/0xb20
> [   92.229921]  __mutex_lock+0x194/0x1470
> [   92.234358]  mutex_lock_nested+0x6c/0xc0
> [   92.238968]  flush_all+0x50/0x1c8
> flush_all at /usr/src/linux-next/mm/slub.c:2649
> [   92.242971]  __kmem_cache_shrink+0x38/0x2f0
> [   92.247842]  slab_memory_callback+0x68/0x280
> slab_mem_going_offline_callback at /usr/src/linux-next/mm/slub.c:4586
> (inlined by) slab_memory_callback at /usr/src/linux-
> next/mm/slub.c:4678
> [   92.252800]  blocking_notifier_call_chain+0xd0/0x138
> notifier_call_chain at /usr/src/linux-next/kernel/notifier.c:83
> (inlined by) blocking_notifier_call_chain at /usr/src/linux-
> next/kernel/notifier.c:337
> (inlined by) blocking_notifier_call_chain at /usr/src/linux-
> next/kernel/notifier.c:325
> [   92.258453]  memory_notify+0x28/0x38
> [   92.262717]  offline_pages+0x2cc/0xce4
> [   92.267153]  memory_subsys_offline+0xd8/0x280
> [   92.272198]  device_offline+0x154/0x1e0
> [   92.276723]  online_store+0xa4/0x118
> [   92.280986]  dev_attr_store+0x44/0x78
> [   92.285336]  sysfs_kf_write+0xe8/0x138
> [   92.289774]  kernfs_fop_write_iter+0x26c/0x3d0
> [   92.294906]  new_sync_write+0x2bc/0x4f8
> [   92.299431]  vfs_write+0x718/0xc88
> [   92.303520]  ksys_write+0xf8/0x1e0
> [   92.307608]  __arm64_sys_write+0x74/0xa8
> [   92.312219]  invoke_syscall.constprop.0+0xdc/0x1d8
> [   92.317698]  do_el0_svc+0xe4/0x2a8
> [   92.321789]  el0_svc+0x64/0x130
> [   92.325619]  el0t_64_sync_handler+0xb0/0xb8
> [   92.330489]  el0t_64_sync+0x180/0x184
>
> > +       cpus_read_lock();
> > +
> > +       for_each_online_cpu(cpu) {
> > +               sfw = &per_cpu(slub_flush, cpu);
> > +               if (!has_cpu_slab(cpu, s)) {
> > +                       sfw->skip = true;
> > +                       continue;
> > +               }
> > +               INIT_WORK(&sfw->work, flush_cpu_slab);
> > +               sfw->skip = false;
> > +               sfw->s = s;
> > +               schedule_work_on(cpu, &sfw->work);
> > +       }
> > +
> > +       for_each_online_cpu(cpu) {
> > +               sfw = &per_cpu(slub_flush, cpu);
> > +               if (sfw->skip)
> > +                       continue;
> > +               flush_work(&sfw->work);
> > +       }
> > +
> > +       cpus_read_unlock();
> > +       mutex_unlock(&flush_lock);
> >  }
> >  
> >  /*
> >


2021-08-09 23:32:39

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

On 8/9/2021 8:44 PM, Mike Galbraith wrote:
> On Mon, 2021-08-09 at 09:41 -0400, Qian Cai wrote:
>>
>>
>> On 8/5/2021 11:19 AM, Vlastimil Babka wrote:
>>>
>>>  
>>> +static DEFINE_MUTEX(flush_lock);
>>> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
>>> +
>>>  static void flush_all(struct kmem_cache *s)
>>>  {
>>> -       on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
>>> +       struct slub_flush_work *sfw;
>>> +       unsigned int cpu;
>>> +
>>> +       mutex_lock(&flush_lock);
>>
>> Vlastimil, taking the lock here could trigger a warning during memory
>> offline/online due to the locking order:
>>
>> slab_mutex -> flush_lock
>
> Bugger. That chain ending with cpu_hotplug_lock makes slub_cpu_dead()
> taking slab_mutex a non-starter for cpu hotplug as well. It's
> established early by kernel_init_freeable()..kmem_cache_destroy() as
> well as by slab_mem_going_offline_callback().

I suck at reading the lockdep splats, so I don't see yet how the "existing
reverse order" occurs - I do understand the order in the "lsbug".
What I also wonder is why didn't this occur also in the older RT trees with this
patch. I did change the order of locks in flush_all() to take flush_lock first
and cpus_read_lock() second, as Cyrill Gorcunov suggested. Would the original
order prevent this? Or we would fail anyway because we already took
cpus_read_lock() in offline_pages() and now are taking it again - do these nest
or not?

2021-08-10 02:10:19

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context



On 8/9/2021 4:08 PM, Vlastimil Babka wrote:
> On 8/9/2021 8:44 PM, Mike Galbraith wrote:
>> On Mon, 2021-08-09 at 09:41 -0400, Qian Cai wrote:
>>>
>>>
>>> On 8/5/2021 11:19 AM, Vlastimil Babka wrote:
>>>>
>>>>  
>>>> +static DEFINE_MUTEX(flush_lock);
>>>> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
>>>> +
>>>>  static void flush_all(struct kmem_cache *s)
>>>>  {
>>>> -       on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
>>>> +       struct slub_flush_work *sfw;
>>>> +       unsigned int cpu;
>>>> +
>>>> +       mutex_lock(&flush_lock);
>>>
>>> Vlastimil, taking the lock here could trigger a warning during memory
>>> offline/online due to the locking order:
>>>
>>> slab_mutex -> flush_lock
>>
>> Bugger. That chain ending with cpu_hotplug_lock makes slub_cpu_dead()
>> taking slab_mutex a non-starter for cpu hotplug as well. It's
>> established early by kernel_init_freeable()..kmem_cache_destroy() as
>> well as by slab_mem_going_offline_callback().
>
> I suck at reading the lockdep splats, so I don't see yet how the "existing
> reverse order" occurs - I do understand the order in the "lsbug".
> What I also wonder is why didn't this occur also in the older RT trees with this
> patch. I did change the order of locks in flush_all() to take flush_lock first
> and cpus_read_lock() second, as Cyrill Gorcunov suggested. Would the original
> order prevent this? Or we would fail anyway because we already took
> cpus_read_lock() in offline_pages() and now are taking it again - do these nest
> or not?

"lsbug" is just an user-space tool running workloads like memory offline/online
via sysfs. The splat indicated that the existing locking orders on the running
system saw so far are:

flush_lock -> cpu_hotplug_lock (in #1)
cpu_hotplug_lock -> pck_batch_high_lock (in #2)
pcp_batch_high_lock -> (memory_chain).rwsem (in #3)
(memory_chain).rwsem -> slab_mutex (in #4)

Thus, lockdep inferences that taking flush_lock first could later reaching
slab_mutex. Then, in the commit, memory offline (in #0) started to take the locking
order slab_mutex -> flush_lock. Thus, the potential deadlock warning.

2021-08-10 05:24:51

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

On Mon, 2021-08-09 at 22:08 +0200, Vlastimil Babka wrote:
> On 8/9/2021 8:44 PM, Mike Galbraith wrote:
> > >
> > > slab_mutex -> flush_lock
> >
> > Bugger.  That chain ending with cpu_hotplug_lock makes slub_cpu_dead()
> > taking slab_mutex a non-starter for cpu hotplug as well.  It's
> > established early by kernel_init_freeable()..kmem_cache_destroy() as
> > well as by slab_mem_going_offline_callback().
>
> I suck at reading the lockdep splats, so I don't see yet how the "existing
> reverse order" occurs - I do understand the order in the "lsbug".
> What I also wonder is why didn't this occur also in the older RT trees with this
> patch.

Apparently (oops) nobody got around to hotplug+lockdep testing, RT or
otherwise. I know I didn't, goldfish like attention span being used up
by explosion testing ;-)

-Mike

2021-08-10 11:49:27

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

On 8/9/21 3:41 PM, Qian Cai wrote:
>>
>> +static DEFINE_MUTEX(flush_lock);
>> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
>> +
>> static void flush_all(struct kmem_cache *s)
>> {
>> - on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
>> + struct slub_flush_work *sfw;
>> + unsigned int cpu;
>> +
>> + mutex_lock(&flush_lock);
>
> Vlastimil, taking the lock here could trigger a warning during memory offline/online due to the locking order:
>
> slab_mutex -> flush_lock
>
> [ 91.374541] WARNING: possible circular locking dependency detected
> [ 91.381411] 5.14.0-rc5-next-20210809+ #84 Not tainted
> [ 91.387149] ------------------------------------------------------
> [ 91.394016] lsbug/1523 is trying to acquire lock:
> [ 91.399406] ffff800018e76530 (flush_lock){+.+.}-{3:3}, at: flush_all+0x50/0x1c8
> [ 91.407425]
> but task is already holding lock:
> [ 91.414638] ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at: slab_memory_callback+0x44/0x280
> [ 91.423603]
> which lock already depends on the new lock.
>

OK, managed to reproduce in qemu and this fixes it for me on top of
next-20210809. Could you test as well, as your testing might be more
comprehensive? I will format is as a fixup for the proper patch in the series then.

----8<----
From 7ce71c7f9455e8b96dc1b728ea566b6ef5e424e4 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Tue, 10 Aug 2021 10:58:07 +0200
Subject: [PATCH] mm, slub: fix memory offline lockdep splat

Reverse order of flush_lock and cpus_read_lock() to prevent lockdep splat.
In slab_mem_going_offline_callback() we already have cpus_read_lock()
held so make sure it's not taken again.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 88a6c3ed2751..073cdd4b020f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2640,13 +2640,13 @@ static bool has_cpu_slab(int cpu, struct kmem_cache *s)
static DEFINE_MUTEX(flush_lock);
static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);

-static void flush_all(struct kmem_cache *s)
+static void flush_all_cpus_locked(struct kmem_cache *s)
{
struct slub_flush_work *sfw;
unsigned int cpu;

+ lockdep_assert_cpus_held();
mutex_lock(&flush_lock);
- cpus_read_lock();

for_each_online_cpu(cpu) {
sfw = &per_cpu(slub_flush, cpu);
@@ -2667,10 +2667,16 @@ static void flush_all(struct kmem_cache *s)
flush_work(&sfw->work);
}

- cpus_read_unlock();
mutex_unlock(&flush_lock);
}

+static void flush_all(struct kmem_cache *s)
+{
+ cpus_read_lock();
+ flush_all_cpus_locked(s);
+ cpus_read_unlock();
+}
+
/*
* Use the cpu notifier to insure that the cpu slabs are flushed when
* necessary.
@@ -4516,7 +4522,7 @@ EXPORT_SYMBOL(kfree);
* being allocated from last increasing the chance that the last objects
* are freed in them.
*/
-int __kmem_cache_shrink(struct kmem_cache *s)
+int __kmem_cache_do_shrink(struct kmem_cache *s)
{
int node;
int i;
@@ -4528,7 +4534,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
unsigned long flags;
int ret = 0;

- flush_all(s);
for_each_kmem_cache_node(s, node, n) {
INIT_LIST_HEAD(&discard);
for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
@@ -4578,13 +4583,21 @@ int __kmem_cache_shrink(struct kmem_cache *s)
return ret;
}

+int __kmem_cache_shrink(struct kmem_cache *s)
+{
+ flush_all(s);
+ return __kmem_cache_do_shrink(s);
+}
+
static int slab_mem_going_offline_callback(void *arg)
{
struct kmem_cache *s;

mutex_lock(&slab_mutex);
- list_for_each_entry(s, &slab_caches, list)
- __kmem_cache_shrink(s);
+ list_for_each_entry(s, &slab_caches, list) {
+ flush_all_cpus_locked(s);
+ __kmem_cache_do_shrink(s);
+ }
mutex_unlock(&slab_mutex);

return 0;
--
2.32.0

2021-08-10 16:28:16

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

On Tue, 2021-08-10 at 11:03 +0200, Vlastimil Babka wrote:
> On 8/9/21 3:41 PM, Qian Cai wrote:
> > >  
> > > +static DEFINE_MUTEX(flush_lock);
> > > +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
> > > +
> > >  static void flush_all(struct kmem_cache *s)
> > >  {
> > > -       on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
> > > +       struct slub_flush_work *sfw;
> > > +       unsigned int cpu;
> > > +
> > > +       mutex_lock(&flush_lock);
> >
> > Vlastimil, taking the lock here could trigger a warning during memory offline/online due to the locking order:
> >
> > slab_mutex -> flush_lock
> >
> > [   91.374541] WARNING: possible circular locking dependency detected
> > [   91.381411] 5.14.0-rc5-next-20210809+ #84 Not tainted
> > [   91.387149] ------------------------------------------------------
> > [   91.394016] lsbug/1523 is trying to acquire lock:
> > [   91.399406] ffff800018e76530 (flush_lock){+.+.}-{3:3}, at: flush_all+0x50/0x1c8
> > [   91.407425]
> >                but task is already holding lock:
> > [   91.414638] ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at: slab_memory_callback+0x44/0x280
> > [   91.423603]
> >                which lock already depends on the new lock.
> >
>
> OK, managed to reproduce in qemu and this fixes it for me on top of
> next-20210809. Could you test as well, as your testing might be more
> comprehensive? I will format is as a fixup for the proper patch in the series then.

As it appeared it should, moving cpu_hotplug_lock outside slab_mutex in
kmem_cache_destroy() on top of that silenced the cpu offline gripe.

---
mm/slab_common.c | 2 ++
mm/slub.c | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)

--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -502,6 +502,7 @@ void kmem_cache_destroy(struct kmem_cach
if (unlikely(!s))
return;

+ cpus_read_lock();
mutex_lock(&slab_mutex);

s->refcount--;
@@ -516,6 +517,7 @@ void kmem_cache_destroy(struct kmem_cach
}
out_unlock:
mutex_unlock(&slab_mutex);
+ cpus_read_unlock();
}
EXPORT_SYMBOL(kmem_cache_destroy);

--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4234,7 +4234,7 @@ int __kmem_cache_shutdown(struct kmem_ca
int node;
struct kmem_cache_node *n;

- flush_all(s);
+ flush_all_cpus_locked(s);
/* Attempt to free all objects */
for_each_kmem_cache_node(s, node, n) {
free_partial(s, n);




2021-08-10 16:39:35

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

On 8/9/21 3:41 PM, Qian Cai wrote:

>> static void flush_all(struct kmem_cache *s)
>> {
>> - on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
>> + struct slub_flush_work *sfw;
>> + unsigned int cpu;
>> +
>> + mutex_lock(&flush_lock);
>
> Vlastimil, taking the lock here could trigger a warning during memory offline/online due to the locking order:
>
> slab_mutex -> flush_lock

Here's the full fixup, also incorporating Mike's fix. Thanks.

----8<----
From c2df67d5116d4615c322e262556e34117e268104 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Tue, 10 Aug 2021 10:58:07 +0200
Subject: [PATCH] mm, slub: fix memory and cpu hotplug related lock ordering
issues

Qian Cai reported [1] a lockdep splat on memory offline.

[ 91.374541] WARNING: possible circular locking dependency detected
[ 91.381411] 5.14.0-rc5-next-20210809+ #84 Not tainted
[ 91.387149] ------------------------------------------------------
[ 91.394016] lsbug/1523 is trying to acquire lock:
[ 91.399406] ffff800018e76530 (flush_lock){+.+.}-{3:3}, at: flush_all+0x50/0x1c8
[ 91.407425] but task is already holding lock:
[ 91.414638] ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at: slab_memory_callback+0x44/0x280
[ 91.423603] which lock already depends on the new lock.

To fix it, we need to change the order in flush_all() so that cpus_read_lock()
is first and mutex_lock(&flush_lock) second.

Also when called from slab_mem_going_offline_callback() we are already under
cpus_read_lock() and cannot take it again, so create a flush_all_cpus_locked()
variant and decouple flushing from actual shrinking for this call path.

Additionally, Mike Galbraith reported [2] wrong order of cpus_read_lock() and
slab_mutex in kmem_cache_destroy() path and proposed a fix to reverse it.

This patch is a fixup for the mmotm patch
mm-slub-move-flush_cpu_slab-invocations-__free_slab-invocations-out-of-irq-context.patch

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/

Reported-by: Qian Cai <[email protected]>
Reported-by: Mike Galbraith <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slab_common.c | 2 ++
mm/slub.c | 29 +++++++++++++++++++++--------
2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 1c673c323baf..ec2bb0beed75 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -502,6 +502,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
if (unlikely(!s))
return;

+ cpus_read_lock();
mutex_lock(&slab_mutex);

s->refcount--;
@@ -516,6 +517,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
}
out_unlock:
mutex_unlock(&slab_mutex);
+ cpus_read_unlock();
}
EXPORT_SYMBOL(kmem_cache_destroy);

diff --git a/mm/slub.c b/mm/slub.c
index da48ada3d17f..152487f84025 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2551,13 +2551,13 @@ static bool has_cpu_slab(int cpu, struct kmem_cache *s)
static DEFINE_MUTEX(flush_lock);
static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);

-static void flush_all(struct kmem_cache *s)
+static void flush_all_cpus_locked(struct kmem_cache *s)
{
struct slub_flush_work *sfw;
unsigned int cpu;

+ lockdep_assert_cpus_held();
mutex_lock(&flush_lock);
- cpus_read_lock();

for_each_online_cpu(cpu) {
sfw = &per_cpu(slub_flush, cpu);
@@ -2578,10 +2578,16 @@ static void flush_all(struct kmem_cache *s)
flush_work(&sfw->work);
}

- cpus_read_unlock();
mutex_unlock(&flush_lock);
}

+static void flush_all(struct kmem_cache *s)
+{
+ cpus_read_lock();
+ flush_all_cpus_locked(s);
+ cpus_read_unlock();
+}
+
/*
* Use the cpu notifier to insure that the cpu slabs are flushed when
* necessary.
@@ -4111,7 +4117,7 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
int node;
struct kmem_cache_node *n;

- flush_all(s);
+ flush_all_cpus_locked(s);
/* Attempt to free all objects */
for_each_kmem_cache_node(s, node, n) {
free_partial(s, n);
@@ -4387,7 +4393,7 @@ EXPORT_SYMBOL(kfree);
* being allocated from last increasing the chance that the last objects
* are freed in them.
*/
-int __kmem_cache_shrink(struct kmem_cache *s)
+int __kmem_cache_do_shrink(struct kmem_cache *s)
{
int node;
int i;
@@ -4399,7 +4405,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
unsigned long flags;
int ret = 0;

- flush_all(s);
for_each_kmem_cache_node(s, node, n) {
INIT_LIST_HEAD(&discard);
for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
@@ -4449,13 +4454,21 @@ int __kmem_cache_shrink(struct kmem_cache *s)
return ret;
}

+int __kmem_cache_shrink(struct kmem_cache *s)
+{
+ flush_all(s);
+ return __kmem_cache_do_shrink(s);
+}
+
static int slab_mem_going_offline_callback(void *arg)
{
struct kmem_cache *s;

mutex_lock(&slab_mutex);
- list_for_each_entry(s, &slab_caches, list)
- __kmem_cache_shrink(s);
+ list_for_each_entry(s, &slab_caches, list) {
+ flush_all_cpus_locked(s);
+ __kmem_cache_do_shrink(s);
+ }
mutex_unlock(&slab_mutex);

return 0;
--
2.32.0


2021-08-10 16:40:21

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 00/35] SLUB: reduce irq disabled scope and make it RT compatible

On 8/5/21 5:19 PM, Vlastimil Babka wrote:
> Series is based on 5.14-rc4 and also available as a git branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0

New branch with fixed up locking orders in patch 29/35:

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r1

2021-08-10 20:27:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

On Tue, Aug 10, 2021 at 11:03:02AM +0200, Vlastimil Babka wrote:
> On 8/9/21 3:41 PM, Qian Cai wrote:
> >>
> >> +static DEFINE_MUTEX(flush_lock);
> >> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
> >> +
> >> static void flush_all(struct kmem_cache *s)
> >> {
> >> - on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
> >> + struct slub_flush_work *sfw;
> >> + unsigned int cpu;
> >> +
> >> + mutex_lock(&flush_lock);
> >
> > Vlastimil, taking the lock here could trigger a warning during memory offline/online due to the locking order:
> >
> > slab_mutex -> flush_lock
> >
> > [ 91.374541] WARNING: possible circular locking dependency detected
> > [ 91.381411] 5.14.0-rc5-next-20210809+ #84 Not tainted
> > [ 91.387149] ------------------------------------------------------
> > [ 91.394016] lsbug/1523 is trying to acquire lock:
> > [ 91.399406] ffff800018e76530 (flush_lock){+.+.}-{3:3}, at: flush_all+0x50/0x1c8
> > [ 91.407425]
> > but task is already holding lock:
> > [ 91.414638] ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at: slab_memory_callback+0x44/0x280
> > [ 91.423603]
> > which lock already depends on the new lock.

From the series in -next, I got a three-way deadlock similar to what
Qian Cai got.

> OK, managed to reproduce in qemu and this fixes it for me on top of
> next-20210809. Could you test as well, as your testing might be more
> comprehensive? I will format is as a fixup for the proper patch in the series then.
>
> ----8<----
> >From 7ce71c7f9455e8b96dc1b728ea566b6ef5e424e4 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <[email protected]>
> Date: Tue, 10 Aug 2021 10:58:07 +0200
> Subject: [PATCH] mm, slub: fix memory offline lockdep splat
>
> Reverse order of flush_lock and cpus_read_lock() to prevent lockdep splat.
> In slab_mem_going_offline_callback() we already have cpus_read_lock()
> held so make sure it's not taken again.
>
> Signed-off-by: Vlastimil Babka <[email protected]>

With this patch, it reduces to a two-way deadlock as shown at the end
of this message.

My reproducer is the following on a two-socket system:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 10m --configs RUDE01 --trust-make

This likely needs the RCU commits in -next to reproduce quickly, though
you never know.

Thanx, Paul

> ---
> mm/slub.c | 27 ++++++++++++++++++++-------
> 1 file changed, 20 insertions(+), 7 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 88a6c3ed2751..073cdd4b020f 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2640,13 +2640,13 @@ static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> static DEFINE_MUTEX(flush_lock);
> static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
>
> -static void flush_all(struct kmem_cache *s)
> +static void flush_all_cpus_locked(struct kmem_cache *s)
> {
> struct slub_flush_work *sfw;
> unsigned int cpu;
>
> + lockdep_assert_cpus_held();
> mutex_lock(&flush_lock);
> - cpus_read_lock();
>
> for_each_online_cpu(cpu) {
> sfw = &per_cpu(slub_flush, cpu);
> @@ -2667,10 +2667,16 @@ static void flush_all(struct kmem_cache *s)
> flush_work(&sfw->work);
> }
>
> - cpus_read_unlock();
> mutex_unlock(&flush_lock);
> }
>
> +static void flush_all(struct kmem_cache *s)
> +{
> + cpus_read_lock();
> + flush_all_cpus_locked(s);
> + cpus_read_unlock();
> +}
> +
> /*
> * Use the cpu notifier to insure that the cpu slabs are flushed when
> * necessary.
> @@ -4516,7 +4522,7 @@ EXPORT_SYMBOL(kfree);
> * being allocated from last increasing the chance that the last objects
> * are freed in them.
> */
> -int __kmem_cache_shrink(struct kmem_cache *s)
> +int __kmem_cache_do_shrink(struct kmem_cache *s)
> {
> int node;
> int i;
> @@ -4528,7 +4534,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
> unsigned long flags;
> int ret = 0;
>
> - flush_all(s);
> for_each_kmem_cache_node(s, node, n) {
> INIT_LIST_HEAD(&discard);
> for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
> @@ -4578,13 +4583,21 @@ int __kmem_cache_shrink(struct kmem_cache *s)
> return ret;
> }
>
> +int __kmem_cache_shrink(struct kmem_cache *s)
> +{
> + flush_all(s);
> + return __kmem_cache_do_shrink(s);
> +}
> +
> static int slab_mem_going_offline_callback(void *arg)
> {
> struct kmem_cache *s;
>
> mutex_lock(&slab_mutex);
> - list_for_each_entry(s, &slab_caches, list)
> - __kmem_cache_shrink(s);
> + list_for_each_entry(s, &slab_caches, list) {
> + flush_all_cpus_locked(s);
> + __kmem_cache_do_shrink(s);
> + }
> mutex_unlock(&slab_mutex);
>
> return 0;
> --
> 2.32.0

[ 602.668050] ========================================================
[ 602.668924] WARNING: possible circular locking dependency detected
[ 602.669796] 5.14.0-rc5-next-20210809+ #3298 Not tainted
[ 602.670537] ------------------------------------------------------
[ 602.671408] torture_shutdow/88 is trying to acquire lock:
[ 602.672169] ffffffffb00686b0 (cpu_hotplug_lock){++++}-{0:0}, at: __kmem_=
cache_shutdown+0x26/0x210
[ 602.673416]
[ 602.673416] but task is already holding lock:
[ 602.674240] ffffffffb0178368 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_de=
stroy+0x1c/0x110
[ 602.675379]
[ 602.675379] which lock already depends on the new lock.
[ 602.675379]
[ 602.676525]
[ 602.676525] the existing dependency chain (in reverse order) is:
[ 602.677576]
[ 602.677576] -> #1 (slab_mutex){+.+.}-{3:3}:
[ 602.678377] __mutex_lock+0x81/0x9a0
[ 602.678964] slub_cpu_dead+0x17/0xb0
[ 602.679547] cpuhp_invoke_callback+0x180/0x890
[ 602.680255] cpuhp_invoke_callback_range+0x3b/0x80
[ 602.681009] _cpu_down+0xe4/0x2b0
[ 602.681556] cpu_down+0x29/0x50
[ 602.682082] device_offline+0x7e/0xb0
[ 602.682677] remove_cpu+0x17/0x30
[ 602.683225] torture_offline+0x7d/0x140
[ 602.683844] torture_onoff+0x14f/0x260
[ 602.684455] kthread+0x132/0x160
[ 602.684994] ret_from_fork+0x22/0x30
[ 602.685574]
[ 602.685574] -> #0 (cpu_hotplug_lock){++++}-{0:0}:
[ 602.686460] __lock_acquire+0x13d2/0x2470
[ 602.687107] lock_acquire+0xc9/0x2e0
[ 602.687686] cpus_read_lock+0x26/0xb0
[ 602.688284] __kmem_cache_shutdown+0x26/0x210
[ 602.688973] kmem_cache_destroy+0x38/0x110
[ 602.689625] rcu_torture_cleanup.cold.36+0x192/0x421
[ 602.690399] torture_shutdown+0xdd/0x1c0
[ 602.691032] kthread+0x132/0x160
[ 602.691563] ret_from_fork+0x22/0x30
[ 602.692147]
[ 602.692147] other info that might help us debug this:
[ 602.692147]
[ 602.693268] Possible unsafe locking scenario:
[ 602.693268]
[ 602.694128] CPU0 CPU1
[ 602.694766] ---- ----
[ 602.695409] lock(slab_mutex);
[ 602.695858] lock(cpu_hotplug_lock);
[ 602.696731] lock(slab_mutex);
[ 602.697531] lock(cpu_hotplug_lock);
[ 602.698057]
[ 602.698057] *** DEADLOCK ***
[ 602.698057]
[ 602.698884] 1 lock held by torture_shutdow/88:
[ 602.699517] #0: ffffffffb0178368 (slab_mutex){+.+.}-{3:3}, at: kmem_cac=
he_destroy+0x1c/0x110
[ 602.700716]
[ 602.700716] stack backtrace:
[ 602.701334] CPU: 3 PID: 88 Comm: torture_shutdow Not tainted 5.14.0-rc5-=
next-20210809+ #3298
[ 602.702518] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module_el8=
.5.0+746+bbd5d70c 04/01/2014
[ 602.703799] Call Trace:
[ 602.704160] dump_stack_lvl+0x44/0x57
[ 602.704686] check_noncircular+0xfe/0x110
[ 602.705264] __lock_acquire+0x13d2/0x2470
[ 602.705836] lock_acquire+0xc9/0x2e0
[ 602.706389] ? __kmem_cache_shutdown+0x26/0x210
[ 602.707059] cpus_read_lock+0x26/0xb0
[ 602.707582] ? __kmem_cache_shutdown+0x26/0x210
[ 602.708226] __kmem_cache_shutdown+0x26/0x210
[ 602.708843] ? lock_is_held_type+0xd6/0x130
[ 602.709442] ? torture_onoff+0x260/0x260
[ 602.710007] kmem_cache_destroy+0x38/0x110
[ 602.710590] rcu_torture_cleanup.cold.36+0x192/0x421
[ 602.711298] ? wait_woken+0x60/0x60
[ 602.711796] ? torture_onoff+0x260/0x260
[ 602.712359] torture_shutdown+0xdd/0x1c0
[ 602.712918] kthread+0x132/0x160
[ 602.713386] ? set_kthread_struct+0x40/0x40
[ 602.713985] ret_from_fork+0x22/0x30

2021-08-10 20:33:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

On Tue, Aug 10, 2021 at 01:47:42PM +0200, Mike Galbraith wrote:
> On Tue, 2021-08-10 at 11:03 +0200, Vlastimil Babka wrote:
> > On 8/9/21 3:41 PM, Qian Cai wrote:
> > > > ?
> > > > +static DEFINE_MUTEX(flush_lock);
> > > > +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
> > > > +
> > > > ?static void flush_all(struct kmem_cache *s)
> > > > ?{
> > > > -???????on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
> > > > +???????struct slub_flush_work *sfw;
> > > > +???????unsigned int cpu;
> > > > +
> > > > +???????mutex_lock(&flush_lock);
> > >
> > > Vlastimil, taking the lock here could trigger a warning during memory offline/online due to the locking order:
> > >
> > > slab_mutex -> flush_lock
> > >
> > > [?? 91.374541] WARNING: possible circular locking dependency detected
> > > [?? 91.381411] 5.14.0-rc5-next-20210809+ #84 Not tainted
> > > [?? 91.387149] ------------------------------------------------------
> > > [?? 91.394016] lsbug/1523 is trying to acquire lock:
> > > [?? 91.399406] ffff800018e76530 (flush_lock){+.+.}-{3:3}, at: flush_all+0x50/0x1c8
> > > [?? 91.407425]
> > > ?????????????? but task is already holding lock:
> > > [?? 91.414638] ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at: slab_memory_callback+0x44/0x280
> > > [?? 91.423603]
> > > ?????????????? which lock already depends on the new lock.
> > >
> >
> > OK, managed to reproduce in qemu and this fixes it for me on top of
> > next-20210809. Could you test as well, as your testing might be more
> > comprehensive? I will format is as a fixup for the proper patch in the series then.
>
> As it appeared it should, moving cpu_hotplug_lock outside slab_mutex in
> kmem_cache_destroy() on top of that silenced the cpu offline gripe.

And this one got rid of the remainder of the deadlock, but gets me the
splat shown at the end of this message. So some sort of middle ground
may be needed.

(Same reproducer as in my previous reply to Vlastimil.)

Thanx, Paul

> ---
> mm/slab_common.c | 2 ++
> mm/slub.c | 2 +-
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -502,6 +502,7 @@ void kmem_cache_destroy(struct kmem_cach
> if (unlikely(!s))
> return;
>
> + cpus_read_lock();
> mutex_lock(&slab_mutex);
>
> s->refcount--;
> @@ -516,6 +517,7 @@ void kmem_cache_destroy(struct kmem_cach
> }
> out_unlock:
> mutex_unlock(&slab_mutex);
> + cpus_read_unlock();
> }
> EXPORT_SYMBOL(kmem_cache_destroy);
>
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4234,7 +4234,7 @@ int __kmem_cache_shutdown(struct kmem_ca
> int node;
> struct kmem_cache_node *n;
>
> - flush_all(s);
> + flush_all_cpus_locked(s);
> /* Attempt to free all objects */
> for_each_kmem_cache_node(s, node, n) {
> free_partial(s, n);

[ 602.539109] ------------[ cut here ]------------
[ 602.539804] WARNING: CPU: 3 PID: 88 at kernel/cpu.c:335 lockdep_assert_cpus_held+0x29/0x30
[ 602.540940] Modules linked in:
[ 602.541377] CPU: 3 PID: 88 Comm: torture_shutdow Not tainted 5.14.0-rc5-next-20210809+ #3299
[ 602.542536] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module_el8.5.0+746+bbd5d70c 04/01/2014
[ 602.543786] RIP: 0010:lockdep_assert_cpus_held+0x29/0x30
[ 602.544524] Code: 00 83 3d 4d f1 a4 01 01 76 0a 8b 05 4d 23 a5 01 85 c0 75 01 c3 be ff ff ff ff 48 c7 c7 b0 86 66 a3 e8 9b 05 c9 00 85 c0 75 ea <0f> 0b c3 0f 1f 40 00 41 57 41 89 ff 41 56 4d 89 c6 41 55 49 89 cd
[ 602.547051] RSP: 0000:ffffb382802efdb8 EFLAGS: 00010246
[ 602.547783] RAX: 0000000000000000 RBX: ffffa23301a44000 RCX: 0000000000000001
[ 602.548764] RDX: 0000000000000001 RSI: ffffffffa335f5c0 RDI: ffffffffa33adbbf[ 602.549747] RBP: ffffa23301a44000 R08: ffffa23302810000 R09: 974cf0ba5c48ad3c
[ 602.550727] R10: ffffb382802efe78 R11: 0000000000000001 R12: ffffa23301a44000[ 602.551709] R13: 00000000000249c0 R14: 00000000ffffffff R15: 0000000fffffffe0
[ 602.552694] FS: 0000000000000000(0000) GS:ffffa2331f580000(0000) knlGS:0000000000000000
[ 602.553805] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 602.554606] CR2: 0000000000000000 CR3: 0000000017222000 CR4: 00000000000006e0
[ 602.555601] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 602.556590] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 602.557585] Call Trace:
[ 602.557927] flush_all_cpus_locked+0x29/0x140
[ 602.558535] __kmem_cache_shutdown+0x26/0x200
[ 602.559145] ? lock_is_held_type+0xd6/0x130
[ 602.559739] ? torture_onoff+0x260/0x260
[ 602.560284] kmem_cache_destroy+0x38/0x110
[ 602.560859] rcu_torture_cleanup.cold.36+0x192/0x421
[ 602.561539] ? wait_woken+0x60/0x60
[ 602.562035] ? torture_onoff+0x260/0x260
[ 602.562591] torture_shutdown+0xdd/0x1c0
[ 602.563131] kthread+0x132/0x160
[ 602.563592] ? set_kthread_struct+0x40/0x40
[ 602.564172] ret_from_fork+0x22/0x30
[ 602.564696] irq event stamp: 1307
[ 602.565161] hardirqs last enabled at (1315): [<ffffffffa1eddced>] __up_console_sem+0x4d/0x50
[ 602.566321] hardirqs last disabled at (1324): [<ffffffffa1eddcd2>] __up_console_sem+0x32/0x50
[ 602.567479] softirqs last enabled at (1304): [<ffffffffa2e00311>] __do_softirq+0x311/0x473
[ 602.568616] softirqs last disabled at (1299): [<ffffffffa1e72eb8>] irq_exit_rcu+0xe8/0xf0
[ 602.569735] ---[ end trace 26fd643e1df331c9 ]---

2021-08-10 22:40:02

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

On 8/10/2021 10:31 PM, Paul E. McKenney wrote:
> On Tue, Aug 10, 2021 at 01:47:42PM +0200, Mike Galbraith wrote:
>> On Tue, 2021-08-10 at 11:03 +0200, Vlastimil Babka wrote:
>>> On 8/9/21 3:41 PM, Qian Cai wrote:
>>>>>  
>>>>> +static DEFINE_MUTEX(flush_lock);
>>>>> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
>>>>> +
>>>>>  static void flush_all(struct kmem_cache *s)
>>>>>  {
>>>>> -       on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
>>>>> +       struct slub_flush_work *sfw;
>>>>> +       unsigned int cpu;
>>>>> +
>>>>> +       mutex_lock(&flush_lock);
>>>>
>>>> Vlastimil, taking the lock here could trigger a warning during memory offline/online due to the locking order:
>>>>
>>>> slab_mutex -> flush_lock
>>>>
>>>> [   91.374541] WARNING: possible circular locking dependency detected
>>>> [   91.381411] 5.14.0-rc5-next-20210809+ #84 Not tainted
>>>> [   91.387149] ------------------------------------------------------
>>>> [   91.394016] lsbug/1523 is trying to acquire lock:
>>>> [   91.399406] ffff800018e76530 (flush_lock){+.+.}-{3:3}, at: flush_all+0x50/0x1c8
>>>> [   91.407425]
>>>>                but task is already holding lock:
>>>> [   91.414638] ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at: slab_memory_callback+0x44/0x280
>>>> [   91.423603]
>>>>                which lock already depends on the new lock.
>>>>
>>>
>>> OK, managed to reproduce in qemu and this fixes it for me on top of
>>> next-20210809. Could you test as well, as your testing might be more
>>> comprehensive? I will format is as a fixup for the proper patch in the series then.
>>
>> As it appeared it should, moving cpu_hotplug_lock outside slab_mutex in
>> kmem_cache_destroy() on top of that silenced the cpu offline gripe.
>
> And this one got rid of the remainder of the deadlock, but gets me the
> splat shown at the end of this message. So some sort of middle ground
> may be needed.
>
> (Same reproducer as in my previous reply to Vlastimil.)
>
> Thanx, Paul
>
>> ---
>> mm/slab_common.c | 2 ++
>> mm/slub.c | 2 +-
>> 2 files changed, 3 insertions(+), 1 deletion(-)
>>
>> --- a/mm/slab_common.c
>> +++ b/mm/slab_common.c
>> @@ -502,6 +502,7 @@ void kmem_cache_destroy(struct kmem_cach
>> if (unlikely(!s))
>> return;
>>
>> + cpus_read_lock();
>> mutex_lock(&slab_mutex);
>>
>> s->refcount--;
>> @@ -516,6 +517,7 @@ void kmem_cache_destroy(struct kmem_cach
>> }
>> out_unlock:
>> mutex_unlock(&slab_mutex);
>> + cpus_read_unlock();
>> }
>> EXPORT_SYMBOL(kmem_cache_destroy);
>>
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -4234,7 +4234,7 @@ int __kmem_cache_shutdown(struct kmem_ca
>> int node;
>> struct kmem_cache_node *n;
>>
>> - flush_all(s);
>> + flush_all_cpus_locked(s);
>> /* Attempt to free all objects */
>> for_each_kmem_cache_node(s, node, n) {
>> free_partial(s, n);
>
> [ 602.539109] ------------[ cut here ]------------
> [ 602.539804] WARNING: CPU: 3 PID: 88 at kernel/cpu.c:335 lockdep_assert_cpus_held+0x29/0x30

So this says the assert failed and we don't have the cpus_read_lock(), right, but...

> [ 602.540940] Modules linked in:
> [ 602.541377] CPU: 3 PID: 88 Comm: torture_shutdow Not tainted 5.14.0-rc5-next-20210809+ #3299
> [ 602.542536] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module_el8.5.0+746+bbd5d70c 04/01/2014
> [ 602.543786] RIP: 0010:lockdep_assert_cpus_held+0x29/0x30
> [ 602.544524] Code: 00 83 3d 4d f1 a4 01 01 76 0a 8b 05 4d 23 a5 01 85 c0 75 01 c3 be ff ff ff ff 48 c7 c7 b0 86 66 a3 e8 9b 05 c9 00 85 c0 75 ea <0f> 0b c3 0f 1f 40 00 41 57 41 89 ff 41 56 4d 89 c6 41 55 49 89 cd
> [ 602.547051] RSP: 0000:ffffb382802efdb8 EFLAGS: 00010246
> [ 602.547783] RAX: 0000000000000000 RBX: ffffa23301a44000 RCX: 0000000000000001
> [ 602.548764] RDX: 0000000000000001 RSI: ffffffffa335f5c0 RDI: ffffffffa33adbbf[ 602.549747] RBP: ffffa23301a44000 R08: ffffa23302810000 R09: 974cf0ba5c48ad3c
> [ 602.550727] R10: ffffb382802efe78 R11: 0000000000000001 R12: ffffa23301a44000[ 602.551709] R13: 00000000000249c0 R14: 00000000ffffffff R15: 0000000fffffffe0
> [ 602.552694] FS: 0000000000000000(0000) GS:ffffa2331f580000(0000) knlGS:0000000000000000
> [ 602.553805] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 602.554606] CR2: 0000000000000000 CR3: 0000000017222000 CR4: 00000000000006e0
> [ 602.555601] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 602.556590] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 602.557585] Call Trace:
> [ 602.557927] flush_all_cpus_locked+0x29/0x140
> [ 602.558535] __kmem_cache_shutdown+0x26/0x200
> [ 602.559145] ? lock_is_held_type+0xd6/0x130
> [ 602.559739] ? torture_onoff+0x260/0x260
> [ 602.560284] kmem_cache_destroy+0x38/0x110

It should have been taken here. I don't understand. It's as if only the
mm/slub.c was patched by Mike's patch, but mm/slab_common.c not?

> [ 602.560859] rcu_torture_cleanup.cold.36+0x192/0x421
> [ 602.561539] ? wait_woken+0x60/0x60
> [ 602.562035] ? torture_onoff+0x260/0x260
> [ 602.562591] torture_shutdown+0xdd/0x1c0
> [ 602.563131] kthread+0x132/0x160
> [ 602.563592] ? set_kthread_struct+0x40/0x40
> [ 602.564172] ret_from_fork+0x22/0x30
> [ 602.564696] irq event stamp: 1307
> [ 602.565161] hardirqs last enabled at (1315): [<ffffffffa1eddced>] __up_console_sem+0x4d/0x50
> [ 602.566321] hardirqs last disabled at (1324): [<ffffffffa1eddcd2>] __up_console_sem+0x32/0x50
> [ 602.567479] softirqs last enabled at (1304): [<ffffffffa2e00311>] __do_softirq+0x311/0x473
> [ 602.568616] softirqs last disabled at (1299): [<ffffffffa1e72eb8>] irq_exit_rcu+0xe8/0xf0
> [ 602.569735] ---[ end trace 26fd643e1df331c9 ]---
>

2021-08-10 23:58:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

On Wed, Aug 11, 2021 at 12:36:00AM +0200, Vlastimil Babka wrote:
> On 8/10/2021 10:31 PM, Paul E. McKenney wrote:
> > On Tue, Aug 10, 2021 at 01:47:42PM +0200, Mike Galbraith wrote:
> >> On Tue, 2021-08-10 at 11:03 +0200, Vlastimil Babka wrote:
> >>> On 8/9/21 3:41 PM, Qian Cai wrote:
> >>>>> ?
> >>>>> +static DEFINE_MUTEX(flush_lock);
> >>>>> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
> >>>>> +
> >>>>> ?static void flush_all(struct kmem_cache *s)
> >>>>> ?{
> >>>>> -???????on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
> >>>>> +???????struct slub_flush_work *sfw;
> >>>>> +???????unsigned int cpu;
> >>>>> +
> >>>>> +???????mutex_lock(&flush_lock);
> >>>>
> >>>> Vlastimil, taking the lock here could trigger a warning during memory offline/online due to the locking order:
> >>>>
> >>>> slab_mutex -> flush_lock
> >>>>
> >>>> [?? 91.374541] WARNING: possible circular locking dependency detected
> >>>> [?? 91.381411] 5.14.0-rc5-next-20210809+ #84 Not tainted
> >>>> [?? 91.387149] ------------------------------------------------------
> >>>> [?? 91.394016] lsbug/1523 is trying to acquire lock:
> >>>> [?? 91.399406] ffff800018e76530 (flush_lock){+.+.}-{3:3}, at: flush_all+0x50/0x1c8
> >>>> [?? 91.407425]
> >>>> ?????????????? but task is already holding lock:
> >>>> [?? 91.414638] ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at: slab_memory_callback+0x44/0x280
> >>>> [?? 91.423603]
> >>>> ?????????????? which lock already depends on the new lock.
> >>>>
> >>>
> >>> OK, managed to reproduce in qemu and this fixes it for me on top of
> >>> next-20210809. Could you test as well, as your testing might be more
> >>> comprehensive? I will format is as a fixup for the proper patch in the series then.
> >>
> >> As it appeared it should, moving cpu_hotplug_lock outside slab_mutex in
> >> kmem_cache_destroy() on top of that silenced the cpu offline gripe.
> >
> > And this one got rid of the remainder of the deadlock, but gets me the
> > splat shown at the end of this message. So some sort of middle ground
> > may be needed.
> >
> > (Same reproducer as in my previous reply to Vlastimil.)
> >
> > Thanx, Paul
> >
> >> ---
> >> mm/slab_common.c | 2 ++
> >> mm/slub.c | 2 +-
> >> 2 files changed, 3 insertions(+), 1 deletion(-)
> >>
> >> --- a/mm/slab_common.c
> >> +++ b/mm/slab_common.c
> >> @@ -502,6 +502,7 @@ void kmem_cache_destroy(struct kmem_cach
> >> if (unlikely(!s))
> >> return;
> >>
> >> + cpus_read_lock();
> >> mutex_lock(&slab_mutex);
> >>
> >> s->refcount--;
> >> @@ -516,6 +517,7 @@ void kmem_cache_destroy(struct kmem_cach
> >> }
> >> out_unlock:
> >> mutex_unlock(&slab_mutex);
> >> + cpus_read_unlock();
> >> }
> >> EXPORT_SYMBOL(kmem_cache_destroy);
> >>
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -4234,7 +4234,7 @@ int __kmem_cache_shutdown(struct kmem_ca
> >> int node;
> >> struct kmem_cache_node *n;
> >>
> >> - flush_all(s);
> >> + flush_all_cpus_locked(s);
> >> /* Attempt to free all objects */
> >> for_each_kmem_cache_node(s, node, n) {
> >> free_partial(s, n);
> >
> > [ 602.539109] ------------[ cut here ]------------
> > [ 602.539804] WARNING: CPU: 3 PID: 88 at kernel/cpu.c:335 lockdep_assert_cpus_held+0x29/0x30
>
> So this says the assert failed and we don't have the cpus_read_lock(), right, but...
>
> > [ 602.540940] Modules linked in:
> > [ 602.541377] CPU: 3 PID: 88 Comm: torture_shutdow Not tainted 5.14.0-rc5-next-20210809+ #3299
> > [ 602.542536] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module_el8.5.0+746+bbd5d70c 04/01/2014
> > [ 602.543786] RIP: 0010:lockdep_assert_cpus_held+0x29/0x30
> > [ 602.544524] Code: 00 83 3d 4d f1 a4 01 01 76 0a 8b 05 4d 23 a5 01 85 c0 75 01 c3 be ff ff ff ff 48 c7 c7 b0 86 66 a3 e8 9b 05 c9 00 85 c0 75 ea <0f> 0b c3 0f 1f 40 00 41 57 41 89 ff 41 56 4d 89 c6 41 55 49 89 cd
> > [ 602.547051] RSP: 0000:ffffb382802efdb8 EFLAGS: 00010246
> > [ 602.547783] RAX: 0000000000000000 RBX: ffffa23301a44000 RCX: 0000000000000001
> > [ 602.548764] RDX: 0000000000000001 RSI: ffffffffa335f5c0 RDI: ffffffffa33adbbf[ 602.549747] RBP: ffffa23301a44000 R08: ffffa23302810000 R09: 974cf0ba5c48ad3c
> > [ 602.550727] R10: ffffb382802efe78 R11: 0000000000000001 R12: ffffa23301a44000[ 602.551709] R13: 00000000000249c0 R14: 00000000ffffffff R15: 0000000fffffffe0
> > [ 602.552694] FS: 0000000000000000(0000) GS:ffffa2331f580000(0000) knlGS:0000000000000000
> > [ 602.553805] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 602.554606] CR2: 0000000000000000 CR3: 0000000017222000 CR4: 00000000000006e0
> > [ 602.555601] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 602.556590] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 602.557585] Call Trace:
> > [ 602.557927] flush_all_cpus_locked+0x29/0x140
> > [ 602.558535] __kmem_cache_shutdown+0x26/0x200
> > [ 602.559145] ? lock_is_held_type+0xd6/0x130
> > [ 602.559739] ? torture_onoff+0x260/0x260
> > [ 602.560284] kmem_cache_destroy+0x38/0x110
>
> It should have been taken here. I don't understand. It's as if only the
> mm/slub.c was patched by Mike's patch, but mm/slab_common.c not?

You know, you would think that I would have learned how to reliably
apply a patch by now. Apparently not this morning.

Anyway, right in one! I will try again with the full patch later.

Thanx, Paul

> > [ 602.560859] rcu_torture_cleanup.cold.36+0x192/0x421
> > [ 602.561539] ? wait_woken+0x60/0x60
> > [ 602.562035] ? torture_onoff+0x260/0x260
> > [ 602.562591] torture_shutdown+0xdd/0x1c0
> > [ 602.563131] kthread+0x132/0x160
> > [ 602.563592] ? set_kthread_struct+0x40/0x40
> > [ 602.564172] ret_from_fork+0x22/0x30
> > [ 602.564696] irq event stamp: 1307
> > [ 602.565161] hardirqs last enabled at (1315): [<ffffffffa1eddced>] __up_console_sem+0x4d/0x50
> > [ 602.566321] hardirqs last disabled at (1324): [<ffffffffa1eddcd2>] __up_console_sem+0x32/0x50
> > [ 602.567479] softirqs last enabled at (1304): [<ffffffffa2e00311>] __do_softirq+0x311/0x473
> > [ 602.568616] softirqs last disabled at (1299): [<ffffffffa1e72eb8>] irq_exit_rcu+0xe8/0xf0
> > [ 602.569735] ---[ end trace 26fd643e1df331c9 ]---
> >
>

2021-08-11 01:45:21

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context



On 8/10/2021 10:33 AM, Vlastimil Babka wrote:
> On 8/9/21 3:41 PM, Qian Cai wrote:
>
>>> static void flush_all(struct kmem_cache *s)
>>> {
>>> - on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
>>> + struct slub_flush_work *sfw;
>>> + unsigned int cpu;
>>> +
>>> + mutex_lock(&flush_lock);
>>
>> Vlastimil, taking the lock here could trigger a warning during memory offline/online due to the locking order:
>>
>> slab_mutex -> flush_lock
>
> Here's the full fixup, also incorporating Mike's fix. Thanks.
>
> ----8<----
> From c2df67d5116d4615c322e262556e34117e268104 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <[email protected]>
> Date: Tue, 10 Aug 2021 10:58:07 +0200
> Subject: [PATCH] mm, slub: fix memory and cpu hotplug related lock ordering
> issues
>
> Qian Cai reported [1] a lockdep splat on memory offline.
>
> [ 91.374541] WARNING: possible circular locking dependency detected
> [ 91.381411] 5.14.0-rc5-next-20210809+ #84 Not tainted
> [ 91.387149] ------------------------------------------------------
> [ 91.394016] lsbug/1523 is trying to acquire lock:
> [ 91.399406] ffff800018e76530 (flush_lock){+.+.}-{3:3}, at: flush_all+0x50/0x1c8
> [ 91.407425] but task is already holding lock:
> [ 91.414638] ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at: slab_memory_callback+0x44/0x280
> [ 91.423603] which lock already depends on the new lock.
>
> To fix it, we need to change the order in flush_all() so that cpus_read_lock()
> is first and mutex_lock(&flush_lock) second.
>
> Also when called from slab_mem_going_offline_callback() we are already under
> cpus_read_lock() and cannot take it again, so create a flush_all_cpus_locked()
> variant and decouple flushing from actual shrinking for this call path.
>
> Additionally, Mike Galbraith reported [2] wrong order of cpus_read_lock() and
> slab_mutex in kmem_cache_destroy() path and proposed a fix to reverse it.
>
> This patch is a fixup for the mmotm patch
> mm-slub-move-flush_cpu_slab-invocations-__free_slab-invocations-out-of-irq-context.patch
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
>
> Reported-by: Qian Cai <[email protected]>
> Reported-by: Mike Galbraith <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>

This is running fine for me. There is a separate hugetlb crash while fuzzing and will
report to where it belongs.

2021-08-11 08:57:33

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

On 8/10/21 4:33 PM, Vlastimil Babka wrote:
> On 8/9/21 3:41 PM, Qian Cai wrote:
>
>>> static void flush_all(struct kmem_cache *s)
>>> {
>>> - on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
>>> + struct slub_flush_work *sfw;
>>> + unsigned int cpu;
>>> +
>>> + mutex_lock(&flush_lock);
>>
>> Vlastimil, taking the lock here could trigger a warning during memory offline/online due to the locking order:
>>
>> slab_mutex -> flush_lock
>
> Here's the full fixup, also incorporating Mike's fix. Thanks.
>

One more fixup, sorry for the churn.
----8<----
From 7cfe3fb1bcd6e589199b10bef480ed097ba9de14 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Wed, 11 Aug 2021 10:51:14 +0200
Subject: [PATCH] mm, slub: fix memory and cpu hotplug related lock ordering
issues - fix

Make __kmem_cache_do_shrink static to silence "no previous prototype" warning.

Reported-by: kernel test robot <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 152487f84025..c9531e03addd 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4393,7 +4393,7 @@ EXPORT_SYMBOL(kfree);
* being allocated from last increasing the chance that the last objects
* are freed in them.
*/
-int __kmem_cache_do_shrink(struct kmem_cache *s)
+static int __kmem_cache_do_shrink(struct kmem_cache *s)
{
int node;
int i;
--
2.32.0

2021-08-11 14:18:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

On Tue, Aug 10, 2021 at 04:53:36PM -0700, Paul E. McKenney wrote:
> On Wed, Aug 11, 2021 at 12:36:00AM +0200, Vlastimil Babka wrote:
> > On 8/10/2021 10:31 PM, Paul E. McKenney wrote:
> > > On Tue, Aug 10, 2021 at 01:47:42PM +0200, Mike Galbraith wrote:
> > >> On Tue, 2021-08-10 at 11:03 +0200, Vlastimil Babka wrote:
> > >>> On 8/9/21 3:41 PM, Qian Cai wrote:
> > >>>>> ?
> > >>>>> +static DEFINE_MUTEX(flush_lock);
> > >>>>> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
> > >>>>> +
> > >>>>> ?static void flush_all(struct kmem_cache *s)
> > >>>>> ?{
> > >>>>> -???????on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
> > >>>>> +???????struct slub_flush_work *sfw;
> > >>>>> +???????unsigned int cpu;
> > >>>>> +
> > >>>>> +???????mutex_lock(&flush_lock);
> > >>>>
> > >>>> Vlastimil, taking the lock here could trigger a warning during memory offline/online due to the locking order:
> > >>>>
> > >>>> slab_mutex -> flush_lock
> > >>>>
> > >>>> [?? 91.374541] WARNING: possible circular locking dependency detected
> > >>>> [?? 91.381411] 5.14.0-rc5-next-20210809+ #84 Not tainted
> > >>>> [?? 91.387149] ------------------------------------------------------
> > >>>> [?? 91.394016] lsbug/1523 is trying to acquire lock:
> > >>>> [?? 91.399406] ffff800018e76530 (flush_lock){+.+.}-{3:3}, at: flush_all+0x50/0x1c8
> > >>>> [?? 91.407425]
> > >>>> ?????????????? but task is already holding lock:
> > >>>> [?? 91.414638] ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at: slab_memory_callback+0x44/0x280
> > >>>> [?? 91.423603]
> > >>>> ?????????????? which lock already depends on the new lock.
> > >>>>
> > >>>
> > >>> OK, managed to reproduce in qemu and this fixes it for me on top of
> > >>> next-20210809. Could you test as well, as your testing might be more
> > >>> comprehensive? I will format is as a fixup for the proper patch in the series then.
> > >>
> > >> As it appeared it should, moving cpu_hotplug_lock outside slab_mutex in
> > >> kmem_cache_destroy() on top of that silenced the cpu offline gripe.
> > >
> > > And this one got rid of the remainder of the deadlock, but gets me the
> > > splat shown at the end of this message. So some sort of middle ground
> > > may be needed.
> > >
> > > (Same reproducer as in my previous reply to Vlastimil.)
> > >
> > > Thanx, Paul
> > >
> > >> ---
> > >> mm/slab_common.c | 2 ++
> > >> mm/slub.c | 2 +-
> > >> 2 files changed, 3 insertions(+), 1 deletion(-)
> > >>
> > >> --- a/mm/slab_common.c
> > >> +++ b/mm/slab_common.c
> > >> @@ -502,6 +502,7 @@ void kmem_cache_destroy(struct kmem_cach
> > >> if (unlikely(!s))
> > >> return;
> > >>
> > >> + cpus_read_lock();
> > >> mutex_lock(&slab_mutex);
> > >>
> > >> s->refcount--;
> > >> @@ -516,6 +517,7 @@ void kmem_cache_destroy(struct kmem_cach
> > >> }
> > >> out_unlock:
> > >> mutex_unlock(&slab_mutex);
> > >> + cpus_read_unlock();
> > >> }
> > >> EXPORT_SYMBOL(kmem_cache_destroy);
> > >>
> > >> --- a/mm/slub.c
> > >> +++ b/mm/slub.c
> > >> @@ -4234,7 +4234,7 @@ int __kmem_cache_shutdown(struct kmem_ca
> > >> int node;
> > >> struct kmem_cache_node *n;
> > >>
> > >> - flush_all(s);
> > >> + flush_all_cpus_locked(s);
> > >> /* Attempt to free all objects */
> > >> for_each_kmem_cache_node(s, node, n) {
> > >> free_partial(s, n);
> > >
> > > [ 602.539109] ------------[ cut here ]------------
> > > [ 602.539804] WARNING: CPU: 3 PID: 88 at kernel/cpu.c:335 lockdep_assert_cpus_held+0x29/0x30
> >
> > So this says the assert failed and we don't have the cpus_read_lock(), right, but...
> >
> > > [ 602.540940] Modules linked in:
> > > [ 602.541377] CPU: 3 PID: 88 Comm: torture_shutdow Not tainted 5.14.0-rc5-next-20210809+ #3299
> > > [ 602.542536] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module_el8.5.0+746+bbd5d70c 04/01/2014
> > > [ 602.543786] RIP: 0010:lockdep_assert_cpus_held+0x29/0x30
> > > [ 602.544524] Code: 00 83 3d 4d f1 a4 01 01 76 0a 8b 05 4d 23 a5 01 85 c0 75 01 c3 be ff ff ff ff 48 c7 c7 b0 86 66 a3 e8 9b 05 c9 00 85 c0 75 ea <0f> 0b c3 0f 1f 40 00 41 57 41 89 ff 41 56 4d 89 c6 41 55 49 89 cd
> > > [ 602.547051] RSP: 0000:ffffb382802efdb8 EFLAGS: 00010246
> > > [ 602.547783] RAX: 0000000000000000 RBX: ffffa23301a44000 RCX: 0000000000000001
> > > [ 602.548764] RDX: 0000000000000001 RSI: ffffffffa335f5c0 RDI: ffffffffa33adbbf[ 602.549747] RBP: ffffa23301a44000 R08: ffffa23302810000 R09: 974cf0ba5c48ad3c
> > > [ 602.550727] R10: ffffb382802efe78 R11: 0000000000000001 R12: ffffa23301a44000[ 602.551709] R13: 00000000000249c0 R14: 00000000ffffffff R15: 0000000fffffffe0
> > > [ 602.552694] FS: 0000000000000000(0000) GS:ffffa2331f580000(0000) knlGS:0000000000000000
> > > [ 602.553805] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 602.554606] CR2: 0000000000000000 CR3: 0000000017222000 CR4: 00000000000006e0
> > > [ 602.555601] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [ 602.556590] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [ 602.557585] Call Trace:
> > > [ 602.557927] flush_all_cpus_locked+0x29/0x140
> > > [ 602.558535] __kmem_cache_shutdown+0x26/0x200
> > > [ 602.559145] ? lock_is_held_type+0xd6/0x130
> > > [ 602.559739] ? torture_onoff+0x260/0x260
> > > [ 602.560284] kmem_cache_destroy+0x38/0x110
> >
> > It should have been taken here. I don't understand. It's as if only the
> > mm/slub.c was patched by Mike's patch, but mm/slab_common.c not?
>
> You know, you would think that I would have learned how to reliably
> apply a patch by now. Apparently not this morning.
>
> Anyway, right in one! I will try again with the full patch later.

And with both patches:

Tested-by: Paul E. McKenney <[email protected]>

Thanx, Paul

> > > [ 602.560859] rcu_torture_cleanup.cold.36+0x192/0x421
> > > [ 602.561539] ? wait_woken+0x60/0x60
> > > [ 602.562035] ? torture_onoff+0x260/0x260
> > > [ 602.562591] torture_shutdown+0xdd/0x1c0
> > > [ 602.563131] kthread+0x132/0x160
> > > [ 602.563592] ? set_kthread_struct+0x40/0x40
> > > [ 602.564172] ret_from_fork+0x22/0x30
> > > [ 602.564696] irq event stamp: 1307
> > > [ 602.565161] hardirqs last enabled at (1315): [<ffffffffa1eddced>] __up_console_sem+0x4d/0x50
> > > [ 602.566321] hardirqs last disabled at (1324): [<ffffffffa1eddcd2>] __up_console_sem+0x32/0x50
> > > [ 602.567479] softirqs last enabled at (1304): [<ffffffffa2e00311>] __do_softirq+0x311/0x473
> > > [ 602.568616] softirqs last disabled at (1299): [<ffffffffa1e72eb8>] irq_exit_rcu+0xe8/0xf0
> > > [ 602.569735] ---[ end trace 26fd643e1df331c9 ]---
> > >
> >

2021-08-15 10:18:39

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 13/35] mm, slub: do initial checks in ___slab_alloc() with irqs enabled

On 8/5/21 5:19 PM, Vlastimil Babka wrote:
> As another step of shortening irq disabled sections in ___slab_alloc(), delay
> disabling irqs until we pass the initial checks if there is a cached percpu
> slab and it's suitable for our allocation.
>
> Now we have to recheck c->page after actually disabling irqs as an allocation
> in irq handler might have replaced it.

Please add an extra paragraph that related to the fixup below (which I
assume will be squashed as usual):

Because we call pfmemalloc_match() as one of the checks, we might hit
VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get
interrupted and the page is freed. Thus introduce a
pfmemalloc_match_unsafe() variant that lacks the PageSlab check.

> Signed-off-by: Vlastimil Babka <[email protected]>
> Acked-by: Mel Gorman <[email protected]>

And the fixup:
----8<----
From bf81bca38b127a8d717978467cf7264580c81248 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Sun, 15 Aug 2021 11:49:46 +0200
Subject: [PATCH] mm, slub: prevent VM_BUG_ON in PageSlabPfmemalloc from
___slab_alloc

Clark Williams reported [1] a VM_BUG_ON in PageSlabPfmemalloc:

page:000000009ac5dd73 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1ab3db
flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc0000000 ffffee1286aceb88 ffffee1287b66288 0000000000000000
raw: 0000000000000000 0000000000100000 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(!PageSlab(page))
------------[ cut here ]------------
kernel BUG at include/linux/page-flags.h:814!
invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI
CPU: 3 PID: 12345 Comm: hackbench Not tainted 5.14.0-rc5-rt8+ #12
Hardware name: /NUC5i7RYB, BIOS RYBDWi35.86A.0359.2016.0906.1028 09/06/2016
RIP: 0010:___slab_alloc+0x340/0x940
Code: c6 48 0f a3 05 b1 7b 57 03 72 99 c7 85 78 ff ff ff ff ff ff ff 48 8b 7d 88 e9 8d fd ff ff 48 c7 c6 50 5a 7c b0 e>
RSP: 0018:ffffba1c4a8b7ab0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000002 RCX: ffff9bb765118000
RDX: 0000000000000000 RSI: ffffffffaf426050 RDI: 00000000ffffffff
RBP: ffffba1c4a8b7b70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9bb7410d3600
R13: 0000000000400cc0 R14: 00000000001f7770 R15: ffff9bbe76df7770
FS: 00007f474b1be740(0000) GS:ffff9bbe76c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f60c04bdaf8 CR3: 0000000124f3a003 CR4: 00000000003706e0
Call Trace:
? __alloc_skb+0x1db/0x270
? __alloc_skb+0x1db/0x270
? kmem_cache_alloc_node+0xa4/0x2b0
kmem_cache_alloc_node+0xa4/0x2b0
__alloc_skb+0x1db/0x270
alloc_skb_with_frags+0x64/0x250
sock_alloc_send_pskb+0x260/0x2b0
? bpf_lsm_socket_getpeersec_dgram+0xa/0x10
unix_stream_sendmsg+0x27c/0x550
? unix_seqpacket_recvmsg+0x60/0x60
sock_sendmsg+0xbd/0xd0
sock_write_iter+0xb9/0x120
new_sync_write+0x175/0x200
vfs_write+0x3c4/0x510
ksys_write+0xc9/0x110
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

The problem is that we are opportunistically checking flags on a page in irq
enabled section. If we are interrupted and the page is freed, it's not an
issue as we detect it after disabling irqs. But on kernels with
CONFIG_DEBUG_VM. The check for PageSlab flag in PageSlabPfmemalloc() can fail.

Fix this by creating an "unsafe" version of the check that doesn't check
PageSlab.

This is a fixup for mmotm patch
mm-slub-do-initial-checks-in-___slab_alloc-with-irqs-enabled.patch

[1] https://lore.kernel.org/lkml/[email protected]/

Reported-by: Clark Williams <[email protected]>
Tested-by: Mike Galbraith <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/page-flags.h | 9 +++++++++
mm/slub.c | 15 ++++++++++++++-
2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5922031ffab6..7fda4fb85bdc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -815,6 +815,15 @@ static inline int PageSlabPfmemalloc(struct page *page)
return PageActive(page);
}

+/*
+ * A version of PageSlabPfmemalloc() for opportunistic checks where the page
+ * might have been freed under us and not be a PageSlab anymore.
+ */
+static inline int __PageSlabPfmemalloc(struct page *page)
+{
+ return PageActive(page);
+}
+
static inline void SetPageSlabPfmemalloc(struct page *page)
{
VM_BUG_ON_PAGE(!PageSlab(page), page);
diff --git a/mm/slub.c b/mm/slub.c
index 7eb06fe9d7a0..d60d48c35f98 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2603,6 +2603,19 @@ static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags)
return true;
}

+/*
+ * A variant of pfmemalloc_match() that tests page flags without asserting
+ * PageSlab. Intended for opportunistic checks before taking a lock and
+ * rechecking that nobody else freed the page under us.
+ */
+static inline bool pfmemalloc_match_unsafe(struct page *page, gfp_t gfpflags)
+{
+ if (unlikely(__PageSlabPfmemalloc(page)))
+ return gfp_pfmemalloc_allowed(gfpflags);
+
+ return true;
+}
+
/*
* Check the page->freelist of a page and either transfer the freelist to the
* per cpu freelist or deactivate the page.
@@ -2704,7 +2717,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
* PFMEMALLOC but right now, we are losing the pfmemalloc
* information when the page leaves the per-cpu allocator
*/
- if (unlikely(!pfmemalloc_match(page, gfpflags)))
+ if (unlikely(!try_pfmemalloc_match(page, gfpflags)))
goto deactivate_slab;

/* must check again c->page in case IRQ handler changed it */
--
2.32.0

2021-08-15 10:21:41

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 00/35] SLUB: reduce irq disabled scope and make it RT compatible

On 8/10/21 4:36 PM, Vlastimil Babka wrote:
> On 8/5/21 5:19 PM, Vlastimil Babka wrote:
>> Series is based on 5.14-rc4 and also available as a git branch:
>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0
>
> New branch with fixed up locking orders in patch 29/35:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r1

New branch with fixed up VM_BUG_ON in patch 13/35:

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r2

2021-08-15 10:25:05

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 13/35] mm, slub: do initial checks in ___slab_alloc() with irqs enabled

On 8/15/21 12:14 PM, Vlastimil Babka wrote:
> On 8/5/21 5:19 PM, Vlastimil Babka wrote:
>> As another step of shortening irq disabled sections in ___slab_alloc(), delay
>> disabling irqs until we pass the initial checks if there is a cached percpu
>> slab and it's suitable for our allocation.
>>
>> Now we have to recheck c->page after actually disabling irqs as an allocation
>> in irq handler might have replaced it.
>
> Please add an extra paragraph that related to the fixup below (which I
> assume will be squashed as usual):
>
> Because we call pfmemalloc_match() as one of the checks, we might hit
> VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get
> interrupted and the page is freed. Thus introduce a
> pfmemalloc_match_unsafe() variant that lacks the PageSlab check.
>
>> Signed-off-by: Vlastimil Babka <[email protected]>
>> Acked-by: Mel Gorman <[email protected]>
>
> And the fixup:

Oops, renaming snafu. Again.

----8<----
From bf81bca38b127a8d717978467cf7264580c81248 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Sun, 15 Aug 2021 11:49:46 +0200
Subject: [PATCH] mm, slub: prevent VM_BUG_ON in PageSlabPfmemalloc from
___slab_alloc

Clark Williams reported [1] a VM_BUG_ON in PageSlabPfmemalloc:

page:000000009ac5dd73 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1ab3db
flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc0000000 ffffee1286aceb88 ffffee1287b66288 0000000000000000
raw: 0000000000000000 0000000000100000 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(!PageSlab(page))
------------[ cut here ]------------
kernel BUG at include/linux/page-flags.h:814!
invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI
CPU: 3 PID: 12345 Comm: hackbench Not tainted 5.14.0-rc5-rt8+ #12
Hardware name: /NUC5i7RYB, BIOS RYBDWi35.86A.0359.2016.0906.1028 09/06/2016
RIP: 0010:___slab_alloc+0x340/0x940
Code: c6 48 0f a3 05 b1 7b 57 03 72 99 c7 85 78 ff ff ff ff ff ff ff 48 8b 7d 88 e9 8d fd ff ff 48 c7 c6 50 5a 7c b0 e>
RSP: 0018:ffffba1c4a8b7ab0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000002 RCX: ffff9bb765118000
RDX: 0000000000000000 RSI: ffffffffaf426050 RDI: 00000000ffffffff
RBP: ffffba1c4a8b7b70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9bb7410d3600
R13: 0000000000400cc0 R14: 00000000001f7770 R15: ffff9bbe76df7770
FS: 00007f474b1be740(0000) GS:ffff9bbe76c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f60c04bdaf8 CR3: 0000000124f3a003 CR4: 00000000003706e0
Call Trace:
? __alloc_skb+0x1db/0x270
? __alloc_skb+0x1db/0x270
? kmem_cache_alloc_node+0xa4/0x2b0
kmem_cache_alloc_node+0xa4/0x2b0
__alloc_skb+0x1db/0x270
alloc_skb_with_frags+0x64/0x250
sock_alloc_send_pskb+0x260/0x2b0
? bpf_lsm_socket_getpeersec_dgram+0xa/0x10
unix_stream_sendmsg+0x27c/0x550
? unix_seqpacket_recvmsg+0x60/0x60
sock_sendmsg+0xbd/0xd0
sock_write_iter+0xb9/0x120
new_sync_write+0x175/0x200
vfs_write+0x3c4/0x510
ksys_write+0xc9/0x110
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

The problem is that we are opportunistically checking flags on a page in irq
enabled section. If we are interrupted and the page is freed, it's not an
issue as we detect it after disabling irqs. But on kernels with
CONFIG_DEBUG_VM. The check for PageSlab flag in PageSlabPfmemalloc() can fail.

Fix this by creating an "unsafe" version of the check that doesn't check
PageSlab.

This is a fixup for mmotm patch
mm-slub-do-initial-checks-in-___slab_alloc-with-irqs-enabled.patch

[1] https://lore.kernel.org/lkml/[email protected]/

Reported-by: Clark Williams <[email protected]>
Tested-by: Mike Galbraith <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/page-flags.h | 9 +++++++++
mm/slub.c | 15 ++++++++++++++-
2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5922031ffab6..7fda4fb85bdc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -815,6 +815,15 @@ static inline int PageSlabPfmemalloc(struct page *page)
return PageActive(page);
}

+/*
+ * A version of PageSlabPfmemalloc() for opportunistic checks where the page
+ * might have been freed under us and not be a PageSlab anymore.
+ */
+static inline int __PageSlabPfmemalloc(struct page *page)
+{
+ return PageActive(page);
+}
+
static inline void SetPageSlabPfmemalloc(struct page *page)
{
VM_BUG_ON_PAGE(!PageSlab(page), page);
diff --git a/mm/slub.c b/mm/slub.c
index 7eb06fe9d7a0..d60d48c35f98 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2603,6 +2603,19 @@ static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags)
return true;
}

+/*
+ * A variant of pfmemalloc_match() that tests page flags without asserting
+ * PageSlab. Intended for opportunistic checks before taking a lock and
+ * rechecking that nobody else freed the page under us.
+ */
+static inline bool pfmemalloc_match_unsafe(struct page *page, gfp_t gfpflags)
+{
+ if (unlikely(__PageSlabPfmemalloc(page)))
+ return gfp_pfmemalloc_allowed(gfpflags);
+
+ return true;
+}
+
/*
* Check the page->freelist of a page and either transfer the freelist to the
* per cpu freelist or deactivate the page.
@@ -2704,7 +2717,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
* PFMEMALLOC but right now, we are losing the pfmemalloc
* information when the page leaves the per-cpu allocator
*/
- if (unlikely(!pfmemalloc_match(page, gfpflags)))
+ if (unlikely(!pfmemalloc_match_unsafe(page, gfpflags)))
goto deactivate_slab;

/* must check again c->page in case IRQ handler changed it */
--
2.32.0


2021-08-15 12:29:35

by Sven Eckelmann

[permalink] [raw]
Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

> mm, slub: convert kmem_cpu_slab protection to local_lock
>
> Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions
> of local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT
> that's equivalent, with better lockdep visibility. On PREEMPT_RT that
> means better preemption.
[...]

Looks like this change breaks booting when 64BIT+LOCK_STAT is enabled on
x86_64:

general protection fault, maybe for address 0xffff888007fcf1c8: 0000 [#1] NOPTI
CPU: 0 PID: 0 Comm: swapper Not tainted 5.14.0-rc5+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
RIP: 0010:kmem_cache_alloc+0x81/0x180
Code: 79 48 00 4c 8b 41 38 0f 84 89 00 00 00 4d 85 c0 0f 84 80 00 00 00 41 8b 44 24 28 49 8b 3c 24 48 8d 4a 01 49 8b 1c 00 4c 89 c0 <48> 0f c7 4f 38 0f 943
RSP: 0000:ffffffff81803c10 EFLAGS: 00000286
RAX: ffff88800244e7c0 RBX: ffff88800244e800 RCX: 0000000000000024
RDX: 0000000000000023 RSI: 0000000000000100 RDI: ffff888007fcf190
RBP: ffffffff81803c38 R08: ffff88800244e7c0 R09: 0000000000000dc0
R10: 0000000000004000 R11: 0000000000000000 R12: ffff8880024413c0
R13: ffffffff810d18f4 R14: 0000000000000dc0 R15: 0000000000000100
FS: 0000000000000000(0000) GS:ffffffff81840000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff888002001000 CR3: 0000000001824000 CR4: 00000000000006b0
Call Trace:
__get_vm_area_node.constprop.0.isra.0+0x74/0x150
__vmalloc_node_range+0x5a/0x2b0
? kernel_clone+0x88/0x390
? copy_process+0x1ac/0x17e0
copy_process+0x768/0x17e0
? kernel_clone+0x88/0x390
kernel_clone+0x88/0x390
? _vm_unmap_aliases.part.0+0xe9/0x110
? change_page_attr_set_clr+0x10d/0x180
kernel_thread+0x43/0x50
? rest_init+0x100/0x100
rest_init+0x1e/0x100
arch_call_rest_init+0x9/0xc
start_kernel+0x481/0x493
x86_64_start_reservations+0x24/0x26
x86_64_start_kernel+0x80/0x84
secondary_startup_64_no_verify+0xc2/0xcb
random: get_random_bytes called from oops_exit+0x34/0x60 with crng_init=0
---[ end trace 2cac18ac38f640c1 ]---
RIP: 0010:kmem_cache_alloc+0x81/0x180
Code: 79 48 00 4c 8b 41 38 0f 84 89 00 00 00 4d 85 c0 0f 84 80 00 00 00 41 8b 44 24 28 49 8b 3c 24 48 8d 4a 01 49 8b 1c 00 4c 89 c0 <48> 0f c7 4f 38 0f 943
RSP: 0000:ffffffff81803c10 EFLAGS: 00000286
RAX: ffff88800244e7c0 RBX: ffff88800244e800 RCX: 0000000000000024
RDX: 0000000000000023 RSI: 0000000000000100 RDI: ffff888007fcf190
RBP: ffffffff81803c38 R08: ffff88800244e7c0 R09: 0000000000000dc0
R10: 0000000000004000 R11: 0000000000000000 R12: ffff8880024413c0
R13: ffffffff810d18f4 R14: 0000000000000dc0 R15: 0000000000000100
FS: 0000000000000000(0000) GS:ffffffff81840000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff888002001000 CR3: 0000000001824000 CR4: 00000000000006b0
Kernel panic - not syncing: Attempted to kill the idle task!
---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---


This was tested in qemu-system-x86_64 (from Debian Bullseye) with

cat > .config << "EOF"
# CONFIG_LOCALVERSION_AUTO is not set
# CONFIG_SWAP is not set
# CONFIG_CROSS_MEMORY_ATTACH is not set
# CONFIG_UTS_NS is not set
# CONFIG_TIME_NS is not set
# CONFIG_PID_NS is not set
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB_MERGE_DEFAULT is not set
# CONFIG_RETPOLINE is not set
# CONFIG_X86_EXTENDED_PLATFORM is not set
# CONFIG_SCHED_OMIT_FRAME_POINTER is not set
# CONFIG_X86_MCE is not set
# CONFIG_X86_IOPL_IOPERM is not set
# CONFIG_MICROCODE is not set
# CONFIG_MTRR_SANITIZER is not set
# CONFIG_RELOCATABLE is not set
# CONFIG_SUSPEND is not set
# CONFIG_ACPI is not set
# CONFIG_DMIID is not set
# CONFIG_VIRTUALIZATION is not set
# CONFIG_SECCOMP is not set
# CONFIG_STACKPROTECTOR is not set
# CONFIG_BLK_DEV_BSG is not set
# CONFIG_MQ_IOSCHED_DEADLINE is not set
# CONFIG_MQ_IOSCHED_KYBER is not set
# CONFIG_BINFMT_ELF is not set
# CONFIG_BINFMT_SCRIPT is not set
# CONFIG_COMPACTION is not set
# CONFIG_STANDALONE is not set
# CONFIG_PREVENT_FIRMWARE_BUILD is not set
# CONFIG_BLK_DEV is not set
# CONFIG_INPUT_KEYBOARD is not set
# CONFIG_INPUT_MOUSE is not set
# CONFIG_SERIO is not set
# CONFIG_LEGACY_PTYS is not set
# CONFIG_LDISC_AUTOLOAD is not set
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
# CONFIG_HW_RANDOM is not set
# CONFIG_DEVMEM is not set
# CONFIG_HWMON is not set
# CONFIG_HID is not set
# CONFIG_USB_SUPPORT is not set
# CONFIG_VIRTIO_MENU is not set
# CONFIG_VHOST_MENU is not set
# CONFIG_X86_PLATFORM_DEVICES is not set
# CONFIG_IOMMU_SUPPORT is not set
# CONFIG_MANDATORY_FILE_LOCKING is not set
# CONFIG_DNOTIFY is not set
# CONFIG_INOTIFY_USER is not set
# CONFIG_MISC_FILESYSTEMS is not set
# CONFIG_SYMBOLIC_ERRNAME is not set
CONFIG_FRAME_WARN=1024
# CONFIG_SECTION_MISMATCH_WARN_ONLY is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_LOCK_STAT=y
# CONFIG_FTRACE is not set
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_UNWINDER_FRAME_POINTER=y
# CONFIG_RUNTIME_TESTING_MENU is not set
CONFIG_64BIT=y
EOF

make ARCH=x86_64 olddefconfig
make ARCH=x86_64 all

qemu-system-x86_64 -nographic -kernel arch/x86/boot/bzImage -append console=ttyS0

Here is the bisect log:

# bad: [4b358aabb93a2c654cd1dcab1a25a589f6e2b153] Add linux-next specific files for 20210813
# good: [36a21d51725af2ce0700c6ebcb6b9594aac658a6] Linux 5.14-rc5
git bisect start 'HEAD' 'v5.14-rc5'
# good: [204808b2ca750e27cbad3455f7cb4368c4f5b260] Merge remote-tracking branch 'crypto/master'
git bisect good 204808b2ca750e27cbad3455f7cb4368c4f5b260
# good: [2201162fca73b487152bcff2ebb0f85c1dde8479] Merge remote-tracking branch 'tip/auto-latest'
git bisect good 2201162fca73b487152bcff2ebb0f85c1dde8479
# good: [41f97b6df1c8fd9fa828967a687693454c4ce888] Merge remote-tracking branch 'staging/staging-next'
git bisect good 41f97b6df1c8fd9fa828967a687693454c4ce888
# good: [797896d32d52af43fc9b0099a198ef29c2ca0138] Merge remote-tracking branch 'userns/for-next'
git bisect good 797896d32d52af43fc9b0099a198ef29c2ca0138
# bad: [5c7e12cc3d39b5cfc0d1a470139e4e89f3b147ed] arch: Kconfig: fix spelling mistake "seperate" -> "separate"
git bisect bad 5c7e12cc3d39b5cfc0d1a470139e4e89f3b147ed
# bad: [3535cf93a31ecd8595744881dbbda666cf61be48] add-mmap_assert_locked-annotations-to-find_vma-fix
git bisect bad 3535cf93a31ecd8595744881dbbda666cf61be48
# bad: [ac90b3dacf327cecda9f3dabc0051c7332770224] mm/debug_vm_pgtable: use struct pgtable_debug_args in PGD and P4D modifying tests
git bisect bad ac90b3dacf327cecda9f3dabc0051c7332770224
# good: [3a7ac8f97abde2d6ec973a00a449c45b9642a15a] mm, slub: do initial checks in ___slab_alloc() with irqs enabled
git bisect good 3a7ac8f97abde2d6ec973a00a449c45b9642a15a
# good: [1c84f3c916405dc3d62cfca55fb2a84de9d7f31e] mm, slub: fix memory and cpu hotplug related lock ordering issues
git bisect good 1c84f3c916405dc3d62cfca55fb2a84de9d7f31e
# bad: [6ac9c394652dbba1181268cb09513edbd733685c] mm/debug_vm_pgtable: introduce struct pgtable_debug_args
git bisect bad 6ac9c394652dbba1181268cb09513edbd733685c
# good: [35a6f4bcf4ad9c8c0208ea48044539a952859b3a] mm, slub: make slab_lock() disable irqs with PREEMPT_RT
git bisect good 35a6f4bcf4ad9c8c0208ea48044539a952859b3a
# good: [03e736e3ca2c0a48822609a89ffa0329f4eb5aae] mm, slub: use migrate_disable() on PREEMPT_RT
git bisect good 03e736e3ca2c0a48822609a89ffa0329f4eb5aae
# bad: [3f57fd12e8b7eb77412623c347566fb83ec5e764] mm, slub: convert kmem_cpu_slab protection to local_lock
git bisect bad 3f57fd12e8b7eb77412623c347566fb83ec5e764
# first bad commit: [3f57fd12e8b7eb77412623c347566fb83ec5e764] mm, slub: convert kmem_cpu_slab protection to local_lock

I haven't checked what part of the change is actually causing the problem

Kind regards,
Sven


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part.

2021-08-17 08:40:32

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On 8/15/21 2:27 PM, Sven Eckelmann wrote:
>> mm, slub: convert kmem_cpu_slab protection to local_lock
>>
>> Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions
>> of local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT
>> that's equivalent, with better lockdep visibility. On PREEMPT_RT that
>> means better preemption.
> [...]
>
> Looks like this change breaks booting when 64BIT+LOCK_STAT is enabled on
> x86_64:

OK reproduced. Thanks, will investigate.

> general protection fault, maybe for address 0xffff888007fcf1c8: 0000 [#1] NOPTI
> CPU: 0 PID: 0 Comm: swapper Not tainted 5.14.0-rc5+ #7
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> RIP: 0010:kmem_cache_alloc+0x81/0x180
> Code: 79 48 00 4c 8b 41 38 0f 84 89 00 00 00 4d 85 c0 0f 84 80 00 00 00 41 8b 44 24 28 49 8b 3c 24 48 8d 4a 01 49 8b 1c 00 4c 89 c0 <48> 0f c7 4f 38 0f 943
> RSP: 0000:ffffffff81803c10 EFLAGS: 00000286
> RAX: ffff88800244e7c0 RBX: ffff88800244e800 RCX: 0000000000000024
> RDX: 0000000000000023 RSI: 0000000000000100 RDI: ffff888007fcf190
> RBP: ffffffff81803c38 R08: ffff88800244e7c0 R09: 0000000000000dc0
> R10: 0000000000004000 R11: 0000000000000000 R12: ffff8880024413c0
> R13: ffffffff810d18f4 R14: 0000000000000dc0 R15: 0000000000000100
> FS: 0000000000000000(0000) GS:ffffffff81840000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff888002001000 CR3: 0000000001824000 CR4: 00000000000006b0
> Call Trace:
> __get_vm_area_node.constprop.0.isra.0+0x74/0x150
> __vmalloc_node_range+0x5a/0x2b0
> ? kernel_clone+0x88/0x390
> ? copy_process+0x1ac/0x17e0
> copy_process+0x768/0x17e0
> ? kernel_clone+0x88/0x390
> kernel_clone+0x88/0x390
> ? _vm_unmap_aliases.part.0+0xe9/0x110
> ? change_page_attr_set_clr+0x10d/0x180
> kernel_thread+0x43/0x50
> ? rest_init+0x100/0x100
> rest_init+0x1e/0x100
> arch_call_rest_init+0x9/0xc
> start_kernel+0x481/0x493
> x86_64_start_reservations+0x24/0x26
> x86_64_start_kernel+0x80/0x84
> secondary_startup_64_no_verify+0xc2/0xcb
> random: get_random_bytes called from oops_exit+0x34/0x60 with crng_init=0
> ---[ end trace 2cac18ac38f640c1 ]---
> RIP: 0010:kmem_cache_alloc+0x81/0x180
> Code: 79 48 00 4c 8b 41 38 0f 84 89 00 00 00 4d 85 c0 0f 84 80 00 00 00 41 8b 44 24 28 49 8b 3c 24 48 8d 4a 01 49 8b 1c 00 4c 89 c0 <48> 0f c7 4f 38 0f 943
> RSP: 0000:ffffffff81803c10 EFLAGS: 00000286
> RAX: ffff88800244e7c0 RBX: ffff88800244e800 RCX: 0000000000000024
> RDX: 0000000000000023 RSI: 0000000000000100 RDI: ffff888007fcf190
> RBP: ffffffff81803c38 R08: ffff88800244e7c0 R09: 0000000000000dc0
> R10: 0000000000004000 R11: 0000000000000000 R12: ffff8880024413c0
> R13: ffffffff810d18f4 R14: 0000000000000dc0 R15: 0000000000000100
> FS: 0000000000000000(0000) GS:ffffffff81840000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff888002001000 CR3: 0000000001824000 CR4: 00000000000006b0
> Kernel panic - not syncing: Attempted to kill the idle task!
> ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
>
>
> This was tested in qemu-system-x86_64 (from Debian Bullseye) with
>
> cat > .config << "EOF"
> # CONFIG_LOCALVERSION_AUTO is not set
> # CONFIG_SWAP is not set
> # CONFIG_CROSS_MEMORY_ATTACH is not set
> # CONFIG_UTS_NS is not set
> # CONFIG_TIME_NS is not set
> # CONFIG_PID_NS is not set
> # CONFIG_COMPAT_BRK is not set
> # CONFIG_SLAB_MERGE_DEFAULT is not set
> # CONFIG_RETPOLINE is not set
> # CONFIG_X86_EXTENDED_PLATFORM is not set
> # CONFIG_SCHED_OMIT_FRAME_POINTER is not set
> # CONFIG_X86_MCE is not set
> # CONFIG_X86_IOPL_IOPERM is not set
> # CONFIG_MICROCODE is not set
> # CONFIG_MTRR_SANITIZER is not set
> # CONFIG_RELOCATABLE is not set
> # CONFIG_SUSPEND is not set
> # CONFIG_ACPI is not set
> # CONFIG_DMIID is not set
> # CONFIG_VIRTUALIZATION is not set
> # CONFIG_SECCOMP is not set
> # CONFIG_STACKPROTECTOR is not set
> # CONFIG_BLK_DEV_BSG is not set
> # CONFIG_MQ_IOSCHED_DEADLINE is not set
> # CONFIG_MQ_IOSCHED_KYBER is not set
> # CONFIG_BINFMT_ELF is not set
> # CONFIG_BINFMT_SCRIPT is not set
> # CONFIG_COMPACTION is not set
> # CONFIG_STANDALONE is not set
> # CONFIG_PREVENT_FIRMWARE_BUILD is not set
> # CONFIG_BLK_DEV is not set
> # CONFIG_INPUT_KEYBOARD is not set
> # CONFIG_INPUT_MOUSE is not set
> # CONFIG_SERIO is not set
> # CONFIG_LEGACY_PTYS is not set
> # CONFIG_LDISC_AUTOLOAD is not set
> CONFIG_SERIAL_8250=y
> CONFIG_SERIAL_8250_CONSOLE=y
> # CONFIG_HW_RANDOM is not set
> # CONFIG_DEVMEM is not set
> # CONFIG_HWMON is not set
> # CONFIG_HID is not set
> # CONFIG_USB_SUPPORT is not set
> # CONFIG_VIRTIO_MENU is not set
> # CONFIG_VHOST_MENU is not set
> # CONFIG_X86_PLATFORM_DEVICES is not set
> # CONFIG_IOMMU_SUPPORT is not set
> # CONFIG_MANDATORY_FILE_LOCKING is not set
> # CONFIG_DNOTIFY is not set
> # CONFIG_INOTIFY_USER is not set
> # CONFIG_MISC_FILESYSTEMS is not set
> # CONFIG_SYMBOLIC_ERRNAME is not set
> CONFIG_FRAME_WARN=1024
> # CONFIG_SECTION_MISMATCH_WARN_ONLY is not set
> CONFIG_DEBUG_KERNEL=y
> CONFIG_LOCK_STAT=y
> # CONFIG_FTRACE is not set
> # CONFIG_X86_VERBOSE_BOOTUP is not set
> CONFIG_UNWINDER_FRAME_POINTER=y
> # CONFIG_RUNTIME_TESTING_MENU is not set
> CONFIG_64BIT=y
> EOF
>
> make ARCH=x86_64 olddefconfig
> make ARCH=x86_64 all
>
> qemu-system-x86_64 -nographic -kernel arch/x86/boot/bzImage -append console=ttyS0
>
> Here is the bisect log:
>
> # bad: [4b358aabb93a2c654cd1dcab1a25a589f6e2b153] Add linux-next specific files for 20210813
> # good: [36a21d51725af2ce0700c6ebcb6b9594aac658a6] Linux 5.14-rc5
> git bisect start 'HEAD' 'v5.14-rc5'
> # good: [204808b2ca750e27cbad3455f7cb4368c4f5b260] Merge remote-tracking branch 'crypto/master'
> git bisect good 204808b2ca750e27cbad3455f7cb4368c4f5b260
> # good: [2201162fca73b487152bcff2ebb0f85c1dde8479] Merge remote-tracking branch 'tip/auto-latest'
> git bisect good 2201162fca73b487152bcff2ebb0f85c1dde8479
> # good: [41f97b6df1c8fd9fa828967a687693454c4ce888] Merge remote-tracking branch 'staging/staging-next'
> git bisect good 41f97b6df1c8fd9fa828967a687693454c4ce888
> # good: [797896d32d52af43fc9b0099a198ef29c2ca0138] Merge remote-tracking branch 'userns/for-next'
> git bisect good 797896d32d52af43fc9b0099a198ef29c2ca0138
> # bad: [5c7e12cc3d39b5cfc0d1a470139e4e89f3b147ed] arch: Kconfig: fix spelling mistake "seperate" -> "separate"
> git bisect bad 5c7e12cc3d39b5cfc0d1a470139e4e89f3b147ed
> # bad: [3535cf93a31ecd8595744881dbbda666cf61be48] add-mmap_assert_locked-annotations-to-find_vma-fix
> git bisect bad 3535cf93a31ecd8595744881dbbda666cf61be48
> # bad: [ac90b3dacf327cecda9f3dabc0051c7332770224] mm/debug_vm_pgtable: use struct pgtable_debug_args in PGD and P4D modifying tests
> git bisect bad ac90b3dacf327cecda9f3dabc0051c7332770224
> # good: [3a7ac8f97abde2d6ec973a00a449c45b9642a15a] mm, slub: do initial checks in ___slab_alloc() with irqs enabled
> git bisect good 3a7ac8f97abde2d6ec973a00a449c45b9642a15a
> # good: [1c84f3c916405dc3d62cfca55fb2a84de9d7f31e] mm, slub: fix memory and cpu hotplug related lock ordering issues
> git bisect good 1c84f3c916405dc3d62cfca55fb2a84de9d7f31e
> # bad: [6ac9c394652dbba1181268cb09513edbd733685c] mm/debug_vm_pgtable: introduce struct pgtable_debug_args
> git bisect bad 6ac9c394652dbba1181268cb09513edbd733685c
> # good: [35a6f4bcf4ad9c8c0208ea48044539a952859b3a] mm, slub: make slab_lock() disable irqs with PREEMPT_RT
> git bisect good 35a6f4bcf4ad9c8c0208ea48044539a952859b3a
> # good: [03e736e3ca2c0a48822609a89ffa0329f4eb5aae] mm, slub: use migrate_disable() on PREEMPT_RT
> git bisect good 03e736e3ca2c0a48822609a89ffa0329f4eb5aae
> # bad: [3f57fd12e8b7eb77412623c347566fb83ec5e764] mm, slub: convert kmem_cpu_slab protection to local_lock
> git bisect bad 3f57fd12e8b7eb77412623c347566fb83ec5e764
> # first bad commit: [3f57fd12e8b7eb77412623c347566fb83ec5e764] mm, slub: convert kmem_cpu_slab protection to local_lock
>
> I haven't checked what part of the change is actually causing the problem
>
> Kind regards,
> Sven
>

Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On 2021-08-17 10:37:48 [+0200], Vlastimil Babka wrote:
> OK reproduced. Thanks, will investigate.

With the local_lock at the top, the needed alignment gets broken for dbl
cmpxchg. On RT it was working ;)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index b5bcac29b979c..cd14aa1f9bc3c 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -42,9 +42,9 @@ enum stat_item {
NR_SLUB_STAT_ITEMS };

struct kmem_cache_cpu {
- local_lock_t lock; /* Protects the fields below except stat */
void **freelist; /* Pointer to next available object */
unsigned long tid; /* Globally unique transaction id */
+ local_lock_t lock; /* Protects the fields below except stat */
struct page *page; /* The slab from which we are allocating */
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct page *partial; /* Partially allocated frozen slabs */

Sebastian

2021-08-17 09:18:12

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On 8/15/21 2:27 PM, Sven Eckelmann wrote:
>> mm, slub: convert kmem_cpu_slab protection to local_lock
>>
>> Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions
>> of local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT
>> that's equivalent, with better lockdep visibility. On PREEMPT_RT that
>> means better preemption.
> [...]
>
> Looks like this change breaks booting when 64BIT+LOCK_STAT is enabled on
> x86_64:
>
> general protection fault, maybe for address 0xffff888007fcf1c8: 0000 [#1] NOPTI
> CPU: 0 PID: 0 Comm: swapper Not tainted 5.14.0-rc5+ #7

faddr2line points to slab_alloc_node(), namely this:

if (unlikely(!this_cpu_cmpxchg_double(
s->cpu_slab->freelist, s->cpu_slab->tid,
object, tid,
next_object, next_tid(tid)))) {

pahole looks like this:

struct kmem_cache_cpu {
local_lock_t lock; /* 0 56 */
void * * freelist; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
long unsigned int tid; /* 64 8 */
struct page * page; /* 72 8 */
struct page * partial; /* 80 8 */

/* size: 88, cachelines: 2, members: 5 */
/* last cacheline: 24 bytes */
};

I had a hunch and added a padding between lock and freelist,
which made the problem indeed go away. Now I don't know if LOCK_STAT
has some bug that makes it write past the local_lock (guess not) or if
it's some result of the false sharing, which I would have expected to be
only a perf issue, not a correctness issue.
Anyway it's a good idea to have the data in the same cache line so I guess
I'll enforce a cacheline boundary between lock and the data?

> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> RIP: 0010:kmem_cache_alloc+0x81/0x180
> Code: 79 48 00 4c 8b 41 38 0f 84 89 00 00 00 4d 85 c0 0f 84 80 00 00 00 41 8b 44 24 28 49 8b 3c 24 48 8d 4a 01 49 8b 1c 00 4c 89 c0 <48> 0f c7 4f 38 0f 943
> RSP: 0000:ffffffff81803c10 EFLAGS: 00000286
> RAX: ffff88800244e7c0 RBX: ffff88800244e800 RCX: 0000000000000024
> RDX: 0000000000000023 RSI: 0000000000000100 RDI: ffff888007fcf190
> RBP: ffffffff81803c38 R08: ffff88800244e7c0 R09: 0000000000000dc0
> R10: 0000000000004000 R11: 0000000000000000 R12: ffff8880024413c0
> R13: ffffffff810d18f4 R14: 0000000000000dc0 R15: 0000000000000100
> FS: 0000000000000000(0000) GS:ffffffff81840000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff888002001000 CR3: 0000000001824000 CR4: 00000000000006b0
> Call Trace:
> __get_vm_area_node.constprop.0.isra.0+0x74/0x150
> __vmalloc_node_range+0x5a/0x2b0
> ? kernel_clone+0x88/0x390
> ? copy_process+0x1ac/0x17e0
> copy_process+0x768/0x17e0
> ? kernel_clone+0x88/0x390
> kernel_clone+0x88/0x390
> ? _vm_unmap_aliases.part.0+0xe9/0x110
> ? change_page_attr_set_clr+0x10d/0x180
> kernel_thread+0x43/0x50
> ? rest_init+0x100/0x100
> rest_init+0x1e/0x100
> arch_call_rest_init+0x9/0xc
> start_kernel+0x481/0x493
> x86_64_start_reservations+0x24/0x26
> x86_64_start_kernel+0x80/0x84
> secondary_startup_64_no_verify+0xc2/0xcb
> random: get_random_bytes called from oops_exit+0x34/0x60 with crng_init=0
> ---[ end trace 2cac18ac38f640c1 ]---
> RIP: 0010:kmem_cache_alloc+0x81/0x180
> Code: 79 48 00 4c 8b 41 38 0f 84 89 00 00 00 4d 85 c0 0f 84 80 00 00 00 41 8b 44 24 28 49 8b 3c 24 48 8d 4a 01 49 8b 1c 00 4c 89 c0 <48> 0f c7 4f 38 0f 943
> RSP: 0000:ffffffff81803c10 EFLAGS: 00000286
> RAX: ffff88800244e7c0 RBX: ffff88800244e800 RCX: 0000000000000024
> RDX: 0000000000000023 RSI: 0000000000000100 RDI: ffff888007fcf190
> RBP: ffffffff81803c38 R08: ffff88800244e7c0 R09: 0000000000000dc0
> R10: 0000000000004000 R11: 0000000000000000 R12: ffff8880024413c0
> R13: ffffffff810d18f4 R14: 0000000000000dc0 R15: 0000000000000100
> FS: 0000000000000000(0000) GS:ffffffff81840000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff888002001000 CR3: 0000000001824000 CR4: 00000000000006b0
> Kernel panic - not syncing: Attempted to kill the idle task!
> ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
>
>
> This was tested in qemu-system-x86_64 (from Debian Bullseye) with
>
> cat > .config << "EOF"
> # CONFIG_LOCALVERSION_AUTO is not set
> # CONFIG_SWAP is not set
> # CONFIG_CROSS_MEMORY_ATTACH is not set
> # CONFIG_UTS_NS is not set
> # CONFIG_TIME_NS is not set
> # CONFIG_PID_NS is not set
> # CONFIG_COMPAT_BRK is not set
> # CONFIG_SLAB_MERGE_DEFAULT is not set
> # CONFIG_RETPOLINE is not set
> # CONFIG_X86_EXTENDED_PLATFORM is not set
> # CONFIG_SCHED_OMIT_FRAME_POINTER is not set
> # CONFIG_X86_MCE is not set
> # CONFIG_X86_IOPL_IOPERM is not set
> # CONFIG_MICROCODE is not set
> # CONFIG_MTRR_SANITIZER is not set
> # CONFIG_RELOCATABLE is not set
> # CONFIG_SUSPEND is not set
> # CONFIG_ACPI is not set
> # CONFIG_DMIID is not set
> # CONFIG_VIRTUALIZATION is not set
> # CONFIG_SECCOMP is not set
> # CONFIG_STACKPROTECTOR is not set
> # CONFIG_BLK_DEV_BSG is not set
> # CONFIG_MQ_IOSCHED_DEADLINE is not set
> # CONFIG_MQ_IOSCHED_KYBER is not set
> # CONFIG_BINFMT_ELF is not set
> # CONFIG_BINFMT_SCRIPT is not set
> # CONFIG_COMPACTION is not set
> # CONFIG_STANDALONE is not set
> # CONFIG_PREVENT_FIRMWARE_BUILD is not set
> # CONFIG_BLK_DEV is not set
> # CONFIG_INPUT_KEYBOARD is not set
> # CONFIG_INPUT_MOUSE is not set
> # CONFIG_SERIO is not set
> # CONFIG_LEGACY_PTYS is not set
> # CONFIG_LDISC_AUTOLOAD is not set
> CONFIG_SERIAL_8250=y
> CONFIG_SERIAL_8250_CONSOLE=y
> # CONFIG_HW_RANDOM is not set
> # CONFIG_DEVMEM is not set
> # CONFIG_HWMON is not set
> # CONFIG_HID is not set
> # CONFIG_USB_SUPPORT is not set
> # CONFIG_VIRTIO_MENU is not set
> # CONFIG_VHOST_MENU is not set
> # CONFIG_X86_PLATFORM_DEVICES is not set
> # CONFIG_IOMMU_SUPPORT is not set
> # CONFIG_MANDATORY_FILE_LOCKING is not set
> # CONFIG_DNOTIFY is not set
> # CONFIG_INOTIFY_USER is not set
> # CONFIG_MISC_FILESYSTEMS is not set
> # CONFIG_SYMBOLIC_ERRNAME is not set
> CONFIG_FRAME_WARN=1024
> # CONFIG_SECTION_MISMATCH_WARN_ONLY is not set
> CONFIG_DEBUG_KERNEL=y
> CONFIG_LOCK_STAT=y
> # CONFIG_FTRACE is not set
> # CONFIG_X86_VERBOSE_BOOTUP is not set
> CONFIG_UNWINDER_FRAME_POINTER=y
> # CONFIG_RUNTIME_TESTING_MENU is not set
> CONFIG_64BIT=y
> EOF
>
> make ARCH=x86_64 olddefconfig
> make ARCH=x86_64 all
>
> qemu-system-x86_64 -nographic -kernel arch/x86/boot/bzImage -append console=ttyS0
>
> Here is the bisect log:
>
> # bad: [4b358aabb93a2c654cd1dcab1a25a589f6e2b153] Add linux-next specific files for 20210813
> # good: [36a21d51725af2ce0700c6ebcb6b9594aac658a6] Linux 5.14-rc5
> git bisect start 'HEAD' 'v5.14-rc5'
> # good: [204808b2ca750e27cbad3455f7cb4368c4f5b260] Merge remote-tracking branch 'crypto/master'
> git bisect good 204808b2ca750e27cbad3455f7cb4368c4f5b260
> # good: [2201162fca73b487152bcff2ebb0f85c1dde8479] Merge remote-tracking branch 'tip/auto-latest'
> git bisect good 2201162fca73b487152bcff2ebb0f85c1dde8479
> # good: [41f97b6df1c8fd9fa828967a687693454c4ce888] Merge remote-tracking branch 'staging/staging-next'
> git bisect good 41f97b6df1c8fd9fa828967a687693454c4ce888
> # good: [797896d32d52af43fc9b0099a198ef29c2ca0138] Merge remote-tracking branch 'userns/for-next'
> git bisect good 797896d32d52af43fc9b0099a198ef29c2ca0138
> # bad: [5c7e12cc3d39b5cfc0d1a470139e4e89f3b147ed] arch: Kconfig: fix spelling mistake "seperate" -> "separate"
> git bisect bad 5c7e12cc3d39b5cfc0d1a470139e4e89f3b147ed
> # bad: [3535cf93a31ecd8595744881dbbda666cf61be48] add-mmap_assert_locked-annotations-to-find_vma-fix
> git bisect bad 3535cf93a31ecd8595744881dbbda666cf61be48
> # bad: [ac90b3dacf327cecda9f3dabc0051c7332770224] mm/debug_vm_pgtable: use struct pgtable_debug_args in PGD and P4D modifying tests
> git bisect bad ac90b3dacf327cecda9f3dabc0051c7332770224
> # good: [3a7ac8f97abde2d6ec973a00a449c45b9642a15a] mm, slub: do initial checks in ___slab_alloc() with irqs enabled
> git bisect good 3a7ac8f97abde2d6ec973a00a449c45b9642a15a
> # good: [1c84f3c916405dc3d62cfca55fb2a84de9d7f31e] mm, slub: fix memory and cpu hotplug related lock ordering issues
> git bisect good 1c84f3c916405dc3d62cfca55fb2a84de9d7f31e
> # bad: [6ac9c394652dbba1181268cb09513edbd733685c] mm/debug_vm_pgtable: introduce struct pgtable_debug_args
> git bisect bad 6ac9c394652dbba1181268cb09513edbd733685c
> # good: [35a6f4bcf4ad9c8c0208ea48044539a952859b3a] mm, slub: make slab_lock() disable irqs with PREEMPT_RT
> git bisect good 35a6f4bcf4ad9c8c0208ea48044539a952859b3a
> # good: [03e736e3ca2c0a48822609a89ffa0329f4eb5aae] mm, slub: use migrate_disable() on PREEMPT_RT
> git bisect good 03e736e3ca2c0a48822609a89ffa0329f4eb5aae
> # bad: [3f57fd12e8b7eb77412623c347566fb83ec5e764] mm, slub: convert kmem_cpu_slab protection to local_lock
> git bisect bad 3f57fd12e8b7eb77412623c347566fb83ec5e764
> # first bad commit: [3f57fd12e8b7eb77412623c347566fb83ec5e764] mm, slub: convert kmem_cpu_slab protection to local_lock
>
> I haven't checked what part of the change is actually causing the problem
>
> Kind regards,
> Sven
>

2021-08-17 09:21:41

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On 8/17/21 11:12 AM, Sebastian Andrzej Siewior wrote:
> On 2021-08-17 10:37:48 [+0200], Vlastimil Babka wrote:
>> OK reproduced. Thanks, will investigate.
>
> With the local_lock at the top, the needed alignment gets broken for dbl
> cmpxchg. On RT it was working ;)

I'd rather have page and partial in the same cacheline as well, is it ok
to just move the lock as last and not care about whether it straddles
cachelines or not? (with CONFIG_SLUB_CPU_PARTIAL it will naturally start
with the next cacheline).

> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index b5bcac29b979c..cd14aa1f9bc3c 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -42,9 +42,9 @@ enum stat_item {
> NR_SLUB_STAT_ITEMS };
>
> struct kmem_cache_cpu {
> - local_lock_t lock; /* Protects the fields below except stat */
> void **freelist; /* Pointer to next available object */
> unsigned long tid; /* Globally unique transaction id */
> + local_lock_t lock; /* Protects the fields below except stat */
> struct page *page; /* The slab from which we are allocating */
> #ifdef CONFIG_SLUB_CPU_PARTIAL
> struct page *partial; /* Partially allocated frozen slabs */
>
> Sebastian
>

Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On 2021-08-17 11:17:18 [+0200], Vlastimil Babka wrote:
> On 8/17/21 11:12 AM, Sebastian Andrzej Siewior wrote:
> > On 2021-08-17 10:37:48 [+0200], Vlastimil Babka wrote:
> >> OK reproduced. Thanks, will investigate.
> >
> > With the local_lock at the top, the needed alignment gets broken for dbl
> > cmpxchg. On RT it was working ;)
>
> I'd rather have page and partial in the same cacheline as well, is it ok
> to just move the lock as last and not care about whether it straddles
> cachelines or not? (with CONFIG_SLUB_CPU_PARTIAL it will naturally start
> with the next cacheline).

Moving like you suggested appears to be more efficient and saves a bit
of memory:

RT+ debug:
struct kmem_cache_cpu {
void * * freelist; /* 0 8 */
long unsigned int tid; /* 8 8 */
struct page * page; /* 16 8 */
struct page * partial; /* 24 8 */
local_lock_t lock; /* 32 144 */

/* size: 176, cachelines: 3, members: 5 */
/* last cacheline: 48 bytes */
};

RT no debug:
struct kmem_cache_cpu {
void * * freelist; /* 0 8 */
long unsigned int tid; /* 8 8 */
struct page * page; /* 16 8 */
struct page * partial; /* 24 8 */
local_lock_t lock; /* 32 32 */

/* size: 64, cachelines: 1, members: 5 */
};

no-RT, no-debug:
struct kmem_cache_cpu {
void * * freelist; /* 0 8 */
long unsigned int tid; /* 8 8 */
struct page * page; /* 16 8 */
struct page * partial; /* 24 8 */
local_lock_t lock; /* 32 0 */

/* size: 32, cachelines: 1, members: 5 */
/* last cacheline: 32 bytes */
};

no-RT, debug:
struct kmem_cache_cpu {
void * * freelist; /* 0 8 */
long unsigned int tid; /* 8 8 */
struct page * page; /* 16 8 */
struct page * partial; /* 24 8 */
local_lock_t lock; /* 32 56 */

/* size: 88, cachelines: 2, members: 5 */
/* last cacheline: 24 bytes */
};

Sebastian

2021-08-17 09:34:33

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On 8/17/21 11:12 AM, Sebastian Andrzej Siewior wrote:
> On 2021-08-17 10:37:48 [+0200], Vlastimil Babka wrote:
>> OK reproduced. Thanks, will investigate.
>
> With the local_lock at the top, the needed alignment gets broken for dbl
> cmpxchg.

Right. I wondered why the checks in __pcpu_double_call_return_bool
didn't trigger. They are VM_BUG_ON() so they did trigger after enabling
DEBUG_VM.

>
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index b5bcac29b979c..cd14aa1f9bc3c 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -42,9 +42,9 @@ enum stat_item {
> NR_SLUB_STAT_ITEMS };
>
> struct kmem_cache_cpu {
> - local_lock_t lock; /* Protects the fields below except stat */
> void **freelist; /* Pointer to next available object */
> unsigned long tid; /* Globally unique transaction id */
> + local_lock_t lock; /* Protects the fields below except stat */
> struct page *page; /* The slab from which we are allocating */
> #ifdef CONFIG_SLUB_CPU_PARTIAL
> struct page *partial; /* Partially allocated frozen slabs */
>
> Sebastian
>

Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On 2021-08-17 11:31:56 [+0200], Vlastimil Babka wrote:
> On 8/17/21 11:12 AM, Sebastian Andrzej Siewior wrote:
> > On 2021-08-17 10:37:48 [+0200], Vlastimil Babka wrote:
> >> OK reproduced. Thanks, will investigate.
> >
> > With the local_lock at the top, the needed alignment gets broken for dbl
> > cmpxchg.
>
> Right. I wondered why the checks in __pcpu_double_call_return_bool
> didn't trigger. They are VM_BUG_ON() so they did trigger after enabling
> DEBUG_VM.

Without the right debugging enabled

| typedef struct {
| #ifdef CONFIG_DEBUG_LOCK_ALLOC
| struct lockdep_map dep_map;
| struct task_struct *owner;
| #endif
| } local_lock_t;

the struct is just empty.

Sebastian

2021-08-17 10:17:31

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On 8/5/21 5:20 PM, Vlastimil Babka wrote:
> Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions of
> local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT that's
> equivalent, with better lockdep visibility. On PREEMPT_RT that means better
> preemption.
>
> However, the cost on PREEMPT_RT is the loss of lockless fast paths which only
> work with cpu freelist. Those are designed to detect and recover from being
> preempted by other conflicting operations (both fast or slow path), but the
> slow path operations assume they cannot be preempted by a fast path operation,
> which is guaranteed naturally with disabled irqs. With local locks on
> PREEMPT_RT, the fast paths now also need to take the local lock to avoid races.
>
> In the allocation fastpath slab_alloc_node() we can just defer to the slowpath
> __slab_alloc() which also works with cpu freelist, but under the local lock.
> In the free fastpath do_slab_free() we have to add a new local lock protected
> version of freeing to the cpu freelist, as the existing slowpath only works
> with the page freelist.
>
> Also update the comment about locking scheme in SLUB to reflect changes done
> by this series.
>
> [ Mike Galbraith <[email protected]>: use local_lock() without irq in PREEMPT_RT
> scope; debugging of RT crashes resulting in put_cpu_partial() locking changes ]
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---

Another fixup. Is it too many and should we replace it all with a v5?
----8<----
From b13291ca13effc2b22a55619aada688ad5defa4b Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Tue, 17 Aug 2021 11:47:16 +0200
Subject: [PATCH] mm, slub: fix kmem_cache_cpu fields alignment for double
cmpxchg

Sven Eckelmann reports [1] that the addition of local_lock to kmem_cache_cpu
breaks a config with 64BIT+LOCK_STAT:

general protection fault, maybe for address 0xffff888007fcf1c8: 0000 [#1] NOPTI
CPU: 0 PID: 0 Comm: swapper Not tainted 5.14.0-rc5+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
RIP: 0010:kmem_cache_alloc+0x81/0x180
Code: 79 48 00 4c 8b 41 38 0f 84 89 00 00 00 4d 85 c0 0f 84 80 00 00 00 41 8b 44 24 28 49 8b 3c 24 48 8d 4a 01 49 8b 1c 00 4c 89 c0 <48> 0f c7 4f 38 0f 943
RSP: 0000:ffffffff81803c10 EFLAGS: 00000286
RAX: ffff88800244e7c0 RBX: ffff88800244e800 RCX: 0000000000000024
RDX: 0000000000000023 RSI: 0000000000000100 RDI: ffff888007fcf190
RBP: ffffffff81803c38 R08: ffff88800244e7c0 R09: 0000000000000dc0
R10: 0000000000004000 R11: 0000000000000000 R12: ffff8880024413c0
R13: ffffffff810d18f4 R14: 0000000000000dc0 R15: 0000000000000100
FS: 0000000000000000(0000) GS:ffffffff81840000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff888002001000 CR3: 0000000001824000 CR4: 00000000000006b0
Call Trace:
__get_vm_area_node.constprop.0.isra.0+0x74/0x150
__vmalloc_node_range+0x5a/0x2b0
? kernel_clone+0x88/0x390
? copy_process+0x1ac/0x17e0
copy_process+0x768/0x17e0
? kernel_clone+0x88/0x390
kernel_clone+0x88/0x390
? _vm_unmap_aliases.part.0+0xe9/0x110
? change_page_attr_set_clr+0x10d/0x180
kernel_thread+0x43/0x50
? rest_init+0x100/0x100
rest_init+0x1e/0x100
arch_call_rest_init+0x9/0xc
start_kernel+0x481/0x493
x86_64_start_reservations+0x24/0x26
x86_64_start_kernel+0x80/0x84
secondary_startup_64_no_verify+0xc2/0xcb
random: get_random_bytes called from oops_exit+0x34/0x60 with crng_init=0
---[ end trace 2cac18ac38f640c1 ]---
RIP: 0010:kmem_cache_alloc+0x81/0x180
Code: 79 48 00 4c 8b 41 38 0f 84 89 00 00 00 4d 85 c0 0f 84 80 00 00 00 41 8b 44 24 28 49 8b 3c 24 48 8d 4a 01 49 8b 1c 00 4c 89 c0 <48> 0f c7 4f 38 0f 943
RSP: 0000:ffffffff81803c10 EFLAGS: 00000286
RAX: ffff88800244e7c0 RBX: ffff88800244e800 RCX: 0000000000000024
RDX: 0000000000000023 RSI: 0000000000000100 RDI: ffff888007fcf190
RBP: ffffffff81803c38 R08: ffff88800244e7c0 R09: 0000000000000dc0
R10: 0000000000004000 R11: 0000000000000000 R12: ffff8880024413c0
R13: ffffffff810d18f4 R14: 0000000000000dc0 R15: 0000000000000100
FS: 0000000000000000(0000) GS:ffffffff81840000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff888002001000 CR3: 0000000001824000 CR4: 00000000000006b0
Kernel panic - not syncing: Attempted to kill the idle task!
---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

Decoding the RIP points to this_cpu_cmpxchg_double() call in slab_alloc_node().

The problem is the particular size of local_lock_t with LOCK_STAT resulting
in the following layout:

struct kmem_cache_cpu {
local_lock_t lock; /* 0 56 */
void * * freelist; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
long unsigned int tid; /* 64 8 */
struct page * page; /* 72 8 */
struct page * partial; /* 80 8 */

/* size: 88, cachelines: 2, members: 5 */
/* last cacheline: 24 bytes */
};

As pointed out by Sebastian Andrzej Siewior, this_cpu_cmpxchg_double()
needs the freelist and tid fields to be aligned to sum of their sizes
(16 bytes) but they are not in this configuration. This didn't happen
with non-debug RT and !RT configs as well as lockdep.

To fix this, move the lock field below partial field, so that it doesn't
affect the layout.

[1] https://lore.kernel.org/linux-mm/2666777.vCjUEy5FO1@sven-desktop/

This is a fixup for mmotm patch
mm-slub-convert-kmem_cpu_slab-protection-to-local_lock.patch

Reported-by: Sven Eckelmann <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/slub_def.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index b5bcac29b979..85499f0586b0 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -41,14 +41,18 @@ enum stat_item {
CPU_PARTIAL_DRAIN, /* Drain cpu partial to node partial */
NR_SLUB_STAT_ITEMS };

+/*
+ * When changing the layout, make sure freelist and tid are still compatible
+ * with this_cpu_cmpxchg_double() alignment requirements.
+ */
struct kmem_cache_cpu {
- local_lock_t lock; /* Protects the fields below except stat */
void **freelist; /* Pointer to next available object */
unsigned long tid; /* Globally unique transaction id */
struct page *page; /* The slab from which we are allocating */
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct page *partial; /* Partially allocated frozen slabs */
#endif
+ local_lock_t lock; /* Protects the fields above */
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
--
2.32.0


2021-08-17 10:26:38

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 00/35] SLUB: reduce irq disabled scope and make it RT compatible

On 8/15/21 12:18 PM, Vlastimil Babka wrote:
> On 8/10/21 4:36 PM, Vlastimil Babka wrote:
>> On 8/5/21 5:19 PM, Vlastimil Babka wrote:
>>> Series is based on 5.14-rc4 and also available as a git branch:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0
>>
>> New branch with fixed up locking orders in patch 29/35:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r1
>
> New branch with fixed up VM_BUG_ON in patch 13/35:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r2

New branch with fixed struct kmem_cache_cpu layout in patch 35/35
(and a rebase to 5.14-rc6)

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r3

Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On 2021-08-05 17:20:00 [+0200], Vlastimil Babka wrote:
> @@ -2849,7 +2891,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>
> load_freelist:
>
> - lockdep_assert_irqs_disabled();
> +#ifdef CONFIG_PREEMPT_RT
> + lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock.lock));
> +#else
> + lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
> +#endif

Could you please make this hunk only

lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));

i.e. the non-RT version?

> /*
> * freelist is pointing to the list of objects to be used.


Sebastian

2021-08-17 15:43:55

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On 8/17/21 5:39 PM, Sebastian Andrzej Siewior wrote:
> On 2021-08-05 17:20:00 [+0200], Vlastimil Babka wrote:
>> @@ -2849,7 +2891,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>
>> load_freelist:
>>
>> - lockdep_assert_irqs_disabled();
>> +#ifdef CONFIG_PREEMPT_RT
>> + lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock.lock));
>> +#else
>> + lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
>> +#endif
>
> Could you please make this hunk only
>
> lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
>
> i.e. the non-RT version?

Does it mean that version works fine on RT now?

>> /*
>> * freelist is pointing to the list of objects to be used.
>
>
> Sebastian
>

Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On 2021-08-17 17:41:57 [+0200], Vlastimil Babka wrote:
> Does it mean that version works fine on RT now?

Yes. There is no difference now between RT and !RT regarding the
dep_map member.

Sebastian

2021-08-17 16:00:18

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On 8/5/21 5:20 PM, Vlastimil Babka wrote:
> Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions of
> local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT that's
> equivalent, with better lockdep visibility. On PREEMPT_RT that means better
> preemption.
>
> However, the cost on PREEMPT_RT is the loss of lockless fast paths which only
> work with cpu freelist. Those are designed to detect and recover from being
> preempted by other conflicting operations (both fast or slow path), but the
> slow path operations assume they cannot be preempted by a fast path operation,
> which is guaranteed naturally with disabled irqs. With local locks on
> PREEMPT_RT, the fast paths now also need to take the local lock to avoid races.
>
> In the allocation fastpath slab_alloc_node() we can just defer to the slowpath
> __slab_alloc() which also works with cpu freelist, but under the local lock.
> In the free fastpath do_slab_free() we have to add a new local lock protected
> version of freeing to the cpu freelist, as the existing slowpath only works
> with the page freelist.
>
> Also update the comment about locking scheme in SLUB to reflect changes done
> by this series.
>
> [ Mike Galbraith <[email protected]>: use local_lock() without irq in PREEMPT_RT
> scope; debugging of RT crashes resulting in put_cpu_partial() locking changes ]
> Signed-off-by: Vlastimil Babka <[email protected]>

And improvements in the RT land made the following fixup-cleanup
possible.
----8<----
From 8b87e5de5d79a9d3ab4627f5530f1888fa7824f8 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Tue, 17 Aug 2021 17:51:54 +0200
Subject: [PATCH] mm, slab: simplify lockdep_assert_held in
lockdep_assert_held()

Sebastian reports [1] that the special version of lockdep_assert_held() for a
local lock with PREEMPT_RT is no longer necessary, and we can simplify.

[1] https://lore.kernel.org/linux-mm/[email protected]/

This is a fixup for mmotm patch
mm-slub-convert-kmem_cpu_slab-protection-to-local_lock.patch

Reported-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index be57687062aa..df1ac8aff86f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2913,11 +2913,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,

load_freelist:

-#ifdef CONFIG_PREEMPT_RT
- lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock.lock));
-#else
lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
-#endif

/*
* freelist is pointing to the list of objects to be used.
--
2.32.0

2021-08-17 16:01:24

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 00/35] SLUB: reduce irq disabled scope and make it RT compatible

On 8/17/21 12:23 PM, Vlastimil Babka wrote:
> On 8/15/21 12:18 PM, Vlastimil Babka wrote:
>> On 8/10/21 4:36 PM, Vlastimil Babka wrote:
>>> On 8/5/21 5:19 PM, Vlastimil Babka wrote:
>>>> Series is based on 5.14-rc4 and also available as a git branch:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0
>>>
>>> New branch with fixed up locking orders in patch 29/35:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r1
>>
>> New branch with fixed up VM_BUG_ON in patch 13/35:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r2
>
> New branch with fixed struct kmem_cache_cpu layout in patch 35/35
> (and a rebase to 5.14-rc6)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r3

Another update to patch 35/35, simplifying lockdep_assert_held() as requested by RT:

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r4

2021-08-17 19:57:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On Tue, 17 Aug 2021 12:14:58 +0200 Vlastimil Babka <[email protected]> wrote:

> Another fixup. Is it too many and should we replace it all with a v5?

Maybe do a full resend when things have settled down and I can at least
check that we match.

What's your confidence level for a 5.15-rc1 merge? It isn't terribly
well reviewed?

2021-08-18 11:54:36

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

On 8/17/21 9:53 PM, Andrew Morton wrote:
> On Tue, 17 Aug 2021 12:14:58 +0200 Vlastimil Babka <[email protected]> wrote:
>
>> Another fixup. Is it too many and should we replace it all with a v5?
>
> Maybe do a full resend when things have settled down and I can at least
> check that we match.

OK.

> What's your confidence level for a 5.15-rc1 merge?

I'd say pretty good. It's part of RT patchset for a while (since early
July IIRC?) and there has been lot of testing there. Mike and Mel also
tested under !RT configs, and the bug report from Sven means the mmotm
in -next also gets testing. The fixups were all thanks to the testing
and recently shifted to smaller unusual-config-specific issues that
could be dealt with even during rcX stabilization in case there's more.

> It isn't terribly
> well reviewed?

Yeah that could be better, the pool of people deeply familiar with SLUB
is not large, unfortunately. I hope folks will still step up!


2021-08-23 20:37:37

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v4 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

Andrew,

On Wed, Aug 18 2021 at 13:52, Vlastimil Babka wrote:
> On 8/17/21 9:53 PM, Andrew Morton wrote:
>> On Tue, 17 Aug 2021 12:14:58 +0200 Vlastimil Babka <[email protected]> wrote:
>
>> What's your confidence level for a 5.15-rc1 merge?
>
> I'd say pretty good. It's part of RT patchset for a while (since early
> July IIRC?) and there has been lot of testing there. Mike and Mel also
> tested under !RT configs, and the bug report from Sven means the mmotm
> in -next also gets testing. The fixups were all thanks to the testing
> and recently shifted to smaller unusual-config-specific issues that
> could be dealt with even during rcX stabilization in case there's
> more.

I can confirm that the series converged nicely from the very beginning
and Vlastimil was quickly addressing review feedback and the really
moderate fallout.

Various stress tests on both RT and mainline with the latest patches
applied look rock solid now. There might be still some small dragons
lurking, but I don't think there is a danger for a big fallout.

>> It isn't terribly
>> well reviewed?
>
> Yeah that could be better, the pool of people deeply familiar with SLUB
> is not large, unfortunately. I hope folks will still step up!

I've reviewed the lot several times with my RT hat on. I'm surely not
qualifiying for deeply familiar, but I've been dealing with taming SLUB
and the page allocator to play nicely with RT for almost 10 years now...

Thanks,

tglx