Changes since RFC v1 [1]:
* Addressed feedback from Christoph and Mel, added their acks.
* Finished RT conversion, including adopting 2 patches from the RT tree.
* The optional local_lock conversion has to sacrifice lockless fathpaths on RT
* Added some more cleanup patches to the front.
This series was initially inspired by Mel's pcplist local_lock rewrite, and
also by interest to better understand SLUB's locking and the new locking
primitives and their RT variants and implications. It should make SLUB more
preemption-friendly and fully RT compatible, hopefully without noticeable
regressions on !RT kernels, as the fast paths are not affected there.
Series is based on 5.13-rc5 and also available as a git branch:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r1
It received some light stability testing on !RT and no testing within RT
kernel. The previous version also got basic performance screening (thanks Mel)
that didn't show major regressions. This version shouldn't be introducing
further regressions. But I'm still interested in e.g. Jesper's tests whether
the bulk allocator or high speed networking in general didn't regress.
Before the series, SLUB is lockless in both allocation and free fast paths, but
elsewhere, it's disabling irqs for considerable periods of time - especially in
allocation slowpath and the bulk allocation, where IRQs are re-enabled only
when a new page from the page allocator is needed, and the context allows
blocking. The irq disabled sections can then include deactivate_slab() which
walks a full freelist and frees the slab back to page allocator or
unfreeze_partials() going through a list of percpu partial slabs. The RT tree
currently has some patches mitigating these, but we can do much better in
mainline too.
Patches 1-6 are straightforward improvements or cleanups that could exist
outside of this series too, but are prerequsities.
Patches 7-10 are also preparatory code changes without functional changes, but
not so useful without the rest of the series.
Patch 11 simplifies the fast paths on systems with preemption, based on
(hopefully correct) observation that the current loops to verify tid are
unnecessary.
Patches 12-21 focus on reducing irq disabled scope in the allocation slowpath:
Patch 12 moves disabling of irqs into ___slab_alloc() from its callers, which
are the allocation slowpath, and bulk allocation. Instead these callers only
disable preemption to stabilize the cpu. The following patches then gradually
reduce the scope of disabled irqs in ___slab_alloc() and the functions called
from there. As of patch 15, the re-enabling of irqs based on gfp flags before
calling the page allocator is removed from allocate_slab(). As of patch 18,
it's possible to reach the page allocator (when existing slabs are depleted of
free objects) without disabling and re-enabling irqs a single time on the way.
Pathces 22-27 similarly reduce the scope of disabled irqs in functions related
to unfreezing percpu partial slab.
Patch 28 is preparatory. Patch 29 is adopted from the RT tree and converts the
flushing of percpu slabs on all cpus from using IPIs to workqueue, so that the
processing isn't happening with irqs disabled in the IPI handler. The flushing
is not called from performance critical contexts, so it should be acceptable.
Patch 30 also comes from RT tree and makes object_map_lock RT compatible.
Patches 31-32 make slab_lock irq-safe on RT where we cannot rely on having
irq already disabled in the context of spin_lock_irqsave().
Patch 33 changes preempt disable to migrate disable, so that the nested
list_lock spinlock is safe to take on RT. Because migrate_disable() is a
function call even on !RT, a small set of private wrappers is introduced
to keep using the cheaper preempt_disable() on !PREEMPT_RT configurations.
As of this patch, SLUB should be compatible with RT's lock semantics, to the
best of my knowledge.
Finally, patch 34 changes irq disabled sections that protect kmem_cache_cpu
fields in the slow paths, with a local lock. However on PREEMPT_RT it means the
lockless fast paths can now preempt slow paths which don't expect that, so the
local lock has to be taken also in the fast paths and they are no longer
lockless. It's up to RT folks to decide if this is a good tradeoff.
The patch also updates the locking documentation in the file's comment.
The main results of this series:
* irq disabling is only done for minimum amount of time needed to protect the
kmem_cache_cpu data or as part of spin lock, local lock and bit spinlock
operations to make them irq-safe
* SLUB should be fully PREEMPT_RT compatible
This should have obvious implications for better preemptibility, especially on RT.
Some details are different than how the current SLUB RT tree patches are
implemented:
mm: sl[au]b: Change list_lock to raw_spinlock_t [2] - the SLAB part can be
dropped as a different patch restricts RT to SLUB anyway. And after this series
the list_lock in SLUB is never taken with irqs or preemption already disabled
so it doesn't have to be converted to raw_spinlock_t.
mm: slub: Move discard_slab() invocations out of IRQ-off sections [3] should be
unnecessary as this series does move these invocations outside irq disabled
sections in a different way.
The remaining patches to upstream from the RT tree are small ones related to
KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes
sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could
perhaps be re-evaluated as the series also addresses some latency issues with
percpu partial slabs.
[1] [RFC 00/26] SLUB: use local_lock for kmem_cache_cpu protection and reduce disabling irqs
https://lore.kernel.org/lkml/[email protected]/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0001-mm-sl-au-b-Change-list_lock-to-raw_spinlock_t.patch?h=linux-5.12.y-rt-patches
[3] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0004-mm-slub-Move-discard_slab-invocations-out-of-IRQ-off.patch?h=linux-5.12.y-rt-patches
[4] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0005-mm-slub-Move-flush_cpu_slab-invocations-__free_slab-.patch?h=linux-5.12.y-rt-patches
Sebastian Andrzej Siewior (2):
mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations
out of IRQ context
mm: slub: Make object_map_lock a raw_spinlock_t
Vlastimil Babka (32):
mm, slub: don't call flush_all() from list_locations()
mm, slub: allocate private object map for sysfs listings
mm, slub: allocate private object map for validate_slab_cache()
mm, slub: don't disable irq for debug_check_no_locks_freed()
mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()
mm, slub: unify cmpxchg_double_slab() and __cmpxchg_double_slab()
mm, slub: extract get_partial() from new_slab_objects()
mm, slub: dissolve new_slab_objects() into ___slab_alloc()
mm, slub: return slab page from get_partial() and set c->page
afterwards
mm, slub: restructure new page checks in ___slab_alloc()
mm, slub: simplify kmem_cache_cpu and tid setup
mm, slub: move disabling/enabling irqs to ___slab_alloc()
mm, slub: do initial checks in ___slab_alloc() with irqs enabled
mm, slub: move disabling irqs closer to get_partial() in
___slab_alloc()
mm, slub: restore irqs around calling new_slab()
mm, slub: validate slab from partial list or page allocator before
making it cpu slab
mm, slub: check new pages with restored irqs
mm, slub: stop disabling irqs around get_partial()
mm, slub: move reset of c->page and freelist out of deactivate_slab()
mm, slub: make locking in deactivate_slab() irq-safe
mm, slub: call deactivate_slab() without disabling irqs
mm, slub: move irq control into unfreeze_partials()
mm, slub: discard slabs in unfreeze_partials() without irqs disabled
mm, slub: detach whole partial list at once in unfreeze_partials()
mm, slub: detach percpu partial list in unfreeze_partials() using
this_cpu_cmpxchg()
mm, slub: only disable irq with spin_lock in __unfreeze_partials()
mm, slub: don't disable irqs in slub_cpu_dead()
mm, slab: make flush_slab() possible to call with irqs enabled
mm, slub: optionally save/restore irqs in slab_[un]lock()/
mm, slub: make slab_lock() disable irqs with PREEMPT_RT
mm, slub: use migrate_disable() on PREEMPT_RT
mm, slub: convert kmem_cpu_slab protection to local_lock
include/linux/slub_def.h | 2 +
mm/slub.c | 750 ++++++++++++++++++++++++++-------------
2 files changed, 499 insertions(+), 253 deletions(-)
--
2.31.1
validate_slab_cache() is called either to handle a sysfs write, or from a
self-test context. In both situations it's straightforward to preallocate a
private object bitmap instead of grabbing the shared static one meant for
critical sections, so let's do that.
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 92c3ab3a95ba..5a940bd3cebb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4622,11 +4622,11 @@ static int count_total(struct page *page)
#endif
#ifdef CONFIG_SLUB_DEBUG
-static void validate_slab(struct kmem_cache *s, struct page *page)
+static void validate_slab(struct kmem_cache *s, struct page *page,
+ unsigned long *obj_map)
{
void *p;
void *addr = page_address(page);
- unsigned long *map;
slab_lock(page);
@@ -4634,21 +4634,20 @@ static void validate_slab(struct kmem_cache *s, struct page *page)
goto unlock;
/* Now we know that a valid freelist exists */
- map = get_map(s, page);
+ __fill_map(obj_map, s, page);
for_each_object(p, s, addr, page->objects) {
- u8 val = test_bit(__obj_to_index(s, addr, p), map) ?
+ u8 val = test_bit(__obj_to_index(s, addr, p), obj_map) ?
SLUB_RED_INACTIVE : SLUB_RED_ACTIVE;
if (!check_object(s, page, p, val))
break;
}
- put_map(map);
unlock:
slab_unlock(page);
}
static int validate_slab_node(struct kmem_cache *s,
- struct kmem_cache_node *n)
+ struct kmem_cache_node *n, unsigned long *obj_map)
{
unsigned long count = 0;
struct page *page;
@@ -4657,7 +4656,7 @@ static int validate_slab_node(struct kmem_cache *s,
spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry(page, &n->partial, slab_list) {
- validate_slab(s, page);
+ validate_slab(s, page, obj_map);
count++;
}
if (count != n->nr_partial)
@@ -4668,7 +4667,7 @@ static int validate_slab_node(struct kmem_cache *s,
goto out;
list_for_each_entry(page, &n->full, slab_list) {
- validate_slab(s, page);
+ validate_slab(s, page, obj_map);
count++;
}
if (count != atomic_long_read(&n->nr_slabs))
@@ -4685,10 +4684,17 @@ static long validate_slab_cache(struct kmem_cache *s)
int node;
unsigned long count = 0;
struct kmem_cache_node *n;
+ unsigned long *obj_map;
+
+ obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL);
+ if (!obj_map)
+ return -ENOMEM;
flush_all(s);
for_each_kmem_cache_node(s, node, n)
- count += validate_slab_node(s, n);
+ count += validate_slab_node(s, n, obj_map);
+
+ bitmap_free(obj_map);
return count;
}
--
2.31.1
When we allocate slab object from a newly acquired page (from node's partial
list or page allocator), we usually also retain the page as a new percpu slab.
There are two exceptions - when pfmemalloc status of the page doesn't match our
gfp flags, or when the cache has debugging enabled.
The current code for these decisions is not easy to follow, so restructure it
and add comments. The new structure will also help with the following changes.
No functional change.
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index ca7cfc7706cc..0d5c253a39b2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2722,13 +2722,29 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
c->page = page;
check_new_page:
- if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
- goto load_freelist;
- /* Only entered in the debug case */
- if (kmem_cache_debug(s) &&
- !alloc_debug_processing(s, page, freelist, addr))
- goto new_slab; /* Slab failed checks. Next slab needed */
+ if (kmem_cache_debug(s)) {
+ if (!alloc_debug_processing(s, page, freelist, addr))
+ /* Slab failed checks. Next slab needed */
+ goto new_slab;
+ else
+ /*
+ * For debug case, we don't load freelist so that all
+ * allocations go through alloc_debug_processing()
+ */
+ goto return_single;
+ }
+
+ if (unlikely(!pfmemalloc_match(page, gfpflags)))
+ /*
+ * For !pfmemalloc_match() case we don't load freelist so that
+ * we don't make further mismatched allocations easier.
+ */
+ goto return_single;
+
+ goto load_freelist;
+
+return_single:
deactivate_slab(s, page, get_freepointer(s, freelist), c);
return freelist;
--
2.31.1
When we obtain a new slab page from node partial list or page allocator, we
assign it to kmem_cache_cpu, perform some checks, and if they fail, we undo
the assignment.
In order to allow doing the checks without irq disabled, restructure the code
so that the checks are done first, and kmem_cache_cpu.page assignment only
after they pass.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index f4ce372e3dd3..378f1cb040b3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2729,10 +2729,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
lockdep_assert_irqs_disabled();
freelist = get_partial(s, gfpflags, node, &page);
- if (freelist) {
- c->page = page;
+ if (freelist)
goto check_new_page;
- }
local_irq_restore(flags);
put_cpu_ptr(s->cpu_slab);
@@ -2745,9 +2743,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
}
local_irq_save(flags);
- if (c->page)
- flush_slab(s, c);
-
/*
* No other reference to the page yet so we can
* muck around with it freely without cmpxchg
@@ -2756,14 +2751,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
page->freelist = NULL;
stat(s, ALLOC_SLAB);
- c->page = page;
check_new_page:
if (kmem_cache_debug(s)) {
if (!alloc_debug_processing(s, page, freelist, addr)) {
/* Slab failed checks. Next slab needed */
- c->page = NULL;
local_irq_restore(flags);
goto new_slab;
} else {
@@ -2782,10 +2775,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
*/
goto return_single;
+ if (unlikely(c->page))
+ flush_slab(s, c);
+ c->page = page;
+
goto load_freelist;
return_single:
+ if (unlikely(c->page))
+ flush_slab(s, c);
+ c->page = page;
+
deactivate_slab(s, page, get_freepointer(s, freelist), c);
local_irq_restore(flags);
return freelist;
--
2.31.1
Currently __slab_alloc() disables irqs around the whole ___slab_alloc(). This
includes cases where this is not needed, such as when the allocation ends up in
the page allocator and has to awkwardly enable irqs back based on gfp flags.
Also the whole kmem_cache_alloc_bulk() is executed with irqs disabled even when
it hits the __slab_alloc() slow path, and long periods with disabled interrupts
are undesirable.
As a first step towards reducing irq disabled periods, move irq handling into
___slab_alloc(). Callers will instead prevent the s->cpu_slab percpu pointer
from becoming invalid via get_cpu_ptr(), thus preempt_disable(). This does not
protect against modification by an irq handler, which is still done by disabled
irq for most of ___slab_alloc(). As a small immediate benefit,
slab_out_of_memory() from ___slab_alloc() is now called with irqs enabled.
kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables them
before calling ___slab_alloc(), which then disables them at its discretion. The
whole kmem_cache_alloc_bulk() operation also disables preemption.
When ___slab_alloc() calls new_slab() to allocate a new page, re-enable
preemption, because new_slab() will re-enable interrupts in contexts that allow
blocking (this will be improved by later patches).
The patch itself will thus increase overhead a bit due to disabled preemption
and increased disabling/enabling irqs in kmem_cache_alloc_bulk(), but that will
be gradually improved in the following patches.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 32 ++++++++++++++++++++++----------
1 file changed, 22 insertions(+), 10 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 6d6a9a69db8a..4800d768d0d3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2610,7 +2610,7 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page)
* we need to allocate a new slab. This is the slowest path since it involves
* a call to the page allocator and the setup of a new slab.
*
- * Version of __slab_alloc to use when we know that interrupts are
+ * Version of __slab_alloc to use when we know that preemption is
* already disabled (which is the case for bulk allocation).
*/
static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
@@ -2618,9 +2618,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
{
void *freelist;
struct page *page;
+ unsigned long flags;
stat(s, ALLOC_SLOWPATH);
+ local_irq_save(flags);
page = c->page;
if (!page) {
/*
@@ -2683,6 +2685,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
VM_BUG_ON(!c->page->frozen);
c->freelist = get_freepointer(s, freelist);
c->tid = next_tid(c->tid);
+ local_irq_restore(flags);
return freelist;
new_slab:
@@ -2700,14 +2703,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto check_new_page;
}
+ put_cpu_ptr(s->cpu_slab);
page = new_slab(s, gfpflags, node);
+ c = get_cpu_ptr(s->cpu_slab);
if (unlikely(!page)) {
+ local_irq_restore(flags);
slab_out_of_memory(s, gfpflags, node);
return NULL;
}
- c = raw_cpu_ptr(s->cpu_slab);
if (c->page)
flush_slab(s, c);
@@ -2747,31 +2752,33 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return_single:
deactivate_slab(s, page, get_freepointer(s, freelist), c);
+ local_irq_restore(flags);
return freelist;
}
/*
- * Another one that disabled interrupt and compensates for possible
- * cpu changes by refetching the per cpu area pointer.
+ * A wrapper for ___slab_alloc() for contexts where preemption is not yet
+ * disabled. Compensates for possible cpu changes by refetching the per cpu area
+ * pointer.
*/
static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
unsigned long addr, struct kmem_cache_cpu *c)
{
void *p;
- unsigned long flags;
- local_irq_save(flags);
#ifdef CONFIG_PREEMPTION
/*
* We may have been preempted and rescheduled on a different
- * cpu before disabling interrupts. Need to reload cpu area
+ * cpu before disabling preemption. Need to reload cpu area
* pointer.
*/
- c = this_cpu_ptr(s->cpu_slab);
+ c = get_cpu_ptr(s->cpu_slab);
#endif
p = ___slab_alloc(s, gfpflags, node, addr, c);
- local_irq_restore(flags);
+#ifdef CONFIG_PREEMPTION
+ put_cpu_ptr(s->cpu_slab);
+#endif
return p;
}
@@ -3291,8 +3298,8 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
* IRQs, which protects against PREEMPT and interrupts
* handlers invoking normal fastpath.
*/
+ c = get_cpu_ptr(s->cpu_slab);
local_irq_disable();
- c = this_cpu_ptr(s->cpu_slab);
for (i = 0; i < size; i++) {
void *object = kfence_alloc(s, s->object_size, flags);
@@ -3313,6 +3320,8 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
*/
c->tid = next_tid(c->tid);
+ local_irq_enable();
+
/*
* Invoking slow path likely have side-effect
* of re-populating per CPU c->freelist
@@ -3325,6 +3334,8 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
c = this_cpu_ptr(s->cpu_slab);
maybe_wipe_obj_freeptr(s, p[i]);
+ local_irq_disable();
+
continue; /* goto for-loop */
}
c->freelist = get_freepointer(s, object);
@@ -3333,6 +3344,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
}
c->tid = next_tid(c->tid);
local_irq_enable();
+ put_cpu_ptr(s->cpu_slab);
/*
* memcg and kmem_cache debug support and memory initialization.
--
2.31.1
The function is now safe to be called with irqs enabled, so move the calls
outside of irq disabled sections.
When called from ___slab_alloc() -> flush_slab() we have irqs disabled, so to
reenable them before deactivate_slab() we need to open-code flush_slab() in
___slab_alloc() and reenable irqs after modifying the kmem_cache_cpu fields.
But that means a IRQ handler meanwhile might have assigned a new page to
kmem_cache_cpu.page so we have to retry the whole check.
The remaining callers of flush_slab() are the IPI handler which has disabled
irqs anyway, and slub_cpu_dead() which will be dealt with in the following
patch.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index bba18146def6..a99a254f2410 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2706,8 +2706,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
freelist = c->freelist;
c->page = NULL;
c->freelist = NULL;
- deactivate_slab(s, page, freelist);
local_irq_restore(flags);
+ deactivate_slab(s, page, freelist);
new_slab:
@@ -2775,18 +2775,32 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
*/
goto return_single;
+retry_load_page:
+
local_irq_save(flags);
- if (unlikely(c->page))
- flush_slab(s, c);
+ if (unlikely(c->page)) {
+ void *flush_freelist = c->freelist;
+ struct page *flush_page = c->page;
+
+ c->page = NULL;
+ c->freelist = NULL;
+ c->tid = next_tid(c->tid);
+
+ local_irq_restore(flags);
+
+ deactivate_slab(s, flush_page, flush_freelist);
+
+ stat(s, CPUSLAB_FLUSH);
+
+ goto retry_load_page;
+ }
c->page = page;
goto load_freelist;
return_single:
- local_irq_save(flags);
deactivate_slab(s, page, get_freepointer(s, freelist));
- local_irq_restore(flags);
return freelist;
}
--
2.31.1
As another step of shortening irq disabled sections in ___slab_alloc(), delay
disabling irqs until we pass the initial checks if there is a cached percpu
slab and it's suitable for our allocation.
Now we have to recheck c->page after actually disabling irqs as an allocation
in irq handler might have replaced it.
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 41 ++++++++++++++++++++++++++++++++---------
1 file changed, 32 insertions(+), 9 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 4800d768d0d3..22b5bfef8aa6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2622,8 +2622,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
stat(s, ALLOC_SLOWPATH);
- local_irq_save(flags);
- page = c->page;
+reread_page:
+
+ page = READ_ONCE(c->page);
if (!page) {
/*
* if the node is not online or has no normal memory, just
@@ -2632,6 +2633,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (unlikely(node != NUMA_NO_NODE &&
!node_isset(node, slab_nodes)))
node = NUMA_NO_NODE;
+ local_irq_save(flags);
+ if (unlikely(c->page)) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
goto new_slab;
}
redo:
@@ -2646,8 +2652,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
} else {
stat(s, ALLOC_NODE_MISMATCH);
- deactivate_slab(s, page, c->freelist, c);
- goto new_slab;
+ goto deactivate_slab;
}
}
@@ -2656,12 +2661,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
* PFMEMALLOC but right now, we are losing the pfmemalloc
* information when the page leaves the per-cpu allocator
*/
- if (unlikely(!pfmemalloc_match(page, gfpflags))) {
- deactivate_slab(s, page, c->freelist, c);
- goto new_slab;
- }
+ if (unlikely(!pfmemalloc_match(page, gfpflags)))
+ goto deactivate_slab;
- /* must check again c->freelist in case of cpu migration or IRQ */
+ /* must check again c->page in case IRQ handler changed it */
+ local_irq_save(flags);
+ if (unlikely(page != c->page)) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
freelist = c->freelist;
if (freelist)
goto load_freelist;
@@ -2677,6 +2685,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
stat(s, ALLOC_REFILL);
load_freelist:
+
+ lockdep_assert_irqs_disabled();
+
/*
* freelist is pointing to the list of objects to be used.
* page is pointing to the page from which the objects are obtained.
@@ -2688,11 +2699,23 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
local_irq_restore(flags);
return freelist;
+deactivate_slab:
+
+ local_irq_save(flags);
+ if (page != c->page) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
+ deactivate_slab(s, page, c->freelist, c);
+
new_slab:
+ lockdep_assert_irqs_disabled();
+
if (slub_percpu_partial(c)) {
page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
+ local_irq_restore(flags);
stat(s, CPU_PARTIAL_ALLOC);
goto redo;
}
--
2.31.1
Building on top of the previous patch, re-enable irqs before checking new
pages. alloc_debug_processing() is now called with enabled irqs so we need to
remove VM_BUG_ON(!irqs_disabled()); in check_slab() - there doesn't seem to be
a need for it anyway.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 378f1cb040b3..760080b58957 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -954,8 +954,6 @@ static int check_slab(struct kmem_cache *s, struct page *page)
{
int maxobj;
- VM_BUG_ON(!irqs_disabled());
-
if (!PageSlab(page)) {
slab_err(s, page, "Not a valid slab page");
return 0;
@@ -2729,10 +2727,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
lockdep_assert_irqs_disabled();
freelist = get_partial(s, gfpflags, node, &page);
+ local_irq_restore(flags);
if (freelist)
goto check_new_page;
- local_irq_restore(flags);
put_cpu_ptr(s->cpu_slab);
page = new_slab(s, gfpflags, node);
c = get_cpu_ptr(s->cpu_slab);
@@ -2742,7 +2740,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return NULL;
}
- local_irq_save(flags);
/*
* No other reference to the page yet so we can
* muck around with it freely without cmpxchg
@@ -2757,7 +2754,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (kmem_cache_debug(s)) {
if (!alloc_debug_processing(s, page, freelist, addr)) {
/* Slab failed checks. Next slab needed */
- local_irq_restore(flags);
goto new_slab;
} else {
/*
@@ -2775,6 +2771,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
*/
goto return_single;
+ local_irq_save(flags);
if (unlikely(c->page))
flush_slab(s, c);
c->page = page;
@@ -2783,6 +2780,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return_single:
+ local_irq_save(flags);
if (unlikely(c->page))
flush_slab(s, c);
c->page = page;
--
2.31.1
allocate_slab() currently re-enables irqs before calling to the page allocator.
It depends on gfpflags_allow_blocking() to determine if it's safe to do so.
Now we can instead simply restore irq before calling it through new_slab().
The other caller early_kmem_cache_node_alloc() is unaffected by this.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index acb5e8f1d9da..f4ce372e3dd3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1749,9 +1749,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
flags &= gfp_allowed_mask;
- if (gfpflags_allow_blocking(flags))
- local_irq_enable();
-
flags |= s->allocflags;
/*
@@ -1810,8 +1807,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
page->frozen = 1;
out:
- if (gfpflags_allow_blocking(flags))
- local_irq_disable();
if (!page)
return NULL;
@@ -2739,16 +2734,17 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto check_new_page;
}
+ local_irq_restore(flags);
put_cpu_ptr(s->cpu_slab);
page = new_slab(s, gfpflags, node);
c = get_cpu_ptr(s->cpu_slab);
if (unlikely(!page)) {
- local_irq_restore(flags);
slab_out_of_memory(s, gfpflags, node);
return NULL;
}
+ local_irq_save(flags);
if (c->page)
flush_slab(s, c);
--
2.31.1
Continue reducing the irq disabled scope. Check for per-cpu partial slabs with
first with irqs enabled and then recheck with irqs disabled before grabbing
the slab page. Mostly preparatory for the following patches.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 34 +++++++++++++++++++++++++---------
1 file changed, 25 insertions(+), 9 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 22b5bfef8aa6..acb5e8f1d9da 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2633,11 +2633,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (unlikely(node != NUMA_NO_NODE &&
!node_isset(node, slab_nodes)))
node = NUMA_NO_NODE;
- local_irq_save(flags);
- if (unlikely(c->page)) {
- local_irq_restore(flags);
- goto reread_page;
- }
goto new_slab;
}
redo:
@@ -2678,6 +2673,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (!freelist) {
c->page = NULL;
+ local_irq_restore(flags);
stat(s, DEACTIVATE_BYPASS);
goto new_slab;
}
@@ -2707,12 +2703,19 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto reread_page;
}
deactivate_slab(s, page, c->freelist, c);
+ local_irq_restore(flags);
new_slab:
- lockdep_assert_irqs_disabled();
-
if (slub_percpu_partial(c)) {
+ local_irq_save(flags);
+ if (unlikely(c->page)) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
+ if (unlikely(!slub_percpu_partial(c)))
+ goto new_objects; /* stolen by an IRQ handler */
+
page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
local_irq_restore(flags);
@@ -2720,6 +2723,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
}
+ local_irq_save(flags);
+ if (unlikely(c->page)) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
+
+new_objects:
+
+ lockdep_assert_irqs_disabled();
+
freelist = get_partial(s, gfpflags, node, &page);
if (freelist) {
c->page = page;
@@ -2752,15 +2765,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
check_new_page:
if (kmem_cache_debug(s)) {
- if (!alloc_debug_processing(s, page, freelist, addr))
+ if (!alloc_debug_processing(s, page, freelist, addr)) {
/* Slab failed checks. Next slab needed */
+ c->page = NULL;
+ local_irq_restore(flags);
goto new_slab;
- else
+ } else {
/*
* For debug case, we don't load freelist so that all
* allocations go through alloc_debug_processing()
*/
goto return_single;
+ }
}
if (unlikely(!pfmemalloc_match(page, gfpflags)))
--
2.31.1
unfreeze_partials() can be optimized so that it doesn't need irqs disabled for
the whole time. As the first step, move irq control into the function and
remove it from the put_cpu_partial() caller.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index a99a254f2410..4a3cd8174cd2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2290,9 +2290,8 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
/*
* Unfreeze all the cpu partial slabs.
*
- * This function must be called with interrupts disabled
- * for the cpu using c (or some other guarantee must be there
- * to guarantee no concurrent accesses).
+ * This function must be called with preemption or migration
+ * disabled with c local to the cpu.
*/
static void unfreeze_partials(struct kmem_cache *s,
struct kmem_cache_cpu *c)
@@ -2300,6 +2299,9 @@ static void unfreeze_partials(struct kmem_cache *s,
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct kmem_cache_node *n = NULL, *n2 = NULL;
struct page *page, *discard_page = NULL;
+ unsigned long flags;
+
+ local_irq_save(flags);
while ((page = slub_percpu_partial(c))) {
struct page new;
@@ -2352,6 +2354,8 @@ static void unfreeze_partials(struct kmem_cache *s,
discard_slab(s, page);
stat(s, FREE_SLAB);
}
+
+ local_irq_restore(flags);
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}
@@ -2379,14 +2383,11 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
pobjects = oldpage->pobjects;
pages = oldpage->pages;
if (drain && pobjects > slub_cpu_partial(s)) {
- unsigned long flags;
/*
* partial array is full. Move the existing
* set to the per node partial list.
*/
- local_irq_save(flags);
unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
- local_irq_restore(flags);
oldpage = NULL;
pobjects = 0;
pages = 0;
--
2.31.1
No need for disabled irqs when discarding slabs, so restore them before
discarding.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c
index 4a3cd8174cd2..dcbc5baea702 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2346,6 +2346,8 @@ static void unfreeze_partials(struct kmem_cache *s,
if (n)
spin_unlock(&n->list_lock);
+ local_irq_restore(flags);
+
while (discard_page) {
page = discard_page;
discard_page = discard_page->next;
@@ -2355,7 +2357,6 @@ static void unfreeze_partials(struct kmem_cache *s,
stat(s, FREE_SLAB);
}
- local_irq_restore(flags);
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}
--
2.31.1
Commit d6e0b7fa1186 ("slub: make dead caches discard free slabs immediately")
introduced cpu partial flushing for kmemcg caches, based on setting the target
cpu_partial to 0 and adding a flushing check in put_cpu_partial().
This code that sets cpu_partial to 0 was later moved by c9fc586403e7 ("slab:
introduce __kmemcg_cache_deactivate()") and ultimately removed by 9855609bde03
("mm: memcg/slab: use a single set of kmem_caches for all accounted
allocations"). However the check and flush in put_cpu_partial() was never
removed, although it's effectively a dead code. So this patch removes it.
Note that d6e0b7fa1186 also added preempt_disable()/enable() to
unfreeze_partials() which could be thus also considered unnecessary. But
further patches will rely on it, so keep it.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 2953f6e43cae..f740598696b4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2420,13 +2420,6 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
!= oldpage);
- if (unlikely(!slub_cpu_partial(s))) {
- unsigned long flags;
-
- local_irq_save(flags);
- unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
- local_irq_restore(flags);
- }
preempt_enable();
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}
--
2.31.1
list_locations() can only be called on caches with SLAB_STORE_USER flag and as
with all slub debugging flags, such caches avoid cpu or percpu partial slabs
altogether, so there's nothing to flush.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 3f96e099817a..f928607230b2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4839,8 +4839,6 @@ static int list_locations(struct kmem_cache *s, char *buf,
GFP_KERNEL)) {
return sysfs_emit(buf, "Out of memory\n");
}
- /* Push back cpu slabs */
- flush_all(s);
for_each_kmem_cache_node(s, node, n) {
unsigned long flags;
--
2.31.1
Slub has a static spinlock protected bitmap for marking which objects are on
freelist when it wants to list them, for situations where dynamically
allocating such map can lead to recursion or locking issues, and on-stack
bitmap would be too large.
The handlers of sysfs files alloc_calls and free_calls also currently use this
shared bitmap, but their syscall context makes it straightforward to allocate a
private map before entering locked sections, so switch these processing paths
to use a private bitmap.
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 42 ++++++++++++++++++++++++++++--------------
1 file changed, 28 insertions(+), 14 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index f928607230b2..92c3ab3a95ba 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -448,6 +448,18 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)];
static DEFINE_SPINLOCK(object_map_lock);
+static void __fill_map(unsigned long *obj_map, struct kmem_cache *s,
+ struct page *page)
+{
+ void *addr = page_address(page);
+ void *p;
+
+ bitmap_zero(obj_map, page->objects);
+
+ for (p = page->freelist; p; p = get_freepointer(s, p))
+ set_bit(__obj_to_index(s, addr, p), obj_map);
+}
+
/*
* Determine a map of object in use on a page.
*
@@ -457,17 +469,11 @@ static DEFINE_SPINLOCK(object_map_lock);
static unsigned long *get_map(struct kmem_cache *s, struct page *page)
__acquires(&object_map_lock)
{
- void *p;
- void *addr = page_address(page);
-
VM_BUG_ON(!irqs_disabled());
spin_lock(&object_map_lock);
- bitmap_zero(object_map, page->objects);
-
- for (p = page->freelist; p; p = get_freepointer(s, p))
- set_bit(__obj_to_index(s, addr, p), object_map);
+ __fill_map(object_map, s, page);
return object_map;
}
@@ -4813,17 +4819,17 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
}
static void process_slab(struct loc_track *t, struct kmem_cache *s,
- struct page *page, enum track_item alloc)
+ struct page *page, enum track_item alloc,
+ unsigned long *obj_map)
{
void *addr = page_address(page);
void *p;
- unsigned long *map;
- map = get_map(s, page);
+ __fill_map(obj_map, s, page);
+
for_each_object(p, s, addr, page->objects)
- if (!test_bit(__obj_to_index(s, addr, p), map))
+ if (!test_bit(__obj_to_index(s, addr, p), obj_map))
add_location(t, s, get_track(s, p, alloc));
- put_map(map);
}
static int list_locations(struct kmem_cache *s, char *buf,
@@ -4834,9 +4840,15 @@ static int list_locations(struct kmem_cache *s, char *buf,
struct loc_track t = { 0, 0, NULL };
int node;
struct kmem_cache_node *n;
+ unsigned long *obj_map;
+
+ obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL);
+ if (!obj_map)
+ return sysfs_emit(buf, "Out of memory\n");
if (!alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location),
GFP_KERNEL)) {
+ bitmap_free(obj_map);
return sysfs_emit(buf, "Out of memory\n");
}
@@ -4849,12 +4861,14 @@ static int list_locations(struct kmem_cache *s, char *buf,
spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry(page, &n->partial, slab_list)
- process_slab(&t, s, page, alloc);
+ process_slab(&t, s, page, alloc, obj_map);
list_for_each_entry(page, &n->full, slab_list)
- process_slab(&t, s, page, alloc);
+ process_slab(&t, s, page, alloc, obj_map);
spin_unlock_irqrestore(&n->list_lock, flags);
}
+ bitmap_free(obj_map);
+
for (i = 0; i < t.count; i++) {
struct location *l = &t.loc[i];
--
2.31.1
In slab_free_hook() we disable irqs around the debug_check_no_locks_freed()
call, which is unnecessary, as irqs are already being disabled inside the call.
This seems to be leftover from the past where there were more calls inside the
irq disabled sections. Remove the irq disable/enable operations.
Mel noted:
> Looks like it was needed for kmemcheck which went away back in 4.15
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 14 +-------------
1 file changed, 1 insertion(+), 13 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 5a940bd3cebb..2953f6e43cae 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1545,20 +1545,8 @@ static __always_inline bool slab_free_hook(struct kmem_cache *s,
{
kmemleak_free_recursive(x, s->flags);
- /*
- * Trouble is that we may no longer disable interrupts in the fast path
- * So in order to make the debug calls that expect irqs to be
- * disabled we need to disable interrupts temporarily.
- */
-#ifdef CONFIG_LOCKDEP
- {
- unsigned long flags;
+ debug_check_no_locks_freed(x, s->object_size);
- local_irq_save(flags);
- debug_check_no_locks_freed(x, s->object_size);
- local_irq_restore(flags);
- }
-#endif
if (!(s->flags & SLAB_DEBUG_OBJECTS))
debug_check_no_obj_freed(x, s->object_size);
--
2.31.1
The later patches will need more fine grained control over individual actions
in ___slab_alloc(), the only caller of new_slab_objects(), so this is a first
preparatory step with no functional change.
This adds a goto label that appears unnecessary at this point, but will be
useful for later changes.
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
---
mm/slub.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 76af5065baeb..38256e87dde8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2553,17 +2553,12 @@ slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid)
static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
int node, struct kmem_cache_cpu **pc)
{
- void *freelist;
+ void *freelist = NULL;
struct kmem_cache_cpu *c = *pc;
struct page *page;
WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
- freelist = get_partial(s, flags, node, c);
-
- if (freelist)
- return freelist;
-
page = new_slab(s, flags, node);
if (page) {
c = raw_cpu_ptr(s->cpu_slab);
@@ -2727,6 +2722,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
}
+ freelist = get_partial(s, gfpflags, node, c);
+ if (freelist)
+ goto check_new_page;
+
freelist = new_slab_objects(s, gfpflags, node, &c);
if (unlikely(!freelist)) {
@@ -2734,6 +2733,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return NULL;
}
+check_new_page:
page = c->page;
if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
goto load_freelist;
--
2.31.1
The later patches will need more fine grained control over individual actions
in ___slab_alloc(), the only caller of new_slab_objects(), so dissolve it
there. This is a preparatory step with no functional change.
The only minor change is moving WARN_ON_ONCE() for using a constructor together
with __GFP_ZERO to new_slab(), which makes it somewhat less frequent, but still
able to catch a development change introducing a systematic misuse.
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 50 ++++++++++++++++++--------------------------------
1 file changed, 18 insertions(+), 32 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 38256e87dde8..87f652bf8e4d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1825,6 +1825,8 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
if (unlikely(flags & GFP_SLAB_BUG_MASK))
flags = kmalloc_fix_flags(flags);
+ WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
+
return allocate_slab(s,
flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
}
@@ -2550,36 +2552,6 @@ slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid)
#endif
}
-static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
- int node, struct kmem_cache_cpu **pc)
-{
- void *freelist = NULL;
- struct kmem_cache_cpu *c = *pc;
- struct page *page;
-
- WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
-
- page = new_slab(s, flags, node);
- if (page) {
- c = raw_cpu_ptr(s->cpu_slab);
- if (c->page)
- flush_slab(s, c);
-
- /*
- * No other reference to the page yet so we can
- * muck around with it freely without cmpxchg
- */
- freelist = page->freelist;
- page->freelist = NULL;
-
- stat(s, ALLOC_SLAB);
- c->page = page;
- *pc = c;
- }
-
- return freelist;
-}
-
static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags)
{
if (unlikely(PageSlabPfmemalloc(page)))
@@ -2726,13 +2698,27 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (freelist)
goto check_new_page;
- freelist = new_slab_objects(s, gfpflags, node, &c);
+ page = new_slab(s, gfpflags, node);
- if (unlikely(!freelist)) {
+ if (unlikely(!page)) {
slab_out_of_memory(s, gfpflags, node);
return NULL;
}
+ c = raw_cpu_ptr(s->cpu_slab);
+ if (c->page)
+ flush_slab(s, c);
+
+ /*
+ * No other reference to the page yet so we can
+ * muck around with it freely without cmpxchg
+ */
+ freelist = page->freelist;
+ page->freelist = NULL;
+
+ stat(s, ALLOC_SLAB);
+ c->page = page;
+
check_new_page:
page = c->page;
if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
--
2.31.1
dectivate_slab() now no longer touches the kmem_cache_cpu structure, so it will
be possible to call it with irqs enabled. Just convert the spin_lock calls to
their irq saving/restoring variants to make it irq-safe.
Note we now have to use cmpxchg_double_slab() for irq-safe slab_lock(), because
in some situations we don't take the list_lock, which would disable irqs.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 8b0b975bfa36..bba18146def6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2163,6 +2163,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
enum slab_modes l = M_NONE, m = M_NONE;
void *nextfree, *freelist_iter, *freelist_tail;
int tail = DEACTIVATE_TO_HEAD;
+ unsigned long flags = 0;
struct page new;
struct page old;
@@ -2238,7 +2239,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
* that acquire_slab() will see a slab page that
* is frozen
*/
- spin_lock(&n->list_lock);
+ spin_lock_irqsave(&n->list_lock, flags);
}
} else {
m = M_FULL;
@@ -2249,7 +2250,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
* slabs from diagnostic functions will not see
* any frozen slabs.
*/
- spin_lock(&n->list_lock);
+ spin_lock_irqsave(&n->list_lock, flags);
}
}
@@ -2266,14 +2267,14 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
}
l = m;
- if (!__cmpxchg_double_slab(s, page,
+ if (!cmpxchg_double_slab(s, page,
old.freelist, old.counters,
new.freelist, new.counters,
"unfreezing slab"))
goto redo;
if (lock)
- spin_unlock(&n->list_lock);
+ spin_unlock_irqrestore(&n->list_lock, flags);
if (m == M_PARTIAL)
stat(s, tail);
--
2.31.1
Instead of iterating through the live percpu partial list, detach it from the
kmem_cache_cpu at once. This is simpler and will allow further optimization.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index dcbc5baea702..3e0ec7211d55 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2298,16 +2298,20 @@ static void unfreeze_partials(struct kmem_cache *s,
{
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct kmem_cache_node *n = NULL, *n2 = NULL;
- struct page *page, *discard_page = NULL;
+ struct page *page, *partial_page, *discard_page = NULL;
unsigned long flags;
local_irq_save(flags);
- while ((page = slub_percpu_partial(c))) {
+ partial_page = slub_percpu_partial(c);
+ c->partial = NULL;
+
+ while (partial_page) {
struct page new;
struct page old;
- slub_set_percpu_partial(c, page);
+ page = partial_page;
+ partial_page = page->next;
n2 = get_node(s, page_to_nid(page));
if (n != n2) {
--
2.31.1
From: Sebastian Andrzej Siewior <[email protected]>
flush_all() flushes a specific SLAB cache on each CPU (where the cache
is present). The deactivate_slab()/__free_slab() invocation happens
within IPI handler and is problematic for PREEMPT_RT.
The flush operation is not a frequent operation or a hot path. The
per-CPU flush operation can be moved to within a workqueue.
[[email protected]: adapt to new SLUB changes]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 48 insertions(+), 8 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 5c70fc17e9be..6e747e0d7dcd 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2474,33 +2474,73 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
unfreeze_partials_cpu(s, c);
}
+struct slub_flush_work {
+ struct work_struct work;
+ struct kmem_cache *s;
+ bool skip;
+};
+
/*
* Flush cpu slab.
*
- * Called from IPI handler with interrupts disabled.
+ * Called from CPU work handler with migration disabled.
*/
-static void flush_cpu_slab(void *d)
+static void flush_cpu_slab(struct work_struct *w)
{
- struct kmem_cache *s = d;
- struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
+ struct kmem_cache *s;
+ struct kmem_cache_cpu *c;
+ struct slub_flush_work *sfw;
+
+ sfw = container_of(w, struct slub_flush_work, work);
+
+ s = sfw->s;
+ c = this_cpu_ptr(s->cpu_slab);
if (c->page)
- flush_slab(s, c, false);
+ flush_slab(s, c, true);
unfreeze_partials(s);
}
-static bool has_cpu_slab(int cpu, void *info)
+static bool has_cpu_slab(int cpu, struct kmem_cache *s)
{
- struct kmem_cache *s = info;
struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
return c->page || slub_percpu_partial(c);
}
+static DEFINE_MUTEX(flush_lock);
+static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
+
static void flush_all(struct kmem_cache *s)
{
- on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
+ struct slub_flush_work *sfw;
+ unsigned int cpu;
+
+ cpus_read_lock();
+ mutex_lock(&flush_lock);
+
+ for_each_online_cpu(cpu) {
+ sfw = &per_cpu(slub_flush, cpu);
+ if (!has_cpu_slab(cpu, s)) {
+ sfw->skip = true;
+ continue;
+ }
+ INIT_WORK(&sfw->work, flush_cpu_slab);
+ sfw->skip = false;
+ sfw->s = s;
+ schedule_work_on(cpu, &sfw->work);
+ }
+
+ for_each_online_cpu(cpu) {
+ sfw = &per_cpu(slub_flush, cpu);
+ if (sfw->skip)
+ continue;
+ flush_work(&sfw->work);
+ }
+
+ mutex_unlock(&flush_lock);
+ cpus_read_unlock();
}
/*
--
2.31.1
We need to disable irqs around slab_lock() (a bit spinlock) to make it
irq-safe. The calls to slab_lock() are nested under spin_lock_irqsave() which
doesn't disable irqs on PREEMPT_RT, so add explicit disabling with PREEMPT_RT.
We also distinguish cmpxchg_double_slab() where we do the disabling explicitly
and __cmpxchg_double_slab() for contexts with already disabled irqs. However
these context are also typically spin_lock_irqsave() thus insufficient on
PREEMPT_RT. Thus, change __cmpxchg_double_slab() to be same as
cmpxchg_double_slab() on PREEMPT_RT.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 6721169f816d..12e966f07f19 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -374,12 +374,12 @@ __slab_unlock(struct page *page, unsigned long *flags, bool disable_irqs)
static __always_inline void
slab_lock(struct page *page, unsigned long *flags)
{
- __slab_lock(page, flags, false);
+ __slab_lock(page, flags, IS_ENABLED(CONFIG_PREEMPT_RT));
}
static __always_inline void slab_unlock(struct page *page, unsigned long *flags)
{
- __slab_unlock(page, flags, false);
+ __slab_unlock(page, flags, IS_ENABLED(CONFIG_PREEMPT_RT));
}
static inline bool ___cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
@@ -422,14 +422,19 @@ static inline bool ___cmpxchg_double_slab(struct kmem_cache *s, struct page *pag
return false;
}
-/* Interrupts must be disabled (for the fallback code to work right) */
+/*
+ * Interrupts must be disabled (for the fallback code to work right), typically
+ * by an _irqsave() lock variant. Except on PREEMPT_RT where locks are different
+ * so we disable interrupts explicitly here.
+ */
static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
void *freelist_old, unsigned long counters_old,
void *freelist_new, unsigned long counters_new,
const char *n)
{
return ___cmpxchg_double_slab(s, page, freelist_old, counters_old,
- freelist_new, counters_new, n, false);
+ freelist_new, counters_new, n,
+ IS_ENABLED(CONFIG_PREEMPT_RT));
}
static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
--
2.31.1
__unfreeze_partials() no longer needs to have irqs disabled, except for making
the spin_lock operations irq-safe, so convert the spin_locks operations and
remove the separate irq handling.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 0867801e3086..5a81189608e6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2292,9 +2292,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page)
{
struct kmem_cache_node *n = NULL, *n2 = NULL;
struct page *page, *discard_page = NULL;
- unsigned long flags;
-
- local_irq_save(flags);
+ unsigned long flags = 0;
while (partial_page) {
struct page new;
@@ -2306,10 +2304,10 @@ static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page)
n2 = get_node(s, page_to_nid(page));
if (n != n2) {
if (n)
- spin_unlock(&n->list_lock);
+ spin_unlock_irqrestore(&n->list_lock, flags);
n = n2;
- spin_lock(&n->list_lock);
+ spin_lock_irqsave(&n->list_lock, flags);
}
do {
@@ -2338,9 +2336,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page)
}
if (n)
- spin_unlock(&n->list_lock);
-
- local_irq_restore(flags);
+ spin_unlock_irqrestore(&n->list_lock, flags);
while (discard_page) {
page = discard_page;
--
2.31.1
The function get_partial() does not need to have irqs disabled as a whole. It's
sufficient to convert spin_lock operations to their irq saving/restoring
versions.
As a result, it's now possible to reach the page allocator from the slab
allocator without disabling and re-enabling interrupts on the way.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 22 ++++++++--------------
1 file changed, 8 insertions(+), 14 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 760080b58957..6f894d1e84d6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1950,11 +1950,12 @@ static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
* Try to allocate a partial slab from a specific node.
*/
static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
- struct page **ret_page, gfp_t flags)
+ struct page **ret_page, gfp_t gfpflags)
{
struct page *page, *page2;
void *object = NULL;
unsigned int available = 0;
+ unsigned long flags;
int objects;
/*
@@ -1966,11 +1967,11 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
if (!n || !n->nr_partial)
return NULL;
- spin_lock(&n->list_lock);
+ spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry_safe(page, page2, &n->partial, slab_list) {
void *t;
- if (!pfmemalloc_match(page, flags))
+ if (!pfmemalloc_match(page, gfpflags))
continue;
t = acquire_slab(s, n, page, object == NULL, &objects);
@@ -1991,7 +1992,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
break;
}
- spin_unlock(&n->list_lock);
+ spin_unlock_irqrestore(&n->list_lock, flags);
return object;
}
@@ -2706,8 +2707,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
local_irq_restore(flags);
goto reread_page;
}
- if (unlikely(!slub_percpu_partial(c)))
+ if (unlikely(!slub_percpu_partial(c))) {
+ local_irq_restore(flags);
goto new_objects; /* stolen by an IRQ handler */
+ }
page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
@@ -2716,18 +2719,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
}
- local_irq_save(flags);
- if (unlikely(c->page)) {
- local_irq_restore(flags);
- goto reread_page;
- }
-
new_objects:
- lockdep_assert_irqs_disabled();
-
freelist = get_partial(s, gfpflags, node, &page);
- local_irq_restore(flags);
if (freelist)
goto check_new_page;
--
2.31.1
slub_cpu_dead() cleans up for an offlined cpu from another cpu and calls only
functions that are now irq safe, so we don't need to disable irqs anymore.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 5a81189608e6..12d35d6a6eaa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2498,14 +2498,10 @@ static void flush_all(struct kmem_cache *s)
static int slub_cpu_dead(unsigned int cpu)
{
struct kmem_cache *s;
- unsigned long flags;
mutex_lock(&slab_mutex);
- list_for_each_entry(s, &slab_caches, list) {
- local_irq_save(flags);
+ list_for_each_entry(s, &slab_caches, list)
__flush_cpu_slab(s, cpu);
- local_irq_restore(flags);
- }
mutex_unlock(&slab_mutex);
return 0;
}
--
2.31.1
deactivate_slab() removes the cpu slab by merging the cpu freelist with slab's
freelist and putting the slab on the proper node's list. It also sets the
respective kmem_cache_cpu pointers to NULL.
By extracting the kmem_cache_cpu operations from the function, we can make it
not dependent on disabled irqs.
Also if we return a single free pointer from ___slab_alloc, we no longer have
to assign kmem_cache_cpu.page before deactivation or care if somebody preempted
us and assigned a different page to our kmem_cache_cpu in the process.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 31 ++++++++++++++++++-------------
1 file changed, 18 insertions(+), 13 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 6f894d1e84d6..8b0b975bfa36 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2149,10 +2149,13 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
}
/*
- * Remove the cpu slab
+ * Finishes removing the cpu slab. Merges cpu's freelist with page's freelist,
+ * unfreezes the slabs and puts it on the proper list.
+ * Assumes the slab has been already safely taken away from kmem_cache_cpu
+ * by the caller.
*/
static void deactivate_slab(struct kmem_cache *s, struct page *page,
- void *freelist, struct kmem_cache_cpu *c)
+ void *freelist)
{
enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
struct kmem_cache_node *n = get_node(s, page_to_nid(page));
@@ -2281,9 +2284,6 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
discard_slab(s, page);
stat(s, FREE_SLAB);
}
-
- c->page = NULL;
- c->freelist = NULL;
}
/*
@@ -2408,10 +2408,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
{
- stat(s, CPUSLAB_FLUSH);
- deactivate_slab(s, c->page, c->freelist, c);
+ void *freelist = c->freelist;
+ struct page *page = c->page;
+ c->page = NULL;
+ c->freelist = NULL;
c->tid = next_tid(c->tid);
+
+ deactivate_slab(s, page, freelist);
+
+ stat(s, CPUSLAB_FLUSH);
}
/*
@@ -2696,7 +2702,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
local_irq_restore(flags);
goto reread_page;
}
- deactivate_slab(s, page, c->freelist, c);
+ freelist = c->freelist;
+ c->page = NULL;
+ c->freelist = NULL;
+ deactivate_slab(s, page, freelist);
local_irq_restore(flags);
new_slab:
@@ -2775,11 +2784,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return_single:
local_irq_save(flags);
- if (unlikely(c->page))
- flush_slab(s, c);
- c->page = page;
-
- deactivate_slab(s, page, get_freepointer(s, freelist), c);
+ deactivate_slab(s, page, get_freepointer(s, freelist));
local_irq_restore(flags);
return freelist;
}
--
2.31.1
Currently flush_slab() is always called with disabled IRQs if it's needed, but
the following patches will change that, so add a parameter to control IRQ
disabling within the function, which only protects the kmem_cache_cpu
manipulation and not the call to deactivate_slab() which doesn't need it.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 12d35d6a6eaa..5c70fc17e9be 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2438,16 +2438,28 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}
-static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
+static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c,
+ bool lock)
{
- void *freelist = c->freelist;
- struct page *page = c->page;
+ unsigned long flags;
+ void *freelist;
+ struct page *page;
+
+ if (lock)
+ local_irq_save(flags);
+
+ freelist = c->freelist;
+ page = c->page;
c->page = NULL;
c->freelist = NULL;
c->tid = next_tid(c->tid);
- deactivate_slab(s, page, freelist);
+ if (lock)
+ local_irq_restore(flags);
+
+ if (page)
+ deactivate_slab(s, page, freelist);
stat(s, CPUSLAB_FLUSH);
}
@@ -2457,7 +2469,7 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
if (c->page)
- flush_slab(s, c);
+ flush_slab(s, c, false);
unfreeze_partials_cpu(s, c);
}
@@ -2473,7 +2485,7 @@ static void flush_cpu_slab(void *d)
struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
if (c->page)
- flush_slab(s, c);
+ flush_slab(s, c, false);
unfreeze_partials(s);
}
--
2.31.1
Instead of relying on disabled irqs for atomicity when detaching the percpu
partial list, we can use this_cpu_cmpxchg() and detach without irqs disabled.
However, unfreeze_partials() can be also called from another cpu on behalf of
a cpu that is being offlined, so we need to restructure the code accordingly:
- __unfreeze_partials() is the bulk of unfreeze_partials() that processes the
detached percpu partial list
- unfreeze_partials() uses this_cpu_cmpxchg() to detach list from current cpu
- unfreeze_partials_cpu() is to be called for the offlined cpu so it needs no
protection, and is called from __flush_cpu_slab()
- flush_cpu_slab() is for the local cpu thus it needs to call
unfreeze_partials(). So it can't simply call
__flush_cpu_slab(smp_processor_id()) anymore and we have to open-code it
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 77 +++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 55 insertions(+), 22 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 3e0ec7211d55..0867801e3086 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2287,25 +2287,15 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
}
}
-/*
- * Unfreeze all the cpu partial slabs.
- *
- * This function must be called with preemption or migration
- * disabled with c local to the cpu.
- */
-static void unfreeze_partials(struct kmem_cache *s,
- struct kmem_cache_cpu *c)
-{
#ifdef CONFIG_SLUB_CPU_PARTIAL
+static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page)
+{
struct kmem_cache_node *n = NULL, *n2 = NULL;
- struct page *page, *partial_page, *discard_page = NULL;
+ struct page *page, *discard_page = NULL;
unsigned long flags;
local_irq_save(flags);
- partial_page = slub_percpu_partial(c);
- c->partial = NULL;
-
while (partial_page) {
struct page new;
struct page old;
@@ -2360,10 +2350,49 @@ static void unfreeze_partials(struct kmem_cache *s,
discard_slab(s, page);
stat(s, FREE_SLAB);
}
+}
-#endif /* CONFIG_SLUB_CPU_PARTIAL */
+/*
+ * Unfreeze all the cpu partial slabs.
+ *
+ * This function must be called with preemption or migration
+ * disabled.
+ */
+static void unfreeze_partials(struct kmem_cache *s)
+{
+ struct page *partial_page;
+
+ do {
+ partial_page = this_cpu_read(s->cpu_slab->partial);
+
+ } while (partial_page &&
+ this_cpu_cmpxchg(s->cpu_slab->partial, partial_page, NULL)
+ != partial_page);
+
+ if (partial_page)
+ __unfreeze_partials(s, partial_page);
}
+static void unfreeze_partials_cpu(struct kmem_cache *s,
+ struct kmem_cache_cpu *c)
+{
+ struct page *partial_page;
+
+ partial_page = slub_percpu_partial(c);
+ c->partial = NULL;
+
+ if (partial_page)
+ __unfreeze_partials(s, partial_page);
+}
+
+#else /* CONFIG_SLUB_CPU_PARTIAL */
+
+static void unfreeze_partials(struct kmem_cache *s) { }
+static void unfreeze_partials_cpu(struct kmem_cache *s,
+ struct kmem_cache_cpu *c) { }
+
+#endif /* CONFIG_SLUB_CPU_PARTIAL */
+
/*
* Put a page that was just frozen (in __slab_free|get_partial_node) into a
* partial page slot if available.
@@ -2392,7 +2421,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
* partial array is full. Move the existing
* set to the per node partial list.
*/
- unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+ unfreeze_partials(s);
oldpage = NULL;
pobjects = 0;
pages = 0;
@@ -2427,11 +2456,6 @@ static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
stat(s, CPUSLAB_FLUSH);
}
-/*
- * Flush cpu slab.
- *
- * Called from IPI handler with interrupts disabled.
- */
static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
{
struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
@@ -2439,14 +2463,23 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
if (c->page)
flush_slab(s, c);
- unfreeze_partials(s, c);
+ unfreeze_partials_cpu(s, c);
}
+/*
+ * Flush cpu slab.
+ *
+ * Called from IPI handler with interrupts disabled.
+ */
static void flush_cpu_slab(void *d)
{
struct kmem_cache *s = d;
+ struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
- __flush_cpu_slab(s, smp_processor_id());
+ if (c->page)
+ flush_slab(s, c);
+
+ unfreeze_partials(s);
}
static bool has_cpu_slab(int cpu, void *info)
--
2.31.1
From: Sebastian Andrzej Siewior <[email protected]>
The variable object_map is protected by object_map_lock. The lock is always
acquired in debug code and within already atomic context
Make object_map_lock a raw_spinlock_t.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 6e747e0d7dcd..cfd5a7660375 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -432,7 +432,7 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
#ifdef CONFIG_SLUB_DEBUG
static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)];
-static DEFINE_SPINLOCK(object_map_lock);
+static DEFINE_RAW_SPINLOCK(object_map_lock);
static void __fill_map(unsigned long *obj_map, struct kmem_cache *s,
struct page *page)
@@ -457,7 +457,7 @@ static unsigned long *get_map(struct kmem_cache *s, struct page *page)
{
VM_BUG_ON(!irqs_disabled());
- spin_lock(&object_map_lock);
+ raw_spin_lock(&object_map_lock);
__fill_map(object_map, s, page);
@@ -467,7 +467,7 @@ static unsigned long *get_map(struct kmem_cache *s, struct page *page)
static void put_map(unsigned long *map) __releases(&object_map_lock)
{
VM_BUG_ON(map != object_map);
- spin_unlock(&object_map_lock);
+ raw_spin_unlock(&object_map_lock);
}
static inline unsigned int size_from_object(struct kmem_cache *s)
--
2.31.1
We currently use preempt_disable() (directly or via get_cpu_ptr()) to stabilize
the pointer to kmem_cache_cpu. On PREEMPT_RT this would be incompatible with
the list_lock spinlock. We can use migrate_disable() instead, but that
increases overhead on !PREEMPT_RT as it's an unconditional function call even
though it's ultimately a migrate_disable() there.
In order to get the best available mechanism on both PREEMPT_RT and
!PREEMPT_RT, introduce private slub_get_cpu_ptr() and slub_put_cpu_ptr()
wrappers and use them.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 41 +++++++++++++++++++++++++++++++----------
1 file changed, 31 insertions(+), 10 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 12e966f07f19..caa206213e72 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -115,6 +115,26 @@
* the fast path and disables lockless freelists.
*/
+/*
+ * We could simply use migrate_disable()/enable() but as long as it's a
+ * function call even on !PREEMPT_RT, use inline preempt_disable() there.
+ */
+#ifdef CONFIG_PREEMPT_RT
+#define slub_get_cpu_ptr(var) get_cpu_ptr(var)
+#define slub_put_cpu_ptr(var) put_cpu_ptr(var)
+#else
+#define slub_get_cpu_ptr(var) \
+({ \
+ migrate_disable(); \
+ this_cpu_ptr(var); \
+})
+#define slub_put_cpu_ptr(var) \
+do { \
+ (void)(var); \
+ migrate_enable(); \
+} while (0)
+#endif
+
#ifdef CONFIG_SLUB_DEBUG
#ifdef CONFIG_SLUB_DEBUG_ON
DEFINE_STATIC_KEY_TRUE(slub_debug_enabled);
@@ -2419,7 +2439,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
int pages;
int pobjects;
- preempt_disable();
+ slub_get_cpu_ptr(s->cpu_slab);
do {
pages = 0;
pobjects = 0;
@@ -2450,7 +2470,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
!= oldpage);
- preempt_enable();
+ slub_put_cpu_ptr(s->cpu_slab);
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}
@@ -2759,7 +2779,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (unlikely(!pfmemalloc_match(page, gfpflags)))
goto deactivate_slab;
- /* must check again c->page in case IRQ handler changed it */
+ /* must check again c->page in case we got preempted and it changed */
local_irq_save(flags);
if (unlikely(page != c->page)) {
local_irq_restore(flags);
@@ -2818,7 +2838,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
}
if (unlikely(!slub_percpu_partial(c))) {
local_irq_restore(flags);
- goto new_objects; /* stolen by an IRQ handler */
+ /* we were preempted and partial list got empty */
+ goto new_objects;
}
page = c->page = slub_percpu_partial(c);
@@ -2834,9 +2855,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (freelist)
goto check_new_page;
- put_cpu_ptr(s->cpu_slab);
+ slub_put_cpu_ptr(s->cpu_slab);
page = new_slab(s, gfpflags, node);
- c = get_cpu_ptr(s->cpu_slab);
+ c = slub_get_cpu_ptr(s->cpu_slab);
if (unlikely(!page)) {
slab_out_of_memory(s, gfpflags, node);
@@ -2919,12 +2940,12 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
* cpu before disabling preemption. Need to reload cpu area
* pointer.
*/
- c = get_cpu_ptr(s->cpu_slab);
+ c = slub_get_cpu_ptr(s->cpu_slab);
#endif
p = ___slab_alloc(s, gfpflags, node, addr, c);
#ifdef CONFIG_PREEMPTION
- put_cpu_ptr(s->cpu_slab);
+ slub_put_cpu_ptr(s->cpu_slab);
#endif
return p;
}
@@ -3445,7 +3466,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
* IRQs, which protects against PREEMPT and interrupts
* handlers invoking normal fastpath.
*/
- c = get_cpu_ptr(s->cpu_slab);
+ c = slub_get_cpu_ptr(s->cpu_slab);
local_irq_disable();
for (i = 0; i < size; i++) {
@@ -3491,7 +3512,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
}
c->tid = next_tid(c->tid);
local_irq_enable();
- put_cpu_ptr(s->cpu_slab);
+ slub_put_cpu_ptr(s->cpu_slab);
/*
* memcg and kmem_cache debug support and memory initialization.
--
2.31.1
Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions of
local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT that's
equivalent, with better lockdep visibility. On PREEMPT_RT that means better
preemption.
However, the cost on PREEMPT_RT is the loss of lockless fast paths which only
work with cpu freelist. Those are designed to detect and recover from being
preempted by other conflicting operations (both fast or slow path), but the
slow path operations assume they cannot be preempted by a fast path operation,
which is guaranteed naturally with disabled irqs. With local locks on
PREEMPT_RT, the fast paths now also need to take the local lock to avoid races.
In the allocation fastpath slab_alloc_node() we can just defer to the slowpath
__slab_alloc() which also works with cpu freelist, but under the local lock.
In the free fastpath do_slab_free() we have to add a new local lock protected
version of freeing to the cpu freelist, as the existing slowpath only works
with the page freelist.
Also update the comment about locking scheme in SLUB to reflect changes done
by this series.
Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/slub_def.h | 2 +
mm/slub.c | 138 ++++++++++++++++++++++++++++++---------
2 files changed, 110 insertions(+), 30 deletions(-)
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index dcde82a4434c..b5bcac29b979 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -10,6 +10,7 @@
#include <linux/kfence.h>
#include <linux/kobject.h>
#include <linux/reciprocal_div.h>
+#include <linux/local_lock.h>
enum stat_item {
ALLOC_FASTPATH, /* Allocation from cpu slab */
@@ -41,6 +42,7 @@ enum stat_item {
NR_SLUB_STAT_ITEMS };
struct kmem_cache_cpu {
+ local_lock_t lock; /* Protects the fields below except stat */
void **freelist; /* Pointer to next available object */
unsigned long tid; /* Globally unique transaction id */
struct page *page; /* The slab from which we are allocating */
diff --git a/mm/slub.c b/mm/slub.c
index caa206213e72..500720ec1e57 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -43,13 +43,22 @@
/*
* Lock order:
* 1. slab_mutex (Global Mutex)
- * 2. node->list_lock
- * 3. slab_lock(page) (Only on some arches and for debugging)
+ * 2. node->list_lock (Spinlock)
+ * OR
+ * kmem_cache->cpu_slab->lock (Local lock)
+ * 3. slab_lock(page) (Only on some arches or for debugging)
+ * 4. object_map_lock (Only for debugging)
*
* slab_mutex
*
* The role of the slab_mutex is to protect the list of all the slabs
* and to synchronize major metadata changes to slab cache structures.
+ * Also synchronizes memory hotplug callbacks.
+ *
+ * slab_lock
+ *
+ * The slab_lock is a wrapper around the page lock, thus it is a bit
+ * spinlock.
*
* The slab_lock is only used for debugging and on arches that do not
* have the ability to do a cmpxchg_double. It only protects:
@@ -58,6 +67,8 @@
* C. page->objects -> Number of objects in page
* D. page->frozen -> frozen state
*
+ * Frozen slabs
+ *
* If a slab is frozen then it is exempt from list management. It is not
* on any list except per cpu partial list. The processor that froze the
* slab is the one who can perform list operations on the page. Other
@@ -65,6 +76,8 @@
* froze the slab is the only one that can retrieve the objects from the
* page's freelist.
*
+ * list_lock
+ *
* The list_lock protects the partial and full list on each node and
* the partial slab counter. If taken then no new slabs may be added or
* removed from the lists nor make the number of partial slabs be modified.
@@ -76,10 +89,36 @@
* slabs, operations can continue without any centralized lock. F.e.
* allocating a long series of objects that fill up slabs does not require
* the list lock.
- * Interrupts are disabled during allocation and deallocation in order to
- * make the slab allocator safe to use in the context of an irq. In addition
- * interrupts are disabled to ensure that the processor does not change
- * while handling per_cpu slabs, due to kernel preemption.
+ *
+ * cpu_slab->lock local lock
+ *
+ * This locks protect slowpath manipulation of all kmem_cache_cpu fields
+ * except the stat counters. This is a percpu structure manipulated only by
+ * the local cpu, so the lock protects against being preempted or interrupted
+ * by an irq. Fast path operations rely on lockless operations instead.
+ * On PREEMPT_RT, the local lock does not actually disable irqs (and thus
+ * prevent the lockless operations), so fastpath operations also need to take
+ * the lock and are no longer lockless.
+ *
+ * lockless fastpaths
+ *
+ * The fast path allocation (slab_alloc_node()) and freeing (do_slab_free())
+ * are fully lockless when satisfied from the percpu slab (and when
+ * cmpxchg_double is possible to use, otherwise slab_lock is taken).
+ * They also don't disable preemption or migration or irqs. They rely on
+ * the transaction id (tid) field to detect being preempted or moved to
+ * another cpu.
+ *
+ * irq, preemption, migration considerations
+ *
+ * Interrupts are disabled as part of list_lock or local_lock operations, or
+ * around the slab_lock operation, in order to make the slab allocator safe
+ * to use in the context of an irq.
+ *
+ * In addition, preemption (or migration on PREEMPT_RT) is disabled in the
+ * allocation slowpath, bulk allocation, and put_cpu_partial(), so that the
+ * local cpu doesn't change in the process and e.g. the kmem_cache_cpu pointer
+ * doesn't have to be revalidated in each section protected by the local lock.
*
* SLUB assigns one slab for allocation to each processor.
* Allocations only occur from these slabs called cpu slabs.
@@ -2179,9 +2218,13 @@ static inline void note_cmpxchg_failure(const char *n,
static void init_kmem_cache_cpus(struct kmem_cache *s)
{
int cpu;
+ struct kmem_cache_cpu *c;
- for_each_possible_cpu(cpu)
- per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
+ for_each_possible_cpu(cpu) {
+ c = per_cpu_ptr(s->cpu_slab, cpu);
+ local_lock_init(&c->lock);
+ c->tid = init_tid(cpu);
+ }
}
/*
@@ -2482,7 +2525,7 @@ static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c,
struct page *page;
if (lock)
- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
freelist = c->freelist;
page = c->page;
@@ -2492,7 +2535,7 @@ static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c,
c->tid = next_tid(c->tid);
if (lock)
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
if (page)
deactivate_slab(s, page, freelist);
@@ -2780,9 +2823,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto deactivate_slab;
/* must check again c->page in case we got preempted and it changed */
- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
if (unlikely(page != c->page)) {
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
goto reread_page;
}
freelist = c->freelist;
@@ -2793,7 +2836,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (!freelist) {
c->page = NULL;
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
stat(s, DEACTIVATE_BYPASS);
goto new_slab;
}
@@ -2802,7 +2845,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
load_freelist:
- lockdep_assert_irqs_disabled();
+ lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
/*
* freelist is pointing to the list of objects to be used.
@@ -2812,39 +2855,39 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
VM_BUG_ON(!c->page->frozen);
c->freelist = get_freepointer(s, freelist);
c->tid = next_tid(c->tid);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
return freelist;
deactivate_slab:
- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
if (page != c->page) {
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
goto reread_page;
}
freelist = c->freelist;
c->page = NULL;
c->freelist = NULL;
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
deactivate_slab(s, page, freelist);
new_slab:
if (slub_percpu_partial(c)) {
- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
if (unlikely(c->page)) {
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
goto reread_page;
}
if (unlikely(!slub_percpu_partial(c))) {
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
/* we were preempted and partial list got empty */
goto new_objects;
}
page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
stat(s, CPU_PARTIAL_ALLOC);
goto redo;
}
@@ -2897,7 +2940,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
retry_load_page:
- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
if (unlikely(c->page)) {
void *flush_freelist = c->freelist;
struct page *flush_page = c->page;
@@ -2906,7 +2949,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
c->freelist = NULL;
c->tid = next_tid(c->tid);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
deactivate_slab(s, flush_page, flush_freelist);
@@ -3025,7 +3068,15 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
object = c->freelist;
page = c->page;
- if (unlikely(!object || !page || !node_match(page, node))) {
+ /*
+ * We cannot use the lockless fastpath on PREEMPT_RT because if a
+ * slowpath has taken the local_lock_irqsave(), it is not protected
+ * against a fast path operation in an irq handler. So we need to take
+ * the slow path which uses local_lock. It is still relatively fast if
+ * there is a suitable cpu freelist.
+ */
+ if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
+ unlikely(!object || !page || !node_match(page, node))) {
object = __slab_alloc(s, gfpflags, node, addr, c);
} else {
void *next_object = get_freepointer_safe(s, object);
@@ -3285,6 +3336,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
barrier();
if (likely(page == c->page)) {
+#ifndef CONFIG_PREEMPT_RT
void **freelist = READ_ONCE(c->freelist);
set_freepointer(s, tail_obj, freelist);
@@ -3297,6 +3349,32 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
note_cmpxchg_failure("slab_free", s, tid);
goto redo;
}
+#else /* CONFIG_PREEMPT_RT */
+ /*
+ * We cannot use the lockless fastpath on PREEMPT_RT because if
+ * a slowpath has taken the local_lock_irqsave(), it is not
+ * protected against a fast path operation in an irq handler. So
+ * we need to take the local_lock. We shouldn't simply defer to
+ * __slab_free() as that wouldn't use the cpu freelist at all.
+ */
+ unsigned long flags;
+ void **freelist;
+
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
+ c = this_cpu_ptr(s->cpu_slab);
+ if (unlikely(page != c->page)) {
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+ goto redo;
+ }
+ tid = c->tid;
+ freelist = c->freelist;
+
+ set_freepointer(s, tail_obj, freelist);
+ c->freelist = head;
+ c->tid = next_tid(tid);
+
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+#endif
stat(s, FREE_FASTPATH);
} else
__slab_free(s, page, head, tail_obj, cnt, addr);
@@ -3467,7 +3545,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
* handlers invoking normal fastpath.
*/
c = slub_get_cpu_ptr(s->cpu_slab);
- local_irq_disable();
+ local_lock_irq(&s->cpu_slab->lock);
for (i = 0; i < size; i++) {
void *object = kfence_alloc(s, s->object_size, flags);
@@ -3488,7 +3566,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
*/
c->tid = next_tid(c->tid);
- local_irq_enable();
+ local_unlock_irq(&s->cpu_slab->lock);
/*
* Invoking slow path likely have side-effect
@@ -3502,7 +3580,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
c = this_cpu_ptr(s->cpu_slab);
maybe_wipe_obj_freeptr(s, p[i]);
- local_irq_disable();
+ local_lock_irq(&s->cpu_slab->lock);
continue; /* goto for-loop */
}
@@ -3511,7 +3589,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
maybe_wipe_obj_freeptr(s, p[i]);
}
c->tid = next_tid(c->tid);
- local_irq_enable();
+ local_unlock_irq(&s->cpu_slab->lock);
slub_put_cpu_ptr(s->cpu_slab);
/*
@@ -3522,7 +3600,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
slab_want_init_on_alloc(flags, s));
return i;
error:
- local_irq_enable();
+ local_unlock_irq(&s->cpu_slab->lock);
slab_post_alloc_hook(s, objcg, flags, i, p, false);
__kmem_cache_free_bulk(s, i, p);
return 0;
--
2.31.1
In slab_alloc_node() and do_slab_free() fastpaths we need to guarantee that
our kmem_cache_cpu pointer is from the same cpu as the tid value. Currently
that's done by reading the tid first using this_cpu_read(), then the
kmem_cache_cpu pointer and verifying we read the same tid using the pointer and
plain READ_ONCE().
This can be simplified to just fetching kmem_cache_cpu pointer and then reading
tid using the pointer. That guarantees they are from the same cpu. We don't
need to read the tid using this_cpu_read() because the value will be validated
by this_cpu_cmpxchg_double(), making sure we are on the correct cpu and the
freelist didn't change by anyone preempting us since reading the tid.
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/slub.c | 22 +++++++++-------------
1 file changed, 9 insertions(+), 13 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 0d5c253a39b2..6d6a9a69db8a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2822,15 +2822,14 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
* reading from one cpu area. That does not matter as long
* as we end up on the original cpu again when doing the cmpxchg.
*
- * We should guarantee that tid and kmem_cache are retrieved on
- * the same cpu. It could be different if CONFIG_PREEMPTION so we need
- * to check if it is matched or not.
+ * We must guarantee that tid and kmem_cache_cpu are retrieved on the
+ * same cpu. We read first the kmem_cache_cpu pointer and use it to read
+ * the tid. If we are preempted and switched to another cpu between the
+ * two reads, it's OK as the two are still associated with the same cpu
+ * and cmpxchg later will validate the cpu.
*/
- do {
- tid = this_cpu_read(s->cpu_slab->tid);
- c = raw_cpu_ptr(s->cpu_slab);
- } while (IS_ENABLED(CONFIG_PREEMPTION) &&
- unlikely(tid != READ_ONCE(c->tid)));
+ c = raw_cpu_ptr(s->cpu_slab);
+ tid = READ_ONCE(c->tid);
/*
* Irqless object alloc/free algorithm used here depends on sequence
@@ -3104,11 +3103,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
* data is retrieved via this pointer. If we are on the same cpu
* during the cmpxchg then the free will succeed.
*/
- do {
- tid = this_cpu_read(s->cpu_slab->tid);
- c = raw_cpu_ptr(s->cpu_slab);
- } while (IS_ENABLED(CONFIG_PREEMPTION) &&
- unlikely(tid != READ_ONCE(c->tid)));
+ c = raw_cpu_ptr(s->cpu_slab);
+ tid = READ_ONCE(c->tid);
/* Same with comment on barrier() in slab_alloc_node() */
barrier();
--
2.31.1
These functions differ only in irq disabling in the slow path. We can create a
common function with an extra bool parameter to control the irq disabling.
As the functions are inline and the parameter compile-time constant, there
will be no runtime overhead due to this change.
Also change the DEBUG_VM based irqs disable assert to the more standard
lockdep_assert based one.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 62 +++++++++++++++++++++----------------------------------
1 file changed, 24 insertions(+), 38 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index f740598696b4..76af5065baeb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -365,13 +365,13 @@ static __always_inline void slab_unlock(struct page *page)
__bit_spin_unlock(PG_locked, &page->flags);
}
-/* Interrupts must be disabled (for the fallback code to work right) */
-static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
+static inline bool ___cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
void *freelist_old, unsigned long counters_old,
void *freelist_new, unsigned long counters_new,
- const char *n)
+ const char *n, bool disable_irqs)
{
- VM_BUG_ON(!irqs_disabled());
+ if (!disable_irqs)
+ lockdep_assert_irqs_disabled();
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
if (s->flags & __CMPXCHG_DOUBLE) {
@@ -382,15 +382,23 @@ static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page
} else
#endif
{
+ unsigned long flags;
+
+ if (disable_irqs)
+ local_irq_save(flags);
slab_lock(page);
if (page->freelist == freelist_old &&
page->counters == counters_old) {
page->freelist = freelist_new;
page->counters = counters_new;
slab_unlock(page);
+ if (disable_irqs)
+ local_irq_restore(flags);
return true;
}
slab_unlock(page);
+ if (disable_irqs)
+ local_irq_restore(flags);
}
cpu_relax();
@@ -403,45 +411,23 @@ static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page
return false;
}
-static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
+/* Interrupts must be disabled (for the fallback code to work right) */
+static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
void *freelist_old, unsigned long counters_old,
void *freelist_new, unsigned long counters_new,
const char *n)
{
-#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
- defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
- if (s->flags & __CMPXCHG_DOUBLE) {
- if (cmpxchg_double(&page->freelist, &page->counters,
- freelist_old, counters_old,
- freelist_new, counters_new))
- return true;
- } else
-#endif
- {
- unsigned long flags;
-
- local_irq_save(flags);
- slab_lock(page);
- if (page->freelist == freelist_old &&
- page->counters == counters_old) {
- page->freelist = freelist_new;
- page->counters = counters_new;
- slab_unlock(page);
- local_irq_restore(flags);
- return true;
- }
- slab_unlock(page);
- local_irq_restore(flags);
- }
-
- cpu_relax();
- stat(s, CMPXCHG_DOUBLE_FAIL);
-
-#ifdef SLUB_DEBUG_CMPXCHG
- pr_info("%s %s: cmpxchg double redo ", n, s->name);
-#endif
+ return ___cmpxchg_double_slab(s, page, freelist_old, counters_old,
+ freelist_new, counters_new, n, false);
+}
- return false;
+static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
+ void *freelist_old, unsigned long counters_old,
+ void *freelist_new, unsigned long counters_new,
+ const char *n)
+{
+ return ___cmpxchg_double_slab(s, page, freelist_old, counters_old,
+ freelist_new, counters_new, n, true);
}
#ifdef CONFIG_SLUB_DEBUG
--
2.31.1
For PREEMPT_RT we will need to disable irqs for this bit spinlock. As a
preparation, add a flags parameter, and an internal version that takes
additional bool parameter to control irq saving/restoring (the flags
parameter is compile-time unused if the bool is a constant false).
Convert ___cmpxchg_double_slab(), which also comes with the same bool
parameter, to use the internal version.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 49 +++++++++++++++++++++++++++++++------------------
1 file changed, 31 insertions(+), 18 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index cfd5a7660375..6721169f816d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -353,18 +353,35 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
/*
* Per slab locking using the pagelock
*/
-static __always_inline void slab_lock(struct page *page)
+static __always_inline void
+__slab_lock(struct page *page, unsigned long *flags, bool disable_irqs)
{
VM_BUG_ON_PAGE(PageTail(page), page);
+ if (disable_irqs)
+ local_irq_save(*flags);
bit_spin_lock(PG_locked, &page->flags);
}
-static __always_inline void slab_unlock(struct page *page)
+static __always_inline void
+__slab_unlock(struct page *page, unsigned long *flags, bool disable_irqs)
{
VM_BUG_ON_PAGE(PageTail(page), page);
+ if (disable_irqs)
+ local_irq_restore(*flags);
__bit_spin_unlock(PG_locked, &page->flags);
}
+static __always_inline void
+slab_lock(struct page *page, unsigned long *flags)
+{
+ __slab_lock(page, flags, false);
+}
+
+static __always_inline void slab_unlock(struct page *page, unsigned long *flags)
+{
+ __slab_unlock(page, flags, false);
+}
+
static inline bool ___cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
void *freelist_old, unsigned long counters_old,
void *freelist_new, unsigned long counters_new,
@@ -384,21 +401,15 @@ static inline bool ___cmpxchg_double_slab(struct kmem_cache *s, struct page *pag
{
unsigned long flags;
- if (disable_irqs)
- local_irq_save(flags);
- slab_lock(page);
+ __slab_lock(page, &flags, disable_irqs);
if (page->freelist == freelist_old &&
page->counters == counters_old) {
page->freelist = freelist_new;
page->counters = counters_new;
- slab_unlock(page);
- if (disable_irqs)
- local_irq_restore(flags);
+ __slab_unlock(page, &flags, disable_irqs);
return true;
}
- slab_unlock(page);
- if (disable_irqs)
- local_irq_restore(flags);
+ __slab_unlock(page, &flags, disable_irqs);
}
cpu_relax();
@@ -1214,11 +1225,11 @@ static noinline int free_debug_processing(
struct kmem_cache_node *n = get_node(s, page_to_nid(page));
void *object = head;
int cnt = 0;
- unsigned long flags;
+ unsigned long flags, flags2;
int ret = 0;
spin_lock_irqsave(&n->list_lock, flags);
- slab_lock(page);
+ slab_lock(page, &flags2);
if (s->flags & SLAB_CONSISTENCY_CHECKS) {
if (!check_slab(s, page))
@@ -1251,7 +1262,7 @@ static noinline int free_debug_processing(
slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n",
bulk_cnt, cnt);
- slab_unlock(page);
+ slab_unlock(page, &flags2);
spin_unlock_irqrestore(&n->list_lock, flags);
if (!ret)
slab_fix(s, "Object at 0x%p not freed", object);
@@ -4007,9 +4018,10 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page,
void *addr = page_address(page);
unsigned long *map;
void *p;
+ unsigned long flags;
slab_err(s, page, text, s->name);
- slab_lock(page);
+ slab_lock(page, &flags);
map = get_map(s, page);
for_each_object(p, s, addr, page->objects) {
@@ -4020,7 +4032,7 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page,
}
}
put_map(map);
- slab_unlock(page);
+ slab_unlock(page, &flags);
#endif
}
@@ -4736,8 +4748,9 @@ static void validate_slab(struct kmem_cache *s, struct page *page,
{
void *p;
void *addr = page_address(page);
+ unsigned long flags;
- slab_lock(page);
+ slab_lock(page, &flags);
if (!check_slab(s, page) || !on_freelist(s, page, NULL))
goto unlock;
@@ -4752,7 +4765,7 @@ static void validate_slab(struct kmem_cache *s, struct page *page,
break;
}
unlock:
- slab_unlock(page);
+ slab_unlock(page, &flags);
}
static int validate_slab_node(struct kmem_cache *s,
--
2.31.1
The function get_partial() finds a suitable page on a partial list, acquires
and returns its freelist and assigns the page pointer to kmem_cache_cpu.
In later patch we will need more control over the kmem_cache_cpu.page
assignment, so instead of passing a kmem_cache_cpu pointer, pass a pointer to a
pointer to a page that get_partial() can fill and the caller can assign the
kmem_cache_cpu.page pointer. No functional change as all of this still happens
with disabled IRQs.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 87f652bf8e4d..ca7cfc7706cc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1957,7 +1957,7 @@ static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
* Try to allocate a partial slab from a specific node.
*/
static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
- struct kmem_cache_cpu *c, gfp_t flags)
+ struct page **ret_page, gfp_t flags)
{
struct page *page, *page2;
void *object = NULL;
@@ -1986,7 +1986,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
available += objects;
if (!object) {
- c->page = page;
+ *ret_page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
} else {
@@ -2006,7 +2006,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
* Get a page from somewhere. Search in increasing NUMA distances.
*/
static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
- struct kmem_cache_cpu *c)
+ struct page **ret_page)
{
#ifdef CONFIG_NUMA
struct zonelist *zonelist;
@@ -2048,7 +2048,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
if (n && cpuset_zone_allowed(zone, flags) &&
n->nr_partial > s->min_partial) {
- object = get_partial_node(s, n, c, flags);
+ object = get_partial_node(s, n, ret_page, flags);
if (object) {
/*
* Don't check read_mems_allowed_retry()
@@ -2070,7 +2070,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
* Get a partial page, lock it and return it.
*/
static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
- struct kmem_cache_cpu *c)
+ struct page **ret_page)
{
void *object;
int searchnode = node;
@@ -2078,11 +2078,11 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
if (node == NUMA_NO_NODE)
searchnode = numa_mem_id();
- object = get_partial_node(s, get_node(s, searchnode), c, flags);
+ object = get_partial_node(s, get_node(s, searchnode), ret_page, flags);
if (object || node != NUMA_NO_NODE)
return object;
- return get_any_partial(s, flags, c);
+ return get_any_partial(s, flags, ret_page);
}
#ifdef CONFIG_PREEMPTION
@@ -2694,9 +2694,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
}
- freelist = get_partial(s, gfpflags, node, c);
- if (freelist)
+ freelist = get_partial(s, gfpflags, node, &page);
+ if (freelist) {
+ c->page = page;
goto check_new_page;
+ }
page = new_slab(s, gfpflags, node);
@@ -2720,7 +2722,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
c->page = page;
check_new_page:
- page = c->page;
if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
goto load_freelist;
--
2.31.1
On Wed, Jun 09, 2021 at 01:38:58PM +0200, Vlastimil Babka wrote:
> +static DEFINE_MUTEX(flush_lock);
> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
> +
> static void flush_all(struct kmem_cache *s)
> {
> - on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
> + struct slub_flush_work *sfw;
> + unsigned int cpu;
> +
> + cpus_read_lock();
> + mutex_lock(&flush_lock);
> +
Hi, Vlastimil! Could you please point why do you lock cpus first and
mutex only after? Why not mutex_lock + cpus_read_lock instead?
On 6/10/21 12:29 AM, Cyrill Gorcunov wrote:
> On Wed, Jun 09, 2021 at 01:38:58PM +0200, Vlastimil Babka wrote:
>> +static DEFINE_MUTEX(flush_lock);
>> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
>> +
>> static void flush_all(struct kmem_cache *s)
>> {
>> - on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
>> + struct slub_flush_work *sfw;
>> + unsigned int cpu;
>> +
>> + cpus_read_lock();
>> + mutex_lock(&flush_lock);
>> +
>
> Hi, Vlastimil! Could you please point why do you lock cpus first and
> mutex only after? Why not mutex_lock + cpus_read_lock instead?
Good question! I must admit I didn't think about it much and just followed the
order that was in the original Sebastian's patch [1]
But there was a good reason for this order as some paths via
__kmem_cache_shutdown() and __kmem_cache_shrink() were alreadu called under
cpus_read_lock. Meanwhile mainline (me, actually) removed those, so now it
doesn't seem to be a need to keep this order anymore and we could switch it.
Thanks,
Vlastimil
[1]
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0005-mm-slub-Move-flush_cpu_slab-invocations-__free_slab-.patch?h=linux-5.12.y-rt-patches
On Thu, Jun 10, 2021 at 10:32:14AM +0200, Vlastimil Babka wrote:
> >> static void flush_all(struct kmem_cache *s)
> >> {
> >> - on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
> >> + struct slub_flush_work *sfw;
> >> + unsigned int cpu;
> >> +
> >> + cpus_read_lock();
> >> + mutex_lock(&flush_lock);
> >> +
> >
> > Hi, Vlastimil! Could you please point why do you lock cpus first and
> > mutex only after? Why not mutex_lock + cpus_read_lock instead?
>
> Good question! I must admit I didn't think about it much and just followed the
> order that was in the original Sebastian's patch [1]
> But there was a good reason for this order as some paths via
> __kmem_cache_shutdown() and __kmem_cache_shrink() were alreadu called under
> cpus_read_lock. Meanwhile mainline (me, actually) removed those, so now it
> doesn't seem to be a need to keep this order anymore and we could switch it.
I bet we should switch :) But we can do that on top later, once series is
settled down and merged.
On Wed, Jun 09, 2021 at 01:38:29PM +0200, Vlastimil Babka wrote:
> Changes since RFC v1 [1]:
> * Addressed feedback from Christoph and Mel, added their acks.
> * Finished RT conversion, including adopting 2 patches from the RT tree.
> * The optional local_lock conversion has to sacrifice lockless fathpaths on RT
> * Added some more cleanup patches to the front.
>
> This series was initially inspired by Mel's pcplist local_lock rewrite, and
> also by interest to better understand SLUB's locking and the new locking
> primitives and their RT variants and implications. It should make SLUB more
> preemption-friendly and fully RT compatible, hopefully without noticeable
> regressions on !RT kernels, as the fast paths are not affected there.
>
> Series is based on 5.13-rc5 and also available as a git branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r1
>
This failed to boot for me inside KVM.
[ 0.273576] SRBDS: Unknown: Dependent on hypervisor status
[ 0.273576] MDS: Mitigation: Clear CPU buffers
[ 0.273576] Freeing SMP alternatives memory: 36K
[ 0.273576] bad: scheduling from the idle thread!
[ 0.273576] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.0-rc5-unspecified+ #1
[ 0.273576] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 0.273576] Call Trace:
[ 0.273576] dump_stack+0x76/0x94
[ 0.273576] dequeue_task_idle+0x28/0x40
[ 0.273576] __do_set_cpus_allowed+0xe0/0x1a0
[ 0.273576] __schedule+0x7f7/0x8f0
[ 0.273576] __cond_resched+0x22/0x40
[ 0.273576] alloc_vmap_area+0x72/0x8b0
[ 0.273576] ? kmem_cache_alloc_node_trace+0x189/0x300
[ 0.273576] ? __get_vm_area_node+0x76/0x150
[ 0.273576] __get_vm_area_node+0xab/0x150
[ 0.273576] __vmalloc_node_range+0x6d/0x2c0
[ 0.273576] ? kernel_clone+0x9b/0x3e0
[ 0.273576] ? kmem_cache_alloc_node+0x18d/0x2f0
[ 0.273576] ? copy_process+0x218/0x1c40
[ 0.273576] copy_process+0x2c1/0x1c40
[ 0.273576] ? kernel_clone+0x9b/0x3e0
[ 0.273576] ? enqueue_task_fair+0xa1/0x590
[ 0.273576] kernel_clone+0x9b/0x3e0
[ 0.273576] ? acpi_hw_register_read+0x146/0x166
[ 0.273576] kernel_thread+0x55/0x70
[ 0.273576] ? kthread_is_per_cpu+0x30/0x30
[ 0.273576] rest_init+0x75/0xc0
[ 0.273576] start_kernel+0x7fb/0x822
[ 0.273576] secondary_startup_64_no_verify+0xc2/0xcb
[ 0.273576] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 0.273576] #PF: supervisor instruction fetch in kernel mode
[ 0.273576] #PF: error_code(0x0010) - not-present page
[ 0.273576] PGD 0 P4D 0
[ 0.273576] Oops: 0010 [#1] SMP PTI
[ 0.273576] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.0-rc5-unspecified+ #1
[ 0.273576] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 0.273576] RIP: 0010:0x0
[ 0.273576] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 0.273576] RSP: 0000:ffffffffbb803bb0 EFLAGS: 00010002
[ 0.273576] RAX: 0000000000000000 RBX: ffff9fa64ea2ecc0 RCX: ffffffffb9e00101
[ 0.273576] RDX: 000000000000000a RSI: ffffffffbb81a940 RDI: ffff9fa64ea2ecc0
[ 0.273576] RBP: ffffffffbb803c08 R08: 0000000000000000 R09: ffffffffbbaa699c
[ 0.273576] R10: 0000000000000000 R11: ffffffffbb803888 R12: ffffffffbb81a940
[ 0.273576] R13: ffff9f9a80245f40 R14: ffffffffbb1f0060 R15: 0000000000003fff
[ 0.273576] FS: 0000000000000000(0000) GS:ffff9fa64ea00000(0000) knlGS:0000000000000000
[ 0.273576] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.273576] CR2: ffffffffffffffd6 CR3: 0000000a60e10001 CR4: 0000000000170ef0
[ 0.273576] Call Trace:
[ 0.273576] __schedule+0x7f7/0x8f0
[ 0.273576] __cond_resched+0x22/0x40
[ 0.273576] alloc_vmap_area+0x72/0x8b0
[ 0.273576] ? kmem_cache_alloc_node_trace+0x189/0x300
[ 0.273576] ? __get_vm_area_node+0x76/0x150
[ 0.273576] __get_vm_area_node+0xab/0x150
[ 0.273576] __vmalloc_node_range+0x6d/0x2c0
[ 0.273576] ? kernel_clone+0x9b/0x3e0
[ 0.273576] ? kmem_cache_alloc_node+0x18d/0x2f0
[ 0.273576] ? copy_process+0x218/0x1c40
[ 0.273576] copy_process+0x2c1/0x1c40
[ 0.273576] ? kernel_clone+0x9b/0x3e0
[ 0.273576] ? enqueue_task_fair+0xa1/0x590
[ 0.273576] kernel_clone+0x9b/0x3e0
[ 0.273576] ? acpi_hw_register_read+0x146/0x166
[ 0.273576] kernel_thread+0x55/0x70
[ 0.273576] ? kthread_is_per_cpu+0x30/0x30
[ 0.273576] rest_init+0x75/0xc0
[ 0.273576] start_kernel+0x7fb/0x822
[ 0.273576] secondary_startup_64_no_verify+0xc2/0xcb
[ 0.273576] Modules linked in:
[ 0.273576] CR2: 0000000000000000
[ 0.273576] ---[ end trace 7199d6fbb50b4cf7 ]---
[ 0.273576] RIP: 0010:0x0
[ 0.273576] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 0.273576] RSP: 0000:ffffffffbb803bb0 EFLAGS: 00010002
[ 0.273576] RAX: 0000000000000000 RBX: ffff9fa64ea2ecc0 RCX: ffffffffb9e00101
[ 0.273576] RDX: 000000000000000a RSI: ffffffffbb81a940 RDI: ffff9fa64ea2ecc0
[ 0.273576] RBP: ffffffffbb803c08 R08: 0000000000000000 R09: ffffffffbbaa699c
[ 0.273576] R10: 0000000000000000 R11: ffffffffbb803888 R12: ffffffffbb81a940
[ 0.273576] R13: ffff9f9a80245f40 R14: ffffffffbb1f0060 R15: 0000000000003fff
[ 0.273576] FS: 0000000000000000(0000) GS:ffff9fa64ea00000(0000) knlGS:0000000000000000
[ 0.273576] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.273576] CR2: ffffffffffffffd6 CR3: 0000000a60e10001 CR4: 0000000000170ef0
[ 0.273576] Kernel panic - not syncing: Attempted to kill the idle task!
[ 0.273576] Rebooting in 90 seconds..
On 6/9/21 1:39 PM, Vlastimil Babka wrote:
> We currently use preempt_disable() (directly or via get_cpu_ptr()) to stabilize
> the pointer to kmem_cache_cpu. On PREEMPT_RT this would be incompatible with
> the list_lock spinlock. We can use migrate_disable() instead, but that
> increases overhead on !PREEMPT_RT as it's an unconditional function call even
> though it's ultimately a migrate_disable() there.
>
> In order to get the best available mechanism on both PREEMPT_RT and
> !PREEMPT_RT, introduce private slub_get_cpu_ptr() and slub_put_cpu_ptr()
> wrappers and use them.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
> mm/slub.c | 41 +++++++++++++++++++++++++++++++----------
> 1 file changed, 31 insertions(+), 10 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 12e966f07f19..caa206213e72 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -115,6 +115,26 @@
> * the fast path and disables lockless freelists.
> */
>
> +/*
> + * We could simply use migrate_disable()/enable() but as long as it's a
> + * function call even on !PREEMPT_RT, use inline preempt_disable() there.
> + */
> +#ifdef CONFIG_PREEMPT_RT
> +#define slub_get_cpu_ptr(var) get_cpu_ptr(var)
> +#define slub_put_cpu_ptr(var) put_cpu_ptr(var)
After Mel's report and bisect pointing to this patch, I realized I got the
#ifdef wrong and it should be #ifnded
Fixed patch follows:
----8<----
From e79e166414b8c22e3bcf73a9383bafe802fa64d2 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Fri, 21 May 2021 14:03:23 +0200
Subject: [PATCH] mm, slub: use migrate_disable() on PREEMPT_RT
We currently use preempt_disable() (directly or via get_cpu_ptr()) to stabilize
the pointer to kmem_cache_cpu. On PREEMPT_RT this would be incompatible with
the list_lock spinlock. We can use migrate_disable() instead, but that
increases overhead on !PREEMPT_RT as it's an unconditional function call even
though it's ultimately a migrate_disable() there.
In order to get the best available mechanism on both PREEMPT_RT and
!PREEMPT_RT, introduce private slub_get_cpu_ptr() and slub_put_cpu_ptr()
wrappers and use them.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 41 +++++++++++++++++++++++++++++++----------
1 file changed, 31 insertions(+), 10 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 12e966f07f19..f0359b0c8154 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -115,6 +115,26 @@
* the fast path and disables lockless freelists.
*/
+/*
+ * We could simply use migrate_disable()/enable() but as long as it's a
+ * function call even on !PREEMPT_RT, use inline preempt_disable() there.
+ */
+#ifndef CONFIG_PREEMPT_RT
+#define slub_get_cpu_ptr(var) get_cpu_ptr(var)
+#define slub_put_cpu_ptr(var) put_cpu_ptr(var)
+#else
+#define slub_get_cpu_ptr(var) \
+({ \
+ migrate_disable(); \
+ this_cpu_ptr(var); \
+})
+#define slub_put_cpu_ptr(var) \
+do { \
+ (void)(var); \
+ migrate_enable(); \
+} while (0)
+#endif
+
#ifdef CONFIG_SLUB_DEBUG
#ifdef CONFIG_SLUB_DEBUG_ON
DEFINE_STATIC_KEY_TRUE(slub_debug_enabled);
@@ -2419,7 +2439,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
int pages;
int pobjects;
- preempt_disable();
+ slub_get_cpu_ptr(s->cpu_slab);
do {
pages = 0;
pobjects = 0;
@@ -2450,7 +2470,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
!= oldpage);
- preempt_enable();
+ slub_put_cpu_ptr(s->cpu_slab);
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}
@@ -2759,7 +2779,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (unlikely(!pfmemalloc_match(page, gfpflags)))
goto deactivate_slab;
- /* must check again c->page in case IRQ handler changed it */
+ /* must check again c->page in case we got preempted and it changed */
local_irq_save(flags);
if (unlikely(page != c->page)) {
local_irq_restore(flags);
@@ -2818,7 +2838,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
}
if (unlikely(!slub_percpu_partial(c))) {
local_irq_restore(flags);
- goto new_objects; /* stolen by an IRQ handler */
+ /* we were preempted and partial list got empty */
+ goto new_objects;
}
page = c->page = slub_percpu_partial(c);
@@ -2834,9 +2855,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (freelist)
goto check_new_page;
- put_cpu_ptr(s->cpu_slab);
+ slub_put_cpu_ptr(s->cpu_slab);
page = new_slab(s, gfpflags, node);
- c = get_cpu_ptr(s->cpu_slab);
+ c = slub_get_cpu_ptr(s->cpu_slab);
if (unlikely(!page)) {
slab_out_of_memory(s, gfpflags, node);
@@ -2919,12 +2940,12 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
* cpu before disabling preemption. Need to reload cpu area
* pointer.
*/
- c = get_cpu_ptr(s->cpu_slab);
+ c = slub_get_cpu_ptr(s->cpu_slab);
#endif
p = ___slab_alloc(s, gfpflags, node, addr, c);
#ifdef CONFIG_PREEMPTION
- put_cpu_ptr(s->cpu_slab);
+ slub_put_cpu_ptr(s->cpu_slab);
#endif
return p;
}
@@ -3445,7 +3466,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
* IRQs, which protects against PREEMPT and interrupts
* handlers invoking normal fastpath.
*/
- c = get_cpu_ptr(s->cpu_slab);
+ c = slub_get_cpu_ptr(s->cpu_slab);
local_irq_disable();
for (i = 0; i < size; i++) {
@@ -3491,7 +3512,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
}
c->tid = next_tid(c->tid);
local_irq_enable();
- put_cpu_ptr(s->cpu_slab);
+ slub_put_cpu_ptr(s->cpu_slab);
/*
* memcg and kmem_cache debug support and memory initialization.
--
2.32.0
On 6/9/21 1:38 PM, Vlastimil Babka wrote:
> Changes since RFC v1 [1]:
> * Addressed feedback from Christoph and Mel, added their acks.
> * Finished RT conversion, including adopting 2 patches from the RT tree.
> * The optional local_lock conversion has to sacrifice lockless fathpaths on RT
> * Added some more cleanup patches to the front.
>
> This series was initially inspired by Mel's pcplist local_lock rewrite, and
> also by interest to better understand SLUB's locking and the new locking
> primitives and their RT variants and implications. It should make SLUB more
> preemption-friendly and fully RT compatible, hopefully without noticeable
> regressions on !RT kernels, as the fast paths are not affected there.
>
> Series is based on 5.13-rc5 and also available as a git branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r1
Pushed a new revision branch:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r2
which fixes a bug in
[RFC v2 33/34] mm, slub: use migrate_disable() on PREEMPT_RT
that was reported by Mel
On 2021-06-14 13:07:14 [+0200], Vlastimil Babka wrote:
> > +#ifdef CONFIG_PREEMPT_RT
> > +#define slub_get_cpu_ptr(var) get_cpu_ptr(var)
> > +#define slub_put_cpu_ptr(var) put_cpu_ptr(var)
>
> After Mel's report and bisect pointing to this patch, I realized I got the
> #ifdef wrong and it should be #ifnded
So if you got the ifdef wrong (and kept everything as-is) then you
tested the RT version on !RT. migrate_disable() behaves on !RT as on RT.
As per changelog you don't use migrate_disable() unconditionally because
it increases the overhead on !RT.
I haven't looked at the series and I have just this tiny question: why
did migrate_disable() crash for Mel on !RT and why do you expect that it
does not happen on PREEMPT_RT?
Sebastian
On Mon, Jun 14, 2021 at 10:49:02AM +0100, Mel Gorman wrote:
> On Wed, Jun 09, 2021 at 01:38:29PM +0200, Vlastimil Babka wrote:
> > Changes since RFC v1 [1]:
> > * Addressed feedback from Christoph and Mel, added their acks.
> > * Finished RT conversion, including adopting 2 patches from the RT tree.
> > * The optional local_lock conversion has to sacrifice lockless fathpaths on RT
> > * Added some more cleanup patches to the front.
> >
> > This series was initially inspired by Mel's pcplist local_lock rewrite, and
> > also by interest to better understand SLUB's locking and the new locking
> > primitives and their RT variants and implications. It should make SLUB more
> > preemption-friendly and fully RT compatible, hopefully without noticeable
> > regressions on !RT kernels, as the fast paths are not affected there.
> >
> > Series is based on 5.13-rc5 and also available as a git branch:
> > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r1
> >
>
> This failed to boot for me inside KVM.
>
Bisection pointed to "mm, slub: use migrate_disable() on PREEMPT_RT".
Off-list, Vlastimil noted that the #ifdef was wrong and the patch below
booted. Wider battery of tests is queued but will take a few days to
complete.
---8<---
mm, slub: use migrate_disable() on PREEMPT_RT -fix
Signed-off-by: Mel Gorman <[email protected]>
---
mm/slub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c
index caa206213e72..f0359b0c8154 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -119,7 +119,7 @@
* We could simply use migrate_disable()/enable() but as long as it's a
* function call even on !PREEMPT_RT, use inline preempt_disable() there.
*/
-#ifdef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPT_RT
#define slub_get_cpu_ptr(var) get_cpu_ptr(var)
#define slub_put_cpu_ptr(var) put_cpu_ptr(var)
#else
--
Mel Gorman
SUSE Labs
On 6/14/21 1:16 PM, Sebastian Andrzej Siewior wrote:
> On 2021-06-14 13:07:14 [+0200], Vlastimil Babka wrote:
>> > +#ifdef CONFIG_PREEMPT_RT
>> > +#define slub_get_cpu_ptr(var) get_cpu_ptr(var)
>> > +#define slub_put_cpu_ptr(var) put_cpu_ptr(var)
>>
>> After Mel's report and bisect pointing to this patch, I realized I got the
>> #ifdef wrong and it should be #ifnded
>
> So if you got the ifdef wrong (and kept everything as-is) then you
> tested the RT version on !RT. migrate_disable() behaves on !RT as on RT.
> As per changelog you don't use migrate_disable() unconditionally because
> it increases the overhead on !RT.
Correct.
> I haven't looked at the series and I have just this tiny question: why
> did migrate_disable() crash for Mel on !RT and why do you expect that it
> does not happen on PREEMPT_RT?
Right, so it's because __slab_alloc() has this optimization to avoid re-reading
'c' in case there is no preemption enabled at all (or it's just voluntary).
#ifdef CONFIG_PREEMPTION
/*
* We may have been preempted and rescheduled on a different
* cpu before disabling preemption. Need to reload cpu area
* pointer.
*/
c = slub_get_cpu_ptr(s->cpu_slab);
#endif
Mel's config has CONFIG_PREEMPT_VOLUNTARY, which means CONFIG_PREEMPTION is not
enabled.
But then later in ___slab_alloc() we have
slub_put_cpu_ptr(s->cpu_slab);
page = new_slab(s, gfpflags, node);
c = slub_get_cpu_ptr(s->cpu_slab);
And this is not hidden under CONFIG_PREEMPTION, so with the #ifdef bug the
slub_put_cpu_ptr did a migrate_enable() with Mel's config, without prior
migrate_disable().
If there wasn't the #ifdef PREEMPT_RT bug:
- this slub_put_cpu_ptr() would translate to put_cpu_ptr() thus
preempt_enable(), which on this config is just a barrier(), so it doesn't matter
that there was no matching preempt_disable() before.
- with PREEMPT_RT the CONFIG_PREEMPTION would be enabled, so the
slub_get_cpu_ptr() would do a migrate_disable() and there's no imbalance.
But now that I dig into this in detail, I can see there might be another
instance of this imbalance bug, if CONFIG_PREEMPTION is disabled, but
CONFIG_PREEMPT_COUNT is enabled, which seems to be possible in some debug
scenarios. Because then preempt_disable()/preempt_enable() still manipulate the
preempt counter and compiling them out in __slab_alloc() will cause imbalance.
So I think the guards in __slab_alloc() should be using CONFIG_PREEMPT_COUNT
instead of CONFIG_PREEMPT to be correct on all configs. I dare not remove them
completely :)
On 6/14/21 1:33 PM, Vlastimil Babka wrote:
> On 6/14/21 1:16 PM, Sebastian Andrzej Siewior wrote:
> But now that I dig into this in detail, I can see there might be another
> instance of this imbalance bug, if CONFIG_PREEMPTION is disabled, but
> CONFIG_PREEMPT_COUNT is enabled, which seems to be possible in some debug
> scenarios. Because then preempt_disable()/preempt_enable() still manipulate the
> preempt counter and compiling them out in __slab_alloc() will cause imbalance.
>
> So I think the guards in __slab_alloc() should be using CONFIG_PREEMPT_COUNT
> instead of CONFIG_PREEMPT to be correct on all configs. I dare not remove them
> completely :)
Yep, it's possible to get such scenario with PREEMPT_VOLUNTARY plus
PROVE_LOCKING - CONFIG_PREEMPTION is disabled but CONFIG_PREEMPT_COUNT is
enabled, and RCU then complains in the page allocator due to the unpaired
preempt_disable() before entering it.
I've pushed a new branch revision with this fixed:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r3
On 2021-06-14 13:33:43 [+0200], Vlastimil Babka wrote:
> On 6/14/21 1:16 PM, Sebastian Andrzej Siewior wrote:
> > I haven't looked at the series and I have just this tiny question: why
> > did migrate_disable() crash for Mel on !RT and why do you expect that it
> > does not happen on PREEMPT_RT?
>
> Right, so it's because __slab_alloc() has this optimization to avoid re-reading
> 'c' in case there is no preemption enabled at all (or it's just voluntary).
>
> #ifdef CONFIG_PREEMPTION
> /*
> * We may have been preempted and rescheduled on a different
> * cpu before disabling preemption. Need to reload cpu area
> * pointer.
> */
> c = slub_get_cpu_ptr(s->cpu_slab);
> #endif
>
> Mel's config has CONFIG_PREEMPT_VOLUNTARY, which means CONFIG_PREEMPTION is not
> enabled.
>
> But then later in ___slab_alloc() we have
>
> slub_put_cpu_ptr(s->cpu_slab);
> page = new_slab(s, gfpflags, node);
> c = slub_get_cpu_ptr(s->cpu_slab);
>
> And this is not hidden under CONFIG_PREEMPTION, so with the #ifdef bug the
> slub_put_cpu_ptr did a migrate_enable() with Mel's config, without prior
> migrate_disable().
Ach, right. The update to this field is done with cmpxchg-double (if I
remember correctly) but I don't remember if this is also re-entry safe.
> If there wasn't the #ifdef PREEMPT_RT bug:
> - this slub_put_cpu_ptr() would translate to put_cpu_ptr() thus
> preempt_enable(), which on this config is just a barrier(), so it doesn't matter
> that there was no matching preempt_disable() before.
> - with PREEMPT_RT the CONFIG_PREEMPTION would be enabled, so the
> slub_get_cpu_ptr() would do a migrate_disable() and there's no imbalance.
>
> But now that I dig into this in detail, I can see there might be another
> instance of this imbalance bug, if CONFIG_PREEMPTION is disabled, but
> CONFIG_PREEMPT_COUNT is enabled, which seems to be possible in some debug
> scenarios. Because then preempt_disable()/preempt_enable() still manipulate the
> preempt counter and compiling them out in __slab_alloc() will cause imbalance.
>
> So I think the guards in __slab_alloc() should be using CONFIG_PREEMPT_COUNT
> instead of CONFIG_PREEMPT to be correct on all configs. I dare not remove them
> completely :)
:)
Sebastian
On 2021-06-09 13:39:00 [+0200], Vlastimil Babka wrote:
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -353,18 +353,35 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
> /*
> * Per slab locking using the pagelock
> */
> -static __always_inline void slab_lock(struct page *page)
> +static __always_inline void
> +__slab_lock(struct page *page, unsigned long *flags, bool disable_irqs)
> {
> VM_BUG_ON_PAGE(PageTail(page), page);
> + if (disable_irqs)
> + local_irq_save(*flags);
> bit_spin_lock(PG_locked, &page->flags);
> }
>
> -static __always_inline void slab_unlock(struct page *page)
> +static __always_inline void
> +__slab_unlock(struct page *page, unsigned long *flags, bool disable_irqs)
> {
> VM_BUG_ON_PAGE(PageTail(page), page);
> + if (disable_irqs)
> + local_irq_restore(*flags);
> __bit_spin_unlock(PG_locked, &page->flags);
> }
You should first unlock then enable IRQs.
Sebastian
I replaced my slub changes with slub-local-lock-v2r3.
I haven't seen any complains from lockdep or so which is good. Then I
did this with RT enabled (and no debug):
- A "time make -j32" run of allmodconfig on /dev/shm.
Old:
| real 20m6,217s
| user 568m22,553s
| sys 48m33,126s
New:
| real 20m9,049s
| user 569m32,096s
| sys 48m47,670s
These 3 seconds here are probably in the noise range.
- perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500
Old:
| 464.967,20 msec task-clock # 27,220 CPUs utilized ( +- 0,16% )
| 7.683.944 context-switches # 0,017 M/sec ( +- 0,86% )
| 931.380 cpu-migrations # 0,002 M/sec ( +- 4,94% )
| 219.569 page-faults # 0,472 K/sec ( +- 0,39% )
| 1.104.727.599.918 cycles # 2,376 GHz ( +- 0,18% )
| 941.428.898.087 stalled-cycles-frontend # 85,22% frontend cycles idle ( +- 0,24% )
| 729.016.546.572 stalled-cycles-backend # 65,99% backend cycles idle ( +- 0,32% )
| 340.133.571.519 instructions # 0,31 insn per cycle
| # 2,77 stalled cycles per insn ( +- 0,12% )
| 73.746.821.314 branches # 158,607 M/sec ( +- 0,13% )
| 377.838.006 branch-misses # 0,51% of all branches ( +- 1,01% )
|
| 17,0820 +- 0,0202 seconds time elapsed ( +- 0,12% )
New:
| 422.865,71 msec task-clock # 4,782 CPUs utilized ( +- 0,34% )
| 14.594.238 context-switches # 0,035 M/sec ( +- 0,43% )
| 3.737.926 cpu-migrations # 0,009 M/sec ( +- 0,46% )
| 218.474 page-faults # 0,517 K/sec ( +- 0,74% )
| 940.715.812.020 cycles # 2,225 GHz ( +- 0,34% )
| 716.593.827.820 stalled-cycles-frontend # 76,18% frontend cycles idle ( +- 0,39% )
| 550.730.862.839 stalled-cycles-backend # 58,54% backend cycles idle ( +- 0,43% )
| 417.274.588.907 instructions # 0,44 insn per cycle
| # 1,72 stalled cycles per insn ( +- 0,17% )
| 92.814.150.290 branches # 219,488 M/sec ( +- 0,17% )
| 822.102.170 branch-misses # 0,89% of all branches ( +- 0,41% )
|
| 88,427 +- 0,618 seconds time elapsed ( +- 0,70% )
So this is outside of the noise range.
I'm not sure where this is coming from. My guess would be higher lock
contention within the memory allocator.
> The remaining patches to upstream from the RT tree are small ones related to
> KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes
> sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could
> perhaps be re-evaluated as the series also addresses some latency issues with
> percpu partial slabs.
With that series the PARTIAL slab can be indeed enabled. I have (had) a
half done series where I had PARTIAL enabled and noticed a slight
increase in latency so made it "default y on !RT". It wasn't dramatic
but appeared to be outside of noise.
Sebastian
On 7/2/21 8:29 PM, Sebastian Andrzej Siewior wrote:
> I replaced my slub changes with slub-local-lock-v2r3.
> I haven't seen any complains from lockdep or so which is good. Then I
> did this with RT enabled (and no debug):
Thanks for testing!
> - A "time make -j32" run of allmodconfig on /dev/shm.
> Old:
> | real 20m6,217s
> | user 568m22,553s
> | sys 48m33,126s
>
> New:
> | real 20m9,049s
> | user 569m32,096s
> | sys 48m47,670s
>
> These 3 seconds here are probably in the noise range.
>
> - perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500
> Old:
> | 464.967,20 msec task-clock # 27,220 CPUs utilized ( +- 0,16% )
> | 7.683.944 context-switches # 0,017 M/sec ( +- 0,86% )
> | 931.380 cpu-migrations # 0,002 M/sec ( +- 4,94% )
> | 219.569 page-faults # 0,472 K/sec ( +- 0,39% )
> | 1.104.727.599.918 cycles # 2,376 GHz ( +- 0,18% )
> | 941.428.898.087 stalled-cycles-frontend # 85,22% frontend cycles idle ( +- 0,24% )
> | 729.016.546.572 stalled-cycles-backend # 65,99% backend cycles idle ( +- 0,32% )
> | 340.133.571.519 instructions # 0,31 insn per cycle
> | # 2,77 stalled cycles per insn ( +- 0,12% )
> | 73.746.821.314 branches # 158,607 M/sec ( +- 0,13% )
> | 377.838.006 branch-misses # 0,51% of all branches ( +- 1,01% )
> |
> | 17,0820 +- 0,0202 seconds time elapsed ( +- 0,12% )
>
> New:
> | 422.865,71 msec task-clock # 4,782 CPUs utilized ( +- 0,34% )
> | 14.594.238 context-switches # 0,035 M/sec ( +- 0,43% )
> | 3.737.926 cpu-migrations # 0,009 M/sec ( +- 0,46% )
> | 218.474 page-faults # 0,517 K/sec ( +- 0,74% )
> | 940.715.812.020 cycles # 2,225 GHz ( +- 0,34% )
> | 716.593.827.820 stalled-cycles-frontend # 76,18% frontend cycles idle ( +- 0,39% )
> | 550.730.862.839 stalled-cycles-backend # 58,54% backend cycles idle ( +- 0,43% )
> | 417.274.588.907 instructions # 0,44 insn per cycle
> | # 1,72 stalled cycles per insn ( +- 0,17% )
> | 92.814.150.290 branches # 219,488 M/sec ( +- 0,17% )
> | 822.102.170 branch-misses # 0,89% of all branches ( +- 0,41% )
> |
> | 88,427 +- 0,618 seconds time elapsed ( +- 0,70% )
>
> So this is outside of the noise range.
> I'm not sure where this is coming from. My guess would be higher lock
> contention within the memory allocator.
The series shouldn't significantly change the memory allocator
interaction, though.
Seems there's less cycles, but more time elapsed, thus more sleeping -
is it locks becoming mutexes on RT?
My first guess - the last, local_lock patch. What would happen if you
take that one out? Should be still RT-compatible. If it improves a lot,
maybe that conversion to local_lock is not worth it then.
My second guess - list_lock remains spinlock with my series, thus RT
mutex, but the current RT tree converts it to raw_spinlock. I'd hope
leaving that one as non-raw spinlock would still be much better for RT
goals, even if hackbench (which is AFAIK very slab intensive) throughput
regresses - hopefully not that much.
>> The remaining patches to upstream from the RT tree are small ones related to
>> KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes
>> sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could
>> perhaps be re-evaluated as the series also addresses some latency issues with
>> percpu partial slabs.
>
> With that series the PARTIAL slab can be indeed enabled. I have (had) a
> half done series where I had PARTIAL enabled and noticed a slight
> increase in latency so made it "default y on !RT". It wasn't dramatic
> but appeared to be outside of noise.
>
> Sebastian
>
On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote:
> I replaced my slub changes with slub-local-lock-v2r3.
> I haven't seen any complains from lockdep or so which is good. Then I
> did this with RT enabled (and no debug):
Below is some raw hackbench data from my little i4790 desktop box. It
says we'll definitely still want list_lock to be raw.
It also appears to be saying that there's something RT specific to
stare at in addition to the list_lock business, but add a pinch of salt
to that due to the config of the virgin(ish) tip tree being much
lighter than the enterprise(ish) config of the tip-rt tree.
perf stat -r10 hackbench -s4096 -l500
full warmup, record, repeat twice for elapsed
5.13.0.g60ab3ed-tip-rt
8,898.51 msec task-clock # 7.525 CPUs utilized ( +- 0.33% )
368,922 context-switches # 0.041 M/sec ( +- 5.20% )
42,281 cpu-migrations # 0.005 M/sec ( +- 5.28% )
13,180 page-faults # 0.001 M/sec ( +- 0.70% )
33,343,378,867 cycles # 3.747 GHz ( +- 0.30% )
21,656,783,887 instructions # 0.65 insn per cycle ( +- 0.67% )
4,408,569,663 branches # 495.428 M/sec ( +- 0.73% )
12,040,125 branch-misses # 0.27% of all branches ( +- 2.93% )
1.18260 +- 0.00473 seconds time elapsed ( +- 0.40% )
1.19018 +- 0.00441 seconds time elapsed ( +- 0.37% ) (repeat)
1.18260 +- 0.00473 seconds time elapsed ( +- 0.40% ) (repeat)
5.13.0.g60ab3ed-tip-rt +slub-local-lock-v2r3 list_lock=raw_spinlock_t
9,642.00 msec task-clock # 7.521 CPUs utilized ( +- 0.46% )
462,091 context-switches # 0.048 M/sec ( +- 4.79% )
44,411 cpu-migrations # 0.005 M/sec ( +- 4.34% )
12,980 page-faults # 0.001 M/sec ( +- 0.43% )
36,098,859,429 cycles # 3.744 GHz ( +- 0.44% )
25,462,853,462 instructions # 0.71 insn per cycle ( +- 0.50% )
5,260,898,360 branches # 545.623 M/sec ( +- 0.52% )
16,088,686 branch-misses # 0.31% of all branches ( +- 2.02% )
1.28207 +- 0.00568 seconds time elapsed ( +- 0.44% )
1.28744 +- 0.00713 seconds time elapsed ( +- 0.55% ) (repeat)
1.28085 +- 0.00850 seconds time elapsed ( +- 0.66% ) (repeat)
5.13.0.g60ab3ed-tip-rt +slub-local-lock-v2r3 list_lock=spinlock_t
10,004.89 msec task-clock # 6.029 CPUs utilized ( +- 1.37% )
654,311 context-switches # 0.065 M/sec ( +- 5.16% )
211,070 cpu-migrations # 0.021 M/sec ( +- 1.38% )
13,262 page-faults # 0.001 M/sec ( +- 0.79% )
36,585,914,931 cycles # 3.657 GHz ( +- 1.35% )
27,682,240,511 instructions # 0.76 insn per cycle ( +- 1.06% )
5,766,064,432 branches # 576.325 M/sec ( +- 1.11% )
24,269,069 branch-misses # 0.42% of all branches ( +- 2.03% )
1.6595 +- 0.0116 seconds time elapsed ( +- 0.70% )
1.6270 +- 0.0180 seconds time elapsed ( +- 1.11% ) (repeat)
1.6213 +- 0.0150 seconds time elapsed ( +- 0.93% ) (repeat)
virgin(ish) tip
5.13.0.g60ab3ed-tip
7,320.67 msec task-clock # 7.792 CPUs utilized ( +- 0.31% )
221,215 context-switches # 0.030 M/sec ( +- 3.97% )
16,234 cpu-migrations # 0.002 M/sec ( +- 4.07% )
13,233 page-faults # 0.002 M/sec ( +- 0.91% )
27,592,205,252 cycles # 3.769 GHz ( +- 0.32% )
8,309,495,040 instructions # 0.30 insn per cycle ( +- 0.37% )
1,555,210,607 branches # 212.441 M/sec ( +- 0.42% )
5,484,209 branch-misses # 0.35% of all branches ( +- 2.13% )
0.93949 +- 0.00423 seconds time elapsed ( +- 0.45% )
0.94608 +- 0.00384 seconds time elapsed ( +- 0.41% ) (repeat)
0.94422 +- 0.00410 seconds time elapsed ( +- 0.43% )
5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
7,343.57 msec task-clock # 7.776 CPUs utilized ( +- 0.44% )
223,044 context-switches # 0.030 M/sec ( +- 3.02% )
16,057 cpu-migrations # 0.002 M/sec ( +- 4.03% )
13,164 page-faults # 0.002 M/sec ( +- 0.97% )
27,684,906,017 cycles # 3.770 GHz ( +- 0.45% )
8,323,273,871 instructions # 0.30 insn per cycle ( +- 0.28% )
1,556,106,680 branches # 211.901 M/sec ( +- 0.31% )
5,463,468 branch-misses # 0.35% of all branches ( +- 1.33% )
0.94440 +- 0.00352 seconds time elapsed ( +- 0.37% )
0.94830 +- 0.00228 seconds time elapsed ( +- 0.24% ) (repeat)
0.93813 +- 0.00440 seconds time elapsed ( +- 0.47% ) (repeat)
On Sat, 2021-07-03 at 09:24 +0200, Mike Galbraith wrote:
>
> It also appears to be saying that there's something RT specific to
> stare at in addition to the list_lock business.
The what is ___slab_alloc() consuming 3.9% CPU in tip-rt-slub whereas
it consumes < 1% in both tip-rt (sans slub patches) and tip-slub.
The why remains to ponder.
5.13.0.g60ab3ed-tip-rt 5.13.0.g60ab3ed-tip-rt-slub 5.13.0.g60ab3ed-tip-slub
25.18% copy_user_enhanced_fast_string copy_user_enhanced_fast_string copy_user_enhanced_fast_string
5.08% unix_stream_read_generic unix_stream_read_generic unix_stream_read_generic
3.39% rt_spin_lock *** ___slab_alloc *** __skb_datagram_iter
2.80% __skb_datagram_iter rt_spin_lock _raw_spin_lock
2.11% get_page_from_freelist __skb_datagram_iter __alloc_skb
2.01% skb_release_data rt_spin_unlock skb_release_data
1.94% rt_spin_unlock get_page_from_freelist __alloc_pages
1.85% __alloc_skb migrate_enable unix_stream_sendmsg
1.68% __schedule skb_release_data _raw_spin_lock_irqsave
1.67% unix_stream_sendmsg __schedule free_pcppages_bulk
1.50% free_pcppages_bulk unix_stream_sendmsg __slab_free
1.38% migrate_enable free_pcppages_bulk __fget_light
1.24% __fget_light __alloc_pages vfs_write
1.16% __slab_free migrate_disable __schedule
1.14% __alloc_pages __fget_light get_page_from_freelist
1.10% fsnotify __slab_free new_sync_write
1.07% kfree fsnotify fsnotify
5.13.0.g60ab3ed-tip-rt-slub ___slab_alloc() consumes 3.90%
0.40 │ mov 0x28(%r13),%edx
0.42 │ add %r15,%rdx
│ __swab():
│ #endif
│
│ static __always_inline unsigned long __swab(const unsigned long y)
│ {
│ #if __BITS_PER_LONG == 64
│ return __swab64(y);
0.05 │ mov %rdx,%rax
1.14 │ bswap %rax
│ freelist_ptr():
│ return (void *)((unsigned long)ptr ^ s->random ^ <== CONFIG_SLAB_FREELIST_HARDENED
0.72 │ xor 0xb0(%r13),%rax
65.41 │ xor (%rdx),%rax <== huh? miss = 65% of that 3.9% kernel util?
│ next_tid():
│ return tid + TID_STEP;
0.09 │ addq $0x200,0x48(%r12)
│ ___slab_alloc():
│ * freelist is pointing to the list of objects to be used.
│ * page is pointing to the page from which the objects are obtained.
│ * That page must be frozen for per cpu allocations to work.
│ */
│ VM_BUG_ON(!c->page->frozen);
│ c->freelist = get_freepointer(s, freelist);
0.05 │ mov %rax,0x40(%r12)
│ c->tid = next_tid(c->tid);
│ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
5.13.0.g60ab3ed-tip-rt ___slab_alloc() consumes < 1%
Percent│ }
│
│ /* must check again c->freelist in case of cpu migration or IRQ */
│ freelist = c->freelist;
0.02 │ a1: mov (%r14),%r13
│ if (freelist)
│ test %r13,%r13
0.02 │ ↓ je 460
│ get_freepointer():
│ return freelist_dereference(s, object + s->offset);
0.23 │ ad: mov 0x28(%r12),%edx
0.18 │ add %r13,%rdx
│ __swab():
│ #endif
│
│ static __always_inline unsigned long __swab(const unsigned long y)
│ {
│ #if __BITS_PER_LONG == 64
│ return __swab64(y);
0.06 │ mov %rdx,%rax
1.16 │ bswap %rax
│ freelist_ptr():
│ return (void *)((unsigned long)ptr ^ s->random ^
0.23 │ xor 0xb0(%r12),%rax
35.25 │ xor (%rdx),%rax <== 35% of < 1% kernel util
│ next_tid():
│ return tid + TID_STEP;
0.28 │ addq $0x200,0x8(%r14)
│ ___slab_alloc():
│ * freelist is pointing to the list of objects to be used.
│ * page is pointing to the page from which the objects are obtained.
│ * That page must be frozen for per cpu allocations to work.
│ */
│ VM_BUG_ON(!c->page->frozen);
│ c->freelist = get_freepointer(s, freelist);
5.13.0.g60ab3ed-tip-slub ___slab_alloc() also consumes < 1%
Percent│ load_freelist:
│
│ lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
0.28 │ 84: add this_cpu_off,%rax
│ get_freepointer():
│ return freelist_dereference(s, object + s->offset);
0.14 │ mov 0x28(%r14),%eax
│ ___slab_alloc():
│ * freelist is pointing to the list of objects to be used.
│ * page is pointing to the page from which the objects are obtained.
│ * That page must be frozen for per cpu allocations to work.
│ */
│ VM_BUG_ON(!c->page->frozen);
│ c->freelist = get_freepointer(s, freelist);
34.36 │ mov 0x0(%r13,%rax,1),%rax
│ next_tid():
│ return tid + TID_STEP;
0.10 │ addq $0x1,0x8(%r12)
│ ___slab_alloc():
│ c->freelist = get_freepointer(s, freelist);
0.04 │ mov %rax,(%r12)
│ c->tid = next_tid(c->tid);
│ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
0.12 │ mov (%r14),%rax
0.03 │ add this_cpu_off,%rax
│ arch_local_irq_restore():
│ return arch_irqs_disabled_flags(flags);
│ }
On Sat, 2021-07-03 at 17:47 +0200, Mike Galbraith wrote:
> On Sat, 2021-07-03 at 09:24 +0200, Mike Galbraith wrote:
> > It also appears to be saying that there's something RT specific to
> > stare at in addition to the list_lock business.
>
> The what is ___slab_alloc() consuming 3.9% CPU in tip-rt-slub whereas
> it consumes < 1% in both tip-rt (sans slub patches) and tip-slub.
>
> The why remains to ponder.
Ignoring odd distribution in the profile (red herring whackable via
prefetch games), and removing perf/debug options from the picture, the
remaining RT specific cost is just that of forced slow path. For
hackbench in my little box, it adds up to right at 3%.
-Mike
On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote:
>
> > The remaining patches to upstream from the RT tree are small ones
> > related to KConfig. The patch that restricts PREEMPT_RT to SLUB
> > (not SLAB or SLOB) makes sense. The patch that disables
> > CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could perhaps be re-
> > evaluated as the series also addresses some latency issues with
> > percpu partial slabs.
>
> With that series the PARTIAL slab can be indeed enabled. I have (had)
> a half done series where I had PARTIAL enabled and noticed a slight
> increase in latency so made it "default y on !RT". It wasn't dramatic
> but appeared to be outside of noise.
I'm seeing warnings/explosions while exercising box IFF PARTIAL slab
thingy is enabled. I aborted -PARTIAL after a little over 4 hours,
whereas the longest survival of 4 +PARTIAL runs was 50 minutes, so I'm
fairly confident that PARTIAL really really is the trigger.
My reproducer is a keep box busy loop of parallel kbuild along with
alternating ltp::controllers (sans swap-from-hell bits), ltp::zram and
hackbench.
The warning may or may not trigger before box explodes.
[ 224.232200] ------------[ cut here ]------------
[ 224.232202] WARNING: CPU: 3 PID: 9112 at mm/slub.c:2018 ___slab_alloc.constprop.100+0xb09/0x13a0
[ 224.232210] Modules linked in: sr_mod(E) cdrom(E) zram(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) xfs(E) libcrc32c(E) af_packet(E) ip6table_mangle(E) ip6table_raw(E) iptable_raw(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nfnetlink(E) ebtable_filter(E) rfkill(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) bpfilter(E) joydev(E) usblp(E) intel_rapl_msr(E) intel_rapl_common(E) nls_iso8859_1(E) nls_cp437(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_codec_hdmi(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) snd_hda_intel(E) aesni_intel(E) snd_intel_dspcfg(E) crypto_simd(E) iTCO_wdt(E) at24(E) intel_pmc_bxt(E) regmap_i2c(E) mei_hdcp(E) iTCO_vendor_support(E) cryptd(E) snd_hda_codec(E) snd_hwdep(E) pcspkr(E) i2c_i801(E) snd_hda_core(E) i2c_smbus(E) r8169(E) snd_pcm(E) realtek(E) snd_timer(E)
[ 224.232253] mei_me(E) mdio_devres(E) lpc_ich(E) snd(E) libphy(E) mei(E) soundcore(E) mfd_core(E) fan(E) thermal(E) nfsd(E) intel_smartconnect(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sch_fq_codel(E) sunrpc(E) fuse(E) configfs(E) hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) nouveau(E) wmi(E) drm_ttm_helper(E) ttm(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) cec(E) ahci(E) rc_core(E) xhci_pci(E) ehci_pci(E) libahci(E) xhci_hcd(E) ehci_hcd(E) drm(E) libata(E) usbcore(E) video(E) button(E) sd_mod(E) t10_pi(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) virtio_ring(E) virtio(E) ext4(E) crc32c_intel(E) crc16(E) mbcache(E) jbd2(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) msr(E) autofs4(E)
[ 224.232295] CPU: 3 PID: 9112 Comm: dd Kdump: loaded Tainted: G E 5.13.0.g60ab3ed-tip-rt-slub #25
[ 224.232297] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
[ 224.232299] RIP: 0010:___slab_alloc.constprop.100+0xb09/0x13a0
[ 224.232302] Code: 48 8b 7d 88 4c 89 ee 8b 47 28 4c 01 f8 48 89 c2 48 0f ca 48 33 97 b8 00 00 00 48 33 10 e8 cf e5 ff ff e9 b5 00 00 00 4d 89 ce <0f> 0b e9 2d fa ff ff 8b 53 38 8b 05 8f 47 4a 01 48 8b 75 80 83 c2
[ 224.232303] RSP: 0018:ffff8882047e3ba0 EFLAGS: 00010046
[ 224.232305] RAX: ffff888100040f80 RBX: 0000000000000000 RCX: 0000000080190019
[ 224.232307] RDX: ffffea0005d59a08 RSI: 0000000000000000 RDI: ffffffff8177ab54
[ 224.232308] RBP: ffff8882047e3c70 R08: ffffffffffffffff R09: 0000000000000000
[ 224.232309] R10: ffff8882047e3c90 R11: 0000000000000000 R12: ffffea0004d60780
[ 224.232310] R13: 0000000000000019 R14: 0000000000000000 R15: 0000000080190019
[ 224.232311] FS: 00007f88ffa09580(0000) GS:ffff88840ecc0000(0000) knlGS:0000000000000000
[ 224.232312] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.232313] CR2: 00007f88ffa4a000 CR3: 000000015606a005 CR4: 00000000001706e0
[ 224.232315] Call Trace:
[ 224.232318] ? __alloc_file+0x26/0xf0
[ 224.232322] ? prep_new_page+0xab/0x100
[ 224.232326] ? __alloc_file+0x26/0xf0
[ 224.232329] ? __slab_alloc.isra.87.constprop.99+0x2e/0x40
[ 224.232331] __slab_alloc.isra.87.constprop.99+0x2e/0x40
[ 224.232333] ? __alloc_file+0x26/0xf0
[ 224.232335] kmem_cache_alloc+0x4d/0x160
[ 224.232338] __alloc_file+0x26/0xf0
[ 224.232340] alloc_empty_file+0x43/0xe0
[ 224.232343] path_openat+0x35/0xe30
[ 224.232345] ? getname_flags+0x32/0x170
[ 224.232347] ? ___slab_alloc.constprop.100+0xbbb/0x13a0
[ 224.232349] ? filemap_map_pages+0xf0/0x3e0
[ 224.232352] do_filp_open+0xa2/0x100
[ 224.232355] ? __slab_alloc.isra.87.constprop.99+0x36/0x40
[ 224.232357] ? __slab_alloc.isra.87.constprop.99+0x36/0x40
[ 224.232359] ? getname_flags+0x32/0x170
[ 224.232361] ? alloc_fd+0xe2/0x1b0
[ 224.232363] ? do_sys_openat2+0x248/0x310
[ 224.232365] do_sys_openat2+0x248/0x310
[ 224.232368] do_sys_open+0x47/0x60
[ 224.232370] do_syscall_64+0x37/0x80
[ 224.232373] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 224.232376] RIP: 0033:0x7f88ff54ab41
[ 224.232378] Code: 41 83 e2 40 48 89 54 24 30 75 3e 89 f0 25 00 00 41 00 3d 00 00 41 00 74 30 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 3f 48 8b 4c 24 18 64 48 33 0c 25 28 00 00 00
[ 224.232379] RSP: 002b:00007ffc0916a720 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
[ 224.232381] RAX: ffffffffffffffda RBX: 0000558d793efcf0 RCX: 00007f88ff54ab41
[ 224.232382] RDX: 0000000000080000 RSI: 0000558d793efcb0 RDI: 00000000ffffff9c
[ 224.232383] RBP: 00007ffc0916a860 R08: 0000000000000000 R09: 00007ffc0916a873
[ 224.232384] R10: 0000000000000000 R11: 0000000000000287 R12: 0000558d793efa88
[ 224.232385] R13: 000000000000000b R14: 0000558d793efa60 R15: 000000000000000b
[ 224.232387] ---[ end trace 0000000000000002 ]---
...
[ 2010.420941] general protection fault, maybe for address 0xffffea00054ca398: 0000 [#1] PREEMPT_RT SMP NOPTI
[ 2010.420946] CPU: 4 PID: 17541 Comm: dd Kdump: loaded Tainted: G W E 5.13.0.g60ab3ed-tip-rt-slub #25
[ 2010.420948] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
[ 2010.420949] RIP: 0010:___slab_alloc.constprop.100+0x52b/0x13a0
[ 2010.420955] Code: b7 44 24 2a 31 db 66 25 ff 7f 66 89 45 b8 48 8b 45 88 80 4d bb 80 48 8b 4d b8 f6 40 0b 40 0f 84 5b 02 00 00 48 89 f0 4c 89 fa <f0> 49 0f c7 4c 24 20 0f 84 8f 01 00 00 4d 89 ce f3 90 48 8b b5 30
[ 2010.420957] RSP: 0018:ffff888102cf7ba0 EFLAGS: 00010002
[ 2010.420959] RAX: ffff888100047e00 RBX: 0000000000000000 RCX: 0000000080000000
[ 2010.420961] RDX: 0000000000000000 RSI: ffff888100047e00 RDI: ffff888100040f80
[ 2010.420962] RBP: ffff888102cf7c70 R08: ffffffffffffffff R09: 0000000000000000
[ 2010.420963] R10: ffff888102cf7c90 R11: 0000000000000000 R12: ffffea00054ca378
[ 2010.420964] R13: 0000000000000000 R14: 80000000000101f8 R15: 0000000000000000
[ 2010.420966] FS: 00007f9e8edbb580(0000) GS:ffff88840ed00000(0000) knlGS:0000000000000000
[ 2010.420967] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2010.420968] CR2: 00007f9e8edfa000 CR3: 0000000102ede004 CR4: 00000000001706e0
[ 2010.420970] Call Trace:
[ 2010.420973] ? __alloc_file+0x26/0xf0
[ 2010.420977] ? prep_new_page+0xab/0x100
[ 2010.420982] ? vm_area_alloc+0x1a/0x60
[ 2010.420984] ? __alloc_file+0x26/0xf0
[ 2010.420987] ? __slab_alloc.isra.87.constprop.99+0x2e/0x40
[ 2010.420989] __slab_alloc.isra.87.constprop.99+0x2e/0x40
[ 2010.420992] ? __alloc_file+0x26/0xf0
[ 2010.420994] kmem_cache_alloc+0x4d/0x160
[ 2010.420996] __alloc_file+0x26/0xf0
[ 2010.420999] alloc_empty_file+0x43/0xe0
[ 2010.421002] path_openat+0x35/0xe30
[ 2010.421004] ? getname_flags+0x32/0x170
[ 2010.421006] ? ___slab_alloc.constprop.100+0xbbb/0x13a0
[ 2010.421008] ? filemap_map_pages+0xf0/0x3e0
[ 2010.421012] do_filp_open+0xa2/0x100
[ 2010.421015] ? __handle_mm_fault+0x8b8/0xa40
[ 2010.421017] ? __slab_alloc.isra.87.constprop.99+0x36/0x40
[ 2010.421019] ? __slab_alloc.isra.87.constprop.99+0x36/0x40
[ 2010.421021] ? getname_flags+0x32/0x170
[ 2010.421023] ? alloc_fd+0xe2/0x1b0
[ 2010.421026] ? do_sys_openat2+0x248/0x310
[ 2010.421028] do_sys_openat2+0x248/0x310
[ 2010.421030] do_sys_open+0x47/0x60
[ 2010.421033] do_syscall_64+0x37/0x80
[ 2010.421035] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 2010.421039] RIP: 0033:0x7f9e8e8fcb41
[ 2010.421041] Code: 41 83 e2 40 48 89 54 24 30 75 3e 89 f0 25 00 00 41 00 3d 00 00 41 00 74 30 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 3f 48 8b 4c 24 18 64 48 33 0c 25 28 00 00 00
[ 2010.421043] RSP: 002b:00007ffe8964b7b0 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
[ 2010.421044] RAX: ffffffffffffffda RBX: 0000562e1e31b450 RCX: 00007f9e8e8fcb41
[ 2010.421045] RDX: 0000000000080000 RSI: 0000562e1e31b230 RDI: 00000000ffffff9c
[ 2010.421047] RBP: 00007ffe8964b8f0 R08: 0000000000000000 R09: 00007ffe8964b903
[ 2010.421048] R10: 0000000000000000 R11: 0000000000000287 R12: 0000562e1e31b400
[ 2010.421049] R13: 0000000000000009 R14: 0000562e1e31b3e0 R15: 0000000000000009
[ 2010.421051] Modules linked in: zram(E) sr_mod(E) cdrom(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) xfs(E) libcrc32c(E) af_packet(E) ip6table_mangle(E) ip6table_raw(E) iptable_raw(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nfnetlink(E) ebtable_filter(E) rfkill(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) bpfilter(E) joydev(E) usblp(E) intel_rapl_msr(E) intel_rapl_common(E) nls_iso8859_1(E) nls_cp437(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_codec_hdmi(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) snd_hda_intel(E) aesni_intel(E) snd_intel_dspcfg(E) crypto_simd(E) iTCO_wdt(E) at24(E) intel_pmc_bxt(E) regmap_i2c(E) mei_hdcp(E) iTCO_vendor_support(E) cryptd(E) snd_hda_codec(E) snd_hwdep(E) pcspkr(E) i2c_i801(E) snd_hda_core(E) i2c_smbus(E) r8169(E) snd_pcm(E) realtek(E) snd_timer(E)
[ 2010.421093] mei_me(E) mdio_devres(E) lpc_ich(E) snd(E) libphy(E) mei(E) soundcore(E) mfd_core(E) fan(E) thermal(E) nfsd(E) intel_smartconnect(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sch_fq_codel(E) sunrpc(E) fuse(E) configfs(E) hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) nouveau(E) wmi(E) drm_ttm_helper(E) ttm(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) cec(E) ahci(E) rc_core(E) xhci_pci(E) ehci_pci(E) libahci(E) xhci_hcd(E) ehci_hcd(E) drm(E) libata(E) usbcore(E) video(E) button(E) sd_mod(E) t10_pi(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) virtio_ring(E) virtio(E) ext4(E) crc32c_intel(E) crc16(E) mbcache(E) jbd2(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) msr(E) autofs4(E) [last unloaded: zram]
[ 2010.421138] Dumping ftrace buffer:
[ 2010.421141] (ftrace buffer empty)
Greetings,
On Wed, 2021-06-09 at 13:38 +0200, Vlastimil Babka wrote:
> @@ -3313,6 +3320,8 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s,
> gfp_t flags, size_t size,
> */
> c->tid = next_tid(c->tid);
>
> + local_irq_enable();
> +
> /*
> * Invoking slow path likely have side-effect
> * of re-populating per CPU c->freelist
This addition should have been followed by removal of the one in the
error path. At the end of the series, RT ends up doing a double
local_unlock_irq() instead of post ___slab_alloc() slub_put_cpu_ptr().
-Mike
On Mon, 2021-07-05 at 18:00 +0200, Mike Galbraith wrote:
> On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote:
> >
> > > The remaining patches to upstream from the RT tree are small ones
> > > related to KConfig. The patch that restricts PREEMPT_RT to SLUB
> > > (not SLAB or SLOB) makes sense. The patch that disables
> > > CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could perhaps be re-
> > > evaluated as the series also addresses some latency issues with
> > > percpu partial slabs.
> >
> > With that series the PARTIAL slab can be indeed enabled. I have
> > (had)
> > a half done series where I had PARTIAL enabled and noticed a slight
> > increase in latency so made it "default y on !RT". It wasn't
> > dramatic
> > but appeared to be outside of noise.
>
> I'm seeing warnings/explosions while exercising box IFF PARTIAL slab
> thingy is enabled. I aborted -PARTIAL after a little over 4 hours,
> whereas the longest survival of 4 +PARTIAL runs was 50 minutes, so
> I'm fairly confident that PARTIAL really really is the trigger.
Resurrecting local exclusion around unfreeze_partials() seems to have
put an end to that. Guess I can chop these trees down now.
---
mm/slub.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2497,7 +2497,9 @@ static void put_cpu_partial(struct kmem_
* partial array is full. Move the existing
* set to the per node partial list.
*/
+ local_lock(&s->cpu_slab->lock);
unfreeze_partials(s);
+ local_unlock(&s->cpu_slab->lock);
oldpage = NULL;
pobjects = 0;
pages = 0;
@@ -2579,7 +2581,9 @@ static void flush_cpu_slab(struct work_s
if (c->page)
flush_slab(s, c, true);
+ local_lock(&s->cpu_slab->lock);
unfreeze_partials(s);
+ local_unlock(&s->cpu_slab->lock);
}
static bool has_cpu_slab(int cpu, struct kmem_cache *s)
@@ -3358,13 +3362,12 @@ static __always_inline void do_slab_free
* we need to take the local_lock. We shouldn't simply defer to
* __slab_free() as that wouldn't use the cpu freelist at all.
*/
- unsigned long flags;
void **freelist;
- local_lock_irqsave(&s->cpu_slab->lock, flags);
+ local_lock(&s->cpu_slab->lock);
c = this_cpu_ptr(s->cpu_slab);
if (unlikely(page != c->page)) {
- local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+ local_unlock(&s->cpu_slab->lock);
goto redo;
}
tid = c->tid;
@@ -3374,7 +3377,7 @@ static __always_inline void do_slab_free
c->freelist = head;
c->tid = next_tid(tid);
- local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+ local_unlock(&s->cpu_slab->lock);
#endif
stat(s, FREE_FASTPATH);
} else
@@ -3601,7 +3604,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
slab_want_init_on_alloc(flags, s));
return i;
error:
- local_unlock_irq(&s->cpu_slab->lock);
+ slub_put_cpu_ptr(s->cpu_slab);
slab_post_alloc_hook(s, objcg, flags, i, p, false);
__kmem_cache_free_bulk(s, i, p);
return 0;
On 7/3/21 9:24 AM, Mike Galbraith wrote:
> On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote:
>> I replaced my slub changes with slub-local-lock-v2r3.
>> I haven't seen any complains from lockdep or so which is good. Then I
>> did this with RT enabled (and no debug):
>
> Below is some raw hackbench data from my little i4790 desktop box. It
> says we'll definitely still want list_lock to be raw.
Hi Mike, thanks a lot for the testing, sorry for late reply.
Did you try, instead of raw list_lock, not applying the last, local lock
patch, as I suggested in reply to bigeasy? I think the impact at
reducing the RT-specific overhead would be larger (than raw list_lock),
the result should still be RT compatible, and it would also deal with
the bugs you found there... (which I'll look into).
Thanks,
Vlastimil
> It also appears to be saying that there's something RT specific to
> stare at in addition to the list_lock business, but add a pinch of salt
> to that due to the config of the virgin(ish) tip tree being much
> lighter than the enterprise(ish) config of the tip-rt tree.
>
> perf stat -r10 hackbench -s4096 -l500
> full warmup, record, repeat twice for elapsed
>
> 5.13.0.g60ab3ed-tip-rt
> 8,898.51 msec task-clock # 7.525 CPUs utilized ( +- 0.33% )
> 368,922 context-switches # 0.041 M/sec ( +- 5.20% )
> 42,281 cpu-migrations # 0.005 M/sec ( +- 5.28% )
> 13,180 page-faults # 0.001 M/sec ( +- 0.70% )
> 33,343,378,867 cycles # 3.747 GHz ( +- 0.30% )
> 21,656,783,887 instructions # 0.65 insn per cycle ( +- 0.67% )
> 4,408,569,663 branches # 495.428 M/sec ( +- 0.73% )
> 12,040,125 branch-misses # 0.27% of all branches ( +- 2.93% )
>
> 1.18260 +- 0.00473 seconds time elapsed ( +- 0.40% )
> 1.19018 +- 0.00441 seconds time elapsed ( +- 0.37% ) (repeat)
> 1.18260 +- 0.00473 seconds time elapsed ( +- 0.40% ) (repeat)
>
> 5.13.0.g60ab3ed-tip-rt +slub-local-lock-v2r3 list_lock=raw_spinlock_t
> 9,642.00 msec task-clock # 7.521 CPUs utilized ( +- 0.46% )
> 462,091 context-switches # 0.048 M/sec ( +- 4.79% )
> 44,411 cpu-migrations # 0.005 M/sec ( +- 4.34% )
> 12,980 page-faults # 0.001 M/sec ( +- 0.43% )
> 36,098,859,429 cycles # 3.744 GHz ( +- 0.44% )
> 25,462,853,462 instructions # 0.71 insn per cycle ( +- 0.50% )
> 5,260,898,360 branches # 545.623 M/sec ( +- 0.52% )
> 16,088,686 branch-misses # 0.31% of all branches ( +- 2.02% )
>
> 1.28207 +- 0.00568 seconds time elapsed ( +- 0.44% )
> 1.28744 +- 0.00713 seconds time elapsed ( +- 0.55% ) (repeat)
> 1.28085 +- 0.00850 seconds time elapsed ( +- 0.66% ) (repeat)
>
> 5.13.0.g60ab3ed-tip-rt +slub-local-lock-v2r3 list_lock=spinlock_t
> 10,004.89 msec task-clock # 6.029 CPUs utilized ( +- 1.37% )
> 654,311 context-switches # 0.065 M/sec ( +- 5.16% )
> 211,070 cpu-migrations # 0.021 M/sec ( +- 1.38% )
> 13,262 page-faults # 0.001 M/sec ( +- 0.79% )
> 36,585,914,931 cycles # 3.657 GHz ( +- 1.35% )
> 27,682,240,511 instructions # 0.76 insn per cycle ( +- 1.06% )
> 5,766,064,432 branches # 576.325 M/sec ( +- 1.11% )
> 24,269,069 branch-misses # 0.42% of all branches ( +- 2.03% )
>
> 1.6595 +- 0.0116 seconds time elapsed ( +- 0.70% )
> 1.6270 +- 0.0180 seconds time elapsed ( +- 1.11% ) (repeat)
> 1.6213 +- 0.0150 seconds time elapsed ( +- 0.93% ) (repeat)
>
> virgin(ish) tip
> 5.13.0.g60ab3ed-tip
> 7,320.67 msec task-clock # 7.792 CPUs utilized ( +- 0.31% )
> 221,215 context-switches # 0.030 M/sec ( +- 3.97% )
> 16,234 cpu-migrations # 0.002 M/sec ( +- 4.07% )
> 13,233 page-faults # 0.002 M/sec ( +- 0.91% )
> 27,592,205,252 cycles # 3.769 GHz ( +- 0.32% )
> 8,309,495,040 instructions # 0.30 insn per cycle ( +- 0.37% )
> 1,555,210,607 branches # 212.441 M/sec ( +- 0.42% )
> 5,484,209 branch-misses # 0.35% of all branches ( +- 2.13% )
>
> 0.93949 +- 0.00423 seconds time elapsed ( +- 0.45% )
> 0.94608 +- 0.00384 seconds time elapsed ( +- 0.41% ) (repeat)
> 0.94422 +- 0.00410 seconds time elapsed ( +- 0.43% )
>
> 5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
> 7,343.57 msec task-clock # 7.776 CPUs utilized ( +- 0.44% )
> 223,044 context-switches # 0.030 M/sec ( +- 3.02% )
> 16,057 cpu-migrations # 0.002 M/sec ( +- 4.03% )
> 13,164 page-faults # 0.002 M/sec ( +- 0.97% )
> 27,684,906,017 cycles # 3.770 GHz ( +- 0.45% )
> 8,323,273,871 instructions # 0.30 insn per cycle ( +- 0.28% )
> 1,556,106,680 branches # 211.901 M/sec ( +- 0.31% )
> 5,463,468 branch-misses # 0.35% of all branches ( +- 1.33% )
>
> 0.94440 +- 0.00352 seconds time elapsed ( +- 0.37% )
> 0.94830 +- 0.00228 seconds time elapsed ( +- 0.24% ) (repeat)
> 0.93813 +- 0.00440 seconds time elapsed ( +- 0.47% ) (repeat)
>
On Sun, 2021-07-18 at 09:41 +0200, Vlastimil Babka wrote:
> On 7/3/21 9:24 AM, Mike Galbraith wrote:
> > On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote:
> > > I replaced my slub changes with slub-local-lock-v2r3.
> > > I haven't seen any complains from lockdep or so which is good.
> > > Then I
> > > did this with RT enabled (and no debug):
> >
> > Below is some raw hackbench data from my little i4790 desktop
> > box. It
> > says we'll definitely still want list_lock to be raw.
>
> Hi Mike, thanks a lot for the testing, sorry for late reply.
>
> Did you try, instead of raw list_lock, not applying the last, local
> lock patch, as I suggested in reply to bigeasy?
No, but I suppose I can give that a go.
-Mike
On Sun, 2021-07-18 at 10:29 +0200, Mike Galbraith wrote:
> On Sun, 2021-07-18 at 09:41 +0200, Vlastimil Babka wrote:
> > On 7/3/21 9:24 AM, Mike Galbraith wrote:
> > > On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior
> > > wrote:
> > > > I replaced my slub changes with slub-local-lock-v2r3.
> > > > I haven't seen any complains from lockdep or so which is good.
> > > > Then I
> > > > did this with RT enabled (and no debug):
> > >
> > > Below is some raw hackbench data from my little i4790 desktop
> > > box. It
> > > says we'll definitely still want list_lock to be raw.
> >
> > Hi Mike, thanks a lot for the testing, sorry for late reply.
> >
> > Did you try, instead of raw list_lock, not applying the last, local
> > lock patch, as I suggested in reply to bigeasy?
>
> No, but I suppose I can give that a go.
The kernel has of course moved forward, so measure 'em again, but
starting before replacing 5.12-rt patches with slub-local-lock-v2r3.
box is old i4790 desktop
perf stat -r10 hackbench -s4096 -l500
full warmup, record, repeat twice for elapsed
Config has only SLUB+SLUB_DEBUG, as originally measured.
5.14.0.g79e92006-tip-rt (5.12-rt, 5.13-rt didn't exist when first measured)
7,984.52 msec task-clock # 7.565 CPUs utilized ( +- 0.66% )
353,566 context-switches # 44.281 K/sec ( +- 2.77% )
37,685 cpu-migrations # 4.720 K/sec ( +- 6.37% )
12,939 page-faults # 1.620 K/sec ( +- 0.67% )
29,901,079,227 cycles # 3.745 GHz ( +- 0.71% )
14,550,797,818 instructions # 0.49 insn per cycle ( +- 0.47% )
3,056,685,643 branches # 382.826 M/sec ( +- 0.51% )
9,598,083 branch-misses # 0.31% of all branches ( +- 2.11% )
1.05542 +- 0.00409 seconds time elapsed ( +- 0.39% )
1.05990 +- 0.00244 seconds time elapsed ( +- 0.23% ) (repeat)
1.05367 +- 0.00303 seconds time elapsed ( +- 0.29% ) (repeat)
5.14.0.g79e92006-tip-rt +slub-local-lock-v2r3 -0034-mm-slub-convert-kmem_cpu_slab-protection-to-local_lock.patch
6,899.35 msec task-clock # 5.637 CPUs utilized ( +- 0.53% )
420,304 context-switches # 60.919 K/sec ( +- 2.83% )
187,130 cpu-migrations # 27.123 K/sec ( +- 1.81% )
13,206 page-faults # 1.914 K/sec ( +- 0.96% )
25,110,362,933 cycles # 3.640 GHz ( +- 0.49% )
15,853,643,635 instructions # 0.63 insn per cycle ( +- 0.64% )
3,366,261,524 branches # 487.910 M/sec ( +- 0.70% )
14,839,618 branch-misses # 0.44% of all branches ( +- 2.01% )
1.22390 +- 0.00744 seconds time elapsed ( +- 0.61% )
1.21813 +- 0.00907 seconds time elapsed ( +- 0.74% ) (repeat)
1.22097 +- 0.00952 seconds time elapsed ( +- 0.78% ) (repeat)
repeat of above with raw list_lock
8,072.62 msec task-clock # 7.605 CPUs utilized ( +- 0.49% )
359,514 context-switches # 44.535 K/sec ( +- 4.95% )
35,285 cpu-migrations # 4.371 K/sec ( +- 5.82% )
13,503 page-faults # 1.673 K/sec ( +- 0.96% )
30,247,989,681 cycles # 3.747 GHz ( +- 0.52% )
14,580,011,391 instructions # 0.48 insn per cycle ( +- 0.81% )
3,063,743,405 branches # 379.523 M/sec ( +- 0.85% )
8,907,160 branch-misses # 0.29% of all branches ( +- 3.99% )
1.06150 +- 0.00427 seconds time elapsed ( +- 0.40% )
1.05041 +- 0.00176 seconds time elapsed ( +- 0.17% ) (repeat)
1.06086 +- 0.00237 seconds time elapsed ( +- 0.22% ) (repeat)
5.14.0.g79e92006-rt3-tip-rt +slub-local-lock-v2r3 full set
7,598.44 msec task-clock # 5.813 CPUs utilized ( +- 0.85% )
488,161 context-switches # 64.245 K/sec ( +- 4.29% )
196,866 cpu-migrations # 25.909 K/sec ( +- 1.49% )
13,042 page-faults # 1.716 K/sec ( +- 0.73% )
27,695,116,746 cycles # 3.645 GHz ( +- 0.79% )
18,423,934,168 instructions # 0.67 insn per cycle ( +- 0.88% )
3,969,540,695 branches # 522.415 M/sec ( +- 0.92% )
15,493,482 branch-misses # 0.39% of all branches ( +- 2.15% )
1.30709 +- 0.00890 seconds time elapsed ( +- 0.68% )
1.3205 +- 0.0134 seconds time elapsed ( +- 1.02% ) (repeat)
1.3083 +- 0.0132 seconds time elapsed ( +- 1.01% ) (repeat)
now that I'm slowly catching up…
On 2021-07-02 22:25:05 [+0200], Vlastimil Babka wrote:
> > - perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500
> > Old:
> > | 464.967,20 msec task-clock # 27,220 CPUs utilized ( +- 0,16% )
> > New:
> > | 422.865,71 msec task-clock # 4,782 CPUs utilized ( +- 0,34% )
>
> The series shouldn't significantly change the memory allocator
> interaction, though.
> Seems there's less cycles, but more time elapsed, thus more sleeping -
> is it locks becoming mutexes on RT?
yes, most likely since the !RT parts are mostly unchanged.
> My second guess - list_lock remains spinlock with my series, thus RT
> mutex, but the current RT tree converts it to raw_spinlock. I'd hope
> leaving that one as non-raw spinlock would still be much better for RT
> goals, even if hackbench (which is AFAIK very slab intensive) throughput
> regresses - hopefully not that much.
Yes, the list_lock seems to be the case. I picked your
slub-local-lock-v3r0 and changed the list_lock (+slab_lock()) to use
raw_spinlock_t and disable interrupts and CPUs utilisation went to
~23CPUs (plus a bunch of warnings which probably made it a little slower
again).
The difference between a sleeping lock (spinlock_t) and a mutex is
that we attempt not to preempt a task that acquired a spinlock_t even if
it is running for some time and the scheduler would preempt it (like it
would do if the task had a mutex acquired. These are the "lazy preempt"
bits in the RT patch).
By making the list_lock a raw_spinlock_t a lot of IRQ-flags dancing
needs to be done as the page-allocator must be entered with enabled
interrupts. And then there is the possibility that you may need to free
some memory even if you allocate memory which requires some extra steps
on RT due to the IRQ-off part. All this vanishes by keeping list_lock a
spinlock_t.
The kernel-build test on /dev/shm remained unchanged so that is good.
Unless there is a real-world use-case, that gets worse, I don't mind
keeping the spinlock_t here. I haven't seen tglx complaining so far.
Sebastian
On 7/29/21 3:49 PM, Sebastian Andrzej Siewior wrote:
> now that I'm slowly catching up…
>
> On 2021-07-02 22:25:05 [+0200], Vlastimil Babka wrote:
>> > - perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500
>> > Old:
>> > | 464.967,20 msec task-clock # 27,220 CPUs utilized ( +- 0,16% )
>> > New:
>> > | 422.865,71 msec task-clock # 4,782 CPUs utilized ( +- 0,34% )
>>
>> The series shouldn't significantly change the memory allocator
>> interaction, though.
>> Seems there's less cycles, but more time elapsed, thus more sleeping -
>> is it locks becoming mutexes on RT?
>
> yes, most likely since the !RT parts are mostly unchanged.
>
>> My second guess - list_lock remains spinlock with my series, thus RT
>> mutex, but the current RT tree converts it to raw_spinlock. I'd hope
>> leaving that one as non-raw spinlock would still be much better for RT
>> goals, even if hackbench (which is AFAIK very slab intensive) throughput
>> regresses - hopefully not that much.
>
> Yes, the list_lock seems to be the case. I picked your
> slub-local-lock-v3r0 and changed the list_lock (+slab_lock()) to use
> raw_spinlock_t and disable interrupts and CPUs utilisation went to
> ~23CPUs (plus a bunch of warnings which probably made it a little slower
> again).
I forgot to point that out in the cover letter, but with v3 this change to
raw_spinlock_t is AFAICS no longer possible (at least with
CONFIG_SLUB_CPU_PARTIAL) because in put_cpu_partial() we now take the local_lock
and it can be called from get_partial_node() which takes the list_lock.
> The difference between a sleeping lock (spinlock_t) and a mutex is
> that we attempt not to preempt a task that acquired a spinlock_t even if
> it is running for some time and the scheduler would preempt it (like it
> would do if the task had a mutex acquired. These are the "lazy preempt"
> bits in the RT patch).
>
> By making the list_lock a raw_spinlock_t a lot of IRQ-flags dancing
> needs to be done as the page-allocator must be entered with enabled
> interrupts.
Hm but SLUB should never call the page allocator from under list_lock in my series?
> And then there is the possibility that you may need to free
> some memory even if you allocate memory which requires some extra steps
> on RT due to the IRQ-off part. All this vanishes by keeping list_lock a
> spinlock_t.
> The kernel-build test on /dev/shm remained unchanged so that is good.
> Unless there is a real-world use-case, that gets worse, I don't mind
> keeping the spinlock_t here. I haven't seen tglx complaining so far.
Good. IIRC hackbench is very close to being a slab microbenchmark, so
regressions there are expected, but should not translate to notable real world
regressions.
> Sebastian
>
On 2021-07-29 16:17:26 [+0200], Vlastimil Babka wrote:
> I forgot to point that out in the cover letter, but with v3 this change to
> raw_spinlock_t is AFAICS no longer possible (at least with
> CONFIG_SLUB_CPU_PARTIAL) because in put_cpu_partial() we now take the local_lock
> and it can be called from get_partial_node() which takes the list_lock.
I saw increased latency numbers with CONFIG_SLUB_CPU_PARTIAL before it
got disabled for other reasons so I'm not too sad if it remains
disabled.
> Hm but SLUB should never call the page allocator from under list_lock in my series?
oh yes. I run into CPU_PARTIAL instead. Sorry for the confusion. So
without PARTIAL it says "24,643 CPUs utilized" and no warnings :)
Sebastian