This series was inspired by Mel's pcplist local_lock rewrite, and also interest
to better understand SLUB's locking and the new primitives and RT variants and
implications. It should make SLUB more preemption-friendly, especially for RT,
hopefully without noticeable regressions, as the fast paths are not affected.
Series is based on 5.13-rc3 and also available as a git branch:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v1r9
It received some light stability testing and also basic performance screening
(thanks Mel) that didn't show major regressions. But I'm interested in e.g.
Jesper's tests whether the bulk allocator didn't regress.
Before the series, SLUB is lockless in both allocation and free fast paths, but
elsewhere, it's disabling irqs for considerable periods of time - especially in
allocation slowpath and the bulk allocation, where IRQs are re-enabled only
when a new page from the page allocator is needed, and the context allows
blocking. The irq disabled sections can then include deactivate_slab() which
walks a full freelist and frees the slab back to page allocator or
unfreeze_partials() going through a list of percpu partial slabs. The RT tree
currently has some patches mitigating these, but we can do much better in
mainline too.
Patches 1-2 are straightforward optimizations removing unnecessary usages of
object_map_lock.
Patch 3 is a cleanup of an obviously unnecessary local_irq_save/restore
instance.
Patch 4 simplifies the fast paths on systems with preemption, based on
(hopefully correct) observation that the current loops to verify tid are
unnecessary.
Patches 5-18 focus on allocation slowpath. Patches 5-8 are preparatory code
refactoring.
Patch 9 moves disabling of irqs into ___slab_alloc() from its callers, which
are the allocation slowpath, and bulk allocation. Instead these callers only
disable migration to stabilize the cpu. The following patches then gradually
reduce the scope of disabled irqs in ___slab_alloc() and the functions called
from there. As of patch 12, the re-enabling of irqs based on gfp flags before
calling the page allocator is removed from allocate_slab(). As of patch 15,
it's possible to reach the page allocator (in case of existing slabs depleted)
without disabling and re-enabling irqs a single time.
Patches 19-24 reduce the scope of disabled irqs in remaining functions. Patch
25 replaces a preempt_disable with migrate_disable in put_cpu_partial().
Patch 26 replaces the remaining explicitly irq disabled sections that protect
percpu variables with a local_lock, and updates the locking documentation in
the file's comment.
The result is that irq disabling is only done for minimum amount of time needed
and as part of spin lock or local lock operations to make them irq-safe, except
one case around slab_lock which is a bit spinlock.
This should have obvious implications for better preemption, especially on RT.
Also some RT patches should now be unnecessary, IIUC:
mm: slub: Enable irqs for __GFP_WAIT [1] becomes unnecessary as of patch 12.
The following two once the IPI flush_slab() handler is dealt with, as discussed
later:
mm: sl[au]b: Change list_lock to raw_spinlock_t [2] - the SLAB part can be
dropped as a different patch restricts RT to SLUB anyway. And after this series
the list_lock in SLUB is never used with irqs disabled before taking the lock.
mm: slub: Move discard_slab() invocations out of IRQ-off sections [3] should be
unnecessary as this series does move these invocations outside irq disabled
sections
Some caveats that will probably have to be solved on PREEMPT_RT - I'm just not
sure enough from reading Documentation/locking/locktypes.rst how some things
work. Advice welcome.
* There are paths such as:
get_partial_node() - does spin_lock_irqsave(&n->list_lock);
acquire_slab()
__cmpxchg_double_slab()
slab_lock() - a bit spinlock without explicit irqsave
On !PREEMPT_RT this is fine as spin_lock_irqsave() disables irq so slab_lock()
doesn't need to and it's still irq-safe. I assume there are no such guarantees on
PREEMPT_RT where spin_lock_irqsave() is just a mutex with disabled migration?
So RT will have to make sure all paths to slab_lock go through explicit irqsave?
* There is this path involving IPI:
flush_all()
on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
IPI with interrupts disabled (is it still true also on RT?)
flush_cpu_slab()
flush_slab()
manipulate kmem_cache_cpu variables
deactivate_slab();
The problems here are that in flush_slab() we manipulate variables normally
protected by the local_lock. On !PREEMPT_RT we don't need the local_lock here
because local_lock_irqsave() just disables irqs and we already got them
disabled from the IPI. On PREEMPT_RT we IIUC actually even can't take the
local_lock due to the irqs already disabled. So that's a problem.
Another issue is that deactivate_slab() above will take the node_lock spinlock,
so with irqs disabled it would still have to be a raw spinlock as patch [2]
does. And it will also call discard_slab() which should be also called without
irqs disabled. So for these reasons, the RT patch "mm: slub: Move
flush_cpu_slab() invocations __free_slab() invocations out of IRQ context" [4]
converting IPIs to workqueues will still be needed. Then the work handler
can use local_lock normally and that should solve the issues with flush_all()
and hopefully allow ditching patch [2].
Or is there perhaps a simpler way to make this flush IPI not disable IRQ on
PREEMPT_RT?
[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0003-mm-slub-Enable-irqs-for-__GFP_WAIT.patch?h=linux-5.12.y-rt-patches
[2] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0001-mm-sl-au-b-Change-list_lock-to-raw_spinlock_t.patch?h=linux-5.12.y-rt-patches
[3] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0004-mm-slub-Move-discard_slab-invocations-out-of-IRQ-off.patch?h=linux-5.12.y-rt-patches
[4] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0005-mm-slub-Move-flush_cpu_slab-invocations-__free_slab-.patch?h=linux-5.12.y-rt-patches
Vlastimil Babka (26):
mm, slub: allocate private object map for sysfs listings
mm, slub: allocate private object map for validate_slab_cache()
mm, slub: don't disable irq for debug_check_no_locks_freed()
mm, slub: simplify kmem_cache_cpu and tid setup
mm, slub: extract get_partial() from new_slab_objects()
mm, slub: dissolve new_slab_objects() into ___slab_alloc()
mm, slub: return slab page from get_partial() and set c->page
afterwards
mm, slub: restructure new page checks in ___slab_alloc()
mm, slub: move disabling/enabling irqs to ___slab_alloc()
mm, slub: do initial checks in ___slab_alloc() with irqs enabled
mm, slub: move disabling irqs closer to get_partial() in
___slab_alloc()
mm, slub: restore irqs around calling new_slab()
mm, slub: validate partial and newly allocated slabs before loading
them
mm, slub: check new pages with restored irqs
mm, slub: stop disabling irqs around get_partial()
mm, slub: move reset of c->page and freelist out of deactivate_slab()
mm, slub: make locking in deactivate_slab() irq-safe
mm, slub: call deactivate_slab() without disabling irqs
mm, slub: move irq control into unfreeze_partials()
mm, slub: discard slabs in unfreeze_partials() without irqs disabled
mm, slub: detach whole partial list at once in unfreeze_partials()
mm, slub: detach percpu partial list in unfreeze_partials() using
this_cpu_cmpxchg()
mm, slub: only disable irq with spin_lock in __unfreeze_partials()
mm, slub: don't disable irqs in slub_cpu_dead()
mm, slub: use migrate_disable() in put_cpu_partial()
mm, slub: convert kmem_cpu_slab protection to local_lock
include/linux/slub_def.h | 2 +
mm/slub.c | 496 ++++++++++++++++++++++++---------------
2 files changed, 314 insertions(+), 184 deletions(-)
--
2.31.1
In slab_free_hook() we disable irqs around the debug_check_no_locks_freed()
call, which is unnecessary, as irqs are already being disabled inside the call.
This seems to be leftover from the past where there were more calls inside the
irq disabled sections. Remove the irq disable/enable operations.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 14 +-------------
1 file changed, 1 insertion(+), 13 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index ed0f2620b19b..83ad64c1d9da 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1545,20 +1545,8 @@ static __always_inline bool slab_free_hook(struct kmem_cache *s,
{
kmemleak_free_recursive(x, s->flags);
- /*
- * Trouble is that we may no longer disable interrupts in the fast path
- * So in order to make the debug calls that expect irqs to be
- * disabled we need to disable interrupts temporarily.
- */
-#ifdef CONFIG_LOCKDEP
- {
- unsigned long flags;
+ debug_check_no_locks_freed(x, s->object_size);
- local_irq_save(flags);
- debug_check_no_locks_freed(x, s->object_size);
- local_irq_restore(flags);
- }
-#endif
if (!(s->flags & SLAB_DEBUG_OBJECTS))
debug_check_no_obj_freed(x, s->object_size);
--
2.31.1
validate_slab_cache() is called either to handle a sysfs write, or from a
self-test context. In both situations it's straightforward to preallocate a
private object bitmap instead of grabbing the shared static one meant for
critical sections, so let's do that.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 4c876749f322..ed0f2620b19b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4622,7 +4622,8 @@ static int count_total(struct page *page)
#endif
#ifdef CONFIG_SLUB_DEBUG
-static void validate_slab(struct kmem_cache *s, struct page *page)
+static void validate_slab(struct kmem_cache *s, struct page *page,
+ unsigned long *obj_map)
{
void *p;
void *addr = page_address(page);
@@ -4634,7 +4635,7 @@ static void validate_slab(struct kmem_cache *s, struct page *page)
goto unlock;
/* Now we know that a valid freelist exists */
- map = get_map(s, page);
+ __fill_map(obj_map, s, page);
for_each_object(p, s, addr, page->objects) {
u8 val = test_bit(__obj_to_index(s, addr, p), map) ?
SLUB_RED_INACTIVE : SLUB_RED_ACTIVE;
@@ -4642,13 +4643,12 @@ static void validate_slab(struct kmem_cache *s, struct page *page)
if (!check_object(s, page, p, val))
break;
}
- put_map(map);
unlock:
slab_unlock(page);
}
static int validate_slab_node(struct kmem_cache *s,
- struct kmem_cache_node *n)
+ struct kmem_cache_node *n, unsigned long *obj_map)
{
unsigned long count = 0;
struct page *page;
@@ -4657,7 +4657,7 @@ static int validate_slab_node(struct kmem_cache *s,
spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry(page, &n->partial, slab_list) {
- validate_slab(s, page);
+ validate_slab(s, page, obj_map);
count++;
}
if (count != n->nr_partial)
@@ -4668,7 +4668,7 @@ static int validate_slab_node(struct kmem_cache *s,
goto out;
list_for_each_entry(page, &n->full, slab_list) {
- validate_slab(s, page);
+ validate_slab(s, page, obj_map);
count++;
}
if (count != atomic_long_read(&n->nr_slabs))
@@ -4685,10 +4685,17 @@ static long validate_slab_cache(struct kmem_cache *s)
int node;
unsigned long count = 0;
struct kmem_cache_node *n;
+ unsigned long *obj_map;
+
+ obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL);
+ if (!obj_map)
+ return -ENOMEM;
flush_all(s);
for_each_kmem_cache_node(s, node, n)
- count += validate_slab_node(s, n);
+ count += validate_slab_node(s, n, obj_map);
+
+ bitmap_free(obj_map);
return count;
}
--
2.31.1
In slab_alloc_node() and do_slab_free() fastpaths we need to guarantee that
our kmem_cache_cpu pointer is from the same cpu as the tid value. Currently
that's done by reading the tid first using this_cpu_read(), then the
kmem_cache_cpu pointer and verifying we read the same tid using the pointer and
plain READ_ONCE().
This can be simplified to just fetching kmem_cache_cpu pointer and then reading
tid using the pointer. That guarantees they are from the same cpu. We don't
need to read the tid using this_cpu_read() because the value will be validated
by this_cpu_cmpxchg_double(), making sure we are on the correct cpu and the
freelist didn't change by anyone preempting us since reading the tid.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 22 +++++++++-------------
1 file changed, 9 insertions(+), 13 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 83ad64c1d9da..7b4cdc59b9ff 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2840,15 +2840,14 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
* reading from one cpu area. That does not matter as long
* as we end up on the original cpu again when doing the cmpxchg.
*
- * We should guarantee that tid and kmem_cache are retrieved on
- * the same cpu. It could be different if CONFIG_PREEMPTION so we need
- * to check if it is matched or not.
+ * We must guarantee that tid and kmem_cache_cpu are retrieved on the
+ * same cpu. We read first the kmem_cache_cpu pointer and use it to read
+ * the tid. If we are preempted and switched to another cpu between the
+ * two reads, it's OK as the two are still associated with the same cpu
+ * and cmpxchg later will validate the cpu.
*/
- do {
- tid = this_cpu_read(s->cpu_slab->tid);
- c = raw_cpu_ptr(s->cpu_slab);
- } while (IS_ENABLED(CONFIG_PREEMPTION) &&
- unlikely(tid != READ_ONCE(c->tid)));
+ c = raw_cpu_ptr(s->cpu_slab);
+ tid = READ_ONCE(c->tid);
/*
* Irqless object alloc/free algorithm used here depends on sequence
@@ -3122,11 +3121,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
* data is retrieved via this pointer. If we are on the same cpu
* during the cmpxchg then the free will succeed.
*/
- do {
- tid = this_cpu_read(s->cpu_slab->tid);
- c = raw_cpu_ptr(s->cpu_slab);
- } while (IS_ENABLED(CONFIG_PREEMPTION) &&
- unlikely(tid != READ_ONCE(c->tid)));
+ c = raw_cpu_ptr(s->cpu_slab);
+ tid = READ_ONCE(c->tid);
/* Same with comment on barrier() in slab_alloc_node() */
barrier();
--
2.31.1
Slub has a static spinlock protected bitmap for marking which objects are on
freelist when it wants to list them, for situations where dynamically
allocating such map can lead to recursion or locking issues, and on-stack
bitmap would be too large.
The handlers of sysfs files alloc_calls and free_calls also currently use this
shared bitmap, but their syscall context makes it straightforward to allocate a
private map before entering locked sections, so switch these processing paths
to use a private bitmap.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 43 +++++++++++++++++++++++++++++--------------
1 file changed, 29 insertions(+), 14 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 3f96e099817a..4c876749f322 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -448,6 +448,18 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)];
static DEFINE_SPINLOCK(object_map_lock);
+static void __fill_map(unsigned long *obj_map, struct kmem_cache *s,
+ struct page *page)
+{
+ void *addr = page_address(page);
+ void *p;
+
+ bitmap_zero(obj_map, page->objects);
+
+ for (p = page->freelist; p; p = get_freepointer(s, p))
+ set_bit(__obj_to_index(s, addr, p), obj_map);
+}
+
/*
* Determine a map of object in use on a page.
*
@@ -457,17 +469,11 @@ static DEFINE_SPINLOCK(object_map_lock);
static unsigned long *get_map(struct kmem_cache *s, struct page *page)
__acquires(&object_map_lock)
{
- void *p;
- void *addr = page_address(page);
-
VM_BUG_ON(!irqs_disabled());
spin_lock(&object_map_lock);
- bitmap_zero(object_map, page->objects);
-
- for (p = page->freelist; p; p = get_freepointer(s, p))
- set_bit(__obj_to_index(s, addr, p), object_map);
+ __fill_map(object_map, s, page);
return object_map;
}
@@ -4813,17 +4819,17 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
}
static void process_slab(struct loc_track *t, struct kmem_cache *s,
- struct page *page, enum track_item alloc)
+ struct page *page, enum track_item alloc,
+ unsigned long *obj_map)
{
void *addr = page_address(page);
void *p;
- unsigned long *map;
- map = get_map(s, page);
+ __fill_map(obj_map, s, page);
+
for_each_object(p, s, addr, page->objects)
- if (!test_bit(__obj_to_index(s, addr, p), map))
+ if (!test_bit(__obj_to_index(s, addr, p), obj_map))
add_location(t, s, get_track(s, p, alloc));
- put_map(map);
}
static int list_locations(struct kmem_cache *s, char *buf,
@@ -4834,11 +4840,18 @@ static int list_locations(struct kmem_cache *s, char *buf,
struct loc_track t = { 0, 0, NULL };
int node;
struct kmem_cache_node *n;
+ unsigned long *obj_map;
+
+ obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL);
+ if (!obj_map)
+ return sysfs_emit(buf, "Out of memory\n");
if (!alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location),
GFP_KERNEL)) {
+ bitmap_free(obj_map);
return sysfs_emit(buf, "Out of memory\n");
}
+
/* Push back cpu slabs */
flush_all(s);
@@ -4851,12 +4864,14 @@ static int list_locations(struct kmem_cache *s, char *buf,
spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry(page, &n->partial, slab_list)
- process_slab(&t, s, page, alloc);
+ process_slab(&t, s, page, alloc, obj_map);
list_for_each_entry(page, &n->full, slab_list)
- process_slab(&t, s, page, alloc);
+ process_slab(&t, s, page, alloc, obj_map);
spin_unlock_irqrestore(&n->list_lock, flags);
}
+ bitmap_free(obj_map);
+
for (i = 0; i < t.count; i++) {
struct location *l = &t.loc[i];
--
2.31.1
The later patches will need more fine grained control over individual actions
in ___slab_alloc(), the only caller of new_slab_objects(), so this is a first
preparatory step with no functional change.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 7b4cdc59b9ff..7973bcd42bc7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2574,17 +2574,12 @@ slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid)
static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
int node, struct kmem_cache_cpu **pc)
{
- void *freelist;
+ void *freelist = NULL;
struct kmem_cache_cpu *c = *pc;
struct page *page;
WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
- freelist = get_partial(s, flags, node, c);
-
- if (freelist)
- return freelist;
-
page = new_slab(s, flags, node);
if (page) {
c = raw_cpu_ptr(s->cpu_slab);
@@ -2748,6 +2743,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
}
+ freelist = get_partial(s, gfpflags, node, c);
+ if (freelist)
+ goto check_new_page;
+
freelist = new_slab_objects(s, gfpflags, node, &c);
if (unlikely(!freelist)) {
@@ -2755,6 +2754,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return NULL;
}
+check_new_page:
page = c->page;
if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
goto load_freelist;
--
2.31.1
When we allocate slab object from a newly acquired page (from node's partial
list or page allocator), we usually also retain the page as a new percpu slab.
There are two exceptions - when pfmemalloc status of the page doesn't match our
gfp flags, or when the cache has debugging enabled.
The current code for these decisions is not easy to follow, so restructure it
and add comments. No functional change.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index f240e424c861..06f30c9ad361 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2743,13 +2743,29 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
c->page = page;
check_new_page:
- if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
- goto load_freelist;
- /* Only entered in the debug case */
- if (kmem_cache_debug(s) &&
- !alloc_debug_processing(s, page, freelist, addr))
- goto new_slab; /* Slab failed checks. Next slab needed */
+ if (kmem_cache_debug(s)) {
+ if (!alloc_debug_processing(s, page, freelist, addr))
+ /* Slab failed checks. Next slab needed */
+ goto new_slab;
+ else
+ /*
+ * For debug case, we don't load freelist so that all
+ * allocations go through alloc_debug_processing()
+ */
+ goto return_single;
+ }
+
+ if (unlikely(!pfmemalloc_match(page, gfpflags)))
+ /*
+ * For !pfmemalloc_match() case we don't load freelist so that
+ * we don't make further mismatched allocations easier.
+ */
+ goto return_single;
+
+ goto load_freelist;
+
+return_single:
deactivate_slab(s, page, get_freepointer(s, freelist), c);
return freelist;
--
2.31.1
The later patches will need more fine grained control over individual actions
in ___slab_alloc(), the only caller of new_slab_objects(), so dissolve it
there. This is a preparatory step with no functional change.
The only minor change is moving WARN_ON_ONCE() for using a constructor together
with __GFP_ZERO to new_slab() which makes it somewhat less frequent, but still
able to catch a misuse.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 50 ++++++++++++++++++--------------------------------
1 file changed, 18 insertions(+), 32 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 7973bcd42bc7..9d1d872f5ab9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1839,6 +1839,8 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
if (unlikely(flags & GFP_SLAB_BUG_MASK))
flags = kmalloc_fix_flags(flags);
+ WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
+
return allocate_slab(s,
flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
}
@@ -2571,36 +2573,6 @@ slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid)
#endif
}
-static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
- int node, struct kmem_cache_cpu **pc)
-{
- void *freelist = NULL;
- struct kmem_cache_cpu *c = *pc;
- struct page *page;
-
- WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
-
- page = new_slab(s, flags, node);
- if (page) {
- c = raw_cpu_ptr(s->cpu_slab);
- if (c->page)
- flush_slab(s, c);
-
- /*
- * No other reference to the page yet so we can
- * muck around with it freely without cmpxchg
- */
- freelist = page->freelist;
- page->freelist = NULL;
-
- stat(s, ALLOC_SLAB);
- c->page = page;
- *pc = c;
- }
-
- return freelist;
-}
-
static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags)
{
if (unlikely(PageSlabPfmemalloc(page)))
@@ -2747,13 +2719,27 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (freelist)
goto check_new_page;
- freelist = new_slab_objects(s, gfpflags, node, &c);
+ page = new_slab(s, gfpflags, node);
- if (unlikely(!freelist)) {
+ if (unlikely(!page)) {
slab_out_of_memory(s, gfpflags, node);
return NULL;
}
+ c = raw_cpu_ptr(s->cpu_slab);
+ if (c->page)
+ flush_slab(s, c);
+
+ /*
+ * No other reference to the page yet so we can
+ * muck around with it freely without cmpxchg
+ */
+ freelist = page->freelist;
+ page->freelist = NULL;
+
+ stat(s, ALLOC_SLAB);
+ c->page = page;
+
check_new_page:
page = c->page;
if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
--
2.31.1
The function get_partial() finds a suitable page on a partial list, acquires
and returns its freelist and assigns the page pointer to kmem_cache_node.
In later patch we will need more control over the kmem_cache_node assignment,
so instead return the page pointer to get_partial()'s caller and assign it
there. No functional change as all of this still happens with disabled irq.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 9d1d872f5ab9..f240e424c861 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1971,7 +1971,7 @@ static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
* Try to allocate a partial slab from a specific node.
*/
static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
- struct kmem_cache_cpu *c, gfp_t flags)
+ struct page **ret_page, gfp_t flags)
{
struct page *page, *page2;
void *object = NULL;
@@ -2000,7 +2000,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
available += objects;
if (!object) {
- c->page = page;
+ *ret_page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
} else {
@@ -2020,7 +2020,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
* Get a page from somewhere. Search in increasing NUMA distances.
*/
static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
- struct kmem_cache_cpu *c)
+ struct page **ret_page)
{
#ifdef CONFIG_NUMA
struct zonelist *zonelist;
@@ -2062,7 +2062,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
if (n && cpuset_zone_allowed(zone, flags) &&
n->nr_partial > s->min_partial) {
- object = get_partial_node(s, n, c, flags);
+ object = get_partial_node(s, n, ret_page, flags);
if (object) {
/*
* Don't check read_mems_allowed_retry()
@@ -2084,7 +2084,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
* Get a partial page, lock it and return it.
*/
static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
- struct kmem_cache_cpu *c)
+ struct page **ret_page)
{
void *object;
int searchnode = node;
@@ -2092,11 +2092,11 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
if (node == NUMA_NO_NODE)
searchnode = numa_mem_id();
- object = get_partial_node(s, get_node(s, searchnode), c, flags);
+ object = get_partial_node(s, get_node(s, searchnode), ret_page, flags);
if (object || node != NUMA_NO_NODE)
return object;
- return get_any_partial(s, flags, c);
+ return get_any_partial(s, flags, ret_page);
}
#ifdef CONFIG_PREEMPTION
@@ -2715,9 +2715,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
}
- freelist = get_partial(s, gfpflags, node, c);
- if (freelist)
+ freelist = get_partial(s, gfpflags, node, &page);
+ if (freelist) {
+ c->page = page;
goto check_new_page;
+ }
page = new_slab(s, gfpflags, node);
@@ -2741,7 +2743,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
c->page = page;
check_new_page:
- page = c->page;
if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
goto load_freelist;
--
2.31.1
Currently __slab_alloc() disables irqs around the whole ___slab_alloc(). This
includes cases where this is not needed, such as when the allocation ends up in
the page allocator and has to awkwardly enable irqs back based on gfp flags.
Also the whole kmem_cache_alloc_bulk() is executed with irqs disabled even when
it hits the __slab_alloc() slow path, and long periods with disabled interrupts
are undesirable.
As a first step towards reducing irq disabled periods, move irq handling into
___slab_alloc(). Callers will instead prevent the s->cpu_slab percpu pointer
from becoming invalid via migrate_disable(). This does not protect against
access preemption, which is still done by disabled irq for most of
___slab_alloc(). As the small immediate benefit, slab_out_of_memory() call from
___slab_alloc() is now done with irqs enabled.
kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables them
before calling ___slab_alloc(), which then disables them at its discretion. The
whole kmem_cache_alloc_bulk() operation also disables cpu migration.
When ___slab_alloc() calls new_slab() to allocate a new page, re-enable
preemption, because new_slab() will re-enable interrupts in contexts that allow
blocking.
The patch itself will thus increase overhead a bit due to disabled migration
and increased disabling/enabling irqs in kmem_cache_alloc_bulk(), but that will
be gradually improved in the following patches.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 06f30c9ad361..c5f4f9282496 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2631,7 +2631,7 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page)
* we need to allocate a new slab. This is the slowest path since it involves
* a call to the page allocator and the setup of a new slab.
*
- * Version of __slab_alloc to use when we know that interrupts are
+ * Version of __slab_alloc to use when we know that preemption is
* already disabled (which is the case for bulk allocation).
*/
static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
@@ -2639,9 +2639,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
{
void *freelist;
struct page *page;
+ unsigned long flags;
stat(s, ALLOC_SLOWPATH);
+ local_irq_save(flags);
page = c->page;
if (!page) {
/*
@@ -2704,6 +2706,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
VM_BUG_ON(!c->page->frozen);
c->freelist = get_freepointer(s, freelist);
c->tid = next_tid(c->tid);
+ local_irq_restore(flags);
return freelist;
new_slab:
@@ -2721,14 +2724,17 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto check_new_page;
}
+ migrate_enable();
page = new_slab(s, gfpflags, node);
+ migrate_disable();
+ c = this_cpu_ptr(s->cpu_slab);
if (unlikely(!page)) {
+ local_irq_restore(flags);
slab_out_of_memory(s, gfpflags, node);
return NULL;
}
- c = raw_cpu_ptr(s->cpu_slab);
if (c->page)
flush_slab(s, c);
@@ -2768,6 +2774,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return_single:
deactivate_slab(s, page, get_freepointer(s, freelist), c);
+ local_irq_restore(flags);
return freelist;
}
@@ -2779,20 +2786,19 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
unsigned long addr, struct kmem_cache_cpu *c)
{
void *p;
- unsigned long flags;
- local_irq_save(flags);
+ migrate_disable();
#ifdef CONFIG_PREEMPTION
/*
* We may have been preempted and rescheduled on a different
- * cpu before disabling interrupts. Need to reload cpu area
+ * cpu before disabling preemption. Need to reload cpu area
* pointer.
*/
c = this_cpu_ptr(s->cpu_slab);
#endif
p = ___slab_alloc(s, gfpflags, node, addr, c);
- local_irq_restore(flags);
+ migrate_enable();
return p;
}
@@ -3312,8 +3318,9 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
* IRQs, which protects against PREEMPT and interrupts
* handlers invoking normal fastpath.
*/
- local_irq_disable();
+ migrate_disable();
c = this_cpu_ptr(s->cpu_slab);
+ local_irq_disable();
for (i = 0; i < size; i++) {
void *object = kfence_alloc(s, s->object_size, flags);
@@ -3334,6 +3341,8 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
*/
c->tid = next_tid(c->tid);
+ local_irq_enable();
+
/*
* Invoking slow path likely have side-effect
* of re-populating per CPU c->freelist
@@ -3346,6 +3355,8 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
c = this_cpu_ptr(s->cpu_slab);
maybe_wipe_obj_freeptr(s, p[i]);
+ local_irq_disable();
+
continue; /* goto for-loop */
}
c->freelist = get_freepointer(s, object);
@@ -3354,6 +3365,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
}
c->tid = next_tid(c->tid);
local_irq_enable();
+ migrate_enable();
/*
* memcg and kmem_cache debug support and memory initialization.
--
2.31.1
Building on top of the previous patch, re-enable irqs before checking new
pages. alloc_debug_processing() is now called with enabled irqs so we need to
remove VM_BUG_ON(!irqs_disabled()); in check_slab() - there doesn't seem to be
a need for it anyway.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 78d7eb5be951..e092387b5932 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -968,8 +968,6 @@ static int check_slab(struct kmem_cache *s, struct page *page)
{
int maxobj;
- VM_BUG_ON(!irqs_disabled());
-
if (!PageSlab(page)) {
slab_err(s, page, "Not a valid slab page");
return 0;
@@ -2743,10 +2741,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
new_objects:
freelist = get_partial(s, gfpflags, node, &page);
+ local_irq_restore(flags);
if (freelist)
goto check_new_page;
- local_irq_restore(flags);
migrate_enable();
page = new_slab(s, gfpflags, node);
migrate_disable();
@@ -2757,7 +2755,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return NULL;
}
- local_irq_save(flags);
/*
* No other reference to the page yet so we can
* muck around with it freely without cmpxchg
@@ -2772,7 +2769,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (kmem_cache_debug(s)) {
if (!alloc_debug_processing(s, page, freelist, addr)) {
/* Slab failed checks. Next slab needed */
- local_irq_restore(flags);
goto new_slab;
} else {
/*
@@ -2790,6 +2786,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
*/
goto return_single;
+ local_irq_save(flags);
if (unlikely(c->page))
flush_slab(s, c);
c->page = page;
@@ -2798,6 +2795,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return_single:
+ local_irq_save(flags);
if (unlikely(c->page))
flush_slab(s, c);
c->page = page;
--
2.31.1
Continue reducing the irq disabled scope. Check for per-cpu partial slabs with
first with irqs enabled and then recheck with irqs disabled before grabbing
the slab page. Mostly preparatory for the following patches.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 30 +++++++++++++++++++++++-------
1 file changed, 23 insertions(+), 7 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index e243a991b15b..16f3cbb81627 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2653,11 +2653,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (unlikely(node != NUMA_NO_NODE &&
!node_isset(node, slab_nodes)))
node = NUMA_NO_NODE;
- local_irq_save(flags);
- if (unlikely(c->page)) {
- local_irq_restore(flags);
- goto reread_page;
- }
goto new_slab;
}
redo:
@@ -2698,6 +2693,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (!freelist) {
c->page = NULL;
+ local_irq_restore(flags);
stat(s, DEACTIVATE_BYPASS);
goto new_slab;
}
@@ -2723,10 +2719,19 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto reread_page;
}
deactivate_slab(s, page, c->freelist, c);
+ local_irq_restore(flags);
new_slab:
if (slub_percpu_partial(c)) {
+ local_irq_save(flags);
+ if (unlikely(c->page)) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
+ if (unlikely(!slub_percpu_partial(c))) /* stolen by IRQ? */
+ goto new_objects;
+
page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
local_irq_restore(flags);
@@ -2734,6 +2739,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
}
+ local_irq_save(flags);
+ if (unlikely(c->page)) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
+
+new_objects:
+
freelist = get_partial(s, gfpflags, node, &page);
if (freelist) {
c->page = page;
@@ -2767,15 +2780,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
check_new_page:
if (kmem_cache_debug(s)) {
- if (!alloc_debug_processing(s, page, freelist, addr))
+ if (!alloc_debug_processing(s, page, freelist, addr)) {
/* Slab failed checks. Next slab needed */
+ c->page = NULL;
+ local_irq_restore(flags);
goto new_slab;
- else
+ } else {
/*
* For debug case, we don't load freelist so that all
* allocations go through alloc_debug_processing()
*/
goto return_single;
+ }
}
if (unlikely(!pfmemalloc_match(page, gfpflags)))
--
2.31.1
When we obtain a new slab page from node partial list or page allocator, we
assign it to kmem_cache_cpu, perform some checks, and if they fail, we undo
the assignment.
In order to allow doing the checks without irq disabled, restructure the code
so that checks go first, and kmem_cache_cpu assignment only afterwards.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 47a3438d6a35..78d7eb5be951 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2743,10 +2743,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
new_objects:
freelist = get_partial(s, gfpflags, node, &page);
- if (freelist) {
- c->page = page;
+ if (freelist)
goto check_new_page;
- }
local_irq_restore(flags);
migrate_enable();
@@ -2760,9 +2758,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
}
local_irq_save(flags);
- if (c->page)
- flush_slab(s, c);
-
/*
* No other reference to the page yet so we can
* muck around with it freely without cmpxchg
@@ -2771,14 +2766,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
page->freelist = NULL;
stat(s, ALLOC_SLAB);
- c->page = page;
check_new_page:
if (kmem_cache_debug(s)) {
if (!alloc_debug_processing(s, page, freelist, addr)) {
/* Slab failed checks. Next slab needed */
- c->page = NULL;
local_irq_restore(flags);
goto new_slab;
} else {
@@ -2797,10 +2790,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
*/
goto return_single;
+ if (unlikely(c->page))
+ flush_slab(s, c);
+ c->page = page;
+
goto load_freelist;
return_single:
+ if (unlikely(c->page))
+ flush_slab(s, c);
+ c->page = page;
+
deactivate_slab(s, page, get_freepointer(s, freelist), c);
local_irq_restore(flags);
return freelist;
--
2.31.1
In put_cpu_partial, we need a stable cpu, but being preempted is not an issue.
So, disable migration instead of preemption.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index bfa5e7c4da1b..8818c210cb97 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2417,7 +2417,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
int pages;
int pobjects;
- preempt_disable();
+ migrate_disable();
do {
pages = 0;
pobjects = 0;
@@ -2451,7 +2451,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
if (unlikely(!slub_cpu_partial(s)))
unfreeze_partials(s);
- preempt_enable();
+ migrate_enable();
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}
--
2.31.1
dectivate_slab() now no longer touches the kmem_cache_cpu structure, so it will
be possible to call it with irqs enabled. Just convert the spin_lock calls to
their irq saving/restoring variants to make it irq-safe. Additionally we now
have to use cmpxchg_double_slab() for irq-safe slab_lock().
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index cf855cd09802..0f58859165bf 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2177,6 +2177,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
enum slab_modes l = M_NONE, m = M_NONE;
void *nextfree, *freelist_iter, *freelist_tail;
int tail = DEACTIVATE_TO_HEAD;
+ unsigned long flags = 0;
struct page new;
struct page old;
@@ -2252,7 +2253,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
* that acquire_slab() will see a slab page that
* is frozen
*/
- spin_lock(&n->list_lock);
+ spin_lock_irqsave(&n->list_lock, flags);
}
} else {
m = M_FULL;
@@ -2263,7 +2264,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
* slabs from diagnostic functions will not see
* any frozen slabs.
*/
- spin_lock(&n->list_lock);
+ spin_lock_irqsave(&n->list_lock, flags);
}
}
@@ -2280,14 +2281,14 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
}
l = m;
- if (!__cmpxchg_double_slab(s, page,
+ if (!cmpxchg_double_slab(s, page,
old.freelist, old.counters,
new.freelist, new.counters,
"unfreezing slab"))
goto redo;
if (lock)
- spin_unlock(&n->list_lock);
+ spin_unlock_irqrestore(&n->list_lock, flags);
if (m == M_PARTIAL)
stat(s, tail);
--
2.31.1
Instead of iterating through the live percpu partial list, detach it from the
kmem_cache_cpu at once. This is simpler and will allow further optimization.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 33c3faacf7b1..414cc621d655 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2312,16 +2312,20 @@ static void unfreeze_partials(struct kmem_cache *s,
{
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct kmem_cache_node *n = NULL, *n2 = NULL;
- struct page *page, *discard_page = NULL;
+ struct page *page, *partial_page, *discard_page = NULL;
unsigned long flags;
local_irq_save(flags);
- while ((page = slub_percpu_partial(c))) {
+ partial_page = slub_percpu_partial(c);
+ c->partial = NULL;
+
+ while (partial_page) {
struct page new;
struct page old;
- slub_set_percpu_partial(c, page);
+ page = partial_page;
+ partial_page = page->next;
n2 = get_node(s, page_to_nid(page));
if (n != n2) {
--
2.31.1
As another step of shortening irq disabled sections in ___slab_alloc(), don't
disable irqs until doing initial checks if there is a cached percpu slab and
it's suitable for our allocation.
Now we have to recheck c->page after actually disabling irqs as an allocation
in irq might have replaced it.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 34 +++++++++++++++++++++++++---------
1 file changed, 25 insertions(+), 9 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index c5f4f9282496..e243a991b15b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2643,8 +2643,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
stat(s, ALLOC_SLOWPATH);
- local_irq_save(flags);
- page = c->page;
+reread_page:
+ page = READ_ONCE(c->page);
if (!page) {
/*
* if the node is not online or has no normal memory, just
@@ -2653,6 +2653,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (unlikely(node != NUMA_NO_NODE &&
!node_isset(node, slab_nodes)))
node = NUMA_NO_NODE;
+ local_irq_save(flags);
+ if (unlikely(c->page)) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
goto new_slab;
}
redo:
@@ -2667,8 +2672,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
} else {
stat(s, ALLOC_NODE_MISMATCH);
- deactivate_slab(s, page, c->freelist, c);
- goto new_slab;
+ goto deactivate_slab;
}
}
@@ -2677,12 +2681,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
* PFMEMALLOC but right now, we are losing the pfmemalloc
* information when the page leaves the per-cpu allocator
*/
- if (unlikely(!pfmemalloc_match(page, gfpflags))) {
- deactivate_slab(s, page, c->freelist, c);
- goto new_slab;
- }
+ if (unlikely(!pfmemalloc_match(page, gfpflags)))
+ goto deactivate_slab;
- /* must check again c->freelist in case of cpu migration or IRQ */
+ /* must check again c->page in case of IRQ */
+ local_irq_save(flags);
+ if (unlikely(page != c->page)) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
freelist = c->freelist;
if (freelist)
goto load_freelist;
@@ -2709,11 +2716,20 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
local_irq_restore(flags);
return freelist;
+deactivate_slab:
+ local_irq_save(flags);
+ if (page != c->page) {
+ local_irq_restore(flags);
+ goto reread_page;
+ }
+ deactivate_slab(s, page, c->freelist, c);
+
new_slab:
if (slub_percpu_partial(c)) {
page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
+ local_irq_restore(flags);
stat(s, CPU_PARTIAL_ALLOC);
goto redo;
}
--
2.31.1
__unfreeze_partials() no longer needs to have irqs disabled, except for making
the spin_lock operations irq-safe, so convert the spin_locks operations and
remove the separate irq handling.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 92345d3840d1..aee83758bbc4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2306,9 +2306,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page)
{
struct kmem_cache_node *n = NULL, *n2 = NULL;
struct page *page, *discard_page = NULL;
- unsigned long flags;
-
- local_irq_save(flags);
+ unsigned long flags = 0;
while (partial_page) {
struct page new;
@@ -2320,10 +2318,10 @@ static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page)
n2 = get_node(s, page_to_nid(page));
if (n != n2) {
if (n)
- spin_unlock(&n->list_lock);
+ spin_unlock_irqrestore(&n->list_lock, flags);
n = n2;
- spin_lock(&n->list_lock);
+ spin_lock_irqsave(&n->list_lock, flags);
}
do {
@@ -2352,9 +2350,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page)
}
if (n)
- spin_unlock(&n->list_lock);
-
- local_irq_restore(flags);
+ spin_unlock_irqrestore(&n->list_lock, flags);
while (discard_page) {
page = discard_page;
--
2.31.1
Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions of
local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT that's
equivalent, with better lockdep visibility. On PREEMPT_RT that means better
preemption.
Also update the comment about locking scheme in SLUB to reflect changes done
by this series.
Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/slub_def.h | 2 +
mm/slub.c | 90 ++++++++++++++++++++++++++++------------
2 files changed, 66 insertions(+), 26 deletions(-)
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index dcde82a4434c..b5bcac29b979 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -10,6 +10,7 @@
#include <linux/kfence.h>
#include <linux/kobject.h>
#include <linux/reciprocal_div.h>
+#include <linux/local_lock.h>
enum stat_item {
ALLOC_FASTPATH, /* Allocation from cpu slab */
@@ -41,6 +42,7 @@ enum stat_item {
NR_SLUB_STAT_ITEMS };
struct kmem_cache_cpu {
+ local_lock_t lock; /* Protects the fields below except stat */
void **freelist; /* Pointer to next available object */
unsigned long tid; /* Globally unique transaction id */
struct page *page; /* The slab from which we are allocating */
diff --git a/mm/slub.c b/mm/slub.c
index 8818c210cb97..5455c6bda997 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -43,13 +43,22 @@
/*
* Lock order:
* 1. slab_mutex (Global Mutex)
- * 2. node->list_lock
+ * 2. node->list_lock (Spinlock)
+ * OR
+ * kmem_cache->cpu_slab->lock (Local lock)
* 3. slab_lock(page) (Only on some arches and for debugging)
+ * 4. object_map_lock (Only for debugging)
*
* slab_mutex
*
* The role of the slab_mutex is to protect the list of all the slabs
* and to synchronize major metadata changes to slab cache structures.
+ * Also synchronizes memory hotplug callbacks.
+ *
+ * slab_lock
+ *
+ * The slab_lock is a wrapper around the page lock, thus it is a bit
+ * spinlock.
*
* The slab_lock is only used for debugging and on arches that do not
* have the ability to do a cmpxchg_double. It only protects:
@@ -65,6 +74,8 @@
* froze the slab is the only one that can retrieve the objects from the
* page's freelist.
*
+ * list_lock
+ *
* The list_lock protects the partial and full list on each node and
* the partial slab counter. If taken then no new slabs may be added or
* removed from the lists nor make the number of partial slabs be modified.
@@ -76,10 +87,33 @@
* slabs, operations can continue without any centralized lock. F.e.
* allocating a long series of objects that fill up slabs does not require
* the list lock.
- * Interrupts are disabled during allocation and deallocation in order to
- * make the slab allocator safe to use in the context of an irq. In addition
- * interrupts are disabled to ensure that the processor does not change
- * while handling per_cpu slabs, due to kernel preemption.
+ *
+ * cpu_slab->lock local lock
+ *
+ * This locks protect slowpath manipulation of all kmem_cache_cpu fields
+ * except the stat counters. This is a percpu structure manipulated only by
+ * the local cpu, so the lock protects against being preempted or interrupted
+ * by an irq. Fast path operations rely on lockless operations instead.
+ *
+ * lockless fastpaths
+ *
+ * The fast path allocation (slab_alloc_node()) and freeing (do_slab_free())
+ * are fully lockless when satisfied from the percpu slab (and when
+ * cmpxchg_double is possible to use, otherwise slab_lock is taken).
+ * They also don't disable preemption or migration or irqs. They rely on
+ * the transaction id (tid) field to detect being preempted or moved to
+ * another cpu.
+ *
+ * irq, preemption, migration considerations
+ *
+ * Interrupts are disabled as part of list_lock or local_lock operations, or
+ * around the slab_lock operation, in order to make the slab allocator safe
+ * to use in the context of an irq.
+ *
+ * In addition, migration is disabled in the allocation slowpath, bulk
+ * allocation, and put_cpu_partial(), so that the local cpu doesn't change in
+ * the process and e.g. the kmem_cache_cpu pointer doesn't have to be
+ * revalidated in each section protected by the local lock.
*
* SLUB assigns one slab for allocation to each processor.
* Allocations only occur from these slabs called cpu slabs.
@@ -427,7 +461,7 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
page->freelist = freelist_new;
page->counters = counters_new;
slab_unlock(page);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
return true;
}
slab_unlock(page);
@@ -2157,9 +2191,13 @@ static inline void note_cmpxchg_failure(const char *n,
static void init_kmem_cache_cpus(struct kmem_cache *s)
{
int cpu;
+ struct kmem_cache_cpu *c;
- for_each_possible_cpu(cpu)
- per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
+ for_each_possible_cpu(cpu) {
+ c = per_cpu_ptr(s->cpu_slab, cpu);
+ local_lock_init(&c->lock);
+ c->tid = init_tid(cpu);
+ }
}
/*
@@ -2708,9 +2746,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto deactivate_slab;
/* must check again c->page in case of IRQ */
- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
if (unlikely(page != c->page)) {
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
goto reread_page;
}
freelist = c->freelist;
@@ -2721,7 +2759,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (!freelist) {
c->page = NULL;
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
stat(s, DEACTIVATE_BYPASS);
goto new_slab;
}
@@ -2737,37 +2775,37 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
VM_BUG_ON(!c->page->frozen);
c->freelist = get_freepointer(s, freelist);
c->tid = next_tid(c->tid);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
return freelist;
deactivate_slab:
- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
if (page != c->page) {
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
goto reread_page;
}
freelist = c->freelist;
c->page = NULL;
c->freelist = NULL;
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
deactivate_slab(s, page, freelist);
new_slab:
if (slub_percpu_partial(c)) {
- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
if (unlikely(c->page)) {
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
goto reread_page;
}
if (unlikely(!slub_percpu_partial(c))) { /* stolen by IRQ? */
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
goto new_objects;
}
page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
stat(s, CPU_PARTIAL_ALLOC);
goto redo;
}
@@ -2820,7 +2858,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto return_single;
retry_load_page:
- local_irq_save(flags);
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
if (unlikely(c->page)) {
void *flush_freelist = c->freelist;
struct page *flush_page = c->page;
@@ -2829,7 +2867,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
c->freelist = NULL;
c->tid = next_tid(c->tid);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
deactivate_slab(s, flush_page, flush_freelist);
@@ -3389,7 +3427,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
*/
migrate_disable();
c = this_cpu_ptr(s->cpu_slab);
- local_irq_disable();
+ local_lock_irq(&s->cpu_slab->lock);
for (i = 0; i < size; i++) {
void *object = kfence_alloc(s, s->object_size, flags);
@@ -3410,7 +3448,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
*/
c->tid = next_tid(c->tid);
- local_irq_enable();
+ local_unlock_irq(&s->cpu_slab->lock);
/*
* Invoking slow path likely have side-effect
@@ -3424,7 +3462,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
c = this_cpu_ptr(s->cpu_slab);
maybe_wipe_obj_freeptr(s, p[i]);
- local_irq_disable();
+ local_lock_irq(&s->cpu_slab->lock);
continue; /* goto for-loop */
}
@@ -3433,7 +3471,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
maybe_wipe_obj_freeptr(s, p[i]);
}
c->tid = next_tid(c->tid);
- local_irq_enable();
+ local_unlock_irq(&s->cpu_slab->lock);
migrate_enable();
/*
@@ -3444,7 +3482,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
slab_want_init_on_alloc(flags, s));
return i;
error:
- local_irq_enable();
+ local_unlock_irq(&s->cpu_slab->lock);
slab_post_alloc_hook(s, objcg, flags, i, p, false);
__kmem_cache_free_bulk(s, i, p);
return 0;
--
2.31.1
allocate_slab() currently re-enables irqs before calling to the page allocator.
It depends on gfpflags_allow_blocking() to determine if it's safe to do so.
Now we can instead simply restore irq before calling it through new_slab().
The other caller early_kmem_cache_node_alloc() is unaffected
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 16f3cbb81627..47a3438d6a35 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1763,9 +1763,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
flags &= gfp_allowed_mask;
- if (gfpflags_allow_blocking(flags))
- local_irq_enable();
-
flags |= s->allocflags;
/*
@@ -1824,8 +1821,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
page->frozen = 1;
out:
- if (gfpflags_allow_blocking(flags))
- local_irq_disable();
if (!page)
return NULL;
@@ -2753,17 +2748,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto check_new_page;
}
+ local_irq_restore(flags);
migrate_enable();
page = new_slab(s, gfpflags, node);
migrate_disable();
c = this_cpu_ptr(s->cpu_slab);
if (unlikely(!page)) {
- local_irq_restore(flags);
slab_out_of_memory(s, gfpflags, node);
return NULL;
}
+ local_irq_save(flags);
if (c->page)
flush_slab(s, c);
--
2.31.1
The function get_partial() does not need to have irqs disabled as a whole. It's
sufficient to convert spin_lock operations to their irq saving/restoring
versions.
As a result, it's now possible to reach the page allocator from the slab
allocator without disabling and re-enabling interrupts on the way.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 20 ++++++++------------
1 file changed, 8 insertions(+), 12 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index e092387b5932..bc7bee5d4bf2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1964,11 +1964,12 @@ static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
* Try to allocate a partial slab from a specific node.
*/
static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
- struct page **ret_page, gfp_t flags)
+ struct page **ret_page, gfp_t gfpflags)
{
struct page *page, *page2;
void *object = NULL;
unsigned int available = 0;
+ unsigned long flags;
int objects;
/*
@@ -1980,11 +1981,11 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
if (!n || !n->nr_partial)
return NULL;
- spin_lock(&n->list_lock);
+ spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry_safe(page, page2, &n->partial, slab_list) {
void *t;
- if (!pfmemalloc_match(page, flags))
+ if (!pfmemalloc_match(page, gfpflags))
continue;
t = acquire_slab(s, n, page, object == NULL, &objects);
@@ -2005,7 +2006,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
break;
}
- spin_unlock(&n->list_lock);
+ spin_unlock_irqrestore(&n->list_lock, flags);
return object;
}
@@ -2722,8 +2723,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
local_irq_restore(flags);
goto reread_page;
}
- if (unlikely(!slub_percpu_partial(c))) /* stolen by IRQ? */
+ if (unlikely(!slub_percpu_partial(c))) { /* stolen by IRQ? */
+ local_irq_restore(flags);
goto new_objects;
+ }
page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
@@ -2732,16 +2735,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
}
- local_irq_save(flags);
- if (unlikely(c->page)) {
- local_irq_restore(flags);
- goto reread_page;
- }
-
new_objects:
freelist = get_partial(s, gfpflags, node, &page);
- local_irq_restore(flags);
if (freelist)
goto check_new_page;
--
2.31.1
slub_cpu_dead() cleans up for an offlined cpu from another cpu and calls only
functions that are now irq safe, so we don't need to disable irqs anymore.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index aee83758bbc4..bfa5e7c4da1b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2515,14 +2515,10 @@ static void flush_all(struct kmem_cache *s)
static int slub_cpu_dead(unsigned int cpu)
{
struct kmem_cache *s;
- unsigned long flags;
mutex_lock(&slab_mutex);
- list_for_each_entry(s, &slab_caches, list) {
- local_irq_save(flags);
+ list_for_each_entry(s, &slab_caches, list)
__flush_cpu_slab(s, cpu);
- local_irq_restore(flags);
- }
mutex_unlock(&slab_mutex);
return 0;
}
--
2.31.1
No need for disabled irqs when discarding slabs, so restore them before
discarding.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c
index 446d2d5344c9..33c3faacf7b1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2360,6 +2360,8 @@ static void unfreeze_partials(struct kmem_cache *s,
if (n)
spin_unlock(&n->list_lock);
+ local_irq_restore(flags);
+
while (discard_page) {
page = discard_page;
discard_page = discard_page->next;
@@ -2369,7 +2371,6 @@ static void unfreeze_partials(struct kmem_cache *s,
stat(s, FREE_SLAB);
}
- local_irq_restore(flags);
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}
--
2.31.1
Instead of relying on disabled irqs for atomicity when detaching the percpu
partial list, we can use this_cpu_cmpxchg() and detach without irqs disabled.
However, unfreeze_partials() can be also called from another cpu on behalf of
a cpu that is being offlined, so we need to restructure the code accordingly:
- __unfreeze_partials() is the bulk of unfreeze_partials() that processes the
detached percpu partial list
- unfreeze_partials() uses this_cpu_cmpxchg() to detach list from current cpu
- unfreeze_partials_cpu() is to be called for the offlined cpu so it needs no
protection, and is called from __flush_cpu_slab()
- flush_cpu_slab() needs to call unfreeze_partial() so it can't simply call
__flush_cpu_slab(smp_processor_id()) anymore
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 79 +++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 56 insertions(+), 23 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 414cc621d655..92345d3840d1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2301,25 +2301,15 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
}
}
-/*
- * Unfreeze all the cpu partial slabs.
- *
- * This function must be called with preemption or migration
- * disabled with c local to the cpu.
- */
-static void unfreeze_partials(struct kmem_cache *s,
- struct kmem_cache_cpu *c)
-{
#ifdef CONFIG_SLUB_CPU_PARTIAL
+static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page)
+{
struct kmem_cache_node *n = NULL, *n2 = NULL;
- struct page *page, *partial_page, *discard_page = NULL;
+ struct page *page, *discard_page = NULL;
unsigned long flags;
local_irq_save(flags);
- partial_page = slub_percpu_partial(c);
- c->partial = NULL;
-
while (partial_page) {
struct page new;
struct page old;
@@ -2374,10 +2364,49 @@ static void unfreeze_partials(struct kmem_cache *s,
discard_slab(s, page);
stat(s, FREE_SLAB);
}
+}
-#endif /* CONFIG_SLUB_CPU_PARTIAL */
+/*
+ * Unfreeze all the cpu partial slabs.
+ *
+ * This function must be called with preemption or migration
+ * disabled.
+ */
+static void unfreeze_partials(struct kmem_cache *s)
+{
+ struct page *partial_page;
+
+ do {
+ partial_page = this_cpu_read(s->cpu_slab->partial);
+
+ } while (partial_page &&
+ this_cpu_cmpxchg(s->cpu_slab->partial, partial_page, NULL)
+ != partial_page);
+
+ if (partial_page)
+ __unfreeze_partials(s, partial_page);
}
+static void unfreeze_partials_cpu(struct kmem_cache *s,
+ struct kmem_cache_cpu *c)
+{
+ struct page *partial_page;
+
+ partial_page = slub_percpu_partial(c);
+ c->partial = NULL;
+
+ if (partial_page)
+ __unfreeze_partials(s, partial_page);
+}
+
+#else /* CONFIG_SLUB_CPU_PARTIAL */
+
+static void unfreeze_partials(struct kmem_cache *s) { }
+static void unfreeze_partials_cpu(struct kmem_cache *s,
+ struct kmem_cache_cpu *c) { }
+
+#endif /* CONFIG_SLUB_CPU_PARTIAL */
+
/*
* Put a page that was just frozen (in __slab_free|get_partial_node) into a
* partial page slot if available.
@@ -2406,7 +2435,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
* partial array is full. Move the existing
* set to the per node partial list.
*/
- unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+ unfreeze_partials(s);
oldpage = NULL;
pobjects = 0;
pages = 0;
@@ -2424,7 +2453,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
!= oldpage);
if (unlikely(!slub_cpu_partial(s)))
- unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+ unfreeze_partials(s);
preempt_enable();
#endif /* CONFIG_SLUB_CPU_PARTIAL */
@@ -2444,11 +2473,6 @@ static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
stat(s, CPUSLAB_FLUSH);
}
-/*
- * Flush cpu slab.
- *
- * Called from IPI handler with interrupts disabled.
- */
static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
{
struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
@@ -2456,14 +2480,23 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
if (c->page)
flush_slab(s, c);
- unfreeze_partials(s, c);
+ unfreeze_partials_cpu(s, c);
}
+/*
+ * Flush cpu slab.
+ *
+ * Called from IPI handler with interrupts disabled.
+ */
static void flush_cpu_slab(void *d)
{
struct kmem_cache *s = d;
+ struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
- __flush_cpu_slab(s, smp_processor_id());
+ if (c->page)
+ flush_slab(s, c);
+
+ unfreeze_partials(s);
}
static bool has_cpu_slab(int cpu, void *info)
--
2.31.1
deactivate_slab() removes the cpu slab by merging the cpu freelist with slab's
freelist and putting the slab on the proper node's list. It also sets the
respective kmem_cache_cpu pointers to NULL.
By extracting the kmem_cache_cpu operations from the function, we can make it
not dependent on disabled irqs.
Also if we return a single free pointer from ___slab_alloc, we no longer have
to load the slab page before deactivation or care if somebody preempted us and
loaded a different slab page to our kmem_cache_cpu in the process.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 31 ++++++++++++++++++-------------
1 file changed, 18 insertions(+), 13 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index bc7bee5d4bf2..cf855cd09802 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2163,10 +2163,13 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
}
/*
- * Remove the cpu slab
+ * Finishes removing the cpu slab. Merges cpu's freelist with page's freelist,
+ * unfreezes the slabs and puts it on the proper list.
+ * Assumes the slab has been already safely taken away from kmem_cache_cpu
+ * by the caller.
*/
static void deactivate_slab(struct kmem_cache *s, struct page *page,
- void *freelist, struct kmem_cache_cpu *c)
+ void *freelist)
{
enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
struct kmem_cache_node *n = get_node(s, page_to_nid(page));
@@ -2295,9 +2298,6 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
discard_slab(s, page);
stat(s, FREE_SLAB);
}
-
- c->page = NULL;
- c->freelist = NULL;
}
/*
@@ -2429,10 +2429,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
{
- stat(s, CPUSLAB_FLUSH);
- deactivate_slab(s, c->page, c->freelist, c);
+ void *freelist = c->freelist;
+ struct page *page = c->page;
+ c->page = NULL;
+ c->freelist = NULL;
c->tid = next_tid(c->tid);
+
+ deactivate_slab(s, page, freelist);
+
+ stat(s, CPUSLAB_FLUSH);
}
/*
@@ -2712,7 +2718,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
local_irq_restore(flags);
goto reread_page;
}
- deactivate_slab(s, page, c->freelist, c);
+ freelist = c->freelist;
+ c->page = NULL;
+ c->freelist = NULL;
+ deactivate_slab(s, page, freelist);
local_irq_restore(flags);
new_slab:
@@ -2792,11 +2801,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return_single:
local_irq_save(flags);
- if (unlikely(c->page))
- flush_slab(s, c);
- c->page = page;
-
- deactivate_slab(s, page, get_freepointer(s, freelist), c);
+ deactivate_slab(s, page, get_freepointer(s, freelist));
local_irq_restore(flags);
return freelist;
}
--
2.31.1
unfreeze_partials() can be optimized so that it doesn't need irqs disabled for
the whole time. As the first step, move irq control into the function and
remove it from callers.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 01422c757f9b..446d2d5344c9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2304,9 +2304,8 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
/*
* Unfreeze all the cpu partial slabs.
*
- * This function must be called with interrupts disabled
- * for the cpu using c (or some other guarantee must be there
- * to guarantee no concurrent accesses).
+ * This function must be called with preemption or migration
+ * disabled with c local to the cpu.
*/
static void unfreeze_partials(struct kmem_cache *s,
struct kmem_cache_cpu *c)
@@ -2314,6 +2313,9 @@ static void unfreeze_partials(struct kmem_cache *s,
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct kmem_cache_node *n = NULL, *n2 = NULL;
struct page *page, *discard_page = NULL;
+ unsigned long flags;
+
+ local_irq_save(flags);
while ((page = slub_percpu_partial(c))) {
struct page new;
@@ -2366,6 +2368,8 @@ static void unfreeze_partials(struct kmem_cache *s,
discard_slab(s, page);
stat(s, FREE_SLAB);
}
+
+ local_irq_restore(flags);
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}
@@ -2393,14 +2397,11 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
pobjects = oldpage->pobjects;
pages = oldpage->pages;
if (drain && pobjects > slub_cpu_partial(s)) {
- unsigned long flags;
/*
* partial array is full. Move the existing
* set to the per node partial list.
*/
- local_irq_save(flags);
unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
- local_irq_restore(flags);
oldpage = NULL;
pobjects = 0;
pages = 0;
@@ -2417,13 +2418,9 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
!= oldpage);
- if (unlikely(!slub_cpu_partial(s))) {
- unsigned long flags;
-
- local_irq_save(flags);
+ if (unlikely(!slub_cpu_partial(s)))
unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
- local_irq_restore(flags);
- }
+
preempt_enable();
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}
--
2.31.1
The function is now safe to be called with irqs enabled, so move the calls
outside of irq disabled sections.
When called from ___slab_alloc() -> flush_slab() we have irqs disabled, so to
reenable them before deactivate_slab() we need to open-code flush_slab() in
___slab_alloc() and reenable irqs after modifying the kmem_cache_cpu fields.
But that means another irq might have assigned a new c->page so we have to
retry the whole check.
The remaining callers of flush_slab() are the IPI handler which has disabled
irqs anyway, and slub_cpu_dead() which will be dealt with in the following
patch.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slub.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 0f58859165bf..01422c757f9b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2722,8 +2722,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
freelist = c->freelist;
c->page = NULL;
c->freelist = NULL;
- deactivate_slab(s, page, freelist);
local_irq_restore(flags);
+ deactivate_slab(s, page, freelist);
new_slab:
@@ -2792,18 +2792,31 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
*/
goto return_single;
+retry_load_page:
local_irq_save(flags);
- if (unlikely(c->page))
- flush_slab(s, c);
+ if (unlikely(c->page)) {
+ void *flush_freelist = c->freelist;
+ struct page *flush_page = c->page;
+
+ c->page = NULL;
+ c->freelist = NULL;
+ c->tid = next_tid(c->tid);
+
+ local_irq_restore(flags);
+
+ deactivate_slab(s, flush_page, flush_freelist);
+
+ stat(s, CPUSLAB_FLUSH);
+
+ goto retry_load_page;
+ }
c->page = page;
goto load_freelist;
return_single:
- local_irq_save(flags);
deactivate_slab(s, page, get_freepointer(s, freelist));
- local_irq_restore(flags);
return freelist;
}
--
2.31.1
On Tue, 25 May 2021, Vlastimil Babka wrote:
> Slub has a static spinlock protected bitmap for marking which objects are on
> freelist when it wants to list them, for situations where dynamically
> allocating such map can lead to recursion or locking issues, and on-stack
> bitmap would be too large.
>
> The handlers of sysfs files alloc_calls and free_calls also currently use this
> shared bitmap, but their syscall context makes it straightforward to allocate a
> private map before entering locked sections, so switch these processing paths
> to use a private bitmap.
Right in that case a GFP_KERNEL allocation is fine and you can avoid the
static map.
Acked-by: Christoph Lameter <[email protected]>
On Tue, 25 May 2021, Vlastimil Babka wrote:
> validate_slab_cache() is called either to handle a sysfs write, or from a
> self-test context. In both situations it's straightforward to preallocate a
> private object bitmap instead of grabbing the shared static one meant for
> critical sections, so let's do that.
Acked-by: Christoph Lameter <[email protected]>
On Tue, 25 May 2021, Vlastimil Babka wrote:
> The later patches will need more fine grained control over individual actions
> in ___slab_alloc(), the only caller of new_slab_objects(), so this is a first
> preparatory step with no functional change.
Acked-by: Christoph Lameter <[email protected]>
On Tue, 25 May 2021, Vlastimil Babka wrote:
> The later patches will need more fine grained control over individual actions
> in ___slab_alloc(), the only caller of new_slab_objects(), so dissolve it
> there. This is a preparatory step with no functional change.
Acked-by: Christoph Lameter <[email protected]>
On Tue, 25 May 2021, Vlastimil Babka wrote:
> The function get_partial() finds a suitable page on a partial list, acquires
> and returns its freelist and assigns the page pointer to kmem_cache_node.
in kmem_cache_cpu ??
> In later patch we will need more control over the kmem_cache_node assignment,
kmem_cache_cpu?
> so instead return the page pointer to get_partial()'s caller and assign it
> there. No functional change as all of this still happens with disabled irq.
Instead of passing a kmem_cache_cpu pointer pass a pointer to a pointer to
a page ....
On Tue, May 25, 2021 at 01:39:22AM +0200, Vlastimil Babka wrote:
> validate_slab_cache() is called either to handle a sysfs write, or from a
> self-test context. In both situations it's straightforward to preallocate a
> private object bitmap instead of grabbing the shared static one meant for
> critical sections, so let's do that.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
>
> <SNIP>
>
> @@ -4685,10 +4685,17 @@ static long validate_slab_cache(struct kmem_cache *s)
> int node;
> unsigned long count = 0;
> struct kmem_cache_node *n;
> + unsigned long *obj_map;
> +
> + obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL);
> + if (!obj_map)
> + return -ENOMEM;
>
Most callers of validate_slab_cache don't care about the return value
except when the validate sysfs file is written. Should a simply
informational message be displayed for -ENOMEM in case a writer to
validate fails and it's not obvious it was because of an allocation
failure?
It's a fairly minor concern so whether you add a message or not
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Tue, May 25, 2021 at 01:39:23AM +0200, Vlastimil Babka wrote:
> In slab_free_hook() we disable irqs around the debug_check_no_locks_freed()
> call, which is unnecessary, as irqs are already being disabled inside the call.
> This seems to be leftover from the past where there were more calls inside the
> irq disabled sections. Remove the irq disable/enable operations.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
Looks like it was needed for kmemcheck which went away back in 4.15 so
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Tue, May 25, 2021 at 12:36:52PM +0200, Vlastimil Babka wrote:
> > Most callers of validate_slab_cache don't care about the return value
> > except when the validate sysfs file is written. Should a simply
> > informational message be displayed for -ENOMEM in case a writer to
> > validate fails and it's not obvious it was because of an allocation
> > failure?
>
> he other callers are all in the effectively dead resiliency_test() code, which
> has meanwhile been replaced in mmotm by kunit tests meanwhile. But it's true
> those don't check the results either for now.
>
Ok.
> > It's a fairly minor concern so whether you add a message or not
>
> I think I'll rather fix up the tests. Or do you mean that -ENOMEM for a sysfs
> write is also not enough and there should be a dmesg explanation for that case?
>
I mean the -ENOMEM for a sysfs write. While it's very unlikely, it might
would explain an unexpected write failure.
--
Mel Gorman
SUSE Labs
On Tue, May 25, 2021 at 01:39:24AM +0200, Vlastimil Babka wrote:
> In slab_alloc_node() and do_slab_free() fastpaths we need to guarantee that
> our kmem_cache_cpu pointer is from the same cpu as the tid value. Currently
> that's done by reading the tid first using this_cpu_read(), then the
> kmem_cache_cpu pointer and verifying we read the same tid using the pointer and
> plain READ_ONCE().
>
> This can be simplified to just fetching kmem_cache_cpu pointer and then reading
> tid using the pointer. That guarantees they are from the same cpu. We don't
> need to read the tid using this_cpu_read() because the value will be validated
> by this_cpu_cmpxchg_double(), making sure we are on the correct cpu and the
> freelist didn't change by anyone preempting us since reading the tid.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
Wow, that's a fun approach to avoiding disabling preemption but the
validation check against preemption remains the same so;
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Tue, May 25, 2021 at 01:39:26AM +0200, Vlastimil Babka wrote:
> The later patches will need more fine grained control over individual actions
> in ___slab_alloc(), the only caller of new_slab_objects(), so dissolve it
> there. This is a preparatory step with no functional change.
>
> The only minor change is moving WARN_ON_ONCE() for using a constructor together
> with __GFP_ZERO to new_slab() which makes it somewhat less frequent, but still
> able to catch a misuse.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
The goto in the previous patch makes a bit more sense now
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Tue, May 25, 2021 at 01:39:28AM +0200, Vlastimil Babka wrote:
> When we allocate slab object from a newly acquired page (from node's partial
> list or page allocator), we usually also retain the page as a new percpu slab.
> There are two exceptions - when pfmemalloc status of the page doesn't match our
> gfp flags, or when the cache has debugging enabled.
>
> The current code for these decisions is not easy to follow, so restructure it
> and add comments. No functional change.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
>
> <SNIP>
>
> + if (unlikely(!pfmemalloc_match(page, gfpflags)))
> + /*
> + * For !pfmemalloc_match() case we don't load freelist so that
> + * we don't make further mismatched allocations easier.
> + */
> + goto return_single;
> +
> + goto load_freelist;
> +
> +return_single:
>
This looked odd to me but I see other stuff goes between the two goto's
later in the series so
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Tue, May 25, 2021 at 01:39:29AM +0200, Vlastimil Babka wrote:
> Currently __slab_alloc() disables irqs around the whole ___slab_alloc(). This
> includes cases where this is not needed, such as when the allocation ends up in
> the page allocator and has to awkwardly enable irqs back based on gfp flags.
> Also the whole kmem_cache_alloc_bulk() is executed with irqs disabled even when
> it hits the __slab_alloc() slow path, and long periods with disabled interrupts
> are undesirable.
>
> As a first step towards reducing irq disabled periods, move irq handling into
> ___slab_alloc(). Callers will instead prevent the s->cpu_slab percpu pointer
> from becoming invalid via migrate_disable(). This does not protect against
> access preemption, which is still done by disabled irq for most of
> ___slab_alloc(). As the small immediate benefit, slab_out_of_memory() call from
> ___slab_alloc() is now done with irqs enabled.
>
> kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables them
> before calling ___slab_alloc(), which then disables them at its discretion. The
> whole kmem_cache_alloc_bulk() operation also disables cpu migration.
>
> When ___slab_alloc() calls new_slab() to allocate a new page, re-enable
> preemption, because new_slab() will re-enable interrupts in contexts that allow
> blocking.
>
> The patch itself will thus increase overhead a bit due to disabled migration
> and increased disabling/enabling irqs in kmem_cache_alloc_bulk(), but that will
> be gradually improved in the following patches.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
Why did you use migrate_disable instead of preempt_disable? There is a
fairly large comment in include/linux/preempt.h on why migrate_disable
is undesirable so new users are likely to be put under the microscope
once Thomas or Peter notice it.
I think you are using it so that an allocation request can be preempted by
a higher priority task but given that the code was disabling interrupts,
there was already some preemption latency. However, migrate_disable
is more expensive than preempt_disable (function call versus a simple
increment). On that basis, I'd recommend starting with preempt_disable
and only using migrate_disable if necessary.
Bonus points for adding a comment where ___slab_alloc disables IRQs to
clarify what is protected -- I assume it's protecting kmem_cache_cpu
from being modified from interrupt context. If so, it's potentially a
local_lock candidate.
--
Mel Gorman
SUSE Labs
On Tue, May 25, 2021 at 01:39:30AM +0200, Vlastimil Babka wrote:
> As another step of shortening irq disabled sections in ___slab_alloc(), don't
> disable irqs until doing initial checks if there is a cached percpu slab and
> it's suitable for our allocation.
>
> Now we have to recheck c->page after actually disabling irqs as an allocation
> in irq might have replaced it.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
Minor nit only -- consider adding a comment at the new_slab label that
IRQs must be disabled already.
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On 5/25/21 12:17 PM, Mel Gorman wrote:
> On Tue, May 25, 2021 at 01:39:22AM +0200, Vlastimil Babka wrote:
>> validate_slab_cache() is called either to handle a sysfs write, or from a
>> self-test context. In both situations it's straightforward to preallocate a
>> private object bitmap instead of grabbing the shared static one meant for
>> critical sections, so let's do that.
>>
>> Signed-off-by: Vlastimil Babka <[email protected]>
>>
>> <SNIP>
>>
>> @@ -4685,10 +4685,17 @@ static long validate_slab_cache(struct kmem_cache *s)
>> int node;
>> unsigned long count = 0;
>> struct kmem_cache_node *n;
>> + unsigned long *obj_map;
>> +
>> + obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL);
>> + if (!obj_map)
>> + return -ENOMEM;
>>
>
>
> Most callers of validate_slab_cache don't care about the return value
> except when the validate sysfs file is written. Should a simply
> informational message be displayed for -ENOMEM in case a writer to
> validate fails and it's not obvious it was because of an allocation
> failure?
he other callers are all in the effectively dead resiliency_test() code, which
has meanwhile been replaced in mmotm by kunit tests meanwhile. But it's true
those don't check the results either for now.
> It's a fairly minor concern so whether you add a message or not
I think I'll rather fix up the tests. Or do you mean that -ENOMEM for a sysfs
write is also not enough and there should be a dmesg explanation for that case?
> Acked-by: Mel Gorman <[email protected]>
Thanks!
On Tue, May 25, 2021 at 01:39:21AM +0200, Vlastimil Babka wrote:
> Slub has a static spinlock protected bitmap for marking which objects are on
> freelist when it wants to list them, for situations where dynamically
> allocating such map can lead to recursion or locking issues, and on-stack
> bitmap would be too large.
>
> The handlers of sysfs files alloc_calls and free_calls also currently use this
> shared bitmap, but their syscall context makes it straightforward to allocate a
> private map before entering locked sections, so switch these processing paths
> to use a private bitmap.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Tue, May 25, 2021 at 01:39:25AM +0200, Vlastimil Babka wrote:
> The later patches will need more fine grained control over individual actions
> in ___slab_alloc(), the only caller of new_slab_objects(), so this is a first
> preparatory step with no functional change.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> <SNIP>
> @@ -2748,6 +2743,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> goto redo;
> }
>
> + freelist = get_partial(s, gfpflags, node, c);
> + if (freelist)
> + goto check_new_page;
> +
> freelist = new_slab_objects(s, gfpflags, node, &c);
>
> if (unlikely(!freelist)) {
It's a nit-pick, but why did you not simply do something like this
instead of the goto?
freelist = get_partial(s, gfpflags, node, c);
if (!freelist)
freelist = new_slab_objects(s, gfpflags, node, &c);
if (unlikely(!freelist))
...
If nothing else, the label may be misleading because c->page may not
be new.
--
Mel Gorman
SUSE Labs
On 5/25/21 2:35 PM, Mel Gorman wrote:
> On Tue, May 25, 2021 at 01:39:29AM +0200, Vlastimil Babka wrote:
>> Currently __slab_alloc() disables irqs around the whole ___slab_alloc(). This
>> includes cases where this is not needed, such as when the allocation ends up in
>> the page allocator and has to awkwardly enable irqs back based on gfp flags.
>> Also the whole kmem_cache_alloc_bulk() is executed with irqs disabled even when
>> it hits the __slab_alloc() slow path, and long periods with disabled interrupts
>> are undesirable.
>>
>> As a first step towards reducing irq disabled periods, move irq handling into
>> ___slab_alloc(). Callers will instead prevent the s->cpu_slab percpu pointer
>> from becoming invalid via migrate_disable(). This does not protect against
>> access preemption, which is still done by disabled irq for most of
>> ___slab_alloc(). As the small immediate benefit, slab_out_of_memory() call from
>> ___slab_alloc() is now done with irqs enabled.
>>
>> kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables them
>> before calling ___slab_alloc(), which then disables them at its discretion. The
>> whole kmem_cache_alloc_bulk() operation also disables cpu migration.
>>
>> When ___slab_alloc() calls new_slab() to allocate a new page, re-enable
>> preemption, because new_slab() will re-enable interrupts in contexts that allow
>> blocking.
>>
>> The patch itself will thus increase overhead a bit due to disabled migration
>> and increased disabling/enabling irqs in kmem_cache_alloc_bulk(), but that will
>> be gradually improved in the following patches.
>>
>> Signed-off-by: Vlastimil Babka <[email protected]>
>
> Why did you use migrate_disable instead of preempt_disable? There is a
> fairly large comment in include/linux/preempt.h on why migrate_disable
> is undesirable so new users are likely to be put under the microscope
> once Thomas or Peter notice it.
I understood it as while undesirable, there's nothing better for now.
> I think you are using it so that an allocation request can be preempted by
> a higher priority task but given that the code was disabling interrupts,
> there was already some preemption latency.
Yes, and the disabled interrupts will get progressively "smaller" in the series.
> However, migrate_disable
> is more expensive than preempt_disable (function call versus a simple
> increment).
That's true, I think perhaps it could be reimplemented so that on !PREEMPT_RT
and with no lockdep/preempt/whatnot debugging it could just translate to an
inline migrate_disable?
> On that basis, I'd recommend starting with preempt_disable
> and only using migrate_disable if necessary.
That's certainly possible and you're right it would be a less disruptive step.
My thinking was that on !PREEMPT_RT it's actually just preempt_disable (however
with the call overhead currently), but PREEMPT_RT would welcome the lack of
preempt disable. I'd be interested to hear RT guys opinion here.
> Bonus points for adding a comment where ___slab_alloc disables IRQs to
> clarify what is protected -- I assume it's protecting kmem_cache_cpu
> from being modified from interrupt context. If so, it's potentially a
> local_lock candidate.
Yeah that gets cleared up later :)
On Tue, May 25, 2021 at 1:40 AM Vlastimil Babka <[email protected]> wrote:
> Continue reducing the irq disabled scope. Check for per-cpu partial slabs with
> first with irqs enabled and then recheck with irqs disabled before grabbing
> the slab page. Mostly preparatory for the following patches.
[...]
> diff --git a/mm/slub.c b/mm/slub.c
[...]
> if (slub_percpu_partial(c)) {
> + local_irq_save(flags);
> + if (unlikely(c->page)) {
> + local_irq_restore(flags);
> + goto reread_page;
> + }
> + if (unlikely(!slub_percpu_partial(c))) /* stolen by IRQ? */
> + goto new_objects;
nit: I think this comment is wrong by the end of the patch series,
since at that point, in RT configurations, it could also be stolen by
another task, if I understand correctly what migrate_disable() means?
Similarly the comment above ___slab_alloc() still talks about
disabling preemption for bulk allocation.
On 5/25/21 2:47 PM, Vlastimil Babka wrote:
> On 5/25/21 2:35 PM, Mel Gorman wrote:
>>
>> Why did you use migrate_disable instead of preempt_disable? There is a
>> fairly large comment in include/linux/preempt.h on why migrate_disable
>> is undesirable so new users are likely to be put under the microscope
>> once Thomas or Peter notice it.
>
> I understood it as while undesirable, there's nothing better for now.
Ah I now recalled the more important reason. By my understanding of
Documentation/locking/locktypes.rst it's not possible on PREEMPT_RT to do a
preempt_disable() and then take a spin_lock (or local_lock) which is a mutex on
RT and needs preemption enabled to take it. And one of the goals is that
list_lock would not have to be raw_spinlock on RT anymore.
>> I think you are using it so that an allocation request can be preempted by
>> a higher priority task but given that the code was disabling interrupts,
>> there was already some preemption latency.
>
> Yes, and the disabled interrupts will get progressively "smaller" in the series.
>
>> However, migrate_disable
>> is more expensive than preempt_disable (function call versus a simple
>> increment).
>
> That's true, I think perhaps it could be reimplemented so that on !PREEMPT_RT
> and with no lockdep/preempt/whatnot debugging it could just translate to an
> inline migrate_disable?
Correction: I meant "translate to an inline preempt_disable" which would then
not change anything for !PREEMPT_RT.
On Tue, May 25, 2021 at 02:47:10PM +0200, Vlastimil Babka wrote:
> On 5/25/21 2:35 PM, Mel Gorman wrote:
> > On Tue, May 25, 2021 at 01:39:29AM +0200, Vlastimil Babka wrote:
> >> Currently __slab_alloc() disables irqs around the whole ___slab_alloc(). This
> >> includes cases where this is not needed, such as when the allocation ends up in
> >> the page allocator and has to awkwardly enable irqs back based on gfp flags.
> >> Also the whole kmem_cache_alloc_bulk() is executed with irqs disabled even when
> >> it hits the __slab_alloc() slow path, and long periods with disabled interrupts
> >> are undesirable.
> >>
> >> As a first step towards reducing irq disabled periods, move irq handling into
> >> ___slab_alloc(). Callers will instead prevent the s->cpu_slab percpu pointer
> >> from becoming invalid via migrate_disable(). This does not protect against
> >> access preemption, which is still done by disabled irq for most of
> >> ___slab_alloc(). As the small immediate benefit, slab_out_of_memory() call from
> >> ___slab_alloc() is now done with irqs enabled.
> >>
> >> kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables them
> >> before calling ___slab_alloc(), which then disables them at its discretion. The
> >> whole kmem_cache_alloc_bulk() operation also disables cpu migration.
> >>
> >> When ___slab_alloc() calls new_slab() to allocate a new page, re-enable
> >> preemption, because new_slab() will re-enable interrupts in contexts that allow
> >> blocking.
> >>
> >> The patch itself will thus increase overhead a bit due to disabled migration
> >> and increased disabling/enabling irqs in kmem_cache_alloc_bulk(), but that will
> >> be gradually improved in the following patches.
> >>
> >> Signed-off-by: Vlastimil Babka <[email protected]>
> >
> > Why did you use migrate_disable instead of preempt_disable? There is a
> > fairly large comment in include/linux/preempt.h on why migrate_disable
> > is undesirable so new users are likely to be put under the microscope
> > once Thomas or Peter notice it.
>
> I understood it as while undesirable, there's nothing better for now.
>
I think the "better" option is to reduce preempt_disable sections as
much as possible but you probably have limited options there. It might
be easier to justify if the sections you were protecting need to go to
sleep like what mm/highmem.c needs but that does not appear to be the case.
> > I think you are using it so that an allocation request can be preempted by
> > a higher priority task but given that the code was disabling interrupts,
> > there was already some preemption latency.
>
> Yes, and the disabled interrupts will get progressively "smaller" in the series.
>
> > However, migrate_disable
> > is more expensive than preempt_disable (function call versus a simple
> > increment).
>
> That's true, I think perhaps it could be reimplemented so that on !PREEMPT_RT
> and with no lockdep/preempt/whatnot debugging it could just translate to an
> inline migrate_disable?
>
It might be a bit too large for that.
> > On that basis, I'd recommend starting with preempt_disable
> > and only using migrate_disable if necessary.
>
> That's certainly possible and you're right it would be a less disruptive step.
> My thinking was that on !PREEMPT_RT it's actually just preempt_disable (however
> with the call overhead currently), but PREEMPT_RT would welcome the lack of
> preempt disable. I'd be interested to hear RT guys opinion here.
>
It does more than preempt_disable even on !PREEMPT_RT. It's only on !SMP
that it becomes inline. While it might allow a higher priority task to
preempt, PREEMPT_RT is also not the common case and I think it's better
to use the lighter-weight option for the majority of configurations.
> > Bonus points for adding a comment where ___slab_alloc disables IRQs to
> > clarify what is protected -- I assume it's protecting kmem_cache_cpu
> > from being modified from interrupt context. If so, it's potentially a
> > local_lock candidate.
>
> Yeah that gets cleared up later :)
>
I saw that after glancing through the rest of the series. While I didn't
spot anything major, I'd also like to hear from Peter or Thomas on whether
migrate_disable or preempt_disable would be preferred for mm/slub.c. The
preempt-rt tree does not help answer the question given that the slub
changes there are mostly about deferring some work until IRQs are enabled.
--
Mel Gorman
SUSE Labs
On Tue, May 25, 2021 at 1:40 AM Vlastimil Babka <[email protected]> wrote:
> In put_cpu_partial, we need a stable cpu, but being preempted is not an issue.
> So, disable migration instead of preemption.
I wouldn't say "not an issue", more like "you're not making it worse".
From what I can tell, the following race can already theoretically happen:
task A: put_cpu_partial() calls preempt_disable()
task A: oldpage = this_cpu_read(s->cpu_slab->partial)
interrupt: kfree() reaches unfreeze_partials() and discards the page
task B (on another CPU): reallocates page as page cache
task A: reads page->pages and page->pobjects, which are actually
halves of the pointer page->lru.prev
task B (on another CPU): frees page
interrupt: allocates page as SLUB page and places it on the percpu partial list
task A: this_cpu_cmpxchg() succeeds
which would cause page->pages and page->pobjects to end up containing
halves of pointers that would then influence when put_cpu_partial()
happens and show up in root-only sysfs files. Maybe that's acceptable,
I don't know. But there should probably at least be a comment for now
to point out that we're reading union fields of a page that might be
in a completely different state.
(Someone should probably fix that code sometime and get rid of
page->pobjects entirely, given how inaccurate it is...)
On 5/25/21 1:39 AM, Vlastimil Babka wrote:
> Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions of
> local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT that's
> equivalent, with better lockdep visibility. On PREEMPT_RT that means better
> preemption.
>
> Also update the comment about locking scheme in SLUB to reflect changes done
> by this series.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
Hm just realized that this won't work on RT because unlike true irq disable, the
local_lock on RT won't protect from irq handler doing alloc/free changing the
critical fields in kmem_cache_cpu in the lockless fastpath, which doesn't take
the local_lock.
Good that I've put the local_lock patch last, I guess. The reduced irq disabled
sections should still help.
> ---
> include/linux/slub_def.h | 2 +
> mm/slub.c | 90 ++++++++++++++++++++++++++++------------
> 2 files changed, 66 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index dcde82a4434c..b5bcac29b979 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -10,6 +10,7 @@
> #include <linux/kfence.h>
> #include <linux/kobject.h>
> #include <linux/reciprocal_div.h>
> +#include <linux/local_lock.h>
>
> enum stat_item {
> ALLOC_FASTPATH, /* Allocation from cpu slab */
> @@ -41,6 +42,7 @@ enum stat_item {
> NR_SLUB_STAT_ITEMS };
>
> struct kmem_cache_cpu {
> + local_lock_t lock; /* Protects the fields below except stat */
> void **freelist; /* Pointer to next available object */
> unsigned long tid; /* Globally unique transaction id */
> struct page *page; /* The slab from which we are allocating */
> diff --git a/mm/slub.c b/mm/slub.c
> index 8818c210cb97..5455c6bda997 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -43,13 +43,22 @@
> /*
> * Lock order:
> * 1. slab_mutex (Global Mutex)
> - * 2. node->list_lock
> + * 2. node->list_lock (Spinlock)
> + * OR
> + * kmem_cache->cpu_slab->lock (Local lock)
> * 3. slab_lock(page) (Only on some arches and for debugging)
> + * 4. object_map_lock (Only for debugging)
> *
> * slab_mutex
> *
> * The role of the slab_mutex is to protect the list of all the slabs
> * and to synchronize major metadata changes to slab cache structures.
> + * Also synchronizes memory hotplug callbacks.
> + *
> + * slab_lock
> + *
> + * The slab_lock is a wrapper around the page lock, thus it is a bit
> + * spinlock.
> *
> * The slab_lock is only used for debugging and on arches that do not
> * have the ability to do a cmpxchg_double. It only protects:
> @@ -65,6 +74,8 @@
> * froze the slab is the only one that can retrieve the objects from the
> * page's freelist.
> *
> + * list_lock
> + *
> * The list_lock protects the partial and full list on each node and
> * the partial slab counter. If taken then no new slabs may be added or
> * removed from the lists nor make the number of partial slabs be modified.
> @@ -76,10 +87,33 @@
> * slabs, operations can continue without any centralized lock. F.e.
> * allocating a long series of objects that fill up slabs does not require
> * the list lock.
> - * Interrupts are disabled during allocation and deallocation in order to
> - * make the slab allocator safe to use in the context of an irq. In addition
> - * interrupts are disabled to ensure that the processor does not change
> - * while handling per_cpu slabs, due to kernel preemption.
> + *
> + * cpu_slab->lock local lock
> + *
> + * This locks protect slowpath manipulation of all kmem_cache_cpu fields
> + * except the stat counters. This is a percpu structure manipulated only by
> + * the local cpu, so the lock protects against being preempted or interrupted
> + * by an irq. Fast path operations rely on lockless operations instead.
> + *
> + * lockless fastpaths
> + *
> + * The fast path allocation (slab_alloc_node()) and freeing (do_slab_free())
> + * are fully lockless when satisfied from the percpu slab (and when
> + * cmpxchg_double is possible to use, otherwise slab_lock is taken).
> + * They also don't disable preemption or migration or irqs. They rely on
> + * the transaction id (tid) field to detect being preempted or moved to
> + * another cpu.
> + *
> + * irq, preemption, migration considerations
> + *
> + * Interrupts are disabled as part of list_lock or local_lock operations, or
> + * around the slab_lock operation, in order to make the slab allocator safe
> + * to use in the context of an irq.
> + *
> + * In addition, migration is disabled in the allocation slowpath, bulk
> + * allocation, and put_cpu_partial(), so that the local cpu doesn't change in
> + * the process and e.g. the kmem_cache_cpu pointer doesn't have to be
> + * revalidated in each section protected by the local lock.
> *
> * SLUB assigns one slab for allocation to each processor.
> * Allocations only occur from these slabs called cpu slabs.
> @@ -427,7 +461,7 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
> page->freelist = freelist_new;
> page->counters = counters_new;
> slab_unlock(page);
> - local_irq_restore(flags);
> + local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> return true;
> }
> slab_unlock(page);
> @@ -2157,9 +2191,13 @@ static inline void note_cmpxchg_failure(const char *n,
> static void init_kmem_cache_cpus(struct kmem_cache *s)
> {
> int cpu;
> + struct kmem_cache_cpu *c;
>
> - for_each_possible_cpu(cpu)
> - per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
> + for_each_possible_cpu(cpu) {
> + c = per_cpu_ptr(s->cpu_slab, cpu);
> + local_lock_init(&c->lock);
> + c->tid = init_tid(cpu);
> + }
> }
>
> /*
> @@ -2708,9 +2746,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> goto deactivate_slab;
>
> /* must check again c->page in case of IRQ */
> - local_irq_save(flags);
> + local_lock_irqsave(&s->cpu_slab->lock, flags);
> if (unlikely(page != c->page)) {
> - local_irq_restore(flags);
> + local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> goto reread_page;
> }
> freelist = c->freelist;
> @@ -2721,7 +2759,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>
> if (!freelist) {
> c->page = NULL;
> - local_irq_restore(flags);
> + local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> stat(s, DEACTIVATE_BYPASS);
> goto new_slab;
> }
> @@ -2737,37 +2775,37 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> VM_BUG_ON(!c->page->frozen);
> c->freelist = get_freepointer(s, freelist);
> c->tid = next_tid(c->tid);
> - local_irq_restore(flags);
> + local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> return freelist;
>
> deactivate_slab:
> - local_irq_save(flags);
> + local_lock_irqsave(&s->cpu_slab->lock, flags);
> if (page != c->page) {
> - local_irq_restore(flags);
> + local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> goto reread_page;
> }
> freelist = c->freelist;
> c->page = NULL;
> c->freelist = NULL;
> - local_irq_restore(flags);
> + local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> deactivate_slab(s, page, freelist);
>
> new_slab:
>
> if (slub_percpu_partial(c)) {
> - local_irq_save(flags);
> + local_lock_irqsave(&s->cpu_slab->lock, flags);
> if (unlikely(c->page)) {
> - local_irq_restore(flags);
> + local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> goto reread_page;
> }
> if (unlikely(!slub_percpu_partial(c))) { /* stolen by IRQ? */
> - local_irq_restore(flags);
> + local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> goto new_objects;
> }
>
> page = c->page = slub_percpu_partial(c);
> slub_set_percpu_partial(c, page);
> - local_irq_restore(flags);
> + local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> stat(s, CPU_PARTIAL_ALLOC);
> goto redo;
> }
> @@ -2820,7 +2858,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> goto return_single;
>
> retry_load_page:
> - local_irq_save(flags);
> + local_lock_irqsave(&s->cpu_slab->lock, flags);
> if (unlikely(c->page)) {
> void *flush_freelist = c->freelist;
> struct page *flush_page = c->page;
> @@ -2829,7 +2867,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> c->freelist = NULL;
> c->tid = next_tid(c->tid);
>
> - local_irq_restore(flags);
> + local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>
> deactivate_slab(s, flush_page, flush_freelist);
>
> @@ -3389,7 +3427,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> */
> migrate_disable();
> c = this_cpu_ptr(s->cpu_slab);
> - local_irq_disable();
> + local_lock_irq(&s->cpu_slab->lock);
>
> for (i = 0; i < size; i++) {
> void *object = kfence_alloc(s, s->object_size, flags);
> @@ -3410,7 +3448,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> */
> c->tid = next_tid(c->tid);
>
> - local_irq_enable();
> + local_unlock_irq(&s->cpu_slab->lock);
>
> /*
> * Invoking slow path likely have side-effect
> @@ -3424,7 +3462,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> c = this_cpu_ptr(s->cpu_slab);
> maybe_wipe_obj_freeptr(s, p[i]);
>
> - local_irq_disable();
> + local_lock_irq(&s->cpu_slab->lock);
>
> continue; /* goto for-loop */
> }
> @@ -3433,7 +3471,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> maybe_wipe_obj_freeptr(s, p[i]);
> }
> c->tid = next_tid(c->tid);
> - local_irq_enable();
> + local_unlock_irq(&s->cpu_slab->lock);
> migrate_enable();
>
> /*
> @@ -3444,7 +3482,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> slab_want_init_on_alloc(flags, s));
> return i;
> error:
> - local_irq_enable();
> + local_unlock_irq(&s->cpu_slab->lock);
> slab_post_alloc_hook(s, objcg, flags, i, p, false);
> __kmem_cache_free_bulk(s, i, p);
> return 0;
>
On 5/25/21 1:33 PM, Mel Gorman wrote:
> On Tue, May 25, 2021 at 12:36:52PM +0200, Vlastimil Babka wrote:
>> > Most callers of validate_slab_cache don't care about the return value
>> > except when the validate sysfs file is written. Should a simply
>> > informational message be displayed for -ENOMEM in case a writer to
>> > validate fails and it's not obvious it was because of an allocation
>> > failure?
>>
>> he other callers are all in the effectively dead resiliency_test() code, which
>> has meanwhile been replaced in mmotm by kunit tests meanwhile. But it's true
>> those don't check the results either for now.
>>
>
> Ok.
>
>> > It's a fairly minor concern so whether you add a message or not
>>
>> I think I'll rather fix up the tests. Or do you mean that -ENOMEM for a sysfs
>> write is also not enough and there should be a dmesg explanation for that case?
>>
>
> I mean the -ENOMEM for a sysfs write. While it's very unlikely, it might
> would explain an unexpected write failure.
On second thought, a failed GFP_KERNEL allocation will already generate a
prominent warning, so an extra message looks arbitrary.
On 5/25/21 11:12 AM, Christoph Lameter wrote:
> On Tue, 25 May 2021, Vlastimil Babka wrote:
>
>> The function get_partial() finds a suitable page on a partial list, acquires
>> and returns its freelist and assigns the page pointer to kmem_cache_node.
> in kmem_cache_cpu ??
>
>> In later patch we will need more control over the kmem_cache_node assignment,
>
> kmem_cache_cpu?
>
>> so instead return the page pointer to get_partial()'s caller and assign it
>> there. No functional change as all of this still happens with disabled irq.
>
> Instead of passing a kmem_cache_cpu pointer pass a pointer to a pointer to
> a page ....
You're right, confused the two structures here. Fixed, thanks.
On 5/25/21 3:04 PM, Mel Gorman wrote:
> On Tue, May 25, 2021 at 01:39:30AM +0200, Vlastimil Babka wrote:
>> As another step of shortening irq disabled sections in ___slab_alloc(), don't
>> disable irqs until doing initial checks if there is a cached percpu slab and
>> it's suitable for our allocation.
>>
>> Now we have to recheck c->page after actually disabling irqs as an allocation
>> in irq might have replaced it.
>>
>> Signed-off-by: Vlastimil Babka <[email protected]>
>
> Minor nit only -- consider adding a comment at the new_slab label that
> IRQs must be disabled already.
Good point, will use lockdep_assert_irqs_disabled() as that's a functional comment.
> Acked-by: Mel Gorman <[email protected]>
Thanks.
On 5/25/21 5:33 PM, Jann Horn wrote:
> On Tue, May 25, 2021 at 1:40 AM Vlastimil Babka <[email protected]> wrote:
>> In put_cpu_partial, we need a stable cpu, but being preempted is not an issue.
>> So, disable migration instead of preemption.
>
> I wouldn't say "not an issue", more like "you're not making it worse".
>
> From what I can tell, the following race can already theoretically happen:
>
> task A: put_cpu_partial() calls preempt_disable()
> task A: oldpage = this_cpu_read(s->cpu_slab->partial)
> interrupt: kfree() reaches unfreeze_partials() and discards the page
> task B (on another CPU): reallocates page as page cache
> task A: reads page->pages and page->pobjects, which are actually
> halves of the pointer page->lru.prev
> task B (on another CPU): frees page
> interrupt: allocates page as SLUB page and places it on the percpu partial list
> task A: this_cpu_cmpxchg() succeeds
Oops, nice find. Thanks.
> which would cause page->pages and page->pobjects to end up containing
> halves of pointers that would then influence when put_cpu_partial()
> happens and show up in root-only sysfs files. Maybe that's acceptable,
> I don't know. But there should probably at least be a comment for now
> to point out that we're reading union fields of a page that might be
> in a completely different state.
>
> (Someone should probably fix that code sometime and get rid of
> page->pobjects entirely, given how inaccurate it is...)
I'll try to address it separately later. Probably just target a number of pages,
instead of objects, on the list and store the number as part of struct
kmem_cache_cpu, not struct page. The inaccuracy leading to potentially long
lists is a good reason enough, the race scenario above is another one...