From: Chengming Zhou <[email protected]>
Changes in RFC v3:
- Directly use __set_bit() and __clear_bit() for the slab_node_partial
flag operations to avoid exporting non-atomic "workingset" interfaces.
- Change get_partial() related functions to return a slab instead of
returning the freelist or single object.
- Don't freeze any slab under the node list_lock to further reduce
list_lock holding times, as suggested by Vlastimil Babka.
- Introduce freeze_slab() to do the delay freezing and return freelist.
- Reorder patches.
- RFC v2: https://lore.kernel.org/all/[email protected]/
Changes in RFC v2:
- Reuse PG_workingset bit to keep track of whether slub is on the
per-node partial list, as suggested by Matthew Wilcox.
- Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which
is caused by leak of partial slabs when get_partial_node().
- Add a patch to simplify acquire_slab().
- Reorder patches a little.
- RFC v1: https://lore.kernel.org/all/[email protected]/
1. Problem
==========
Now we have to freeze the slab when get from the node partial list, and
unfreeze the slab when put to the node partial list. Because we need to
rely on the node list_lock to synchronize the "frozen" bit changes.
This implementation has some drawbacks:
- Alloc path: twice cmpxchg_double.
It has to get some partial slabs from node when the allocator has used
up the CPU partial slabs. So it freeze the slab (one cmpxchg_double)
with node list_lock held, put those frozen slabs on its CPU partial
list. Later ___slab_alloc() will cmpxchg_double try-loop again if that
slab is picked to use.
- Alloc path: amplified contention on node list_lock.
Since we have to synchronize the "frozen" bit changes under the node
list_lock, the contention of slab (struct page) can be transferred
to the node list_lock. On machine with many CPUs in one node, the
contention of list_lock will be amplified by all CPUs' alloc path.
The current code has to workaround this problem by avoiding using
cmpxchg_double try-loop, which will just break and return when
contention of page encountered and the first cmpxchg_double failed.
But this workaround has its own problem. For more context, see
9b1ea29bc0d7 ("Revert "mm, slub: consider rest of partial list if
acquire_slab() fails"").
- Free path: redundant unfreeze.
__slab_free() will freeze and cache some slabs on its partial list,
and flush them to the node partial list when exceed, which has to
unfreeze those slabs again under the node list_lock. Actually we
don't need to freeze slab on CPU partial list, in which case we
can save the unfreeze cmpxchg_double operations in flush path.
2. Solution
===========
We solve these problems by leaving slabs unfrozen when moving out of
the node partial list and on CPU partial list, so "frozen" bit is 0.
These partial slabs won't be manipulate concurrently by alloc path,
the only racer is free path, which may manipulate its list when !inuse.
So we need to introduce another synchronization way to avoid it, we
reuse PG_workingset to keep track of whether the slab is on node partial
list or not, only in that case we can manipulate the slab list.
The slab will be delay frozen when it's picked to actively use by the
CPU, it becomes full at the same time, in which case we still need to
rely on "frozen" bit to avoid manipulating its list. So the slab will
be frozen only when activate use and be unfrozen only when deactivate.
3. Testing
==========
We just did some simple testing on a server with 128 CPUs (2 nodes) to
compare performance for now.
- perf bench sched messaging -g 5 -t -l 100000
baseline RFC
7.042s 6.966s
7.022s 7.045s
7.054s 6.985s
- stress-ng --rawpkt 128 --rawpkt-ops 100000000
baseline RFC
2.42s 2.15s
2.45s 2.16s
2.44s 2.17s
It shows above there is about 10% improvement on stress-ng rawpkt
testcase, although no much improvement on perf sched bench testcase.
Thanks for any comment and code review!
Chengming Zhou (7):
slub: Keep track of whether slub is on the per-node partial list
slub: Prepare __slab_free() for unfrozen partial slab out of node
partial list
slub: Reflow ___slab_alloc()
slub: Change get_partial() interfaces to return slab
slub: Introduce freeze_slab()
slub: Delay freezing of partial slabs
slub: Optimize deactivate_slab()
mm/slab.h | 19 ++++
mm/slub.c | 306 +++++++++++++++++++++++-------------------------------
2 files changed, 149 insertions(+), 176 deletions(-)
--
2.40.1
From: Chengming Zhou <[email protected]>
The get_partial() interface used in ___slab_alloc() may return a single
object in the "kmem_cache_debug(s)" case, in which we will just return
the "freelist" object.
Move this handling up to prepare for later changes.
And the "pfmemalloc_match()" part is not needed for node partial slab,
since we already check this in the get_partial_node().
Signed-off-by: Chengming Zhou <[email protected]>
---
mm/slub.c | 31 +++++++++++++++----------------
1 file changed, 15 insertions(+), 16 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index f568a32d7332..cd8aa68c156e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3218,8 +3218,21 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
pc.slab = &slab;
pc.orig_size = orig_size;
freelist = get_partial(s, node, &pc);
- if (freelist)
- goto check_new_slab;
+ if (freelist) {
+ if (kmem_cache_debug(s)) {
+ /*
+ * For debug caches here we had to go through
+ * alloc_single_from_partial() so just store the
+ * tracking info and return the object.
+ */
+ if (s->flags & SLAB_STORE_USER)
+ set_track(s, freelist, TRACK_ALLOC, addr);
+
+ return freelist;
+ }
+
+ goto retry_load_slab;
+ }
slub_put_cpu_ptr(s->cpu_slab);
slab = new_slab(s, gfpflags, node);
@@ -3255,20 +3268,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
inc_slabs_node(s, slab_nid(slab), slab->objects);
-check_new_slab:
-
- if (kmem_cache_debug(s)) {
- /*
- * For debug caches here we had to go through
- * alloc_single_from_partial() so just store the tracking info
- * and return the object
- */
- if (s->flags & SLAB_STORE_USER)
- set_track(s, freelist, TRACK_ALLOC, addr);
-
- return freelist;
- }
-
if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
/*
* For !pfmemalloc_match() case we don't load freelist so that
--
2.40.1
From: Chengming Zhou <[email protected]>
Now the partial slub will be frozen when taken out of node partial list,
so the __slab_free() will know from "was_frozen" that the partial slab
is not on node partial list and is used by one kmem_cache_cpu.
But we will change this, make partial slabs leave the node partial list
with unfrozen state, so we need to change __slab_free() to use the new
slab_test_node_partial() we just introduced.
Signed-off-by: Chengming Zhou <[email protected]>
---
mm/slub.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/mm/slub.c b/mm/slub.c
index 3fad4edca34b..f568a32d7332 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3610,6 +3610,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
unsigned long counters;
struct kmem_cache_node *n = NULL;
unsigned long flags;
+ bool on_node_partial;
stat(s, FREE_SLOWPATH);
@@ -3657,6 +3658,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
*/
spin_lock_irqsave(&n->list_lock, flags);
+ on_node_partial = slab_test_node_partial(slab);
}
}
@@ -3685,6 +3687,15 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
return;
}
+ /*
+ * This slab was partial but not on the per-node partial list,
+ * in which case we shouldn't manipulate its list, just return.
+ */
+ if (prior && !on_node_partial) {
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ return;
+ }
+
if (unlikely(!new.inuse && n->nr_partial >= s->min_partial))
goto slab_empty;
--
2.40.1
From: Chengming Zhou <[email protected]>
We will have unfrozen slabs out of the node partial list later, so we
need a freeze_slab() function to freeze the partial slab and get its
freelist.
Signed-off-by: Chengming Zhou <[email protected]>
---
mm/slub.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/mm/slub.c b/mm/slub.c
index 7d0234bffad3..5b428648021f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3079,6 +3079,33 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
return freelist;
}
+/*
+ * Freeze the partial slab and return the pointer to the freelist.
+ */
+static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
+{
+ struct slab new;
+ unsigned long counters;
+ void *freelist;
+
+ do {
+ freelist = slab->freelist;
+ counters = slab->counters;
+
+ new.counters = counters;
+ VM_BUG_ON(new.frozen);
+
+ new.inuse = slab->objects;
+ new.frozen = 1;
+
+ } while (!__slab_update_freelist(s, slab,
+ freelist, counters,
+ NULL, new.counters,
+ "freeze_slab"));
+
+ return freelist;
+}
+
/*
* Slow path. The lockless freelist is empty or we need to perform
* debugging duties.
--
2.40.1
From: Chengming Zhou <[email protected]>
We need all get_partial() related interfaces to return a slab, instead
of returning the freelist (or object).
Use the partial_context.object to return back freelist or object for
now. This patch shouldn't have any functional changes.
Signed-off-by: Chengming Zhou <[email protected]>
---
mm/slub.c | 63 +++++++++++++++++++++++++++++--------------------------
1 file changed, 33 insertions(+), 30 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index cd8aa68c156e..7d0234bffad3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -204,9 +204,9 @@ DEFINE_STATIC_KEY_FALSE(slub_debug_enabled);
/* Structure holding parameters for get_partial() call chain */
struct partial_context {
- struct slab **slab;
gfp_t flags;
unsigned int orig_size;
+ void *object;
};
static inline bool kmem_cache_debug(struct kmem_cache *s)
@@ -2271,10 +2271,11 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
/*
* Try to allocate a partial slab from a specific node.
*/
-static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
- struct partial_context *pc)
+static struct slab *get_partial_node(struct kmem_cache *s,
+ struct kmem_cache_node *n,
+ struct partial_context *pc)
{
- struct slab *slab, *slab2;
+ struct slab *slab, *slab2, *partial = NULL;
void *object = NULL;
unsigned long flags;
unsigned int partial_slabs = 0;
@@ -2290,27 +2291,28 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
- void *t;
-
if (!pfmemalloc_match(slab, pc->flags))
continue;
if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
object = alloc_single_from_partial(s, n, slab,
pc->orig_size);
- if (object)
+ if (object) {
+ partial = slab;
+ pc->object = object;
break;
+ }
continue;
}
- t = acquire_slab(s, n, slab, object == NULL);
- if (!t)
+ object = acquire_slab(s, n, slab, object == NULL);
+ if (!object)
break;
- if (!object) {
- *pc->slab = slab;
+ if (!partial) {
+ partial = slab;
+ pc->object = object;
stat(s, ALLOC_FROM_PARTIAL);
- object = t;
} else {
put_cpu_partial(s, slab, 0);
stat(s, CPU_PARTIAL_NODE);
@@ -2326,20 +2328,21 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
}
spin_unlock_irqrestore(&n->list_lock, flags);
- return object;
+ return partial;
}
/*
* Get a slab from somewhere. Search in increasing NUMA distances.
*/
-static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
+static struct slab *get_any_partial(struct kmem_cache *s,
+ struct partial_context *pc)
{
#ifdef CONFIG_NUMA
struct zonelist *zonelist;
struct zoneref *z;
struct zone *zone;
enum zone_type highest_zoneidx = gfp_zone(pc->flags);
- void *object;
+ struct slab *slab;
unsigned int cpuset_mems_cookie;
/*
@@ -2374,8 +2377,8 @@ static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
if (n && cpuset_zone_allowed(zone, pc->flags) &&
n->nr_partial > s->min_partial) {
- object = get_partial_node(s, n, pc);
- if (object) {
+ slab = get_partial_node(s, n, pc);
+ if (slab) {
/*
* Don't check read_mems_allowed_retry()
* here - if mems_allowed was updated in
@@ -2383,7 +2386,7 @@ static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
* between allocation and the cpuset
* update
*/
- return object;
+ return slab;
}
}
}
@@ -2395,17 +2398,18 @@ static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
/*
* Get a partial slab, lock it and return it.
*/
-static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc)
+static struct slab *get_partial(struct kmem_cache *s, int node,
+ struct partial_context *pc)
{
- void *object;
+ struct slab *slab;
int searchnode = node;
if (node == NUMA_NO_NODE)
searchnode = numa_mem_id();
- object = get_partial_node(s, get_node(s, searchnode), pc);
- if (object || node != NUMA_NO_NODE)
- return object;
+ slab = get_partial_node(s, get_node(s, searchnode), pc);
+ if (slab || node != NUMA_NO_NODE)
+ return slab;
return get_any_partial(s, pc);
}
@@ -3215,10 +3219,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
new_objects:
pc.flags = gfpflags;
- pc.slab = &slab;
pc.orig_size = orig_size;
- freelist = get_partial(s, node, &pc);
- if (freelist) {
+ slab = get_partial(s, node, &pc);
+ if (slab) {
+ freelist = pc.object;
if (kmem_cache_debug(s)) {
/*
* For debug caches here we had to go through
@@ -3410,12 +3414,11 @@ static void *__slab_alloc_node(struct kmem_cache *s,
void *object;
pc.flags = gfpflags;
- pc.slab = &slab;
pc.orig_size = orig_size;
- object = get_partial(s, node, &pc);
+ slab = get_partial(s, node, &pc);
- if (object)
- return object;
+ if (slab)
+ return pc.object;
slab = new_slab(s, gfpflags, node);
if (unlikely(!slab)) {
--
2.40.1
On 10/24/23 11:33, [email protected] wrote:
> From: Chengming Zhou <[email protected]>
>
> Now the partial slub will be frozen when taken out of node partial list,
partially empty slab
> so the __slab_free() will know from "was_frozen" that the partial slab
> is not on node partial list and is used by one kmem_cache_cpu.
... is a cpu or cpu partial slab of some cpu.
> But we will change this, make partial slabs leave the node partial list
> with unfrozen state, so we need to change __slab_free() to use the new
> slab_test_node_partial() we just introduced.
>
> Signed-off-by: Chengming Zhou <[email protected]>
> ---
> mm/slub.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 3fad4edca34b..f568a32d7332 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3610,6 +3610,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> unsigned long counters;
> struct kmem_cache_node *n = NULL;
> unsigned long flags;
> + bool on_node_partial;
>
> stat(s, FREE_SLOWPATH);
>
> @@ -3657,6 +3658,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> */
> spin_lock_irqsave(&n->list_lock, flags);
>
> + on_node_partial = slab_test_node_partial(slab);
> }
> }
>
> @@ -3685,6 +3687,15 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> return;
> }
>
> + /*
> + * This slab was partial but not on the per-node partial list,
This slab was partially empty ...
Otherwise LGTM!
> + * in which case we shouldn't manipulate its list, just return.
> + */
> + if (prior && !on_node_partial) {
> + spin_unlock_irqrestore(&n->list_lock, flags);
> + return;
> + }
> +
> if (unlikely(!new.inuse && n->nr_partial >= s->min_partial))
> goto slab_empty;
>
On 10/24/23 11:33, [email protected] wrote:
> From: Chengming Zhou <[email protected]>
>
> The get_partial() interface used in ___slab_alloc() may return a single
> object in the "kmem_cache_debug(s)" case, in which we will just return
> the "freelist" object.
>
> Move this handling up to prepare for later changes.
>
> And the "pfmemalloc_match()" part is not needed for node partial slab,
> since we already check this in the get_partial_node().
Good catch.
> Signed-off-by: Chengming Zhou <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Thanks!
On Tue, 24 Oct 2023, [email protected] wrote:
> 2. Solution
> ===========
> We solve these problems by leaving slabs unfrozen when moving out of
> the node partial list and on CPU partial list, so "frozen" bit is 0.
>
> These partial slabs won't be manipulate concurrently by alloc path,
> the only racer is free path, which may manipulate its list when !inuse.
> So we need to introduce another synchronization way to avoid it, we
> reuse PG_workingset to keep track of whether the slab is on node partial
> list or not, only in that case we can manipulate the slab list.
>
> The slab will be delay frozen when it's picked to actively use by the
> CPU, it becomes full at the same time, in which case we still need to
> rely on "frozen" bit to avoid manipulating its list. So the slab will
> be frozen only when activate use and be unfrozen only when deactivate.
I think we have to clear our terminology a bit about what a "frozen" slab
is.
Before this patch a frozen slab is not on the node partial list and
therefore its state on the list does not have to be considered during
freeing and other operations. The frozen slab could be actively allocated
from.
From the source:
* Frozen slabs
*
* If a slab is frozen then it is exempt from list management. It is not
* on any list except per cpu partial list. The processor that froze the
* slab is the one who can perform list operations on the slab. Other
* processors may put objects onto the freelist but the processor that
* froze the slab is the only one that can retrieve the objects from the
* slab's freelist.
*
After this patch the PG_workingset indicates the state of being on
the partial lists.
What does "frozen slab" then mean? The slab is being allocated from? Is
that information useful or can we drop the frozen flag?
Update the definition?
On 2023/10/27 23:18, Vlastimil Babka wrote:
> On 10/24/23 11:33, [email protected] wrote:
>> From: Chengming Zhou <[email protected]>
>>
>> Now the partial slub will be frozen when taken out of node partial list,
>
> partially empty slab
>
>> so the __slab_free() will know from "was_frozen" that the partial slab
>> is not on node partial list and is used by one kmem_cache_cpu.
>
> ... is a cpu or cpu partial slab of some cpu.
>
>> But we will change this, make partial slabs leave the node partial list
>> with unfrozen state, so we need to change __slab_free() to use the new
>> slab_test_node_partial() we just introduced.
>>
>> Signed-off-by: Chengming Zhou <[email protected]>
>> ---
>> mm/slub.c | 11 +++++++++++
>> 1 file changed, 11 insertions(+)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 3fad4edca34b..f568a32d7332 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3610,6 +3610,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>> unsigned long counters;
>> struct kmem_cache_node *n = NULL;
>> unsigned long flags;
>> + bool on_node_partial;
>>
>> stat(s, FREE_SLOWPATH);
>>
>> @@ -3657,6 +3658,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>> */
>> spin_lock_irqsave(&n->list_lock, flags);
>>
>> + on_node_partial = slab_test_node_partial(slab);
>> }
>> }
>>
>> @@ -3685,6 +3687,15 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>> return;
>> }
>>
>> + /*
>> + * This slab was partial but not on the per-node partial list,
>
> This slab was partially empty ...
>
> Otherwise LGTM!
Ok, will fix.
Thanks!
>
>> + * in which case we shouldn't manipulate its list, just return.
>> + */
>> + if (prior && !on_node_partial) {
>> + spin_unlock_irqrestore(&n->list_lock, flags);
>> + return;
>> + }
>> +
>> if (unlikely(!new.inuse && n->nr_partial >= s->min_partial))
>> goto slab_empty;
>>
>
On 2023/10/28 01:57, Christoph Lameter wrote:
> On Tue, 24 Oct 2023, [email protected] wrote:
>
>> 2. Solution
>> ===========
>> We solve these problems by leaving slabs unfrozen when moving out of
>> the node partial list and on CPU partial list, so "frozen" bit is 0.
>>
>> These partial slabs won't be manipulate concurrently by alloc path,
>> the only racer is free path, which may manipulate its list when !inuse.
>> So we need to introduce another synchronization way to avoid it, we
>> reuse PG_workingset to keep track of whether the slab is on node partial
>> list or not, only in that case we can manipulate the slab list.
>>
>> The slab will be delay frozen when it's picked to actively use by the
>> CPU, it becomes full at the same time, in which case we still need to
>> rely on "frozen" bit to avoid manipulating its list. So the slab will
>> be frozen only when activate use and be unfrozen only when deactivate.
>
> I think we have to clear our terminology a bit about what a "frozen" slab is.
Yes, we need to clean up these inconsistent documentations in the source.
>
> Before this patch a frozen slab is not on the node partial list and therefore its state on the list does not have to be considered during freeing and other operations. The frozen slab could be actively allocated from.
>
> From the source:
>
> * Frozen slabs
> *
> * If a slab is frozen then it is exempt from list management. It is not
> * on any list except per cpu partial list. The processor that froze the
~~ except per cpu partial list ~~
Frozen slab is not on any list, it's actively allocated from by the processor
that froze it. IOW, frozen slab is the cpu slab.
> * slab is the one who can perform list operations on the slab. Other
> * processors may put objects onto the freelist but the processor that
> * froze the slab is the only one that can retrieve the objects from the
> * slab's freelist.
> *
This part I think is unchanged.
>
>
> After this patch the PG_workingset indicates the state of being on the partial lists.
>
> What does "frozen slab" then mean? The slab is being allocated from? Is that information useful or can we drop the frozen flag?
Right, frozen slab is the cpu slab, which is being allocated from by the cpu that froze it.
IMHO, the "frozen" bit is useful because:
1. PG_workingset is only useful on partial slab, which indicates the slab is on the node
partial list, so we can manipulate its list in the __slab_free() path.
2. But for full slab (slab->freelist == NULL), PG_workingset is not much useful, we don't
safely know whether it's used as the cpu slab or not just from this flag. So __slab_free()
still rely on the "frozen" bit to know it.
3. And the maintaining of "frozen" has no extra cost now, since it's changed together with "freelist"
and other counter using cmpxchg, we already have the cmpxchg when start to use a slab as the cpu slab.
Maybe I missed something, I don't know how to drop the frozen flag.
>
> Update the definition?
>
Ok, will add a cleanup patch to update.
Thanks!
On 10/28/23 04:36, Chengming Zhou wrote:
>>
>>
>> After this patch the PG_workingset indicates the state of being on the partial lists.
>>
>> What does "frozen slab" then mean? The slab is being allocated from? Is that information useful or can we drop the frozen flag?
>
> Right, frozen slab is the cpu slab, which is being allocated from by the cpu that froze it.
>
> IMHO, the "frozen" bit is useful because:
>
> 1. PG_workingset is only useful on partial slab, which indicates the slab is on the node
> partial list, so we can manipulate its list in the __slab_free() path.
>
> 2. But for full slab (slab->freelist == NULL), PG_workingset is not much useful, we don't
> safely know whether it's used as the cpu slab or not just from this flag. So __slab_free()
> still rely on the "frozen" bit to know it.
Well, we could extend the meaning of PG_workingset to mean "not a cpu slab
or pecpu partial slab" i.e. both on node partial list and full. However it
would increase the number of cases where __slab_free() has to lock the
list_lock and check the PG_working set. "slab->freelist == NULL" might
happen often exactly because the freelist became cpu freelist.
> 3. And the maintaining of "frozen" has no extra cost now, since it's changed together with "freelist"
> and other counter using cmpxchg, we already have the cmpxchg when start to use a slab as the cpu slab.
And together with this point, I don't see a reason to drop the frozen bit.
It's still useful for cpu slabs. It just wasn't the best possible solution
for percpu partial slabs.
> Maybe I missed something, I don't know how to drop the frozen flag.
Should be possible, but not worth it IMHO.
>>
>> Update the definition?
>>
>
> Ok, will add a cleanup patch to update.
>
> Thanks!
On 10/24/23 11:33, [email protected] wrote:
> From: Chengming Zhou <[email protected]>
>
> We need all get_partial() related interfaces to return a slab, instead
> of returning the freelist (or object).
>
> Use the partial_context.object to return back freelist or object for
> now. This patch shouldn't have any functional changes.
>
> Signed-off-by: Chengming Zhou <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
I think you could even move patches 3/7 and 4/7 to the front of the series,
as cleanups that are useful on their own.
Thanks!
On 10/24/23 11:33, [email protected] wrote:
> From: Chengming Zhou <[email protected]>
>
> We will have unfrozen slabs out of the node partial list later, so we
> need a freeze_slab() function to freeze the partial slab and get its
> freelist.
>
> Signed-off-by: Chengming Zhou <[email protected]>
As you noted, we'll need slab_update_freelist().
Otherwise,
Reviewed-by: Vlastimil Babka <[email protected]>
> ---
> mm/slub.c | 27 +++++++++++++++++++++++++++
> 1 file changed, 27 insertions(+)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 7d0234bffad3..5b428648021f 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3079,6 +3079,33 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
> return freelist;
> }
>
> +/*
> + * Freeze the partial slab and return the pointer to the freelist.
> + */
> +static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
> +{
> + struct slab new;
> + unsigned long counters;
> + void *freelist;
> +
> + do {
> + freelist = slab->freelist;
> + counters = slab->counters;
> +
> + new.counters = counters;
> + VM_BUG_ON(new.frozen);
> +
> + new.inuse = slab->objects;
> + new.frozen = 1;
> +
> + } while (!__slab_update_freelist(s, slab,
> + freelist, counters,
> + NULL, new.counters,
> + "freeze_slab"));
> +
> + return freelist;
> +}
> +
> /*
> * Slow path. The lockless freelist is empty or we need to perform
> * debugging duties.
On Sat, 28 Oct 2023, Chengming Zhou wrote:
> 2. But for full slab (slab->freelist == NULL), PG_workingset is not much useful, we don't
> safely know whether it's used as the cpu slab or not just from this flag. So __slab_free()
> still rely on the "frozen" bit to know it.
>
> 3. And the maintaining of "frozen" has no extra cost now, since it's changed together with "freelist"
> and other counter using cmpxchg, we already have the cmpxchg when start to use a slab as the cpu slab.
>
> Maybe I missed something, I don't know how to drop the frozen flag.
Maybe frozen is now = PG_Workingset | cmpxchg-frozen?
On 2023/10/31 00:55, Vlastimil Babka wrote:
> On 10/24/23 11:33, [email protected] wrote:
>> From: Chengming Zhou <[email protected]>
>>
>> We need all get_partial() related interfaces to return a slab, instead
>> of returning the freelist (or object).
>>
>> Use the partial_context.object to return back freelist or object for
>> now. This patch shouldn't have any functional changes.
>>
>> Signed-off-by: Chengming Zhou <[email protected]>
>
> Reviewed-by: Vlastimil Babka <[email protected]>
>
> I think you could even move patches 3/7 and 4/7 to the front of the series,
> as cleanups that are useful on their own.
>
Right, I will move these two to the front in the next version.
Thanks!
On 2023/10/31 00:19, Vlastimil Babka wrote:
> On 10/28/23 04:36, Chengming Zhou wrote:
>>>
>>>
>>> After this patch the PG_workingset indicates the state of being on the partial lists.
>>>
>>> What does "frozen slab" then mean? The slab is being allocated from? Is that information useful or can we drop the frozen flag?
>>
>> Right, frozen slab is the cpu slab, which is being allocated from by the cpu that froze it.
>>
>> IMHO, the "frozen" bit is useful because:
>>
>> 1. PG_workingset is only useful on partial slab, which indicates the slab is on the node
>> partial list, so we can manipulate its list in the __slab_free() path.
>>
>> 2. But for full slab (slab->freelist == NULL), PG_workingset is not much useful, we don't
>> safely know whether it's used as the cpu slab or not just from this flag. So __slab_free()
>> still rely on the "frozen" bit to know it.
>
> Well, we could extend the meaning of PG_workingset to mean "not a cpu slab
> or pecpu partial slab" i.e. both on node partial list and full. However it
> would increase the number of cases where __slab_free() has to lock the
> list_lock and check the PG_working set. "slab->freelist == NULL" might
> happen often exactly because the freelist became cpu freelist.
Ah, right, it's possible to do like this.
>
>> 3. And the maintaining of "frozen" has no extra cost now, since it's changed together with "freelist"
>> and other counter using cmpxchg, we already have the cmpxchg when start to use a slab as the cpu slab.
>
> And together with this point, I don't see a reason to drop the frozen bit.
> It's still useful for cpu slabs. It just wasn't the best possible solution
> for percpu partial slabs.
>
>> Maybe I missed something, I don't know how to drop the frozen flag.
>
> Should be possible, but not worth it IMHO.
Agree, we'll just keep "frozen" for the cpu slabs.
Thanks!
On 2023/10/31 03:25, Christoph Lameter wrote:
> On Sat, 28 Oct 2023, Chengming Zhou wrote:
>
>> 2. But for full slab (slab->freelist == NULL), PG_workingset is not much useful, we don't
>> safely know whether it's used as the cpu slab or not just from this flag. So __slab_free()
>> still rely on the "frozen" bit to know it.
>>
>> 3. And the maintaining of "frozen" has no extra cost now, since it's changed together with "freelist"
>> and other counter using cmpxchg, we already have the cmpxchg when start to use a slab as the cpu slab.
>>
>> Maybe I missed something, I don't know how to drop the frozen flag.
>
>
> Maybe frozen is now = PG_Workingset | cmpxchg-frozen?
>
>
The current scheme (which this series implemented) is:
- node partial slabs: PG_Workingset (set or clear with per-node list_lock protection)
- cpu partial slabs: !PG_Workingset
- cpu slabs: !PG_Workingset && frozen (set or clear using cmpxchg together with freelist)
- full slabs: !PG_Workingset
As Vlastimil noted, it's possible to drop "frozen" bit for cpu slabs, but
we keep it for performance, since we don't need to grab node list_lock to
check whether PG_Workingset is set or not if the "frozen" bit is set.
Thanks!
On Tue, 31 Oct 2023, Chengming Zhou wrote:
> The current scheme (which this series implemented) is:
>
> - node partial slabs: PG_Workingset (set or clear with per-node list_lock protection)
> - cpu partial slabs: !PG_Workingset
And then the frozen flag needs to be set. Otherwise slab_free() would
conclude it is on a partial list?
> - cpu slabs: !PG_Workingset && frozen (set or clear using cmpxchg together with freelist)
> - full slabs: !PG_Workingset
And frozen is clear? Otherwise it is the same as a cpu partial slab.
> As Vlastimil noted, it's possible to drop "frozen" bit for cpu slabs, but
> we keep it for performance, since we don't need to grab node list_lock to
> check whether PG_Workingset is set or not if the "frozen" bit is set.
>
> Thanks!
>
On 2023/10/31 11:47, Christoph Lameter wrote:
> On Tue, 31 Oct 2023, Chengming Zhou wrote:
>
>> The current scheme (which this series implemented) is:
>>
>> - node partial slabs: PG_Workingset (set or clear with per-node list_lock protection)
>> - cpu partial slabs: !PG_Workingset
>
> And then the frozen flag needs to be set. Otherwise slab_free() would conclude it is on a partial list?
>
- cpu partial slabs: !PG_Workingset && !frozen
Here comes the optimization that "frozen" is not set for the cpu partial slabs,
slab_free() will grab node list_lock then check by !PG_Workingset that it's not
on a node partial list.
>> - cpu slabs: !PG_Workingset && frozen (set or clear using cmpxchg together with freelist)
>
>
>
>> - full slabs: !PG_Workingset
>
> And frozen is clear? Otherwise it is the same as a cpu partial slab.
>
Right, - full slabs: !PG_Workingset && !frozen
Yes, it's the same as a cpu partial slab from only these two flags, but
slab_free() also know whether it was full or not.
>> As Vlastimil noted, it's possible to drop "frozen" bit for cpu slabs, but
>> we keep it for performance, since we don't need to grab node list_lock to
>> check whether PG_Workingset is set or not if the "frozen" bit is set.
>>
>> Thanks!
>>