2023-10-21 14:44:31

by Chengming Zhou

[permalink] [raw]
Subject: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs

From: Chengming Zhou <[email protected]>

Changes in RFC v2:
- Reuse PG_workingset bit to keep track of whether slub is on the
per-node partial list, as suggested by Matthew Wilcox.
- Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which
is caused by leak of partial slabs when get_partial_node().
- Add a patch to simplify acquire_slab().
- Reorder patches a little.
- v1: https://lore.kernel.org/all/[email protected]/

1. Problem
==========
Now we have to freeze the slab when get from the node partial list, and
unfreeze the slab when put to the node partial list. Because we need to
rely on the node list_lock to synchronize the "frozen" bit changes.

This implementation has some drawbacks:

- Alloc path: twice cmpxchg_double.
It has to get some partial slabs from node when the allocator has used
up the CPU partial slabs. So it freeze the slab (one cmpxchg_double)
with node list_lock held, put those frozen slabs on its CPU partial
list. Later ___slab_alloc() will cmpxchg_double try-loop again if that
slab is picked to use.

- Alloc path: amplified contention on node list_lock.
Since we have to synchronize the "frozen" bit changes under the node
list_lock, the contention of slab (struct page) can be transferred
to the node list_lock. On machine with many CPUs in one node, the
contention of list_lock will be amplified by all CPUs' alloc path.

The current code has to workaround this problem by avoiding using
cmpxchg_double try-loop, which will just break and return when
contention of page encountered and the first cmpxchg_double failed.
But this workaround has its own problem.

- Free path: redundant unfreeze.
__slab_free() will freeze and cache some slabs on its partial list,
and flush them to the node partial list when exceed, which has to
unfreeze those slabs again under the node list_lock. Actually we
don't need to freeze slab on CPU partial list, in which case we
can save the unfreeze cmpxchg_double operations in flush path.

2. Solution
===========
We solve these problems by leaving slabs unfrozen when moving out of
the node partial list and on CPU partial list, so "frozen" bit is 0.

These partial slabs won't be manipulate concurrently by alloc path,
the only racer is free path, which may manipulate its list when !inuse.
So we need to introduce another synchronization way to avoid it, we
reuse PG_workingset to keep track of whether the slab is on node partial
list or not, only in that case we can manipulate the slab list.

The slab will be delay frozen when it's picked to actively use by the
CPU, it becomes full at the same time, in which case we still need to
rely on "frozen" bit to avoid manipulating its list. So the slab will
be frozen only when activate use and be unfrozen only when deactivate.

3. Testing
==========
We just did some simple testing on a server with 128 CPUs (2 nodes) to
compare performance for now.

- perf bench sched messaging -g 5 -t -l 100000
baseline RFC
7.042s 6.966s
7.022s 7.045s
7.054s 6.985s

- stress-ng --rawpkt 128 --rawpkt-ops 100000000
baseline RFC
2.42s 2.15s
2.45s 2.16s
2.44s 2.17s

It shows above there is about 10% improvement on stress-ng rawpkt
testcase, although no much improvement on perf sched bench testcase.

Thanks for any comment and code review!

Chengming Zhou (6):
slub: Keep track of whether slub is on the per-node partial list
slub: Prepare __slab_free() for unfrozen partial slab out of node
partial list
slub: Don't freeze slabs for cpu partial
slub: Simplify acquire_slab()
slub: Introduce get_cpu_partial()
slub: Optimize deactivate_slab()

include/linux/page-flags.h | 2 +
mm/slab.h | 19 +++
mm/slub.c | 245 +++++++++++++++++++------------------
3 files changed, 150 insertions(+), 116 deletions(-)

--
2.20.1


2023-10-21 14:44:43

by Chengming Zhou

[permalink] [raw]
Subject: [RFC PATCH v2 1/6] slub: Keep track of whether slub is on the per-node partial list

From: Chengming Zhou <[email protected]>

Now we rely on the "frozen" bit to see if we should manipulate the
slab->slab_list, which will be changed in the following patch.

Instead we introduce another way to keep track of whether slub is on
the per-node partial list, here we reuse the PG_workingset bit.

Signed-off-by: Chengming Zhou <[email protected]>
---
include/linux/page-flags.h | 2 ++
mm/slab.h | 19 +++++++++++++++++++
mm/slub.c | 3 +++
3 files changed, 24 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a88e64acebfe..e8b1be71d722 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -478,6 +478,8 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
PAGEFLAG(Workingset, workingset, PF_HEAD)
TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
+ __SETPAGEFLAG(Workingset, workingset, PF_HEAD)
+ __CLEARPAGEFLAG(Workingset, workingset, PF_HEAD)
__PAGEFLAG(Slab, slab, PF_NO_TAIL)
PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */

diff --git a/mm/slab.h b/mm/slab.h
index 8cd3294fedf5..9cff64cae8de 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -193,6 +193,25 @@ static inline void __slab_clear_pfmemalloc(struct slab *slab)
__folio_clear_active(slab_folio(slab));
}

+/*
+ * Slub reuse PG_workingset bit to keep track of whether it's on
+ * the per-node partial list.
+ */
+static inline bool slab_test_node_partial(const struct slab *slab)
+{
+ return folio_test_workingset((struct folio *)slab_folio(slab));
+}
+
+static inline void slab_set_node_partial(struct slab *slab)
+{
+ __folio_set_workingset(slab_folio(slab));
+}
+
+static inline void slab_clear_node_partial(struct slab *slab)
+{
+ __folio_clear_workingset(slab_folio(slab));
+}
+
static inline void *slab_address(const struct slab *slab)
{
return folio_address(slab_folio(slab));
diff --git a/mm/slub.c b/mm/slub.c
index 63d281dfacdb..3fad4edca34b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2127,6 +2127,7 @@ __add_partial(struct kmem_cache_node *n, struct slab *slab, int tail)
list_add_tail(&slab->slab_list, &n->partial);
else
list_add(&slab->slab_list, &n->partial);
+ slab_set_node_partial(slab);
}

static inline void add_partial(struct kmem_cache_node *n,
@@ -2141,6 +2142,7 @@ static inline void remove_partial(struct kmem_cache_node *n,
{
lockdep_assert_held(&n->list_lock);
list_del(&slab->slab_list);
+ slab_clear_node_partial(slab);
n->nr_partial--;
}

@@ -4831,6 +4833,7 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)

if (free == slab->objects) {
list_move(&slab->slab_list, &discard);
+ slab_clear_node_partial(slab);
n->nr_partial--;
dec_slabs_node(s, node, slab->objects);
} else if (free <= SHRINK_PROMOTE_MAX)
--
2.20.1

2023-10-21 14:45:08

by Chengming Zhou

[permalink] [raw]
Subject: [RFC PATCH v2 4/6] slub: Simplify acquire_slab()

From: Chengming Zhou <[email protected]>

Now the object == NULL is always true, simplify acquire_slab().

Signed-off-by: Chengming Zhou <[email protected]>
---
mm/slub.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 61ee82ea21b6..9f0b80fefc70 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2222,8 +2222,7 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s,
* Returns a list of objects or NULL if it fails.
*/
static inline void *acquire_slab(struct kmem_cache *s,
- struct kmem_cache_node *n, struct slab *slab,
- int mode)
+ struct kmem_cache_node *n, struct slab *slab)
{
void *freelist;
unsigned long counters;
@@ -2239,12 +2238,8 @@ static inline void *acquire_slab(struct kmem_cache *s,
freelist = slab->freelist;
counters = slab->counters;
new.counters = counters;
- if (mode) {
- new.inuse = slab->objects;
- new.freelist = NULL;
- } else {
- new.freelist = freelist;
- }
+ new.inuse = slab->objects;
+ new.freelist = NULL;

VM_BUG_ON(new.frozen);
new.frozen = 1;
@@ -2306,7 +2301,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
}

if (!object) {
- t = acquire_slab(s, n, slab, object == NULL);
+ t = acquire_slab(s, n, slab);
if (t) {
*pc->slab = slab;
stat(s, ALLOC_FROM_PARTIAL);
--
2.20.1

2023-10-21 14:45:31

by Chengming Zhou

[permalink] [raw]
Subject: [RFC PATCH v2 5/6] slub: Introduce get_cpu_partial()

From: Chengming Zhou <[email protected]>

Since the slabs on cpu partial list are not frozen anymore, we introduce
get_cpu_partial() to get a frozen slab with its freelist from cpu partial
list. It's now much like getting a frozen slab with its freelist from
node partial list.

Another change is about get_partial(), which can return no frozen slab
when all slabs are failed when acquire_slab(), but get some unfreeze slabs
in its cpu partial list, so we need to check this rare case to avoid
allocating a new slab.

Signed-off-by: Chengming Zhou <[email protected]>
---
mm/slub.c | 87 +++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 68 insertions(+), 19 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 9f0b80fefc70..7fae959c56eb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3055,6 +3055,68 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
return freelist;
}

+#ifdef CONFIG_SLUB_CPU_PARTIAL
+
+static void *get_cpu_partial(struct kmem_cache *s, struct kmem_cache_cpu *c,
+ struct slab **slabptr, int node, gfp_t gfpflags)
+{
+ unsigned long flags;
+ struct slab *slab;
+ struct slab new;
+ unsigned long counters;
+ void *freelist;
+
+ while (slub_percpu_partial(c)) {
+ local_lock_irqsave(&s->cpu_slab->lock, flags);
+ if (unlikely(!slub_percpu_partial(c))) {
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+ /* we were preempted and partial list got empty */
+ return NULL;
+ }
+
+ slab = slub_percpu_partial(c);
+ slub_set_percpu_partial(c, slab);
+ local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+ stat(s, CPU_PARTIAL_ALLOC);
+
+ if (unlikely(!node_match(slab, node) ||
+ !pfmemalloc_match(slab, gfpflags))) {
+ slab->next = NULL;
+ __unfreeze_partials(s, slab);
+ continue;
+ }
+
+ do {
+ freelist = slab->freelist;
+ counters = slab->counters;
+
+ new.counters = counters;
+ VM_BUG_ON(new.frozen);
+
+ new.inuse = slab->objects;
+ new.frozen = 1;
+ } while (!__slab_update_freelist(s, slab,
+ freelist, counters,
+ NULL, new.counters,
+ "get_cpu_partial"));
+
+ *slabptr = slab;
+ return freelist;
+ }
+
+ return NULL;
+}
+
+#else /* CONFIG_SLUB_CPU_PARTIAL */
+
+static void *get_cpu_partial(struct kmem_cache *s, struct kmem_cache_cpu *c,
+ struct slab **slabptr, int node, gfp_t gfpflags)
+{
+ return NULL;
+}
+
+#endif
+
/*
* Slow path. The lockless freelist is empty or we need to perform
* debugging duties.
@@ -3097,7 +3159,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
node = NUMA_NO_NODE;
goto new_slab;
}
-redo:

if (unlikely(!node_match(slab, node))) {
/*
@@ -3173,24 +3234,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,

new_slab:

- if (slub_percpu_partial(c)) {
- local_lock_irqsave(&s->cpu_slab->lock, flags);
- if (unlikely(c->slab)) {
- local_unlock_irqrestore(&s->cpu_slab->lock, flags);
- goto reread_slab;
- }
- if (unlikely(!slub_percpu_partial(c))) {
- local_unlock_irqrestore(&s->cpu_slab->lock, flags);
- /* we were preempted and partial list got empty */
- goto new_objects;
- }
-
- slab = c->slab = slub_percpu_partial(c);
- slub_set_percpu_partial(c, slab);
- local_unlock_irqrestore(&s->cpu_slab->lock, flags);
- stat(s, CPU_PARTIAL_ALLOC);
- goto redo;
- }
+ freelist = get_cpu_partial(s, c, &slab, node, gfpflags);
+ if (freelist)
+ goto retry_load_slab;

new_objects:

@@ -3201,6 +3247,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (freelist)
goto check_new_slab;

+ if (slub_percpu_partial(c))
+ goto new_slab;
+
slub_put_cpu_ptr(s->cpu_slab);
slab = new_slab(s, gfpflags, node);
c = slub_get_cpu_ptr(s->cpu_slab);
--
2.20.1

2023-10-21 14:45:41

by Chengming Zhou

[permalink] [raw]
Subject: [RFC PATCH v2 2/6] slub: Prepare __slab_free() for unfrozen partial slab out of node partial list

From: Chengming Zhou <[email protected]>

Now the partial slub will be frozen when taken out of node partial list,
so the __slab_free() will know from "was_frozen" that the partial slab
is not on node partial list and is used by one kmem_cache_cpu.

But we will change this, make partial slabs leave the node partial list
with unfrozen state, so we need to change __slab_free() to use the new
slab_test_node_partial() we just introduced.

Signed-off-by: Chengming Zhou <[email protected]>
---
mm/slub.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 3fad4edca34b..adeff8df85ec 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3610,6 +3610,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
unsigned long counters;
struct kmem_cache_node *n = NULL;
unsigned long flags;
+ bool on_node_partial;

stat(s, FREE_SLOWPATH);

@@ -3657,6 +3658,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
*/
spin_lock_irqsave(&n->list_lock, flags);

+ on_node_partial = slab_test_node_partial(slab);
}
}

@@ -3685,6 +3687,15 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
return;
}

+ /*
+ * This slab was not full and not on the per-node partial list either,
+ * in which case we shouldn't manipulate its list, just early return.
+ */
+ if (prior && !on_node_partial) {
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ return;
+ }
+
if (unlikely(!new.inuse && n->nr_partial >= s->min_partial))
goto slab_empty;

--
2.20.1

2023-10-21 14:46:02

by Chengming Zhou

[permalink] [raw]
Subject: [RFC PATCH v2 3/6] slub: Don't freeze slabs for cpu partial

From: Chengming Zhou <[email protected]>

Now we will freeze slabs when moving them out of node partial list to
cpu partial list, this method needs two cmpxchg_double operations:

1. freeze slab (acquire_slab()) under the node list_lock
2. get_freelist() when pick used in ___slab_alloc()

Actually we don't need to freeze when moving slabs out of node partial
list, we can delay freeze to use slab freelist in ___slab_alloc(), so
we can save one cmpxchg_double().

And there are other good points:

1. The moving of slabs between node partial list and cpu partial list
becomes simpler, since we don't need to freeze or unfreeze at all.

2. The node list_lock contention would be less, since we only need to
freeze one slab under the node list_lock. (In fact, we can first
move slabs out of node partial list, don't need to freeze any slab
at all, so the contention on slab won't transfer to the node list_lock
contention.)

We can achieve this because there is no concurrent path would manipulate
the partial slab list except the __slab_free() path, which is serialized
now.

Note this patch just change the parts of moving the partial slabs for
easy code review, we will fix other parts in the following patches.
Specifically this patch change three paths:
1. get partial slab from node: get_partial_node()
2. put partial slab to node: __unfreeze_partials()
3. cache partail slab on cpu when __slab_free()

Signed-off-by: Chengming Zhou <[email protected]>
---
mm/slub.c | 63 +++++++++++++++++--------------------------------------
1 file changed, 19 insertions(+), 44 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index adeff8df85ec..61ee82ea21b6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2277,7 +2277,9 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
struct slab *slab, *slab2;
void *object = NULL;
unsigned long flags;
+#ifdef CONFIG_SLUB_CPU_PARTIAL
unsigned int partial_slabs = 0;
+#endif

/*
* Racy check. If we mistakenly see no partial slabs then we
@@ -2303,20 +2305,22 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
continue;
}

- t = acquire_slab(s, n, slab, object == NULL);
- if (!t)
- break;
-
if (!object) {
- *pc->slab = slab;
- stat(s, ALLOC_FROM_PARTIAL);
- object = t;
- } else {
- put_cpu_partial(s, slab, 0);
- stat(s, CPU_PARTIAL_NODE);
- partial_slabs++;
+ t = acquire_slab(s, n, slab, object == NULL);
+ if (t) {
+ *pc->slab = slab;
+ stat(s, ALLOC_FROM_PARTIAL);
+ object = t;
+ continue;
+ }
}
+
#ifdef CONFIG_SLUB_CPU_PARTIAL
+ remove_partial(n, slab);
+ put_cpu_partial(s, slab, 0);
+ stat(s, CPU_PARTIAL_NODE);
+ partial_slabs++;
+
if (!kmem_cache_has_cpu_partial(s)
|| partial_slabs > s->cpu_partial_slabs / 2)
break;
@@ -2606,9 +2610,6 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
unsigned long flags = 0;

while (partial_slab) {
- struct slab new;
- struct slab old;
-
slab = partial_slab;
partial_slab = slab->next;

@@ -2621,23 +2622,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
spin_lock_irqsave(&n->list_lock, flags);
}

- do {
-
- old.freelist = slab->freelist;
- old.counters = slab->counters;
- VM_BUG_ON(!old.frozen);
-
- new.counters = old.counters;
- new.freelist = old.freelist;
-
- new.frozen = 0;
-
- } while (!__slab_update_freelist(s, slab,
- old.freelist, old.counters,
- new.freelist, new.counters,
- "unfreezing slab"));
-
- if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) {
+ if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial)) {
slab->next = slab_to_discard;
slab_to_discard = slab;
} else {
@@ -3634,18 +3619,8 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
was_frozen = new.frozen;
new.inuse -= cnt;
if ((!new.inuse || !prior) && !was_frozen) {
-
- if (kmem_cache_has_cpu_partial(s) && !prior) {
-
- /*
- * Slab was on no list before and will be
- * partially empty
- * We can defer the list move and instead
- * freeze it.
- */
- new.frozen = 1;
-
- } else { /* Needs to be taken off a list */
+ /* Needs to be taken off a list */
+ if (!kmem_cache_has_cpu_partial(s) || prior) {

n = get_node(s, slab_nid(slab));
/*
@@ -3675,7 +3650,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
* activity can be necessary.
*/
stat(s, FREE_FROZEN);
- } else if (new.frozen) {
+ } else if (kmem_cache_has_cpu_partial(s) && !prior) {
/*
* If we just froze the slab then put it onto the
* per cpu partial list.
--
2.20.1

2023-10-21 14:46:03

by Chengming Zhou

[permalink] [raw]
Subject: [RFC PATCH v2 6/6] slub: Optimize deactivate_slab()

From: Chengming Zhou <[email protected]>

Since the introduce of unfrozen slabs on cpu partial list, we don't
need to synchronize the slab frozen state under the node list_lock.

The caller of deactivate_slab() and the caller of __slab_free() won't
manipulate the slab list concurrently.

So we can get node list_lock in the stage three if we need to manipulate
the slab list in this path.

Signed-off-by: Chengming Zhou <[email protected]>
---
mm/slub.c | 70 ++++++++++++++++++++-----------------------------------
1 file changed, 25 insertions(+), 45 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 7fae959c56eb..29a60bfbf9c5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2493,10 +2493,8 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
void *freelist)
{
- enum slab_modes { M_NONE, M_PARTIAL, M_FREE, M_FULL_NOLIST };
struct kmem_cache_node *n = get_node(s, slab_nid(slab));
int free_delta = 0;
- enum slab_modes mode = M_NONE;
void *nextfree, *freelist_iter, *freelist_tail;
int tail = DEACTIVATE_TO_HEAD;
unsigned long flags = 0;
@@ -2543,58 +2541,40 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
* unfrozen and number of objects in the slab may have changed.
* Then release lock and retry cmpxchg again.
*/
-redo:
-
- old.freelist = READ_ONCE(slab->freelist);
- old.counters = READ_ONCE(slab->counters);
- VM_BUG_ON(!old.frozen);
-
- /* Determine target state of the slab */
- new.counters = old.counters;
- if (freelist_tail) {
- new.inuse -= free_delta;
- set_freepointer(s, freelist_tail, old.freelist);
- new.freelist = freelist;
- } else
- new.freelist = old.freelist;
+ do {
+ old.freelist = READ_ONCE(slab->freelist);
+ old.counters = READ_ONCE(slab->counters);
+ VM_BUG_ON(!old.frozen);
+
+ /* Determine target state of the slab */
+ new.counters = old.counters;
+ new.frozen = 0;
+ if (freelist_tail) {
+ new.inuse -= free_delta;
+ set_freepointer(s, freelist_tail, old.freelist);
+ new.freelist = freelist;
+ } else
+ new.freelist = old.freelist;

- new.frozen = 0;
+ } while (!slab_update_freelist(s, slab,
+ old.freelist, old.counters,
+ new.freelist, new.counters,
+ "unfreezing slab"));

+ /*
+ * Stage three: Manipulate the slab list based on the updated state.
+ */
if (!new.inuse && n->nr_partial >= s->min_partial) {
- mode = M_FREE;
+ stat(s, DEACTIVATE_EMPTY);
+ discard_slab(s, slab);
+ stat(s, FREE_SLAB);
} else if (new.freelist) {
- mode = M_PARTIAL;
- /*
- * Taking the spinlock removes the possibility that
- * acquire_slab() will see a slab that is frozen
- */
spin_lock_irqsave(&n->list_lock, flags);
- } else {
- mode = M_FULL_NOLIST;
- }
-
-
- if (!slab_update_freelist(s, slab,
- old.freelist, old.counters,
- new.freelist, new.counters,
- "unfreezing slab")) {
- if (mode == M_PARTIAL)
- spin_unlock_irqrestore(&n->list_lock, flags);
- goto redo;
- }
-
-
- if (mode == M_PARTIAL) {
add_partial(n, slab, tail);
spin_unlock_irqrestore(&n->list_lock, flags);
stat(s, tail);
- } else if (mode == M_FREE) {
- stat(s, DEACTIVATE_EMPTY);
- discard_slab(s, slab);
- stat(s, FREE_SLAB);
- } else if (mode == M_FULL_NOLIST) {
+ } else
stat(s, DEACTIVATE_FULL);
- }
}

#ifdef CONFIG_SLUB_CPU_PARTIAL
--
2.20.1

2023-10-22 14:53:52

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs

On Sat, Oct 21, 2023 at 11:43 PM <[email protected]> wrote:
>
> From: Chengming Zhou <[email protected]>
>
> Changes in RFC v2:
> - Reuse PG_workingset bit to keep track of whether slub is on the
> per-node partial list, as suggested by Matthew Wilcox.
> - Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which
> is caused by leak of partial slabs when get_partial_node().
> - Add a patch to simplify acquire_slab().
> - Reorder patches a little.
> - v1: https://lore.kernel.org/all/[email protected]/

I've picked [1] and tested this patch series and it passed a simple MM
& slab test
in 30 different SLUB configurations [2].

Also there's code coverage information [3] if you're interested :P

For the series,
Tested-by: Hyeonggon Yoo <[email protected]>

Will review when I have free time ;)
Thanks!

[1] https://git.kerneltesting.org/slab-experimental/log/
[2] https://jenkins.kerneltesting.org/job/slab-experimental/
[3] https://coverage.kerneltesting.org/slab-experimental-6283c415/mm/index.html

> Chengming Zhou (6):
> slub: Keep track of whether slub is on the per-node partial list
> slub: Prepare __slab_free() for unfrozen partial slab out of node
> partial list
> slub: Don't freeze slabs for cpu partial
> slub: Simplify acquire_slab()
> slub: Introduce get_cpu_partial()
> slub: Optimize deactivate_slab()
>
> include/linux/page-flags.h | 2 +
> mm/slab.h | 19 +++
> mm/slub.c | 245 +++++++++++++++++++------------------
> 3 files changed, 150 insertions(+), 116 deletions(-)
>
> --
> 2.20.1
>

2023-10-23 12:33:10

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/6] slub: Keep track of whether slub is on the per-node partial list

On Sat, Oct 21, 2023 at 02:43:12PM +0000, [email protected] wrote:
> +++ b/include/linux/page-flags.h
> @@ -478,6 +478,8 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
> TESTCLEARFLAG(Active, active, PF_HEAD)
> PAGEFLAG(Workingset, workingset, PF_HEAD)
> TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
> + __SETPAGEFLAG(Workingset, workingset, PF_HEAD)
> + __CLEARPAGEFLAG(Workingset, workingset, PF_HEAD)

This makes me nervous. The __ versions can only be used when there are
guaranteed to be no other accesses to the flags. It's never going to
be the case that we want code to call __folio_set_workingset().

_Assuming_ that it's safe to use the non-atomic flag setting, I'd
rather see this done as ...

static inline void slab_set_node_partial(struct slab *slab)
{
__folio_set_workingset(slab_folio(slab));
__set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
}

2023-10-23 15:46:34

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs

On 10/21/23 16:43, [email protected] wrote:
> From: Chengming Zhou <[email protected]>

Hi!

> Changes in RFC v2:
> - Reuse PG_workingset bit to keep track of whether slub is on the
> per-node partial list, as suggested by Matthew Wilcox.
> - Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which
> is caused by leak of partial slabs when get_partial_node().
> - Add a patch to simplify acquire_slab().
> - Reorder patches a little.
> - v1: https://lore.kernel.org/all/[email protected]/
>
> 1. Problem
> ==========
> Now we have to freeze the slab when get from the node partial list, and
> unfreeze the slab when put to the node partial list. Because we need to
> rely on the node list_lock to synchronize the "frozen" bit changes.
>
> This implementation has some drawbacks:
>
> - Alloc path: twice cmpxchg_double.
> It has to get some partial slabs from node when the allocator has used
> up the CPU partial slabs. So it freeze the slab (one cmpxchg_double)
> with node list_lock held, put those frozen slabs on its CPU partial
> list. Later ___slab_alloc() will cmpxchg_double try-loop again if that
> slab is picked to use.
>
> - Alloc path: amplified contention on node list_lock.
> Since we have to synchronize the "frozen" bit changes under the node
> list_lock, the contention of slab (struct page) can be transferred
> to the node list_lock. On machine with many CPUs in one node, the
> contention of list_lock will be amplified by all CPUs' alloc path.
>
> The current code has to workaround this problem by avoiding using
> cmpxchg_double try-loop, which will just break and return when
> contention of page encountered and the first cmpxchg_double failed.
> But this workaround has its own problem.

I'd note here: For more context, see 9b1ea29bc0d7 ("Revert "mm, slub:
consider rest of partial list if acquire_slab() fails"")

> - Free path: redundant unfreeze.
> __slab_free() will freeze and cache some slabs on its partial list,
> and flush them to the node partial list when exceed, which has to
> unfreeze those slabs again under the node list_lock. Actually we
> don't need to freeze slab on CPU partial list, in which case we
> can save the unfreeze cmpxchg_double operations in flush path.
>
> 2. Solution
> ===========
> We solve these problems by leaving slabs unfrozen when moving out of
> the node partial list and on CPU partial list, so "frozen" bit is 0.
>
> These partial slabs won't be manipulate concurrently by alloc path,
> the only racer is free path, which may manipulate its list when !inuse.
> So we need to introduce another synchronization way to avoid it, we
> reuse PG_workingset to keep track of whether the slab is on node partial
> list or not, only in that case we can manipulate the slab list.
>
> The slab will be delay frozen when it's picked to actively use by the
> CPU, it becomes full at the same time, in which case we still need to
> rely on "frozen" bit to avoid manipulating its list. So the slab will
> be frozen only when activate use and be unfrozen only when deactivate.

Interesting solution! I wonder if we could go a bit further and remove
acquire_slab() completely. Because AFAICS even after your changes,
acquire_slab() is still attempted including freezing the slab, which means
still doing an cmpxchg_double under the list_lock, and now also handling the
special case when it failed, but we at least filled percpu partial lists.
What if we only filled the partial list without freezing, and then froze the
first slab outside of the list_lock?

Or more precisely, instead of returning the acquired "object" we would
return the first slab removed from partial list. I think it would simplify
the code a bit, and further reduce list_lock holding times.

I'll also point out a few more details, but it's not a full detailed review
as the suggestion above, and another for 4/5, could mean a rather
significant change for v3.

Thanks!

> 3. Testing
> ==========
> We just did some simple testing on a server with 128 CPUs (2 nodes) to
> compare performance for now.
>
> - perf bench sched messaging -g 5 -t -l 100000
> baseline RFC
> 7.042s 6.966s
> 7.022s 7.045s
> 7.054s 6.985s
>
> - stress-ng --rawpkt 128 --rawpkt-ops 100000000
> baseline RFC
> 2.42s 2.15s
> 2.45s 2.16s
> 2.44s 2.17s
>
> It shows above there is about 10% improvement on stress-ng rawpkt
> testcase, although no much improvement on perf sched bench testcase.
>
> Thanks for any comment and code review!
>
> Chengming Zhou (6):
> slub: Keep track of whether slub is on the per-node partial list
> slub: Prepare __slab_free() for unfrozen partial slab out of node
> partial list
> slub: Don't freeze slabs for cpu partial
> slub: Simplify acquire_slab()
> slub: Introduce get_cpu_partial()
> slub: Optimize deactivate_slab()
>
> include/linux/page-flags.h | 2 +
> mm/slab.h | 19 +++
> mm/slub.c | 245 +++++++++++++++++++------------------
> 3 files changed, 150 insertions(+), 116 deletions(-)
>

2023-10-23 16:01:17

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/6] slub: Don't freeze slabs for cpu partial

On 10/21/23 16:43, [email protected] wrote:
> From: Chengming Zhou <[email protected]>
>
> Now we will freeze slabs when moving them out of node partial list to
> cpu partial list, this method needs two cmpxchg_double operations:
>
> 1. freeze slab (acquire_slab()) under the node list_lock
> 2. get_freelist() when pick used in ___slab_alloc()
>
> Actually we don't need to freeze when moving slabs out of node partial
> list, we can delay freeze to use slab freelist in ___slab_alloc(), so
> we can save one cmpxchg_double().
>
> And there are other good points:
>
> 1. The moving of slabs between node partial list and cpu partial list
> becomes simpler, since we don't need to freeze or unfreeze at all.
>
> 2. The node list_lock contention would be less, since we only need to
> freeze one slab under the node list_lock. (In fact, we can first
> move slabs out of node partial list, don't need to freeze any slab
> at all, so the contention on slab won't transfer to the node list_lock
> contention.)
>
> We can achieve this because there is no concurrent path would manipulate
> the partial slab list except the __slab_free() path, which is serialized
> now.
>
> Note this patch just change the parts of moving the partial slabs for
> easy code review, we will fix other parts in the following patches.
> Specifically this patch change three paths:
> 1. get partial slab from node: get_partial_node()
> 2. put partial slab to node: __unfreeze_partials()
> 3. cache partail slab on cpu when __slab_free()

So the problem with this approach that one patch breaks things and another
one fixes them up, is that git bisect becomes problematic, so we shouldn't
do that and instead bite the bullet and deal with a potentially large patch.
At some level it's unavoidable as one has to grasp all the moving pieces
anyway to see e.g. if the changes in allocation path are compatible with the
changes in freeing.
When possible, we can do preparatory stuff that doesn't break things like in
patches 1 and 2, maybe get_cpu_partial() could also be introduced (even if
unused) before switching the world over to the new scheme in a single patch
(and possible cleanups afterwards). So would it be possible to redo it in
such way please?

<snip>

> @@ -2621,23 +2622,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
> spin_lock_irqsave(&n->list_lock, flags);
> }
>
> - do {
> -
> - old.freelist = slab->freelist;
> - old.counters = slab->counters;
> - VM_BUG_ON(!old.frozen);
> -
> - new.counters = old.counters;
> - new.freelist = old.freelist;
> -
> - new.frozen = 0;
> -
> - } while (!__slab_update_freelist(s, slab,
> - old.freelist, old.counters,
> - new.freelist, new.counters,
> - "unfreezing slab"));
> -
> - if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) {
> + if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial)) {
> slab->next = slab_to_discard;
> slab_to_discard = slab;
> } else {
> @@ -3634,18 +3619,8 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> was_frozen = new.frozen;
> new.inuse -= cnt;
> if ((!new.inuse || !prior) && !was_frozen) {
> -
> - if (kmem_cache_has_cpu_partial(s) && !prior) {
> -
> - /*
> - * Slab was on no list before and will be
> - * partially empty
> - * We can defer the list move and instead
> - * freeze it.
> - */
> - new.frozen = 1;
> -
> - } else { /* Needs to be taken off a list */
> + /* Needs to be taken off a list */
> + if (!kmem_cache_has_cpu_partial(s) || prior) {
>
> n = get_node(s, slab_nid(slab));
> /*
> @@ -3675,7 +3650,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> * activity can be necessary.
> */
> stat(s, FREE_FROZEN);
> - } else if (new.frozen) {
> + } else if (kmem_cache_has_cpu_partial(s) && !prior) {
> /*
> * If we just froze the slab then put it onto the
> * per cpu partial list.

I think this comment is now misleading, we didn't freeze the slab, so it's
now something like "if we started with a full slab...".

2023-10-23 16:22:51

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/6] slub: Keep track of whether slub is on the per-node partial list

On Mon, Oct 23, 2023 at 01:32:27PM +0100, Matthew Wilcox wrote:
> _Assuming_ that it's safe to use the non-atomic flag setting, I'd
> rather see this done as ...
>
> static inline void slab_set_node_partial(struct slab *slab)
> {
> __folio_set_workingset(slab_folio(slab));

Ugh, I meant to delete this line. I meant to just write the next line.

> __set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
> }

Subject: Re: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs

On Mon, 23 Oct 2023, Vlastimil Babka wrote:

>>
>> The slab will be delay frozen when it's picked to actively use by the
>> CPU, it becomes full at the same time, in which case we still need to
>> rely on "frozen" bit to avoid manipulating its list. So the slab will
>> be frozen only when activate use and be unfrozen only when deactivate.
>
> Interesting solution! I wonder if we could go a bit further and remove
> acquire_slab() completely. Because AFAICS even after your changes,
> acquire_slab() is still attempted including freezing the slab, which means
> still doing an cmpxchg_double under the list_lock, and now also handling the
> special case when it failed, but we at least filled percpu partial lists.
> What if we only filled the partial list without freezing, and then froze the
> first slab outside of the list_lock?
>
> Or more precisely, instead of returning the acquired "object" we would
> return the first slab removed from partial list. I think it would simplify
> the code a bit, and further reduce list_lock holding times.
>
> I'll also point out a few more details, but it's not a full detailed review
> as the suggestion above, and another for 4/5, could mean a rather
> significant change for v3.

This is not that easy. The frozen bit indicates that list management does
not have to be done for a slab if its processed in free. If you take a
slab off the list without setting that bit then something else needs to
provide the information that "frozen" provided.

If the frozen bit changes can be handled in a different way than
with cmpxchg then that is a good optimization.

For much of the frozen handling we must be holding the node list lock
anyways in order to add/remove from the list. So we already have a lock
that could be used to protect flag operations.

2023-10-23 18:44:56

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs

On 10/23/23 19:00, Christoph Lameter (Ampere) wrote:
> On Mon, 23 Oct 2023, Vlastimil Babka wrote:
>
>>>
>>> The slab will be delay frozen when it's picked to actively use by the
>>> CPU, it becomes full at the same time, in which case we still need to
>>> rely on "frozen" bit to avoid manipulating its list. So the slab will
>>> be frozen only when activate use and be unfrozen only when deactivate.
>>
>> Interesting solution! I wonder if we could go a bit further and remove
>> acquire_slab() completely. Because AFAICS even after your changes,
>> acquire_slab() is still attempted including freezing the slab, which means
>> still doing an cmpxchg_double under the list_lock, and now also handling the
>> special case when it failed, but we at least filled percpu partial lists.
>> What if we only filled the partial list without freezing, and then froze the
>> first slab outside of the list_lock?
>>
>> Or more precisely, instead of returning the acquired "object" we would
>> return the first slab removed from partial list. I think it would simplify
>> the code a bit, and further reduce list_lock holding times.
>>
>> I'll also point out a few more details, but it's not a full detailed review
>> as the suggestion above, and another for 4/5, could mean a rather
>> significant change for v3.
>
> This is not that easy. The frozen bit indicates that list management does
> not have to be done for a slab if its processed in free. If you take a
> slab off the list without setting that bit then something else needs to
> provide the information that "frozen" provided.

Yes, that's the new slab_node_partial flag in patch 1, protected by list_lock.

> If the frozen bit changes can be handled in a different way than
> with cmpxchg then that is a good optimization.

Frozen bit stays the same, but some scenarios can now avoid it.

> For much of the frozen handling we must be holding the node list lock
> anyways in order to add/remove from the list. So we already have a lock
> that could be used to protect flag operations.

I can see the following differences between the traditional frozen bit and
the new flag:

frozen bit advantage:
- __slab_free() on an already-frozen slab can ignore list operations and
list_lock completely

frozen bit disadvantage:
- acquire_slab() trying to do cmpxchg_double() under list_lock (see commit
9b1ea29bc0d7)

slab_node_partial flag advantage:
- we can take slabs off from node partial list without cmpxchg_double()
- probably less cmpxchg_double() operations overall

slab_node_partial flag disadvantage:
- a __slab_free() that encouters a slab that's not frozen (but
slab_node_partial flag is not set) might have to do more work, including
taking the list_lock only to find out that slab_node_partial flag is false
(but AFAICS that happens only when the slab becomes fully free by the free
operation, thus relatively rarely).

Put together, I think we might indeed get the best of both if the frozen
flag is kept to use for cpu slabs, and we rely on slab_node_partial flag for
cpu partial slabs, as the series does.

Subject: Re: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs

On Mon, 23 Oct 2023, Vlastimil Babka wrote:

>> For much of the frozen handling we must be holding the node list lock
>> anyways in order to add/remove from the list. So we already have a lock
>> that could be used to protect flag operations.
>
> I can see the following differences between the traditional frozen bit and
> the new flag:
>
> frozen bit advantage:
> - __slab_free() on an already-frozen slab can ignore list operations and
> list_lock completely
>
> frozen bit disadvantage:
> - acquire_slab() trying to do cmpxchg_double() under list_lock (see commit
> 9b1ea29bc0d7)


Ok so a slab is frozen if either of those conditions are met. That gets a
bit complicated to test for. Can we just get away with the
slab_node_partial flag?

The advantage with the frozen state is that it can be changed with a
cmpxchg together with some other values (list pointer, counter) that need
updating at free and allocation.

But frozen updates are rarer so maybe its worth to completely drop the
frozen bit. If both need to be updates then we would have two atomic ops.
One is the cmpxchg and the other the operation on the page flag.

2023-10-24 01:57:41

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/6] slub: Keep track of whether slub is on the per-node partial list

On 2023/10/24 00:22, Matthew Wilcox wrote:
> On Mon, Oct 23, 2023 at 01:32:27PM +0100, Matthew Wilcox wrote:
>> _Assuming_ that it's safe to use the non-atomic flag setting, I'd
>> rather see this done as ...
>>
>> static inline void slab_set_node_partial(struct slab *slab)
>> {
>> __folio_set_workingset(slab_folio(slab));
>
> Ugh, I meant to delete this line. I meant to just write the next line.
>
>> __set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
>> }

Yes, it's safe to use the non-atomic version here, since it's protected
by the slub per-node list_lock and we want better performance here.

Ok, will change to directly use __set_bit() and __clear_bit() instead of
polluting the "workingset" interfaces there.

Thanks.

2023-10-24 02:02:45

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs

On 2023/10/22 22:52, Hyeonggon Yoo wrote:
> On Sat, Oct 21, 2023 at 11:43 PM <[email protected]> wrote:
>>
>> From: Chengming Zhou <[email protected]>
>>
>> Changes in RFC v2:
>> - Reuse PG_workingset bit to keep track of whether slub is on the
>> per-node partial list, as suggested by Matthew Wilcox.
>> - Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which
>> is caused by leak of partial slabs when get_partial_node().
>> - Add a patch to simplify acquire_slab().
>> - Reorder patches a little.
>> - v1: https://lore.kernel.org/all/[email protected]/
>
> I've picked [1] and tested this patch series and it passed a simple MM
> & slab test
> in 30 different SLUB configurations [2].
>
> Also there's code coverage information [3] if you're interested :P
>
> For the series,
> Tested-by: Hyeonggon Yoo <[email protected]>

Thank you! These are very helpful to me, I will also do more workloads
stress testing with more configurations.

>
> Will review when I have free time ;)

Ok, thanks for your time.

> Thanks!
>
> [1] https://git.kerneltesting.org/slab-experimental/log/
> [2] https://jenkins.kerneltesting.org/job/slab-experimental/
> [3] https://coverage.kerneltesting.org/slab-experimental-6283c415/mm/index.html
>
>> Chengming Zhou (6):
>> slub: Keep track of whether slub is on the per-node partial list
>> slub: Prepare __slab_free() for unfrozen partial slab out of node
>> partial list
>> slub: Don't freeze slabs for cpu partial
>> slub: Simplify acquire_slab()
>> slub: Introduce get_cpu_partial()
>> slub: Optimize deactivate_slab()
>>
>> include/linux/page-flags.h | 2 +
>> mm/slab.h | 19 +++
>> mm/slub.c | 245 +++++++++++++++++++------------------
>> 3 files changed, 150 insertions(+), 116 deletions(-)
>>
>> --
>> 2.20.1
>>

2023-10-24 02:22:38

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs

On 2023/10/23 23:46, Vlastimil Babka wrote:
> On 10/21/23 16:43, [email protected] wrote:
>> From: Chengming Zhou <[email protected]>
>
> Hi!
>
>> Changes in RFC v2:
>> - Reuse PG_workingset bit to keep track of whether slub is on the
>> per-node partial list, as suggested by Matthew Wilcox.
>> - Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which
>> is caused by leak of partial slabs when get_partial_node().
>> - Add a patch to simplify acquire_slab().
>> - Reorder patches a little.
>> - v1: https://lore.kernel.org/all/[email protected]/
>>
>> 1. Problem
>> ==========
>> Now we have to freeze the slab when get from the node partial list, and
>> unfreeze the slab when put to the node partial list. Because we need to
>> rely on the node list_lock to synchronize the "frozen" bit changes.
>>
>> This implementation has some drawbacks:
>>
>> - Alloc path: twice cmpxchg_double.
>> It has to get some partial slabs from node when the allocator has used
>> up the CPU partial slabs. So it freeze the slab (one cmpxchg_double)
>> with node list_lock held, put those frozen slabs on its CPU partial
>> list. Later ___slab_alloc() will cmpxchg_double try-loop again if that
>> slab is picked to use.
>>
>> - Alloc path: amplified contention on node list_lock.
>> Since we have to synchronize the "frozen" bit changes under the node
>> list_lock, the contention of slab (struct page) can be transferred
>> to the node list_lock. On machine with many CPUs in one node, the
>> contention of list_lock will be amplified by all CPUs' alloc path.
>>
>> The current code has to workaround this problem by avoiding using
>> cmpxchg_double try-loop, which will just break and return when
>> contention of page encountered and the first cmpxchg_double failed.
>> But this workaround has its own problem.
>
> I'd note here: For more context, see 9b1ea29bc0d7 ("Revert "mm, slub:
> consider rest of partial list if acquire_slab() fails"")

Good, will add it.

>
>> - Free path: redundant unfreeze.
>> __slab_free() will freeze and cache some slabs on its partial list,
>> and flush them to the node partial list when exceed, which has to
>> unfreeze those slabs again under the node list_lock. Actually we
>> don't need to freeze slab on CPU partial list, in which case we
>> can save the unfreeze cmpxchg_double operations in flush path.
>>
>> 2. Solution
>> ===========
>> We solve these problems by leaving slabs unfrozen when moving out of
>> the node partial list and on CPU partial list, so "frozen" bit is 0.
>>
>> These partial slabs won't be manipulate concurrently by alloc path,
>> the only racer is free path, which may manipulate its list when !inuse.
>> So we need to introduce another synchronization way to avoid it, we
>> reuse PG_workingset to keep track of whether the slab is on node partial
>> list or not, only in that case we can manipulate the slab list.
>>
>> The slab will be delay frozen when it's picked to actively use by the
>> CPU, it becomes full at the same time, in which case we still need to
>> rely on "frozen" bit to avoid manipulating its list. So the slab will
>> be frozen only when activate use and be unfrozen only when deactivate.
>
> Interesting solution! I wonder if we could go a bit further and remove
> acquire_slab() completely. Because AFAICS even after your changes,
> acquire_slab() is still attempted including freezing the slab, which means
> still doing an cmpxchg_double under the list_lock, and now also handling the
> special case when it failed, but we at least filled percpu partial lists.
> What if we only filled the partial list without freezing, and then froze the
> first slab outside of the list_lock?

Good idea, we can return one slab and put other slabs to the CPU partial list.
So we can remove the acquire_slab() completely and don't need to handle the
fail case. The code will be cleaner, too.

>
> Or more precisely, instead of returning the acquired "object" we would
> return the first slab removed from partial list. I think it would simplify
> the code a bit, and further reduce list_lock holding times.

Ok, I will do this in the next version. But I find we have to return the object
in the "IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)" case, in which
we need to allocate a single object under the node list_lock.

Maybe we can use "struct partial_context" to return the object in this case?

struct partial_context {
- struct slab **slab;
gfp_t flags;
unsigned int orig_size;
+ void *object;
};

Then we can change all get_partial interfaces to return a slab. Do you agree
with this way?

>
> I'll also point out a few more details, but it's not a full detailed review
> as the suggestion above, and another for 4/5, could mean a rather
> significant change for v3.

Thank you!

>
> Thanks!
>
>> 3. Testing
>> ==========
>> We just did some simple testing on a server with 128 CPUs (2 nodes) to
>> compare performance for now.
>>
>> - perf bench sched messaging -g 5 -t -l 100000
>> baseline RFC
>> 7.042s 6.966s
>> 7.022s 7.045s
>> 7.054s 6.985s
>>
>> - stress-ng --rawpkt 128 --rawpkt-ops 100000000
>> baseline RFC
>> 2.42s 2.15s
>> 2.45s 2.16s
>> 2.44s 2.17s
>>
>> It shows above there is about 10% improvement on stress-ng rawpkt
>> testcase, although no much improvement on perf sched bench testcase.
>>
>> Thanks for any comment and code review!
>>
>> Chengming Zhou (6):
>> slub: Keep track of whether slub is on the per-node partial list
>> slub: Prepare __slab_free() for unfrozen partial slab out of node
>> partial list
>> slub: Don't freeze slabs for cpu partial
>> slub: Simplify acquire_slab()
>> slub: Introduce get_cpu_partial()
>> slub: Optimize deactivate_slab()
>>
>> include/linux/page-flags.h | 2 +
>> mm/slab.h | 19 +++
>> mm/slub.c | 245 +++++++++++++++++++------------------
>> 3 files changed, 150 insertions(+), 116 deletions(-)
>>
>

2023-10-24 02:41:42

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/6] slub: Don't freeze slabs for cpu partial

On 2023/10/24 00:00, Vlastimil Babka wrote:
> On 10/21/23 16:43, [email protected] wrote:
>> From: Chengming Zhou <[email protected]>
>>
>> Now we will freeze slabs when moving them out of node partial list to
>> cpu partial list, this method needs two cmpxchg_double operations:
>>
>> 1. freeze slab (acquire_slab()) under the node list_lock
>> 2. get_freelist() when pick used in ___slab_alloc()
>>
>> Actually we don't need to freeze when moving slabs out of node partial
>> list, we can delay freeze to use slab freelist in ___slab_alloc(), so
>> we can save one cmpxchg_double().
>>
>> And there are other good points:
>>
>> 1. The moving of slabs between node partial list and cpu partial list
>> becomes simpler, since we don't need to freeze or unfreeze at all.
>>
>> 2. The node list_lock contention would be less, since we only need to
>> freeze one slab under the node list_lock. (In fact, we can first
>> move slabs out of node partial list, don't need to freeze any slab
>> at all, so the contention on slab won't transfer to the node list_lock
>> contention.)
>>
>> We can achieve this because there is no concurrent path would manipulate
>> the partial slab list except the __slab_free() path, which is serialized
>> now.
>>
>> Note this patch just change the parts of moving the partial slabs for
>> easy code review, we will fix other parts in the following patches.
>> Specifically this patch change three paths:
>> 1. get partial slab from node: get_partial_node()
>> 2. put partial slab to node: __unfreeze_partials()
>> 3. cache partail slab on cpu when __slab_free()
>
> So the problem with this approach that one patch breaks things and another
> one fixes them up, is that git bisect becomes problematic, so we shouldn't
> do that and instead bite the bullet and deal with a potentially large patch.
> At some level it's unavoidable as one has to grasp all the moving pieces
> anyway to see e.g. if the changes in allocation path are compatible with the
> changes in freeing.
> When possible, we can do preparatory stuff that doesn't break things like in
> patches 1 and 2, maybe get_cpu_partial() could also be introduced (even if
> unused) before switching the world over to the new scheme in a single patch
> (and possible cleanups afterwards). So would it be possible to redo it in
> such way please?

Ok, I will change to this way. I didn't know how to handle this potentially
large patch gracefully at first. Your detailed advice is very helpful to me,
I will prepare all possible preparatory stuff, then change all parts over
in one patch.

>
> <snip>
>
>> @@ -2621,23 +2622,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
>> spin_lock_irqsave(&n->list_lock, flags);
>> }
>>
>> - do {
>> -
>> - old.freelist = slab->freelist;
>> - old.counters = slab->counters;
>> - VM_BUG_ON(!old.frozen);
>> -
>> - new.counters = old.counters;
>> - new.freelist = old.freelist;
>> -
>> - new.frozen = 0;
>> -
>> - } while (!__slab_update_freelist(s, slab,
>> - old.freelist, old.counters,
>> - new.freelist, new.counters,
>> - "unfreezing slab"));
>> -
>> - if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) {
>> + if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial)) {
>> slab->next = slab_to_discard;
>> slab_to_discard = slab;
>> } else {
>> @@ -3634,18 +3619,8 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>> was_frozen = new.frozen;
>> new.inuse -= cnt;
>> if ((!new.inuse || !prior) && !was_frozen) {
>> -
>> - if (kmem_cache_has_cpu_partial(s) && !prior) {
>> -
>> - /*
>> - * Slab was on no list before and will be
>> - * partially empty
>> - * We can defer the list move and instead
>> - * freeze it.
>> - */
>> - new.frozen = 1;
>> -
>> - } else { /* Needs to be taken off a list */
>> + /* Needs to be taken off a list */
>> + if (!kmem_cache_has_cpu_partial(s) || prior) {
>>
>> n = get_node(s, slab_nid(slab));
>> /*
>> @@ -3675,7 +3650,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>> * activity can be necessary.
>> */
>> stat(s, FREE_FROZEN);
>> - } else if (new.frozen) {
>> + } else if (kmem_cache_has_cpu_partial(s) && !prior) {
>> /*
>> * If we just froze the slab then put it onto the
>> * per cpu partial list.
>
> I think this comment is now misleading, we didn't freeze the slab, so it's
> now something like "if we started with a full slab...".

Ok, will check and change these inconsistent comments by the way.

Thanks!

2023-10-24 08:19:30

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs

On 10/23/23 23:05, Christoph Lameter (Ampere) wrote:
> On Mon, 23 Oct 2023, Vlastimil Babka wrote:
>
>>> For much of the frozen handling we must be holding the node list lock
>>> anyways in order to add/remove from the list. So we already have a lock
>>> that could be used to protect flag operations.
>>
>> I can see the following differences between the traditional frozen bit and
>> the new flag:
>>
>> frozen bit advantage:
>> - __slab_free() on an already-frozen slab can ignore list operations and
>> list_lock completely
>>
>> frozen bit disadvantage:
>> - acquire_slab() trying to do cmpxchg_double() under list_lock (see commit
>> 9b1ea29bc0d7)
>
>
> Ok so a slab is frozen if either of those conditions are met. That gets a
> bit complicated to test for. Can we just get away with the
> slab_node_partial flag?

Might be worth trying, but I'd try only as a next separate step. I think
freezing the slab that becomes cpu slab (not partial cpu) still has benefits
and no extra cost as that's when we're doing the cmpxchg_double anyway. And
the complicated tests are confined to __slab_free() and it's not *that* bad
IMHO, one condition checks for was_frozen, another for slab_test_node_partial().

> The advantage with the frozen state is that it can be changed with a
> cmpxchg together with some other values (list pointer, counter) that need
> updating at free and allocation.

Exactly, but for taking a slab off the node partial list we don't need to
deal with those, so that's where it makes sense to delay the frozen bit
handling.

> But frozen updates are rarer so maybe its worth to completely drop the
> frozen bit. If both need to be updates then we would have two atomic ops.
> One is the cmpxchg and the other the operation on the page flag.

The flag update doesn't even have to be atomic as it's only done under
list_lock.

2023-10-24 08:20:59

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs

On 10/24/23 04:20, Chengming Zhou wrote:
> On 2023/10/23 23:46, Vlastimil Babka wrote:
>> Or more precisely, instead of returning the acquired "object" we would
>> return the first slab removed from partial list. I think it would simplify
>> the code a bit, and further reduce list_lock holding times.
>
> Ok, I will do this in the next version. But I find we have to return the object
> in the "IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)" case, in which
> we need to allocate a single object under the node list_lock.

Ah, right.

> Maybe we can use "struct partial_context" to return the object in this case?
>
> struct partial_context {
> - struct slab **slab;
> gfp_t flags;
> unsigned int orig_size;
> + void *object;
> };
>
> Then we can change all get_partial interfaces to return a slab. Do you agree
> with this way?

Yeah, good idea! Thanks!

2023-10-24 11:03:37

by Chengming Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs

On 2023/10/24 05:05, Christoph Lameter (Ampere) wrote:
> On Mon, 23 Oct 2023, Vlastimil Babka wrote:
>
>>> For much of the frozen handling we must be holding the node list lock
>>> anyways in order to add/remove from the list. So we already have a lock
>>> that could be used to protect flag operations.
>>
>> I can see the following differences between the traditional frozen bit and
>> the new flag:
>>
>> frozen bit advantage:
>> - __slab_free() on an already-frozen slab can ignore list operations and
>> list_lock completely
>>
>> frozen bit disadvantage:
>> - acquire_slab() trying to do cmpxchg_double() under list_lock (see commit
>> 9b1ea29bc0d7)
>
>
> Ok so a slab is frozen if either of those conditions are met. That gets a bit complicated to test for. Can we just get away with the slab_node_partial flag?
>
> The advantage with the frozen state is that it can be changed with a cmpxchg together with some other values (list pointer, counter) that need updating at free and allocation.
>
> But frozen updates are rarer so maybe its worth to completely drop the frozen bit. If both need to be updates then we would have two atomic ops. One is the cmpxchg and the other the operation on the page flag.
>

This introduced page flag bit is using non-atomic operations, which is protected
by the node list_lock.

As for completely dropping the "frozen" bit, I find it hard because we have the
DEACTIVATE_BYPASS optimization in get_freelist(), which clear the "frozen" bit
without the synchronization of node list_lock. So __slab_free() still need to
rely on the "frozen" bit for CPU active slab.

This patch series mainly optimize the cmpxchg cost in moving partial slabs
between node partial list and CPU partial list, and alleviate the contention
of node list_lock meanwhile.

Thanks!