2023-02-23 13:28:22

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v2 0/7] make slab shrink lockless

Hi all,

This patch series aims to make slab shrink lockless.

1. Background
=============

On our servers, we often find the following system cpu hotspots:

44.16% [kernel] [k] down_read_trylock
14.12% [kernel] [k] up_read
13.43% [kernel] [k] shrink_slab
5.25% [kernel] [k] count_shadow_nodes
3.42% [kernel] [k] idr_find

Then we used bpftrace to capture its calltrace as follows:

@[
down_read_trylock+5
shrink_slab+292
shrink_node+640
do_try_to_free_pages+211
try_to_free_mem_cgroup_pages+266
try_charge_memcg+386
charge_memcg+51
__mem_cgroup_charge+44
__handle_mm_fault+1416
handle_mm_fault+260
do_user_addr_fault+459
exc_page_fault+104
asm_exc_page_fault+38
clear_user_rep_good+18
read_zero+100
vfs_read+176
ksys_read+93
do_syscall_64+62
entry_SYSCALL_64_after_hwframe+114
]: 1868979

It is easy to see that this is caused by the frequent failure to obtain the
read lock of shrinker_rwsem when reclaiming slab memory.

Currently, the shrinker_rwsem is a global lock. And the following cases may
cause the above system cpu hotspots:

a. the write lock of shrinker_rwsem was held for too long. For example, there
are many memcgs in the system, which causes some paths to hold locks and
traverse it for too long. (e.g. expand_shrinker_info())
b. the read lock of shrinker_rwsem was held for too long, and a writer came at
this time. Then this writer will be forced to wait and block all subsequent
readers.
For example:
- be scheduled when the read lock of shrinker_rwsem is held in
do_shrink_slab()
- some shrinker are blocked for too long. Like the case mentioned in the
patchset[1].

[1]. https://lore.kernel.org/lkml/[email protected]/

And all the down_read_trylock() hotspots caused by the above cases can be
solved by replacing the shrinker_rwsem trylocks with SRCU.

2. Survey
=========

Before doing the code implementation, I found that there were many similar
submissions in the community:

a. Davidlohr Bueso submitted a patch in 2015.
Subject: [PATCH -next v2] mm: srcu-ify shrinkers
Link: https://lore.kernel.org/all/[email protected]/
Result: It was finally merged into the linux-next branch, but failed on arm
allnoconfig (without CONFIG_SRCU)

b. Tetsuo Handa submitted a patchset in 2017.
Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
Result: Finally chose to use the current simple way (break when
rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU,
but SRCU was not unconditionally enabled at the time.

c. Kirill Tkhai submitted a patchset in 2018.
Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
Result: At that time, SRCU was not unconditionally enabled, and there were
some objections to enabling SRCU. Later, because Kirill's focus was
moved to other things, this patchset was not continued to be updated.

d. Sultan Alsawaf submitted a patch in 2021.
Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
Link: https://lore.kernel.org/lkml/[email protected]/
Result: Rejected because SRCU was not unconditionally enabled.

We can find that almost all these historical commits were abandoned because SRCU
was not unconditionally enabled. But now SRCU has been unconditionally enable
by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks
with SRCU.

[2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/

3. Reproduction and testing
===========================

We can reproduce the down_read_trylock() hotspot through the following script:

```
#!/bin/bash
DIR="/root/shrinker/memcg/mnt"

do_create()
{
mkdir /sys/fs/cgroup/memory/test
echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
for i in `seq 0 $1`;
do
mkdir /sys/fs/cgroup/memory/test/$i;
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
mkdir -p $DIR/$i;
done
}

do_mount()
{
for i in `seq $1 $2`;
do
mount -t tmpfs $i $DIR/$i;
done
}

do_touch()
{
for i in `seq $1 $2`;
do
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
done
}

do_create 2000
do_mount 0 2000
do_touch 0 1000
```

Save the above script and execute it, we can get the following perf hotspots:

46.60% [kernel] [k] down_read_trylock
18.70% [kernel] [k] up_read
15.44% [kernel] [k] shrink_slab
4.37% [kernel] [k] _find_next_bit
2.75% [kernel] [k] xa_load
2.07% [kernel] [k] idr_find
1.73% [kernel] [k] do_shrink_slab
1.42% [kernel] [k] shrink_lruvec
0.74% [kernel] [k] shrink_node
0.60% [kernel] [k] list_lru_count_one

After applying this patchset, the hotspot becomes as follows:

19.53% [kernel] [k] _find_next_bit
14.63% [kernel] [k] do_shrink_slab
14.58% [kernel] [k] shrink_slab
11.83% [kernel] [k] shrink_lruvec
9.33% [kernel] [k] __blk_flush_plug
6.67% [kernel] [k] mem_cgroup_iter
3.73% [kernel] [k] list_lru_count_one
2.43% [kernel] [k] shrink_node
1.96% [kernel] [k] super_cache_count
1.78% [kernel] [k] __rcu_read_unlock
1.38% [kernel] [k] __srcu_read_lock
1.30% [kernel] [k] xas_descend

We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock,
which realizes the lockless slab reclaim.

This series is based on next-20230217.

Comments and suggestions are welcome.

Thanks,
Qi.

Changelog in v1 -> v2:
- add a map_nr_max field to shrinker_info (suggested by Kirill)
- use shrinker_mutex in reparent_shrinker_deferred() (pointed by Kirill)

Qi Zheng (7):
mm: vmscan: add a map_nr_max field to shrinker_info
mm: vmscan: make global slab shrink lockless
mm: vmscan: make memcg slab shrink lockless
mm: shrinkers: make count and scan in shrinker debugfs lockless
mm: vmscan: hold write lock to reparent shrinker nr_deferred
mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
mm: shrinkers: convert shrinker_rwsem to mutex

drivers/md/dm-cache-metadata.c | 2 +-
drivers/md/dm-thin-metadata.c | 2 +-
fs/super.c | 2 +-
include/linux/memcontrol.h | 1 +
mm/shrinker_debug.c | 38 ++++-----
mm/vmscan.c | 142 +++++++++++++++++----------------
6 files changed, 92 insertions(+), 95 deletions(-)

--
2.20.1



2023-02-23 13:28:28

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info

To prepare for the subsequent lockless memcg slab shrink,
add a map_nr_max field to struct shrinker_info to records
its own real shrinker_nr_max.

No functional changes.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/memcontrol.h | 1 +
mm/vmscan.c | 29 ++++++++++++++++++-----------
2 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b6eda2ab205d..aa69ea98e2d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -97,6 +97,7 @@ struct shrinker_info {
struct rcu_head rcu;
atomic_long_t *nr_deferred;
unsigned long *map;
+ int map_nr_max;
};

struct lruvec_stats_percpu {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..9f895ca6216c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
lockdep_is_held(&shrinker_rwsem));
}

+static inline bool need_expand(int new_nr_max, int old_nr_max)
+{
+ return round_up(new_nr_max, BITS_PER_LONG) >
+ round_up(old_nr_max, BITS_PER_LONG);
+}
+
static int expand_one_shrinker_info(struct mem_cgroup *memcg,
int map_size, int defer_size,
- int old_map_size, int old_defer_size)
+ int old_map_size, int old_defer_size,
+ int new_nr_max)
{
struct shrinker_info *new, *old;
struct mem_cgroup_per_node *pn;
@@ -240,12 +247,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
if (!old)
return 0;

+ if (!need_expand(new_nr_max, old->map_nr_max))
+ return 0;
+
new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
if (!new)
return -ENOMEM;

new->nr_deferred = (atomic_long_t *)(new + 1);
new->map = (void *)new->nr_deferred + defer_size;
+ new->map_nr_max = new_nr_max;

/* map: set all old bits, clear all new bits */
memset(new->map, (int)0xff, old_map_size);
@@ -295,6 +306,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
}
info->nr_deferred = (atomic_long_t *)(info + 1);
info->map = (void *)info->nr_deferred + defer_size;
+ info->map_nr_max = shrinker_nr_max;
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
}
up_write(&shrinker_rwsem);
@@ -302,12 +314,6 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
return ret;
}

-static inline bool need_expand(int nr_max)
-{
- return round_up(nr_max, BITS_PER_LONG) >
- round_up(shrinker_nr_max, BITS_PER_LONG);
-}
-
static int expand_shrinker_info(int new_id)
{
int ret = 0;
@@ -316,7 +322,7 @@ static int expand_shrinker_info(int new_id)
int old_map_size, old_defer_size = 0;
struct mem_cgroup *memcg;

- if (!need_expand(new_nr_max))
+ if (!need_expand(new_nr_max, shrinker_nr_max))
goto out;

if (!root_mem_cgroup)
@@ -332,7 +338,8 @@ static int expand_shrinker_info(int new_id)
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
ret = expand_one_shrinker_info(memcg, map_size, defer_size,
- old_map_size, old_defer_size);
+ old_map_size, old_defer_size,
+ new_nr_max);
if (ret) {
mem_cgroup_iter_break(NULL, memcg);
goto out;
@@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
for_each_node(nid) {
child_info = shrinker_info_protected(memcg, nid);
parent_info = shrinker_info_protected(parent, nid);
- for (i = 0; i < shrinker_nr_max; i++) {
+ for (i = 0; i < child_info->map_nr_max; i++) {
nr = atomic_long_read(&child_info->nr_deferred[i]);
atomic_long_add(nr, &parent_info->nr_deferred[i]);
}
@@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
if (unlikely(!info))
goto unlock;

- for_each_set_bit(i, info->map, shrinker_nr_max) {
+ for_each_set_bit(i, info->map, info->map_nr_max) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
--
2.20.1


2023-02-23 13:28:32

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v2 5/7] mm: vmscan: hold write lock to reparent shrinker nr_deferred

For now, reparent_shrinker_deferred() is the only holder
of read lock of shrinker_rwsem. And it already holds the
global cgroup_mutex, so it will not be called in parallel.

Therefore, in order to convert shrinker_rwsem to shrinker_mutex
later, here we change to hold the write lock of shrinker_rwsem
to reparent.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/vmscan.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 25a4a660e45f..89602e97583a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -450,7 +450,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
parent = root_mem_cgroup;

/* Prevent from concurrent shrinker_info expand */
- down_read(&shrinker_rwsem);
+ down_write(&shrinker_rwsem);
for_each_node(nid) {
child_info = shrinker_info_protected(memcg, nid);
parent_info = shrinker_info_protected(parent, nid);
@@ -459,7 +459,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
atomic_long_add(nr, &parent_info->nr_deferred[i]);
}
}
- up_read(&shrinker_rwsem);
+ up_write(&shrinker_rwsem);
}

static bool cgroup_reclaim(struct scan_control *sc)
--
2.20.1


2023-02-23 13:28:47

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v2 3/7] mm: vmscan: make memcg slab shrink lockless

Like global slab shrink, since commit 1cd0bd06093c
("rcu: Remove CONFIG_SRCU"), it's time to use SRCU
to protect readers who previously held shrinker_rwsem.

We can test with the following script:

```
DIR="/root/shrinker/memcg/mnt"

do_create()
{
mkdir /sys/fs/cgroup/memory/test
echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
for i in `seq 0 $1`;
do
mkdir /sys/fs/cgroup/memory/test/$i;
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
mkdir -p $DIR/$i;
done
}

do_mount()
{
for i in `seq $1 $2`;
do
mount -t tmpfs $i $DIR/$i;
done
}

do_touch()
{
for i in `seq $1 $2`;
do
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
done
}

do_create 2000
do_mount 0 2000
do_touch 0 1000
```

Before applying:

46.60% [kernel] [k] down_read_trylock
18.70% [kernel] [k] up_read
15.44% [kernel] [k] shrink_slab
4.37% [kernel] [k] _find_next_bit
2.75% [kernel] [k] xa_load
2.07% [kernel] [k] idr_find
1.73% [kernel] [k] do_shrink_slab
1.42% [kernel] [k] shrink_lruvec
0.74% [kernel] [k] shrink_node
0.60% [kernel] [k] list_lru_count_one

After applying:

19.53% [kernel] [k] _find_next_bit
14.63% [kernel] [k] do_shrink_slab
14.58% [kernel] [k] shrink_slab
11.83% [kernel] [k] shrink_lruvec
9.33% [kernel] [k] __blk_flush_plug
6.67% [kernel] [k] mem_cgroup_iter
3.73% [kernel] [k] list_lru_count_one
2.43% [kernel] [k] shrink_node
1.96% [kernel] [k] super_cache_count
1.78% [kernel] [k] __rcu_read_unlock
1.38% [kernel] [k] __srcu_read_lock
1.30% [kernel] [k] xas_descend

We can see that the readers is no longer blocked.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/vmscan.c | 46 +++++++++++++++++++++++++++-------------------
1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 02987a6f95d1..25a4a660e45f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,7 @@
#include <linux/khugepaged.h>
#include <linux/rculist_nulls.h>
#include <linux/random.h>
+#include <linux/srcu.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -221,8 +222,21 @@ static inline int shrinker_defer_size(int nr_items)
static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
int nid)
{
- return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
- lockdep_is_held(&shrinker_rwsem));
+ return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
+ &shrinker_srcu,
+ lockdep_is_held(&shrinker_rwsem));
+}
+
+static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
+ int nid)
+{
+ return srcu_dereference(memcg->nodeinfo[nid]->shrinker_info,
+ &shrinker_srcu);
+}
+
+static void free_shrinker_info_rcu(struct rcu_head *head)
+{
+ kvfree(container_of(head, struct shrinker_info, rcu));
}

static inline bool need_expand(int new_nr_max, int old_nr_max)
@@ -268,7 +282,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
defer_size - old_defer_size);

rcu_assign_pointer(pn->shrinker_info, new);
- kvfree_rcu(old, rcu);
+ call_srcu(&shrinker_srcu, &old->rcu, free_shrinker_info_rcu);
}

return 0;
@@ -357,13 +371,14 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
{
if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
struct shrinker_info *info;
+ int srcu_idx;

- rcu_read_lock();
- info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
+ srcu_idx = srcu_read_lock(&shrinker_srcu);
+ info = shrinker_info_srcu(memcg, nid);
/* Pairs with smp mb in shrink_slab() */
smp_mb__before_atomic();
set_bit(shrinker_id, info->map);
- rcu_read_unlock();
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
}
}

@@ -377,7 +392,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
return -ENOSYS;

down_write(&shrinker_rwsem);
- /* This may call shrinker, so it must use down_read_trylock() */
id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
@@ -411,7 +425,7 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
{
struct shrinker_info *info;

- info = shrinker_info_protected(memcg, nid);
+ info = shrinker_info_srcu(memcg, nid);
return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
}

@@ -420,7 +434,7 @@ static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
{
struct shrinker_info *info;

- info = shrinker_info_protected(memcg, nid);
+ info = shrinker_info_srcu(memcg, nid);
return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
}

@@ -898,15 +912,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
{
struct shrinker_info *info;
unsigned long ret, freed = 0;
+ int srcu_idx;
int i;

if (!mem_cgroup_online(memcg))
return 0;

- if (!down_read_trylock(&shrinker_rwsem))
- return 0;
-
- info = shrinker_info_protected(memcg, nid);
+ srcu_idx = srcu_read_lock(&shrinker_srcu);
+ info = shrinker_info_srcu(memcg, nid);
if (unlikely(!info))
goto unlock;

@@ -956,14 +969,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
set_shrinker_bit(memcg, nid, i);
}
freed += ret;
-
- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
}
unlock:
- up_read(&shrinker_rwsem);
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
return freed;
}
#else /* CONFIG_MEMCG */
--
2.20.1


2023-02-23 13:28:49

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v2 6/7] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()

Now there are no readers of shrinker_rwsem, so
synchronize_shrinkers() does not need to hold the
writer of shrinker_rwsem to wait for all running
shinkers to complete, synchronize_srcu() is enough.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/vmscan.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 89602e97583a..d1a95d60d127 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -794,15 +794,11 @@ EXPORT_SYMBOL(unregister_shrinker);
/**
* synchronize_shrinkers - Wait for all running shrinkers to complete.
*
- * This is equivalent to calling unregister_shrink() and register_shrinker(),
- * but atomically and with less overhead. This is useful to guarantee that all
- * shrinker invocations have seen an update, before freeing memory, similar to
- * rcu.
+ * This is useful to guarantee that all shrinker invocations have seen an
+ * update, before freeing memory.
*/
void synchronize_shrinkers(void)
{
- down_write(&shrinker_rwsem);
- up_write(&shrinker_rwsem);
synchronize_srcu(&shrinker_srcu);
}
EXPORT_SYMBOL(synchronize_shrinkers);
--
2.20.1


2023-02-23 13:29:05

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v2 7/7] mm: shrinkers: convert shrinker_rwsem to mutex

Now there are no readers of shrinker_rwsem, so we
can simply replace it with mutex lock.

Signed-off-by: Qi Zheng <[email protected]>
---
drivers/md/dm-cache-metadata.c | 2 +-
drivers/md/dm-thin-metadata.c | 2 +-
fs/super.c | 2 +-
mm/shrinker_debug.c | 14 +++++++-------
mm/vmscan.c | 34 +++++++++++++++++-----------------
5 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/drivers/md/dm-cache-metadata.c b/drivers/md/dm-cache-metadata.c
index acffed750e3e..9e0c69958587 100644
--- a/drivers/md/dm-cache-metadata.c
+++ b/drivers/md/dm-cache-metadata.c
@@ -1828,7 +1828,7 @@ int dm_cache_metadata_abort(struct dm_cache_metadata *cmd)
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
* cmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
* shrinker associated with the block manager's bufio client vs cmd root_lock).
- * - must take shrinker_rwsem without holding cmd->root_lock
+ * - must take shrinker_mutex without holding cmd->root_lock
*/
new_bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
CACHE_MAX_CONCURRENT_LOCKS);
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index fd464fb024c3..9f5cb52c5763 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1887,7 +1887,7 @@ int dm_pool_abort_metadata(struct dm_pool_metadata *pmd)
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
* pmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
* shrinker associated with the block manager's bufio client vs pmd root_lock).
- * - must take shrinker_rwsem without holding pmd->root_lock
+ * - must take shrinker_mutex without holding pmd->root_lock
*/
new_bm = dm_block_manager_create(pmd->bdev, THIN_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
THIN_MAX_CONCURRENT_LOCKS);
diff --git a/fs/super.c b/fs/super.c
index 84332d5cb817..91a4037b1d95 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -54,7 +54,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
* One thing we have to be careful of with a per-sb shrinker is that we don't
* drop the last active reference to the superblock from within the shrinker.
* If that happens we could trigger unregistering the shrinker from within the
- * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
+ * shrinker path and that leads to deadlock on the shrinker_mutex. Hence we
* take a passive reference to the superblock to avoid this from occurring.
*/
static unsigned long super_cache_scan(struct shrinker *shrink,
diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 6aa7a7ec69da..b0f6aff372df 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -7,7 +7,7 @@
#include <linux/memcontrol.h>

/* defined in vmscan.c */
-extern struct rw_semaphore shrinker_rwsem;
+extern struct mutex shrinker_mutex;
extern struct list_head shrinker_list;
extern struct srcu_struct shrinker_srcu;

@@ -167,7 +167,7 @@ int shrinker_debugfs_add(struct shrinker *shrinker)
char buf[128];
int id;

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

/* debugfs isn't initialized yet, add debugfs entries later. */
if (!shrinker_debugfs_root)
@@ -210,7 +210,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
if (!new)
return -ENOMEM;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);

old = shrinker->name;
shrinker->name = new;
@@ -228,7 +228,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
shrinker->debugfs_entry = entry;
}

- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

kfree_const(old);

@@ -240,7 +240,7 @@ struct dentry *shrinker_debugfs_remove(struct shrinker *shrinker)
{
struct dentry *entry = shrinker->debugfs_entry;

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

kfree_const(shrinker->name);
shrinker->name = NULL;
@@ -265,14 +265,14 @@ static int __init shrinker_debugfs_init(void)
shrinker_debugfs_root = dentry;

/* Create debugfs entries for shrinkers registered at boot */
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_for_each_entry(shrinker, &shrinker_list, list)
if (!shrinker->debugfs_entry) {
ret = shrinker_debugfs_add(shrinker);
if (ret)
break;
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

return ret;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d1a95d60d127..27ef9946ae8a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -35,7 +35,7 @@
#include <linux/cpuset.h>
#include <linux/compaction.h>
#include <linux/notifier.h>
-#include <linux/rwsem.h>
+#include <linux/mutex.h>
#include <linux/delay.h>
#include <linux/kthread.h>
#include <linux/freezer.h>
@@ -202,7 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
}

LIST_HEAD(shrinker_list);
-DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_MUTEX(shrinker_mutex);
DEFINE_SRCU(shrinker_srcu);

#ifdef CONFIG_MEMCG
@@ -224,7 +224,7 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
{
return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
&shrinker_srcu,
- lockdep_is_held(&shrinker_rwsem));
+ lockdep_is_held(&shrinker_mutex));
}

static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
@@ -308,7 +308,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
int nid, size, ret = 0;
int map_size, defer_size = 0;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
map_size = shrinker_map_size(shrinker_nr_max);
defer_size = shrinker_defer_size(shrinker_nr_max);
size = map_size + defer_size;
@@ -324,7 +324,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
info->map_nr_max = shrinker_nr_max;
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

return ret;
}
@@ -343,7 +343,7 @@ static int expand_shrinker_info(int new_id)
if (!root_mem_cgroup)
goto out;

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

map_size = shrinker_map_size(new_nr_max);
defer_size = shrinker_defer_size(new_nr_max);
@@ -391,7 +391,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
if (mem_cgroup_disabled())
return -ENOSYS;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
@@ -405,7 +405,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
shrinker->id = id;
ret = 0;
unlock:
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
return ret;
}

@@ -415,7 +415,7 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)

BUG_ON(id < 0);

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

idr_remove(&shrinker_idr, id);
}
@@ -450,7 +450,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
parent = root_mem_cgroup;

/* Prevent from concurrent shrinker_info expand */
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
for_each_node(nid) {
child_info = shrinker_info_protected(memcg, nid);
parent_info = shrinker_info_protected(parent, nid);
@@ -459,7 +459,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
atomic_long_add(nr, &parent_info->nr_deferred[i]);
}
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
}

static bool cgroup_reclaim(struct scan_control *sc)
@@ -708,9 +708,9 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
shrinker->name = NULL;
#endif
if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
unregister_memcg_shrinker(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
return;
}

@@ -720,11 +720,11 @@ void free_prealloced_shrinker(struct shrinker *shrinker)

void register_shrinker_prepared(struct shrinker *shrinker)
{
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_add_tail_rcu(&shrinker->list, &shrinker_list);
shrinker->flags |= SHRINKER_REGISTERED;
shrinker_debugfs_add(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
}

static int __register_shrinker(struct shrinker *shrinker)
@@ -774,13 +774,13 @@ void unregister_shrinker(struct shrinker *shrinker)
if (!(shrinker->flags & SHRINKER_REGISTERED))
return;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_del_rcu(&shrinker->list);
shrinker->flags &= ~SHRINKER_REGISTERED;
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
debugfs_entry = shrinker_debugfs_remove(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

synchronize_srcu(&shrinker_srcu);

--
2.20.1


2023-02-23 18:19:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] make slab shrink lockless

On Thu, Feb 23, 2023 at 09:27:18PM +0800, Qi Zheng wrote:
> Hi all,
>
> This patch series aims to make slab shrink lockless.
>
> 1. Background
> =============
>
> On our servers, we often find the following system cpu hotspots:
>
> 44.16% [kernel] [k] down_read_trylock
> 14.12% [kernel] [k] up_read
> 13.43% [kernel] [k] shrink_slab
> 5.25% [kernel] [k] count_shadow_nodes
> 3.42% [kernel] [k] idr_find
>
> Then we used bpftrace to capture its calltrace as follows:
>
> @[
> down_read_trylock+5
> shrink_slab+292
> shrink_node+640
> do_try_to_free_pages+211
> try_to_free_mem_cgroup_pages+266
> try_charge_memcg+386
> charge_memcg+51
> __mem_cgroup_charge+44
> __handle_mm_fault+1416
> handle_mm_fault+260
> do_user_addr_fault+459
> exc_page_fault+104
> asm_exc_page_fault+38
> clear_user_rep_good+18
> read_zero+100
> vfs_read+176
> ksys_read+93
> do_syscall_64+62
> entry_SYSCALL_64_after_hwframe+114
> ]: 1868979
>
> It is easy to see that this is caused by the frequent failure to obtain the
> read lock of shrinker_rwsem when reclaiming slab memory.
>
> Currently, the shrinker_rwsem is a global lock. And the following cases may
> cause the above system cpu hotspots:
>
> a. the write lock of shrinker_rwsem was held for too long. For example, there
> are many memcgs in the system, which causes some paths to hold locks and
> traverse it for too long. (e.g. expand_shrinker_info())
> b. the read lock of shrinker_rwsem was held for too long, and a writer came at
> this time. Then this writer will be forced to wait and block all subsequent
> readers.
> For example:
> - be scheduled when the read lock of shrinker_rwsem is held in
> do_shrink_slab()
> - some shrinker are blocked for too long. Like the case mentioned in the
> patchset[1].
>
> [1]. https://lore.kernel.org/lkml/[email protected]/
>
> And all the down_read_trylock() hotspots caused by the above cases can be
> solved by replacing the shrinker_rwsem trylocks with SRCU.

Glad to see that making SRCU unconditional was helpful! And I do very
much like the idea of the shrinker running better!

The main thing that enabled unconditional SRCU was the code added in
v5.19 to dynamically allocate SRCU's srcu_node combining tree. This is
important for a number of Linux distributions that have NR_CPUS up in the
thousands, for which this combining tree is quite large. In v5.19 and
later, srcu_struct structures without frequent call_srcu() invocations
never allocate that combining tree. Even srcu_struct structures that
have enough call_srcu() activity to cause the lock contention that in
turn forces the combining tree to be allocated, that combining tree
is sized for the actual number of CPUs present, which is usually way
smaller than NR_CPUS.

So if you are going to backport this back past v5.19, you might also
need those SRCU changes. Or not, depending on how much memory your
systems are equipped with. ;-)

Thanx, Paul

> 2. Survey
> =========
>
> Before doing the code implementation, I found that there were many similar
> submissions in the community:
>
> a. Davidlohr Bueso submitted a patch in 2015.
> Subject: [PATCH -next v2] mm: srcu-ify shrinkers
> Link: https://lore.kernel.org/all/[email protected]/
> Result: It was finally merged into the linux-next branch, but failed on arm
> allnoconfig (without CONFIG_SRCU)
>
> b. Tetsuo Handa submitted a patchset in 2017.
> Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
> Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
> Result: Finally chose to use the current simple way (break when
> rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU,
> but SRCU was not unconditionally enabled at the time.
>
> c. Kirill Tkhai submitted a patchset in 2018.
> Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
> Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
> Result: At that time, SRCU was not unconditionally enabled, and there were
> some objections to enabling SRCU. Later, because Kirill's focus was
> moved to other things, this patchset was not continued to be updated.
>
> d. Sultan Alsawaf submitted a patch in 2021.
> Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
> Link: https://lore.kernel.org/lkml/[email protected]/
> Result: Rejected because SRCU was not unconditionally enabled.
>
> We can find that almost all these historical commits were abandoned because SRCU
> was not unconditionally enabled. But now SRCU has been unconditionally enable
> by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks
> with SRCU.
>
> [2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/
>
> 3. Reproduction and testing
> ===========================
>
> We can reproduce the down_read_trylock() hotspot through the following script:
>
> ```
> #!/bin/bash
> DIR="/root/shrinker/memcg/mnt"
>
> do_create()
> {
> mkdir /sys/fs/cgroup/memory/test
> echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> for i in `seq 0 $1`;
> do
> mkdir /sys/fs/cgroup/memory/test/$i;
> echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
> mkdir -p $DIR/$i;
> done
> }
>
> do_mount()
> {
> for i in `seq $1 $2`;
> do
> mount -t tmpfs $i $DIR/$i;
> done
> }
>
> do_touch()
> {
> for i in `seq $1 $2`;
> do
> echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
> dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
> done
> }
>
> do_create 2000
> do_mount 0 2000
> do_touch 0 1000
> ```
>
> Save the above script and execute it, we can get the following perf hotspots:
>
> 46.60% [kernel] [k] down_read_trylock
> 18.70% [kernel] [k] up_read
> 15.44% [kernel] [k] shrink_slab
> 4.37% [kernel] [k] _find_next_bit
> 2.75% [kernel] [k] xa_load
> 2.07% [kernel] [k] idr_find
> 1.73% [kernel] [k] do_shrink_slab
> 1.42% [kernel] [k] shrink_lruvec
> 0.74% [kernel] [k] shrink_node
> 0.60% [kernel] [k] list_lru_count_one
>
> After applying this patchset, the hotspot becomes as follows:
>
> 19.53% [kernel] [k] _find_next_bit
> 14.63% [kernel] [k] do_shrink_slab
> 14.58% [kernel] [k] shrink_slab
> 11.83% [kernel] [k] shrink_lruvec
> 9.33% [kernel] [k] __blk_flush_plug
> 6.67% [kernel] [k] mem_cgroup_iter
> 3.73% [kernel] [k] list_lru_count_one
> 2.43% [kernel] [k] shrink_node
> 1.96% [kernel] [k] super_cache_count
> 1.78% [kernel] [k] __rcu_read_unlock
> 1.38% [kernel] [k] __srcu_read_lock
> 1.30% [kernel] [k] xas_descend
>
> We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock,
> which realizes the lockless slab reclaim.
>
> This series is based on next-20230217.
>
> Comments and suggestions are welcome.
>
> Thanks,
> Qi.
>
> Changelog in v1 -> v2:
> - add a map_nr_max field to shrinker_info (suggested by Kirill)
> - use shrinker_mutex in reparent_shrinker_deferred() (pointed by Kirill)
>
> Qi Zheng (7):
> mm: vmscan: add a map_nr_max field to shrinker_info
> mm: vmscan: make global slab shrink lockless
> mm: vmscan: make memcg slab shrink lockless
> mm: shrinkers: make count and scan in shrinker debugfs lockless
> mm: vmscan: hold write lock to reparent shrinker nr_deferred
> mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
> mm: shrinkers: convert shrinker_rwsem to mutex
>
> drivers/md/dm-cache-metadata.c | 2 +-
> drivers/md/dm-thin-metadata.c | 2 +-
> fs/super.c | 2 +-
> include/linux/memcontrol.h | 1 +
> mm/shrinker_debug.c | 38 ++++-----
> mm/vmscan.c | 142 +++++++++++++++++----------------
> 6 files changed, 92 insertions(+), 95 deletions(-)
>
> --
> 2.20.1
>

2023-02-24 04:12:22

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] make slab shrink lockless



On 2023/2/24 02:19, Paul E. McKenney wrote:
> On Thu, Feb 23, 2023 at 09:27:18PM +0800, Qi Zheng wrote:
>> Hi all,
>>
>> This patch series aims to make slab shrink lockless.
>>
>> 1. Background
>> =============
>>
>> On our servers, we often find the following system cpu hotspots:
>>
>> 44.16% [kernel] [k] down_read_trylock
>> 14.12% [kernel] [k] up_read
>> 13.43% [kernel] [k] shrink_slab
>> 5.25% [kernel] [k] count_shadow_nodes
>> 3.42% [kernel] [k] idr_find
>>
>> Then we used bpftrace to capture its calltrace as follows:
>>
>> @[
>> down_read_trylock+5
>> shrink_slab+292
>> shrink_node+640
>> do_try_to_free_pages+211
>> try_to_free_mem_cgroup_pages+266
>> try_charge_memcg+386
>> charge_memcg+51
>> __mem_cgroup_charge+44
>> __handle_mm_fault+1416
>> handle_mm_fault+260
>> do_user_addr_fault+459
>> exc_page_fault+104
>> asm_exc_page_fault+38
>> clear_user_rep_good+18
>> read_zero+100
>> vfs_read+176
>> ksys_read+93
>> do_syscall_64+62
>> entry_SYSCALL_64_after_hwframe+114
>> ]: 1868979
>>
>> It is easy to see that this is caused by the frequent failure to obtain the
>> read lock of shrinker_rwsem when reclaiming slab memory.
>>
>> Currently, the shrinker_rwsem is a global lock. And the following cases may
>> cause the above system cpu hotspots:
>>
>> a. the write lock of shrinker_rwsem was held for too long. For example, there
>> are many memcgs in the system, which causes some paths to hold locks and
>> traverse it for too long. (e.g. expand_shrinker_info())
>> b. the read lock of shrinker_rwsem was held for too long, and a writer came at
>> this time. Then this writer will be forced to wait and block all subsequent
>> readers.
>> For example:
>> - be scheduled when the read lock of shrinker_rwsem is held in
>> do_shrink_slab()
>> - some shrinker are blocked for too long. Like the case mentioned in the
>> patchset[1].
>>
>> [1]. https://lore.kernel.org/lkml/[email protected]/
>>
>> And all the down_read_trylock() hotspots caused by the above cases can be
>> solved by replacing the shrinker_rwsem trylocks with SRCU.

Hi Paul,

>
> Glad to see that making SRCU unconditional was helpful! And I do very
> much like the idea of the shrinker running better!

+1 :)

>
> The main thing that enabled unconditional SRCU was the code added in
> v5.19 to dynamically allocate SRCU's srcu_node combining tree. This is
> important for a number of Linux distributions that have NR_CPUS up in the
> thousands, for which this combining tree is quite large. In v5.19 and
> later, srcu_struct structures without frequent call_srcu() invocations
> never allocate that combining tree. Even srcu_struct structures that
> have enough call_srcu() activity to cause the lock contention that in
> turn forces the combining tree to be allocated, that combining tree
> is sized for the actual number of CPUs present, which is usually way
> smaller than NR_CPUS.

Thank you very much for such a detailed background introduction. :)

>
> So if you are going to backport this back past v5.19, you might also
> need those SRCU changes. Or not, depending on how much memory your
> systems are equipped with. ;-)

Got it.

Thanks,
Qi

>
> Thanx, Paul
>
>> 2. Survey
>> =========
>>
>> Before doing the code implementation, I found that there were many similar
>> submissions in the community:
>>
>> a. Davidlohr Bueso submitted a patch in 2015.
>> Subject: [PATCH -next v2] mm: srcu-ify shrinkers
>> Link: https://lore.kernel.org/all/[email protected]/
>> Result: It was finally merged into the linux-next branch, but failed on arm
>> allnoconfig (without CONFIG_SRCU)
>>
>> b. Tetsuo Handa submitted a patchset in 2017.
>> Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
>> Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>> Result: Finally chose to use the current simple way (break when
>> rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU,
>> but SRCU was not unconditionally enabled at the time.
>>
>> c. Kirill Tkhai submitted a patchset in 2018.
>> Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
>> Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>> Result: At that time, SRCU was not unconditionally enabled, and there were
>> some objections to enabling SRCU. Later, because Kirill's focus was
>> moved to other things, this patchset was not continued to be updated.
>>
>> d. Sultan Alsawaf submitted a patch in 2021.
>> Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Result: Rejected because SRCU was not unconditionally enabled.
>>
>> We can find that almost all these historical commits were abandoned because SRCU
>> was not unconditionally enabled. But now SRCU has been unconditionally enable
>> by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks
>> with SRCU.
>>
>> [2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/
>>
>> 3. Reproduction and testing
>> ===========================
>>
>> We can reproduce the down_read_trylock() hotspot through the following script:
>>
>> ```
>> #!/bin/bash
>> DIR="/root/shrinker/memcg/mnt"
>>
>> do_create()
>> {
>> mkdir /sys/fs/cgroup/memory/test
>> echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>> for i in `seq 0 $1`;
>> do
>> mkdir /sys/fs/cgroup/memory/test/$i;
>> echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>> mkdir -p $DIR/$i;
>> done
>> }
>>
>> do_mount()
>> {
>> for i in `seq $1 $2`;
>> do
>> mount -t tmpfs $i $DIR/$i;
>> done
>> }
>>
>> do_touch()
>> {
>> for i in `seq $1 $2`;
>> do
>> echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>> dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
>> done
>> }
>>
>> do_create 2000
>> do_mount 0 2000
>> do_touch 0 1000
>> ```
>>
>> Save the above script and execute it, we can get the following perf hotspots:
>>
>> 46.60% [kernel] [k] down_read_trylock
>> 18.70% [kernel] [k] up_read
>> 15.44% [kernel] [k] shrink_slab
>> 4.37% [kernel] [k] _find_next_bit
>> 2.75% [kernel] [k] xa_load
>> 2.07% [kernel] [k] idr_find
>> 1.73% [kernel] [k] do_shrink_slab
>> 1.42% [kernel] [k] shrink_lruvec
>> 0.74% [kernel] [k] shrink_node
>> 0.60% [kernel] [k] list_lru_count_one
>>
>> After applying this patchset, the hotspot becomes as follows:
>>
>> 19.53% [kernel] [k] _find_next_bit
>> 14.63% [kernel] [k] do_shrink_slab
>> 14.58% [kernel] [k] shrink_slab
>> 11.83% [kernel] [k] shrink_lruvec
>> 9.33% [kernel] [k] __blk_flush_plug
>> 6.67% [kernel] [k] mem_cgroup_iter
>> 3.73% [kernel] [k] list_lru_count_one
>> 2.43% [kernel] [k] shrink_node
>> 1.96% [kernel] [k] super_cache_count
>> 1.78% [kernel] [k] __rcu_read_unlock
>> 1.38% [kernel] [k] __srcu_read_lock
>> 1.30% [kernel] [k] xas_descend
>>
>> We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock,
>> which realizes the lockless slab reclaim.
>>
>> This series is based on next-20230217.
>>
>> Comments and suggestions are welcome.
>>
>> Thanks,
>> Qi.
>>
>> Changelog in v1 -> v2:
>> - add a map_nr_max field to shrinker_info (suggested by Kirill)
>> - use shrinker_mutex in reparent_shrinker_deferred() (pointed by Kirill)
>>
>> Qi Zheng (7):
>> mm: vmscan: add a map_nr_max field to shrinker_info
>> mm: vmscan: make global slab shrink lockless
>> mm: vmscan: make memcg slab shrink lockless
>> mm: shrinkers: make count and scan in shrinker debugfs lockless
>> mm: vmscan: hold write lock to reparent shrinker nr_deferred
>> mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
>> mm: shrinkers: convert shrinker_rwsem to mutex
>>
>> drivers/md/dm-cache-metadata.c | 2 +-
>> drivers/md/dm-thin-metadata.c | 2 +-
>> fs/super.c | 2 +-
>> include/linux/memcontrol.h | 1 +
>> mm/shrinker_debug.c | 38 ++++-----
>> mm/vmscan.c | 142 +++++++++++++++++----------------
>> 6 files changed, 92 insertions(+), 95 deletions(-)
>>
>> --
>> 2.20.1
>>

--
Thanks,
Qi

2023-02-25 08:18:54

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info



On 2023/2/23 21:27, Qi Zheng wrote:
> To prepare for the subsequent lockless memcg slab shrink,
> add a map_nr_max field to struct shrinker_info to records
> its own real shrinker_nr_max.
>
> No functional changes.
>
> Signed-off-by: Qi Zheng <[email protected]>

I missed Suggested-by here, hi Kirill, can I add it?

Suggested-by: Kirill Tkhai <[email protected]>

> ---
> include/linux/memcontrol.h | 1 +
> mm/vmscan.c | 29 ++++++++++++++++++-----------
> 2 files changed, 19 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b6eda2ab205d..aa69ea98e2d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -97,6 +97,7 @@ struct shrinker_info {
> struct rcu_head rcu;
> atomic_long_t *nr_deferred;
> unsigned long *map;
> + int map_nr_max;
> };
>
> struct lruvec_stats_percpu {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9c1c5e8b24b8..9f895ca6216c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> lockdep_is_held(&shrinker_rwsem));
> }
>
> +static inline bool need_expand(int new_nr_max, int old_nr_max)
> +{
> + return round_up(new_nr_max, BITS_PER_LONG) >
> + round_up(old_nr_max, BITS_PER_LONG);
> +}
> +
> static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> int map_size, int defer_size,
> - int old_map_size, int old_defer_size)
> + int old_map_size, int old_defer_size,
> + int new_nr_max)
> {
> struct shrinker_info *new, *old;
> struct mem_cgroup_per_node *pn;
> @@ -240,12 +247,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> if (!old)
> return 0;
>
> + if (!need_expand(new_nr_max, old->map_nr_max))
> + return 0;
> +
> new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> if (!new)
> return -ENOMEM;
>
> new->nr_deferred = (atomic_long_t *)(new + 1);
> new->map = (void *)new->nr_deferred + defer_size;
> + new->map_nr_max = new_nr_max;
>
> /* map: set all old bits, clear all new bits */
> memset(new->map, (int)0xff, old_map_size);
> @@ -295,6 +306,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> }
> info->nr_deferred = (atomic_long_t *)(info + 1);
> info->map = (void *)info->nr_deferred + defer_size;
> + info->map_nr_max = shrinker_nr_max;
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
> }
> up_write(&shrinker_rwsem);
> @@ -302,12 +314,6 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> return ret;
> }
>
> -static inline bool need_expand(int nr_max)
> -{
> - return round_up(nr_max, BITS_PER_LONG) >
> - round_up(shrinker_nr_max, BITS_PER_LONG);
> -}
> -
> static int expand_shrinker_info(int new_id)
> {
> int ret = 0;
> @@ -316,7 +322,7 @@ static int expand_shrinker_info(int new_id)
> int old_map_size, old_defer_size = 0;
> struct mem_cgroup *memcg;
>
> - if (!need_expand(new_nr_max))
> + if (!need_expand(new_nr_max, shrinker_nr_max))
> goto out;
>
> if (!root_mem_cgroup)
> @@ -332,7 +338,8 @@ static int expand_shrinker_info(int new_id)
> memcg = mem_cgroup_iter(NULL, NULL, NULL);
> do {
> ret = expand_one_shrinker_info(memcg, map_size, defer_size,
> - old_map_size, old_defer_size);
> + old_map_size, old_defer_size,
> + new_nr_max);
> if (ret) {
> mem_cgroup_iter_break(NULL, memcg);
> goto out;
> @@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
> for_each_node(nid) {
> child_info = shrinker_info_protected(memcg, nid);
> parent_info = shrinker_info_protected(parent, nid);
> - for (i = 0; i < shrinker_nr_max; i++) {
> + for (i = 0; i < child_info->map_nr_max; i++) {
> nr = atomic_long_read(&child_info->nr_deferred[i]);
> atomic_long_add(nr, &parent_info->nr_deferred[i]);
> }
> @@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> if (unlikely(!info))
> goto unlock;
>
> - for_each_set_bit(i, info->map, shrinker_nr_max) {
> + for_each_set_bit(i, info->map, info->map_nr_max) {
> struct shrink_control sc = {
> .gfp_mask = gfp_mask,
> .nid = nid,

--
Thanks,
Qi

2023-02-25 15:14:59

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info

Hi Qi,

On 25.02.2023 11:18, Qi Zheng wrote:
>
>
> On 2023/2/23 21:27, Qi Zheng wrote:
>> To prepare for the subsequent lockless memcg slab shrink,
>> add a map_nr_max field to struct shrinker_info to records
>> its own real shrinker_nr_max.
>>
>> No functional changes.
>>
>> Signed-off-by: Qi Zheng <[email protected]>
>
> I missed Suggested-by here, hi Kirill, can I add it?
>
> Suggested-by: Kirill Tkhai <[email protected]>

Yes, feel free to add this tag.

There is a comment below.

>> ---
>>   include/linux/memcontrol.h |  1 +
>>   mm/vmscan.c                | 29 ++++++++++++++++++-----------
>>   2 files changed, 19 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index b6eda2ab205d..aa69ea98e2d8 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -97,6 +97,7 @@ struct shrinker_info {
>>       struct rcu_head rcu;
>>       atomic_long_t *nr_deferred;
>>       unsigned long *map;
>> +    int map_nr_max;
>>   };
>>     struct lruvec_stats_percpu {
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 9c1c5e8b24b8..9f895ca6216c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>>                        lockdep_is_held(&shrinker_rwsem));
>>   }
>>   +static inline bool need_expand(int new_nr_max, int old_nr_max)
>> +{
>> +    return round_up(new_nr_max, BITS_PER_LONG) >
>> +           round_up(old_nr_max, BITS_PER_LONG);
>> +}
>> +
>>   static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>                       int map_size, int defer_size,
>> -                    int old_map_size, int old_defer_size)
>> +                    int old_map_size, int old_defer_size,
>> +                    int new_nr_max)
>>   {
>>       struct shrinker_info *new, *old;
>>       struct mem_cgroup_per_node *pn;
>> @@ -240,12 +247,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>           if (!old)
>>               return 0;
>>   +        if (!need_expand(new_nr_max, old->map_nr_max))
>> +            return 0;
>> +
>>           new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>>           if (!new)
>>               return -ENOMEM;
>>             new->nr_deferred = (atomic_long_t *)(new + 1);
>>           new->map = (void *)new->nr_deferred + defer_size;
>> +        new->map_nr_max = new_nr_max;
>>             /* map: set all old bits, clear all new bits */
>>           memset(new->map, (int)0xff, old_map_size);
>> @@ -295,6 +306,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>           }
>>           info->nr_deferred = (atomic_long_t *)(info + 1);
>>           info->map = (void *)info->nr_deferred + defer_size;
>> +        info->map_nr_max = shrinker_nr_max;
>>           rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>>       }
>>       up_write(&shrinker_rwsem);
>> @@ -302,12 +314,6 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>       return ret;
>>   }
>>   -static inline bool need_expand(int nr_max)
>> -{
>> -    return round_up(nr_max, BITS_PER_LONG) >
>> -           round_up(shrinker_nr_max, BITS_PER_LONG);
>> -}
>> -
>>   static int expand_shrinker_info(int new_id)
>>   {
>>       int ret = 0;
>> @@ -316,7 +322,7 @@ static int expand_shrinker_info(int new_id)
>>       int old_map_size, old_defer_size = 0;
>>       struct mem_cgroup *memcg;
>>   -    if (!need_expand(new_nr_max))
>> +    if (!need_expand(new_nr_max, shrinker_nr_max))
>>           goto out;
>>         if (!root_mem_cgroup)
>> @@ -332,7 +338,8 @@ static int expand_shrinker_info(int new_id)
>>       memcg = mem_cgroup_iter(NULL, NULL, NULL);
>>       do {
>>           ret = expand_one_shrinker_info(memcg, map_size, defer_size,
>> -                           old_map_size, old_defer_size);
>> +                           old_map_size, old_defer_size,
>> +                           new_nr_max);
>>           if (ret) {
>>               mem_cgroup_iter_break(NULL, memcg);
>>               goto out;
>> @@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>>       for_each_node(nid) {
>>           child_info = shrinker_info_protected(memcg, nid);
>>           parent_info = shrinker_info_protected(parent, nid);
>> -        for (i = 0; i < shrinker_nr_max; i++) {
>> +        for (i = 0; i < child_info->map_nr_max; i++) {
>>               nr = atomic_long_read(&child_info->nr_deferred[i]);
>>               atomic_long_add(nr, &parent_info->nr_deferred[i]);
>>           }
>> @@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>       if (unlikely(!info))
>>           goto unlock;
>>   -    for_each_set_bit(i, info->map, shrinker_nr_max) {
>> +    for_each_set_bit(i, info->map, info->map_nr_max) {
>>           struct shrink_control sc = {
>>               .gfp_mask = gfp_mask,
>>               .nid = nid,

The patch as whole thing won't work as expected. It won't ever call shrinker with ids from [round_down(shrinker_nr_max, sizeof(unsigned long)) + 1, shrinker_nr_max - 1]

Just replay the sequence we add new shrinkers:

1)We add shrinker #0:
shrinker_nr_max = 0;

prealloc_memcg_shrinker()
id = 0;
expand_shrinker_info(0)
new_nr_max = 1;
expand_one_shrinker_info(new_nr_max = 1)
new->map_nr_max = 1;
shrinker_nr_max = 1;

2)We add shrinker #1:
prealloc_memcg_shrinker()
id = 1;
expand_shrinker_info(1)
new_nr_max = 2;
need_expand(2, 1) => false => ignore expand
shrinker_nr_max = 2;

3)Then we call shrinker:
shrink_slab_memcg()
for_each_set_bit(i, info->map, 1/* info->map_nr_max */ ) {
} => ignore shrinker #1

I'd fixed this patch by something like the below:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9f895ca6216c..bb617a3871f1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -224,12 +224,6 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
lockdep_is_held(&shrinker_rwsem));
}

-static inline bool need_expand(int new_nr_max, int old_nr_max)
-{
- return round_up(new_nr_max, BITS_PER_LONG) >
- round_up(old_nr_max, BITS_PER_LONG);
-}
-
static int expand_one_shrinker_info(struct mem_cgroup *memcg,
int map_size, int defer_size,
int old_map_size, int old_defer_size,
@@ -247,9 +241,6 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
if (!old)
return 0;

- if (!need_expand(new_nr_max, old->map_nr_max))
- return 0;
-
new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
if (!new)
return -ENOMEM;
@@ -317,14 +308,11 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
static int expand_shrinker_info(int new_id)
{
int ret = 0;
- int new_nr_max = new_id + 1;
+ int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
int map_size, defer_size = 0;
int old_map_size, old_defer_size = 0;
struct mem_cgroup *memcg;

- if (!need_expand(new_nr_max, shrinker_nr_max))
- goto out;
-
if (!root_mem_cgroup)
goto out;

@@ -359,9 +347,11 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)

rcu_read_lock();
info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
- /* Pairs with smp mb in shrink_slab() */
- smp_mb__before_atomic();
- set_bit(shrinker_id, info->map);
+ if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
+ /* Pairs with smp mb in shrink_slab() */
+ smp_mb__before_atomic();
+ set_bit(shrinker_id, info->map);
+ }
rcu_read_unlock();
}
}

(I also added a new check into set_shrinker_bit() for safety).

Kirill

2023-02-25 15:53:23

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info



On 2023/2/25 23:14, Kirill Tkhai wrote:
> Hi Qi,
>
> On 25.02.2023 11:18, Qi Zheng wrote:
>>
>>
>> On 2023/2/23 21:27, Qi Zheng wrote:
>>> To prepare for the subsequent lockless memcg slab shrink,
>>> add a map_nr_max field to struct shrinker_info to records
>>> its own real shrinker_nr_max.
>>>
>>> No functional changes.
>>>
>>> Signed-off-by: Qi Zheng <[email protected]>
>>
>> I missed Suggested-by here, hi Kirill, can I add it?
>>
>> Suggested-by: Kirill Tkhai <[email protected]>
>
> Yes, feel free to add this tag.

Thanks.

>
> There is a comment below.
>
>>> ---
>>>   include/linux/memcontrol.h |  1 +
>>>   mm/vmscan.c                | 29 ++++++++++++++++++-----------
>>>   2 files changed, 19 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index b6eda2ab205d..aa69ea98e2d8 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -97,6 +97,7 @@ struct shrinker_info {
>>>       struct rcu_head rcu;
>>>       atomic_long_t *nr_deferred;
>>>       unsigned long *map;
>>> +    int map_nr_max;
>>>   };
>>>     struct lruvec_stats_percpu {
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 9c1c5e8b24b8..9f895ca6216c 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>>>                        lockdep_is_held(&shrinker_rwsem));
>>>   }
>>>   +static inline bool need_expand(int new_nr_max, int old_nr_max)
>>> +{
>>> +    return round_up(new_nr_max, BITS_PER_LONG) >
>>> +           round_up(old_nr_max, BITS_PER_LONG);
>>> +}
>>> +
>>>   static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>>                       int map_size, int defer_size,
>>> -                    int old_map_size, int old_defer_size)
>>> +                    int old_map_size, int old_defer_size,
>>> +                    int new_nr_max)
>>>   {
>>>       struct shrinker_info *new, *old;
>>>       struct mem_cgroup_per_node *pn;
>>> @@ -240,12 +247,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>>           if (!old)
>>>               return 0;
>>>   +        if (!need_expand(new_nr_max, old->map_nr_max))
>>> +            return 0;
>>> +
>>>           new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>>>           if (!new)
>>>               return -ENOMEM;
>>>             new->nr_deferred = (atomic_long_t *)(new + 1);
>>>           new->map = (void *)new->nr_deferred + defer_size;
>>> +        new->map_nr_max = new_nr_max;
>>>             /* map: set all old bits, clear all new bits */
>>>           memset(new->map, (int)0xff, old_map_size);
>>> @@ -295,6 +306,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>>           }
>>>           info->nr_deferred = (atomic_long_t *)(info + 1);
>>>           info->map = (void *)info->nr_deferred + defer_size;
>>> +        info->map_nr_max = shrinker_nr_max;
>>>           rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>>>       }
>>>       up_write(&shrinker_rwsem);
>>> @@ -302,12 +314,6 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>>       return ret;
>>>   }
>>>   -static inline bool need_expand(int nr_max)
>>> -{
>>> -    return round_up(nr_max, BITS_PER_LONG) >
>>> -           round_up(shrinker_nr_max, BITS_PER_LONG);
>>> -}
>>> -
>>>   static int expand_shrinker_info(int new_id)
>>>   {
>>>       int ret = 0;
>>> @@ -316,7 +322,7 @@ static int expand_shrinker_info(int new_id)
>>>       int old_map_size, old_defer_size = 0;
>>>       struct mem_cgroup *memcg;
>>>   -    if (!need_expand(new_nr_max))
>>> +    if (!need_expand(new_nr_max, shrinker_nr_max))
>>>           goto out;
>>>         if (!root_mem_cgroup)
>>> @@ -332,7 +338,8 @@ static int expand_shrinker_info(int new_id)
>>>       memcg = mem_cgroup_iter(NULL, NULL, NULL);
>>>       do {
>>>           ret = expand_one_shrinker_info(memcg, map_size, defer_size,
>>> -                           old_map_size, old_defer_size);
>>> +                           old_map_size, old_defer_size,
>>> +                           new_nr_max);
>>>           if (ret) {
>>>               mem_cgroup_iter_break(NULL, memcg);
>>>               goto out;
>>> @@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>>>       for_each_node(nid) {
>>>           child_info = shrinker_info_protected(memcg, nid);
>>>           parent_info = shrinker_info_protected(parent, nid);
>>> -        for (i = 0; i < shrinker_nr_max; i++) {
>>> +        for (i = 0; i < child_info->map_nr_max; i++) {
>>>               nr = atomic_long_read(&child_info->nr_deferred[i]);
>>>               atomic_long_add(nr, &parent_info->nr_deferred[i]);
>>>           }
>>> @@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>       if (unlikely(!info))
>>>           goto unlock;
>>>   -    for_each_set_bit(i, info->map, shrinker_nr_max) {
>>> +    for_each_set_bit(i, info->map, info->map_nr_max) {
>>>           struct shrink_control sc = {
>>>               .gfp_mask = gfp_mask,
>>>               .nid = nid,
>
> The patch as whole thing won't work as expected. It won't ever call shrinker with ids from [round_down(shrinker_nr_max, sizeof(unsigned long)) + 1, shrinker_nr_max - 1]
>
> Just replay the sequence we add new shrinkers:
>
> 1)We add shrinker #0:
> shrinker_nr_max = 0;
>
> prealloc_memcg_shrinker()
> id = 0;
> expand_shrinker_info(0)
> new_nr_max = 1;
> expand_one_shrinker_info(new_nr_max = 1)
> new->map_nr_max = 1;
> shrinker_nr_max = 1;
>
> 2)We add shrinker #1:
> prealloc_memcg_shrinker()
> id = 1;
> expand_shrinker_info(1)
> new_nr_max = 2;
> need_expand(2, 1) => false => ignore expand
> shrinker_nr_max = 2;
>
> 3)Then we call shrinker:
> shrink_slab_memcg()
> for_each_set_bit(i, info->map, 1/* info->map_nr_max */ ) {
> } => ignore shrinker #1

Oh, I got it.

>
> I'd fixed this patch by something like the below:

The fix below looks good to me, will add them to the next version. :)

Thanks,
Qi

>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9f895ca6216c..bb617a3871f1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -224,12 +224,6 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> lockdep_is_held(&shrinker_rwsem));
> }
>
> -static inline bool need_expand(int new_nr_max, int old_nr_max)
> -{
> - return round_up(new_nr_max, BITS_PER_LONG) >
> - round_up(old_nr_max, BITS_PER_LONG);
> -}
> -
> static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> int map_size, int defer_size,
> int old_map_size, int old_defer_size,
> @@ -247,9 +241,6 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> if (!old)
> return 0;
>
> - if (!need_expand(new_nr_max, old->map_nr_max))
> - return 0;
> -
> new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> if (!new)
> return -ENOMEM;
> @@ -317,14 +308,11 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> static int expand_shrinker_info(int new_id)
> {
> int ret = 0;
> - int new_nr_max = new_id + 1;
> + int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
> int map_size, defer_size = 0;
> int old_map_size, old_defer_size = 0;
> struct mem_cgroup *memcg;
>
> - if (!need_expand(new_nr_max, shrinker_nr_max))
> - goto out;
> -
> if (!root_mem_cgroup)
> goto out;
>
> @@ -359,9 +347,11 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
>
> rcu_read_lock();
> info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> - /* Pairs with smp mb in shrink_slab() */
> - smp_mb__before_atomic();
> - set_bit(shrinker_id, info->map);
> + if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
> + /* Pairs with smp mb in shrink_slab() */
> + smp_mb__before_atomic();
> + set_bit(shrinker_id, info->map);
> + }
> rcu_read_unlock();
> }
> }
>
> (I also added a new check into set_shrinker_bit() for safety).
>
> Kirill

--
Thanks,
Qi

2023-02-26 13:55:07

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info



On 2023/2/25 23:14, Kirill Tkhai wrote:
> Hi Qi,
>
> On 25.02.2023 11:18, Qi Zheng wrote:
>>
>>
>> On 2023/2/23 21:27, Qi Zheng wrote:
>>> To prepare for the subsequent lockless memcg slab shrink,
>>> add a map_nr_max field to struct shrinker_info to records
>>> its own real shrinker_nr_max.
>>>
>>> No functional changes.
>>>
>>> Signed-off-by: Qi Zheng <[email protected]>
>>
>> I missed Suggested-by here, hi Kirill, can I add it?
>>
>> Suggested-by: Kirill Tkhai <[email protected]>
>
> Yes, feel free to add this tag.
>
> There is a comment below.
>
>>> ---
>>>   include/linux/memcontrol.h |  1 +
>>>   mm/vmscan.c                | 29 ++++++++++++++++++-----------
>>>   2 files changed, 19 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index b6eda2ab205d..aa69ea98e2d8 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -97,6 +97,7 @@ struct shrinker_info {
>>>       struct rcu_head rcu;
>>>       atomic_long_t *nr_deferred;
>>>       unsigned long *map;
>>> +    int map_nr_max;
>>>   };
>>>     struct lruvec_stats_percpu {
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 9c1c5e8b24b8..9f895ca6216c 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>>>                        lockdep_is_held(&shrinker_rwsem));
>>>   }
>>>   +static inline bool need_expand(int new_nr_max, int old_nr_max)
>>> +{
>>> +    return round_up(new_nr_max, BITS_PER_LONG) >
>>> +           round_up(old_nr_max, BITS_PER_LONG);
>>> +}
>>> +
>>>   static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>>                       int map_size, int defer_size,
>>> -                    int old_map_size, int old_defer_size)
>>> +                    int old_map_size, int old_defer_size,
>>> +                    int new_nr_max)
>>>   {
>>>       struct shrinker_info *new, *old;
>>>       struct mem_cgroup_per_node *pn;
>>> @@ -240,12 +247,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>>           if (!old)
>>>               return 0;
>>>   +        if (!need_expand(new_nr_max, old->map_nr_max))
>>> +            return 0;
>>> +
>>>           new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>>>           if (!new)
>>>               return -ENOMEM;
>>>             new->nr_deferred = (atomic_long_t *)(new + 1);
>>>           new->map = (void *)new->nr_deferred + defer_size;
>>> +        new->map_nr_max = new_nr_max;
>>>             /* map: set all old bits, clear all new bits */
>>>           memset(new->map, (int)0xff, old_map_size);
>>> @@ -295,6 +306,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>>           }
>>>           info->nr_deferred = (atomic_long_t *)(info + 1);
>>>           info->map = (void *)info->nr_deferred + defer_size;
>>> +        info->map_nr_max = shrinker_nr_max;
>>>           rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>>>       }
>>>       up_write(&shrinker_rwsem);
>>> @@ -302,12 +314,6 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>>       return ret;
>>>   }
>>>   -static inline bool need_expand(int nr_max)
>>> -{
>>> -    return round_up(nr_max, BITS_PER_LONG) >
>>> -           round_up(shrinker_nr_max, BITS_PER_LONG);
>>> -}
>>> -
>>>   static int expand_shrinker_info(int new_id)
>>>   {
>>>       int ret = 0;
>>> @@ -316,7 +322,7 @@ static int expand_shrinker_info(int new_id)
>>>       int old_map_size, old_defer_size = 0;
>>>       struct mem_cgroup *memcg;
>>>   -    if (!need_expand(new_nr_max))
>>> +    if (!need_expand(new_nr_max, shrinker_nr_max))
>>>           goto out;
>>>         if (!root_mem_cgroup)
>>> @@ -332,7 +338,8 @@ static int expand_shrinker_info(int new_id)
>>>       memcg = mem_cgroup_iter(NULL, NULL, NULL);
>>>       do {
>>>           ret = expand_one_shrinker_info(memcg, map_size, defer_size,
>>> -                           old_map_size, old_defer_size);
>>> +                           old_map_size, old_defer_size,
>>> +                           new_nr_max);
>>>           if (ret) {
>>>               mem_cgroup_iter_break(NULL, memcg);
>>>               goto out;
>>> @@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>>>       for_each_node(nid) {
>>>           child_info = shrinker_info_protected(memcg, nid);
>>>           parent_info = shrinker_info_protected(parent, nid);
>>> -        for (i = 0; i < shrinker_nr_max; i++) {
>>> +        for (i = 0; i < child_info->map_nr_max; i++) {
>>>               nr = atomic_long_read(&child_info->nr_deferred[i]);
>>>               atomic_long_add(nr, &parent_info->nr_deferred[i]);
>>>           }
>>> @@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>       if (unlikely(!info))
>>>           goto unlock;
>>>   -    for_each_set_bit(i, info->map, shrinker_nr_max) {
>>> +    for_each_set_bit(i, info->map, info->map_nr_max) {
>>>           struct shrink_control sc = {
>>>               .gfp_mask = gfp_mask,
>>>               .nid = nid,
>
> The patch as whole thing won't work as expected. It won't ever call shrinker with ids from [round_down(shrinker_nr_max, sizeof(unsigned long)) + 1, shrinker_nr_max - 1]
>
> Just replay the sequence we add new shrinkers:
>
> 1)We add shrinker #0:
> shrinker_nr_max = 0;
>
> prealloc_memcg_shrinker()
> id = 0;
> expand_shrinker_info(0)
> new_nr_max = 1;
> expand_one_shrinker_info(new_nr_max = 1)
> new->map_nr_max = 1;
> shrinker_nr_max = 1;
>
> 2)We add shrinker #1:
> prealloc_memcg_shrinker()
> id = 1;
> expand_shrinker_info(1)
> new_nr_max = 2;
> need_expand(2, 1) => false => ignore expand
> shrinker_nr_max = 2;
>
> 3)Then we call shrinker:
> shrink_slab_memcg()
> for_each_set_bit(i, info->map, 1/* info->map_nr_max */ ) {
> } => ignore shrinker #1
>
> I'd fixed this patch by something like the below:
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9f895ca6216c..bb617a3871f1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -224,12 +224,6 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> lockdep_is_held(&shrinker_rwsem));
> }
>
> -static inline bool need_expand(int new_nr_max, int old_nr_max)
> -{
> - return round_up(new_nr_max, BITS_PER_LONG) >
> - round_up(old_nr_max, BITS_PER_LONG);
> -}
> -
> static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> int map_size, int defer_size,
> int old_map_size, int old_defer_size,
> @@ -247,9 +241,6 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> if (!old)
> return 0;
>
> - if (!need_expand(new_nr_max, old->map_nr_max))
> - return 0;
> -

Maybe we can keep this. For example, when we failed to allocate memory
by calling kvmalloc_node() last time, some shrinker_info may have been
expanded, and these shrinker_info do not need to be expanded again.

> new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> if (!new)
> return -ENOMEM;
> @@ -317,14 +308,11 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> static int expand_shrinker_info(int new_id)
> {
> int ret = 0;
> - int new_nr_max = new_id + 1;
> + int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
> int map_size, defer_size = 0;
> int old_map_size, old_defer_size = 0;
> struct mem_cgroup *memcg;
>
> - if (!need_expand(new_nr_max, shrinker_nr_max))
> - goto out;
> -
> if (!root_mem_cgroup)
> goto out;
>
> @@ -359,9 +347,11 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
>
> rcu_read_lock();
> info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> - /* Pairs with smp mb in shrink_slab() */
> - smp_mb__before_atomic();
> - set_bit(shrinker_id, info->map);
> + if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
> + /* Pairs with smp mb in shrink_slab() */
> + smp_mb__before_atomic();
> + set_bit(shrinker_id, info->map);
> + }
> rcu_read_unlock();
> }
> }
>
> (I also added a new check into set_shrinker_bit() for safety).
>
> Kirill

--
Thanks,
Qi