2023-02-20 09:17:31

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 0/5] make slab shrink lockless

Hi all,

This patch series aims to make slab shrink lockless.

1. Background
=============

On our servers, we often find the following system cpu hotspots:

44.16% [kernel] [k] down_read_trylock
14.12% [kernel] [k] up_read
13.43% [kernel] [k] shrink_slab
5.25% [kernel] [k] count_shadow_nodes
3.42% [kernel] [k] idr_find

Then we used bpftrace to capture its calltrace as follows:

@[
down_read_trylock+5
shrink_slab+292
shrink_node+640
do_try_to_free_pages+211
try_to_free_mem_cgroup_pages+266
try_charge_memcg+386
charge_memcg+51
__mem_cgroup_charge+44
__handle_mm_fault+1416
handle_mm_fault+260
do_user_addr_fault+459
exc_page_fault+104
asm_exc_page_fault+38
clear_user_rep_good+18
read_zero+100
vfs_read+176
ksys_read+93
do_syscall_64+62
entry_SYSCALL_64_after_hwframe+114
]: 1868979

It is easy to see that this is caused by the frequent failure to obtain the
read lock of shrinker_rwsem when reclaiming slab memory.

Currently, the shrinker_rwsem is a global lock. And the following cases may
cause the above system cpu hotspots:

a. the write lock of shrinker_rwsem was held for too long. For example, there
are many memcgs in the system, which causes some paths to hold locks and
traverse it for too long. (e.g. expand_shrinker_info())
b. the read lock of shrinker_rwsem was held for too long, and a writer came at
this time. Then this writer will be forced to wait and block all subsequent
readers.
For example:
- be scheduled when the read lock of shrinker_rwsem is held in
do_shrink_slab()
- some shrinker are blocked for too long. Like the case mentioned in the
patchset[1].

[1]. https://lore.kernel.org/lkml/[email protected]/

And all the down_read_trylock() hotspots caused by the above cases can be
solved by replacing the shrinker_rwsem trylocks with SRCU.

2. Survey
=========

Before doing the code implementation, I found that there were many similar
submissions in the community:

a. Davidlohr Bueso submitted a patch in 2015.
Subject: [PATCH -next v2] mm: srcu-ify shrinkers
Link: https://lore.kernel.org/all/[email protected]/
Result: It was finally merged into the linux-next branch, but failed on arm
allnoconfig (without CONFIG_SRCU)

b. Tetsuo Handa submitted a patchset in 2017.
Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
Result: Finally chose to use the current simple way (break when
rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU,
but SRCU was not unconditionally enabled at the time.

c. Kirill Tkhai submitted a patchset in 2018.
Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
Result: At that time, SRCU was not unconditionally enabled, and there were
some objections to enabling SRCU. Later, because Kirill's focus was
moved to other things, this patchset was not continued to be updated.

d. Sultan Alsawaf submitted a patch in 2021.
Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
Link: https://lore.kernel.org/lkml/[email protected]/
Result: Rejected because SRCU was not unconditionally enabled.

We can find that almost all these historical commits were abandoned because SRCU
was not unconditionally enabled. But now SRCU has been unconditionally enable
by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks
with SRCU.

[2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/

3. Reproduction and testing
===========================

We can reproduce the down_read_trylock() hotspot through the following script:

```
#!/bin/bash
DIR="/root/shrinker/memcg/mnt"

do_create()
{
mkdir /sys/fs/cgroup/memory/test
echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
for i in `seq 0 $1`;
do
mkdir /sys/fs/cgroup/memory/test/$i;
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
mkdir -p $DIR/$i;
done
}

do_mount()
{
for i in `seq $1 $2`;
do
mount -t tmpfs $i $DIR/$i;
done
}

do_touch()
{
for i in `seq $1 $2`;
do
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
done
}

do_create 2000
do_mount 0 2000
do_touch 0 1000
```

Save the above script and execute it, we can get the following perf hotspots:

46.60% [kernel] [k] down_read_trylock
18.70% [kernel] [k] up_read
15.44% [kernel] [k] shrink_slab
4.37% [kernel] [k] _find_next_bit
2.75% [kernel] [k] xa_load
2.07% [kernel] [k] idr_find
1.73% [kernel] [k] do_shrink_slab
1.42% [kernel] [k] shrink_lruvec
0.74% [kernel] [k] shrink_node
0.60% [kernel] [k] list_lru_count_one

After applying this patchset, the hotspot becomes as follows:

19.53% [kernel] [k] _find_next_bit
14.63% [kernel] [k] do_shrink_slab
14.58% [kernel] [k] shrink_slab
11.83% [kernel] [k] shrink_lruvec
9.33% [kernel] [k] __blk_flush_plug
6.67% [kernel] [k] mem_cgroup_iter
3.73% [kernel] [k] list_lru_count_one
2.43% [kernel] [k] shrink_node
1.96% [kernel] [k] super_cache_count
1.78% [kernel] [k] __rcu_read_unlock
1.38% [kernel] [k] __srcu_read_lock
1.30% [kernel] [k] xas_descend

We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock,
which realizes the lockless slab reclaim.

This series is based on next-20230217.

Comments and suggestions are welcome.

Thanks,
Qi.

Qi Zheng (5):
mm: vmscan: make global slab shrink lockless
mm: vmscan: make memcg slab shrink lockless
mm: shrinkers: make count and scan in shrinker debugfs lockless
mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
mm: shrinkers: convert shrinker_rwsem to mutex

drivers/md/dm-cache-metadata.c | 2 +-
drivers/md/dm-thin-metadata.c | 2 +-
fs/super.c | 2 +-
mm/shrinker_debug.c | 38 ++++-------
mm/vmscan.c | 119 ++++++++++++++++-----------------
5 files changed, 76 insertions(+), 87 deletions(-)

--
2.20.1



2023-02-20 09:17:35

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 1/5] mm: vmscan: make global slab shrink lockless

The shrinker_rwsem is a global lock in shrinkers subsystem,
it is easy to cause blocking in the following cases:

a. the write lock of shrinker_rwsem was held for too long.
For example, there are many memcgs in the system, which
causes some paths to hold locks and traverse it for too
long. (e.g. expand_shrinker_info())
b. the read lock of shrinker_rwsem was held for too long,
and a writer came at this time. Then this writer will be
forced to wait and block all subsequent readers.
For example:
- be scheduled when the read lock of shrinker_rwsem is
held in do_shrink_slab()
- some shrinker are blocked for too long. Like the case
mentioned in the patchset[1].

Therefore, many times in history ([2],[3],[4],[5]), some
people wanted to replace shrinker_rwsem reader with SRCU,
but they all gave up because SRCU was not unconditionally
enabled.

But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
the SRCU is unconditionally enabled. So it's time to use
SRCU to protect readers who previously held shrinker_rwsem.

[1]. https://lore.kernel.org/lkml/[email protected]/
[2]. https://lore.kernel.org/all/[email protected]/
[3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
[4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
[5]. https://lore.kernel.org/lkml/[email protected]/

Signed-off-by: Qi Zheng <[email protected]>
---
mm/vmscan.c | 27 +++++++++++----------------
1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..95a3d6ddc6c1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,

LIST_HEAD(shrinker_list);
DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_SRCU(shrinker_srcu);

#ifdef CONFIG_MEMCG
static int shrinker_nr_max;
@@ -699,7 +700,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
void register_shrinker_prepared(struct shrinker *shrinker)
{
down_write(&shrinker_rwsem);
- list_add_tail(&shrinker->list, &shrinker_list);
+ list_add_tail_rcu(&shrinker->list, &shrinker_list);
shrinker->flags |= SHRINKER_REGISTERED;
shrinker_debugfs_add(shrinker);
up_write(&shrinker_rwsem);
@@ -753,13 +754,15 @@ void unregister_shrinker(struct shrinker *shrinker)
return;

down_write(&shrinker_rwsem);
- list_del(&shrinker->list);
+ list_del_rcu(&shrinker->list);
shrinker->flags &= ~SHRINKER_REGISTERED;
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
debugfs_entry = shrinker_debugfs_remove(shrinker);
up_write(&shrinker_rwsem);

+ synchronize_srcu(&shrinker_srcu);
+
debugfs_remove_recursive(debugfs_entry);

kfree(shrinker->nr_deferred);
@@ -779,6 +782,7 @@ void synchronize_shrinkers(void)
{
down_write(&shrinker_rwsem);
up_write(&shrinker_rwsem);
+ synchronize_srcu(&shrinker_srcu);
}
EXPORT_SYMBOL(synchronize_shrinkers);

@@ -989,6 +993,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
{
unsigned long ret, freed = 0;
struct shrinker *shrinker;
+ int srcu_idx;

/*
* The root memcg might be allocated even though memcg is disabled
@@ -1000,10 +1005,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

- if (!down_read_trylock(&shrinker_rwsem))
- goto out;
+ srcu_idx = srcu_read_lock(&shrinker_srcu);

- list_for_each_entry(shrinker, &shrinker_list, list) {
+ list_for_each_entry_srcu(shrinker, &shrinker_list, list,
+ srcu_read_lock_held(&shrinker_srcu)) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
@@ -1014,19 +1019,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
- /*
- * Bail out if someone want to register a new shrinker to
- * prevent the registration from being stalled for long periods
- * by parallel ongoing shrinking.
- */
- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
}

- up_read(&shrinker_rwsem);
-out:
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
cond_resched();
return freed;
}
--
2.20.1


2023-02-20 09:17:40

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 5/5] mm: shrinkers: convert shrinker_rwsem to mutex

Now there are no readers of shrinker_rwsem, so we
can simply replace it with mutex lock.

Signed-off-by: Qi Zheng <[email protected]>
---
drivers/md/dm-cache-metadata.c | 2 +-
drivers/md/dm-thin-metadata.c | 2 +-
fs/super.c | 2 +-
mm/shrinker_debug.c | 14 +++++++-------
mm/vmscan.c | 30 +++++++++++++++---------------
5 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/drivers/md/dm-cache-metadata.c b/drivers/md/dm-cache-metadata.c
index acffed750e3e..9e0c69958587 100644
--- a/drivers/md/dm-cache-metadata.c
+++ b/drivers/md/dm-cache-metadata.c
@@ -1828,7 +1828,7 @@ int dm_cache_metadata_abort(struct dm_cache_metadata *cmd)
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
* cmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
* shrinker associated with the block manager's bufio client vs cmd root_lock).
- * - must take shrinker_rwsem without holding cmd->root_lock
+ * - must take shrinker_mutex without holding cmd->root_lock
*/
new_bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
CACHE_MAX_CONCURRENT_LOCKS);
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index fd464fb024c3..9f5cb52c5763 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1887,7 +1887,7 @@ int dm_pool_abort_metadata(struct dm_pool_metadata *pmd)
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
* pmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
* shrinker associated with the block manager's bufio client vs pmd root_lock).
- * - must take shrinker_rwsem without holding pmd->root_lock
+ * - must take shrinker_mutex without holding pmd->root_lock
*/
new_bm = dm_block_manager_create(pmd->bdev, THIN_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
THIN_MAX_CONCURRENT_LOCKS);
diff --git a/fs/super.c b/fs/super.c
index 84332d5cb817..91a4037b1d95 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -54,7 +54,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
* One thing we have to be careful of with a per-sb shrinker is that we don't
* drop the last active reference to the superblock from within the shrinker.
* If that happens we could trigger unregistering the shrinker from within the
- * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
+ * shrinker path and that leads to deadlock on the shrinker_mutex. Hence we
* take a passive reference to the superblock to avoid this from occurring.
*/
static unsigned long super_cache_scan(struct shrinker *shrink,
diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 6a26e8ac40aa..4c3a5d9afb5c 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -7,7 +7,7 @@
#include <linux/memcontrol.h>

/* defined in vmscan.c */
-extern struct rw_semaphore shrinker_rwsem;
+extern struct mutex shrinker_mutex;
extern struct list_head shrinker_list;
extern struct srcu_struct shrinker_srcu;

@@ -167,7 +167,7 @@ int shrinker_debugfs_add(struct shrinker *shrinker)
char buf[128];
int id;

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

/* debugfs isn't initialized yet, add debugfs entries later. */
if (!shrinker_debugfs_root)
@@ -210,7 +210,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
if (!new)
return -ENOMEM;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);

old = shrinker->name;
shrinker->name = new;
@@ -228,7 +228,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
shrinker->debugfs_entry = entry;
}

- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

kfree_const(old);

@@ -240,7 +240,7 @@ struct dentry *shrinker_debugfs_remove(struct shrinker *shrinker)
{
struct dentry *entry = shrinker->debugfs_entry;

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

kfree_const(shrinker->name);
shrinker->name = NULL;
@@ -265,14 +265,14 @@ static int __init shrinker_debugfs_init(void)
shrinker_debugfs_root = dentry;

/* Create debugfs entries for shrinkers registered at boot */
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_for_each_entry(shrinker, &shrinker_list, list)
if (!shrinker->debugfs_entry) {
ret = shrinker_debugfs_add(shrinker);
if (ret)
break;
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

return ret;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b8da56beba96..f94bfe540675 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -35,7 +35,7 @@
#include <linux/cpuset.h>
#include <linux/compaction.h>
#include <linux/notifier.h>
-#include <linux/rwsem.h>
+#include <linux/mutex.h>
#include <linux/delay.h>
#include <linux/kthread.h>
#include <linux/freezer.h>
@@ -202,7 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
}

LIST_HEAD(shrinker_list);
-DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_MUTEX(shrinker_mutex);
DEFINE_SRCU(shrinker_srcu);

#ifdef CONFIG_MEMCG
@@ -224,7 +224,7 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
{
return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
&shrinker_srcu,
- lockdep_is_held(&shrinker_rwsem));
+ lockdep_is_held(&shrinker_mutex));
}

static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
@@ -297,7 +297,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
int nid, size, ret = 0;
int map_size, defer_size = 0;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
map_size = shrinker_map_size(shrinker_nr_max);
defer_size = shrinker_defer_size(shrinker_nr_max);
size = map_size + defer_size;
@@ -312,7 +312,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
info->map = (void *)info->nr_deferred + defer_size;
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

return ret;
}
@@ -337,7 +337,7 @@ static int expand_shrinker_info(int new_id)
if (!root_mem_cgroup)
goto out;

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

map_size = shrinker_map_size(new_nr_max);
defer_size = shrinker_defer_size(new_nr_max);
@@ -384,7 +384,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
if (mem_cgroup_disabled())
return -ENOSYS;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
@@ -398,7 +398,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
shrinker->id = id;
ret = 0;
unlock:
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
return ret;
}

@@ -408,7 +408,7 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)

BUG_ON(id < 0);

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

idr_remove(&shrinker_idr, id);
}
@@ -701,9 +701,9 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
shrinker->name = NULL;
#endif
if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
unregister_memcg_shrinker(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
return;
}

@@ -713,11 +713,11 @@ void free_prealloced_shrinker(struct shrinker *shrinker)

void register_shrinker_prepared(struct shrinker *shrinker)
{
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_add_tail_rcu(&shrinker->list, &shrinker_list);
shrinker->flags |= SHRINKER_REGISTERED;
shrinker_debugfs_add(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
}

static int __register_shrinker(struct shrinker *shrinker)
@@ -767,13 +767,13 @@ void unregister_shrinker(struct shrinker *shrinker)
if (!(shrinker->flags & SHRINKER_REGISTERED))
return;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_del_rcu(&shrinker->list);
shrinker->flags &= ~SHRINKER_REGISTERED;
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
debugfs_entry = shrinker_debugfs_remove(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

synchronize_srcu(&shrinker_srcu);

--
2.20.1


2023-02-20 09:17:49

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 3/5] mm: shrinkers: make count and scan in shrinker debugfs lockless

Like global and memcg slab shrink, also use SRCU to
make count and scan operations in memory shrinker
debugfs lockless.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/shrinker_debug.c | 24 +++++++-----------------
1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 39c3491e28a3..6a26e8ac40aa 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -9,6 +9,7 @@
/* defined in vmscan.c */
extern struct rw_semaphore shrinker_rwsem;
extern struct list_head shrinker_list;
+extern struct srcu_struct shrinker_srcu;

static DEFINE_IDA(shrinker_debugfs_ida);
static struct dentry *shrinker_debugfs_root;
@@ -49,18 +50,13 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
struct mem_cgroup *memcg;
unsigned long total;
bool memcg_aware;
- int ret, nid;
+ int ret, nid, srcu_idx;

count_per_node = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
if (!count_per_node)
return -ENOMEM;

- ret = down_read_killable(&shrinker_rwsem);
- if (ret) {
- kfree(count_per_node);
- return ret;
- }
- rcu_read_lock();
+ srcu_idx = srcu_read_lock(&shrinker_srcu);

memcg_aware = shrinker->flags & SHRINKER_MEMCG_AWARE;

@@ -91,8 +87,7 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
}
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);

- rcu_read_unlock();
- up_read(&shrinker_rwsem);
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);

kfree(count_per_node);
return ret;
@@ -115,9 +110,8 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
.gfp_mask = GFP_KERNEL,
};
struct mem_cgroup *memcg = NULL;
- int nid;
+ int nid, srcu_idx;
char kbuf[72];
- ssize_t ret;

read_len = size < (sizeof(kbuf) - 1) ? size : (sizeof(kbuf) - 1);
if (copy_from_user(kbuf, buf, read_len))
@@ -146,11 +140,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
return -EINVAL;
}

- ret = down_read_killable(&shrinker_rwsem);
- if (ret) {
- mem_cgroup_put(memcg);
- return ret;
- }
+ srcu_idx = srcu_read_lock(&shrinker_srcu);

sc.nid = nid;
sc.memcg = memcg;
@@ -159,7 +149,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,

shrinker->scan_objects(shrinker, &sc);

- up_read(&shrinker_rwsem);
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
mem_cgroup_put(memcg);

return size;
--
2.20.1


2023-02-20 09:17:54

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 4/5] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()

Now there are no readers of shrinker_rwsem, so
synchronize_shrinkers() does not need to hold the
writer of shrinker_rwsem to wait for all running
shinkers to complete, synchronize_srcu() is enough.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/vmscan.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dc47396ecd0e..b8da56beba96 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -787,15 +787,11 @@ EXPORT_SYMBOL(unregister_shrinker);
/**
* synchronize_shrinkers - Wait for all running shrinkers to complete.
*
- * This is equivalent to calling unregister_shrink() and register_shrinker(),
- * but atomically and with less overhead. This is useful to guarantee that all
- * shrinker invocations have seen an update, before freeing memory, similar to
- * rcu.
+ * This is useful to guarantee that all shrinker invocations have seen an
+ * update, before freeing memory.
*/
void synchronize_shrinkers(void)
{
- down_write(&shrinker_rwsem);
- up_write(&shrinker_rwsem);
synchronize_srcu(&shrinker_srcu);
}
EXPORT_SYMBOL(synchronize_shrinkers);
--
2.20.1


2023-02-20 13:24:05

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: shrinkers: make count and scan in shrinker debugfs lockless

Hi Qi,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20230220]
[cannot apply to device-mapper-dm/for-next linus/master v6.2]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/mm-vmscan-make-global-slab-shrink-lockless/20230220-171954
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20230220091637.64865-4-zhengqi.arch%40bytedance.com
patch subject: [PATCH 3/5] mm: shrinkers: make count and scan in shrinker debugfs lockless
config: riscv-randconfig-r036-20230219 (https://download.01.org/0day-ci/archive/20230220/[email protected]/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project db89896bbbd2251fff457699635acbbedeead27f)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install riscv cross compiling tool for clang build
# apt-get install binutils-riscv64-linux-gnu
# https://github.com/intel-lab-lkp/linux/commit/b5b7259339a7a5cfae5a120356750c5a9e151d12
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Qi-Zheng/mm-vmscan-make-global-slab-shrink-lockless/20230220-171954
git checkout b5b7259339a7a5cfae5a120356750c5a9e151d12
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=riscv olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> mm/shrinker_debug.c:88:11: warning: variable 'ret' is used uninitialized whenever 'do' loop exits because its condition is false [-Wsometimes-uninitialized]
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/shrinker_debug.c:93:9: note: uninitialized use occurs here
return ret;
^~~
mm/shrinker_debug.c:88:11: note: remove the condition if it is always true
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1
>> mm/shrinker_debug.c:78:7: warning: variable 'ret' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
if (!memcg_aware) {
^~~~~~~~~~~~
mm/shrinker_debug.c:93:9: note: uninitialized use occurs here
return ret;
^~~
mm/shrinker_debug.c:78:3: note: remove the 'if' if its condition is always false
if (!memcg_aware) {
^~~~~~~~~~~~~~~~~~~
mm/shrinker_debug.c:53:9: note: initialize the variable 'ret' to silence this warning
int ret, nid, srcu_idx;
^
= 0
2 warnings generated.


vim +88 mm/shrinker_debug.c

5035ebc644aec9 Roman Gushchin 2022-05-31 45
5035ebc644aec9 Roman Gushchin 2022-05-31 46 static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
5035ebc644aec9 Roman Gushchin 2022-05-31 47 {
5035ebc644aec9 Roman Gushchin 2022-05-31 48 struct shrinker *shrinker = m->private;
5035ebc644aec9 Roman Gushchin 2022-05-31 49 unsigned long *count_per_node;
5035ebc644aec9 Roman Gushchin 2022-05-31 50 struct mem_cgroup *memcg;
5035ebc644aec9 Roman Gushchin 2022-05-31 51 unsigned long total;
5035ebc644aec9 Roman Gushchin 2022-05-31 52 bool memcg_aware;
b5b7259339a7a5 Qi Zheng 2023-02-20 53 int ret, nid, srcu_idx;
5035ebc644aec9 Roman Gushchin 2022-05-31 54
5035ebc644aec9 Roman Gushchin 2022-05-31 55 count_per_node = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
5035ebc644aec9 Roman Gushchin 2022-05-31 56 if (!count_per_node)
5035ebc644aec9 Roman Gushchin 2022-05-31 57 return -ENOMEM;
5035ebc644aec9 Roman Gushchin 2022-05-31 58
b5b7259339a7a5 Qi Zheng 2023-02-20 59 srcu_idx = srcu_read_lock(&shrinker_srcu);
5035ebc644aec9 Roman Gushchin 2022-05-31 60
5035ebc644aec9 Roman Gushchin 2022-05-31 61 memcg_aware = shrinker->flags & SHRINKER_MEMCG_AWARE;
5035ebc644aec9 Roman Gushchin 2022-05-31 62
5035ebc644aec9 Roman Gushchin 2022-05-31 63 memcg = mem_cgroup_iter(NULL, NULL, NULL);
5035ebc644aec9 Roman Gushchin 2022-05-31 64 do {
5035ebc644aec9 Roman Gushchin 2022-05-31 65 if (memcg && !mem_cgroup_online(memcg))
5035ebc644aec9 Roman Gushchin 2022-05-31 66 continue;
5035ebc644aec9 Roman Gushchin 2022-05-31 67
5035ebc644aec9 Roman Gushchin 2022-05-31 68 total = shrinker_count_objects(shrinker,
5035ebc644aec9 Roman Gushchin 2022-05-31 69 memcg_aware ? memcg : NULL,
5035ebc644aec9 Roman Gushchin 2022-05-31 70 count_per_node);
5035ebc644aec9 Roman Gushchin 2022-05-31 71 if (total) {
5035ebc644aec9 Roman Gushchin 2022-05-31 72 seq_printf(m, "%lu", mem_cgroup_ino(memcg));
5035ebc644aec9 Roman Gushchin 2022-05-31 73 for_each_node(nid)
5035ebc644aec9 Roman Gushchin 2022-05-31 74 seq_printf(m, " %lu", count_per_node[nid]);
5035ebc644aec9 Roman Gushchin 2022-05-31 75 seq_putc(m, '\n');
5035ebc644aec9 Roman Gushchin 2022-05-31 76 }
5035ebc644aec9 Roman Gushchin 2022-05-31 77
5035ebc644aec9 Roman Gushchin 2022-05-31 @78 if (!memcg_aware) {
5035ebc644aec9 Roman Gushchin 2022-05-31 79 mem_cgroup_iter_break(NULL, memcg);
5035ebc644aec9 Roman Gushchin 2022-05-31 80 break;
5035ebc644aec9 Roman Gushchin 2022-05-31 81 }
5035ebc644aec9 Roman Gushchin 2022-05-31 82
5035ebc644aec9 Roman Gushchin 2022-05-31 83 if (signal_pending(current)) {
5035ebc644aec9 Roman Gushchin 2022-05-31 84 mem_cgroup_iter_break(NULL, memcg);
5035ebc644aec9 Roman Gushchin 2022-05-31 85 ret = -EINTR;
5035ebc644aec9 Roman Gushchin 2022-05-31 86 break;
5035ebc644aec9 Roman Gushchin 2022-05-31 87 }
5035ebc644aec9 Roman Gushchin 2022-05-31 @88 } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
5035ebc644aec9 Roman Gushchin 2022-05-31 89
b5b7259339a7a5 Qi Zheng 2023-02-20 90 srcu_read_unlock(&shrinker_srcu, srcu_idx);
5035ebc644aec9 Roman Gushchin 2022-05-31 91
5035ebc644aec9 Roman Gushchin 2022-05-31 92 kfree(count_per_node);
5035ebc644aec9 Roman Gushchin 2022-05-31 @93 return ret;
5035ebc644aec9 Roman Gushchin 2022-05-31 94 }
5035ebc644aec9 Roman Gushchin 2022-05-31 95 DEFINE_SHOW_ATTRIBUTE(shrinker_debugfs_count);
5035ebc644aec9 Roman Gushchin 2022-05-31 96

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests