2023-02-26 14:53:24

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 0/8] make slab shrink lockless

Hi all,

This patch series aims to make slab shrink lockless.

1. Background
=============

On our servers, we often find the following system cpu hotspots:

44.16% [kernel] [k] down_read_trylock
14.12% [kernel] [k] up_read
13.43% [kernel] [k] shrink_slab
5.25% [kernel] [k] count_shadow_nodes
3.42% [kernel] [k] idr_find

Then we used bpftrace to capture its calltrace as follows:

@[
down_read_trylock+5
shrink_slab+292
shrink_node+640
do_try_to_free_pages+211
try_to_free_mem_cgroup_pages+266
try_charge_memcg+386
charge_memcg+51
__mem_cgroup_charge+44
__handle_mm_fault+1416
handle_mm_fault+260
do_user_addr_fault+459
exc_page_fault+104
asm_exc_page_fault+38
clear_user_rep_good+18
read_zero+100
vfs_read+176
ksys_read+93
do_syscall_64+62
entry_SYSCALL_64_after_hwframe+114
]: 1868979

It is easy to see that this is caused by the frequent failure to obtain the
read lock of shrinker_rwsem when reclaiming slab memory.

Currently, the shrinker_rwsem is a global lock. And the following cases may
cause the above system cpu hotspots:

a. the write lock of shrinker_rwsem was held for too long. For example, there
are many memcgs in the system, which causes some paths to hold locks and
traverse it for too long. (e.g. expand_shrinker_info())
b. the read lock of shrinker_rwsem was held for too long, and a writer came at
this time. Then this writer will be forced to wait and block all subsequent
readers.
For example:
- be scheduled when the read lock of shrinker_rwsem is held in
do_shrink_slab()
- some shrinker are blocked for too long. Like the case mentioned in the
patchset[1].

[1]. https://lore.kernel.org/lkml/[email protected]/

And all the down_read_trylock() hotspots caused by the above cases can be
solved by replacing the shrinker_rwsem trylocks with SRCU.

2. Survey
=========

Before doing the code implementation, I found that there were many similar
submissions in the community:

a. Davidlohr Bueso submitted a patch in 2015.
Subject: [PATCH -next v2] mm: srcu-ify shrinkers
Link: https://lore.kernel.org/all/[email protected]/
Result: It was finally merged into the linux-next branch, but failed on arm
allnoconfig (without CONFIG_SRCU)

b. Tetsuo Handa submitted a patchset in 2017.
Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
Result: Finally chose to use the current simple way (break when
rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU,
but SRCU was not unconditionally enabled at the time.

c. Kirill Tkhai submitted a patchset in 2018.
Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
Result: At that time, SRCU was not unconditionally enabled, and there were
some objections to enabling SRCU. Later, because Kirill's focus was
moved to other things, this patchset was not continued to be updated.

d. Sultan Alsawaf submitted a patch in 2021.
Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
Link: https://lore.kernel.org/lkml/[email protected]/
Result: Rejected because SRCU was not unconditionally enabled.

We can find that almost all these historical commits were abandoned because SRCU
was not unconditionally enabled. But now SRCU has been unconditionally enable
by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks
with SRCU.

[2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/

3. Reproduction and testing
===========================

We can reproduce the down_read_trylock() hotspot through the following script:

```
#!/bin/bash
DIR="/root/shrinker/memcg/mnt"

do_create()
{
mkdir /sys/fs/cgroup/memory/test
echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
for i in `seq 0 $1`;
do
mkdir /sys/fs/cgroup/memory/test/$i;
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
mkdir -p $DIR/$i;
done
}

do_mount()
{
for i in `seq $1 $2`;
do
mount -t tmpfs $i $DIR/$i;
done
}

do_touch()
{
for i in `seq $1 $2`;
do
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
done
}

do_create 2000
do_mount 0 2000
do_touch 0 1000
```

Save the above script and execute it, we can get the following perf hotspots:

46.60% [kernel] [k] down_read_trylock
18.70% [kernel] [k] up_read
15.44% [kernel] [k] shrink_slab
4.37% [kernel] [k] _find_next_bit
2.75% [kernel] [k] xa_load
2.07% [kernel] [k] idr_find
1.73% [kernel] [k] do_shrink_slab
1.42% [kernel] [k] shrink_lruvec
0.74% [kernel] [k] shrink_node
0.60% [kernel] [k] list_lru_count_one

After applying this patchset, the hotspot becomes as follows:

19.53% [kernel] [k] _find_next_bit
14.63% [kernel] [k] do_shrink_slab
14.58% [kernel] [k] shrink_slab
11.83% [kernel] [k] shrink_lruvec
9.33% [kernel] [k] __blk_flush_plug
6.67% [kernel] [k] mem_cgroup_iter
3.73% [kernel] [k] list_lru_count_one
2.43% [kernel] [k] shrink_node
1.96% [kernel] [k] super_cache_count
1.78% [kernel] [k] __rcu_read_unlock
1.38% [kernel] [k] __srcu_read_lock
1.30% [kernel] [k] xas_descend

We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock,
which realizes the lockless slab reclaim.

This series is based on next-20230217.

Comments and suggestions are welcome.

Thanks,
Qi.

Changelog in v2 -> v3:
- fix bug in [PATCH v2 1/7] (per Kirill)
- add Kirill's pacth which restore a check similar to the rwsem_is_contendent()
check by adding shrinker_srcu_generation

Changelog in v1 -> v2:
- add a map_nr_max field to shrinker_info (suggested by Kirill)
- use shrinker_mutex in reparent_shrinker_deferred() (pointed by Kirill)

Kirill Tkhai (1):
mm: vmscan: add shrinker_srcu_generation

Qi Zheng (7):
mm: vmscan: add a map_nr_max field to shrinker_info
mm: vmscan: make global slab shrink lockless
mm: vmscan: make memcg slab shrink lockless
mm: shrinkers: make count and scan in shrinker debugfs lockless
mm: vmscan: hold write lock to reparent shrinker nr_deferred
mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
mm: shrinkers: convert shrinker_rwsem to mutex

drivers/md/dm-cache-metadata.c | 2 +-
drivers/md/dm-thin-metadata.c | 2 +-
fs/super.c | 2 +-
include/linux/memcontrol.h | 1 +
mm/shrinker_debug.c | 38 +++-----
mm/vmscan.c | 166 +++++++++++++++++++--------------
6 files changed, 112 insertions(+), 99 deletions(-)

--
2.20.1



2023-02-26 14:53:54

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 2/8] mm: vmscan: make global slab shrink lockless

The shrinker_rwsem is a global lock in shrinkers subsystem,
it is easy to cause blocking in the following cases:

a. the write lock of shrinker_rwsem was held for too long.
For example, there are many memcgs in the system, which
causes some paths to hold locks and traverse it for too
long. (e.g. expand_shrinker_info())
b. the read lock of shrinker_rwsem was held for too long,
and a writer came at this time. Then this writer will be
forced to wait and block all subsequent readers.
For example:
- be scheduled when the read lock of shrinker_rwsem is
held in do_shrink_slab()
- some shrinker are blocked for too long. Like the case
mentioned in the patchset[1].

Therefore, many times in history ([2],[3],[4],[5]), some
people wanted to replace shrinker_rwsem reader with SRCU,
but they all gave up because SRCU was not unconditionally
enabled.

But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
the SRCU is unconditionally enabled. So it's time to use
SRCU to protect readers who previously held shrinker_rwsem.

[1]. https://lore.kernel.org/lkml/[email protected]/
[2]. https://lore.kernel.org/all/[email protected]/
[3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
[4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
[5]. https://lore.kernel.org/lkml/[email protected]/

Signed-off-by: Qi Zheng <[email protected]>
---
mm/vmscan.c | 27 +++++++++++----------------
1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 546c07ccb3bc..2a21a84d3db1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,

LIST_HEAD(shrinker_list);
DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_SRCU(shrinker_srcu);

#ifdef CONFIG_MEMCG
static int shrinker_nr_max;
@@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
void register_shrinker_prepared(struct shrinker *shrinker)
{
down_write(&shrinker_rwsem);
- list_add_tail(&shrinker->list, &shrinker_list);
+ list_add_tail_rcu(&shrinker->list, &shrinker_list);
shrinker->flags |= SHRINKER_REGISTERED;
shrinker_debugfs_add(shrinker);
up_write(&shrinker_rwsem);
@@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
return;

down_write(&shrinker_rwsem);
- list_del(&shrinker->list);
+ list_del_rcu(&shrinker->list);
shrinker->flags &= ~SHRINKER_REGISTERED;
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
debugfs_entry = shrinker_debugfs_remove(shrinker);
up_write(&shrinker_rwsem);

+ synchronize_srcu(&shrinker_srcu);
+
debugfs_remove_recursive(debugfs_entry);

kfree(shrinker->nr_deferred);
@@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
{
down_write(&shrinker_rwsem);
up_write(&shrinker_rwsem);
+ synchronize_srcu(&shrinker_srcu);
}
EXPORT_SYMBOL(synchronize_shrinkers);

@@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
{
unsigned long ret, freed = 0;
struct shrinker *shrinker;
+ int srcu_idx;

/*
* The root memcg might be allocated even though memcg is disabled
@@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

- if (!down_read_trylock(&shrinker_rwsem))
- goto out;
+ srcu_idx = srcu_read_lock(&shrinker_srcu);

- list_for_each_entry(shrinker, &shrinker_list, list) {
+ list_for_each_entry_srcu(shrinker, &shrinker_list, list,
+ srcu_read_lock_held(&shrinker_srcu)) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
@@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
- /*
- * Bail out if someone want to register a new shrinker to
- * prevent the registration from being stalled for long periods
- * by parallel ongoing shrinking.
- */
- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
}

- up_read(&shrinker_rwsem);
-out:
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
cond_resched();
return freed;
}
--
2.20.1


2023-02-26 14:53:55

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 1/8] mm: vmscan: add a map_nr_max field to shrinker_info

To prepare for the subsequent lockless memcg slab shrink,
add a map_nr_max field to struct shrinker_info to records
its own real shrinker_nr_max.

Suggested-by: Kirill Tkhai <[email protected]>
Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/memcontrol.h | 1 +
mm/vmscan.c | 41 ++++++++++++++++++++++----------------
2 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b6eda2ab205d..aa69ea98e2d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -97,6 +97,7 @@ struct shrinker_info {
struct rcu_head rcu;
atomic_long_t *nr_deferred;
unsigned long *map;
+ int map_nr_max;
};

struct lruvec_stats_percpu {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..546c07ccb3bc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
lockdep_is_held(&shrinker_rwsem));
}

+static inline bool need_expand(int new_nr_max, int old_nr_max)
+{
+ return round_up(new_nr_max, BITS_PER_LONG) >
+ round_up(old_nr_max, BITS_PER_LONG);
+}
+
static int expand_one_shrinker_info(struct mem_cgroup *memcg,
int map_size, int defer_size,
- int old_map_size, int old_defer_size)
+ int old_map_size, int old_defer_size,
+ int new_nr_max)
{
struct shrinker_info *new, *old;
struct mem_cgroup_per_node *pn;
@@ -240,12 +247,17 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
if (!old)
return 0;

+ /* Already expanded this shrinker_info */
+ if (!need_expand(new_nr_max, old->map_nr_max))
+ return 0;
+
new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
if (!new)
return -ENOMEM;

new->nr_deferred = (atomic_long_t *)(new + 1);
new->map = (void *)new->nr_deferred + defer_size;
+ new->map_nr_max = new_nr_max;

/* map: set all old bits, clear all new bits */
memset(new->map, (int)0xff, old_map_size);
@@ -295,6 +307,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
}
info->nr_deferred = (atomic_long_t *)(info + 1);
info->map = (void *)info->nr_deferred + defer_size;
+ info->map_nr_max = shrinker_nr_max;
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
}
up_write(&shrinker_rwsem);
@@ -302,23 +315,14 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
return ret;
}

-static inline bool need_expand(int nr_max)
-{
- return round_up(nr_max, BITS_PER_LONG) >
- round_up(shrinker_nr_max, BITS_PER_LONG);
-}
-
static int expand_shrinker_info(int new_id)
{
int ret = 0;
- int new_nr_max = new_id + 1;
+ int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
int map_size, defer_size = 0;
int old_map_size, old_defer_size = 0;
struct mem_cgroup *memcg;

- if (!need_expand(new_nr_max))
- goto out;
-
if (!root_mem_cgroup)
goto out;

@@ -332,7 +336,8 @@ static int expand_shrinker_info(int new_id)
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
ret = expand_one_shrinker_info(memcg, map_size, defer_size,
- old_map_size, old_defer_size);
+ old_map_size, old_defer_size,
+ new_nr_max);
if (ret) {
mem_cgroup_iter_break(NULL, memcg);
goto out;
@@ -352,9 +357,11 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)

rcu_read_lock();
info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
- /* Pairs with smp mb in shrink_slab() */
- smp_mb__before_atomic();
- set_bit(shrinker_id, info->map);
+ if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
+ /* Pairs with smp mb in shrink_slab() */
+ smp_mb__before_atomic();
+ set_bit(shrinker_id, info->map);
+ }
rcu_read_unlock();
}
}
@@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
for_each_node(nid) {
child_info = shrinker_info_protected(memcg, nid);
parent_info = shrinker_info_protected(parent, nid);
- for (i = 0; i < shrinker_nr_max; i++) {
+ for (i = 0; i < child_info->map_nr_max; i++) {
nr = atomic_long_read(&child_info->nr_deferred[i]);
atomic_long_add(nr, &parent_info->nr_deferred[i]);
}
@@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
if (unlikely(!info))
goto unlock;

- for_each_set_bit(i, info->map, shrinker_nr_max) {
+ for_each_set_bit(i, info->map, info->map_nr_max) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
--
2.20.1


2023-02-26 14:54:13

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 3/8] mm: vmscan: make memcg slab shrink lockless

Like global slab shrink, since commit 1cd0bd06093c
("rcu: Remove CONFIG_SRCU"), it's time to use SRCU
to protect readers who previously held shrinker_rwsem.

We can test with the following script:

```
DIR="/root/shrinker/memcg/mnt"

do_create()
{
mkdir /sys/fs/cgroup/memory/test
echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
for i in `seq 0 $1`;
do
mkdir /sys/fs/cgroup/memory/test/$i;
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
mkdir -p $DIR/$i;
done
}

do_mount()
{
for i in `seq $1 $2`;
do
mount -t tmpfs $i $DIR/$i;
done
}

do_touch()
{
for i in `seq $1 $2`;
do
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
done
}

do_create 2000
do_mount 0 2000
do_touch 0 1000
```

Before applying:

46.60% [kernel] [k] down_read_trylock
18.70% [kernel] [k] up_read
15.44% [kernel] [k] shrink_slab
4.37% [kernel] [k] _find_next_bit
2.75% [kernel] [k] xa_load
2.07% [kernel] [k] idr_find
1.73% [kernel] [k] do_shrink_slab
1.42% [kernel] [k] shrink_lruvec
0.74% [kernel] [k] shrink_node
0.60% [kernel] [k] list_lru_count_one

After applying:

19.53% [kernel] [k] _find_next_bit
14.63% [kernel] [k] do_shrink_slab
14.58% [kernel] [k] shrink_slab
11.83% [kernel] [k] shrink_lruvec
9.33% [kernel] [k] __blk_flush_plug
6.67% [kernel] [k] mem_cgroup_iter
3.73% [kernel] [k] list_lru_count_one
2.43% [kernel] [k] shrink_node
1.96% [kernel] [k] super_cache_count
1.78% [kernel] [k] __rcu_read_unlock
1.38% [kernel] [k] __srcu_read_lock
1.30% [kernel] [k] xas_descend

We can see that the readers is no longer blocked.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/vmscan.c | 46 +++++++++++++++++++++++++++-------------------
1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2a21a84d3db1..490764f8e085 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,7 @@
#include <linux/khugepaged.h>
#include <linux/rculist_nulls.h>
#include <linux/random.h>
+#include <linux/srcu.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -221,8 +222,21 @@ static inline int shrinker_defer_size(int nr_items)
static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
int nid)
{
- return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
- lockdep_is_held(&shrinker_rwsem));
+ return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
+ &shrinker_srcu,
+ lockdep_is_held(&shrinker_rwsem));
+}
+
+static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
+ int nid)
+{
+ return srcu_dereference(memcg->nodeinfo[nid]->shrinker_info,
+ &shrinker_srcu);
+}
+
+static void free_shrinker_info_rcu(struct rcu_head *head)
+{
+ kvfree(container_of(head, struct shrinker_info, rcu));
}

static inline bool need_expand(int new_nr_max, int old_nr_max)
@@ -269,7 +283,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
defer_size - old_defer_size);

rcu_assign_pointer(pn->shrinker_info, new);
- kvfree_rcu(old, rcu);
+ call_srcu(&shrinker_srcu, &old->rcu, free_shrinker_info_rcu);
}

return 0;
@@ -355,15 +369,16 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
{
if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
struct shrinker_info *info;
+ int srcu_idx;

- rcu_read_lock();
- info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
+ srcu_idx = srcu_read_lock(&shrinker_srcu);
+ info = shrinker_info_srcu(memcg, nid);
if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
/* Pairs with smp mb in shrink_slab() */
smp_mb__before_atomic();
set_bit(shrinker_id, info->map);
}
- rcu_read_unlock();
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
}
}

@@ -377,7 +392,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
return -ENOSYS;

down_write(&shrinker_rwsem);
- /* This may call shrinker, so it must use down_read_trylock() */
id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
@@ -411,7 +425,7 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
{
struct shrinker_info *info;

- info = shrinker_info_protected(memcg, nid);
+ info = shrinker_info_srcu(memcg, nid);
return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
}

@@ -420,7 +434,7 @@ static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
{
struct shrinker_info *info;

- info = shrinker_info_protected(memcg, nid);
+ info = shrinker_info_srcu(memcg, nid);
return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
}

@@ -898,15 +912,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
{
struct shrinker_info *info;
unsigned long ret, freed = 0;
+ int srcu_idx;
int i;

if (!mem_cgroup_online(memcg))
return 0;

- if (!down_read_trylock(&shrinker_rwsem))
- return 0;
-
- info = shrinker_info_protected(memcg, nid);
+ srcu_idx = srcu_read_lock(&shrinker_srcu);
+ info = shrinker_info_srcu(memcg, nid);
if (unlikely(!info))
goto unlock;

@@ -956,14 +969,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
set_shrinker_bit(memcg, nid, i);
}
freed += ret;
-
- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
}
unlock:
- up_read(&shrinker_rwsem);
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
return freed;
}
#else /* CONFIG_MEMCG */
--
2.20.1


2023-02-26 14:54:30

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 4/8] mm: vmscan: add shrinker_srcu_generation

From: Kirill Tkhai <[email protected]>

After we make slab shrink lockless with SRCU, the longest
sleep unregister_shrinker() will be a sleep waiting for
all do_shrink_slab() calls.

To aviod long unbreakable action in the unregister_shrinker(),
add shrinker_srcu_generation to restore a check similar to the
rwsem_is_contendent() check that we had before.

And for memcg slab shrink, we unlock SRCU and continue
iterations from the next shrinker id.

Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Qi Zheng <[email protected]>
---
mm/vmscan.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 490764f8e085..99e852c0ab9e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
LIST_HEAD(shrinker_list);
DECLARE_RWSEM(shrinker_rwsem);
DEFINE_SRCU(shrinker_srcu);
+static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);

#ifdef CONFIG_MEMCG
static int shrinker_nr_max;
@@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
debugfs_entry = shrinker_debugfs_remove(shrinker);
up_write(&shrinker_rwsem);

+ atomic_inc(&shrinker_srcu_generation);
synchronize_srcu(&shrinker_srcu);

debugfs_remove_recursive(debugfs_entry);
@@ -803,6 +805,7 @@ void synchronize_shrinkers(void)
{
down_write(&shrinker_rwsem);
up_write(&shrinker_rwsem);
+ atomic_inc(&shrinker_srcu_generation);
synchronize_srcu(&shrinker_srcu);
}
EXPORT_SYMBOL(synchronize_shrinkers);
@@ -912,18 +915,20 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
{
struct shrinker_info *info;
unsigned long ret, freed = 0;
- int srcu_idx;
- int i;
+ int srcu_idx, generation;
+ int i = 0;

if (!mem_cgroup_online(memcg))
return 0;

+again:
srcu_idx = srcu_read_lock(&shrinker_srcu);
info = shrinker_info_srcu(memcg, nid);
if (unlikely(!info))
goto unlock;

- for_each_set_bit(i, info->map, info->map_nr_max) {
+ generation = atomic_read(&shrinker_srcu_generation);
+ for_each_set_bit_from(i, info->map, info->map_nr_max) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
@@ -969,6 +974,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
set_shrinker_bit(memcg, nid, i);
}
freed += ret;
+ if (atomic_read(&shrinker_srcu_generation) != generation) {
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
+ i++;
+ goto again;
+ }
}
unlock:
srcu_read_unlock(&shrinker_srcu, srcu_idx);
@@ -1008,7 +1018,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
{
unsigned long ret, freed = 0;
struct shrinker *shrinker;
- int srcu_idx;
+ int srcu_idx, generation;

/*
* The root memcg might be allocated even though memcg is disabled
@@ -1022,6 +1032,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,

srcu_idx = srcu_read_lock(&shrinker_srcu);

+ generation = atomic_read(&shrinker_srcu_generation);
list_for_each_entry_srcu(shrinker, &shrinker_list, list,
srcu_read_lock_held(&shrinker_srcu)) {
struct shrink_control sc = {
@@ -1034,6 +1045,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
+
+ if (atomic_read(&shrinker_srcu_generation) != generation) {
+ freed = freed ? : 1;
+ break;
+ }
}

srcu_read_unlock(&shrinker_srcu, srcu_idx);
--
2.20.1


2023-02-26 14:54:36

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()

Now there are no readers of shrinker_rwsem, so
synchronize_shrinkers() does not need to hold the
writer of shrinker_rwsem to wait for all running
shinkers to complete, synchronize_srcu() is enough.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/vmscan.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 16ff64813175..2d71fd565c78 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -796,15 +796,11 @@ EXPORT_SYMBOL(unregister_shrinker);
/**
* synchronize_shrinkers - Wait for all running shrinkers to complete.
*
- * This is equivalent to calling unregister_shrink() and register_shrinker(),
- * but atomically and with less overhead. This is useful to guarantee that all
- * shrinker invocations have seen an update, before freeing memory, similar to
- * rcu.
+ * This is useful to guarantee that all shrinker invocations have seen an
+ * update, before freeing memory.
*/
void synchronize_shrinkers(void)
{
- down_write(&shrinker_rwsem);
- up_write(&shrinker_rwsem);
atomic_inc(&shrinker_srcu_generation);
synchronize_srcu(&shrinker_srcu);
}
--
2.20.1


2023-02-26 14:56:49

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 6/8] mm: vmscan: hold write lock to reparent shrinker nr_deferred

For now, reparent_shrinker_deferred() is the only holder
of read lock of shrinker_rwsem. And it already holds the
global cgroup_mutex, so it will not be called in parallel.

Therefore, in order to convert shrinker_rwsem to shrinker_mutex
later, here we change to hold the write lock of shrinker_rwsem
to reparent.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/vmscan.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99e852c0ab9e..16ff64813175 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -451,7 +451,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
parent = root_mem_cgroup;

/* Prevent from concurrent shrinker_info expand */
- down_read(&shrinker_rwsem);
+ down_write(&shrinker_rwsem);
for_each_node(nid) {
child_info = shrinker_info_protected(memcg, nid);
parent_info = shrinker_info_protected(parent, nid);
@@ -460,7 +460,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
atomic_long_add(nr, &parent_info->nr_deferred[i]);
}
}
- up_read(&shrinker_rwsem);
+ up_write(&shrinker_rwsem);
}

static bool cgroup_reclaim(struct scan_control *sc)
--
2.20.1


2023-02-26 14:57:14

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 8/8] mm: shrinkers: convert shrinker_rwsem to mutex

Now there are no readers of shrinker_rwsem, so we
can simply replace it with mutex lock.

Signed-off-by: Qi Zheng <[email protected]>
---
drivers/md/dm-cache-metadata.c | 2 +-
drivers/md/dm-thin-metadata.c | 2 +-
fs/super.c | 2 +-
mm/shrinker_debug.c | 14 +++++++-------
mm/vmscan.c | 34 +++++++++++++++++-----------------
5 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/drivers/md/dm-cache-metadata.c b/drivers/md/dm-cache-metadata.c
index acffed750e3e..9e0c69958587 100644
--- a/drivers/md/dm-cache-metadata.c
+++ b/drivers/md/dm-cache-metadata.c
@@ -1828,7 +1828,7 @@ int dm_cache_metadata_abort(struct dm_cache_metadata *cmd)
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
* cmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
* shrinker associated with the block manager's bufio client vs cmd root_lock).
- * - must take shrinker_rwsem without holding cmd->root_lock
+ * - must take shrinker_mutex without holding cmd->root_lock
*/
new_bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
CACHE_MAX_CONCURRENT_LOCKS);
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index fd464fb024c3..9f5cb52c5763 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1887,7 +1887,7 @@ int dm_pool_abort_metadata(struct dm_pool_metadata *pmd)
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
* pmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
* shrinker associated with the block manager's bufio client vs pmd root_lock).
- * - must take shrinker_rwsem without holding pmd->root_lock
+ * - must take shrinker_mutex without holding pmd->root_lock
*/
new_bm = dm_block_manager_create(pmd->bdev, THIN_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
THIN_MAX_CONCURRENT_LOCKS);
diff --git a/fs/super.c b/fs/super.c
index 84332d5cb817..91a4037b1d95 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -54,7 +54,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
* One thing we have to be careful of with a per-sb shrinker is that we don't
* drop the last active reference to the superblock from within the shrinker.
* If that happens we could trigger unregistering the shrinker from within the
- * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
+ * shrinker path and that leads to deadlock on the shrinker_mutex. Hence we
* take a passive reference to the superblock to avoid this from occurring.
*/
static unsigned long super_cache_scan(struct shrinker *shrink,
diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 6aa7a7ec69da..b0f6aff372df 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -7,7 +7,7 @@
#include <linux/memcontrol.h>

/* defined in vmscan.c */
-extern struct rw_semaphore shrinker_rwsem;
+extern struct mutex shrinker_mutex;
extern struct list_head shrinker_list;
extern struct srcu_struct shrinker_srcu;

@@ -167,7 +167,7 @@ int shrinker_debugfs_add(struct shrinker *shrinker)
char buf[128];
int id;

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

/* debugfs isn't initialized yet, add debugfs entries later. */
if (!shrinker_debugfs_root)
@@ -210,7 +210,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
if (!new)
return -ENOMEM;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);

old = shrinker->name;
shrinker->name = new;
@@ -228,7 +228,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
shrinker->debugfs_entry = entry;
}

- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

kfree_const(old);

@@ -240,7 +240,7 @@ struct dentry *shrinker_debugfs_remove(struct shrinker *shrinker)
{
struct dentry *entry = shrinker->debugfs_entry;

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

kfree_const(shrinker->name);
shrinker->name = NULL;
@@ -265,14 +265,14 @@ static int __init shrinker_debugfs_init(void)
shrinker_debugfs_root = dentry;

/* Create debugfs entries for shrinkers registered at boot */
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_for_each_entry(shrinker, &shrinker_list, list)
if (!shrinker->debugfs_entry) {
ret = shrinker_debugfs_add(shrinker);
if (ret)
break;
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

return ret;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d71fd565c78..6c5d21ba0c9a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -35,7 +35,7 @@
#include <linux/cpuset.h>
#include <linux/compaction.h>
#include <linux/notifier.h>
-#include <linux/rwsem.h>
+#include <linux/mutex.h>
#include <linux/delay.h>
#include <linux/kthread.h>
#include <linux/freezer.h>
@@ -202,7 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
}

LIST_HEAD(shrinker_list);
-DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_MUTEX(shrinker_mutex);
DEFINE_SRCU(shrinker_srcu);
static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);

@@ -225,7 +225,7 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
{
return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
&shrinker_srcu,
- lockdep_is_held(&shrinker_rwsem));
+ lockdep_is_held(&shrinker_mutex));
}

static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
@@ -310,7 +310,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
int nid, size, ret = 0;
int map_size, defer_size = 0;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
map_size = shrinker_map_size(shrinker_nr_max);
defer_size = shrinker_defer_size(shrinker_nr_max);
size = map_size + defer_size;
@@ -326,7 +326,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
info->map_nr_max = shrinker_nr_max;
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

return ret;
}
@@ -342,7 +342,7 @@ static int expand_shrinker_info(int new_id)
if (!root_mem_cgroup)
goto out;

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

map_size = shrinker_map_size(new_nr_max);
defer_size = shrinker_defer_size(new_nr_max);
@@ -392,7 +392,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
if (mem_cgroup_disabled())
return -ENOSYS;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
@@ -406,7 +406,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
shrinker->id = id;
ret = 0;
unlock:
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
return ret;
}

@@ -416,7 +416,7 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)

BUG_ON(id < 0);

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

idr_remove(&shrinker_idr, id);
}
@@ -451,7 +451,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
parent = root_mem_cgroup;

/* Prevent from concurrent shrinker_info expand */
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
for_each_node(nid) {
child_info = shrinker_info_protected(memcg, nid);
parent_info = shrinker_info_protected(parent, nid);
@@ -460,7 +460,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
atomic_long_add(nr, &parent_info->nr_deferred[i]);
}
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
}

static bool cgroup_reclaim(struct scan_control *sc)
@@ -709,9 +709,9 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
shrinker->name = NULL;
#endif
if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
unregister_memcg_shrinker(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
return;
}

@@ -721,11 +721,11 @@ void free_prealloced_shrinker(struct shrinker *shrinker)

void register_shrinker_prepared(struct shrinker *shrinker)
{
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_add_tail_rcu(&shrinker->list, &shrinker_list);
shrinker->flags |= SHRINKER_REGISTERED;
shrinker_debugfs_add(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
}

static int __register_shrinker(struct shrinker *shrinker)
@@ -775,13 +775,13 @@ void unregister_shrinker(struct shrinker *shrinker)
if (!(shrinker->flags & SHRINKER_REGISTERED))
return;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_del_rcu(&shrinker->list);
shrinker->flags &= ~SHRINKER_REGISTERED;
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
debugfs_entry = shrinker_debugfs_remove(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

atomic_inc(&shrinker_srcu_generation);
synchronize_srcu(&shrinker_srcu);
--
2.20.1


2023-02-26 14:58:29

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v3 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless

Like global and memcg slab shrink, also use SRCU to
make count and scan operations in memory shrinker
debugfs lockless.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/shrinker_debug.c | 24 +++++++-----------------
1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 39c3491e28a3..6aa7a7ec69da 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -9,6 +9,7 @@
/* defined in vmscan.c */
extern struct rw_semaphore shrinker_rwsem;
extern struct list_head shrinker_list;
+extern struct srcu_struct shrinker_srcu;

static DEFINE_IDA(shrinker_debugfs_ida);
static struct dentry *shrinker_debugfs_root;
@@ -49,18 +50,13 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
struct mem_cgroup *memcg;
unsigned long total;
bool memcg_aware;
- int ret, nid;
+ int ret = 0, nid, srcu_idx;

count_per_node = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
if (!count_per_node)
return -ENOMEM;

- ret = down_read_killable(&shrinker_rwsem);
- if (ret) {
- kfree(count_per_node);
- return ret;
- }
- rcu_read_lock();
+ srcu_idx = srcu_read_lock(&shrinker_srcu);

memcg_aware = shrinker->flags & SHRINKER_MEMCG_AWARE;

@@ -91,8 +87,7 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
}
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);

- rcu_read_unlock();
- up_read(&shrinker_rwsem);
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);

kfree(count_per_node);
return ret;
@@ -115,9 +110,8 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
.gfp_mask = GFP_KERNEL,
};
struct mem_cgroup *memcg = NULL;
- int nid;
+ int nid, srcu_idx;
char kbuf[72];
- ssize_t ret;

read_len = size < (sizeof(kbuf) - 1) ? size : (sizeof(kbuf) - 1);
if (copy_from_user(kbuf, buf, read_len))
@@ -146,11 +140,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
return -EINVAL;
}

- ret = down_read_killable(&shrinker_rwsem);
- if (ret) {
- mem_cgroup_put(memcg);
- return ret;
- }
+ srcu_idx = srcu_read_lock(&shrinker_srcu);

sc.nid = nid;
sc.memcg = memcg;
@@ -159,7 +149,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,

shrinker->scan_objects(shrinker, &sc);

- up_read(&shrinker_rwsem);
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
mem_cgroup_put(memcg);

return size;
--
2.20.1


2023-02-26 15:07:17

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] mm: vmscan: add a map_nr_max field to shrinker_info



On 2023/2/26 22:46, Qi Zheng wrote:
> To prepare for the subsequent lockless memcg slab shrink,
> add a map_nr_max field to struct shrinker_info to records
> its own real shrinker_nr_max.
>
> Suggested-by: Kirill Tkhai <[email protected]>
> Signed-off-by: Qi Zheng <[email protected]>
> ---
> include/linux/memcontrol.h | 1 +
> mm/vmscan.c | 41 ++++++++++++++++++++++----------------
> 2 files changed, 25 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b6eda2ab205d..aa69ea98e2d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -97,6 +97,7 @@ struct shrinker_info {
> struct rcu_head rcu;
> atomic_long_t *nr_deferred;
> unsigned long *map;
> + int map_nr_max;
> };
>
> struct lruvec_stats_percpu {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9c1c5e8b24b8..546c07ccb3bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> lockdep_is_held(&shrinker_rwsem));
> }
>
> +static inline bool need_expand(int new_nr_max, int old_nr_max)
> +{
> + return round_up(new_nr_max, BITS_PER_LONG) >
> + round_up(old_nr_max, BITS_PER_LONG);
> +}
> +
> static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> int map_size, int defer_size,
> - int old_map_size, int old_defer_size)
> + int old_map_size, int old_defer_size,
> + int new_nr_max)
> {
> struct shrinker_info *new, *old;
> struct mem_cgroup_per_node *pn;
> @@ -240,12 +247,17 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> if (!old)
> return 0;
>
> + /* Already expanded this shrinker_info */
> + if (!need_expand(new_nr_max, old->map_nr_max))
> + return 0;

Oops, need to continue here.

> +
> new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> if (!new)
> return -ENOMEM;
>
> new->nr_deferred = (atomic_long_t *)(new + 1);
> new->map = (void *)new->nr_deferred + defer_size;
> + new->map_nr_max = new_nr_max;
>
> /* map: set all old bits, clear all new bits */
> memset(new->map, (int)0xff, old_map_size);
> @@ -295,6 +307,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> }
> info->nr_deferred = (atomic_long_t *)(info + 1);
> info->map = (void *)info->nr_deferred + defer_size;
> + info->map_nr_max = shrinker_nr_max;
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
> }
> up_write(&shrinker_rwsem);
> @@ -302,23 +315,14 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> return ret;
> }
>
> -static inline bool need_expand(int nr_max)
> -{
> - return round_up(nr_max, BITS_PER_LONG) >
> - round_up(shrinker_nr_max, BITS_PER_LONG);
> -}
> -
> static int expand_shrinker_info(int new_id)
> {
> int ret = 0;
> - int new_nr_max = new_id + 1;
> + int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
> int map_size, defer_size = 0;
> int old_map_size, old_defer_size = 0;
> struct mem_cgroup *memcg;
>
> - if (!need_expand(new_nr_max))
> - goto out;
> -
> if (!root_mem_cgroup)
> goto out;
>
> @@ -332,7 +336,8 @@ static int expand_shrinker_info(int new_id)
> memcg = mem_cgroup_iter(NULL, NULL, NULL);
> do {
> ret = expand_one_shrinker_info(memcg, map_size, defer_size,
> - old_map_size, old_defer_size);
> + old_map_size, old_defer_size,
> + new_nr_max);
> if (ret) {
> mem_cgroup_iter_break(NULL, memcg);
> goto out;
> @@ -352,9 +357,11 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
>
> rcu_read_lock();
> info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> - /* Pairs with smp mb in shrink_slab() */
> - smp_mb__before_atomic();
> - set_bit(shrinker_id, info->map);
> + if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
> + /* Pairs with smp mb in shrink_slab() */
> + smp_mb__before_atomic();
> + set_bit(shrinker_id, info->map);
> + }
> rcu_read_unlock();
> }
> }
> @@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
> for_each_node(nid) {
> child_info = shrinker_info_protected(memcg, nid);
> parent_info = shrinker_info_protected(parent, nid);
> - for (i = 0; i < shrinker_nr_max; i++) {
> + for (i = 0; i < child_info->map_nr_max; i++) {
> nr = atomic_long_read(&child_info->nr_deferred[i]);
> atomic_long_add(nr, &parent_info->nr_deferred[i]);
> }
> @@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> if (unlikely(!info))
> goto unlock;
>
> - for_each_set_bit(i, info->map, shrinker_nr_max) {
> + for_each_set_bit(i, info->map, info->map_nr_max) {
> struct shrink_control sc = {
> .gfp_mask = gfp_mask,
> .nid = nid,

--
Thanks,
Qi

2023-02-26 19:51:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make slab shrink lockless

On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <[email protected]> wrote:

> Hi all,
>
> This patch series aims to make slab shrink lockless.

What an awesome changelog.

> 2. Survey
> =========

Especially this part.

Looking through all the prior efforts and at this patchset I am not
immediately seeing any statements about the overall effect upon
real-world workloads. For a good example, does this patchset
measurably improve throughput or energy consumption on your servers?



2023-02-27 13:33:12

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make slab shrink lockless



On 2023/2/27 03:51, Andrew Morton wrote:
> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <[email protected]> wrote:
>
>> Hi all,
>>
>> This patch series aims to make slab shrink lockless.
>
> What an awesome changelog.
>
>> 2. Survey
>> =========
>
> Especially this part.
>
> Looking through all the prior efforts and at this patchset I am not
> immediately seeing any statements about the overall effect upon
> real-world workloads. For a good example, does this patchset
> measurably improve throughput or energy consumption on your servers?

Hi Andrew,

I re-tested with the following physical machines:

Architecture: x86_64
CPU(s): 96
On-line CPU(s) list: 0-95
Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz

I found that the reason for the hotspot I described in cover letter is
wrong. The reason for the down_read_trylock() hotspot is not because of
the failure to trylock, but simply because of the atomic operation
(cmpxchg). And this will lead to a significant reduction in IPC (insn
per cycle).

To verify this, I did the following tests:

1. Run the following script to create down_read_trylock() hotspots:

```
#!/bin/bash

DIR="/root/shrinker/memcg/mnt"

do_create()
{
mkdir -p /sys/fs/cgroup/memory/test
mkdir -p /sys/fs/cgroup/perf_event/test
echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
for i in `seq 0 $1`;
do
mkdir -p /sys/fs/cgroup/memory/test/$i;
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
mkdir -p $DIR/$i;
done
}

do_mount()
{
for i in `seq $1 $2`;
do
mount -t tmpfs $i $DIR/$i;
done
}

do_touch()
{
for i in `seq $1 $2`;
do
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
done
}

case "$1" in
touch)
do_touch $2 $3
;;
test)
do_create 4000
do_mount 0 4000
do_touch 0 3000
;;
*)
exit 1
;;
esac
```

Save the above script, then run test and touch commands.

Then we can use the following perf command to view hotspots:

perf top -U -F 999

1) Before applying this patchset:

32.31% [kernel] [k] down_read_trylock
19.40% [kernel] [k] pv_native_safe_halt
16.24% [kernel] [k] up_read
15.70% [kernel] [k] shrink_slab
4.69% [kernel] [k] _find_next_bit
2.62% [kernel] [k] shrink_node
1.78% [kernel] [k] shrink_lruvec
0.76% [kernel] [k] do_shrink_slab

2) After applying this patchset:

27.83% [kernel] [k] _find_next_bit
16.97% [kernel] [k] shrink_slab
15.82% [kernel] [k] pv_native_safe_halt
9.58% [kernel] [k] shrink_node
8.31% [kernel] [k] shrink_lruvec
5.64% [kernel] [k] do_shrink_slab
3.88% [kernel] [k] mem_cgroup_iter

2. At the same time, we use the following perf command to capture IPC
information:

perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10

1) Before applying this patchset:

Performance counter stats for 'system wide' (5 runs):

454187219766 cycles test
( +- 1.84% )
78896433101 instructions test # 0.17 insn
per cycle ( +- 0.44% )

10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% )

2) After applying this patchset:

Performance counter stats for 'system wide' (5 runs):

841954709443 cycles test
( +- 15.80% ) (98.69%)
527258677936 instructions test # 0.63 insn
per cycle ( +- 15.11% ) (98.68%)

10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% )

We can see that IPC drops very seriously when calling
down_read_trylock() at high frequency. After using SRCU,
the IPC is at a normal level.

Thanks,
Qi

>
>


2023-02-27 15:08:50

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make slab shrink lockless

Hi,

On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>
>
> On 2023/2/27 03:51, Andrew Morton wrote:
> > On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > This patch series aims to make slab shrink lockless.
> >
> > What an awesome changelog.
> >
> > > 2. Survey
> > > =========
> >
> > Especially this part.
> >
> > Looking through all the prior efforts and at this patchset I am not
> > immediately seeing any statements about the overall effect upon
> > real-world workloads. For a good example, does this patchset
> > measurably improve throughput or energy consumption on your servers?
>
> Hi Andrew,
>
> I re-tested with the following physical machines:
>
> Architecture: x86_64
> CPU(s): 96
> On-line CPU(s) list: 0-95
> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
>
> I found that the reason for the hotspot I described in cover letter is
> wrong. The reason for the down_read_trylock() hotspot is not because of
> the failure to trylock, but simply because of the atomic operation
> (cmpxchg). And this will lead to a significant reduction in IPC (insn
> per cycle).

...

> Then we can use the following perf command to view hotspots:
>
> perf top -U -F 999
>
> 1) Before applying this patchset:
>
> 32.31% [kernel] [k] down_read_trylock
> 19.40% [kernel] [k] pv_native_safe_halt
> 16.24% [kernel] [k] up_read
> 15.70% [kernel] [k] shrink_slab
> 4.69% [kernel] [k] _find_next_bit
> 2.62% [kernel] [k] shrink_node
> 1.78% [kernel] [k] shrink_lruvec
> 0.76% [kernel] [k] do_shrink_slab
>
> 2) After applying this patchset:
>
> 27.83% [kernel] [k] _find_next_bit
> 16.97% [kernel] [k] shrink_slab
> 15.82% [kernel] [k] pv_native_safe_halt
> 9.58% [kernel] [k] shrink_node
> 8.31% [kernel] [k] shrink_lruvec
> 5.64% [kernel] [k] do_shrink_slab
> 3.88% [kernel] [k] mem_cgroup_iter
>
> 2. At the same time, we use the following perf command to capture IPC
> information:
>
> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>
> 1) Before applying this patchset:
>
> Performance counter stats for 'system wide' (5 runs):
>
> 454187219766 cycles test (
> +- 1.84% )
> 78896433101 instructions test # 0.17 insn per
> cycle ( +- 0.44% )
>
> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% )
>
> 2) After applying this patchset:
>
> Performance counter stats for 'system wide' (5 runs):
>
> 841954709443 cycles test (
> +- 15.80% ) (98.69%)
> 527258677936 instructions test # 0.63 insn per
> cycle ( +- 15.11% ) (98.68%)
>
> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% )
>
> We can see that IPC drops very seriously when calling
> down_read_trylock() at high frequency. After using SRCU,
> the IPC is at a normal level.

The results you present do show improvement in IPC for an artificial test
script. But more interesting would be to see how a real world workloads
benefit from your changes.

> Thanks,
> Qi

--
Sincerely yours,
Mike.

2023-02-27 19:02:35

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make slab shrink lockless

On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>
>
> On 2023/2/27 03:51, Andrew Morton wrote:
> > On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <[email protected]> wrote:
> >
> Save the above script, then run test and touch commands.
>
> Then we can use the following perf command to view hotspots:
>
> perf top -U -F 999
>
> 1) Before applying this patchset:
>
> 32.31% [kernel] [k] down_read_trylock
> 19.40% [kernel] [k] pv_native_safe_halt
> 16.24% [kernel] [k] up_read
> 15.70% [kernel] [k] shrink_slab
> 4.69% [kernel] [k] _find_next_bit
> 2.62% [kernel] [k] shrink_node
> 1.78% [kernel] [k] shrink_lruvec
> 0.76% [kernel] [k] do_shrink_slab
>
> 2) After applying this patchset:
>
> 27.83% [kernel] [k] _find_next_bit
> 16.97% [kernel] [k] shrink_slab
> 15.82% [kernel] [k] pv_native_safe_halt
> 9.58% [kernel] [k] shrink_node
> 8.31% [kernel] [k] shrink_lruvec
> 5.64% [kernel] [k] do_shrink_slab
> 3.88% [kernel] [k] mem_cgroup_iter

Not opposing the intention of the patchset in any way (I actually think
it's a good idea to make the shrinkers list lockless), but looking at
both outputs above I think that the main problem is not the contention on
the semaphore, but the reason of this contention.

It seems like often there is a long list of shrinkers which barely
can reclaim any memory, but we're calling them again and again.
In order to achieve real wins with real-life workloads, I guess
it's what we should optimize.

Thanks!

2023-02-27 19:21:12

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make slab shrink lockless

On 27.02.2023 18:08, Mike Rapoport wrote:
> Hi,
>
> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>>
>>
>> On 2023/2/27 03:51, Andrew Morton wrote:
>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> This patch series aims to make slab shrink lockless.
>>>
>>> What an awesome changelog.
>>>
>>>> 2. Survey
>>>> =========
>>>
>>> Especially this part.
>>>
>>> Looking through all the prior efforts and at this patchset I am not
>>> immediately seeing any statements about the overall effect upon
>>> real-world workloads. For a good example, does this patchset
>>> measurably improve throughput or energy consumption on your servers?
>>
>> Hi Andrew,
>>
>> I re-tested with the following physical machines:
>>
>> Architecture: x86_64
>> CPU(s): 96
>> On-line CPU(s) list: 0-95
>> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
>>
>> I found that the reason for the hotspot I described in cover letter is
>> wrong. The reason for the down_read_trylock() hotspot is not because of
>> the failure to trylock, but simply because of the atomic operation
>> (cmpxchg). And this will lead to a significant reduction in IPC (insn
>> per cycle).
>
> ...
>
>> Then we can use the following perf command to view hotspots:
>>
>> perf top -U -F 999
>>
>> 1) Before applying this patchset:
>>
>> 32.31% [kernel] [k] down_read_trylock
>> 19.40% [kernel] [k] pv_native_safe_halt
>> 16.24% [kernel] [k] up_read
>> 15.70% [kernel] [k] shrink_slab
>> 4.69% [kernel] [k] _find_next_bit
>> 2.62% [kernel] [k] shrink_node
>> 1.78% [kernel] [k] shrink_lruvec
>> 0.76% [kernel] [k] do_shrink_slab
>>
>> 2) After applying this patchset:
>>
>> 27.83% [kernel] [k] _find_next_bit
>> 16.97% [kernel] [k] shrink_slab
>> 15.82% [kernel] [k] pv_native_safe_halt
>> 9.58% [kernel] [k] shrink_node
>> 8.31% [kernel] [k] shrink_lruvec
>> 5.64% [kernel] [k] do_shrink_slab
>> 3.88% [kernel] [k] mem_cgroup_iter
>>
>> 2. At the same time, we use the following perf command to capture IPC
>> information:
>>
>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>>
>> 1) Before applying this patchset:
>>
>> Performance counter stats for 'system wide' (5 runs):
>>
>> 454187219766 cycles test (
>> +- 1.84% )
>> 78896433101 instructions test # 0.17 insn per
>> cycle ( +- 0.44% )
>>
>> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% )
>>
>> 2) After applying this patchset:
>>
>> Performance counter stats for 'system wide' (5 runs):
>>
>> 841954709443 cycles test (
>> +- 15.80% ) (98.69%)
>> 527258677936 instructions test # 0.63 insn per
>> cycle ( +- 15.11% ) (98.68%)
>>
>> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% )
>>
>> We can see that IPC drops very seriously when calling
>> down_read_trylock() at high frequency. After using SRCU,
>> the IPC is at a normal level.
>
> The results you present do show improvement in IPC for an artificial test
> script. But more interesting would be to see how a real world workloads
> benefit from your changes.

One of the real workloads from my experience is start of an overcommitted node
containing many starting containers after node crash (or many resuming containers
after reboot for kernel update). In these cases memory pressure is huge, and
the node goes round in long reclaim.

This patch patchset makes prealloc_memcg_shrinker() independent of do_shrink_slab(),
so prealloc_memcg_shrinker() won't have to wait till shrink_slab_memcg() completes its
current bit iteration, sees rwsem_is_contended() and the iteration breaks.

Also, it's important to mention that currently we have the strange behavior:

prealloc_memcg_shrinker()
down_write(&shrinker_rwsem)
idr_alloc()
reclaim
for each child memcg
shrink_slab_memcg()
down_read_trylock(&shrinker_rwsem) -> fail

All the slab reclaim in this behavior is just a parasite work, and it just wastes
our cpu time, which does not look a good design.

Kirill

2023-02-27 19:32:48

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make slab shrink lockless

On Mon, Feb 27, 2023 at 10:20:59PM +0300, Kirill Tkhai wrote:
> On 27.02.2023 18:08, Mike Rapoport wrote:
> > Hi,
> >
> > On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
> >>
> >>
> >> On 2023/2/27 03:51, Andrew Morton wrote:
> >>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <[email protected]> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> This patch series aims to make slab shrink lockless.
> >>>
> >>> What an awesome changelog.
> >>>
> >>>> 2. Survey
> >>>> =========
> >>>
> >>> Especially this part.
> >>>
> >>> Looking through all the prior efforts and at this patchset I am not
> >>> immediately seeing any statements about the overall effect upon
> >>> real-world workloads. For a good example, does this patchset
> >>> measurably improve throughput or energy consumption on your servers?
> >>
> >> Hi Andrew,
> >>
> >> I re-tested with the following physical machines:
> >>
> >> Architecture: x86_64
> >> CPU(s): 96
> >> On-line CPU(s) list: 0-95
> >> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
> >>
> >> I found that the reason for the hotspot I described in cover letter is
> >> wrong. The reason for the down_read_trylock() hotspot is not because of
> >> the failure to trylock, but simply because of the atomic operation
> >> (cmpxchg). And this will lead to a significant reduction in IPC (insn
> >> per cycle).
> >
> > ...
> >
> >> Then we can use the following perf command to view hotspots:
> >>
> >> perf top -U -F 999
> >>
> >> 1) Before applying this patchset:
> >>
> >> 32.31% [kernel] [k] down_read_trylock
> >> 19.40% [kernel] [k] pv_native_safe_halt
> >> 16.24% [kernel] [k] up_read
> >> 15.70% [kernel] [k] shrink_slab
> >> 4.69% [kernel] [k] _find_next_bit
> >> 2.62% [kernel] [k] shrink_node
> >> 1.78% [kernel] [k] shrink_lruvec
> >> 0.76% [kernel] [k] do_shrink_slab
> >>
> >> 2) After applying this patchset:
> >>
> >> 27.83% [kernel] [k] _find_next_bit
> >> 16.97% [kernel] [k] shrink_slab
> >> 15.82% [kernel] [k] pv_native_safe_halt
> >> 9.58% [kernel] [k] shrink_node
> >> 8.31% [kernel] [k] shrink_lruvec
> >> 5.64% [kernel] [k] do_shrink_slab
> >> 3.88% [kernel] [k] mem_cgroup_iter
> >>
> >> 2. At the same time, we use the following perf command to capture IPC
> >> information:
> >>
> >> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
> >>
> >> 1) Before applying this patchset:
> >>
> >> Performance counter stats for 'system wide' (5 runs):
> >>
> >> 454187219766 cycles test (
> >> +- 1.84% )
> >> 78896433101 instructions test # 0.17 insn per
> >> cycle ( +- 0.44% )
> >>
> >> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% )
> >>
> >> 2) After applying this patchset:
> >>
> >> Performance counter stats for 'system wide' (5 runs):
> >>
> >> 841954709443 cycles test (
> >> +- 15.80% ) (98.69%)
> >> 527258677936 instructions test # 0.63 insn per
> >> cycle ( +- 15.11% ) (98.68%)
> >>
> >> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% )
> >>
> >> We can see that IPC drops very seriously when calling
> >> down_read_trylock() at high frequency. After using SRCU,
> >> the IPC is at a normal level.
> >
> > The results you present do show improvement in IPC for an artificial test
> > script. But more interesting would be to see how a real world workloads
> > benefit from your changes.
>
> One of the real workloads from my experience is start of an overcommitted node
> containing many starting containers after node crash (or many resuming containers
> after reboot for kernel update). In these cases memory pressure is huge, and
> the node goes round in long reclaim.
>
> This patch patchset makes prealloc_memcg_shrinker() independent of do_shrink_slab(),
> so prealloc_memcg_shrinker() won't have to wait till shrink_slab_memcg() completes its
> current bit iteration, sees rwsem_is_contended() and the iteration breaks.
>
> Also, it's important to mention that currently we have the strange behavior:
>
> prealloc_memcg_shrinker()
> down_write(&shrinker_rwsem)
> idr_alloc()
> reclaim
> for each child memcg
> shrink_slab_memcg()
> down_read_trylock(&shrinker_rwsem) -> fail

But this can happen only if we get -ENOMEM in idr_alloc()?
Doesn't seem to be a very hot path.

Thanks!

2023-02-27 19:47:59

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make slab shrink lockless

On 27.02.2023 22:32, Roman Gushchin wrote:
> On Mon, Feb 27, 2023 at 10:20:59PM +0300, Kirill Tkhai wrote:
>> On 27.02.2023 18:08, Mike Rapoport wrote:
>>> Hi,
>>>
>>> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2023/2/27 03:51, Andrew Morton wrote:
>>>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <[email protected]> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> This patch series aims to make slab shrink lockless.
>>>>>
>>>>> What an awesome changelog.
>>>>>
>>>>>> 2. Survey
>>>>>> =========
>>>>>
>>>>> Especially this part.
>>>>>
>>>>> Looking through all the prior efforts and at this patchset I am not
>>>>> immediately seeing any statements about the overall effect upon
>>>>> real-world workloads. For a good example, does this patchset
>>>>> measurably improve throughput or energy consumption on your servers?
>>>>
>>>> Hi Andrew,
>>>>
>>>> I re-tested with the following physical machines:
>>>>
>>>> Architecture: x86_64
>>>> CPU(s): 96
>>>> On-line CPU(s) list: 0-95
>>>> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
>>>>
>>>> I found that the reason for the hotspot I described in cover letter is
>>>> wrong. The reason for the down_read_trylock() hotspot is not because of
>>>> the failure to trylock, but simply because of the atomic operation
>>>> (cmpxchg). And this will lead to a significant reduction in IPC (insn
>>>> per cycle).
>>>
>>> ...
>>>
>>>> Then we can use the following perf command to view hotspots:
>>>>
>>>> perf top -U -F 999
>>>>
>>>> 1) Before applying this patchset:
>>>>
>>>> 32.31% [kernel] [k] down_read_trylock
>>>> 19.40% [kernel] [k] pv_native_safe_halt
>>>> 16.24% [kernel] [k] up_read
>>>> 15.70% [kernel] [k] shrink_slab
>>>> 4.69% [kernel] [k] _find_next_bit
>>>> 2.62% [kernel] [k] shrink_node
>>>> 1.78% [kernel] [k] shrink_lruvec
>>>> 0.76% [kernel] [k] do_shrink_slab
>>>>
>>>> 2) After applying this patchset:
>>>>
>>>> 27.83% [kernel] [k] _find_next_bit
>>>> 16.97% [kernel] [k] shrink_slab
>>>> 15.82% [kernel] [k] pv_native_safe_halt
>>>> 9.58% [kernel] [k] shrink_node
>>>> 8.31% [kernel] [k] shrink_lruvec
>>>> 5.64% [kernel] [k] do_shrink_slab
>>>> 3.88% [kernel] [k] mem_cgroup_iter
>>>>
>>>> 2. At the same time, we use the following perf command to capture IPC
>>>> information:
>>>>
>>>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>>>>
>>>> 1) Before applying this patchset:
>>>>
>>>> Performance counter stats for 'system wide' (5 runs):
>>>>
>>>> 454187219766 cycles test (
>>>> +- 1.84% )
>>>> 78896433101 instructions test # 0.17 insn per
>>>> cycle ( +- 0.44% )
>>>>
>>>> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% )
>>>>
>>>> 2) After applying this patchset:
>>>>
>>>> Performance counter stats for 'system wide' (5 runs):
>>>>
>>>> 841954709443 cycles test (
>>>> +- 15.80% ) (98.69%)
>>>> 527258677936 instructions test # 0.63 insn per
>>>> cycle ( +- 15.11% ) (98.68%)
>>>>
>>>> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% )
>>>>
>>>> We can see that IPC drops very seriously when calling
>>>> down_read_trylock() at high frequency. After using SRCU,
>>>> the IPC is at a normal level.
>>>
>>> The results you present do show improvement in IPC for an artificial test
>>> script. But more interesting would be to see how a real world workloads
>>> benefit from your changes.
>>
>> One of the real workloads from my experience is start of an overcommitted node
>> containing many starting containers after node crash (or many resuming containers
>> after reboot for kernel update). In these cases memory pressure is huge, and
>> the node goes round in long reclaim.
>>
>> This patch patchset makes prealloc_memcg_shrinker() independent of do_shrink_slab(),
>> so prealloc_memcg_shrinker() won't have to wait till shrink_slab_memcg() completes its
>> current bit iteration, sees rwsem_is_contended() and the iteration breaks.
>>
>> Also, it's important to mention that currently we have the strange behavior:
>>
>> prealloc_memcg_shrinker()
>> down_write(&shrinker_rwsem)
>> idr_alloc()
>> reclaim
>> for each child memcg
>> shrink_slab_memcg()
>> down_read_trylock(&shrinker_rwsem) -> fail
>
> But this can happen only if we get -ENOMEM in idr_alloc()?
> Doesn't seem to be a very hot path.

There is not only idr_alloc(), but expand_shrinker_info() too. The last is more heavier.
But despite that, yes, it's not a hot path.

The memory pressure on overcommited node start I described above is a regular situation.
There are lots of register_shrinker() contending with reclaim.

2023-02-28 10:05:53

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make slab shrink lockless



On 2023/2/27 23:08, Mike Rapoport wrote:
> Hi,
>
> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>>
>>
>> On 2023/2/27 03:51, Andrew Morton wrote:
>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> This patch series aims to make slab shrink lockless.
>>>
>>> What an awesome changelog.
>>>
>>>> 2. Survey
>>>> =========
>>>
>>> Especially this part.
>>>
>>> Looking through all the prior efforts and at this patchset I am not
>>> immediately seeing any statements about the overall effect upon
>>> real-world workloads. For a good example, does this patchset
>>> measurably improve throughput or energy consumption on your servers?
>>
>> Hi Andrew,
>>
>> I re-tested with the following physical machines:
>>
>> Architecture: x86_64
>> CPU(s): 96
>> On-line CPU(s) list: 0-95
>> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
>>
>> I found that the reason for the hotspot I described in cover letter is
>> wrong. The reason for the down_read_trylock() hotspot is not because of
>> the failure to trylock, but simply because of the atomic operation
>> (cmpxchg). And this will lead to a significant reduction in IPC (insn
>> per cycle).
>
> ...
>
>> Then we can use the following perf command to view hotspots:
>>
>> perf top -U -F 999
>>
>> 1) Before applying this patchset:
>>
>> 32.31% [kernel] [k] down_read_trylock
>> 19.40% [kernel] [k] pv_native_safe_halt
>> 16.24% [kernel] [k] up_read
>> 15.70% [kernel] [k] shrink_slab
>> 4.69% [kernel] [k] _find_next_bit
>> 2.62% [kernel] [k] shrink_node
>> 1.78% [kernel] [k] shrink_lruvec
>> 0.76% [kernel] [k] do_shrink_slab
>>
>> 2) After applying this patchset:
>>
>> 27.83% [kernel] [k] _find_next_bit
>> 16.97% [kernel] [k] shrink_slab
>> 15.82% [kernel] [k] pv_native_safe_halt
>> 9.58% [kernel] [k] shrink_node
>> 8.31% [kernel] [k] shrink_lruvec
>> 5.64% [kernel] [k] do_shrink_slab
>> 3.88% [kernel] [k] mem_cgroup_iter
>>
>> 2. At the same time, we use the following perf command to capture IPC
>> information:
>>
>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>>
>> 1) Before applying this patchset:
>>
>> Performance counter stats for 'system wide' (5 runs):
>>
>> 454187219766 cycles test (
>> +- 1.84% )
>> 78896433101 instructions test # 0.17 insn per
>> cycle ( +- 0.44% )
>>
>> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% )
>>
>> 2) After applying this patchset:
>>
>> Performance counter stats for 'system wide' (5 runs):
>>
>> 841954709443 cycles test (
>> +- 15.80% ) (98.69%)
>> 527258677936 instructions test # 0.63 insn per
>> cycle ( +- 15.11% ) (98.68%)
>>
>> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% )
>>
>> We can see that IPC drops very seriously when calling
>> down_read_trylock() at high frequency. After using SRCU,
>> the IPC is at a normal level.
>
> The results you present do show improvement in IPC for an artificial test
> script. But more interesting would be to see how a real world workloads
> benefit from your changes.

Hi Mike and Andrew,

I did encounter this problem under the real workload of our online
server. At the end of this email, I posted another call stack and
hot spot that I found before.

I scanned the hotspots of all our online servers yesterday and today,
but unfortunately did not find the live environment.

Some of our servers have a large number of containers, and each
container will mount some file systems. This is likely to trigger
down_read_trylock() hotspots when the memory pressure of the whole
machine or the memory pressure of memcg is high.

So I just found a physical server with a similar configuration to the
online server yesterday for a simulation test. The call stack and the
hot spot in the simulation test are almost exactly the same, so in
theory, when such a hot spot appears on the online server, we can also
enjoy the improvement of IPC. This will improve the performance of the
server in memory exhaustion scenarios (memcg or global level).

And the above scenario is only one aspect, and the other aspect is the
lock competition scenario mentioned by Kirill. After applying this patch
set, slab shrink and register_shrinker() can be completely parallelized,
which can fix that problem.

These are the two main benefits for real workloads that I consider.

Thanks,
Qi

call stack
----------

@[
down_read_trylock+1
shrink_slab+128
shrink_node+371
do_try_to_free_pages+232
try_to_free_pages+243
_alloc_pages_slowpath+771
_alloc_pages_nodemask+702
pagecache_get_page+255
filemap_fault+1361
ext4_filemap_fault+44
__do_fault+76
handle_mm_fault+3543
do_user_addr_fault+442
do_page_fault+48
page_fault+62
]: 1161690
@[
down_read_trylock+1
shrink_slab+128
shrink_node+371
balance_pgdat+690
kswapd+389
kthread+246
ret_from_fork+31
]: 8424884
@[
down_read_trylock+1
shrink_slab+128
shrink_node+371
do_try_to_free_pages+232
try_to_free_pages+243
__alloc_pages_slowpath+771
__alloc_pages_nodemask+702
__do_page_cache_readahead+244
filemap_fault+1674
ext4_filemap_fault+44
__do_fault+76
handle_mm_fault+3543
do_user_addr_fault+442
do_page_fault+48
page_fault+62
]: 20917631

hotspot
-------

52.22% [kernel] [k] down_read_trylock
19.60% [kernel] [k] up_read
8.86% [kernel] [k] shrink_slab
2.44% [kernel] [k] idr_find
1.25% [kernel] [k] count_shadow_nodes
1.18% [kernel] [k] shrink lruvec
0.71% [kernel] [k] mem_cgroup_iter
0.71% [kernel] [k] shrink_node
0.55% [kernel] [k] find_next_bit


>
>> Thanks,
>> Qi
>

--
Thanks,
Qi

2023-02-28 10:09:16

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make slab shrink lockless



On 2023/2/28 03:20, Kirill Tkhai wrote:
> On 27.02.2023 18:08, Mike Rapoport wrote:
>> Hi,
>>
>> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>>>
>>>
>>> On 2023/2/27 03:51, Andrew Morton wrote:
>>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <[email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> This patch series aims to make slab shrink lockless.
>>>>
>>>> What an awesome changelog.
>>>>
>>>>> 2. Survey
>>>>> =========
>>>>
>>>> Especially this part.
>>>>
>>>> Looking through all the prior efforts and at this patchset I am not
>>>> immediately seeing any statements about the overall effect upon
>>>> real-world workloads. For a good example, does this patchset
>>>> measurably improve throughput or energy consumption on your servers?
>>>
>>> Hi Andrew,
>>>
>>> I re-tested with the following physical machines:
>>>
>>> Architecture: x86_64
>>> CPU(s): 96
>>> On-line CPU(s) list: 0-95
>>> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
>>>
>>> I found that the reason for the hotspot I described in cover letter is
>>> wrong. The reason for the down_read_trylock() hotspot is not because of
>>> the failure to trylock, but simply because of the atomic operation
>>> (cmpxchg). And this will lead to a significant reduction in IPC (insn
>>> per cycle).
>>
>> ...
>>
>>> Then we can use the following perf command to view hotspots:
>>>
>>> perf top -U -F 999
>>>
>>> 1) Before applying this patchset:
>>>
>>> 32.31% [kernel] [k] down_read_trylock
>>> 19.40% [kernel] [k] pv_native_safe_halt
>>> 16.24% [kernel] [k] up_read
>>> 15.70% [kernel] [k] shrink_slab
>>> 4.69% [kernel] [k] _find_next_bit
>>> 2.62% [kernel] [k] shrink_node
>>> 1.78% [kernel] [k] shrink_lruvec
>>> 0.76% [kernel] [k] do_shrink_slab
>>>
>>> 2) After applying this patchset:
>>>
>>> 27.83% [kernel] [k] _find_next_bit
>>> 16.97% [kernel] [k] shrink_slab
>>> 15.82% [kernel] [k] pv_native_safe_halt
>>> 9.58% [kernel] [k] shrink_node
>>> 8.31% [kernel] [k] shrink_lruvec
>>> 5.64% [kernel] [k] do_shrink_slab
>>> 3.88% [kernel] [k] mem_cgroup_iter
>>>
>>> 2. At the same time, we use the following perf command to capture IPC
>>> information:
>>>
>>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>>>
>>> 1) Before applying this patchset:
>>>
>>> Performance counter stats for 'system wide' (5 runs):
>>>
>>> 454187219766 cycles test (
>>> +- 1.84% )
>>> 78896433101 instructions test # 0.17 insn per
>>> cycle ( +- 0.44% )
>>>
>>> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% )
>>>
>>> 2) After applying this patchset:
>>>
>>> Performance counter stats for 'system wide' (5 runs):
>>>
>>> 841954709443 cycles test (
>>> +- 15.80% ) (98.69%)
>>> 527258677936 instructions test # 0.63 insn per
>>> cycle ( +- 15.11% ) (98.68%)
>>>
>>> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% )
>>>
>>> We can see that IPC drops very seriously when calling
>>> down_read_trylock() at high frequency. After using SRCU,
>>> the IPC is at a normal level.
>>
>> The results you present do show improvement in IPC for an artificial test
>> script. But more interesting would be to see how a real world workloads
>> benefit from your changes.
>
> One of the real workloads from my experience is start of an overcommitted node
> containing many starting containers after node crash (or many resuming containers
> after reboot for kernel update). In these cases memory pressure is huge, and
> the node goes round in long reclaim.

Thanks a lot for providing this real workload! :)

>
> This patch patchset makes prealloc_memcg_shrinker() independent of do_shrink_slab(),
> so prealloc_memcg_shrinker() won't have to wait till shrink_slab_memcg() completes its
> current bit iteration, sees rwsem_is_contended() and the iteration breaks.
>
> Also, it's important to mention that currently we have the strange behavior:
>
> prealloc_memcg_shrinker()
> down_write(&shrinker_rwsem)
> idr_alloc()
> reclaim
> for each child memcg
> shrink_slab_memcg()
> down_read_trylock(&shrinker_rwsem) -> fail
>
> All the slab reclaim in this behavior is just a parasite work, and it just wastes
> our cpu time, which does not look a good design.
>
> Kirill

--
Thanks,
Qi

2023-02-28 10:11:48

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make slab shrink lockless



On 2023/2/28 03:02, Roman Gushchin wrote:
> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>>
>>
>> On 2023/2/27 03:51, Andrew Morton wrote:
>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <[email protected]> wrote:
>>>
>> Save the above script, then run test and touch commands.
>>
>> Then we can use the following perf command to view hotspots:
>>
>> perf top -U -F 999
>>
>> 1) Before applying this patchset:
>>
>> 32.31% [kernel] [k] down_read_trylock
>> 19.40% [kernel] [k] pv_native_safe_halt
>> 16.24% [kernel] [k] up_read
>> 15.70% [kernel] [k] shrink_slab
>> 4.69% [kernel] [k] _find_next_bit
>> 2.62% [kernel] [k] shrink_node
>> 1.78% [kernel] [k] shrink_lruvec
>> 0.76% [kernel] [k] do_shrink_slab
>>
>> 2) After applying this patchset:
>>
>> 27.83% [kernel] [k] _find_next_bit
>> 16.97% [kernel] [k] shrink_slab
>> 15.82% [kernel] [k] pv_native_safe_halt
>> 9.58% [kernel] [k] shrink_node
>> 8.31% [kernel] [k] shrink_lruvec
>> 5.64% [kernel] [k] do_shrink_slab
>> 3.88% [kernel] [k] mem_cgroup_iter
>
> Not opposing the intention of the patchset in any way (I actually think
> it's a good idea to make the shrinkers list lockless), but looking at
> both outputs above I think that the main problem is not the contention on
> the semaphore, but the reason of this contention.

Yes, in the above scenario, there is indeed no lock contention problem.

>
> It seems like often there is a long list of shrinkers which barely
> can reclaim any memory, but we're calling them again and again.
> In order to achieve real wins with real-life workloads, I guess
> it's what we should optimize.
>
> Thanks!

--
Thanks,
Qi

2023-02-28 10:55:30

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make slab shrink lockless



On 2023/2/28 18:04, Qi Zheng wrote:
>
>
> On 2023/2/27 23:08, Mike Rapoport wrote:
>> Hi,
>>
>> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>>>
>>>
>>> On 2023/2/27 03:51, Andrew Morton wrote:
>>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng
>>>> <[email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> This patch series aims to make slab shrink lockless.
>>>>
>>>> What an awesome changelog.
>>>>
>>>>> 2. Survey
>>>>> =========
>>>>
>>>> Especially this part.
>>>>
>>>> Looking through all the prior efforts and at this patchset I am not
>>>> immediately seeing any statements about the overall effect upon
>>>> real-world workloads.  For a good example, does this patchset
>>>> measurably improve throughput or energy consumption on your servers?
>>>
>>> Hi Andrew,
>>>
>>> I re-tested with the following physical machines:
>>>
>>> Architecture:        x86_64
>>> CPU(s):              96
>>> On-line CPU(s) list: 0-95
>>> Model name:          Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
>>>
>>> I found that the reason for the hotspot I described in cover letter is
>>> wrong. The reason for the down_read_trylock() hotspot is not because of
>>> the failure to trylock, but simply because of the atomic operation
>>> (cmpxchg). And this will lead to a significant reduction in IPC (insn
>>> per cycle).
>>
>> ...
>>> Then we can use the following perf command to view hotspots:
>>>
>>> perf top -U -F 999
>>>
>>> 1) Before applying this patchset:
>>>
>>>    32.31%  [kernel]           [k] down_read_trylock
>>>    19.40%  [kernel]           [k] pv_native_safe_halt
>>>    16.24%  [kernel]           [k] up_read
>>>    15.70%  [kernel]           [k] shrink_slab
>>>     4.69%  [kernel]           [k] _find_next_bit
>>>     2.62%  [kernel]           [k] shrink_node
>>>     1.78%  [kernel]           [k] shrink_lruvec
>>>     0.76%  [kernel]           [k] do_shrink_slab
>>>
>>> 2) After applying this patchset:
>>>
>>>    27.83%  [kernel]           [k] _find_next_bit
>>>    16.97%  [kernel]           [k] shrink_slab
>>>    15.82%  [kernel]           [k] pv_native_safe_halt
>>>     9.58%  [kernel]           [k] shrink_node
>>>     8.31%  [kernel]           [k] shrink_lruvec
>>>     5.64%  [kernel]           [k] do_shrink_slab
>>>     3.88%  [kernel]           [k] mem_cgroup_iter
>>>
>>> 2. At the same time, we use the following perf command to capture IPC
>>> information:
>>>
>>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>>>
>>> 1) Before applying this patchset:
>>>
>>>   Performance counter stats for 'system wide' (5 runs):
>>>
>>>        454187219766      cycles
>>> test                    (
>>> +-  1.84% )
>>>         78896433101      instructions              test #    0.17
>>> insn per
>>> cycle           ( +-  0.44% )
>>>
>>>          10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
>>>
>>> 2) After applying this patchset:
>>>
>>>   Performance counter stats for 'system wide' (5 runs):
>>>
>>>        841954709443      cycles
>>> test                    (
>>> +- 15.80% )  (98.69%)
>>>        527258677936      instructions              test #    0.63
>>> insn per
>>> cycle           ( +- 15.11% )  (98.68%)
>>>
>>>            10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
>>>
>>> We can see that IPC drops very seriously when calling
>>> down_read_trylock() at high frequency. After using SRCU,
>>> the IPC is at a normal level.
>>
>> The results you present do show improvement in IPC for an artificial test
>> script. But more interesting would be to see how a real world workloads
>> benefit from your changes.
>
> Hi Mike and Andrew,
>
> I did encounter this problem under the real workload of our online
> server. At the end of this email, I posted another call stack and
> hot spot that I found before.
>
> I scanned the hotspots of all our online servers yesterday and today,
> but unfortunately did not find the live environment.
>
> Some of our servers have a large number of containers, and each
> container will mount some file systems. This is likely to trigger
> down_read_trylock() hotspots when the memory pressure of the whole
> machine or the memory pressure of memcg is high.

And the servers where this hotspot has happened (we have a hotspot alarm
record), basically have 96 cores, or 128 cores or even more.

>
> So I just found a physical server with a similar configuration to the
> online server yesterday for a simulation test. The call stack and the
> hot spot in the simulation test are almost exactly the same, so in
> theory, when such a hot spot appears on the online server, we can also
> enjoy the improvement of IPC. This will improve the performance of the
> server in memory exhaustion scenarios (memcg or global level).
>
> And the above scenario is only one aspect, and the other aspect is the
> lock competition scenario mentioned by Kirill. After applying this patch
> set, slab shrink and register_shrinker() can be completely parallelized,
> which can fix that problem.
>
> These are the two main benefits for real workloads that I consider.
>
> Thanks,
> Qi
>
> call stack
> ----------
>
> @[
>     down_read_trylock+1
>     shrink_slab+128
>     shrink_node+371
>     do_try_to_free_pages+232
>     try_to_free_pages+243
>     _alloc_pages_slowpath+771
>     _alloc_pages_nodemask+702
>     pagecache_get_page+255
>     filemap_fault+1361
>     ext4_filemap_fault+44
>     __do_fault+76
>     handle_mm_fault+3543
>     do_user_addr_fault+442
>     do_page_fault+48
>     page_fault+62
> ]: 1161690
> @[
>     down_read_trylock+1
>     shrink_slab+128
>     shrink_node+371
>     balance_pgdat+690
>     kswapd+389
>     kthread+246
>     ret_from_fork+31
> ]: 8424884
> @[
>     down_read_trylock+1
>     shrink_slab+128
>     shrink_node+371
>     do_try_to_free_pages+232
>     try_to_free_pages+243
>     __alloc_pages_slowpath+771
>     __alloc_pages_nodemask+702
>     __do_page_cache_readahead+244
>     filemap_fault+1674
>     ext4_filemap_fault+44
>     __do_fault+76
>     handle_mm_fault+3543
>     do_user_addr_fault+442
>     do_page_fault+48
>     page_fault+62
> ]: 20917631
>
> hotspot
> -------
>
> 52.22% [kernel]        [k] down_read_trylock
> 19.60% [kernel]        [k] up_read
>  8.86% [kernel]        [k] shrink_slab
>  2.44% [kernel]        [k] idr_find
>  1.25% [kernel]        [k] count_shadow_nodes
>  1.18% [kernel]        [k] shrink lruvec
>  0.71% [kernel]        [k] mem_cgroup_iter
>  0.71% [kernel]        [k] shrink_node
>  0.55% [kernel]        [k] find_next_bit
>
>
>>> Thanks,
>>> Qi
>>
>

--
Thanks,
Qi

2023-02-28 18:41:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make slab shrink lockless

On Mon 27-02-23 17:08:30, Mike Rapoport wrote:
[...]
> The results you present do show improvement in IPC for an artificial test
> script. But more interesting would be to see how a real world workloads
> benefit from your changes.

It's been quite some time ago (2018ish) when we have seen bug report
where mount got stalled when racing with memory reclaim. This was
nasty because the said mount was a part of login chain and users simply
had to wait for a long time to get loged in in that particular
deployment.

The mount was blocked on a shrinker registration and the reclaim was
stalled in a slab shrinker IIRC. I do not remember all the details but
the underlying problem was that a shrinker callback took a long time
because there were too many objects to scan or it had to sync with other
fs operation. I believe we ended up using Minchan's break out from slab
shrinking if the shrinker semaphore was contended and that helped to
some degree but there were still some corner cases where a single slab
shrinker could take a noticeable amount of time.

In general using a "big" lock like shrinker_rwsem from the reclaim and
potentially block many unrelated subsystems that just want to register
or unregister shrinkers is a potential source of hard to predict
problems. So this is a very welcome change.
--
Michal Hocko
SUSE Labs

2023-03-01 02:28:08

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make slab shrink lockless



On 2023/3/1 02:40, Michal Hocko wrote:
> On Mon 27-02-23 17:08:30, Mike Rapoport wrote:
> [...]
>> The results you present do show improvement in IPC for an artificial test
>> script. But more interesting would be to see how a real world workloads
>> benefit from your changes.
>
> It's been quite some time ago (2018ish) when we have seen bug report
> where mount got stalled when racing with memory reclaim. This was
> nasty because the said mount was a part of login chain and users simply
> had to wait for a long time to get loged in in that particular
> deployment.
>
> The mount was blocked on a shrinker registration and the reclaim was
> stalled in a slab shrinker IIRC. I do not remember all the details but
> the underlying problem was that a shrinker callback took a long time
> because there were too many objects to scan or it had to sync with other
> fs operation. I believe we ended up using Minchan's break out from slab
> shrinking if the shrinker semaphore was contended and that helped to
> some degree but there were still some corner cases where a single slab
> shrinker could take a noticeable amount of time.
>
> In general using a "big" lock like shrinker_rwsem from the reclaim and
> potentially block many unrelated subsystems that just want to register
> or unregister shrinkers is a potential source of hard to predict
> problems. So this is a very welcome change.

Totally agree. :)

Thanks,
Qi