2023-06-22 08:56:22

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink

Hi all,

1. Background
=============

We used to implement the lockless slab shrink with SRCU [1], but then kernel
test robot reported -88.8% regression in stress-ng.ramfs.ops_per_sec test
case [2], so we reverted it [3].

This patch series aims to re-implement the lockless slab shrink using the
refcount+RCU method proposed by Dave Chinner [4].

[1]. https://lore.kernel.org/lkml/[email protected]/
[2]. https://lore.kernel.org/lkml/[email protected]/
[3]. https://lore.kernel.org/all/[email protected]/
[4]. https://lore.kernel.org/lkml/[email protected]/

2. Implementation
=================

Currently, the shrinker instances can be divided into the following three types:

a) global shrinker instance statically defined in the kernel, such as
workingset_shadow_shrinker.

b) global shrinker instance statically defined in the kernel modules, such as
mmu_shrinker in x86.

c) shrinker instance embedded in other structures.

For *case a*, the memory of shrinker instance is never freed. For *case b*, the
memory of shrinker instance will be freed after the module is unloaded. But we
will call synchronize_rcu() in free_module() to wait for RCU read-side critical
section to exit. For *case c*, we need to dynamically allocate these shrinker
instances, then the memory of shrinker instance can be dynamically freed alone
by calling kfree_rcu(). Then we can use rcu_read_{lock,unlock}() to ensure that
the shrinker instance is valid.

The shrinker::refcount mechanism ensures that the shrinker instance will not be
run again after unregistration. So the structure that records the pointer of
shrinker instance can be safely freed without waiting for the RCU read-side
critical section.

In this way, while we implement the lockless slab shrink, we don't need to be
blocked in unregister_shrinker() to wait RCU read-side critical section.

PATCH 1 ~ 2: infrastructure for dynamically allocating shrinker instances
PATCH 3 ~ 21: dynamically allocate the shrinker instances in case c
PATCH 22: introduce pool_shrink_rwsem to implement private synchronize_shrinkers()
PATCH 23 ~ 28: implement the lockless slab shrink
PATCH 29: move shrinker-related code into a separate file

3. Testing
==========

3.1 slab shrink stress test
---------------------------

We can reproduce the down_read_trylock() hotspot through the following script:

```

DIR="/root/shrinker/memcg/mnt"

do_create()
{
mkdir -p /sys/fs/cgroup/memory/test
mkdir -p /sys/fs/cgroup/perf_event/test
echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
for i in `seq 0 $1`;
do
mkdir -p /sys/fs/cgroup/memory/test/$i;
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
mkdir -p $DIR/$i;
done
}

do_mount()
{
for i in `seq $1 $2`;
do
mount -t tmpfs $i $DIR/$i;
done
}

do_touch()
{
for i in `seq $1 $2`;
do
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
done
}

case "$1" in
touch)
do_touch $2 $3
;;
test)
do_create 4000
do_mount 0 4000
do_touch 0 3000
;;
*)
exit 1
;;
esac
```

Save the above script, then run test and touch commands. Then we can use the
following perf command to view hotspots:

perf top -U -F 999 [-g]

1) Before applying this patchset:

35.34% [kernel] [k] down_read_trylock
18.44% [kernel] [k] shrink_slab
15.98% [kernel] [k] pv_native_safe_halt
15.08% [kernel] [k] up_read
5.33% [kernel] [k] idr_find
2.71% [kernel] [k] _find_next_bit
2.21% [kernel] [k] shrink_node
1.29% [kernel] [k] shrink_lruvec
0.66% [kernel] [k] do_shrink_slab
0.33% [kernel] [k] list_lru_count_one
0.33% [kernel] [k] __radix_tree_lookup
0.25% [kernel] [k] mem_cgroup_iter

- 82.19% 19.49% [kernel] [k] shrink_slab
- 62.00% shrink_slab
36.37% down_read_trylock
15.52% up_read
5.48% idr_find
3.38% _find_next_bit
+ 0.98% do_shrink_slab

2) After applying this patchset:

46.83% [kernel] [k] shrink_slab
20.52% [kernel] [k] pv_native_safe_halt
8.85% [kernel] [k] do_shrink_slab
7.71% [kernel] [k] _find_next_bit
1.72% [kernel] [k] xas_descend
1.70% [kernel] [k] shrink_node
1.44% [kernel] [k] shrink_lruvec
1.43% [kernel] [k] mem_cgroup_iter
1.28% [kernel] [k] xas_load
0.89% [kernel] [k] super_cache_count
0.84% [kernel] [k] xas_start
0.66% [kernel] [k] list_lru_count_one

- 65.50% 40.44% [kernel] [k] shrink_slab
- 22.96% shrink_slab
13.11% _find_next_bit
- 9.91% do_shrink_slab
- 1.59% super_cache_count
0.92% list_lru_count_one

We can see that the first perf hotspot becomes shrink_slab, which is what we
expect.

3.2 registeration and unregisteration stress test
-------------------------------------------------

Run the command below to test:

stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &

1) Before applying this patchset:

setting to a 60 second run per stressor
dispatching hogs: 9 ramfs
stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
(secs) (secs) (secs) (real time) (usr+sys time)
ramfs 880623 60.02 7.71 226.93 14671.45 3753.09
ramfs:
1 System Management Interrupt
for a 60.03s run time:
5762.40s available CPU time
7.71s user time ( 0.13%)
226.93s system time ( 3.94%)
234.64s total time ( 4.07%)
load average: 8.54 3.06 2.11
passed: 9: ramfs (9)
failed: 0
skipped: 0
successful run completed in 60.03s (1 min, 0.03 secs)

2) After applying this patchset:

setting to a 60 second run per stressor
dispatching hogs: 9 ramfs
stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
(secs) (secs) (secs) (real time) (usr+sys time)
ramfs 847562 60.02 7.44 230.22 14120.66 3566.23
ramfs:
4 System Management Interrupts
for a 60.12s run time:
5771.95s available CPU time
7.44s user time ( 0.13%)
230.22s system time ( 3.99%)
237.66s total time ( 4.12%)
load average: 8.18 2.43 0.84
passed: 9: ramfs (9)
failed: 0
skipped: 0
successful run completed in 60.12s (1 min, 0.12 secs)

We can see that the ops/s has hardly changed.

This series is based on next-20230613.

Comments and suggestions are welcome.

Thanks,
Qi.

Qi Zheng (29):
mm: shrinker: add shrinker::private_data field
mm: vmscan: introduce some helpers for dynamically allocating shrinker
drm/i915: dynamically allocate the i915_gem_mm shrinker
drm/msm: dynamically allocate the drm-msm_gem shrinker
drm/panfrost: dynamically allocate the drm-panfrost shrinker
dm: dynamically allocate the dm-bufio shrinker
dm zoned: dynamically allocate the dm-zoned-meta shrinker
md/raid5: dynamically allocate the md-raid5 shrinker
bcache: dynamically allocate the md-bcache shrinker
vmw_balloon: dynamically allocate the vmw-balloon shrinker
virtio_balloon: dynamically allocate the virtio-balloon shrinker
mbcache: dynamically allocate the mbcache shrinker
ext4: dynamically allocate the ext4-es shrinker
jbd2,ext4: dynamically allocate the jbd2-journal shrinker
NFSD: dynamically allocate the nfsd-client shrinker
NFSD: dynamically allocate the nfsd-reply shrinker
xfs: dynamically allocate the xfs-buf shrinker
xfs: dynamically allocate the xfs-inodegc shrinker
xfs: dynamically allocate the xfs-qm shrinker
zsmalloc: dynamically allocate the mm-zspool shrinker
fs: super: dynamically allocate the s_shrink
drm/ttm: introduce pool_shrink_rwsem
mm: shrinker: add refcount and completion_wait fields
mm: vmscan: make global slab shrink lockless
mm: vmscan: make memcg slab shrink lockless
mm: shrinker: make count and scan in shrinker debugfs lockless
mm: vmscan: hold write lock to reparent shrinker nr_deferred
mm: shrinkers: convert shrinker_rwsem to mutex
mm: shrinker: move shrinker-related code into a separate file

drivers/gpu/drm/i915/gem/i915_gem_shrinker.c | 27 +-
drivers/gpu/drm/i915/i915_drv.h | 3 +-
drivers/gpu/drm/msm/msm_drv.h | 2 +-
drivers/gpu/drm/msm/msm_gem_shrinker.c | 25 +-
drivers/gpu/drm/panfrost/panfrost_device.h | 2 +-
.../gpu/drm/panfrost/panfrost_gem_shrinker.c | 24 +-
drivers/gpu/drm/ttm/ttm_pool.c | 15 +
drivers/md/bcache/bcache.h | 2 +-
drivers/md/bcache/btree.c | 23 +-
drivers/md/bcache/sysfs.c | 2 +-
drivers/md/dm-bufio.c | 23 +-
drivers/md/dm-cache-metadata.c | 2 +-
drivers/md/dm-thin-metadata.c | 2 +-
drivers/md/dm-zoned-metadata.c | 25 +-
drivers/md/raid5.c | 28 +-
drivers/md/raid5.h | 2 +-
drivers/misc/vmw_balloon.c | 16 +-
drivers/virtio/virtio_balloon.c | 26 +-
fs/btrfs/super.c | 2 +-
fs/ext4/ext4.h | 2 +-
fs/ext4/extents_status.c | 21 +-
fs/jbd2/journal.c | 32 +-
fs/kernfs/mount.c | 2 +-
fs/mbcache.c | 39 +-
fs/nfsd/netns.h | 4 +-
fs/nfsd/nfs4state.c | 20 +-
fs/nfsd/nfscache.c | 33 +-
fs/proc/root.c | 2 +-
fs/super.c | 40 +-
fs/xfs/xfs_buf.c | 25 +-
fs/xfs/xfs_buf.h | 2 +-
fs/xfs/xfs_icache.c | 27 +-
fs/xfs/xfs_mount.c | 4 +-
fs/xfs/xfs_mount.h | 2 +-
fs/xfs/xfs_qm.c | 24 +-
fs/xfs/xfs_qm.h | 2 +-
include/linux/fs.h | 2 +-
include/linux/jbd2.h | 2 +-
include/linux/shrinker.h | 35 +-
mm/Makefile | 4 +-
mm/shrinker.c | 750 ++++++++++++++++++
mm/shrinker_debug.c | 26 +-
mm/vmscan.c | 702 ----------------
mm/zsmalloc.c | 28 +-
44 files changed, 1128 insertions(+), 953 deletions(-)
create mode 100644 mm/shrinker.c

--
2.30.2



2023-06-22 08:57:12

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 07/29] dm zoned: dynamically allocate the dm-zoned-meta shrinker

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the dm-zoned-meta shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct dmz_metadata.

Signed-off-by: Qi Zheng <[email protected]>
---
drivers/md/dm-zoned-metadata.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c
index 9d3cca8e3dc9..41b10ffb968a 100644
--- a/drivers/md/dm-zoned-metadata.c
+++ b/drivers/md/dm-zoned-metadata.c
@@ -187,7 +187,7 @@ struct dmz_metadata {
struct rb_root mblk_rbtree;
struct list_head mblk_lru_list;
struct list_head mblk_dirty_list;
- struct shrinker mblk_shrinker;
+ struct shrinker *mblk_shrinker;

/* Zone allocation management */
struct mutex map_lock;
@@ -615,7 +615,7 @@ static unsigned long dmz_shrink_mblock_cache(struct dmz_metadata *zmd,
static unsigned long dmz_mblock_shrinker_count(struct shrinker *shrink,
struct shrink_control *sc)
{
- struct dmz_metadata *zmd = container_of(shrink, struct dmz_metadata, mblk_shrinker);
+ struct dmz_metadata *zmd = shrink->private_data;

return atomic_read(&zmd->nr_mblks);
}
@@ -626,7 +626,7 @@ static unsigned long dmz_mblock_shrinker_count(struct shrinker *shrink,
static unsigned long dmz_mblock_shrinker_scan(struct shrinker *shrink,
struct shrink_control *sc)
{
- struct dmz_metadata *zmd = container_of(shrink, struct dmz_metadata, mblk_shrinker);
+ struct dmz_metadata *zmd = shrink->private_data;
unsigned long count;

spin_lock(&zmd->mblk_lock);
@@ -2936,17 +2936,22 @@ int dmz_ctr_metadata(struct dmz_dev *dev, int num_dev,
*/
zmd->min_nr_mblks = 2 + zmd->nr_map_blocks + zmd->zone_nr_bitmap_blocks * 16;
zmd->max_nr_mblks = zmd->min_nr_mblks + 512;
- zmd->mblk_shrinker.count_objects = dmz_mblock_shrinker_count;
- zmd->mblk_shrinker.scan_objects = dmz_mblock_shrinker_scan;
- zmd->mblk_shrinker.seeks = DEFAULT_SEEKS;
+
+ zmd->mblk_shrinker = shrinker_alloc_and_init(dmz_mblock_shrinker_count,
+ dmz_mblock_shrinker_scan,
+ 0, DEFAULT_SEEKS, 0, zmd);
+ if (!zmd->mblk_shrinker) {
+ dmz_zmd_err(zmd, "allocate metadata cache shrinker failed");
+ goto err;
+ }

/* Metadata cache shrinker */
- ret = register_shrinker(&zmd->mblk_shrinker, "dm-zoned-meta:(%u:%u)",
+ ret = register_shrinker(zmd->mblk_shrinker, "dm-zoned-meta:(%u:%u)",
MAJOR(dev->bdev->bd_dev),
MINOR(dev->bdev->bd_dev));
if (ret) {
dmz_zmd_err(zmd, "Register metadata cache shrinker failed");
- goto err;
+ goto err_shrinker;
}

dmz_zmd_info(zmd, "DM-Zoned metadata version %d", zmd->sb_version);
@@ -2982,6 +2987,8 @@ int dmz_ctr_metadata(struct dmz_dev *dev, int num_dev,
*metadata = zmd;

return 0;
+err_shrinker:
+ shrinker_free(zmd->mblk_shrinker);
err:
dmz_cleanup_metadata(zmd);
kfree(zmd);
@@ -2995,7 +3002,7 @@ int dmz_ctr_metadata(struct dmz_dev *dev, int num_dev,
*/
void dmz_dtr_metadata(struct dmz_metadata *zmd)
{
- unregister_shrinker(&zmd->mblk_shrinker);
+ unregister_and_free_shrinker(zmd->mblk_shrinker);
dmz_cleanup_metadata(zmd);
kfree(zmd);
}
--
2.30.2


2023-06-22 08:57:54

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 05/29] drm/panfrost: dynamically allocate the drm-panfrost shrinker

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the drm-panfrost shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct panfrost_device.

Signed-off-by: Qi Zheng <[email protected]>
---
drivers/gpu/drm/panfrost/panfrost_device.h | 2 +-
.../gpu/drm/panfrost/panfrost_gem_shrinker.c | 24 ++++++++++---------
2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
index b0126b9fbadc..e667e5689353 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -118,7 +118,7 @@ struct panfrost_device {

struct mutex shrinker_lock;
struct list_head shrinker_list;
- struct shrinker shrinker;
+ struct shrinker *shrinker;

struct panfrost_devfreq pfdevfreq;
};
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
index bf0170782f25..2a5513eb9e1f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
@@ -18,8 +18,7 @@
static unsigned long
panfrost_gem_shrinker_count(struct shrinker *shrinker, struct shrink_control *sc)
{
- struct panfrost_device *pfdev =
- container_of(shrinker, struct panfrost_device, shrinker);
+ struct panfrost_device *pfdev = shrinker->private_data;
struct drm_gem_shmem_object *shmem;
unsigned long count = 0;

@@ -65,8 +64,7 @@ static bool panfrost_gem_purge(struct drm_gem_object *obj)
static unsigned long
panfrost_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
{
- struct panfrost_device *pfdev =
- container_of(shrinker, struct panfrost_device, shrinker);
+ struct panfrost_device *pfdev = shrinker->private_data;
struct drm_gem_shmem_object *shmem, *tmp;
unsigned long freed = 0;

@@ -100,10 +98,15 @@ panfrost_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
void panfrost_gem_shrinker_init(struct drm_device *dev)
{
struct panfrost_device *pfdev = dev->dev_private;
- pfdev->shrinker.count_objects = panfrost_gem_shrinker_count;
- pfdev->shrinker.scan_objects = panfrost_gem_shrinker_scan;
- pfdev->shrinker.seeks = DEFAULT_SEEKS;
- WARN_ON(register_shrinker(&pfdev->shrinker, "drm-panfrost"));
+
+ pfdev->shrinker = shrinker_alloc_and_init(panfrost_gem_shrinker_count,
+ panfrost_gem_shrinker_scan, 0,
+ DEFAULT_SEEKS, 0, pfdev);
+ if (pfdev->shrinker &&
+ register_shrinker(pfdev->shrinker, "drm-panfrost")) {
+ shrinker_free(pfdev->shrinker);
+ WARN_ON(1);
+ }
}

/**
@@ -116,7 +119,6 @@ void panfrost_gem_shrinker_cleanup(struct drm_device *dev)
{
struct panfrost_device *pfdev = dev->dev_private;

- if (pfdev->shrinker.nr_deferred) {
- unregister_shrinker(&pfdev->shrinker);
- }
+ if (pfdev->shrinker->nr_deferred)
+ unregister_and_free_shrinker(pfdev->shrinker);
}
--
2.30.2


2023-06-22 08:58:33

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 09/29] bcache: dynamically allocate the md-bcache shrinker

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the md-bcache shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct cache_set.

Signed-off-by: Qi Zheng <[email protected]>
---
drivers/md/bcache/bcache.h | 2 +-
drivers/md/bcache/btree.c | 23 ++++++++++++++---------
drivers/md/bcache/sysfs.c | 2 +-
3 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 700dc5588d5f..53c73b372e7a 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -541,7 +541,7 @@ struct cache_set {
struct bio_set bio_split;

/* For the btree cache */
- struct shrinker shrink;
+ struct shrinker *shrink;

/* For the btree cache and anything allocation related */
struct mutex bucket_lock;
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 569f48958bde..1131ae91f62a 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -667,7 +667,7 @@ static int mca_reap(struct btree *b, unsigned int min_order, bool flush)
static unsigned long bch_mca_scan(struct shrinker *shrink,
struct shrink_control *sc)
{
- struct cache_set *c = container_of(shrink, struct cache_set, shrink);
+ struct cache_set *c = shrink->private_data;
struct btree *b, *t;
unsigned long i, nr = sc->nr_to_scan;
unsigned long freed = 0;
@@ -734,7 +734,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
static unsigned long bch_mca_count(struct shrinker *shrink,
struct shrink_control *sc)
{
- struct cache_set *c = container_of(shrink, struct cache_set, shrink);
+ struct cache_set *c = shrink->private_data;

if (c->shrinker_disabled)
return 0;
@@ -752,8 +752,8 @@ void bch_btree_cache_free(struct cache_set *c)

closure_init_stack(&cl);

- if (c->shrink.list.next)
- unregister_shrinker(&c->shrink);
+ if (c->shrink->list.next)
+ unregister_and_free_shrinker(c->shrink);

mutex_lock(&c->bucket_lock);

@@ -828,14 +828,19 @@ int bch_btree_cache_alloc(struct cache_set *c)
c->verify_data = NULL;
#endif

- c->shrink.count_objects = bch_mca_count;
- c->shrink.scan_objects = bch_mca_scan;
- c->shrink.seeks = 4;
- c->shrink.batch = c->btree_pages * 2;
+ c->shrink = shrinker_alloc_and_init(bch_mca_count, bch_mca_scan,
+ c->btree_pages * 2, 4, 0, c);
+ if (!c->shrink) {
+ pr_warn("bcache: %s: could not allocate shrinker\n",
+ __func__);
+ return -ENOMEM;
+ }

- if (register_shrinker(&c->shrink, "md-bcache:%pU", c->set_uuid))
+ if (register_shrinker(c->shrink, "md-bcache:%pU", c->set_uuid)) {
pr_warn("bcache: %s: could not register shrinker\n",
__func__);
+ shrinker_free(c->shrink);
+ }

return 0;
}
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index c6f677059214..771577581f52 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -866,7 +866,7 @@ STORE(__bch_cache_set)

sc.gfp_mask = GFP_KERNEL;
sc.nr_to_scan = strtoul_or_return(buf);
- c->shrink.scan_objects(&c->shrink, &sc);
+ c->shrink->scan_objects(c->shrink, &sc);
}

sysfs_strtoul_clamp(congested_read_threshold_us,
--
2.30.2


2023-06-22 08:59:04

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 11/29] virtio_balloon: dynamically allocate the virtio-balloon shrinker

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the virtio-balloon shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct virtio_balloon.

Signed-off-by: Qi Zheng <[email protected]>
---
drivers/virtio/virtio_balloon.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 5b15936a5214..fa051bff8d90 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -111,7 +111,7 @@ struct virtio_balloon {
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];

/* Shrinker to return free pages - VIRTIO_BALLOON_F_FREE_PAGE_HINT */
- struct shrinker shrinker;
+ struct shrinker *shrinker;

/* OOM notifier to deflate on OOM - VIRTIO_BALLOON_F_DEFLATE_ON_OOM */
struct notifier_block oom_nb;
@@ -816,8 +816,7 @@ static unsigned long shrink_free_pages(struct virtio_balloon *vb,
static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
struct shrink_control *sc)
{
- struct virtio_balloon *vb = container_of(shrinker,
- struct virtio_balloon, shrinker);
+ struct virtio_balloon *vb = shrinker->private_data;

return shrink_free_pages(vb, sc->nr_to_scan);
}
@@ -825,8 +824,7 @@ static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker,
struct shrink_control *sc)
{
- struct virtio_balloon *vb = container_of(shrinker,
- struct virtio_balloon, shrinker);
+ struct virtio_balloon *vb = shrinker->private_data;

return vb->num_free_page_blocks * VIRTIO_BALLOON_HINT_BLOCK_PAGES;
}
@@ -847,16 +845,24 @@ static int virtio_balloon_oom_notify(struct notifier_block *nb,

static void virtio_balloon_unregister_shrinker(struct virtio_balloon *vb)
{
- unregister_shrinker(&vb->shrinker);
+ unregister_and_free_shrinker(vb->shrinker);
}

static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
{
- vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
- vb->shrinker.count_objects = virtio_balloon_shrinker_count;
- vb->shrinker.seeks = DEFAULT_SEEKS;
+ int ret;
+
+ vb->shrinker = shrinker_alloc_and_init(virtio_balloon_shrinker_count,
+ virtio_balloon_shrinker_scan,
+ 0, DEFAULT_SEEKS, 0, vb);
+ if (!vb->shrinker)
+ return -ENOMEM;
+
+ ret = register_shrinker(vb->shrinker, "virtio-balloon");
+ if (ret)
+ shrinker_free(vb->shrinker);

- return register_shrinker(&vb->shrinker, "virtio-balloon");
+ return ret;
}

static int virtballoon_probe(struct virtio_device *vdev)
--
2.30.2


2023-06-22 08:59:56

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 13/29] ext4: dynamically allocate the ext4-es shrinker

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the ext4-es shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct ext4_sb_info.

Signed-off-by: Qi Zheng <[email protected]>
---
fs/ext4/ext4.h | 2 +-
fs/ext4/extents_status.c | 21 ++++++++++++---------
2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0a2d55faa095..1bd150d454f5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1651,7 +1651,7 @@ struct ext4_sb_info {
__u32 s_csum_seed;

/* Reclaim extents from extent status tree */
- struct shrinker s_es_shrinker;
+ struct shrinker *s_es_shrinker;
struct list_head s_es_list; /* List of inodes with reclaimable extents */
long s_es_nr_inode;
struct ext4_es_stats s_es_stats;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 9b5b8951afb4..fea82339f4b4 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -1596,7 +1596,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
unsigned long nr;
struct ext4_sb_info *sbi;

- sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
+ sbi = shrink->private_data;
nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
return nr;
@@ -1605,8 +1605,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
static unsigned long ext4_es_scan(struct shrinker *shrink,
struct shrink_control *sc)
{
- struct ext4_sb_info *sbi = container_of(shrink,
- struct ext4_sb_info, s_es_shrinker);
+ struct ext4_sb_info *sbi = shrink->private_data;
int nr_to_scan = sc->nr_to_scan;
int ret, nr_shrunk;

@@ -1690,15 +1689,19 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
if (err)
goto err3;

- sbi->s_es_shrinker.scan_objects = ext4_es_scan;
- sbi->s_es_shrinker.count_objects = ext4_es_count;
- sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
- err = register_shrinker(&sbi->s_es_shrinker, "ext4-es:%s",
+ sbi->s_es_shrinker = shrinker_alloc_and_init(ext4_es_count, ext4_es_scan,
+ 0, DEFAULT_SEEKS, 0, sbi);
+ if (!sbi->s_es_shrinker)
+ goto err4;
+
+ err = register_shrinker(sbi->s_es_shrinker, "ext4-es:%s",
sbi->s_sb->s_id);
if (err)
- goto err4;
+ goto err5;

return 0;
+err5:
+ shrinker_free(sbi->s_es_shrinker);
err4:
percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
err3:
@@ -1716,7 +1719,7 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses);
percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
- unregister_shrinker(&sbi->s_es_shrinker);
+ unregister_and_free_shrinker(sbi->s_es_shrinker);
}

/*
--
2.30.2


2023-06-22 09:01:08

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the nfsd-client shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct nfsd_net.

Signed-off-by: Qi Zheng <[email protected]>
---
fs/nfsd/netns.h | 2 +-
fs/nfsd/nfs4state.c | 20 ++++++++++++--------
2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
index ec49b200b797..f669444d5336 100644
--- a/fs/nfsd/netns.h
+++ b/fs/nfsd/netns.h
@@ -195,7 +195,7 @@ struct nfsd_net {
int nfs4_max_clients;

atomic_t nfsd_courtesy_clients;
- struct shrinker nfsd_client_shrinker;
+ struct shrinker *nfsd_client_shrinker;
struct work_struct nfsd_shrinker_work;
};

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 6e61fa3acaf1..a06184270548 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -4388,8 +4388,7 @@ static unsigned long
nfsd4_state_shrinker_count(struct shrinker *shrink, struct shrink_control *sc)
{
int count;
- struct nfsd_net *nn = container_of(shrink,
- struct nfsd_net, nfsd_client_shrinker);
+ struct nfsd_net *nn = shrink->private_data;

count = atomic_read(&nn->nfsd_courtesy_clients);
if (!count)
@@ -8094,14 +8093,19 @@ static int nfs4_state_create_net(struct net *net)
INIT_WORK(&nn->nfsd_shrinker_work, nfsd4_state_shrinker_worker);
get_net(net);

- nn->nfsd_client_shrinker.scan_objects = nfsd4_state_shrinker_scan;
- nn->nfsd_client_shrinker.count_objects = nfsd4_state_shrinker_count;
- nn->nfsd_client_shrinker.seeks = DEFAULT_SEEKS;
-
- if (register_shrinker(&nn->nfsd_client_shrinker, "nfsd-client"))
+ nn->nfsd_client_shrinker = shrinker_alloc_and_init(nfsd4_state_shrinker_count,
+ nfsd4_state_shrinker_scan,
+ 0, DEFAULT_SEEKS, 0,
+ nn);
+ if (!nn->nfsd_client_shrinker)
goto err_shrinker;
+
+ if (register_shrinker(nn->nfsd_client_shrinker, "nfsd-client"))
+ goto err_register;
return 0;

+err_register:
+ shrinker_free(nn->nfsd_client_shrinker);
err_shrinker:
put_net(net);
kfree(nn->sessionid_hashtbl);
@@ -8197,7 +8201,7 @@ nfs4_state_shutdown_net(struct net *net)
struct list_head *pos, *next, reaplist;
struct nfsd_net *nn = net_generic(net, nfsd_net_id);

- unregister_shrinker(&nn->nfsd_client_shrinker);
+ unregister_and_free_shrinker(nn->nfsd_client_shrinker);
cancel_work(&nn->nfsd_shrinker_work);
cancel_delayed_work_sync(&nn->laundromat_work);
locks_end_grace(&nn->nfsd4_manager);
--
2.30.2


2023-06-22 09:01:47

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 17/29] xfs: dynamically allocate the xfs-buf shrinker

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the xfs-buf shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct xfs_buftarg.

Signed-off-by: Qi Zheng <[email protected]>
---
fs/xfs/xfs_buf.c | 25 ++++++++++++++-----------
fs/xfs/xfs_buf.h | 2 +-
2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 15d1e5a7c2d3..6657d285d26f 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1906,8 +1906,7 @@ xfs_buftarg_shrink_scan(
struct shrinker *shrink,
struct shrink_control *sc)
{
- struct xfs_buftarg *btp = container_of(shrink,
- struct xfs_buftarg, bt_shrinker);
+ struct xfs_buftarg *btp = shrink->private_data;
LIST_HEAD(dispose);
unsigned long freed;

@@ -1929,8 +1928,7 @@ xfs_buftarg_shrink_count(
struct shrinker *shrink,
struct shrink_control *sc)
{
- struct xfs_buftarg *btp = container_of(shrink,
- struct xfs_buftarg, bt_shrinker);
+ struct xfs_buftarg *btp = shrink->private_data;
return list_lru_shrink_count(&btp->bt_lru, sc);
}

@@ -1938,7 +1936,7 @@ void
xfs_free_buftarg(
struct xfs_buftarg *btp)
{
- unregister_shrinker(&btp->bt_shrinker);
+ unregister_and_free_shrinker(btp->bt_shrinker);
ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0);
percpu_counter_destroy(&btp->bt_io_count);
list_lru_destroy(&btp->bt_lru);
@@ -2021,15 +2019,20 @@ xfs_alloc_buftarg(
if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL))
goto error_lru;

- btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
- btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
- btp->bt_shrinker.seeks = DEFAULT_SEEKS;
- btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE;
- if (register_shrinker(&btp->bt_shrinker, "xfs-buf:%s",
- mp->m_super->s_id))
+ btp->bt_shrinker = shrinker_alloc_and_init(xfs_buftarg_shrink_count,
+ xfs_buftarg_shrink_scan,
+ 0, DEFAULT_SEEKS,
+ SHRINKER_NUMA_AWARE, btp);
+ if (!btp->bt_shrinker)
goto error_pcpu;
+
+ if (register_shrinker(btp->bt_shrinker, "xfs-buf:%s",
+ mp->m_super->s_id))
+ goto error_shrinker;
return btp;

+error_shrinker:
+ shrinker_free(btp->bt_shrinker);
error_pcpu:
percpu_counter_destroy(&btp->bt_io_count);
error_lru:
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 549c60942208..4e6969a675f7 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -102,7 +102,7 @@ typedef struct xfs_buftarg {
size_t bt_logical_sectormask;

/* LRU control structures */
- struct shrinker bt_shrinker;
+ struct shrinker *bt_shrinker;
struct list_lru bt_lru;

struct percpu_counter bt_io_count;
--
2.30.2


2023-06-22 09:02:31

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 18/29] xfs: dynamically allocate the xfs-inodegc shrinker

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the xfs-inodegc shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct xfs_mount.

Signed-off-by: Qi Zheng <[email protected]>
---
fs/xfs/xfs_icache.c | 27 ++++++++++++++++-----------
fs/xfs/xfs_mount.c | 4 ++--
fs/xfs/xfs_mount.h | 2 +-
3 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 453890942d9f..1ef0c9fa57de 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -2225,8 +2225,7 @@ xfs_inodegc_shrinker_count(
struct shrinker *shrink,
struct shrink_control *sc)
{
- struct xfs_mount *mp = container_of(shrink, struct xfs_mount,
- m_inodegc_shrinker);
+ struct xfs_mount *mp = shrink->private_data;
struct xfs_inodegc *gc;
int cpu;

@@ -2247,8 +2246,7 @@ xfs_inodegc_shrinker_scan(
struct shrinker *shrink,
struct shrink_control *sc)
{
- struct xfs_mount *mp = container_of(shrink, struct xfs_mount,
- m_inodegc_shrinker);
+ struct xfs_mount *mp = shrink->private_data;
struct xfs_inodegc *gc;
int cpu;
bool no_items = true;
@@ -2284,13 +2282,20 @@ int
xfs_inodegc_register_shrinker(
struct xfs_mount *mp)
{
- struct shrinker *shrink = &mp->m_inodegc_shrinker;
+ int ret;

- shrink->count_objects = xfs_inodegc_shrinker_count;
- shrink->scan_objects = xfs_inodegc_shrinker_scan;
- shrink->seeks = 0;
- shrink->flags = SHRINKER_NONSLAB;
- shrink->batch = XFS_INODEGC_SHRINKER_BATCH;
+ mp->m_inodegc_shrinker =
+ shrinker_alloc_and_init(xfs_inodegc_shrinker_count,
+ xfs_inodegc_shrinker_scan,
+ XFS_INODEGC_SHRINKER_BATCH,
+ 0, SHRINKER_NONSLAB, mp);
+ if (!mp->m_inodegc_shrinker)
+ return -ENOMEM;
+
+ ret = register_shrinker(mp->m_inodegc_shrinker, "xfs-inodegc:%s",
+ mp->m_super->s_id);
+ if (ret)
+ shrinker_free(mp->m_inodegc_shrinker);

- return register_shrinker(shrink, "xfs-inodegc:%s", mp->m_super->s_id);
+ return ret;
}
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index fb87ffb48f7f..67286859ad34 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1018,7 +1018,7 @@ xfs_mountfs(
out_log_dealloc:
xfs_log_mount_cancel(mp);
out_inodegc_shrinker:
- unregister_shrinker(&mp->m_inodegc_shrinker);
+ unregister_and_free_shrinker(mp->m_inodegc_shrinker);
out_fail_wait:
if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp)
xfs_buftarg_drain(mp->m_logdev_targp);
@@ -1100,7 +1100,7 @@ xfs_unmountfs(
#if defined(DEBUG)
xfs_errortag_clearall(mp);
#endif
- unregister_shrinker(&mp->m_inodegc_shrinker);
+ unregister_and_free_shrinker(mp->m_inodegc_shrinker);
xfs_free_perag(mp);

xfs_errortag_del(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e2866e7fa60c..562c294ca08e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -217,7 +217,7 @@ typedef struct xfs_mount {
atomic_t m_agirotor; /* last ag dir inode alloced */

/* Memory shrinker to throttle and reprioritize inodegc */
- struct shrinker m_inodegc_shrinker;
+ struct shrinker *m_inodegc_shrinker;
/*
* Workqueue item so that we can coalesce multiple inode flush attempts
* into a single flush.
--
2.30.2


2023-06-22 09:03:03

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 19/29] xfs: dynamically allocate the xfs-qm shrinker

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the xfs-qm shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct xfs_quotainfo.

Signed-off-by: Qi Zheng <[email protected]>
---
fs/xfs/xfs_qm.c | 24 +++++++++++++-----------
fs/xfs/xfs_qm.h | 2 +-
2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 6abcc34fafd8..b15320d469cc 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -504,8 +504,7 @@ xfs_qm_shrink_scan(
struct shrinker *shrink,
struct shrink_control *sc)
{
- struct xfs_quotainfo *qi = container_of(shrink,
- struct xfs_quotainfo, qi_shrinker);
+ struct xfs_quotainfo *qi = shrink->private_data;
struct xfs_qm_isolate isol;
unsigned long freed;
int error;
@@ -539,8 +538,7 @@ xfs_qm_shrink_count(
struct shrinker *shrink,
struct shrink_control *sc)
{
- struct xfs_quotainfo *qi = container_of(shrink,
- struct xfs_quotainfo, qi_shrinker);
+ struct xfs_quotainfo *qi = shrink->private_data;

return list_lru_shrink_count(&qi->qi_lru, sc);
}
@@ -680,18 +678,22 @@ xfs_qm_init_quotainfo(
if (XFS_IS_PQUOTA_ON(mp))
xfs_qm_set_defquota(mp, XFS_DQTYPE_PROJ, qinf);

- qinf->qi_shrinker.count_objects = xfs_qm_shrink_count;
- qinf->qi_shrinker.scan_objects = xfs_qm_shrink_scan;
- qinf->qi_shrinker.seeks = DEFAULT_SEEKS;
- qinf->qi_shrinker.flags = SHRINKER_NUMA_AWARE;
+ qinf->qi_shrinker = shrinker_alloc_and_init(xfs_qm_shrink_count,
+ xfs_qm_shrink_scan,
+ 0, DEFAULT_SEEKS,
+ SHRINKER_NUMA_AWARE, qinf);
+ if (!qinf->qi_shrinker)
+ goto out_free_inos;

- error = register_shrinker(&qinf->qi_shrinker, "xfs-qm:%s",
+ error = register_shrinker(qinf->qi_shrinker, "xfs-qm:%s",
mp->m_super->s_id);
if (error)
- goto out_free_inos;
+ goto out_shrinker;

return 0;

+out_shrinker:
+ shrinker_free(qinf->qi_shrinker);
out_free_inos:
mutex_destroy(&qinf->qi_quotaofflock);
mutex_destroy(&qinf->qi_tree_lock);
@@ -718,7 +720,7 @@ xfs_qm_destroy_quotainfo(
qi = mp->m_quotainfo;
ASSERT(qi != NULL);

- unregister_shrinker(&qi->qi_shrinker);
+ unregister_and_free_shrinker(qi->qi_shrinker);
list_lru_destroy(&qi->qi_lru);
xfs_qm_destroy_quotainos(qi);
mutex_destroy(&qi->qi_tree_lock);
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index 9683f0457d19..d5c9fc4ba591 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -63,7 +63,7 @@ struct xfs_quotainfo {
struct xfs_def_quota qi_usr_default;
struct xfs_def_quota qi_grp_default;
struct xfs_def_quota qi_prj_default;
- struct shrinker qi_shrinker;
+ struct shrinker *qi_shrinker;

/* Minimum and maximum quota expiration timestamp values. */
time64_t qi_expiry_min;
--
2.30.2


2023-06-22 09:04:39

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 22/29] drm/ttm: introduce pool_shrink_rwsem

Currently, the synchronize_shrinkers() is only used by TTM
pool. It only requires that no shrinkers run in parallel.

After we use RCU+refcount method to implement the lockless
slab shrink, we can not use shrinker_rwsem or synchronize_rcu()
to guarantee that all shrinker invocations have seen an update
before freeing memory.

So we introduce a new pool_shrink_rwsem to implement a private
synchronize_shrinkers(), so as to achieve the same purpose.

Signed-off-by: Qi Zheng <[email protected]>
---
drivers/gpu/drm/ttm/ttm_pool.c | 15 +++++++++++++++
include/linux/shrinker.h | 1 -
mm/vmscan.c | 15 ---------------
3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index cddb9151d20f..713b1c0a70e1 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -74,6 +74,7 @@ static struct ttm_pool_type global_dma32_uncached[MAX_ORDER + 1];
static spinlock_t shrinker_lock;
static struct list_head shrinker_list;
static struct shrinker mm_shrinker;
+static DECLARE_RWSEM(pool_shrink_rwsem);

/* Allocate pages of size 1 << order with the given gfp_flags */
static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
@@ -317,6 +318,7 @@ static unsigned int ttm_pool_shrink(void)
unsigned int num_pages;
struct page *p;

+ down_read(&pool_shrink_rwsem);
spin_lock(&shrinker_lock);
pt = list_first_entry(&shrinker_list, typeof(*pt), shrinker_list);
list_move_tail(&pt->shrinker_list, &shrinker_list);
@@ -329,6 +331,7 @@ static unsigned int ttm_pool_shrink(void)
} else {
num_pages = 0;
}
+ up_read(&pool_shrink_rwsem);

return num_pages;
}
@@ -572,6 +575,18 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
}
EXPORT_SYMBOL(ttm_pool_init);

+/**
+ * synchronize_shrinkers - Wait for all running shrinkers to complete.
+ *
+ * This is useful to guarantee that all shrinker invocations have seen an
+ * update, before freeing memory, similar to rcu.
+ */
+static void synchronize_shrinkers(void)
+{
+ down_write(&pool_shrink_rwsem);
+ up_write(&pool_shrink_rwsem);
+}
+
/**
* ttm_pool_fini - Cleanup a pool
*
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 8e9ba6fa3fcc..4094e4c44e80 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -105,7 +105,6 @@ extern int __printf(2, 3) register_shrinker(struct shrinker *shrinker,
const char *fmt, ...);
extern void unregister_shrinker(struct shrinker *shrinker);
extern void free_prealloced_shrinker(struct shrinker *shrinker);
-extern void synchronize_shrinkers(void);

typedef unsigned long (*count_objects_cb)(struct shrinker *s,
struct shrink_control *sc);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64ff598fbad9..3a8d50ad6ff6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -844,21 +844,6 @@ void unregister_and_free_shrinker(struct shrinker *shrinker)
}
EXPORT_SYMBOL(unregister_and_free_shrinker);

-/**
- * synchronize_shrinkers - Wait for all running shrinkers to complete.
- *
- * This is equivalent to calling unregister_shrink() and register_shrinker(),
- * but atomically and with less overhead. This is useful to guarantee that all
- * shrinker invocations have seen an update, before freeing memory, similar to
- * rcu.
- */
-void synchronize_shrinkers(void)
-{
- down_write(&shrinker_rwsem);
- up_write(&shrinker_rwsem);
-}
-EXPORT_SYMBOL(synchronize_shrinkers);
-
#define SHRINK_BATCH 128

static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
--
2.30.2


2023-06-22 09:06:38

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file

The mm/vmscan.c file is too large, so separate the shrinker-related
code from it into a separate file. No functional changes.

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/shrinker.h | 3 +
mm/Makefile | 4 +-
mm/shrinker.c | 750 +++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 746 --------------------------------------
4 files changed, 755 insertions(+), 748 deletions(-)
create mode 100644 mm/shrinker.c

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index b0c6c2df9db8..b837b4c0864a 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -104,6 +104,9 @@ struct shrinker {
*/
#define SHRINKER_NONSLAB (1 << 3)

+unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
+ int priority);
+
extern int __printf(2, 3) prealloc_shrinker(struct shrinker *shrinker,
const char *fmt, ...);
extern void register_shrinker_prepared(struct shrinker *shrinker);
diff --git a/mm/Makefile b/mm/Makefile
index 678530a07326..891899186608 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -48,8 +48,8 @@ endif

obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
maccess.o page-writeback.o folio-compat.o \
- readahead.o swap.o truncate.o vmscan.o shmem.o \
- util.o mmzone.o vmstat.o backing-dev.o \
+ readahead.o swap.o truncate.o vmscan.o shrinker.o \
+ shmem.o util.o mmzone.o vmstat.o backing-dev.o \
mm_init.o percpu.o slab_common.o \
compaction.o show_mem.o\
interval_tree.o list_lru.o workingset.o \
diff --git a/mm/shrinker.c b/mm/shrinker.c
new file mode 100644
index 000000000000..f88966f3d6be
--- /dev/null
+++ b/mm/shrinker.c
@@ -0,0 +1,750 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/memcontrol.h>
+#include <linux/mutex.h>
+#include <linux/rculist.h>
+#include <linux/shrinker.h>
+#include <trace/events/vmscan.h>
+
+LIST_HEAD(shrinker_list);
+DEFINE_MUTEX(shrinker_mutex);
+
+#ifdef CONFIG_MEMCG
+static int shrinker_nr_max;
+
+/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
+static inline int shrinker_map_size(int nr_items)
+{
+ return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
+}
+
+static inline int shrinker_defer_size(int nr_items)
+{
+ return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
+}
+
+static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
+ int nid)
+{
+ return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
+ lockdep_is_held(&shrinker_mutex));
+}
+
+static struct shrinker_info *shrinker_info_rcu(struct mem_cgroup *memcg,
+ int nid)
+{
+ return rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
+}
+
+static int expand_one_shrinker_info(struct mem_cgroup *memcg,
+ int map_size, int defer_size,
+ int old_map_size, int old_defer_size,
+ int new_nr_max)
+{
+ struct shrinker_info *new, *old;
+ struct mem_cgroup_per_node *pn;
+ int nid;
+ int size = map_size + defer_size;
+
+ for_each_node(nid) {
+ pn = memcg->nodeinfo[nid];
+ old = shrinker_info_protected(memcg, nid);
+ /* Not yet online memcg */
+ if (!old)
+ return 0;
+
+ /* Already expanded this shrinker_info */
+ if (new_nr_max <= old->map_nr_max)
+ continue;
+
+ new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
+ if (!new)
+ return -ENOMEM;
+
+ new->nr_deferred = (atomic_long_t *)(new + 1);
+ new->map = (void *)new->nr_deferred + defer_size;
+ new->map_nr_max = new_nr_max;
+
+ /* map: set all old bits, clear all new bits */
+ memset(new->map, (int)0xff, old_map_size);
+ memset((void *)new->map + old_map_size, 0, map_size - old_map_size);
+ /* nr_deferred: copy old values, clear all new values */
+ memcpy(new->nr_deferred, old->nr_deferred, old_defer_size);
+ memset((void *)new->nr_deferred + old_defer_size, 0,
+ defer_size - old_defer_size);
+
+ rcu_assign_pointer(pn->shrinker_info, new);
+ kvfree_rcu(old, rcu);
+ }
+
+ return 0;
+}
+
+void free_shrinker_info(struct mem_cgroup *memcg)
+{
+ struct mem_cgroup_per_node *pn;
+ struct shrinker_info *info;
+ int nid;
+
+ for_each_node(nid) {
+ pn = memcg->nodeinfo[nid];
+ info = rcu_dereference_protected(pn->shrinker_info, true);
+ kvfree(info);
+ rcu_assign_pointer(pn->shrinker_info, NULL);
+ }
+}
+
+int alloc_shrinker_info(struct mem_cgroup *memcg)
+{
+ struct shrinker_info *info;
+ int nid, size, ret = 0;
+ int map_size, defer_size = 0;
+
+ mutex_lock(&shrinker_mutex);
+ map_size = shrinker_map_size(shrinker_nr_max);
+ defer_size = shrinker_defer_size(shrinker_nr_max);
+ size = map_size + defer_size;
+ for_each_node(nid) {
+ info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
+ if (!info) {
+ free_shrinker_info(memcg);
+ ret = -ENOMEM;
+ break;
+ }
+ info->nr_deferred = (atomic_long_t *)(info + 1);
+ info->map = (void *)info->nr_deferred + defer_size;
+ info->map_nr_max = shrinker_nr_max;
+ rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
+ }
+ mutex_unlock(&shrinker_mutex);
+
+ return ret;
+}
+
+static int expand_shrinker_info(int new_id)
+{
+ int ret = 0;
+ int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
+ int map_size, defer_size = 0;
+ int old_map_size, old_defer_size = 0;
+ struct mem_cgroup *memcg;
+
+ if (!root_mem_cgroup)
+ goto out;
+
+ lockdep_assert_held(&shrinker_mutex);
+
+ map_size = shrinker_map_size(new_nr_max);
+ defer_size = shrinker_defer_size(new_nr_max);
+ old_map_size = shrinker_map_size(shrinker_nr_max);
+ old_defer_size = shrinker_defer_size(shrinker_nr_max);
+
+ memcg = mem_cgroup_iter(NULL, NULL, NULL);
+ do {
+ ret = expand_one_shrinker_info(memcg, map_size, defer_size,
+ old_map_size, old_defer_size,
+ new_nr_max);
+ if (ret) {
+ mem_cgroup_iter_break(NULL, memcg);
+ goto out;
+ }
+ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
+out:
+ if (!ret)
+ shrinker_nr_max = new_nr_max;
+
+ return ret;
+}
+
+void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
+{
+ if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
+ struct shrinker_info *info;
+
+ rcu_read_lock();
+ info = shrinker_info_rcu(memcg, nid);
+ if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
+ /* Pairs with smp mb in shrink_slab() */
+ smp_mb__before_atomic();
+ set_bit(shrinker_id, info->map);
+ }
+ rcu_read_unlock();
+ }
+}
+
+static DEFINE_IDR(shrinker_idr);
+
+static int prealloc_memcg_shrinker(struct shrinker *shrinker)
+{
+ int id, ret = -ENOMEM;
+
+ if (mem_cgroup_disabled())
+ return -ENOSYS;
+
+ mutex_lock(&shrinker_mutex);
+ id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
+ if (id < 0)
+ goto unlock;
+
+ if (id >= shrinker_nr_max) {
+ if (expand_shrinker_info(id)) {
+ idr_remove(&shrinker_idr, id);
+ goto unlock;
+ }
+ }
+ shrinker->id = id;
+ ret = 0;
+unlock:
+ mutex_unlock(&shrinker_mutex);
+ return ret;
+}
+
+static void unregister_memcg_shrinker(struct shrinker *shrinker)
+{
+ int id = shrinker->id;
+
+ BUG_ON(id < 0);
+
+ lockdep_assert_held(&shrinker_mutex);
+
+ idr_remove(&shrinker_idr, id);
+}
+
+static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
+ struct mem_cgroup *memcg)
+{
+ struct shrinker_info *info;
+ long nr_deferred;
+
+ rcu_read_lock();
+ info = shrinker_info_rcu(memcg, nid);
+ nr_deferred = atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
+ rcu_read_unlock();
+
+ return nr_deferred;
+}
+
+static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
+ struct mem_cgroup *memcg)
+{
+ struct shrinker_info *info;
+ long nr_deferred;
+
+ rcu_read_lock();
+ info = shrinker_info_rcu(memcg, nid);
+ nr_deferred = atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
+ rcu_read_unlock();
+
+ return nr_deferred;
+}
+
+void reparent_shrinker_deferred(struct mem_cgroup *memcg)
+{
+ int i, nid;
+ long nr;
+ struct mem_cgroup *parent;
+ struct shrinker_info *child_info, *parent_info;
+
+ parent = parent_mem_cgroup(memcg);
+ if (!parent)
+ parent = root_mem_cgroup;
+
+ /* Prevent from concurrent shrinker_info expand */
+ mutex_lock(&shrinker_mutex);
+ for_each_node(nid) {
+ child_info = shrinker_info_protected(memcg, nid);
+ parent_info = shrinker_info_protected(parent, nid);
+ for (i = 0; i < child_info->map_nr_max; i++) {
+ nr = atomic_long_read(&child_info->nr_deferred[i]);
+ atomic_long_add(nr, &parent_info->nr_deferred[i]);
+ }
+ }
+ mutex_unlock(&shrinker_mutex);
+}
+#else
+static int prealloc_memcg_shrinker(struct shrinker *shrinker)
+{
+ return -ENOSYS;
+}
+
+static void unregister_memcg_shrinker(struct shrinker *shrinker)
+{
+}
+
+static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
+ struct mem_cgroup *memcg)
+{
+ return 0;
+}
+
+static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
+ struct mem_cgroup *memcg)
+{
+ return 0;
+}
+#endif /* CONFIG_MEMCG */
+
+static long xchg_nr_deferred(struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+ int nid = sc->nid;
+
+ if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+ nid = 0;
+
+ if (sc->memcg &&
+ (shrinker->flags & SHRINKER_MEMCG_AWARE))
+ return xchg_nr_deferred_memcg(nid, shrinker,
+ sc->memcg);
+
+ return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+}
+
+static long add_nr_deferred(long nr, struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+ int nid = sc->nid;
+
+ if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+ nid = 0;
+
+ if (sc->memcg &&
+ (shrinker->flags & SHRINKER_MEMCG_AWARE))
+ return add_nr_deferred_memcg(nr, nid, shrinker,
+ sc->memcg);
+
+ return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
+}
+
+#define SHRINK_BATCH 128
+
+static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
+ struct shrinker *shrinker, int priority)
+{
+ unsigned long freed = 0;
+ unsigned long long delta;
+ long total_scan;
+ long freeable;
+ long nr;
+ long new_nr;
+ long batch_size = shrinker->batch ? shrinker->batch
+ : SHRINK_BATCH;
+ long scanned = 0, next_deferred;
+
+ freeable = shrinker->count_objects(shrinker, shrinkctl);
+ if (freeable == 0 || freeable == SHRINK_EMPTY)
+ return freeable;
+
+ /*
+ * copy the current shrinker scan count into a local variable
+ * and zero it so that other concurrent shrinker invocations
+ * don't also do this scanning work.
+ */
+ nr = xchg_nr_deferred(shrinker, shrinkctl);
+
+ if (shrinker->seeks) {
+ delta = freeable >> priority;
+ delta *= 4;
+ do_div(delta, shrinker->seeks);
+ } else {
+ /*
+ * These objects don't require any IO to create. Trim
+ * them aggressively under memory pressure to keep
+ * them from causing refetches in the IO caches.
+ */
+ delta = freeable / 2;
+ }
+
+ total_scan = nr >> priority;
+ total_scan += delta;
+ total_scan = min(total_scan, (2 * freeable));
+
+ trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
+ freeable, delta, total_scan, priority);
+
+ /*
+ * Normally, we should not scan less than batch_size objects in one
+ * pass to avoid too frequent shrinker calls, but if the slab has less
+ * than batch_size objects in total and we are really tight on memory,
+ * we will try to reclaim all available objects, otherwise we can end
+ * up failing allocations although there are plenty of reclaimable
+ * objects spread over several slabs with usage less than the
+ * batch_size.
+ *
+ * We detect the "tight on memory" situations by looking at the total
+ * number of objects we want to scan (total_scan). If it is greater
+ * than the total number of objects on slab (freeable), we must be
+ * scanning at high prio and therefore should try to reclaim as much as
+ * possible.
+ */
+ while (total_scan >= batch_size ||
+ total_scan >= freeable) {
+ unsigned long ret;
+ unsigned long nr_to_scan = min(batch_size, total_scan);
+
+ shrinkctl->nr_to_scan = nr_to_scan;
+ shrinkctl->nr_scanned = nr_to_scan;
+ ret = shrinker->scan_objects(shrinker, shrinkctl);
+ if (ret == SHRINK_STOP)
+ break;
+ freed += ret;
+
+ count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
+ total_scan -= shrinkctl->nr_scanned;
+ scanned += shrinkctl->nr_scanned;
+
+ cond_resched();
+ }
+
+ /*
+ * The deferred work is increased by any new work (delta) that wasn't
+ * done, decreased by old deferred work that was done now.
+ *
+ * And it is capped to two times of the freeable items.
+ */
+ next_deferred = max_t(long, (nr + delta - scanned), 0);
+ next_deferred = min(next_deferred, (2 * freeable));
+
+ /*
+ * move the unused scan count back into the shrinker in a
+ * manner that handles concurrent updates.
+ */
+ new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl);
+
+ trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
+ return freed;
+}
+
+#ifdef CONFIG_MEMCG
+static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
+ struct mem_cgroup *memcg, int priority)
+{
+ struct shrinker_info *info;
+ unsigned long ret, freed = 0;
+ int i = 0;
+
+ if (!mem_cgroup_online(memcg))
+ return 0;
+
+again:
+ rcu_read_lock();
+ info = shrinker_info_rcu(memcg, nid);
+ if (unlikely(!info))
+ goto unlock;
+
+ for_each_set_bit_from(i, info->map, info->map_nr_max) {
+ struct shrink_control sc = {
+ .gfp_mask = gfp_mask,
+ .nid = nid,
+ .memcg = memcg,
+ };
+ struct shrinker *shrinker;
+
+ shrinker = idr_find(&shrinker_idr, i);
+ if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) {
+ if (!shrinker)
+ clear_bit(i, info->map);
+ continue;
+ }
+
+ if (!shrinker_try_get(shrinker))
+ continue;
+ rcu_read_unlock();
+
+ /* Call non-slab shrinkers even though kmem is disabled */
+ if (!memcg_kmem_online() &&
+ !(shrinker->flags & SHRINKER_NONSLAB))
+ continue;
+
+ ret = do_shrink_slab(&sc, shrinker, priority);
+ if (ret == SHRINK_EMPTY) {
+ clear_bit(i, info->map);
+ /*
+ * After the shrinker reported that it had no objects to
+ * free, but before we cleared the corresponding bit in
+ * the memcg shrinker map, a new object might have been
+ * added. To make sure, we have the bit set in this
+ * case, we invoke the shrinker one more time and reset
+ * the bit if it reports that it is not empty anymore.
+ * The memory barrier here pairs with the barrier in
+ * set_shrinker_bit():
+ *
+ * list_lru_add() shrink_slab_memcg()
+ * list_add_tail() clear_bit()
+ * <MB> <MB>
+ * set_bit() do_shrink_slab()
+ */
+ smp_mb__after_atomic();
+ ret = do_shrink_slab(&sc, shrinker, priority);
+ if (ret == SHRINK_EMPTY)
+ ret = 0;
+ else
+ set_shrinker_bit(memcg, nid, i);
+ }
+ freed += ret;
+
+ shrinker_put(shrinker);
+
+ /*
+ * We have already exited the read-side of rcu critical section
+ * before calling do_shrink_slab(), the shrinker_info may be
+ * released in expand_one_shrinker_info(), so restart the
+ * iteration.
+ */
+ i++;
+ goto again;
+ }
+unlock:
+ rcu_read_unlock();
+ return freed;
+}
+#else /* CONFIG_MEMCG */
+static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
+ struct mem_cgroup *memcg, int priority)
+{
+ return 0;
+}
+#endif /* CONFIG_MEMCG */
+
+/**
+ * shrink_slab - shrink slab caches
+ * @gfp_mask: allocation context
+ * @nid: node whose slab caches to target
+ * @memcg: memory cgroup whose slab caches to target
+ * @priority: the reclaim priority
+ *
+ * Call the shrink functions to age shrinkable caches.
+ *
+ * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
+ * unaware shrinkers will receive a node id of 0 instead.
+ *
+ * @memcg specifies the memory cgroup to target. Unaware shrinkers
+ * are called only if it is the root cgroup.
+ *
+ * @priority is sc->priority, we take the number of objects and >> by priority
+ * in order to get the scan target.
+ *
+ * Returns the number of reclaimed slab objects.
+ */
+unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
+ int priority)
+{
+ unsigned long ret, freed = 0;
+ struct shrinker *shrinker;
+
+ /*
+ * The root memcg might be allocated even though memcg is disabled
+ * via "cgroup_disable=memory" boot parameter. This could make
+ * mem_cgroup_is_root() return false, then just run memcg slab
+ * shrink, but skip global shrink. This may result in premature
+ * oom.
+ */
+ if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
+ return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
+ struct shrink_control sc = {
+ .gfp_mask = gfp_mask,
+ .nid = nid,
+ .memcg = memcg,
+ };
+
+ if (!shrinker_try_get(shrinker))
+ continue;
+ rcu_read_unlock();
+
+ ret = do_shrink_slab(&sc, shrinker, priority);
+ if (ret == SHRINK_EMPTY)
+ ret = 0;
+ freed += ret;
+
+ rcu_read_lock();
+ shrinker_put(shrinker);
+ }
+ rcu_read_unlock();
+ cond_resched();
+ return freed;
+}
+
+/*
+ * Add a shrinker callback to be called from the vm.
+ */
+static int __prealloc_shrinker(struct shrinker *shrinker)
+{
+ unsigned int size;
+ int err;
+
+ if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
+ err = prealloc_memcg_shrinker(shrinker);
+ if (err != -ENOSYS)
+ return err;
+
+ shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
+ }
+
+ size = sizeof(*shrinker->nr_deferred);
+ if (shrinker->flags & SHRINKER_NUMA_AWARE)
+ size *= nr_node_ids;
+
+ shrinker->nr_deferred = kzalloc(size, GFP_KERNEL);
+ if (!shrinker->nr_deferred)
+ return -ENOMEM;
+
+ return 0;
+}
+
+#ifdef CONFIG_SHRINKER_DEBUG
+int prealloc_shrinker(struct shrinker *shrinker, const char *fmt, ...)
+{
+ va_list ap;
+ int err;
+
+ va_start(ap, fmt);
+ shrinker->name = kvasprintf_const(GFP_KERNEL, fmt, ap);
+ va_end(ap);
+ if (!shrinker->name)
+ return -ENOMEM;
+
+ err = __prealloc_shrinker(shrinker);
+ if (err) {
+ kfree_const(shrinker->name);
+ shrinker->name = NULL;
+ }
+
+ return err;
+}
+#else
+int prealloc_shrinker(struct shrinker *shrinker, const char *fmt, ...)
+{
+ return __prealloc_shrinker(shrinker);
+}
+#endif
+
+void free_prealloced_shrinker(struct shrinker *shrinker)
+{
+#ifdef CONFIG_SHRINKER_DEBUG
+ kfree_const(shrinker->name);
+ shrinker->name = NULL;
+#endif
+ if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
+ mutex_lock(&shrinker_mutex);
+ unregister_memcg_shrinker(shrinker);
+ mutex_unlock(&shrinker_mutex);
+ return;
+ }
+
+ kfree(shrinker->nr_deferred);
+ shrinker->nr_deferred = NULL;
+}
+
+void register_shrinker_prepared(struct shrinker *shrinker)
+{
+ mutex_lock(&shrinker_mutex);
+ refcount_set(&shrinker->refcount, 1);
+ init_completion(&shrinker->completion_wait);
+ list_add_tail_rcu(&shrinker->list, &shrinker_list);
+ shrinker->flags |= SHRINKER_REGISTERED;
+ shrinker_debugfs_add(shrinker);
+ mutex_unlock(&shrinker_mutex);
+}
+
+static int __register_shrinker(struct shrinker *shrinker)
+{
+ int err = __prealloc_shrinker(shrinker);
+
+ if (err)
+ return err;
+ register_shrinker_prepared(shrinker);
+ return 0;
+}
+
+#ifdef CONFIG_SHRINKER_DEBUG
+int register_shrinker(struct shrinker *shrinker, const char *fmt, ...)
+{
+ va_list ap;
+ int err;
+
+ va_start(ap, fmt);
+ shrinker->name = kvasprintf_const(GFP_KERNEL, fmt, ap);
+ va_end(ap);
+ if (!shrinker->name)
+ return -ENOMEM;
+
+ err = __register_shrinker(shrinker);
+ if (err) {
+ kfree_const(shrinker->name);
+ shrinker->name = NULL;
+ }
+ return err;
+}
+#else
+int register_shrinker(struct shrinker *shrinker, const char *fmt, ...)
+{
+ return __register_shrinker(shrinker);
+}
+#endif
+EXPORT_SYMBOL(register_shrinker);
+
+/*
+ * Remove one
+ */
+void unregister_shrinker(struct shrinker *shrinker)
+{
+ struct dentry *debugfs_entry;
+ int debugfs_id;
+
+ if (!(shrinker->flags & SHRINKER_REGISTERED))
+ return;
+
+ shrinker_put(shrinker);
+ wait_for_completion(&shrinker->completion_wait);
+
+ mutex_lock(&shrinker_mutex);
+ list_del_rcu(&shrinker->list);
+ shrinker->flags &= ~SHRINKER_REGISTERED;
+ if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+ unregister_memcg_shrinker(shrinker);
+ debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
+ mutex_unlock(&shrinker_mutex);
+
+ shrinker_debugfs_remove(debugfs_entry, debugfs_id);
+
+ kfree(shrinker->nr_deferred);
+ shrinker->nr_deferred = NULL;
+}
+EXPORT_SYMBOL(unregister_shrinker);
+
+struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
+ scan_objects_cb scan, long batch,
+ int seeks, unsigned flags,
+ void *priv_data)
+{
+ struct shrinker *shrinker;
+
+ shrinker = kzalloc(sizeof(struct shrinker), GFP_KERNEL);
+ if (!shrinker)
+ return NULL;
+
+ shrinker->count_objects = count;
+ shrinker->scan_objects = scan;
+ shrinker->batch = batch;
+ shrinker->seeks = seeks;
+ shrinker->flags = flags;
+ shrinker->private_data = priv_data;
+
+ return shrinker;
+}
+EXPORT_SYMBOL(shrinker_alloc_and_init);
+
+void shrinker_free(struct shrinker *shrinker)
+{
+ kfree(shrinker);
+}
+EXPORT_SYMBOL(shrinker_free);
+
+void unregister_and_free_shrinker(struct shrinker *shrinker)
+{
+ unregister_shrinker(shrinker);
+ kfree_rcu(shrinker, rcu);
+}
+EXPORT_SYMBOL(unregister_and_free_shrinker);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bcdd97caa403..0816c6496483 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -35,7 +35,6 @@
#include <linux/cpuset.h>
#include <linux/compaction.h>
#include <linux/notifier.h>
-#include <linux/mutex.h>
#include <linux/delay.h>
#include <linux/kthread.h>
#include <linux/freezer.h>
@@ -57,7 +56,6 @@
#include <linux/khugepaged.h>
#include <linux/rculist_nulls.h>
#include <linux/random.h>
-#include <linux/rculist.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -189,262 +187,7 @@ struct scan_control {
*/
int vm_swappiness = 60;

-LIST_HEAD(shrinker_list);
-DEFINE_MUTEX(shrinker_mutex);
-
#ifdef CONFIG_MEMCG
-static int shrinker_nr_max;
-
-/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
-static inline int shrinker_map_size(int nr_items)
-{
- return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
-}
-
-static inline int shrinker_defer_size(int nr_items)
-{
- return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
-}
-
-static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
- int nid)
-{
- return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
- lockdep_is_held(&shrinker_mutex));
-}
-
-static struct shrinker_info *shrinker_info_rcu(struct mem_cgroup *memcg,
- int nid)
-{
- return rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
-}
-
-static int expand_one_shrinker_info(struct mem_cgroup *memcg,
- int map_size, int defer_size,
- int old_map_size, int old_defer_size,
- int new_nr_max)
-{
- struct shrinker_info *new, *old;
- struct mem_cgroup_per_node *pn;
- int nid;
- int size = map_size + defer_size;
-
- for_each_node(nid) {
- pn = memcg->nodeinfo[nid];
- old = shrinker_info_protected(memcg, nid);
- /* Not yet online memcg */
- if (!old)
- return 0;
-
- /* Already expanded this shrinker_info */
- if (new_nr_max <= old->map_nr_max)
- continue;
-
- new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
- if (!new)
- return -ENOMEM;
-
- new->nr_deferred = (atomic_long_t *)(new + 1);
- new->map = (void *)new->nr_deferred + defer_size;
- new->map_nr_max = new_nr_max;
-
- /* map: set all old bits, clear all new bits */
- memset(new->map, (int)0xff, old_map_size);
- memset((void *)new->map + old_map_size, 0, map_size - old_map_size);
- /* nr_deferred: copy old values, clear all new values */
- memcpy(new->nr_deferred, old->nr_deferred, old_defer_size);
- memset((void *)new->nr_deferred + old_defer_size, 0,
- defer_size - old_defer_size);
-
- rcu_assign_pointer(pn->shrinker_info, new);
- kvfree_rcu(old, rcu);
- }
-
- return 0;
-}
-
-void free_shrinker_info(struct mem_cgroup *memcg)
-{
- struct mem_cgroup_per_node *pn;
- struct shrinker_info *info;
- int nid;
-
- for_each_node(nid) {
- pn = memcg->nodeinfo[nid];
- info = rcu_dereference_protected(pn->shrinker_info, true);
- kvfree(info);
- rcu_assign_pointer(pn->shrinker_info, NULL);
- }
-}
-
-int alloc_shrinker_info(struct mem_cgroup *memcg)
-{
- struct shrinker_info *info;
- int nid, size, ret = 0;
- int map_size, defer_size = 0;
-
- mutex_lock(&shrinker_mutex);
- map_size = shrinker_map_size(shrinker_nr_max);
- defer_size = shrinker_defer_size(shrinker_nr_max);
- size = map_size + defer_size;
- for_each_node(nid) {
- info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
- if (!info) {
- free_shrinker_info(memcg);
- ret = -ENOMEM;
- break;
- }
- info->nr_deferred = (atomic_long_t *)(info + 1);
- info->map = (void *)info->nr_deferred + defer_size;
- info->map_nr_max = shrinker_nr_max;
- rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
- }
- mutex_unlock(&shrinker_mutex);
-
- return ret;
-}
-
-static int expand_shrinker_info(int new_id)
-{
- int ret = 0;
- int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
- int map_size, defer_size = 0;
- int old_map_size, old_defer_size = 0;
- struct mem_cgroup *memcg;
-
- if (!root_mem_cgroup)
- goto out;
-
- lockdep_assert_held(&shrinker_mutex);
-
- map_size = shrinker_map_size(new_nr_max);
- defer_size = shrinker_defer_size(new_nr_max);
- old_map_size = shrinker_map_size(shrinker_nr_max);
- old_defer_size = shrinker_defer_size(shrinker_nr_max);
-
- memcg = mem_cgroup_iter(NULL, NULL, NULL);
- do {
- ret = expand_one_shrinker_info(memcg, map_size, defer_size,
- old_map_size, old_defer_size,
- new_nr_max);
- if (ret) {
- mem_cgroup_iter_break(NULL, memcg);
- goto out;
- }
- } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
-out:
- if (!ret)
- shrinker_nr_max = new_nr_max;
-
- return ret;
-}
-
-void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
-{
- if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
- struct shrinker_info *info;
-
- rcu_read_lock();
- info = shrinker_info_rcu(memcg, nid);
- if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
- /* Pairs with smp mb in shrink_slab() */
- smp_mb__before_atomic();
- set_bit(shrinker_id, info->map);
- }
- rcu_read_unlock();
- }
-}
-
-static DEFINE_IDR(shrinker_idr);
-
-static int prealloc_memcg_shrinker(struct shrinker *shrinker)
-{
- int id, ret = -ENOMEM;
-
- if (mem_cgroup_disabled())
- return -ENOSYS;
-
- mutex_lock(&shrinker_mutex);
- id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
- if (id < 0)
- goto unlock;
-
- if (id >= shrinker_nr_max) {
- if (expand_shrinker_info(id)) {
- idr_remove(&shrinker_idr, id);
- goto unlock;
- }
- }
- shrinker->id = id;
- ret = 0;
-unlock:
- mutex_unlock(&shrinker_mutex);
- return ret;
-}
-
-static void unregister_memcg_shrinker(struct shrinker *shrinker)
-{
- int id = shrinker->id;
-
- BUG_ON(id < 0);
-
- lockdep_assert_held(&shrinker_mutex);
-
- idr_remove(&shrinker_idr, id);
-}
-
-static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
- struct mem_cgroup *memcg)
-{
- struct shrinker_info *info;
- long nr_deferred;
-
- rcu_read_lock();
- info = shrinker_info_rcu(memcg, nid);
- nr_deferred = atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
- rcu_read_unlock();
-
- return nr_deferred;
-}
-
-static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
- struct mem_cgroup *memcg)
-{
- struct shrinker_info *info;
- long nr_deferred;
-
- rcu_read_lock();
- info = shrinker_info_rcu(memcg, nid);
- nr_deferred = atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
- rcu_read_unlock();
-
- return nr_deferred;
-}
-
-void reparent_shrinker_deferred(struct mem_cgroup *memcg)
-{
- int i, nid;
- long nr;
- struct mem_cgroup *parent;
- struct shrinker_info *child_info, *parent_info;
-
- parent = parent_mem_cgroup(memcg);
- if (!parent)
- parent = root_mem_cgroup;
-
- /* Prevent from concurrent shrinker_info expand */
- mutex_lock(&shrinker_mutex);
- for_each_node(nid) {
- child_info = shrinker_info_protected(memcg, nid);
- parent_info = shrinker_info_protected(parent, nid);
- for (i = 0; i < child_info->map_nr_max; i++) {
- nr = atomic_long_read(&child_info->nr_deferred[i]);
- atomic_long_add(nr, &parent_info->nr_deferred[i]);
- }
- }
- mutex_unlock(&shrinker_mutex);
-}
-
static bool cgroup_reclaim(struct scan_control *sc)
{
return sc->target_mem_cgroup;
@@ -479,27 +222,6 @@ static bool writeback_throttling_sane(struct scan_control *sc)
return false;
}
#else
-static int prealloc_memcg_shrinker(struct shrinker *shrinker)
-{
- return -ENOSYS;
-}
-
-static void unregister_memcg_shrinker(struct shrinker *shrinker)
-{
-}
-
-static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
- struct mem_cgroup *memcg)
-{
- return 0;
-}
-
-static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
- struct mem_cgroup *memcg)
-{
- return 0;
-}
-
static bool cgroup_reclaim(struct scan_control *sc)
{
return false;
@@ -568,39 +290,6 @@ static void flush_reclaim_state(struct scan_control *sc)
}
}

-static long xchg_nr_deferred(struct shrinker *shrinker,
- struct shrink_control *sc)
-{
- int nid = sc->nid;
-
- if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
- nid = 0;
-
- if (sc->memcg &&
- (shrinker->flags & SHRINKER_MEMCG_AWARE))
- return xchg_nr_deferred_memcg(nid, shrinker,
- sc->memcg);
-
- return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
-}
-
-
-static long add_nr_deferred(long nr, struct shrinker *shrinker,
- struct shrink_control *sc)
-{
- int nid = sc->nid;
-
- if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
- nid = 0;
-
- if (sc->memcg &&
- (shrinker->flags & SHRINKER_MEMCG_AWARE))
- return add_nr_deferred_memcg(nr, nid, shrinker,
- sc->memcg);
-
- return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
-}
-
static bool can_demote(int nid, struct scan_control *sc)
{
if (!numa_demotion_enabled)
@@ -682,441 +371,6 @@ static unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru,
return size;
}

-/*
- * Add a shrinker callback to be called from the vm.
- */
-static int __prealloc_shrinker(struct shrinker *shrinker)
-{
- unsigned int size;
- int err;
-
- if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
- err = prealloc_memcg_shrinker(shrinker);
- if (err != -ENOSYS)
- return err;
-
- shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
- }
-
- size = sizeof(*shrinker->nr_deferred);
- if (shrinker->flags & SHRINKER_NUMA_AWARE)
- size *= nr_node_ids;
-
- shrinker->nr_deferred = kzalloc(size, GFP_KERNEL);
- if (!shrinker->nr_deferred)
- return -ENOMEM;
-
- return 0;
-}
-
-#ifdef CONFIG_SHRINKER_DEBUG
-int prealloc_shrinker(struct shrinker *shrinker, const char *fmt, ...)
-{
- va_list ap;
- int err;
-
- va_start(ap, fmt);
- shrinker->name = kvasprintf_const(GFP_KERNEL, fmt, ap);
- va_end(ap);
- if (!shrinker->name)
- return -ENOMEM;
-
- err = __prealloc_shrinker(shrinker);
- if (err) {
- kfree_const(shrinker->name);
- shrinker->name = NULL;
- }
-
- return err;
-}
-#else
-int prealloc_shrinker(struct shrinker *shrinker, const char *fmt, ...)
-{
- return __prealloc_shrinker(shrinker);
-}
-#endif
-
-void free_prealloced_shrinker(struct shrinker *shrinker)
-{
-#ifdef CONFIG_SHRINKER_DEBUG
- kfree_const(shrinker->name);
- shrinker->name = NULL;
-#endif
- if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
- mutex_lock(&shrinker_mutex);
- unregister_memcg_shrinker(shrinker);
- mutex_unlock(&shrinker_mutex);
- return;
- }
-
- kfree(shrinker->nr_deferred);
- shrinker->nr_deferred = NULL;
-}
-
-void register_shrinker_prepared(struct shrinker *shrinker)
-{
- mutex_lock(&shrinker_mutex);
- refcount_set(&shrinker->refcount, 1);
- init_completion(&shrinker->completion_wait);
- list_add_tail_rcu(&shrinker->list, &shrinker_list);
- shrinker->flags |= SHRINKER_REGISTERED;
- shrinker_debugfs_add(shrinker);
- mutex_unlock(&shrinker_mutex);
-}
-
-static int __register_shrinker(struct shrinker *shrinker)
-{
- int err = __prealloc_shrinker(shrinker);
-
- if (err)
- return err;
- register_shrinker_prepared(shrinker);
- return 0;
-}
-
-#ifdef CONFIG_SHRINKER_DEBUG
-int register_shrinker(struct shrinker *shrinker, const char *fmt, ...)
-{
- va_list ap;
- int err;
-
- va_start(ap, fmt);
- shrinker->name = kvasprintf_const(GFP_KERNEL, fmt, ap);
- va_end(ap);
- if (!shrinker->name)
- return -ENOMEM;
-
- err = __register_shrinker(shrinker);
- if (err) {
- kfree_const(shrinker->name);
- shrinker->name = NULL;
- }
- return err;
-}
-#else
-int register_shrinker(struct shrinker *shrinker, const char *fmt, ...)
-{
- return __register_shrinker(shrinker);
-}
-#endif
-EXPORT_SYMBOL(register_shrinker);
-
-/*
- * Remove one
- */
-void unregister_shrinker(struct shrinker *shrinker)
-{
- struct dentry *debugfs_entry;
- int debugfs_id;
-
- if (!(shrinker->flags & SHRINKER_REGISTERED))
- return;
-
- shrinker_put(shrinker);
- wait_for_completion(&shrinker->completion_wait);
-
- mutex_lock(&shrinker_mutex);
- list_del_rcu(&shrinker->list);
- shrinker->flags &= ~SHRINKER_REGISTERED;
- if (shrinker->flags & SHRINKER_MEMCG_AWARE)
- unregister_memcg_shrinker(shrinker);
- debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
- mutex_unlock(&shrinker_mutex);
-
- shrinker_debugfs_remove(debugfs_entry, debugfs_id);
-
- kfree(shrinker->nr_deferred);
- shrinker->nr_deferred = NULL;
-}
-EXPORT_SYMBOL(unregister_shrinker);
-
-struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
- scan_objects_cb scan, long batch,
- int seeks, unsigned flags,
- void *priv_data)
-{
- struct shrinker *shrinker;
-
- shrinker = kzalloc(sizeof(struct shrinker), GFP_KERNEL);
- if (!shrinker)
- return NULL;
-
- shrinker->count_objects = count;
- shrinker->scan_objects = scan;
- shrinker->batch = batch;
- shrinker->seeks = seeks;
- shrinker->flags = flags;
- shrinker->private_data = priv_data;
-
- return shrinker;
-}
-EXPORT_SYMBOL(shrinker_alloc_and_init);
-
-void shrinker_free(struct shrinker *shrinker)
-{
- kfree(shrinker);
-}
-EXPORT_SYMBOL(shrinker_free);
-
-void unregister_and_free_shrinker(struct shrinker *shrinker)
-{
- unregister_shrinker(shrinker);
- kfree_rcu(shrinker, rcu);
-}
-EXPORT_SYMBOL(unregister_and_free_shrinker);
-
-#define SHRINK_BATCH 128
-
-static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
- struct shrinker *shrinker, int priority)
-{
- unsigned long freed = 0;
- unsigned long long delta;
- long total_scan;
- long freeable;
- long nr;
- long new_nr;
- long batch_size = shrinker->batch ? shrinker->batch
- : SHRINK_BATCH;
- long scanned = 0, next_deferred;
-
- freeable = shrinker->count_objects(shrinker, shrinkctl);
- if (freeable == 0 || freeable == SHRINK_EMPTY)
- return freeable;
-
- /*
- * copy the current shrinker scan count into a local variable
- * and zero it so that other concurrent shrinker invocations
- * don't also do this scanning work.
- */
- nr = xchg_nr_deferred(shrinker, shrinkctl);
-
- if (shrinker->seeks) {
- delta = freeable >> priority;
- delta *= 4;
- do_div(delta, shrinker->seeks);
- } else {
- /*
- * These objects don't require any IO to create. Trim
- * them aggressively under memory pressure to keep
- * them from causing refetches in the IO caches.
- */
- delta = freeable / 2;
- }
-
- total_scan = nr >> priority;
- total_scan += delta;
- total_scan = min(total_scan, (2 * freeable));
-
- trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
- freeable, delta, total_scan, priority);
-
- /*
- * Normally, we should not scan less than batch_size objects in one
- * pass to avoid too frequent shrinker calls, but if the slab has less
- * than batch_size objects in total and we are really tight on memory,
- * we will try to reclaim all available objects, otherwise we can end
- * up failing allocations although there are plenty of reclaimable
- * objects spread over several slabs with usage less than the
- * batch_size.
- *
- * We detect the "tight on memory" situations by looking at the total
- * number of objects we want to scan (total_scan). If it is greater
- * than the total number of objects on slab (freeable), we must be
- * scanning at high prio and therefore should try to reclaim as much as
- * possible.
- */
- while (total_scan >= batch_size ||
- total_scan >= freeable) {
- unsigned long ret;
- unsigned long nr_to_scan = min(batch_size, total_scan);
-
- shrinkctl->nr_to_scan = nr_to_scan;
- shrinkctl->nr_scanned = nr_to_scan;
- ret = shrinker->scan_objects(shrinker, shrinkctl);
- if (ret == SHRINK_STOP)
- break;
- freed += ret;
-
- count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
- total_scan -= shrinkctl->nr_scanned;
- scanned += shrinkctl->nr_scanned;
-
- cond_resched();
- }
-
- /*
- * The deferred work is increased by any new work (delta) that wasn't
- * done, decreased by old deferred work that was done now.
- *
- * And it is capped to two times of the freeable items.
- */
- next_deferred = max_t(long, (nr + delta - scanned), 0);
- next_deferred = min(next_deferred, (2 * freeable));
-
- /*
- * move the unused scan count back into the shrinker in a
- * manner that handles concurrent updates.
- */
- new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl);
-
- trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
- return freed;
-}
-
-#ifdef CONFIG_MEMCG
-static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
- struct mem_cgroup *memcg, int priority)
-{
- struct shrinker_info *info;
- unsigned long ret, freed = 0;
- int i = 0;
-
- if (!mem_cgroup_online(memcg))
- return 0;
-
-again:
- rcu_read_lock();
- info = shrinker_info_rcu(memcg, nid);
- if (unlikely(!info))
- goto unlock;
-
- for_each_set_bit_from(i, info->map, info->map_nr_max) {
- struct shrink_control sc = {
- .gfp_mask = gfp_mask,
- .nid = nid,
- .memcg = memcg,
- };
- struct shrinker *shrinker;
-
- shrinker = idr_find(&shrinker_idr, i);
- if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) {
- if (!shrinker)
- clear_bit(i, info->map);
- continue;
- }
-
- if (!shrinker_try_get(shrinker))
- continue;
- rcu_read_unlock();
-
- /* Call non-slab shrinkers even though kmem is disabled */
- if (!memcg_kmem_online() &&
- !(shrinker->flags & SHRINKER_NONSLAB))
- continue;
-
- ret = do_shrink_slab(&sc, shrinker, priority);
- if (ret == SHRINK_EMPTY) {
- clear_bit(i, info->map);
- /*
- * After the shrinker reported that it had no objects to
- * free, but before we cleared the corresponding bit in
- * the memcg shrinker map, a new object might have been
- * added. To make sure, we have the bit set in this
- * case, we invoke the shrinker one more time and reset
- * the bit if it reports that it is not empty anymore.
- * The memory barrier here pairs with the barrier in
- * set_shrinker_bit():
- *
- * list_lru_add() shrink_slab_memcg()
- * list_add_tail() clear_bit()
- * <MB> <MB>
- * set_bit() do_shrink_slab()
- */
- smp_mb__after_atomic();
- ret = do_shrink_slab(&sc, shrinker, priority);
- if (ret == SHRINK_EMPTY)
- ret = 0;
- else
- set_shrinker_bit(memcg, nid, i);
- }
- freed += ret;
-
- shrinker_put(shrinker);
-
- /*
- * We have already exited the read-side of rcu critical section
- * before calling do_shrink_slab(), the shrinker_info may be
- * released in expand_one_shrinker_info(), so restart the
- * iteration.
- */
- i++;
- goto again;
- }
-unlock:
- rcu_read_unlock();
- return freed;
-}
-#else /* CONFIG_MEMCG */
-static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
- struct mem_cgroup *memcg, int priority)
-{
- return 0;
-}
-#endif /* CONFIG_MEMCG */
-
-/**
- * shrink_slab - shrink slab caches
- * @gfp_mask: allocation context
- * @nid: node whose slab caches to target
- * @memcg: memory cgroup whose slab caches to target
- * @priority: the reclaim priority
- *
- * Call the shrink functions to age shrinkable caches.
- *
- * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
- * unaware shrinkers will receive a node id of 0 instead.
- *
- * @memcg specifies the memory cgroup to target. Unaware shrinkers
- * are called only if it is the root cgroup.
- *
- * @priority is sc->priority, we take the number of objects and >> by priority
- * in order to get the scan target.
- *
- * Returns the number of reclaimed slab objects.
- */
-static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
- struct mem_cgroup *memcg,
- int priority)
-{
- unsigned long ret, freed = 0;
- struct shrinker *shrinker;
-
- /*
- * The root memcg might be allocated even though memcg is disabled
- * via "cgroup_disable=memory" boot parameter. This could make
- * mem_cgroup_is_root() return false, then just run memcg slab
- * shrink, but skip global shrink. This may result in premature
- * oom.
- */
- if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
- return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
-
- rcu_read_lock();
- list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
- struct shrink_control sc = {
- .gfp_mask = gfp_mask,
- .nid = nid,
- .memcg = memcg,
- };
-
- if (!shrinker_try_get(shrinker))
- continue;
- rcu_read_unlock();
-
- ret = do_shrink_slab(&sc, shrinker, priority);
- if (ret == SHRINK_EMPTY)
- ret = 0;
- freed += ret;
-
- rcu_read_lock();
- shrinker_put(shrinker);
- }
- rcu_read_unlock();
- cond_resched();
- return freed;
-}
-
static unsigned long drop_slab_node(int nid)
{
unsigned long freed = 0;
--
2.30.2


2023-06-22 09:06:41

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 25/29] mm: vmscan: make memcg slab shrink lockless

Like global slab shrink, this commit also uses refcount+RCU
method to make memcg slab shrink lockless.

We can reproduce the down_read_trylock() hotspot through the
following script:

```

DIR="/root/shrinker/memcg/mnt"

do_create()
{
mkdir -p /sys/fs/cgroup/memory/test
mkdir -p /sys/fs/cgroup/perf_event/test
echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
for i in `seq 0 $1`;
do
mkdir -p /sys/fs/cgroup/memory/test/$i;
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
mkdir -p $DIR/$i;
done
}

do_mount()
{
for i in `seq $1 $2`;
do
mount -t tmpfs $i $DIR/$i;
done
}

do_touch()
{
for i in `seq $1 $2`;
do
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
done
}

case "$1" in
touch)
do_touch $2 $3
;;
test)
do_create 4000
do_mount 0 4000
do_touch 0 3000
;;
*)
exit 1
;;
esac
```

Save the above script, then run test and touch commands.
Then we can use the following perf command to view hotspots:

perf top -U -F 999 [-g]

1) Before applying this patchset:

35.34% [kernel] [k] down_read_trylock
18.44% [kernel] [k] shrink_slab
15.98% [kernel] [k] pv_native_safe_halt
15.08% [kernel] [k] up_read
5.33% [kernel] [k] idr_find
2.71% [kernel] [k] _find_next_bit
2.21% [kernel] [k] shrink_node
1.29% [kernel] [k] shrink_lruvec
0.66% [kernel] [k] do_shrink_slab
0.33% [kernel] [k] list_lru_count_one
0.33% [kernel] [k] __radix_tree_lookup
0.25% [kernel] [k] mem_cgroup_iter

- 82.19% 19.49% [kernel] [k] shrink_slab
- 62.00% shrink_slab
36.37% down_read_trylock
15.52% up_read
5.48% idr_find
3.38% _find_next_bit
+ 0.98% do_shrink_slab

2) After applying this patchset:

46.83% [kernel] [k] shrink_slab
20.52% [kernel] [k] pv_native_safe_halt
8.85% [kernel] [k] do_shrink_slab
7.71% [kernel] [k] _find_next_bit
1.72% [kernel] [k] xas_descend
1.70% [kernel] [k] shrink_node
1.44% [kernel] [k] shrink_lruvec
1.43% [kernel] [k] mem_cgroup_iter
1.28% [kernel] [k] xas_load
0.89% [kernel] [k] super_cache_count
0.84% [kernel] [k] xas_start
0.66% [kernel] [k] list_lru_count_one

- 65.50% 40.44% [kernel] [k] shrink_slab
- 22.96% shrink_slab
13.11% _find_next_bit
- 9.91% do_shrink_slab
- 1.59% super_cache_count
0.92% list_lru_count_one

We can see that the first perf hotspot becomes shrink_slab,
which is what we expect.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/vmscan.c | 58 +++++++++++++++++++++++++++++++++++++----------------
1 file changed, 41 insertions(+), 17 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 767569698946..357a1f2ad690 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -213,6 +213,12 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
lockdep_is_held(&shrinker_rwsem));
}

+static struct shrinker_info *shrinker_info_rcu(struct mem_cgroup *memcg,
+ int nid)
+{
+ return rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
+}
+
static int expand_one_shrinker_info(struct mem_cgroup *memcg,
int map_size, int defer_size,
int old_map_size, int old_defer_size,
@@ -339,7 +345,7 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
struct shrinker_info *info;

rcu_read_lock();
- info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
+ info = shrinker_info_rcu(memcg, nid);
if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
/* Pairs with smp mb in shrink_slab() */
smp_mb__before_atomic();
@@ -359,7 +365,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
return -ENOSYS;

down_write(&shrinker_rwsem);
- /* This may call shrinker, so it must use down_read_trylock() */
id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
@@ -392,18 +397,28 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
struct mem_cgroup *memcg)
{
struct shrinker_info *info;
+ long nr_deferred;

- info = shrinker_info_protected(memcg, nid);
- return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
+ rcu_read_lock();
+ info = shrinker_info_rcu(memcg, nid);
+ nr_deferred = atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
+ rcu_read_unlock();
+
+ return nr_deferred;
}

static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
struct mem_cgroup *memcg)
{
struct shrinker_info *info;
+ long nr_deferred;
+
+ rcu_read_lock();
+ info = shrinker_info_rcu(memcg, nid);
+ nr_deferred = atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
+ rcu_read_unlock();

- info = shrinker_info_protected(memcg, nid);
- return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
+ return nr_deferred;
}

void reparent_shrinker_deferred(struct mem_cgroup *memcg)
@@ -955,19 +970,18 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
{
struct shrinker_info *info;
unsigned long ret, freed = 0;
- int i;
+ int i = 0;

if (!mem_cgroup_online(memcg))
return 0;

- if (!down_read_trylock(&shrinker_rwsem))
- return 0;
-
- info = shrinker_info_protected(memcg, nid);
+again:
+ rcu_read_lock();
+ info = shrinker_info_rcu(memcg, nid);
if (unlikely(!info))
goto unlock;

- for_each_set_bit(i, info->map, info->map_nr_max) {
+ for_each_set_bit_from(i, info->map, info->map_nr_max) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
@@ -982,6 +996,10 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
continue;
}

+ if (!shrinker_try_get(shrinker))
+ continue;
+ rcu_read_unlock();
+
/* Call non-slab shrinkers even though kmem is disabled */
if (!memcg_kmem_online() &&
!(shrinker->flags & SHRINKER_NONSLAB))
@@ -1014,13 +1032,19 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
}
freed += ret;

- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
+ shrinker_put(shrinker);
+
+ /*
+ * We have already exited the read-side of rcu critical section
+ * before calling do_shrink_slab(), the shrinker_info may be
+ * released in expand_one_shrinker_info(), so restart the
+ * iteration.
+ */
+ i++;
+ goto again;
}
unlock:
- up_read(&shrinker_rwsem);
+ rcu_read_unlock();
return freed;
}
#else /* CONFIG_MEMCG */
--
2.30.2


2023-06-22 09:06:52

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 28/29] mm: shrinkers: convert shrinker_rwsem to mutex

Now there are no readers of shrinker_rwsem, so we
can simply replace it with mutex lock.

Signed-off-by: Qi Zheng <[email protected]>
---
drivers/md/dm-cache-metadata.c | 2 +-
drivers/md/dm-thin-metadata.c | 2 +-
fs/super.c | 2 +-
mm/shrinker_debug.c | 14 +++++++-------
mm/vmscan.c | 34 +++++++++++++++++-----------------
5 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/drivers/md/dm-cache-metadata.c b/drivers/md/dm-cache-metadata.c
index acffed750e3e..9e0c69958587 100644
--- a/drivers/md/dm-cache-metadata.c
+++ b/drivers/md/dm-cache-metadata.c
@@ -1828,7 +1828,7 @@ int dm_cache_metadata_abort(struct dm_cache_metadata *cmd)
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
* cmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
* shrinker associated with the block manager's bufio client vs cmd root_lock).
- * - must take shrinker_rwsem without holding cmd->root_lock
+ * - must take shrinker_mutex without holding cmd->root_lock
*/
new_bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
CACHE_MAX_CONCURRENT_LOCKS);
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index fd464fb024c3..9f5cb52c5763 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1887,7 +1887,7 @@ int dm_pool_abort_metadata(struct dm_pool_metadata *pmd)
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
* pmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
* shrinker associated with the block manager's bufio client vs pmd root_lock).
- * - must take shrinker_rwsem without holding pmd->root_lock
+ * - must take shrinker_mutex without holding pmd->root_lock
*/
new_bm = dm_block_manager_create(pmd->bdev, THIN_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
THIN_MAX_CONCURRENT_LOCKS);
diff --git a/fs/super.c b/fs/super.c
index 791342bb8ac9..471800ff793a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -54,7 +54,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
* One thing we have to be careful of with a per-sb shrinker is that we don't
* drop the last active reference to the superblock from within the shrinker.
* If that happens we could trigger unregistering the shrinker from within the
- * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
+ * shrinker path and that leads to deadlock on the shrinker_mutex. Hence we
* take a passive reference to the superblock to avoid this from occurring.
*/
static unsigned long super_cache_scan(struct shrinker *shrink,
diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index c18fa9b6b7f0..7ad903f84463 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -7,7 +7,7 @@
#include <linux/memcontrol.h>

/* defined in vmscan.c */
-extern struct rw_semaphore shrinker_rwsem;
+extern struct mutex shrinker_mutex;
extern struct list_head shrinker_list;

static DEFINE_IDA(shrinker_debugfs_ida);
@@ -177,7 +177,7 @@ int shrinker_debugfs_add(struct shrinker *shrinker)
char buf[128];
int id;

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

/* debugfs isn't initialized yet, add debugfs entries later. */
if (!shrinker_debugfs_root)
@@ -220,7 +220,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
if (!new)
return -ENOMEM;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);

old = shrinker->name;
shrinker->name = new;
@@ -238,7 +238,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
shrinker->debugfs_entry = entry;
}

- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

kfree_const(old);

@@ -251,7 +251,7 @@ struct dentry *shrinker_debugfs_detach(struct shrinker *shrinker,
{
struct dentry *entry = shrinker->debugfs_entry;

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

kfree_const(shrinker->name);
shrinker->name = NULL;
@@ -280,14 +280,14 @@ static int __init shrinker_debugfs_init(void)
shrinker_debugfs_root = dentry;

/* Create debugfs entries for shrinkers registered at boot */
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_for_each_entry(shrinker, &shrinker_list, list)
if (!shrinker->debugfs_entry) {
ret = shrinker_debugfs_add(shrinker);
if (ret)
break;
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

return ret;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0711b63e68d9..bcdd97caa403 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -35,7 +35,7 @@
#include <linux/cpuset.h>
#include <linux/compaction.h>
#include <linux/notifier.h>
-#include <linux/rwsem.h>
+#include <linux/mutex.h>
#include <linux/delay.h>
#include <linux/kthread.h>
#include <linux/freezer.h>
@@ -190,7 +190,7 @@ struct scan_control {
int vm_swappiness = 60;

LIST_HEAD(shrinker_list);
-DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_MUTEX(shrinker_mutex);

#ifdef CONFIG_MEMCG
static int shrinker_nr_max;
@@ -210,7 +210,7 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
int nid)
{
return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
- lockdep_is_held(&shrinker_rwsem));
+ lockdep_is_held(&shrinker_mutex));
}

static struct shrinker_info *shrinker_info_rcu(struct mem_cgroup *memcg,
@@ -283,7 +283,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
int nid, size, ret = 0;
int map_size, defer_size = 0;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
map_size = shrinker_map_size(shrinker_nr_max);
defer_size = shrinker_defer_size(shrinker_nr_max);
size = map_size + defer_size;
@@ -299,7 +299,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
info->map_nr_max = shrinker_nr_max;
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

return ret;
}
@@ -315,7 +315,7 @@ static int expand_shrinker_info(int new_id)
if (!root_mem_cgroup)
goto out;

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

map_size = shrinker_map_size(new_nr_max);
defer_size = shrinker_defer_size(new_nr_max);
@@ -364,7 +364,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
if (mem_cgroup_disabled())
return -ENOSYS;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
@@ -378,7 +378,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
shrinker->id = id;
ret = 0;
unlock:
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
return ret;
}

@@ -388,7 +388,7 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)

BUG_ON(id < 0);

- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);

idr_remove(&shrinker_idr, id);
}
@@ -433,7 +433,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
parent = root_mem_cgroup;

/* Prevent from concurrent shrinker_info expand */
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
for_each_node(nid) {
child_info = shrinker_info_protected(memcg, nid);
parent_info = shrinker_info_protected(parent, nid);
@@ -442,7 +442,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
atomic_long_add(nr, &parent_info->nr_deferred[i]);
}
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
}

static bool cgroup_reclaim(struct scan_control *sc)
@@ -743,9 +743,9 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
shrinker->name = NULL;
#endif
if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
unregister_memcg_shrinker(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
return;
}

@@ -755,13 +755,13 @@ void free_prealloced_shrinker(struct shrinker *shrinker)

void register_shrinker_prepared(struct shrinker *shrinker)
{
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
refcount_set(&shrinker->refcount, 1);
init_completion(&shrinker->completion_wait);
list_add_tail_rcu(&shrinker->list, &shrinker_list);
shrinker->flags |= SHRINKER_REGISTERED;
shrinker_debugfs_add(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
}

static int __register_shrinker(struct shrinker *shrinker)
@@ -815,13 +815,13 @@ void unregister_shrinker(struct shrinker *shrinker)
shrinker_put(shrinker);
wait_for_completion(&shrinker->completion_wait);

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_del_rcu(&shrinker->list);
shrinker->flags &= ~SHRINKER_REGISTERED;
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

shrinker_debugfs_remove(debugfs_entry, debugfs_id);

--
2.30.2


2023-06-22 09:06:55

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 26/29] mm: shrinker: make count and scan in shrinker debugfs lockless

Like global and memcg slab shrink, also make count and scan
operations in memory shrinker debugfs lockless.

The debugfs_remove_recursive() will wait for debugfs_file_put()
to return, so there is no need to call rcu_read_lock() before
calling shrinker_try_get().

Signed-off-by: Qi Zheng <[email protected]>
---
mm/shrinker_debug.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 3ab53fad8876..c18fa9b6b7f0 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -55,8 +55,8 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
if (!count_per_node)
return -ENOMEM;

- ret = down_read_killable(&shrinker_rwsem);
- if (ret) {
+ ret = shrinker_try_get(shrinker);
+ if (!ret) {
kfree(count_per_node);
return ret;
}
@@ -92,7 +92,7 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);

rcu_read_unlock();
- up_read(&shrinker_rwsem);
+ shrinker_put(shrinker);

kfree(count_per_node);
return ret;
@@ -146,8 +146,8 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
return -EINVAL;
}

- ret = down_read_killable(&shrinker_rwsem);
- if (ret) {
+ ret = shrinker_try_get(shrinker);
+ if (!ret) {
mem_cgroup_put(memcg);
return ret;
}
@@ -159,7 +159,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,

shrinker->scan_objects(shrinker, &sc);

- up_read(&shrinker_rwsem);
+ shrinker_put(shrinker);
mem_cgroup_put(memcg);

return size;
--
2.30.2


2023-06-22 09:06:59

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 27/29] mm: vmscan: hold write lock to reparent shrinker nr_deferred

For now, reparent_shrinker_deferred() is the only holder
of read lock of shrinker_rwsem. And it already holds the
global cgroup_mutex, so it will not be called in parallel.

Therefore, in order to convert shrinker_rwsem to shrinker_mutex
later, here we change to hold the write lock of shrinker_rwsem
to reparent.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/vmscan.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 357a1f2ad690..0711b63e68d9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -433,7 +433,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
parent = root_mem_cgroup;

/* Prevent from concurrent shrinker_info expand */
- down_read(&shrinker_rwsem);
+ down_write(&shrinker_rwsem);
for_each_node(nid) {
child_info = shrinker_info_protected(memcg, nid);
parent_info = shrinker_info_protected(parent, nid);
@@ -442,7 +442,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
atomic_long_add(nr, &parent_info->nr_deferred[i]);
}
}
- up_read(&shrinker_rwsem);
+ up_write(&shrinker_rwsem);
}

static bool cgroup_reclaim(struct scan_control *sc)
--
2.30.2


2023-06-23 05:38:21

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file

On (23/06/22 16:53), Qi Zheng wrote:
> +/*
> + * Remove one
> + */
> +void unregister_shrinker(struct shrinker *shrinker)
> +{
> + struct dentry *debugfs_entry;
> + int debugfs_id;
> +
> + if (!(shrinker->flags & SHRINKER_REGISTERED))
> + return;
> +
> + shrinker_put(shrinker);
> + wait_for_completion(&shrinker->completion_wait);
> +
> + mutex_lock(&shrinker_mutex);
> + list_del_rcu(&shrinker->list);

Should this function wait for RCU grace period(s) before it goes
touching shrinker fields?

> + shrinker->flags &= ~SHRINKER_REGISTERED;
> + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> + unregister_memcg_shrinker(shrinker);
> + debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
> + mutex_unlock(&shrinker_mutex);
> +
> + shrinker_debugfs_remove(debugfs_entry, debugfs_id);
> +
> + kfree(shrinker->nr_deferred);
> + shrinker->nr_deferred = NULL;
> +}
> +EXPORT_SYMBOL(unregister_shrinker);

[..]

> +void shrinker_free(struct shrinker *shrinker)
> +{
> + kfree(shrinker);
> +}
> +EXPORT_SYMBOL(shrinker_free);
> +
> +void unregister_and_free_shrinker(struct shrinker *shrinker)
> +{
> + unregister_shrinker(shrinker);
> + kfree_rcu(shrinker, rcu);
> +}

Seems like this

unregister_shrinker();
shrinker_free();

is not exact equivalent of this

unregister_and_free_shrinker();

2023-06-23 13:17:00

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file

Hi Vlastimil,

On 2023/6/22 22:53, Vlastimil Babka wrote:
> On 6/22/23 10:53, Qi Zheng wrote:
>> The mm/vmscan.c file is too large, so separate the shrinker-related
>> code from it into a separate file. No functional changes.
>>
>> Signed-off-by: Qi Zheng <[email protected]>
>
> Maybe do this move as patch 01 so the further changes are done in the new
> file already?

Sure, I will do it in the v2.

Thanks,
Qi

>

2023-06-23 13:26:20

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file

Hi Sergey,

On 2023/6/23 13:25, Sergey Senozhatsky wrote:
> On (23/06/22 16:53), Qi Zheng wrote:
>> +/*
>> + * Remove one
>> + */
>> +void unregister_shrinker(struct shrinker *shrinker)
>> +{
>> + struct dentry *debugfs_entry;
>> + int debugfs_id;
>> +
>> + if (!(shrinker->flags & SHRINKER_REGISTERED))
>> + return;
>> +
>> + shrinker_put(shrinker);
>> + wait_for_completion(&shrinker->completion_wait);
>> +
>> + mutex_lock(&shrinker_mutex);
>> + list_del_rcu(&shrinker->list);
>
> Should this function wait for RCU grace period(s) before it goes
> touching shrinker fields?

Why? We will free this shrinker instance by rcu after executing
unregister_shrinker(). So it is safe to touch shrinker fields here.

>
>> + shrinker->flags &= ~SHRINKER_REGISTERED;
>> + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>> + unregister_memcg_shrinker(shrinker);
>> + debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
>> + mutex_unlock(&shrinker_mutex);
>> +
>> + shrinker_debugfs_remove(debugfs_entry, debugfs_id);
>> +
>> + kfree(shrinker->nr_deferred);
>> + shrinker->nr_deferred = NULL;
>> +}
>> +EXPORT_SYMBOL(unregister_shrinker);
>
> [..]
>
>> +void shrinker_free(struct shrinker *shrinker)
>> +{
>> + kfree(shrinker);
>> +}
>> +EXPORT_SYMBOL(shrinker_free);
>> +
>> +void unregister_and_free_shrinker(struct shrinker *shrinker)
>> +{
>> + unregister_shrinker(shrinker);
>> + kfree_rcu(shrinker, rcu);
>> +}
>
> Seems like this
>
> unregister_shrinker();
> shrinker_free();
>
> is not exact equivalent of this
>
> unregister_and_free_shrinker();

Yes, my original intention is that shrinker_free() is only used to
handle the case where register_shrinker() returns failure.

I will implement the method suggested by Dave in 02/29. Those APIs are
more concise and will bring more benefits. :)

Thanks,
Qi




2023-06-23 13:36:54

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 05/29] drm/panfrost: dynamically allocate the drm-panfrost shrinker

Hi Steven,

The email you replied to was the failed version (due to the error
below), so I copied your reply and replied to you on this successful
version.

(4.7.1 Error: too many recipients from 49.7.199.173)

On 2023/6/23 18:01, Steven Price wrote:
> On 22/06/2023 09:39, Qi Zheng wrote:
>> From: Qi Zheng <[email protected]>
>>
>> In preparation for implementing lockless slab shrink,
>> we need to dynamically allocate the drm-panfrost shrinker,
>> so that it can be freed asynchronously using kfree_rcu().
>> Then it doesn't need to wait for RCU read-side critical
>> section when releasing the struct panfrost_device.
>>
>> Signed-off-by: Qi Zheng <[email protected]>
>> ---
>> drivers/gpu/drm/panfrost/panfrost_device.h | 2 +-
>> .../gpu/drm/panfrost/panfrost_gem_shrinker.c | 24 ++++++++++---------
>> 2 files changed, 14 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h
b/drivers/gpu/drm/panfrost/panfrost_device.h
>> index b0126b9fbadc..e667e5689353 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
>> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
>> @@ -118,7 +118,7 @@ struct panfrost_device {
>>
>> struct mutex shrinker_lock;
>> struct list_head shrinker_list;
>> - struct shrinker shrinker;
>> + struct shrinker *shrinker;
>>
>> struct panfrost_devfreq pfdevfreq;
>> };
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
>> index bf0170782f25..2a5513eb9e1f 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
>> @@ -18,8 +18,7 @@
>> static unsigned long
>> panfrost_gem_shrinker_count(struct shrinker *shrinker, struct
shrink_control *sc)
>> {
>> - struct panfrost_device *pfdev =
>> - container_of(shrinker, struct panfrost_device, shrinker);
>> + struct panfrost_device *pfdev = shrinker->private_data;
>> struct drm_gem_shmem_object *shmem;
>> unsigned long count = 0;
>>
>> @@ -65,8 +64,7 @@ static bool panfrost_gem_purge(struct
drm_gem_object *obj)
>> static unsigned long
>> panfrost_gem_shrinker_scan(struct shrinker *shrinker, struct
shrink_control *sc)
>> {
>> - struct panfrost_device *pfdev =
>> - container_of(shrinker, struct panfrost_device, shrinker);
>> + struct panfrost_device *pfdev = shrinker->private_data;
>> struct drm_gem_shmem_object *shmem, *tmp;
>> unsigned long freed = 0;
>>
>> @@ -100,10 +98,15 @@ panfrost_gem_shrinker_scan(struct shrinker
*shrinker, struct shrink_control *sc)
>> void panfrost_gem_shrinker_init(struct drm_device *dev)
>> {
>> struct panfrost_device *pfdev = dev->dev_private;
>> - pfdev->shrinker.count_objects = panfrost_gem_shrinker_count;
>> - pfdev->shrinker.scan_objects = panfrost_gem_shrinker_scan;
>> - pfdev->shrinker.seeks = DEFAULT_SEEKS;
>> - WARN_ON(register_shrinker(&pfdev->shrinker, "drm-panfrost"));
>> +
>> + pfdev->shrinker = shrinker_alloc_and_init(panfrost_gem_shrinker_count,
>> + panfrost_gem_shrinker_scan, 0,
>> + DEFAULT_SEEKS, 0, pfdev);
>> + if (pfdev->shrinker &&
>> + register_shrinker(pfdev->shrinker, "drm-panfrost")) {
>> + shrinker_free(pfdev->shrinker);
>> + WARN_ON(1);
>> + }
>
> So we didn't have good error handling here before, but this is
> significantly worse. Previously if register_shrinker() failed then the
> driver could safely continue without a shrinker - it would waste memory
> but still function.
>
> However we now have two failure conditions:
> * shrinker_alloc_init() returns NULL. No warning and NULL deferences
> will happen later.
>
> * register_shrinker() fails, shrinker_free() will free pdev->shrinker
> we get a warning, but followed by a use-after-free later.
>
> I think we need to modify panfrost_gem_shrinker_init() to be able to
> return an error, so a change something like the below (untested) before
> your change.

Indeed. I will fix it in the v2.

Thanks,
Qi

>
> Steve
>
> ----8<---
> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c
> b/drivers/gpu/drm/panfrost/panfrost_drv.c
> index bbada731bbbd..f705bbdea360 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> @@ -598,10 +598,14 @@ static int panfrost_probe(struct platform_device
> *pdev)
> if (err < 0)
> goto err_out1;
>
> - panfrost_gem_shrinker_init(ddev);
> + err = panfrost_gem_shrinker_init(ddev);
> + if (err)
> + goto err_out2;
>
> return 0;
>
> +err_out2:
> + drm_dev_unregister(ddev);
> err_out1:
> pm_runtime_disable(pfdev->dev);
> panfrost_device_fini(pfdev);
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h
> b/drivers/gpu/drm/panfrost/panfrost_gem.h
> index ad2877eeeccd..863d2ec8d4f0 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gem.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
> @@ -81,7 +81,7 @@ panfrost_gem_mapping_get(struct panfrost_gem_object
*bo,
> void panfrost_gem_mapping_put(struct panfrost_gem_mapping *mapping);
> void panfrost_gem_teardown_mappings_locked(struct
panfrost_gem_object *bo);
>
> -void panfrost_gem_shrinker_init(struct drm_device *dev);
> +int panfrost_gem_shrinker_init(struct drm_device *dev);
> void panfrost_gem_shrinker_cleanup(struct drm_device *dev);
>
> #endif /* __PANFROST_GEM_H__ */
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
> b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
> index bf0170782f25..90265b37636f 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
> @@ -97,13 +97,17 @@ panfrost_gem_shrinker_scan(struct shrinker
> *shrinker, struct shrink_control *sc)
> *
> * This function registers and sets up the panfrost shrinker.
> */
> -void panfrost_gem_shrinker_init(struct drm_device *dev)
> +int panfrost_gem_shrinker_init(struct drm_device *dev)
> {
> struct panfrost_device *pfdev = dev->dev_private;
> + int ret;
> +
> pfdev->shrinker.count_objects = panfrost_gem_shrinker_count;
> pfdev->shrinker.scan_objects = panfrost_gem_shrinker_scan;
> pfdev->shrinker.seeks = DEFAULT_SEEKS;
> - WARN_ON(register_shrinker(&pfdev->shrinker, "drm-panfrost"));
> + ret = register_shrinker(&pfdev->shrinker, "drm-panfrost");
> +
> + return ret;
> }
>
> /**
>


2023-06-23 22:00:57

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker

On Thu, Jun 22, 2023 at 04:53:21PM +0800, Qi Zheng wrote:
> In preparation for implementing lockless slab shrink,
> we need to dynamically allocate the nfsd-client shrinker,
> so that it can be freed asynchronously using kfree_rcu().
> Then it doesn't need to wait for RCU read-side critical
> section when releasing the struct nfsd_net.
>
> Signed-off-by: Qi Zheng <[email protected]>

For 15/29 and 16/29 of this series:

Acked-by: Chuck Lever <[email protected]>


> ---
> fs/nfsd/netns.h | 2 +-
> fs/nfsd/nfs4state.c | 20 ++++++++++++--------
> 2 files changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
> index ec49b200b797..f669444d5336 100644
> --- a/fs/nfsd/netns.h
> +++ b/fs/nfsd/netns.h
> @@ -195,7 +195,7 @@ struct nfsd_net {
> int nfs4_max_clients;
>
> atomic_t nfsd_courtesy_clients;
> - struct shrinker nfsd_client_shrinker;
> + struct shrinker *nfsd_client_shrinker;
> struct work_struct nfsd_shrinker_work;
> };
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 6e61fa3acaf1..a06184270548 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -4388,8 +4388,7 @@ static unsigned long
> nfsd4_state_shrinker_count(struct shrinker *shrink, struct shrink_control *sc)
> {
> int count;
> - struct nfsd_net *nn = container_of(shrink,
> - struct nfsd_net, nfsd_client_shrinker);
> + struct nfsd_net *nn = shrink->private_data;
>
> count = atomic_read(&nn->nfsd_courtesy_clients);
> if (!count)
> @@ -8094,14 +8093,19 @@ static int nfs4_state_create_net(struct net *net)
> INIT_WORK(&nn->nfsd_shrinker_work, nfsd4_state_shrinker_worker);
> get_net(net);
>
> - nn->nfsd_client_shrinker.scan_objects = nfsd4_state_shrinker_scan;
> - nn->nfsd_client_shrinker.count_objects = nfsd4_state_shrinker_count;
> - nn->nfsd_client_shrinker.seeks = DEFAULT_SEEKS;
> -
> - if (register_shrinker(&nn->nfsd_client_shrinker, "nfsd-client"))
> + nn->nfsd_client_shrinker = shrinker_alloc_and_init(nfsd4_state_shrinker_count,
> + nfsd4_state_shrinker_scan,
> + 0, DEFAULT_SEEKS, 0,
> + nn);
> + if (!nn->nfsd_client_shrinker)
> goto err_shrinker;
> +
> + if (register_shrinker(nn->nfsd_client_shrinker, "nfsd-client"))
> + goto err_register;
> return 0;
>
> +err_register:
> + shrinker_free(nn->nfsd_client_shrinker);
> err_shrinker:
> put_net(net);
> kfree(nn->sessionid_hashtbl);
> @@ -8197,7 +8201,7 @@ nfs4_state_shutdown_net(struct net *net)
> struct list_head *pos, *next, reaplist;
> struct nfsd_net *nn = net_generic(net, nfsd_net_id);
>
> - unregister_shrinker(&nn->nfsd_client_shrinker);
> + unregister_and_free_shrinker(nn->nfsd_client_shrinker);
> cancel_work(&nn->nfsd_shrinker_work);
> cancel_delayed_work_sync(&nn->laundromat_work);
> locks_end_grace(&nn->nfsd4_manager);
> --
> 2.30.2
>

2023-06-24 11:31:24

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker

Hi Chuck,

On 2023/6/24 05:49, Chuck Lever wrote:
> On Thu, Jun 22, 2023 at 04:53:21PM +0800, Qi Zheng wrote:
>> In preparation for implementing lockless slab shrink,
>> we need to dynamically allocate the nfsd-client shrinker,
>> so that it can be freed asynchronously using kfree_rcu().
>> Then it doesn't need to wait for RCU read-side critical
>> section when releasing the struct nfsd_net.
>>
>> Signed-off-by: Qi Zheng <[email protected]>
>
> For 15/29 and 16/29 of this series:
>
> Acked-by: Chuck Lever <[email protected]>

Thanks for your review! :)

And I will implement the APIs suggested by Dave in 02/29 in
the v2, so there will be some changes here, but it should
not be much. So I will keep your Acked-bys in the v2.

Thanks,
Qi

>
>
>> ---
>> fs/nfsd/netns.h | 2 +-
>> fs/nfsd/nfs4state.c | 20 ++++++++++++--------
>> 2 files changed, 13 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
>> index ec49b200b797..f669444d5336 100644
>> --- a/fs/nfsd/netns.h
>> +++ b/fs/nfsd/netns.h
>> @@ -195,7 +195,7 @@ struct nfsd_net {
>> int nfs4_max_clients;
>>
>> atomic_t nfsd_courtesy_clients;
>> - struct shrinker nfsd_client_shrinker;
>> + struct shrinker *nfsd_client_shrinker;
>> struct work_struct nfsd_shrinker_work;
>> };
>>
>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>> index 6e61fa3acaf1..a06184270548 100644
>> --- a/fs/nfsd/nfs4state.c
>> +++ b/fs/nfsd/nfs4state.c
>> @@ -4388,8 +4388,7 @@ static unsigned long
>> nfsd4_state_shrinker_count(struct shrinker *shrink, struct shrink_control *sc)
>> {
>> int count;
>> - struct nfsd_net *nn = container_of(shrink,
>> - struct nfsd_net, nfsd_client_shrinker);
>> + struct nfsd_net *nn = shrink->private_data;
>>
>> count = atomic_read(&nn->nfsd_courtesy_clients);
>> if (!count)
>> @@ -8094,14 +8093,19 @@ static int nfs4_state_create_net(struct net *net)
>> INIT_WORK(&nn->nfsd_shrinker_work, nfsd4_state_shrinker_worker);
>> get_net(net);
>>
>> - nn->nfsd_client_shrinker.scan_objects = nfsd4_state_shrinker_scan;
>> - nn->nfsd_client_shrinker.count_objects = nfsd4_state_shrinker_count;
>> - nn->nfsd_client_shrinker.seeks = DEFAULT_SEEKS;
>> -
>> - if (register_shrinker(&nn->nfsd_client_shrinker, "nfsd-client"))
>> + nn->nfsd_client_shrinker = shrinker_alloc_and_init(nfsd4_state_shrinker_count,
>> + nfsd4_state_shrinker_scan,
>> + 0, DEFAULT_SEEKS, 0,
>> + nn);
>> + if (!nn->nfsd_client_shrinker)
>> goto err_shrinker;
>> +
>> + if (register_shrinker(nn->nfsd_client_shrinker, "nfsd-client"))
>> + goto err_register;
>> return 0;
>>
>> +err_register:
>> + shrinker_free(nn->nfsd_client_shrinker);
>> err_shrinker:
>> put_net(net);
>> kfree(nn->sessionid_hashtbl);
>> @@ -8197,7 +8201,7 @@ nfs4_state_shutdown_net(struct net *net)
>> struct list_head *pos, *next, reaplist;
>> struct nfsd_net *nn = net_generic(net, nfsd_net_id);
>>
>> - unregister_shrinker(&nn->nfsd_client_shrinker);
>> + unregister_and_free_shrinker(nn->nfsd_client_shrinker);
>> cancel_work(&nn->nfsd_shrinker_work);
>> cancel_delayed_work_sync(&nn->laundromat_work);
>> locks_end_grace(&nn->nfsd4_manager);
>> --
>> 2.30.2
>>