2021-02-17 00:16:49

by Yang Shi

[permalink] [raw]
Subject: [v8 PATCH 00/13] Make shrinker's nr_deferred memcg aware


Changelog
v7 --> v8:
* Added lockdep assert in expand_shrinker_info() per Roman.
* Added patch 05/13 to use kvfree_rcu() instead of call_rcu() per Roman
and Kirill.
* Moved rwsem acquire/release out of unregister_memcg_shrinker() per Roman.
* Renamed count_nr_deferred_{memcg} to xchg_nr_deferred_{memcg} per Roman.
* Fixed the next_deferred logic per Vlastimil.
* Misc minor code cleanup, refactor and spelling correction per Roman
and Shakeel.
* Collected more ack and review tags from Roman, Shakeel and Vlastimil.
v6 --> v7:
* Expanded shrinker_info in a batch of BITS_PER_LONG per Kirill.
* Added patch 06/12 to introduce a helper for dereferencing shrinker_info
per Kirill.
* Renamed set_nr_deferred_memcg to add_nr_deferred_memcg per Kirill.
* Collected Acked-by from Kirill.
v5 --> v6:
* Rebased on top of https://lore.kernel.org/linux-mm/[email protected]/
per Kirill.
* Don't register shrinker idr with NULL and remove idr_replace() per Vlastimil.
* Move nr_deferred before map to guarantee the alignment per Vlastimil.
* Misc minor code cleanup and refactor per Kirill and Vlastimil.
* Added Acked-by from Vlastimil for path #1, #2, #3, #5, #9 and #10.
v4 --> v5:
* Incorporated the comments from Kirill.
* Rebased to v5.11-rc5.
v3 --> v4:
* Removed "memcg_" prefix for shrinker_maps related functions per Roman.
* Use write lock instead of read lock per Kirill. Also removed Johannes's ack
since write lock is used.
* Incorporated the comments from Kirill.
* Removed RFC.
* Rebased to v5.11-rc4.
v2 --> v3:
* Moved shrinker_maps related code to vmscan.c per Dave.
* Removed memcg_shrinker_map_size. Calcuated the size of map via shrinker_nr_max
per Johannes.
* Consolidated shrinker_deferred with shrinker_maps into one struct per Dave.
* Simplified the nr_deferred related code.
* Dropped the memory barrier from v2.
* Moved nr_deferred reparent code to vmscan.c per Dave.
* Added test coverage information in patch #11. Dave is concerned about the
potential regression. I didn't notice regression with my tests, but suggestions
about more test coverage is definitely welcome. And it may help spot regression
with this patch in -mm tree then linux-next tree so I keep it in this version.
* The code cleanup and consolidation resulted in the series grow to 11 patches.
* Rebased onto 5.11-rc2.
v1 --> v2:
* Use shrinker->flags to store the new SHRINKER_REGISTERED flag per Roman.
* Folded patch #1 into patch #6 per Roman.
* Added memory barrier to prevent shrink_slab_memcg from seeing NULL shrinker_maps/
shrinker_deferred per Kirill.
* Removed memcg_shrinker_map_mutex. Protcted shrinker_map/shrinker_deferred
allocations from expand with shrinker_rwsem per Johannes.

Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
it turned out there were huge amount accumulated nr_deferred objects seen by the
shrinker.

On our production machine, I saw absurd number of nr_deferred shown as the below
tracing result:

<...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
9300 cache items 1667 delta 11 total_scan 833

There are 2.5 trillion deferred objects on one node, assuming all of them
are dentry (192 bytes per object), so the total size of deferred on
one node is ~480TB. It is definitely ridiculous.

I managed to reproduce this problem with kernel build workload plus negative dentry
generator.

First step, run the below kernel build test script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

cd /root/Buildarea/linux-stable

for i in `seq 1500`; do
cgcreate -g memory:kern_build
echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes

echo 3 > /proc/sys/vm/drop_caches
cgexec -g memory:kern_build make clean > /dev/null 2>&1
cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1

cgdelete -g memory:kern_build
done

Then run the below negative dentry generator script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

mkdir /sys/fs/cgroup/memory/test
echo $$ > /sys/fs/cgroup/memory/test/tasks

for i in `seq $NR_CPUS`; do
while true; do
FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
cat $FILE 2>/dev/null
done &
done

Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
showed:

kswapd0-475 [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
kswapd0-475 [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928

There were huge number of deferred objects before the shrinker was called, the behavior
does match the code but it might be not desirable from the user's stand of point.

The excessive amount of nr_deferred might be accumulated due to various reasons, for example:
* GFP_NOFS allocation
* Significant times of small amount scan (< scan_batch, 1024 for vfs metadata)

However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the deferred objects
is per shrinker, this may have some bad effects:
* Poor isolation among memcgs. Some memcgs which happen to have frequent limit
reclaim may get nr_deferred accumulated to a huge number, then other innocent
memcgs may take the fall. In our case the main workload was hit.
* Unbounded deferred objects. There is no cap for deferred objects, it can outgrow
ridiculously as the tracing result showed.
* Easy to get out of control. Although shrinkers take into account deferred objects,
but it can go out of control easily. One misconfigured memcg could incur absurd
amount of deferred objects in a period of time.
* Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. There may be
hundred GB slab caches for vfe metadata heavy workload, shrink half of them may take
minutes. We observed latency spike due to the prolonged reclaim.

These issues also have been discussed in https://lore.kernel.org/linux-mm/[email protected]/.
The patchset is the outcome of that discussion.

So this patchset makes nr_deferred per-memcg to tackle the problem. It does:
* Have memcg_shrinker_deferred per memcg per node, just like what shrinker_map
does. Instead it is an atomic_long_t array, each element represent one shrinker
even though the shrinker is not memcg aware, this simplifies the implementation.
For memcg aware shrinkers, the deferred objects are just accumulated to its own
memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg aware
shrinkers still use global nr_deferred from struct shrinker.
* Once the memcg is offlined, its nr_deferred will be reparented to its parent along
with LRUs.
* The root memcg has memcg_shrinker_deferred array too. It simplifies the handling of
reparenting to root memcg.
* Cap nr_deferred to 2x of the length of lru. The idea is borrowed from Dave Chinner's
series (https://lore.kernel.org/linux-xfs/[email protected]/)

The downside is each memcg has to allocate extra memory to store the nr_deferred array.
On our production environment, there are typically around 40 shrinkers, so each memcg
needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine.

We have been running the patched kernel on some hosts of our fleet (test and production) for
months, it works very well. The monitor data shows the working set is sustained as expected.

Yang Shi (13):
mm: vmscan: use nid from shrink_control for tracepoint
mm: vmscan: consolidate shrinker_maps handling code
mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
mm: vmscan: remove memcg_shrinker_map_size
mm: vmscan: use kvfree_rcu instead of call_rcu
mm: memcontrol: rename shrinker_map to shrinker_info
mm: vmscan: add shrinker_info_protected() helper
mm: vmscan: use a new flag to indicate shrinker is registered
mm: vmscan: add per memcg shrinker nr_deferred
mm: vmscan: use per memcg nr_deferred of shrinker
mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
mm: memcontrol: reparent nr_deferred when memcg offline
mm: vmscan: shrink deferred objects proportional to priority

include/linux/memcontrol.h | 23 +++---
include/linux/shrinker.h | 7 +-
mm/huge_memory.c | 4 +-
mm/list_lru.c | 6 +-
mm/memcontrol.c | 130 +------------------------------
mm/vmscan.c | 394 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------
6 files changed, 319 insertions(+), 245 deletions(-)


2021-02-17 00:17:08

by Yang Shi

[permalink] [raw]
Subject: [v8 PATCH 01/13] mm: vmscan: use nid from shrink_control for tracepoint

The tracepoint's nid should show what node the shrink happens on, the start tracepoint
uses nid from shrinkctl, but the nid might be set to 0 before end tracepoint if the
shrinker is not NUMA aware, so the tracing log may show the shrink happens on one
node but end up on the other node. It seems confusing. And the following patch
will remove using nid directly in do_shrink_slab(), this patch also helps cleanup
the code.

Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b1b574ad199d..b512dd5e3a1c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -535,7 +535,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
else
new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);

- trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
+ trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
return freed;
}

--
2.26.2

2021-02-17 00:17:31

by Yang Shi

[permalink] [raw]
Subject: [v8 PATCH 04/13] mm: vmscan: remove memcg_shrinker_map_size

Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but actually the
map size can be calculated via shrinker_nr_max, so it seems unnecessary to keep both.
Remove memcg_shrinker_map_size since shrinker_nr_max is also used by iterating the
bit map.

Acked-by: Kirill Tkhai <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
mm/vmscan.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 543af6ec1e02..2e753c2516fa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -185,8 +185,12 @@ static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

#ifdef CONFIG_MEMCG
+static int shrinker_nr_max;

-static int memcg_shrinker_map_size;
+static inline int shrinker_map_size(int nr_items)
+{
+ return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
+}

static void free_shrinker_map_rcu(struct rcu_head *head)
{
@@ -247,7 +251,7 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
return 0;

down_write(&shrinker_rwsem);
- size = memcg_shrinker_map_size;
+ size = shrinker_map_size(shrinker_nr_max);
for_each_node(nid) {
map = kvzalloc_node(sizeof(*map) + size, GFP_KERNEL, nid);
if (!map) {
@@ -265,12 +269,13 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
static int expand_shrinker_maps(int new_id)
{
int size, old_size, ret = 0;
+ int new_nr_max = new_id + 1;
struct mem_cgroup *memcg;

- size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
- old_size = memcg_shrinker_map_size;
+ size = shrinker_map_size(new_nr_max);
+ old_size = shrinker_map_size(shrinker_nr_max);
if (size <= old_size)
- return 0;
+ goto out;

if (!root_mem_cgroup)
goto out;
@@ -289,7 +294,7 @@ static int expand_shrinker_maps(int new_id)
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
out:
if (!ret)
- memcg_shrinker_map_size = size;
+ shrinker_nr_max = new_nr_max;

return ret;
}
@@ -322,7 +327,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
#define SHRINKER_REGISTERING ((struct shrinker *)~0UL)

static DEFINE_IDR(shrinker_idr);
-static int shrinker_nr_max;

static int prealloc_memcg_shrinker(struct shrinker *shrinker)
{
@@ -339,8 +343,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
idr_remove(&shrinker_idr, id);
goto unlock;
}
-
- shrinker_nr_max = id + 1;
}
shrinker->id = id;
ret = 0;
--
2.26.2

2021-02-17 00:18:08

by Yang Shi

[permalink] [raw]
Subject: [v8 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
We don't have to define a dedicated callback for call_rcu() anymore.

Signed-off-by: Yang Shi <[email protected]>
---
mm/vmscan.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e753c2516fa..c2a309acd86b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
}

-static void free_shrinker_map_rcu(struct rcu_head *head)
-{
- kvfree(container_of(head, struct memcg_shrinker_map, rcu));
-}
-
static int expand_one_shrinker_map(struct mem_cgroup *memcg,
int size, int old_size)
{
@@ -219,7 +214,7 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
memset((void *)new->map + old_size, 0, size - old_size);

rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
- call_rcu(&old->rcu, free_shrinker_map_rcu);
+ kvfree_rcu(old);
}

return 0;
--
2.26.2

2021-02-17 00:19:02

by Yang Shi

[permalink] [raw]
Subject: [v8 PATCH 03/13] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation

Since memcg_shrinker_map_size just can be changed under holding shrinker_rwsem
exclusively, the read side can be protected by holding read lock, so it sounds
superfluous to have a dedicated mutex.

Kirill Tkhai suggested use write lock since:

* We want the assignment to shrinker_maps is visible for shrink_slab_memcg().
* The rcu_dereference_protected() dereferrencing in shrink_slab_memcg(), but
in case of we use READ lock in alloc_shrinker_maps(), the dereferrencing
is not actually protected.
* READ lock makes alloc_shrinker_info() racy against memory allocation fail.
alloc_shrinker_info()->free_shrinker_info() may free memory right after
shrink_slab_memcg() dereferenced it. You may say
shrink_slab_memcg()->mem_cgroup_online() protects us from it? Yes, sure,
but this is not the thing we want to remember in the future, since this
spreads modularity.

And a test with heavy paging workload didn't show write lock makes things worse.

Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
mm/vmscan.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 96b08c79f18d..543af6ec1e02 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -187,7 +187,6 @@ static DECLARE_RWSEM(shrinker_rwsem);
#ifdef CONFIG_MEMCG

static int memcg_shrinker_map_size;
-static DEFINE_MUTEX(memcg_shrinker_map_mutex);

static void free_shrinker_map_rcu(struct rcu_head *head)
{
@@ -200,8 +199,6 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
struct memcg_shrinker_map *new, *old;
int nid;

- lockdep_assert_held(&memcg_shrinker_map_mutex);
-
for_each_node(nid) {
old = rcu_dereference_protected(
mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
@@ -249,7 +246,7 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
if (mem_cgroup_is_root(memcg))
return 0;

- mutex_lock(&memcg_shrinker_map_mutex);
+ down_write(&shrinker_rwsem);
size = memcg_shrinker_map_size;
for_each_node(nid) {
map = kvzalloc_node(sizeof(*map) + size, GFP_KERNEL, nid);
@@ -260,7 +257,7 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
}
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
}
- mutex_unlock(&memcg_shrinker_map_mutex);
+ up_write(&shrinker_rwsem);

return ret;
}
@@ -275,9 +272,10 @@ static int expand_shrinker_maps(int new_id)
if (size <= old_size)
return 0;

- mutex_lock(&memcg_shrinker_map_mutex);
if (!root_mem_cgroup)
- goto unlock;
+ goto out;
+
+ lockdep_assert_held(&shrinker_rwsem);

memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
@@ -286,13 +284,13 @@ static int expand_shrinker_maps(int new_id)
ret = expand_one_shrinker_map(memcg, size, old_size);
if (ret) {
mem_cgroup_iter_break(NULL, memcg);
- goto unlock;
+ goto out;
}
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
-unlock:
+out:
if (!ret)
memcg_shrinker_map_size = size;
- mutex_unlock(&memcg_shrinker_map_mutex);
+
return ret;
}

--
2.26.2

2021-02-17 00:20:10

by Yang Shi

[permalink] [raw]
Subject: [v8 PATCH 08/13] mm: vmscan: use a new flag to indicate shrinker is registered

Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
This approach is fine with nr_deferred at the shrinker level, but the following
patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
shrinker->nr_deferred would always be NULL. This would prevent the shrinkers
from unregistering correctly.

Remove SHRINKER_REGISTERING since we could check if shrinker is registered
successfully by the new flag.

Acked-by: Kirill Tkhai <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
include/linux/shrinker.h | 7 ++++---
mm/vmscan.c | 40 +++++++++++++++-------------------------
2 files changed, 19 insertions(+), 28 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 0f80123650e2..1eac79ce57d4 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -79,13 +79,14 @@ struct shrinker {
#define DEFAULT_SEEKS 2 /* A good number if you don't know better. */

/* Flags */
-#define SHRINKER_NUMA_AWARE (1 << 0)
-#define SHRINKER_MEMCG_AWARE (1 << 1)
+#define SHRINKER_REGISTERED (1 << 0)
+#define SHRINKER_NUMA_AWARE (1 << 1)
+#define SHRINKER_MEMCG_AWARE (1 << 2)
/*
* It just makes sense when the shrinker is also MEMCG_AWARE for now,
* non-MEMCG_AWARE shrinker should not have this flag set.
*/
-#define SHRINKER_NONSLAB (1 << 2)
+#define SHRINKER_NONSLAB (1 << 3)

extern int prealloc_shrinker(struct shrinker *shrinker);
extern void register_shrinker_prepared(struct shrinker *shrinker);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fe6e25f46b55..a1047ea60ecf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -314,19 +314,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
}
}

-/*
- * We allow subsystems to populate their shrinker-related
- * LRU lists before register_shrinker_prepared() is called
- * for the shrinker, since we don't want to impose
- * restrictions on their internal registration order.
- * In this case shrink_slab_memcg() may find corresponding
- * bit is set in the shrinkers map.
- *
- * This value is used by the function to detect registering
- * shrinkers and to skip do_shrink_slab() calls for them.
- */
-#define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
-
static DEFINE_IDR(shrinker_idr);

static int prealloc_memcg_shrinker(struct shrinker *shrinker)
@@ -335,7 +322,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)

down_write(&shrinker_rwsem);
/* This may call shrinker, so it must use down_read_trylock() */
- id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
+ id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;

@@ -358,9 +345,9 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)

BUG_ON(id < 0);

- down_write(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_rwsem);
+
idr_remove(&shrinker_idr, id);
- up_write(&shrinker_rwsem);
}

static bool cgroup_reclaim(struct scan_control *sc)
@@ -487,8 +474,11 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
if (!shrinker->nr_deferred)
return;

- if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+ if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
+ down_write(&shrinker_rwsem);
unregister_memcg_shrinker(shrinker);
+ up_write(&shrinker_rwsem);
+ }

kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
@@ -498,10 +488,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
{
down_write(&shrinker_rwsem);
list_add_tail(&shrinker->list, &shrinker_list);
-#ifdef CONFIG_MEMCG
- if (shrinker->flags & SHRINKER_MEMCG_AWARE)
- idr_replace(&shrinker_idr, shrinker, shrinker->id);
-#endif
+ shrinker->flags |= SHRINKER_REGISTERED;
up_write(&shrinker_rwsem);
}

@@ -521,13 +508,16 @@ EXPORT_SYMBOL(register_shrinker);
*/
void unregister_shrinker(struct shrinker *shrinker)
{
- if (!shrinker->nr_deferred)
+ if (!(shrinker->flags & SHRINKER_REGISTERED))
return;
- if (shrinker->flags & SHRINKER_MEMCG_AWARE)
- unregister_memcg_shrinker(shrinker);
+
down_write(&shrinker_rwsem);
list_del(&shrinker->list);
+ shrinker->flags &= ~SHRINKER_REGISTERED;
+ if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+ unregister_memcg_shrinker(shrinker);
up_write(&shrinker_rwsem);
+
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}
@@ -692,7 +682,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
struct shrinker *shrinker;

shrinker = idr_find(&shrinker_idr, i);
- if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
+ if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) {
if (!shrinker)
clear_bit(i, info->map);
continue;
--
2.26.2

2021-02-17 00:20:20

by Yang Shi

[permalink] [raw]
Subject: [v8 PATCH 12/13] mm: memcontrol: reparent nr_deferred when memcg offline

Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to parent's
corresponding nr_deferred when memcg offline.

Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 1 +
mm/vmscan.c | 24 ++++++++++++++++++++++++
3 files changed, 26 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c457fc7bc631..e1c4b93889ad 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1585,6 +1585,7 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
int alloc_shrinker_info(struct mem_cgroup *memcg);
void free_shrinker_info(struct mem_cgroup *memcg);
void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
+void reparent_shrinker_deferred(struct mem_cgroup *memcg);
#else
#define mem_cgroup_sockets_enabled 0
static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f64ad0d044d9..21f36b73f36a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5282,6 +5282,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
page_counter_set_low(&memcg->memory, 0);

memcg_offline_kmem(memcg);
+ reparent_shrinker_deferred(memcg);
wb_memcg_offline(memcg);

drain_all_stock(memcg);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d8800e4da67d..4247a3568585 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -395,6 +395,30 @@ static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
}

+void reparent_shrinker_deferred(struct mem_cgroup *memcg)
+{
+ int i, nid;
+ long nr;
+ struct mem_cgroup *parent;
+ struct shrinker_info *child_info, *parent_info;
+
+ parent = parent_mem_cgroup(memcg);
+ if (!parent)
+ parent = root_mem_cgroup;
+
+ /* Prevent from concurrent shrinker_info expand */
+ down_read(&shrinker_rwsem);
+ for_each_node(nid) {
+ child_info = shrinker_info_protected(memcg, nid);
+ parent_info = shrinker_info_protected(parent, nid);
+ for (i = 0; i < shrinker_nr_max; i++) {
+ nr = atomic_long_read(&child_info->nr_deferred[i]);
+ atomic_long_add(nr, &parent_info->nr_deferred[i]);
+ }
+ }
+ up_read(&shrinker_rwsem);
+}
+
static bool cgroup_reclaim(struct scan_control *sc)
{
return sc->target_mem_cgroup;
--
2.26.2

2021-02-17 00:20:30

by Yang Shi

[permalink] [raw]
Subject: [v8 PATCH 13/13] mm: vmscan: shrink deferred objects proportional to priority

The number of deferred objects might get windup to an absurd number, and it
results in clamp of slab objects. It is undesirable for sustaining workingset.

So shrink deferred objects proportional to priority and cap nr_deferred to twice
of cache items.

The idea is borrowed from Dave Chinner's patch:
https://lore.kernel.org/linux-xfs/[email protected]/

Tested with kernel build and vfs metadata heavy workload in our production
environment, no regression is spotted so far.

Signed-off-by: Yang Shi <[email protected]>
---
mm/vmscan.c | 46 +++++++++++-----------------------------------
1 file changed, 11 insertions(+), 35 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4247a3568585..b3bdc3ba8edc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -661,7 +661,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
*/
nr = xchg_nr_deferred(shrinker, shrinkctl);

- total_scan = nr;
if (shrinker->seeks) {
delta = freeable >> priority;
delta *= 4;
@@ -675,37 +674,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
delta = freeable / 2;
}

+ total_scan = nr >> priority;
total_scan += delta;
- if (total_scan < 0) {
- pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
- shrinker->scan_objects, total_scan);
- total_scan = freeable;
- next_deferred = nr;
- } else
- next_deferred = total_scan;
-
- /*
- * We need to avoid excessive windup on filesystem shrinkers
- * due to large numbers of GFP_NOFS allocations causing the
- * shrinkers to return -1 all the time. This results in a large
- * nr being built up so when a shrink that can do some work
- * comes along it empties the entire cache due to nr >>>
- * freeable. This is bad for sustaining a working set in
- * memory.
- *
- * Hence only allow the shrinker to scan the entire cache when
- * a large delta change is calculated directly.
- */
- if (delta < freeable / 4)
- total_scan = min(total_scan, freeable / 2);
-
- /*
- * Avoid risking looping forever due to too large nr value:
- * never try to free more than twice the estimate number of
- * freeable entries.
- */
- if (total_scan > freeable * 2)
- total_scan = freeable * 2;
+ total_scan = min(total_scan, (2 * freeable));

trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
freeable, delta, total_scan, priority);
@@ -744,10 +715,15 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
cond_resched();
}

- if (next_deferred >= scanned)
- next_deferred -= scanned;
- else
- next_deferred = 0;
+ /*
+ * The deferred work is increased by any new work (delta) that wasn't
+ * done, decreased by old deferred work that was done now.
+ *
+ * And it is capped to two times of the freeable items.
+ */
+ next_deferred = max_t(long, (nr + delta - scanned), 0);
+ next_deferred = min(next_deferred, (2 * freeable));
+
/*
* move the unused scan count back into the shrinker in a
* manner that handles concurrent updates.
--
2.26.2

2021-02-17 00:20:41

by Yang Shi

[permalink] [raw]
Subject: [v8 PATCH 10/13] mm: vmscan: use per memcg nr_deferred of shrinker

Use per memcg's nr_deferred for memcg aware shrinkers. The shrinker's nr_deferred
will be used in the following cases:
1. Non memcg aware shrinkers
2. !CONFIG_MEMCG
3. memcg is disabled by boot parameter

Signed-off-by: Yang Shi <[email protected]>
---
mm/vmscan.c | 78 ++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 66 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fcb399e18fc3..57cbc6bc8a49 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -374,6 +374,24 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
idr_remove(&shrinker_idr, id);
}

+static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
+ struct mem_cgroup *memcg)
+{
+ struct shrinker_info *info;
+
+ info = shrinker_info_protected(memcg, nid);
+ return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
+}
+
+static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
+ struct mem_cgroup *memcg)
+{
+ struct shrinker_info *info;
+
+ info = shrinker_info_protected(memcg, nid);
+ return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
+}
+
static bool cgroup_reclaim(struct scan_control *sc)
{
return sc->target_mem_cgroup;
@@ -412,6 +430,18 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
{
}

+static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
+ struct mem_cgroup *memcg)
+{
+ return 0;
+}
+
+static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
+ struct mem_cgroup *memcg)
+{
+ return 0;
+}
+
static bool cgroup_reclaim(struct scan_control *sc)
{
return false;
@@ -423,6 +453,39 @@ static bool writeback_throttling_sane(struct scan_control *sc)
}
#endif

+static long xchg_nr_deferred(struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+ int nid = sc->nid;
+
+ if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+ nid = 0;
+
+ if (sc->memcg &&
+ (shrinker->flags & SHRINKER_MEMCG_AWARE))
+ return xchg_nr_deferred_memcg(nid, shrinker,
+ sc->memcg);
+
+ return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+}
+
+
+static long add_nr_deferred(long nr, struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+ int nid = sc->nid;
+
+ if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+ nid = 0;
+
+ if (sc->memcg &&
+ (shrinker->flags & SHRINKER_MEMCG_AWARE))
+ return add_nr_deferred_memcg(nr, nid, shrinker,
+ sc->memcg);
+
+ return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
+}
+
/*
* This misses isolated pages which are not accounted for to save counters.
* As the data only determines if reclaim or compaction continues, it is
@@ -558,14 +621,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
long freeable;
long nr;
long new_nr;
- int nid = shrinkctl->nid;
long batch_size = shrinker->batch ? shrinker->batch
: SHRINK_BATCH;
long scanned = 0, next_deferred;

- if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
- nid = 0;
-
freeable = shrinker->count_objects(shrinker, shrinkctl);
if (freeable == 0 || freeable == SHRINK_EMPTY)
return freeable;
@@ -575,7 +634,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
* and zero it so that other concurrent shrinker invocations
* don't also do this scanning work.
*/
- nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+ nr = xchg_nr_deferred(shrinker, shrinkctl);

total_scan = nr;
if (shrinker->seeks) {
@@ -666,14 +725,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
next_deferred = 0;
/*
* move the unused scan count back into the shrinker in a
- * manner that handles concurrent updates. If we exhausted the
- * scan, there is no need to do an update.
+ * manner that handles concurrent updates.
*/
- if (next_deferred > 0)
- new_nr = atomic_long_add_return(next_deferred,
- &shrinker->nr_deferred[nid]);
- else
- new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
+ new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl);

trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
return freed;
--
2.26.2

2021-02-17 00:20:54

by Yang Shi

[permalink] [raw]
Subject: [v8 PATCH 06/13] mm: memcontrol: rename shrinker_map to shrinker_info

The following patch is going to add nr_deferred into shrinker_map, the change will
make shrinker_map not only include map anymore, so rename it to "memcg_shrinker_info".
And this should make the patch adding nr_deferred cleaner and readable and make
review easier. Also remove the "memcg_" prefix.

Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
include/linux/memcontrol.h | 8 +++---
mm/memcontrol.c | 6 ++--
mm/vmscan.c | 58 +++++++++++++++++++-------------------
3 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1739f17e0939..4c9253896e25 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -96,7 +96,7 @@ struct lruvec_stat {
* Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
* which have elements charged to this memcg.
*/
-struct memcg_shrinker_map {
+struct shrinker_info {
struct rcu_head rcu;
unsigned long map[];
};
@@ -118,7 +118,7 @@ struct mem_cgroup_per_node {

struct mem_cgroup_reclaim_iter iter;

- struct memcg_shrinker_map __rcu *shrinker_map;
+ struct shrinker_info __rcu *shrinker_info;

struct rb_node tree_node; /* RB tree node */
unsigned long usage_in_excess;/* Set to the value by which */
@@ -1581,8 +1581,8 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
return false;
}

-int alloc_shrinker_maps(struct mem_cgroup *memcg);
-void free_shrinker_maps(struct mem_cgroup *memcg);
+int alloc_shrinker_info(struct mem_cgroup *memcg);
+void free_shrinker_info(struct mem_cgroup *memcg);
void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
#else
#define mem_cgroup_sockets_enabled 0
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f5c9a0d2160b..f64ad0d044d9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5246,11 +5246,11 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
struct mem_cgroup *memcg = mem_cgroup_from_css(css);

/*
- * A memcg must be visible for expand_shrinker_maps()
+ * A memcg must be visible for expand_shrinker_info()
* by the time the maps are allocated. So, we allocate maps
* here, when for_each_mem_cgroup() can't skip it.
*/
- if (alloc_shrinker_maps(memcg)) {
+ if (alloc_shrinker_info(memcg)) {
mem_cgroup_id_remove(memcg);
return -ENOMEM;
}
@@ -5314,7 +5314,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
vmpressure_cleanup(&memcg->vmpressure);
cancel_work_sync(&memcg->high_work);
mem_cgroup_remove_from_trees(memcg);
- free_shrinker_maps(memcg);
+ free_shrinker_info(memcg);
memcg_free_kmem(memcg);
mem_cgroup_free(memcg);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c2a309acd86b..c94861a3ea3e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,15 +192,15 @@ static inline int shrinker_map_size(int nr_items)
return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
}

-static int expand_one_shrinker_map(struct mem_cgroup *memcg,
- int size, int old_size)
+static int expand_one_shrinker_info(struct mem_cgroup *memcg,
+ int size, int old_size)
{
- struct memcg_shrinker_map *new, *old;
+ struct shrinker_info *new, *old;
int nid;

for_each_node(nid) {
old = rcu_dereference_protected(
- mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
+ mem_cgroup_nodeinfo(memcg, nid)->shrinker_info, true);
/* Not yet online memcg */
if (!old)
return 0;
@@ -213,17 +213,17 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
memset(new->map, (int)0xff, old_size);
memset((void *)new->map + old_size, 0, size - old_size);

- rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
+ rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
kvfree_rcu(old);
}

return 0;
}

-void free_shrinker_maps(struct mem_cgroup *memcg)
+void free_shrinker_info(struct mem_cgroup *memcg)
{
struct mem_cgroup_per_node *pn;
- struct memcg_shrinker_map *map;
+ struct shrinker_info *info;
int nid;

if (mem_cgroup_is_root(memcg))
@@ -231,15 +231,15 @@ void free_shrinker_maps(struct mem_cgroup *memcg)

for_each_node(nid) {
pn = mem_cgroup_nodeinfo(memcg, nid);
- map = rcu_dereference_protected(pn->shrinker_map, true);
- kvfree(map);
- rcu_assign_pointer(pn->shrinker_map, NULL);
+ info = rcu_dereference_protected(pn->shrinker_info, true);
+ kvfree(info);
+ rcu_assign_pointer(pn->shrinker_info, NULL);
}
}

-int alloc_shrinker_maps(struct mem_cgroup *memcg)
+int alloc_shrinker_info(struct mem_cgroup *memcg)
{
- struct memcg_shrinker_map *map;
+ struct shrinker_info *info;
int nid, size, ret = 0;

if (mem_cgroup_is_root(memcg))
@@ -248,20 +248,20 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
down_write(&shrinker_rwsem);
size = shrinker_map_size(shrinker_nr_max);
for_each_node(nid) {
- map = kvzalloc_node(sizeof(*map) + size, GFP_KERNEL, nid);
- if (!map) {
- free_shrinker_maps(memcg);
+ info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
+ if (!info) {
+ free_shrinker_info(memcg);
ret = -ENOMEM;
break;
}
- rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
+ rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
}
up_write(&shrinker_rwsem);

return ret;
}

-static int expand_shrinker_maps(int new_id)
+static int expand_shrinker_info(int new_id)
{
int size, old_size, ret = 0;
int new_nr_max = new_id + 1;
@@ -281,7 +281,7 @@ static int expand_shrinker_maps(int new_id)
do {
if (mem_cgroup_is_root(memcg))
continue;
- ret = expand_one_shrinker_map(memcg, size, old_size);
+ ret = expand_one_shrinker_info(memcg, size, old_size);
if (ret) {
mem_cgroup_iter_break(NULL, memcg);
goto out;
@@ -297,13 +297,13 @@ static int expand_shrinker_maps(int new_id)
void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
{
if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
- struct memcg_shrinker_map *map;
+ struct shrinker_info *info;

rcu_read_lock();
- map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
+ info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
/* Pairs with smp mb in shrink_slab() */
smp_mb__before_atomic();
- set_bit(shrinker_id, map->map);
+ set_bit(shrinker_id, info->map);
rcu_read_unlock();
}
}
@@ -334,7 +334,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
goto unlock;

if (id >= shrinker_nr_max) {
- if (expand_shrinker_maps(id)) {
+ if (expand_shrinker_info(id)) {
idr_remove(&shrinker_idr, id);
goto unlock;
}
@@ -663,7 +663,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
struct mem_cgroup *memcg, int priority)
{
- struct memcg_shrinker_map *map;
+ struct shrinker_info *info;
unsigned long ret, freed = 0;
int i;

@@ -673,12 +673,12 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
if (!down_read_trylock(&shrinker_rwsem))
return 0;

- map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
- true);
- if (unlikely(!map))
+ info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
+ true);
+ if (unlikely(!info))
goto unlock;

- for_each_set_bit(i, map->map, shrinker_nr_max) {
+ for_each_set_bit(i, info->map, shrinker_nr_max) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
@@ -689,7 +689,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
shrinker = idr_find(&shrinker_idr, i);
if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
if (!shrinker)
- clear_bit(i, map->map);
+ clear_bit(i, info->map);
continue;
}

@@ -700,7 +700,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,

ret = do_shrink_slab(&sc, shrinker, priority);
if (ret == SHRINK_EMPTY) {
- clear_bit(i, map->map);
+ clear_bit(i, info->map);
/*
* After the shrinker reported that it had no objects to
* free, but before we cleared the corresponding bit in
--
2.26.2

2021-02-17 00:21:06

by Yang Shi

[permalink] [raw]
Subject: [v8 PATCH 07/13] mm: vmscan: add shrinker_info_protected() helper

The shrinker_info is dereferenced in a couple of places via rcu_dereference_protected
with different calling conventions, for example, using mem_cgroup_nodeinfo helper
or dereferencing memcg->nodeinfo[nid]->shrinker_info. And the later patch
will add more dereference places.

So extract the dereference into a helper to make the code more readable. No
functional change.

Acked-by: Roman Gushchin <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
mm/vmscan.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c94861a3ea3e..fe6e25f46b55 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,6 +192,13 @@ static inline int shrinker_map_size(int nr_items)
return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
}

+static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
+ int nid)
+{
+ return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
+ lockdep_is_held(&shrinker_rwsem));
+}
+
static int expand_one_shrinker_info(struct mem_cgroup *memcg,
int size, int old_size)
{
@@ -199,8 +206,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
int nid;

for_each_node(nid) {
- old = rcu_dereference_protected(
- mem_cgroup_nodeinfo(memcg, nid)->shrinker_info, true);
+ old = shrinker_info_protected(memcg, nid);
/* Not yet online memcg */
if (!old)
return 0;
@@ -231,7 +237,7 @@ void free_shrinker_info(struct mem_cgroup *memcg)

for_each_node(nid) {
pn = mem_cgroup_nodeinfo(memcg, nid);
- info = rcu_dereference_protected(pn->shrinker_info, true);
+ info = shrinker_info_protected(memcg, nid);
kvfree(info);
rcu_assign_pointer(pn->shrinker_info, NULL);
}
@@ -673,8 +679,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
if (!down_read_trylock(&shrinker_rwsem))
return 0;

- info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
- true);
+ info = shrinker_info_protected(memcg, nid);
if (unlikely(!info))
goto unlock;

--
2.26.2

2021-02-17 00:21:26

by Yang Shi

[permalink] [raw]
Subject: [v8 PATCH 09/13] mm: vmscan: add per memcg shrinker nr_deferred

Currently the number of deferred objects are per shrinker, but some slabs, for example,
vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.

The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
may suffer from over shrink, excessive reclaim latency, etc.

For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
heavy workload. Workload in A generates excessive deferred objects, then B's vfs cache
might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.

We observed this hit in our production environment which was running vfs heavy workload
shown as the below tracing log:

<...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
cache items 246404277 delta 31345 total_scan 123202138
<...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
last shrinker return val 123186855

The vfs cache and page cache ratio was 10:1 on this machine, and half of caches were dropped.
This also resulted in significant amount of page caches were dropped due to inodes eviction.

Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
better isolation.

When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
would be used. And non memcg aware shrinkers use shrinker's nr_deferred all the time.

Signed-off-by: Yang Shi <[email protected]>
---
include/linux/memcontrol.h | 7 +++--
mm/vmscan.c | 60 ++++++++++++++++++++++++++------------
2 files changed, 46 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4c9253896e25..c457fc7bc631 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -93,12 +93,13 @@ struct lruvec_stat {
};

/*
- * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
- * which have elements charged to this memcg.
+ * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
+ * shrinkers, which have elements charged to this memcg.
*/
struct shrinker_info {
struct rcu_head rcu;
- unsigned long map[];
+ atomic_long_t *nr_deferred;
+ unsigned long *map;
};

/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a1047ea60ecf..fcb399e18fc3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -187,11 +187,17 @@ static DECLARE_RWSEM(shrinker_rwsem);
#ifdef CONFIG_MEMCG
static int shrinker_nr_max;

+/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
static inline int shrinker_map_size(int nr_items)
{
return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
}

+static inline int shrinker_defer_size(int nr_items)
+{
+ return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
+}
+
static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
int nid)
{
@@ -200,10 +206,12 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
}

static int expand_one_shrinker_info(struct mem_cgroup *memcg,
- int size, int old_size)
+ int map_size, int defer_size,
+ int old_map_size, int old_defer_size)
{
struct shrinker_info *new, *old;
int nid;
+ int size = map_size + defer_size;

for_each_node(nid) {
old = shrinker_info_protected(memcg, nid);
@@ -215,9 +223,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
if (!new)
return -ENOMEM;

- /* Set all old bits, clear all new bits */
- memset(new->map, (int)0xff, old_size);
- memset((void *)new->map + old_size, 0, size - old_size);
+ new->nr_deferred = (atomic_long_t *)(new + 1);
+ new->map = (void *)new->nr_deferred + defer_size;
+
+ /* map: set all old bits, clear all new bits */
+ memset(new->map, (int)0xff, old_map_size);
+ memset((void *)new->map + old_map_size, 0, map_size - old_map_size);
+ /* nr_deferred: copy old values, clear all new values */
+ memcpy(new->nr_deferred, old->nr_deferred, old_defer_size);
+ memset((void *)new->nr_deferred + old_defer_size, 0,
+ defer_size - old_defer_size);

rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
kvfree_rcu(old);
@@ -232,9 +247,6 @@ void free_shrinker_info(struct mem_cgroup *memcg)
struct shrinker_info *info;
int nid;

- if (mem_cgroup_is_root(memcg))
- return;
-
for_each_node(nid) {
pn = mem_cgroup_nodeinfo(memcg, nid);
info = shrinker_info_protected(memcg, nid);
@@ -247,12 +259,12 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
{
struct shrinker_info *info;
int nid, size, ret = 0;
-
- if (mem_cgroup_is_root(memcg))
- return 0;
+ int map_size, defer_size = 0;

down_write(&shrinker_rwsem);
- size = shrinker_map_size(shrinker_nr_max);
+ map_size = shrinker_map_size(shrinker_nr_max);
+ defer_size = shrinker_defer_size(shrinker_nr_max);
+ size = map_size + defer_size;
for_each_node(nid) {
info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
if (!info) {
@@ -260,6 +272,8 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
ret = -ENOMEM;
break;
}
+ info->nr_deferred = (atomic_long_t *)(info + 1);
+ info->map = (void *)info->nr_deferred + defer_size;
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
}
up_write(&shrinker_rwsem);
@@ -267,15 +281,21 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
return ret;
}

+static inline bool need_expand(int nr_max)
+{
+ return round_up(nr_max, BITS_PER_LONG) >
+ round_up(shrinker_nr_max, BITS_PER_LONG);
+}
+
static int expand_shrinker_info(int new_id)
{
- int size, old_size, ret = 0;
+ int ret = 0;
int new_nr_max = new_id + 1;
+ int map_size, defer_size = 0;
+ int old_map_size, old_defer_size = 0;
struct mem_cgroup *memcg;

- size = shrinker_map_size(new_nr_max);
- old_size = shrinker_map_size(shrinker_nr_max);
- if (size <= old_size)
+ if (!need_expand(new_nr_max))
goto out;

if (!root_mem_cgroup)
@@ -283,11 +303,15 @@ static int expand_shrinker_info(int new_id)

lockdep_assert_held(&shrinker_rwsem);

+ map_size = shrinker_map_size(new_nr_max);
+ defer_size = shrinker_defer_size(new_nr_max);
+ old_map_size = shrinker_map_size(shrinker_nr_max);
+ old_defer_size = shrinker_defer_size(shrinker_nr_max);
+
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
- if (mem_cgroup_is_root(memcg))
- continue;
- ret = expand_one_shrinker_info(memcg, size, old_size);
+ ret = expand_one_shrinker_info(memcg, map_size, defer_size,
+ old_map_size, old_defer_size);
if (ret) {
mem_cgroup_iter_break(NULL, memcg);
goto out;
--
2.26.2

2021-02-17 00:21:44

by Yang Shi

[permalink] [raw]
Subject: [v8 PATCH 11/13] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers

Now nr_deferred is available on per memcg level for memcg aware shrinkers, so don't need
allocate shrinker->nr_deferred for such shrinkers anymore.

The prealloc_memcg_shrinker() would return -ENOSYS if !CONFIG_MEMCG or memcg is disabled
by kernel command line, then shrinker's SHRINKER_MEMCG_AWARE flag would be cleared.
This makes the implementation of this patch simpler.

Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Kirill Tkhai <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
mm/vmscan.c | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 57cbc6bc8a49..d8800e4da67d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -344,6 +344,9 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
{
int id, ret = -ENOMEM;

+ if (mem_cgroup_disabled())
+ return -ENOSYS;
+
down_write(&shrinker_rwsem);
/* This may call shrinker, so it must use down_read_trylock() */
id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
@@ -423,7 +426,7 @@ static bool writeback_throttling_sane(struct scan_control *sc)
#else
static int prealloc_memcg_shrinker(struct shrinker *shrinker)
{
- return 0;
+ return -ENOSYS;
}

static void unregister_memcg_shrinker(struct shrinker *shrinker)
@@ -534,8 +537,18 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
*/
int prealloc_shrinker(struct shrinker *shrinker)
{
- unsigned int size = sizeof(*shrinker->nr_deferred);
+ unsigned int size;
+ int err;
+
+ if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
+ err = prealloc_memcg_shrinker(shrinker);
+ if (err != -ENOSYS)
+ return err;

+ shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
+ }
+
+ size = sizeof(*shrinker->nr_deferred);
if (shrinker->flags & SHRINKER_NUMA_AWARE)
size *= nr_node_ids;

@@ -543,28 +556,16 @@ int prealloc_shrinker(struct shrinker *shrinker)
if (!shrinker->nr_deferred)
return -ENOMEM;

- if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
- if (prealloc_memcg_shrinker(shrinker))
- goto free_deferred;
- }
-
return 0;
-
-free_deferred:
- kfree(shrinker->nr_deferred);
- shrinker->nr_deferred = NULL;
- return -ENOMEM;
}

void free_prealloced_shrinker(struct shrinker *shrinker)
{
- if (!shrinker->nr_deferred)
- return;
-
if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
down_write(&shrinker_rwsem);
unregister_memcg_shrinker(shrinker);
up_write(&shrinker_rwsem);
+ return;
}

kfree(shrinker->nr_deferred);
--
2.26.2

2021-02-17 02:18:47

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v8 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

On Tue, Feb 16, 2021 at 04:13:14PM -0800, Yang Shi wrote:
> Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
> We don't have to define a dedicated callback for call_rcu() anymore.
>
> Signed-off-by: Yang Shi <[email protected]>

Acked-by: Roman Gushchin <[email protected]>

Thanks!

> ---
> mm/vmscan.c | 7 +------
> 1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2e753c2516fa..c2a309acd86b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
> return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> }
>
> -static void free_shrinker_map_rcu(struct rcu_head *head)
> -{
> - kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> -}
> -
> static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> int size, int old_size)
> {
> @@ -219,7 +214,7 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> memset((void *)new->map + old_size, 0, size - old_size);
>
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> - call_rcu(&old->rcu, free_shrinker_map_rcu);
> + kvfree_rcu(old);
> }
>
> return 0;
> --
> 2.26.2
>

2021-02-17 02:22:13

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v8 PATCH 09/13] mm: vmscan: add per memcg shrinker nr_deferred

On Tue, Feb 16, 2021 at 04:13:18PM -0800, Yang Shi wrote:
> Currently the number of deferred objects are per shrinker, but some slabs, for example,
> vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
>
> The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> may suffer from over shrink, excessive reclaim latency, etc.
>
> For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> heavy workload. Workload in A generates excessive deferred objects, then B's vfs cache
> might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
>
> We observed this hit in our production environment which was running vfs heavy workload
> shown as the below tracing log:
>
> <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> cache items 246404277 delta 31345 total_scan 123202138
> <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> last shrinker return val 123186855
>
> The vfs cache and page cache ratio was 10:1 on this machine, and half of caches were dropped.
> This also resulted in significant amount of page caches were dropped due to inodes eviction.
>
> Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> better isolation.
>
> When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> would be used. And non memcg aware shrinkers use shrinker's nr_deferred all the time.
>
> Signed-off-by: Yang Shi <[email protected]>

Acked-by: Roman Gushchin <[email protected]>

Thanks!

> ---
> include/linux/memcontrol.h | 7 +++--
> mm/vmscan.c | 60 ++++++++++++++++++++++++++------------
> 2 files changed, 46 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4c9253896e25..c457fc7bc631 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -93,12 +93,13 @@ struct lruvec_stat {
> };
>
> /*
> - * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> - * which have elements charged to this memcg.
> + * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
> + * shrinkers, which have elements charged to this memcg.
> */
> struct shrinker_info {
> struct rcu_head rcu;
> - unsigned long map[];
> + atomic_long_t *nr_deferred;
> + unsigned long *map;
> };
>
> /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a1047ea60ecf..fcb399e18fc3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -187,11 +187,17 @@ static DECLARE_RWSEM(shrinker_rwsem);
> #ifdef CONFIG_MEMCG
> static int shrinker_nr_max;
>
> +/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
> static inline int shrinker_map_size(int nr_items)
> {
> return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> }
>
> +static inline int shrinker_defer_size(int nr_items)
> +{
> + return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
> +}
> +
> static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> int nid)
> {
> @@ -200,10 +206,12 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> }
>
> static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> - int size, int old_size)
> + int map_size, int defer_size,
> + int old_map_size, int old_defer_size)
> {
> struct shrinker_info *new, *old;
> int nid;
> + int size = map_size + defer_size;
>
> for_each_node(nid) {
> old = shrinker_info_protected(memcg, nid);
> @@ -215,9 +223,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> if (!new)
> return -ENOMEM;
>
> - /* Set all old bits, clear all new bits */
> - memset(new->map, (int)0xff, old_size);
> - memset((void *)new->map + old_size, 0, size - old_size);
> + new->nr_deferred = (atomic_long_t *)(new + 1);
> + new->map = (void *)new->nr_deferred + defer_size;
> +
> + /* map: set all old bits, clear all new bits */
> + memset(new->map, (int)0xff, old_map_size);
> + memset((void *)new->map + old_map_size, 0, map_size - old_map_size);
> + /* nr_deferred: copy old values, clear all new values */
> + memcpy(new->nr_deferred, old->nr_deferred, old_defer_size);
> + memset((void *)new->nr_deferred + old_defer_size, 0,
> + defer_size - old_defer_size);
>
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
> kvfree_rcu(old);
> @@ -232,9 +247,6 @@ void free_shrinker_info(struct mem_cgroup *memcg)
> struct shrinker_info *info;
> int nid;
>
> - if (mem_cgroup_is_root(memcg))
> - return;
> -
> for_each_node(nid) {
> pn = mem_cgroup_nodeinfo(memcg, nid);
> info = shrinker_info_protected(memcg, nid);
> @@ -247,12 +259,12 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> {
> struct shrinker_info *info;
> int nid, size, ret = 0;
> -
> - if (mem_cgroup_is_root(memcg))
> - return 0;
> + int map_size, defer_size = 0;
>
> down_write(&shrinker_rwsem);
> - size = shrinker_map_size(shrinker_nr_max);
> + map_size = shrinker_map_size(shrinker_nr_max);
> + defer_size = shrinker_defer_size(shrinker_nr_max);
> + size = map_size + defer_size;
> for_each_node(nid) {
> info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
> if (!info) {
> @@ -260,6 +272,8 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> ret = -ENOMEM;
> break;
> }
> + info->nr_deferred = (atomic_long_t *)(info + 1);
> + info->map = (void *)info->nr_deferred + defer_size;
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
> }
> up_write(&shrinker_rwsem);
> @@ -267,15 +281,21 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> return ret;
> }
>
> +static inline bool need_expand(int nr_max)
> +{
> + return round_up(nr_max, BITS_PER_LONG) >
> + round_up(shrinker_nr_max, BITS_PER_LONG);
> +}
> +
> static int expand_shrinker_info(int new_id)
> {
> - int size, old_size, ret = 0;
> + int ret = 0;
> int new_nr_max = new_id + 1;
> + int map_size, defer_size = 0;
> + int old_map_size, old_defer_size = 0;
> struct mem_cgroup *memcg;
>
> - size = shrinker_map_size(new_nr_max);
> - old_size = shrinker_map_size(shrinker_nr_max);
> - if (size <= old_size)
> + if (!need_expand(new_nr_max))
> goto out;
>
> if (!root_mem_cgroup)
> @@ -283,11 +303,15 @@ static int expand_shrinker_info(int new_id)
>
> lockdep_assert_held(&shrinker_rwsem);
>
> + map_size = shrinker_map_size(new_nr_max);
> + defer_size = shrinker_defer_size(new_nr_max);
> + old_map_size = shrinker_map_size(shrinker_nr_max);
> + old_defer_size = shrinker_defer_size(shrinker_nr_max);
> +
> memcg = mem_cgroup_iter(NULL, NULL, NULL);
> do {
> - if (mem_cgroup_is_root(memcg))
> - continue;
> - ret = expand_one_shrinker_info(memcg, size, old_size);
> + ret = expand_one_shrinker_info(memcg, map_size, defer_size,
> + old_map_size, old_defer_size);
> if (ret) {
> mem_cgroup_iter_break(NULL, memcg);
> goto out;
> --
> 2.26.2
>

2021-02-17 02:43:44

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v8 PATCH 08/13] mm: vmscan: use a new flag to indicate shrinker is registered

On Tue, Feb 16, 2021 at 04:13:17PM -0800, Yang Shi wrote:
> Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> This approach is fine with nr_deferred at the shrinker level, but the following
> patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> shrinker->nr_deferred would always be NULL. This would prevent the shrinkers
> from unregistering correctly.
>
> Remove SHRINKER_REGISTERING since we could check if shrinker is registered
> successfully by the new flag.
>
> Acked-by: Kirill Tkhai <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Acked-by: Roman Gushchin <[email protected]>

> ---
> include/linux/shrinker.h | 7 ++++---
> mm/vmscan.c | 40 +++++++++++++++-------------------------
> 2 files changed, 19 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 0f80123650e2..1eac79ce57d4 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -79,13 +79,14 @@ struct shrinker {
> #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
>
> /* Flags */
> -#define SHRINKER_NUMA_AWARE (1 << 0)
> -#define SHRINKER_MEMCG_AWARE (1 << 1)
> +#define SHRINKER_REGISTERED (1 << 0)
> +#define SHRINKER_NUMA_AWARE (1 << 1)
> +#define SHRINKER_MEMCG_AWARE (1 << 2)
> /*
> * It just makes sense when the shrinker is also MEMCG_AWARE for now,
> * non-MEMCG_AWARE shrinker should not have this flag set.
> */
> -#define SHRINKER_NONSLAB (1 << 2)
> +#define SHRINKER_NONSLAB (1 << 3)
>
> extern int prealloc_shrinker(struct shrinker *shrinker);
> extern void register_shrinker_prepared(struct shrinker *shrinker);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fe6e25f46b55..a1047ea60ecf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -314,19 +314,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
> }
> }
>
> -/*
> - * We allow subsystems to populate their shrinker-related
> - * LRU lists before register_shrinker_prepared() is called
> - * for the shrinker, since we don't want to impose
> - * restrictions on their internal registration order.
> - * In this case shrink_slab_memcg() may find corresponding
> - * bit is set in the shrinkers map.
> - *
> - * This value is used by the function to detect registering
> - * shrinkers and to skip do_shrink_slab() calls for them.
> - */
> -#define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
> -
> static DEFINE_IDR(shrinker_idr);
>
> static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> @@ -335,7 +322,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>
> down_write(&shrinker_rwsem);
> /* This may call shrinker, so it must use down_read_trylock() */
> - id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
> + id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
> if (id < 0)
> goto unlock;
>
> @@ -358,9 +345,9 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
>
> BUG_ON(id < 0);
>
> - down_write(&shrinker_rwsem);
> + lockdep_assert_held(&shrinker_rwsem);
> +
> idr_remove(&shrinker_idr, id);
> - up_write(&shrinker_rwsem);
> }
>
> static bool cgroup_reclaim(struct scan_control *sc)
> @@ -487,8 +474,11 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
> if (!shrinker->nr_deferred)
> return;
>
> - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> + if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> + down_write(&shrinker_rwsem);
> unregister_memcg_shrinker(shrinker);
> + up_write(&shrinker_rwsem);
> + }
>
> kfree(shrinker->nr_deferred);
> shrinker->nr_deferred = NULL;
> @@ -498,10 +488,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
> {
> down_write(&shrinker_rwsem);
> list_add_tail(&shrinker->list, &shrinker_list);
> -#ifdef CONFIG_MEMCG
> - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> - idr_replace(&shrinker_idr, shrinker, shrinker->id);
> -#endif
> + shrinker->flags |= SHRINKER_REGISTERED;
> up_write(&shrinker_rwsem);
> }
>
> @@ -521,13 +508,16 @@ EXPORT_SYMBOL(register_shrinker);
> */
> void unregister_shrinker(struct shrinker *shrinker)
> {
> - if (!shrinker->nr_deferred)
> + if (!(shrinker->flags & SHRINKER_REGISTERED))
> return;
> - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> - unregister_memcg_shrinker(shrinker);
> +
> down_write(&shrinker_rwsem);
> list_del(&shrinker->list);
> + shrinker->flags &= ~SHRINKER_REGISTERED;
> + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> + unregister_memcg_shrinker(shrinker);
> up_write(&shrinker_rwsem);
> +
> kfree(shrinker->nr_deferred);
> shrinker->nr_deferred = NULL;
> }
> @@ -692,7 +682,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> struct shrinker *shrinker;
>
> shrinker = idr_find(&shrinker_idr, i);
> - if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
> + if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) {
> if (!shrinker)
> clear_bit(i, info->map);
> continue;
> --
> 2.26.2
>

2021-02-17 02:44:13

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v8 PATCH 10/13] mm: vmscan: use per memcg nr_deferred of shrinker

On Tue, Feb 16, 2021 at 04:13:19PM -0800, Yang Shi wrote:
> Use per memcg's nr_deferred for memcg aware shrinkers. The shrinker's nr_deferred
> will be used in the following cases:
> 1. Non memcg aware shrinkers
> 2. !CONFIG_MEMCG
> 3. memcg is disabled by boot parameter
>
> Signed-off-by: Yang Shi <[email protected]>

LGTM!

Acked-by: Roman Gushchin <[email protected]>

Thanks!

2021-02-17 06:28:02

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [v8 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

On 17.02.2021 03:13, Yang Shi wrote:
> Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
> We don't have to define a dedicated callback for call_rcu() anymore.
>
> Signed-off-by: Yang Shi <[email protected]>

Acked-by: Kirill Tkhai <[email protected]>

> ---
> mm/vmscan.c | 7 +------
> 1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2e753c2516fa..c2a309acd86b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
> return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> }
>
> -static void free_shrinker_map_rcu(struct rcu_head *head)
> -{
> - kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> -}
> -
> static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> int size, int old_size)
> {
> @@ -219,7 +214,7 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> memset((void *)new->map + old_size, 0, size - old_size);
>
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> - call_rcu(&old->rcu, free_shrinker_map_rcu);
> + kvfree_rcu(old);
> }
>
> return 0;
>

2021-02-17 06:50:26

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [v8 PATCH 09/13] mm: vmscan: add per memcg shrinker nr_deferred

On 17.02.2021 03:13, Yang Shi wrote:
> Currently the number of deferred objects are per shrinker, but some slabs, for example,
> vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
>
> The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> may suffer from over shrink, excessive reclaim latency, etc.
>
> For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> heavy workload. Workload in A generates excessive deferred objects, then B's vfs cache
> might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
>
> We observed this hit in our production environment which was running vfs heavy workload
> shown as the below tracing log:
>
> <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> cache items 246404277 delta 31345 total_scan 123202138
> <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> last shrinker return val 123186855
>
> The vfs cache and page cache ratio was 10:1 on this machine, and half of caches were dropped.
> This also resulted in significant amount of page caches were dropped due to inodes eviction.
>
> Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> better isolation.
>
> When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> would be used. And non memcg aware shrinkers use shrinker's nr_deferred all the time.
>
> Signed-off-by: Yang Shi <[email protected]>

Acked-by: Kirill Tkhai <[email protected]>

> ---
> include/linux/memcontrol.h | 7 +++--
> mm/vmscan.c | 60 ++++++++++++++++++++++++++------------
> 2 files changed, 46 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4c9253896e25..c457fc7bc631 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -93,12 +93,13 @@ struct lruvec_stat {
> };
>
> /*
> - * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> - * which have elements charged to this memcg.
> + * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
> + * shrinkers, which have elements charged to this memcg.
> */
> struct shrinker_info {
> struct rcu_head rcu;
> - unsigned long map[];
> + atomic_long_t *nr_deferred;
> + unsigned long *map;
> };
>
> /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a1047ea60ecf..fcb399e18fc3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -187,11 +187,17 @@ static DECLARE_RWSEM(shrinker_rwsem);
> #ifdef CONFIG_MEMCG
> static int shrinker_nr_max;
>
> +/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
> static inline int shrinker_map_size(int nr_items)
> {
> return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> }
>
> +static inline int shrinker_defer_size(int nr_items)
> +{
> + return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
> +}
> +
> static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> int nid)
> {
> @@ -200,10 +206,12 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> }
>
> static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> - int size, int old_size)
> + int map_size, int defer_size,
> + int old_map_size, int old_defer_size)
> {
> struct shrinker_info *new, *old;
> int nid;
> + int size = map_size + defer_size;
>
> for_each_node(nid) {
> old = shrinker_info_protected(memcg, nid);
> @@ -215,9 +223,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> if (!new)
> return -ENOMEM;
>
> - /* Set all old bits, clear all new bits */
> - memset(new->map, (int)0xff, old_size);
> - memset((void *)new->map + old_size, 0, size - old_size);
> + new->nr_deferred = (atomic_long_t *)(new + 1);
> + new->map = (void *)new->nr_deferred + defer_size;
> +
> + /* map: set all old bits, clear all new bits */
> + memset(new->map, (int)0xff, old_map_size);
> + memset((void *)new->map + old_map_size, 0, map_size - old_map_size);
> + /* nr_deferred: copy old values, clear all new values */
> + memcpy(new->nr_deferred, old->nr_deferred, old_defer_size);
> + memset((void *)new->nr_deferred + old_defer_size, 0,
> + defer_size - old_defer_size);
>
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
> kvfree_rcu(old);
> @@ -232,9 +247,6 @@ void free_shrinker_info(struct mem_cgroup *memcg)
> struct shrinker_info *info;
> int nid;
>
> - if (mem_cgroup_is_root(memcg))
> - return;
> -
> for_each_node(nid) {
> pn = mem_cgroup_nodeinfo(memcg, nid);
> info = shrinker_info_protected(memcg, nid);
> @@ -247,12 +259,12 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> {
> struct shrinker_info *info;
> int nid, size, ret = 0;
> -
> - if (mem_cgroup_is_root(memcg))
> - return 0;
> + int map_size, defer_size = 0;
>
> down_write(&shrinker_rwsem);
> - size = shrinker_map_size(shrinker_nr_max);
> + map_size = shrinker_map_size(shrinker_nr_max);
> + defer_size = shrinker_defer_size(shrinker_nr_max);
> + size = map_size + defer_size;
> for_each_node(nid) {
> info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
> if (!info) {
> @@ -260,6 +272,8 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> ret = -ENOMEM;
> break;
> }
> + info->nr_deferred = (atomic_long_t *)(info + 1);
> + info->map = (void *)info->nr_deferred + defer_size;
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
> }
> up_write(&shrinker_rwsem);
> @@ -267,15 +281,21 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> return ret;
> }
>
> +static inline bool need_expand(int nr_max)
> +{
> + return round_up(nr_max, BITS_PER_LONG) >
> + round_up(shrinker_nr_max, BITS_PER_LONG);
> +}
> +
> static int expand_shrinker_info(int new_id)
> {
> - int size, old_size, ret = 0;
> + int ret = 0;
> int new_nr_max = new_id + 1;
> + int map_size, defer_size = 0;
> + int old_map_size, old_defer_size = 0;
> struct mem_cgroup *memcg;
>
> - size = shrinker_map_size(new_nr_max);
> - old_size = shrinker_map_size(shrinker_nr_max);
> - if (size <= old_size)
> + if (!need_expand(new_nr_max))
> goto out;
>
> if (!root_mem_cgroup)
> @@ -283,11 +303,15 @@ static int expand_shrinker_info(int new_id)
>
> lockdep_assert_held(&shrinker_rwsem);
>
> + map_size = shrinker_map_size(new_nr_max);
> + defer_size = shrinker_defer_size(new_nr_max);
> + old_map_size = shrinker_map_size(shrinker_nr_max);
> + old_defer_size = shrinker_defer_size(shrinker_nr_max);
> +
> memcg = mem_cgroup_iter(NULL, NULL, NULL);
> do {
> - if (mem_cgroup_is_root(memcg))
> - continue;
> - ret = expand_one_shrinker_info(memcg, size, old_size);
> + ret = expand_one_shrinker_info(memcg, map_size, defer_size,
> + old_map_size, old_defer_size);
> if (ret) {
> mem_cgroup_iter_break(NULL, memcg);
> goto out;
>

2021-02-17 06:51:46

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [v8 PATCH 10/13] mm: vmscan: use per memcg nr_deferred of shrinker

On 17.02.2021 03:13, Yang Shi wrote:
> Use per memcg's nr_deferred for memcg aware shrinkers. The shrinker's nr_deferred
> will be used in the following cases:
> 1. Non memcg aware shrinkers
> 2. !CONFIG_MEMCG
> 3. memcg is disabled by boot parameter
>
> Signed-off-by: Yang Shi <[email protected]>

Acked-by: Kirill Tkhai <[email protected]>

> ---
> mm/vmscan.c | 78 ++++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 66 insertions(+), 12 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fcb399e18fc3..57cbc6bc8a49 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -374,6 +374,24 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
> idr_remove(&shrinker_idr, id);
> }
>
> +static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
> + struct mem_cgroup *memcg)
> +{
> + struct shrinker_info *info;
> +
> + info = shrinker_info_protected(memcg, nid);
> + return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
> +}
> +
> +static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
> + struct mem_cgroup *memcg)
> +{
> + struct shrinker_info *info;
> +
> + info = shrinker_info_protected(memcg, nid);
> + return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
> +}
> +
> static bool cgroup_reclaim(struct scan_control *sc)
> {
> return sc->target_mem_cgroup;
> @@ -412,6 +430,18 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
> {
> }
>
> +static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
> + struct mem_cgroup *memcg)
> +{
> + return 0;
> +}
> +
> +static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
> + struct mem_cgroup *memcg)
> +{
> + return 0;
> +}
> +
> static bool cgroup_reclaim(struct scan_control *sc)
> {
> return false;
> @@ -423,6 +453,39 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> }
> #endif
>
> +static long xchg_nr_deferred(struct shrinker *shrinker,
> + struct shrink_control *sc)
> +{
> + int nid = sc->nid;
> +
> + if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> + nid = 0;
> +
> + if (sc->memcg &&
> + (shrinker->flags & SHRINKER_MEMCG_AWARE))
> + return xchg_nr_deferred_memcg(nid, shrinker,
> + sc->memcg);
> +
> + return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> +}
> +
> +
> +static long add_nr_deferred(long nr, struct shrinker *shrinker,
> + struct shrink_control *sc)
> +{
> + int nid = sc->nid;
> +
> + if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> + nid = 0;
> +
> + if (sc->memcg &&
> + (shrinker->flags & SHRINKER_MEMCG_AWARE))
> + return add_nr_deferred_memcg(nr, nid, shrinker,
> + sc->memcg);
> +
> + return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
> +}
> +
> /*
> * This misses isolated pages which are not accounted for to save counters.
> * As the data only determines if reclaim or compaction continues, it is
> @@ -558,14 +621,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> long freeable;
> long nr;
> long new_nr;
> - int nid = shrinkctl->nid;
> long batch_size = shrinker->batch ? shrinker->batch
> : SHRINK_BATCH;
> long scanned = 0, next_deferred;
>
> - if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> - nid = 0;
> -
> freeable = shrinker->count_objects(shrinker, shrinkctl);
> if (freeable == 0 || freeable == SHRINK_EMPTY)
> return freeable;
> @@ -575,7 +634,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> * and zero it so that other concurrent shrinker invocations
> * don't also do this scanning work.
> */
> - nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> + nr = xchg_nr_deferred(shrinker, shrinkctl);
>
> total_scan = nr;
> if (shrinker->seeks) {
> @@ -666,14 +725,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> next_deferred = 0;
> /*
> * move the unused scan count back into the shrinker in a
> - * manner that handles concurrent updates. If we exhausted the
> - * scan, there is no need to do an update.
> + * manner that handles concurrent updates.
> */
> - if (next_deferred > 0)
> - new_nr = atomic_long_add_return(next_deferred,
> - &shrinker->nr_deferred[nid]);
> - else
> - new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
> + new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl);
>
> trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
> return freed;
>

2021-02-25 17:03:19

by Yang Shi

[permalink] [raw]
Subject: Re: [v8 PATCH 00/13] Make shrinker's nr_deferred memcg aware

Hi Andrew,

Just checking in whether this series is on your radar. The patch 1/13
~ patch 12/13 have been reviewed and acked. Vlastimil had had some
comments on patch 13/13, I'm not sure if he is going to continue
reviewing that one. I hope the last patch could get into the -mm tree
along with the others so that it can get a broader test. What do you
think about it?

Thanks,
Yang

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
>
>
> Changelog
> v7 --> v8:
> * Added lockdep assert in expand_shrinker_info() per Roman.
> * Added patch 05/13 to use kvfree_rcu() instead of call_rcu() per Roman
> and Kirill.
> * Moved rwsem acquire/release out of unregister_memcg_shrinker() per Roman.
> * Renamed count_nr_deferred_{memcg} to xchg_nr_deferred_{memcg} per Roman.
> * Fixed the next_deferred logic per Vlastimil.
> * Misc minor code cleanup, refactor and spelling correction per Roman
> and Shakeel.
> * Collected more ack and review tags from Roman, Shakeel and Vlastimil.
> v6 --> v7:
> * Expanded shrinker_info in a batch of BITS_PER_LONG per Kirill.
> * Added patch 06/12 to introduce a helper for dereferencing shrinker_info
> per Kirill.
> * Renamed set_nr_deferred_memcg to add_nr_deferred_memcg per Kirill.
> * Collected Acked-by from Kirill.
> v5 --> v6:
> * Rebased on top of https://lore.kernel.org/linux-mm/[email protected]/
> per Kirill.
> * Don't register shrinker idr with NULL and remove idr_replace() per Vlastimil.
> * Move nr_deferred before map to guarantee the alignment per Vlastimil.
> * Misc minor code cleanup and refactor per Kirill and Vlastimil.
> * Added Acked-by from Vlastimil for path #1, #2, #3, #5, #9 and #10.
> v4 --> v5:
> * Incorporated the comments from Kirill.
> * Rebased to v5.11-rc5.
> v3 --> v4:
> * Removed "memcg_" prefix for shrinker_maps related functions per Roman.
> * Use write lock instead of read lock per Kirill. Also removed Johannes's ack
> since write lock is used.
> * Incorporated the comments from Kirill.
> * Removed RFC.
> * Rebased to v5.11-rc4.
> v2 --> v3:
> * Moved shrinker_maps related code to vmscan.c per Dave.
> * Removed memcg_shrinker_map_size. Calcuated the size of map via shrinker_nr_max
> per Johannes.
> * Consolidated shrinker_deferred with shrinker_maps into one struct per Dave.
> * Simplified the nr_deferred related code.
> * Dropped the memory barrier from v2.
> * Moved nr_deferred reparent code to vmscan.c per Dave.
> * Added test coverage information in patch #11. Dave is concerned about the
> potential regression. I didn't notice regression with my tests, but suggestions
> about more test coverage is definitely welcome. And it may help spot regression
> with this patch in -mm tree then linux-next tree so I keep it in this version.
> * The code cleanup and consolidation resulted in the series grow to 11 patches.
> * Rebased onto 5.11-rc2.
> v1 --> v2:
> * Use shrinker->flags to store the new SHRINKER_REGISTERED flag per Roman.
> * Folded patch #1 into patch #6 per Roman.
> * Added memory barrier to prevent shrink_slab_memcg from seeing NULL shrinker_maps/
> shrinker_deferred per Kirill.
> * Removed memcg_shrinker_map_mutex. Protcted shrinker_map/shrinker_deferred
> allocations from expand with shrinker_rwsem per Johannes.
>
> Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
> it turned out there were huge amount accumulated nr_deferred objects seen by the
> shrinker.
>
> On our production machine, I saw absurd number of nr_deferred shown as the below
> tracing result:
>
> <...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
> super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
> 2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
> 9300 cache items 1667 delta 11 total_scan 833
>
> There are 2.5 trillion deferred objects on one node, assuming all of them
> are dentry (192 bytes per object), so the total size of deferred on
> one node is ~480TB. It is definitely ridiculous.
>
> I managed to reproduce this problem with kernel build workload plus negative dentry
> generator.
>
> First step, run the below kernel build test script:
>
> NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
>
> cd /root/Buildarea/linux-stable
>
> for i in `seq 1500`; do
> cgcreate -g memory:kern_build
> echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes
>
> echo 3 > /proc/sys/vm/drop_caches
> cgexec -g memory:kern_build make clean > /dev/null 2>&1
> cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1
>
> cgdelete -g memory:kern_build
> done
>
> Then run the below negative dentry generator script:
>
> NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
>
> mkdir /sys/fs/cgroup/memory/test
> echo $$ > /sys/fs/cgroup/memory/test/tasks
>
> for i in `seq $NR_CPUS`; do
> while true; do
> FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
> cat $FILE 2>/dev/null
> done &
> done
>
> Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
> showed:
>
> kswapd0-475 [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
> objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
> kswapd0-475 [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
> scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928
>
> There were huge number of deferred objects before the shrinker was called, the behavior
> does match the code but it might be not desirable from the user's stand of point.
>
> The excessive amount of nr_deferred might be accumulated due to various reasons, for example:
> * GFP_NOFS allocation
> * Significant times of small amount scan (< scan_batch, 1024 for vfs metadata)
>
> However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the deferred objects
> is per shrinker, this may have some bad effects:
> * Poor isolation among memcgs. Some memcgs which happen to have frequent limit
> reclaim may get nr_deferred accumulated to a huge number, then other innocent
> memcgs may take the fall. In our case the main workload was hit.
> * Unbounded deferred objects. There is no cap for deferred objects, it can outgrow
> ridiculously as the tracing result showed.
> * Easy to get out of control. Although shrinkers take into account deferred objects,
> but it can go out of control easily. One misconfigured memcg could incur absurd
> amount of deferred objects in a period of time.
> * Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. There may be
> hundred GB slab caches for vfe metadata heavy workload, shrink half of them may take
> minutes. We observed latency spike due to the prolonged reclaim.
>
> These issues also have been discussed in https://lore.kernel.org/linux-mm/[email protected]/.
> The patchset is the outcome of that discussion.
>
> So this patchset makes nr_deferred per-memcg to tackle the problem. It does:
> * Have memcg_shrinker_deferred per memcg per node, just like what shrinker_map
> does. Instead it is an atomic_long_t array, each element represent one shrinker
> even though the shrinker is not memcg aware, this simplifies the implementation.
> For memcg aware shrinkers, the deferred objects are just accumulated to its own
> memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg aware
> shrinkers still use global nr_deferred from struct shrinker.
> * Once the memcg is offlined, its nr_deferred will be reparented to its parent along
> with LRUs.
> * The root memcg has memcg_shrinker_deferred array too. It simplifies the handling of
> reparenting to root memcg.
> * Cap nr_deferred to 2x of the length of lru. The idea is borrowed from Dave Chinner's
> series (https://lore.kernel.org/linux-xfs/[email protected]/)
>
> The downside is each memcg has to allocate extra memory to store the nr_deferred array.
> On our production environment, there are typically around 40 shrinkers, so each memcg
> needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine.
>
> We have been running the patched kernel on some hosts of our fleet (test and production) for
> months, it works very well. The monitor data shows the working set is sustained as expected.
>
> Yang Shi (13):
> mm: vmscan: use nid from shrink_control for tracepoint
> mm: vmscan: consolidate shrinker_maps handling code
> mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
> mm: vmscan: remove memcg_shrinker_map_size
> mm: vmscan: use kvfree_rcu instead of call_rcu
> mm: memcontrol: rename shrinker_map to shrinker_info
> mm: vmscan: add shrinker_info_protected() helper
> mm: vmscan: use a new flag to indicate shrinker is registered
> mm: vmscan: add per memcg shrinker nr_deferred
> mm: vmscan: use per memcg nr_deferred of shrinker
> mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
> mm: memcontrol: reparent nr_deferred when memcg offline
> mm: vmscan: shrink deferred objects proportional to priority
>
> include/linux/memcontrol.h | 23 +++---
> include/linux/shrinker.h | 7 +-
> mm/huge_memory.c | 4 +-
> mm/list_lru.c | 6 +-
> mm/memcontrol.c | 130 +------------------------------
> mm/vmscan.c | 394 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------
> 6 files changed, 319 insertions(+), 245 deletions(-)
>

2021-03-01 20:21:53

by Yang Shi

[permalink] [raw]
Subject: Re: [v8 PATCH 00/13] Make shrinker's nr_deferred memcg aware

On Mon, Mar 1, 2021 at 7:05 AM Johannes Weiner <[email protected]> wrote:
>
> Hello Yang,
>
> On Thu, Feb 25, 2021 at 09:00:16AM -0800, Yang Shi wrote:
> > Hi Andrew,
> >
> > Just checking in whether this series is on your radar. The patch 1/13
> > ~ patch 12/13 have been reviewed and acked. Vlastimil had had some
> > comments on patch 13/13, I'm not sure if he is going to continue
> > reviewing that one. I hope the last patch could get into the -mm tree
> > along with the others so that it can get a broader test. What do you
> > think about it?
>
> The merge window for 5.12 is/has been open, which is when maintainers
> are busy getting everything from the previous development cycle ready
> to send upstream. Usually, only fixes but no new features are picked
> up during that time. If you don't hear back, try resending in a week.

Thanks, Johannes. Totally understand.

>
> That reminds me, I also have patches I need to resend :)

2021-03-03 04:27:40

by Johannes Weiner

[permalink] [raw]
Subject: Re: [v8 PATCH 00/13] Make shrinker's nr_deferred memcg aware

Hello Yang,

On Thu, Feb 25, 2021 at 09:00:16AM -0800, Yang Shi wrote:
> Hi Andrew,
>
> Just checking in whether this series is on your radar. The patch 1/13
> ~ patch 12/13 have been reviewed and acked. Vlastimil had had some
> comments on patch 13/13, I'm not sure if he is going to continue
> reviewing that one. I hope the last patch could get into the -mm tree
> along with the others so that it can get a broader test. What do you
> think about it?

The merge window for 5.12 is/has been open, which is when maintainers
are busy getting everything from the previous development cycle ready
to send upstream. Usually, only fixes but no new features are picked
up during that time. If you don't hear back, try resending in a week.

That reminds me, I also have patches I need to resend :)

2021-03-08 08:44:49

by Shakeel Butt

[permalink] [raw]
Subject: Re: [v8 PATCH 06/13] mm: memcontrol: rename shrinker_map to shrinker_info

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
>
> The following patch is going to add nr_deferred into shrinker_map, the change will
> make shrinker_map not only include map anymore, so rename it to "memcg_shrinker_info".
> And this should make the patch adding nr_deferred cleaner and readable and make
> review easier. Also remove the "memcg_" prefix.
>
> Acked-by: Vlastimil Babka <[email protected]>
> Acked-by: Kirill Tkhai <[email protected]>
> Acked-by: Roman Gushchin <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

2021-03-08 08:44:51

by Shakeel Butt

[permalink] [raw]
Subject: Re: [v8 PATCH 07/13] mm: vmscan: add shrinker_info_protected() helper

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
>
> The shrinker_info is dereferenced in a couple of places via rcu_dereference_protected
> with different calling conventions, for example, using mem_cgroup_nodeinfo helper
> or dereferencing memcg->nodeinfo[nid]->shrinker_info. And the later patch
> will add more dereference places.
>
> So extract the dereference into a helper to make the code more readable. No
> functional change.
>
> Acked-by: Roman Gushchin <[email protected]>
> Acked-by: Kirill Tkhai <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

2021-03-08 08:45:00

by Shakeel Butt

[permalink] [raw]
Subject: Re: [v8 PATCH 04/13] mm: vmscan: remove memcg_shrinker_map_size

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
>
> Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but actually the
> map size can be calculated via shrinker_nr_max, so it seems unnecessary to keep both.
> Remove memcg_shrinker_map_size since shrinker_nr_max is also used by iterating the
> bit map.
>
> Acked-by: Kirill Tkhai <[email protected]>
> Acked-by: Roman Gushchin <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

2021-03-08 08:46:01

by Shakeel Butt

[permalink] [raw]
Subject: Re: [v8 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
>
> Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
> We don't have to define a dedicated callback for call_rcu() anymore.
>
> Signed-off-by: Yang Shi <[email protected]>
> ---
> mm/vmscan.c | 7 +------
> 1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2e753c2516fa..c2a309acd86b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
> return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> }
>
> -static void free_shrinker_map_rcu(struct rcu_head *head)
> -{
> - kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> -}
> -
> static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> int size, int old_size)
> {
> @@ -219,7 +214,7 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> memset((void *)new->map + old_size, 0, size - old_size);
>
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> - call_rcu(&old->rcu, free_shrinker_map_rcu);
> + kvfree_rcu(old);

Please use kvfree_rcu(old, rcu) instead of kvfree_rcu(old). The single
param can call synchronize_rcu().

2021-03-08 08:46:25

by Shakeel Butt

[permalink] [raw]
Subject: Re: [v8 PATCH 03/13] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
>
> Since memcg_shrinker_map_size just can be changed under holding shrinker_rwsem
> exclusively, the read side can be protected by holding read lock, so it sounds
> superfluous to have a dedicated mutex.
>
> Kirill Tkhai suggested use write lock since:
>
> * We want the assignment to shrinker_maps is visible for shrink_slab_memcg().
> * The rcu_dereference_protected() dereferrencing in shrink_slab_memcg(), but
> in case of we use READ lock in alloc_shrinker_maps(), the dereferrencing
> is not actually protected.
> * READ lock makes alloc_shrinker_info() racy against memory allocation fail.
> alloc_shrinker_info()->free_shrinker_info() may free memory right after
> shrink_slab_memcg() dereferenced it. You may say
> shrink_slab_memcg()->mem_cgroup_online() protects us from it? Yes, sure,
> but this is not the thing we want to remember in the future, since this
> spreads modularity.
>
> And a test with heavy paging workload didn't show write lock makes things worse.
>
> Acked-by: Vlastimil Babka <[email protected]>
> Acked-by: Kirill Tkhai <[email protected]>
> Acked-by: Roman Gushchin <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

2021-03-08 14:58:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [v8 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

On Sun, Mar 07, 2021 at 10:13:04PM -0800, Shakeel Butt wrote:
> On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
> >
> > Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
> > We don't have to define a dedicated callback for call_rcu() anymore.
> >
> > Signed-off-by: Yang Shi <[email protected]>
> > ---
> > mm/vmscan.c | 7 +------
> > 1 file changed, 1 insertion(+), 6 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2e753c2516fa..c2a309acd86b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
> > return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> > }
> >
> > -static void free_shrinker_map_rcu(struct rcu_head *head)
> > -{
> > - kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> > -}
> > -
> > static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> > int size, int old_size)
> > {
> > @@ -219,7 +214,7 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> > memset((void *)new->map + old_size, 0, size - old_size);
> >
> > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> > - call_rcu(&old->rcu, free_shrinker_map_rcu);
> > + kvfree_rcu(old);
>
> Please use kvfree_rcu(old, rcu) instead of kvfree_rcu(old). The single
> param can call synchronize_rcu().

Especially given that you already have the ->rcu field that the
two-argument form requires.

The reason for using the single-argument form is when you have lots of
little data structures, such that getting rid of that rcu_head structure
is valuable enough to be worth the occasional call to synchronize_rcu().
However, please note that this call to synchronize_rcu() happens only
under OOM conditions.

Thanx, Paul

2021-03-08 16:51:29

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v8 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

On Sun, Mar 07, 2021 at 10:13:04PM -0800, Shakeel Butt wrote:
> On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
> >
> > Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
> > We don't have to define a dedicated callback for call_rcu() anymore.
> >
> > Signed-off-by: Yang Shi <[email protected]>
> > ---
> > mm/vmscan.c | 7 +------
> > 1 file changed, 1 insertion(+), 6 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2e753c2516fa..c2a309acd86b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
> > return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> > }
> >
> > -static void free_shrinker_map_rcu(struct rcu_head *head)
> > -{
> > - kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> > -}
> > -
> > static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> > int size, int old_size)
> > {
> > @@ -219,7 +214,7 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> > memset((void *)new->map + old_size, 0, size - old_size);
> >
> > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> > - call_rcu(&old->rcu, free_shrinker_map_rcu);
> > + kvfree_rcu(old);
>
> Please use kvfree_rcu(old, rcu) instead of kvfree_rcu(old). The single
> param can call synchronize_rcu().

Oh, I didn't know about this difference. Thank you for noticing!

2021-03-08 17:50:30

by Shakeel Butt

[permalink] [raw]
Subject: Re: [v8 PATCH 08/13] mm: vmscan: use a new flag to indicate shrinker is registered

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
>
> Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> This approach is fine with nr_deferred at the shrinker level, but the following
> patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> shrinker->nr_deferred would always be NULL. This would prevent the shrinkers
> from unregistering correctly.
>
> Remove SHRINKER_REGISTERING since we could check if shrinker is registered
> successfully by the new flag.
>
> Acked-by: Kirill Tkhai <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

2021-03-08 18:17:51

by Yang Shi

[permalink] [raw]
Subject: Re: [v8 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

On Mon, Mar 8, 2021 at 6:54 AM Paul E. McKenney <[email protected]> wrote:
>
> On Sun, Mar 07, 2021 at 10:13:04PM -0800, Shakeel Butt wrote:
> > On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
> > >
> > > Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
> > > We don't have to define a dedicated callback for call_rcu() anymore.
> > >
> > > Signed-off-by: Yang Shi <[email protected]>
> > > ---
> > > mm/vmscan.c | 7 +------
> > > 1 file changed, 1 insertion(+), 6 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 2e753c2516fa..c2a309acd86b 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
> > > return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> > > }
> > >
> > > -static void free_shrinker_map_rcu(struct rcu_head *head)
> > > -{
> > > - kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> > > -}
> > > -
> > > static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> > > int size, int old_size)
> > > {
> > > @@ -219,7 +214,7 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> > > memset((void *)new->map + old_size, 0, size - old_size);
> > >
> > > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> > > - call_rcu(&old->rcu, free_shrinker_map_rcu);
> > > + kvfree_rcu(old);
> >
> > Please use kvfree_rcu(old, rcu) instead of kvfree_rcu(old). The single
> > param can call synchronize_rcu().
>
> Especially given that you already have the ->rcu field that the
> two-argument form requires.
>
> The reason for using the single-argument form is when you have lots of
> little data structures, such that getting rid of that rcu_head structure
> is valuable enough to be worth the occasional call to synchronize_rcu().
> However, please note that this call to synchronize_rcu() happens only
> under OOM conditions.

Thanks, Shakeel and Paul. I didn't realize the difference. Will use
the two params form in the new version.

>
> Thanx, Paul

2021-03-08 19:14:16

by Shakeel Butt

[permalink] [raw]
Subject: Re: [v8 PATCH 09/13] mm: vmscan: add per memcg shrinker nr_deferred

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
>
> Currently the number of deferred objects are per shrinker, but some slabs, for example,
> vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
>
> The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> may suffer from over shrink, excessive reclaim latency, etc.
>
> For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> heavy workload. Workload in A generates excessive deferred objects, then B's vfs cache
> might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
>
> We observed this hit in our production environment which was running vfs heavy workload
> shown as the below tracing log:
>
> <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> cache items 246404277 delta 31345 total_scan 123202138
> <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> last shrinker return val 123186855
>
> The vfs cache and page cache ratio was 10:1 on this machine, and half of caches were dropped.
> This also resulted in significant amount of page caches were dropped due to inodes eviction.
>
> Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> better isolation.
>
> When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> would be used. And non memcg aware shrinkers use shrinker's nr_deferred all the time.
>
> Signed-off-by: Yang Shi <[email protected]>
> ---
> include/linux/memcontrol.h | 7 +++--
> mm/vmscan.c | 60 ++++++++++++++++++++++++++------------
> 2 files changed, 46 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4c9253896e25..c457fc7bc631 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -93,12 +93,13 @@ struct lruvec_stat {
> };
>
> /*
> - * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> - * which have elements charged to this memcg.
> + * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
> + * shrinkers, which have elements charged to this memcg.
> */
> struct shrinker_info {
> struct rcu_head rcu;
> - unsigned long map[];
> + atomic_long_t *nr_deferred;
> + unsigned long *map;
> };
>
> /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a1047ea60ecf..fcb399e18fc3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -187,11 +187,17 @@ static DECLARE_RWSEM(shrinker_rwsem);
> #ifdef CONFIG_MEMCG
> static int shrinker_nr_max;
>
> +/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
> static inline int shrinker_map_size(int nr_items)
> {
> return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> }
>
> +static inline int shrinker_defer_size(int nr_items)
> +{
> + return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
> +}
> +
> static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> int nid)
> {
> @@ -200,10 +206,12 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> }
>
> static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> - int size, int old_size)
> + int map_size, int defer_size,
> + int old_map_size, int old_defer_size)
> {
> struct shrinker_info *new, *old;
> int nid;
> + int size = map_size + defer_size;
>
> for_each_node(nid) {
> old = shrinker_info_protected(memcg, nid);
> @@ -215,9 +223,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> if (!new)
> return -ENOMEM;
>
> - /* Set all old bits, clear all new bits */
> - memset(new->map, (int)0xff, old_size);
> - memset((void *)new->map + old_size, 0, size - old_size);
> + new->nr_deferred = (atomic_long_t *)(new + 1);
> + new->map = (void *)new->nr_deferred + defer_size;
> +
> + /* map: set all old bits, clear all new bits */
> + memset(new->map, (int)0xff, old_map_size);
> + memset((void *)new->map + old_map_size, 0, map_size - old_map_size);
> + /* nr_deferred: copy old values, clear all new values */
> + memcpy(new->nr_deferred, old->nr_deferred, old_defer_size);
> + memset((void *)new->nr_deferred + old_defer_size, 0,
> + defer_size - old_defer_size);
>
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
> kvfree_rcu(old);
> @@ -232,9 +247,6 @@ void free_shrinker_info(struct mem_cgroup *memcg)
> struct shrinker_info *info;
> int nid;
>
> - if (mem_cgroup_is_root(memcg))
> - return;
> -
> for_each_node(nid) {
> pn = mem_cgroup_nodeinfo(memcg, nid);
> info = shrinker_info_protected(memcg, nid);
> @@ -247,12 +259,12 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> {
> struct shrinker_info *info;
> int nid, size, ret = 0;
> -
> - if (mem_cgroup_is_root(memcg))
> - return 0;

Can you please comment on the consequences on allowing to allocate
shrinker_info for root memcg? Why didn't we do that before but now it
is fine (or maybe required)? Please add the explanation in the commit
message.

> + int map_size, defer_size = 0;
>
> down_write(&shrinker_rwsem);
> - size = shrinker_map_size(shrinker_nr_max);
> + map_size = shrinker_map_size(shrinker_nr_max);
> + defer_size = shrinker_defer_size(shrinker_nr_max);
> + size = map_size + defer_size;
> for_each_node(nid) {
> info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
> if (!info) {
> @@ -260,6 +272,8 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> ret = -ENOMEM;
> break;
> }
> + info->nr_deferred = (atomic_long_t *)(info + 1);
> + info->map = (void *)info->nr_deferred + defer_size;
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
> }
> up_write(&shrinker_rwsem);
> @@ -267,15 +281,21 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> return ret;
> }
>
> +static inline bool need_expand(int nr_max)
> +{
> + return round_up(nr_max, BITS_PER_LONG) >
> + round_up(shrinker_nr_max, BITS_PER_LONG);
> +}
> +
> static int expand_shrinker_info(int new_id)
> {
> - int size, old_size, ret = 0;
> + int ret = 0;
> int new_nr_max = new_id + 1;
> + int map_size, defer_size = 0;
> + int old_map_size, old_defer_size = 0;
> struct mem_cgroup *memcg;
>
> - size = shrinker_map_size(new_nr_max);
> - old_size = shrinker_map_size(shrinker_nr_max);
> - if (size <= old_size)
> + if (!need_expand(new_nr_max))
> goto out;
>
> if (!root_mem_cgroup)
> @@ -283,11 +303,15 @@ static int expand_shrinker_info(int new_id)
>
> lockdep_assert_held(&shrinker_rwsem);
>
> + map_size = shrinker_map_size(new_nr_max);
> + defer_size = shrinker_defer_size(new_nr_max);
> + old_map_size = shrinker_map_size(shrinker_nr_max);
> + old_defer_size = shrinker_defer_size(shrinker_nr_max);
> +
> memcg = mem_cgroup_iter(NULL, NULL, NULL);
> do {
> - if (mem_cgroup_is_root(memcg))
> - continue;
> - ret = expand_one_shrinker_info(memcg, size, old_size);
> + ret = expand_one_shrinker_info(memcg, map_size, defer_size,
> + old_map_size, old_defer_size);
> if (ret) {
> mem_cgroup_iter_break(NULL, memcg);
> goto out;
> --
> 2.26.2
>

2021-03-08 19:19:22

by Shakeel Butt

[permalink] [raw]
Subject: Re: [v8 PATCH 10/13] mm: vmscan: use per memcg nr_deferred of shrinker

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
>
> Use per memcg's nr_deferred for memcg aware shrinkers. The shrinker's nr_deferred
> will be used in the following cases:
> 1. Non memcg aware shrinkers
> 2. !CONFIG_MEMCG
> 3. memcg is disabled by boot parameter
>
> Signed-off-by: Yang Shi <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

2021-03-08 20:26:28

by Yang Shi

[permalink] [raw]
Subject: Re: [v8 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

On Mon, Mar 8, 2021 at 8:49 AM Roman Gushchin <[email protected]> wrote:
>
> On Sun, Mar 07, 2021 at 10:13:04PM -0800, Shakeel Butt wrote:
> > On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
> > >
> > > Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
> > > We don't have to define a dedicated callback for call_rcu() anymore.
> > >
> > > Signed-off-by: Yang Shi <[email protected]>
> > > ---
> > > mm/vmscan.c | 7 +------
> > > 1 file changed, 1 insertion(+), 6 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 2e753c2516fa..c2a309acd86b 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
> > > return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> > > }
> > >
> > > -static void free_shrinker_map_rcu(struct rcu_head *head)
> > > -{
> > > - kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> > > -}
> > > -
> > > static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> > > int size, int old_size)
> > > {
> > > @@ -219,7 +214,7 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> > > memset((void *)new->map + old_size, 0, size - old_size);
> > >
> > > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> > > - call_rcu(&old->rcu, free_shrinker_map_rcu);
> > > + kvfree_rcu(old);
> >
> > Please use kvfree_rcu(old, rcu) instead of kvfree_rcu(old). The single
> > param can call synchronize_rcu().
>
> Oh, I didn't know about this difference. Thank you for noticing!

BTW, I think I could keep you and Kirill's acked-by with this change
(using two params form kvfree_rcu) since the change seems trivial.

2021-03-08 20:34:34

by Yang Shi

[permalink] [raw]
Subject: Re: [v8 PATCH 09/13] mm: vmscan: add per memcg shrinker nr_deferred

On Mon, Mar 8, 2021 at 11:12 AM Shakeel Butt <[email protected]> wrote:
>
> On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
> >
> > Currently the number of deferred objects are per shrinker, but some slabs, for example,
> > vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> >
> > The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> > excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> > may suffer from over shrink, excessive reclaim latency, etc.
> >
> > For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> > heavy workload. Workload in A generates excessive deferred objects, then B's vfs cache
> > might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> >
> > We observed this hit in our production environment which was running vfs heavy workload
> > shown as the below tracing log:
> >
> > <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> > cache items 246404277 delta 31345 total_scan 123202138
> > <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> > last shrinker return val 123186855
> >
> > The vfs cache and page cache ratio was 10:1 on this machine, and half of caches were dropped.
> > This also resulted in significant amount of page caches were dropped due to inodes eviction.
> >
> > Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> > better isolation.
> >
> > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> > would be used. And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> >
> > Signed-off-by: Yang Shi <[email protected]>
> > ---
> > include/linux/memcontrol.h | 7 +++--
> > mm/vmscan.c | 60 ++++++++++++++++++++++++++------------
> > 2 files changed, 46 insertions(+), 21 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 4c9253896e25..c457fc7bc631 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -93,12 +93,13 @@ struct lruvec_stat {
> > };
> >
> > /*
> > - * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> > - * which have elements charged to this memcg.
> > + * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
> > + * shrinkers, which have elements charged to this memcg.
> > */
> > struct shrinker_info {
> > struct rcu_head rcu;
> > - unsigned long map[];
> > + atomic_long_t *nr_deferred;
> > + unsigned long *map;
> > };
> >
> > /*
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a1047ea60ecf..fcb399e18fc3 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -187,11 +187,17 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > #ifdef CONFIG_MEMCG
> > static int shrinker_nr_max;
> >
> > +/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
> > static inline int shrinker_map_size(int nr_items)
> > {
> > return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> > }
> >
> > +static inline int shrinker_defer_size(int nr_items)
> > +{
> > + return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
> > +}
> > +
> > static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> > int nid)
> > {
> > @@ -200,10 +206,12 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> > }
> >
> > static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> > - int size, int old_size)
> > + int map_size, int defer_size,
> > + int old_map_size, int old_defer_size)
> > {
> > struct shrinker_info *new, *old;
> > int nid;
> > + int size = map_size + defer_size;
> >
> > for_each_node(nid) {
> > old = shrinker_info_protected(memcg, nid);
> > @@ -215,9 +223,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> > if (!new)
> > return -ENOMEM;
> >
> > - /* Set all old bits, clear all new bits */
> > - memset(new->map, (int)0xff, old_size);
> > - memset((void *)new->map + old_size, 0, size - old_size);
> > + new->nr_deferred = (atomic_long_t *)(new + 1);
> > + new->map = (void *)new->nr_deferred + defer_size;
> > +
> > + /* map: set all old bits, clear all new bits */
> > + memset(new->map, (int)0xff, old_map_size);
> > + memset((void *)new->map + old_map_size, 0, map_size - old_map_size);
> > + /* nr_deferred: copy old values, clear all new values */
> > + memcpy(new->nr_deferred, old->nr_deferred, old_defer_size);
> > + memset((void *)new->nr_deferred + old_defer_size, 0,
> > + defer_size - old_defer_size);
> >
> > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
> > kvfree_rcu(old);
> > @@ -232,9 +247,6 @@ void free_shrinker_info(struct mem_cgroup *memcg)
> > struct shrinker_info *info;
> > int nid;
> >
> > - if (mem_cgroup_is_root(memcg))
> > - return;
> > -
> > for_each_node(nid) {
> > pn = mem_cgroup_nodeinfo(memcg, nid);
> > info = shrinker_info_protected(memcg, nid);
> > @@ -247,12 +259,12 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> > {
> > struct shrinker_info *info;
> > int nid, size, ret = 0;
> > -
> > - if (mem_cgroup_is_root(memcg))
> > - return 0;
>
> Can you please comment on the consequences on allowing to allocate
> shrinker_info for root memcg? Why didn't we do that before but now it
> is fine (or maybe required)? Please add the explanation in the commit
> message.

Before the patchset shrinker_info just tracks shrinker_maps which is
not required for root memcg. But the newly added nr_deferred is needed
in root memcg otherwise the nr_deferred work would get lost once the
memcgs are reparented to root.

How's about adding the below paragraph to the commit log:

"To preserve nr_deferred when reparenting memcgs to root, root memcg
needs shrinker_info allocated too."

>
> > + int map_size, defer_size = 0;
> >
> > down_write(&shrinker_rwsem);
> > - size = shrinker_map_size(shrinker_nr_max);
> > + map_size = shrinker_map_size(shrinker_nr_max);
> > + defer_size = shrinker_defer_size(shrinker_nr_max);
> > + size = map_size + defer_size;
> > for_each_node(nid) {
> > info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
> > if (!info) {
> > @@ -260,6 +272,8 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> > ret = -ENOMEM;
> > break;
> > }
> > + info->nr_deferred = (atomic_long_t *)(info + 1);
> > + info->map = (void *)info->nr_deferred + defer_size;
> > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
> > }
> > up_write(&shrinker_rwsem);
> > @@ -267,15 +281,21 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> > return ret;
> > }
> >
> > +static inline bool need_expand(int nr_max)
> > +{
> > + return round_up(nr_max, BITS_PER_LONG) >
> > + round_up(shrinker_nr_max, BITS_PER_LONG);
> > +}
> > +
> > static int expand_shrinker_info(int new_id)
> > {
> > - int size, old_size, ret = 0;
> > + int ret = 0;
> > int new_nr_max = new_id + 1;
> > + int map_size, defer_size = 0;
> > + int old_map_size, old_defer_size = 0;
> > struct mem_cgroup *memcg;
> >
> > - size = shrinker_map_size(new_nr_max);
> > - old_size = shrinker_map_size(shrinker_nr_max);
> > - if (size <= old_size)
> > + if (!need_expand(new_nr_max))
> > goto out;
> >
> > if (!root_mem_cgroup)
> > @@ -283,11 +303,15 @@ static int expand_shrinker_info(int new_id)
> >
> > lockdep_assert_held(&shrinker_rwsem);
> >
> > + map_size = shrinker_map_size(new_nr_max);
> > + defer_size = shrinker_defer_size(new_nr_max);
> > + old_map_size = shrinker_map_size(shrinker_nr_max);
> > + old_defer_size = shrinker_defer_size(shrinker_nr_max);
> > +
> > memcg = mem_cgroup_iter(NULL, NULL, NULL);
> > do {
> > - if (mem_cgroup_is_root(memcg))
> > - continue;
> > - ret = expand_one_shrinker_info(memcg, size, old_size);
> > + ret = expand_one_shrinker_info(memcg, map_size, defer_size,
> > + old_map_size, old_defer_size);
> > if (ret) {
> > mem_cgroup_iter_break(NULL, memcg);
> > goto out;
> > --
> > 2.26.2
> >

2021-03-08 21:13:34

by Shakeel Butt

[permalink] [raw]
Subject: Re: [v8 PATCH 09/13] mm: vmscan: add per memcg shrinker nr_deferred

On Mon, Mar 8, 2021 at 12:30 PM Yang Shi <[email protected]> wrote:
>
> On Mon, Mar 8, 2021 at 11:12 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
> > >
> > > Currently the number of deferred objects are per shrinker, but some slabs, for example,
> > > vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> > >
> > > The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> > > excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> > > may suffer from over shrink, excessive reclaim latency, etc.
> > >
> > > For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> > > heavy workload. Workload in A generates excessive deferred objects, then B's vfs cache
> > > might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> > >
> > > We observed this hit in our production environment which was running vfs heavy workload
> > > shown as the below tracing log:
> > >
> > > <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> > > cache items 246404277 delta 31345 total_scan 123202138
> > > <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> > > last shrinker return val 123186855
> > >
> > > The vfs cache and page cache ratio was 10:1 on this machine, and half of caches were dropped.
> > > This also resulted in significant amount of page caches were dropped due to inodes eviction.
> > >
> > > Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> > > better isolation.
> > >
> > > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> > > would be used. And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> > >
> > > Signed-off-by: Yang Shi <[email protected]>
> > > ---
> > > include/linux/memcontrol.h | 7 +++--
> > > mm/vmscan.c | 60 ++++++++++++++++++++++++++------------
> > > 2 files changed, 46 insertions(+), 21 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 4c9253896e25..c457fc7bc631 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -93,12 +93,13 @@ struct lruvec_stat {
> > > };
> > >
> > > /*
> > > - * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> > > - * which have elements charged to this memcg.
> > > + * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
> > > + * shrinkers, which have elements charged to this memcg.
> > > */
> > > struct shrinker_info {
> > > struct rcu_head rcu;
> > > - unsigned long map[];
> > > + atomic_long_t *nr_deferred;
> > > + unsigned long *map;
> > > };
> > >
> > > /*
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index a1047ea60ecf..fcb399e18fc3 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -187,11 +187,17 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > > #ifdef CONFIG_MEMCG
> > > static int shrinker_nr_max;
> > >
> > > +/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
> > > static inline int shrinker_map_size(int nr_items)
> > > {
> > > return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> > > }
> > >
> > > +static inline int shrinker_defer_size(int nr_items)
> > > +{
> > > + return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
> > > +}
> > > +
> > > static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> > > int nid)
> > > {
> > > @@ -200,10 +206,12 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> > > }
> > >
> > > static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> > > - int size, int old_size)
> > > + int map_size, int defer_size,
> > > + int old_map_size, int old_defer_size)
> > > {
> > > struct shrinker_info *new, *old;
> > > int nid;
> > > + int size = map_size + defer_size;
> > >
> > > for_each_node(nid) {
> > > old = shrinker_info_protected(memcg, nid);
> > > @@ -215,9 +223,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> > > if (!new)
> > > return -ENOMEM;
> > >
> > > - /* Set all old bits, clear all new bits */
> > > - memset(new->map, (int)0xff, old_size);
> > > - memset((void *)new->map + old_size, 0, size - old_size);
> > > + new->nr_deferred = (atomic_long_t *)(new + 1);
> > > + new->map = (void *)new->nr_deferred + defer_size;
> > > +
> > > + /* map: set all old bits, clear all new bits */
> > > + memset(new->map, (int)0xff, old_map_size);
> > > + memset((void *)new->map + old_map_size, 0, map_size - old_map_size);
> > > + /* nr_deferred: copy old values, clear all new values */
> > > + memcpy(new->nr_deferred, old->nr_deferred, old_defer_size);
> > > + memset((void *)new->nr_deferred + old_defer_size, 0,
> > > + defer_size - old_defer_size);
> > >
> > > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
> > > kvfree_rcu(old);
> > > @@ -232,9 +247,6 @@ void free_shrinker_info(struct mem_cgroup *memcg)
> > > struct shrinker_info *info;
> > > int nid;
> > >
> > > - if (mem_cgroup_is_root(memcg))
> > > - return;
> > > -
> > > for_each_node(nid) {
> > > pn = mem_cgroup_nodeinfo(memcg, nid);
> > > info = shrinker_info_protected(memcg, nid);
> > > @@ -247,12 +259,12 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> > > {
> > > struct shrinker_info *info;
> > > int nid, size, ret = 0;
> > > -
> > > - if (mem_cgroup_is_root(memcg))
> > > - return 0;
> >
> > Can you please comment on the consequences on allowing to allocate
> > shrinker_info for root memcg? Why didn't we do that before but now it
> > is fine (or maybe required)? Please add the explanation in the commit
> > message.
>
> Before the patchset shrinker_info just tracks shrinker_maps which is
> not required for root memcg. But the newly added nr_deferred is needed
> in root memcg otherwise the nr_deferred work would get lost once the
> memcgs are reparented to root.
>
> How's about adding the below paragraph to the commit log:
>
> "To preserve nr_deferred when reparenting memcgs to root, root memcg
> needs shrinker_info allocated too."
>

LGTM and you can add:

Reviewed-by: Shakeel Butt <[email protected]>

2021-03-08 21:13:49

by Shakeel Butt

[permalink] [raw]
Subject: Re: [v8 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

On Mon, Mar 8, 2021 at 12:22 PM Yang Shi <[email protected]> wrote:
>
> On Mon, Mar 8, 2021 at 8:49 AM Roman Gushchin <[email protected]> wrote:
> >
> > On Sun, Mar 07, 2021 at 10:13:04PM -0800, Shakeel Butt wrote:
> > > On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
> > > >
> > > > Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
> > > > We don't have to define a dedicated callback for call_rcu() anymore.
> > > >
> > > > Signed-off-by: Yang Shi <[email protected]>
> > > > ---
> > > > mm/vmscan.c | 7 +------
> > > > 1 file changed, 1 insertion(+), 6 deletions(-)
> > > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 2e753c2516fa..c2a309acd86b 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
> > > > return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
> > > > }
> > > >
> > > > -static void free_shrinker_map_rcu(struct rcu_head *head)
> > > > -{
> > > > - kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> > > > -}
> > > > -
> > > > static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> > > > int size, int old_size)
> > > > {
> > > > @@ -219,7 +214,7 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> > > > memset((void *)new->map + old_size, 0, size - old_size);
> > > >
> > > > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> > > > - call_rcu(&old->rcu, free_shrinker_map_rcu);
> > > > + kvfree_rcu(old);
> > >
> > > Please use kvfree_rcu(old, rcu) instead of kvfree_rcu(old). The single
> > > param can call synchronize_rcu().
> >
> > Oh, I didn't know about this difference. Thank you for noticing!
>
> BTW, I think I could keep you and Kirill's acked-by with this change
> (using two params form kvfree_rcu) since the change seems trivial.

Once you change, you can add:

Reviewed-by: Shakeel Butt <[email protected]>

2021-03-08 21:59:17

by Shakeel Butt

[permalink] [raw]
Subject: Re: [v8 PATCH 11/13] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
>
> Now nr_deferred is available on per memcg level for memcg aware shrinkers, so don't need
> allocate shrinker->nr_deferred for such shrinkers anymore.
>
> The prealloc_memcg_shrinker() would return -ENOSYS if !CONFIG_MEMCG or memcg is disabled
> by kernel command line, then shrinker's SHRINKER_MEMCG_AWARE flag would be cleared.
> This makes the implementation of this patch simpler.
>
> Acked-by: Vlastimil Babka <[email protected]>
> Reviewed-by: Kirill Tkhai <[email protected]>
> Acked-by: Roman Gushchin <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

2021-03-08 23:45:05

by Shakeel Butt

[permalink] [raw]
Subject: Re: [v8 PATCH 12/13] mm: memcontrol: reparent nr_deferred when memcg offline

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <[email protected]> wrote:
>
> Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to parent's
> corresponding nr_deferred when memcg offline.
>
> Acked-by: Vlastimil Babka <[email protected]>
> Acked-by: Kirill Tkhai <[email protected]>
> Acked-by: Roman Gushchin <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>