2018-05-10 09:58:44

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 00/13] Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))

Hi,

this patches solves the problem with slow shrink_slab() occuring
on the machines having many shrinkers and memory cgroups (i.e.,
with many containers). The problem is complexity of shrink_slab()
is O(n^2) and it grows too fast with the growth of containers
numbers.

Let we have 200 containers, and every container has 10 mounts
and 10 cgroups. All container tasks are isolated, and they don't
touch foreign containers mounts.

In case of global reclaim, a task has to iterate all over the memcgs
and to call all the memcg-aware shrinkers for all of them. This means,
the task has to visit 200 * 10 = 2000 shrinkers for every memcg,
and since there are 2000 memcgs, the total calls of do_shrink_slab()
are 2000 * 2000 = 4000000.

4 million calls are not a number operations, which can takes 1 cpu cycle.
E.g., super_cache_count() accesses at least two lists, and makes arifmetical
calculations. Even, if there are no charged objects, we do these calculations,
and replaces cpu caches by read memory. I observed nodes spending almost 100%
time in kernel, in case of intensive writing and global reclaim. The writer
consumes pages fast, but it's need to shrink_slab() before the reclaimer
reached shrink pages function (and frees SWAP_CLUSTER_MAX pages). Even if
there is no writing, the iterations just waste the time, and slows reclaim down.

Let's see the small test below:

$echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
$mkdir /sys/fs/cgroup/memory/ct
$echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
$for i in `seq 0 4000`;
do mkdir /sys/fs/cgroup/memory/ct/$i;
echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file;
done

Then, let's see drop caches time (5 sequential calls):
$time echo 3 > /proc/sys/vm/drop_caches

0.00user 13.78system 0:13.78elapsed 99%CPU
0.00user 5.59system 0:05.60elapsed 99%CPU
0.00user 5.48system 0:05.48elapsed 99%CPU
0.00user 8.35system 0:08.35elapsed 99%CPU
0.00user 8.34system 0:08.35elapsed 99%CPU


Last four calls don't actually shrink something. So, the iterations
over slab shrinkers take 5.48 seconds. Not so good for scalability.

The patchset solves the problem by making shrink_slab() of O(n)
complexity. There are following functional actions:

1)Assign id to every registered memcg-aware shrinker.
2)Maintain per-memcgroup bitmap of memcg-aware shrinkers,
and set a shrinker-related bit after the first element
is added to lru list (also, when removed child memcg
elements are reparanted).
3)Split memcg-aware shrinkers and !memcg-aware shrinkers,
and call a shrinker if its bit is set in memcg's shrinker
bitmap.
(Also, there is a functionality to clear the bit, after
last element is shrinked).

This gives signify performance increase. The result after patchset is applied:

$time echo 3 > /proc/sys/vm/drop_caches

0.00user 1.10system 0:01.10elapsed 99%CPU
0.00user 0.00system 0:00.01elapsed 64%CPU
0.00user 0.01system 0:00.01elapsed 82%CPU
0.00user 0.00system 0:00.01elapsed 64%CPU
0.00user 0.01system 0:00.01elapsed 82%CPU

The results show the performance increases at least in 548 times.

So, the patchset makes shrink_slab() of less complexity and improves
the performance in such types of load I pointed. This will give a profit
in case of !global reclaim case, since there also will be less
do_shrink_slab() calls.

This patchset is made against linux-next.git tree.

v5: Make the optimizing logic under CONFIG_MEMCG_SHRINKER instead of MEMCG && !SLOB

v4: Do not use memcg mem_cgroup_idr for iteration over mem cgroups

v3: Many changes requested in commentaries to v2:

1)rebase on prealloc_shrinker() code base
2)root_mem_cgroup is made out of memcg maps
3)rwsem replaced with shrinkers_nr_max_mutex
4)changes around assignment of shrinker id to list lru
5)everything renamed

v2: Many changes requested in commentaries to v1:

1)the code mostly moved to mm/memcontrol.c;
2)using IDR instead of array of shrinkers;
3)added a possibility to assign list_lru shrinker id
at the time of shrinker registering;
4)reorginized locking and renamed functions and variables.

---

Kirill Tkhai (13):
mm: Assign id to every memcg-aware shrinker
memcg: Move up for_each_mem_cgroup{,_tree} defines
mm: Assign memcg-aware shrinkers bitmap to memcg
mm: Refactoring in workingset_init()
fs: Refactoring in alloc_super()
fs: Propagate shrinker::id to list_lru
list_lru: Add memcg argument to list_lru_from_kmem()
list_lru: Pass dst_memcg argument to memcg_drain_list_lru_node()
list_lru: Pass lru argument to memcg_drain_list_lru_node()
mm: Set bit in memcg shrinker bitmap on first list_lru item apearance
mm: Iterate only over charged shrinkers during memcg shrink_slab()
mm: Add SHRINK_EMPTY shrinker methods return value
mm: Clear shrinker bit if there are no objects related to memcg


fs/super.c | 18 ++++-
include/linux/list_lru.h | 5 +
include/linux/memcontrol.h | 39 ++++++++++
include/linux/shrinker.h | 11 ++-
init/Kconfig | 5 +
mm/list_lru.c | 65 +++++++++++++----
mm/memcontrol.c | 148 ++++++++++++++++++++++++++++++++++----
mm/vmscan.c | 170 +++++++++++++++++++++++++++++++++++++++++---
mm/workingset.c | 13 +++
9 files changed, 421 insertions(+), 53 deletions(-)

--
Signed-off-by: Kirill Tkhai <[email protected]>


2018-05-10 09:53:17

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 02/13] memcg: Move up for_each_mem_cgroup{, _tree} defines

Next patch requires these defines are above their current
position, so here they are moved to declarations.

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/memcontrol.c | 30 +++++++++++++++---------------
1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bde5819be340..3df3efa7ff40 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -233,6 +233,21 @@ enum res_type {
/* Used for OOM nofiier */
#define OOM_CONTROL (0)

+/*
+ * Iteration constructs for visiting all cgroups (under a tree). If
+ * loops are exited prematurely (break), mem_cgroup_iter_break() must
+ * be used for reference counting.
+ */
+#define for_each_mem_cgroup_tree(iter, root) \
+ for (iter = mem_cgroup_iter(root, NULL, NULL); \
+ iter != NULL; \
+ iter = mem_cgroup_iter(root, iter, NULL))
+
+#define for_each_mem_cgroup(iter) \
+ for (iter = mem_cgroup_iter(NULL, NULL, NULL); \
+ iter != NULL; \
+ iter = mem_cgroup_iter(NULL, iter, NULL))
+
/* Some nice accessors for the vmpressure. */
struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
{
@@ -867,21 +882,6 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
}
}

-/*
- * Iteration constructs for visiting all cgroups (under a tree). If
- * loops are exited prematurely (break), mem_cgroup_iter_break() must
- * be used for reference counting.
- */
-#define for_each_mem_cgroup_tree(iter, root) \
- for (iter = mem_cgroup_iter(root, NULL, NULL); \
- iter != NULL; \
- iter = mem_cgroup_iter(root, iter, NULL))
-
-#define for_each_mem_cgroup(iter) \
- for (iter = mem_cgroup_iter(NULL, NULL, NULL); \
- iter != NULL; \
- iter = mem_cgroup_iter(NULL, iter, NULL))
-
/**
* mem_cgroup_scan_tasks - iterate over tasks of a memory cgroup hierarchy
* @memcg: hierarchy root


2018-05-10 09:53:40

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 03/13] mm: Assign memcg-aware shrinkers bitmap to memcg

Imagine a big node with many cpus, memory cgroups and containers.
Let we have 200 containers, every container has 10 mounts,
and 10 cgroups. All container tasks don't touch foreign
containers mounts. If there is intensive pages write,
and global reclaim happens, a writing task has to iterate
over all memcgs to shrink slab, before it's able to go
to shrink_page_list().

Iteration over all the memcg slabs is very expensive:
the task has to visit 200 * 10 = 2000 shrinkers
for every memcg, and since there are 2000 memcgs,
the total calls are 2000 * 2000 = 4000000.

So, the shrinker makes 4 million do_shrink_slab() calls
just to try to isolate SWAP_CLUSTER_MAX pages in one
of the actively writing memcg via shrink_page_list().
I've observed a node spending almost 100% in kernel,
making useless iteration over already shrinked slab.

This patch adds bitmap of memcg-aware shrinkers to memcg.
The size of the bitmap depends on bitmap_nr_ids, and during
memcg life it's maintained to be enough to fit bitmap_nr_ids
shrinkers. Every bit in the map is related to corresponding
shrinker id.

Next patches will maintain set bit only for really charged
memcg. This will allow shrink_slab() to increase its
performance in significant way. See the last patch for
the numbers.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/memcontrol.h | 21 ++++++++
mm/memcontrol.c | 116 ++++++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 16 ++++++
3 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6cbea2f25a87..e5e7e0fc7158 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -105,6 +105,17 @@ struct lruvec_stat {
long count[NR_VM_NODE_STAT_ITEMS];
};

+#ifdef CONFIG_MEMCG_SHRINKER
+/*
+ * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
+ * which have elements charged to this memcg.
+ */
+struct memcg_shrinker_map {
+ struct rcu_head rcu;
+ unsigned long map[0];
+};
+#endif /* CONFIG_MEMCG_SHRINKER */
+
/*
* per-zone information in memory controller.
*/
@@ -118,6 +129,9 @@ struct mem_cgroup_per_node {

struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];

+#ifdef CONFIG_MEMCG_SHRINKER
+ struct memcg_shrinker_map __rcu *shrinker_map;
+#endif
struct rb_node tree_node; /* RB tree node */
unsigned long usage_in_excess;/* Set to the value by which */
/* the soft limit is exceeded*/
@@ -1255,4 +1269,11 @@ static inline void memcg_put_cache_ids(void)

#endif /* CONFIG_MEMCG && !CONFIG_SLOB */

+#ifdef CONFIG_MEMCG_SHRINKER
+#define MEMCG_SHRINKER_MAP(memcg, nid) (memcg->nodeinfo[nid]->shrinker_map)
+
+extern int memcg_shrinker_nr_max;
+extern int memcg_expand_shrinker_maps(int old_id, int id);
+#endif /* CONFIG_MEMCG_SHRINKER */
+
#endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3df3efa7ff40..18e0fdf302a9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -322,6 +322,116 @@ struct workqueue_struct *memcg_kmem_cache_wq;

#endif /* !CONFIG_SLOB */

+#ifdef CONFIG_MEMCG_SHRINKER
+int memcg_shrinker_nr_max;
+static DEFINE_MUTEX(shrinkers_nr_max_mutex);
+
+static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
+{
+ kvfree(container_of(head, struct memcg_shrinker_map, rcu));
+}
+
+static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
+ int size, int old_size)
+{
+ struct memcg_shrinker_map *new, *old;
+ int nid;
+
+ lockdep_assert_held(&shrinkers_nr_max_mutex);
+
+ for_each_node(nid) {
+ old = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), true);
+ /* Not yet online memcg */
+ if (old_size && !old)
+ return 0;
+
+ new = kvmalloc(sizeof(*new) + size, GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ /* Set all old bits, clear all new bits */
+ memset(new->map, (int)0xff, old_size);
+ memset((void *)new->map + old_size, 0, size - old_size);
+
+ rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
+ if (old)
+ call_rcu(&old->rcu, memcg_free_shrinker_map_rcu);
+ }
+
+ return 0;
+}
+
+static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
+{
+ struct mem_cgroup_per_node *pn;
+ struct memcg_shrinker_map *map;
+ int nid;
+
+ if (memcg == root_mem_cgroup)
+ return;
+
+ mutex_lock(&shrinkers_nr_max_mutex);
+ for_each_node(nid) {
+ pn = mem_cgroup_nodeinfo(memcg, nid);
+ map = rcu_dereference_protected(pn->shrinker_map, true);
+ if (map)
+ call_rcu(&map->rcu, memcg_free_shrinker_map_rcu);
+ rcu_assign_pointer(pn->shrinker_map, NULL);
+ }
+ mutex_unlock(&shrinkers_nr_max_mutex);
+}
+
+static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
+{
+ int ret, size = memcg_shrinker_nr_max/BITS_PER_BYTE;
+
+ if (memcg == root_mem_cgroup)
+ return 0;
+
+ mutex_lock(&shrinkers_nr_max_mutex);
+ ret = memcg_expand_one_shrinker_map(memcg, size, 0);
+ mutex_unlock(&shrinkers_nr_max_mutex);
+
+ if (ret)
+ memcg_free_shrinker_maps(memcg);
+
+ return ret;
+}
+
+static struct idr mem_cgroup_idr;
+
+int memcg_expand_shrinker_maps(int old_nr, int nr)
+{
+ int size, old_size, ret = 0;
+ struct mem_cgroup *memcg;
+
+ old_size = old_nr / BITS_PER_BYTE;
+ size = nr / BITS_PER_BYTE;
+
+ mutex_lock(&shrinkers_nr_max_mutex);
+
+ if (!root_mem_cgroup)
+ goto unlock;
+
+ for_each_mem_cgroup(memcg) {
+ if (memcg == root_mem_cgroup)
+ continue;
+ ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
+ if (ret)
+ goto unlock;
+ }
+unlock:
+ mutex_unlock(&shrinkers_nr_max_mutex);
+ return ret;
+}
+#else /* CONFIG_MEMCG_SHRINKER */
+static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
+{
+ return 0;
+}
+static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) { }
+#endif /* CONFIG_MEMCG_SHRINKER */
+
/**
* mem_cgroup_css_from_page - css of the memcg associated with a page
* @page: page of interest
@@ -4471,6 +4581,11 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);

+ if (memcg_alloc_shrinker_maps(memcg)) {
+ mem_cgroup_id_remove(memcg);
+ return -ENOMEM;
+ }
+
/* Online state pins memcg ID, memcg ID pins CSS */
atomic_set(&memcg->id.ref, 1);
css_get(css);
@@ -4522,6 +4637,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
vmpressure_cleanup(&memcg->vmpressure);
cancel_work_sync(&memcg->high_work);
mem_cgroup_remove_from_trees(memcg);
+ memcg_free_shrinker_maps(memcg);
memcg_free_kmem(memcg);
mem_cgroup_free(memcg);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d691beac1048..d8a2870710e0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -174,12 +174,26 @@ static DEFINE_IDR(shrinker_idr);

static int prealloc_memcg_shrinker(struct shrinker *shrinker)
{
- int id, ret;
+ int id, nr, ret;

down_write(&shrinker_rwsem);
ret = id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
if (ret < 0)
goto unlock;
+
+ if (id >= memcg_shrinker_nr_max) {
+ nr = memcg_shrinker_nr_max * 2;
+ if (nr == 0)
+ nr = BITS_PER_BYTE;
+ BUG_ON(id >= nr);
+
+ if (memcg_expand_shrinker_maps(memcg_shrinker_nr_max, nr)) {
+ idr_remove(&shrinker_idr, id);
+ goto unlock;
+ }
+ memcg_shrinker_nr_max = nr;
+ }
+
shrinker->id = id;
ret = 0;
unlock:


2018-05-10 09:53:56

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 05/13] fs: Refactoring in alloc_super()

Do two list_lru_init_memcg() calls after prealloc_super().
destroy_unused_super() in fail path is OK with this.
Next patch needs such the order.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/super.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 16c153d2f4f1..2ccacb78f91c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -234,10 +234,6 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
INIT_LIST_HEAD(&s->s_inodes_wb);
spin_lock_init(&s->s_inode_wblist_lock);

- if (list_lru_init_memcg(&s->s_dentry_lru))
- goto fail;
- if (list_lru_init_memcg(&s->s_inode_lru))
- goto fail;
s->s_count = 1;
atomic_set(&s->s_active, 1);
mutex_init(&s->s_vfs_rename_mutex);
@@ -258,6 +254,10 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
if (prealloc_shrinker(&s->s_shrink))
goto fail;
+ if (list_lru_init_memcg(&s->s_dentry_lru))
+ goto fail;
+ if (list_lru_init_memcg(&s->s_inode_lru))
+ goto fail;
return s;

fail:


2018-05-10 09:54:10

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 06/13] fs: Propagate shrinker::id to list_lru

The patch adds list_lru::shrinker_id field, and populates
it by registered shrinker id.

This will be used to set correct bit in memcg shrinkers
map by lru code in next patches, after there appeared
the first related to memcg element in list_lru.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/super.c | 4 ++++
include/linux/list_lru.h | 3 +++
mm/list_lru.c | 6 ++++++
mm/workingset.c | 3 +++
4 files changed, 16 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 2ccacb78f91c..dfa85e725e45 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -258,6 +258,10 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
goto fail;
if (list_lru_init_memcg(&s->s_inode_lru))
goto fail;
+#ifdef CONFIG_MEMCG_SHRINKER
+ s->s_dentry_lru.shrinker_id = s->s_shrink.id;
+ s->s_inode_lru.shrinker_id = s->s_shrink.id;
+#endif
return s;

fail:
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 96def9d15b1b..a63b7a4abc6b 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -54,6 +54,9 @@ struct list_lru {
#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
struct list_head list;
#endif
+#ifdef CONFIG_MEMCG_SHRINKER
+ int shrinker_id;
+#endif
};

void list_lru_destroy(struct list_lru *lru);
diff --git a/mm/list_lru.c b/mm/list_lru.c
index d9c84c5bda1d..8dd3f181d86f 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -567,6 +567,9 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
size_t size = sizeof(*lru->node) * nr_node_ids;
int err = -ENOMEM;

+#ifdef CONFIG_MEMCG_SHRINKER
+ lru->shrinker_id = -1;
+#endif
memcg_get_cache_ids();

lru->node = kzalloc(size, GFP_KERNEL);
@@ -609,6 +612,9 @@ void list_lru_destroy(struct list_lru *lru)
kfree(lru->node);
lru->node = NULL;

+#ifdef CONFIG_MEMCG_SHRINKER
+ lru->shrinker_id = -1;
+#endif
memcg_put_cache_ids();
}
EXPORT_SYMBOL_GPL(list_lru_destroy);
diff --git a/mm/workingset.c b/mm/workingset.c
index c3a4fe145bb7..da720f3b0a0a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -534,6 +534,9 @@ static int __init workingset_init(void)
ret = __list_lru_init(&shadow_nodes, true, &shadow_nodes_key);
if (ret)
goto err_list_lru;
+#ifdef CONFIG_MEMCG_SHRINKER
+ shadow_nodes.shrinker_id = workingset_shadow_shrinker.id;
+#endif
register_shrinker_prepared(&workingset_shadow_shrinker);
return 0;
err_list_lru:


2018-05-10 09:54:37

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 09/13] list_lru: Pass lru argument to memcg_drain_list_lru_node()

This is just refactoring to allow next patches to have
lru pointer in memcg_drain_list_lru_node().

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/list_lru.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 46b805073ed0..7f6cb27aa2f5 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -516,9 +516,10 @@ int memcg_update_all_list_lrus(int new_size)
goto out;
}

-static void memcg_drain_list_lru_node(struct list_lru_node *nlru,
+static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
int src_idx, struct mem_cgroup *dst_memcg)
{
+ struct list_lru_node *nlru = &lru->node[nid];
int dst_idx = dst_memcg->kmemcg_id;
struct list_lru_one *src, *dst;

@@ -547,7 +548,7 @@ static void memcg_drain_list_lru(struct list_lru *lru,
return;

for_each_node(i)
- memcg_drain_list_lru_node(&lru->node[i], src_idx, dst_memcg);
+ memcg_drain_list_lru_node(lru, i, src_idx, dst_memcg);
}

void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg)


2018-05-10 09:54:53

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

Using the preparations made in previous patches, in case of memcg
shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
bitmap. To do that, we separate iterations over memcg-aware and
!memcg-aware shrinkers, and memcg-aware shrinkers are chosen
via for_each_set_bit() from the bitmap. In case of big nodes,
having many isolated environments, this gives significant
performance growth. See next patches for the details.

Note, that the patch does not respect to empty memcg shrinkers,
since we never clear the bitmap bits after we set it once.
Their shrinkers will be called again, with no shrinked objects
as result. This functionality is provided by next patches.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/memcontrol.h | 1 +
mm/vmscan.c | 70 ++++++++++++++++++++++++++++++++++++++------
2 files changed, 62 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 82f892e77637..436691a66500 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -760,6 +760,7 @@ void mem_cgroup_split_huge_fixup(struct page *head);
#define MEM_CGROUP_ID_MAX 0

struct mem_cgroup;
+#define root_mem_cgroup NULL

static inline bool mem_cgroup_disabled(void)
{
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d8a2870710e0..a2e38e05adb5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -376,6 +376,7 @@ int prealloc_shrinker(struct shrinker *shrinker)
goto free_deferred;
}

+ INIT_LIST_HEAD(&shrinker->list);
return 0;

free_deferred:
@@ -547,6 +548,63 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
return freed;
}

+#ifdef CONFIG_MEMCG_SHRINKER
+static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
+ struct mem_cgroup *memcg, int priority)
+{
+ struct memcg_shrinker_map *map;
+ unsigned long freed = 0;
+ int ret, i;
+
+ if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
+ return 0;
+
+ if (!down_read_trylock(&shrinker_rwsem))
+ return 0;
+
+ /*
+ * 1)Caller passes only alive memcg, so map can't be NULL.
+ * 2)shrinker_rwsem protects from maps expanding.
+ */
+ map = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), true);
+ BUG_ON(!map);
+
+ for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
+ struct shrink_control sc = {
+ .gfp_mask = gfp_mask,
+ .nid = nid,
+ .memcg = memcg,
+ };
+ struct shrinker *shrinker;
+
+ shrinker = idr_find(&shrinker_idr, i);
+ if (!shrinker) {
+ clear_bit(i, map->map);
+ continue;
+ }
+ if (list_empty(&shrinker->list))
+ continue;
+
+ ret = do_shrink_slab(&sc, shrinker, priority);
+ freed += ret;
+
+ if (rwsem_is_contended(&shrinker_rwsem)) {
+ freed = freed ? : 1;
+ break;
+ }
+ }
+
+ up_read(&shrinker_rwsem);
+ return freed;
+}
+#else /* CONFIG_MEMCG_SHRINKER */
+static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
+ struct mem_cgroup *memcg, int priority)
+{
+ return 0;
+}
+#endif /* CONFIG_MEMCG_SHRINKER */
+
/**
* shrink_slab - shrink slab caches
* @gfp_mask: allocation context
@@ -576,8 +634,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
struct shrinker *shrinker;
unsigned long freed = 0;

- if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)))
- return 0;
+ if (memcg && memcg != root_mem_cgroup)
+ return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

if (!down_read_trylock(&shrinker_rwsem))
goto out;
@@ -589,13 +647,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
.memcg = memcg,
};

- /*
- * If kernel memory accounting is disabled, we ignore
- * SHRINKER_MEMCG_AWARE flag and call all shrinkers
- * passing NULL for memcg.
- */
- if (memcg_kmem_enabled() &&
- !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
+ if (!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
continue;

if (!(shrinker->flags & SHRINKER_NUMA_AWARE))


2018-05-10 09:55:07

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 12/13] mm: Add SHRINK_EMPTY shrinker methods return value

We need to differ the situations, when shrinker has
very small amount of objects (see vfs_pressure_ratio()
called from super_cache_count()), and when it has no
objects at all. Currently, in the both of these cases,
shrinker::count_objects() returns 0.

The patch introduces new SHRINK_EMPTY return value,
which will be used for "no objects at all" case.
It's is a refactoring mostly, as SHRINK_EMPTY is replaced
by 0 by all callers of do_shrink_slab() in this patch,
and all the magic will happen in further.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/super.c | 3 +++
include/linux/shrinker.h | 7 +++++--
mm/vmscan.c | 12 +++++++++---
mm/workingset.c | 3 +++
4 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index dfa85e725e45..3cad04644329 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -134,6 +134,9 @@ static unsigned long super_cache_count(struct shrinker *shrink,
total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);

+ if (!total_objects)
+ return SHRINK_EMPTY;
+
total_objects = vfs_pressure_ratio(total_objects);
return total_objects;
}
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index d8f3fc833e6e..82ea5012dfa0 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -34,12 +34,15 @@ struct shrink_control {
};

#define SHRINK_STOP (~0UL)
+#define SHRINK_EMPTY (~0UL - 1)
/*
* A callback you can register to apply pressure to ageable caches.
*
* @count_objects should return the number of freeable items in the cache. If
- * there are no objects to free or the number of freeable items cannot be
- * determined, it should return 0. No deadlock checks should be done during the
+ * there are no objects to free, it should return SHRINK_EMPTY, while 0 is
+ * returned in cases of the number of freeable items cannot be determined
+ * or shrinker should skip this cache for this time (e.g., their number
+ * is below shrinkable limit). No deadlock checks should be done during the
* count callback - the shrinker relies on aggregating scan counts that couldn't
* be executed due to potential deadlocks to be run at a later call when the
* deadlock condition is no longer pending.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a2e38e05adb5..7b0075612d73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -446,8 +446,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
long scanned = 0, next_deferred;

freeable = shrinker->count_objects(shrinker, shrinkctl);
- if (freeable == 0)
- return 0;
+ if (freeable == 0 || freeable == SHRINK_EMPTY)
+ return freeable;

/*
* copy the current shrinker scan count into a local variable
@@ -586,6 +586,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
continue;

ret = do_shrink_slab(&sc, shrinker, priority);
+ if (ret == SHRINK_EMPTY)
+ ret = 0;
freed += ret;

if (rwsem_is_contended(&shrinker_rwsem)) {
@@ -633,6 +635,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
{
struct shrinker *shrinker;
unsigned long freed = 0;
+ int ret;

if (memcg && memcg != root_mem_cgroup)
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
@@ -653,7 +656,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
sc.nid = 0;

- freed += do_shrink_slab(&sc, shrinker, priority);
+ ret = do_shrink_slab(&sc, shrinker, priority);
+ if (ret == SHRINK_EMPTY)
+ ret = 0;
+ freed += ret;
/*
* Bail out if someone want to register a new shrinker to
* prevent the regsitration from being stalled for long periods
diff --git a/mm/workingset.c b/mm/workingset.c
index da720f3b0a0a..e731e21a9fca 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -402,6 +402,9 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
}
max_nodes = cache >> (RADIX_TREE_MAP_SHIFT - 3);

+ if (!nodes)
+ return SHRINK_EMPTY;
+
if (nodes <= max_nodes)
return 0;
return nodes - max_nodes;


2018-05-10 09:55:12

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 13/13] mm: Clear shrinker bit if there are no objects related to memcg

To avoid further unneed calls of do_shrink_slab()
for shrinkers, which already do not have any charged
objects in a memcg, their bits have to be cleared.

This patch introduces a lockless mechanism to do that
without races without parallel list lru add. After
do_shrink_slab() returns SHRINK_EMPTY the first time,
we clear the bit and call it once again. Then we restore
the bit, if the new return value is different.

Note, that single smp_mb__after_atomic() in shrink_slab_memcg()
covers two situations:

1)list_lru_add() shrink_slab_memcg
list_add_tail() for_each_set_bit() <--- read bit
do_shrink_slab() <--- missed list update (no barrier)
<MB> <MB>
set_bit() do_shrink_slab() <--- seen list update

This situation, when the first do_shrink_slab() sees set bit,
but it doesn't see list update (i.e., race with the first element
queueing), is rare. So we don't add <MB> before the first call
of do_shrink_slab() instead of this to do not slow down generic
case. Also, it's need the second call as seen in below in (2).

2)list_lru_add() shrink_slab_memcg()
list_add_tail() ...
set_bit() ...
... for_each_set_bit()
do_shrink_slab() do_shrink_slab()
clear_bit() ...
... ...
list_lru_add() ...
list_add_tail() clear_bit()
<MB> <MB>
set_bit() do_shrink_slab()

The barriers guarantees, the second do_shrink_slab()
in the right side task sees list update if really
cleared the bit. This case is drawn in the code comment.

[Results/performance of the patchset]

After the whole patchset applied the below test shows signify
increase of performance:

$echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
$mkdir /sys/fs/cgroup/memory/ct
$echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
$for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done

Then, 5 sequential calls of drop caches:
$time echo 3 > /proc/sys/vm/drop_caches

1)Before:
0.00user 13.78system 0:13.78elapsed 99%CPU
0.00user 5.59system 0:05.60elapsed 99%CPU
0.00user 5.48system 0:05.48elapsed 99%CPU
0.00user 8.35system 0:08.35elapsed 99%CPU
0.00user 8.34system 0:08.35elapsed 99%CPU

2)After
0.00user 1.10system 0:01.10elapsed 99%CPU
0.00user 0.00system 0:00.01elapsed 64%CPU
0.00user 0.01system 0:00.01elapsed 82%CPU
0.00user 0.00system 0:00.01elapsed 64%CPU
0.00user 0.01system 0:00.01elapsed 82%CPU

The results show the performance increases at least in 548 times.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/memcontrol.h | 2 ++
mm/vmscan.c | 19 +++++++++++++++++--
2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 436691a66500..82c0bf2d0579 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1283,6 +1283,8 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int

rcu_read_lock();
map = MEMCG_SHRINKER_MAP(memcg, nid);
+ /* Pairs with smp mb in shrink_slab() */
+ smp_mb__before_atomic();
set_bit(nr, map->map);
rcu_read_unlock();
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b0075612d73..189b163bef4a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -586,8 +586,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
continue;

ret = do_shrink_slab(&sc, shrinker, priority);
- if (ret == SHRINK_EMPTY)
- ret = 0;
+ if (ret == SHRINK_EMPTY) {
+ clear_bit(i, map->map);
+ /*
+ * Pairs with mb in memcg_set_shrinker_bit():
+ *
+ * list_lru_add() shrink_slab_memcg()
+ * list_add_tail() clear_bit()
+ * <MB> <MB>
+ * set_bit() do_shrink_slab()
+ */
+ smp_mb__after_atomic();
+ ret = do_shrink_slab(&sc, shrinker, priority);
+ if (ret == SHRINK_EMPTY)
+ ret = 0;
+ else
+ memcg_set_shrinker_bit(memcg, nid, i);
+ }
freed += ret;

if (rwsem_is_contended(&shrinker_rwsem)) {


2018-05-10 09:55:28

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 07/13] list_lru: Add memcg argument to list_lru_from_kmem()

This is just refactoring to allow next patches to have
memcg pointer in list_lru_from_kmem().

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/list_lru.c | 25 +++++++++++++++++--------
1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 8dd3f181d86f..0721381b2e3d 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -76,18 +76,24 @@ static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
}

static inline struct list_lru_one *
-list_lru_from_kmem(struct list_lru_node *nlru, void *ptr)
+list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
+ struct mem_cgroup **memcg_ptr)
{
- struct mem_cgroup *memcg;
+ struct list_lru_one *l = &nlru->lru;
+ struct mem_cgroup *memcg = NULL;

if (!nlru->memcg_lrus)
- return &nlru->lru;
+ goto out;

memcg = mem_cgroup_from_kmem(ptr);
if (!memcg)
- return &nlru->lru;
+ goto out;

- return list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
+ l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
+out:
+ if (memcg_ptr)
+ *memcg_ptr = memcg;
+ return l;
}
#else
static inline bool list_lru_memcg_aware(struct list_lru *lru)
@@ -102,8 +108,11 @@ list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
}

static inline struct list_lru_one *
-list_lru_from_kmem(struct list_lru_node *nlru, void *ptr)
+list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
+ struct mem_cgroup **memcg_ptr)
{
+ if (memcg_ptr)
+ *memcg_ptr = NULL;
return &nlru->lru;
}
#endif /* CONFIG_MEMCG && !CONFIG_SLOB */
@@ -116,7 +125,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)

spin_lock(&nlru->lock);
if (list_empty(item)) {
- l = list_lru_from_kmem(nlru, item);
+ l = list_lru_from_kmem(nlru, item, NULL);
list_add_tail(item, &l->list);
l->nr_items++;
nlru->nr_items++;
@@ -142,7 +151,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)

spin_lock(&nlru->lock);
if (!list_empty(item)) {
- l = list_lru_from_kmem(nlru, item);
+ l = list_lru_from_kmem(nlru, item, NULL);
list_del_init(item);
l->nr_items--;
nlru->nr_items--;


2018-05-10 09:55:37

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 08/13] list_lru: Pass dst_memcg argument to memcg_drain_list_lru_node()

This is just refactoring to allow next patches to have
dst_memcg pointer in memcg_drain_list_lru_node().

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/list_lru.h | 2 +-
mm/list_lru.c | 11 ++++++-----
mm/memcontrol.c | 2 +-
3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index a63b7a4abc6b..a63bad2c981a 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -68,7 +68,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
#define list_lru_init_memcg(lru) __list_lru_init((lru), true, NULL)

int memcg_update_all_list_lrus(int num_memcgs);
-void memcg_drain_all_list_lrus(int src_idx, int dst_idx);
+void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg);

/**
* list_lru_add: add an element to the lru list's tail
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 0721381b2e3d..46b805073ed0 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -517,8 +517,9 @@ int memcg_update_all_list_lrus(int new_size)
}

static void memcg_drain_list_lru_node(struct list_lru_node *nlru,
- int src_idx, int dst_idx)
+ int src_idx, struct mem_cgroup *dst_memcg)
{
+ int dst_idx = dst_memcg->kmemcg_id;
struct list_lru_one *src, *dst;

/*
@@ -538,7 +539,7 @@ static void memcg_drain_list_lru_node(struct list_lru_node *nlru,
}

static void memcg_drain_list_lru(struct list_lru *lru,
- int src_idx, int dst_idx)
+ int src_idx, struct mem_cgroup *dst_memcg)
{
int i;

@@ -546,16 +547,16 @@ static void memcg_drain_list_lru(struct list_lru *lru,
return;

for_each_node(i)
- memcg_drain_list_lru_node(&lru->node[i], src_idx, dst_idx);
+ memcg_drain_list_lru_node(&lru->node[i], src_idx, dst_memcg);
}

-void memcg_drain_all_list_lrus(int src_idx, int dst_idx)
+void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg)
{
struct list_lru *lru;

mutex_lock(&list_lrus_mutex);
list_for_each_entry(lru, &list_lrus, list)
- memcg_drain_list_lru(lru, src_idx, dst_idx);
+ memcg_drain_list_lru(lru, src_idx, dst_memcg);
mutex_unlock(&list_lrus_mutex);
}
#else
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 18e0fdf302a9..df9e7f159369 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3173,7 +3173,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
}
rcu_read_unlock();

- memcg_drain_all_list_lrus(kmemcg_id, parent->kmemcg_id);
+ memcg_drain_all_list_lrus(kmemcg_id, parent);

memcg_free_cache_id(kmemcg_id);
}


2018-05-10 09:55:57

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 10/13] mm: Set bit in memcg shrinker bitmap on first list_lru item apearance

Introduce set_shrinker_bit() function to set shrinker-related
bit in memcg shrinker bitmap, and set the bit after the first
item is added and in case of reparenting destroyed memcg's items.

This will allow next patch to make shrinkers be called only,
in case of they have charged objects at the moment, and
to improve shrink_slab() performance.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/memcontrol.h | 15 +++++++++++++++
mm/list_lru.c | 22 ++++++++++++++++++++--
2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e5e7e0fc7158..82f892e77637 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1274,6 +1274,21 @@ static inline void memcg_put_cache_ids(void)

extern int memcg_shrinker_nr_max;
extern int memcg_expand_shrinker_maps(int old_id, int id);
+
+static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int nr)
+{
+ if (nr >= 0 && memcg && memcg != root_mem_cgroup) {
+ struct memcg_shrinker_map *map;
+
+ rcu_read_lock();
+ map = MEMCG_SHRINKER_MAP(memcg, nid);
+ set_bit(nr, map->map);
+ rcu_read_unlock();
+ }
+}
+#else
+static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
+ int node, int id) { }
#endif /* CONFIG_MEMCG_SHRINKER */

#endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 7f6cb27aa2f5..6ce52f80f12c 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -30,6 +30,11 @@ static void list_lru_unregister(struct list_lru *lru)
list_del(&lru->list);
mutex_unlock(&list_lrus_mutex);
}
+
+static int lru_shrinker_id(struct list_lru *lru)
+{
+ return lru->shrinker_id;
+}
#else
static void list_lru_register(struct list_lru *lru)
{
@@ -38,6 +43,11 @@ static void list_lru_register(struct list_lru *lru)
static void list_lru_unregister(struct list_lru *lru)
{
}
+
+static int lru_shrinker_id(struct list_lru *lru)
+{
+ return -1;
+}
#endif /* CONFIG_MEMCG && !CONFIG_SLOB */

#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
@@ -121,13 +131,17 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
{
int nid = page_to_nid(virt_to_page(item));
struct list_lru_node *nlru = &lru->node[nid];
+ struct mem_cgroup *memcg;
struct list_lru_one *l;

spin_lock(&nlru->lock);
if (list_empty(item)) {
- l = list_lru_from_kmem(nlru, item, NULL);
+ l = list_lru_from_kmem(nlru, item, &memcg);
list_add_tail(item, &l->list);
- l->nr_items++;
+ /* Set shrinker bit if the first element was added */
+ if (!l->nr_items++)
+ memcg_set_shrinker_bit(memcg, nid,
+ lru_shrinker_id(lru));
nlru->nr_items++;
spin_unlock(&nlru->lock);
return true;
@@ -522,6 +536,7 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
struct list_lru_node *nlru = &lru->node[nid];
int dst_idx = dst_memcg->kmemcg_id;
struct list_lru_one *src, *dst;
+ bool set;

/*
* Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -533,7 +548,10 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
dst = list_lru_from_memcg_idx(nlru, dst_idx);

list_splice_init(&src->list, &dst->list);
+ set = (!dst->nr_items && src->nr_items);
dst->nr_items += src->nr_items;
+ if (set)
+ memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
src->nr_items = 0;

spin_unlock_irq(&nlru->lock);


2018-05-10 09:56:50

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 04/13] mm: Refactoring in workingset_init()

Use prealloc_shrinker()/register_shrinker_prepared()
instead of register_shrinker(). This will be used
in next patch.

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/workingset.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 40ee02c83978..c3a4fe145bb7 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -528,15 +528,16 @@ static int __init workingset_init(void)
pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
timestamp_bits, max_order, bucket_order);

- ret = __list_lru_init(&shadow_nodes, true, &shadow_nodes_key);
+ ret = prealloc_shrinker(&workingset_shadow_shrinker);
if (ret)
goto err;
- ret = register_shrinker(&workingset_shadow_shrinker);
+ ret = __list_lru_init(&shadow_nodes, true, &shadow_nodes_key);
if (ret)
goto err_list_lru;
+ register_shrinker_prepared(&workingset_shadow_shrinker);
return 0;
err_list_lru:
- list_lru_destroy(&shadow_nodes);
+ free_prealloced_shrinker(&workingset_shadow_shrinker);
err:
return ret;
}


2018-05-10 09:57:03

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v5 01/13] mm: Assign id to every memcg-aware shrinker

The patch introduces shrinker::id number, which is used to enumerate
memcg-aware shrinkers. The number start from 0, and the code tries
to maintain it as small as possible.

This will be used as to represent a memcg-aware shrinkers in memcg
shrinkers map.

Since all memcg-aware shrinkers are based on list_lru, which is per-memcg
in case of !SLOB only, the new functionality will be under MEMCG && !SLOB
ifdef (symlinked to CONFIG_MEMCG_SHRINKER).

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/super.c | 3 ++
include/linux/shrinker.h | 4 +++
init/Kconfig | 5 ++++
mm/vmscan.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 71 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 122c402049a2..16c153d2f4f1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -248,6 +248,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
s->s_time_gran = 1000000000;
s->cleancache_poolid = CLEANCACHE_NO_POOL;

+#ifdef CONFIG_MEMCG_SHRINKER
+ s->s_shrink.id = -1;
+#endif
s->s_shrink.seeks = DEFAULT_SEEKS;
s->s_shrink.scan_objects = super_cache_scan;
s->s_shrink.count_objects = super_cache_count;
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 6794490f25b2..d8f3fc833e6e 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -66,6 +66,10 @@ struct shrinker {

/* These are for internal use */
struct list_head list;
+#ifdef CONFIG_MEMCG_SHRINKER
+ /* ID in shrinker_idr */
+ int id;
+#endif
/* objs pending delete, per node */
atomic_long_t *nr_deferred;
};
diff --git a/init/Kconfig b/init/Kconfig
index 1706d963766b..09e201c2ada9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -680,6 +680,11 @@ config MEMCG_SWAP_ENABLED
select this option (if, for some reason, they need to disable it
then swapaccount=0 does the trick).

+config MEMCG_SHRINKER
+ bool
+ depends on MEMCG && !SLOB
+ default y
+
config BLK_CGROUP
bool "IO controller"
depends on BLOCK
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 10c8a38c5eef..d691beac1048 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -169,6 +169,47 @@ unsigned long vm_total_pages;
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

+#ifdef CONFIG_MEMCG_SHRINKER
+static DEFINE_IDR(shrinker_idr);
+
+static int prealloc_memcg_shrinker(struct shrinker *shrinker)
+{
+ int id, ret;
+
+ down_write(&shrinker_rwsem);
+ ret = id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
+ if (ret < 0)
+ goto unlock;
+ shrinker->id = id;
+ ret = 0;
+unlock:
+ up_write(&shrinker_rwsem);
+ return ret;
+}
+
+static void del_memcg_shrinker(struct shrinker *shrinker)
+{
+ int id = shrinker->id;
+
+ if (id < 0)
+ return;
+
+ down_write(&shrinker_rwsem);
+ idr_remove(&shrinker_idr, id);
+ up_write(&shrinker_rwsem);
+ shrinker->id = -1;
+}
+#else /* CONFIG_MEMCG_SHRINKER */
+static int prealloc_memcg_shrinker(struct shrinker *shrinker)
+{
+ return 0;
+}
+
+static void del_memcg_shrinker(struct shrinker *shrinker)
+{
+}
+#endif /* CONFIG_MEMCG_SHRINKER */
+
#ifdef CONFIG_MEMCG
static bool global_reclaim(struct scan_control *sc)
{
@@ -306,6 +347,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
int prealloc_shrinker(struct shrinker *shrinker)
{
size_t size = sizeof(*shrinker->nr_deferred);
+ int ret;

if (shrinker->flags & SHRINKER_NUMA_AWARE)
size *= nr_node_ids;
@@ -313,11 +355,26 @@ int prealloc_shrinker(struct shrinker *shrinker)
shrinker->nr_deferred = kzalloc(size, GFP_KERNEL);
if (!shrinker->nr_deferred)
return -ENOMEM;
+
+ if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
+ ret = prealloc_memcg_shrinker(shrinker);
+ if (ret)
+ goto free_deferred;
+ }
+
return 0;
+
+free_deferred:
+ kfree(shrinker->nr_deferred);
+ shrinker->nr_deferred = NULL;
+ return -ENOMEM;
}

void free_prealloced_shrinker(struct shrinker *shrinker)
{
+ if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+ del_memcg_shrinker(shrinker);
+
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}
@@ -347,6 +404,8 @@ void unregister_shrinker(struct shrinker *shrinker)
{
if (!shrinker->nr_deferred)
return;
+ if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+ del_memcg_shrinker(shrinker);
down_write(&shrinker_rwsem);
list_del(&shrinker->list);
up_write(&shrinker_rwsem);


2018-05-13 05:15:39

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: Assign id to every memcg-aware shrinker

On Thu, May 10, 2018 at 12:52:18PM +0300, Kirill Tkhai wrote:
> The patch introduces shrinker::id number, which is used to enumerate
> memcg-aware shrinkers. The number start from 0, and the code tries
> to maintain it as small as possible.
>
> This will be used as to represent a memcg-aware shrinkers in memcg
> shrinkers map.
>
> Since all memcg-aware shrinkers are based on list_lru, which is per-memcg
> in case of !SLOB only, the new functionality will be under MEMCG && !SLOB
> ifdef (symlinked to CONFIG_MEMCG_SHRINKER).

Using MEMCG && !SLOB instead of introducing a new config option was done
deliberately, see:

http://lkml.kernel.org/r/[email protected]

I guess, this doesn't work well any more, as there are more and more
parts depending on kmem accounting, like shrinkers. If you really want
to introduce a new option, I think you should call it CONFIG_MEMCG_KMEM
and use it consistently throughout the code instead of MEMCG && !SLOB.
And this should be done in a separate patch.

> diff --git a/fs/super.c b/fs/super.c
> index 122c402049a2..16c153d2f4f1 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -248,6 +248,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> s->s_time_gran = 1000000000;
> s->cleancache_poolid = CLEANCACHE_NO_POOL;
>
> +#ifdef CONFIG_MEMCG_SHRINKER
> + s->s_shrink.id = -1;
> +#endif

No point doing that - you are going to overwrite the id anyway in
prealloc_shrinker().

> s->s_shrink.seeks = DEFAULT_SEEKS;
> s->s_shrink.scan_objects = super_cache_scan;
> s->s_shrink.count_objects = super_cache_count;

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 10c8a38c5eef..d691beac1048 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -169,6 +169,47 @@ unsigned long vm_total_pages;
> static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>
> +#ifdef CONFIG_MEMCG_SHRINKER
> +static DEFINE_IDR(shrinker_idr);
> +
> +static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> +{
> + int id, ret;
> +
> + down_write(&shrinker_rwsem);
> + ret = id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
> + if (ret < 0)
> + goto unlock;
> + shrinker->id = id;
> + ret = 0;
> +unlock:
> + up_write(&shrinker_rwsem);
> + return ret;
> +}
> +
> +static void del_memcg_shrinker(struct shrinker *shrinker)

Nit: IMO unregister_memcg_shrinker() would be a better name as it
matches unregister_shrinker(), just like prealloc_memcg_shrinker()
matches prealloc_shrinker().

> +{
> + int id = shrinker->id;
> +

> + if (id < 0)
> + return;

Nit: I think this should be BUG_ON(id >= 0) as this function is only
called for memcg-aware shrinkers AFAICS.

> +
> + down_write(&shrinker_rwsem);
> + idr_remove(&shrinker_idr, id);
> + up_write(&shrinker_rwsem);
> + shrinker->id = -1;
> +}

2018-05-13 16:48:06

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v5 03/13] mm: Assign memcg-aware shrinkers bitmap to memcg

On Thu, May 10, 2018 at 12:52:36PM +0300, Kirill Tkhai wrote:
> Imagine a big node with many cpus, memory cgroups and containers.
> Let we have 200 containers, every container has 10 mounts,
> and 10 cgroups. All container tasks don't touch foreign
> containers mounts. If there is intensive pages write,
> and global reclaim happens, a writing task has to iterate
> over all memcgs to shrink slab, before it's able to go
> to shrink_page_list().
>
> Iteration over all the memcg slabs is very expensive:
> the task has to visit 200 * 10 = 2000 shrinkers
> for every memcg, and since there are 2000 memcgs,
> the total calls are 2000 * 2000 = 4000000.
>
> So, the shrinker makes 4 million do_shrink_slab() calls
> just to try to isolate SWAP_CLUSTER_MAX pages in one
> of the actively writing memcg via shrink_page_list().
> I've observed a node spending almost 100% in kernel,
> making useless iteration over already shrinked slab.
>
> This patch adds bitmap of memcg-aware shrinkers to memcg.
> The size of the bitmap depends on bitmap_nr_ids, and during
> memcg life it's maintained to be enough to fit bitmap_nr_ids
> shrinkers. Every bit in the map is related to corresponding
> shrinker id.
>
> Next patches will maintain set bit only for really charged
> memcg. This will allow shrink_slab() to increase its
> performance in significant way. See the last patch for
> the numbers.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> include/linux/memcontrol.h | 21 ++++++++
> mm/memcontrol.c | 116 ++++++++++++++++++++++++++++++++++++++++++++
> mm/vmscan.c | 16 ++++++
> 3 files changed, 152 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6cbea2f25a87..e5e7e0fc7158 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -105,6 +105,17 @@ struct lruvec_stat {
> long count[NR_VM_NODE_STAT_ITEMS];
> };
>
> +#ifdef CONFIG_MEMCG_SHRINKER
> +/*
> + * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> + * which have elements charged to this memcg.
> + */
> +struct memcg_shrinker_map {
> + struct rcu_head rcu;
> + unsigned long map[0];
> +};
> +#endif /* CONFIG_MEMCG_SHRINKER */
> +

AFAIR we don't normally ifdef structure definitions.

> /*
> * per-zone information in memory controller.
> */
> @@ -118,6 +129,9 @@ struct mem_cgroup_per_node {
>
> struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];
>
> +#ifdef CONFIG_MEMCG_SHRINKER
> + struct memcg_shrinker_map __rcu *shrinker_map;
> +#endif
> struct rb_node tree_node; /* RB tree node */
> unsigned long usage_in_excess;/* Set to the value by which */
> /* the soft limit is exceeded*/
> @@ -1255,4 +1269,11 @@ static inline void memcg_put_cache_ids(void)
>
> #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
>
> +#ifdef CONFIG_MEMCG_SHRINKER

> +#define MEMCG_SHRINKER_MAP(memcg, nid) (memcg->nodeinfo[nid]->shrinker_map)

I don't really like this helper macro. Accessing shrinker_map directly
looks cleaner IMO.

> +
> +extern int memcg_shrinker_nr_max;

As I've mentioned before, the capacity of shrinker map should be a
private business of memcontrol.c IMHO. We shouldn't use it in vmscan.c
as max shrinker id, instead we should introduce another variable for
this, private to vmscan.c.

> +extern int memcg_expand_shrinker_maps(int old_id, int id);

... Then this function would take just one argument, max id, and would
update shrinker_map capacity if necessary in memcontrol.c under the
corresponding mutex, which would look much more readable IMHO as all
shrinker_map related manipulations would be isolated in memcontrol.c.

> +#endif /* CONFIG_MEMCG_SHRINKER */
> +
> #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3df3efa7ff40..18e0fdf302a9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -322,6 +322,116 @@ struct workqueue_struct *memcg_kmem_cache_wq;
>
> #endif /* !CONFIG_SLOB */
>
> +#ifdef CONFIG_MEMCG_SHRINKER
> +int memcg_shrinker_nr_max;

memcg_shrinker_map_capacity, may be?

> +static DEFINE_MUTEX(shrinkers_nr_max_mutex);

memcg_shrinker_map_mutex?

> +
> +static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
> +{
> + kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> +}
> +
> +static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> + int size, int old_size)

If you followed my advice and made the shrinker_map_capacity private to
memcontrol.c, you wouldn't need to pass old_size here either, just max
shrinker id.

> +{
> + struct memcg_shrinker_map *new, *old;
> + int nid;
> +
> + lockdep_assert_held(&shrinkers_nr_max_mutex);
> +
> + for_each_node(nid) {
> + old = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), true);
> + /* Not yet online memcg */
> + if (old_size && !old)
> + return 0;
> +
> + new = kvmalloc(sizeof(*new) + size, GFP_KERNEL);
> + if (!new)
> + return -ENOMEM;
> +
> + /* Set all old bits, clear all new bits */
> + memset(new->map, (int)0xff, old_size);
> + memset((void *)new->map + old_size, 0, size - old_size);
> +
> + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> + if (old)
> + call_rcu(&old->rcu, memcg_free_shrinker_map_rcu);
> + }
> +
> + return 0;
> +}
> +
> +static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
> +{
> + struct mem_cgroup_per_node *pn;
> + struct memcg_shrinker_map *map;
> + int nid;
> +
> + if (memcg == root_mem_cgroup)
> + return;

Nit: there's mem_cgroup_is_root() helper.

> +
> + mutex_lock(&shrinkers_nr_max_mutex);

Why do you need to take the mutex here? You don't access shrinker map
capacity here AFAICS.

> + for_each_node(nid) {
> + pn = mem_cgroup_nodeinfo(memcg, nid);
> + map = rcu_dereference_protected(pn->shrinker_map, true);
> + if (map)
> + call_rcu(&map->rcu, memcg_free_shrinker_map_rcu);
> + rcu_assign_pointer(pn->shrinker_map, NULL);
> + }
> + mutex_unlock(&shrinkers_nr_max_mutex);
> +}
> +
> +static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> +{
> + int ret, size = memcg_shrinker_nr_max/BITS_PER_BYTE;
> +
> + if (memcg == root_mem_cgroup)
> + return 0;

Nit: mem_cgroup_is_root().

> +
> + mutex_lock(&shrinkers_nr_max_mutex);

> + ret = memcg_expand_one_shrinker_map(memcg, size, 0);

I don't think it's worth reusing the function designed for reallocating
shrinker maps for initial allocation. Please just fold the code here -
it will make both 'alloc' and 'expand' easier to follow IMHO.

> + mutex_unlock(&shrinkers_nr_max_mutex);
> +
> + if (ret)
> + memcg_free_shrinker_maps(memcg);
> +
> + return ret;
> +}
> +

> +static struct idr mem_cgroup_idr;

Stray change.

> +
> +int memcg_expand_shrinker_maps(int old_nr, int nr)
> +{
> + int size, old_size, ret = 0;
> + struct mem_cgroup *memcg;
> +
> + old_size = old_nr / BITS_PER_BYTE;
> + size = nr / BITS_PER_BYTE;
> +
> + mutex_lock(&shrinkers_nr_max_mutex);
> +

> + if (!root_mem_cgroup)
> + goto unlock;

This wants a comment.

> +
> + for_each_mem_cgroup(memcg) {
> + if (memcg == root_mem_cgroup)

Nit: mem_cgroup_is_root().

> + continue;
> + ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
> + if (ret)
> + goto unlock;
> + }
> +unlock:
> + mutex_unlock(&shrinkers_nr_max_mutex);
> + return ret;
> +}
> +#else /* CONFIG_MEMCG_SHRINKER */
> +static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> +{
> + return 0;
> +}
> +static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) { }
> +#endif /* CONFIG_MEMCG_SHRINKER */
> +
> /**
> * mem_cgroup_css_from_page - css of the memcg associated with a page
> * @page: page of interest
> @@ -4471,6 +4581,11 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>
> + if (memcg_alloc_shrinker_maps(memcg)) {
> + mem_cgroup_id_remove(memcg);
> + return -ENOMEM;
> + }
> +
> /* Online state pins memcg ID, memcg ID pins CSS */
> atomic_set(&memcg->id.ref, 1);
> css_get(css);
> @@ -4522,6 +4637,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
> vmpressure_cleanup(&memcg->vmpressure);
> cancel_work_sync(&memcg->high_work);
> mem_cgroup_remove_from_trees(memcg);
> + memcg_free_shrinker_maps(memcg);
> memcg_free_kmem(memcg);
> mem_cgroup_free(memcg);
> }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d691beac1048..d8a2870710e0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -174,12 +174,26 @@ static DEFINE_IDR(shrinker_idr);
>
> static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> {
> - int id, ret;
> + int id, nr, ret;
>
> down_write(&shrinker_rwsem);
> ret = id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
> if (ret < 0)
> goto unlock;
> +
> + if (id >= memcg_shrinker_nr_max) {
> + nr = memcg_shrinker_nr_max * 2;
> + if (nr == 0)
> + nr = BITS_PER_BYTE;
> + BUG_ON(id >= nr);

The logic defining shrinker map capacity growth should be private to
memcontrol.c IMHO.

> +
> + if (memcg_expand_shrinker_maps(memcg_shrinker_nr_max, nr)) {
> + idr_remove(&shrinker_idr, id);
> + goto unlock;
> + }
> + memcg_shrinker_nr_max = nr;
> + }
> +
> shrinker->id = id;
> ret = 0;
> unlock:
>

2018-05-13 16:58:25

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v5 06/13] fs: Propagate shrinker::id to list_lru

On Thu, May 10, 2018 at 12:53:06PM +0300, Kirill Tkhai wrote:
> The patch adds list_lru::shrinker_id field, and populates
> it by registered shrinker id.
>
> This will be used to set correct bit in memcg shrinkers
> map by lru code in next patches, after there appeared
> the first related to memcg element in list_lru.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> fs/super.c | 4 ++++
> include/linux/list_lru.h | 3 +++
> mm/list_lru.c | 6 ++++++
> mm/workingset.c | 3 +++
> 4 files changed, 16 insertions(+)
>
> diff --git a/fs/super.c b/fs/super.c
> index 2ccacb78f91c..dfa85e725e45 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -258,6 +258,10 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> goto fail;
> if (list_lru_init_memcg(&s->s_inode_lru))
> goto fail;
> +#ifdef CONFIG_MEMCG_SHRINKER
> + s->s_dentry_lru.shrinker_id = s->s_shrink.id;
> + s->s_inode_lru.shrinker_id = s->s_shrink.id;
> +#endif

I don't like this. Can't you simply pass struct shrinker to
list_lru_init_memcg() and let it extract the id?

2018-05-14 09:05:40

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: Assign id to every memcg-aware shrinker

On 13.05.2018 08:15, Vladimir Davydov wrote:
> On Thu, May 10, 2018 at 12:52:18PM +0300, Kirill Tkhai wrote:
>> The patch introduces shrinker::id number, which is used to enumerate
>> memcg-aware shrinkers. The number start from 0, and the code tries
>> to maintain it as small as possible.
>>
>> This will be used as to represent a memcg-aware shrinkers in memcg
>> shrinkers map.
>>
>> Since all memcg-aware shrinkers are based on list_lru, which is per-memcg
>> in case of !SLOB only, the new functionality will be under MEMCG && !SLOB
>> ifdef (symlinked to CONFIG_MEMCG_SHRINKER).
>
> Using MEMCG && !SLOB instead of introducing a new config option was done
> deliberately, see:
>
> http://lkml.kernel.org/r/[email protected]
>
> I guess, this doesn't work well any more, as there are more and more
> parts depending on kmem accounting, like shrinkers. If you really want
> to introduce a new option, I think you should call it CONFIG_MEMCG_KMEM
> and use it consistently throughout the code instead of MEMCG && !SLOB.
> And this should be done in a separate patch.

What do you mean under "consistently throughout the code"? Should I replace
all MEMCG && !SLOB with CONFIG_MEMCG_KMEM over existing code?

>> diff --git a/fs/super.c b/fs/super.c
>> index 122c402049a2..16c153d2f4f1 100644
>> --- a/fs/super.c
>> +++ b/fs/super.c
>> @@ -248,6 +248,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
>> s->s_time_gran = 1000000000;
>> s->cleancache_poolid = CLEANCACHE_NO_POOL;
>>
>> +#ifdef CONFIG_MEMCG_SHRINKER
>> + s->s_shrink.id = -1;
>> +#endif
>
> No point doing that - you are going to overwrite the id anyway in
> prealloc_shrinker().

Not so, this is done deliberately. alloc_super() has the only "fail" label,
and it handles all the allocation errors there. The patch just behaves in
the same style. It sets "-1" to make destroy_unused_super() able to differ
the cases, when shrinker is really initialized, and when it's not.
If you don't like this, I can move "s->s_shrink.id = -1;" into
prealloc_memcg_shrinker() instead of this.

>> s->s_shrink.seeks = DEFAULT_SEEKS;
>> s->s_shrink.scan_objects = super_cache_scan;
>> s->s_shrink.count_objects = super_cache_count;
>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 10c8a38c5eef..d691beac1048 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -169,6 +169,47 @@ unsigned long vm_total_pages;
>> static LIST_HEAD(shrinker_list);
>> static DECLARE_RWSEM(shrinker_rwsem);
>>
>> +#ifdef CONFIG_MEMCG_SHRINKER
>> +static DEFINE_IDR(shrinker_idr);
>> +
>> +static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>> +{
>> + int id, ret;
>> +
>> + down_write(&shrinker_rwsem);
>> + ret = id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
>> + if (ret < 0)
>> + goto unlock;
>> + shrinker->id = id;
>> + ret = 0;
>> +unlock:
>> + up_write(&shrinker_rwsem);
>> + return ret;
>> +}
>> +
>> +static void del_memcg_shrinker(struct shrinker *shrinker)
>
> Nit: IMO unregister_memcg_shrinker() would be a better name as it
> matches unregister_shrinker(), just like prealloc_memcg_shrinker()
> matches prealloc_shrinker().
>
>> +{
>> + int id = shrinker->id;
>> +
>
>> + if (id < 0)
>> + return;
>
> Nit: I think this should be BUG_ON(id >= 0) as this function is only
> called for memcg-aware shrinkers AFAICS.

See comment to alloc_super().

>> +
>> + down_write(&shrinker_rwsem);
>> + idr_remove(&shrinker_idr, id);
>> + up_write(&shrinker_rwsem);
>> + shrinker->id = -1;
>> +}

Kirill

2018-05-14 09:35:26

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v5 03/13] mm: Assign memcg-aware shrinkers bitmap to memcg

On 13.05.2018 19:47, Vladimir Davydov wrote:
> On Thu, May 10, 2018 at 12:52:36PM +0300, Kirill Tkhai wrote:
>> Imagine a big node with many cpus, memory cgroups and containers.
>> Let we have 200 containers, every container has 10 mounts,
>> and 10 cgroups. All container tasks don't touch foreign
>> containers mounts. If there is intensive pages write,
>> and global reclaim happens, a writing task has to iterate
>> over all memcgs to shrink slab, before it's able to go
>> to shrink_page_list().
>>
>> Iteration over all the memcg slabs is very expensive:
>> the task has to visit 200 * 10 = 2000 shrinkers
>> for every memcg, and since there are 2000 memcgs,
>> the total calls are 2000 * 2000 = 4000000.
>>
>> So, the shrinker makes 4 million do_shrink_slab() calls
>> just to try to isolate SWAP_CLUSTER_MAX pages in one
>> of the actively writing memcg via shrink_page_list().
>> I've observed a node spending almost 100% in kernel,
>> making useless iteration over already shrinked slab.
>>
>> This patch adds bitmap of memcg-aware shrinkers to memcg.
>> The size of the bitmap depends on bitmap_nr_ids, and during
>> memcg life it's maintained to be enough to fit bitmap_nr_ids
>> shrinkers. Every bit in the map is related to corresponding
>> shrinker id.
>>
>> Next patches will maintain set bit only for really charged
>> memcg. This will allow shrink_slab() to increase its
>> performance in significant way. See the last patch for
>> the numbers.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> include/linux/memcontrol.h | 21 ++++++++
>> mm/memcontrol.c | 116 ++++++++++++++++++++++++++++++++++++++++++++
>> mm/vmscan.c | 16 ++++++
>> 3 files changed, 152 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 6cbea2f25a87..e5e7e0fc7158 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -105,6 +105,17 @@ struct lruvec_stat {
>> long count[NR_VM_NODE_STAT_ITEMS];
>> };
>>
>> +#ifdef CONFIG_MEMCG_SHRINKER
>> +/*
>> + * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
>> + * which have elements charged to this memcg.
>> + */
>> +struct memcg_shrinker_map {
>> + struct rcu_head rcu;
>> + unsigned long map[0];
>> +};
>> +#endif /* CONFIG_MEMCG_SHRINKER */
>> +
>
> AFAIR we don't normally ifdef structure definitions.
>
>> /*
>> * per-zone information in memory controller.
>> */
>> @@ -118,6 +129,9 @@ struct mem_cgroup_per_node {
>>
>> struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];
>>
>> +#ifdef CONFIG_MEMCG_SHRINKER
>> + struct memcg_shrinker_map __rcu *shrinker_map;
>> +#endif
>> struct rb_node tree_node; /* RB tree node */
>> unsigned long usage_in_excess;/* Set to the value by which */
>> /* the soft limit is exceeded*/
>> @@ -1255,4 +1269,11 @@ static inline void memcg_put_cache_ids(void)
>>
>> #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
>>
>> +#ifdef CONFIG_MEMCG_SHRINKER
>
>> +#define MEMCG_SHRINKER_MAP(memcg, nid) (memcg->nodeinfo[nid]->shrinker_map)
>
> I don't really like this helper macro. Accessing shrinker_map directly
> looks cleaner IMO.
>
>> +
>> +extern int memcg_shrinker_nr_max;
>
> As I've mentioned before, the capacity of shrinker map should be a
> private business of memcontrol.c IMHO. We shouldn't use it in vmscan.c
> as max shrinker id, instead we should introduce another variable for
> this, private to vmscan.c.
>
>> +extern int memcg_expand_shrinker_maps(int old_id, int id);
>
> ... Then this function would take just one argument, max id, and would
> update shrinker_map capacity if necessary in memcontrol.c under the
> corresponding mutex, which would look much more readable IMHO as all
> shrinker_map related manipulations would be isolated in memcontrol.c.
>
>> +#endif /* CONFIG_MEMCG_SHRINKER */
>> +
>> #endif /* _LINUX_MEMCONTROL_H */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 3df3efa7ff40..18e0fdf302a9 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -322,6 +322,116 @@ struct workqueue_struct *memcg_kmem_cache_wq;
>>
>> #endif /* !CONFIG_SLOB */
>>
>> +#ifdef CONFIG_MEMCG_SHRINKER
>> +int memcg_shrinker_nr_max;
>
> memcg_shrinker_map_capacity, may be?
>
>> +static DEFINE_MUTEX(shrinkers_nr_max_mutex);
>
> memcg_shrinker_map_mutex?
>
>> +
>> +static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
>> +{
>> + kvfree(container_of(head, struct memcg_shrinker_map, rcu));
>> +}
>> +
>> +static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
>> + int size, int old_size)
>
> If you followed my advice and made the shrinker_map_capacity private to
> memcontrol.c, you wouldn't need to pass old_size here either, just max
> shrinker id.
>
>> +{
>> + struct memcg_shrinker_map *new, *old;
>> + int nid;
>> +
>> + lockdep_assert_held(&shrinkers_nr_max_mutex);
>> +
>> + for_each_node(nid) {
>> + old = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), true);
>> + /* Not yet online memcg */
>> + if (old_size && !old)
>> + return 0;
>> +
>> + new = kvmalloc(sizeof(*new) + size, GFP_KERNEL);
>> + if (!new)
>> + return -ENOMEM;
>> +
>> + /* Set all old bits, clear all new bits */
>> + memset(new->map, (int)0xff, old_size);
>> + memset((void *)new->map + old_size, 0, size - old_size);
>> +
>> + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
>> + if (old)
>> + call_rcu(&old->rcu, memcg_free_shrinker_map_rcu);
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
>> +{
>> + struct mem_cgroup_per_node *pn;
>> + struct memcg_shrinker_map *map;
>> + int nid;
>> +
>> + if (memcg == root_mem_cgroup)
>> + return;
>
> Nit: there's mem_cgroup_is_root() helper.
>
>> +
>> + mutex_lock(&shrinkers_nr_max_mutex);
>
> Why do you need to take the mutex here? You don't access shrinker map
> capacity here AFAICS.

Allocation of shrinkers map is in css_online() now, and this wants its pay.
memcg_expand_one_shrinker_map() must be able to differ mem cgroups with
allocated maps, mem cgroups with not allocated maps, and mem cgroups with
failed/failing css_online. So, the mutex is used for synchronization with
expanding. See "old_size && !old" check in memcg_expand_one_shrinker_map().

>> + for_each_node(nid) {
>> + pn = mem_cgroup_nodeinfo(memcg, nid);
>> + map = rcu_dereference_protected(pn->shrinker_map, true);
>> + if (map)
>> + call_rcu(&map->rcu, memcg_free_shrinker_map_rcu);
>> + rcu_assign_pointer(pn->shrinker_map, NULL);
>> + }
>> + mutex_unlock(&shrinkers_nr_max_mutex);
>> +}
>> +
>> +static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>> +{
>> + int ret, size = memcg_shrinker_nr_max/BITS_PER_BYTE;
>> +
>> + if (memcg == root_mem_cgroup)
>> + return 0;
>
> Nit: mem_cgroup_is_root().
>
>> +
>> + mutex_lock(&shrinkers_nr_max_mutex);
>
>> + ret = memcg_expand_one_shrinker_map(memcg, size, 0);
>
> I don't think it's worth reusing the function designed for reallocating
> shrinker maps for initial allocation. Please just fold the code here -
> it will make both 'alloc' and 'expand' easier to follow IMHO.

These function will have 80% code the same. What are the reasons to duplicate
the same functionality? Two functions are more difficult for support, and
everywhere in kernel we try to avoid this IMHO.
>> + mutex_unlock(&shrinkers_nr_max_mutex);
>> +
>> + if (ret)
>> + memcg_free_shrinker_maps(memcg);
>> +
>> + return ret;
>> +}
>> +
>
>> +static struct idr mem_cgroup_idr;
>
> Stray change.
>
>> +
>> +int memcg_expand_shrinker_maps(int old_nr, int nr)
>> +{
>> + int size, old_size, ret = 0;
>> + struct mem_cgroup *memcg;
>> +
>> + old_size = old_nr / BITS_PER_BYTE;
>> + size = nr / BITS_PER_BYTE;
>> +
>> + mutex_lock(&shrinkers_nr_max_mutex);
>> +
>
>> + if (!root_mem_cgroup)
>> + goto unlock;
>
> This wants a comment.

Which comment does this want? "root_mem_cgroup is not initialized, so it does not have child mem cgroups"?

>> +
>> + for_each_mem_cgroup(memcg) {
>> + if (memcg == root_mem_cgroup)
>
> Nit: mem_cgroup_is_root().
>
>> + continue;
>> + ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
>> + if (ret)
>> + goto unlock;
>> + }
>> +unlock:
>> + mutex_unlock(&shrinkers_nr_max_mutex);
>> + return ret;
>> +}
>> +#else /* CONFIG_MEMCG_SHRINKER */
>> +static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>> +{
>> + return 0;
>> +}
>> +static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) { }
>> +#endif /* CONFIG_MEMCG_SHRINKER */
>> +
>> /**
>> * mem_cgroup_css_from_page - css of the memcg associated with a page
>> * @page: page of interest
>> @@ -4471,6 +4581,11 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>> {
>> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>>
>> + if (memcg_alloc_shrinker_maps(memcg)) {
>> + mem_cgroup_id_remove(memcg);
>> + return -ENOMEM;
>> + }
>> +
>> /* Online state pins memcg ID, memcg ID pins CSS */
>> atomic_set(&memcg->id.ref, 1);
>> css_get(css);
>> @@ -4522,6 +4637,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>> vmpressure_cleanup(&memcg->vmpressure);
>> cancel_work_sync(&memcg->high_work);
>> mem_cgroup_remove_from_trees(memcg);
>> + memcg_free_shrinker_maps(memcg);
>> memcg_free_kmem(memcg);
>> mem_cgroup_free(memcg);
>> }
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index d691beac1048..d8a2870710e0 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -174,12 +174,26 @@ static DEFINE_IDR(shrinker_idr);
>>
>> static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>> {
>> - int id, ret;
>> + int id, nr, ret;
>>
>> down_write(&shrinker_rwsem);
>> ret = id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
>> if (ret < 0)
>> goto unlock;
>> +
>> + if (id >= memcg_shrinker_nr_max) {
>> + nr = memcg_shrinker_nr_max * 2;
>> + if (nr == 0)
>> + nr = BITS_PER_BYTE;
>> + BUG_ON(id >= nr);
>
> The logic defining shrinker map capacity growth should be private to
> memcontrol.c IMHO.
>
>> +
>> + if (memcg_expand_shrinker_maps(memcg_shrinker_nr_max, nr)) {
>> + idr_remove(&shrinker_idr, id);
>> + goto unlock;
>> + }
>> + memcg_shrinker_nr_max = nr;
>> + }
>> +
>> shrinker->id = id;
>> ret = 0;
>> unlock:
>>

2018-05-15 03:30:02

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: Assign id to every memcg-aware shrinker

On Mon, May 14, 2018 at 12:03:38PM +0300, Kirill Tkhai wrote:
> On 13.05.2018 08:15, Vladimir Davydov wrote:
> > On Thu, May 10, 2018 at 12:52:18PM +0300, Kirill Tkhai wrote:
> >> The patch introduces shrinker::id number, which is used to enumerate
> >> memcg-aware shrinkers. The number start from 0, and the code tries
> >> to maintain it as small as possible.
> >>
> >> This will be used as to represent a memcg-aware shrinkers in memcg
> >> shrinkers map.
> >>
> >> Since all memcg-aware shrinkers are based on list_lru, which is per-memcg
> >> in case of !SLOB only, the new functionality will be under MEMCG && !SLOB
> >> ifdef (symlinked to CONFIG_MEMCG_SHRINKER).
> >
> > Using MEMCG && !SLOB instead of introducing a new config option was done
> > deliberately, see:
> >
> > http://lkml.kernel.org/r/[email protected]
> >
> > I guess, this doesn't work well any more, as there are more and more
> > parts depending on kmem accounting, like shrinkers. If you really want
> > to introduce a new option, I think you should call it CONFIG_MEMCG_KMEM
> > and use it consistently throughout the code instead of MEMCG && !SLOB.
> > And this should be done in a separate patch.
>
> What do you mean under "consistently throughout the code"? Should I replace
> all MEMCG && !SLOB with CONFIG_MEMCG_KMEM over existing code?

Yes, otherwise it looks messy - in some places we check !SLOB, in others
we use CONFIG_MEMCG_SHRINKER (or whatever it will be called).

>
> >> diff --git a/fs/super.c b/fs/super.c
> >> index 122c402049a2..16c153d2f4f1 100644
> >> --- a/fs/super.c
> >> +++ b/fs/super.c
> >> @@ -248,6 +248,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> >> s->s_time_gran = 1000000000;
> >> s->cleancache_poolid = CLEANCACHE_NO_POOL;
> >>
> >> +#ifdef CONFIG_MEMCG_SHRINKER
> >> + s->s_shrink.id = -1;
> >> +#endif
> >
> > No point doing that - you are going to overwrite the id anyway in
> > prealloc_shrinker().
>
> Not so, this is done deliberately. alloc_super() has the only "fail" label,
> and it handles all the allocation errors there. The patch just behaves in
> the same style. It sets "-1" to make destroy_unused_super() able to differ
> the cases, when shrinker is really initialized, and when it's not.
> If you don't like this, I can move "s->s_shrink.id = -1;" into
> prealloc_memcg_shrinker() instead of this.

Yes, please do so that we don't have MEMCG ifdefs in fs code.

Thanks.

2018-05-15 03:54:58

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v5 03/13] mm: Assign memcg-aware shrinkers bitmap to memcg

On Mon, May 14, 2018 at 12:34:45PM +0300, Kirill Tkhai wrote:
> >> +static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
> >> +{
> >> + struct mem_cgroup_per_node *pn;
> >> + struct memcg_shrinker_map *map;
> >> + int nid;
> >> +
> >> + if (memcg == root_mem_cgroup)
> >> + return;
> >> +
> >> + mutex_lock(&shrinkers_nr_max_mutex);
> >
> > Why do you need to take the mutex here? You don't access shrinker map
> > capacity here AFAICS.
>
> Allocation of shrinkers map is in css_online() now, and this wants its pay.
> memcg_expand_one_shrinker_map() must be able to differ mem cgroups with
> allocated maps, mem cgroups with not allocated maps, and mem cgroups with
> failed/failing css_online. So, the mutex is used for synchronization with
> expanding. See "old_size && !old" check in memcg_expand_one_shrinker_map().

Another reason to have 'expand' and 'alloc' paths separated - you
wouldn't need to take the mutex here as 'free' wouldn't be used for
undoing initial allocation, instead 'alloc' would cleanup by itself
while still holding the mutex.

>
> >> + for_each_node(nid) {
> >> + pn = mem_cgroup_nodeinfo(memcg, nid);
> >> + map = rcu_dereference_protected(pn->shrinker_map, true);
> >> + if (map)
> >> + call_rcu(&map->rcu, memcg_free_shrinker_map_rcu);
> >> + rcu_assign_pointer(pn->shrinker_map, NULL);
> >> + }
> >> + mutex_unlock(&shrinkers_nr_max_mutex);
> >> +}
> >> +
> >> +static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >> +{
> >> + int ret, size = memcg_shrinker_nr_max/BITS_PER_BYTE;
> >> +
> >> + if (memcg == root_mem_cgroup)
> >> + return 0;
> >> +
> >> + mutex_lock(&shrinkers_nr_max_mutex);
> >> + ret = memcg_expand_one_shrinker_map(memcg, size, 0);
> >
> > I don't think it's worth reusing the function designed for reallocating
> > shrinker maps for initial allocation. Please just fold the code here -
> > it will make both 'alloc' and 'expand' easier to follow IMHO.
>
> These function will have 80% code the same. What are the reasons to duplicate
> the same functionality? Two functions are more difficult for support, and
> everywhere in kernel we try to avoid this IMHO.

IMHO two functions with clear semantics are easier to maintain than
a function that does one of two things depending on some condition.
Separating 'alloc' from 'expand' would only add 10-15 SLOC.

> >> + mutex_unlock(&shrinkers_nr_max_mutex);
> >> +
> >> + if (ret)
> >> + memcg_free_shrinker_maps(memcg);
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static struct idr mem_cgroup_idr;
> >> +
> >> +int memcg_expand_shrinker_maps(int old_nr, int nr)
> >> +{
> >> + int size, old_size, ret = 0;
> >> + struct mem_cgroup *memcg;
> >> +
> >> + old_size = old_nr / BITS_PER_BYTE;
> >> + size = nr / BITS_PER_BYTE;
> >> +
> >> + mutex_lock(&shrinkers_nr_max_mutex);
> >> +
> >> + if (!root_mem_cgroup)
> >> + goto unlock;
> >
> > This wants a comment.
>
> Which comment does this want? "root_mem_cgroup is not initialized, so
> it does not have child mem cgroups"?

Looking at this code again, I find it pretty self-explaining, sorry.

Thanks.

2018-05-15 04:09:06

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v5 10/13] mm: Set bit in memcg shrinker bitmap on first list_lru item apearance

On Thu, May 10, 2018 at 12:53:45PM +0300, Kirill Tkhai wrote:
> Introduce set_shrinker_bit() function to set shrinker-related
> bit in memcg shrinker bitmap, and set the bit after the first
> item is added and in case of reparenting destroyed memcg's items.
>
> This will allow next patch to make shrinkers be called only,
> in case of they have charged objects at the moment, and
> to improve shrink_slab() performance.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> include/linux/memcontrol.h | 15 +++++++++++++++
> mm/list_lru.c | 22 ++++++++++++++++++++--
> 2 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e5e7e0fc7158..82f892e77637 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1274,6 +1274,21 @@ static inline void memcg_put_cache_ids(void)
>
> extern int memcg_shrinker_nr_max;
> extern int memcg_expand_shrinker_maps(int old_id, int id);
> +
> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int nr)

Nit: too long line (> 80 characters)
Nit: let's rename 'nr' to 'shrinker_id'

> +{
> + if (nr >= 0 && memcg && memcg != root_mem_cgroup) {
> + struct memcg_shrinker_map *map;
> +
> + rcu_read_lock();
> + map = MEMCG_SHRINKER_MAP(memcg, nid);

Missing rcu_dereference.

> + set_bit(nr, map->map);
> + rcu_read_unlock();
> + }
> +}
> +#else
> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
> + int node, int id) { }

Nit: please keep the signature (including argument names) the same as in
MEMCG-enabled definition, namely 'node' => 'nid', 'id' => 'shrinker_id'.

Thanks.

2018-05-15 05:45:20

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

On Thu, May 10, 2018 at 12:53:55PM +0300, Kirill Tkhai wrote:
> Using the preparations made in previous patches, in case of memcg
> shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
> bitmap. To do that, we separate iterations over memcg-aware and
> !memcg-aware shrinkers, and memcg-aware shrinkers are chosen
> via for_each_set_bit() from the bitmap. In case of big nodes,
> having many isolated environments, this gives significant
> performance growth. See next patches for the details.
>
> Note, that the patch does not respect to empty memcg shrinkers,
> since we never clear the bitmap bits after we set it once.
> Their shrinkers will be called again, with no shrinked objects
> as result. This functionality is provided by next patches.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> include/linux/memcontrol.h | 1 +
> mm/vmscan.c | 70 ++++++++++++++++++++++++++++++++++++++------
> 2 files changed, 62 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 82f892e77637..436691a66500 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -760,6 +760,7 @@ void mem_cgroup_split_huge_fixup(struct page *head);
> #define MEM_CGROUP_ID_MAX 0
>
> struct mem_cgroup;
> +#define root_mem_cgroup NULL

Let's instead export mem_cgroup_is_root(). In case if MEMCG is disabled
it will always return false.

>
> static inline bool mem_cgroup_disabled(void)
> {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d8a2870710e0..a2e38e05adb5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -376,6 +376,7 @@ int prealloc_shrinker(struct shrinker *shrinker)
> goto free_deferred;
> }
>
> + INIT_LIST_HEAD(&shrinker->list);

IMO this shouldn't be here, see my comment below.

> return 0;
>
> free_deferred:
> @@ -547,6 +548,63 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> return freed;
> }
>
> +#ifdef CONFIG_MEMCG_SHRINKER
> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> + struct mem_cgroup *memcg, int priority)
> +{
> + struct memcg_shrinker_map *map;
> + unsigned long freed = 0;
> + int ret, i;
> +
> + if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
> + return 0;
> +
> + if (!down_read_trylock(&shrinker_rwsem))
> + return 0;
> +
> + /*
> + * 1)Caller passes only alive memcg, so map can't be NULL.
> + * 2)shrinker_rwsem protects from maps expanding.

^^
Nit: space missing here :-)

> + */
> + map = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), true);
> + BUG_ON(!map);
> +
> + for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
> + struct shrink_control sc = {
> + .gfp_mask = gfp_mask,
> + .nid = nid,
> + .memcg = memcg,
> + };
> + struct shrinker *shrinker;
> +
> + shrinker = idr_find(&shrinker_idr, i);
> + if (!shrinker) {
> + clear_bit(i, map->map);
> + continue;
> + }

The shrinker must be memcg aware so please add

BUG_ON((shrinker->flags & SHRINKER_MEMCG_AWARE) == 0);

> + if (list_empty(&shrinker->list))
> + continue;

I don't like using shrinker->list as an indicator that the shrinker has
been initialized. IMO if you do need such a check, you should split
shrinker_idr registration in two steps - allocate a slot in 'prealloc'
and set the pointer in 'register'. However, can we really encounter an
unregistered shrinker here? AFAIU a bit can be set in the shrinker map
only after the corresponding shrinker has been initialized, no?

> +
> + ret = do_shrink_slab(&sc, shrinker, priority);
> + freed += ret;
> +
> + if (rwsem_is_contended(&shrinker_rwsem)) {
> + freed = freed ? : 1;
> + break;
> + }
> + }
> +
> + up_read(&shrinker_rwsem);
> + return freed;
> +}
> +#else /* CONFIG_MEMCG_SHRINKER */
> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> + struct mem_cgroup *memcg, int priority)
> +{
> + return 0;
> +}
> +#endif /* CONFIG_MEMCG_SHRINKER */
> +
> /**
> * shrink_slab - shrink slab caches
> * @gfp_mask: allocation context
> @@ -576,8 +634,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> struct shrinker *shrinker;
> unsigned long freed = 0;
>
> - if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)))
> - return 0;
> + if (memcg && memcg != root_mem_cgroup)

if (!mem_cgroup_is_root(memcg))

> + return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>
> if (!down_read_trylock(&shrinker_rwsem))
> goto out;
> @@ -589,13 +647,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> .memcg = memcg,
> };
>
> - /*
> - * If kernel memory accounting is disabled, we ignore
> - * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> - * passing NULL for memcg.
> - */
> - if (memcg_kmem_enabled() &&
> - !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> + if (!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> continue;

I want this check gone. It's easy to achieve, actually - just remove the
following lines from shrink_node()

if (global_reclaim(sc))
shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
sc->priority);

>
> if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>

2018-05-15 06:00:28

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v5 13/13] mm: Clear shrinker bit if there are no objects related to memcg

On Thu, May 10, 2018 at 12:54:15PM +0300, Kirill Tkhai wrote:
> To avoid further unneed calls of do_shrink_slab()
> for shrinkers, which already do not have any charged
> objects in a memcg, their bits have to be cleared.
>
> This patch introduces a lockless mechanism to do that
> without races without parallel list lru add. After
> do_shrink_slab() returns SHRINK_EMPTY the first time,
> we clear the bit and call it once again. Then we restore
> the bit, if the new return value is different.
>
> Note, that single smp_mb__after_atomic() in shrink_slab_memcg()
> covers two situations:
>
> 1)list_lru_add() shrink_slab_memcg
> list_add_tail() for_each_set_bit() <--- read bit
> do_shrink_slab() <--- missed list update (no barrier)
> <MB> <MB>
> set_bit() do_shrink_slab() <--- seen list update
>
> This situation, when the first do_shrink_slab() sees set bit,
> but it doesn't see list update (i.e., race with the first element
> queueing), is rare. So we don't add <MB> before the first call
> of do_shrink_slab() instead of this to do not slow down generic
> case. Also, it's need the second call as seen in below in (2).
>
> 2)list_lru_add() shrink_slab_memcg()
> list_add_tail() ...
> set_bit() ...
> ... for_each_set_bit()
> do_shrink_slab() do_shrink_slab()
> clear_bit() ...
> ... ...
> list_lru_add() ...
> list_add_tail() clear_bit()
> <MB> <MB>
> set_bit() do_shrink_slab()
>
> The barriers guarantees, the second do_shrink_slab()
> in the right side task sees list update if really
> cleared the bit. This case is drawn in the code comment.
>
> [Results/performance of the patchset]
>
> After the whole patchset applied the below test shows signify
> increase of performance:
>
> $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
> $mkdir /sys/fs/cgroup/memory/ct
> $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
> $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done
>
> Then, 5 sequential calls of drop caches:
> $time echo 3 > /proc/sys/vm/drop_caches
>
> 1)Before:
> 0.00user 13.78system 0:13.78elapsed 99%CPU
> 0.00user 5.59system 0:05.60elapsed 99%CPU
> 0.00user 5.48system 0:05.48elapsed 99%CPU
> 0.00user 8.35system 0:08.35elapsed 99%CPU
> 0.00user 8.34system 0:08.35elapsed 99%CPU
>
> 2)After
> 0.00user 1.10system 0:01.10elapsed 99%CPU
> 0.00user 0.00system 0:00.01elapsed 64%CPU
> 0.00user 0.01system 0:00.01elapsed 82%CPU
> 0.00user 0.00system 0:00.01elapsed 64%CPU
> 0.00user 0.01system 0:00.01elapsed 82%CPU
>
> The results show the performance increases at least in 548 times.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> include/linux/memcontrol.h | 2 ++
> mm/vmscan.c | 19 +++++++++++++++++--
> 2 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 436691a66500..82c0bf2d0579 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1283,6 +1283,8 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int
>
> rcu_read_lock();
> map = MEMCG_SHRINKER_MAP(memcg, nid);
> + /* Pairs with smp mb in shrink_slab() */
> + smp_mb__before_atomic();
> set_bit(nr, map->map);
> rcu_read_unlock();
> }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7b0075612d73..189b163bef4a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -586,8 +586,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> continue;
>
> ret = do_shrink_slab(&sc, shrinker, priority);
> - if (ret == SHRINK_EMPTY)
> - ret = 0;
> + if (ret == SHRINK_EMPTY) {
> + clear_bit(i, map->map);
> + /*
> + * Pairs with mb in memcg_set_shrinker_bit():
> + *
> + * list_lru_add() shrink_slab_memcg()
> + * list_add_tail() clear_bit()
> + * <MB> <MB>
> + * set_bit() do_shrink_slab()
> + */

Please improve the comment so that it isn't just a diagram.

> + smp_mb__after_atomic();
> + ret = do_shrink_slab(&sc, shrinker, priority);
> + if (ret == SHRINK_EMPTY)
> + ret = 0;
> + else
> + memcg_set_shrinker_bit(memcg, nid, i);
> + }
> freed += ret;
>
> if (rwsem_is_contended(&shrinker_rwsem)) {
>

2018-05-15 08:56:50

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v5 13/13] mm: Clear shrinker bit if there are no objects related to memcg

On 15.05.2018 08:59, Vladimir Davydov wrote:
> On Thu, May 10, 2018 at 12:54:15PM +0300, Kirill Tkhai wrote:
>> To avoid further unneed calls of do_shrink_slab()
>> for shrinkers, which already do not have any charged
>> objects in a memcg, their bits have to be cleared.
>>
>> This patch introduces a lockless mechanism to do that
>> without races without parallel list lru add. After
>> do_shrink_slab() returns SHRINK_EMPTY the first time,
>> we clear the bit and call it once again. Then we restore
>> the bit, if the new return value is different.
>>
>> Note, that single smp_mb__after_atomic() in shrink_slab_memcg()
>> covers two situations:
>>
>> 1)list_lru_add() shrink_slab_memcg
>> list_add_tail() for_each_set_bit() <--- read bit
>> do_shrink_slab() <--- missed list update (no barrier)
>> <MB> <MB>
>> set_bit() do_shrink_slab() <--- seen list update
>>
>> This situation, when the first do_shrink_slab() sees set bit,
>> but it doesn't see list update (i.e., race with the first element
>> queueing), is rare. So we don't add <MB> before the first call
>> of do_shrink_slab() instead of this to do not slow down generic
>> case. Also, it's need the second call as seen in below in (2).
>>
>> 2)list_lru_add() shrink_slab_memcg()
>> list_add_tail() ...
>> set_bit() ...
>> ... for_each_set_bit()
>> do_shrink_slab() do_shrink_slab()
>> clear_bit() ...
>> ... ...
>> list_lru_add() ...
>> list_add_tail() clear_bit()
>> <MB> <MB>
>> set_bit() do_shrink_slab()
>>
>> The barriers guarantees, the second do_shrink_slab()
>> in the right side task sees list update if really
>> cleared the bit. This case is drawn in the code comment.
>>
>> [Results/performance of the patchset]
>>
>> After the whole patchset applied the below test shows signify
>> increase of performance:
>>
>> $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
>> $mkdir /sys/fs/cgroup/memory/ct
>> $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
>> $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done
>>
>> Then, 5 sequential calls of drop caches:
>> $time echo 3 > /proc/sys/vm/drop_caches
>>
>> 1)Before:
>> 0.00user 13.78system 0:13.78elapsed 99%CPU
>> 0.00user 5.59system 0:05.60elapsed 99%CPU
>> 0.00user 5.48system 0:05.48elapsed 99%CPU
>> 0.00user 8.35system 0:08.35elapsed 99%CPU
>> 0.00user 8.34system 0:08.35elapsed 99%CPU
>>
>> 2)After
>> 0.00user 1.10system 0:01.10elapsed 99%CPU
>> 0.00user 0.00system 0:00.01elapsed 64%CPU
>> 0.00user 0.01system 0:00.01elapsed 82%CPU
>> 0.00user 0.00system 0:00.01elapsed 64%CPU
>> 0.00user 0.01system 0:00.01elapsed 82%CPU
>>
>> The results show the performance increases at least in 548 times.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> include/linux/memcontrol.h | 2 ++
>> mm/vmscan.c | 19 +++++++++++++++++--
>> 2 files changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 436691a66500..82c0bf2d0579 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -1283,6 +1283,8 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int
>>
>> rcu_read_lock();
>> map = MEMCG_SHRINKER_MAP(memcg, nid);
>> + /* Pairs with smp mb in shrink_slab() */
>> + smp_mb__before_atomic();
>> set_bit(nr, map->map);
>> rcu_read_unlock();
>> }
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 7b0075612d73..189b163bef4a 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -586,8 +586,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>> continue;
>>
>> ret = do_shrink_slab(&sc, shrinker, priority);
>> - if (ret == SHRINK_EMPTY)
>> - ret = 0;
>> + if (ret == SHRINK_EMPTY) {
>> + clear_bit(i, map->map);
>> + /*
>> + * Pairs with mb in memcg_set_shrinker_bit():
>> + *
>> + * list_lru_add() shrink_slab_memcg()
>> + * list_add_tail() clear_bit()
>> + * <MB> <MB>
>> + * set_bit() do_shrink_slab()
>> + */
>
> Please improve the comment so that it isn't just a diagram.

Please, say, which comment you want to see here.

>> + smp_mb__after_atomic();
>> + ret = do_shrink_slab(&sc, shrinker, priority);
>> + if (ret == SHRINK_EMPTY)
>> + ret = 0;
>> + else
>> + memcg_set_shrinker_bit(memcg, nid, i);
>> + }
>> freed += ret;
>>
>> if (rwsem_is_contended(&shrinker_rwsem)) {
>>

2018-05-15 10:13:08

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

On 15.05.2018 08:44, Vladimir Davydov wrote:
> On Thu, May 10, 2018 at 12:53:55PM +0300, Kirill Tkhai wrote:
>> Using the preparations made in previous patches, in case of memcg
>> shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
>> bitmap. To do that, we separate iterations over memcg-aware and
>> !memcg-aware shrinkers, and memcg-aware shrinkers are chosen
>> via for_each_set_bit() from the bitmap. In case of big nodes,
>> having many isolated environments, this gives significant
>> performance growth. See next patches for the details.
>>
>> Note, that the patch does not respect to empty memcg shrinkers,
>> since we never clear the bitmap bits after we set it once.
>> Their shrinkers will be called again, with no shrinked objects
>> as result. This functionality is provided by next patches.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> include/linux/memcontrol.h | 1 +
>> mm/vmscan.c | 70 ++++++++++++++++++++++++++++++++++++++------
>> 2 files changed, 62 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 82f892e77637..436691a66500 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -760,6 +760,7 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>> #define MEM_CGROUP_ID_MAX 0
>>
>> struct mem_cgroup;
>> +#define root_mem_cgroup NULL
>
> Let's instead export mem_cgroup_is_root(). In case if MEMCG is disabled
> it will always return false.

export == move to header file

>>
>> static inline bool mem_cgroup_disabled(void)
>> {
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index d8a2870710e0..a2e38e05adb5 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -376,6 +376,7 @@ int prealloc_shrinker(struct shrinker *shrinker)
>> goto free_deferred;
>> }
>>
>> + INIT_LIST_HEAD(&shrinker->list);
>
> IMO this shouldn't be here, see my comment below.
>
>> return 0;
>>
>> free_deferred:
>> @@ -547,6 +548,63 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>> return freed;
>> }
>>
>> +#ifdef CONFIG_MEMCG_SHRINKER
>> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>> + struct mem_cgroup *memcg, int priority)
>> +{
>> + struct memcg_shrinker_map *map;
>> + unsigned long freed = 0;
>> + int ret, i;
>> +
>> + if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
>> + return 0;
>> +
>> + if (!down_read_trylock(&shrinker_rwsem))
>> + return 0;
>> +
>> + /*
>> + * 1)Caller passes only alive memcg, so map can't be NULL.
>> + * 2)shrinker_rwsem protects from maps expanding.
>
> ^^
> Nit: space missing here :-)

I don't understand what you mean here. Please, clarify...

>> + */
>> + map = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), true);
>> + BUG_ON(!map);
>> +
>> + for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
>> + struct shrink_control sc = {
>> + .gfp_mask = gfp_mask,
>> + .nid = nid,
>> + .memcg = memcg,
>> + };
>> + struct shrinker *shrinker;
>> +
>> + shrinker = idr_find(&shrinker_idr, i);
>> + if (!shrinker) {
>> + clear_bit(i, map->map);
>> + continue;
>> + }
>
> The shrinker must be memcg aware so please add
>
> BUG_ON((shrinker->flags & SHRINKER_MEMCG_AWARE) == 0);
>
>> + if (list_empty(&shrinker->list))
>> + continue;
>
> I don't like using shrinker->list as an indicator that the shrinker has
> been initialized. IMO if you do need such a check, you should split
> shrinker_idr registration in two steps - allocate a slot in 'prealloc'
> and set the pointer in 'register'. However, can we really encounter an
> unregistered shrinker here? AFAIU a bit can be set in the shrinker map
> only after the corresponding shrinker has been initialized, no?

1)No, it's not so. Here is a race:
cpu#0 cpu#1 cpu#2
prealloc_shrinker()
prealloc_shrinker()
memcg_expand_shrinker_maps()
memcg_expand_one_shrinker_map()
memset(&new->map, 0xff);
do_shrink_slab() (on uninitialized LRUs)
init LRUs
register_shrinker_prepared()

So, the check is needed.

2)Assigning NULL pointer can't be used here, since NULL pointer is already used
to clear unregistered shrinkers from the map. See the check right after idr_find().

list_empty() is used since it's the already existing indicator, which does not
require additional member in struct shrinker.

>> +
>> + ret = do_shrink_slab(&sc, shrinker, priority);
>> + freed += ret;
>> +
>> + if (rwsem_is_contended(&shrinker_rwsem)) {
>> + freed = freed ? : 1;
>> + break;
>> + }
>> + }
>> +
>> + up_read(&shrinker_rwsem);
>> + return freed;
>> +}
>> +#else /* CONFIG_MEMCG_SHRINKER */
>> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>> + struct mem_cgroup *memcg, int priority)
>> +{
>> + return 0;
>> +}
>> +#endif /* CONFIG_MEMCG_SHRINKER */
>> +
>> /**
>> * shrink_slab - shrink slab caches
>> * @gfp_mask: allocation context
>> @@ -576,8 +634,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>> struct shrinker *shrinker;
>> unsigned long freed = 0;
>>
>> - if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)))
>> - return 0;
>> + if (memcg && memcg != root_mem_cgroup)
>
> if (!mem_cgroup_is_root(memcg))
>
>> + return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>
>> if (!down_read_trylock(&shrinker_rwsem))
>> goto out;
>> @@ -589,13 +647,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>> .memcg = memcg,
>> };
>>
>> - /*
>> - * If kernel memory accounting is disabled, we ignore
>> - * SHRINKER_MEMCG_AWARE flag and call all shrinkers
>> - * passing NULL for memcg.
>> - */
>> - if (memcg_kmem_enabled() &&
>> - !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>> + if (!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>> continue;
>
> I want this check gone. It's easy to achieve, actually - just remove the
> following lines from shrink_node()
>
> if (global_reclaim(sc))
> shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
> sc->priority);
>
>>
>> if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>>

2018-05-15 14:52:35

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

On 15.05.2018 08:44, Vladimir Davydov wrote:
> On Thu, May 10, 2018 at 12:53:55PM +0300, Kirill Tkhai wrote:
>> Using the preparations made in previous patches, in case of memcg
>> shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
>> bitmap. To do that, we separate iterations over memcg-aware and
>> !memcg-aware shrinkers, and memcg-aware shrinkers are chosen
>> via for_each_set_bit() from the bitmap. In case of big nodes,
>> having many isolated environments, this gives significant
>> performance growth. See next patches for the details.
>>
>> Note, that the patch does not respect to empty memcg shrinkers,
>> since we never clear the bitmap bits after we set it once.
>> Their shrinkers will be called again, with no shrinked objects
>> as result. This functionality is provided by next patches.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> include/linux/memcontrol.h | 1 +
>> mm/vmscan.c | 70 ++++++++++++++++++++++++++++++++++++++------
>> 2 files changed, 62 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 82f892e77637..436691a66500 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -760,6 +760,7 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>> #define MEM_CGROUP_ID_MAX 0
>>
>> struct mem_cgroup;
>> +#define root_mem_cgroup NULL
>
> Let's instead export mem_cgroup_is_root(). In case if MEMCG is disabled
> it will always return false.
>
>>
>> static inline bool mem_cgroup_disabled(void)
>> {
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index d8a2870710e0..a2e38e05adb5 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -376,6 +376,7 @@ int prealloc_shrinker(struct shrinker *shrinker)
>> goto free_deferred;
>> }
>>
>> + INIT_LIST_HEAD(&shrinker->list);
>
> IMO this shouldn't be here, see my comment below.
>
>> return 0;
>>
>> free_deferred:
>> @@ -547,6 +548,63 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>> return freed;
>> }
>>
>> +#ifdef CONFIG_MEMCG_SHRINKER
>> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>> + struct mem_cgroup *memcg, int priority)
>> +{
>> + struct memcg_shrinker_map *map;
>> + unsigned long freed = 0;
>> + int ret, i;
>> +
>> + if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
>> + return 0;
>> +
>> + if (!down_read_trylock(&shrinker_rwsem))
>> + return 0;
>> +
>> + /*
>> + * 1)Caller passes only alive memcg, so map can't be NULL.
>> + * 2)shrinker_rwsem protects from maps expanding.
>
> ^^
> Nit: space missing here :-)
>
>> + */
>> + map = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), true);
>> + BUG_ON(!map);
>> +
>> + for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
>> + struct shrink_control sc = {
>> + .gfp_mask = gfp_mask,
>> + .nid = nid,
>> + .memcg = memcg,
>> + };
>> + struct shrinker *shrinker;
>> +
>> + shrinker = idr_find(&shrinker_idr, i);
>> + if (!shrinker) {
>> + clear_bit(i, map->map);
>> + continue;
>> + }
>
> The shrinker must be memcg aware so please add
>
> BUG_ON((shrinker->flags & SHRINKER_MEMCG_AWARE) == 0);
>
>> + if (list_empty(&shrinker->list))
>> + continue;
>
> I don't like using shrinker->list as an indicator that the shrinker has
> been initialized. IMO if you do need such a check, you should split
> shrinker_idr registration in two steps - allocate a slot in 'prealloc'
> and set the pointer in 'register'. However, can we really encounter an
> unregistered shrinker here? AFAIU a bit can be set in the shrinker map
> only after the corresponding shrinker has been initialized, no?
>
>> +
>> + ret = do_shrink_slab(&sc, shrinker, priority);
>> + freed += ret;
>> +
>> + if (rwsem_is_contended(&shrinker_rwsem)) {
>> + freed = freed ? : 1;
>> + break;
>> + }
>> + }
>> +
>> + up_read(&shrinker_rwsem);
>> + return freed;
>> +}
>> +#else /* CONFIG_MEMCG_SHRINKER */
>> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>> + struct mem_cgroup *memcg, int priority)
>> +{
>> + return 0;
>> +}
>> +#endif /* CONFIG_MEMCG_SHRINKER */
>> +
>> /**
>> * shrink_slab - shrink slab caches
>> * @gfp_mask: allocation context
>> @@ -576,8 +634,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>> struct shrinker *shrinker;
>> unsigned long freed = 0;
>>
>> - if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)))
>> - return 0;
>> + if (memcg && memcg != root_mem_cgroup)
>
> if (!mem_cgroup_is_root(memcg))
>
>> + return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>
>> if (!down_read_trylock(&shrinker_rwsem))
>> goto out;
>> @@ -589,13 +647,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>> .memcg = memcg,
>> };
>>
>> - /*
>> - * If kernel memory accounting is disabled, we ignore
>> - * SHRINKER_MEMCG_AWARE flag and call all shrinkers
>> - * passing NULL for memcg.
>> - */
>> - if (memcg_kmem_enabled() &&
>> - !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>> + if (!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>> continue;
>
> I want this check gone. It's easy to achieve, actually - just remove the
> following lines from shrink_node()
>
> if (global_reclaim(sc))
> shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
> sc->priority);

This check is not related to the patchset. Let's don't mix everything
in the single series of patches, because after your last remarks it will
grow at least up to 15 patches. This patchset can't be responsible for
everything.

>>
>> if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>>

2018-05-17 04:20:13

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

On Tue, May 15, 2018 at 05:49:59PM +0300, Kirill Tkhai wrote:
> >> @@ -589,13 +647,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >> .memcg = memcg,
> >> };
> >>
> >> - /*
> >> - * If kernel memory accounting is disabled, we ignore
> >> - * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >> - * passing NULL for memcg.
> >> - */
> >> - if (memcg_kmem_enabled() &&
> >> - !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> >> + if (!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> >> continue;
> >
> > I want this check gone. It's easy to achieve, actually - just remove the
> > following lines from shrink_node()
> >
> > if (global_reclaim(sc))
> > shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
> > sc->priority);
>
> This check is not related to the patchset.

Yes, it is. This patch modifies shrink_slab which is used only by
shrink_node. Simplifying shrink_node along the way looks right to me.

> Let's don't mix everything in the single series of patches, because
> after your last remarks it will grow at least up to 15 patches.

Most of which are trivial so I don't see any problem here.

> This patchset can't be responsible for everything.

I don't understand why you balk at simplifying the code a bit while you
are patching related functions anyway.

>
> >>
> >> if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> >>

2018-05-17 04:35:48

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

On Tue, May 15, 2018 at 01:12:20PM +0300, Kirill Tkhai wrote:
> >> +#define root_mem_cgroup NULL
> >
> > Let's instead export mem_cgroup_is_root(). In case if MEMCG is disabled
> > it will always return false.
>
> export == move to header file

That and adding a stub function in case !MEMCG.

> >> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> >> + struct mem_cgroup *memcg, int priority)
> >> +{
> >> + struct memcg_shrinker_map *map;
> >> + unsigned long freed = 0;
> >> + int ret, i;
> >> +
> >> + if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
> >> + return 0;
> >> +
> >> + if (!down_read_trylock(&shrinker_rwsem))
> >> + return 0;
> >> +
> >> + /*
> >> + * 1)Caller passes only alive memcg, so map can't be NULL.
> >> + * 2)shrinker_rwsem protects from maps expanding.
> >
> > ^^
> > Nit: space missing here :-)
>
> I don't understand what you mean here. Please, clarify...

This is just a trivial remark regarding comment formatting. They usually
put a space between the number and the first word in the sentence, i.e.
between '1)' and 'Caller' in your case.

>
> >> + */
> >> + map = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), true);
> >> + BUG_ON(!map);
> >> +
> >> + for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
> >> + struct shrink_control sc = {
> >> + .gfp_mask = gfp_mask,
> >> + .nid = nid,
> >> + .memcg = memcg,
> >> + };
> >> + struct shrinker *shrinker;
> >> +
> >> + shrinker = idr_find(&shrinker_idr, i);
> >> + if (!shrinker) {
> >> + clear_bit(i, map->map);
> >> + continue;
> >> + }
> >> + if (list_empty(&shrinker->list))
> >> + continue;
> >
> > I don't like using shrinker->list as an indicator that the shrinker has
> > been initialized. IMO if you do need such a check, you should split
> > shrinker_idr registration in two steps - allocate a slot in 'prealloc'
> > and set the pointer in 'register'. However, can we really encounter an
> > unregistered shrinker here? AFAIU a bit can be set in the shrinker map
> > only after the corresponding shrinker has been initialized, no?
>
> 1)No, it's not so. Here is a race:
> cpu#0 cpu#1 cpu#2
> prealloc_shrinker()
> prealloc_shrinker()
> memcg_expand_shrinker_maps()
> memcg_expand_one_shrinker_map()
> memset(&new->map, 0xff);
> do_shrink_slab() (on uninitialized LRUs)
> init LRUs
> register_shrinker_prepared()
>
> So, the check is needed.

OK, I see.

>
> 2)Assigning NULL pointer can't be used here, since NULL pointer is already used
> to clear unregistered shrinkers from the map. See the check right after idr_find().

But it won't break anything if we clear bit for prealloc-ed, but not yet
registered shrinkers, will it?

>
> list_empty() is used since it's the already existing indicator, which does not
> require additional member in struct shrinker.

It just looks rather counter-intuitive to me to use shrinker->list to
differentiate between registered and unregistered shrinkers. May be, I'm
wrong. If you are sure that this is OK, I'm fine with it, but then
please add a comment here explaining what this check is needed for.

Thanks.

2018-05-17 04:50:19

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v5 13/13] mm: Clear shrinker bit if there are no objects related to memcg

On Tue, May 15, 2018 at 11:55:04AM +0300, Kirill Tkhai wrote:
> >> @@ -586,8 +586,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> >> continue;
> >>
> >> ret = do_shrink_slab(&sc, shrinker, priority);
> >> - if (ret == SHRINK_EMPTY)
> >> - ret = 0;
> >> + if (ret == SHRINK_EMPTY) {
> >> + clear_bit(i, map->map);
> >> + /*
> >> + * Pairs with mb in memcg_set_shrinker_bit():
> >> + *
> >> + * list_lru_add() shrink_slab_memcg()
> >> + * list_add_tail() clear_bit()
> >> + * <MB> <MB>
> >> + * set_bit() do_shrink_slab()
> >> + */
> >
> > Please improve the comment so that it isn't just a diagram.
>
> Please, say, which comment you want to see here.

I want the reader to understand why we need to invoke the shrinker twice
if it returns SHRINK_EMPTY. The diagram doesn't really help here IMO. So
I'd write Something like this:

ret = do_shrink_slab(&sc, shrinker, priority);
if (ret == SHRINK_EMPTY) {
clear_bit(i, map->map);
/*
* After the shrinker reported that it had no objects to free,
* but before we cleared the corresponding bit in the memcg
* shrinker map, a new object might have been added. To make
* sure, we have the bit set in this case, we invoke the
* shrinker one more time and re-set the bit if it reports that
* it is not empty anymore. The memory barrier here pairs with
* the barrier in memcg_set_shrinker_bit():
*
* list_lru_add() shrink_slab_memcg()
* list_add_tail() clear_bit()
* <MB> <MB>
* set_bit() do_shrink_slab()
*/
smp_mb__after_atomic();
ret = do_shrink_slab(&sc, shrinker, priority);
if (ret == SHRINK_EMPTY)
ret = 0;
else
memcg_set_shrinker_bit(memcg, nid, i);

2018-05-17 11:41:15

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

On 17.05.2018 07:33, Vladimir Davydov wrote:
> On Tue, May 15, 2018 at 01:12:20PM +0300, Kirill Tkhai wrote:
>>>> +#define root_mem_cgroup NULL
>>>
>>> Let's instead export mem_cgroup_is_root(). In case if MEMCG is disabled
>>> it will always return false.
>>
>> export == move to header file
>
> That and adding a stub function in case !MEMCG.
>
>>>> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>> + struct mem_cgroup *memcg, int priority)
>>>> +{
>>>> + struct memcg_shrinker_map *map;
>>>> + unsigned long freed = 0;
>>>> + int ret, i;
>>>> +
>>>> + if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
>>>> + return 0;
>>>> +
>>>> + if (!down_read_trylock(&shrinker_rwsem))
>>>> + return 0;
>>>> +
>>>> + /*
>>>> + * 1)Caller passes only alive memcg, so map can't be NULL.
>>>> + * 2)shrinker_rwsem protects from maps expanding.
>>>
>>> ^^
>>> Nit: space missing here :-)
>>
>> I don't understand what you mean here. Please, clarify...
>
> This is just a trivial remark regarding comment formatting. They usually
> put a space between the number and the first word in the sentence, i.e.
> between '1)' and 'Caller' in your case.
>
>>
>>>> + */
>>>> + map = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), true);
>>>> + BUG_ON(!map);
>>>> +
>>>> + for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
>>>> + struct shrink_control sc = {
>>>> + .gfp_mask = gfp_mask,
>>>> + .nid = nid,
>>>> + .memcg = memcg,
>>>> + };
>>>> + struct shrinker *shrinker;
>>>> +
>>>> + shrinker = idr_find(&shrinker_idr, i);
>>>> + if (!shrinker) {
>>>> + clear_bit(i, map->map);
>>>> + continue;
>>>> + }
>>>> + if (list_empty(&shrinker->list))
>>>> + continue;
>>>
>>> I don't like using shrinker->list as an indicator that the shrinker has
>>> been initialized. IMO if you do need such a check, you should split
>>> shrinker_idr registration in two steps - allocate a slot in 'prealloc'
>>> and set the pointer in 'register'. However, can we really encounter an
>>> unregistered shrinker here? AFAIU a bit can be set in the shrinker map
>>> only after the corresponding shrinker has been initialized, no?
>>
>> 1)No, it's not so. Here is a race:
>> cpu#0 cpu#1 cpu#2
>> prealloc_shrinker()
>> prealloc_shrinker()
>> memcg_expand_shrinker_maps()
>> memcg_expand_one_shrinker_map()
>> memset(&new->map, 0xff);
>> do_shrink_slab() (on uninitialized LRUs)
>> init LRUs
>> register_shrinker_prepared()
>>
>> So, the check is needed.
>
> OK, I see.
>
>>
>> 2)Assigning NULL pointer can't be used here, since NULL pointer is already used
>> to clear unregistered shrinkers from the map. See the check right after idr_find().
>
> But it won't break anything if we clear bit for prealloc-ed, but not yet
> registered shrinkers, will it?

This imposes restrictions on the code, which register a shrinker, because
there is no a rule or a guarantee in kernel, that list LRU can't be populated
before shrinker is completely registered. The separate subsystems of kernel
have to be modular, while clearing the bit will break the modularity and
imposes the restrictions on the users of this interface.

Also, if go another way and we delegete this to users, and they follow this rule,
this may require non-trivial locking scheme for them. So, let's keep the modularity.

Also, we can't move memset(0xff) to register_shrinker_preallocated(), since
then we would have to keep in memory the state of the fact the maps were expanded
in prealloc_shrinker().

>>
>> list_empty() is used since it's the already existing indicator, which does not
>> require additional member in struct shrinker.
>
> It just looks rather counter-intuitive to me to use shrinker->list to
> differentiate between registered and unregistered shrinkers. May be, I'm
> wrong. If you are sure that this is OK, I'm fine with it, but then
> please add a comment here explaining what this check is needed for.

We may introduce new flag in shrinker::flags to indicate this fact instead,
but for me it seems the same.

Thanks,
Kirill

2018-05-17 11:51:04

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

On 17.05.2018 07:16, Vladimir Davydov wrote:
> On Tue, May 15, 2018 at 05:49:59PM +0300, Kirill Tkhai wrote:
>>>> @@ -589,13 +647,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>> .memcg = memcg,
>>>> };
>>>>
>>>> - /*
>>>> - * If kernel memory accounting is disabled, we ignore
>>>> - * SHRINKER_MEMCG_AWARE flag and call all shrinkers
>>>> - * passing NULL for memcg.
>>>> - */
>>>> - if (memcg_kmem_enabled() &&
>>>> - !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>>>> + if (!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>>>> continue;
>>>
>>> I want this check gone. It's easy to achieve, actually - just remove the
>>> following lines from shrink_node()
>>>
>>> if (global_reclaim(sc))
>>> shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
>>> sc->priority);
>>
>> This check is not related to the patchset.
>
> Yes, it is. This patch modifies shrink_slab which is used only by
> shrink_node. Simplifying shrink_node along the way looks right to me.

shrink_slab() is used not only in this place. I does not seem a trivial
change for me.

>> Let's don't mix everything in the single series of patches, because
>> after your last remarks it will grow at least up to 15 patches.
>
> Most of which are trivial so I don't see any problem here.
>
>> This patchset can't be responsible for everything.
>
> I don't understand why you balk at simplifying the code a bit while you
> are patching related functions anyway.

Because this function is used in several places, and we have some particulars
on root_mem_cgroup initialization, and this function called from these places
with different states of root_mem_cgroup. It does not seem trivial fix for me.

Let's do it on top of the series later, what is the problem? It does not seem
critical problem.

Kirill

2018-05-17 13:53:36

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

On Thu, May 17, 2018 at 02:49:26PM +0300, Kirill Tkhai wrote:
> On 17.05.2018 07:16, Vladimir Davydov wrote:
> > On Tue, May 15, 2018 at 05:49:59PM +0300, Kirill Tkhai wrote:
> >>>> @@ -589,13 +647,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >>>> .memcg = memcg,
> >>>> };
> >>>>
> >>>> - /*
> >>>> - * If kernel memory accounting is disabled, we ignore
> >>>> - * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >>>> - * passing NULL for memcg.
> >>>> - */
> >>>> - if (memcg_kmem_enabled() &&
> >>>> - !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> >>>> + if (!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> >>>> continue;
> >>>
> >>> I want this check gone. It's easy to achieve, actually - just remove the
> >>> following lines from shrink_node()
> >>>
> >>> if (global_reclaim(sc))
> >>> shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
> >>> sc->priority);
> >>
> >> This check is not related to the patchset.
> >
> > Yes, it is. This patch modifies shrink_slab which is used only by
> > shrink_node. Simplifying shrink_node along the way looks right to me.
>
> shrink_slab() is used not only in this place.

drop_slab_node() doesn't really count as it is an extract from shrink_node()

> I does not seem a trivial change for me.
>
> >> Let's don't mix everything in the single series of patches, because
> >> after your last remarks it will grow at least up to 15 patches.
> >
> > Most of which are trivial so I don't see any problem here.
> >
> >> This patchset can't be responsible for everything.
> >
> > I don't understand why you balk at simplifying the code a bit while you
> > are patching related functions anyway.
>
> Because this function is used in several places, and we have some particulars
> on root_mem_cgroup initialization, and this function called from these places
> with different states of root_mem_cgroup. It does not seem trivial fix for me.

Let me do it for you then:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b697323a88c..e778569538de 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -486,10 +486,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
* @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
* unaware shrinkers will receive a node id of 0 instead.
*
- * @memcg specifies the memory cgroup to target. If it is not NULL,
- * only shrinkers with SHRINKER_MEMCG_AWARE set will be called to scan
- * objects from the memory cgroup specified. Otherwise, only unaware
- * shrinkers are called.
+ * @memcg specifies the memory cgroup to target. Unaware shrinkers
+ * are called only if it is the root cgroup.
*
* @priority is sc->priority, we take the number of objects and >> by priority
* in order to get the scan target.
@@ -554,6 +552,7 @@ void drop_slab_node(int nid)
struct mem_cgroup *memcg = NULL;

freed = 0;
+ memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
@@ -2557,9 +2556,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
node_lru_pages += lru_pages;

- if (memcg)
- shrink_slab(sc->gfp_mask, pgdat->node_id,
- memcg, sc->priority);
+ shrink_slab(sc->gfp_mask, pgdat->node_id,
+ memcg, sc->priority);

/* Record the group's reclaim efficiency */
vmpressure(sc->gfp_mask, memcg, false,
@@ -2583,10 +2581,6 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
}
} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));

- if (global_reclaim(sc))
- shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
- sc->priority);
-
if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;


Seems simple enough to fold it into this patch, doesn't it?