2018-03-21 13:27:15

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 00/10] Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))

Imagine a big node with many cpus, memory cgroups and containers.
Let we have 200 containers, and every container has 10 mounts
and 10 cgroups. All container tasks don't touch foreign containers
mounts.

In case of global reclaim, a task has to iterate all over the memcgs
and to call all the memcg-aware shrinkers for all of them. This means,
the task has to visit 200 * 10 = 2000 shrinkers for every memcg,
and since there are 2000 memcgs, the total calls of do_shrink_slab()
are 2000 * 2000 = 4000000.

4 million calls are not a number operations, which can takes 1 cpu cycle.
E.g., super_cache_count() accesses at least two lists, and makes arifmetical
calculations. Even, if there are no charged objects, we do these calculations,
and replaces cpu caches by read memory. I observed nodes spending almost 100%
time in kernel, in case of intensive writing and global reclaim. Even if
there is no writing, the iterations just waste the time, and slows reclaim down.

Let's see the small test below:
$echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
$mkdir /sys/fs/cgroup/memory/ct
$echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
$for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done

Then, let's see drop caches time (4 sequential calls):
$time echo 3 > /proc/sys/vm/drop_caches
0.00user 6.80system 0:06.82elapsed 99%CPU
0.00user 4.61system 0:04.62elapsed 99%CPU
0.00user 4.61system 0:04.61elapsed 99%CPU
0.00user 4.61system 0:04.61elapsed 99%CPU

Last three calls don't actually shrink something. So, the iterations
over slab shrinkers take 4.61 seconds. Not so good for scalability.

The patchset solves the problem with following actions:
1)Assign id to every registered memcg-aware shrinker.
2)Maintain per-memcgroup bitmap of memcg-aware shrinkers,
and set a shrinker-related bit after the first element
is added to lru list (also, when removed child memcg
elements are reparanted).
3)Split memcg-aware shrinkers and !memcg-aware shrinkers,
and call a shrinker if its bit is set in memcg's shrinker
bitmap
(Also, there is a functionality to clear the bit, after
last element is shrinked).

This gives signify performance increase. The result after patchset is applied:

$time echo 3 > /proc/sys/vm/drop_caches
0.00user 0.93system 0:00.94elapsed 99%CPU
0.00user 0.00system 0:00.01elapsed 80%CPU
0.00user 0.00system 0:00.01elapsed 80%CPU
0.00user 0.00system 0:00.01elapsed 81%CPU
(4.61s/0.01s = 461 times faster)

Currenly, all memcg-aware shrinkers are implemented via list_lru.
The only exception is XFS cached objects backlog (which is completelly
no memcg-aware, but pretends to be memcg-aware). See
xfs_fs_nr_cached_objects() and xfs_fs_free_cached_objects() for
the details. It seems, this can be reworked to fix this lack.

So, the patchset makes shrink_slab() of less complexity and improves
the performance in such types of load I pointed. This will give a profit
in case of !global reclaim case, since there also will be less do_shrink_slab()
calls.

This patchset is made against linux-next.git tree.

---

Kirill Tkhai (10):
mm: Assign id to every memcg-aware shrinker
mm: Maintain memcg-aware shrinkers in mcg_shrinkers array
mm: Assign memcg-aware shrinkers bitmap to memcg
fs: Propagate shrinker::id to list_lru
list_lru: Add memcg argument to list_lru_from_kmem()
list_lru: Pass dst_memcg argument to memcg_drain_list_lru_node()
list_lru: Pass lru argument to memcg_drain_list_lru_node()
mm: Set bit in memcg shrinker bitmap on first list_lru item apearance
mm: Iterate only over charged shrinkers during memcg shrink_slab()
mm: Clear shrinker bit if there are no objects related to memcg


fs/super.c | 8 +
include/linux/list_lru.h | 3
include/linux/memcontrol.h | 20 +++
include/linux/shrinker.h | 9 +
mm/list_lru.c | 65 ++++++--
mm/memcontrol.c | 7 +
mm/vmscan.c | 337 ++++++++++++++++++++++++++++++++++++++++++--
mm/workingset.c | 6 +
8 files changed, 418 insertions(+), 37 deletions(-)

--
Signed-off-by: Kirill Tkhai <[email protected]>


2018-03-21 13:23:47

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 01/10] mm: Assign id to every memcg-aware shrinker

The patch introduces shrinker::id number, which is used to enumerate
memcg-aware shrinkers. The number start from 0, and the code tries
to maintain it as small as possible.

This will be used as to represent a memcg-aware shrinkers in memcg
shrinkers map.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/shrinker.h | 1 +
mm/vmscan.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 60 insertions(+)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index a3894918a436..738de8ef5246 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -66,6 +66,7 @@ struct shrinker {

/* These are for internal use */
struct list_head list;
+ int id;
/* objs pending delete, per node */
atomic_long_t *nr_deferred;
};
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8fcd9f8d7390..91b5120b924f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -159,6 +159,56 @@ unsigned long vm_total_pages;
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+static DEFINE_IDA(bitmap_id_ida);
+static DECLARE_RWSEM(bitmap_rwsem);
+static int bitmap_id_start;
+
+static int alloc_shrinker_id(struct shrinker *shrinker)
+{
+ int id, ret;
+
+ if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
+ return 0;
+retry:
+ ida_pre_get(&bitmap_id_ida, GFP_KERNEL);
+ down_write(&bitmap_rwsem);
+ ret = ida_get_new_above(&bitmap_id_ida, bitmap_id_start, &id);
+ if (!ret) {
+ shrinker->id = id;
+ bitmap_id_start = shrinker->id + 1;
+ }
+ up_write(&bitmap_rwsem);
+ if (ret == -EAGAIN)
+ goto retry;
+
+ return ret;
+}
+
+static void free_shrinker_id(struct shrinker *shrinker)
+{
+ int id = shrinker->id;
+
+ if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
+ return;
+
+ down_write(&bitmap_rwsem);
+ ida_remove(&bitmap_id_ida, id);
+ if (bitmap_id_start > id)
+ bitmap_id_start = id;
+ up_write(&bitmap_rwsem);
+}
+#else /* CONFIG_MEMCG && !CONFIG_SLOB */
+static int alloc_shrinker_id(struct shrinker *shrinker)
+{
+ return 0;
+}
+
+static void free_shrinker_id(struct shrinker *shrinker)
+{
+}
+#endif /* CONFIG_MEMCG && !CONFIG_SLOB */
+
#ifdef CONFIG_MEMCG
static bool global_reclaim(struct scan_control *sc)
{
@@ -269,10 +319,18 @@ int register_shrinker(struct shrinker *shrinker)
if (!shrinker->nr_deferred)
return -ENOMEM;

+ if (alloc_shrinker_id(shrinker))
+ goto free_deferred;
+
down_write(&shrinker_rwsem);
list_add_tail(&shrinker->list, &shrinker_list);
up_write(&shrinker_rwsem);
return 0;
+
+free_deferred:
+ kfree(shrinker->nr_deferred);
+ shrinker->nr_deferred = NULL;
+ return -ENOMEM;
}
EXPORT_SYMBOL(register_shrinker);

@@ -286,6 +344,7 @@ void unregister_shrinker(struct shrinker *shrinker)
down_write(&shrinker_rwsem);
list_del(&shrinker->list);
up_write(&shrinker_rwsem);
+ free_shrinker_id(shrinker);
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}


2018-03-21 13:23:48

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

Imagine a big node with many cpus, memory cgroups and containers.
Let we have 200 containers, every container has 10 mounts,
and 10 cgroups. All container tasks don't touch foreign
containers mounts. If there is intensive pages write,
and global reclaim happens, a writing task has to iterate
over all memcgs to shrink slab, before it's able to go
to shrink_page_list().

Iteration over all the memcg slabs is very expensive:
the task has to visit 200 * 10 = 2000 shrinkers
for every memcg, and since there are 2000 memcgs,
the total calls are 2000 * 2000 = 4000000.

So, the shrinker makes 4 million do_shrink_slab() calls
just to try to isolate SWAP_CLUSTER_MAX pages in one
of the actively writing memcg via shrink_page_list().
I've observed a node spending almost 100% in kernel,
making useless iteration over already shrinked slab.

This patch adds bitmap of memcg-aware shrinkers to memcg.
The size of the bitmap depends on bitmap_nr_ids, and during
memcg life it's maintained to be enough to fit bitmap_nr_ids
shrinkers. Every bit in the map is related to corresponding
shrinker id.

Next patches will maintain set bit only for really charged
memcg. This will allow shrink_slab() to increase its
performance in significant way. See the last patch for
the numbers.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/memcontrol.h | 20 ++++++++
mm/memcontrol.c | 5 ++
mm/vmscan.c | 117 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 142 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4525b4404a9e..ad88a9697fb9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -151,6 +151,11 @@ struct mem_cgroup_thresholds {
struct mem_cgroup_threshold_ary *spare;
};

+struct shrinkers_map {
+ struct rcu_head rcu;
+ unsigned long *map[0];
+};
+
enum memcg_kmem_state {
KMEM_NONE,
KMEM_ALLOCATED,
@@ -182,6 +187,9 @@ struct mem_cgroup {
unsigned long low;
unsigned long high;

+ /* Bitmap of shrinker ids suitable to call for this memcg */
+ struct shrinkers_map __rcu *shrinkers_map;
+
/* Range enforcement for interrupt charges */
struct work_struct high_work;

@@ -1219,6 +1227,9 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
return memcg ? memcg->kmemcg_id : -1;
}

+int alloc_shrinker_maps(struct mem_cgroup *memcg);
+void free_shrinker_maps(struct mem_cgroup *memcg);
+
#else
#define for_each_memcg_cache_index(_idx) \
for (; NULL; )
@@ -1241,6 +1252,15 @@ static inline void memcg_put_cache_ids(void)
{
}

+static inline int alloc_shrinker_maps(struct mem_cgroup *memcg)
+{
+ return 0;
+}
+
+static inline void free_shrinker_maps(struct mem_cgroup *memcg)
+{
+}
+
#endif /* CONFIG_MEMCG && !CONFIG_SLOB */

#endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3801ac1fcfbc..2324577c62dc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4476,6 +4476,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);

+ if (alloc_shrinker_maps(memcg))
+ return -ENOMEM;
+
/* Online state pins memcg ID, memcg ID pins CSS */
atomic_set(&memcg->id.ref, 1);
css_get(css);
@@ -4487,6 +4490,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
struct mem_cgroup_event *event, *tmp;

+ free_shrinker_maps(memcg);
+
/*
* Unregister events and notify userspace.
* Notify userspace about cgroup removing only after rmdir of cgroup
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 97ce4f342fab..9d1df5d90eca 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -165,6 +165,10 @@ static DECLARE_RWSEM(bitmap_rwsem);
static int bitmap_id_start;
static int bitmap_nr_ids;
static struct shrinker **mcg_shrinkers;
+struct shrinkers_map *__rcu root_shrinkers_map;
+
+#define SHRINKERS_MAP(memcg) \
+ (memcg == root_mem_cgroup || !memcg ? root_shrinkers_map : memcg->shrinkers_map)

static int expand_shrinkers_array(int old_nr, int nr)
{
@@ -188,6 +192,116 @@ static int expand_shrinkers_array(int old_nr, int nr)
return 0;
}

+static void kvfree_map_rcu(struct rcu_head *head)
+{
+ struct shrinkers_map *map;
+ int i = nr_node_ids;
+
+ map = container_of(head, struct shrinkers_map, rcu);
+ while (--i >= 0)
+ kvfree(map->map[i]);
+ kvfree(map);
+}
+
+static int memcg_expand_maps(struct mem_cgroup *memcg, int size, int old_size)
+{
+ struct shrinkers_map *new, *old;
+ int i;
+
+ new = kvmalloc(sizeof(*new) + nr_node_ids * sizeof(new->map[0]),
+ GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ for (i = 0; i < nr_node_ids; i++) {
+ new->map[i] = kvmalloc_node(size, GFP_KERNEL, i);
+ if (!new->map[i]) {
+ while (--i >= 0)
+ kvfree(new->map[i]);
+ kvfree(new);
+ return -ENOMEM;
+ }
+
+ /* Set all old bits, clear all new bits */
+ memset(new->map[i], (int)0xff, old_size);
+ memset((void *)new->map[i] + old_size, 0, size - old_size);
+ }
+
+ lockdep_assert_held(&bitmap_rwsem);
+ old = rcu_dereference_protected(SHRINKERS_MAP(memcg), true);
+
+ /*
+ * We don't want to use rcu_read_lock() in shrink_slab().
+ * Since expansion happens rare, we may just take the lock
+ * here.
+ */
+ if (old)
+ down_write(&shrinker_rwsem);
+
+ if (memcg)
+ rcu_assign_pointer(memcg->shrinkers_map, new);
+ else
+ rcu_assign_pointer(root_shrinkers_map, new);
+
+ if (old) {
+ up_write(&shrinker_rwsem);
+ call_rcu(&old->rcu, kvfree_map_rcu);
+ }
+
+ return 0;
+}
+
+int alloc_shrinker_maps(struct mem_cgroup *memcg)
+{
+ int ret;
+
+ if (memcg == root_mem_cgroup)
+ return 0;
+
+ down_read(&bitmap_rwsem);
+ ret = memcg_expand_maps(memcg, bitmap_nr_ids/BITS_PER_BYTE, 0);
+ up_read(&bitmap_rwsem);
+ return ret;
+}
+
+void free_shrinker_maps(struct mem_cgroup *memcg)
+{
+ struct shrinkers_map *map;
+
+ if (memcg == root_mem_cgroup)
+ return;
+
+ down_read(&bitmap_rwsem);
+ map = rcu_dereference_protected(memcg->shrinkers_map, true);
+ rcu_assign_pointer(memcg->shrinkers_map, NULL);
+ up_read(&bitmap_rwsem);
+
+ if (map)
+ kvfree_map_rcu(&map->rcu);
+}
+
+static int expand_shrinker_maps(int old_id, int id)
+{
+ struct mem_cgroup *memcg = NULL, *root_memcg = root_mem_cgroup;
+ int size, old_size, ret;
+
+ size = id / BITS_PER_BYTE;
+ old_size = old_id / BITS_PER_BYTE;
+
+ if (root_memcg)
+ memcg = mem_cgroup_iter(root_memcg, NULL, NULL);
+ do {
+ ret = memcg_expand_maps(memcg == root_memcg ? NULL : memcg,
+ size, old_size);
+ if (ret || !root_memcg) {
+ mem_cgroup_iter_break(root_memcg, memcg);
+ break;
+ }
+ } while (!!(memcg = mem_cgroup_iter(root_memcg, memcg, NULL)));
+
+ return ret;
+}
+
static int expand_shrinker_id(int id)
{
if (likely(id < bitmap_nr_ids))
@@ -200,6 +314,9 @@ static int expand_shrinker_id(int id)
if (expand_shrinkers_array(bitmap_nr_ids, id))
return -ENOMEM;

+ if (expand_shrinker_maps(bitmap_nr_ids, id))
+ return -ENOMEM;
+
bitmap_nr_ids = id;
return 0;
}


2018-03-21 13:24:06

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 04/10] fs: Propagate shrinker::id to list_lru

The patch adds list_lru::shrk_id field, and populates
it by registered shrinker id.

This will be used to set correct bit in memcg shrinkers
map by lru code in next patches, after there appeared
the first related to memcg element in list_lru.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/super.c | 5 +++++
include/linux/list_lru.h | 1 +
mm/list_lru.c | 7 ++++++-
mm/workingset.c | 3 +++
4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 0660083427fa..1f3dc4eab409 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -521,6 +521,11 @@ struct super_block *sget_userns(struct file_system_type *type,
if (err) {
deactivate_locked_super(s);
s = ERR_PTR(err);
+ } else {
+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+ s->s_dentry_lru.shrk_id = s->s_shrink.id;
+ s->s_inode_lru.shrk_id = s->s_shrink.id;
+#endif
}
return s;
}
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 96def9d15b1b..ce1d010cd3fa 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -53,6 +53,7 @@ struct list_lru {
struct list_lru_node *node;
#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
struct list_head list;
+ int shrk_id;
#endif
};

diff --git a/mm/list_lru.c b/mm/list_lru.c
index d9c84c5bda1d..013bf04a9eb9 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -567,6 +567,9 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
size_t size = sizeof(*lru->node) * nr_node_ids;
int err = -ENOMEM;

+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+ lru->shrk_id = -1;
+#endif
memcg_get_cache_ids();

lru->node = kzalloc(size, GFP_KERNEL);
@@ -608,7 +611,9 @@ void list_lru_destroy(struct list_lru *lru)
memcg_destroy_list_lru(lru);
kfree(lru->node);
lru->node = NULL;
-
+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+ lru->shrk_id = -1;
+#endif
memcg_put_cache_ids();
}
EXPORT_SYMBOL_GPL(list_lru_destroy);
diff --git a/mm/workingset.c b/mm/workingset.c
index b7d616a3bbbe..62c9eb000c4f 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -534,6 +534,9 @@ static int __init workingset_init(void)
ret = register_shrinker(&workingset_shadow_shrinker);
if (ret)
goto err_list_lru;
+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+ shadow_nodes.shrk_id = workingset_shadow_shrinker.id;
+#endif
return 0;
err_list_lru:
list_lru_destroy(&shadow_nodes);


2018-03-21 13:24:24

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 06/10] list_lru: Pass dst_memcg argument to memcg_drain_list_lru_node()

This is just refactoring to allow next patches to have
dst_memcg pointer in memcg_drain_list_lru_node().

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/list_lru.h | 2 +-
mm/list_lru.c | 11 ++++++-----
mm/memcontrol.c | 2 +-
3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index ce1d010cd3fa..50cf8c61c609 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -66,7 +66,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
#define list_lru_init_memcg(lru) __list_lru_init((lru), true, NULL)

int memcg_update_all_list_lrus(int num_memcgs);
-void memcg_drain_all_list_lrus(int src_idx, int dst_idx);
+void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg);

/**
* list_lru_add: add an element to the lru list's tail
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 78a3943190d4..a1259b88adba 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -516,8 +516,9 @@ int memcg_update_all_list_lrus(int new_size)
}

static void memcg_drain_list_lru_node(struct list_lru_node *nlru,
- int src_idx, int dst_idx)
+ int src_idx, struct mem_cgroup *dst_memcg)
{
+ int dst_idx = dst_memcg->kmemcg_id;
struct list_lru_one *src, *dst;

/*
@@ -537,7 +538,7 @@ static void memcg_drain_list_lru_node(struct list_lru_node *nlru,
}

static void memcg_drain_list_lru(struct list_lru *lru,
- int src_idx, int dst_idx)
+ int src_idx, struct mem_cgroup *dst_memcg)
{
int i;

@@ -545,16 +546,16 @@ static void memcg_drain_list_lru(struct list_lru *lru,
return;

for_each_node(i)
- memcg_drain_list_lru_node(&lru->node[i], src_idx, dst_idx);
+ memcg_drain_list_lru_node(&lru->node[i], src_idx, dst_memcg);
}

-void memcg_drain_all_list_lrus(int src_idx, int dst_idx)
+void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg)
{
struct list_lru *lru;

mutex_lock(&list_lrus_mutex);
list_for_each_entry(lru, &list_lrus, list)
- memcg_drain_list_lru(lru, src_idx, dst_idx);
+ memcg_drain_list_lru(lru, src_idx, dst_memcg);
mutex_unlock(&list_lrus_mutex);
}
#else
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2324577c62dc..8bd6d2a1f12b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3073,7 +3073,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
}
rcu_read_unlock();

- memcg_drain_all_list_lrus(kmemcg_id, parent->kmemcg_id);
+ memcg_drain_all_list_lrus(kmemcg_id, parent);

memcg_free_cache_id(kmemcg_id);
}


2018-03-21 13:24:31

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 07/10] list_lru: Pass lru argument to memcg_drain_list_lru_node()

This is just refactoring to allow next patches to have
lru pointer in memcg_drain_list_lru_node().

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/list_lru.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index a1259b88adba..85a0988154aa 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -515,9 +515,10 @@ int memcg_update_all_list_lrus(int new_size)
goto out;
}

-static void memcg_drain_list_lru_node(struct list_lru_node *nlru,
+static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
int src_idx, struct mem_cgroup *dst_memcg)
{
+ struct list_lru_node *nlru = &lru->node[nid];
int dst_idx = dst_memcg->kmemcg_id;
struct list_lru_one *src, *dst;

@@ -546,7 +547,7 @@ static void memcg_drain_list_lru(struct list_lru *lru,
return;

for_each_node(i)
- memcg_drain_list_lru_node(&lru->node[i], src_idx, dst_memcg);
+ memcg_drain_list_lru_node(lru, i, src_idx, dst_memcg);
}

void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg)


2018-03-21 13:24:42

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 08/10] mm: Set bit in memcg shrinker bitmap on first list_lru item apearance

Introduce set_shrinker_bit() function to set shrinker-related
bit in memcg shrinker bitmap, and set the bit after the first
item is added and in case of reparenting destroyed memcg's items.

This will allow next patch to make shrinkers be called only,
in case of they have charged objects at the moment, and
to improve shrink_slab() performance.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/shrinker.h | 7 +++++++
mm/list_lru.c | 22 ++++++++++++++++++++--
mm/vmscan.c | 7 +++++++
3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 738de8ef5246..24aeed1bc332 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -78,4 +78,11 @@ struct shrinker {

extern __must_check int register_shrinker(struct shrinker *);
extern void unregister_shrinker(struct shrinker *);
+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+extern void set_shrinker_bit(struct mem_cgroup *, int, int);
+#else
+static inline void set_shrinker_bit(struct mem_cgroup *memcg, int node, int id)
+{
+}
+#endif
#endif
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 85a0988154aa..9a331c790bfb 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -30,6 +30,11 @@ static void list_lru_unregister(struct list_lru *lru)
list_del(&lru->list);
mutex_unlock(&list_lrus_mutex);
}
+
+static int lru_shrk_id(struct list_lru *lru)
+{
+ return lru->shrk_id;
+}
#else
static void list_lru_register(struct list_lru *lru)
{
@@ -38,6 +43,11 @@ static void list_lru_register(struct list_lru *lru)
static void list_lru_unregister(struct list_lru *lru)
{
}
+
+static int lru_shrk_id(struct list_lru *lru)
+{
+ return -1;
+}
#endif /* CONFIG_MEMCG && !CONFIG_SLOB */

#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
@@ -120,13 +130,15 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
{
int nid = page_to_nid(virt_to_page(item));
struct list_lru_node *nlru = &lru->node[nid];
+ struct mem_cgroup *memcg;
struct list_lru_one *l;

spin_lock(&nlru->lock);
if (list_empty(item)) {
- l = list_lru_from_kmem(nlru, item, NULL);
+ l = list_lru_from_kmem(nlru, item, &memcg);
list_add_tail(item, &l->list);
- l->nr_items++;
+ if (!l->nr_items++ && lru_shrk_id(lru) >= 0)
+ set_shrinker_bit(memcg, nid, lru_shrk_id(lru));
nlru->nr_items++;
spin_unlock(&nlru->lock);
return true;
@@ -521,6 +533,7 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
struct list_lru_node *nlru = &lru->node[nid];
int dst_idx = dst_memcg->kmemcg_id;
struct list_lru_one *src, *dst;
+ bool set;

/*
* Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -532,9 +545,14 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
dst = list_lru_from_memcg_idx(nlru, dst_idx);

list_splice_init(&src->list, &dst->list);
+
+ set = (src->nr_items && !dst->nr_items);
dst->nr_items += src->nr_items;
src->nr_items = 0;

+ if (set && lru->shrk_id >= 0)
+ set_shrinker_bit(dst_memcg, nid, lru->shrk_id);
+
spin_unlock_irq(&nlru->lock);
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9d1df5d90eca..265cf069b470 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -378,6 +378,13 @@ static void del_shrinker(struct shrinker *shrinker)
list_del(&shrinker->list);
up_write(&shrinker_rwsem);
}
+
+void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int nr)
+{
+ struct shrinkers_map *map = SHRINKERS_MAP(memcg);
+
+ set_bit(nr, map->map[nid]);
+}
#else /* CONFIG_MEMCG && !CONFIG_SLOB */
static int alloc_shrinker_id(struct shrinker *shrinker)
{


2018-03-21 13:25:27

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 02/10] mm: Maintain memcg-aware shrinkers in mcg_shrinkers array

The patch introduces mcg_shrinkers array to keep memcg-aware
shrinkers in order of their shrinker::id.

This allows to access the shrinkers dirrectly by the id,
without iteration over shrinker_list list.

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/vmscan.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 81 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 91b5120b924f..97ce4f342fab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -163,6 +163,46 @@ static DECLARE_RWSEM(shrinker_rwsem);
static DEFINE_IDA(bitmap_id_ida);
static DECLARE_RWSEM(bitmap_rwsem);
static int bitmap_id_start;
+static int bitmap_nr_ids;
+static struct shrinker **mcg_shrinkers;
+
+static int expand_shrinkers_array(int old_nr, int nr)
+{
+ struct shrinker **new;
+ int old_size, size;
+
+ size = nr * sizeof(struct shrinker *);
+ new = kvmalloc(size, GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ old_size = old_nr * sizeof(struct shrinker *);
+ memset((void *)new + old_size, 0, size - old_size);
+
+ down_write(&shrinker_rwsem);
+ memcpy((void *)new, mcg_shrinkers, old_size);
+ swap(new, mcg_shrinkers);
+ up_write(&shrinker_rwsem);
+
+ kvfree(new);
+ return 0;
+}
+
+static int expand_shrinker_id(int id)
+{
+ if (likely(id < bitmap_nr_ids))
+ return 0;
+
+ id = bitmap_nr_ids * 2;
+ if (id == 0)
+ id = BITS_PER_BYTE;
+
+ if (expand_shrinkers_array(bitmap_nr_ids, id))
+ return -ENOMEM;
+
+ bitmap_nr_ids = id;
+ return 0;
+}

static int alloc_shrinker_id(struct shrinker *shrinker)
{
@@ -175,8 +215,13 @@ static int alloc_shrinker_id(struct shrinker *shrinker)
down_write(&bitmap_rwsem);
ret = ida_get_new_above(&bitmap_id_ida, bitmap_id_start, &id);
if (!ret) {
- shrinker->id = id;
- bitmap_id_start = shrinker->id + 1;
+ if (expand_shrinker_id(id)) {
+ ida_remove(&bitmap_id_ida, id);
+ ret = -ENOMEM;
+ } else {
+ shrinker->id = id;
+ bitmap_id_start = shrinker->id + 1;
+ }
}
up_write(&bitmap_rwsem);
if (ret == -EAGAIN)
@@ -198,6 +243,24 @@ static void free_shrinker_id(struct shrinker *shrinker)
bitmap_id_start = id;
up_write(&bitmap_rwsem);
}
+
+static void add_shrinker(struct shrinker *shrinker)
+{
+ down_write(&shrinker_rwsem);
+ if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+ mcg_shrinkers[shrinker->id] = shrinker;
+ list_add_tail(&shrinker->list, &shrinker_list);
+ up_write(&shrinker_rwsem);
+}
+
+static void del_shrinker(struct shrinker *shrinker)
+{
+ down_write(&shrinker_rwsem);
+ if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+ mcg_shrinkers[shrinker->id] = NULL;
+ list_del(&shrinker->list);
+ up_write(&shrinker_rwsem);
+}
#else /* CONFIG_MEMCG && !CONFIG_SLOB */
static int alloc_shrinker_id(struct shrinker *shrinker)
{
@@ -207,6 +270,20 @@ static int alloc_shrinker_id(struct shrinker *shrinker)
static void free_shrinker_id(struct shrinker *shrinker)
{
}
+
+static void add_shrinker(struct shrinker *shrinker)
+{
+ down_write(&shrinker_rwsem);
+ list_add_tail(&shrinker->list, &shrinker_list);
+ up_write(&shrinker_rwsem);
+}
+
+static void del_shrinker(struct shrinker *shrinker)
+{
+ down_write(&shrinker_rwsem);
+ list_del(&shrinker->list);
+ up_write(&shrinker_rwsem);
+}
#endif /* CONFIG_MEMCG && !CONFIG_SLOB */

#ifdef CONFIG_MEMCG
@@ -322,9 +399,7 @@ int register_shrinker(struct shrinker *shrinker)
if (alloc_shrinker_id(shrinker))
goto free_deferred;

- down_write(&shrinker_rwsem);
- list_add_tail(&shrinker->list, &shrinker_list);
- up_write(&shrinker_rwsem);
+ add_shrinker(shrinker);
return 0;

free_deferred:
@@ -341,9 +416,7 @@ void unregister_shrinker(struct shrinker *shrinker)
{
if (!shrinker->nr_deferred)
return;
- down_write(&shrinker_rwsem);
- list_del(&shrinker->list);
- up_write(&shrinker_rwsem);
+ del_shrinker(shrinker);
free_shrinker_id(shrinker);
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;


2018-03-21 13:25:31

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 10/10] mm: Clear shrinker bit if there are no objects related to memcg

To avoid further unneed calls of do_shrink_slab()
for shrinkers, which already do not have any charged
objects in a memcg, their bits have to be cleared.

This patch introduces new return value SHRINK_EMPTY,
which will be used in case of there is no charged
objects in shrinker. We can't use 0 instead of that,
as a shrinker may return 0, when it has very small
amount of objects.

To prevent race with parallel list lru add, we call
do_shrink_slab() once again, after the bit is cleared.
So, if there is a new object, we never miss it, and
the bit will be restored again.

The below test shows significant performance growths
after using the patchset:

$echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
$mkdir /sys/fs/cgroup/memory/ct
$echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
$for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done

Then 4 drop_caches:
$time echo 3 > /proc/sys/vm/drop_caches

Times of drop_caches:

*Before (4 iterations)*
0.00user 6.80system 0:06.82elapsed 99%CPU
0.00user 4.61system 0:04.62elapsed 99%CPU
0.00user 4.61system 0:04.61elapsed 99%CPU
0.00user 4.61system 0:04.61elapsed 99%CPU

*After (4 iterations)*
0.00user 0.93system 0:00.94elapsed 99%CPU
0.00user 0.00system 0:00.01elapsed 80%CPU
0.00user 0.00system 0:00.01elapsed 80%CPU
0.00user 0.00system 0:00.01elapsed 81%CPU

4.61s/0.01s = 461 times faster.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/super.c | 3 +++
include/linux/shrinker.h | 1 +
mm/vmscan.c | 21 ++++++++++++++++++---
mm/workingset.c | 3 +++
4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 1f3dc4eab409..788ea4e6a40b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -133,6 +133,9 @@ static unsigned long super_cache_count(struct shrinker *shrink,
total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);

+ if (!total_objects)
+ return SHRINK_EMPTY;
+
total_objects = vfs_pressure_ratio(total_objects);
return total_objects;
}
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 24aeed1bc332..b23180deb928 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -34,6 +34,7 @@ struct shrink_control {
};

#define SHRINK_STOP (~0UL)
+#define SHRINK_EMPTY (~0UL - 1)
/*
* A callback you can register to apply pressure to ageable caches.
*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e1fd16bc7a9b..1fc05e8bde04 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -387,6 +387,7 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int nr)
{
struct shrinkers_map *map = SHRINKERS_MAP(memcg);

+ smp_mb__before_atomic(); /* Pairs with mb in shrink_slab() */
set_bit(nr, map->map[nid]);
}
#else /* CONFIG_MEMCG && !CONFIG_SLOB */
@@ -568,8 +569,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
long scanned = 0, next_deferred;

freeable = shrinker->count_objects(shrinker, shrinkctl);
- if (freeable == 0)
- return 0;
+ if (freeable == 0 || freeable == SHRINK_EMPTY)
+ return freeable;

/*
* copy the current shrinker scan count into a local variable
@@ -708,6 +709,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
if (!memcg_kmem_enabled() || memcg) {
struct shrinkers_map *map;
+ unsigned long ret;
int i;

map = rcu_dereference_protected(SHRINKERS_MAP(memcg), true);
@@ -724,7 +726,20 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
clear_bit(i, map->map[nid]);
continue;
}
- freed += do_shrink_slab(&sc, shrinker, priority);
+ if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+ sc.nid = 0;
+ ret = do_shrink_slab(&sc, shrinker, priority);
+ if (ret == SHRINK_EMPTY) {
+ clear_bit(i, map->map[nid]);
+ /* pairs with mb in set_shrinker_bit() */
+ smp_mb__after_atomic();
+ ret = do_shrink_slab(&sc, shrinker, priority);
+ if (ret == SHRINK_EMPTY)
+ ret = 0;
+ else
+ set_bit(i, map->map[nid]);
+ }
+ freed += ret;

if (rwsem_is_contended(&shrinker_rwsem)) {
freed = freed ? : 1;
diff --git a/mm/workingset.c b/mm/workingset.c
index 62c9eb000c4f..dc05574f0a07 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -402,6 +402,9 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
}
max_nodes = cache >> (RADIX_TREE_MAP_SHIFT - 3);

+ if (!nodes)
+ return SHRINK_EMPTY;
+
if (nodes <= max_nodes)
return 0;
return nodes - max_nodes;


2018-03-21 13:25:41

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 05/10] list_lru: Add memcg argument to list_lru_from_kmem()

This is just refactoring to allow next patches to have
memcg pointer in list_lru_from_kmem().

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/list_lru.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 013bf04a9eb9..78a3943190d4 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -76,18 +76,24 @@ static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
}

static inline struct list_lru_one *
-list_lru_from_kmem(struct list_lru_node *nlru, void *ptr)
+list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
+ struct mem_cgroup **memcg_ptr)
{
- struct mem_cgroup *memcg;
+ struct list_lru_one *l = &nlru->lru;
+ struct mem_cgroup *memcg = NULL;

if (!nlru->memcg_lrus)
- return &nlru->lru;
+ goto out;

memcg = mem_cgroup_from_kmem(ptr);
if (!memcg)
- return &nlru->lru;
+ goto out;

- return list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
+ l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
+out:
+ if (memcg_ptr)
+ *memcg_ptr = memcg;
+ return l;
}
#else
static inline bool list_lru_memcg_aware(struct list_lru *lru)
@@ -102,8 +108,10 @@ list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
}

static inline struct list_lru_one *
-list_lru_from_kmem(struct list_lru_node *nlru, void *ptr)
+list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
+ struct mem_cgroup **memcg_ptr)
{
+ *memcg_ptr = NULL;
return &nlru->lru;
}
#endif /* CONFIG_MEMCG && !CONFIG_SLOB */
@@ -116,7 +124,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)

spin_lock(&nlru->lock);
if (list_empty(item)) {
- l = list_lru_from_kmem(nlru, item);
+ l = list_lru_from_kmem(nlru, item, NULL);
list_add_tail(item, &l->list);
l->nr_items++;
nlru->nr_items++;
@@ -142,7 +150,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)

spin_lock(&nlru->lock);
if (!list_empty(item)) {
- l = list_lru_from_kmem(nlru, item);
+ l = list_lru_from_kmem(nlru, item, NULL);
list_del_init(item);
l->nr_items--;
nlru->nr_items--;


2018-03-21 13:25:42

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 00/10] Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))

This is actually RFC, so comments are welcome!

On 21.03.2018 16:21, Kirill Tkhai wrote:
> Imagine a big node with many cpus, memory cgroups and containers.
> Let we have 200 containers, and every container has 10 mounts
> and 10 cgroups. All container tasks don't touch foreign containers
> mounts.
>
> In case of global reclaim, a task has to iterate all over the memcgs
> and to call all the memcg-aware shrinkers for all of them. This means,
> the task has to visit 200 * 10 = 2000 shrinkers for every memcg,
> and since there are 2000 memcgs, the total calls of do_shrink_slab()
> are 2000 * 2000 = 4000000.
>
> 4 million calls are not a number operations, which can takes 1 cpu cycle.
> E.g., super_cache_count() accesses at least two lists, and makes arifmetical
> calculations. Even, if there are no charged objects, we do these calculations,
> and replaces cpu caches by read memory. I observed nodes spending almost 100%
> time in kernel, in case of intensive writing and global reclaim. Even if
> there is no writing, the iterations just waste the time, and slows reclaim down.
>
> Let's see the small test below:
> $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
> $mkdir /sys/fs/cgroup/memory/ct
> $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
> $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done
>
> Then, let's see drop caches time (4 sequential calls):
> $time echo 3 > /proc/sys/vm/drop_caches
> 0.00user 6.80system 0:06.82elapsed 99%CPU
> 0.00user 4.61system 0:04.62elapsed 99%CPU
> 0.00user 4.61system 0:04.61elapsed 99%CPU
> 0.00user 4.61system 0:04.61elapsed 99%CPU
>
> Last three calls don't actually shrink something. So, the iterations
> over slab shrinkers take 4.61 seconds. Not so good for scalability.
>
> The patchset solves the problem with following actions:
> 1)Assign id to every registered memcg-aware shrinker.
> 2)Maintain per-memcgroup bitmap of memcg-aware shrinkers,
> and set a shrinker-related bit after the first element
> is added to lru list (also, when removed child memcg
> elements are reparanted).
> 3)Split memcg-aware shrinkers and !memcg-aware shrinkers,
> and call a shrinker if its bit is set in memcg's shrinker
> bitmap
> (Also, there is a functionality to clear the bit, after
> last element is shrinked).
>
> This gives signify performance increase. The result after patchset is applied:
>
> $time echo 3 > /proc/sys/vm/drop_caches
> 0.00user 0.93system 0:00.94elapsed 99%CPU
> 0.00user 0.00system 0:00.01elapsed 80%CPU
> 0.00user 0.00system 0:00.01elapsed 80%CPU
> 0.00user 0.00system 0:00.01elapsed 81%CPU
> (4.61s/0.01s = 461 times faster)
>
> Currenly, all memcg-aware shrinkers are implemented via list_lru.
> The only exception is XFS cached objects backlog (which is completelly
> no memcg-aware, but pretends to be memcg-aware). See
> xfs_fs_nr_cached_objects() and xfs_fs_free_cached_objects() for
> the details. It seems, this can be reworked to fix this lack.
>
> So, the patchset makes shrink_slab() of less complexity and improves
> the performance in such types of load I pointed. This will give a profit
> in case of !global reclaim case, since there also will be less do_shrink_slab()
> calls.
>
> This patchset is made against linux-next.git tree.
>
> ---
>
> Kirill Tkhai (10):
> mm: Assign id to every memcg-aware shrinker
> mm: Maintain memcg-aware shrinkers in mcg_shrinkers array
> mm: Assign memcg-aware shrinkers bitmap to memcg
> fs: Propagate shrinker::id to list_lru
> list_lru: Add memcg argument to list_lru_from_kmem()
> list_lru: Pass dst_memcg argument to memcg_drain_list_lru_node()
> list_lru: Pass lru argument to memcg_drain_list_lru_node()
> mm: Set bit in memcg shrinker bitmap on first list_lru item apearance
> mm: Iterate only over charged shrinkers during memcg shrink_slab()
> mm: Clear shrinker bit if there are no objects related to memcg
>
>
> fs/super.c | 8 +
> include/linux/list_lru.h | 3
> include/linux/memcontrol.h | 20 +++
> include/linux/shrinker.h | 9 +
> mm/list_lru.c | 65 ++++++--
> mm/memcontrol.c | 7 +
> mm/vmscan.c | 337 ++++++++++++++++++++++++++++++++++++++++++--
> mm/workingset.c | 6 +
> 8 files changed, 418 insertions(+), 37 deletions(-)
>
> --
> Signed-off-by: Kirill Tkhai <[email protected]>
>

2018-03-21 13:26:06

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 09/10] mm: Iterate only over charged shrinkers during memcg shrink_slab()

Using the preparations made in previous patches, in case of memcg
shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
bitmap. To do that, we separate iterations over memcg-aware and
!memcg-aware shrinkers, and memcg-aware shrinkers are chosen
via for_each_set_bit() from the bitmap. In case of big nodes,
having many isolated environments, this gives significant
performance growth. See next patch for the details.

Note, that the patch does not respect to empty memcg shrinkers,
since we never clear the bitmap bits after we set it once.
Their shrinkers will be called again, with no shrinked objects
as result. This functionality is provided by next patch.

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/vmscan.c | 54 +++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 41 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 265cf069b470..e1fd16bc7a9b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -327,6 +327,8 @@ static int alloc_shrinker_id(struct shrinker *shrinker)

if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
return 0;
+ BUG_ON(!(shrinker->flags & SHRINKER_NUMA_AWARE));
+
retry:
ida_pre_get(&bitmap_id_ida, GFP_KERNEL);
down_write(&bitmap_rwsem);
@@ -366,7 +368,8 @@ static void add_shrinker(struct shrinker *shrinker)
down_write(&shrinker_rwsem);
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
mcg_shrinkers[shrinker->id] = shrinker;
- list_add_tail(&shrinker->list, &shrinker_list);
+ else
+ list_add_tail(&shrinker->list, &shrinker_list);
up_write(&shrinker_rwsem);
}

@@ -375,7 +378,8 @@ static void del_shrinker(struct shrinker *shrinker)
down_write(&shrinker_rwsem);
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
mcg_shrinkers[shrinker->id] = NULL;
- list_del(&shrinker->list);
+ else
+ list_del(&shrinker->list);
up_write(&shrinker_rwsem);
}

@@ -701,6 +705,39 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (!down_read_trylock(&shrinker_rwsem))
goto out;

+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+ if (!memcg_kmem_enabled() || memcg) {
+ struct shrinkers_map *map;
+ int i;
+
+ map = rcu_dereference_protected(SHRINKERS_MAP(memcg), true);
+ if (map) {
+ for_each_set_bit(i, map->map[nid], bitmap_nr_ids) {
+ struct shrink_control sc = {
+ .gfp_mask = gfp_mask,
+ .nid = nid,
+ .memcg = memcg,
+ };
+
+ shrinker = mcg_shrinkers[i];
+ if (!shrinker) {
+ clear_bit(i, map->map[nid]);
+ continue;
+ }
+ freed += do_shrink_slab(&sc, shrinker, priority);
+
+ if (rwsem_is_contended(&shrinker_rwsem)) {
+ freed = freed ? : 1;
+ goto unlock;
+ }
+ }
+ }
+
+ if (memcg_kmem_enabled() && memcg)
+ goto unlock;
+ }
+#endif
+
list_for_each_entry(shrinker, &shrinker_list, list) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
@@ -708,15 +745,6 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
.memcg = memcg,
};

- /*
- * If kernel memory accounting is disabled, we ignore
- * SHRINKER_MEMCG_AWARE flag and call all shrinkers
- * passing NULL for memcg.
- */
- if (memcg_kmem_enabled() &&
- !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
- continue;
-
if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
sc.nid = 0;

@@ -728,10 +756,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
*/
if (rwsem_is_contended(&shrinker_rwsem)) {
freed = freed ? : 1;
- break;
+ goto unlock;
}
}
-
+unlock:
up_read(&shrinker_rwsem);
out:
cond_resched();


2018-03-21 14:58:29

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

On Wed, Mar 21, 2018 at 04:21:40PM +0300, Kirill Tkhai wrote:
> +++ b/include/linux/memcontrol.h
> @@ -151,6 +151,11 @@ struct mem_cgroup_thresholds {
> struct mem_cgroup_threshold_ary *spare;
> };
>
> +struct shrinkers_map {
> + struct rcu_head rcu;
> + unsigned long *map[0];
> +};
> +
> enum memcg_kmem_state {
> KMEM_NONE,
> KMEM_ALLOCATED,
> @@ -182,6 +187,9 @@ struct mem_cgroup {
> unsigned long low;
> unsigned long high;
>
> + /* Bitmap of shrinker ids suitable to call for this memcg */
> + struct shrinkers_map __rcu *shrinkers_map;
> +
> /* Range enforcement for interrupt charges */
> struct work_struct high_work;
>

Why use your own bitmap here? Why not use an IDA which can grow and
shrink automatically without you needing to play fun games with RCU?


2018-03-21 15:14:18

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

On 21.03.2018 17:56, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 04:21:40PM +0300, Kirill Tkhai wrote:
>> +++ b/include/linux/memcontrol.h
>> @@ -151,6 +151,11 @@ struct mem_cgroup_thresholds {
>> struct mem_cgroup_threshold_ary *spare;
>> };
>>
>> +struct shrinkers_map {
>> + struct rcu_head rcu;
>> + unsigned long *map[0];
>> +};
>> +
>> enum memcg_kmem_state {
>> KMEM_NONE,
>> KMEM_ALLOCATED,
>> @@ -182,6 +187,9 @@ struct mem_cgroup {
>> unsigned long low;
>> unsigned long high;
>>
>> + /* Bitmap of shrinker ids suitable to call for this memcg */
>> + struct shrinkers_map __rcu *shrinkers_map;
>> +
>> /* Range enforcement for interrupt charges */
>> struct work_struct high_work;
>>
>
> Why use your own bitmap here? Why not use an IDA which can grow and
> shrink automatically without you needing to play fun games with RCU?

Bitmap allows to use unlocked set_bit()/clear_bit() to maintain the map
of not empty shrinkers.

So, the reason to use IDR here is to save bitmap memory? Does this mean
IDA works fast with sparse identifiers? It seems they require per-memcg
lock to call IDR primitives. I just don't have information about this.

If so, which IDA primitive can be used to set particular id in bitmap?
There is idr_alloc_cyclic(idr, NULL, id, id+1, GFP_KERNEL) only I see
to do that.

Kirill

2018-03-21 15:30:02

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

On Wed, Mar 21, 2018 at 06:12:17PM +0300, Kirill Tkhai wrote:
> On 21.03.2018 17:56, Matthew Wilcox wrote:
> > Why use your own bitmap here? Why not use an IDA which can grow and
> > shrink automatically without you needing to play fun games with RCU?
>
> Bitmap allows to use unlocked set_bit()/clear_bit() to maintain the map
> of not empty shrinkers.
>
> So, the reason to use IDR here is to save bitmap memory? Does this mean
> IDA works fast with sparse identifiers? It seems they require per-memcg
> lock to call IDR primitives. I just don't have information about this.
>
> If so, which IDA primitive can be used to set particular id in bitmap?
> There is idr_alloc_cyclic(idr, NULL, id, id+1, GFP_KERNEL) only I see
> to do that.

You're confusing IDR and IDA in your email, which is unfortunate.

You can set a bit in an IDA by calling ida_simple_get(ida, n, n, GFP_FOO);
You clear it by calling ida_simple_remove(ida, n);

The identifiers aren't going to be all that sparse; after all you're
allocating them from a global IDA. Up to 62 identifiers will allocate
no memory; 63-1024 identifiers will allocate a single 128 byte chunk.
Between 1025 and 65536 identifiers, you'll allocate a 576-byte chunk
and then 128-byte chunks for each block of 1024 identifiers (*). One of
the big wins with the IDA is that it will shrink again after being used.
I didn't read all the way through your patchset to see if you bother to
shrink your bitmap after it's no longer used, but most resizing bitmaps
we have in the kernel don't bother with that part.

(*) Actually it's more complex than that... between 1025 and 1086,
you'll have a 576 byte chunk, a 128-byte chunk and then use 62 bits of
the next pointer before allocating a 128 byte chunk when reaching ID
1087. Similar things happen for the 62 bits after 2048, 3076 and so on.
The individual chunks aren't shrunk until they're empty so if you set ID
1025 and then ID 1100, then clear ID 1100, the 128-byte chunk will remain
allocated until ID 1025 is cleared. This probably doesn't matter to you.

2018-03-21 15:44:56

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

On 21.03.2018 18:26, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 06:12:17PM +0300, Kirill Tkhai wrote:
>> On 21.03.2018 17:56, Matthew Wilcox wrote:
>>> Why use your own bitmap here? Why not use an IDA which can grow and
>>> shrink automatically without you needing to play fun games with RCU?
>>
>> Bitmap allows to use unlocked set_bit()/clear_bit() to maintain the map
>> of not empty shrinkers.
>>
>> So, the reason to use IDR here is to save bitmap memory? Does this mean
>> IDA works fast with sparse identifiers? It seems they require per-memcg
>> lock to call IDR primitives. I just don't have information about this.
>>
>> If so, which IDA primitive can be used to set particular id in bitmap?
>> There is idr_alloc_cyclic(idr, NULL, id, id+1, GFP_KERNEL) only I see
>> to do that.
>
> You're confusing IDR and IDA in your email, which is unfortunate.
>
> You can set a bit in an IDA by calling ida_simple_get(ida, n, n, GFP_FOO);
> You clear it by calling ida_simple_remove(ida, n);

I moved to IDR in the message, since IDA uses global spinlock. It will be
taken every time a first object is added to list_lru, or last is removed.
These may be frequently called operations, and they may scale not good
on big machines.

Using IDR will allow us to introduce memcg-related locks, but I'm still not
sure it's easy to introduce them in scalable-way. Simple set_bit()/clear_bit()
do not require locks at all.

> The identifiers aren't going to be all that sparse; after all you're
> allocating them from a global IDA. Up to 62 identifiers will allocate
> no memory; 63-1024 identifiers will allocate a single 128 byte chunk.
> Between 1025 and 65536 identifiers, you'll allocate a 576-byte chunk
> and then 128-byte chunks for each block of 1024 identifiers (*). One of
> the big wins with the IDA is that it will shrink again after being used.
> I didn't read all the way through your patchset to see if you bother to
> shrink your bitmap after it's no longer used, but most resizing bitmaps
> we have in the kernel don't bother with that part.
>
> (*) Actually it's more complex than that... between 1025 and 1086,
> you'll have a 576 byte chunk, a 128-byte chunk and then use 62 bits of
> the next pointer before allocating a 128 byte chunk when reaching ID
> 1087. Similar things happen for the 62 bits after 2048, 3076 and so on.
> The individual chunks aren't shrunk until they're empty so if you set ID
> 1025 and then ID 1100, then clear ID 1100, the 128-byte chunk will remain
> allocated until ID 1025 is cleared. This probably doesn't matter to you.

Sound great, thanks for explaining this. The big problem I see is
that IDA/IDR add primitives allocate memory, while they will be used
in the places, where they mustn't fail. There is list_lru_add(), and
it's called unconditionally in current kernel code. The patchset makes
the bitmap be populated in this function. So, we can't use IDR there.

Thanks,
Kirill

2018-03-21 16:23:37

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

On Wed, Mar 21, 2018 at 06:43:01PM +0300, Kirill Tkhai wrote:
> On 21.03.2018 18:26, Matthew Wilcox wrote:
> > On Wed, Mar 21, 2018 at 06:12:17PM +0300, Kirill Tkhai wrote:
> >> On 21.03.2018 17:56, Matthew Wilcox wrote:
> >>> Why use your own bitmap here? Why not use an IDA which can grow and
> >>> shrink automatically without you needing to play fun games with RCU?
> >>
> >> Bitmap allows to use unlocked set_bit()/clear_bit() to maintain the map
> >> of not empty shrinkers.
> >>
> >> So, the reason to use IDR here is to save bitmap memory? Does this mean
> >> IDA works fast with sparse identifiers? It seems they require per-memcg
> >> lock to call IDR primitives. I just don't have information about this.
> >>
> >> If so, which IDA primitive can be used to set particular id in bitmap?
> >> There is idr_alloc_cyclic(idr, NULL, id, id+1, GFP_KERNEL) only I see
> >> to do that.
> >
> > You're confusing IDR and IDA in your email, which is unfortunate.
> >
> > You can set a bit in an IDA by calling ida_simple_get(ida, n, n, GFP_FOO);
> > You clear it by calling ida_simple_remove(ida, n);
>
> I moved to IDR in the message, since IDA uses global spinlock. It will be
> taken every time a first object is added to list_lru, or last is removed.
> These may be frequently called operations, and they may scale not good
> on big machines.

I'm fixing the global spinlock issue with the IDA. Not going to be ready
for 4.17, but hopefully for 4.18.

> Using IDR will allow us to introduce memcg-related locks, but I'm still not
> sure it's easy to introduce them in scalable-way. Simple set_bit()/clear_bit()
> do not require locks at all.

They're locked operations ... they may not have an explicit spinlock
associated with them, but the locking still happens.

> > The identifiers aren't going to be all that sparse; after all you're
> > allocating them from a global IDA. Up to 62 identifiers will allocate
> > no memory; 63-1024 identifiers will allocate a single 128 byte chunk.
> > Between 1025 and 65536 identifiers, you'll allocate a 576-byte chunk
> > and then 128-byte chunks for each block of 1024 identifiers (*). One of
> > the big wins with the IDA is that it will shrink again after being used.
> > I didn't read all the way through your patchset to see if you bother to
> > shrink your bitmap after it's no longer used, but most resizing bitmaps
> > we have in the kernel don't bother with that part.
> >
> > (*) Actually it's more complex than that... between 1025 and 1086,
> > you'll have a 576 byte chunk, a 128-byte chunk and then use 62 bits of
> > the next pointer before allocating a 128 byte chunk when reaching ID
> > 1087. Similar things happen for the 62 bits after 2048, 3076 and so on.
> > The individual chunks aren't shrunk until they're empty so if you set ID
> > 1025 and then ID 1100, then clear ID 1100, the 128-byte chunk will remain
> > allocated until ID 1025 is cleared. This probably doesn't matter to you.
>
> Sound great, thanks for explaining this. The big problem I see is
> that IDA/IDR add primitives allocate memory, while they will be used
> in the places, where they mustn't fail. There is list_lru_add(), and
> it's called unconditionally in current kernel code. The patchset makes
> the bitmap be populated in this function. So, we can't use IDR there.

Maybe we can use GFP_NOFAIL here. They're small allocations, so we're
only asking for single-page allocations to not fail, which shouldn't
put too much strain on the VM.

2018-03-21 16:45:48

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

On 21.03.2018 19:20, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 06:43:01PM +0300, Kirill Tkhai wrote:
>> On 21.03.2018 18:26, Matthew Wilcox wrote:
>>> On Wed, Mar 21, 2018 at 06:12:17PM +0300, Kirill Tkhai wrote:
>>>> On 21.03.2018 17:56, Matthew Wilcox wrote:
>>>>> Why use your own bitmap here? Why not use an IDA which can grow and
>>>>> shrink automatically without you needing to play fun games with RCU?
>>>>
>>>> Bitmap allows to use unlocked set_bit()/clear_bit() to maintain the map
>>>> of not empty shrinkers.
>>>>
>>>> So, the reason to use IDR here is to save bitmap memory? Does this mean
>>>> IDA works fast with sparse identifiers? It seems they require per-memcg
>>>> lock to call IDR primitives. I just don't have information about this.
>>>>
>>>> If so, which IDA primitive can be used to set particular id in bitmap?
>>>> There is idr_alloc_cyclic(idr, NULL, id, id+1, GFP_KERNEL) only I see
>>>> to do that.
>>>
>>> You're confusing IDR and IDA in your email, which is unfortunate.
>>>
>>> You can set a bit in an IDA by calling ida_simple_get(ida, n, n, GFP_FOO);
>>> You clear it by calling ida_simple_remove(ida, n);
>>
>> I moved to IDR in the message, since IDA uses global spinlock. It will be
>> taken every time a first object is added to list_lru, or last is removed.
>> These may be frequently called operations, and they may scale not good
>> on big machines.
>
> I'm fixing the global spinlock issue with the IDA. Not going to be ready
> for 4.17, but hopefully for 4.18.

It will be nice to see that in kernel.

>> Using IDR will allow us to introduce memcg-related locks, but I'm still not
>> sure it's easy to introduce them in scalable-way. Simple set_bit()/clear_bit()
>> do not require locks at all.
>
> They're locked operations ... they may not have an explicit spinlock
> associated with them, but the locking still happens.

Yes, they are not ideal in this way.

>>> The identifiers aren't going to be all that sparse; after all you're
>>> allocating them from a global IDA. Up to 62 identifiers will allocate
>>> no memory; 63-1024 identifiers will allocate a single 128 byte chunk.
>>> Between 1025 and 65536 identifiers, you'll allocate a 576-byte chunk
>>> and then 128-byte chunks for each block of 1024 identifiers (*). One of
>>> the big wins with the IDA is that it will shrink again after being used.
>>> I didn't read all the way through your patchset to see if you bother to
>>> shrink your bitmap after it's no longer used, but most resizing bitmaps
>>> we have in the kernel don't bother with that part.
>>>
>>> (*) Actually it's more complex than that... between 1025 and 1086,
>>> you'll have a 576 byte chunk, a 128-byte chunk and then use 62 bits of
>>> the next pointer before allocating a 128 byte chunk when reaching ID
>>> 1087. Similar things happen for the 62 bits after 2048, 3076 and so on.
>>> The individual chunks aren't shrunk until they're empty so if you set ID
>>> 1025 and then ID 1100, then clear ID 1100, the 128-byte chunk will remain
>>> allocated until ID 1025 is cleared. This probably doesn't matter to you.
>>
>> Sound great, thanks for explaining this. The big problem I see is
>> that IDA/IDR add primitives allocate memory, while they will be used
>> in the places, where they mustn't fail. There is list_lru_add(), and
>> it's called unconditionally in current kernel code. The patchset makes
>> the bitmap be populated in this function. So, we can't use IDR there.
>
> Maybe we can use GFP_NOFAIL here. They're small allocations, so we're
> only asking for single-page allocations to not fail, which shouldn't
> put too much strain on the VM.

Oh. I'm not sure about this. Even if each allocation is small, there is
theoretically possible a situation, when many lists will want to add first
element. list_lru_add() is called from iput() for example.

Kirill

2018-03-21 17:56:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

On Wed, Mar 21, 2018 at 07:42:38PM +0300, Kirill Tkhai wrote:
> On 21.03.2018 19:20, Matthew Wilcox wrote:
> >> Sound great, thanks for explaining this. The big problem I see is
> >> that IDA/IDR add primitives allocate memory, while they will be used
> >> in the places, where they mustn't fail. There is list_lru_add(), and
> >> it's called unconditionally in current kernel code. The patchset makes
> >> the bitmap be populated in this function. So, we can't use IDR there.
> >
> > Maybe we can use GFP_NOFAIL here. They're small allocations, so we're
> > only asking for single-page allocations to not fail, which shouldn't
> > put too much strain on the VM.
>
> Oh. I'm not sure about this. Even if each allocation is small, there is
> theoretically possible a situation, when many lists will want to add first
> element. list_lru_add() is called from iput() for example.

I see. Maybe we could solve this with an IDA_NO_SHRINK flag and an
ida_resize(ida, max); call.

You'll also want something like this:


diff --git a/include/linux/idr.h b/include/linux/idr.h
index 0f650b90ced0..ee7185354fb2 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -273,6 +273,22 @@ static inline void ida_init(struct ida *ida)
ida_alloc_range(ida, start, (end) - 1, gfp)
#define ida_simple_remove(ida, id) ida_free(ida, id)

+int ida_find(struct ida *, unsigned int id);
+
+/**
+ * ida_for_each() - Iterate over all allocated IDs.
+ * @ida: IDA handle.
+ * @id: Loop cursor.
+ *
+ * For each iteration of this loop, @id will be set to an allocated ID.
+ * No locks are held across the body of the loop, so you can call ida_free()
+ * if you want or adjust @id to skip IDs or re-process earlier IDs.
+ *
+ * On successful loop exit, @id will be less than 0.
+ */
+#define ida_for_each(ida, i) \
+ for (i = ida_find(ida, 0); i >= 0; i = ida_find(ida, i + 1))
+
/**
* ida_get_new - allocate new ID
* @ida: idr handle
diff --git a/lib/idr.c b/lib/idr.c
index fab3763e8c2a..ba9fae7eb2f5 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -612,3 +612,54 @@ void ida_free(struct ida *ida, unsigned int id)
spin_unlock_irqrestore(&simple_ida_lock, flags);
}
EXPORT_SYMBOL(ida_free);
+
+/**
+ * ida_find() - Find an allocated ID.
+ * @ida: IDA handle.
+ * @id: Minimum ID to return.
+ *
+ * Context: Any context.
+ * Return: An ID which is at least as large as @id or %-ENOSPC if @id is
+ * higher than any allocated ID.
+ */
+int ida_find(struct ida *ida, unsigned int id)
+{
+ unsigned long flags;
+ unsigned long index = id / IDA_BITMAP_BITS;
+ unsigned bit = id % IDA_BITMAP_BITS;
+ struct ida_bitmap *bitmap;
+ struct radix_tree_iter iter;
+ void __rcu **slot;
+ int ret = -ENOSPC;
+
+ spin_lock_irqsave(&simple_ida_lock, flags);
+advance:
+ slot = radix_tree_iter_find(&ida->ida_rt, &iter, index);
+ if (!slot)
+ goto unlock;
+ bitmap = rcu_dereference_raw(*slot);
+ if (radix_tree_exception(bitmap)) {
+ if (bit < (BITS_PER_LONG - RADIX_TREE_EXCEPTIONAL_SHIFT)) {
+ unsigned long bits = (unsigned long)bitmap;
+
+ bits >>= bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
+ if (bits) {
+ bit += __ffs(bits);
+ goto found;
+ }
+ }
+ } else {
+ bit = find_next_bit(bitmap->bitmap, IDA_BITMAP_BITS, bit);
+ if (bit < IDA_BITMAP_BITS)
+ goto found;
+ }
+ bit = 0;
+ index++;
+ goto advance;
+found:
+ ret = iter.index * IDA_BITMAP_BITS + bit;
+unlock:
+ spin_unlock_irqrestore(&simple_ida_lock, flags);
+ return ret;
+}
+EXPORT_SYMBOL(ida_find);
diff --git a/tools/testing/radix-tree/idr-test.c b/tools/testing/radix-tree/idr-test.c
index 6c645eb77d42..a9b5a33a4ef3 100644
--- a/tools/testing/radix-tree/idr-test.c
+++ b/tools/testing/radix-tree/idr-test.c
@@ -358,8 +358,12 @@ void ida_check_conv(void)
assert(ida_pre_get(&ida, GFP_KERNEL));
assert(!ida_get_new_above(&ida, i + 1, &id));
assert(id == i + 1);
+ ida_for_each(&ida, id)
+ BUG_ON(id != (i + 1));
assert(!ida_get_new_above(&ida, i + BITS_PER_LONG, &id));
assert(id == i + BITS_PER_LONG);
+ ida_for_each(&ida, id)
+ BUG_ON((id != (i + 1)) && (id != (i + BITS_PER_LONG)));
ida_remove(&ida, i + 1);
ida_remove(&ida, i + BITS_PER_LONG);
assert(ida_is_empty(&ida));
@@ -484,7 +488,7 @@ void ida_simple_get_remove_test(void)
void ida_checks(void)
{
DEFINE_IDA(ida);
- int id;
+ int id, id2;
unsigned long i;

radix_tree_cpu_dead(1);
@@ -496,8 +500,22 @@ void ida_checks(void)
assert(id == i);
}

+ id2 = 0;
+ ida_for_each(&ida, id) {
+ BUG_ON(id != id2++);
+ }
+ BUG_ON(id >= 0);
+ BUG_ON(id2 != 10000);
+
ida_remove(&ida, 20);
ida_remove(&ida, 21);
+ id2 = 0;
+ ida_for_each(&ida, id) {
+ if (id != id2++) {
+ BUG_ON(id != 22 || id2 != 21);
+ id2 = 23;
+ }
+ }
for (i = 0; i < 3; i++) {
assert(ida_pre_get(&ida, GFP_KERNEL));
assert(!ida_get_new(&ida, &id));

2018-03-22 16:40:31

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

On 21.03.2018 20:54, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 07:42:38PM +0300, Kirill Tkhai wrote:
>> On 21.03.2018 19:20, Matthew Wilcox wrote:
>>>> Sound great, thanks for explaining this. The big problem I see is
>>>> that IDA/IDR add primitives allocate memory, while they will be used
>>>> in the places, where they mustn't fail. There is list_lru_add(), and
>>>> it's called unconditionally in current kernel code. The patchset makes
>>>> the bitmap be populated in this function. So, we can't use IDR there.
>>>
>>> Maybe we can use GFP_NOFAIL here. They're small allocations, so we're
>>> only asking for single-page allocations to not fail, which shouldn't
>>> put too much strain on the VM.
>>
>> Oh. I'm not sure about this. Even if each allocation is small, there is
>> theoretically possible a situation, when many lists will want to add first
>> element. list_lru_add() is called from iput() for example.
>
> I see. Maybe we could solve this with an IDA_NO_SHRINK flag and an
> ida_resize(ida, max); call.

But what will be the difference to bitmap, if we allocate maximal map anyway?

> You'll also want something like this:
>
>
> diff --git a/include/linux/idr.h b/include/linux/idr.h
> index 0f650b90ced0..ee7185354fb2 100644
> --- a/include/linux/idr.h
> +++ b/include/linux/idr.h
> @@ -273,6 +273,22 @@ static inline void ida_init(struct ida *ida)
> ida_alloc_range(ida, start, (end) - 1, gfp)
> #define ida_simple_remove(ida, id) ida_free(ida, id)
>
> +int ida_find(struct ida *, unsigned int id);
> +
> +/**
> + * ida_for_each() - Iterate over all allocated IDs.
> + * @ida: IDA handle.
> + * @id: Loop cursor.
> + *
> + * For each iteration of this loop, @id will be set to an allocated ID.
> + * No locks are held across the body of the loop, so you can call ida_free()
> + * if you want or adjust @id to skip IDs or re-process earlier IDs.
> + *
> + * On successful loop exit, @id will be less than 0.
> + */
> +#define ida_for_each(ida, i) \
> + for (i = ida_find(ida, 0); i >= 0; i = ida_find(ida, i + 1))
> +
> /**
> * ida_get_new - allocate new ID
> * @ida: idr handle
> diff --git a/lib/idr.c b/lib/idr.c
> index fab3763e8c2a..ba9fae7eb2f5 100644
> --- a/lib/idr.c
> +++ b/lib/idr.c
> @@ -612,3 +612,54 @@ void ida_free(struct ida *ida, unsigned int id)
> spin_unlock_irqrestore(&simple_ida_lock, flags);
> }
> EXPORT_SYMBOL(ida_free);
> +
> +/**
> + * ida_find() - Find an allocated ID.
> + * @ida: IDA handle.
> + * @id: Minimum ID to return.
> + *
> + * Context: Any context.
> + * Return: An ID which is at least as large as @id or %-ENOSPC if @id is
> + * higher than any allocated ID.
> + */
> +int ida_find(struct ida *ida, unsigned int id)
> +{
> + unsigned long flags;
> + unsigned long index = id / IDA_BITMAP_BITS;
> + unsigned bit = id % IDA_BITMAP_BITS;
> + struct ida_bitmap *bitmap;
> + struct radix_tree_iter iter;
> + void __rcu **slot;
> + int ret = -ENOSPC;
> +
> + spin_lock_irqsave(&simple_ida_lock, flags);
> +advance:
> + slot = radix_tree_iter_find(&ida->ida_rt, &iter, index);
> + if (!slot)
> + goto unlock;
> + bitmap = rcu_dereference_raw(*slot);
> + if (radix_tree_exception(bitmap)) {
> + if (bit < (BITS_PER_LONG - RADIX_TREE_EXCEPTIONAL_SHIFT)) {
> + unsigned long bits = (unsigned long)bitmap;
> +
> + bits >>= bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
> + if (bits) {
> + bit += __ffs(bits);
> + goto found;
> + }
> + }
> + } else {
> + bit = find_next_bit(bitmap->bitmap, IDA_BITMAP_BITS, bit);
> + if (bit < IDA_BITMAP_BITS)
> + goto found;
> + }
> + bit = 0;
> + index++;
> + goto advance;
> +found:
> + ret = iter.index * IDA_BITMAP_BITS + bit;
> +unlock:
> + spin_unlock_irqrestore(&simple_ida_lock, flags);
> + return ret;
> +}
> +EXPORT_SYMBOL(ida_find);
> diff --git a/tools/testing/radix-tree/idr-test.c b/tools/testing/radix-tree/idr-test.c
> index 6c645eb77d42..a9b5a33a4ef3 100644
> --- a/tools/testing/radix-tree/idr-test.c
> +++ b/tools/testing/radix-tree/idr-test.c
> @@ -358,8 +358,12 @@ void ida_check_conv(void)
> assert(ida_pre_get(&ida, GFP_KERNEL));
> assert(!ida_get_new_above(&ida, i + 1, &id));
> assert(id == i + 1);
> + ida_for_each(&ida, id)
> + BUG_ON(id != (i + 1));
> assert(!ida_get_new_above(&ida, i + BITS_PER_LONG, &id));
> assert(id == i + BITS_PER_LONG);
> + ida_for_each(&ida, id)
> + BUG_ON((id != (i + 1)) && (id != (i + BITS_PER_LONG)));
> ida_remove(&ida, i + 1);
> ida_remove(&ida, i + BITS_PER_LONG);
> assert(ida_is_empty(&ida));
> @@ -484,7 +488,7 @@ void ida_simple_get_remove_test(void)
> void ida_checks(void)
> {
> DEFINE_IDA(ida);
> - int id;
> + int id, id2;
> unsigned long i;
>
> radix_tree_cpu_dead(1);
> @@ -496,8 +500,22 @@ void ida_checks(void)
> assert(id == i);
> }
>
> + id2 = 0;
> + ida_for_each(&ida, id) {
> + BUG_ON(id != id2++);
> + }
> + BUG_ON(id >= 0);
> + BUG_ON(id2 != 10000);
> +
> ida_remove(&ida, 20);
> ida_remove(&ida, 21);
> + id2 = 0;
> + ida_for_each(&ida, id) {
> + if (id != id2++) {
> + BUG_ON(id != 22 || id2 != 21);
> + id2 = 23;
> + }
> + }
> for (i = 0; i < 3; i++) {
> assert(ida_pre_get(&ida, GFP_KERNEL));
> assert(!ida_get_new(&ida, &id));
>

2018-03-23 09:07:28

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

Hi Kirill,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on mmotm/master]
[also build test WARNING on v4.16-rc6 next-20180322]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Kirill-Tkhai/Improve-shrink_slab-scalability-old-complexity-was-O-n-2-new-is-O-n/20180323-052754
base: git://git.cmpxchg.org/linux-mmotm.git master
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:79:1: sparse: incorrect type in argument 3 (different base types) @@ expected unsigned long [unsigned] flags @@ got resunsigned long [unsigned] flags @@
include/trace/events/vmscan.h:79:1: expected unsigned long [unsigned] flags
include/trace/events/vmscan.h:79:1: got restricted gfp_t [usertype] gfp_flags
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:106:1: sparse: incorrect type in argument 3 (different base types) @@ expected unsigned long [unsigned] flags @@ got resunsigned long [unsigned] flags @@
include/trace/events/vmscan.h:106:1: expected unsigned long [unsigned] flags
include/trace/events/vmscan.h:106:1: got restricted gfp_t [usertype] gfp_flags
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
include/trace/events/vmscan.h:196:1: sparse: too many warnings
>> mm/vmscan.c:231:15: sparse: incompatible types in conditional expression (different address spaces)
>> mm/vmscan.c:231:15: sparse: cast from unknown type
>> mm/vmscan.c:231:15: sparse: incompatible types in conditional expression (different address spaces)
>> mm/vmscan.c:231:15: sparse: incompatible types in conditional expression (different address spaces)
>> mm/vmscan.c:231:15: sparse: cast from unknown type

vim +231 mm/vmscan.c

205
206 static int memcg_expand_maps(struct mem_cgroup *memcg, int size, int old_size)
207 {
208 struct shrinkers_map *new, *old;
209 int i;
210
211 new = kvmalloc(sizeof(*new) + nr_node_ids * sizeof(new->map[0]),
212 GFP_KERNEL);
213 if (!new)
214 return -ENOMEM;
215
216 for (i = 0; i < nr_node_ids; i++) {
217 new->map[i] = kvmalloc_node(size, GFP_KERNEL, i);
218 if (!new->map[i]) {
219 while (--i >= 0)
220 kvfree(new->map[i]);
221 kvfree(new);
222 return -ENOMEM;
223 }
224
225 /* Set all old bits, clear all new bits */
226 memset(new->map[i], (int)0xff, old_size);
227 memset((void *)new->map[i] + old_size, 0, size - old_size);
228 }
229
230 lockdep_assert_held(&bitmap_rwsem);
> 231 old = rcu_dereference_protected(SHRINKERS_MAP(memcg), true);
232
233 /*
234 * We don't want to use rcu_read_lock() in shrink_slab().
235 * Since expansion happens rare, we may just take the lock
236 * here.
237 */
238 if (old)
239 down_write(&shrinker_rwsem);
240
241 if (memcg)
242 rcu_assign_pointer(memcg->shrinkers_map, new);
243 else
244 rcu_assign_pointer(root_shrinkers_map, new);
245
246 if (old) {
247 up_write(&shrinker_rwsem);
248 call_rcu(&old->rcu, kvfree_map_rcu);
249 }
250
251 return 0;
252 }
253

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation

2018-03-23 11:27:25

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

On 23.03.2018 12:06, kbuild test robot wrote:
> Hi Kirill,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on mmotm/master]
> [also build test WARNING on v4.16-rc6 next-20180322]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Kirill-Tkhai/Improve-shrink_slab-scalability-old-complexity-was-O-n-2-new-is-O-n/20180323-052754
> base: git://git.cmpxchg.org/linux-mmotm.git master
> reproduce:
> # apt-get install sparse
> make ARCH=x86_64 allmodconfig
> make C=1 CF=-D__CHECK_ENDIAN__
>
>
> sparse warnings: (new ones prefixed by >>)
>
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:79:1: sparse: incorrect type in argument 3 (different base types) @@ expected unsigned long [unsigned] flags @@ got resunsigned long [unsigned] flags @@
> include/trace/events/vmscan.h:79:1: expected unsigned long [unsigned] flags
> include/trace/events/vmscan.h:79:1: got restricted gfp_t [usertype] gfp_flags
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:106:1: sparse: incorrect type in argument 3 (different base types) @@ expected unsigned long [unsigned] flags @@ got resunsigned long [unsigned] flags @@
> include/trace/events/vmscan.h:106:1: expected unsigned long [unsigned] flags
> include/trace/events/vmscan.h:106:1: got restricted gfp_t [usertype] gfp_flags
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: cast from restricted gfp_t
> include/trace/events/vmscan.h:196:1: sparse: too many warnings
>>> mm/vmscan.c:231:15: sparse: incompatible types in conditional expression (different address spaces)
>>> mm/vmscan.c:231:15: sparse: cast from unknown type
>>> mm/vmscan.c:231:15: sparse: incompatible types in conditional expression (different address spaces)
>>> mm/vmscan.c:231:15: sparse: incompatible types in conditional expression (different address spaces)
>>> mm/vmscan.c:231:15: sparse: cast from unknown type
>
> vim +231 mm/vmscan.c

Yeah, thanks for report.

>
> 205
> 206 static int memcg_expand_maps(struct mem_cgroup *memcg, int size, int old_size)
> 207 {
> 208 struct shrinkers_map *new, *old;
> 209 int i;
> 210
> 211 new = kvmalloc(sizeof(*new) + nr_node_ids * sizeof(new->map[0]),
> 212 GFP_KERNEL);
> 213 if (!new)
> 214 return -ENOMEM;
> 215
> 216 for (i = 0; i < nr_node_ids; i++) {
> 217 new->map[i] = kvmalloc_node(size, GFP_KERNEL, i);
> 218 if (!new->map[i]) {
> 219 while (--i >= 0)
> 220 kvfree(new->map[i]);
> 221 kvfree(new);
> 222 return -ENOMEM;
> 223 }
> 224
> 225 /* Set all old bits, clear all new bits */
> 226 memset(new->map[i], (int)0xff, old_size);
> 227 memset((void *)new->map[i] + old_size, 0, size - old_size);
> 228 }
> 229
> 230 lockdep_assert_held(&bitmap_rwsem);
> > 231 old = rcu_dereference_protected(SHRINKERS_MAP(memcg), true);
> 232
> 233 /*
> 234 * We don't want to use rcu_read_lock() in shrink_slab().
> 235 * Since expansion happens rare, we may just take the lock
> 236 * here.
> 237 */
> 238 if (old)
> 239 down_write(&shrinker_rwsem);
> 240
> 241 if (memcg)
> 242 rcu_assign_pointer(memcg->shrinkers_map, new);
> 243 else
> 244 rcu_assign_pointer(root_shrinkers_map, new);
> 245
> 246 if (old) {
> 247 up_write(&shrinker_rwsem);
> 248 call_rcu(&old->rcu, kvfree_map_rcu);
> 249 }
> 250
> 251 return 0;
> 252 }
> 253
>
> ---
> 0-DAY kernel test infrastructure Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all Intel Corporation
>

2018-03-24 18:42:19

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Assign id to every memcg-aware shrinker

Hello Kirill,

I don't have any objections to the idea behind this patch set.
Well, at least I don't know how to better tackle the problem you
describe in the cover letter. Please, see below for my comments
regarding implementation details.

On Wed, Mar 21, 2018 at 04:21:17PM +0300, Kirill Tkhai wrote:
> The patch introduces shrinker::id number, which is used to enumerate
> memcg-aware shrinkers. The number start from 0, and the code tries
> to maintain it as small as possible.
>
> This will be used as to represent a memcg-aware shrinkers in memcg
> shrinkers map.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> include/linux/shrinker.h | 1 +
> mm/vmscan.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 60 insertions(+)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index a3894918a436..738de8ef5246 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -66,6 +66,7 @@ struct shrinker {
>
> /* These are for internal use */
> struct list_head list;
> + int id;

This definition could definitely use a comment.

BTW shouldn't we ifdef it?

> /* objs pending delete, per node */
> atomic_long_t *nr_deferred;
> };
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8fcd9f8d7390..91b5120b924f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -159,6 +159,56 @@ unsigned long vm_total_pages;
> static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>
> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> +static DEFINE_IDA(bitmap_id_ida);
> +static DECLARE_RWSEM(bitmap_rwsem);

Can't we reuse shrinker_rwsem for protecting the ida?

> +static int bitmap_id_start;
> +
> +static int alloc_shrinker_id(struct shrinker *shrinker)
> +{
> + int id, ret;
> +
> + if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> + return 0;
> +retry:
> + ida_pre_get(&bitmap_id_ida, GFP_KERNEL);
> + down_write(&bitmap_rwsem);
> + ret = ida_get_new_above(&bitmap_id_ida, bitmap_id_start, &id);

AFAIK ida always allocates the smallest available id so you don't need
to keep track of bitmap_id_start.

> + if (!ret) {
> + shrinker->id = id;
> + bitmap_id_start = shrinker->id + 1;
> + }
> + up_write(&bitmap_rwsem);
> + if (ret == -EAGAIN)
> + goto retry;
> +
> + return ret;
> +}

2018-03-24 18:46:49

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: Maintain memcg-aware shrinkers in mcg_shrinkers array

On Wed, Mar 21, 2018 at 04:21:29PM +0300, Kirill Tkhai wrote:
> The patch introduces mcg_shrinkers array to keep memcg-aware
> shrinkers in order of their shrinker::id.
>
> This allows to access the shrinkers dirrectly by the id,
> without iteration over shrinker_list list.

Why don't you simply use idr instead of ida? With idr you wouldn't need
the array mapping shrinker id to shrinker ptr. AFAIU you need this
mapping to look up the shrinker by id in shrink_slab. The latter doesn't
seem to be a hot path so using idr there should be acceptable. Since we
already have shrinker_rwsem, which is taken for reading by shrink_slab,
we wouldn't even need any additional locking for it.

2018-03-24 18:51:33

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 04/10] fs: Propagate shrinker::id to list_lru

On Wed, Mar 21, 2018 at 04:21:51PM +0300, Kirill Tkhai wrote:
> The patch adds list_lru::shrk_id field, and populates
> it by registered shrinker id.
>
> This will be used to set correct bit in memcg shrinkers
> map by lru code in next patches, after there appeared
> the first related to memcg element in list_lru.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> fs/super.c | 5 +++++
> include/linux/list_lru.h | 1 +
> mm/list_lru.c | 7 ++++++-
> mm/workingset.c | 3 +++
> 4 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/fs/super.c b/fs/super.c
> index 0660083427fa..1f3dc4eab409 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -521,6 +521,11 @@ struct super_block *sget_userns(struct file_system_type *type,
> if (err) {
> deactivate_locked_super(s);
> s = ERR_PTR(err);
> + } else {
> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> + s->s_dentry_lru.shrk_id = s->s_shrink.id;
> + s->s_inode_lru.shrk_id = s->s_shrink.id;
> +#endif

I don't really like the new member name. Let's call it shrink_id or
shrinker_id, shall we?

Also, I think we'd better pass shrink_id to list_lru_init rather than
setting it explicitly.

2018-03-24 19:26:52

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

On Wed, Mar 21, 2018 at 04:21:40PM +0300, Kirill Tkhai wrote:
> Imagine a big node with many cpus, memory cgroups and containers.
> Let we have 200 containers, every container has 10 mounts,
> and 10 cgroups. All container tasks don't touch foreign
> containers mounts. If there is intensive pages write,
> and global reclaim happens, a writing task has to iterate
> over all memcgs to shrink slab, before it's able to go
> to shrink_page_list().
>
> Iteration over all the memcg slabs is very expensive:
> the task has to visit 200 * 10 = 2000 shrinkers
> for every memcg, and since there are 2000 memcgs,
> the total calls are 2000 * 2000 = 4000000.
>
> So, the shrinker makes 4 million do_shrink_slab() calls
> just to try to isolate SWAP_CLUSTER_MAX pages in one
> of the actively writing memcg via shrink_page_list().
> I've observed a node spending almost 100% in kernel,
> making useless iteration over already shrinked slab.
>
> This patch adds bitmap of memcg-aware shrinkers to memcg.
> The size of the bitmap depends on bitmap_nr_ids, and during
> memcg life it's maintained to be enough to fit bitmap_nr_ids
> shrinkers. Every bit in the map is related to corresponding
> shrinker id.
>
> Next patches will maintain set bit only for really charged
> memcg. This will allow shrink_slab() to increase its
> performance in significant way. See the last patch for
> the numbers.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> include/linux/memcontrol.h | 20 ++++++++
> mm/memcontrol.c | 5 ++
> mm/vmscan.c | 117 ++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 142 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4525b4404a9e..ad88a9697fb9 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -151,6 +151,11 @@ struct mem_cgroup_thresholds {
> struct mem_cgroup_threshold_ary *spare;
> };
>
> +struct shrinkers_map {

IMO better call it mem_cgroup_shrinker_map.

> + struct rcu_head rcu;
> + unsigned long *map[0];
> +};
> +
> enum memcg_kmem_state {
> KMEM_NONE,
> KMEM_ALLOCATED,
> @@ -182,6 +187,9 @@ struct mem_cgroup {
> unsigned long low;
> unsigned long high;
>
> + /* Bitmap of shrinker ids suitable to call for this memcg */
> + struct shrinkers_map __rcu *shrinkers_map;
> +

We keep all per-node data in mem_cgroup_per_node struct. I think this
bitmap should be defined there as well.

> /* Range enforcement for interrupt charges */
> struct work_struct high_work;
>

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3801ac1fcfbc..2324577c62dc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4476,6 +4476,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>
> + if (alloc_shrinker_maps(memcg))
> + return -ENOMEM;
> +

This needs a comment explaining why you can't allocate the map in
css_alloc, which seems to be a better place for it.

> /* Online state pins memcg ID, memcg ID pins CSS */
> atomic_set(&memcg->id.ref, 1);
> css_get(css);
> @@ -4487,6 +4490,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> struct mem_cgroup_event *event, *tmp;
>
> + free_shrinker_maps(memcg);
> +

AFAIU this can race with shrink_slab accessing the map, resulting in
use-after-free. IMO it would be safer to free the bitmap from css_free.

> /*
> * Unregister events and notify userspace.
> * Notify userspace about cgroup removing only after rmdir of cgroup
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 97ce4f342fab..9d1df5d90eca 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -165,6 +165,10 @@ static DECLARE_RWSEM(bitmap_rwsem);
> static int bitmap_id_start;
> static int bitmap_nr_ids;
> static struct shrinker **mcg_shrinkers;
> +struct shrinkers_map *__rcu root_shrinkers_map;

Why do you need root_shrinkers_map? AFAIR the root memory cgroup doesn't
have kernel memory accounting enabled.

> +
> +#define SHRINKERS_MAP(memcg) \
> + (memcg == root_mem_cgroup || !memcg ? root_shrinkers_map : memcg->shrinkers_map)
>
> static int expand_shrinkers_array(int old_nr, int nr)
> {
> @@ -188,6 +192,116 @@ static int expand_shrinkers_array(int old_nr, int nr)
> return 0;
> }
>
> +static void kvfree_map_rcu(struct rcu_head *head)
> +{

> +static int memcg_expand_maps(struct mem_cgroup *memcg, int size, int old_size)
> +{

> +int alloc_shrinker_maps(struct mem_cgroup *memcg)
> +{

> +void free_shrinker_maps(struct mem_cgroup *memcg)
> +{

> +static int expand_shrinker_maps(int old_id, int id)
> +{

All these functions should be defined in memcontrol.c

The only public function should be mem_cgroup_grow_shrinker_map (I'm not
insisting on the name), which reallocates shrinker bitmap for each
cgroups so that it can accommodate the new shrinker id. To do that,
you'll probably need to keep track of the bitmap capacity in
memcontrol.c

2018-03-24 19:34:12

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 06/10] list_lru: Pass dst_memcg argument to memcg_drain_list_lru_node()

On Wed, Mar 21, 2018 at 04:22:10PM +0300, Kirill Tkhai wrote:
> This is just refactoring to allow next patches to have
> dst_memcg pointer in memcg_drain_list_lru_node().
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> include/linux/list_lru.h | 2 +-
> mm/list_lru.c | 11 ++++++-----
> mm/memcontrol.c | 2 +-
> 3 files changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index ce1d010cd3fa..50cf8c61c609 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -66,7 +66,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
> #define list_lru_init_memcg(lru) __list_lru_init((lru), true, NULL)
>
> int memcg_update_all_list_lrus(int num_memcgs);
> -void memcg_drain_all_list_lrus(int src_idx, int dst_idx);
> +void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg);

Please, for consistency pass the source cgroup as a pointer as well.

2018-03-24 19:46:59

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: Set bit in memcg shrinker bitmap on first list_lru item apearance

On Wed, Mar 21, 2018 at 04:22:40PM +0300, Kirill Tkhai wrote:
> Introduce set_shrinker_bit() function to set shrinker-related
> bit in memcg shrinker bitmap, and set the bit after the first
> item is added and in case of reparenting destroyed memcg's items.
>
> This will allow next patch to make shrinkers be called only,
> in case of they have charged objects at the moment, and
> to improve shrink_slab() performance.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> include/linux/shrinker.h | 7 +++++++
> mm/list_lru.c | 22 ++++++++++++++++++++--
> mm/vmscan.c | 7 +++++++
> 3 files changed, 34 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 738de8ef5246..24aeed1bc332 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -78,4 +78,11 @@ struct shrinker {
>
> extern __must_check int register_shrinker(struct shrinker *);
> extern void unregister_shrinker(struct shrinker *);
> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> +extern void set_shrinker_bit(struct mem_cgroup *, int, int);
> +#else
> +static inline void set_shrinker_bit(struct mem_cgroup *memcg, int node, int id)
> +{
> +}
> +#endif

IMO this function, as well as other shrinker bitmap manipulation
functions, should be defined in memcontrol.[hc] and have mem_cgroup_
prefix.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9d1df5d90eca..265cf069b470 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -378,6 +378,13 @@ static void del_shrinker(struct shrinker *shrinker)
> list_del(&shrinker->list);
> up_write(&shrinker_rwsem);
> }
> +
> +void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int nr)
> +{
> + struct shrinkers_map *map = SHRINKERS_MAP(memcg);
> +
> + set_bit(nr, map->map[nid]);
> +}

Shouldn't we use rcu_read_lock here? After all, the map can be
reallocated right from under our feet.

2018-03-24 20:14:03

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 09/10] mm: Iterate only over charged shrinkers during memcg shrink_slab()

On Wed, Mar 21, 2018 at 04:22:51PM +0300, Kirill Tkhai wrote:
> Using the preparations made in previous patches, in case of memcg
> shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
> bitmap. To do that, we separate iterations over memcg-aware and
> !memcg-aware shrinkers, and memcg-aware shrinkers are chosen
> via for_each_set_bit() from the bitmap. In case of big nodes,
> having many isolated environments, this gives significant
> performance growth. See next patch for the details.
>
> Note, that the patch does not respect to empty memcg shrinkers,
> since we never clear the bitmap bits after we set it once.
> Their shrinkers will be called again, with no shrinked objects
> as result. This functionality is provided by next patch.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> mm/vmscan.c | 54 +++++++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 41 insertions(+), 13 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 265cf069b470..e1fd16bc7a9b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -327,6 +327,8 @@ static int alloc_shrinker_id(struct shrinker *shrinker)
>
> if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> return 0;
> + BUG_ON(!(shrinker->flags & SHRINKER_NUMA_AWARE));
> +
> retry:
> ida_pre_get(&bitmap_id_ida, GFP_KERNEL);
> down_write(&bitmap_rwsem);
> @@ -366,7 +368,8 @@ static void add_shrinker(struct shrinker *shrinker)
> down_write(&shrinker_rwsem);
> if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> mcg_shrinkers[shrinker->id] = shrinker;
> - list_add_tail(&shrinker->list, &shrinker_list);
> + else
> + list_add_tail(&shrinker->list, &shrinker_list);

I don't think we should remove per-memcg shrinkers from the global
shrinker list - this is confusing. It won't be critical if we iterate
over all shrinkers on global reclaim, will it?

> up_write(&shrinker_rwsem);
> }
>
> @@ -701,6 +705,39 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> if (!down_read_trylock(&shrinker_rwsem))
> goto out;
>
> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> + if (!memcg_kmem_enabled() || memcg) {
> + struct shrinkers_map *map;
> + int i;
> +
> + map = rcu_dereference_protected(SHRINKERS_MAP(memcg), true);
> + if (map) {
> + for_each_set_bit(i, map->map[nid], bitmap_nr_ids) {
> + struct shrink_control sc = {
> + .gfp_mask = gfp_mask,
> + .nid = nid,
> + .memcg = memcg,
> + };
> +
> + shrinker = mcg_shrinkers[i];
> + if (!shrinker) {
> + clear_bit(i, map->map[nid]);
> + continue;
> + }
> + freed += do_shrink_slab(&sc, shrinker, priority);
> +
> + if (rwsem_is_contended(&shrinker_rwsem)) {
> + freed = freed ? : 1;
> + goto unlock;
> + }
> + }
> + }
> +
> + if (memcg_kmem_enabled() && memcg)
> + goto unlock;

May be, factor this out to a separate function, say shrink_slab_memcg?
Just for the sake of code legibility.

2018-03-24 20:34:33

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 10/10] mm: Clear shrinker bit if there are no objects related to memcg

On Wed, Mar 21, 2018 at 04:23:01PM +0300, Kirill Tkhai wrote:
> To avoid further unneed calls of do_shrink_slab()
> for shrinkers, which already do not have any charged
> objects in a memcg, their bits have to be cleared.
>
> This patch introduces new return value SHRINK_EMPTY,
> which will be used in case of there is no charged
> objects in shrinker. We can't use 0 instead of that,
> as a shrinker may return 0, when it has very small
> amount of objects.
>
> To prevent race with parallel list lru add, we call
> do_shrink_slab() once again, after the bit is cleared.
> So, if there is a new object, we never miss it, and
> the bit will be restored again.
>
> The below test shows significant performance growths
> after using the patchset:
>
> $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
> $mkdir /sys/fs/cgroup/memory/ct
> $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
> $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done
>
> Then 4 drop_caches:
> $time echo 3 > /proc/sys/vm/drop_caches
>
> Times of drop_caches:
>
> *Before (4 iterations)*
> 0.00user 6.80system 0:06.82elapsed 99%CPU
> 0.00user 4.61system 0:04.62elapsed 99%CPU
> 0.00user 4.61system 0:04.61elapsed 99%CPU
> 0.00user 4.61system 0:04.61elapsed 99%CPU
>
> *After (4 iterations)*
> 0.00user 0.93system 0:00.94elapsed 99%CPU
> 0.00user 0.00system 0:00.01elapsed 80%CPU
> 0.00user 0.00system 0:00.01elapsed 80%CPU
> 0.00user 0.00system 0:00.01elapsed 81%CPU
>
> 4.61s/0.01s = 461 times faster.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> fs/super.c | 3 +++
> include/linux/shrinker.h | 1 +
> mm/vmscan.c | 21 ++++++++++++++++++---
> mm/workingset.c | 3 +++
> 4 files changed, 25 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 24aeed1bc332..b23180deb928 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -34,6 +34,7 @@ struct shrink_control {
> };
>
> #define SHRINK_STOP (~0UL)
> +#define SHRINK_EMPTY (~0UL - 1)

Please update the comment below accordingly.

> /*
> * A callback you can register to apply pressure to ageable caches.
> *
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e1fd16bc7a9b..1fc05e8bde04 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -387,6 +387,7 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int nr)
> {
> struct shrinkers_map *map = SHRINKERS_MAP(memcg);
>
> + smp_mb__before_atomic(); /* Pairs with mb in shrink_slab() */

I don't understand the purpose of this barrier. Please add a comment
explaining why you need it.

> set_bit(nr, map->map[nid]);
> }
> #else /* CONFIG_MEMCG && !CONFIG_SLOB */
> @@ -568,8 +569,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> long scanned = 0, next_deferred;
>
> freeable = shrinker->count_objects(shrinker, shrinkctl);
> - if (freeable == 0)
> - return 0;
> + if (freeable == 0 || freeable == SHRINK_EMPTY)
> + return freeable;
>
> /*
> * copy the current shrinker scan count into a local variable
> @@ -708,6 +709,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> if (!memcg_kmem_enabled() || memcg) {
> struct shrinkers_map *map;
> + unsigned long ret;
> int i;
>
> map = rcu_dereference_protected(SHRINKERS_MAP(memcg), true);
> @@ -724,7 +726,20 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> clear_bit(i, map->map[nid]);
> continue;
> }
> - freed += do_shrink_slab(&sc, shrinker, priority);
> + if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> + sc.nid = 0;

Hmm, if my memory doesn't fail, in the previous patch you added a BUG_ON
ensuring that a memcg-aware shrinker must also be numa-aware while here
you still check it. Please remove the BUG_ON or remove this check.
Better remove the BUG_ON, because a memcg-aware shrinker doesn't have to
be numa-aware.

> + ret = do_shrink_slab(&sc, shrinker, priority);
> + if (ret == SHRINK_EMPTY) {

do_shrink_slab() is also called for memcg-unaware shrinkers, you should
probably handle SHRINK_EMPTY there as well.

> + clear_bit(i, map->map[nid]);
> + /* pairs with mb in set_shrinker_bit() */
> + smp_mb__after_atomic();
> + ret = do_shrink_slab(&sc, shrinker, priority);
> + if (ret == SHRINK_EMPTY)
> + ret = 0;
> + else
> + set_bit(i, map->map[nid]);

Well, that's definitely a tricky part and hence needs a good comment.

Anyway, it would be great if we could simplify this part somehow.

> + }
> + freed += ret;
>
> if (rwsem_is_contended(&shrinker_rwsem)) {
> freed = freed ? : 1;

2018-03-26 15:11:18

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Assign id to every memcg-aware shrinker

Hi, Vladimir,

thanks for your review!

On 24.03.2018 21:40, Vladimir Davydov wrote:
> Hello Kirill,
>
> I don't have any objections to the idea behind this patch set.
> Well, at least I don't know how to better tackle the problem you
> describe in the cover letter. Please, see below for my comments
> regarding implementation details.
>
> On Wed, Mar 21, 2018 at 04:21:17PM +0300, Kirill Tkhai wrote:
>> The patch introduces shrinker::id number, which is used to enumerate
>> memcg-aware shrinkers. The number start from 0, and the code tries
>> to maintain it as small as possible.
>>
>> This will be used as to represent a memcg-aware shrinkers in memcg
>> shrinkers map.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> include/linux/shrinker.h | 1 +
>> mm/vmscan.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 60 insertions(+)
>>
>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>> index a3894918a436..738de8ef5246 100644
>> --- a/include/linux/shrinker.h
>> +++ b/include/linux/shrinker.h
>> @@ -66,6 +66,7 @@ struct shrinker {
>>
>> /* These are for internal use */
>> struct list_head list;
>> + int id;
>
> This definition could definitely use a comment.
>
> BTW shouldn't we ifdef it?

Ok

>> /* objs pending delete, per node */
>> atomic_long_t *nr_deferred;
>> };
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 8fcd9f8d7390..91b5120b924f 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -159,6 +159,56 @@ unsigned long vm_total_pages;
>> static LIST_HEAD(shrinker_list);
>> static DECLARE_RWSEM(shrinker_rwsem);
>>
>> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>> +static DEFINE_IDA(bitmap_id_ida);
>> +static DECLARE_RWSEM(bitmap_rwsem);
>
> Can't we reuse shrinker_rwsem for protecting the ida?

I think it won't be better, since we allocate memory under this semaphore.
After we use shrinker_rwsem, we'll have to allocate the memory with GFP_ATOMIC,
which does not seems good. Currently, the patchset makes shrinker_rwsem be taken
for a small time, just to assign already allocated memory to maps.

>> +static int bitmap_id_start;
>> +
>> +static int alloc_shrinker_id(struct shrinker *shrinker)
>> +{
>> + int id, ret;
>> +
>> + if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>> + return 0;
>> +retry:
>> + ida_pre_get(&bitmap_id_ida, GFP_KERNEL);
>> + down_write(&bitmap_rwsem);
>> + ret = ida_get_new_above(&bitmap_id_ida, bitmap_id_start, &id);
>
> AFAIK ida always allocates the smallest available id so you don't need
> to keep track of bitmap_id_start.

I saw mnt_alloc_group_id() does the same, so this was the reason, the additional
variable was used. Doesn't this gives a good advise to ida and makes it find
a free id faster?

>> + if (!ret) {
>> + shrinker->id = id;
>> + bitmap_id_start = shrinker->id + 1;
>> + }
>> + up_write(&bitmap_rwsem);
>> + if (ret == -EAGAIN)
>> + goto retry;
>> +
>> + return ret;
>> +}

Thanks,
Kirill

2018-03-26 15:16:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Assign id to every memcg-aware shrinker

On Mon, Mar 26, 2018 at 06:09:35PM +0300, Kirill Tkhai wrote:
> > AFAIK ida always allocates the smallest available id so you don't need
> > to keep track of bitmap_id_start.
>
> I saw mnt_alloc_group_id() does the same, so this was the reason, the additional
> variable was used. Doesn't this gives a good advise to ida and makes it find
> a free id faster?

No, it doesn't help the IDA in the slightest. I have a patch in my
tree to delete that silliness from mnt_alloc_group_id(); just haven't
submitted it yet.


2018-03-26 15:22:25

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: Maintain memcg-aware shrinkers in mcg_shrinkers array

On 24.03.2018 21:45, Vladimir Davydov wrote:
> On Wed, Mar 21, 2018 at 04:21:29PM +0300, Kirill Tkhai wrote:
>> The patch introduces mcg_shrinkers array to keep memcg-aware
>> shrinkers in order of their shrinker::id.
>>
>> This allows to access the shrinkers dirrectly by the id,
>> without iteration over shrinker_list list.
>
> Why don't you simply use idr instead of ida? With idr you wouldn't need
> the array mapping shrinker id to shrinker ptr. AFAIU you need this
> mapping to look up the shrinker by id in shrink_slab. The latter doesn't
> seem to be a hot path so using idr there should be acceptable. Since we
> already have shrinker_rwsem, which is taken for reading by shrink_slab,
> we wouldn't even need any additional locking for it.

The reason is ida may allocate memory, and since list_lru_add() can't fail,
we can't do that there. If we allocate all the ida memory at the time of
memcg creation (i.e., preallocate it), this is not different to the way
the bitmap makes.

While bitmap has the agvantage, since it's simplest data structure (while
ida has some radix tree overhead).

Also, bitmap does not require a lock, there is single atomic operation
to set or clear a bit, and it scales better, when anything.

Kirill

2018-03-26 15:31:54

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 04/10] fs: Propagate shrinker::id to list_lru

On 24.03.2018 21:50, Vladimir Davydov wrote:
> On Wed, Mar 21, 2018 at 04:21:51PM +0300, Kirill Tkhai wrote:
>> The patch adds list_lru::shrk_id field, and populates
>> it by registered shrinker id.
>>
>> This will be used to set correct bit in memcg shrinkers
>> map by lru code in next patches, after there appeared
>> the first related to memcg element in list_lru.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> fs/super.c | 5 +++++
>> include/linux/list_lru.h | 1 +
>> mm/list_lru.c | 7 ++++++-
>> mm/workingset.c | 3 +++
>> 4 files changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/super.c b/fs/super.c
>> index 0660083427fa..1f3dc4eab409 100644
>> --- a/fs/super.c
>> +++ b/fs/super.c
>> @@ -521,6 +521,11 @@ struct super_block *sget_userns(struct file_system_type *type,
>> if (err) {
>> deactivate_locked_super(s);
>> s = ERR_PTR(err);
>> + } else {
>> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>> + s->s_dentry_lru.shrk_id = s->s_shrink.id;
>> + s->s_inode_lru.shrk_id = s->s_shrink.id;
>> +#endif
>
> I don't really like the new member name. Let's call it shrink_id or
> shrinker_id, shall we?
>
> Also, I think we'd better pass shrink_id to list_lru_init rather than
> setting it explicitly.

Ok, I'll think on this in v2.

Thanks,
Kirill

2018-03-26 15:32:14

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

On 24.03.2018 22:25, Vladimir Davydov wrote:
> On Wed, Mar 21, 2018 at 04:21:40PM +0300, Kirill Tkhai wrote:
>> Imagine a big node with many cpus, memory cgroups and containers.
>> Let we have 200 containers, every container has 10 mounts,
>> and 10 cgroups. All container tasks don't touch foreign
>> containers mounts. If there is intensive pages write,
>> and global reclaim happens, a writing task has to iterate
>> over all memcgs to shrink slab, before it's able to go
>> to shrink_page_list().
>>
>> Iteration over all the memcg slabs is very expensive:
>> the task has to visit 200 * 10 = 2000 shrinkers
>> for every memcg, and since there are 2000 memcgs,
>> the total calls are 2000 * 2000 = 4000000.
>>
>> So, the shrinker makes 4 million do_shrink_slab() calls
>> just to try to isolate SWAP_CLUSTER_MAX pages in one
>> of the actively writing memcg via shrink_page_list().
>> I've observed a node spending almost 100% in kernel,
>> making useless iteration over already shrinked slab.
>>
>> This patch adds bitmap of memcg-aware shrinkers to memcg.
>> The size of the bitmap depends on bitmap_nr_ids, and during
>> memcg life it's maintained to be enough to fit bitmap_nr_ids
>> shrinkers. Every bit in the map is related to corresponding
>> shrinker id.
>>
>> Next patches will maintain set bit only for really charged
>> memcg. This will allow shrink_slab() to increase its
>> performance in significant way. See the last patch for
>> the numbers.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> include/linux/memcontrol.h | 20 ++++++++
>> mm/memcontrol.c | 5 ++
>> mm/vmscan.c | 117 ++++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 142 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 4525b4404a9e..ad88a9697fb9 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -151,6 +151,11 @@ struct mem_cgroup_thresholds {
>> struct mem_cgroup_threshold_ary *spare;
>> };
>>
>> +struct shrinkers_map {
>
> IMO better call it mem_cgroup_shrinker_map.
>
>> + struct rcu_head rcu;
>> + unsigned long *map[0];
>> +};
>> +
>> enum memcg_kmem_state {
>> KMEM_NONE,
>> KMEM_ALLOCATED,
>> @@ -182,6 +187,9 @@ struct mem_cgroup {
>> unsigned long low;
>> unsigned long high;
>>
>> + /* Bitmap of shrinker ids suitable to call for this memcg */
>> + struct shrinkers_map __rcu *shrinkers_map;
>> +
>
> We keep all per-node data in mem_cgroup_per_node struct. I think this
> bitmap should be defined there as well.

But them we'll have to have struct rcu_head for every node to free the map
via rcu. This is the only reason I did that. But if you think it's not a problem,
I'll agree with you.

>> /* Range enforcement for interrupt charges */
>> struct work_struct high_work;
>>
>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 3801ac1fcfbc..2324577c62dc 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -4476,6 +4476,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>> {
>> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>>
>> + if (alloc_shrinker_maps(memcg))
>> + return -ENOMEM;
>> +
>
> This needs a comment explaining why you can't allocate the map in
> css_alloc, which seems to be a better place for it.

I want to use for_each_mem_cgroup_tree() which seem require the memcg
is online. Otherwise map expanding will skip such memcg.
Comment is not a problem ;)

>> /* Online state pins memcg ID, memcg ID pins CSS */
>> atomic_set(&memcg->id.ref, 1);
>> css_get(css);
>> @@ -4487,6 +4490,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>> struct mem_cgroup_event *event, *tmp;
>>
>> + free_shrinker_maps(memcg);
>> +
>
> AFAIU this can race with shrink_slab accessing the map, resulting in
> use-after-free. IMO it would be safer to free the bitmap from css_free.

But doesn't shrink_slab() iterate only online memcg?

>> /*
>> * Unregister events and notify userspace.
>> * Notify userspace about cgroup removing only after rmdir of cgroup
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 97ce4f342fab..9d1df5d90eca 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -165,6 +165,10 @@ static DECLARE_RWSEM(bitmap_rwsem);
>> static int bitmap_id_start;
>> static int bitmap_nr_ids;
>> static struct shrinker **mcg_shrinkers;
>> +struct shrinkers_map *__rcu root_shrinkers_map;
>
> Why do you need root_shrinkers_map? AFAIR the root memory cgroup doesn't
> have kernel memory accounting enabled.
But we can charge the corresponding lru and iterate it over global reclaim,
don't we?

struct list_lru_node {
...
/* global list, used for the root cgroup in cgroup aware lrus */
struct list_lru_one lru;
...
};


>> +
>> +#define SHRINKERS_MAP(memcg) \
>> + (memcg == root_mem_cgroup || !memcg ? root_shrinkers_map : memcg->shrinkers_map)
>>
>> static int expand_shrinkers_array(int old_nr, int nr)
>> {
>> @@ -188,6 +192,116 @@ static int expand_shrinkers_array(int old_nr, int nr)
>> return 0;
>> }
>>
>> +static void kvfree_map_rcu(struct rcu_head *head)
>> +{
>
>> +static int memcg_expand_maps(struct mem_cgroup *memcg, int size, int old_size)
>> +{
>
>> +int alloc_shrinker_maps(struct mem_cgroup *memcg)
>> +{
>
>> +void free_shrinker_maps(struct mem_cgroup *memcg)
>> +{
>
>> +static int expand_shrinker_maps(int old_id, int id)
>> +{
>
> All these functions should be defined in memcontrol.c
>
> The only public function should be mem_cgroup_grow_shrinker_map (I'm not
> insisting on the name), which reallocates shrinker bitmap for each
> cgroups so that it can accommodate the new shrinker id. To do that,
> you'll probably need to keep track of the bitmap capacity in
> memcontrol.c

Ok, I will do, thanks.

Kirill

2018-03-26 15:32:27

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 06/10] list_lru: Pass dst_memcg argument to memcg_drain_list_lru_node()

On 24.03.2018 22:32, Vladimir Davydov wrote:
> On Wed, Mar 21, 2018 at 04:22:10PM +0300, Kirill Tkhai wrote:
>> This is just refactoring to allow next patches to have
>> dst_memcg pointer in memcg_drain_list_lru_node().
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> include/linux/list_lru.h | 2 +-
>> mm/list_lru.c | 11 ++++++-----
>> mm/memcontrol.c | 2 +-
>> 3 files changed, 8 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>> index ce1d010cd3fa..50cf8c61c609 100644
>> --- a/include/linux/list_lru.h
>> +++ b/include/linux/list_lru.h
>> @@ -66,7 +66,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
>> #define list_lru_init_memcg(lru) __list_lru_init((lru), true, NULL)
>>
>> int memcg_update_all_list_lrus(int num_memcgs);
>> -void memcg_drain_all_list_lrus(int src_idx, int dst_idx);
>> +void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg);
>
> Please, for consistency pass the source cgroup as a pointer as well.

Ok

2018-03-26 15:33:30

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: Set bit in memcg shrinker bitmap on first list_lru item apearance

On 24.03.2018 22:45, Vladimir Davydov wrote:
> On Wed, Mar 21, 2018 at 04:22:40PM +0300, Kirill Tkhai wrote:
>> Introduce set_shrinker_bit() function to set shrinker-related
>> bit in memcg shrinker bitmap, and set the bit after the first
>> item is added and in case of reparenting destroyed memcg's items.
>>
>> This will allow next patch to make shrinkers be called only,
>> in case of they have charged objects at the moment, and
>> to improve shrink_slab() performance.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> include/linux/shrinker.h | 7 +++++++
>> mm/list_lru.c | 22 ++++++++++++++++++++--
>> mm/vmscan.c | 7 +++++++
>> 3 files changed, 34 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>> index 738de8ef5246..24aeed1bc332 100644
>> --- a/include/linux/shrinker.h
>> +++ b/include/linux/shrinker.h
>> @@ -78,4 +78,11 @@ struct shrinker {
>>
>> extern __must_check int register_shrinker(struct shrinker *);
>> extern void unregister_shrinker(struct shrinker *);
>> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>> +extern void set_shrinker_bit(struct mem_cgroup *, int, int);
>> +#else
>> +static inline void set_shrinker_bit(struct mem_cgroup *memcg, int node, int id)
>> +{
>> +}
>> +#endif
>
> IMO this function, as well as other shrinker bitmap manipulation
> functions, should be defined in memcontrol.[hc] and have mem_cgroup_
> prefix.
>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 9d1df5d90eca..265cf069b470 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -378,6 +378,13 @@ static void del_shrinker(struct shrinker *shrinker)
>> list_del(&shrinker->list);
>> up_write(&shrinker_rwsem);
>> }
>> +
>> +void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int nr)
>> +{
>> + struct shrinkers_map *map = SHRINKERS_MAP(memcg);
>> +
>> + set_bit(nr, map->map[nid]);
>> +}
>
> Shouldn't we use rcu_read_lock here? After all, the map can be
> reallocated right from under our feet.

We do have to do that! Thanks for pointing.

Kirill

2018-03-26 15:35:17

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 09/10] mm: Iterate only over charged shrinkers during memcg shrink_slab()

On 24.03.2018 23:11, Vladimir Davydov wrote:
> On Wed, Mar 21, 2018 at 04:22:51PM +0300, Kirill Tkhai wrote:
>> Using the preparations made in previous patches, in case of memcg
>> shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
>> bitmap. To do that, we separate iterations over memcg-aware and
>> !memcg-aware shrinkers, and memcg-aware shrinkers are chosen
>> via for_each_set_bit() from the bitmap. In case of big nodes,
>> having many isolated environments, this gives significant
>> performance growth. See next patch for the details.
>>
>> Note, that the patch does not respect to empty memcg shrinkers,
>> since we never clear the bitmap bits after we set it once.
>> Their shrinkers will be called again, with no shrinked objects
>> as result. This functionality is provided by next patch.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> mm/vmscan.c | 54 +++++++++++++++++++++++++++++++++++++++++-------------
>> 1 file changed, 41 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 265cf069b470..e1fd16bc7a9b 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -327,6 +327,8 @@ static int alloc_shrinker_id(struct shrinker *shrinker)
>>
>> if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>> return 0;
>> + BUG_ON(!(shrinker->flags & SHRINKER_NUMA_AWARE));
>> +
>> retry:
>> ida_pre_get(&bitmap_id_ida, GFP_KERNEL);
>> down_write(&bitmap_rwsem);
>> @@ -366,7 +368,8 @@ static void add_shrinker(struct shrinker *shrinker)
>> down_write(&shrinker_rwsem);
>> if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>> mcg_shrinkers[shrinker->id] = shrinker;
>> - list_add_tail(&shrinker->list, &shrinker_list);
>> + else
>> + list_add_tail(&shrinker->list, &shrinker_list);
>
> I don't think we should remove per-memcg shrinkers from the global
> shrinker list - this is confusing. It won't be critical if we iterate
> over all shrinkers on global reclaim, will it?

It depends on number of all shrinkers. And this is excess actions, we have to do.
Accessing their memory we flush cpu caches in some way. So, if there is no a reason
we really need them, I'd removed them from the list.

>> up_write(&shrinker_rwsem);
>> }
>>
>> @@ -701,6 +705,39 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>> if (!down_read_trylock(&shrinker_rwsem))
>> goto out;
>>
>> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>> + if (!memcg_kmem_enabled() || memcg) {
>> + struct shrinkers_map *map;
>> + int i;
>> +
>> + map = rcu_dereference_protected(SHRINKERS_MAP(memcg), true);
>> + if (map) {
>> + for_each_set_bit(i, map->map[nid], bitmap_nr_ids) {
>> + struct shrink_control sc = {
>> + .gfp_mask = gfp_mask,
>> + .nid = nid,
>> + .memcg = memcg,
>> + };
>> +
>> + shrinker = mcg_shrinkers[i];
>> + if (!shrinker) {
>> + clear_bit(i, map->map[nid]);
>> + continue;
>> + }
>> + freed += do_shrink_slab(&sc, shrinker, priority);
>> +
>> + if (rwsem_is_contended(&shrinker_rwsem)) {
>> + freed = freed ? : 1;
>> + goto unlock;
>> + }
>> + }
>> + }
>> +
>> + if (memcg_kmem_enabled() && memcg)
>> + goto unlock;
>
> May be, factor this out to a separate function, say shrink_slab_memcg?
> Just for the sake of code legibility.

Good idea, thanks.

Kirill

2018-03-26 15:36:17

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: Maintain memcg-aware shrinkers in mcg_shrinkers array

On Mon, Mar 26, 2018 at 06:20:55PM +0300, Kirill Tkhai wrote:
> On 24.03.2018 21:45, Vladimir Davydov wrote:
> > Why don't you simply use idr instead of ida? With idr you wouldn't need
> > the array mapping shrinker id to shrinker ptr. AFAIU you need this
> > mapping to look up the shrinker by id in shrink_slab. The latter doesn't
> > seem to be a hot path so using idr there should be acceptable. Since we
> > already have shrinker_rwsem, which is taken for reading by shrink_slab,
> > we wouldn't even need any additional locking for it.
>
> The reason is ida may allocate memory, and since list_lru_add() can't fail,
> we can't do that there. If we allocate all the ida memory at the time of
> memcg creation (i.e., preallocate it), this is not different to the way
> the bitmap makes.
>
> While bitmap has the agvantage, since it's simplest data structure (while
> ida has some radix tree overhead).

That would be true if you never wanted to resize the bitmap, but of
course you do, so you have your own interactions with RCU to contend with.
So you have the overhead of the RCU head, and you have your own code to
handle resizing which may have subtle errors.


2018-03-26 15:38:50

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 10/10] mm: Clear shrinker bit if there are no objects related to memcg

On 24.03.2018 23:33, Vladimir Davydov wrote:
> On Wed, Mar 21, 2018 at 04:23:01PM +0300, Kirill Tkhai wrote:
>> To avoid further unneed calls of do_shrink_slab()
>> for shrinkers, which already do not have any charged
>> objects in a memcg, their bits have to be cleared.
>>
>> This patch introduces new return value SHRINK_EMPTY,
>> which will be used in case of there is no charged
>> objects in shrinker. We can't use 0 instead of that,
>> as a shrinker may return 0, when it has very small
>> amount of objects.
>>
>> To prevent race with parallel list lru add, we call
>> do_shrink_slab() once again, after the bit is cleared.
>> So, if there is a new object, we never miss it, and
>> the bit will be restored again.
>>
>> The below test shows significant performance growths
>> after using the patchset:
>>
>> $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
>> $mkdir /sys/fs/cgroup/memory/ct
>> $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
>> $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done
>>
>> Then 4 drop_caches:
>> $time echo 3 > /proc/sys/vm/drop_caches
>>
>> Times of drop_caches:
>>
>> *Before (4 iterations)*
>> 0.00user 6.80system 0:06.82elapsed 99%CPU
>> 0.00user 4.61system 0:04.62elapsed 99%CPU
>> 0.00user 4.61system 0:04.61elapsed 99%CPU
>> 0.00user 4.61system 0:04.61elapsed 99%CPU
>>
>> *After (4 iterations)*
>> 0.00user 0.93system 0:00.94elapsed 99%CPU
>> 0.00user 0.00system 0:00.01elapsed 80%CPU
>> 0.00user 0.00system 0:00.01elapsed 80%CPU
>> 0.00user 0.00system 0:00.01elapsed 81%CPU
>>
>> 4.61s/0.01s = 461 times faster.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> fs/super.c | 3 +++
>> include/linux/shrinker.h | 1 +
>> mm/vmscan.c | 21 ++++++++++++++++++---
>> mm/workingset.c | 3 +++
>> 4 files changed, 25 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>> index 24aeed1bc332..b23180deb928 100644
>> --- a/include/linux/shrinker.h
>> +++ b/include/linux/shrinker.h
>> @@ -34,6 +34,7 @@ struct shrink_control {
>> };
>>
>> #define SHRINK_STOP (~0UL)
>> +#define SHRINK_EMPTY (~0UL - 1)
>
> Please update the comment below accordingly.

Ok

>> /*
>> * A callback you can register to apply pressure to ageable caches.
>> *
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index e1fd16bc7a9b..1fc05e8bde04 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -387,6 +387,7 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int nr)
>> {
>> struct shrinkers_map *map = SHRINKERS_MAP(memcg);
>>
>> + smp_mb__before_atomic(); /* Pairs with mb in shrink_slab() */
>
> I don't understand the purpose of this barrier. Please add a comment
> explaining why you need it.

Ok

>> set_bit(nr, map->map[nid]);
>> }
>> #else /* CONFIG_MEMCG && !CONFIG_SLOB */
>> @@ -568,8 +569,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>> long scanned = 0, next_deferred;
>>
>> freeable = shrinker->count_objects(shrinker, shrinkctl);
>> - if (freeable == 0)
>> - return 0;
>> + if (freeable == 0 || freeable == SHRINK_EMPTY)
>> + return freeable;
>>
>> /*
>> * copy the current shrinker scan count into a local variable
>> @@ -708,6 +709,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>> #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>> if (!memcg_kmem_enabled() || memcg) {
>> struct shrinkers_map *map;
>> + unsigned long ret;
>> int i;
>>
>> map = rcu_dereference_protected(SHRINKERS_MAP(memcg), true);
>> @@ -724,7 +726,20 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>> clear_bit(i, map->map[nid]);
>> continue;
>> }
>> - freed += do_shrink_slab(&sc, shrinker, priority);
>> + if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>> + sc.nid = 0;
>
> Hmm, if my memory doesn't fail, in the previous patch you added a BUG_ON
> ensuring that a memcg-aware shrinker must also be numa-aware while here
> you still check it. Please remove the BUG_ON or remove this check.
> Better remove the BUG_ON, because a memcg-aware shrinker doesn't have to
> be numa-aware.

Really, we do not introduce new limitations, so it's need to just remove the BUG_ON.
Will do in v2.

>
>> + ret = do_shrink_slab(&sc, shrinker, priority);
>> + if (ret == SHRINK_EMPTY) {
>
> do_shrink_slab() is also called for memcg-unaware shrinkers, you should
> probably handle SHRINK_EMPTY there as well.

Ok, it looks like we just need to return 0 instead of SHRINK_EMPTY in such cases.

>> + clear_bit(i, map->map[nid]);
>> + /* pairs with mb in set_shrinker_bit() */
>> + smp_mb__after_atomic();
>> + ret = do_shrink_slab(&sc, shrinker, priority);
>> + if (ret == SHRINK_EMPTY)
>> + ret = 0;
>> + else
>> + set_bit(i, map->map[nid]);
>
> Well, that's definitely a tricky part and hence needs a good comment.
>
> Anyway, it would be great if we could simplify this part somehow.

Ok, I'll add some cleanup preparations for that.

>> + }
>> + freed += ret;
>>
>> if (rwsem_is_contended(&shrinker_rwsem)) {
>> freed = freed ? : 1;

Thanks,
Kirill

2018-03-26 15:40:00

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Assign id to every memcg-aware shrinker

On 26.03.2018 18:14, Matthew Wilcox wrote:
> On Mon, Mar 26, 2018 at 06:09:35PM +0300, Kirill Tkhai wrote:
>>> AFAIK ida always allocates the smallest available id so you don't need
>>> to keep track of bitmap_id_start.
>>
>> I saw mnt_alloc_group_id() does the same, so this was the reason, the additional
>> variable was used. Doesn't this gives a good advise to ida and makes it find
>> a free id faster?
>
> No, it doesn't help the IDA in the slightest. I have a patch in my
> tree to delete that silliness from mnt_alloc_group_id(); just haven't
> submitted it yet.

Ok, then I'll remove this trick.

Thanks,
Kirill

2018-03-27 09:16:23

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Assign id to every memcg-aware shrinker

On Mon, Mar 26, 2018 at 06:09:35PM +0300, Kirill Tkhai wrote:
> Hi, Vladimir,
>
> thanks for your review!
>
> On 24.03.2018 21:40, Vladimir Davydov wrote:
> > Hello Kirill,
> >
> > I don't have any objections to the idea behind this patch set.
> > Well, at least I don't know how to better tackle the problem you
> > describe in the cover letter. Please, see below for my comments
> > regarding implementation details.
> >
> > On Wed, Mar 21, 2018 at 04:21:17PM +0300, Kirill Tkhai wrote:
> >> The patch introduces shrinker::id number, which is used to enumerate
> >> memcg-aware shrinkers. The number start from 0, and the code tries
> >> to maintain it as small as possible.
> >>
> >> This will be used as to represent a memcg-aware shrinkers in memcg
> >> shrinkers map.
> >>
> >> Signed-off-by: Kirill Tkhai <[email protected]>
> >> ---
> >> include/linux/shrinker.h | 1 +
> >> mm/vmscan.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++
> >> 2 files changed, 60 insertions(+)
> >>
> >> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> >> index a3894918a436..738de8ef5246 100644
> >> --- a/include/linux/shrinker.h
> >> +++ b/include/linux/shrinker.h
> >> @@ -66,6 +66,7 @@ struct shrinker {
> >>
> >> /* These are for internal use */
> >> struct list_head list;
> >> + int id;
> >
> > This definition could definitely use a comment.
> >
> > BTW shouldn't we ifdef it?
>
> Ok
>
> >> /* objs pending delete, per node */
> >> atomic_long_t *nr_deferred;
> >> };
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 8fcd9f8d7390..91b5120b924f 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -159,6 +159,56 @@ unsigned long vm_total_pages;
> >> static LIST_HEAD(shrinker_list);
> >> static DECLARE_RWSEM(shrinker_rwsem);
> >>
> >> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> >> +static DEFINE_IDA(bitmap_id_ida);
> >> +static DECLARE_RWSEM(bitmap_rwsem);
> >
> > Can't we reuse shrinker_rwsem for protecting the ida?
>
> I think it won't be better, since we allocate memory under this semaphore.
> After we use shrinker_rwsem, we'll have to allocate the memory with GFP_ATOMIC,
> which does not seems good. Currently, the patchset makes shrinker_rwsem be taken
> for a small time, just to assign already allocated memory to maps.

AFAIR it's OK to sleep under an rwsem so GFP_ATOMIC wouldn't be
necessary. Anyway, we only need to allocate memory when we extend
shrinker bitmaps, which is rare. In fact, there can only be a limited
number of such calls, as we never shrink these bitmaps (which is fine
by me).

>
> >> +static int bitmap_id_start;
> >> +
> >> +static int alloc_shrinker_id(struct shrinker *shrinker)
> >> +{
> >> + int id, ret;
> >> +
> >> + if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> >> + return 0;
> >> +retry:
> >> + ida_pre_get(&bitmap_id_ida, GFP_KERNEL);
> >> + down_write(&bitmap_rwsem);
> >> + ret = ida_get_new_above(&bitmap_id_ida, bitmap_id_start, &id);
> >
> > AFAIK ida always allocates the smallest available id so you don't need
> > to keep track of bitmap_id_start.
>
> I saw mnt_alloc_group_id() does the same, so this was the reason, the additional
> variable was used. Doesn't this gives a good advise to ida and makes it find
> a free id faster?

As Matthew pointed out, this is rather pointless.

2018-03-27 09:20:53

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: Maintain memcg-aware shrinkers in mcg_shrinkers array

On Mon, Mar 26, 2018 at 06:20:55PM +0300, Kirill Tkhai wrote:
> On 24.03.2018 21:45, Vladimir Davydov wrote:
> > On Wed, Mar 21, 2018 at 04:21:29PM +0300, Kirill Tkhai wrote:
> >> The patch introduces mcg_shrinkers array to keep memcg-aware
> >> shrinkers in order of their shrinker::id.
> >>
> >> This allows to access the shrinkers dirrectly by the id,
> >> without iteration over shrinker_list list.
> >
> > Why don't you simply use idr instead of ida? With idr you wouldn't need
> > the array mapping shrinker id to shrinker ptr. AFAIU you need this
> > mapping to look up the shrinker by id in shrink_slab. The latter doesn't
> > seem to be a hot path so using idr there should be acceptable. Since we
> > already have shrinker_rwsem, which is taken for reading by shrink_slab,
> > we wouldn't even need any additional locking for it.
>
> The reason is ida may allocate memory, and since list_lru_add() can't fail,
> we can't do that there. If we allocate all the ida memory at the time of
> memcg creation (i.e., preallocate it), this is not different to the way
> the bitmap makes.
>
> While bitmap has the agvantage, since it's simplest data structure (while
> ida has some radix tree overhead).
>
> Also, bitmap does not require a lock, there is single atomic operation
> to set or clear a bit, and it scales better, when anything.

I didn't mean the per-memcg bitmaps - I think it's OK to use plain
arrays for them and reallocate them with the aid of RCU.

What I actually mean is the mapping shrink_id => shrinker. AFAIU it
isn't accessed from list_lru, it is only needed to look up a shrinker
by id from shrink_slab(). The latter is rather a slow path so I think
we can use an IDR for this mapping instead of IDA + plain array.

2018-03-27 10:02:08

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

On Mon, Mar 26, 2018 at 06:29:05PM +0300, Kirill Tkhai wrote:
> >> @@ -182,6 +187,9 @@ struct mem_cgroup {
> >> unsigned long low;
> >> unsigned long high;
> >>
> >> + /* Bitmap of shrinker ids suitable to call for this memcg */
> >> + struct shrinkers_map __rcu *shrinkers_map;
> >> +
> >
> > We keep all per-node data in mem_cgroup_per_node struct. I think this
> > bitmap should be defined there as well.
>
> But them we'll have to have struct rcu_head for every node to free the map
> via rcu. This is the only reason I did that. But if you think it's not a problem,
> I'll agree with you.

I think it's OK. It'd be consistent with how list_lru handles
list_lru_memcg reallocations.

> >> @@ -4487,6 +4490,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
> >> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> >> struct mem_cgroup_event *event, *tmp;
> >>
> >> + free_shrinker_maps(memcg);
> >> +
> >
> > AFAIU this can race with shrink_slab accessing the map, resulting in
> > use-after-free. IMO it would be safer to free the bitmap from css_free.
>
> But doesn't shrink_slab() iterate only online memcg?

Well, yes, shrink_slab() bails out if the memcg is offline, but I
suspect there might be a race condition between shrink_slab and
css_offline when shrink_slab calls shrinkers for an offline cgroup.

>
> >> /*
> >> * Unregister events and notify userspace.
> >> * Notify userspace about cgroup removing only after rmdir of cgroup
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 97ce4f342fab..9d1df5d90eca 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -165,6 +165,10 @@ static DECLARE_RWSEM(bitmap_rwsem);
> >> static int bitmap_id_start;
> >> static int bitmap_nr_ids;
> >> static struct shrinker **mcg_shrinkers;
> >> +struct shrinkers_map *__rcu root_shrinkers_map;
> >
> > Why do you need root_shrinkers_map? AFAIR the root memory cgroup doesn't
> > have kernel memory accounting enabled.
> But we can charge the corresponding lru and iterate it over global reclaim,
> don't we?

Yes, I guess you're right. But do we need to care about it? Would it be
OK if we iterated over all shrinkers for the root cgroup? Dunno...

Anyway, please try to handle the root cgroup consistently with other
cgroups. I mean, nothing like this root_shrinkers_map should exist.
It should be either a part of root_mem_cgroup or we should iterate over
all shrinkers for the root cgroup.

>
> struct list_lru_node {
> ...
> /* global list, used for the root cgroup in cgroup aware lrus */
> struct list_lru_one lru;
> ...
> };

2018-03-27 15:11:55

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Assign id to every memcg-aware shrinker

On 27.03.2018 12:15, Vladimir Davydov wrote:
> On Mon, Mar 26, 2018 at 06:09:35PM +0300, Kirill Tkhai wrote:
>> Hi, Vladimir,
>>
>> thanks for your review!
>>
>> On 24.03.2018 21:40, Vladimir Davydov wrote:
>>> Hello Kirill,
>>>
>>> I don't have any objections to the idea behind this patch set.
>>> Well, at least I don't know how to better tackle the problem you
>>> describe in the cover letter. Please, see below for my comments
>>> regarding implementation details.
>>>
>>> On Wed, Mar 21, 2018 at 04:21:17PM +0300, Kirill Tkhai wrote:
>>>> The patch introduces shrinker::id number, which is used to enumerate
>>>> memcg-aware shrinkers. The number start from 0, and the code tries
>>>> to maintain it as small as possible.
>>>>
>>>> This will be used as to represent a memcg-aware shrinkers in memcg
>>>> shrinkers map.
>>>>
>>>> Signed-off-by: Kirill Tkhai <[email protected]>
>>>> ---
>>>> include/linux/shrinker.h | 1 +
>>>> mm/vmscan.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++
>>>> 2 files changed, 60 insertions(+)
>>>>
>>>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>>>> index a3894918a436..738de8ef5246 100644
>>>> --- a/include/linux/shrinker.h
>>>> +++ b/include/linux/shrinker.h
>>>> @@ -66,6 +66,7 @@ struct shrinker {
>>>>
>>>> /* These are for internal use */
>>>> struct list_head list;
>>>> + int id;
>>>
>>> This definition could definitely use a comment.
>>>
>>> BTW shouldn't we ifdef it?
>>
>> Ok
>>
>>>> /* objs pending delete, per node */
>>>> atomic_long_t *nr_deferred;
>>>> };
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 8fcd9f8d7390..91b5120b924f 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -159,6 +159,56 @@ unsigned long vm_total_pages;
>>>> static LIST_HEAD(shrinker_list);
>>>> static DECLARE_RWSEM(shrinker_rwsem);
>>>>
>>>> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>>>> +static DEFINE_IDA(bitmap_id_ida);
>>>> +static DECLARE_RWSEM(bitmap_rwsem);
>>>
>>> Can't we reuse shrinker_rwsem for protecting the ida?
>>
>> I think it won't be better, since we allocate memory under this semaphore.
>> After we use shrinker_rwsem, we'll have to allocate the memory with GFP_ATOMIC,
>> which does not seems good. Currently, the patchset makes shrinker_rwsem be taken
>> for a small time, just to assign already allocated memory to maps.
>
> AFAIR it's OK to sleep under an rwsem so GFP_ATOMIC wouldn't be
> necessary. Anyway, we only need to allocate memory when we extend
> shrinker bitmaps, which is rare. In fact, there can only be a limited
> number of such calls, as we never shrink these bitmaps (which is fine
> by me).

We take bitmap_rwsem for writing to expand shrinkers maps. If we replace
it with shrinker_rwsem and the memory allocation get into reclaim, there
will be deadlock.

>>
>>>> +static int bitmap_id_start;
>>>> +
>>>> +static int alloc_shrinker_id(struct shrinker *shrinker)
>>>> +{
>>>> + int id, ret;
>>>> +
>>>> + if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>>>> + return 0;
>>>> +retry:
>>>> + ida_pre_get(&bitmap_id_ida, GFP_KERNEL);
>>>> + down_write(&bitmap_rwsem);
>>>> + ret = ida_get_new_above(&bitmap_id_ida, bitmap_id_start, &id);
>>>
>>> AFAIK ida always allocates the smallest available id so you don't need
>>> to keep track of bitmap_id_start.
>>
>> I saw mnt_alloc_group_id() does the same, so this was the reason, the additional
>> variable was used. Doesn't this gives a good advise to ida and makes it find
>> a free id faster?
>
> As Matthew pointed out, this is rather pointless.

Kirill

2018-03-27 15:20:26

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Assign memcg-aware shrinkers bitmap to memcg

On 27.03.2018 13:00, Vladimir Davydov wrote:
> On Mon, Mar 26, 2018 at 06:29:05PM +0300, Kirill Tkhai wrote:
>>>> @@ -182,6 +187,9 @@ struct mem_cgroup {
>>>> unsigned long low;
>>>> unsigned long high;
>>>>
>>>> + /* Bitmap of shrinker ids suitable to call for this memcg */
>>>> + struct shrinkers_map __rcu *shrinkers_map;
>>>> +
>>>
>>> We keep all per-node data in mem_cgroup_per_node struct. I think this
>>> bitmap should be defined there as well.
>>
>> But them we'll have to have struct rcu_head for every node to free the map
>> via rcu. This is the only reason I did that. But if you think it's not a problem,
>> I'll agree with you.
>
> I think it's OK. It'd be consistent with how list_lru handles
> list_lru_memcg reallocations.
>
>>>> @@ -4487,6 +4490,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>>>> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>>>> struct mem_cgroup_event *event, *tmp;
>>>>
>>>> + free_shrinker_maps(memcg);
>>>> +
>>>
>>> AFAIU this can race with shrink_slab accessing the map, resulting in
>>> use-after-free. IMO it would be safer to free the bitmap from css_free.
>>
>> But doesn't shrink_slab() iterate only online memcg?
>
> Well, yes, shrink_slab() bails out if the memcg is offline, but I
> suspect there might be a race condition between shrink_slab and
> css_offline when shrink_slab calls shrinkers for an offline cgroup.
>
>>
>>>> /*
>>>> * Unregister events and notify userspace.
>>>> * Notify userspace about cgroup removing only after rmdir of cgroup
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 97ce4f342fab..9d1df5d90eca 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -165,6 +165,10 @@ static DECLARE_RWSEM(bitmap_rwsem);
>>>> static int bitmap_id_start;
>>>> static int bitmap_nr_ids;
>>>> static struct shrinker **mcg_shrinkers;
>>>> +struct shrinkers_map *__rcu root_shrinkers_map;
>>>
>>> Why do you need root_shrinkers_map? AFAIR the root memory cgroup doesn't
>>> have kernel memory accounting enabled.
>> But we can charge the corresponding lru and iterate it over global reclaim,
>> don't we?
>
> Yes, I guess you're right. But do we need to care about it? Would it be
> OK if we iterated over all shrinkers for the root cgroup? Dunno...

In case of 2000 shrinkers, this will flush the cache. This is the reason :)

> Anyway, please try to handle the root cgroup consistently with other
> cgroups. I mean, nothing like this root_shrinkers_map should exist.
> It should be either a part of root_mem_cgroup or we should iterate over
> all shrinkers for the root cgroup.

It's not possible. root_mem_cgroup does not exist always. Even if CONFIG_MEMCG
is enabled, memcg may be prohibited by boot params.

In case of it's not prohibited, there are some shrinkers, which are registered
before it's initialized, while memory_cgrp_subsys can't has .early_init = 1.

>>
>> struct list_lru_node {
>> ...
>> /* global list, used for the root cgroup in cgroup aware lrus */
>> struct list_lru_one lru;
>> ...
>> };

Kirill

2018-03-27 15:32:00

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: Maintain memcg-aware shrinkers in mcg_shrinkers array

On 27.03.2018 12:18, Vladimir Davydov wrote:
> On Mon, Mar 26, 2018 at 06:20:55PM +0300, Kirill Tkhai wrote:
>> On 24.03.2018 21:45, Vladimir Davydov wrote:
>>> On Wed, Mar 21, 2018 at 04:21:29PM +0300, Kirill Tkhai wrote:
>>>> The patch introduces mcg_shrinkers array to keep memcg-aware
>>>> shrinkers in order of their shrinker::id.
>>>>
>>>> This allows to access the shrinkers dirrectly by the id,
>>>> without iteration over shrinker_list list.
>>>
>>> Why don't you simply use idr instead of ida? With idr you wouldn't need
>>> the array mapping shrinker id to shrinker ptr. AFAIU you need this
>>> mapping to look up the shrinker by id in shrink_slab. The latter doesn't
>>> seem to be a hot path so using idr there should be acceptable. Since we
>>> already have shrinker_rwsem, which is taken for reading by shrink_slab,
>>> we wouldn't even need any additional locking for it.
>>
>> The reason is ida may allocate memory, and since list_lru_add() can't fail,
>> we can't do that there. If we allocate all the ida memory at the time of
>> memcg creation (i.e., preallocate it), this is not different to the way
>> the bitmap makes.
>>
>> While bitmap has the agvantage, since it's simplest data structure (while
>> ida has some radix tree overhead).
>>
>> Also, bitmap does not require a lock, there is single atomic operation
>> to set or clear a bit, and it scales better, when anything.
>
> I didn't mean the per-memcg bitmaps - I think it's OK to use plain
> arrays for them and reallocate them with the aid of RCU.
>
> What I actually mean is the mapping shrink_id => shrinker. AFAIU it
> isn't accessed from list_lru, it is only needed to look up a shrinker
> by id from shrink_slab(). The latter is rather a slow path so I think
> we can use an IDR for this mapping instead of IDA + plain array.

This is good idea.

Thanks,
Kirill

2018-03-27 15:49:54

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Assign id to every memcg-aware shrinker

On Tue, Mar 27, 2018 at 06:09:20PM +0300, Kirill Tkhai wrote:
> >>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>> index 8fcd9f8d7390..91b5120b924f 100644
> >>>> --- a/mm/vmscan.c
> >>>> +++ b/mm/vmscan.c
> >>>> @@ -159,6 +159,56 @@ unsigned long vm_total_pages;
> >>>> static LIST_HEAD(shrinker_list);
> >>>> static DECLARE_RWSEM(shrinker_rwsem);
> >>>>
> >>>> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> >>>> +static DEFINE_IDA(bitmap_id_ida);
> >>>> +static DECLARE_RWSEM(bitmap_rwsem);
> >>>
> >>> Can't we reuse shrinker_rwsem for protecting the ida?
> >>
> >> I think it won't be better, since we allocate memory under this semaphore.
> >> After we use shrinker_rwsem, we'll have to allocate the memory with GFP_ATOMIC,
> >> which does not seems good. Currently, the patchset makes shrinker_rwsem be taken
> >> for a small time, just to assign already allocated memory to maps.
> >
> > AFAIR it's OK to sleep under an rwsem so GFP_ATOMIC wouldn't be
> > necessary. Anyway, we only need to allocate memory when we extend
> > shrinker bitmaps, which is rare. In fact, there can only be a limited
> > number of such calls, as we never shrink these bitmaps (which is fine
> > by me).
>
> We take bitmap_rwsem for writing to expand shrinkers maps. If we replace
> it with shrinker_rwsem and the memory allocation get into reclaim, there
> will be deadlock.

Hmm, AFAICS we use down_read_trylock() in shrink_slab() so no deadlock
would be possible. We wouldn't be able to reclaim slabs though, that's
true, but I don't think it would be a problem for small allocations.

That's how I see this. We use shrinker_rwsem to protect IDR mapping
shrink_id => shrinker (I still insist on IDR). It may allocate, but the
allocation size is going to be fairly small so it's OK that we don't
call shrinkers there. After we allocated a shrinker ID, we release
shrinker_rwsem and call mem_cgroup_grow_shrinker_map (or whatever it
will be called), which checks if per-memcg shrinker bitmaps need growing
and if they do it takes its own mutex used exclusively for protecting
the bitmaps and reallocates the bitmaps (we will need the mutex anyway
to synchronize css_online vs shrinker bitmap reallocation as the
shrinker_rwsem is private to vmscan.c and we don't want to export it
to memcontrol.c).

2018-03-28 10:33:11

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Assign id to every memcg-aware shrinker



On 27.03.2018 18:48, Vladimir Davydov wrote:
> On Tue, Mar 27, 2018 at 06:09:20PM +0300, Kirill Tkhai wrote:
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index 8fcd9f8d7390..91b5120b924f 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -159,6 +159,56 @@ unsigned long vm_total_pages;
>>>>>> static LIST_HEAD(shrinker_list);
>>>>>> static DECLARE_RWSEM(shrinker_rwsem);
>>>>>>
>>>>>> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>>>>>> +static DEFINE_IDA(bitmap_id_ida);
>>>>>> +static DECLARE_RWSEM(bitmap_rwsem);
>>>>>
>>>>> Can't we reuse shrinker_rwsem for protecting the ida?
>>>>
>>>> I think it won't be better, since we allocate memory under this semaphore.
>>>> After we use shrinker_rwsem, we'll have to allocate the memory with GFP_ATOMIC,
>>>> which does not seems good. Currently, the patchset makes shrinker_rwsem be taken
>>>> for a small time, just to assign already allocated memory to maps.
>>>
>>> AFAIR it's OK to sleep under an rwsem so GFP_ATOMIC wouldn't be
>>> necessary. Anyway, we only need to allocate memory when we extend
>>> shrinker bitmaps, which is rare. In fact, there can only be a limited
>>> number of such calls, as we never shrink these bitmaps (which is fine
>>> by me).
>>
>> We take bitmap_rwsem for writing to expand shrinkers maps. If we replace
>> it with shrinker_rwsem and the memory allocation get into reclaim, there
>> will be deadlock.
>
> Hmm, AFAICS we use down_read_trylock() in shrink_slab() so no deadlock
> would be possible. We wouldn't be able to reclaim slabs though, that's
> true, but I don't think it would be a problem for small allocations.
>
> That's how I see this. We use shrinker_rwsem to protect IDR mapping
> shrink_id => shrinker (I still insist on IDR). It may allocate, but the
> allocation size is going to be fairly small so it's OK that we don't
> call shrinkers there. After we allocated a shrinker ID, we release
> shrinker_rwsem and call mem_cgroup_grow_shrinker_map (or whatever it
> will be called), which checks if per-memcg shrinker bitmaps need growing
> and if they do it takes its own mutex used exclusively for protecting
> the bitmaps and reallocates the bitmaps (we will need the mutex anyway
> to synchronize css_online vs shrinker bitmap reallocation as the
> shrinker_rwsem is private to vmscan.c and we don't want to export it
> to memcontrol.c).

But what the profit of prohibiting reclaim during shrinker id allocation?
In case of this is a IDR, it still may require 1 page, and still may get
in after fast reclaim. If we prohibit reclaim, we'll fail to register
the shrinker.

It's not a rare situation, when all the memory is occupied by page cache.
So, we will fail to mount something in some situation.

What the advantages do we have to be more significant, than this disadvantage?

Kirill

2018-03-28 11:05:04

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Assign id to every memcg-aware shrinker

On Wed, Mar 28, 2018 at 01:30:20PM +0300, Kirill Tkhai wrote:
> On 27.03.2018 18:48, Vladimir Davydov wrote:
> > On Tue, Mar 27, 2018 at 06:09:20PM +0300, Kirill Tkhai wrote:
> >>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>>>> index 8fcd9f8d7390..91b5120b924f 100644
> >>>>>> --- a/mm/vmscan.c
> >>>>>> +++ b/mm/vmscan.c
> >>>>>> @@ -159,6 +159,56 @@ unsigned long vm_total_pages;
> >>>>>> static LIST_HEAD(shrinker_list);
> >>>>>> static DECLARE_RWSEM(shrinker_rwsem);
> >>>>>>
> >>>>>> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> >>>>>> +static DEFINE_IDA(bitmap_id_ida);
> >>>>>> +static DECLARE_RWSEM(bitmap_rwsem);
> >>>>>
> >>>>> Can't we reuse shrinker_rwsem for protecting the ida?
> >>>>
> >>>> I think it won't be better, since we allocate memory under this semaphore.
> >>>> After we use shrinker_rwsem, we'll have to allocate the memory with GFP_ATOMIC,
> >>>> which does not seems good. Currently, the patchset makes shrinker_rwsem be taken
> >>>> for a small time, just to assign already allocated memory to maps.
> >>>
> >>> AFAIR it's OK to sleep under an rwsem so GFP_ATOMIC wouldn't be
> >>> necessary. Anyway, we only need to allocate memory when we extend
> >>> shrinker bitmaps, which is rare. In fact, there can only be a limited
> >>> number of such calls, as we never shrink these bitmaps (which is fine
> >>> by me).
> >>
> >> We take bitmap_rwsem for writing to expand shrinkers maps. If we replace
> >> it with shrinker_rwsem and the memory allocation get into reclaim, there
> >> will be deadlock.
> >
> > Hmm, AFAICS we use down_read_trylock() in shrink_slab() so no deadlock
> > would be possible. We wouldn't be able to reclaim slabs though, that's
> > true, but I don't think it would be a problem for small allocations.
> >
> > That's how I see this. We use shrinker_rwsem to protect IDR mapping
> > shrink_id => shrinker (I still insist on IDR). It may allocate, but the
> > allocation size is going to be fairly small so it's OK that we don't
> > call shrinkers there. After we allocated a shrinker ID, we release
> > shrinker_rwsem and call mem_cgroup_grow_shrinker_map (or whatever it
> > will be called), which checks if per-memcg shrinker bitmaps need growing
> > and if they do it takes its own mutex used exclusively for protecting
> > the bitmaps and reallocates the bitmaps (we will need the mutex anyway
> > to synchronize css_online vs shrinker bitmap reallocation as the
> > shrinker_rwsem is private to vmscan.c and we don't want to export it
> > to memcontrol.c).
>
> But what the profit of prohibiting reclaim during shrinker id allocation?
> In case of this is a IDR, it still may require 1 page, and still may get
> in after fast reclaim. If we prohibit reclaim, we'll fail to register
> the shrinker.
>
> It's not a rare situation, when all the memory is occupied by page cache.

shrinker_rwsem doesn't block page cache reclaim, only dcache reclaim.
I don't think that dcache can occupy all available memory.

> So, we will fail to mount something in some situation.
>
> What the advantages do we have to be more significant, than this disadvantage?

The main advantage is code simplicity.

2018-03-28 14:51:20

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 06/10] list_lru: Pass dst_memcg argument to memcg_drain_list_lru_node()

On 26.03.2018 18:30, Kirill Tkhai wrote:
> On 24.03.2018 22:32, Vladimir Davydov wrote:
>> On Wed, Mar 21, 2018 at 04:22:10PM +0300, Kirill Tkhai wrote:
>>> This is just refactoring to allow next patches to have
>>> dst_memcg pointer in memcg_drain_list_lru_node().
>>>
>>> Signed-off-by: Kirill Tkhai <[email protected]>
>>> ---
>>> include/linux/list_lru.h | 2 +-
>>> mm/list_lru.c | 11 ++++++-----
>>> mm/memcontrol.c | 2 +-
>>> 3 files changed, 8 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>>> index ce1d010cd3fa..50cf8c61c609 100644
>>> --- a/include/linux/list_lru.h
>>> +++ b/include/linux/list_lru.h
>>> @@ -66,7 +66,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
>>> #define list_lru_init_memcg(lru) __list_lru_init((lru), true, NULL)
>>>
>>> int memcg_update_all_list_lrus(int num_memcgs);
>>> -void memcg_drain_all_list_lrus(int src_idx, int dst_idx);
>>> +void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg);
>>
>> Please, for consistency pass the source cgroup as a pointer as well.
>
> Ok

Hm. But we call it from memcg_offline_kmem() after cgroup's kmemcg_id is set
to parent memcg's kmemcg_id:

rcu_read_lock(); /* can be called from css_free w/o cgroup_mutex */
css_for_each_descendant_pre(css, &memcg->css) {
child = mem_cgroup_from_css(css);
BUG_ON(child->kmemcg_id != kmemcg_id);
child->kmemcg_id = parent->kmemcg_id;
if (!memcg->use_hierarchy)
break;
}
rcu_read_unlock();

memcg_drain_all_list_lrus(kmemcg_id, parent);

It does not seem we may pass memcg to memcg_drain_all_list_lrus()
or change the logic or order. What do you think?

Kirill

2018-04-02 03:21:40

by kernel test robot

[permalink] [raw]
Subject: [lkp-robot] [list_lru] 42658d54ce: BUG:unable_to_handle_kernel


FYI, we noticed the following commit (built with gcc-7):

commit: 42658d54ce4d9c25c8a286651c60cbc869f2f91e ("list_lru: Add memcg argument to list_lru_from_kmem()")
url: https://github.com/0day-ci/linux/commits/Kirill-Tkhai/Improve-shrink_slab-scalability-old-complexity-was-O-n-2-new-is-O-n/20180323-052754
base: git://git.cmpxchg.org/linux-mmotm.git master

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu Haswell,+smep,+smap -smp 2 -m 512M

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+------------------------------------------+------------+------------+
| | 7f23acedf7 | 42658d54ce |
+------------------------------------------+------------+------------+
| boot_successes | 10 | 0 |
| boot_failures | 0 | 19 |
| BUG:unable_to_handle_kernel | 0 | 19 |
| Oops:#[##] | 0 | 19 |
| RIP:list_lru_add | 0 | 19 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 19 |
+------------------------------------------+------------+------------+



[ 465.702558] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 465.721123] PGD 800000001740e067 P4D 800000001740e067 PUD 1740f067 PMD 0
[ 465.737033] Oops: 0002 [#1] PTI
[ 465.744456] CPU: 0 PID: 163 Comm: rc.local Not tainted 4.16.0-rc5-mm1-00298-g42658d5 #1
[ 465.760374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 465.773920] RIP: 0010:list_lru_add+0x1e/0x70
[ 465.780850] RSP: 0018:ffffc90001f0fe18 EFLAGS: 00010246
[ 465.791551] RAX: ffff88001a1f16f0 RBX: ffff88000001a508 RCX: 0000000000000001
[ 465.806362] RDX: 0000000000000000 RSI: ffff88000001a520 RDI: 0000000000000246
[ 465.821265] RBP: ffffc90001f0fe28 R08: 0000000000000000 R09: 0000000000000001
[ 465.836206] R10: ffffc90001f0fd88 R11: 0000000000000001 R12: ffff88001a1f16f0
[ 465.850706] R13: ffffffff82c213d8 R14: ffffffff811aa4d5 R15: ffff88001a1f15f0
[ 465.865285] FS: 00007fccddf45700(0000) GS:ffffffff82e3f000(0000) knlGS:0000000000000000
[ 465.881839] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 465.893412] CR2: 0000000000000000 CR3: 0000000017868004 CR4: 00000000000206f0
[ 465.907019] Call Trace:
[ 465.911369] d_lru_add+0x37/0x40
[ 465.917037] dput+0x181/0x1d0
[ 465.922225] __fput+0x1a7/0x1c0
[ 465.928052] ____fput+0x9/0x10
[ 465.933601] task_work_run+0x84/0xc0
[ 465.940404] exit_to_usermode_loop+0x4e/0x80
[ 465.948013] do_syscall_64+0x179/0x190
[ 465.954817] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 465.963476] RIP: 0033:0x7fccdd625040
[ 465.969435] RSP: 002b:00007fff689ff258 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[ 465.981362] RAX: 0000000000000000 RBX: 00000000006f1c08 RCX: 00007fccdd625040
[ 465.993162] RDX: 00000000fbada408 RSI: 0000000000000001 RDI: 0000000000000003
[ 466.005567] RBP: 0000000000000000 R08: 0000000000000078 R09: 0000000001000000
[ 466.018316] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
[ 466.030899] R13: 000000000046e150 R14: 00000000000004f0 R15: 0000000000000001
[ 466.043602] Code: c3 66 90 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 54 53 48 8b 1f 49 89 f4 48 89 df e8 db f6 f5 00 49 8b 04 24 49 39 c4 75 3d <48> c7 04 25 00 00 00 00 00 00 00 00 48 8b 43 50 48 8d 53 48 4c
[ 466.077316] RIP: list_lru_add+0x1e/0x70 RSP: ffffc90001f0fe18
[ 466.087943] CR2: 0000000000000000
[ 466.094137] ---[ end trace aeec590ab6dccbb2 ]---


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
Xiaolong


Attachments:
(No filename) (3.84 kB)
config-4.16.0-rc5-mm1-00298-g42658d5 (121.77 kB)
job-script (4.03 kB)
dmesg.xz (14.63 kB)
Download all attachments

2018-04-02 08:53:21

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [lkp-robot] [list_lru] 42658d54ce: BUG:unable_to_handle_kernel

Hi, Xiaolong,

thanks for reporting this.

I'll make needed changes in v2.

Kirill

On 02.04.2018 06:17, kernel test robot wrote:
>
> FYI, we noticed the following commit (built with gcc-7):
>
> commit: 42658d54ce4d9c25c8a286651c60cbc869f2f91e ("list_lru: Add memcg argument to list_lru_from_kmem()")
> url: https://github.com/0day-ci/linux/commits/Kirill-Tkhai/Improve-shrink_slab-scalability-old-complexity-was-O-n-2-new-is-O-n/20180323-052754
> base: git://git.cmpxchg.org/linux-mmotm.git master
>
> in testcase: boot
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu Haswell,+smep,+smap -smp 2 -m 512M
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> +------------------------------------------+------------+------------+
> | | 7f23acedf7 | 42658d54ce |
> +------------------------------------------+------------+------------+
> | boot_successes | 10 | 0 |
> | boot_failures | 0 | 19 |
> | BUG:unable_to_handle_kernel | 0 | 19 |
> | Oops:#[##] | 0 | 19 |
> | RIP:list_lru_add | 0 | 19 |
> | Kernel_panic-not_syncing:Fatal_exception | 0 | 19 |
> +------------------------------------------+------------+------------+
>
>
>
> [ 465.702558] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> [ 465.721123] PGD 800000001740e067 P4D 800000001740e067 PUD 1740f067 PMD 0
> [ 465.737033] Oops: 0002 [#1] PTI
> [ 465.744456] CPU: 0 PID: 163 Comm: rc.local Not tainted 4.16.0-rc5-mm1-00298-g42658d5 #1
> [ 465.760374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [ 465.773920] RIP: 0010:list_lru_add+0x1e/0x70
> [ 465.780850] RSP: 0018:ffffc90001f0fe18 EFLAGS: 00010246
> [ 465.791551] RAX: ffff88001a1f16f0 RBX: ffff88000001a508 RCX: 0000000000000001
> [ 465.806362] RDX: 0000000000000000 RSI: ffff88000001a520 RDI: 0000000000000246
> [ 465.821265] RBP: ffffc90001f0fe28 R08: 0000000000000000 R09: 0000000000000001
> [ 465.836206] R10: ffffc90001f0fd88 R11: 0000000000000001 R12: ffff88001a1f16f0
> [ 465.850706] R13: ffffffff82c213d8 R14: ffffffff811aa4d5 R15: ffff88001a1f15f0
> [ 465.865285] FS: 00007fccddf45700(0000) GS:ffffffff82e3f000(0000) knlGS:0000000000000000
> [ 465.881839] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 465.893412] CR2: 0000000000000000 CR3: 0000000017868004 CR4: 00000000000206f0
> [ 465.907019] Call Trace:
> [ 465.911369] d_lru_add+0x37/0x40
> [ 465.917037] dput+0x181/0x1d0
> [ 465.922225] __fput+0x1a7/0x1c0
> [ 465.928052] ____fput+0x9/0x10
> [ 465.933601] task_work_run+0x84/0xc0
> [ 465.940404] exit_to_usermode_loop+0x4e/0x80
> [ 465.948013] do_syscall_64+0x179/0x190
> [ 465.954817] entry_SYSCALL_64_after_hwframe+0x42/0xb7
> [ 465.963476] RIP: 0033:0x7fccdd625040
> [ 465.969435] RSP: 002b:00007fff689ff258 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> [ 465.981362] RAX: 0000000000000000 RBX: 00000000006f1c08 RCX: 00007fccdd625040
> [ 465.993162] RDX: 00000000fbada408 RSI: 0000000000000001 RDI: 0000000000000003
> [ 466.005567] RBP: 0000000000000000 R08: 0000000000000078 R09: 0000000001000000
> [ 466.018316] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
> [ 466.030899] R13: 000000000046e150 R14: 00000000000004f0 R15: 0000000000000001
> [ 466.043602] Code: c3 66 90 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 54 53 48 8b 1f 49 89 f4 48 89 df e8 db f6 f5 00 49 8b 04 24 49 39 c4 75 3d <48> c7 04 25 00 00 00 00 00 00 00 00 48 8b 43 50 48 8d 53 48 4c
> [ 466.077316] RIP: list_lru_add+0x1e/0x70 RSP: ffffc90001f0fe18
> [ 466.087943] CR2: 0000000000000000
> [ 466.094137] ---[ end trace aeec590ab6dccbb2 ]---
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
>
>
>
> Thanks,
> Xiaolong
>