2018-04-17 15:59:20

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 00/12] Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))

Hi,

this patches solves the problem with slow shrink_slab() occuring
on the machines having many shrinkers and memory cgroups (i.e.,
with many containers). The problem is complexity of shrink_slab()
is O(n^2) and it grows too fast with the growth of containers
numbers.

Let we have 200 containers, and every container has 10 mounts
and 10 cgroups. All container tasks are isolated, and they don't
touch foreign containers mounts.

In case of global reclaim, a task has to iterate all over the memcgs
and to call all the memcg-aware shrinkers for all of them. This means,
the task has to visit 200 * 10 = 2000 shrinkers for every memcg,
and since there are 2000 memcgs, the total calls of do_shrink_slab()
are 2000 * 2000 = 4000000.

4 million calls are not a number operations, which can takes 1 cpu cycle.
E.g., super_cache_count() accesses at least two lists, and makes arifmetical
calculations. Even, if there are no charged objects, we do these calculations,
and replaces cpu caches by read memory. I observed nodes spending almost 100%
time in kernel, in case of intensive writing and global reclaim. The writer
consumes pages fast, but it's need to shrink_slab() before the reclaimer
reached shrink pages function (and frees SWAP_CLUSTER_MAX pages). Even if
there is no writing, the iterations just waste the time, and slows reclaim down.

Let's see the small test below:

$echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
$mkdir /sys/fs/cgroup/memory/ct
$echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
$for i in `seq 0 4000`;
do mkdir /sys/fs/cgroup/memory/ct/$i;
echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file;
done

Then, let's see drop caches time (4 sequential calls):
$time echo 3 > /proc/sys/vm/drop_caches

0.00user 8.99system 0:08.99elapsed 99%CPU
0.00user 5.97system 0:05.97elapsed 100%CPU
0.00user 5.97system 0:05.97elapsed 100%CPU
0.00user 5.85system 0:05.85elapsed 100%CPU

Last three calls don't actually shrink something. So, the iterations
over slab shrinkers take 5.85 seconds. Not so good for scalability.

The patchset solves the problem by making shrink_slab() of O(n)
complexity. There are following functional actions:

1)Assign id to every registered memcg-aware shrinker.
2)Maintain per-memcgroup bitmap of memcg-aware shrinkers,
and set a shrinker-related bit after the first element
is added to lru list (also, when removed child memcg
elements are reparanted).
3)Split memcg-aware shrinkers and !memcg-aware shrinkers,
and call a shrinker if its bit is set in memcg's shrinker
bitmap.
(Also, there is a functionality to clear the bit, after
last element is shrinked).

This gives signify performance increase. The result after patchset is applied:

$time echo 3 > /proc/sys/vm/drop_caches
0.00user 1.11system 0:01.12elapsed 99%CPU
0.00user 0.00system 0:00.00elapsed 100%CPU
0.00user 0.00system 0:00.00elapsed 100%CPU
0.00user 0.00system 0:00.00elapsed 100%CPU

Even if we round 0:00.00 up to 0:00.01, the results shows
the performance increases at least in 585 times.

So, the patchset makes shrink_slab() of less complexity and improves
the performance in such types of load I pointed. This will give a profit
in case of !global reclaim case, since there also will be less
do_shrink_slab() calls.

This patchset is made against linux-next.git tree.

v2: Many changes requested in commentaries to v1:

1)the code mostly moved to mm/memcontrol.c;
2)using IDR instead of array of shrinkers;
3)added a possibility to assign list_lru shrinker id
at the time of shrinker registering;
4)reorginized locking and renamed functions and variables.
---

Kirill Tkhai (12):
mm: Assign id to every memcg-aware shrinker
memcg: Refactoring in mem_cgroup_alloc()
memcg: Refactoring in alloc_mem_cgroup_per_node_info()
mm: Assign memcg-aware shrinkers bitmap to memcg
fs: Propagate shrinker::id to list_lru
list_lru: Add memcg argument to list_lru_from_kmem()
list_lru: Pass dst_memcg argument to memcg_drain_list_lru_node()
list_lru: Pass lru argument to memcg_drain_list_lru_node()
mm: Set bit in memcg shrinker bitmap on first list_lru item apearance
mm: Iterate only over charged shrinkers during memcg shrink_slab()
mm: Add SHRINK_EMPTY shrinker methods return value
mm: Clear shrinker bit if there are no objects related to memcg


fs/super.c | 7 +-
include/linux/list_lru.h | 3 -
include/linux/memcontrol.h | 30 +++++++
include/linux/shrinker.h | 17 +++-
mm/list_lru.c | 64 +++++++++++----
mm/memcontrol.c | 160 ++++++++++++++++++++++++++++++++----
mm/vmscan.c | 194 +++++++++++++++++++++++++++++++++++++++-----
mm/workingset.c | 6 +
8 files changed, 422 insertions(+), 59 deletions(-)

--
Signed-off-by: Kirill Tkhai <[email protected]>


2018-04-17 15:55:05

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 01/12] mm: Assign id to every memcg-aware shrinker

The patch introduces shrinker::id number, which is used to enumerate
memcg-aware shrinkers. The number start from 0, and the code tries
to maintain it as small as possible.

This will be used as to represent a memcg-aware shrinkers in memcg
shrinkers map.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/shrinker.h | 2 ++
mm/vmscan.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 53 insertions(+)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index a3894918a436..86b651fa2846 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -66,6 +66,8 @@ struct shrinker {

/* These are for internal use */
struct list_head list;
+ /* ID in shrinkers_id_idr */
+ int id;
/* objs pending delete, per node */
atomic_long_t *nr_deferred;
};
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b920ce3ae02..4f02fe83537e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -169,6 +169,43 @@ unsigned long vm_total_pages;
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+static DEFINE_IDR(shrinkers_id_idr);
+
+static int add_memcg_shrinker(struct shrinker *shrinker)
+{
+ int id, ret;
+
+ down_write(&shrinker_rwsem);
+ ret = id = idr_alloc(&shrinkers_id_idr, shrinker, 0, 0, GFP_KERNEL);
+ if (ret < 0)
+ goto unlock;
+ shrinker->id = id;
+ ret = 0;
+unlock:
+ up_write(&shrinker_rwsem);
+ return ret;
+}
+
+static void del_memcg_shrinker(struct shrinker *shrinker)
+{
+ int id = shrinker->id;
+
+ down_write(&shrinker_rwsem);
+ idr_remove(&shrinkers_id_idr, id);
+ up_write(&shrinker_rwsem);
+}
+#else /* CONFIG_MEMCG && !CONFIG_SLOB */
+static int add_memcg_shrinker(struct shrinker *shrinker)
+{
+ return 0;
+}
+
+static void del_memcg_shrinker(struct shrinker *shrinker)
+{
+}
+#endif /* CONFIG_MEMCG && !CONFIG_SLOB */
+
#ifdef CONFIG_MEMCG
static bool global_reclaim(struct scan_control *sc)
{
@@ -306,6 +343,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
int register_shrinker(struct shrinker *shrinker)
{
size_t size = sizeof(*shrinker->nr_deferred);
+ int ret;

if (shrinker->flags & SHRINKER_NUMA_AWARE)
size *= nr_node_ids;
@@ -314,10 +352,21 @@ int register_shrinker(struct shrinker *shrinker)
if (!shrinker->nr_deferred)
return -ENOMEM;

+ if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
+ ret = add_memcg_shrinker(shrinker);
+ if (ret)
+ goto free_deferred;
+ }
+
down_write(&shrinker_rwsem);
list_add_tail(&shrinker->list, &shrinker_list);
up_write(&shrinker_rwsem);
return 0;
+
+free_deferred:
+ kfree(shrinker->nr_deferred);
+ shrinker->nr_deferred = NULL;
+ return -ENOMEM;
}
EXPORT_SYMBOL(register_shrinker);

@@ -328,6 +377,8 @@ void unregister_shrinker(struct shrinker *shrinker)
{
if (!shrinker->nr_deferred)
return;
+ if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+ del_memcg_shrinker(shrinker);
down_write(&shrinker_rwsem);
list_del(&shrinker->list);
up_write(&shrinker_rwsem);


2018-04-17 15:55:21

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 02/12] memcg: Refactoring in mem_cgroup_alloc()

Call alloc_mem_cgroup_per_node_info() and IDR allocation later.
This is preparation for next patches, which will require this
two actions are made nearby (they will be done under read lock,
and here we place them together to minimize the time, it's held).

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/memcontrol.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 448db08d97a0..d99ea5680ffe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4383,20 +4383,10 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
if (!memcg)
return NULL;

- memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL,
- 1, MEM_CGROUP_ID_MAX,
- GFP_KERNEL);
- if (memcg->id.id < 0)
- goto fail;
-
memcg->stat_cpu = alloc_percpu(struct mem_cgroup_stat_cpu);
if (!memcg->stat_cpu)
goto fail;

- for_each_node(node)
- if (alloc_mem_cgroup_per_node_info(memcg, node))
- goto fail;
-
if (memcg_wb_domain_init(memcg, GFP_KERNEL))
goto fail;

@@ -4415,7 +4405,16 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
#ifdef CONFIG_CGROUP_WRITEBACK
INIT_LIST_HEAD(&memcg->cgwb_list);
#endif
- idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
+ for_each_node(node)
+ if (alloc_mem_cgroup_per_node_info(memcg, node))
+ goto fail;
+
+ memcg->id.id = idr_alloc(&mem_cgroup_idr, memcg,
+ 1, MEM_CGROUP_ID_MAX,
+ GFP_KERNEL);
+ if (memcg->id.id < 0)
+ goto fail;
+
return memcg;
fail:
mem_cgroup_id_remove(memcg);


2018-04-17 15:55:31

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 03/12] memcg: Refactoring in alloc_mem_cgroup_per_node_info()

Organize the function in "if () goto err" style,
since next patch will add more "if" branches.

Also assign and clear memcg->nodeinfo[node]
earlier for the same reason.

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/memcontrol.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d99ea5680ffe..2959a454a072 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4327,20 +4327,22 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp);
if (!pn)
return 1;
+ memcg->nodeinfo[node] = pn;

pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
- if (!pn->lruvec_stat_cpu) {
- kfree(pn);
- return 1;
- }
+ if (!pn->lruvec_stat_cpu)
+ goto err_pcpu;

lruvec_init(&pn->lruvec);
pn->usage_in_excess = 0;
pn->on_tree = false;
pn->memcg = memcg;
-
- memcg->nodeinfo[node] = pn;
return 0;
+
+err_pcpu:
+ memcg->nodeinfo[node] = NULL;
+ kfree(pn);
+ return 1;
}

static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)


2018-04-17 15:55:49

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

Imagine a big node with many cpus, memory cgroups and containers.
Let we have 200 containers, every container has 10 mounts,
and 10 cgroups. All container tasks don't touch foreign
containers mounts. If there is intensive pages write,
and global reclaim happens, a writing task has to iterate
over all memcgs to shrink slab, before it's able to go
to shrink_page_list().

Iteration over all the memcg slabs is very expensive:
the task has to visit 200 * 10 = 2000 shrinkers
for every memcg, and since there are 2000 memcgs,
the total calls are 2000 * 2000 = 4000000.

So, the shrinker makes 4 million do_shrink_slab() calls
just to try to isolate SWAP_CLUSTER_MAX pages in one
of the actively writing memcg via shrink_page_list().
I've observed a node spending almost 100% in kernel,
making useless iteration over already shrinked slab.

This patch adds bitmap of memcg-aware shrinkers to memcg.
The size of the bitmap depends on bitmap_nr_ids, and during
memcg life it's maintained to be enough to fit bitmap_nr_ids
shrinkers. Every bit in the map is related to corresponding
shrinker id.

Next patches will maintain set bit only for really charged
memcg. This will allow shrink_slab() to increase its
performance in significant way. See the last patch for
the numbers.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/memcontrol.h | 15 +++++
mm/memcontrol.c | 125 ++++++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 21 +++++++
3 files changed, 160 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index af9eed2e3e04..2ec96ab46b01 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -115,6 +115,7 @@ struct mem_cgroup_per_node {
unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];

struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];
+ struct memcg_shrinker_map __rcu *shrinkers_map;

struct rb_node tree_node; /* RB tree node */
unsigned long usage_in_excess;/* Set to the value by which */
@@ -153,6 +154,11 @@ struct mem_cgroup_thresholds {
struct mem_cgroup_threshold_ary *spare;
};

+struct memcg_shrinker_map {
+ struct rcu_head rcu;
+ unsigned long map[0];
+};
+
enum memcg_kmem_state {
KMEM_NONE,
KMEM_ALLOCATED,
@@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
void memcg_get_cache_ids(void);
void memcg_put_cache_ids(void);

+extern int shrinkers_max_nr;
+
/*
* Helper macro to loop through all memcg-specific caches. Callers must still
* check if the cache is valid (it is either valid or NULL).
@@ -1223,6 +1231,13 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
return memcg ? memcg->kmemcg_id : -1;
}

+extern struct memcg_shrinker_map __rcu *root_shrinkers_map[];
+#define SHRINKERS_MAP(memcg, nid) \
+ (memcg == root_mem_cgroup || !memcg ? \
+ root_shrinkers_map[nid] : memcg->nodeinfo[nid]->shrinkers_map)
+
+extern int expand_shrinker_maps(int old_id, int id);
+
#else
#define for_each_memcg_cache_index(_idx) \
for (; NULL; )
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2959a454a072..562dfb1be9ef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -305,6 +305,113 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);

struct workqueue_struct *memcg_kmem_cache_wq;

+static DECLARE_RWSEM(shrinkers_max_nr_rwsem);
+struct memcg_shrinker_map __rcu *root_shrinkers_map[MAX_NUMNODES] = { 0 };
+
+static void get_shrinkers_max_nr(void)
+{
+ down_read(&shrinkers_max_nr_rwsem);
+}
+
+static void put_shrinkers_max_nr(void)
+{
+ up_read(&shrinkers_max_nr_rwsem);
+}
+
+static void kvfree_map_rcu(struct rcu_head *head)
+{
+ kvfree(container_of(head, struct memcg_shrinker_map, rcu));
+}
+
+static int memcg_expand_maps(struct mem_cgroup *memcg, int nid,
+ int size, int old_size)
+{
+ struct memcg_shrinker_map *new, *old;
+
+ lockdep_assert_held(&shrinkers_max_nr_rwsem);
+
+ new = kvmalloc(sizeof(*new) + size, GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ /* Set all old bits, clear all new bits */
+ memset(new->map, (int)0xff, old_size);
+ memset((void *)new->map + old_size, 0, size - old_size);
+
+ old = rcu_dereference_protected(SHRINKERS_MAP(memcg, nid), true);
+
+ if (memcg)
+ rcu_assign_pointer(memcg->nodeinfo[nid]->shrinkers_map, new);
+ else
+ rcu_assign_pointer(root_shrinkers_map[nid], new);
+
+ if (old)
+ call_rcu(&old->rcu, kvfree_map_rcu);
+
+ return 0;
+}
+
+static int alloc_shrinker_maps(struct mem_cgroup *memcg, int nid)
+{
+ /* Skip allocation, when we're initializing root_mem_cgroup */
+ if (!root_mem_cgroup)
+ return 0;
+
+ return memcg_expand_maps(memcg, nid, shrinkers_max_nr/BITS_PER_BYTE, 0);
+}
+
+static void free_shrinker_maps(struct mem_cgroup *memcg,
+ struct mem_cgroup_per_node *pn)
+{
+ struct memcg_shrinker_map *map;
+
+ if (memcg == root_mem_cgroup)
+ return;
+
+ /* IDR unhashed long ago, and expand_shrinker_maps can't race with us */
+ map = rcu_dereference_protected(pn->shrinkers_map, true);
+ kvfree_map_rcu(&map->rcu);
+}
+
+static struct idr mem_cgroup_idr;
+
+int expand_shrinker_maps(int old_nr, int nr)
+{
+ int id, size, old_size, node, ret;
+ struct mem_cgroup *memcg;
+
+ old_size = old_nr / BITS_PER_BYTE;
+ size = nr / BITS_PER_BYTE;
+
+ down_write(&shrinkers_max_nr_rwsem);
+ for_each_node(node) {
+ idr_for_each_entry(&mem_cgroup_idr, memcg, id) {
+ if (id == 1)
+ memcg = NULL;
+ ret = memcg_expand_maps(memcg, node, size, old_size);
+ if (ret)
+ goto unlock;
+ }
+
+ /* root_mem_cgroup is not initialized yet */
+ if (id == 0)
+ ret = memcg_expand_maps(NULL, node, size, old_size);
+ }
+unlock:
+ up_write(&shrinkers_max_nr_rwsem);
+ return ret;
+}
+#else /* CONFIG_SLOB */
+static void get_shrinkers_max_nr(void) { }
+static void put_shrinkers_max_nr(void) { }
+
+static int alloc_shrinker_maps(struct mem_cgroup *memcg, int nid)
+{
+ return 0;
+}
+static void free_shrinker_maps(struct mem_cgroup *memcg,
+ struct mem_cgroup_per_node *pn) { }
+
#endif /* !CONFIG_SLOB */

/**
@@ -3002,6 +3109,8 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
}

#ifndef CONFIG_SLOB
+int shrinkers_max_nr;
+
static int memcg_online_kmem(struct mem_cgroup *memcg)
{
int memcg_id;
@@ -4266,7 +4375,10 @@ static DEFINE_IDR(mem_cgroup_idr);
static void mem_cgroup_id_remove(struct mem_cgroup *memcg)
{
if (memcg->id.id > 0) {
+ /* Removing IDR must be visible for expand_shrinker_maps() */
+ get_shrinkers_max_nr();
idr_remove(&mem_cgroup_idr, memcg->id.id);
+ put_shrinkers_max_nr();
memcg->id.id = 0;
}
}
@@ -4333,12 +4445,17 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn->lruvec_stat_cpu)
goto err_pcpu;

+ if (alloc_shrinker_maps(memcg, node))
+ goto err_maps;
+
lruvec_init(&pn->lruvec);
pn->usage_in_excess = 0;
pn->on_tree = false;
pn->memcg = memcg;
return 0;

+err_maps:
+ free_percpu(pn->lruvec_stat_cpu);
err_pcpu:
memcg->nodeinfo[node] = NULL;
kfree(pn);
@@ -4352,6 +4469,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return;

+ free_shrinker_maps(memcg, pn);
free_percpu(pn->lruvec_stat_cpu);
kfree(pn);
}
@@ -4407,13 +4525,18 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
#ifdef CONFIG_CGROUP_WRITEBACK
INIT_LIST_HEAD(&memcg->cgwb_list);
#endif
+
+ get_shrinkers_max_nr();
for_each_node(node)
- if (alloc_mem_cgroup_per_node_info(memcg, node))
+ if (alloc_mem_cgroup_per_node_info(memcg, node)) {
+ put_shrinkers_max_nr();
goto fail;
+ }

memcg->id.id = idr_alloc(&mem_cgroup_idr, memcg,
1, MEM_CGROUP_ID_MAX,
GFP_KERNEL);
+ put_shrinkers_max_nr();
if (memcg->id.id < 0)
goto fail;

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f02fe83537e..f63eb5596c35 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -172,6 +172,22 @@ static DECLARE_RWSEM(shrinker_rwsem);
#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
static DEFINE_IDR(shrinkers_id_idr);

+static int expand_shrinker_id(int id)
+{
+ if (likely(id < shrinkers_max_nr))
+ return 0;
+
+ id = shrinkers_max_nr * 2;
+ if (id == 0)
+ id = BITS_PER_BYTE;
+
+ if (expand_shrinker_maps(shrinkers_max_nr, id))
+ return -ENOMEM;
+
+ shrinkers_max_nr = id;
+ return 0;
+}
+
static int add_memcg_shrinker(struct shrinker *shrinker)
{
int id, ret;
@@ -180,6 +196,11 @@ static int add_memcg_shrinker(struct shrinker *shrinker)
ret = id = idr_alloc(&shrinkers_id_idr, shrinker, 0, 0, GFP_KERNEL);
if (ret < 0)
goto unlock;
+ ret = expand_shrinker_id(id);
+ if (ret < 0) {
+ idr_remove(&shrinkers_id_idr, id);
+ goto unlock;
+ }
shrinker->id = id;
ret = 0;
unlock:


2018-04-17 15:56:21

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 07/12] list_lru: Pass dst_memcg argument to memcg_drain_list_lru_node()

This is just refactoring to allow next patches to have
dst_memcg pointer in memcg_drain_list_lru_node().

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/list_lru.h | 2 +-
mm/list_lru.c | 11 ++++++-----
mm/memcontrol.c | 2 +-
3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index f5b6bb7a8670..5c7db0022ce6 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -66,7 +66,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
#define list_lru_init_memcg(lru) __list_lru_init((lru), true, NULL)

int memcg_update_all_list_lrus(int num_memcgs);
-void memcg_drain_all_list_lrus(int src_idx, int dst_idx);
+void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg);

/**
* list_lru_add: add an element to the lru list's tail
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 437f854eac44..a92850bc209f 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -517,8 +517,9 @@ int memcg_update_all_list_lrus(int new_size)
}

static void memcg_drain_list_lru_node(struct list_lru_node *nlru,
- int src_idx, int dst_idx)
+ int src_idx, struct mem_cgroup *dst_memcg)
{
+ int dst_idx = dst_memcg->kmemcg_id;
struct list_lru_one *src, *dst;

/*
@@ -538,7 +539,7 @@ static void memcg_drain_list_lru_node(struct list_lru_node *nlru,
}

static void memcg_drain_list_lru(struct list_lru *lru,
- int src_idx, int dst_idx)
+ int src_idx, struct mem_cgroup *dst_memcg)
{
int i;

@@ -546,16 +547,16 @@ static void memcg_drain_list_lru(struct list_lru *lru,
return;

for_each_node(i)
- memcg_drain_list_lru_node(&lru->node[i], src_idx, dst_idx);
+ memcg_drain_list_lru_node(&lru->node[i], src_idx, dst_memcg);
}

-void memcg_drain_all_list_lrus(int src_idx, int dst_idx)
+void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg)
{
struct list_lru *lru;

mutex_lock(&list_lrus_mutex);
list_for_each_entry(lru, &list_lrus, list)
- memcg_drain_list_lru(lru, src_idx, dst_idx);
+ memcg_drain_list_lru(lru, src_idx, dst_memcg);
mutex_unlock(&list_lrus_mutex);
}
#else
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 562dfb1be9ef..ffbe65d43a15 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3182,7 +3182,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
}
rcu_read_unlock();

- memcg_drain_all_list_lrus(kmemcg_id, parent->kmemcg_id);
+ memcg_drain_all_list_lrus(kmemcg_id, parent);

memcg_free_cache_id(kmemcg_id);
}


2018-04-17 15:56:34

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 10/12] mm: Iterate only over charged shrinkers during memcg shrink_slab()

Using the preparations made in previous patches, in case of memcg
shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
bitmap. To do that, we separate iterations over memcg-aware and
!memcg-aware shrinkers, and memcg-aware shrinkers are chosen
via for_each_set_bit() from the bitmap. In case of big nodes,
having many isolated environments, this gives significant
performance growth. See next patches for the details.

Note, that the patch does not respect to empty memcg shrinkers,
since we never clear the bitmap bits after we set it once.
Their shrinkers will be called again, with no shrinked objects
as result. This functionality is provided by next patches.

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/vmscan.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 72 insertions(+), 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 34cd1d9b8b22..b81b8a7727b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -169,6 +169,20 @@ unsigned long vm_total_pages;
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

+static void link_shrinker(struct shrinker *shrinker)
+{
+ down_write(&shrinker_rwsem);
+ list_add_tail(&shrinker->list, &shrinker_list);
+ up_write(&shrinker_rwsem);
+}
+
+static void unlink_shrinker(struct shrinker *shrinker)
+{
+ down_write(&shrinker_rwsem);
+ list_del(&shrinker->list);
+ up_write(&shrinker_rwsem);
+}
+
#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
static DEFINE_IDR(shrinkers_id_idr);

@@ -221,11 +235,13 @@ static void del_memcg_shrinker(struct shrinker *shrinker)
#else /* CONFIG_MEMCG && !CONFIG_SLOB */
static int add_memcg_shrinker(struct shrinker *shrinker, int nr, va_list args)
{
+ link_shrinker(shrinker);
return 0;
}

static void del_memcg_shrinker(struct shrinker *shrinker)
{
+ unlink_shrinker(shrinker);
}
#endif /* CONFIG_MEMCG && !CONFIG_SLOB */

@@ -382,11 +398,9 @@ int __register_shrinker(struct shrinker *shrinker, int nr, ...)
va_end(args);
if (ret)
goto free_deferred;
- }
+ } else
+ link_shrinker(shrinker);

- down_write(&shrinker_rwsem);
- list_add_tail(&shrinker->list, &shrinker_list);
- up_write(&shrinker_rwsem);
return 0;

free_deferred:
@@ -405,9 +419,8 @@ void unregister_shrinker(struct shrinker *shrinker)
return;
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
del_memcg_shrinker(shrinker);
- down_write(&shrinker_rwsem);
- list_del(&shrinker->list);
- up_write(&shrinker_rwsem);
+ else
+ unlink_shrinker(shrinker);
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}
@@ -532,6 +545,53 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
return freed;
}

+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
+ struct mem_cgroup *memcg,
+ int priority)
+{
+ struct memcg_shrinker_map *map;
+ unsigned long freed = 0;
+ int ret, i;
+
+ if (!down_read_trylock(&shrinker_rwsem))
+ return 0;
+
+ /*
+ * 1)Caller passes only alive memcg, so map can't be NULL.
+ * 2)shrinker_rwsem protects from maps expanding.
+ */
+ map = rcu_dereference_protected(SHRINKERS_MAP(memcg, nid), true);
+ BUG_ON(!map);
+
+ for_each_set_bit(i, map->map, shrinkers_max_nr) {
+ struct shrink_control sc = {
+ .gfp_mask = gfp_mask,
+ .nid = nid,
+ .memcg = memcg,
+ };
+ struct shrinker *shrinker;
+
+ shrinker = idr_find(&shrinkers_id_idr, i);
+ if (!shrinker) {
+ clear_bit(i, map->map);
+ continue;
+ }
+
+ ret = do_shrink_slab(&sc, shrinker, priority);
+ freed += ret;
+
+ if (rwsem_is_contended(&shrinker_rwsem)) {
+ freed = freed ? : 1;
+ break;
+ }
+ }
+
+ up_read(&shrinker_rwsem);
+ return freed;
+}
+#endif
+
/**
* shrink_slab - shrink slab caches
* @gfp_mask: allocation context
@@ -564,6 +624,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)))
return 0;

+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+ if (memcg)
+ return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
+#endif
+
if (!down_read_trylock(&shrinker_rwsem))
goto out;

@@ -574,15 +639,6 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
.memcg = memcg,
};

- /*
- * If kernel memory accounting is disabled, we ignore
- * SHRINKER_MEMCG_AWARE flag and call all shrinkers
- * passing NULL for memcg.
- */
- if (memcg_kmem_enabled() &&
- !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
- continue;
-
if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
sc.nid = 0;



2018-04-17 15:56:35

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 08/12] list_lru: Pass lru argument to memcg_drain_list_lru_node()

This is just refactoring to allow next patches to have
lru pointer in memcg_drain_list_lru_node().

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/list_lru.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index a92850bc209f..ed0f97b0c087 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -516,9 +516,10 @@ int memcg_update_all_list_lrus(int new_size)
goto out;
}

-static void memcg_drain_list_lru_node(struct list_lru_node *nlru,
+static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
int src_idx, struct mem_cgroup *dst_memcg)
{
+ struct list_lru_node *nlru = &lru->node[nid];
int dst_idx = dst_memcg->kmemcg_id;
struct list_lru_one *src, *dst;

@@ -547,7 +548,7 @@ static void memcg_drain_list_lru(struct list_lru *lru,
return;

for_each_node(i)
- memcg_drain_list_lru_node(&lru->node[i], src_idx, dst_memcg);
+ memcg_drain_list_lru_node(lru, i, src_idx, dst_memcg);
}

void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg)


2018-04-17 15:57:00

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 12/12] mm: Clear shrinker bit if there are no objects related to memcg

To avoid further unneed calls of do_shrink_slab()
for shrinkers, which already do not have any charged
objects in a memcg, their bits have to be cleared.

This patch introduces a lockless mechanism to do that
without races without parallel list lru add. After
do_shrink_slab() returns SHRINK_EMPTY the first time,
we clear the bit and call it once again. Then we restore
the bit, if the new return value is different.

Note, that single smp_mb__after_atomic() in shrink_slab_memcg()
covers two situations:

1)list_lru_add() shrink_slab_memcg
list_add_tail() for_each_set_bit() <--- read bit
do_shrink_slab() <--- missed list update (no barrier)
<MB> <MB>
set_bit() do_shrink_slab() <--- seen list update

This situation, when the first do_shrink_slab() sees set bit,
but it doesn't see list update (i.e., race with the first element
queueing), is rare. So we don't add <MB> before the first call
of do_shrink_slab() instead of this to do not slow down generic
case. Also, it's need the second call as seen in below in (2).

2)list_lru_add() shrink_slab_memcg()
list_add_tail() ...
set_bit() ...
... for_each_set_bit()
do_shrink_slab() do_shrink_slab()
clear_bit() ...
... ...
list_lru_add() ...
list_add_tail() clear_bit()
<MB> <MB>
set_bit() do_shrink_slab()

The barriers guarantees, the second do_shrink_slab()
in the right side task sees list update if really
cleared the bit. This case is drawn in the code comment.

[Results/performance of the patchset]

After the whole patchset applied the below test shows signify
increase of performance:

$echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
$mkdir /sys/fs/cgroup/memory/ct
$echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
$for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done

Then, 4 sequential calls of drop caches:
$time echo 3 > /proc/sys/vm/drop_caches

1)Before:
0.00user 8.99system 0:08.99elapsed 99%CPU
0.00user 5.97system 0:05.97elapsed 100%CPU
0.00user 5.97system 0:05.97elapsed 100%CPU
0.00user 5.85system 0:05.85elapsed 100%CPU

2)After
0.00user 1.11system 0:01.12elapsed 99%CPU
0.00user 0.00system 0:00.00elapsed 100%CPU
0.00user 0.00system 0:00.00elapsed 100%CPU
0.00user 0.00system 0:00.00elapsed 100%CPU

Even if we round 0:00.00 up to 0:00.01, the results shows
the performance increases at least in 585 times.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/memcontrol.h | 2 ++
mm/vmscan.c | 19 +++++++++++++++++--
2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e1c1fa8e417a..1c5c68550e2f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1245,6 +1245,8 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int nr)

rcu_read_lock();
map = SHRINKERS_MAP(memcg, nid);
+ /* Pairs with smp mb in shrink_slab() */
+ smp_mb__before_atomic();
set_bit(nr, map->map);
rcu_read_unlock();
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3be9b4d81c13..a8733bc5377b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -579,8 +579,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
}

ret = do_shrink_slab(&sc, shrinker, priority);
- if (ret == SHRINK_EMPTY)
- ret = 0;
+ if (ret == SHRINK_EMPTY) {
+ clear_bit(i, map->map);
+ /*
+ * Pairs with mb in set_shrinker_bit():
+ *
+ * list_lru_add() shrink_slab_memcg()
+ * list_add_tail() clear_bit()
+ * <MB> <MB>
+ * set_bit() do_shrink_slab()
+ */
+ smp_mb__after_atomic();
+ ret = do_shrink_slab(&sc, shrinker, priority);
+ if (ret == SHRINK_EMPTY)
+ ret = 0;
+ else
+ set_shrinker_bit(memcg, nid, i);
+ }
freed += ret;

if (rwsem_is_contended(&shrinker_rwsem)) {


2018-04-17 15:57:02

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 05/12] fs: Propagate shrinker::id to list_lru

The patch adds list_lru::shrinker_id field, and populates
it by registered shrinker id.

This will be used to set correct bit in memcg shrinkers
map by lru code in next patches, after there appeared
the first related to memcg element in list_lru.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/super.c | 4 +++-
include/linux/list_lru.h | 1 +
include/linux/shrinker.h | 8 +++++++-
mm/list_lru.c | 6 ++++++
mm/vmscan.c | 15 ++++++++++-----
mm/workingset.c | 3 ++-
6 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 5fa9a8d8d865..9bc5698c8c3c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -518,7 +518,9 @@ struct super_block *sget_userns(struct file_system_type *type,
hlist_add_head(&s->s_instances, &type->fs_supers);
spin_unlock(&sb_lock);
get_filesystem(type);
- err = register_shrinker(&s->s_shrink);
+ err = register_shrinker_args(&s->s_shrink, 2,
+ &s->s_dentry_lru.shrinker_id,
+ &s->s_inode_lru.shrinker_id);
if (err) {
deactivate_locked_super(s);
s = ERR_PTR(err);
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 96def9d15b1b..f5b6bb7a8670 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -53,6 +53,7 @@ struct list_lru {
struct list_lru_node *node;
#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
struct list_head list;
+ int shrinker_id;
#endif
};

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 86b651fa2846..22ee2996c480 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -77,6 +77,12 @@ struct shrinker {
#define SHRINKER_NUMA_AWARE (1 << 0)
#define SHRINKER_MEMCG_AWARE (1 << 1)

-extern __must_check int register_shrinker(struct shrinker *);
+extern __must_check int __register_shrinker(struct shrinker *, int, ...);
+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+#define register_shrinker_args(...) __register_shrinker(__VA_ARGS__)
+#else
+#define register_shrinker_args(shrinker,...) __register_shrinker(shrinker, 0)
+#endif
+#define register_shrinker(shrinker) register_shrinker_args(shrinker, 0)
extern void unregister_shrinker(struct shrinker *);
#endif
diff --git a/mm/list_lru.c b/mm/list_lru.c
index d9c84c5bda1d..2a4d29491947 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -567,6 +567,9 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
size_t size = sizeof(*lru->node) * nr_node_ids;
int err = -ENOMEM;

+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+ lru->shrinker_id = -1;
+#endif
memcg_get_cache_ids();

lru->node = kzalloc(size, GFP_KERNEL);
@@ -609,6 +612,9 @@ void list_lru_destroy(struct list_lru *lru)
kfree(lru->node);
lru->node = NULL;

+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+ lru->shrinker_id = -1;
+#endif
memcg_put_cache_ids();
}
EXPORT_SYMBOL_GPL(list_lru_destroy);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f63eb5596c35..34cd1d9b8b22 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -188,7 +188,7 @@ static int expand_shrinker_id(int id)
return 0;
}

-static int add_memcg_shrinker(struct shrinker *shrinker)
+static int add_memcg_shrinker(struct shrinker *shrinker, int nr, va_list va_ids)
{
int id, ret;

@@ -202,6 +202,8 @@ static int add_memcg_shrinker(struct shrinker *shrinker)
goto unlock;
}
shrinker->id = id;
+ while (nr-- > 0)
+ *va_arg(va_ids, int *) = id;
ret = 0;
unlock:
up_write(&shrinker_rwsem);
@@ -217,7 +219,7 @@ static void del_memcg_shrinker(struct shrinker *shrinker)
up_write(&shrinker_rwsem);
}
#else /* CONFIG_MEMCG && !CONFIG_SLOB */
-static int add_memcg_shrinker(struct shrinker *shrinker)
+static int add_memcg_shrinker(struct shrinker *shrinker, int nr, va_list args)
{
return 0;
}
@@ -361,9 +363,10 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
/*
* Add a shrinker callback to be called from the vm.
*/
-int register_shrinker(struct shrinker *shrinker)
+int __register_shrinker(struct shrinker *shrinker, int nr, ...)
{
size_t size = sizeof(*shrinker->nr_deferred);
+ va_list args;
int ret;

if (shrinker->flags & SHRINKER_NUMA_AWARE)
@@ -374,7 +377,9 @@ int register_shrinker(struct shrinker *shrinker)
return -ENOMEM;

if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
- ret = add_memcg_shrinker(shrinker);
+ va_start(args, nr);
+ ret = add_memcg_shrinker(shrinker, nr, args);
+ va_end(args);
if (ret)
goto free_deferred;
}
@@ -389,7 +394,7 @@ int register_shrinker(struct shrinker *shrinker)
shrinker->nr_deferred = NULL;
return -ENOMEM;
}
-EXPORT_SYMBOL(register_shrinker);
+EXPORT_SYMBOL(__register_shrinker);

/*
* Remove one
diff --git a/mm/workingset.c b/mm/workingset.c
index 40ee02c83978..2e2555649d13 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -531,7 +531,8 @@ static int __init workingset_init(void)
ret = __list_lru_init(&shadow_nodes, true, &shadow_nodes_key);
if (ret)
goto err;
- ret = register_shrinker(&workingset_shadow_shrinker);
+ ret = register_shrinker_args(&workingset_shadow_shrinker,
+ 1, &shadow_nodes.shrinker_id);
if (ret)
goto err_list_lru;
return 0;


2018-04-17 15:57:16

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 06/12] list_lru: Add memcg argument to list_lru_from_kmem()

This is just refactoring to allow next patches to have
memcg pointer in list_lru_from_kmem().

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/list_lru.c | 25 +++++++++++++++++--------
1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 2a4d29491947..437f854eac44 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -76,18 +76,24 @@ static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
}

static inline struct list_lru_one *
-list_lru_from_kmem(struct list_lru_node *nlru, void *ptr)
+list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
+ struct mem_cgroup **memcg_ptr)
{
- struct mem_cgroup *memcg;
+ struct list_lru_one *l = &nlru->lru;
+ struct mem_cgroup *memcg = NULL;

if (!nlru->memcg_lrus)
- return &nlru->lru;
+ goto out;

memcg = mem_cgroup_from_kmem(ptr);
if (!memcg)
- return &nlru->lru;
+ goto out;

- return list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
+ l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
+out:
+ if (memcg_ptr)
+ *memcg_ptr = memcg;
+ return l;
}
#else
static inline bool list_lru_memcg_aware(struct list_lru *lru)
@@ -102,8 +108,11 @@ list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
}

static inline struct list_lru_one *
-list_lru_from_kmem(struct list_lru_node *nlru, void *ptr)
+list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
+ struct mem_cgroup **memcg_ptr)
{
+ if (memcg_ptr)
+ *memcg_ptr = NULL;
return &nlru->lru;
}
#endif /* CONFIG_MEMCG && !CONFIG_SLOB */
@@ -116,7 +125,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)

spin_lock(&nlru->lock);
if (list_empty(item)) {
- l = list_lru_from_kmem(nlru, item);
+ l = list_lru_from_kmem(nlru, item, NULL);
list_add_tail(item, &l->list);
l->nr_items++;
nlru->nr_items++;
@@ -142,7 +151,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)

spin_lock(&nlru->lock);
if (!list_empty(item)) {
- l = list_lru_from_kmem(nlru, item);
+ l = list_lru_from_kmem(nlru, item, NULL);
list_del_init(item);
l->nr_items--;
nlru->nr_items--;


2018-04-17 15:57:36

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 11/12] mm: Add SHRINK_EMPTY shrinker methods return value

We need to differ the situations, when shrinker has
very small amount of objects (see vfs_pressure_ratio()
called from super_cache_count()), and when it has no
objects at all. Currently, in the both of these cases,
shrinker::count_objects() returns 0.

The patch introduces new SHRINK_EMPTY return value,
which will be used for "no objects at all" case.
It's is a refactoring mostly, as SHRINK_EMPTY is replaced
by 0 by all callers of do_shrink_slab() in this patch,
and all the magic will happen in further.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/super.c | 3 +++
include/linux/shrinker.h | 7 +++++--
mm/vmscan.c | 12 +++++++++---
mm/workingset.c | 3 +++
4 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 9bc5698c8c3c..06c87f81037c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -134,6 +134,9 @@ static unsigned long super_cache_count(struct shrinker *shrink,
total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);

+ if (!total_objects)
+ return SHRINK_EMPTY;
+
total_objects = vfs_pressure_ratio(total_objects);
return total_objects;
}
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 22ee2996c480..973df501c335 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -34,12 +34,15 @@ struct shrink_control {
};

#define SHRINK_STOP (~0UL)
+#define SHRINK_EMPTY (~0UL - 1)
/*
* A callback you can register to apply pressure to ageable caches.
*
* @count_objects should return the number of freeable items in the cache. If
- * there are no objects to free or the number of freeable items cannot be
- * determined, it should return 0. No deadlock checks should be done during the
+ * there are no objects to free, it should return SHRINK_EMPTY, while 0 is
+ * returned in cases of the number of freeable items cannot be determined
+ * or shrinker should skip this cache for this time (e.g., their number
+ * is below shrinkable limit). No deadlock checks should be done during the
* count callback - the shrinker relies on aggregating scan counts that couldn't
* be executed due to potential deadlocks to be run at a later call when the
* deadlock condition is no longer pending.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b81b8a7727b5..3be9b4d81c13 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -443,8 +443,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
long scanned = 0, next_deferred;

freeable = shrinker->count_objects(shrinker, shrinkctl);
- if (freeable == 0)
- return 0;
+ if (freeable == 0 || freeable == SHRINK_EMPTY)
+ return freeable;

/*
* copy the current shrinker scan count into a local variable
@@ -579,6 +579,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
}

ret = do_shrink_slab(&sc, shrinker, priority);
+ if (ret == SHRINK_EMPTY)
+ ret = 0;
freed += ret;

if (rwsem_is_contended(&shrinker_rwsem)) {
@@ -620,6 +622,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
{
struct shrinker *shrinker;
unsigned long freed = 0;
+ int ret;

if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)))
return 0;
@@ -642,7 +645,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
sc.nid = 0;

- freed += do_shrink_slab(&sc, shrinker, priority);
+ ret = do_shrink_slab(&sc, shrinker, priority);
+ if (ret == SHRINK_EMPTY)
+ ret = 0;
+ freed += ret;
/*
* Bail out if someone want to register a new shrinker to
* prevent the regsitration from being stalled for long periods
diff --git a/mm/workingset.c b/mm/workingset.c
index 2e2555649d13..075b990e7c81 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -402,6 +402,9 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
}
max_nodes = cache >> (RADIX_TREE_MAP_SHIFT - 3);

+ if (!nodes)
+ return SHRINK_EMPTY;
+
if (nodes <= max_nodes)
return 0;
return nodes - max_nodes;


2018-04-17 15:58:28

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 09/12] mm: Set bit in memcg shrinker bitmap on first list_lru item apearance

Introduce set_shrinker_bit() function to set shrinker-related
bit in memcg shrinker bitmap, and set the bit after the first
item is added and in case of reparenting destroyed memcg's items.

This will allow next patch to make shrinkers be called only,
in case of they have charged objects at the moment, and
to improve shrink_slab() performance.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/memcontrol.h | 13 +++++++++++++
mm/list_lru.c | 21 +++++++++++++++++++--
2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2ec96ab46b01..e1c1fa8e417a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1238,6 +1238,17 @@ extern struct memcg_shrinker_map __rcu *root_shrinkers_map[];

extern int expand_shrinker_maps(int old_id, int id);

+static inline void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int nr)
+{
+ if (nr >= 0) {
+ struct memcg_shrinker_map *map;
+
+ rcu_read_lock();
+ map = SHRINKERS_MAP(memcg, nid);
+ set_bit(nr, map->map);
+ rcu_read_unlock();
+ }
+}
#else
#define for_each_memcg_cache_index(_idx) \
for (; NULL; )
@@ -1260,6 +1271,8 @@ static inline void memcg_put_cache_ids(void)
{
}

+static inline void set_shrinker_bit(struct mem_cgroup *memcg, int node, int id) { }
+
#endif /* CONFIG_MEMCG && !CONFIG_SLOB */

#endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/list_lru.c b/mm/list_lru.c
index ed0f97b0c087..52d67ca391bb 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -30,6 +30,11 @@ static void list_lru_unregister(struct list_lru *lru)
list_del(&lru->list);
mutex_unlock(&list_lrus_mutex);
}
+
+static int lru_shrinker_id(struct list_lru *lru)
+{
+ return lru->shrinker_id;
+}
#else
static void list_lru_register(struct list_lru *lru)
{
@@ -38,6 +43,11 @@ static void list_lru_register(struct list_lru *lru)
static void list_lru_unregister(struct list_lru *lru)
{
}
+
+static int lru_shrinker_id(struct list_lru *lru)
+{
+ return -1;
+}
#endif /* CONFIG_MEMCG && !CONFIG_SLOB */

#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
@@ -121,13 +131,16 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
{
int nid = page_to_nid(virt_to_page(item));
struct list_lru_node *nlru = &lru->node[nid];
+ struct mem_cgroup *memcg;
struct list_lru_one *l;

spin_lock(&nlru->lock);
if (list_empty(item)) {
- l = list_lru_from_kmem(nlru, item, NULL);
+ l = list_lru_from_kmem(nlru, item, &memcg);
list_add_tail(item, &l->list);
- l->nr_items++;
+ /* Set shrinker bit if the first element was added */
+ if (!l->nr_items++)
+ set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
nlru->nr_items++;
spin_unlock(&nlru->lock);
return true;
@@ -522,6 +535,7 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
struct list_lru_node *nlru = &lru->node[nid];
int dst_idx = dst_memcg->kmemcg_id;
struct list_lru_one *src, *dst;
+ bool set;

/*
* Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -533,7 +547,10 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
dst = list_lru_from_memcg_idx(nlru, dst_idx);

list_splice_init(&src->list, &dst->list);
+ set = (!dst->nr_items && src->nr_items);
dst->nr_items += src->nr_items;
+ if (set)
+ set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
src->nr_items = 0;

spin_unlock_irq(&nlru->lock);


2018-04-18 12:57:08

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

Hi Kirill,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on mmotm/master]
[also build test WARNING on next-20180418]
[cannot apply to v4.17-rc1]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Kirill-Tkhai/Improve-shrink_slab-scalability-old-complexity-was-O-n-2-new-is-O-n/20180418-184501
base: git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-x011-201815 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

mm/memcontrol.c: In function 'expand_shrinker_maps':
>> mm/memcontrol.c:402:9: warning: 'ret' may be used uninitialized in this function [-Wmaybe-uninitialized]
return ret;
^~~

vim +/ret +402 mm/memcontrol.c

377
378 int expand_shrinker_maps(int old_nr, int nr)
379 {
380 int id, size, old_size, node, ret;
381 struct mem_cgroup *memcg;
382
383 old_size = old_nr / BITS_PER_BYTE;
384 size = nr / BITS_PER_BYTE;
385
386 down_write(&shrinkers_max_nr_rwsem);
387 for_each_node(node) {
388 idr_for_each_entry(&mem_cgroup_idr, memcg, id) {
389 if (id == 1)
390 memcg = NULL;
391 ret = memcg_expand_maps(memcg, node, size, old_size);
392 if (ret)
393 goto unlock;
394 }
395
396 /* root_mem_cgroup is not initialized yet */
397 if (id == 0)
398 ret = memcg_expand_maps(NULL, node, size, old_size);
399 }
400 unlock:
401 up_write(&shrinkers_max_nr_rwsem);
> 402 return ret;
403 }
404 #else /* CONFIG_SLOB */
405 static void get_shrinkers_max_nr(void) { }
406 static void put_shrinkers_max_nr(void) { }
407

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (2.17 kB)
.config.gz (26.94 kB)
Download all attachments

2018-04-18 13:07:36

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

Hi,

On 18.04.2018 15:55, kbuild test robot wrote:
> Hi Kirill,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on mmotm/master]
> [also build test WARNING on next-20180418]
> [cannot apply to v4.17-rc1]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Kirill-Tkhai/Improve-shrink_slab-scalability-old-complexity-was-O-n-2-new-is-O-n/20180418-184501
> base: git://git.cmpxchg.org/linux-mmotm.git master
> config: x86_64-randconfig-x011-201815 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64
>
> Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
> http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings
>
> All warnings (new ones prefixed by >>):
>
> mm/memcontrol.c: In function 'expand_shrinker_maps':
>>> mm/memcontrol.c:402:9: warning: 'ret' may be used uninitialized in this function [-Wmaybe-uninitialized]
> return ret;> ^~~

thanks for reporting this. Actually in terms of kernel it's a false positive
(since this function is called at time, when for_each_node() iterates not empty
list), but of course, I'll add ret initialization to silence the compiler.

This should not prevent the review of the patchset, so I'm waiting for people's
comments about it before resending v3.

>
> vim +/ret +402 mm/memcontrol.c
>
> 377
> 378 int expand_shrinker_maps(int old_nr, int nr)
> 379 {
> 380 int id, size, old_size, node, ret;
> 381 struct mem_cgroup *memcg;
> 382
> 383 old_size = old_nr / BITS_PER_BYTE;
> 384 size = nr / BITS_PER_BYTE;
> 385
> 386 down_write(&shrinkers_max_nr_rwsem);
> 387 for_each_node(node) {
> 388 idr_for_each_entry(&mem_cgroup_idr, memcg, id) {
> 389 if (id == 1)
> 390 memcg = NULL;
> 391 ret = memcg_expand_maps(memcg, node, size, old_size);
> 392 if (ret)
> 393 goto unlock;
> 394 }
> 395
> 396 /* root_mem_cgroup is not initialized yet */
> 397 if (id == 0)
> 398 ret = memcg_expand_maps(NULL, node, size, old_size);
> 399 }
> 400 unlock:
> 401 up_write(&shrinkers_max_nr_rwsem);
> > 402 return ret;
> 403 }
> 404 #else /* CONFIG_SLOB */
> 405 static void get_shrinkers_max_nr(void) { }
> 406 static void put_shrinkers_max_nr(void) { }
> 407
>
> ---
> 0-DAY kernel test infrastructure Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all Intel Corporation

Kirill

2018-04-18 14:18:49

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH v2 01/12] mm: Assign id to every memcg-aware shrinker

Kirill Tkhai wrote:
> The patch introduces shrinker::id number, which is used to enumerate
> memcg-aware shrinkers. The number start from 0, and the code tries
> to maintain it as small as possible.
>
> This will be used as to represent a memcg-aware shrinkers in memcg
> shrinkers map.

I'm not reading this thread. But is there reason "id" needs to be managed
using smallest numbers? Can't we use address of shrinker object as "id"
(which will be sparse bitmap, and would be managed using linked list for now)?

2018-04-18 14:28:44

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 01/12] mm: Assign id to every memcg-aware shrinker

On 18.04.2018 17:14, Tetsuo Handa wrote:
> Kirill Tkhai wrote:
>> The patch introduces shrinker::id number, which is used to enumerate
>> memcg-aware shrinkers. The number start from 0, and the code tries
>> to maintain it as small as possible.
>>
>> This will be used as to represent a memcg-aware shrinkers in memcg
>> shrinkers map.
>
> I'm not reading this thread. But is there reason "id" needs to be managed
> using smallest numbers? Can't we use address of shrinker object as "id"
> (which will be sparse bitmap, and would be managed using linked list for now)?

Yes, it's needed to have the smallest numbers, as next patches introduce
per-memcg bitmap containing ids of shrinkers.

Kirill

2018-04-18 14:36:34

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH v2 01/12] mm: Assign id to every memcg-aware shrinker

Kirill Tkhai wrote:
> On 18.04.2018 17:14, Tetsuo Handa wrote:
> > Kirill Tkhai wrote:
> >> The patch introduces shrinker::id number, which is used to enumerate
> >> memcg-aware shrinkers. The number start from 0, and the code tries
> >> to maintain it as small as possible.
> >>
> >> This will be used as to represent a memcg-aware shrinkers in memcg
> >> shrinkers map.
> >
> > I'm not reading this thread. But is there reason "id" needs to be managed
> > using smallest numbers? Can't we use address of shrinker object as "id"
> > (which will be sparse bitmap, and would be managed using linked list for now)?
>
> Yes, it's needed to have the smallest numbers, as next patches introduce
> per-memcg bitmap containing ids of shrinkers.

If you use sparse bitmap (xbitmap ?), I think you can do it.

2018-04-18 15:04:43

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 01/12] mm: Assign id to every memcg-aware shrinker

On 18.04.2018 17:32, Tetsuo Handa wrote:
> Kirill Tkhai wrote:
>> On 18.04.2018 17:14, Tetsuo Handa wrote:
>>> Kirill Tkhai wrote:
>>>> The patch introduces shrinker::id number, which is used to enumerate
>>>> memcg-aware shrinkers. The number start from 0, and the code tries
>>>> to maintain it as small as possible.
>>>>
>>>> This will be used as to represent a memcg-aware shrinkers in memcg
>>>> shrinkers map.
>>>
>>> I'm not reading this thread. But is there reason "id" needs to be managed
>>> using smallest numbers? Can't we use address of shrinker object as "id"
>>> (which will be sparse bitmap, and would be managed using linked list for now)?
>>
>> Yes, it's needed to have the smallest numbers, as next patches introduce
>> per-memcg bitmap containing ids of shrinkers.
>
> If you use sparse bitmap (xbitmap ?), I think you can do it.

There is no implementation in kernel, and search gave me this link:
https://patchwork.kernel.org/patch/10128397/

The problem is that it may allocate memory, and hence to fail.
While adding an element to shrinker lists (and setting a bit
in bitmap) mustn't fail. So, it's not possible to use sparse bitmap.

Kirill

2018-04-22 17:18:16

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v2 01/12] mm: Assign id to every memcg-aware shrinker

On Tue, Apr 17, 2018 at 09:53:02PM +0300, Kirill Tkhai wrote:
> The patch introduces shrinker::id number, which is used to enumerate
> memcg-aware shrinkers. The number start from 0, and the code tries
> to maintain it as small as possible.
>
> This will be used as to represent a memcg-aware shrinkers in memcg
> shrinkers map.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> include/linux/shrinker.h | 2 ++
> mm/vmscan.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 53 insertions(+)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index a3894918a436..86b651fa2846 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -66,6 +66,8 @@ struct shrinker {
>
> /* These are for internal use */
> struct list_head list;

> + /* ID in shrinkers_id_idr */
> + int id;

This should be under ifdef CONFIG_MEMCG && CONFIG_SLOB.

> /* objs pending delete, per node */
> atomic_long_t *nr_deferred;
> };
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8b920ce3ae02..4f02fe83537e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -169,6 +169,43 @@ unsigned long vm_total_pages;
> static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>
> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)

> +static DEFINE_IDR(shrinkers_id_idr);

IMO shrinker_idr would be a better name.

> +
> +static int add_memcg_shrinker(struct shrinker *shrinker)
> +{
> + int id, ret;
> +
> + down_write(&shrinker_rwsem);
> + ret = id = idr_alloc(&shrinkers_id_idr, shrinker, 0, 0, GFP_KERNEL);
> + if (ret < 0)
> + goto unlock;
> + shrinker->id = id;
> + ret = 0;
> +unlock:
> + up_write(&shrinker_rwsem);
> + return ret;
> +}
> +
> +static void del_memcg_shrinker(struct shrinker *shrinker)
> +{
> + int id = shrinker->id;
> +
> + down_write(&shrinker_rwsem);
> + idr_remove(&shrinkers_id_idr, id);
> + up_write(&shrinker_rwsem);
> +}
> +#else /* CONFIG_MEMCG && !CONFIG_SLOB */
> +static int add_memcg_shrinker(struct shrinker *shrinker)
> +{
> + return 0;
> +}
> +
> +static void del_memcg_shrinker(struct shrinker *shrinker)
> +{
> +}
> +#endif /* CONFIG_MEMCG && !CONFIG_SLOB */
> +
> #ifdef CONFIG_MEMCG
> static bool global_reclaim(struct scan_control *sc)
> {
> @@ -306,6 +343,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
> int register_shrinker(struct shrinker *shrinker)
> {
> size_t size = sizeof(*shrinker->nr_deferred);
> + int ret;
>
> if (shrinker->flags & SHRINKER_NUMA_AWARE)
> size *= nr_node_ids;
> @@ -314,10 +352,21 @@ int register_shrinker(struct shrinker *shrinker)
> if (!shrinker->nr_deferred)
> return -ENOMEM;
>
> + if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> + ret = add_memcg_shrinker(shrinker);
> + if (ret)
> + goto free_deferred;
> + }
> +

This doesn't apply anymore, not after commit 8e04944f0ea8a ("mm,vmscan:
Allow preallocating memory for register_shrinker()"). Please rebase.

I guess now you have to allocate an id in prealloc_shrinker and set the
pointer (with idr_replace) in register_shrinker_prepared.

> down_write(&shrinker_rwsem);
> list_add_tail(&shrinker->list, &shrinker_list);
> up_write(&shrinker_rwsem);
> return 0;
> +
> +free_deferred:
> + kfree(shrinker->nr_deferred);
> + shrinker->nr_deferred = NULL;
> + return -ENOMEM;
> }
> EXPORT_SYMBOL(register_shrinker);
>
> @@ -328,6 +377,8 @@ void unregister_shrinker(struct shrinker *shrinker)
> {
> if (!shrinker->nr_deferred)
> return;
> + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> + del_memcg_shrinker(shrinker);
> down_write(&shrinker_rwsem);
> list_del(&shrinker->list);
> up_write(&shrinker_rwsem);
>

2018-04-22 18:00:32

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

On Tue, Apr 17, 2018 at 09:53:31PM +0300, Kirill Tkhai wrote:
> Imagine a big node with many cpus, memory cgroups and containers.
> Let we have 200 containers, every container has 10 mounts,
> and 10 cgroups. All container tasks don't touch foreign
> containers mounts. If there is intensive pages write,
> and global reclaim happens, a writing task has to iterate
> over all memcgs to shrink slab, before it's able to go
> to shrink_page_list().
>
> Iteration over all the memcg slabs is very expensive:
> the task has to visit 200 * 10 = 2000 shrinkers
> for every memcg, and since there are 2000 memcgs,
> the total calls are 2000 * 2000 = 4000000.
>
> So, the shrinker makes 4 million do_shrink_slab() calls
> just to try to isolate SWAP_CLUSTER_MAX pages in one
> of the actively writing memcg via shrink_page_list().
> I've observed a node spending almost 100% in kernel,
> making useless iteration over already shrinked slab.
>
> This patch adds bitmap of memcg-aware shrinkers to memcg.
> The size of the bitmap depends on bitmap_nr_ids, and during
> memcg life it's maintained to be enough to fit bitmap_nr_ids
> shrinkers. Every bit in the map is related to corresponding
> shrinker id.
>
> Next patches will maintain set bit only for really charged
> memcg. This will allow shrink_slab() to increase its
> performance in significant way. See the last patch for
> the numbers.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> include/linux/memcontrol.h | 15 +++++
> mm/memcontrol.c | 125 ++++++++++++++++++++++++++++++++++++++++++++
> mm/vmscan.c | 21 +++++++
> 3 files changed, 160 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index af9eed2e3e04..2ec96ab46b01 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -115,6 +115,7 @@ struct mem_cgroup_per_node {
> unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
>
> struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];

> + struct memcg_shrinker_map __rcu *shrinkers_map;

shrinker_map

>
> struct rb_node tree_node; /* RB tree node */
> unsigned long usage_in_excess;/* Set to the value by which */
> @@ -153,6 +154,11 @@ struct mem_cgroup_thresholds {
> struct mem_cgroup_threshold_ary *spare;
> };
>
> +struct memcg_shrinker_map {
> + struct rcu_head rcu;
> + unsigned long map[0];
> +};
> +

This struct should be defined before struct mem_cgroup_per_node.

A comment explaining what this map is for and what it maps would be
really helpful.

> enum memcg_kmem_state {
> KMEM_NONE,
> KMEM_ALLOCATED,
> @@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
> void memcg_get_cache_ids(void);
> void memcg_put_cache_ids(void);
>
> +extern int shrinkers_max_nr;
> +

memcg_shrinker_id_max?

> /*
> * Helper macro to loop through all memcg-specific caches. Callers must still
> * check if the cache is valid (it is either valid or NULL).
> @@ -1223,6 +1231,13 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
> return memcg ? memcg->kmemcg_id : -1;
> }
>
> +extern struct memcg_shrinker_map __rcu *root_shrinkers_map[];
> +#define SHRINKERS_MAP(memcg, nid) \
> + (memcg == root_mem_cgroup || !memcg ? \
> + root_shrinkers_map[nid] : memcg->nodeinfo[nid]->shrinkers_map)
> +
> +extern int expand_shrinker_maps(int old_id, int id);
> +

I'm strongly against using a special map for the root cgroup. I'd prefer
to disable this optimization for the root cgroup altogether and simply
iterate over all registered shrinkers when it comes to global reclaim.
It shouldn't degrade performance as the root cgroup is singular.

> #else
> #define for_each_memcg_cache_index(_idx) \
> for (; NULL; )
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2959a454a072..562dfb1be9ef 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -305,6 +305,113 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
>
> struct workqueue_struct *memcg_kmem_cache_wq;
>
> +static DECLARE_RWSEM(shrinkers_max_nr_rwsem);

Why rwsem? AFAIU you want to synchronize between two code paths: when a
memory cgroup is allocated (or switched online?) to allocate a shrinker
map for it and in the function growing shrinker maps for all cgroups.
A mutex should be enough for this.

> +struct memcg_shrinker_map __rcu *root_shrinkers_map[MAX_NUMNODES] = { 0 };
> +
> +static void get_shrinkers_max_nr(void)
> +{
> + down_read(&shrinkers_max_nr_rwsem);
> +}
> +
> +static void put_shrinkers_max_nr(void)
> +{
> + up_read(&shrinkers_max_nr_rwsem);
> +}
> +
> +static void kvfree_map_rcu(struct rcu_head *head)

free_shrinker_map_rcu

> +{
> + kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> +}
> +
> +static int memcg_expand_maps(struct mem_cgroup *memcg, int nid,

Bad name: the function reallocates just one map, not many maps; the name
doesn't indicate that it is about shrinkers; the name is inconsistent
with alloc_shrinker_maps and free_shrinker_maps. Please fix.

> + int size, int old_size)
> +{
> + struct memcg_shrinker_map *new, *old;
> +
> + lockdep_assert_held(&shrinkers_max_nr_rwsem);
> +
> + new = kvmalloc(sizeof(*new) + size, GFP_KERNEL);
> + if (!new)
> + return -ENOMEM;
> +
> + /* Set all old bits, clear all new bits */
> + memset(new->map, (int)0xff, old_size);
> + memset((void *)new->map + old_size, 0, size - old_size);
> +
> + old = rcu_dereference_protected(SHRINKERS_MAP(memcg, nid), true);
> +
> + if (memcg)
> + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinkers_map, new);
> + else
> + rcu_assign_pointer(root_shrinkers_map[nid], new);
> +
> + if (old)
> + call_rcu(&old->rcu, kvfree_map_rcu);
> +
> + return 0;
> +}
> +
> +static int alloc_shrinker_maps(struct mem_cgroup *memcg, int nid)
> +{
> + /* Skip allocation, when we're initializing root_mem_cgroup */
> + if (!root_mem_cgroup)
> + return 0;
> +
> + return memcg_expand_maps(memcg, nid, shrinkers_max_nr/BITS_PER_BYTE, 0);
> +}
> +
> +static void free_shrinker_maps(struct mem_cgroup *memcg,
> + struct mem_cgroup_per_node *pn)
> +{
> + struct memcg_shrinker_map *map;
> +
> + if (memcg == root_mem_cgroup)
> + return;
> +
> + /* IDR unhashed long ago, and expand_shrinker_maps can't race with us */
> + map = rcu_dereference_protected(pn->shrinkers_map, true);
> + kvfree_map_rcu(&map->rcu);
> +}
> +
> +static struct idr mem_cgroup_idr;
> +
> +int expand_shrinker_maps(int old_nr, int nr)
> +{
> + int id, size, old_size, node, ret;
> + struct mem_cgroup *memcg;
> +
> + old_size = old_nr / BITS_PER_BYTE;
> + size = nr / BITS_PER_BYTE;
> +
> + down_write(&shrinkers_max_nr_rwsem);
> + for_each_node(node) {

Iterating over cgroups first, numa nodes second seems like a better idea
to me. I think you should fold for_each_node in memcg_expand_maps.

> + idr_for_each_entry(&mem_cgroup_idr, memcg, id) {

Iterating over mem_cgroup_idr looks strange. Why don't you use
for_each_mem_cgroup?

> + if (id == 1)
> + memcg = NULL;
> + ret = memcg_expand_maps(memcg, node, size, old_size);
> + if (ret)
> + goto unlock;
> + }
> +
> + /* root_mem_cgroup is not initialized yet */
> + if (id == 0)
> + ret = memcg_expand_maps(NULL, node, size, old_size);
> + }
> +unlock:
> + up_write(&shrinkers_max_nr_rwsem);
> + return ret;
> +}
> +#else /* CONFIG_SLOB */
> +static void get_shrinkers_max_nr(void) { }
> +static void put_shrinkers_max_nr(void) { }
> +
> +static int alloc_shrinker_maps(struct mem_cgroup *memcg, int nid)
> +{
> + return 0;
> +}
> +static void free_shrinker_maps(struct mem_cgroup *memcg,
> + struct mem_cgroup_per_node *pn) { }
> +
> #endif /* !CONFIG_SLOB */
>
> /**
> @@ -3002,6 +3109,8 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
> }
>
> #ifndef CONFIG_SLOB
> +int shrinkers_max_nr;
> +
> static int memcg_online_kmem(struct mem_cgroup *memcg)
> {
> int memcg_id;
> @@ -4266,7 +4375,10 @@ static DEFINE_IDR(mem_cgroup_idr);
> static void mem_cgroup_id_remove(struct mem_cgroup *memcg)
> {
> if (memcg->id.id > 0) {
> + /* Removing IDR must be visible for expand_shrinker_maps() */
> + get_shrinkers_max_nr();
> idr_remove(&mem_cgroup_idr, memcg->id.id);
> + put_shrinkers_max_nr();
> memcg->id.id = 0;
> }
> }
> @@ -4333,12 +4445,17 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> if (!pn->lruvec_stat_cpu)
> goto err_pcpu;
>
> + if (alloc_shrinker_maps(memcg, node))
> + goto err_maps;
> +
> lruvec_init(&pn->lruvec);
> pn->usage_in_excess = 0;
> pn->on_tree = false;
> pn->memcg = memcg;
> return 0;
>
> +err_maps:
> + free_percpu(pn->lruvec_stat_cpu);
> err_pcpu:
> memcg->nodeinfo[node] = NULL;
> kfree(pn);
> @@ -4352,6 +4469,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> if (!pn)
> return;
>
> + free_shrinker_maps(memcg, pn);
> free_percpu(pn->lruvec_stat_cpu);
> kfree(pn);
> }
> @@ -4407,13 +4525,18 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> #ifdef CONFIG_CGROUP_WRITEBACK
> INIT_LIST_HEAD(&memcg->cgwb_list);
> #endif
> +
> + get_shrinkers_max_nr();
> for_each_node(node)
> - if (alloc_mem_cgroup_per_node_info(memcg, node))
> + if (alloc_mem_cgroup_per_node_info(memcg, node)) {
> + put_shrinkers_max_nr();
> goto fail;
> + }
>
> memcg->id.id = idr_alloc(&mem_cgroup_idr, memcg,
> 1, MEM_CGROUP_ID_MAX,
> GFP_KERNEL);
> + put_shrinkers_max_nr();
> if (memcg->id.id < 0)
> goto fail;
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f02fe83537e..f63eb5596c35 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -172,6 +172,22 @@ static DECLARE_RWSEM(shrinker_rwsem);
> #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> static DEFINE_IDR(shrinkers_id_idr);
>
> +static int expand_shrinker_id(int id)
> +{
> + if (likely(id < shrinkers_max_nr))
> + return 0;
> +
> + id = shrinkers_max_nr * 2;
> + if (id == 0)
> + id = BITS_PER_BYTE;
> +
> + if (expand_shrinker_maps(shrinkers_max_nr, id))
> + return -ENOMEM;
> +
> + shrinkers_max_nr = id;
> + return 0;
> +}
> +

I think this function should live in memcontrol.c and shrinkers_max_nr
should be private to memcontrol.c.

> static int add_memcg_shrinker(struct shrinker *shrinker)
> {
> int id, ret;
> @@ -180,6 +196,11 @@ static int add_memcg_shrinker(struct shrinker *shrinker)
> ret = id = idr_alloc(&shrinkers_id_idr, shrinker, 0, 0, GFP_KERNEL);
> if (ret < 0)
> goto unlock;
> + ret = expand_shrinker_id(id);
> + if (ret < 0) {
> + idr_remove(&shrinkers_id_idr, id);
> + goto unlock;
> + }
> shrinker->id = id;
> ret = 0;
> unlock:
>

2018-04-22 18:05:00

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] fs: Propagate shrinker::id to list_lru

On Tue, Apr 17, 2018 at 09:53:47PM +0300, Kirill Tkhai wrote:
> The patch adds list_lru::shrinker_id field, and populates
> it by registered shrinker id.
>
> This will be used to set correct bit in memcg shrinkers
> map by lru code in next patches, after there appeared
> the first related to memcg element in list_lru.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> fs/super.c | 4 +++-
> include/linux/list_lru.h | 1 +
> include/linux/shrinker.h | 8 +++++++-
> mm/list_lru.c | 6 ++++++
> mm/vmscan.c | 15 ++++++++++-----
> mm/workingset.c | 3 ++-
> 6 files changed, 29 insertions(+), 8 deletions(-)
>
> diff --git a/fs/super.c b/fs/super.c
> index 5fa9a8d8d865..9bc5698c8c3c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -518,7 +518,9 @@ struct super_block *sget_userns(struct file_system_type *type,
> hlist_add_head(&s->s_instances, &type->fs_supers);
> spin_unlock(&sb_lock);
> get_filesystem(type);
> - err = register_shrinker(&s->s_shrink);
> + err = register_shrinker_args(&s->s_shrink, 2,
> + &s->s_dentry_lru.shrinker_id,
> + &s->s_inode_lru.shrinker_id);

This looks ugly. May be, we could allocate an id in prealloc_shrinker
then simply pass it to list_lru_init in arguments?

2018-04-22 18:21:28

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v2 10/12] mm: Iterate only over charged shrinkers during memcg shrink_slab()

On Tue, Apr 17, 2018 at 09:54:34PM +0300, Kirill Tkhai wrote:
> Using the preparations made in previous patches, in case of memcg
> shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
> bitmap. To do that, we separate iterations over memcg-aware and
> !memcg-aware shrinkers, and memcg-aware shrinkers are chosen
> via for_each_set_bit() from the bitmap. In case of big nodes,
> having many isolated environments, this gives significant
> performance growth. See next patches for the details.
>
> Note, that the patch does not respect to empty memcg shrinkers,
> since we never clear the bitmap bits after we set it once.
> Their shrinkers will be called again, with no shrinked objects
> as result. This functionality is provided by next patches.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> mm/vmscan.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 72 insertions(+), 16 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 34cd1d9b8b22..b81b8a7727b5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -169,6 +169,20 @@ unsigned long vm_total_pages;
> static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>
> +static void link_shrinker(struct shrinker *shrinker)
> +{
> + down_write(&shrinker_rwsem);
> + list_add_tail(&shrinker->list, &shrinker_list);
> + up_write(&shrinker_rwsem);
> +}
> +
> +static void unlink_shrinker(struct shrinker *shrinker)
> +{
> + down_write(&shrinker_rwsem);
> + list_del(&shrinker->list);
> + up_write(&shrinker_rwsem);
> +}
> +
> #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> static DEFINE_IDR(shrinkers_id_idr);
>
> @@ -221,11 +235,13 @@ static void del_memcg_shrinker(struct shrinker *shrinker)
> #else /* CONFIG_MEMCG && !CONFIG_SLOB */
> static int add_memcg_shrinker(struct shrinker *shrinker, int nr, va_list args)
> {
> + link_shrinker(shrinker);
> return 0;
> }
>
> static void del_memcg_shrinker(struct shrinker *shrinker)
> {
> + unlink_shrinker(shrinker);
> }
> #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
>
> @@ -382,11 +398,9 @@ int __register_shrinker(struct shrinker *shrinker, int nr, ...)
> va_end(args);
> if (ret)
> goto free_deferred;
> - }
> + } else
> + link_shrinker(shrinker);
>
> - down_write(&shrinker_rwsem);
> - list_add_tail(&shrinker->list, &shrinker_list);
> - up_write(&shrinker_rwsem);
> return 0;
>
> free_deferred:
> @@ -405,9 +419,8 @@ void unregister_shrinker(struct shrinker *shrinker)
> return;
> if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> del_memcg_shrinker(shrinker);
> - down_write(&shrinker_rwsem);
> - list_del(&shrinker->list);
> - up_write(&shrinker_rwsem);
> + else
> + unlink_shrinker(shrinker);

I really don't like that depending on the config, the shrinker_list
stores either all shrinkers or only memcg-unaware ones. I think it
should always store all shrinkers and it should be used in case of
global reclaim. That is IMO shrink_slab should look like this:

shrink_slab(memcg)
{
if (!mem_cgroup_is_root(memcg))
return shrink_slab_memcg()
list_for_each(shrinker, shrinker_list, link)
do_shrink_slab()
}

Yeah, that means that for the root mem cgroup we will always call all
shrinkers, but IMO it is OK as there's the only root mem cgroup out
there and it is visited only on global reclaim so it shouldn't degrade
performance.

> kfree(shrinker->nr_deferred);
> shrinker->nr_deferred = NULL;
> }
> @@ -532,6 +545,53 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> return freed;
> }
>
> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> + struct mem_cgroup *memcg,
> + int priority)
> +{
> + struct memcg_shrinker_map *map;
> + unsigned long freed = 0;
> + int ret, i;
> +
> + if (!down_read_trylock(&shrinker_rwsem))
> + return 0;
> +
> + /*
> + * 1)Caller passes only alive memcg, so map can't be NULL.
> + * 2)shrinker_rwsem protects from maps expanding.
> + */
> + map = rcu_dereference_protected(SHRINKERS_MAP(memcg, nid), true);
> + BUG_ON(!map);
> +
> + for_each_set_bit(i, map->map, shrinkers_max_nr) {
> + struct shrink_control sc = {
> + .gfp_mask = gfp_mask,
> + .nid = nid,
> + .memcg = memcg,
> + };
> + struct shrinker *shrinker;
> +
> + shrinker = idr_find(&shrinkers_id_idr, i);
> + if (!shrinker) {
> + clear_bit(i, map->map);
> + continue;
> + }
> +
> + ret = do_shrink_slab(&sc, shrinker, priority);
> + freed += ret;
> +
> + if (rwsem_is_contended(&shrinker_rwsem)) {
> + freed = freed ? : 1;
> + break;
> + }
> + }
> +
> + up_read(&shrinker_rwsem);
> + return freed;
> +}
> +#endif
> +
> /**
> * shrink_slab - shrink slab caches
> * @gfp_mask: allocation context
> @@ -564,6 +624,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)))
> return 0;

The check above should be moved to shrink_slab_memcg.

>
> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)

Please don't use ifdef here - define a stub function for no-memcg case.

> + if (memcg)
> + return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> +#endif
> +
> if (!down_read_trylock(&shrinker_rwsem))
> goto out;
>
> @@ -574,15 +639,6 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> .memcg = memcg,
> };
>
> - /*
> - * If kernel memory accounting is disabled, we ignore
> - * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> - * passing NULL for memcg.
> - */
> - if (memcg_kmem_enabled() &&
> - !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> - continue;
> -
> if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> sc.nid = 0;
>
>

2018-04-22 18:22:57

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v2 12/12] mm: Clear shrinker bit if there are no objects related to memcg

On Tue, Apr 17, 2018 at 09:54:51PM +0300, Kirill Tkhai wrote:
> To avoid further unneed calls of do_shrink_slab()
> for shrinkers, which already do not have any charged
> objects in a memcg, their bits have to be cleared.
>
> This patch introduces a lockless mechanism to do that
> without races without parallel list lru add. After
> do_shrink_slab() returns SHRINK_EMPTY the first time,
> we clear the bit and call it once again. Then we restore
> the bit, if the new return value is different.
>
> Note, that single smp_mb__after_atomic() in shrink_slab_memcg()
> covers two situations:
>
> 1)list_lru_add() shrink_slab_memcg
> list_add_tail() for_each_set_bit() <--- read bit
> do_shrink_slab() <--- missed list update (no barrier)
> <MB> <MB>
> set_bit() do_shrink_slab() <--- seen list update
>
> This situation, when the first do_shrink_slab() sees set bit,
> but it doesn't see list update (i.e., race with the first element
> queueing), is rare. So we don't add <MB> before the first call
> of do_shrink_slab() instead of this to do not slow down generic
> case. Also, it's need the second call as seen in below in (2).
>
> 2)list_lru_add() shrink_slab_memcg()
> list_add_tail() ...
> set_bit() ...
> ... for_each_set_bit()
> do_shrink_slab() do_shrink_slab()
> clear_bit() ...
> ... ...
> list_lru_add() ...
> list_add_tail() clear_bit()
> <MB> <MB>
> set_bit() do_shrink_slab()
>
> The barriers guarantees, the second do_shrink_slab()
> in the right side task sees list update if really
> cleared the bit. This case is drawn in the code comment.
>
> [Results/performance of the patchset]
>
> After the whole patchset applied the below test shows signify
> increase of performance:
>
> $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
> $mkdir /sys/fs/cgroup/memory/ct
> $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
> $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done
>
> Then, 4 sequential calls of drop caches:
> $time echo 3 > /proc/sys/vm/drop_caches
>
> 1)Before:
> 0.00user 8.99system 0:08.99elapsed 99%CPU
> 0.00user 5.97system 0:05.97elapsed 100%CPU
> 0.00user 5.97system 0:05.97elapsed 100%CPU
> 0.00user 5.85system 0:05.85elapsed 100%CPU
>
> 2)After
> 0.00user 1.11system 0:01.12elapsed 99%CPU
> 0.00user 0.00system 0:00.00elapsed 100%CPU
> 0.00user 0.00system 0:00.00elapsed 100%CPU
> 0.00user 0.00system 0:00.00elapsed 100%CPU
>
> Even if we round 0:00.00 up to 0:00.01, the results shows
> the performance increases at least in 585 times.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> include/linux/memcontrol.h | 2 ++
> mm/vmscan.c | 19 +++++++++++++++++--
> 2 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e1c1fa8e417a..1c5c68550e2f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1245,6 +1245,8 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int nr)
>
> rcu_read_lock();
> map = SHRINKERS_MAP(memcg, nid);
> + /* Pairs with smp mb in shrink_slab() */
> + smp_mb__before_atomic();
> set_bit(nr, map->map);
> rcu_read_unlock();
> }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3be9b4d81c13..a8733bc5377b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -579,8 +579,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> }
>
> ret = do_shrink_slab(&sc, shrinker, priority);
> - if (ret == SHRINK_EMPTY)
> - ret = 0;
> + if (ret == SHRINK_EMPTY) {
> + clear_bit(i, map->map);
> + /*
> + * Pairs with mb in set_shrinker_bit():
> + *
> + * list_lru_add() shrink_slab_memcg()
> + * list_add_tail() clear_bit()
> + * <MB> <MB>
> + * set_bit() do_shrink_slab()
> + */
> + smp_mb__after_atomic();
> + ret = do_shrink_slab(&sc, shrinker, priority);
> + if (ret == SHRINK_EMPTY)
> + ret = 0;
> + else
> + set_shrinker_bit(memcg, nid, i);
> + }

This is mind-boggling. Are there any alternatives? For instance, can't
we clear the bit in list_lru_del, when we hold the list lock?

> freed += ret;
>
> if (rwsem_is_contended(&shrinker_rwsem)) {
>

2018-04-23 10:02:52

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 12/12] mm: Clear shrinker bit if there are no objects related to memcg

On 22.04.2018 21:21, Vladimir Davydov wrote:
> On Tue, Apr 17, 2018 at 09:54:51PM +0300, Kirill Tkhai wrote:
>> To avoid further unneed calls of do_shrink_slab()
>> for shrinkers, which already do not have any charged
>> objects in a memcg, their bits have to be cleared.
>>
>> This patch introduces a lockless mechanism to do that
>> without races without parallel list lru add. After
>> do_shrink_slab() returns SHRINK_EMPTY the first time,
>> we clear the bit and call it once again. Then we restore
>> the bit, if the new return value is different.
>>
>> Note, that single smp_mb__after_atomic() in shrink_slab_memcg()
>> covers two situations:
>>
>> 1)list_lru_add() shrink_slab_memcg
>> list_add_tail() for_each_set_bit() <--- read bit
>> do_shrink_slab() <--- missed list update (no barrier)
>> <MB> <MB>
>> set_bit() do_shrink_slab() <--- seen list update
>>
>> This situation, when the first do_shrink_slab() sees set bit,
>> but it doesn't see list update (i.e., race with the first element
>> queueing), is rare. So we don't add <MB> before the first call
>> of do_shrink_slab() instead of this to do not slow down generic
>> case. Also, it's need the second call as seen in below in (2).
>>
>> 2)list_lru_add() shrink_slab_memcg()
>> list_add_tail() ...
>> set_bit() ...
>> ... for_each_set_bit()
>> do_shrink_slab() do_shrink_slab()
>> clear_bit() ...
>> ... ...
>> list_lru_add() ...
>> list_add_tail() clear_bit()
>> <MB> <MB>
>> set_bit() do_shrink_slab()
>>
>> The barriers guarantees, the second do_shrink_slab()
>> in the right side task sees list update if really
>> cleared the bit. This case is drawn in the code comment.
>>
>> [Results/performance of the patchset]
>>
>> After the whole patchset applied the below test shows signify
>> increase of performance:
>>
>> $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
>> $mkdir /sys/fs/cgroup/memory/ct
>> $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
>> $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done
>>
>> Then, 4 sequential calls of drop caches:
>> $time echo 3 > /proc/sys/vm/drop_caches
>>
>> 1)Before:
>> 0.00user 8.99system 0:08.99elapsed 99%CPU
>> 0.00user 5.97system 0:05.97elapsed 100%CPU
>> 0.00user 5.97system 0:05.97elapsed 100%CPU
>> 0.00user 5.85system 0:05.85elapsed 100%CPU
>>
>> 2)After
>> 0.00user 1.11system 0:01.12elapsed 99%CPU
>> 0.00user 0.00system 0:00.00elapsed 100%CPU
>> 0.00user 0.00system 0:00.00elapsed 100%CPU
>> 0.00user 0.00system 0:00.00elapsed 100%CPU
>>
>> Even if we round 0:00.00 up to 0:00.01, the results shows
>> the performance increases at least in 585 times.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> include/linux/memcontrol.h | 2 ++
>> mm/vmscan.c | 19 +++++++++++++++++--
>> 2 files changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index e1c1fa8e417a..1c5c68550e2f 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -1245,6 +1245,8 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int nr)
>>
>> rcu_read_lock();
>> map = SHRINKERS_MAP(memcg, nid);
>> + /* Pairs with smp mb in shrink_slab() */
>> + smp_mb__before_atomic();
>> set_bit(nr, map->map);
>> rcu_read_unlock();
>> }
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 3be9b4d81c13..a8733bc5377b 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -579,8 +579,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>> }
>>
>> ret = do_shrink_slab(&sc, shrinker, priority);
>> - if (ret == SHRINK_EMPTY)
>> - ret = 0;
>> + if (ret == SHRINK_EMPTY) {
>> + clear_bit(i, map->map);
>> + /*
>> + * Pairs with mb in set_shrinker_bit():
>> + *
>> + * list_lru_add() shrink_slab_memcg()
>> + * list_add_tail() clear_bit()
>> + * <MB> <MB>
>> + * set_bit() do_shrink_slab()
>> + */
>> + smp_mb__after_atomic();
>> + ret = do_shrink_slab(&sc, shrinker, priority);
>> + if (ret == SHRINK_EMPTY)
>> + ret = 0;
>> + else
>> + set_shrinker_bit(memcg, nid, i);
>> + }
>
> This is mind-boggling. Are there any alternatives? For instance, can't
> we clear the bit in list_lru_del, when we hold the list lock?

Since a single shrinker may iterate over several lru lists, we can't do that.
Otherwise, we would have to probe another shrinker's lru list from a lru list,
which became empty in list_lru_del().

The solution I suggested, is generic, and it does not depend on low-level
structure type, used by shrinker. This even doesn't have to be a lru list.

>> freed += ret;
>>
>> if (rwsem_is_contended(&shrinker_rwsem)) {

Kirill

2018-04-23 10:57:12

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

On 22.04.2018 20:59, Vladimir Davydov wrote:
> On Tue, Apr 17, 2018 at 09:53:31PM +0300, Kirill Tkhai wrote:
>> Imagine a big node with many cpus, memory cgroups and containers.
>> Let we have 200 containers, every container has 10 mounts,
>> and 10 cgroups. All container tasks don't touch foreign
>> containers mounts. If there is intensive pages write,
>> and global reclaim happens, a writing task has to iterate
>> over all memcgs to shrink slab, before it's able to go
>> to shrink_page_list().
>>
>> Iteration over all the memcg slabs is very expensive:
>> the task has to visit 200 * 10 = 2000 shrinkers
>> for every memcg, and since there are 2000 memcgs,
>> the total calls are 2000 * 2000 = 4000000.
>>
>> So, the shrinker makes 4 million do_shrink_slab() calls
>> just to try to isolate SWAP_CLUSTER_MAX pages in one
>> of the actively writing memcg via shrink_page_list().
>> I've observed a node spending almost 100% in kernel,
>> making useless iteration over already shrinked slab.
>>
>> This patch adds bitmap of memcg-aware shrinkers to memcg.
>> The size of the bitmap depends on bitmap_nr_ids, and during
>> memcg life it's maintained to be enough to fit bitmap_nr_ids
>> shrinkers. Every bit in the map is related to corresponding
>> shrinker id.
>>
>> Next patches will maintain set bit only for really charged
>> memcg. This will allow shrink_slab() to increase its
>> performance in significant way. See the last patch for
>> the numbers.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> include/linux/memcontrol.h | 15 +++++
>> mm/memcontrol.c | 125 ++++++++++++++++++++++++++++++++++++++++++++
>> mm/vmscan.c | 21 +++++++
>> 3 files changed, 160 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index af9eed2e3e04..2ec96ab46b01 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -115,6 +115,7 @@ struct mem_cgroup_per_node {
>> unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
>>
>> struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];
>
>> + struct memcg_shrinker_map __rcu *shrinkers_map;
>
> shrinker_map
>
>>
>> struct rb_node tree_node; /* RB tree node */
>> unsigned long usage_in_excess;/* Set to the value by which */
>> @@ -153,6 +154,11 @@ struct mem_cgroup_thresholds {
>> struct mem_cgroup_threshold_ary *spare;
>> };
>>
>> +struct memcg_shrinker_map {
>> + struct rcu_head rcu;
>> + unsigned long map[0];
>> +};
>> +
>
> This struct should be defined before struct mem_cgroup_per_node.
>
> A comment explaining what this map is for and what it maps would be
> really helpful.
>
>> enum memcg_kmem_state {
>> KMEM_NONE,
>> KMEM_ALLOCATED,
>> @@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
>> void memcg_get_cache_ids(void);
>> void memcg_put_cache_ids(void);
>>
>> +extern int shrinkers_max_nr;
>> +
>
> memcg_shrinker_id_max?

memcg_shrinker_id_max sounds like an includive value, doesn't it?
While shrinker->id < shrinker_max_nr.

Let's better use memcg_shrinker_nr_max.

>> /*
>> * Helper macro to loop through all memcg-specific caches. Callers must still
>> * check if the cache is valid (it is either valid or NULL).
>> @@ -1223,6 +1231,13 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
>> return memcg ? memcg->kmemcg_id : -1;
>> }
>>
>> +extern struct memcg_shrinker_map __rcu *root_shrinkers_map[];
>> +#define SHRINKERS_MAP(memcg, nid) \
>> + (memcg == root_mem_cgroup || !memcg ? \
>> + root_shrinkers_map[nid] : memcg->nodeinfo[nid]->shrinkers_map)
>> +
>> +extern int expand_shrinker_maps(int old_id, int id);
>> +
>
> I'm strongly against using a special map for the root cgroup. I'd prefer
> to disable this optimization for the root cgroup altogether and simply
> iterate over all registered shrinkers when it comes to global reclaim.
> It shouldn't degrade performance as the root cgroup is singular.
>
>> #else
>> #define for_each_memcg_cache_index(_idx) \
>> for (; NULL; )
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 2959a454a072..562dfb1be9ef 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -305,6 +305,113 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
>>
>> struct workqueue_struct *memcg_kmem_cache_wq;
>>
>> +static DECLARE_RWSEM(shrinkers_max_nr_rwsem);
>
> Why rwsem? AFAIU you want to synchronize between two code paths: when a
> memory cgroup is allocated (or switched online?) to allocate a shrinker
> map for it and in the function growing shrinker maps for all cgroups.
> A mutex should be enough for this.
>
>> +struct memcg_shrinker_map __rcu *root_shrinkers_map[MAX_NUMNODES] = { 0 };
>> +
>> +static void get_shrinkers_max_nr(void)
>> +{
>> + down_read(&shrinkers_max_nr_rwsem);
>> +}
>> +
>> +static void put_shrinkers_max_nr(void)
>> +{
>> + up_read(&shrinkers_max_nr_rwsem);
>> +}
>> +
>> +static void kvfree_map_rcu(struct rcu_head *head)
>
> free_shrinker_map_rcu
>
>> +{
>> + kvfree(container_of(head, struct memcg_shrinker_map, rcu));
>> +}
>> +
>> +static int memcg_expand_maps(struct mem_cgroup *memcg, int nid,
>
> Bad name: the function reallocates just one map, not many maps; the name
> doesn't indicate that it is about shrinkers; the name is inconsistent
> with alloc_shrinker_maps and free_shrinker_maps. Please fix.
>
>> + int size, int old_size)
>> +{
>> + struct memcg_shrinker_map *new, *old;
>> +
>> + lockdep_assert_held(&shrinkers_max_nr_rwsem);
>> +
>> + new = kvmalloc(sizeof(*new) + size, GFP_KERNEL);
>> + if (!new)
>> + return -ENOMEM;
>> +
>> + /* Set all old bits, clear all new bits */
>> + memset(new->map, (int)0xff, old_size);
>> + memset((void *)new->map + old_size, 0, size - old_size);
>> +
>> + old = rcu_dereference_protected(SHRINKERS_MAP(memcg, nid), true);
>> +
>> + if (memcg)
>> + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinkers_map, new);
>> + else
>> + rcu_assign_pointer(root_shrinkers_map[nid], new);
>> +
>> + if (old)
>> + call_rcu(&old->rcu, kvfree_map_rcu);
>> +
>> + return 0;
>> +}
>> +
>> +static int alloc_shrinker_maps(struct mem_cgroup *memcg, int nid)
>> +{
>> + /* Skip allocation, when we're initializing root_mem_cgroup */
>> + if (!root_mem_cgroup)
>> + return 0;
>> +
>> + return memcg_expand_maps(memcg, nid, shrinkers_max_nr/BITS_PER_BYTE, 0);
>> +}
>> +
>> +static void free_shrinker_maps(struct mem_cgroup *memcg,
>> + struct mem_cgroup_per_node *pn)
>> +{
>> + struct memcg_shrinker_map *map;
>> +
>> + if (memcg == root_mem_cgroup)
>> + return;
>> +
>> + /* IDR unhashed long ago, and expand_shrinker_maps can't race with us */
>> + map = rcu_dereference_protected(pn->shrinkers_map, true);
>> + kvfree_map_rcu(&map->rcu);
>> +}
>> +
>> +static struct idr mem_cgroup_idr;
>> +
>> +int expand_shrinker_maps(int old_nr, int nr)
>> +{
>> + int id, size, old_size, node, ret;
>> + struct mem_cgroup *memcg;
>> +
>> + old_size = old_nr / BITS_PER_BYTE;
>> + size = nr / BITS_PER_BYTE;
>> +
>> + down_write(&shrinkers_max_nr_rwsem);
>> + for_each_node(node) {
>
> Iterating over cgroups first, numa nodes second seems like a better idea
> to me. I think you should fold for_each_node in memcg_expand_maps.
>
>> + idr_for_each_entry(&mem_cgroup_idr, memcg, id) {
>
> Iterating over mem_cgroup_idr looks strange. Why don't you use
> for_each_mem_cgroup?

We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
mem_cgroup_css_online() mustn't fail (it's a requirement of currently
existing design of memcg_cgroup::id).

A new memcg is added to parent's list between two of these calls:

css_create()
ss->css_alloc()
list_add_tail_rcu(&css->sibling, &parent_css->children)
ss->css_online()


for_each_mem_cgroup() does not see allocated, but not linked children.

>> + if (id == 1)
>> + memcg = NULL;
>> + ret = memcg_expand_maps(memcg, node, size, old_size);
>> + if (ret)
>> + goto unlock;
>> + }
>> +
>> + /* root_mem_cgroup is not initialized yet */
>> + if (id == 0)
>> + ret = memcg_expand_maps(NULL, node, size, old_size);
>> + }
>> +unlock:
>> + up_write(&shrinkers_max_nr_rwsem);
>> + return ret;
>> +}
>> +#else /* CONFIG_SLOB */
>> +static void get_shrinkers_max_nr(void) { }
>> +static void put_shrinkers_max_nr(void) { }
>> +
>> +static int alloc_shrinker_maps(struct mem_cgroup *memcg, int nid)
>> +{
>> + return 0;
>> +}
>> +static void free_shrinker_maps(struct mem_cgroup *memcg,
>> + struct mem_cgroup_per_node *pn) { }
>> +
>> #endif /* !CONFIG_SLOB */
>>
>> /**
>> @@ -3002,6 +3109,8 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
>> }
>>
>> #ifndef CONFIG_SLOB
>> +int shrinkers_max_nr;
>> +
>> static int memcg_online_kmem(struct mem_cgroup *memcg)
>> {
>> int memcg_id;
>> @@ -4266,7 +4375,10 @@ static DEFINE_IDR(mem_cgroup_idr);
>> static void mem_cgroup_id_remove(struct mem_cgroup *memcg)
>> {
>> if (memcg->id.id > 0) {
>> + /* Removing IDR must be visible for expand_shrinker_maps() */
>> + get_shrinkers_max_nr();
>> idr_remove(&mem_cgroup_idr, memcg->id.id);
>> + put_shrinkers_max_nr();
>> memcg->id.id = 0;
>> }
>> }
>> @@ -4333,12 +4445,17 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>> if (!pn->lruvec_stat_cpu)
>> goto err_pcpu;
>>
>> + if (alloc_shrinker_maps(memcg, node))
>> + goto err_maps;
>> +
>> lruvec_init(&pn->lruvec);
>> pn->usage_in_excess = 0;
>> pn->on_tree = false;
>> pn->memcg = memcg;
>> return 0;
>>
>> +err_maps:
>> + free_percpu(pn->lruvec_stat_cpu);
>> err_pcpu:
>> memcg->nodeinfo[node] = NULL;
>> kfree(pn);
>> @@ -4352,6 +4469,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>> if (!pn)
>> return;
>>
>> + free_shrinker_maps(memcg, pn);
>> free_percpu(pn->lruvec_stat_cpu);
>> kfree(pn);
>> }
>> @@ -4407,13 +4525,18 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>> #ifdef CONFIG_CGROUP_WRITEBACK
>> INIT_LIST_HEAD(&memcg->cgwb_list);
>> #endif
>> +
>> + get_shrinkers_max_nr();
>> for_each_node(node)
>> - if (alloc_mem_cgroup_per_node_info(memcg, node))
>> + if (alloc_mem_cgroup_per_node_info(memcg, node)) {
>> + put_shrinkers_max_nr();
>> goto fail;
>> + }
>>
>> memcg->id.id = idr_alloc(&mem_cgroup_idr, memcg,
>> 1, MEM_CGROUP_ID_MAX,
>> GFP_KERNEL);
>> + put_shrinkers_max_nr();
>> if (memcg->id.id < 0)
>> goto fail;
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 4f02fe83537e..f63eb5596c35 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -172,6 +172,22 @@ static DECLARE_RWSEM(shrinker_rwsem);
>> #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>> static DEFINE_IDR(shrinkers_id_idr);
>>
>> +static int expand_shrinker_id(int id)
>> +{
>> + if (likely(id < shrinkers_max_nr))
>> + return 0;
>> +
>> + id = shrinkers_max_nr * 2;
>> + if (id == 0)
>> + id = BITS_PER_BYTE;
>> +
>> + if (expand_shrinker_maps(shrinkers_max_nr, id))
>> + return -ENOMEM;
>> +
>> + shrinkers_max_nr = id;
>> + return 0;
>> +}
>> +
>
> I think this function should live in memcontrol.c and shrinkers_max_nr
> should be private to memcontrol.c.
>
>> static int add_memcg_shrinker(struct shrinker *shrinker)
>> {
>> int id, ret;
>> @@ -180,6 +196,11 @@ static int add_memcg_shrinker(struct shrinker *shrinker)
>> ret = id = idr_alloc(&shrinkers_id_idr, shrinker, 0, 0, GFP_KERNEL);
>> if (ret < 0)
>> goto unlock;
>> + ret = expand_shrinker_id(id);
>> + if (ret < 0) {
>> + idr_remove(&shrinkers_id_idr, id);
>> + goto unlock;
>> + }
>> shrinker->id = id;
>> ret = 0;
>> unlock:
>>

2018-04-23 11:05:19

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

On 22.04.2018 20:59, Vladimir Davydov wrote:
> On Tue, Apr 17, 2018 at 09:53:31PM +0300, Kirill Tkhai wrote:
>> Imagine a big node with many cpus, memory cgroups and containers.
>> Let we have 200 containers, every container has 10 mounts,
>> and 10 cgroups. All container tasks don't touch foreign
>> containers mounts. If there is intensive pages write,
>> and global reclaim happens, a writing task has to iterate
>> over all memcgs to shrink slab, before it's able to go
>> to shrink_page_list().
>>
>> Iteration over all the memcg slabs is very expensive:
>> the task has to visit 200 * 10 = 2000 shrinkers
>> for every memcg, and since there are 2000 memcgs,
>> the total calls are 2000 * 2000 = 4000000.
>>
>> So, the shrinker makes 4 million do_shrink_slab() calls
>> just to try to isolate SWAP_CLUSTER_MAX pages in one
>> of the actively writing memcg via shrink_page_list().
>> I've observed a node spending almost 100% in kernel,
>> making useless iteration over already shrinked slab.
>>
>> This patch adds bitmap of memcg-aware shrinkers to memcg.
>> The size of the bitmap depends on bitmap_nr_ids, and during
>> memcg life it's maintained to be enough to fit bitmap_nr_ids
>> shrinkers. Every bit in the map is related to corresponding
>> shrinker id.
>>
>> Next patches will maintain set bit only for really charged
>> memcg. This will allow shrink_slab() to increase its
>> performance in significant way. See the last patch for
>> the numbers.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> include/linux/memcontrol.h | 15 +++++
>> mm/memcontrol.c | 125 ++++++++++++++++++++++++++++++++++++++++++++
>> mm/vmscan.c | 21 +++++++
>> 3 files changed, 160 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index af9eed2e3e04..2ec96ab46b01 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -115,6 +115,7 @@ struct mem_cgroup_per_node {
>> unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
>>
>> struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];
>
>> + struct memcg_shrinker_map __rcu *shrinkers_map;
>
> shrinker_map
>
>>
>> struct rb_node tree_node; /* RB tree node */
>> unsigned long usage_in_excess;/* Set to the value by which */
>> @@ -153,6 +154,11 @@ struct mem_cgroup_thresholds {
>> struct mem_cgroup_threshold_ary *spare;
>> };
>>
>> +struct memcg_shrinker_map {
>> + struct rcu_head rcu;
>> + unsigned long map[0];
>> +};
>> +
>
> This struct should be defined before struct mem_cgroup_per_node.
>
> A comment explaining what this map is for and what it maps would be
> really helpful.
>
>> enum memcg_kmem_state {
>> KMEM_NONE,
>> KMEM_ALLOCATED,
>> @@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
>> void memcg_get_cache_ids(void);
>> void memcg_put_cache_ids(void);
>>
>> +extern int shrinkers_max_nr;
>> +
>
> memcg_shrinker_id_max?
>
>> /*
>> * Helper macro to loop through all memcg-specific caches. Callers must still
>> * check if the cache is valid (it is either valid or NULL).
>> @@ -1223,6 +1231,13 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
>> return memcg ? memcg->kmemcg_id : -1;
>> }
>>
>> +extern struct memcg_shrinker_map __rcu *root_shrinkers_map[];
>> +#define SHRINKERS_MAP(memcg, nid) \
>> + (memcg == root_mem_cgroup || !memcg ? \
>> + root_shrinkers_map[nid] : memcg->nodeinfo[nid]->shrinkers_map)
>> +
>> +extern int expand_shrinker_maps(int old_id, int id);
>> +
>
> I'm strongly against using a special map for the root cgroup. I'd prefer
> to disable this optimization for the root cgroup altogether and simply
> iterate over all registered shrinkers when it comes to global reclaim.
> It shouldn't degrade performance as the root cgroup is singular.
>
>> #else
>> #define for_each_memcg_cache_index(_idx) \
>> for (; NULL; )
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 2959a454a072..562dfb1be9ef 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -305,6 +305,113 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
>>
>> struct workqueue_struct *memcg_kmem_cache_wq;
>>
>> +static DECLARE_RWSEM(shrinkers_max_nr_rwsem);
>
> Why rwsem? AFAIU you want to synchronize between two code paths: when a
> memory cgroup is allocated (or switched online?) to allocate a shrinker
> map for it and in the function growing shrinker maps for all cgroups.
> A mutex should be enough for this.
>
>> +struct memcg_shrinker_map __rcu *root_shrinkers_map[MAX_NUMNODES] = { 0 };
>> +
>> +static void get_shrinkers_max_nr(void)
>> +{
>> + down_read(&shrinkers_max_nr_rwsem);
>> +}
>> +
>> +static void put_shrinkers_max_nr(void)
>> +{
>> + up_read(&shrinkers_max_nr_rwsem);
>> +}
>> +
>> +static void kvfree_map_rcu(struct rcu_head *head)
>
> free_shrinker_map_rcu
>
>> +{
>> + kvfree(container_of(head, struct memcg_shrinker_map, rcu));
>> +}
>> +
>> +static int memcg_expand_maps(struct mem_cgroup *memcg, int nid,
>
> Bad name: the function reallocates just one map, not many maps; the name
> doesn't indicate that it is about shrinkers; the name is inconsistent
> with alloc_shrinker_maps and free_shrinker_maps. Please fix.
>
>> + int size, int old_size)
>> +{
>> + struct memcg_shrinker_map *new, *old;
>> +
>> + lockdep_assert_held(&shrinkers_max_nr_rwsem);
>> +
>> + new = kvmalloc(sizeof(*new) + size, GFP_KERNEL);
>> + if (!new)
>> + return -ENOMEM;
>> +
>> + /* Set all old bits, clear all new bits */
>> + memset(new->map, (int)0xff, old_size);
>> + memset((void *)new->map + old_size, 0, size - old_size);
>> +
>> + old = rcu_dereference_protected(SHRINKERS_MAP(memcg, nid), true);
>> +
>> + if (memcg)
>> + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinkers_map, new);
>> + else
>> + rcu_assign_pointer(root_shrinkers_map[nid], new);
>> +
>> + if (old)
>> + call_rcu(&old->rcu, kvfree_map_rcu);
>> +
>> + return 0;
>> +}
>> +
>> +static int alloc_shrinker_maps(struct mem_cgroup *memcg, int nid)
>> +{
>> + /* Skip allocation, when we're initializing root_mem_cgroup */
>> + if (!root_mem_cgroup)
>> + return 0;
>> +
>> + return memcg_expand_maps(memcg, nid, shrinkers_max_nr/BITS_PER_BYTE, 0);
>> +}
>> +
>> +static void free_shrinker_maps(struct mem_cgroup *memcg,
>> + struct mem_cgroup_per_node *pn)
>> +{
>> + struct memcg_shrinker_map *map;
>> +
>> + if (memcg == root_mem_cgroup)
>> + return;
>> +
>> + /* IDR unhashed long ago, and expand_shrinker_maps can't race with us */
>> + map = rcu_dereference_protected(pn->shrinkers_map, true);
>> + kvfree_map_rcu(&map->rcu);
>> +}
>> +
>> +static struct idr mem_cgroup_idr;
>> +
>> +int expand_shrinker_maps(int old_nr, int nr)
>> +{
>> + int id, size, old_size, node, ret;
>> + struct mem_cgroup *memcg;
>> +
>> + old_size = old_nr / BITS_PER_BYTE;
>> + size = nr / BITS_PER_BYTE;
>> +
>> + down_write(&shrinkers_max_nr_rwsem);
>> + for_each_node(node) {
>
> Iterating over cgroups first, numa nodes second seems like a better idea
> to me. I think you should fold for_each_node in memcg_expand_maps.

This function is also used in alloc_shrinker_maps(), which has node id argument.
We can't change the order, since maps are stored into memcg_cgroup::nodeinfo[nid].

>
>> + idr_for_each_entry(&mem_cgroup_idr, memcg, id) {
>
> Iterating over mem_cgroup_idr looks strange. Why don't you use
> for_each_mem_cgroup?
>
>> + if (id == 1)
>> + memcg = NULL;
>> + ret = memcg_expand_maps(memcg, node, size, old_size);
>> + if (ret)
>> + goto unlock;
>> + }
>> +
>> + /* root_mem_cgroup is not initialized yet */
>> + if (id == 0)
>> + ret = memcg_expand_maps(NULL, node, size, old_size);
>> + }
>> +unlock:
>> + up_write(&shrinkers_max_nr_rwsem);
>> + return ret;
>> +}
>> +#else /* CONFIG_SLOB */
>> +static void get_shrinkers_max_nr(void) { }
>> +static void put_shrinkers_max_nr(void) { }
>> +
>> +static int alloc_shrinker_maps(struct mem_cgroup *memcg, int nid)
>> +{
>> + return 0;
>> +}
>> +static void free_shrinker_maps(struct mem_cgroup *memcg,
>> + struct mem_cgroup_per_node *pn) { }
>> +
>> #endif /* !CONFIG_SLOB */
>>
>> /**
>> @@ -3002,6 +3109,8 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
>> }
>>
>> #ifndef CONFIG_SLOB
>> +int shrinkers_max_nr;
>> +
>> static int memcg_online_kmem(struct mem_cgroup *memcg)
>> {
>> int memcg_id;
>> @@ -4266,7 +4375,10 @@ static DEFINE_IDR(mem_cgroup_idr);
>> static void mem_cgroup_id_remove(struct mem_cgroup *memcg)
>> {
>> if (memcg->id.id > 0) {
>> + /* Removing IDR must be visible for expand_shrinker_maps() */
>> + get_shrinkers_max_nr();
>> idr_remove(&mem_cgroup_idr, memcg->id.id);
>> + put_shrinkers_max_nr();
>> memcg->id.id = 0;
>> }
>> }
>> @@ -4333,12 +4445,17 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>> if (!pn->lruvec_stat_cpu)
>> goto err_pcpu;
>>
>> + if (alloc_shrinker_maps(memcg, node))
>> + goto err_maps;
>> +
>> lruvec_init(&pn->lruvec);
>> pn->usage_in_excess = 0;
>> pn->on_tree = false;
>> pn->memcg = memcg;
>> return 0;
>>
>> +err_maps:
>> + free_percpu(pn->lruvec_stat_cpu);
>> err_pcpu:
>> memcg->nodeinfo[node] = NULL;
>> kfree(pn);
>> @@ -4352,6 +4469,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>> if (!pn)
>> return;
>>
>> + free_shrinker_maps(memcg, pn);
>> free_percpu(pn->lruvec_stat_cpu);
>> kfree(pn);
>> }
>> @@ -4407,13 +4525,18 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>> #ifdef CONFIG_CGROUP_WRITEBACK
>> INIT_LIST_HEAD(&memcg->cgwb_list);
>> #endif
>> +
>> + get_shrinkers_max_nr();
>> for_each_node(node)
>> - if (alloc_mem_cgroup_per_node_info(memcg, node))
>> + if (alloc_mem_cgroup_per_node_info(memcg, node)) {
>> + put_shrinkers_max_nr();
>> goto fail;
>> + }
>>
>> memcg->id.id = idr_alloc(&mem_cgroup_idr, memcg,
>> 1, MEM_CGROUP_ID_MAX,
>> GFP_KERNEL);
>> + put_shrinkers_max_nr();
>> if (memcg->id.id < 0)
>> goto fail;
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 4f02fe83537e..f63eb5596c35 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -172,6 +172,22 @@ static DECLARE_RWSEM(shrinker_rwsem);
>> #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>> static DEFINE_IDR(shrinkers_id_idr);
>>
>> +static int expand_shrinker_id(int id)
>> +{
>> + if (likely(id < shrinkers_max_nr))
>> + return 0;
>> +
>> + id = shrinkers_max_nr * 2;
>> + if (id == 0)
>> + id = BITS_PER_BYTE;
>> +
>> + if (expand_shrinker_maps(shrinkers_max_nr, id))
>> + return -ENOMEM;
>> +
>> + shrinkers_max_nr = id;
>> + return 0;
>> +}
>> +
>
> I think this function should live in memcontrol.c and shrinkers_max_nr
> should be private to memcontrol.c.
>
>> static int add_memcg_shrinker(struct shrinker *shrinker)
>> {
>> int id, ret;
>> @@ -180,6 +196,11 @@ static int add_memcg_shrinker(struct shrinker *shrinker)
>> ret = id = idr_alloc(&shrinkers_id_idr, shrinker, 0, 0, GFP_KERNEL);
>> if (ret < 0)
>> goto unlock;
>> + ret = expand_shrinker_id(id);
>> + if (ret < 0) {
>> + idr_remove(&shrinkers_id_idr, id);
>> + goto unlock;
>> + }
>> shrinker->id = id;
>> ret = 0;
>> unlock:
>>

2018-04-23 11:09:16

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

On 22.04.2018 20:59, Vladimir Davydov wrote:
> On Tue, Apr 17, 2018 at 09:53:31PM +0300, Kirill Tkhai wrote:
>> Imagine a big node with many cpus, memory cgroups and containers.
>> Let we have 200 containers, every container has 10 mounts,
>> and 10 cgroups. All container tasks don't touch foreign
>> containers mounts. If there is intensive pages write,
>> and global reclaim happens, a writing task has to iterate
>> over all memcgs to shrink slab, before it's able to go
>> to shrink_page_list().
>>
>> Iteration over all the memcg slabs is very expensive:
>> the task has to visit 200 * 10 = 2000 shrinkers
>> for every memcg, and since there are 2000 memcgs,
>> the total calls are 2000 * 2000 = 4000000.
>>
>> So, the shrinker makes 4 million do_shrink_slab() calls
>> just to try to isolate SWAP_CLUSTER_MAX pages in one
>> of the actively writing memcg via shrink_page_list().
>> I've observed a node spending almost 100% in kernel,
>> making useless iteration over already shrinked slab.
>>
>> This patch adds bitmap of memcg-aware shrinkers to memcg.
>> The size of the bitmap depends on bitmap_nr_ids, and during
>> memcg life it's maintained to be enough to fit bitmap_nr_ids
>> shrinkers. Every bit in the map is related to corresponding
>> shrinker id.
>>
>> Next patches will maintain set bit only for really charged
>> memcg. This will allow shrink_slab() to increase its
>> performance in significant way. See the last patch for
>> the numbers.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> include/linux/memcontrol.h | 15 +++++
>> mm/memcontrol.c | 125 ++++++++++++++++++++++++++++++++++++++++++++
>> mm/vmscan.c | 21 +++++++
>> 3 files changed, 160 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index af9eed2e3e04..2ec96ab46b01 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -115,6 +115,7 @@ struct mem_cgroup_per_node {
>> unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
>>
>> struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];
>
>> + struct memcg_shrinker_map __rcu *shrinkers_map;
>
> shrinker_map
>
>>
>> struct rb_node tree_node; /* RB tree node */
>> unsigned long usage_in_excess;/* Set to the value by which */
>> @@ -153,6 +154,11 @@ struct mem_cgroup_thresholds {
>> struct mem_cgroup_threshold_ary *spare;
>> };
>>
>> +struct memcg_shrinker_map {
>> + struct rcu_head rcu;
>> + unsigned long map[0];
>> +};
>> +
>
> This struct should be defined before struct mem_cgroup_per_node.
>
> A comment explaining what this map is for and what it maps would be
> really helpful.
>
>> enum memcg_kmem_state {
>> KMEM_NONE,
>> KMEM_ALLOCATED,
>> @@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
>> void memcg_get_cache_ids(void);
>> void memcg_put_cache_ids(void);
>>
>> +extern int shrinkers_max_nr;
>> +
>
> memcg_shrinker_id_max?
>
>> /*
>> * Helper macro to loop through all memcg-specific caches. Callers must still
>> * check if the cache is valid (it is either valid or NULL).
>> @@ -1223,6 +1231,13 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
>> return memcg ? memcg->kmemcg_id : -1;
>> }
>>
>> +extern struct memcg_shrinker_map __rcu *root_shrinkers_map[];
>> +#define SHRINKERS_MAP(memcg, nid) \
>> + (memcg == root_mem_cgroup || !memcg ? \
>> + root_shrinkers_map[nid] : memcg->nodeinfo[nid]->shrinkers_map)
>> +
>> +extern int expand_shrinker_maps(int old_id, int id);
>> +
>
> I'm strongly against using a special map for the root cgroup. I'd prefer
> to disable this optimization for the root cgroup altogether and simply
> iterate over all registered shrinkers when it comes to global reclaim.
> It shouldn't degrade performance as the root cgroup is singular.
>
>> #else
>> #define for_each_memcg_cache_index(_idx) \
>> for (; NULL; )
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 2959a454a072..562dfb1be9ef 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -305,6 +305,113 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
>>
>> struct workqueue_struct *memcg_kmem_cache_wq;
>>
>> +static DECLARE_RWSEM(shrinkers_max_nr_rwsem);
>
> Why rwsem? AFAIU you want to synchronize between two code paths: when a
> memory cgroup is allocated (or switched online?) to allocate a shrinker
> map for it and in the function growing shrinker maps for all cgroups.
> A mutex should be enough for this.
>
>> +struct memcg_shrinker_map __rcu *root_shrinkers_map[MAX_NUMNODES] = { 0 };
>> +
>> +static void get_shrinkers_max_nr(void)
>> +{
>> + down_read(&shrinkers_max_nr_rwsem);
>> +}
>> +
>> +static void put_shrinkers_max_nr(void)
>> +{
>> + up_read(&shrinkers_max_nr_rwsem);
>> +}
>> +
>> +static void kvfree_map_rcu(struct rcu_head *head)
>
> free_shrinker_map_rcu
>
>> +{
>> + kvfree(container_of(head, struct memcg_shrinker_map, rcu));
>> +}
>> +
>> +static int memcg_expand_maps(struct mem_cgroup *memcg, int nid,
>
> Bad name: the function reallocates just one map, not many maps; the name
> doesn't indicate that it is about shrinkers; the name is inconsistent
> with alloc_shrinker_maps and free_shrinker_maps. Please fix.
>
>> + int size, int old_size)
>> +{
>> + struct memcg_shrinker_map *new, *old;
>> +
>> + lockdep_assert_held(&shrinkers_max_nr_rwsem);
>> +
>> + new = kvmalloc(sizeof(*new) + size, GFP_KERNEL);
>> + if (!new)
>> + return -ENOMEM;
>> +
>> + /* Set all old bits, clear all new bits */
>> + memset(new->map, (int)0xff, old_size);
>> + memset((void *)new->map + old_size, 0, size - old_size);
>> +
>> + old = rcu_dereference_protected(SHRINKERS_MAP(memcg, nid), true);
>> +
>> + if (memcg)
>> + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinkers_map, new);
>> + else
>> + rcu_assign_pointer(root_shrinkers_map[nid], new);
>> +
>> + if (old)
>> + call_rcu(&old->rcu, kvfree_map_rcu);
>> +
>> + return 0;
>> +}
>> +
>> +static int alloc_shrinker_maps(struct mem_cgroup *memcg, int nid)
>> +{
>> + /* Skip allocation, when we're initializing root_mem_cgroup */
>> + if (!root_mem_cgroup)
>> + return 0;
>> +
>> + return memcg_expand_maps(memcg, nid, shrinkers_max_nr/BITS_PER_BYTE, 0);
>> +}
>> +
>> +static void free_shrinker_maps(struct mem_cgroup *memcg,
>> + struct mem_cgroup_per_node *pn)
>> +{
>> + struct memcg_shrinker_map *map;
>> +
>> + if (memcg == root_mem_cgroup)
>> + return;
>> +
>> + /* IDR unhashed long ago, and expand_shrinker_maps can't race with us */
>> + map = rcu_dereference_protected(pn->shrinkers_map, true);
>> + kvfree_map_rcu(&map->rcu);
>> +}
>> +
>> +static struct idr mem_cgroup_idr;
>> +
>> +int expand_shrinker_maps(int old_nr, int nr)
>> +{
>> + int id, size, old_size, node, ret;
>> + struct mem_cgroup *memcg;
>> +
>> + old_size = old_nr / BITS_PER_BYTE;
>> + size = nr / BITS_PER_BYTE;
>> +
>> + down_write(&shrinkers_max_nr_rwsem);
>> + for_each_node(node) {
>
> Iterating over cgroups first, numa nodes second seems like a better idea
> to me. I think you should fold for_each_node in memcg_expand_maps.
>
>> + idr_for_each_entry(&mem_cgroup_idr, memcg, id) {
>
> Iterating over mem_cgroup_idr looks strange. Why don't you use
> for_each_mem_cgroup?
>
>> + if (id == 1)
>> + memcg = NULL;
>> + ret = memcg_expand_maps(memcg, node, size, old_size);
>> + if (ret)
>> + goto unlock;
>> + }
>> +
>> + /* root_mem_cgroup is not initialized yet */
>> + if (id == 0)
>> + ret = memcg_expand_maps(NULL, node, size, old_size);
>> + }
>> +unlock:
>> + up_write(&shrinkers_max_nr_rwsem);
>> + return ret;
>> +}
>> +#else /* CONFIG_SLOB */
>> +static void get_shrinkers_max_nr(void) { }
>> +static void put_shrinkers_max_nr(void) { }
>> +
>> +static int alloc_shrinker_maps(struct mem_cgroup *memcg, int nid)
>> +{
>> + return 0;
>> +}
>> +static void free_shrinker_maps(struct mem_cgroup *memcg,
>> + struct mem_cgroup_per_node *pn) { }
>> +
>> #endif /* !CONFIG_SLOB */
>>
>> /**
>> @@ -3002,6 +3109,8 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
>> }
>>
>> #ifndef CONFIG_SLOB
>> +int shrinkers_max_nr;
>> +
>> static int memcg_online_kmem(struct mem_cgroup *memcg)
>> {
>> int memcg_id;
>> @@ -4266,7 +4375,10 @@ static DEFINE_IDR(mem_cgroup_idr);
>> static void mem_cgroup_id_remove(struct mem_cgroup *memcg)
>> {
>> if (memcg->id.id > 0) {
>> + /* Removing IDR must be visible for expand_shrinker_maps() */
>> + get_shrinkers_max_nr();
>> idr_remove(&mem_cgroup_idr, memcg->id.id);
>> + put_shrinkers_max_nr();
>> memcg->id.id = 0;
>> }
>> }
>> @@ -4333,12 +4445,17 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>> if (!pn->lruvec_stat_cpu)
>> goto err_pcpu;
>>
>> + if (alloc_shrinker_maps(memcg, node))
>> + goto err_maps;
>> +
>> lruvec_init(&pn->lruvec);
>> pn->usage_in_excess = 0;
>> pn->on_tree = false;
>> pn->memcg = memcg;
>> return 0;
>>
>> +err_maps:
>> + free_percpu(pn->lruvec_stat_cpu);
>> err_pcpu:
>> memcg->nodeinfo[node] = NULL;
>> kfree(pn);
>> @@ -4352,6 +4469,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>> if (!pn)
>> return;
>>
>> + free_shrinker_maps(memcg, pn);
>> free_percpu(pn->lruvec_stat_cpu);
>> kfree(pn);
>> }
>> @@ -4407,13 +4525,18 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>> #ifdef CONFIG_CGROUP_WRITEBACK
>> INIT_LIST_HEAD(&memcg->cgwb_list);
>> #endif
>> +
>> + get_shrinkers_max_nr();
>> for_each_node(node)
>> - if (alloc_mem_cgroup_per_node_info(memcg, node))
>> + if (alloc_mem_cgroup_per_node_info(memcg, node)) {
>> + put_shrinkers_max_nr();
>> goto fail;
>> + }
>>
>> memcg->id.id = idr_alloc(&mem_cgroup_idr, memcg,
>> 1, MEM_CGROUP_ID_MAX,
>> GFP_KERNEL);
>> + put_shrinkers_max_nr();
>> if (memcg->id.id < 0)
>> goto fail;
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 4f02fe83537e..f63eb5596c35 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -172,6 +172,22 @@ static DECLARE_RWSEM(shrinker_rwsem);
>> #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>> static DEFINE_IDR(shrinkers_id_idr);
>>
>> +static int expand_shrinker_id(int id)
>> +{
>> + if (likely(id < shrinkers_max_nr))
>> + return 0;
>> +
>> + id = shrinkers_max_nr * 2;
>> + if (id == 0)
>> + id = BITS_PER_BYTE;
>> +
>> + if (expand_shrinker_maps(shrinkers_max_nr, id))
>> + return -ENOMEM;
>> +
>> + shrinkers_max_nr = id;
>> + return 0;
>> +}
>> +
>
> I think this function should live in memcontrol.c and shrinkers_max_nr
> should be private to memcontrol.c.

It can't be private as shrink_slab_memcg() uses this value to get the last bit of bitmap.

>> static int add_memcg_shrinker(struct shrinker *shrinker)
>> {
>> int id, ret;
>> @@ -180,6 +196,11 @@ static int add_memcg_shrinker(struct shrinker *shrinker)
>> ret = id = idr_alloc(&shrinkers_id_idr, shrinker, 0, 0, GFP_KERNEL);
>> if (ret < 0)
>> goto unlock;
>> + ret = expand_shrinker_id(id);
>> + if (ret < 0) {
>> + idr_remove(&shrinkers_id_idr, id);
>> + goto unlock;
>> + }
>> shrinker->id = id;
>> ret = 0;
>> unlock:
>>

2018-04-24 11:38:39

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v2 12/12] mm: Clear shrinker bit if there are no objects related to memcg

On Mon, Apr 23, 2018 at 01:01:08PM +0300, Kirill Tkhai wrote:
> On 22.04.2018 21:21, Vladimir Davydov wrote:
> > On Tue, Apr 17, 2018 at 09:54:51PM +0300, Kirill Tkhai wrote:
> >> To avoid further unneed calls of do_shrink_slab()
> >> for shrinkers, which already do not have any charged
> >> objects in a memcg, their bits have to be cleared.
> >>
> >> This patch introduces a lockless mechanism to do that
> >> without races without parallel list lru add. After
> >> do_shrink_slab() returns SHRINK_EMPTY the first time,
> >> we clear the bit and call it once again. Then we restore
> >> the bit, if the new return value is different.
> >>
> >> Note, that single smp_mb__after_atomic() in shrink_slab_memcg()
> >> covers two situations:
> >>
> >> 1)list_lru_add() shrink_slab_memcg
> >> list_add_tail() for_each_set_bit() <--- read bit
> >> do_shrink_slab() <--- missed list update (no barrier)
> >> <MB> <MB>
> >> set_bit() do_shrink_slab() <--- seen list update
> >>
> >> This situation, when the first do_shrink_slab() sees set bit,
> >> but it doesn't see list update (i.e., race with the first element
> >> queueing), is rare. So we don't add <MB> before the first call
> >> of do_shrink_slab() instead of this to do not slow down generic
> >> case. Also, it's need the second call as seen in below in (2).
> >>
> >> 2)list_lru_add() shrink_slab_memcg()
> >> list_add_tail() ...
> >> set_bit() ...
> >> ... for_each_set_bit()
> >> do_shrink_slab() do_shrink_slab()
> >> clear_bit() ...
> >> ... ...
> >> list_lru_add() ...
> >> list_add_tail() clear_bit()
> >> <MB> <MB>
> >> set_bit() do_shrink_slab()
> >>
> >> The barriers guarantees, the second do_shrink_slab()
> >> in the right side task sees list update if really
> >> cleared the bit. This case is drawn in the code comment.
> >>
> >> [Results/performance of the patchset]
> >>
> >> After the whole patchset applied the below test shows signify
> >> increase of performance:
> >>
> >> $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
> >> $mkdir /sys/fs/cgroup/memory/ct
> >> $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
> >> $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done
> >>
> >> Then, 4 sequential calls of drop caches:
> >> $time echo 3 > /proc/sys/vm/drop_caches
> >>
> >> 1)Before:
> >> 0.00user 8.99system 0:08.99elapsed 99%CPU
> >> 0.00user 5.97system 0:05.97elapsed 100%CPU
> >> 0.00user 5.97system 0:05.97elapsed 100%CPU
> >> 0.00user 5.85system 0:05.85elapsed 100%CPU
> >>
> >> 2)After
> >> 0.00user 1.11system 0:01.12elapsed 99%CPU
> >> 0.00user 0.00system 0:00.00elapsed 100%CPU
> >> 0.00user 0.00system 0:00.00elapsed 100%CPU
> >> 0.00user 0.00system 0:00.00elapsed 100%CPU
> >>
> >> Even if we round 0:00.00 up to 0:00.01, the results shows
> >> the performance increases at least in 585 times.
> >>
> >> Signed-off-by: Kirill Tkhai <[email protected]>
> >> ---
> >> include/linux/memcontrol.h | 2 ++
> >> mm/vmscan.c | 19 +++++++++++++++++--
> >> 2 files changed, 19 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >> index e1c1fa8e417a..1c5c68550e2f 100644
> >> --- a/include/linux/memcontrol.h
> >> +++ b/include/linux/memcontrol.h
> >> @@ -1245,6 +1245,8 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int nr)
> >>
> >> rcu_read_lock();
> >> map = SHRINKERS_MAP(memcg, nid);
> >> + /* Pairs with smp mb in shrink_slab() */
> >> + smp_mb__before_atomic();
> >> set_bit(nr, map->map);
> >> rcu_read_unlock();
> >> }
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 3be9b4d81c13..a8733bc5377b 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -579,8 +579,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> >> }
> >>
> >> ret = do_shrink_slab(&sc, shrinker, priority);
> >> - if (ret == SHRINK_EMPTY)
> >> - ret = 0;
> >> + if (ret == SHRINK_EMPTY) {
> >> + clear_bit(i, map->map);
> >> + /*
> >> + * Pairs with mb in set_shrinker_bit():
> >> + *
> >> + * list_lru_add() shrink_slab_memcg()
> >> + * list_add_tail() clear_bit()
> >> + * <MB> <MB>
> >> + * set_bit() do_shrink_slab()
> >> + */
> >> + smp_mb__after_atomic();
> >> + ret = do_shrink_slab(&sc, shrinker, priority);
> >> + if (ret == SHRINK_EMPTY)
> >> + ret = 0;
> >> + else
> >> + set_shrinker_bit(memcg, nid, i);
> >> + }
> >
> > This is mind-boggling. Are there any alternatives? For instance, can't
> > we clear the bit in list_lru_del, when we hold the list lock?
>
> Since a single shrinker may iterate over several lru lists, we can't do that.
> Otherwise, we would have to probe another shrinker's lru list from a lru list,
> which became empty in list_lru_del().
>
> The solution I suggested, is generic, and it does not depend on low-level
> structure type, used by shrinker. This even doesn't have to be a lru list.

Fair enough. I guess this is the best way to go after all. Please try to
improve the comment so that it isn't just a pure diagram.

Also, please prefix all memcg-related function names (such as
set_shrinker_bit) with memcg_ (or mem_cgroup_) in this and all
other patches.

2018-04-24 11:49:44

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

On Mon, Apr 23, 2018 at 01:54:50PM +0300, Kirill Tkhai wrote:
> >> @@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
> >> void memcg_get_cache_ids(void);
> >> void memcg_put_cache_ids(void);
> >>
> >> +extern int shrinkers_max_nr;
> >> +
> >
> > memcg_shrinker_id_max?
>
> memcg_shrinker_id_max sounds like an includive value, doesn't it?
> While shrinker->id < shrinker_max_nr.
>
> Let's better use memcg_shrinker_nr_max.

or memcg_nr_shrinker_ids (to match memcg_nr_cache_ids), not sure...

Come to think of it, this variable is kinda awkward: it is defined in
vmscan.c but declared in memcontrol.h; it is used by vmscan.c for max
shrinker id and by memcontrol.c for shrinker map capacity. Just a raw
idea: what about splitting it in two: one is private to vmscan.c, used
as max id, say we call it shrinker_id_max; the other is defined in
memcontrol.c and is used for shrinker map capacity, say we call it
memcg_shrinker_map_capacity. What do you think?

> >> +int expand_shrinker_maps(int old_nr, int nr)
> >> +{
> >> + int id, size, old_size, node, ret;
> >> + struct mem_cgroup *memcg;
> >> +
> >> + old_size = old_nr / BITS_PER_BYTE;
> >> + size = nr / BITS_PER_BYTE;
> >> +
> >> + down_write(&shrinkers_max_nr_rwsem);
> >> + for_each_node(node) {
> >
> > Iterating over cgroups first, numa nodes second seems like a better idea
> > to me. I think you should fold for_each_node in memcg_expand_maps.
> >
> >> + idr_for_each_entry(&mem_cgroup_idr, memcg, id) {
> >
> > Iterating over mem_cgroup_idr looks strange. Why don't you use
> > for_each_mem_cgroup?
>
> We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
> mem_cgroup_css_online() mustn't fail (it's a requirement of currently
> existing design of memcg_cgroup::id).
>
> A new memcg is added to parent's list between two of these calls:
>
> css_create()
> ss->css_alloc()
> list_add_tail_rcu(&css->sibling, &parent_css->children)
> ss->css_online()
>
> for_each_mem_cgroup() does not see allocated, but not linked children.

Why don't we move shrinker map allocation to css_online then?

>
> >> + if (id == 1)
> >> + memcg = NULL;
> >> + ret = memcg_expand_maps(memcg, node, size, old_size);
> >> + if (ret)
> >> + goto unlock;
> >> + }
> >> +
> >> + /* root_mem_cgroup is not initialized yet */
> >> + if (id == 0)
> >> + ret = memcg_expand_maps(NULL, node, size, old_size);
> >> + }
> >> +unlock:
> >> + up_write(&shrinkers_max_nr_rwsem);
> >> + return ret;
> >> +}

2018-04-24 12:35:23

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

On Tue, Apr 24, 2018 at 02:38:51PM +0300, Kirill Tkhai wrote:
> On 24.04.2018 14:28, Vladimir Davydov wrote:
> > On Mon, Apr 23, 2018 at 01:54:50PM +0300, Kirill Tkhai wrote:
> >>>> @@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
> >>>> void memcg_get_cache_ids(void);
> >>>> void memcg_put_cache_ids(void);
> >>>>
> >>>> +extern int shrinkers_max_nr;
> >>>> +
> >>>
> >>> memcg_shrinker_id_max?
> >>
> >> memcg_shrinker_id_max sounds like an includive value, doesn't it?
> >> While shrinker->id < shrinker_max_nr.
> >>
> >> Let's better use memcg_shrinker_nr_max.
> >
> > or memcg_nr_shrinker_ids (to match memcg_nr_cache_ids), not sure...
> >
> > Come to think of it, this variable is kinda awkward: it is defined in
> > vmscan.c but declared in memcontrol.h; it is used by vmscan.c for max
> > shrinker id and by memcontrol.c for shrinker map capacity. Just a raw
> > idea: what about splitting it in two: one is private to vmscan.c, used
> > as max id, say we call it shrinker_id_max; the other is defined in
> > memcontrol.c and is used for shrinker map capacity, say we call it
> > memcg_shrinker_map_capacity. What do you think?
>
> I don't much like a duplication of the single variable...

Well, it's not really a duplication. For example, shrinker_id_max could
decrease when a shrinker is unregistered while shrinker_map_capacity can
only grow exponentially.

> Are there real problems, if it defined in memcontrol.{c,h} and use in
> both of the places?

The code is more difficult to follow when variables are shared like that
IMHO. I suggest you try it and see how it looks. May be, it will only
get worse and we'll have to revert to what we have now. Difficult to say
without seeing the code.

>
> >>>> +int expand_shrinker_maps(int old_nr, int nr)
> >>>> +{
> >>>> + int id, size, old_size, node, ret;
> >>>> + struct mem_cgroup *memcg;
> >>>> +
> >>>> + old_size = old_nr / BITS_PER_BYTE;
> >>>> + size = nr / BITS_PER_BYTE;
> >>>> +
> >>>> + down_write(&shrinkers_max_nr_rwsem);
> >>>> + for_each_node(node) {
> >>>
> >>> Iterating over cgroups first, numa nodes second seems like a better idea
> >>> to me. I think you should fold for_each_node in memcg_expand_maps.
> >>>
> >>>> + idr_for_each_entry(&mem_cgroup_idr, memcg, id) {
> >>>
> >>> Iterating over mem_cgroup_idr looks strange. Why don't you use
> >>> for_each_mem_cgroup?
> >>
> >> We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
> >> mem_cgroup_css_online() mustn't fail (it's a requirement of currently
> >> existing design of memcg_cgroup::id).
> >>
> >> A new memcg is added to parent's list between two of these calls:
> >>
> >> css_create()
> >> ss->css_alloc()
> >> list_add_tail_rcu(&css->sibling, &parent_css->children)
> >> ss->css_online()
> >>
> >> for_each_mem_cgroup() does not see allocated, but not linked children.
> >
> > Why don't we move shrinker map allocation to css_online then?
>
> Because the design of memcg_cgroup::id prohibits mem_cgroup_css_online() to fail.
> This function can't fail.

I fail to understand why it is so. Could you please elaborate?

>
> I don't think it will be good to dive into reworking of this stuff for this patchset,
> which is really already big. Also, it will be assymmetric to allocate one part of
> data in css_alloc(), while another data in css_free(). This breaks cgroup design,
> which specially introduces this two function to differ allocation and onlining.
> Also, I've just move the allocation to alloc_mem_cgroup_per_node_info() like it was
> suggested in comments to v1...

Yeah, but (ab)using mem_cgroup_idr for iterating over all allocated
memory cgroups looks rather dubious to me...

2018-04-24 12:37:54

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

Let's discuss on code with changes after your commits to v2 to have them made visible.
v3 is on the way

Kirill

On 24.04.2018 14:28, Vladimir Davydov wrote:
> On Mon, Apr 23, 2018 at 01:54:50PM +0300, Kirill Tkhai wrote:
>>>> @@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
>>>> void memcg_get_cache_ids(void);
>>>> void memcg_put_cache_ids(void);
>>>>
>>>> +extern int shrinkers_max_nr;
>>>> +
>>>
>>> memcg_shrinker_id_max?
>>
>> memcg_shrinker_id_max sounds like an includive value, doesn't it?
>> While shrinker->id < shrinker_max_nr.
>>
>> Let's better use memcg_shrinker_nr_max.
>
> or memcg_nr_shrinker_ids (to match memcg_nr_cache_ids), not sure...
>
> Come to think of it, this variable is kinda awkward: it is defined in
> vmscan.c but declared in memcontrol.h; it is used by vmscan.c for max
> shrinker id and by memcontrol.c for shrinker map capacity. Just a raw
> idea: what about splitting it in two: one is private to vmscan.c, used
> as max id, say we call it shrinker_id_max; the other is defined in
> memcontrol.c and is used for shrinker map capacity, say we call it
> memcg_shrinker_map_capacity. What do you think?
>
>>>> +int expand_shrinker_maps(int old_nr, int nr)
>>>> +{
>>>> + int id, size, old_size, node, ret;
>>>> + struct mem_cgroup *memcg;
>>>> +
>>>> + old_size = old_nr / BITS_PER_BYTE;
>>>> + size = nr / BITS_PER_BYTE;
>>>> +
>>>> + down_write(&shrinkers_max_nr_rwsem);
>>>> + for_each_node(node) {
>>>
>>> Iterating over cgroups first, numa nodes second seems like a better idea
>>> to me. I think you should fold for_each_node in memcg_expand_maps.
>>>
>>>> + idr_for_each_entry(&mem_cgroup_idr, memcg, id) {
>>>
>>> Iterating over mem_cgroup_idr looks strange. Why don't you use
>>> for_each_mem_cgroup?
>>
>> We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
>> mem_cgroup_css_online() mustn't fail (it's a requirement of currently
>> existing design of memcg_cgroup::id).
>>
>> A new memcg is added to parent's list between two of these calls:
>>
>> css_create()
>> ss->css_alloc()
>> list_add_tail_rcu(&css->sibling, &parent_css->children)
>> ss->css_online()
>>
>> for_each_mem_cgroup() does not see allocated, but not linked children.
>
> Why don't we move shrinker map allocation to css_online then?
>
>>
>>>> + if (id == 1)
>>>> + memcg = NULL;
>>>> + ret = memcg_expand_maps(memcg, node, size, old_size);
>>>> + if (ret)
>>>> + goto unlock;
>>>> + }
>>>> +
>>>> + /* root_mem_cgroup is not initialized yet */
>>>> + if (id == 0)
>>>> + ret = memcg_expand_maps(NULL, node, size, old_size);
>>>> + }
>>>> +unlock:
>>>> + up_write(&shrinkers_max_nr_rwsem);
>>>> + return ret;
>>>> +}

2018-04-24 12:43:43

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

On 24.04.2018 15:15, Vladimir Davydov wrote:
> On Tue, Apr 24, 2018 at 02:38:51PM +0300, Kirill Tkhai wrote:
>> On 24.04.2018 14:28, Vladimir Davydov wrote:
>>> On Mon, Apr 23, 2018 at 01:54:50PM +0300, Kirill Tkhai wrote:
>>>>>> @@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
>>>>>> void memcg_get_cache_ids(void);
>>>>>> void memcg_put_cache_ids(void);
>>>>>>
>>>>>> +extern int shrinkers_max_nr;
>>>>>> +
>>>>>
>>>>> memcg_shrinker_id_max?
>>>>
>>>> memcg_shrinker_id_max sounds like an includive value, doesn't it?
>>>> While shrinker->id < shrinker_max_nr.
>>>>
>>>> Let's better use memcg_shrinker_nr_max.
>>>
>>> or memcg_nr_shrinker_ids (to match memcg_nr_cache_ids), not sure...
>>>
>>> Come to think of it, this variable is kinda awkward: it is defined in
>>> vmscan.c but declared in memcontrol.h; it is used by vmscan.c for max
>>> shrinker id and by memcontrol.c for shrinker map capacity. Just a raw
>>> idea: what about splitting it in two: one is private to vmscan.c, used
>>> as max id, say we call it shrinker_id_max; the other is defined in
>>> memcontrol.c and is used for shrinker map capacity, say we call it
>>> memcg_shrinker_map_capacity. What do you think?
>>
>> I don't much like a duplication of the single variable...
>
> Well, it's not really a duplication. For example, shrinker_id_max could
> decrease when a shrinker is unregistered while shrinker_map_capacity can
> only grow exponentially.
>
>> Are there real problems, if it defined in memcontrol.{c,h} and use in
>> both of the places?
>
> The code is more difficult to follow when variables are shared like that
> IMHO. I suggest you try it and see how it looks. May be, it will only
> get worse and we'll have to revert to what we have now. Difficult to say
> without seeing the code.
>
>>
>>>>>> +int expand_shrinker_maps(int old_nr, int nr)
>>>>>> +{
>>>>>> + int id, size, old_size, node, ret;
>>>>>> + struct mem_cgroup *memcg;
>>>>>> +
>>>>>> + old_size = old_nr / BITS_PER_BYTE;
>>>>>> + size = nr / BITS_PER_BYTE;
>>>>>> +
>>>>>> + down_write(&shrinkers_max_nr_rwsem);
>>>>>> + for_each_node(node) {
>>>>>
>>>>> Iterating over cgroups first, numa nodes second seems like a better idea
>>>>> to me. I think you should fold for_each_node in memcg_expand_maps.
>>>>>
>>>>>> + idr_for_each_entry(&mem_cgroup_idr, memcg, id) {
>>>>>
>>>>> Iterating over mem_cgroup_idr looks strange. Why don't you use
>>>>> for_each_mem_cgroup?
>>>>
>>>> We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
>>>> mem_cgroup_css_online() mustn't fail (it's a requirement of currently
>>>> existing design of memcg_cgroup::id).
>>>>
>>>> A new memcg is added to parent's list between two of these calls:
>>>>
>>>> css_create()
>>>> ss->css_alloc()
>>>> list_add_tail_rcu(&css->sibling, &parent_css->children)
>>>> ss->css_online()
>>>>
>>>> for_each_mem_cgroup() does not see allocated, but not linked children.
>>>
>>> Why don't we move shrinker map allocation to css_online then?
>>
>> Because the design of memcg_cgroup::id prohibits mem_cgroup_css_online() to fail.
>> This function can't fail.
>
> I fail to understand why it is so. Could you please elaborate?

mem_cgroup::id is freed not in mem_cgroup_css_free(), but earlier. It's freed
between mem_cgroup_css_offline() and mem_cgroup_free(), after the last reference
is put.

In case of sometimes we want to free it in mem_cgroup_css_free(), this will
introduce assymmetric in the logic, which makes it more difficult. There is
already a bug, which I fixed in

"memcg: remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure"

new change will make this code completely not-modular and unreadable.

>>
>> I don't think it will be good to dive into reworking of this stuff for this patchset,
>> which is really already big. Also, it will be assymmetric to allocate one part of
>> data in css_alloc(), while another data in css_free(). This breaks cgroup design,
>> which specially introduces this two function to differ allocation and onlining.
>> Also, I've just move the allocation to alloc_mem_cgroup_per_node_info() like it was
>> suggested in comments to v1...
>
> Yeah, but (ab)using mem_cgroup_idr for iterating over all allocated
> memory cgroups looks rather dubious to me...

But we have to iterate over all allocated memory cgroups in any way,
as all of them must have expanded maps. What is the problem?
It's rather simple method, and it faster then for_each_mem_cgroup()
cycle, since it does not have to play with get and put of refcounters.

Kirill

2018-04-24 14:30:04

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

On Mon, Apr 23, 2018 at 02:06:31PM +0300, Kirill Tkhai wrote:
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 4f02fe83537e..f63eb5596c35 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -172,6 +172,22 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >> #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> >> static DEFINE_IDR(shrinkers_id_idr);
> >>
> >> +static int expand_shrinker_id(int id)
> >> +{
> >> + if (likely(id < shrinkers_max_nr))
> >> + return 0;
> >> +
> >> + id = shrinkers_max_nr * 2;
> >> + if (id == 0)
> >> + id = BITS_PER_BYTE;
> >> +
> >> + if (expand_shrinker_maps(shrinkers_max_nr, id))
> >> + return -ENOMEM;
> >> +
> >> + shrinkers_max_nr = id;
> >> + return 0;
> >> +}
> >> +
> >
> > I think this function should live in memcontrol.c and shrinkers_max_nr
> > should be private to memcontrol.c.
>
> It can't be private as shrink_slab_memcg() uses this value to get the last bit of bitmap.

Yeah, you're right, sorry I haven't noticed that.

What about moving id allocation to this function as well? IMHO it would
make the code flow a little bit more straightforward. I mean,

alloc_shrinker_id()
{
int id = idr_alloc(...)
if (id >= memcg_nr_shrinker_ids)
memcg_grow_shrinker_map(...)
return id;
}

>
> >> static int add_memcg_shrinker(struct shrinker *shrinker)
> >> {
> >> int id, ret;
> >> @@ -180,6 +196,11 @@ static int add_memcg_shrinker(struct shrinker *shrinker)
> >> ret = id = idr_alloc(&shrinkers_id_idr, shrinker, 0, 0, GFP_KERNEL);
> >> if (ret < 0)
> >> goto unlock;
> >> + ret = expand_shrinker_id(id);
> >> + if (ret < 0) {
> >> + idr_remove(&shrinkers_id_idr, id);
> >> + goto unlock;
> >> + }
> >> shrinker->id = id;
> >> ret = 0;
> >> unlock:
> >>

2018-04-24 14:31:51

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

On 24.04.2018 14:28, Vladimir Davydov wrote:
> On Mon, Apr 23, 2018 at 01:54:50PM +0300, Kirill Tkhai wrote:
>>>> @@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
>>>> void memcg_get_cache_ids(void);
>>>> void memcg_put_cache_ids(void);
>>>>
>>>> +extern int shrinkers_max_nr;
>>>> +
>>>
>>> memcg_shrinker_id_max?
>>
>> memcg_shrinker_id_max sounds like an includive value, doesn't it?
>> While shrinker->id < shrinker_max_nr.
>>
>> Let's better use memcg_shrinker_nr_max.
>
> or memcg_nr_shrinker_ids (to match memcg_nr_cache_ids), not sure...
>
> Come to think of it, this variable is kinda awkward: it is defined in
> vmscan.c but declared in memcontrol.h; it is used by vmscan.c for max
> shrinker id and by memcontrol.c for shrinker map capacity. Just a raw
> idea: what about splitting it in two: one is private to vmscan.c, used
> as max id, say we call it shrinker_id_max; the other is defined in
> memcontrol.c and is used for shrinker map capacity, say we call it
> memcg_shrinker_map_capacity. What do you think?

I don't much like a duplication of the single variable... Are there real
problems, if it defined in memcontrol.{c,h} and use in both of the places?

>>>> +int expand_shrinker_maps(int old_nr, int nr)
>>>> +{
>>>> + int id, size, old_size, node, ret;
>>>> + struct mem_cgroup *memcg;
>>>> +
>>>> + old_size = old_nr / BITS_PER_BYTE;
>>>> + size = nr / BITS_PER_BYTE;
>>>> +
>>>> + down_write(&shrinkers_max_nr_rwsem);
>>>> + for_each_node(node) {
>>>
>>> Iterating over cgroups first, numa nodes second seems like a better idea
>>> to me. I think you should fold for_each_node in memcg_expand_maps.
>>>
>>>> + idr_for_each_entry(&mem_cgroup_idr, memcg, id) {
>>>
>>> Iterating over mem_cgroup_idr looks strange. Why don't you use
>>> for_each_mem_cgroup?
>>
>> We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
>> mem_cgroup_css_online() mustn't fail (it's a requirement of currently
>> existing design of memcg_cgroup::id).
>>
>> A new memcg is added to parent's list between two of these calls:
>>
>> css_create()
>> ss->css_alloc()
>> list_add_tail_rcu(&css->sibling, &parent_css->children)
>> ss->css_online()
>>
>> for_each_mem_cgroup() does not see allocated, but not linked children.
>
> Why don't we move shrinker map allocation to css_online then?

Because the design of memcg_cgroup::id prohibits mem_cgroup_css_online() to fail.
This function can't fail.

I don't think it will be good to dive into reworking of this stuff for this patchset,
which is really already big. Also, it will be assymmetric to allocate one part of
data in css_alloc(), while another data in css_free(). This breaks cgroup design,
which specially introduces this two function to differ allocation and onlining.
Also, I've just move the allocation to alloc_mem_cgroup_per_node_info() like it was
suggested in comments to v1...

>>
>>>> + if (id == 1)
>>>> + memcg = NULL;
>>>> + ret = memcg_expand_maps(memcg, node, size, old_size);
>>>> + if (ret)
>>>> + goto unlock;
>>>> + }
>>>> +
>>>> + /* root_mem_cgroup is not initialized yet */
>>>> + if (id == 0)
>>>> + ret = memcg_expand_maps(NULL, node, size, old_size);
>>>> + }
>>>> +unlock:
>>>> + up_write(&shrinkers_max_nr_rwsem);
>>>> + return ret;
>>>> +}

Kirill

2018-04-28 15:10:00

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

On Tue, Apr 24, 2018 at 03:24:53PM +0300, Kirill Tkhai wrote:
> >>>>>> +int expand_shrinker_maps(int old_nr, int nr)
> >>>>>> +{
> >>>>>> + int id, size, old_size, node, ret;
> >>>>>> + struct mem_cgroup *memcg;
> >>>>>> +
> >>>>>> + old_size = old_nr / BITS_PER_BYTE;
> >>>>>> + size = nr / BITS_PER_BYTE;
> >>>>>> +
> >>>>>> + down_write(&shrinkers_max_nr_rwsem);
> >>>>>> + for_each_node(node) {
> >>>>>
> >>>>> Iterating over cgroups first, numa nodes second seems like a better idea
> >>>>> to me. I think you should fold for_each_node in memcg_expand_maps.
> >>>>>
> >>>>>> + idr_for_each_entry(&mem_cgroup_idr, memcg, id) {
> >>>>>
> >>>>> Iterating over mem_cgroup_idr looks strange. Why don't you use
> >>>>> for_each_mem_cgroup?
> >>>>
> >>>> We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
> >>>> mem_cgroup_css_online() mustn't fail (it's a requirement of currently
> >>>> existing design of memcg_cgroup::id).
> >>>>
> >>>> A new memcg is added to parent's list between two of these calls:
> >>>>
> >>>> css_create()
> >>>> ss->css_alloc()
> >>>> list_add_tail_rcu(&css->sibling, &parent_css->children)
> >>>> ss->css_online()
> >>>>
> >>>> for_each_mem_cgroup() does not see allocated, but not linked children.
> >>>
> >>> Why don't we move shrinker map allocation to css_online then?
> >>
> >> Because the design of memcg_cgroup::id prohibits mem_cgroup_css_online() to fail.
> >> This function can't fail.
> >
> > I fail to understand why it is so. Could you please elaborate?
>
> mem_cgroup::id is freed not in mem_cgroup_css_free(), but earlier. It's freed
> between mem_cgroup_css_offline() and mem_cgroup_free(), after the last reference
> is put.
>
> In case of sometimes we want to free it in mem_cgroup_css_free(), this will
> introduce assymmetric in the logic, which makes it more difficult. There is
> already a bug, which I fixed in
>
> "memcg: remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure"
>
> new change will make this code completely not-modular and unreadable.

How is that? AFAIU all we need to do to handle css_online failure
properly is call mem_cgroup_id_remove() from mem_cgroup_css_free().
That's it, as mem_cgroup_id_remove() is already safe to call more
than once for the same cgroup - the first call will free the id
while all subsequent calls will do nothing.

>
> >>
> >> I don't think it will be good to dive into reworking of this stuff for this patchset,
> >> which is really already big. Also, it will be assymmetric to allocate one part of
> >> data in css_alloc(), while another data in css_free(). This breaks cgroup design,
> >> which specially introduces this two function to differ allocation and onlining.
> >> Also, I've just move the allocation to alloc_mem_cgroup_per_node_info() like it was
> >> suggested in comments to v1...
> >
> > Yeah, but (ab)using mem_cgroup_idr for iterating over all allocated
> > memory cgroups looks rather dubious to me...
>
> But we have to iterate over all allocated memory cgroups in any way,
> as all of them must have expanded maps. What is the problem?
> It's rather simple method, and it faster then for_each_mem_cgroup()
> cycle, since it does not have to play with get and put of refcounters.

I don't like this, because mem_cgroup_idr was initially introduced to
avoid depletion of css ids by offline cgroups. We could fix that problem
by extending swap_cgroup to UINT_MAX, in which case mem_cgroup_idr
wouldn't be needed at all. Reusing mem_cgroup_idr for iterating over
allocated cgroups deprives us of the ability to reconsider that design
decision in future, neither does it look natural IMO. Besides, in order
to use mem_cgroup_idr for your purpose, you have to reshuffle the code
of mem_cgroup_css_alloc in a rather contrived way IMO.

I agree that allocating parts of struct mem_cgroup in css_online may
look dubious, but IMHO it's better than inventing a new way to iterate
over cgroups instead of using the iterator provided by cgroup core.
May be, you should ask Tejun which way he thinks is better.

Thanks,
Vladimir

2018-05-03 11:17:28

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

On 28.04.2018 18:08, Vladimir Davydov wrote:
> On Tue, Apr 24, 2018 at 03:24:53PM +0300, Kirill Tkhai wrote:
>>>>>>>> +int expand_shrinker_maps(int old_nr, int nr)
>>>>>>>> +{
>>>>>>>> + int id, size, old_size, node, ret;
>>>>>>>> + struct mem_cgroup *memcg;
>>>>>>>> +
>>>>>>>> + old_size = old_nr / BITS_PER_BYTE;
>>>>>>>> + size = nr / BITS_PER_BYTE;
>>>>>>>> +
>>>>>>>> + down_write(&shrinkers_max_nr_rwsem);
>>>>>>>> + for_each_node(node) {
>>>>>>>
>>>>>>> Iterating over cgroups first, numa nodes second seems like a better idea
>>>>>>> to me. I think you should fold for_each_node in memcg_expand_maps.
>>>>>>>
>>>>>>>> + idr_for_each_entry(&mem_cgroup_idr, memcg, id) {
>>>>>>>
>>>>>>> Iterating over mem_cgroup_idr looks strange. Why don't you use
>>>>>>> for_each_mem_cgroup?
>>>>>>
>>>>>> We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
>>>>>> mem_cgroup_css_online() mustn't fail (it's a requirement of currently
>>>>>> existing design of memcg_cgroup::id).
>>>>>>
>>>>>> A new memcg is added to parent's list between two of these calls:
>>>>>>
>>>>>> css_create()
>>>>>> ss->css_alloc()
>>>>>> list_add_tail_rcu(&css->sibling, &parent_css->children)
>>>>>> ss->css_online()
>>>>>>
>>>>>> for_each_mem_cgroup() does not see allocated, but not linked children.
>>>>>
>>>>> Why don't we move shrinker map allocation to css_online then?
>>>>
>>>> Because the design of memcg_cgroup::id prohibits mem_cgroup_css_online() to fail.
>>>> This function can't fail.
>>>
>>> I fail to understand why it is so. Could you please elaborate?
>>
>> mem_cgroup::id is freed not in mem_cgroup_css_free(), but earlier. It's freed
>> between mem_cgroup_css_offline() and mem_cgroup_free(), after the last reference
>> is put.
>>
>> In case of sometimes we want to free it in mem_cgroup_css_free(), this will
>> introduce assymmetric in the logic, which makes it more difficult. There is
>> already a bug, which I fixed in
>>
>> "memcg: remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure"
>>
>> new change will make this code completely not-modular and unreadable.
>
> How is that? AFAIU all we need to do to handle css_online failure
> properly is call mem_cgroup_id_remove() from mem_cgroup_css_free().
> That's it, as mem_cgroup_id_remove() is already safe to call more
> than once for the same cgroup - the first call will free the id
> while all subsequent calls will do nothing.

I seemed confusing a reader for me, but now I'll agree with you, since
it's OK for you as for a reader.

>>
>>>>
>>>> I don't think it will be good to dive into reworking of this stuff for this patchset,
>>>> which is really already big. Also, it will be assymmetric to allocate one part of
>>>> data in css_alloc(), while another data in css_free(). This breaks cgroup design,
>>>> which specially introduces this two function to differ allocation and onlining.
>>>> Also, I've just move the allocation to alloc_mem_cgroup_per_node_info() like it was
>>>> suggested in comments to v1...
>>>
>>> Yeah, but (ab)using mem_cgroup_idr for iterating over all allocated
>>> memory cgroups looks rather dubious to me...
>>
>> But we have to iterate over all allocated memory cgroups in any way,
>> as all of them must have expanded maps. What is the problem?
>> It's rather simple method, and it faster then for_each_mem_cgroup()
>> cycle, since it does not have to play with get and put of refcounters.
>
> I don't like this, because mem_cgroup_idr was initially introduced to
> avoid depletion of css ids by offline cgroups. We could fix that problem
> by extending swap_cgroup to UINT_MAX, in which case mem_cgroup_idr
> wouldn't be needed at all. Reusing mem_cgroup_idr for iterating over
> allocated cgroups deprives us of the ability to reconsider that design
> decision in future, neither does it look natural IMO. Besides, in order
> to use mem_cgroup_idr for your purpose, you have to reshuffle the code
> of mem_cgroup_css_alloc in a rather contrived way IMO.
>
> I agree that allocating parts of struct mem_cgroup in css_online may
> look dubious, but IMHO it's better than inventing a new way to iterate
> over cgroups instead of using the iterator provided by cgroup core.
> May be, you should ask Tejun which way he thinks is better.
>
> Thanks,
> Vladimir
>