2017-08-22 12:29:11

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 0/3] Make count list_lru_one::nr_items lockless

This series aims to improve scalability of list_lru shrinking
and to make list_lru_count_one() working more effective.

On RHEL7 3.10 kernel I observe high system time usage and time
spent in super_cache_count() during slab shrinking:

0,94% mysqld [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
0,57% mysqld [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
0,51% mysqld [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
0,32% mysqld [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
0,32% mysqld [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2

(percentage of all node time; collected via $perf record --call-graph fp -j k -a).
It's an example, how the processes traces look like. And many processes spend time
in the above.

There is a node with many containers (more, than 200), and (as it's usually happen)
containers have no free memory (cache is actively used). Since shrink_slab() iterates
all superblocks, and it happens frequently, the shrink scales badly, and node spends
in sys more than 90% of time.

The patchset makes list_lru_count_one() lockless via RCU technics. Patch [1/3]
adds a new rcu field to struct list_lru_memcg and makes functions account its
size during allocations. Patch [2/3] makes list_lru_node::memcg_lrus RCU-protected
and RCU-accessible. Patch [3/3] removes the lock and adds rcu read protection
into __list_lru_count_one().

---

Kirill Tkhai (3):
mm: Add rcu field to struct list_lru_memcg
mm: Make list_lru_node::memcg_lrus RCU protected
mm: Count list_lru_one::nr_items lockless


include/linux/list_lru.h | 3 +-
mm/list_lru.c | 77 ++++++++++++++++++++++++++++++----------------
2 files changed, 53 insertions(+), 27 deletions(-)

--
Signed-off-by: Kirill Tkhai <[email protected]>


2017-08-22 12:29:19

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 1/3] mm: Add rcu field to struct list_lru_memcg

This patch adds the new field and teaches kmalloc()
to allocate memory for it.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/list_lru.h | 1 +
mm/list_lru.c | 7 ++++---
2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index fa7fd03cb5f9..b65505b32a3d 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -31,6 +31,7 @@ struct list_lru_one {
};

struct list_lru_memcg {
+ struct rcu_head rcu;
/* array of per cgroup lists, indexed by memcg_cache_id */
struct list_lru_one *lru[0];
};
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 7a40fa2be858..a726e321bf3e 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -325,7 +325,8 @@ static int memcg_init_list_lru_node(struct list_lru_node *nlru)
{
int size = memcg_nr_cache_ids;

- nlru->memcg_lrus = kmalloc(size * sizeof(void *), GFP_KERNEL);
+ nlru->memcg_lrus = kmalloc(sizeof(struct list_lru_memcg) +
+ size * sizeof(void *), GFP_KERNEL);
if (!nlru->memcg_lrus)
return -ENOMEM;

@@ -351,7 +352,7 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru,
BUG_ON(old_size > new_size);

old = nlru->memcg_lrus;
- new = kmalloc(new_size * sizeof(void *), GFP_KERNEL);
+ new = kmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
if (!new)
return -ENOMEM;

@@ -360,7 +361,7 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru,
return -ENOMEM;
}

- memcpy(new, old, old_size * sizeof(void *));
+ memcpy(&new->lru, &old->lru, old_size * sizeof(void *));

/*
* The lock guarantees that we won't race with a reader

2017-08-22 12:29:37

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 3/3] mm: Count list_lru_one::nr_items lockless

During the reclaiming slab of a memcg, shrink_slab iterates
over all registered shrinkers in the system, and tries to count
and consume objects related to the cgroup. In case of memory
pressure, this behaves bad: I observe high system time and
time spent in list_lru_count_one() for many processes on RHEL7
kernel (collected via $perf record --call-graph fp -j k -a):

0,50% nixstatsagent [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
0,26% nixstatsagent [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
0,23% nixstatsagent [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
0,15% nixstatsagent [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
0,15% nixstatsagent [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2

0,94% mysqld [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
0,57% mysqld [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
0,51% mysqld [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
0,32% mysqld [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
0,32% mysqld [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2

0,73% sshd [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
0,35% sshd [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
0,32% sshd [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
0,21% sshd [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
0,21% sshd [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2

This patch aims to make super_cache_count() more effective. It
makes __list_lru_count_one() count nr_items lockless to minimize
overhead introducing by locking operation, and to make parallel
reclaims more scalable.

The lock won't be taken on shrinker::count_objects(),
it would be taken only for the real shrink by the thread,
who realizes it.

https://jira.sw.ru/browse/PSBM-69296

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/list_lru.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 2db3cdadb577..8d1d2db5f4fb 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -177,10 +177,10 @@ static unsigned long __list_lru_count_one(struct list_lru *lru,
struct list_lru_one *l;
unsigned long count;

- spin_lock(&nlru->lock);
+ rcu_read_lock();
l = list_lru_from_memcg_idx(nlru, memcg_idx);
count = l->nr_items;
- spin_unlock(&nlru->lock);
+ rcu_read_unlock();

return count;
}

2017-08-22 12:29:54

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 2/3] mm: Make list_lru_node::memcg_lrus RCU protected

The array list_lru_node::memcg_lrus::list_lru_one[] only grows,
and it never shrinks. The growths happens in memcg_update_list_lru_node(),
and old array's members remain the same after it.

So, the access to the array's members may become RCU protected,
and it's possible to avoid using list_lru_node::lock to dereference it.
This will be used to get list's nr_items in next patch lockless.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/list_lru.h | 2 +
mm/list_lru.c | 70 +++++++++++++++++++++++++++++++---------------
2 files changed, 48 insertions(+), 24 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index b65505b32a3d..a55258100e40 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -43,7 +43,7 @@ struct list_lru_node {
struct list_lru_one lru;
#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
/* for cgroup aware lrus points to per cgroup lists, otherwise NULL */
- struct list_lru_memcg *memcg_lrus;
+ struct list_lru_memcg __rcu *memcg_lrus;
#endif
long nr_items;
} ____cacheline_aligned_in_smp;
diff --git a/mm/list_lru.c b/mm/list_lru.c
index a726e321bf3e..2db3cdadb577 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -42,24 +42,30 @@ static void list_lru_unregister(struct list_lru *lru)
#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
static inline bool list_lru_memcg_aware(struct list_lru *lru)
{
+ struct list_lru_memcg *memcg_lrus;
/*
* This needs node 0 to be always present, even
* in the systems supporting sparse numa ids.
+ *
+ * Here we only check the pointer is not NULL,
+ * so RCU lock is not need.
*/
- return !!lru->node[0].memcg_lrus;
+ memcg_lrus = rcu_dereference_check(lru->node[0].memcg_lrus, true);
+ return !!memcg_lrus;
}

static inline struct list_lru_one *
list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
{
+ struct list_lru_memcg *memcg_lrus;
/*
- * The lock protects the array of per cgroup lists from relocation
- * (see memcg_update_list_lru_node).
+ * Either lock and RCU protects the array of per cgroup lists
+ * from relocation (see memcg_update_list_lru_node).
*/
- lockdep_assert_held(&nlru->lock);
- if (nlru->memcg_lrus && idx >= 0)
- return nlru->memcg_lrus->lru[idx];
-
+ memcg_lrus = rcu_dereference_check(nlru->memcg_lrus,
+ lockdep_is_held(&nlru->lock));
+ if (memcg_lrus && idx >= 0)
+ return memcg_lrus->lru[idx];
return &nlru->lru;
}

@@ -76,9 +82,12 @@ static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
static inline struct list_lru_one *
list_lru_from_kmem(struct list_lru_node *nlru, void *ptr)
{
+ struct list_lru_memcg *memcg_lrus;
struct mem_cgroup *memcg;

- if (!nlru->memcg_lrus)
+ /* Here we only check the pointer is not NULL, so RCU lock isn't need */
+ memcg_lrus = rcu_dereference_check(nlru->memcg_lrus, true);
+ if (!memcg_lrus)
return &nlru->lru;

memcg = mem_cgroup_from_kmem(ptr);
@@ -323,25 +332,33 @@ static int __memcg_init_list_lru_node(struct list_lru_memcg *memcg_lrus,

static int memcg_init_list_lru_node(struct list_lru_node *nlru)
{
+ struct list_lru_memcg *memcg_lrus;
int size = memcg_nr_cache_ids;

- nlru->memcg_lrus = kmalloc(sizeof(struct list_lru_memcg) +
- size * sizeof(void *), GFP_KERNEL);
- if (!nlru->memcg_lrus)
+ memcg_lrus = kmalloc(sizeof(*memcg_lrus) +
+ size * sizeof(void *), GFP_KERNEL);
+ if (!memcg_lrus)
return -ENOMEM;

- if (__memcg_init_list_lru_node(nlru->memcg_lrus, 0, size)) {
- kfree(nlru->memcg_lrus);
+ if (__memcg_init_list_lru_node(memcg_lrus, 0, size)) {
+ kfree(memcg_lrus);
return -ENOMEM;
}
+ rcu_assign_pointer(nlru->memcg_lrus, memcg_lrus);

return 0;
}

static void memcg_destroy_list_lru_node(struct list_lru_node *nlru)
{
- __memcg_destroy_list_lru_node(nlru->memcg_lrus, 0, memcg_nr_cache_ids);
- kfree(nlru->memcg_lrus);
+ struct list_lru_memcg *memcg_lrus;
+ /*
+ * This is called when shrinker has already been unregistered,
+ * and nobody can use it. So, it's not need to use kfree_rcu().
+ */
+ memcg_lrus = rcu_dereference_check(nlru->memcg_lrus, true);
+ __memcg_destroy_list_lru_node(memcg_lrus, 0, memcg_nr_cache_ids);
+ kfree(memcg_lrus);
}

static int memcg_update_list_lru_node(struct list_lru_node *nlru,
@@ -350,8 +367,10 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru,
struct list_lru_memcg *old, *new;

BUG_ON(old_size > new_size);
+ lockdep_assert_held(&list_lrus_mutex);

- old = nlru->memcg_lrus;
+ /* list_lrus_mutex is held, nobody can change memcg_lrus. Silence RCU */
+ old = rcu_dereference_check(nlru->memcg_lrus, true);
new = kmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
if (!new)
return -ENOMEM;
@@ -364,26 +383,31 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru,
memcpy(&new->lru, &old->lru, old_size * sizeof(void *));

/*
- * The lock guarantees that we won't race with a reader
- * (see list_lru_from_memcg_idx).
+ * The locking below allows the readers, that already take nlru->lock,
+ * not to use additional rcu_read_lock()/rcu_read_unlock() pair.
*
* Since list_lru_{add,del} may be called under an IRQ-safe lock,
* we have to use IRQ-safe primitives here to avoid deadlock.
*/
spin_lock_irq(&nlru->lock);
- nlru->memcg_lrus = new;
+ rcu_assign_pointer(nlru->memcg_lrus, new);
spin_unlock_irq(&nlru->lock);

- kfree(old);
+ kfree_rcu(old, rcu);
return 0;
}

static void memcg_cancel_update_list_lru_node(struct list_lru_node *nlru,
int old_size, int new_size)
{
+ struct list_lru_memcg *memcg_lrus;
+
+ lockdep_assert_held(&list_lrus_mutex);
+ memcg_lrus = rcu_dereference_check(nlru->memcg_lrus, true);
+
/* do not bother shrinking the array back to the old size, because we
* cannot handle allocation failures here */
- __memcg_destroy_list_lru_node(nlru->memcg_lrus, old_size, new_size);
+ __memcg_destroy_list_lru_node(memcg_lrus, old_size, new_size);
}

static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
@@ -400,7 +424,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
return 0;
fail:
for (i = i - 1; i >= 0; i--) {
- if (!lru->node[i].memcg_lrus)
+ if (!rcu_dereference_check(lru->node[i].memcg_lrus, true))
continue;
memcg_destroy_list_lru_node(&lru->node[i]);
}
@@ -434,7 +458,7 @@ static int memcg_update_list_lru(struct list_lru *lru,
return 0;
fail:
for (i = i - 1; i >= 0; i--) {
- if (!lru->node[i].memcg_lrus)
+ if (!rcu_dereference_check(lru->node[i].memcg_lrus, true))
continue;

memcg_cancel_update_list_lru_node(&lru->node[i],

2017-08-22 19:34:12

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: Make list_lru_node::memcg_lrus RCU protected

Hello Kirill,

On Tue, Aug 22, 2017 at 03:29:26PM +0300, Kirill Tkhai wrote:
> The array list_lru_node::memcg_lrus::list_lru_one[] only grows,
> and it never shrinks. The growths happens in memcg_update_list_lru_node(),
> and old array's members remain the same after it.
>
> So, the access to the array's members may become RCU protected,
> and it's possible to avoid using list_lru_node::lock to dereference it.
> This will be used to get list's nr_items in next patch lockless.
>
> Signed-off-by: Kirill Tkhai <[email protected]>

The patch looks very nice. A few really minor comments below.

First, I don't think it's worth splitting this patch in three: patch #1
introduces a structure member that is only used in patch #2, while patch
#2 adds RCU protection, but nobody benefits from it until patch #3 is
applied. Since patches #1 and #3 are tiny, why don't you fold them in
patch #2?

> diff --git a/mm/list_lru.c b/mm/list_lru.c
> @@ -42,24 +42,30 @@ static void list_lru_unregister(struct list_lru *lru)
> #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> static inline bool list_lru_memcg_aware(struct list_lru *lru)
> {
> + struct list_lru_memcg *memcg_lrus;
> /*
> * This needs node 0 to be always present, even
> * in the systems supporting sparse numa ids.
> + *
> + * Here we only check the pointer is not NULL,
> + * so RCU lock is not need.
> */
> - return !!lru->node[0].memcg_lrus;
> + memcg_lrus = rcu_dereference_check(lru->node[0].memcg_lrus, true);
> + return !!memcg_lrus;

IIRC you don't need rcu_dereference() here, because you don't actually
dereference anything. The compiler shouldn't complain if you leaved this
as is.

> }
>
> static inline struct list_lru_one *
> list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
> {
> + struct list_lru_memcg *memcg_lrus;
> /*
> - * The lock protects the array of per cgroup lists from relocation
> - * (see memcg_update_list_lru_node).
> + * Either lock and RCU protects the array of per cgroup lists

Typo: s/and/or/

> + * from relocation (see memcg_update_list_lru_node).
> */
> - lockdep_assert_held(&nlru->lock);
> - if (nlru->memcg_lrus && idx >= 0)
> - return nlru->memcg_lrus->lru[idx];
> -
> + memcg_lrus = rcu_dereference_check(nlru->memcg_lrus,
> + lockdep_is_held(&nlru->lock));
> + if (memcg_lrus && idx >= 0)
> + return memcg_lrus->lru[idx];
> return &nlru->lru;
> }
>
> @@ -76,9 +82,12 @@ static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
> static inline struct list_lru_one *
> list_lru_from_kmem(struct list_lru_node *nlru, void *ptr)
> {
> + struct list_lru_memcg *memcg_lrus;
> struct mem_cgroup *memcg;
>
> - if (!nlru->memcg_lrus)
> + /* Here we only check the pointer is not NULL, so RCU lock isn't need */
> + memcg_lrus = rcu_dereference_check(nlru->memcg_lrus, true);
> + if (!memcg_lrus)

Again, rcu_dereference() is redundant.

> return &nlru->lru;
>
> memcg = mem_cgroup_from_kmem(ptr);
> @@ -323,25 +332,33 @@ static int __memcg_init_list_lru_node(struct list_lru_memcg *memcg_lrus,
>
> static int memcg_init_list_lru_node(struct list_lru_node *nlru)
> {
> + struct list_lru_memcg *memcg_lrus;
> int size = memcg_nr_cache_ids;
>
> - nlru->memcg_lrus = kmalloc(sizeof(struct list_lru_memcg) +
> - size * sizeof(void *), GFP_KERNEL);
> - if (!nlru->memcg_lrus)
> + memcg_lrus = kmalloc(sizeof(*memcg_lrus) +
> + size * sizeof(void *), GFP_KERNEL);
> + if (!memcg_lrus)
> return -ENOMEM;
>
> - if (__memcg_init_list_lru_node(nlru->memcg_lrus, 0, size)) {
> - kfree(nlru->memcg_lrus);
> + if (__memcg_init_list_lru_node(memcg_lrus, 0, size)) {
> + kfree(memcg_lrus);
> return -ENOMEM;
> }
> + rcu_assign_pointer(nlru->memcg_lrus, memcg_lrus);

You don't need a memory barrier here, so RCU_INIT_POINTER() would fit
better.

>
> return 0;
> }
>
> static void memcg_destroy_list_lru_node(struct list_lru_node *nlru)
> {
> - __memcg_destroy_list_lru_node(nlru->memcg_lrus, 0, memcg_nr_cache_ids);
> - kfree(nlru->memcg_lrus);
> + struct list_lru_memcg *memcg_lrus;
> + /*
> + * This is called when shrinker has already been unregistered,

> + * and nobody can use it. So, it's not need to use kfree_rcu().

Typo: s/it's not need/there's no need/

> + */
> + memcg_lrus = rcu_dereference_check(nlru->memcg_lrus, true);

IIRC there's rcu_dereference_protected() for cases when you don't
expect any changes to an __rcu variable. Let's use it instead of
rcu_dereference_check() where appropriate.

> + __memcg_destroy_list_lru_node(memcg_lrus, 0, memcg_nr_cache_ids);
> + kfree(memcg_lrus);
> }
>
> static int memcg_update_list_lru_node(struct list_lru_node *nlru,
> @@ -350,8 +367,10 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru,
> struct list_lru_memcg *old, *new;
>
> BUG_ON(old_size > new_size);
> + lockdep_assert_held(&list_lrus_mutex);
>
> - old = nlru->memcg_lrus;
> + /* list_lrus_mutex is held, nobody can change memcg_lrus. Silence RCU */

> + old = rcu_dereference_check(nlru->memcg_lrus, true);

s/rcu_dereference_check/rcu_dereference_protected/

> new = kmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
> if (!new)
> return -ENOMEM;
> @@ -364,26 +383,31 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru,
> memcpy(&new->lru, &old->lru, old_size * sizeof(void *));
>
> /*
> - * The lock guarantees that we won't race with a reader
> - * (see list_lru_from_memcg_idx).

> + * The locking below allows the readers, that already take nlru->lock,
> + * not to use additional rcu_read_lock()/rcu_read_unlock() pair.

Rephrase a little bit?

The locking below allows readers that hold nlru->lock avoid taking
rcu_read_lock (see list_lru_from_memcg_idx).

> *
> * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> * we have to use IRQ-safe primitives here to avoid deadlock.
> */
> spin_lock_irq(&nlru->lock);
> - nlru->memcg_lrus = new;
> + rcu_assign_pointer(nlru->memcg_lrus, new);
> spin_unlock_irq(&nlru->lock);
>
> - kfree(old);
> + kfree_rcu(old, rcu);
> return 0;
> }
>
> static void memcg_cancel_update_list_lru_node(struct list_lru_node *nlru,
> int old_size, int new_size)
> {
> + struct list_lru_memcg *memcg_lrus;
> +
> + lockdep_assert_held(&list_lrus_mutex);
> + memcg_lrus = rcu_dereference_check(nlru->memcg_lrus, true);

s/rcu_dereference_check/rcu_dereference_protected/

> +
> /* do not bother shrinking the array back to the old size, because we
> * cannot handle allocation failures here */
> - __memcg_destroy_list_lru_node(nlru->memcg_lrus, old_size, new_size);
> + __memcg_destroy_list_lru_node(memcg_lrus, old_size, new_size);
> }
>
> static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
> @@ -400,7 +424,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
> return 0;
> fail:
> for (i = i - 1; i >= 0; i--) {
> - if (!lru->node[i].memcg_lrus)
> + if (!rcu_dereference_check(lru->node[i].memcg_lrus, true))

No need in rcu_dereference() here as you don't actually dereference
anything.

> continue;
> memcg_destroy_list_lru_node(&lru->node[i]);
> }
> @@ -434,7 +458,7 @@ static int memcg_update_list_lru(struct list_lru *lru,
> return 0;
> fail:
> for (i = i - 1; i >= 0; i--) {
> - if (!lru->node[i].memcg_lrus)
> + if (!rcu_dereference_check(lru->node[i].memcg_lrus, true))

Ditto.

> continue;
>
> memcg_cancel_update_list_lru_node(&lru->node[i],
>

2017-08-22 19:47:31

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: Count list_lru_one::nr_items lockless

On Tue, Aug 22, 2017 at 03:29:35PM +0300, Kirill Tkhai wrote:
> During the reclaiming slab of a memcg, shrink_slab iterates
> over all registered shrinkers in the system, and tries to count
> and consume objects related to the cgroup. In case of memory
> pressure, this behaves bad: I observe high system time and
> time spent in list_lru_count_one() for many processes on RHEL7
> kernel (collected via $perf record --call-graph fp -j k -a):
>
> 0,50% nixstatsagent [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
> 0,26% nixstatsagent [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
> 0,23% nixstatsagent [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
> 0,15% nixstatsagent [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
> 0,15% nixstatsagent [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
>
> 0,94% mysqld [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
> 0,57% mysqld [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
> 0,51% mysqld [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
> 0,32% mysqld [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
> 0,32% mysqld [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
>
> 0,73% sshd [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
> 0,35% sshd [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
> 0,32% sshd [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
> 0,21% sshd [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
> 0,21% sshd [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2

It would be nice to see how this is improved by this patch.
Can you try to record the traces on the vanilla kernel with
and without this patch?

>
> This patch aims to make super_cache_count() more effective. It
> makes __list_lru_count_one() count nr_items lockless to minimize
> overhead introducing by locking operation, and to make parallel
> reclaims more scalable.
>
> The lock won't be taken on shrinker::count_objects(),
> it would be taken only for the real shrink by the thread,
> who realizes it.
>

> https://jira.sw.ru/browse/PSBM-69296

Not relevant.

2017-08-23 08:01:03

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: Count list_lru_one::nr_items lockless

On 22.08.2017 22:47, Vladimir Davydov wrote:
> On Tue, Aug 22, 2017 at 03:29:35PM +0300, Kirill Tkhai wrote:
>> During the reclaiming slab of a memcg, shrink_slab iterates
>> over all registered shrinkers in the system, and tries to count
>> and consume objects related to the cgroup. In case of memory
>> pressure, this behaves bad: I observe high system time and
>> time spent in list_lru_count_one() for many processes on RHEL7
>> kernel (collected via $perf record --call-graph fp -j k -a):
>>
>> 0,50% nixstatsagent [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
>> 0,26% nixstatsagent [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
>> 0,23% nixstatsagent [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
>> 0,15% nixstatsagent [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
>> 0,15% nixstatsagent [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
>>
>> 0,94% mysqld [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
>> 0,57% mysqld [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
>> 0,51% mysqld [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
>> 0,32% mysqld [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
>> 0,32% mysqld [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
>>
>> 0,73% sshd [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
>> 0,35% sshd [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
>> 0,32% sshd [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
>> 0,21% sshd [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
>> 0,21% sshd [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
>
> It would be nice to see how this is improved by this patch.
> Can you try to record the traces on the vanilla kernel with
> and without this patch?

Sadly, the talk is about a production node, and it's impossible to use vanila kernel there.

>>
>> This patch aims to make super_cache_count() more effective. It
>> makes __list_lru_count_one() count nr_items lockless to minimize
>> overhead introducing by locking operation, and to make parallel
>> reclaims more scalable.
>>
>> The lock won't be taken on shrinker::count_objects(),
>> it would be taken only for the real shrink by the thread,
>> who realizes it.
>>
>
>> https://jira.sw.ru/browse/PSBM-69296
>
> Not relevant.
>

2017-08-23 08:27:19

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: Count list_lru_one::nr_items lockless

On Wed, Aug 23, 2017 at 11:00:56AM +0300, Kirill Tkhai wrote:
> On 22.08.2017 22:47, Vladimir Davydov wrote:
> > On Tue, Aug 22, 2017 at 03:29:35PM +0300, Kirill Tkhai wrote:
> >> During the reclaiming slab of a memcg, shrink_slab iterates
> >> over all registered shrinkers in the system, and tries to count
> >> and consume objects related to the cgroup. In case of memory
> >> pressure, this behaves bad: I observe high system time and
> >> time spent in list_lru_count_one() for many processes on RHEL7
> >> kernel (collected via $perf record --call-graph fp -j k -a):
> >>
> >> 0,50% nixstatsagent [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
> >> 0,26% nixstatsagent [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
> >> 0,23% nixstatsagent [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
> >> 0,15% nixstatsagent [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
> >> 0,15% nixstatsagent [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
> >>
> >> 0,94% mysqld [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
> >> 0,57% mysqld [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
> >> 0,51% mysqld [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
> >> 0,32% mysqld [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
> >> 0,32% mysqld [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
> >>
> >> 0,73% sshd [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
> >> 0,35% sshd [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
> >> 0,32% sshd [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
> >> 0,21% sshd [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
> >> 0,21% sshd [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
> >
> > It would be nice to see how this is improved by this patch.
> > Can you try to record the traces on the vanilla kernel with
> > and without this patch?
>
> Sadly, the talk is about a production node, and it's impossible to use vanila kernel there.

I see :-( Then maybe you could try to come up with a contrived test?

2017-08-23 12:26:13

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: Count list_lru_one::nr_items lockless

On 23.08.2017 11:27, Vladimir Davydov wrote:
> On Wed, Aug 23, 2017 at 11:00:56AM +0300, Kirill Tkhai wrote:
>> On 22.08.2017 22:47, Vladimir Davydov wrote:
>>> On Tue, Aug 22, 2017 at 03:29:35PM +0300, Kirill Tkhai wrote:
>>>> During the reclaiming slab of a memcg, shrink_slab iterates
>>>> over all registered shrinkers in the system, and tries to count
>>>> and consume objects related to the cgroup. In case of memory
>>>> pressure, this behaves bad: I observe high system time and
>>>> time spent in list_lru_count_one() for many processes on RHEL7
>>>> kernel (collected via $perf record --call-graph fp -j k -a):
>>>>
>>>> 0,50% nixstatsagent [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
>>>> 0,26% nixstatsagent [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
>>>> 0,23% nixstatsagent [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
>>>> 0,15% nixstatsagent [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
>>>> 0,15% nixstatsagent [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
>>>>
>>>> 0,94% mysqld [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
>>>> 0,57% mysqld [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
>>>> 0,51% mysqld [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
>>>> 0,32% mysqld [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
>>>> 0,32% mysqld [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
>>>>
>>>> 0,73% sshd [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
>>>> 0,35% sshd [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
>>>> 0,32% sshd [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
>>>> 0,21% sshd [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
>>>> 0,21% sshd [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
>>>
>>> It would be nice to see how this is improved by this patch.
>>> Can you try to record the traces on the vanilla kernel with
>>> and without this patch?
>>
>> Sadly, the talk is about a production node, and it's impossible to use vanila kernel there.
>
> I see :-( Then maybe you could try to come up with a contrived test?

I've tried and I'm not sure I'm able to reproduce on my test 8-cpu node the situation like I saw on production node
via a test. Maybe you have an idea how to measure that?

I've changed the places, you commented, and the merged patch is below.
How are you about it?

[PATCH]mm: Make count list_lru_one::nr_items lockless

During the reclaiming slab of a memcg, shrink_slab iterates
over all registered shrinkers in the system, and tries to count
and consume objects related to the cgroup. In case of memory
pressure, this behaves bad: I observe high system time and
time spent in list_lru_count_one() for many processes on RHEL7
kernel (collected via $perf record --call-graph fp -j k -a):

0,50% nixstatsagent [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
0,26% nixstatsagent [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
0,23% nixstatsagent [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
0,15% nixstatsagent [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
0,15% nixstatsagent [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2

0,94% mysqld [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
0,57% mysqld [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
0,51% mysqld [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
0,32% mysqld [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
0,32% mysqld [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2

0,73% sshd [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
0,35% sshd [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
0,32% sshd [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
0,21% sshd [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
0,21% sshd [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2

This patch aims to make super_cache_count() (and other functions,
which count LRU nr_items) more effective.
It allows list_lru_node::memcg_lrus to be RCU-accessed, and makes
__list_lru_count_one() count nr_items lockless to minimize
overhead introduced by locking operation, and to make parallel
reclaims more scalable.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/list_lru.h | 3 ++-
mm/list_lru.c | 59 ++++++++++++++++++++++++++++++------------------
2 files changed, 39 insertions(+), 23 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index fa7fd03cb5f9..a55258100e40 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -31,6 +31,7 @@ struct list_lru_one {
};

struct list_lru_memcg {
+ struct rcu_head rcu;
/* array of per cgroup lists, indexed by memcg_cache_id */
struct list_lru_one *lru[0];
};
@@ -42,7 +43,7 @@ struct list_lru_node {
struct list_lru_one lru;
#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
/* for cgroup aware lrus points to per cgroup lists, otherwise NULL */
- struct list_lru_memcg *memcg_lrus;
+ struct list_lru_memcg __rcu *memcg_lrus;
#endif
long nr_items;
} ____cacheline_aligned_in_smp;
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 7a40fa2be858..9fdb24818dae 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -52,14 +52,15 @@ static inline bool list_lru_memcg_aware(struct list_lru *lru)
static inline struct list_lru_one *
list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
{
+ struct list_lru_memcg *memcg_lrus;
/*
- * The lock protects the array of per cgroup lists from relocation
- * (see memcg_update_list_lru_node).
+ * Either lock or RCU protects the array of per cgroup lists
+ * from relocation (see memcg_update_list_lru_node).
*/
- lockdep_assert_held(&nlru->lock);
- if (nlru->memcg_lrus && idx >= 0)
- return nlru->memcg_lrus->lru[idx];
-
+ memcg_lrus = rcu_dereference_check(nlru->memcg_lrus,
+ lockdep_is_held(&nlru->lock));
+ if (memcg_lrus && idx >= 0)
+ return memcg_lrus->lru[idx];
return &nlru->lru;
}

@@ -168,10 +169,10 @@ static unsigned long __list_lru_count_one(struct list_lru *lru,
struct list_lru_one *l;
unsigned long count;

- spin_lock(&nlru->lock);
+ rcu_read_lock();
l = list_lru_from_memcg_idx(nlru, memcg_idx);
count = l->nr_items;
- spin_unlock(&nlru->lock);
+ rcu_read_unlock();

return count;
}
@@ -323,24 +324,33 @@ static int __memcg_init_list_lru_node(struct list_lru_memcg *memcg_lrus,

static int memcg_init_list_lru_node(struct list_lru_node *nlru)
{
+ struct list_lru_memcg *memcg_lrus;
int size = memcg_nr_cache_ids;

- nlru->memcg_lrus = kmalloc(size * sizeof(void *), GFP_KERNEL);
- if (!nlru->memcg_lrus)
+ memcg_lrus = kmalloc(sizeof(*memcg_lrus) +
+ size * sizeof(void *), GFP_KERNEL);
+ if (!memcg_lrus)
return -ENOMEM;

- if (__memcg_init_list_lru_node(nlru->memcg_lrus, 0, size)) {
- kfree(nlru->memcg_lrus);
+ if (__memcg_init_list_lru_node(memcg_lrus, 0, size)) {
+ kfree(memcg_lrus);
return -ENOMEM;
}
+ RCU_INIT_POINTER(nlru->memcg_lrus, memcg_lrus);

return 0;
}

static void memcg_destroy_list_lru_node(struct list_lru_node *nlru)
{
- __memcg_destroy_list_lru_node(nlru->memcg_lrus, 0, memcg_nr_cache_ids);
- kfree(nlru->memcg_lrus);
+ struct list_lru_memcg *memcg_lrus;
+ /*
+ * This is called when shrinker has already been unregistered,
+ * and nobody can use it. So, there is no need to use kfree_rcu().
+ */
+ memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
+ __memcg_destroy_list_lru_node(memcg_lrus, 0, memcg_nr_cache_ids);
+ kfree(memcg_lrus);
}

static int memcg_update_list_lru_node(struct list_lru_node *nlru,
@@ -350,8 +360,9 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru,

BUG_ON(old_size > new_size);

- old = nlru->memcg_lrus;
- new = kmalloc(new_size * sizeof(void *), GFP_KERNEL);
+ old = rcu_dereference_protected(nlru->memcg_lrus,
+ lockdep_is_held(&list_lrus_mutex));
+ new = kmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
if (!new)
return -ENOMEM;

@@ -360,29 +371,33 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru,
return -ENOMEM;
}

- memcpy(new, old, old_size * sizeof(void *));
+ memcpy(&new->lru, &old->lru, old_size * sizeof(void *));

/*
- * The lock guarantees that we won't race with a reader
- * (see list_lru_from_memcg_idx).
+ * The locking below allows readers that hold nlru->lock avoid taking
+ * rcu_read_lock (see list_lru_from_memcg_idx).
*
* Since list_lru_{add,del} may be called under an IRQ-safe lock,
* we have to use IRQ-safe primitives here to avoid deadlock.
*/
spin_lock_irq(&nlru->lock);
- nlru->memcg_lrus = new;
+ rcu_assign_pointer(nlru->memcg_lrus, new);
spin_unlock_irq(&nlru->lock);

- kfree(old);
+ kfree_rcu(old, rcu);
return 0;
}

static void memcg_cancel_update_list_lru_node(struct list_lru_node *nlru,
int old_size, int new_size)
{
+ struct list_lru_memcg *memcg_lrus;
+
+ memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus,
+ lockdep_is_held(&list_lrus_mutex));
/* do not bother shrinking the array back to the old size, because we
* cannot handle allocation failures here */
- __memcg_destroy_list_lru_node(nlru->memcg_lrus, old_size, new_size);
+ __memcg_destroy_list_lru_node(memcg_lrus, old_size, new_size);
}

static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)

2017-08-26 17:57:39

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: Count list_lru_one::nr_items lockless

On Wed, Aug 23, 2017 at 03:26:12PM +0300, Kirill Tkhai wrote:
> On 23.08.2017 11:27, Vladimir Davydov wrote:
> > On Wed, Aug 23, 2017 at 11:00:56AM +0300, Kirill Tkhai wrote:
> >> On 22.08.2017 22:47, Vladimir Davydov wrote:
> >>> On Tue, Aug 22, 2017 at 03:29:35PM +0300, Kirill Tkhai wrote:
> >>>> During the reclaiming slab of a memcg, shrink_slab iterates
> >>>> over all registered shrinkers in the system, and tries to count
> >>>> and consume objects related to the cgroup. In case of memory
> >>>> pressure, this behaves bad: I observe high system time and
> >>>> time spent in list_lru_count_one() for many processes on RHEL7
> >>>> kernel (collected via $perf record --call-graph fp -j k -a):
> >>>>
> >>>> 0,50% nixstatsagent [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
> >>>> 0,26% nixstatsagent [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
> >>>> 0,23% nixstatsagent [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
> >>>> 0,15% nixstatsagent [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
> >>>> 0,15% nixstatsagent [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
> >>>>
> >>>> 0,94% mysqld [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
> >>>> 0,57% mysqld [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
> >>>> 0,51% mysqld [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
> >>>> 0,32% mysqld [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
> >>>> 0,32% mysqld [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
> >>>>
> >>>> 0,73% sshd [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
> >>>> 0,35% sshd [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
> >>>> 0,32% sshd [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
> >>>> 0,21% sshd [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
> >>>> 0,21% sshd [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
> >>>
> >>> It would be nice to see how this is improved by this patch.
> >>> Can you try to record the traces on the vanilla kernel with
> >>> and without this patch?
> >>
> >> Sadly, the talk is about a production node, and it's impossible to use vanila kernel there.
> >
> > I see :-( Then maybe you could try to come up with a contrived test?
>
> I've tried and I'm not sure I'm able to reproduce on my test 8-cpu
> node the situation like I saw on production node via a test. Maybe you
> have an idea how to measure that?

Since the issue here is heavy lock contention while counting shrinkable
objects, what we need to do in order to demonstrate that this patch
really helps is trigger parallel direct reclaim that would walk over a
large number of cgroups. For the effect of this patch to be perceptible,
the cgroups should have no shrinkable objects (inodes, dentries)
accounted to them so that direct reclaim would spend most time counting
objects, not shrinking them. So I would try to create a cgroup with a
large number (say 1000) of empty sub-cgroups, set its limit to be <
system memory (to easily trigger direct reclaim and avoid kswapd
weighting in and disrupting the picture), and run N processes in it each
reading its own large file that does not fit in the limit where N is >
the number of cores (say 4 processes per core). I would expect each of
the processes to spend a lot of time trying to acquire a list_lru lock
to count inodes/dentries, which should be observed via perf. With this
patch the processes would be able to proceed in parallel without
stalling on list_lru lock.

>
> I've changed the places, you commented, and the merged patch is below.
> How are you about it?

Looks good, thanks.

Acked-by: Vladimir Davydov <[email protected]>

>
> [PATCH]mm: Make count list_lru_one::nr_items lockless
>
> During the reclaiming slab of a memcg, shrink_slab iterates
> over all registered shrinkers in the system, and tries to count
> and consume objects related to the cgroup. In case of memory
> pressure, this behaves bad: I observe high system time and
> time spent in list_lru_count_one() for many processes on RHEL7
> kernel (collected via $perf record --call-graph fp -j k -a):
>
> 0,50% nixstatsagent [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
> 0,26% nixstatsagent [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
> 0,23% nixstatsagent [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
> 0,15% nixstatsagent [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
> 0,15% nixstatsagent [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
>
> 0,94% mysqld [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
> 0,57% mysqld [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
> 0,51% mysqld [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
> 0,32% mysqld [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
> 0,32% mysqld [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
>
> 0,73% sshd [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock
> 0,35% sshd [kernel.vmlinux] [k] shrink_slab [k] shrink_slab
> 0,32% sshd [kernel.vmlinux] [k] super_cache_count [k] super_cache_count
> 0,21% sshd [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock
> 0,21% sshd [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2
>
> This patch aims to make super_cache_count() (and other functions,
> which count LRU nr_items) more effective.
> It allows list_lru_node::memcg_lrus to be RCU-accessed, and makes
> __list_lru_count_one() count nr_items lockless to minimize
> overhead introduced by locking operation, and to make parallel
> reclaims more scalable.
>
> Signed-off-by: Kirill Tkhai <[email protected]>