2019-03-07 23:02:33

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH 0/5] mm: reduce the memory footprint of dying memory cgroups

A cgroup can remain in the dying state for a long time, being pinned in the
memory by any kernel object. It can be pinned by a page, shared with other
cgroup (e.g. mlocked by a process in the other cgroup). It can be pinned
by a vfs cache object, etc.

Mostly because of percpu data, the size of a memcg structure in the kernel
memory is quite large. Depending on the machine size and the kernel config,
it can easily reach hundreds of kilobytes per cgroup.

Depending on the memory pressure and the reclaim approach (which is a separate
topic), it looks like several hundreds (if not single thousands) of dying
cgroups is a typical number. On a moderately sized machine the overall memory
footprint is measured in hundreds of megabytes.

So if we can't completely get rid of dying cgroups, let's make them smaller.
This patchset aims to reduce the size of a dying memory cgroup by the premature
release of percpu data during the cgroup removal, and use of atomic counterparts
instead. Currently it covers per-memcg vmstat_percpu, per-memcg per-node
lruvec_stat_cpu. The same approach can be further applied to other percpu data.

Results on my test machine (32 CPUs, singe node):

With the patchset: Originally:

nr_dying_descendants 0
Slab: 66640 kB Slab: 67644 kB
Percpu: 6912 kB Percpu: 6912 kB

nr_dying_descendants 1000
Slab: 85912 kB Slab: 84704 kB
Percpu: 26880 kB Percpu: 64128 kB

So one dying cgroup went from 75 kB to 39 kB, which is almost twice smaller.
The difference will be even bigger on a bigger machine
(especially, with NUMA).

To test the patchset, I used the following script:
CG=/sys/fs/cgroup/percpu_test/

mkdir ${CG}
echo "+memory" > ${CG}/cgroup.subtree_control

cat ${CG}/cgroup.stat | grep nr_dying_descendants
cat /proc/meminfo | grep -e Percpu -e Slab

for i in `seq 1 1000`; do
mkdir ${CG}/${i}
echo $$ > ${CG}/${i}/cgroup.procs
dd if=/dev/urandom of=/tmp/test-${i} count=1 2> /dev/null
echo $$ > /sys/fs/cgroup/cgroup.procs
rmdir ${CG}/${i}
done

cat /sys/fs/cgroup/cgroup.stat | grep nr_dying_descendants
cat /proc/meminfo | grep -e Percpu -e Slab

rmdir ${CG}


Roman Gushchin (5):
mm: prepare to premature release of memcg->vmstats_percpu
mm: prepare to premature release of per-node lruvec_stat_cpu
mm: release memcg percpu data prematurely
mm: release per-node memcg percpu data prematurely
mm: spill memcg percpu stats and events before releasing

include/linux/memcontrol.h | 66 ++++++++++----
mm/memcontrol.c | 173 +++++++++++++++++++++++++++++++++----
2 files changed, 204 insertions(+), 35 deletions(-)

--
2.20.1



2019-03-07 23:01:25

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH 4/5] mm: release per-node memcg percpu data prematurely

Similar to memcg-level statistics, per-node data isn't expected
to be hot after cgroup removal. Switching over to atomics and
prematurely releasing percpu data helps to reduce the memory
footprint of dying cgroups.

Signed-off-by: Roman Gushchin <[email protected]>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 24 +++++++++++++++++++++++-
2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 569337514230..f296693d102b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -127,6 +127,7 @@ struct mem_cgroup_per_node {
struct lruvec lruvec;

struct lruvec_stat __rcu /* __percpu */ *lruvec_stat_cpu;
+ struct lruvec_stat __percpu *lruvec_stat_cpu_offlined;
atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS];

unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8c55954e6f23..18e863890392 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4459,7 +4459,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return;

- free_percpu(pn->lruvec_stat_cpu);
+ WARN_ON_ONCE(pn->lruvec_stat_cpu != NULL);
kfree(pn);
}

@@ -4615,7 +4615,17 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
static void mem_cgroup_free_percpu(struct rcu_head *rcu)
{
struct mem_cgroup *memcg = container_of(rcu, struct mem_cgroup, rcu);
+ int node;
+
+ for_each_node(node) {
+ struct mem_cgroup_per_node *pn = memcg->nodeinfo[node];

+ if (!pn)
+ continue;
+
+ free_percpu(pn->lruvec_stat_cpu_offlined);
+ WARN_ON_ONCE(pn->lruvec_stat_cpu != NULL);
+ }
free_percpu(memcg->vmstats_percpu_offlined);
WARN_ON_ONCE(memcg->vmstats_percpu);

@@ -4624,6 +4634,18 @@ static void mem_cgroup_free_percpu(struct rcu_head *rcu)

static void mem_cgroup_offline_percpu(struct mem_cgroup *memcg)
{
+ int node;
+
+ for_each_node(node) {
+ struct mem_cgroup_per_node *pn = memcg->nodeinfo[node];
+
+ if (!pn)
+ continue;
+
+ pn->lruvec_stat_cpu_offlined = (struct lruvec_stat __percpu *)
+ rcu_dereference(pn->lruvec_stat_cpu);
+ rcu_assign_pointer(pn->lruvec_stat_cpu, NULL);
+ }
memcg->vmstats_percpu_offlined = (struct memcg_vmstats_percpu __percpu*)
rcu_dereference(memcg->vmstats_percpu);
rcu_assign_pointer(memcg->vmstats_percpu, NULL);
--
2.20.1


2019-03-07 23:01:32

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH 5/5] mm: spill memcg percpu stats and events before releasing

Spill percpu stats and events data to corresponding before releasing
percpu memory.

Although per-cpu stats are never exactly precise, dropping them on
floor regularly may lead to an accumulation of an error. So, it's
safer to sync them before releasing.

To minimize the number of atomic updates, let's sum all stats/events
on all cpus locally, and then make a single update per entry.

Signed-off-by: Roman Gushchin <[email protected]>
---
mm/memcontrol.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 18e863890392..b7eb6fac735e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4612,11 +4612,63 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
return 0;
}

+/*
+ * Spill all per-cpu stats and events into atomics.
+ * Try to minimize the number of atomic writes by gathering data from
+ * all cpus locally, and then make one atomic update.
+ * No locking is required, because no one has an access to
+ * the offlined percpu data.
+ */
+static void mem_cgroup_spill_offlined_percpu(struct mem_cgroup *memcg)
+{
+ struct memcg_vmstats_percpu __percpu *vmstats_percpu;
+ struct lruvec_stat __percpu *lruvec_stat_cpu;
+ struct mem_cgroup_per_node *pn;
+ int cpu, i;
+ long x;
+
+ vmstats_percpu = memcg->vmstats_percpu_offlined;
+
+ for (i = 0; i < MEMCG_NR_STAT; i++) {
+ int nid;
+
+ x = 0;
+ for_each_possible_cpu(cpu)
+ x += per_cpu(vmstats_percpu->stat[i], cpu);
+ if (x)
+ atomic_long_add(x, &memcg->vmstats[i]);
+
+ if (i >= NR_VM_NODE_STAT_ITEMS)
+ continue;
+
+ for_each_node(nid) {
+ pn = mem_cgroup_nodeinfo(memcg, nid);
+ lruvec_stat_cpu = pn->lruvec_stat_cpu_offlined;
+
+ x = 0;
+ for_each_possible_cpu(cpu)
+ x += per_cpu(lruvec_stat_cpu->count[i], cpu);
+ if (x)
+ atomic_long_add(x, &pn->lruvec_stat[i]);
+ }
+ }
+
+ for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
+ x = 0;
+ for_each_possible_cpu(cpu)
+ x += per_cpu(vmstats_percpu->events[i], cpu);
+ if (x)
+ atomic_long_add(x, &memcg->vmevents[i]);
+ }
+}
+
static void mem_cgroup_free_percpu(struct rcu_head *rcu)
{
struct mem_cgroup *memcg = container_of(rcu, struct mem_cgroup, rcu);
int node;

+ mem_cgroup_spill_offlined_percpu(memcg);
+
for_each_node(node) {
struct mem_cgroup_per_node *pn = memcg->nodeinfo[node];

--
2.20.1


2019-03-07 23:01:42

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH 3/5] mm: release memcg percpu data prematurely

To reduce the memory footprint of a dying memory cgroup, let's
release massive percpu data (vmstats_percpu) as early as possible,
and use atomic counterparts instead.

A dying cgroup can remain in the dying state for quite a long
time, being pinned in memory by any reference. For example,
if a page mlocked by some other cgroup, is charged to the dying
cgroup, it won't go away until the page will be released.

A dying memory cgroup can have some memory activity (e.g. dirty
pages can be flushed after cgroup removal), but in general it's
not expected to be very active in comparison to living cgroups.

So reducing the memory footprint by releasing percpu data
and switching over to atomics seems to be a good trade off.

Signed-off-by: Roman Gushchin <[email protected]>
---
include/linux/memcontrol.h | 4 ++++
mm/memcontrol.c | 24 +++++++++++++++++++++++-
2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8ac04632002a..569337514230 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -275,6 +275,10 @@ struct mem_cgroup {

/* memory.stat */
struct memcg_vmstats_percpu __rcu /* __percpu */ *vmstats_percpu;
+ struct memcg_vmstats_percpu __percpu *vmstats_percpu_offlined;
+
+ /* used to release non-used percpu memory */
+ struct rcu_head rcu;

MEMCG_PADDING(_pad2_);

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8f3cac02221a..8c55954e6f23 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4469,7 +4469,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)

for_each_node(node)
free_mem_cgroup_per_node_info(memcg, node);
- free_percpu(memcg->vmstats_percpu);
+ WARN_ON_ONCE(memcg->vmstats_percpu != NULL);
kfree(memcg);
}

@@ -4612,6 +4612,26 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
return 0;
}

+static void mem_cgroup_free_percpu(struct rcu_head *rcu)
+{
+ struct mem_cgroup *memcg = container_of(rcu, struct mem_cgroup, rcu);
+
+ free_percpu(memcg->vmstats_percpu_offlined);
+ WARN_ON_ONCE(memcg->vmstats_percpu);
+
+ css_put(&memcg->css);
+}
+
+static void mem_cgroup_offline_percpu(struct mem_cgroup *memcg)
+{
+ memcg->vmstats_percpu_offlined = (struct memcg_vmstats_percpu __percpu*)
+ rcu_dereference(memcg->vmstats_percpu);
+ rcu_assign_pointer(memcg->vmstats_percpu, NULL);
+
+ css_get(&memcg->css);
+ call_rcu(&memcg->rcu, mem_cgroup_free_percpu);
+}
+
static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -4638,6 +4658,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
drain_all_stock(memcg);

mem_cgroup_id_put(memcg);
+
+ mem_cgroup_offline_percpu(memcg);
}

static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
--
2.20.1


2019-03-07 23:01:56

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH 1/5] mm: prepare to premature release of memcg->vmstats_percpu

Prepare to handle premature release of memcg->vmstats_percpu data.
Currently it's a generic pointer which is expected to be non-NULL
during the whole life time of a memcg. Switch over to the
rcu-protected pointer, and carefully check it for being non-NULL.

This change is a required step towards dynamic premature release
of percpu memcg data.

Signed-off-by: Roman Gushchin <[email protected]>
---
include/linux/memcontrol.h | 40 +++++++++++++++++-------
mm/memcontrol.c | 62 +++++++++++++++++++++++++++++---------
2 files changed, 77 insertions(+), 25 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 534267947664..05ca77767c6a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -274,7 +274,7 @@ struct mem_cgroup {
struct task_struct *move_lock_task;

/* memory.stat */
- struct memcg_vmstats_percpu __percpu *vmstats_percpu;
+ struct memcg_vmstats_percpu __rcu /* __percpu */ *vmstats_percpu;

MEMCG_PADDING(_pad2_);

@@ -597,17 +597,26 @@ static inline unsigned long memcg_page_state(struct mem_cgroup *memcg,
static inline void __mod_memcg_state(struct mem_cgroup *memcg,
int idx, int val)
{
+ struct memcg_vmstats_percpu __percpu *vmstats_percpu;
long x;

if (mem_cgroup_disabled())
return;

- x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
- if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
- atomic_long_add(x, &memcg->vmstats[idx]);
- x = 0;
+ rcu_read_lock();
+ vmstats_percpu = (struct memcg_vmstats_percpu __percpu *)
+ rcu_dereference(memcg->vmstats_percpu);
+ if (likely(vmstats_percpu)) {
+ x = val + __this_cpu_read(vmstats_percpu->stat[idx]);
+ if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+ atomic_long_add(x, &memcg->vmstats[idx]);
+ x = 0;
+ }
+ __this_cpu_write(vmstats_percpu->stat[idx], x);
+ } else {
+ atomic_long_add(val, &memcg->vmstats[idx]);
}
- __this_cpu_write(memcg->vmstats_percpu->stat[idx], x);
+ rcu_read_unlock();
}

/* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -740,17 +749,26 @@ static inline void __count_memcg_events(struct mem_cgroup *memcg,
enum vm_event_item idx,
unsigned long count)
{
+ struct memcg_vmstats_percpu __percpu *vmstats_percpu;
unsigned long x;

if (mem_cgroup_disabled())
return;

- x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]);
- if (unlikely(x > MEMCG_CHARGE_BATCH)) {
- atomic_long_add(x, &memcg->vmevents[idx]);
- x = 0;
+ rcu_read_lock();
+ vmstats_percpu = (struct memcg_vmstats_percpu __percpu *)
+ rcu_dereference(memcg->vmstats_percpu);
+ if (likely(vmstats_percpu)) {
+ x = count + __this_cpu_read(vmstats_percpu->events[idx]);
+ if (unlikely(x > MEMCG_CHARGE_BATCH)) {
+ atomic_long_add(x, &memcg->vmevents[idx]);
+ x = 0;
+ }
+ __this_cpu_write(vmstats_percpu->events[idx], x);
+ } else {
+ atomic_long_add(count, &memcg->vmevents[idx]);
}
- __this_cpu_write(memcg->vmstats_percpu->events[idx], x);
+ rcu_read_unlock();
}

static inline void count_memcg_events(struct mem_cgroup *memcg,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c532f8685aa3..803c772f354b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -697,6 +697,8 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
struct page *page,
bool compound, int nr_pages)
{
+ struct memcg_vmstats_percpu __percpu *vmstats_percpu;
+
/*
* Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
* counted as CACHE even if it's on ANON LRU.
@@ -722,7 +724,12 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
nr_pages = -nr_pages; /* for event */
}

- __this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
+ rcu_read_lock();
+ vmstats_percpu = (struct memcg_vmstats_percpu __percpu *)
+ rcu_dereference(memcg->vmstats_percpu);
+ if (likely(vmstats_percpu))
+ __this_cpu_add(vmstats_percpu->nr_page_events, nr_pages);
+ rcu_read_unlock();
}

unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
@@ -756,10 +763,18 @@ static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
enum mem_cgroup_events_target target)
{
+ struct memcg_vmstats_percpu __percpu *vmstats_percpu;
unsigned long val, next;
+ bool ret = false;

- val = __this_cpu_read(memcg->vmstats_percpu->nr_page_events);
- next = __this_cpu_read(memcg->vmstats_percpu->targets[target]);
+ rcu_read_lock();
+ vmstats_percpu = (struct memcg_vmstats_percpu __percpu *)
+ rcu_dereference(memcg->vmstats_percpu);
+ if (!vmstats_percpu)
+ goto out;
+
+ val = __this_cpu_read(vmstats_percpu->nr_page_events);
+ next = __this_cpu_read(vmstats_percpu->targets[target]);
/* from time_after() in jiffies.h */
if ((long)(next - val) < 0) {
switch (target) {
@@ -775,10 +790,12 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
default:
break;
}
- __this_cpu_write(memcg->vmstats_percpu->targets[target], next);
- return true;
+ __this_cpu_write(vmstats_percpu->targets[target], next);
+ ret = true;
}
- return false;
+out:
+ rcu_read_unlock();
+ return ret;
}

/*
@@ -2104,22 +2121,29 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)

static int memcg_hotplug_cpu_dead(unsigned int cpu)
{
+ struct memcg_vmstats_percpu __percpu *vmstats_percpu;
struct memcg_stock_pcp *stock;
struct mem_cgroup *memcg;

stock = &per_cpu(memcg_stock, cpu);
drain_stock(stock);

+ rcu_read_lock();
for_each_mem_cgroup(memcg) {
int i;

+ vmstats_percpu = (struct memcg_vmstats_percpu __percpu *)
+ rcu_dereference(memcg->vmstats_percpu);
+
for (i = 0; i < MEMCG_NR_STAT; i++) {
int nid;
long x;

- x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0);
- if (x)
- atomic_long_add(x, &memcg->vmstats[i]);
+ if (vmstats_percpu) {
+ x = this_cpu_xchg(vmstats_percpu->stat[i], 0);
+ if (x)
+ atomic_long_add(x, &memcg->vmstats[i]);
+ }

if (i >= NR_VM_NODE_STAT_ITEMS)
continue;
@@ -2137,11 +2161,14 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
long x;

- x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0);
- if (x)
- atomic_long_add(x, &memcg->vmevents[i]);
+ if (vmstats_percpu) {
+ x = this_cpu_xchg(vmstats_percpu->events[i], 0);
+ if (x)
+ atomic_long_add(x, &memcg->vmevents[i]);
+ }
}
}
+ rcu_read_unlock();

return 0;
}
@@ -4464,7 +4491,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
if (memcg->id.id < 0)
goto fail;

- memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
+ rcu_assign_pointer(memcg->vmstats_percpu,
+ alloc_percpu(struct memcg_vmstats_percpu));
if (!memcg->vmstats_percpu)
goto fail;

@@ -6054,6 +6082,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)
{
unsigned long nr_pages = ug->nr_anon + ug->nr_file + ug->nr_kmem;
unsigned long flags;
+ struct memcg_vmstats_percpu __percpu *vmstats_percpu;

if (!mem_cgroup_is_root(ug->memcg)) {
page_counter_uncharge(&ug->memcg->memory, nr_pages);
@@ -6070,7 +6099,12 @@ static void uncharge_batch(const struct uncharge_gather *ug)
__mod_memcg_state(ug->memcg, MEMCG_RSS_HUGE, -ug->nr_huge);
__mod_memcg_state(ug->memcg, NR_SHMEM, -ug->nr_shmem);
__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
- __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, nr_pages);
+ rcu_read_lock();
+ vmstats_percpu = (struct memcg_vmstats_percpu __percpu *)
+ rcu_dereference(ug->memcg->vmstats_percpu);
+ if (likely(vmstats_percpu))
+ __this_cpu_add(vmstats_percpu->nr_page_events, nr_pages);
+ rcu_read_unlock();
memcg_check_events(ug->memcg, ug->dummy_page);
local_irq_restore(flags);

--
2.20.1


2019-03-07 23:02:50

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH 2/5] mm: prepare to premature release of per-node lruvec_stat_cpu

Similar to the memcg's vmstats_percpu, per-memcg per-node stats
consists of percpu- and atomic counterparts, and we do expect
that both coexist during the whole life-cycle of the memcg.

To prepare for a premature release of percpu per-node data,
let's pretend that lruvec_stat_cpu is a rcu-protected pointer,
which can be NULL. This patch adds corresponding checks whenever
required.

Signed-off-by: Roman Gushchin <[email protected]>
---
include/linux/memcontrol.h | 21 +++++++++++++++------
mm/memcontrol.c | 11 +++++++++--
2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 05ca77767c6a..8ac04632002a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -126,7 +126,7 @@ struct memcg_shrinker_map {
struct mem_cgroup_per_node {
struct lruvec lruvec;

- struct lruvec_stat __percpu *lruvec_stat_cpu;
+ struct lruvec_stat __rcu /* __percpu */ *lruvec_stat_cpu;
atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS];

unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
@@ -682,6 +682,7 @@ static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
static inline void __mod_lruvec_state(struct lruvec *lruvec,
enum node_stat_item idx, int val)
{
+ struct lruvec_stat __percpu *lruvec_stat_cpu;
struct mem_cgroup_per_node *pn;
long x;

@@ -697,12 +698,20 @@ static inline void __mod_lruvec_state(struct lruvec *lruvec,
__mod_memcg_state(pn->memcg, idx, val);

/* Update lruvec */
- x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
- if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
- atomic_long_add(x, &pn->lruvec_stat[idx]);
- x = 0;
+ rcu_read_lock();
+ lruvec_stat_cpu = (struct lruvec_stat __percpu *)
+ rcu_dereference(pn->lruvec_stat_cpu);
+ if (likely(lruvec_stat_cpu)) {
+ x = val + __this_cpu_read(lruvec_stat_cpu->count[idx]);
+ if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+ atomic_long_add(x, &pn->lruvec_stat[idx]);
+ x = 0;
+ }
+ __this_cpu_write(lruvec_stat_cpu->count[idx], x);
+ } else {
+ atomic_long_add(val, &pn->lruvec_stat[idx]);
}
- __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
+ rcu_read_unlock();
}

static inline void mod_lruvec_state(struct lruvec *lruvec,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 803c772f354b..8f3cac02221a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2122,6 +2122,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
static int memcg_hotplug_cpu_dead(unsigned int cpu)
{
struct memcg_vmstats_percpu __percpu *vmstats_percpu;
+ struct lruvec_stat __percpu *lruvec_stat_cpu;
struct memcg_stock_pcp *stock;
struct mem_cgroup *memcg;

@@ -2152,7 +2153,12 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
struct mem_cgroup_per_node *pn;

pn = mem_cgroup_nodeinfo(memcg, nid);
- x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0);
+
+ lruvec_stat_cpu = (struct lruvec_stat __percpu*)
+ rcu_dereference(pn->lruvec_stat_cpu);
+ if (!lruvec_stat_cpu)
+ continue;
+ x = this_cpu_xchg(lruvec_stat_cpu->count[i], 0);
if (x)
atomic_long_add(x, &pn->lruvec_stat[i]);
}
@@ -4430,7 +4436,8 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return 1;

- pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
+ rcu_assign_pointer(pn->lruvec_stat_cpu,
+ alloc_percpu(struct lruvec_stat));
if (!pn->lruvec_stat_cpu) {
kfree(pn);
return 1;
--
2.20.1


2019-03-11 17:16:23

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: prepare to premature release of memcg->vmstats_percpu

On Thu, Mar 07, 2019 at 03:00:29PM -0800, Roman Gushchin wrote:
> Prepare to handle premature release of memcg->vmstats_percpu data.
> Currently it's a generic pointer which is expected to be non-NULL
> during the whole life time of a memcg. Switch over to the
> rcu-protected pointer, and carefully check it for being non-NULL.
>
> This change is a required step towards dynamic premature release
> of percpu memcg data.
>
> Signed-off-by: Roman Gushchin <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2019-03-11 17:19:00

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 2/5] mm: prepare to premature release of per-node lruvec_stat_cpu

On Thu, Mar 07, 2019 at 03:00:30PM -0800, Roman Gushchin wrote:
> Similar to the memcg's vmstats_percpu, per-memcg per-node stats
> consists of percpu- and atomic counterparts, and we do expect
> that both coexist during the whole life-cycle of the memcg.
>
> To prepare for a premature release of percpu per-node data,
> let's pretend that lruvec_stat_cpu is a rcu-protected pointer,
> which can be NULL. This patch adds corresponding checks whenever
> required.
>
> Signed-off-by: Roman Gushchin <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

> @@ -4430,7 +4436,8 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> if (!pn)
> return 1;
>
> - pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
> + rcu_assign_pointer(pn->lruvec_stat_cpu,
> + alloc_percpu(struct lruvec_stat));
> if (!pn->lruvec_stat_cpu) {

Nitpick: wouldn't this have to use rcu_dereference()? Might be cleaner
to use an intermediate variable and only assign after the NULL check.

2019-03-11 17:26:49

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: release memcg percpu data prematurely

On Thu, Mar 07, 2019 at 03:00:31PM -0800, Roman Gushchin wrote:
> To reduce the memory footprint of a dying memory cgroup, let's
> release massive percpu data (vmstats_percpu) as early as possible,
> and use atomic counterparts instead.
>
> A dying cgroup can remain in the dying state for quite a long
> time, being pinned in memory by any reference. For example,
> if a page mlocked by some other cgroup, is charged to the dying
> cgroup, it won't go away until the page will be released.
>
> A dying memory cgroup can have some memory activity (e.g. dirty
> pages can be flushed after cgroup removal), but in general it's
> not expected to be very active in comparison to living cgroups.
>
> So reducing the memory footprint by releasing percpu data
> and switching over to atomics seems to be a good trade off.
>
> Signed-off-by: Roman Gushchin <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

One nitpick below:

> @@ -4612,6 +4612,26 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
> return 0;
> }
>
> +static void mem_cgroup_free_percpu(struct rcu_head *rcu)
> +{
> + struct mem_cgroup *memcg = container_of(rcu, struct mem_cgroup, rcu);
> +
> + free_percpu(memcg->vmstats_percpu_offlined);
> + WARN_ON_ONCE(memcg->vmstats_percpu);
> +
> + css_put(&memcg->css);

Nitpick: I had to double take seeing a "mem_cgroup_free_*" function
that does css_put(). We use "free" terminology (mem_cgroup_css_free,
memcg_free_kmem, memcg_free_shrinker_maps, mem_cgroup_free) from the
.css_free callback, which only happens when the last reference is put.

Can we go with something less ambigous? We can add "rcu" and drop the
mem_cgroup prefix since it's narrowly scoped. "percpu_rcu_free"?

> +static void mem_cgroup_offline_percpu(struct mem_cgroup *memcg)
> +{
> + memcg->vmstats_percpu_offlined = (struct memcg_vmstats_percpu __percpu*)
> + rcu_dereference(memcg->vmstats_percpu);
> + rcu_assign_pointer(memcg->vmstats_percpu, NULL);
> +
> + css_get(&memcg->css);
> + call_rcu(&memcg->rcu, mem_cgroup_free_percpu);
> +}
> +
> static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);

2019-03-11 17:27:54

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: release per-node memcg percpu data prematurely

On Thu, Mar 07, 2019 at 03:00:32PM -0800, Roman Gushchin wrote:
> Similar to memcg-level statistics, per-node data isn't expected
> to be hot after cgroup removal. Switching over to atomics and
> prematurely releasing percpu data helps to reduce the memory
> footprint of dying cgroups.
>
> Signed-off-by: Roman Gushchin <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2019-03-11 17:39:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm: spill memcg percpu stats and events before releasing

On Thu, Mar 07, 2019 at 03:00:33PM -0800, Roman Gushchin wrote:
> Spill percpu stats and events data to corresponding before releasing
> percpu memory.
>
> Although per-cpu stats are never exactly precise, dropping them on
> floor regularly may lead to an accumulation of an error. So, it's
> safer to sync them before releasing.
>
> To minimize the number of atomic updates, let's sum all stats/events
> on all cpus locally, and then make a single update per entry.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> ---
> mm/memcontrol.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 52 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 18e863890392..b7eb6fac735e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4612,11 +4612,63 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
> return 0;
> }
>
> +/*
> + * Spill all per-cpu stats and events into atomics.
> + * Try to minimize the number of atomic writes by gathering data from
> + * all cpus locally, and then make one atomic update.
> + * No locking is required, because no one has an access to
> + * the offlined percpu data.
> + */
> +static void mem_cgroup_spill_offlined_percpu(struct mem_cgroup *memcg)
> +{
> + struct memcg_vmstats_percpu __percpu *vmstats_percpu;
> + struct lruvec_stat __percpu *lruvec_stat_cpu;
> + struct mem_cgroup_per_node *pn;
> + int cpu, i;
> + long x;
> +
> + vmstats_percpu = memcg->vmstats_percpu_offlined;
> +
> + for (i = 0; i < MEMCG_NR_STAT; i++) {
> + int nid;
> +
> + x = 0;
> + for_each_possible_cpu(cpu)
> + x += per_cpu(vmstats_percpu->stat[i], cpu);
> + if (x)
> + atomic_long_add(x, &memcg->vmstats[i]);
> +
> + if (i >= NR_VM_NODE_STAT_ITEMS)
> + continue;
> +
> + for_each_node(nid) {
> + pn = mem_cgroup_nodeinfo(memcg, nid);
> + lruvec_stat_cpu = pn->lruvec_stat_cpu_offlined;
> +
> + x = 0;
> + for_each_possible_cpu(cpu)
> + x += per_cpu(lruvec_stat_cpu->count[i], cpu);
> + if (x)
> + atomic_long_add(x, &pn->lruvec_stat[i]);
> + }
> + }
> +
> + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
> + x = 0;
> + for_each_possible_cpu(cpu)
> + x += per_cpu(vmstats_percpu->events[i], cpu);
> + if (x)
> + atomic_long_add(x, &memcg->vmevents[i]);
> + }

This looks good, but couldn't this be merged with the cpu offlining?
It seems to be exactly the same code, except for the nesting of the
for_each_possible_cpu() iteration here.

This could be a function that takes a CPU argument and then iterates
the cgroups and stat items to collect and spill the counters of that
specified CPU; offlining would call it once, and this spill code here
would call it for_each_possible_cpu().

We shouldn't need the atomicity of this_cpu_xchg() during hotunplug,
the scheduler isn't even active on that CPU anymore when it's called.

2019-03-11 19:28:16

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm: spill memcg percpu stats and events before releasing

On Mon, Mar 11, 2019 at 01:38:25PM -0400, Johannes Weiner wrote:
> On Thu, Mar 07, 2019 at 03:00:33PM -0800, Roman Gushchin wrote:
> > Spill percpu stats and events data to corresponding before releasing
> > percpu memory.
> >
> > Although per-cpu stats are never exactly precise, dropping them on
> > floor regularly may lead to an accumulation of an error. So, it's
> > safer to sync them before releasing.
> >
> > To minimize the number of atomic updates, let's sum all stats/events
> > on all cpus locally, and then make a single update per entry.
> >
> > Signed-off-by: Roman Gushchin <[email protected]>
> > ---
> > mm/memcontrol.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 52 insertions(+)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 18e863890392..b7eb6fac735e 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4612,11 +4612,63 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
> > return 0;
> > }
> >
> > +/*
> > + * Spill all per-cpu stats and events into atomics.
> > + * Try to minimize the number of atomic writes by gathering data from
> > + * all cpus locally, and then make one atomic update.
> > + * No locking is required, because no one has an access to
> > + * the offlined percpu data.
> > + */
> > +static void mem_cgroup_spill_offlined_percpu(struct mem_cgroup *memcg)
> > +{
> > + struct memcg_vmstats_percpu __percpu *vmstats_percpu;
> > + struct lruvec_stat __percpu *lruvec_stat_cpu;
> > + struct mem_cgroup_per_node *pn;
> > + int cpu, i;
> > + long x;
> > +
> > + vmstats_percpu = memcg->vmstats_percpu_offlined;
> > +
> > + for (i = 0; i < MEMCG_NR_STAT; i++) {
> > + int nid;
> > +
> > + x = 0;
> > + for_each_possible_cpu(cpu)
> > + x += per_cpu(vmstats_percpu->stat[i], cpu);
> > + if (x)
> > + atomic_long_add(x, &memcg->vmstats[i]);
> > +
> > + if (i >= NR_VM_NODE_STAT_ITEMS)
> > + continue;
> > +
> > + for_each_node(nid) {
> > + pn = mem_cgroup_nodeinfo(memcg, nid);
> > + lruvec_stat_cpu = pn->lruvec_stat_cpu_offlined;
> > +
> > + x = 0;
> > + for_each_possible_cpu(cpu)
> > + x += per_cpu(lruvec_stat_cpu->count[i], cpu);
> > + if (x)
> > + atomic_long_add(x, &pn->lruvec_stat[i]);
> > + }
> > + }
> > +
> > + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
> > + x = 0;
> > + for_each_possible_cpu(cpu)
> > + x += per_cpu(vmstats_percpu->events[i], cpu);
> > + if (x)
> > + atomic_long_add(x, &memcg->vmevents[i]);
> > + }
>
> This looks good, but couldn't this be merged with the cpu offlining?
> It seems to be exactly the same code, except for the nesting of the
> for_each_possible_cpu() iteration here.
>
> This could be a function that takes a CPU argument and then iterates
> the cgroups and stat items to collect and spill the counters of that
> specified CPU; offlining would call it once, and this spill code here
> would call it for_each_possible_cpu().
>
> We shouldn't need the atomicity of this_cpu_xchg() during hotunplug,
> the scheduler isn't even active on that CPU anymore when it's called.

Good point!
I initially tried to adapt the cpu offlining code, but it didn't work
well: the code became too complex and ugly. But the opposite can be done
easily: mem_cgroup_spill_offlined_percpu() can take a cpumask,
and the cpu offlining code will look like:
for_each_mem_cgroup(memcg)
mem_cgroup_spill_offlined_percpu(memcg, cpumask);
I'll master a separate patch.

Thank you!