2019-04-12 15:17:29

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 0/4] mm: memcontrol: memory.stat cost & correctness

The cgroup memory.stat file holds recursive statistics for the entire
subtree. The current implementation does this tree walk on-demand
whenever the file is read. This is giving us problems in production.

1. The cost of aggregating the statistics on-demand is high. A lot of
system service cgroups are mostly idle and their stats don't change
between reads, yet we always have to check them. There are also always
some lazily-dying cgroups sitting around that are pinned by a handful
of remaining page cache; the same applies to them.

In an application that periodically monitors memory.stat in our fleet,
we have seen the aggregation consume up to 5% CPU time.

2. When cgroups die and disappear from the cgroup tree, so do their
accumulated vm events. The result is that the event counters at
higher-level cgroups can go backwards and confuse some of our
automation, let alone people looking at the graphs over time.

To address both issues, this patch series changes the stat
implementation to spill counts upwards when the counters change.

The upward spilling is batched using the existing per-cpu cache. In a
sparse file stress test with 5 level cgroup nesting, the additional
cost of the flushing was negligible (a little under 1% of CPU at 100%
CPU utilization, compared to the 5% of reading memory.stat during
regular operation).

include/linux/memcontrol.h | 96 +++++++-------
mm/memcontrol.c | 290 +++++++++++++++++++++++++++----------------
mm/vmscan.c | 4 +-
mm/workingset.c | 7 +-
4 files changed, 234 insertions(+), 163 deletions(-)



2019-04-12 15:16:22

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 4/4] mm: memcontrol: fix NUMA round-robin reclaim at intermediate level

When a cgroup is reclaimed on behalf of a configured limit, reclaim
needs to round-robin through all NUMA nodes that hold pages of the
memcg in question. However, when assembling the mask of candidate NUMA
nodes, the code only consults the *local* cgroup LRU counters, not the
recursive counters for the entire subtree. Cgroup limits are
frequently configured against intermediate cgroups that do not have
memory on their own LRUs. In this case, the node mask will always come
up empty and reclaim falls back to scanning only the current node.

If a cgroup subtree has some memory on one node but the processes are
bound to another node afterwards, the limit reclaim will never age or
reclaim that memory anymore.

To fix this, use the recursive LRU counts for a cgroup subtree to
determine which nodes hold memory of that cgroup.

The code has been broken like this forever, so it doesn't seem to be a
problem in practice. I just noticed it while reviewing the way the LRU
counters are used in general.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2eb2d4ef9b34..2535e54e7989 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1512,13 +1512,13 @@ static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
{
struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);

- if (lruvec_page_state_local(lruvec, NR_INACTIVE_FILE) ||
- lruvec_page_state_local(lruvec, NR_ACTIVE_FILE))
+ if (lruvec_page_state(lruvec, NR_INACTIVE_FILE) ||
+ lruvec_page_state(lruvec, NR_ACTIVE_FILE))
return true;
if (noswap || !total_swap_pages)
return false;
- if (lruvec_page_state_local(lruvec, NR_INACTIVE_ANON) ||
- lruvec_page_state_local(lruvec, NR_ACTIVE_ANON))
+ if (lruvec_page_state(lruvec, NR_INACTIVE_ANON) ||
+ lruvec_page_state(lruvec, NR_ACTIVE_ANON))
return true;
return false;

--
2.21.0

2019-04-12 15:16:23

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 3/4] mm: memcontrol: fix recursive statistics correctness & scalabilty

Right now, when somebody needs to know the recursive memory statistics
and events of a cgroup subtree, they need to walk the entire subtree
and sum up the counters manually.

There are two issues with this:

1. When a cgroup gets deleted, its stats are lost. The state counters
should all be 0 at that point, of course, but the events are not. When
this happens, the event counters, which are supposed to be monotonic,
can go backwards in the parent cgroups.

2. During regular operation, we always have a certain number of lazily
freed cgroups sitting around that have been deleted, have no tasks,
but have a few cache pages remaining. These groups' statistics do not
change until we eventually hit memory pressure, but somebody watching,
say, memory.stat on an ancestor has to iterate those every time.

This patch addresses both issues by introducing recursive counters at
each level that are propagated from the write side when stats change.

Upward propagation happens when the per-cpu caches spill over into the
local atomic counter. This is the same thing we do during charge and
uncharge, except that the latter uses atomic RMWs, which are more
expensive; stat changes happen at around the same rate. In a sparse
file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
nesting levels, perf shows __mod_memcg_page state at ~1%.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 54 +++++++++-
mm/memcontrol.c | 205 ++++++++++++++++++-------------------
2 files changed, 150 insertions(+), 109 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index cae7d1b11eea..36bdfe8e5965 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -128,6 +128,7 @@ struct mem_cgroup_per_node {

struct lruvec_stat __percpu *lruvec_stat_cpu;
atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS];
+ atomic_long_t lruvec_stat_local[NR_VM_NODE_STAT_ITEMS];

unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];

@@ -279,8 +280,12 @@ struct mem_cgroup {
MEMCG_PADDING(_pad2_);

atomic_long_t vmstats[MEMCG_NR_STAT];
+ atomic_long_t vmstats_local[MEMCG_NR_STAT];
+
atomic_long_t vmevents[NR_VM_EVENT_ITEMS];
- atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
+ atomic_long_t vmevents_local[NR_VM_EVENT_ITEMS];
+
+ atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];

unsigned long socket_pressure;

@@ -565,6 +570,20 @@ struct mem_cgroup *lock_page_memcg(struct page *page);
void __unlock_page_memcg(struct mem_cgroup *memcg);
void unlock_page_memcg(struct page *page);

+/*
+ * idx can be of type enum memcg_stat_item or node_stat_item.
+ * Keep in sync with memcg_exact_page_state().
+ */
+static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
+{
+ long x = atomic_long_read(&memcg->vmstats[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
+}
+
/*
* idx can be of type enum memcg_stat_item or node_stat_item.
* Keep in sync with memcg_exact_page_state().
@@ -572,7 +591,7 @@ void unlock_page_memcg(struct page *page);
static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg,
int idx)
{
- long x = atomic_long_read(&memcg->vmstats[idx]);
+ long x = atomic_long_read(&memcg->vmstats_local[idx]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -624,6 +643,24 @@ static inline void mod_memcg_page_state(struct page *page,
mod_memcg_state(page->mem_cgroup, idx, val);
}

+static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
+ enum node_stat_item idx)
+{
+ struct mem_cgroup_per_node *pn;
+ long x;
+
+ if (mem_cgroup_disabled())
+ return node_page_state(lruvec_pgdat(lruvec), idx);
+
+ pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+ x = atomic_long_read(&pn->lruvec_stat[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
+}
+
static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
enum node_stat_item idx)
{
@@ -634,7 +671,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
return node_page_state(lruvec_pgdat(lruvec), idx);

pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = atomic_long_read(&pn->lruvec_stat[idx]);
+ x = atomic_long_read(&pn->lruvec_stat_local[idx]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -991,6 +1028,11 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
{
}

+static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
+{
+ return 0;
+}
+
static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg,
int idx)
{
@@ -1021,6 +1063,12 @@ static inline void mod_memcg_page_state(struct page *page,
{
}

+static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
+ enum node_stat_item idx)
+{
+ return node_page_state(lruvec_pgdat(lruvec), idx);
+}
+
static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
enum node_stat_item idx)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3535270ebeec..2eb2d4ef9b34 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -702,12 +702,27 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)

x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
- atomic_long_add(x, &memcg->vmstats[idx]);
+ struct mem_cgroup *mi;
+
+ atomic_long_add(x, &memcg->vmstats_local[idx]);
+ for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
+ atomic_long_add(x, &mi->vmstats[idx]);
x = 0;
}
__this_cpu_write(memcg->vmstats_percpu->stat[idx], x);
}

+static struct mem_cgroup_per_node *
+parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
+{
+ struct mem_cgroup *parent;
+
+ parent = parent_mem_cgroup(pn->memcg);
+ if (!parent)
+ return NULL;
+ return mem_cgroup_nodeinfo(parent, nid);
+}
+
/**
* __mod_lruvec_state - update lruvec memory statistics
* @lruvec: the lruvec
@@ -721,24 +736,31 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
int val)
{
+ pg_data_t *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup_per_node *pn;
+ struct mem_cgroup *memcg;
long x;

/* Update node */
- __mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
+ __mod_node_page_state(pgdat, idx, val);

if (mem_cgroup_disabled())
return;

pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+ memcg = pn->memcg;

/* Update memcg */
- __mod_memcg_state(pn->memcg, idx, val);
+ __mod_memcg_state(memcg, idx, val);

/* Update lruvec */
x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
- atomic_long_add(x, &pn->lruvec_stat[idx]);
+ struct mem_cgroup_per_node *pi;
+
+ atomic_long_add(x, &pn->lruvec_stat_local[idx]);
+ for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
+ atomic_long_add(x, &pi->lruvec_stat[idx]);
x = 0;
}
__this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
@@ -760,18 +782,26 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,

x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]);
if (unlikely(x > MEMCG_CHARGE_BATCH)) {
- atomic_long_add(x, &memcg->vmevents[idx]);
+ struct mem_cgroup *mi;
+
+ atomic_long_add(x, &memcg->vmevents_local[idx]);
+ for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
+ atomic_long_add(x, &mi->vmevents[idx]);
x = 0;
}
__this_cpu_write(memcg->vmstats_percpu->events[idx], x);
}

-static unsigned long memcg_events_local(struct mem_cgroup *memcg,
- int event)
+static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
{
return atomic_long_read(&memcg->vmevents[event]);
}

+static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
+{
+ return atomic_long_read(&memcg->vmevents_local[event]);
+}
+
static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
struct page *page,
bool compound, int nr_pages)
@@ -2162,7 +2192,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
static int memcg_hotplug_cpu_dead(unsigned int cpu)
{
struct memcg_stock_pcp *stock;
- struct mem_cgroup *memcg;
+ struct mem_cgroup *memcg, *mi;

stock = &per_cpu(memcg_stock, cpu);
drain_stock(stock);
@@ -2175,8 +2205,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
long x;

x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0);
- if (x)
- atomic_long_add(x, &memcg->vmstats[i]);
+ if (x) {
+ atomic_long_add(x, &memcg->vmstats_local[i]);
+ for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
+ atomic_long_add(x, &memcg->vmstats[i]);
+ }

if (i >= NR_VM_NODE_STAT_ITEMS)
continue;
@@ -2186,8 +2219,12 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)

pn = mem_cgroup_nodeinfo(memcg, nid);
x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0);
- if (x)
- atomic_long_add(x, &pn->lruvec_stat[i]);
+ if (x) {
+ atomic_long_add(x, &pn->lruvec_stat_local[i]);
+ do {
+ atomic_long_add(x, &pn->lruvec_stat[i]);
+ } while ((pn = parent_nodeinfo(pn, nid)));
+ }
}
}

@@ -2195,8 +2232,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
long x;

x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0);
- if (x)
- atomic_long_add(x, &memcg->vmevents[i]);
+ if (x) {
+ atomic_long_add(x, &memcg->vmevents_local[i]);
+ for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
+ atomic_long_add(x, &memcg->vmevents[i]);
+ }
}
}

@@ -3036,54 +3076,15 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
return retval;
}

-struct accumulated_vmstats {
- unsigned long vmstats[MEMCG_NR_STAT];
- unsigned long vmevents[NR_VM_EVENT_ITEMS];
- unsigned long lru_pages[NR_LRU_LISTS];
-
- /* overrides for v1 */
- const unsigned int *vmstats_array;
- const unsigned int *vmevents_array;
-
- int vmstats_size;
- int vmevents_size;
-};
-
-static void accumulate_vmstats(struct mem_cgroup *memcg,
- struct accumulated_vmstats *acc)
-{
- struct mem_cgroup *mi;
- int i;
-
- for_each_mem_cgroup_tree(mi, memcg) {
- for (i = 0; i < acc->vmstats_size; i++)
- acc->vmstats[i] += memcg_page_state_local(mi,
- acc->vmstats_array ? acc->vmstats_array[i] : i);
-
- for (i = 0; i < acc->vmevents_size; i++)
- acc->vmevents[i] += memcg_events_local(mi,
- acc->vmevents_array
- ? acc->vmevents_array[i] : i);
-
- for (i = 0; i < NR_LRU_LISTS; i++)
- acc->lru_pages[i] += memcg_page_state_local(mi,
- NR_LRU_BASE + i);
- }
-}
-
static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
{
- unsigned long val = 0;
+ unsigned long val;

if (mem_cgroup_is_root(memcg)) {
- struct mem_cgroup *iter;
-
- for_each_mem_cgroup_tree(iter, memcg) {
- val += memcg_page_state_local(iter, MEMCG_CACHE);
- val += memcg_page_state_local(iter, MEMCG_RSS);
- if (swap)
- val += memcg_page_state_local(iter, MEMCG_SWAP);
- }
+ val = memcg_page_state(memcg, MEMCG_CACHE) +
+ memcg_page_state(memcg, MEMCG_RSS);
+ if (swap)
+ val += memcg_page_state(memcg, MEMCG_SWAP);
} else {
if (!swap)
val = page_counter_read(&memcg->memory);
@@ -3514,7 +3515,6 @@ static int memcg_stat_show(struct seq_file *m, void *v)
unsigned long memory, memsw;
struct mem_cgroup *mi;
unsigned int i;
- struct accumulated_vmstats acc;

BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
@@ -3548,27 +3548,21 @@ static int memcg_stat_show(struct seq_file *m, void *v)
seq_printf(m, "hierarchical_memsw_limit %llu\n",
(u64)memsw * PAGE_SIZE);

- memset(&acc, 0, sizeof(acc));
- acc.vmstats_size = ARRAY_SIZE(memcg1_stats);
- acc.vmstats_array = memcg1_stats;
- acc.vmevents_size = ARRAY_SIZE(memcg1_events);
- acc.vmevents_array = memcg1_events;
- accumulate_vmstats(memcg, &acc);
-
for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account())
continue;
seq_printf(m, "total_%s %llu\n", memcg1_stat_names[i],
- (u64)acc.vmstats[i] * PAGE_SIZE);
+ (u64)memcg_page_state(memcg, i) * PAGE_SIZE);
}

for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
seq_printf(m, "total_%s %llu\n", memcg1_event_names[i],
- (u64)acc.vmevents[i]);
+ (u64)memcg_events(memcg, i));

for (i = 0; i < NR_LRU_LISTS; i++)
seq_printf(m, "total_%s %llu\n", mem_cgroup_lru_names[i],
- (u64)acc.lru_pages[i] * PAGE_SIZE);
+ (u64)memcg_page_state(memcg, NR_LRU_BASE + i) *
+ PAGE_SIZE);

#ifdef CONFIG_DEBUG_VM
{
@@ -5661,7 +5655,6 @@ static int memory_events_show(struct seq_file *m, void *v)
static int memory_stat_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
- struct accumulated_vmstats acc;
int i;

/*
@@ -5675,31 +5668,27 @@ static int memory_stat_show(struct seq_file *m, void *v)
* Current memory state:
*/

- memset(&acc, 0, sizeof(acc));
- acc.vmstats_size = MEMCG_NR_STAT;
- acc.vmevents_size = NR_VM_EVENT_ITEMS;
- accumulate_vmstats(memcg, &acc);
-
seq_printf(m, "anon %llu\n",
- (u64)acc.vmstats[MEMCG_RSS] * PAGE_SIZE);
+ (u64)memcg_page_state(memcg, MEMCG_RSS) * PAGE_SIZE);
seq_printf(m, "file %llu\n",
- (u64)acc.vmstats[MEMCG_CACHE] * PAGE_SIZE);
+ (u64)memcg_page_state(memcg, MEMCG_CACHE) * PAGE_SIZE);
seq_printf(m, "kernel_stack %llu\n",
- (u64)acc.vmstats[MEMCG_KERNEL_STACK_KB] * 1024);
+ (u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) * 1024);
seq_printf(m, "slab %llu\n",
- (u64)(acc.vmstats[NR_SLAB_RECLAIMABLE] +
- acc.vmstats[NR_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);
+ (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) +
+ memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) *
+ PAGE_SIZE);
seq_printf(m, "sock %llu\n",
- (u64)acc.vmstats[MEMCG_SOCK] * PAGE_SIZE);
+ (u64)memcg_page_state(memcg, MEMCG_SOCK) * PAGE_SIZE);

seq_printf(m, "shmem %llu\n",
- (u64)acc.vmstats[NR_SHMEM] * PAGE_SIZE);
+ (u64)memcg_page_state(memcg, NR_SHMEM) * PAGE_SIZE);
seq_printf(m, "file_mapped %llu\n",
- (u64)acc.vmstats[NR_FILE_MAPPED] * PAGE_SIZE);
+ (u64)memcg_page_state(memcg, NR_FILE_MAPPED) * PAGE_SIZE);
seq_printf(m, "file_dirty %llu\n",
- (u64)acc.vmstats[NR_FILE_DIRTY] * PAGE_SIZE);
+ (u64)memcg_page_state(memcg, NR_FILE_DIRTY) * PAGE_SIZE);
seq_printf(m, "file_writeback %llu\n",
- (u64)acc.vmstats[NR_WRITEBACK] * PAGE_SIZE);
+ (u64)memcg_page_state(memcg, NR_WRITEBACK) * PAGE_SIZE);

/*
* TODO: We should eventually replace our own MEMCG_RSS_HUGE counter
@@ -5708,43 +5697,47 @@ static int memory_stat_show(struct seq_file *m, void *v)
* where the page->mem_cgroup is set up and stable.
*/
seq_printf(m, "anon_thp %llu\n",
- (u64)acc.vmstats[MEMCG_RSS_HUGE] * PAGE_SIZE);
+ (u64)memcg_page_state(memcg, MEMCG_RSS_HUGE) * PAGE_SIZE);

for (i = 0; i < NR_LRU_LISTS; i++)
seq_printf(m, "%s %llu\n", mem_cgroup_lru_names[i],
- (u64)acc.lru_pages[i] * PAGE_SIZE);
+ (u64)memcg_page_state(memcg, NR_LRU_BASE + i) *
+ PAGE_SIZE);

seq_printf(m, "slab_reclaimable %llu\n",
- (u64)acc.vmstats[NR_SLAB_RECLAIMABLE] * PAGE_SIZE);
+ (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) *
+ PAGE_SIZE);
seq_printf(m, "slab_unreclaimable %llu\n",
- (u64)acc.vmstats[NR_SLAB_UNRECLAIMABLE] * PAGE_SIZE);
+ (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
+ PAGE_SIZE);

/* Accumulated memory events */

- seq_printf(m, "pgfault %lu\n", acc.vmevents[PGFAULT]);
- seq_printf(m, "pgmajfault %lu\n", acc.vmevents[PGMAJFAULT]);
+ seq_printf(m, "pgfault %lu\n", memcg_events(memcg, PGFAULT));
+ seq_printf(m, "pgmajfault %lu\n", memcg_events(memcg, PGMAJFAULT));

seq_printf(m, "workingset_refault %lu\n",
- acc.vmstats[WORKINGSET_REFAULT]);
+ memcg_page_state(memcg, WORKINGSET_REFAULT));
seq_printf(m, "workingset_activate %lu\n",
- acc.vmstats[WORKINGSET_ACTIVATE]);
+ memcg_page_state(memcg, WORKINGSET_ACTIVATE));
seq_printf(m, "workingset_nodereclaim %lu\n",
- acc.vmstats[WORKINGSET_NODERECLAIM]);
-
- seq_printf(m, "pgrefill %lu\n", acc.vmevents[PGREFILL]);
- seq_printf(m, "pgscan %lu\n", acc.vmevents[PGSCAN_KSWAPD] +
- acc.vmevents[PGSCAN_DIRECT]);
- seq_printf(m, "pgsteal %lu\n", acc.vmevents[PGSTEAL_KSWAPD] +
- acc.vmevents[PGSTEAL_DIRECT]);
- seq_printf(m, "pgactivate %lu\n", acc.vmevents[PGACTIVATE]);
- seq_printf(m, "pgdeactivate %lu\n", acc.vmevents[PGDEACTIVATE]);
- seq_printf(m, "pglazyfree %lu\n", acc.vmevents[PGLAZYFREE]);
- seq_printf(m, "pglazyfreed %lu\n", acc.vmevents[PGLAZYFREED]);
+ memcg_page_state(memcg, WORKINGSET_NODERECLAIM));
+
+ seq_printf(m, "pgrefill %lu\n", memcg_events(memcg, PGREFILL));
+ seq_printf(m, "pgscan %lu\n", memcg_events(memcg, PGSCAN_KSWAPD) +
+ memcg_events(memcg, PGSCAN_DIRECT));
+ seq_printf(m, "pgsteal %lu\n", memcg_events(memcg, PGSTEAL_KSWAPD) +
+ memcg_events(memcg, PGSTEAL_DIRECT));
+ seq_printf(m, "pgactivate %lu\n", memcg_events(memcg, PGACTIVATE));
+ seq_printf(m, "pgdeactivate %lu\n", memcg_events(memcg, PGDEACTIVATE));
+ seq_printf(m, "pglazyfree %lu\n", memcg_events(memcg, PGLAZYFREE));
+ seq_printf(m, "pglazyfreed %lu\n", memcg_events(memcg, PGLAZYFREED));

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- seq_printf(m, "thp_fault_alloc %lu\n", acc.vmevents[THP_FAULT_ALLOC]);
+ seq_printf(m, "thp_fault_alloc %lu\n",
+ memcg_events(memcg, THP_FAULT_ALLOC));
seq_printf(m, "thp_collapse_alloc %lu\n",
- acc.vmevents[THP_COLLAPSE_ALLOC]);
+ memcg_events(memcg, THP_COLLAPSE_ALLOC));
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

return 0;
--
2.21.0

2019-04-12 15:16:51

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 2/4] mm: memcontrol: move stat/event counting functions out-of-line

These are getting too big to be inlined in every callsite. They were
stolen from vmstat.c, which already out-of-lines them, and they have
only been growing since. The callsites aren't that hot, either.

Move __mod_memcg_state()
__mod_lruvec_state() and
__count_memcg_events() out of line and add kerneldoc comments.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 62 +++---------------------------
mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++
2 files changed, 84 insertions(+), 57 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 139be7d44c29..cae7d1b11eea 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -580,22 +580,7 @@ static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg,
return x;
}

-/* idx can be of type enum memcg_stat_item or node_stat_item */
-static inline void __mod_memcg_state(struct mem_cgroup *memcg,
- int idx, int val)
-{
- long x;
-
- if (mem_cgroup_disabled())
- return;
-
- x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
- if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
- atomic_long_add(x, &memcg->vmstats[idx]);
- x = 0;
- }
- __this_cpu_write(memcg->vmstats_percpu->stat[idx], x);
-}
+void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);

/* idx can be of type enum memcg_stat_item or node_stat_item */
static inline void mod_memcg_state(struct mem_cgroup *memcg,
@@ -657,31 +642,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
return x;
}

-static inline void __mod_lruvec_state(struct lruvec *lruvec,
- enum node_stat_item idx, int val)
-{
- struct mem_cgroup_per_node *pn;
- long x;
-
- /* Update node */
- __mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
-
- if (mem_cgroup_disabled())
- return;
-
- pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-
- /* Update memcg */
- __mod_memcg_state(pn->memcg, idx, val);
-
- /* Update lruvec */
- x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
- if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
- atomic_long_add(x, &pn->lruvec_stat[idx]);
- x = 0;
- }
- __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
-}
+void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+ int val);

static inline void mod_lruvec_state(struct lruvec *lruvec,
enum node_stat_item idx, int val)
@@ -723,22 +685,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
gfp_t gfp_mask,
unsigned long *total_scanned);

-static inline void __count_memcg_events(struct mem_cgroup *memcg,
- enum vm_event_item idx,
- unsigned long count)
-{
- unsigned long x;
-
- if (mem_cgroup_disabled())
- return;
-
- x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]);
- if (unlikely(x > MEMCG_CHARGE_BATCH)) {
- atomic_long_add(x, &memcg->vmevents[idx]);
- x = 0;
- }
- __this_cpu_write(memcg->vmstats_percpu->events[idx], x);
-}
+void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
+ unsigned long count);

static inline void count_memcg_events(struct mem_cgroup *memcg,
enum vm_event_item idx,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 109608b8091f..3535270ebeec 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -687,6 +687,85 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
return mz;
}

+/**
+ * __mod_memcg_state - update cgroup memory statistics
+ * @memcg: the memory cgroup
+ * @idx: the stat item - can be enum memcg_stat_item or enum node_stat_item
+ * @val: delta to add to the counter, can be negative
+ */
+void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
+{
+ long x;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
+ if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+ atomic_long_add(x, &memcg->vmstats[idx]);
+ x = 0;
+ }
+ __this_cpu_write(memcg->vmstats_percpu->stat[idx], x);
+}
+
+/**
+ * __mod_lruvec_state - update lruvec memory statistics
+ * @lruvec: the lruvec
+ * @idx: the stat item
+ * @val: delta to add to the counter, can be negative
+ *
+ * The lruvec is the intersection of the NUMA node and a cgroup. This
+ * function updates the all three counters that are affected by a
+ * change of state at this level: per-node, per-cgroup, per-lruvec.
+ */
+void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+ int val)
+{
+ struct mem_cgroup_per_node *pn;
+ long x;
+
+ /* Update node */
+ __mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
+
+ if (mem_cgroup_disabled())
+ return;
+
+ pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+
+ /* Update memcg */
+ __mod_memcg_state(pn->memcg, idx, val);
+
+ /* Update lruvec */
+ x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
+ if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+ atomic_long_add(x, &pn->lruvec_stat[idx]);
+ x = 0;
+ }
+ __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
+}
+
+/**
+ * __count_memcg_events - account VM events in a cgroup
+ * @memcg: the memory cgroup
+ * @idx: the event item
+ * @count: the number of events that occured
+ */
+void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
+ unsigned long count)
+{
+ unsigned long x;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]);
+ if (unlikely(x > MEMCG_CHARGE_BATCH)) {
+ atomic_long_add(x, &memcg->vmevents[idx]);
+ x = 0;
+ }
+ __this_cpu_write(memcg->vmstats_percpu->events[idx], x);
+}
+
static unsigned long memcg_events_local(struct mem_cgroup *memcg,
int event)
{
--
2.21.0

2019-04-12 15:18:25

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 1/4] mm: memcontrol: make cgroup stats and events query API explicitly local

memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
currently returning the state of the local memcg or lruvec, not the
recursive state.

In practice there is a demand for both versions, although the callers
that want the recursive counts currently sum them up by hand.

Per default, cgroups are considered recursive entities and generally
we expect more users of the recursive counters, with the local counts
being special cases. To reflect that in the name, add a _local suffix
to the current implementations.

The following patch will re-incarnate these functions with recursive
semantics, but with an O(1) implementation.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 16 +++++++--------
mm/memcontrol.c | 40 ++++++++++++++++++++------------------
mm/vmscan.c | 4 ++--
mm/workingset.c | 7 ++++---
4 files changed, 35 insertions(+), 32 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3823cb335b60..139be7d44c29 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -569,8 +569,8 @@ void unlock_page_memcg(struct page *page);
* idx can be of type enum memcg_stat_item or node_stat_item.
* Keep in sync with memcg_exact_page_state().
*/
-static inline unsigned long memcg_page_state(struct mem_cgroup *memcg,
- int idx)
+static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg,
+ int idx)
{
long x = atomic_long_read(&memcg->vmstats[idx]);
#ifdef CONFIG_SMP
@@ -639,8 +639,8 @@ static inline void mod_memcg_page_state(struct page *page,
mod_memcg_state(page->mem_cgroup, idx, val);
}

-static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
- enum node_stat_item idx)
+static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
+ enum node_stat_item idx)
{
struct mem_cgroup_per_node *pn;
long x;
@@ -1043,8 +1043,8 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
{
}

-static inline unsigned long memcg_page_state(struct mem_cgroup *memcg,
- int idx)
+static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg,
+ int idx)
{
return 0;
}
@@ -1073,8 +1073,8 @@ static inline void mod_memcg_page_state(struct page *page,
{
}

-static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
- enum node_stat_item idx)
+static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
+ enum node_stat_item idx)
{
return node_page_state(lruvec_pgdat(lruvec), idx);
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cd03b1181f7f..109608b8091f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -687,8 +687,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
return mz;
}

-static unsigned long memcg_sum_events(struct mem_cgroup *memcg,
- int event)
+static unsigned long memcg_events_local(struct mem_cgroup *memcg,
+ int event)
{
return atomic_long_read(&memcg->vmevents[event]);
}
@@ -1325,12 +1325,14 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
if (memcg1_stats[i] == MEMCG_SWAP && !do_swap_account)
continue;
pr_cont(" %s:%luKB", memcg1_stat_names[i],
- K(memcg_page_state(iter, memcg1_stats[i])));
+ K(memcg_page_state_local(iter,
+ memcg1_stats[i])));
}

for (i = 0; i < NR_LRU_LISTS; i++)
pr_cont(" %s:%luKB", mem_cgroup_lru_names[i],
- K(memcg_page_state(iter, NR_LRU_BASE + i)));
+ K(memcg_page_state_local(iter,
+ NR_LRU_BASE + i)));

pr_cont("\n");
}
@@ -1401,13 +1403,13 @@ static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
{
struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);

- if (lruvec_page_state(lruvec, NR_INACTIVE_FILE) ||
- lruvec_page_state(lruvec, NR_ACTIVE_FILE))
+ if (lruvec_page_state_local(lruvec, NR_INACTIVE_FILE) ||
+ lruvec_page_state_local(lruvec, NR_ACTIVE_FILE))
return true;
if (noswap || !total_swap_pages)
return false;
- if (lruvec_page_state(lruvec, NR_INACTIVE_ANON) ||
- lruvec_page_state(lruvec, NR_ACTIVE_ANON))
+ if (lruvec_page_state_local(lruvec, NR_INACTIVE_ANON) ||
+ lruvec_page_state_local(lruvec, NR_ACTIVE_ANON))
return true;
return false;

@@ -2976,16 +2978,16 @@ static void accumulate_vmstats(struct mem_cgroup *memcg,

for_each_mem_cgroup_tree(mi, memcg) {
for (i = 0; i < acc->vmstats_size; i++)
- acc->vmstats[i] += memcg_page_state(mi,
+ acc->vmstats[i] += memcg_page_state_local(mi,
acc->vmstats_array ? acc->vmstats_array[i] : i);

for (i = 0; i < acc->vmevents_size; i++)
- acc->vmevents[i] += memcg_sum_events(mi,
+ acc->vmevents[i] += memcg_events_local(mi,
acc->vmevents_array
? acc->vmevents_array[i] : i);

for (i = 0; i < NR_LRU_LISTS; i++)
- acc->lru_pages[i] += memcg_page_state(mi,
+ acc->lru_pages[i] += memcg_page_state_local(mi,
NR_LRU_BASE + i);
}
}
@@ -2998,10 +3000,10 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
struct mem_cgroup *iter;

for_each_mem_cgroup_tree(iter, memcg) {
- val += memcg_page_state(iter, MEMCG_CACHE);
- val += memcg_page_state(iter, MEMCG_RSS);
+ val += memcg_page_state_local(iter, MEMCG_CACHE);
+ val += memcg_page_state_local(iter, MEMCG_RSS);
if (swap)
- val += memcg_page_state(iter, MEMCG_SWAP);
+ val += memcg_page_state_local(iter, MEMCG_SWAP);
}
} else {
if (!swap)
@@ -3343,7 +3345,7 @@ static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
for_each_lru(lru) {
if (!(BIT(lru) & lru_mask))
continue;
- nr += lruvec_page_state(lruvec, NR_LRU_BASE + lru);
+ nr += lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
}
return nr;
}
@@ -3357,7 +3359,7 @@ static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
for_each_lru(lru) {
if (!(BIT(lru) & lru_mask))
continue;
- nr += memcg_page_state(memcg, NR_LRU_BASE + lru);
+ nr += memcg_page_state_local(memcg, NR_LRU_BASE + lru);
}
return nr;
}
@@ -3442,17 +3444,17 @@ static int memcg_stat_show(struct seq_file *m, void *v)
if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account())
continue;
seq_printf(m, "%s %lu\n", memcg1_stat_names[i],
- memcg_page_state(memcg, memcg1_stats[i]) *
+ memcg_page_state_local(memcg, memcg1_stats[i]) *
PAGE_SIZE);
}

for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
seq_printf(m, "%s %lu\n", memcg1_event_names[i],
- memcg_sum_events(memcg, memcg1_events[i]));
+ memcg_events_local(memcg, memcg1_events[i]));

for (i = 0; i < NR_LRU_LISTS; i++)
seq_printf(m, "%s %lu\n", mem_cgroup_lru_names[i],
- memcg_page_state(memcg, NR_LRU_BASE + i) *
+ memcg_page_state_local(memcg, NR_LRU_BASE + i) *
PAGE_SIZE);

/* Hierarchical information */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c9f8afe61ae3..6e99a8b9b2ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -346,7 +346,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
int zid;

if (!mem_cgroup_disabled())
- lru_size = lruvec_page_state(lruvec, NR_LRU_BASE + lru);
+ lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
else
lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);

@@ -2163,7 +2163,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
* is being established. Disable active list protection to get
* rid of the stale workingset quickly.
*/
- refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE);
+ refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
if (file && actual_reclaim && lruvec->refaults != refaults) {
inactive_ratio = 0;
} else {
diff --git a/mm/workingset.c b/mm/workingset.c
index 6419baebd306..e0b4edcb88c8 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -430,9 +430,10 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,

lruvec = mem_cgroup_lruvec(NODE_DATA(sc->nid), sc->memcg);
for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
- pages += lruvec_page_state(lruvec, NR_LRU_BASE + i);
- pages += lruvec_page_state(lruvec, NR_SLAB_RECLAIMABLE);
- pages += lruvec_page_state(lruvec, NR_SLAB_UNRECLAIMABLE);
+ pages += lruvec_page_state_local(lruvec,
+ NR_LRU_BASE + i);
+ pages += lruvec_page_state_local(lruvec, NR_SLAB_RECLAIMABLE);
+ pages += lruvec_page_state_local(lruvec, NR_SLAB_UNRECLAIMABLE);
} else
#endif
pages = node_present_pages(sc->nid);
--
2.21.0

2019-04-12 19:58:14

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 3/4] mm: memcontrol: fix recursive statistics correctness & scalabilty

On Fri, Apr 12, 2019 at 8:15 AM Johannes Weiner <[email protected]> wrote:
>
> Right now, when somebody needs to know the recursive memory statistics
> and events of a cgroup subtree, they need to walk the entire subtree
> and sum up the counters manually.
>
> There are two issues with this:
>
> 1. When a cgroup gets deleted, its stats are lost. The state counters
> should all be 0 at that point, of course, but the events are not. When
> this happens, the event counters, which are supposed to be monotonic,
> can go backwards in the parent cgroups.
>

We also faced this exact same issue as well and had the similar solution.

> 2. During regular operation, we always have a certain number of lazily
> freed cgroups sitting around that have been deleted, have no tasks,
> but have a few cache pages remaining. These groups' statistics do not
> change until we eventually hit memory pressure, but somebody watching,
> say, memory.stat on an ancestor has to iterate those every time.
>
> This patch addresses both issues by introducing recursive counters at
> each level that are propagated from the write side when stats change.
>
> Upward propagation happens when the per-cpu caches spill over into the
> local atomic counter. This is the same thing we do during charge and
> uncharge, except that the latter uses atomic RMWs, which are more
> expensive; stat changes happen at around the same rate. In a sparse
> file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
> nesting levels, perf shows __mod_memcg_page state at ~1%.
>

(Unrelated to this patchset) I think there should also a way to get
the exact memcg stats. As the machines are getting bigger (more cpus
and larger basic page size) the accuracy of stats are getting worse.
Internally we have an additional interface memory.stat_exact for that.
However I am not sure in the upstream kernel will an additional
interface is better or something like /proc/sys/vm/stat_refresh which
sync all per-cpu stats.

> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

> ---
> include/linux/memcontrol.h | 54 +++++++++-
> mm/memcontrol.c | 205 ++++++++++++++++++-------------------
> 2 files changed, 150 insertions(+), 109 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index cae7d1b11eea..36bdfe8e5965 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -128,6 +128,7 @@ struct mem_cgroup_per_node {
>
> struct lruvec_stat __percpu *lruvec_stat_cpu;
> atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS];
> + atomic_long_t lruvec_stat_local[NR_VM_NODE_STAT_ITEMS];
>
> unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
>
> @@ -279,8 +280,12 @@ struct mem_cgroup {
> MEMCG_PADDING(_pad2_);
>
> atomic_long_t vmstats[MEMCG_NR_STAT];
> + atomic_long_t vmstats_local[MEMCG_NR_STAT];
> +
> atomic_long_t vmevents[NR_VM_EVENT_ITEMS];
> - atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
> + atomic_long_t vmevents_local[NR_VM_EVENT_ITEMS];
> +
> + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
>
> unsigned long socket_pressure;
>
> @@ -565,6 +570,20 @@ struct mem_cgroup *lock_page_memcg(struct page *page);
> void __unlock_page_memcg(struct mem_cgroup *memcg);
> void unlock_page_memcg(struct page *page);
>
> +/*
> + * idx can be of type enum memcg_stat_item or node_stat_item.
> + * Keep in sync with memcg_exact_page_state().
> + */
> +static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
> +{
> + long x = atomic_long_read(&memcg->vmstats[idx]);
> +#ifdef CONFIG_SMP
> + if (x < 0)
> + x = 0;
> +#endif
> + return x;
> +}
> +
> /*
> * idx can be of type enum memcg_stat_item or node_stat_item.
> * Keep in sync with memcg_exact_page_state().
> @@ -572,7 +591,7 @@ void unlock_page_memcg(struct page *page);
> static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg,
> int idx)
> {
> - long x = atomic_long_read(&memcg->vmstats[idx]);
> + long x = atomic_long_read(&memcg->vmstats_local[idx]);
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> @@ -624,6 +643,24 @@ static inline void mod_memcg_page_state(struct page *page,
> mod_memcg_state(page->mem_cgroup, idx, val);
> }
>
> +static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
> + enum node_stat_item idx)
> +{
> + struct mem_cgroup_per_node *pn;
> + long x;
> +
> + if (mem_cgroup_disabled())
> + return node_page_state(lruvec_pgdat(lruvec), idx);
> +
> + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> + x = atomic_long_read(&pn->lruvec_stat[idx]);
> +#ifdef CONFIG_SMP
> + if (x < 0)
> + x = 0;
> +#endif
> + return x;
> +}
> +
> static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> enum node_stat_item idx)
> {
> @@ -634,7 +671,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> return node_page_state(lruvec_pgdat(lruvec), idx);
>
> pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> - x = atomic_long_read(&pn->lruvec_stat[idx]);
> + x = atomic_long_read(&pn->lruvec_stat_local[idx]);
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> @@ -991,6 +1028,11 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
> {
> }
>
> +static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
> +{
> + return 0;
> +}
> +
> static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg,
> int idx)
> {
> @@ -1021,6 +1063,12 @@ static inline void mod_memcg_page_state(struct page *page,
> {
> }
>
> +static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
> + enum node_stat_item idx)
> +{
> + return node_page_state(lruvec_pgdat(lruvec), idx);
> +}
> +
> static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> enum node_stat_item idx)
> {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3535270ebeec..2eb2d4ef9b34 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -702,12 +702,27 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
>
> x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
> if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
> - atomic_long_add(x, &memcg->vmstats[idx]);
> + struct mem_cgroup *mi;
> +
> + atomic_long_add(x, &memcg->vmstats_local[idx]);
> + for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> + atomic_long_add(x, &mi->vmstats[idx]);
> x = 0;
> }
> __this_cpu_write(memcg->vmstats_percpu->stat[idx], x);
> }
>
> +static struct mem_cgroup_per_node *
> +parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
> +{
> + struct mem_cgroup *parent;
> +
> + parent = parent_mem_cgroup(pn->memcg);
> + if (!parent)
> + return NULL;
> + return mem_cgroup_nodeinfo(parent, nid);
> +}
> +
> /**
> * __mod_lruvec_state - update lruvec memory statistics
> * @lruvec: the lruvec
> @@ -721,24 +736,31 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
> void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> int val)
> {
> + pg_data_t *pgdat = lruvec_pgdat(lruvec);
> struct mem_cgroup_per_node *pn;
> + struct mem_cgroup *memcg;
> long x;
>
> /* Update node */
> - __mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
> + __mod_node_page_state(pgdat, idx, val);
>
> if (mem_cgroup_disabled())
> return;
>
> pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> + memcg = pn->memcg;
>
> /* Update memcg */
> - __mod_memcg_state(pn->memcg, idx, val);
> + __mod_memcg_state(memcg, idx, val);
>
> /* Update lruvec */
> x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
> if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
> - atomic_long_add(x, &pn->lruvec_stat[idx]);
> + struct mem_cgroup_per_node *pi;
> +
> + atomic_long_add(x, &pn->lruvec_stat_local[idx]);
> + for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
> + atomic_long_add(x, &pi->lruvec_stat[idx]);
> x = 0;
> }
> __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
> @@ -760,18 +782,26 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
>
> x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]);
> if (unlikely(x > MEMCG_CHARGE_BATCH)) {
> - atomic_long_add(x, &memcg->vmevents[idx]);
> + struct mem_cgroup *mi;
> +
> + atomic_long_add(x, &memcg->vmevents_local[idx]);
> + for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> + atomic_long_add(x, &mi->vmevents[idx]);
> x = 0;
> }
> __this_cpu_write(memcg->vmstats_percpu->events[idx], x);
> }
>
> -static unsigned long memcg_events_local(struct mem_cgroup *memcg,
> - int event)
> +static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
> {
> return atomic_long_read(&memcg->vmevents[event]);
> }
>
> +static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
> +{
> + return atomic_long_read(&memcg->vmevents_local[event]);
> +}
> +
> static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
> struct page *page,
> bool compound, int nr_pages)
> @@ -2162,7 +2192,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> static int memcg_hotplug_cpu_dead(unsigned int cpu)
> {
> struct memcg_stock_pcp *stock;
> - struct mem_cgroup *memcg;
> + struct mem_cgroup *memcg, *mi;
>
> stock = &per_cpu(memcg_stock, cpu);
> drain_stock(stock);
> @@ -2175,8 +2205,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
> long x;
>
> x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0);
> - if (x)
> - atomic_long_add(x, &memcg->vmstats[i]);
> + if (x) {
> + atomic_long_add(x, &memcg->vmstats_local[i]);
> + for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> + atomic_long_add(x, &memcg->vmstats[i]);
> + }
>
> if (i >= NR_VM_NODE_STAT_ITEMS)
> continue;
> @@ -2186,8 +2219,12 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
>
> pn = mem_cgroup_nodeinfo(memcg, nid);
> x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0);
> - if (x)
> - atomic_long_add(x, &pn->lruvec_stat[i]);
> + if (x) {
> + atomic_long_add(x, &pn->lruvec_stat_local[i]);
> + do {
> + atomic_long_add(x, &pn->lruvec_stat[i]);
> + } while ((pn = parent_nodeinfo(pn, nid)));
> + }
> }
> }
>
> @@ -2195,8 +2232,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
> long x;
>
> x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0);
> - if (x)
> - atomic_long_add(x, &memcg->vmevents[i]);
> + if (x) {
> + atomic_long_add(x, &memcg->vmevents_local[i]);
> + for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> + atomic_long_add(x, &memcg->vmevents[i]);
> + }
> }
> }
>
> @@ -3036,54 +3076,15 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
> return retval;
> }
>
> -struct accumulated_vmstats {
> - unsigned long vmstats[MEMCG_NR_STAT];
> - unsigned long vmevents[NR_VM_EVENT_ITEMS];
> - unsigned long lru_pages[NR_LRU_LISTS];
> -
> - /* overrides for v1 */
> - const unsigned int *vmstats_array;
> - const unsigned int *vmevents_array;
> -
> - int vmstats_size;
> - int vmevents_size;
> -};
> -
> -static void accumulate_vmstats(struct mem_cgroup *memcg,
> - struct accumulated_vmstats *acc)
> -{
> - struct mem_cgroup *mi;
> - int i;
> -
> - for_each_mem_cgroup_tree(mi, memcg) {
> - for (i = 0; i < acc->vmstats_size; i++)
> - acc->vmstats[i] += memcg_page_state_local(mi,
> - acc->vmstats_array ? acc->vmstats_array[i] : i);
> -
> - for (i = 0; i < acc->vmevents_size; i++)
> - acc->vmevents[i] += memcg_events_local(mi,
> - acc->vmevents_array
> - ? acc->vmevents_array[i] : i);
> -
> - for (i = 0; i < NR_LRU_LISTS; i++)
> - acc->lru_pages[i] += memcg_page_state_local(mi,
> - NR_LRU_BASE + i);
> - }
> -}
> -
> static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
> {
> - unsigned long val = 0;
> + unsigned long val;
>
> if (mem_cgroup_is_root(memcg)) {
> - struct mem_cgroup *iter;
> -
> - for_each_mem_cgroup_tree(iter, memcg) {
> - val += memcg_page_state_local(iter, MEMCG_CACHE);
> - val += memcg_page_state_local(iter, MEMCG_RSS);
> - if (swap)
> - val += memcg_page_state_local(iter, MEMCG_SWAP);
> - }
> + val = memcg_page_state(memcg, MEMCG_CACHE) +
> + memcg_page_state(memcg, MEMCG_RSS);
> + if (swap)
> + val += memcg_page_state(memcg, MEMCG_SWAP);
> } else {
> if (!swap)
> val = page_counter_read(&memcg->memory);
> @@ -3514,7 +3515,6 @@ static int memcg_stat_show(struct seq_file *m, void *v)
> unsigned long memory, memsw;
> struct mem_cgroup *mi;
> unsigned int i;
> - struct accumulated_vmstats acc;
>
> BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
> BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
> @@ -3548,27 +3548,21 @@ static int memcg_stat_show(struct seq_file *m, void *v)
> seq_printf(m, "hierarchical_memsw_limit %llu\n",
> (u64)memsw * PAGE_SIZE);
>
> - memset(&acc, 0, sizeof(acc));
> - acc.vmstats_size = ARRAY_SIZE(memcg1_stats);
> - acc.vmstats_array = memcg1_stats;
> - acc.vmevents_size = ARRAY_SIZE(memcg1_events);
> - acc.vmevents_array = memcg1_events;
> - accumulate_vmstats(memcg, &acc);
> -
> for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
> if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account())
> continue;
> seq_printf(m, "total_%s %llu\n", memcg1_stat_names[i],
> - (u64)acc.vmstats[i] * PAGE_SIZE);
> + (u64)memcg_page_state(memcg, i) * PAGE_SIZE);
> }
>
> for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
> seq_printf(m, "total_%s %llu\n", memcg1_event_names[i],
> - (u64)acc.vmevents[i]);
> + (u64)memcg_events(memcg, i));
>
> for (i = 0; i < NR_LRU_LISTS; i++)
> seq_printf(m, "total_%s %llu\n", mem_cgroup_lru_names[i],
> - (u64)acc.lru_pages[i] * PAGE_SIZE);
> + (u64)memcg_page_state(memcg, NR_LRU_BASE + i) *
> + PAGE_SIZE);
>
> #ifdef CONFIG_DEBUG_VM
> {
> @@ -5661,7 +5655,6 @@ static int memory_events_show(struct seq_file *m, void *v)
> static int memory_stat_show(struct seq_file *m, void *v)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> - struct accumulated_vmstats acc;
> int i;
>
> /*
> @@ -5675,31 +5668,27 @@ static int memory_stat_show(struct seq_file *m, void *v)
> * Current memory state:
> */
>
> - memset(&acc, 0, sizeof(acc));
> - acc.vmstats_size = MEMCG_NR_STAT;
> - acc.vmevents_size = NR_VM_EVENT_ITEMS;
> - accumulate_vmstats(memcg, &acc);
> -
> seq_printf(m, "anon %llu\n",
> - (u64)acc.vmstats[MEMCG_RSS] * PAGE_SIZE);
> + (u64)memcg_page_state(memcg, MEMCG_RSS) * PAGE_SIZE);
> seq_printf(m, "file %llu\n",
> - (u64)acc.vmstats[MEMCG_CACHE] * PAGE_SIZE);
> + (u64)memcg_page_state(memcg, MEMCG_CACHE) * PAGE_SIZE);
> seq_printf(m, "kernel_stack %llu\n",
> - (u64)acc.vmstats[MEMCG_KERNEL_STACK_KB] * 1024);
> + (u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) * 1024);
> seq_printf(m, "slab %llu\n",
> - (u64)(acc.vmstats[NR_SLAB_RECLAIMABLE] +
> - acc.vmstats[NR_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);
> + (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) +
> + memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) *
> + PAGE_SIZE);
> seq_printf(m, "sock %llu\n",
> - (u64)acc.vmstats[MEMCG_SOCK] * PAGE_SIZE);
> + (u64)memcg_page_state(memcg, MEMCG_SOCK) * PAGE_SIZE);
>
> seq_printf(m, "shmem %llu\n",
> - (u64)acc.vmstats[NR_SHMEM] * PAGE_SIZE);
> + (u64)memcg_page_state(memcg, NR_SHMEM) * PAGE_SIZE);
> seq_printf(m, "file_mapped %llu\n",
> - (u64)acc.vmstats[NR_FILE_MAPPED] * PAGE_SIZE);
> + (u64)memcg_page_state(memcg, NR_FILE_MAPPED) * PAGE_SIZE);
> seq_printf(m, "file_dirty %llu\n",
> - (u64)acc.vmstats[NR_FILE_DIRTY] * PAGE_SIZE);
> + (u64)memcg_page_state(memcg, NR_FILE_DIRTY) * PAGE_SIZE);
> seq_printf(m, "file_writeback %llu\n",
> - (u64)acc.vmstats[NR_WRITEBACK] * PAGE_SIZE);
> + (u64)memcg_page_state(memcg, NR_WRITEBACK) * PAGE_SIZE);
>
> /*
> * TODO: We should eventually replace our own MEMCG_RSS_HUGE counter
> @@ -5708,43 +5697,47 @@ static int memory_stat_show(struct seq_file *m, void *v)
> * where the page->mem_cgroup is set up and stable.
> */
> seq_printf(m, "anon_thp %llu\n",
> - (u64)acc.vmstats[MEMCG_RSS_HUGE] * PAGE_SIZE);
> + (u64)memcg_page_state(memcg, MEMCG_RSS_HUGE) * PAGE_SIZE);
>
> for (i = 0; i < NR_LRU_LISTS; i++)
> seq_printf(m, "%s %llu\n", mem_cgroup_lru_names[i],
> - (u64)acc.lru_pages[i] * PAGE_SIZE);
> + (u64)memcg_page_state(memcg, NR_LRU_BASE + i) *
> + PAGE_SIZE);
>
> seq_printf(m, "slab_reclaimable %llu\n",
> - (u64)acc.vmstats[NR_SLAB_RECLAIMABLE] * PAGE_SIZE);
> + (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) *
> + PAGE_SIZE);
> seq_printf(m, "slab_unreclaimable %llu\n",
> - (u64)acc.vmstats[NR_SLAB_UNRECLAIMABLE] * PAGE_SIZE);
> + (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
> + PAGE_SIZE);
>
> /* Accumulated memory events */
>
> - seq_printf(m, "pgfault %lu\n", acc.vmevents[PGFAULT]);
> - seq_printf(m, "pgmajfault %lu\n", acc.vmevents[PGMAJFAULT]);
> + seq_printf(m, "pgfault %lu\n", memcg_events(memcg, PGFAULT));
> + seq_printf(m, "pgmajfault %lu\n", memcg_events(memcg, PGMAJFAULT));
>
> seq_printf(m, "workingset_refault %lu\n",
> - acc.vmstats[WORKINGSET_REFAULT]);
> + memcg_page_state(memcg, WORKINGSET_REFAULT));
> seq_printf(m, "workingset_activate %lu\n",
> - acc.vmstats[WORKINGSET_ACTIVATE]);
> + memcg_page_state(memcg, WORKINGSET_ACTIVATE));
> seq_printf(m, "workingset_nodereclaim %lu\n",
> - acc.vmstats[WORKINGSET_NODERECLAIM]);
> -
> - seq_printf(m, "pgrefill %lu\n", acc.vmevents[PGREFILL]);
> - seq_printf(m, "pgscan %lu\n", acc.vmevents[PGSCAN_KSWAPD] +
> - acc.vmevents[PGSCAN_DIRECT]);
> - seq_printf(m, "pgsteal %lu\n", acc.vmevents[PGSTEAL_KSWAPD] +
> - acc.vmevents[PGSTEAL_DIRECT]);
> - seq_printf(m, "pgactivate %lu\n", acc.vmevents[PGACTIVATE]);
> - seq_printf(m, "pgdeactivate %lu\n", acc.vmevents[PGDEACTIVATE]);
> - seq_printf(m, "pglazyfree %lu\n", acc.vmevents[PGLAZYFREE]);
> - seq_printf(m, "pglazyfreed %lu\n", acc.vmevents[PGLAZYFREED]);
> + memcg_page_state(memcg, WORKINGSET_NODERECLAIM));
> +
> + seq_printf(m, "pgrefill %lu\n", memcg_events(memcg, PGREFILL));
> + seq_printf(m, "pgscan %lu\n", memcg_events(memcg, PGSCAN_KSWAPD) +
> + memcg_events(memcg, PGSCAN_DIRECT));
> + seq_printf(m, "pgsteal %lu\n", memcg_events(memcg, PGSTEAL_KSWAPD) +
> + memcg_events(memcg, PGSTEAL_DIRECT));
> + seq_printf(m, "pgactivate %lu\n", memcg_events(memcg, PGACTIVATE));
> + seq_printf(m, "pgdeactivate %lu\n", memcg_events(memcg, PGDEACTIVATE));
> + seq_printf(m, "pglazyfree %lu\n", memcg_events(memcg, PGLAZYFREE));
> + seq_printf(m, "pglazyfreed %lu\n", memcg_events(memcg, PGLAZYFREED));
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> - seq_printf(m, "thp_fault_alloc %lu\n", acc.vmevents[THP_FAULT_ALLOC]);
> + seq_printf(m, "thp_fault_alloc %lu\n",
> + memcg_events(memcg, THP_FAULT_ALLOC));
> seq_printf(m, "thp_collapse_alloc %lu\n",
> - acc.vmevents[THP_COLLAPSE_ALLOC]);
> + memcg_events(memcg, THP_COLLAPSE_ALLOC));
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> return 0;
> --
> 2.21.0
>

2019-04-12 20:10:16

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 0/4] mm: memcontrol: memory.stat cost & correctness

On Fri, Apr 12, 2019 at 8:15 AM Johannes Weiner <[email protected]> wrote:
>
> The cgroup memory.stat file holds recursive statistics for the entire
> subtree. The current implementation does this tree walk on-demand
> whenever the file is read. This is giving us problems in production.
>
> 1. The cost of aggregating the statistics on-demand is high. A lot of
> system service cgroups are mostly idle and their stats don't change
> between reads, yet we always have to check them. There are also always
> some lazily-dying cgroups sitting around that are pinned by a handful
> of remaining page cache; the same applies to them.
>
> In an application that periodically monitors memory.stat in our fleet,
> we have seen the aggregation consume up to 5% CPU time.
>
> 2. When cgroups die and disappear from the cgroup tree, so do their
> accumulated vm events. The result is that the event counters at
> higher-level cgroups can go backwards and confuse some of our
> automation, let alone people looking at the graphs over time.
>
> To address both issues, this patch series changes the stat
> implementation to spill counts upwards when the counters change.
>
> The upward spilling is batched using the existing per-cpu cache. In a
> sparse file stress test with 5 level cgroup nesting, the additional
> cost of the flushing was negligible (a little under 1% of CPU at 100%
> CPU utilization, compared to the 5% of reading memory.stat during
> regular operation).

For whole series:

Reviewed-by: Shakeel Butt <[email protected]>

>
> include/linux/memcontrol.h | 96 +++++++-------
> mm/memcontrol.c | 290 +++++++++++++++++++++++++++----------------
> mm/vmscan.c | 4 +-
> mm/workingset.c | 7 +-
> 4 files changed, 234 insertions(+), 163 deletions(-)
>
>

2019-04-12 20:11:32

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 3/4] mm: memcontrol: fix recursive statistics correctness & scalabilty

On Fri, Apr 12, 2019 at 12:55:10PM -0700, Shakeel Butt wrote:
> We also faced this exact same issue as well and had the similar solution.
>
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Reviewed-by: Shakeel Butt <[email protected]>

Thanks for the review!

> (Unrelated to this patchset) I think there should also a way to get
> the exact memcg stats. As the machines are getting bigger (more cpus
> and larger basic page size) the accuracy of stats are getting worse.
> Internally we have an additional interface memory.stat_exact for that.
> However I am not sure in the upstream kernel will an additional
> interface is better or something like /proc/sys/vm/stat_refresh which
> sync all per-cpu stats.

I was talking to Roman about this earlier as well and he mentioned it
would be nice to have periodic flushing of the per-cpu caches. The
global vmstat has something similar. We might be able to hook into
those workers, but it would likely require some smarts so we don't
walk the entire cgroup tree every couple of seconds.

We haven't had any actual problems with the per-cpu fuzziness, mainly
because the cgroups of interest also grow in size as the machines get
bigger, and so the relative error doesn't increase.

Are your requirements that the error dissipates over time (waiting for
a threshold convergence somewhere?) or do you have automation that
gets decisions wrong due to the error at any given point in time?

2019-04-12 20:17:37

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH 3/4] mm: memcontrol: fix recursive statistics correctness & scalabilty

On Fri, Apr 12, 2019 at 12:55:10PM -0700, Shakeel Butt wrote:
> On Fri, Apr 12, 2019 at 8:15 AM Johannes Weiner <[email protected]> wrote:
> >
> > Right now, when somebody needs to know the recursive memory statistics
> > and events of a cgroup subtree, they need to walk the entire subtree
> > and sum up the counters manually.
> >
> > There are two issues with this:
> >
> > 1. When a cgroup gets deleted, its stats are lost. The state counters
> > should all be 0 at that point, of course, but the events are not. When
> > this happens, the event counters, which are supposed to be monotonic,
> > can go backwards in the parent cgroups.
> >
>
> We also faced this exact same issue as well and had the similar solution.
>
> > 2. During regular operation, we always have a certain number of lazily
> > freed cgroups sitting around that have been deleted, have no tasks,
> > but have a few cache pages remaining. These groups' statistics do not
> > change until we eventually hit memory pressure, but somebody watching,
> > say, memory.stat on an ancestor has to iterate those every time.
> >
> > This patch addresses both issues by introducing recursive counters at
> > each level that are propagated from the write side when stats change.
> >
> > Upward propagation happens when the per-cpu caches spill over into the
> > local atomic counter. This is the same thing we do during charge and
> > uncharge, except that the latter uses atomic RMWs, which are more
> > expensive; stat changes happen at around the same rate. In a sparse
> > file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
> > nesting levels, perf shows __mod_memcg_page state at ~1%.
> >
>
> (Unrelated to this patchset) I think there should also a way to get
> the exact memcg stats. As the machines are getting bigger (more cpus
> and larger basic page size) the accuracy of stats are getting worse.
> Internally we have an additional interface memory.stat_exact for that.
> However I am not sure in the upstream kernel will an additional
> interface is better or something like /proc/sys/vm/stat_refresh which
> sync all per-cpu stats.

I was thinking about eventually consistent counters: sync them periodically
from a worker thread. It should keep the cost of reading small, but
should increase the accuracy. Will it work for you?

2019-04-12 20:39:27

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 3/4] mm: memcontrol: fix recursive statistics correctness & scalabilty

On Fri, Apr 12, 2019 at 1:10 PM Johannes Weiner <[email protected]> wrote:
>
> On Fri, Apr 12, 2019 at 12:55:10PM -0700, Shakeel Butt wrote:
> > We also faced this exact same issue as well and had the similar solution.
> >
> > > Signed-off-by: Johannes Weiner <[email protected]>
> >
> > Reviewed-by: Shakeel Butt <[email protected]>
>
> Thanks for the review!
>
> > (Unrelated to this patchset) I think there should also a way to get
> > the exact memcg stats. As the machines are getting bigger (more cpus
> > and larger basic page size) the accuracy of stats are getting worse.
> > Internally we have an additional interface memory.stat_exact for that.
> > However I am not sure in the upstream kernel will an additional
> > interface is better or something like /proc/sys/vm/stat_refresh which
> > sync all per-cpu stats.
>
> I was talking to Roman about this earlier as well and he mentioned it
> would be nice to have periodic flushing of the per-cpu caches. The
> global vmstat has something similar. We might be able to hook into
> those workers, but it would likely require some smarts so we don't
> walk the entire cgroup tree every couple of seconds.
>
> We haven't had any actual problems with the per-cpu fuzziness, mainly
> because the cgroups of interest also grow in size as the machines get
> bigger, and so the relative error doesn't increase.
>

Yes, this is very machine size dependent. We see this issue more often
on larger machines.

> Are your requirements that the error dissipates over time (waiting for
> a threshold convergence somewhere?) or do you have automation that
> gets decisions wrong due to the error at any given point in time?

Not sure about the first one but we do have the second case. The node
controller does make decisions in an online way based on the stats.
Also we do periodically collect and store stats for all jobs across
the fleet. This data is processed (offline) and is used in a lot of
ways. The inaccuracy in the stats do affect all that analysis
particularly for small jobs.

2019-04-12 20:51:08

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 3/4] mm: memcontrol: fix recursive statistics correctness & scalabilty

On Fri, Apr 12, 2019 at 1:16 PM Roman Gushchin <[email protected]> wrote:
>
> On Fri, Apr 12, 2019 at 12:55:10PM -0700, Shakeel Butt wrote:
> > On Fri, Apr 12, 2019 at 8:15 AM Johannes Weiner <[email protected]> wrote:
> > >
> > > Right now, when somebody needs to know the recursive memory statistics
> > > and events of a cgroup subtree, they need to walk the entire subtree
> > > and sum up the counters manually.
> > >
> > > There are two issues with this:
> > >
> > > 1. When a cgroup gets deleted, its stats are lost. The state counters
> > > should all be 0 at that point, of course, but the events are not. When
> > > this happens, the event counters, which are supposed to be monotonic,
> > > can go backwards in the parent cgroups.
> > >
> >
> > We also faced this exact same issue as well and had the similar solution.
> >
> > > 2. During regular operation, we always have a certain number of lazily
> > > freed cgroups sitting around that have been deleted, have no tasks,
> > > but have a few cache pages remaining. These groups' statistics do not
> > > change until we eventually hit memory pressure, but somebody watching,
> > > say, memory.stat on an ancestor has to iterate those every time.
> > >
> > > This patch addresses both issues by introducing recursive counters at
> > > each level that are propagated from the write side when stats change.
> > >
> > > Upward propagation happens when the per-cpu caches spill over into the
> > > local atomic counter. This is the same thing we do during charge and
> > > uncharge, except that the latter uses atomic RMWs, which are more
> > > expensive; stat changes happen at around the same rate. In a sparse
> > > file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
> > > nesting levels, perf shows __mod_memcg_page state at ~1%.
> > >
> >
> > (Unrelated to this patchset) I think there should also a way to get
> > the exact memcg stats. As the machines are getting bigger (more cpus
> > and larger basic page size) the accuracy of stats are getting worse.
> > Internally we have an additional interface memory.stat_exact for that.
> > However I am not sure in the upstream kernel will an additional
> > interface is better or something like /proc/sys/vm/stat_refresh which
> > sync all per-cpu stats.
>
> I was thinking about eventually consistent counters: sync them periodically
> from a worker thread. It should keep the cost of reading small, but
> should increase the accuracy. Will it work for you?

Worker thread based solution seems fine to me but Johannes said it
would be best to not traverse the whole tree every few seconds.

2019-04-12 22:05:25

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH 0/4] mm: memcontrol: memory.stat cost & correctness

On Fri, Apr 12, 2019 at 11:15:03AM -0400, Johannes Weiner wrote:
> The cgroup memory.stat file holds recursive statistics for the entire
> subtree. The current implementation does this tree walk on-demand
> whenever the file is read. This is giving us problems in production.
>
> 1. The cost of aggregating the statistics on-demand is high. A lot of
> system service cgroups are mostly idle and their stats don't change
> between reads, yet we always have to check them. There are also always
> some lazily-dying cgroups sitting around that are pinned by a handful
> of remaining page cache; the same applies to them.
>
> In an application that periodically monitors memory.stat in our fleet,
> we have seen the aggregation consume up to 5% CPU time.
>
> 2. When cgroups die and disappear from the cgroup tree, so do their
> accumulated vm events. The result is that the event counters at
> higher-level cgroups can go backwards and confuse some of our
> automation, let alone people looking at the graphs over time.
>
> To address both issues, this patch series changes the stat
> implementation to spill counts upwards when the counters change.
>
> The upward spilling is batched using the existing per-cpu cache. In a
> sparse file stress test with 5 level cgroup nesting, the additional
> cost of the flushing was negligible (a little under 1% of CPU at 100%
> CPU utilization, compared to the 5% of reading memory.stat during
> regular operation).
>
> include/linux/memcontrol.h | 96 +++++++-------
> mm/memcontrol.c | 290 +++++++++++++++++++++++++++----------------
> mm/vmscan.c | 4 +-
> mm/workingset.c | 7 +-
> 4 files changed, 234 insertions(+), 163 deletions(-)
>
>

For the series:
Reviewed-by: Roman Gushchin <[email protected]>

Thanks!