2024-05-01 17:26:36

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v4 0/8] memcg: reduce memory consumption by memcg stats

Most of the memory overhead of a memcg object is due to memcg stats
maintained by the kernel. Since stats updates happen in performance
critical codepaths, the stats are maintained per-cpu and numa specific
stats are maintained per-node * per-cpu. This drastically increase the
overhead on large machines i.e. large of CPUs and multiple numa nodes.
This patch series tries to reduce the overhead by at least not
allocating the memory for stats which are not memcg specific.

Changelog since v3:
Minor changes related to changing macro names and chaning the if
conditions orders.

Changelog since v2:
Using WARN_ONCE() instead of pr_warn_once() and some changes commit log
changes. Also included a patch from Roman.

Changelog since v1:
The main change from the v1 is the indirection approach used in this
patchset instead of rearranging the members of node_stat_item.


Roman Gushchin (1):
mm: memcg: account memory used for memcg vmstats and lruvec stats

Shakeel Butt (7):
memcg: reduce memory size of mem_cgroup_events_index
memcg: dynamically allocate lruvec_stats
memcg: reduce memory for the lruvec and memcg stats
memcg: cleanup __mod_memcg_lruvec_state
mm: cleanup WORKINGSET_NODES in workingset
memcg: warn for unexpected events and stats
memcg: use proper type for mod_memcg_state

include/linux/memcontrol.h | 75 ++----------
mm/memcontrol.c | 244 +++++++++++++++++++++++++++++++------
mm/workingset.c | 7 +-
3 files changed, 222 insertions(+), 104 deletions(-)

--
2.43.0



2024-05-01 17:26:43

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v4 1/8] memcg: reduce memory size of mem_cgroup_events_index

mem_cgroup_events_index is a translation table to get the right index of
the memcg relevant entry for the general vm_event_item. At the moment,
it is defined as integer array. However on a typical system the max
entry of vm_event_item (NR_VM_EVENT_ITEMS) is 113, so we don't need to
use int as storage type of the array. For now just use int8_t as type
and add a BUILD_BUG_ON().

Another benefit of this change is that the translation table fits in 2
cachelines while previously it would require 8 cachelines (assuming 64
bytes cacheline).

Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Reviewed-by: T.J. Mercier <[email protected]>
---
Changes since v3:
- N/A

mm/memcontrol.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 602ad5faad4d..c146187cda9c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -607,11 +607,13 @@ static const unsigned int memcg_vm_event_stat[] = {
};

#define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_event_stat)
-static int mem_cgroup_events_index[NR_VM_EVENT_ITEMS] __read_mostly;
+static int8_t mem_cgroup_events_index[NR_VM_EVENT_ITEMS] __read_mostly;

static void init_memcg_events(void)
{
- int i;
+ int8_t i;
+
+ BUILD_BUG_ON(NR_VM_EVENT_ITEMS >= S8_MAX);

for (i = 0; i < NR_MEMCG_EVENTS; ++i)
mem_cgroup_events_index[memcg_vm_event_stat[i]] = i + 1;
--
2.43.0


2024-05-01 17:27:01

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v4 2/8] memcg: dynamically allocate lruvec_stats

To decouple the dependency of lruvec_stats on NR_VM_NODE_STAT_ITEMS, we
need to dynamically allocate lruvec_stats in the mem_cgroup_per_node
structure. Also move the definition of lruvec_stats_percpu and
lruvec_stats and related functions to the memcontrol.c to facilitate
later patches. No functional changes in the patch.

Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Reviewed-by: T.J. Mercier <[email protected]>
---
Changes since v3:
- N/A

include/linux/memcontrol.h | 62 +++------------------------
mm/memcontrol.c | 87 ++++++++++++++++++++++++++++++++------
2 files changed, 81 insertions(+), 68 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9aba0d0462ca..ab8a6e884375 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -83,6 +83,8 @@ enum mem_cgroup_events_target {

struct memcg_vmstats_percpu;
struct memcg_vmstats;
+struct lruvec_stats_percpu;
+struct lruvec_stats;

struct mem_cgroup_reclaim_iter {
struct mem_cgroup *position;
@@ -90,25 +92,6 @@ struct mem_cgroup_reclaim_iter {
unsigned int generation;
};

-struct lruvec_stats_percpu {
- /* Local (CPU and cgroup) state */
- long state[NR_VM_NODE_STAT_ITEMS];
-
- /* Delta calculation for lockless upward propagation */
- long state_prev[NR_VM_NODE_STAT_ITEMS];
-};
-
-struct lruvec_stats {
- /* Aggregated (CPU and subtree) state */
- long state[NR_VM_NODE_STAT_ITEMS];
-
- /* Non-hierarchical (CPU aggregated) state */
- long state_local[NR_VM_NODE_STAT_ITEMS];
-
- /* Pending child counts during tree propagation */
- long state_pending[NR_VM_NODE_STAT_ITEMS];
-};
-
/*
* per-node information in memory controller.
*/
@@ -116,7 +99,7 @@ struct mem_cgroup_per_node {
struct lruvec lruvec;

struct lruvec_stats_percpu __percpu *lruvec_stats_percpu;
- struct lruvec_stats lruvec_stats;
+ struct lruvec_stats *lruvec_stats;

unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];

@@ -1037,42 +1020,9 @@ static inline void mod_memcg_page_state(struct page *page,
}

unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
-
-static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
- enum node_stat_item idx)
-{
- struct mem_cgroup_per_node *pn;
- long x;
-
- if (mem_cgroup_disabled())
- return node_page_state(lruvec_pgdat(lruvec), idx);
-
- pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = READ_ONCE(pn->lruvec_stats.state[idx]);
-#ifdef CONFIG_SMP
- if (x < 0)
- x = 0;
-#endif
- return x;
-}
-
-static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
- enum node_stat_item idx)
-{
- struct mem_cgroup_per_node *pn;
- long x = 0;
-
- if (mem_cgroup_disabled())
- return node_page_state(lruvec_pgdat(lruvec), idx);
-
- pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = READ_ONCE(pn->lruvec_stats.state_local[idx]);
-#ifdef CONFIG_SMP
- if (x < 0)
- x = 0;
-#endif
- return x;
-}
+unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
+unsigned long lruvec_page_state_local(struct lruvec *lruvec,
+ enum node_stat_item idx);

void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c146187cda9c..7126459ec56a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -576,6 +576,60 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
return mz;
}

+struct lruvec_stats_percpu {
+ /* Local (CPU and cgroup) state */
+ long state[NR_VM_NODE_STAT_ITEMS];
+
+ /* Delta calculation for lockless upward propagation */
+ long state_prev[NR_VM_NODE_STAT_ITEMS];
+};
+
+struct lruvec_stats {
+ /* Aggregated (CPU and subtree) state */
+ long state[NR_VM_NODE_STAT_ITEMS];
+
+ /* Non-hierarchical (CPU aggregated) state */
+ long state_local[NR_VM_NODE_STAT_ITEMS];
+
+ /* Pending child counts during tree propagation */
+ long state_pending[NR_VM_NODE_STAT_ITEMS];
+};
+
+unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx)
+{
+ struct mem_cgroup_per_node *pn;
+ long x;
+
+ if (mem_cgroup_disabled())
+ return node_page_state(lruvec_pgdat(lruvec), idx);
+
+ pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+ x = READ_ONCE(pn->lruvec_stats->state[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
+}
+
+unsigned long lruvec_page_state_local(struct lruvec *lruvec,
+ enum node_stat_item idx)
+{
+ struct mem_cgroup_per_node *pn;
+ long x = 0;
+
+ if (mem_cgroup_disabled())
+ return node_page_state(lruvec_pgdat(lruvec), idx);
+
+ pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+ x = READ_ONCE(pn->lruvec_stats->state_local[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
+}
+
/* Subset of vm_event_item to report for memcg event stats */
static const unsigned int memcg_vm_event_stat[] = {
PGPGIN,
@@ -5491,18 +5545,25 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return 1;

+ pn->lruvec_stats = kzalloc_node(sizeof(struct lruvec_stats), GFP_KERNEL,
+ node);
+ if (!pn->lruvec_stats)
+ goto fail;
+
pn->lruvec_stats_percpu = alloc_percpu_gfp(struct lruvec_stats_percpu,
GFP_KERNEL_ACCOUNT);
- if (!pn->lruvec_stats_percpu) {
- kfree(pn);
- return 1;
- }
+ if (!pn->lruvec_stats_percpu)
+ goto fail;

lruvec_init(&pn->lruvec);
pn->memcg = memcg;

memcg->nodeinfo[node] = pn;
return 0;
+fail:
+ kfree(pn->lruvec_stats);
+ kfree(pn);
+ return 1;
}

static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
@@ -5513,6 +5574,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
return;

free_percpu(pn->lruvec_stats_percpu);
+ kfree(pn->lruvec_stats);
kfree(pn);
}

@@ -5865,18 +5927,19 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)

for_each_node_state(nid, N_MEMORY) {
struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid];
- struct mem_cgroup_per_node *ppn = NULL;
+ struct lruvec_stats *lstats = pn->lruvec_stats;
+ struct lruvec_stats *plstats = NULL;
struct lruvec_stats_percpu *lstatc;

if (parent)
- ppn = parent->nodeinfo[nid];
+ plstats = parent->nodeinfo[nid]->lruvec_stats;

lstatc = per_cpu_ptr(pn->lruvec_stats_percpu, cpu);

for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
- delta = pn->lruvec_stats.state_pending[i];
+ delta = lstats->state_pending[i];
if (delta)
- pn->lruvec_stats.state_pending[i] = 0;
+ lstats->state_pending[i] = 0;

delta_cpu = 0;
v = READ_ONCE(lstatc->state[i]);
@@ -5887,12 +5950,12 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
}

if (delta_cpu)
- pn->lruvec_stats.state_local[i] += delta_cpu;
+ lstats->state_local[i] += delta_cpu;

if (delta) {
- pn->lruvec_stats.state[i] += delta;
- if (ppn)
- ppn->lruvec_stats.state_pending[i] += delta;
+ lstats->state[i] += delta;
+ if (plstats)
+ plstats->state_pending[i] += delta;
}
}
}
--
2.43.0


2024-05-01 17:27:13

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v4 3/8] mm: memcg: account memory used for memcg vmstats and lruvec stats

From: Roman Gushchin <[email protected]>

The percpu memory used by memcg's memory statistics is already accounted.
For consistency, let's enable accounting for vmstats and lruvec stats
as well.

Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Reviewed-by: T.J. Mercier <[email protected]>
---
Changes since v3:
- N/A

mm/memcontrol.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7126459ec56a..434cff91b65e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5545,8 +5545,8 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return 1;

- pn->lruvec_stats = kzalloc_node(sizeof(struct lruvec_stats), GFP_KERNEL,
- node);
+ pn->lruvec_stats = kzalloc_node(sizeof(struct lruvec_stats),
+ GFP_KERNEL_ACCOUNT, node);
if (!pn->lruvec_stats)
goto fail;

@@ -5617,7 +5617,8 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
goto fail;
}

- memcg->vmstats = kzalloc(sizeof(struct memcg_vmstats), GFP_KERNEL);
+ memcg->vmstats = kzalloc(sizeof(struct memcg_vmstats),
+ GFP_KERNEL_ACCOUNT);
if (!memcg->vmstats)
goto fail;

--
2.43.0


2024-05-01 17:27:27

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v4 4/8] memcg: reduce memory for the lruvec and memcg stats

At the moment, the amount of memory allocated for stats related structs
in the mem_cgroup corresponds to the size of enum node_stat_item.
However not all fields in enum node_stat_item have corresponding memcg
stats. So, let's use indirection mechanism similar to the one used for
memcg vmstats management.

For a given x86_64 config, the size of stats with and without patch is:

structs size in bytes w/o with

struct lruvec_stats 1128 648
struct lruvec_stats_percpu 752 432
struct memcg_vmstats 1832 1352
struct memcg_vmstats_percpu 1280 960

The memory savings is further compounded by the fact that these structs
are allocated for each cpu and for each node. To be precise, for each
memcg the memory saved would be:

Memory saved = ((21 * 3 * NR_NODES) + (21 * 2 * NR_NODES * NR_CPUS) +
(21 * 3) + (21 * 2 * NR_CPUS)) * sizeof(long)

Where 21 is the number of fields eliminated.

Signed-off-by: Shakeel Butt <[email protected]>
---
Changes since v3:
- Use S8_MAX instead of 127 (Yosry)
- Change the ordering of if conditions (Yosry)
- Changed the stat array size macro to MEMCG_VMSTAT_SIZE (Johannes)

mm/memcontrol.c | 134 ++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 114 insertions(+), 20 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 434cff91b65e..f3b6be5a0605 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -576,35 +576,106 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
return mz;
}

+/* Subset of node_stat_item for memcg stats */
+static const unsigned int memcg_node_stat_items[] = {
+ NR_INACTIVE_ANON,
+ NR_ACTIVE_ANON,
+ NR_INACTIVE_FILE,
+ NR_ACTIVE_FILE,
+ NR_UNEVICTABLE,
+ NR_SLAB_RECLAIMABLE_B,
+ NR_SLAB_UNRECLAIMABLE_B,
+ WORKINGSET_REFAULT_ANON,
+ WORKINGSET_REFAULT_FILE,
+ WORKINGSET_ACTIVATE_ANON,
+ WORKINGSET_ACTIVATE_FILE,
+ WORKINGSET_RESTORE_ANON,
+ WORKINGSET_RESTORE_FILE,
+ WORKINGSET_NODERECLAIM,
+ NR_ANON_MAPPED,
+ NR_FILE_MAPPED,
+ NR_FILE_PAGES,
+ NR_FILE_DIRTY,
+ NR_WRITEBACK,
+ NR_SHMEM,
+ NR_SHMEM_THPS,
+ NR_FILE_THPS,
+ NR_ANON_THPS,
+ NR_KERNEL_STACK_KB,
+ NR_PAGETABLE,
+ NR_SECONDARY_PAGETABLE,
+#ifdef CONFIG_SWAP
+ NR_SWAPCACHE,
+#endif
+};
+
+static const unsigned int memcg_stat_items[] = {
+ MEMCG_SWAP,
+ MEMCG_SOCK,
+ MEMCG_PERCPU_B,
+ MEMCG_VMALLOC,
+ MEMCG_KMEM,
+ MEMCG_ZSWAP_B,
+ MEMCG_ZSWAPPED,
+};
+
+#define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
+#define MEMCG_VMSTAT_SIZE (NR_MEMCG_NODE_STAT_ITEMS + \
+ ARRAY_SIZE(memcg_stat_items))
+static int8_t mem_cgroup_stats_index[MEMCG_NR_STAT] __read_mostly;
+
+static void init_memcg_stats(void)
+{
+ int8_t i, j = 0;
+
+ BUILD_BUG_ON(MEMCG_NR_STAT >= S8_MAX);
+
+ for (i = 0; i < NR_MEMCG_NODE_STAT_ITEMS; ++i)
+ mem_cgroup_stats_index[memcg_node_stat_items[i]] = ++j;
+
+ for (i = 0; i < ARRAY_SIZE(memcg_stat_items); ++i)
+ mem_cgroup_stats_index[memcg_stat_items[i]] = ++j;
+}
+
+static inline int memcg_stats_index(int idx)
+{
+ return mem_cgroup_stats_index[idx] - 1;
+}
+
struct lruvec_stats_percpu {
/* Local (CPU and cgroup) state */
- long state[NR_VM_NODE_STAT_ITEMS];
+ long state[NR_MEMCG_NODE_STAT_ITEMS];

/* Delta calculation for lockless upward propagation */
- long state_prev[NR_VM_NODE_STAT_ITEMS];
+ long state_prev[NR_MEMCG_NODE_STAT_ITEMS];
};

struct lruvec_stats {
/* Aggregated (CPU and subtree) state */
- long state[NR_VM_NODE_STAT_ITEMS];
+ long state[NR_MEMCG_NODE_STAT_ITEMS];

/* Non-hierarchical (CPU aggregated) state */
- long state_local[NR_VM_NODE_STAT_ITEMS];
+ long state_local[NR_MEMCG_NODE_STAT_ITEMS];

/* Pending child counts during tree propagation */
- long state_pending[NR_VM_NODE_STAT_ITEMS];
+ long state_pending[NR_MEMCG_NODE_STAT_ITEMS];
};

unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx)
{
struct mem_cgroup_per_node *pn;
long x;
+ int i;

if (mem_cgroup_disabled())
return node_page_state(lruvec_pgdat(lruvec), idx);

+ i = memcg_stats_index(idx);
+ if (i < 0)
+ return 0;
+
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = READ_ONCE(pn->lruvec_stats->state[idx]);
+ x = READ_ONCE(pn->lruvec_stats->state[i]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -617,12 +688,17 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
{
struct mem_cgroup_per_node *pn;
long x = 0;
+ int i;

if (mem_cgroup_disabled())
return node_page_state(lruvec_pgdat(lruvec), idx);

+ i = memcg_stats_index(idx);
+ if (i < 0)
+ return 0;
+
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = READ_ONCE(pn->lruvec_stats->state_local[idx]);
+ x = READ_ONCE(pn->lruvec_stats->state_local[i]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -689,11 +765,11 @@ struct memcg_vmstats_percpu {
/* The above should fit a single cacheline for memcg_rstat_updated() */

/* Local (CPU and cgroup) page state & events */
- long state[MEMCG_NR_STAT];
+ long state[MEMCG_VMSTAT_SIZE];
unsigned long events[NR_MEMCG_EVENTS];

/* Delta calculation for lockless upward propagation */
- long state_prev[MEMCG_NR_STAT];
+ long state_prev[MEMCG_VMSTAT_SIZE];
unsigned long events_prev[NR_MEMCG_EVENTS];

/* Cgroup1: threshold notifications & softlimit tree updates */
@@ -703,15 +779,15 @@ struct memcg_vmstats_percpu {

struct memcg_vmstats {
/* Aggregated (CPU and subtree) page state & events */
- long state[MEMCG_NR_STAT];
+ long state[MEMCG_VMSTAT_SIZE];
unsigned long events[NR_MEMCG_EVENTS];

/* Non-hierarchical (CPU aggregated) page state & events */
- long state_local[MEMCG_NR_STAT];
+ long state_local[MEMCG_VMSTAT_SIZE];
unsigned long events_local[NR_MEMCG_EVENTS];

/* Pending child counts during tree propagation */
- long state_pending[MEMCG_NR_STAT];
+ long state_pending[MEMCG_VMSTAT_SIZE];
unsigned long events_pending[NR_MEMCG_EVENTS];

/* Stats updates since the last flush */
@@ -844,7 +920,13 @@ static void flush_memcg_stats_dwork(struct work_struct *w)

unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
{
- long x = READ_ONCE(memcg->vmstats->state[idx]);
+ long x;
+ int i = memcg_stats_index(idx);
+
+ if (i < 0)
+ return 0;
+
+ x = READ_ONCE(memcg->vmstats->state[i]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -876,18 +958,25 @@ static int memcg_state_val_in_pages(int idx, int val)
*/
void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
{
- if (mem_cgroup_disabled())
+ int i = memcg_stats_index(idx);
+
+ if (mem_cgroup_disabled() || i < 0)
return;

- __this_cpu_add(memcg->vmstats_percpu->state[idx], val);
+ __this_cpu_add(memcg->vmstats_percpu->state[i], val);
memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
}

/* idx can be of type enum memcg_stat_item or node_stat_item. */
static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
{
- long x = READ_ONCE(memcg->vmstats->state_local[idx]);
+ long x;
+ int i = memcg_stats_index(idx);
+
+ if (i < 0)
+ return 0;

+ x = READ_ONCE(memcg->vmstats->state_local[i]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -901,6 +990,10 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
{
struct mem_cgroup_per_node *pn;
struct mem_cgroup *memcg;
+ int i = memcg_stats_index(idx);
+
+ if (i < 0)
+ return;

pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
memcg = pn->memcg;
@@ -930,10 +1023,10 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
}

/* Update memcg */
- __this_cpu_add(memcg->vmstats_percpu->state[idx], val);
+ __this_cpu_add(memcg->vmstats_percpu->state[i], val);

/* Update lruvec */
- __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
+ __this_cpu_add(pn->lruvec_stats_percpu->state[i], val);

memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
memcg_stats_unlock();
@@ -5702,6 +5795,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
page_counter_init(&memcg->kmem, &parent->kmem);
page_counter_init(&memcg->tcpmem, &parent->tcpmem);
} else {
+ init_memcg_stats();
init_memcg_events();
page_counter_init(&memcg->memory, NULL);
page_counter_init(&memcg->swap, NULL);
@@ -5873,7 +5967,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)

statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);

- for (i = 0; i < MEMCG_NR_STAT; i++) {
+ for (i = 0; i < MEMCG_VMSTAT_SIZE; i++) {
/*
* Collect the aggregated propagation counts of groups
* below us. We're in a per-cpu loop here and this is
@@ -5937,7 +6031,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)

lstatc = per_cpu_ptr(pn->lruvec_stats_percpu, cpu);

- for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+ for (i = 0; i < NR_MEMCG_NODE_STAT_ITEMS; i++) {
delta = lstats->state_pending[i];
if (delta)
lstats->state_pending[i] = 0;
--
2.43.0


2024-05-01 17:28:08

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v4 7/8] memcg: warn for unexpected events and stats

To reduce memory usage by the memcg events and stats, the kernel uses
indirection table and only allocate stats and events which are being
used by the memcg code. To make this more robust, let's add warnings
where unexpected stats and events indexes are used.

Signed-off-by: Shakeel Butt <[email protected]>
---
Changes since v3:
- N/A

mm/memcontrol.c | 39 +++++++++++++++++++++++----------------
1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a78cf00dd537..b4a1b4bb599d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -671,7 +671,7 @@ unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx)
return node_page_state(lruvec_pgdat(lruvec), idx);

i = memcg_stats_index(idx);
- if (i < 0)
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
return 0;

pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
@@ -687,14 +687,14 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
enum node_stat_item idx)
{
struct mem_cgroup_per_node *pn;
- long x = 0;
+ long x;
int i;

if (mem_cgroup_disabled())
return node_page_state(lruvec_pgdat(lruvec), idx);

i = memcg_stats_index(idx);
- if (i < 0)
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
return 0;

pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
@@ -923,7 +923,7 @@ unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
long x;
int i = memcg_stats_index(idx);

- if (i < 0)
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
return 0;

x = READ_ONCE(memcg->vmstats->state[i]);
@@ -960,7 +960,10 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
{
int i = memcg_stats_index(idx);

- if (mem_cgroup_disabled() || i < 0)
+ if (mem_cgroup_disabled())
+ return;
+
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
return;

__this_cpu_add(memcg->vmstats_percpu->state[i], val);
@@ -973,7 +976,7 @@ static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
long x;
int i = memcg_stats_index(idx);

- if (i < 0)
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
return 0;

x = READ_ONCE(memcg->vmstats->state_local[i]);
@@ -992,7 +995,7 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
struct mem_cgroup *memcg;
int i = memcg_stats_index(idx);

- if (i < 0)
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
return;

pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
@@ -1106,34 +1109,38 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
unsigned long count)
{
- int index = memcg_events_index(idx);
+ int i = memcg_events_index(idx);

- if (mem_cgroup_disabled() || index < 0)
+ if (mem_cgroup_disabled())
+ return;
+
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
return;

memcg_stats_lock();
- __this_cpu_add(memcg->vmstats_percpu->events[index], count);
+ __this_cpu_add(memcg->vmstats_percpu->events[i], count);
memcg_rstat_updated(memcg, count);
memcg_stats_unlock();
}

static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
{
- int index = memcg_events_index(event);
+ int i = memcg_events_index(event);

- if (index < 0)
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, event))
return 0;
- return READ_ONCE(memcg->vmstats->events[index]);
+
+ return READ_ONCE(memcg->vmstats->events[i]);
}

static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
{
- int index = memcg_events_index(event);
+ int i = memcg_events_index(event);

- if (index < 0)
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, event))
return 0;

- return READ_ONCE(memcg->vmstats->events_local[index]);
+ return READ_ONCE(memcg->vmstats->events_local[i]);
}

static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
--
2.43.0


2024-05-01 17:28:45

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v4 5/8] memcg: cleanup __mod_memcg_lruvec_state

There are no memcg specific stats for NR_SHMEM_PMDMAPPED and
NR_FILE_PMDMAPPED. Let's remove them.

Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Reviewed-by: T.J. Mercier <[email protected]>
---
Changes since v3:
- N/A

mm/memcontrol.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f3b6be5a0605..a78cf00dd537 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1010,8 +1010,6 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
case NR_ANON_MAPPED:
case NR_FILE_MAPPED:
case NR_ANON_THPS:
- case NR_SHMEM_PMDMAPPED:
- case NR_FILE_PMDMAPPED:
if (WARN_ON_ONCE(!in_task()))
pr_warn("stat item index: %d\n", idx);
break;
--
2.43.0


2024-05-01 17:28:47

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v4 8/8] memcg: use proper type for mod_memcg_state

The memcg stats update functions can take arbitrary integer but the
only input which make sense is enum memcg_stat_item and we don't
want these functions to be called with arbitrary integer, so replace
the parameter type with enum memcg_stat_item and compiler will be able
to warn if memcg stat update functions are called with incorrect index
value.

Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: T.J. Mercier <[email protected]>
---
Changes since v3:
- N/A

include/linux/memcontrol.h | 13 +++++++------
mm/memcontrol.c | 3 ++-
2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ab8a6e884375..030d34e9d117 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -974,7 +974,8 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
void folio_memcg_lock(struct folio *folio);
void folio_memcg_unlock(struct folio *folio);

-void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
+void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
+ int val);

/* try to stablize folio_memcg() for all the pages in a memcg */
static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
@@ -995,7 +996,7 @@ static inline void mem_cgroup_unlock_pages(void)

/* idx can be of type enum memcg_stat_item or node_stat_item */
static inline void mod_memcg_state(struct mem_cgroup *memcg,
- int idx, int val)
+ enum memcg_stat_item idx, int val)
{
unsigned long flags;

@@ -1005,7 +1006,7 @@ static inline void mod_memcg_state(struct mem_cgroup *memcg,
}

static inline void mod_memcg_page_state(struct page *page,
- int idx, int val)
+ enum memcg_stat_item idx, int val)
{
struct mem_cgroup *memcg;

@@ -1491,19 +1492,19 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
}

static inline void __mod_memcg_state(struct mem_cgroup *memcg,
- int idx,
+ enum memcg_stat_item idx,
int nr)
{
}

static inline void mod_memcg_state(struct mem_cgroup *memcg,
- int idx,
+ enum memcg_stat_item idx,
int nr)
{
}

static inline void mod_memcg_page_state(struct page *page,
- int idx, int val)
+ enum memcg_stat_item idx, int val)
{
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b4a1b4bb599d..39f8b0df46f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -956,7 +956,8 @@ static int memcg_state_val_in_pages(int idx, int val)
* @idx: the stat item - can be enum memcg_stat_item or enum node_stat_item
* @val: delta to add to the counter, can be negative
*/
-void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
+void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
+ int val)
{
int i = memcg_stats_index(idx);

--
2.43.0


2024-05-01 17:29:16

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v4 6/8] mm: cleanup WORKINGSET_NODES in workingset

WORKINGSET_NODES is not exposed in the memcg stats and thus there is no
need to use the memcg specific stat update functions for it. In future
if we decide to expose WORKINGSET_NODES in the memcg stats, we can
revert this patch.

Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Reviewed-by: T.J. Mercier <[email protected]>
---
Changes since v3:
- N/A

mm/workingset.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index f2a0ecaf708d..c22adb93622a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -618,6 +618,7 @@ struct list_lru shadow_nodes;
void workingset_update_node(struct xa_node *node)
{
struct address_space *mapping;
+ struct page *page = virt_to_page(node);

/*
* Track non-empty nodes that contain only shadow entries;
@@ -633,12 +634,12 @@ void workingset_update_node(struct xa_node *node)
if (node->count && node->count == node->nr_values) {
if (list_empty(&node->private_list)) {
list_lru_add_obj(&shadow_nodes, &node->private_list);
- __inc_lruvec_kmem_state(node, WORKINGSET_NODES);
+ __inc_node_page_state(page, WORKINGSET_NODES);
}
} else {
if (!list_empty(&node->private_list)) {
list_lru_del_obj(&shadow_nodes, &node->private_list);
- __dec_lruvec_kmem_state(node, WORKINGSET_NODES);
+ __dec_node_page_state(page, WORKINGSET_NODES);
}
}
}
@@ -742,7 +743,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
}

list_lru_isolate(lru, item);
- __dec_lruvec_kmem_state(node, WORKINGSET_NODES);
+ __dec_node_page_state(virt_to_page(node), WORKINGSET_NODES);

spin_unlock(lru_lock);

--
2.43.0


2024-05-01 23:20:07

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v4 2/8] memcg: dynamically allocate lruvec_stats

On Wed, May 01, 2024 at 10:26:11AM -0700, Shakeel Butt wrote:
> To decouple the dependency of lruvec_stats on NR_VM_NODE_STAT_ITEMS, we
> need to dynamically allocate lruvec_stats in the mem_cgroup_per_node
> structure. Also move the definition of lruvec_stats_percpu and
> lruvec_stats and related functions to the memcontrol.c to facilitate
> later patches. No functional changes in the patch.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reviewed-by: Yosry Ahmed <[email protected]>
> Reviewed-by: T.J. Mercier <[email protected]>

Reviewed-by: Roman Gushchin <[email protected]>

2024-05-01 23:20:59

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] memcg: reduce memory for the lruvec and memcg stats

On Wed, May 01, 2024 at 10:26:13AM -0700, Shakeel Butt wrote:
> At the moment, the amount of memory allocated for stats related structs
> in the mem_cgroup corresponds to the size of enum node_stat_item.
> However not all fields in enum node_stat_item have corresponding memcg
> stats. So, let's use indirection mechanism similar to the one used for
> memcg vmstats management.
>
> For a given x86_64 config, the size of stats with and without patch is:
>
> structs size in bytes w/o with
>
> struct lruvec_stats 1128 648
> struct lruvec_stats_percpu 752 432
> struct memcg_vmstats 1832 1352
> struct memcg_vmstats_percpu 1280 960
>
> The memory savings is further compounded by the fact that these structs
> are allocated for each cpu and for each node. To be precise, for each
> memcg the memory saved would be:
>
> Memory saved = ((21 * 3 * NR_NODES) + (21 * 2 * NR_NODES * NR_CPUS) +
> (21 * 3) + (21 * 2 * NR_CPUS)) * sizeof(long)
>
> Where 21 is the number of fields eliminated.
>
> Signed-off-by: Shakeel Butt <[email protected]>

Reviewed-by: Roman Gushchin <[email protected]>

2024-05-01 23:23:25

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v4 7/8] memcg: warn for unexpected events and stats

On Wed, May 01, 2024 at 10:26:16AM -0700, Shakeel Butt wrote:
> To reduce memory usage by the memcg events and stats, the kernel uses
> indirection table and only allocate stats and events which are being
> used by the memcg code. To make this more robust, let's add warnings
> where unexpected stats and events indexes are used.
>
> Signed-off-by: Shakeel Butt <[email protected]>

Reviewed-by: Roman Gushchin <[email protected]>

2024-05-01 23:24:07

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v4 8/8] memcg: use proper type for mod_memcg_state

On Wed, May 01, 2024 at 10:26:17AM -0700, Shakeel Butt wrote:
> The memcg stats update functions can take arbitrary integer but the
> only input which make sense is enum memcg_stat_item and we don't
> want these functions to be called with arbitrary integer, so replace
> the parameter type with enum memcg_stat_item and compiler will be able
> to warn if memcg stat update functions are called with incorrect index
> value.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reviewed-by: T.J. Mercier <[email protected]>

Reviewed-by: Roman Gushchin <[email protected]>

Thanks!