2024-04-30 06:06:33

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v3 0/8] memcg: reduce memory consumption by memcg stats

Most of the memory overhead of a memcg object is due to memcg stats
maintained by the kernel. Since stats updates happen in performance
critical codepaths, the stats are maintained per-cpu and numa specific
stats are maintained per-node * per-cpu. This drastically increase the
overhead on large machines i.e. large of CPUs and multiple numa nodes.
This patch series tries to reduce the overhead by at least not
allocating the memory for stats which are not memcg specific.

Changelog since v2:
Using WARN_ONCE() instead of pr_warn_once() and some changes commit log
changes. Also included a patch from Roman.

Changelog since v1:
The main change from the v1 is the indirection approach used in this
patchset instead of rearranging the members of node_stat_item.


Roman Gushchin (1):
mm: memcg: account memory used for memcg vmstats and lruvec stats

Shakeel Butt (7):
memcg: reduce memory size of mem_cgroup_events_index
memcg: dynamically allocate lruvec_stats
memcg: reduce memory for the lruvec and memcg stats
memcg: cleanup __mod_memcg_lruvec_state
mm: cleanup WORKINGSET_NODES in workingset
memcg: warn for unexpected events and stats
memcg: use proper type for mod_memcg_state

include/linux/memcontrol.h | 75 ++----------
mm/memcontrol.c | 244 +++++++++++++++++++++++++++++++------
mm/workingset.c | 7 +-
3 files changed, 222 insertions(+), 104 deletions(-)

--
2.43.0



2024-04-30 06:06:40

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v3 1/8] memcg: reduce memory size of mem_cgroup_events_index

mem_cgroup_events_index is a translation table to get the right index of
the memcg relevant entry for the general vm_event_item. At the moment,
it is defined as integer array. However on a typical system the max
entry of vm_event_item (NR_VM_EVENT_ITEMS) is 113, so we don't need to
use int as storage type of the array. For now just use int8_t as type
and add a BUILD_BUG_ON() and will switch to short once NR_VM_EVENT_ITEMS
touches 127.

Another benefit of this change is that the translation table fits in 2
cachelines while previously it would require 8 cachelines (assuming 64
bytes cachesline).

Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
---
Changes since v2:
- Used S8_MAX instead of 127
- Update commit message based on Yosry's feedback.

mm/memcontrol.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 602ad5faad4d..c146187cda9c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -607,11 +607,13 @@ static const unsigned int memcg_vm_event_stat[] = {
};

#define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_event_stat)
-static int mem_cgroup_events_index[NR_VM_EVENT_ITEMS] __read_mostly;
+static int8_t mem_cgroup_events_index[NR_VM_EVENT_ITEMS] __read_mostly;

static void init_memcg_events(void)
{
- int i;
+ int8_t i;
+
+ BUILD_BUG_ON(NR_VM_EVENT_ITEMS >= S8_MAX);

for (i = 0; i < NR_MEMCG_EVENTS; ++i)
mem_cgroup_events_index[memcg_vm_event_stat[i]] = i + 1;
--
2.43.0


2024-04-30 06:06:56

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v3 2/8] memcg: dynamically allocate lruvec_stats

To decouple the dependency of lruvec_stats on NR_VM_NODE_STAT_ITEMS, we
need to dynamically allocate lruvec_stats in the mem_cgroup_per_node
structure. Also move the definition of lruvec_stats_percpu and
lruvec_stats and related functions to the memcontrol.c to facilitate
later patches. No functional changes in the patch.

Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
---

Changes since v2:
- N/A

include/linux/memcontrol.h | 62 +++------------------------
mm/memcontrol.c | 87 ++++++++++++++++++++++++++++++++------
2 files changed, 81 insertions(+), 68 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9aba0d0462ca..ab8a6e884375 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -83,6 +83,8 @@ enum mem_cgroup_events_target {

struct memcg_vmstats_percpu;
struct memcg_vmstats;
+struct lruvec_stats_percpu;
+struct lruvec_stats;

struct mem_cgroup_reclaim_iter {
struct mem_cgroup *position;
@@ -90,25 +92,6 @@ struct mem_cgroup_reclaim_iter {
unsigned int generation;
};

-struct lruvec_stats_percpu {
- /* Local (CPU and cgroup) state */
- long state[NR_VM_NODE_STAT_ITEMS];
-
- /* Delta calculation for lockless upward propagation */
- long state_prev[NR_VM_NODE_STAT_ITEMS];
-};
-
-struct lruvec_stats {
- /* Aggregated (CPU and subtree) state */
- long state[NR_VM_NODE_STAT_ITEMS];
-
- /* Non-hierarchical (CPU aggregated) state */
- long state_local[NR_VM_NODE_STAT_ITEMS];
-
- /* Pending child counts during tree propagation */
- long state_pending[NR_VM_NODE_STAT_ITEMS];
-};
-
/*
* per-node information in memory controller.
*/
@@ -116,7 +99,7 @@ struct mem_cgroup_per_node {
struct lruvec lruvec;

struct lruvec_stats_percpu __percpu *lruvec_stats_percpu;
- struct lruvec_stats lruvec_stats;
+ struct lruvec_stats *lruvec_stats;

unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];

@@ -1037,42 +1020,9 @@ static inline void mod_memcg_page_state(struct page *page,
}

unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
-
-static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
- enum node_stat_item idx)
-{
- struct mem_cgroup_per_node *pn;
- long x;
-
- if (mem_cgroup_disabled())
- return node_page_state(lruvec_pgdat(lruvec), idx);
-
- pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = READ_ONCE(pn->lruvec_stats.state[idx]);
-#ifdef CONFIG_SMP
- if (x < 0)
- x = 0;
-#endif
- return x;
-}
-
-static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
- enum node_stat_item idx)
-{
- struct mem_cgroup_per_node *pn;
- long x = 0;
-
- if (mem_cgroup_disabled())
- return node_page_state(lruvec_pgdat(lruvec), idx);
-
- pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = READ_ONCE(pn->lruvec_stats.state_local[idx]);
-#ifdef CONFIG_SMP
- if (x < 0)
- x = 0;
-#endif
- return x;
-}
+unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
+unsigned long lruvec_page_state_local(struct lruvec *lruvec,
+ enum node_stat_item idx);

void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c146187cda9c..7126459ec56a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -576,6 +576,60 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
return mz;
}

+struct lruvec_stats_percpu {
+ /* Local (CPU and cgroup) state */
+ long state[NR_VM_NODE_STAT_ITEMS];
+
+ /* Delta calculation for lockless upward propagation */
+ long state_prev[NR_VM_NODE_STAT_ITEMS];
+};
+
+struct lruvec_stats {
+ /* Aggregated (CPU and subtree) state */
+ long state[NR_VM_NODE_STAT_ITEMS];
+
+ /* Non-hierarchical (CPU aggregated) state */
+ long state_local[NR_VM_NODE_STAT_ITEMS];
+
+ /* Pending child counts during tree propagation */
+ long state_pending[NR_VM_NODE_STAT_ITEMS];
+};
+
+unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx)
+{
+ struct mem_cgroup_per_node *pn;
+ long x;
+
+ if (mem_cgroup_disabled())
+ return node_page_state(lruvec_pgdat(lruvec), idx);
+
+ pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+ x = READ_ONCE(pn->lruvec_stats->state[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
+}
+
+unsigned long lruvec_page_state_local(struct lruvec *lruvec,
+ enum node_stat_item idx)
+{
+ struct mem_cgroup_per_node *pn;
+ long x = 0;
+
+ if (mem_cgroup_disabled())
+ return node_page_state(lruvec_pgdat(lruvec), idx);
+
+ pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+ x = READ_ONCE(pn->lruvec_stats->state_local[idx]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
+}
+
/* Subset of vm_event_item to report for memcg event stats */
static const unsigned int memcg_vm_event_stat[] = {
PGPGIN,
@@ -5491,18 +5545,25 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return 1;

+ pn->lruvec_stats = kzalloc_node(sizeof(struct lruvec_stats), GFP_KERNEL,
+ node);
+ if (!pn->lruvec_stats)
+ goto fail;
+
pn->lruvec_stats_percpu = alloc_percpu_gfp(struct lruvec_stats_percpu,
GFP_KERNEL_ACCOUNT);
- if (!pn->lruvec_stats_percpu) {
- kfree(pn);
- return 1;
- }
+ if (!pn->lruvec_stats_percpu)
+ goto fail;

lruvec_init(&pn->lruvec);
pn->memcg = memcg;

memcg->nodeinfo[node] = pn;
return 0;
+fail:
+ kfree(pn->lruvec_stats);
+ kfree(pn);
+ return 1;
}

static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
@@ -5513,6 +5574,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
return;

free_percpu(pn->lruvec_stats_percpu);
+ kfree(pn->lruvec_stats);
kfree(pn);
}

@@ -5865,18 +5927,19 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)

for_each_node_state(nid, N_MEMORY) {
struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid];
- struct mem_cgroup_per_node *ppn = NULL;
+ struct lruvec_stats *lstats = pn->lruvec_stats;
+ struct lruvec_stats *plstats = NULL;
struct lruvec_stats_percpu *lstatc;

if (parent)
- ppn = parent->nodeinfo[nid];
+ plstats = parent->nodeinfo[nid]->lruvec_stats;

lstatc = per_cpu_ptr(pn->lruvec_stats_percpu, cpu);

for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
- delta = pn->lruvec_stats.state_pending[i];
+ delta = lstats->state_pending[i];
if (delta)
- pn->lruvec_stats.state_pending[i] = 0;
+ lstats->state_pending[i] = 0;

delta_cpu = 0;
v = READ_ONCE(lstatc->state[i]);
@@ -5887,12 +5950,12 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
}

if (delta_cpu)
- pn->lruvec_stats.state_local[i] += delta_cpu;
+ lstats->state_local[i] += delta_cpu;

if (delta) {
- pn->lruvec_stats.state[i] += delta;
- if (ppn)
- ppn->lruvec_stats.state_pending[i] += delta;
+ lstats->state[i] += delta;
+ if (plstats)
+ plstats->state_pending[i] += delta;
}
}
}
--
2.43.0


2024-04-30 06:07:17

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v3 4/8] memcg: reduce memory for the lruvec and memcg stats

At the moment, the amount of memory allocated for stats related structs
in the mem_cgroup corresponds to the size of enum node_stat_item.
However not all fields in enum node_stat_item has corresponding memcg
stats. So, let's use indirection mechanism similar to the one used for
memcg vmstats management.

For a given x86_64 config, the size of stats with and without patch is:

structs size in bytes w/o with

struct lruvec_stats 1128 648
struct lruvec_stats_percpu 752 432
struct memcg_vmstats 1832 1352
struct memcg_vmstats_percpu 1280 960

The memory savings is further compounded by the fact that these structs
are allocated for each cpu and for each node. To be precise, for each
memcg the memory saved would be:

Memory saved = ((21 * 3 * NR_NODES) + (21 * 2 * NR_NODS * NR_CPUS) +
(21 * 3) + (21 * 2 * NR_CPUS)) * sizeof(long)

Where 21 is the number of fields eliminated.

Signed-off-by: Shakeel Butt <[email protected]>
---

Changes since v2:
- N/A

mm/memcontrol.c | 138 ++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 115 insertions(+), 23 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 434cff91b65e..f424c5b2ba9b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -576,35 +576,105 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
return mz;
}

+/* Subset of node_stat_item for memcg stats */
+static const unsigned int memcg_node_stat_items[] = {
+ NR_INACTIVE_ANON,
+ NR_ACTIVE_ANON,
+ NR_INACTIVE_FILE,
+ NR_ACTIVE_FILE,
+ NR_UNEVICTABLE,
+ NR_SLAB_RECLAIMABLE_B,
+ NR_SLAB_UNRECLAIMABLE_B,
+ WORKINGSET_REFAULT_ANON,
+ WORKINGSET_REFAULT_FILE,
+ WORKINGSET_ACTIVATE_ANON,
+ WORKINGSET_ACTIVATE_FILE,
+ WORKINGSET_RESTORE_ANON,
+ WORKINGSET_RESTORE_FILE,
+ WORKINGSET_NODERECLAIM,
+ NR_ANON_MAPPED,
+ NR_FILE_MAPPED,
+ NR_FILE_PAGES,
+ NR_FILE_DIRTY,
+ NR_WRITEBACK,
+ NR_SHMEM,
+ NR_SHMEM_THPS,
+ NR_FILE_THPS,
+ NR_ANON_THPS,
+ NR_KERNEL_STACK_KB,
+ NR_PAGETABLE,
+ NR_SECONDARY_PAGETABLE,
+#ifdef CONFIG_SWAP
+ NR_SWAPCACHE,
+#endif
+};
+
+static const unsigned int memcg_stat_items[] = {
+ MEMCG_SWAP,
+ MEMCG_SOCK,
+ MEMCG_PERCPU_B,
+ MEMCG_VMALLOC,
+ MEMCG_KMEM,
+ MEMCG_ZSWAP_B,
+ MEMCG_ZSWAPPED,
+};
+
+#define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
+#define NR_MEMCG_STATS (NR_MEMCG_NODE_STAT_ITEMS + ARRAY_SIZE(memcg_stat_items))
+static int8_t mem_cgroup_stats_index[MEMCG_NR_STAT] __read_mostly;
+
+static void init_memcg_stats(void)
+{
+ int8_t i, j = 0;
+
+ /* Switch to short once this failure occurs. */
+ BUILD_BUG_ON(NR_MEMCG_STATS >= 127 /* INT8_MAX */);
+
+ for (i = 0; i < NR_MEMCG_NODE_STAT_ITEMS; ++i)
+ mem_cgroup_stats_index[memcg_node_stat_items[i]] = ++j;
+
+ for (i = 0; i < ARRAY_SIZE(memcg_stat_items); ++i)
+ mem_cgroup_stats_index[memcg_stat_items[i]] = ++j;
+}
+
+static inline int memcg_stats_index(int idx)
+{
+ return mem_cgroup_stats_index[idx] - 1;
+}
+
struct lruvec_stats_percpu {
/* Local (CPU and cgroup) state */
- long state[NR_VM_NODE_STAT_ITEMS];
+ long state[NR_MEMCG_NODE_STAT_ITEMS];

/* Delta calculation for lockless upward propagation */
- long state_prev[NR_VM_NODE_STAT_ITEMS];
+ long state_prev[NR_MEMCG_NODE_STAT_ITEMS];
};

struct lruvec_stats {
/* Aggregated (CPU and subtree) state */
- long state[NR_VM_NODE_STAT_ITEMS];
+ long state[NR_MEMCG_NODE_STAT_ITEMS];

/* Non-hierarchical (CPU aggregated) state */
- long state_local[NR_VM_NODE_STAT_ITEMS];
+ long state_local[NR_MEMCG_NODE_STAT_ITEMS];

/* Pending child counts during tree propagation */
- long state_pending[NR_VM_NODE_STAT_ITEMS];
+ long state_pending[NR_MEMCG_NODE_STAT_ITEMS];
};

unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx)
{
struct mem_cgroup_per_node *pn;
- long x;
+ long x = 0;
+ int i;

if (mem_cgroup_disabled())
return node_page_state(lruvec_pgdat(lruvec), idx);

- pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = READ_ONCE(pn->lruvec_stats->state[idx]);
+ i = memcg_stats_index(idx);
+ if (i >= 0) {
+ pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+ x = READ_ONCE(pn->lruvec_stats->state[i]);
+ }
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -617,12 +687,16 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
{
struct mem_cgroup_per_node *pn;
long x = 0;
+ int i;

if (mem_cgroup_disabled())
return node_page_state(lruvec_pgdat(lruvec), idx);

- pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = READ_ONCE(pn->lruvec_stats->state_local[idx]);
+ i = memcg_stats_index(idx);
+ if (i >= 0) {
+ pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+ x = READ_ONCE(pn->lruvec_stats->state_local[i]);
+ }
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -689,11 +763,11 @@ struct memcg_vmstats_percpu {
/* The above should fit a single cacheline for memcg_rstat_updated() */

/* Local (CPU and cgroup) page state & events */
- long state[MEMCG_NR_STAT];
+ long state[NR_MEMCG_STATS];
unsigned long events[NR_MEMCG_EVENTS];

/* Delta calculation for lockless upward propagation */
- long state_prev[MEMCG_NR_STAT];
+ long state_prev[NR_MEMCG_STATS];
unsigned long events_prev[NR_MEMCG_EVENTS];

/* Cgroup1: threshold notifications & softlimit tree updates */
@@ -703,15 +777,15 @@ struct memcg_vmstats_percpu {

struct memcg_vmstats {
/* Aggregated (CPU and subtree) page state & events */
- long state[MEMCG_NR_STAT];
+ long state[NR_MEMCG_STATS];
unsigned long events[NR_MEMCG_EVENTS];

/* Non-hierarchical (CPU aggregated) page state & events */
- long state_local[MEMCG_NR_STAT];
+ long state_local[NR_MEMCG_STATS];
unsigned long events_local[NR_MEMCG_EVENTS];

/* Pending child counts during tree propagation */
- long state_pending[MEMCG_NR_STAT];
+ long state_pending[NR_MEMCG_STATS];
unsigned long events_pending[NR_MEMCG_EVENTS];

/* Stats updates since the last flush */
@@ -844,7 +918,13 @@ static void flush_memcg_stats_dwork(struct work_struct *w)

unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
{
- long x = READ_ONCE(memcg->vmstats->state[idx]);
+ long x;
+ int i = memcg_stats_index(idx);
+
+ if (i < 0)
+ return 0;
+
+ x = READ_ONCE(memcg->vmstats->state[i]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -876,18 +956,25 @@ static int memcg_state_val_in_pages(int idx, int val)
*/
void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
{
- if (mem_cgroup_disabled())
+ int i = memcg_stats_index(idx);
+
+ if (mem_cgroup_disabled() || i < 0)
return;

- __this_cpu_add(memcg->vmstats_percpu->state[idx], val);
+ __this_cpu_add(memcg->vmstats_percpu->state[i], val);
memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
}

/* idx can be of type enum memcg_stat_item or node_stat_item. */
static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
{
- long x = READ_ONCE(memcg->vmstats->state_local[idx]);
+ long x;
+ int i = memcg_stats_index(idx);
+
+ if (i < 0)
+ return 0;

+ x = READ_ONCE(memcg->vmstats->state_local[i]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -901,6 +988,10 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
{
struct mem_cgroup_per_node *pn;
struct mem_cgroup *memcg;
+ int i = memcg_stats_index(idx);
+
+ if (i < 0)
+ return;

pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
memcg = pn->memcg;
@@ -930,10 +1021,10 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
}

/* Update memcg */
- __this_cpu_add(memcg->vmstats_percpu->state[idx], val);
+ __this_cpu_add(memcg->vmstats_percpu->state[i], val);

/* Update lruvec */
- __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
+ __this_cpu_add(pn->lruvec_stats_percpu->state[i], val);

memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
memcg_stats_unlock();
@@ -5702,6 +5793,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
page_counter_init(&memcg->kmem, &parent->kmem);
page_counter_init(&memcg->tcpmem, &parent->tcpmem);
} else {
+ init_memcg_stats();
init_memcg_events();
page_counter_init(&memcg->memory, NULL);
page_counter_init(&memcg->swap, NULL);
@@ -5873,7 +5965,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)

statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);

- for (i = 0; i < MEMCG_NR_STAT; i++) {
+ for (i = 0; i < NR_MEMCG_STATS; i++) {
/*
* Collect the aggregated propagation counts of groups
* below us. We're in a per-cpu loop here and this is
@@ -5937,7 +6029,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)

lstatc = per_cpu_ptr(pn->lruvec_stats_percpu, cpu);

- for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+ for (i = 0; i < NR_MEMCG_NODE_STAT_ITEMS; i++) {
delta = lstats->state_pending[i];
if (delta)
lstats->state_pending[i] = 0;
--
2.43.0


2024-04-30 06:07:30

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v3 5/8] memcg: cleanup __mod_memcg_lruvec_state

There are no memcg specific stats for NR_SHMEM_PMDMAPPED and
NR_FILE_PMDMAPPED. Let's remove them.

Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
---
Changes since v2:
- N/A

mm/memcontrol.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f424c5b2ba9b..df94abc0088f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1008,8 +1008,6 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
case NR_ANON_MAPPED:
case NR_FILE_MAPPED:
case NR_ANON_THPS:
- case NR_SHMEM_PMDMAPPED:
- case NR_FILE_PMDMAPPED:
if (WARN_ON_ONCE(!in_task()))
pr_warn("stat item index: %d\n", idx);
break;
--
2.43.0


2024-04-30 06:07:35

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v3 3/8] mm: memcg: account memory used for memcg vmstats and lruvec stats

From: Roman Gushchin <[email protected]>

The percpu memory used by memcg's memory statistics is already accounted.
For consistency, let's enable accounting for vmstats and lruvec stats
as well.

Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Shakeel Butt <[email protected]>
---
mm/memcontrol.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7126459ec56a..434cff91b65e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5545,8 +5545,8 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return 1;

- pn->lruvec_stats = kzalloc_node(sizeof(struct lruvec_stats), GFP_KERNEL,
- node);
+ pn->lruvec_stats = kzalloc_node(sizeof(struct lruvec_stats),
+ GFP_KERNEL_ACCOUNT, node);
if (!pn->lruvec_stats)
goto fail;

@@ -5617,7 +5617,8 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
goto fail;
}

- memcg->vmstats = kzalloc(sizeof(struct memcg_vmstats), GFP_KERNEL);
+ memcg->vmstats = kzalloc(sizeof(struct memcg_vmstats),
+ GFP_KERNEL_ACCOUNT);
if (!memcg->vmstats)
goto fail;

--
2.43.0


2024-04-30 06:07:41

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v3 6/8] mm: cleanup WORKINGSET_NODES in workingset

WORKINGSET_NODES is not exposed in the memcg stats and thus there is no
need to use the memcg specific stat update functions for it. In future
if we decide to expose WORKINGSET_NODES in the memcg stats, we can
revert this patch.

Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
---

Changes since v2:
- N/A

mm/workingset.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index f2a0ecaf708d..c22adb93622a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -618,6 +618,7 @@ struct list_lru shadow_nodes;
void workingset_update_node(struct xa_node *node)
{
struct address_space *mapping;
+ struct page *page = virt_to_page(node);

/*
* Track non-empty nodes that contain only shadow entries;
@@ -633,12 +634,12 @@ void workingset_update_node(struct xa_node *node)
if (node->count && node->count == node->nr_values) {
if (list_empty(&node->private_list)) {
list_lru_add_obj(&shadow_nodes, &node->private_list);
- __inc_lruvec_kmem_state(node, WORKINGSET_NODES);
+ __inc_node_page_state(page, WORKINGSET_NODES);
}
} else {
if (!list_empty(&node->private_list)) {
list_lru_del_obj(&shadow_nodes, &node->private_list);
- __dec_lruvec_kmem_state(node, WORKINGSET_NODES);
+ __dec_node_page_state(page, WORKINGSET_NODES);
}
}
}
@@ -742,7 +743,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
}

list_lru_isolate(lru, item);
- __dec_lruvec_kmem_state(node, WORKINGSET_NODES);
+ __dec_node_page_state(virt_to_page(node), WORKINGSET_NODES);

spin_unlock(lru_lock);

--
2.43.0


2024-04-30 06:07:53

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v3 7/8] memcg: warn for unexpected events and stats

To reduce memory usage by the memcg events and stats, the kernel uses
indirection table and only allocate stats and events which are being
used by the memcg code. To make this more robust, let's add warnings
where unexpected stats and events indexes are used.

Signed-off-by: Shakeel Butt <[email protected]>
---

Changes since v2:
- Based on feedback from Johannes, switched to WARN_ONCE() from
pr_warn_once().

mm/memcontrol.c | 55 ++++++++++++++++++++++++++++---------------------
1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df94abc0088f..72e36977a96e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -664,17 +664,18 @@ struct lruvec_stats {
unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx)
{
struct mem_cgroup_per_node *pn;
- long x = 0;
+ long x;
int i;

if (mem_cgroup_disabled())
return node_page_state(lruvec_pgdat(lruvec), idx);

i = memcg_stats_index(idx);
- if (i >= 0) {
- pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = READ_ONCE(pn->lruvec_stats->state[i]);
- }
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
+ return 0;
+
+ pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+ x = READ_ONCE(pn->lruvec_stats->state[i]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -686,17 +687,18 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
enum node_stat_item idx)
{
struct mem_cgroup_per_node *pn;
- long x = 0;
+ long x;
int i;

if (mem_cgroup_disabled())
return node_page_state(lruvec_pgdat(lruvec), idx);

i = memcg_stats_index(idx);
- if (i >= 0) {
- pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = READ_ONCE(pn->lruvec_stats->state_local[i]);
- }
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
+ return 0;
+
+ pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+ x = READ_ONCE(pn->lruvec_stats->state_local[i]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -921,7 +923,7 @@ unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
long x;
int i = memcg_stats_index(idx);

- if (i < 0)
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
return 0;

x = READ_ONCE(memcg->vmstats->state[i]);
@@ -958,7 +960,10 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
{
int i = memcg_stats_index(idx);

- if (mem_cgroup_disabled() || i < 0)
+ if (mem_cgroup_disabled())
+ return;
+
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
return;

__this_cpu_add(memcg->vmstats_percpu->state[i], val);
@@ -971,7 +976,7 @@ static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
long x;
int i = memcg_stats_index(idx);

- if (i < 0)
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
return 0;

x = READ_ONCE(memcg->vmstats->state_local[i]);
@@ -990,7 +995,7 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
struct mem_cgroup *memcg;
int i = memcg_stats_index(idx);

- if (i < 0)
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
return;

pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
@@ -1104,34 +1109,38 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
unsigned long count)
{
- int index = memcg_events_index(idx);
+ int i = memcg_events_index(idx);

- if (mem_cgroup_disabled() || index < 0)
+ if (mem_cgroup_disabled())
+ return;
+
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx))
return;

memcg_stats_lock();
- __this_cpu_add(memcg->vmstats_percpu->events[index], count);
+ __this_cpu_add(memcg->vmstats_percpu->events[i], count);
memcg_rstat_updated(memcg, count);
memcg_stats_unlock();
}

static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
{
- int index = memcg_events_index(event);
+ int i = memcg_events_index(event);

- if (index < 0)
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, event))
return 0;
- return READ_ONCE(memcg->vmstats->events[index]);
+
+ return READ_ONCE(memcg->vmstats->events[i]);
}

static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
{
- int index = memcg_events_index(event);
+ int i = memcg_events_index(event);

- if (index < 0)
+ if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, event))
return 0;

- return READ_ONCE(memcg->vmstats->events_local[index]);
+ return READ_ONCE(memcg->vmstats->events_local[i]);
}

static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
--
2.43.0


2024-04-30 06:08:16

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH v3 8/8] memcg: use proper type for mod_memcg_state

The memcg stats update functions can take arbitrary integer but the
only input which make sense is enum memcg_stat_item and we don't
want these functions to be called with arbitrary integer, so replace
the parameter type with enum memcg_stat_item and compiler will be able
to warn if memcg stat update functions are called with incorrect index
value.

Signed-off-by: Shakeel Butt <[email protected]>
---
Change since v2:
- Fixed whitespace issue based on TJ's suggestion.

include/linux/memcontrol.h | 13 +++++++------
mm/memcontrol.c | 3 ++-
2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ab8a6e884375..030d34e9d117 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -974,7 +974,8 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
void folio_memcg_lock(struct folio *folio);
void folio_memcg_unlock(struct folio *folio);

-void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
+void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
+ int val);

/* try to stablize folio_memcg() for all the pages in a memcg */
static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
@@ -995,7 +996,7 @@ static inline void mem_cgroup_unlock_pages(void)

/* idx can be of type enum memcg_stat_item or node_stat_item */
static inline void mod_memcg_state(struct mem_cgroup *memcg,
- int idx, int val)
+ enum memcg_stat_item idx, int val)
{
unsigned long flags;

@@ -1005,7 +1006,7 @@ static inline void mod_memcg_state(struct mem_cgroup *memcg,
}

static inline void mod_memcg_page_state(struct page *page,
- int idx, int val)
+ enum memcg_stat_item idx, int val)
{
struct mem_cgroup *memcg;

@@ -1491,19 +1492,19 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
}

static inline void __mod_memcg_state(struct mem_cgroup *memcg,
- int idx,
+ enum memcg_stat_item idx,
int nr)
{
}

static inline void mod_memcg_state(struct mem_cgroup *memcg,
- int idx,
+ enum memcg_stat_item idx,
int nr)
{
}

static inline void mod_memcg_page_state(struct page *page,
- int idx, int val)
+ enum memcg_stat_item idx, int val)
{
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 72e36977a96e..f5fc16b918ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -956,7 +956,8 @@ static int memcg_state_val_in_pages(int idx, int val)
* @idx: the stat item - can be enum memcg_stat_item or enum node_stat_item
* @val: delta to add to the counter, can be negative
*/
-void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
+void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
+ int val)
{
int i = memcg_stats_index(idx);

--
2.43.0


2024-04-30 08:34:55

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] mm: memcg: account memory used for memcg vmstats and lruvec stats

On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
>
> From: Roman Gushchin <[email protected]>
>
> The percpu memory used by memcg's memory statistics is already accounted.
> For consistency, let's enable accounting for vmstats and lruvec stats
> as well.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> Signed-off-by: Shakeel Butt <[email protected]>

Reviewed-by: Yosry Ahmed <[email protected]>

> ---
> mm/memcontrol.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7126459ec56a..434cff91b65e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5545,8 +5545,8 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> if (!pn)
> return 1;
>
> - pn->lruvec_stats = kzalloc_node(sizeof(struct lruvec_stats), GFP_KERNEL,
> - node);
> + pn->lruvec_stats = kzalloc_node(sizeof(struct lruvec_stats),
> + GFP_KERNEL_ACCOUNT, node);
> if (!pn->lruvec_stats)
> goto fail;
>
> @@ -5617,7 +5617,8 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
> goto fail;
> }
>
> - memcg->vmstats = kzalloc(sizeof(struct memcg_vmstats), GFP_KERNEL);
> + memcg->vmstats = kzalloc(sizeof(struct memcg_vmstats),
> + GFP_KERNEL_ACCOUNT);
> if (!memcg->vmstats)
> goto fail;
>
> --
> 2.43.0
>

2024-04-30 08:35:38

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] memcg: reduce memory size of mem_cgroup_events_index

On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
>
> mem_cgroup_events_index is a translation table to get the right index of
> the memcg relevant entry for the general vm_event_item. At the moment,
> it is defined as integer array. However on a typical system the max
> entry of vm_event_item (NR_VM_EVENT_ITEMS) is 113, so we don't need to
> use int as storage type of the array. For now just use int8_t as type
> and add a BUILD_BUG_ON() and will switch to short once NR_VM_EVENT_ITEMS
> touches 127.
>
> Another benefit of this change is that the translation table fits in 2
> cachelines while previously it would require 8 cachelines (assuming 64
> bytes cachesline).
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reviewed-by: Roman Gushchin <[email protected]>

Reviewed-by: Yosry Ahmed <[email protected]>

> ---
> Changes since v2:
> - Used S8_MAX instead of 127
> - Update commit message based on Yosry's feedback.
>
> mm/memcontrol.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 602ad5faad4d..c146187cda9c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -607,11 +607,13 @@ static const unsigned int memcg_vm_event_stat[] = {
> };
>
> #define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_event_stat)
> -static int mem_cgroup_events_index[NR_VM_EVENT_ITEMS] __read_mostly;
> +static int8_t mem_cgroup_events_index[NR_VM_EVENT_ITEMS] __read_mostly;
>
> static void init_memcg_events(void)
> {
> - int i;
> + int8_t i;
> +
> + BUILD_BUG_ON(NR_VM_EVENT_ITEMS >= S8_MAX);
>
> for (i = 0; i < NR_MEMCG_EVENTS; ++i)
> mem_cgroup_events_index[memcg_vm_event_stat[i]] = i + 1;
> --
> 2.43.0
>

2024-04-30 09:08:07

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] memcg: reduce memory for the lruvec and memcg stats

On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
>
> At the moment, the amount of memory allocated for stats related structs
> in the mem_cgroup corresponds to the size of enum node_stat_item.
> However not all fields in enum node_stat_item has corresponding memcg
> stats. So, let's use indirection mechanism similar to the one used for
> memcg vmstats management.
>
> For a given x86_64 config, the size of stats with and without patch is:
>
> structs size in bytes w/o with
>
> struct lruvec_stats 1128 648
> struct lruvec_stats_percpu 752 432
> struct memcg_vmstats 1832 1352
> struct memcg_vmstats_percpu 1280 960
>
> The memory savings is further compounded by the fact that these structs
> are allocated for each cpu and for each node. To be precise, for each
> memcg the memory saved would be:
>
> Memory saved = ((21 * 3 * NR_NODES) + (21 * 2 * NR_NODS * NR_CPUS) +
> (21 * 3) + (21 * 2 * NR_CPUS)) * sizeof(long)
>
> Where 21 is the number of fields eliminated.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> ---
>
> Changes since v2:
> - N/A
>
> mm/memcontrol.c | 138 ++++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 115 insertions(+), 23 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 434cff91b65e..f424c5b2ba9b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -576,35 +576,105 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> return mz;
> }
>
> +/* Subset of node_stat_item for memcg stats */
> +static const unsigned int memcg_node_stat_items[] = {
> + NR_INACTIVE_ANON,
> + NR_ACTIVE_ANON,
> + NR_INACTIVE_FILE,
> + NR_ACTIVE_FILE,
> + NR_UNEVICTABLE,
> + NR_SLAB_RECLAIMABLE_B,
> + NR_SLAB_UNRECLAIMABLE_B,
> + WORKINGSET_REFAULT_ANON,
> + WORKINGSET_REFAULT_FILE,
> + WORKINGSET_ACTIVATE_ANON,
> + WORKINGSET_ACTIVATE_FILE,
> + WORKINGSET_RESTORE_ANON,
> + WORKINGSET_RESTORE_FILE,
> + WORKINGSET_NODERECLAIM,
> + NR_ANON_MAPPED,
> + NR_FILE_MAPPED,
> + NR_FILE_PAGES,
> + NR_FILE_DIRTY,
> + NR_WRITEBACK,
> + NR_SHMEM,
> + NR_SHMEM_THPS,
> + NR_FILE_THPS,
> + NR_ANON_THPS,
> + NR_KERNEL_STACK_KB,
> + NR_PAGETABLE,
> + NR_SECONDARY_PAGETABLE,
> +#ifdef CONFIG_SWAP
> + NR_SWAPCACHE,
> +#endif
> +};
> +
> +static const unsigned int memcg_stat_items[] = {
> + MEMCG_SWAP,
> + MEMCG_SOCK,
> + MEMCG_PERCPU_B,
> + MEMCG_VMALLOC,
> + MEMCG_KMEM,
> + MEMCG_ZSWAP_B,
> + MEMCG_ZSWAPPED,
> +};
> +
> +#define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
> +#define NR_MEMCG_STATS (NR_MEMCG_NODE_STAT_ITEMS + ARRAY_SIZE(memcg_stat_items))
> +static int8_t mem_cgroup_stats_index[MEMCG_NR_STAT] __read_mostly;

NR_MEMCG_STATS and MEMCG_NR_STAT are awfully close and have different
meanings. I think we should come up with better names (sorry nothing
comes to mind) or add a comment to make the difference more obvious.

> +
> +static void init_memcg_stats(void)
> +{
> + int8_t i, j = 0;
> +
> + /* Switch to short once this failure occurs. */
> + BUILD_BUG_ON(NR_MEMCG_STATS >= 127 /* INT8_MAX */);

Should we use S8_MAX here too?

> +
> + for (i = 0; i < NR_MEMCG_NODE_STAT_ITEMS; ++i)
> + mem_cgroup_stats_index[memcg_node_stat_items[i]] = ++j;
> +
> + for (i = 0; i < ARRAY_SIZE(memcg_stat_items); ++i)
> + mem_cgroup_stats_index[memcg_stat_items[i]] = ++j;
> +}
> +
> +static inline int memcg_stats_index(int idx)
> +{
> + return mem_cgroup_stats_index[idx] - 1;
> +}
> +
> struct lruvec_stats_percpu {
> /* Local (CPU and cgroup) state */
> - long state[NR_VM_NODE_STAT_ITEMS];
> + long state[NR_MEMCG_NODE_STAT_ITEMS];
>
> /* Delta calculation for lockless upward propagation */
> - long state_prev[NR_VM_NODE_STAT_ITEMS];
> + long state_prev[NR_MEMCG_NODE_STAT_ITEMS];
> };
>
> struct lruvec_stats {
> /* Aggregated (CPU and subtree) state */
> - long state[NR_VM_NODE_STAT_ITEMS];
> + long state[NR_MEMCG_NODE_STAT_ITEMS];
>
> /* Non-hierarchical (CPU aggregated) state */
> - long state_local[NR_VM_NODE_STAT_ITEMS];
> + long state_local[NR_MEMCG_NODE_STAT_ITEMS];
>
> /* Pending child counts during tree propagation */
> - long state_pending[NR_VM_NODE_STAT_ITEMS];
> + long state_pending[NR_MEMCG_NODE_STAT_ITEMS];
> };
>
> unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx)
> {
> struct mem_cgroup_per_node *pn;
> - long x;
> + long x = 0;
> + int i;
>
> if (mem_cgroup_disabled())
> return node_page_state(lruvec_pgdat(lruvec), idx);
>
> - pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> - x = READ_ONCE(pn->lruvec_stats->state[idx]);
> + i = memcg_stats_index(idx);
> + if (i >= 0) {

nit: we could return here if (i < 0) like you did in
memcg_page_state() and others below, less indentation. Same for
lruvec_page_state_local().

> + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> + x = READ_ONCE(pn->lruvec_stats->state[i]);
> + }
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> @@ -617,12 +687,16 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> {
> struct mem_cgroup_per_node *pn;
> long x = 0;
> + int i;
>
> if (mem_cgroup_disabled())
> return node_page_state(lruvec_pgdat(lruvec), idx);
>
> - pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> - x = READ_ONCE(pn->lruvec_stats->state_local[idx]);
> + i = memcg_stats_index(idx);
> + if (i >= 0) {
> + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> + x = READ_ONCE(pn->lruvec_stats->state_local[i]);
> + }
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> @@ -689,11 +763,11 @@ struct memcg_vmstats_percpu {
> /* The above should fit a single cacheline for memcg_rstat_updated() */
>
> /* Local (CPU and cgroup) page state & events */
> - long state[MEMCG_NR_STAT];
> + long state[NR_MEMCG_STATS];
> unsigned long events[NR_MEMCG_EVENTS];
>
> /* Delta calculation for lockless upward propagation */
> - long state_prev[MEMCG_NR_STAT];
> + long state_prev[NR_MEMCG_STATS];
> unsigned long events_prev[NR_MEMCG_EVENTS];
>
> /* Cgroup1: threshold notifications & softlimit tree updates */
> @@ -703,15 +777,15 @@ struct memcg_vmstats_percpu {
>
> struct memcg_vmstats {
> /* Aggregated (CPU and subtree) page state & events */
> - long state[MEMCG_NR_STAT];
> + long state[NR_MEMCG_STATS];
> unsigned long events[NR_MEMCG_EVENTS];
>
> /* Non-hierarchical (CPU aggregated) page state & events */
> - long state_local[MEMCG_NR_STAT];
> + long state_local[NR_MEMCG_STATS];
> unsigned long events_local[NR_MEMCG_EVENTS];
>
> /* Pending child counts during tree propagation */
> - long state_pending[MEMCG_NR_STAT];
> + long state_pending[NR_MEMCG_STATS];
> unsigned long events_pending[NR_MEMCG_EVENTS];
>
> /* Stats updates since the last flush */
> @@ -844,7 +918,13 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
>
> unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
> {
> - long x = READ_ONCE(memcg->vmstats->state[idx]);
> + long x;
> + int i = memcg_stats_index(idx);
> +
> + if (i < 0)
> + return 0;
> +
> + x = READ_ONCE(memcg->vmstats->state[i]);
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> @@ -876,18 +956,25 @@ static int memcg_state_val_in_pages(int idx, int val)
> */
> void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
> {
> - if (mem_cgroup_disabled())
> + int i = memcg_stats_index(idx);
> +
> + if (mem_cgroup_disabled() || i < 0)
> return;
>
> - __this_cpu_add(memcg->vmstats_percpu->state[idx], val);
> + __this_cpu_add(memcg->vmstats_percpu->state[i], val);
> memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
> }
>
> /* idx can be of type enum memcg_stat_item or node_stat_item. */
> static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
> {
> - long x = READ_ONCE(memcg->vmstats->state_local[idx]);
> + long x;
> + int i = memcg_stats_index(idx);
> +
> + if (i < 0)
> + return 0;
>
> + x = READ_ONCE(memcg->vmstats->state_local[i]);
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> @@ -901,6 +988,10 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
> {
> struct mem_cgroup_per_node *pn;
> struct mem_cgroup *memcg;
> + int i = memcg_stats_index(idx);
> +
> + if (i < 0)
> + return;
>
> pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> memcg = pn->memcg;
> @@ -930,10 +1021,10 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
> }
>
> /* Update memcg */
> - __this_cpu_add(memcg->vmstats_percpu->state[idx], val);
> + __this_cpu_add(memcg->vmstats_percpu->state[i], val);
>
> /* Update lruvec */
> - __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
> + __this_cpu_add(pn->lruvec_stats_percpu->state[i], val);
>
> memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
> memcg_stats_unlock();
> @@ -5702,6 +5793,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> page_counter_init(&memcg->kmem, &parent->kmem);
> page_counter_init(&memcg->tcpmem, &parent->tcpmem);
> } else {
> + init_memcg_stats();
> init_memcg_events();
> page_counter_init(&memcg->memory, NULL);
> page_counter_init(&memcg->swap, NULL);
> @@ -5873,7 +5965,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>
> statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
>
> - for (i = 0; i < MEMCG_NR_STAT; i++) {
> + for (i = 0; i < NR_MEMCG_STATS; i++) {
> /*
> * Collect the aggregated propagation counts of groups
> * below us. We're in a per-cpu loop here and this is
> @@ -5937,7 +6029,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>
> lstatc = per_cpu_ptr(pn->lruvec_stats_percpu, cpu);
>
> - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
> + for (i = 0; i < NR_MEMCG_NODE_STAT_ITEMS; i++) {
> delta = lstats->state_pending[i];
> if (delta)
> lstats->state_pending[i] = 0;
> --
> 2.43.0
>

2024-04-30 17:30:15

by T.J. Mercier

[permalink] [raw]
Subject: Re: [PATCH v3 2/8] memcg: dynamically allocate lruvec_stats

On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
>
> To decouple the dependency of lruvec_stats on NR_VM_NODE_STAT_ITEMS, we
> need to dynamically allocate lruvec_stats in the mem_cgroup_per_node
> structure. Also move the definition of lruvec_stats_percpu and
> lruvec_stats and related functions to the memcontrol.c to facilitate
> later patches. No functional changes in the patch.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reviewed-by: Yosry Ahmed <[email protected]>

Reviewed-by: T.J. Mercier <[email protected]>

2024-04-30 17:31:12

by T.J. Mercier

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] memcg: reduce memory for the lruvec and memcg stats

On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
>
> At the moment, the amount of memory allocated for stats related structs
> in the mem_cgroup corresponds to the size of enum node_stat_item.
> However not all fields in enum node_stat_item has corresponding memcg

typo: "have corresponding"

> stats. So, let's use indirection mechanism similar to the one used for
> memcg vmstats management.
>
> For a given x86_64 config, the size of stats with and without patch is:
>
> structs size in bytes w/o with
>
> struct lruvec_stats 1128 648
> struct lruvec_stats_percpu 752 432
> struct memcg_vmstats 1832 1352
> struct memcg_vmstats_percpu 1280 960
>
> The memory savings is further compounded by the fact that these structs
> are allocated for each cpu and for each node. To be precise, for each
> memcg the memory saved would be:
>
> Memory saved = ((21 * 3 * NR_NODES) + (21 * 2 * NR_NODS * NR_CPUS) +

typo: "NR_NODES"

> (21 * 3) + (21 * 2 * NR_CPUS)) * sizeof(long)
>
> Where 21 is the number of fields eliminated.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> ---
>
> Changes since v2:
> - N/A
>
> mm/memcontrol.c | 138 ++++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 115 insertions(+), 23 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 434cff91b65e..f424c5b2ba9b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -576,35 +576,105 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> return mz;
> }
>
> +/* Subset of node_stat_item for memcg stats */
> +static const unsigned int memcg_node_stat_items[] = {
> + NR_INACTIVE_ANON,
> + NR_ACTIVE_ANON,
> + NR_INACTIVE_FILE,
> + NR_ACTIVE_FILE,
> + NR_UNEVICTABLE,
> + NR_SLAB_RECLAIMABLE_B,
> + NR_SLAB_UNRECLAIMABLE_B,
> + WORKINGSET_REFAULT_ANON,
> + WORKINGSET_REFAULT_FILE,
> + WORKINGSET_ACTIVATE_ANON,
> + WORKINGSET_ACTIVATE_FILE,
> + WORKINGSET_RESTORE_ANON,
> + WORKINGSET_RESTORE_FILE,
> + WORKINGSET_NODERECLAIM,
> + NR_ANON_MAPPED,
> + NR_FILE_MAPPED,
> + NR_FILE_PAGES,
> + NR_FILE_DIRTY,
> + NR_WRITEBACK,
> + NR_SHMEM,
> + NR_SHMEM_THPS,
> + NR_FILE_THPS,
> + NR_ANON_THPS,
> + NR_KERNEL_STACK_KB,
> + NR_PAGETABLE,
> + NR_SECONDARY_PAGETABLE,
> +#ifdef CONFIG_SWAP
> + NR_SWAPCACHE,
> +#endif
> +};
> +
> +static const unsigned int memcg_stat_items[] = {
> + MEMCG_SWAP,
> + MEMCG_SOCK,
> + MEMCG_PERCPU_B,
> + MEMCG_VMALLOC,
> + MEMCG_KMEM,
> + MEMCG_ZSWAP_B,
> + MEMCG_ZSWAPPED,
> +};
> +
> +#define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
> +#define NR_MEMCG_STATS (NR_MEMCG_NODE_STAT_ITEMS + ARRAY_SIZE(memcg_stat_items))
> +static int8_t mem_cgroup_stats_index[MEMCG_NR_STAT] __read_mostly;
> +
> +static void init_memcg_stats(void)
> +{
> + int8_t i, j = 0;
> +
> + /* Switch to short once this failure occurs. */
> + BUILD_BUG_ON(NR_MEMCG_STATS >= 127 /* INT8_MAX */);
> +
> + for (i = 0; i < NR_MEMCG_NODE_STAT_ITEMS; ++i)
> + mem_cgroup_stats_index[memcg_node_stat_items[i]] = ++j;
> +
> + for (i = 0; i < ARRAY_SIZE(memcg_stat_items); ++i)
> + mem_cgroup_stats_index[memcg_stat_items[i]] = ++j;
> +}
> +
> +static inline int memcg_stats_index(int idx)
> +{
> + return mem_cgroup_stats_index[idx] - 1;

Could this just be: return mem_cgroup_stats_index[idx];
with a postfix increment of j in init_memcg_stats instead of prefix increment?


> +}
> +
> struct lruvec_stats_percpu {
> /* Local (CPU and cgroup) state */
> - long state[NR_VM_NODE_STAT_ITEMS];
> + long state[NR_MEMCG_NODE_STAT_ITEMS];
>
> /* Delta calculation for lockless upward propagation */
> - long state_prev[NR_VM_NODE_STAT_ITEMS];
> + long state_prev[NR_MEMCG_NODE_STAT_ITEMS];
> };
>
> struct lruvec_stats {
> /* Aggregated (CPU and subtree) state */
> - long state[NR_VM_NODE_STAT_ITEMS];
> + long state[NR_MEMCG_NODE_STAT_ITEMS];
>
> /* Non-hierarchical (CPU aggregated) state */
> - long state_local[NR_VM_NODE_STAT_ITEMS];
> + long state_local[NR_MEMCG_NODE_STAT_ITEMS];
>
> /* Pending child counts during tree propagation */
> - long state_pending[NR_VM_NODE_STAT_ITEMS];
> + long state_pending[NR_MEMCG_NODE_STAT_ITEMS];
> };
>
> unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx)
> {
> struct mem_cgroup_per_node *pn;
> - long x;
> + long x = 0;
> + int i;
>
> if (mem_cgroup_disabled())
> return node_page_state(lruvec_pgdat(lruvec), idx);
>
> - pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> - x = READ_ONCE(pn->lruvec_stats->state[idx]);
> + i = memcg_stats_index(idx);
> + if (i >= 0) {
> + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> + x = READ_ONCE(pn->lruvec_stats->state[i]);
> + }
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> @@ -617,12 +687,16 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> {
> struct mem_cgroup_per_node *pn;
> long x = 0;
> + int i;
>
> if (mem_cgroup_disabled())
> return node_page_state(lruvec_pgdat(lruvec), idx);
>
> - pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> - x = READ_ONCE(pn->lruvec_stats->state_local[idx]);
> + i = memcg_stats_index(idx);
> + if (i >= 0) {
> + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> + x = READ_ONCE(pn->lruvec_stats->state_local[i]);
> + }
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> @@ -689,11 +763,11 @@ struct memcg_vmstats_percpu {
> /* The above should fit a single cacheline for memcg_rstat_updated() */
>
> /* Local (CPU and cgroup) page state & events */
> - long state[MEMCG_NR_STAT];
> + long state[NR_MEMCG_STATS];
> unsigned long events[NR_MEMCG_EVENTS];
>
> /* Delta calculation for lockless upward propagation */
> - long state_prev[MEMCG_NR_STAT];
> + long state_prev[NR_MEMCG_STATS];
> unsigned long events_prev[NR_MEMCG_EVENTS];
>
> /* Cgroup1: threshold notifications & softlimit tree updates */
> @@ -703,15 +777,15 @@ struct memcg_vmstats_percpu {
>
> struct memcg_vmstats {
> /* Aggregated (CPU and subtree) page state & events */
> - long state[MEMCG_NR_STAT];
> + long state[NR_MEMCG_STATS];
> unsigned long events[NR_MEMCG_EVENTS];
>
> /* Non-hierarchical (CPU aggregated) page state & events */
> - long state_local[MEMCG_NR_STAT];
> + long state_local[NR_MEMCG_STATS];
> unsigned long events_local[NR_MEMCG_EVENTS];
>
> /* Pending child counts during tree propagation */
> - long state_pending[MEMCG_NR_STAT];
> + long state_pending[NR_MEMCG_STATS];
> unsigned long events_pending[NR_MEMCG_EVENTS];
>
> /* Stats updates since the last flush */
> @@ -844,7 +918,13 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
>
> unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
> {
> - long x = READ_ONCE(memcg->vmstats->state[idx]);
> + long x;
> + int i = memcg_stats_index(idx);
> +
> + if (i < 0)
> + return 0;
> +
> + x = READ_ONCE(memcg->vmstats->state[i]);
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> @@ -876,18 +956,25 @@ static int memcg_state_val_in_pages(int idx, int val)
> */
> void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
> {
> - if (mem_cgroup_disabled())
> + int i = memcg_stats_index(idx);
> +
> + if (mem_cgroup_disabled() || i < 0)
> return;
>
> - __this_cpu_add(memcg->vmstats_percpu->state[idx], val);
> + __this_cpu_add(memcg->vmstats_percpu->state[i], val);
> memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
> }
>
> /* idx can be of type enum memcg_stat_item or node_stat_item. */
> static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
> {
> - long x = READ_ONCE(memcg->vmstats->state_local[idx]);
> + long x;
> + int i = memcg_stats_index(idx);
> +
> + if (i < 0)
> + return 0;
>
> + x = READ_ONCE(memcg->vmstats->state_local[i]);
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> @@ -901,6 +988,10 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
> {
> struct mem_cgroup_per_node *pn;
> struct mem_cgroup *memcg;
> + int i = memcg_stats_index(idx);
> +
> + if (i < 0)
> + return;
>
> pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> memcg = pn->memcg;
> @@ -930,10 +1021,10 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
> }
>
> /* Update memcg */
> - __this_cpu_add(memcg->vmstats_percpu->state[idx], val);
> + __this_cpu_add(memcg->vmstats_percpu->state[i], val);
>
> /* Update lruvec */
> - __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
> + __this_cpu_add(pn->lruvec_stats_percpu->state[i], val);
>
> memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
> memcg_stats_unlock();
> @@ -5702,6 +5793,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> page_counter_init(&memcg->kmem, &parent->kmem);
> page_counter_init(&memcg->tcpmem, &parent->tcpmem);
> } else {
> + init_memcg_stats();
> init_memcg_events();
> page_counter_init(&memcg->memory, NULL);
> page_counter_init(&memcg->swap, NULL);
> @@ -5873,7 +5965,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>
> statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
>
> - for (i = 0; i < MEMCG_NR_STAT; i++) {
> + for (i = 0; i < NR_MEMCG_STATS; i++) {
> /*
> * Collect the aggregated propagation counts of groups
> * below us. We're in a per-cpu loop here and this is
> @@ -5937,7 +6029,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>
> lstatc = per_cpu_ptr(pn->lruvec_stats_percpu, cpu);
>
> - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
> + for (i = 0; i < NR_MEMCG_NODE_STAT_ITEMS; i++) {
> delta = lstats->state_pending[i];
> if (delta)
> lstats->state_pending[i] = 0;
> --
> 2.43.0
>

2024-04-30 17:31:32

by T.J. Mercier

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] memcg: cleanup __mod_memcg_lruvec_state

On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
>
> There are no memcg specific stats for NR_SHMEM_PMDMAPPED and
> NR_FILE_PMDMAPPED. Let's remove them.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reviewed-by: Yosry Ahmed <[email protected]>
> Reviewed-by: Roman Gushchin <[email protected]>

Reviewed-by: T.J. Mercier <[email protected]>


> ---
> Changes since v2:
> - N/A
>
> mm/memcontrol.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f424c5b2ba9b..df94abc0088f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1008,8 +1008,6 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
> case NR_ANON_MAPPED:
> case NR_FILE_MAPPED:
> case NR_ANON_THPS:
> - case NR_SHMEM_PMDMAPPED:
> - case NR_FILE_PMDMAPPED:
> if (WARN_ON_ONCE(!in_task()))
> pr_warn("stat item index: %d\n", idx);
> break;
> --
> 2.43.0
>

2024-04-30 17:32:41

by T.J. Mercier

[permalink] [raw]
Subject: Re: [PATCH v3 6/8] mm: cleanup WORKINGSET_NODES in workingset

On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
>
> WORKINGSET_NODES is not exposed in the memcg stats and thus there is no
> need to use the memcg specific stat update functions for it. In future
> if we decide to expose WORKINGSET_NODES in the memcg stats, we can
> revert this patch.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reviewed-by: Roman Gushchin <[email protected]>

Reviewed-by: T.J. Mercier <[email protected]>

> ---
>
> Changes since v2:
> - N/A
>
> mm/workingset.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/workingset.c b/mm/workingset.c
> index f2a0ecaf708d..c22adb93622a 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -618,6 +618,7 @@ struct list_lru shadow_nodes;
> void workingset_update_node(struct xa_node *node)
> {
> struct address_space *mapping;
> + struct page *page = virt_to_page(node);
>
> /*
> * Track non-empty nodes that contain only shadow entries;
> @@ -633,12 +634,12 @@ void workingset_update_node(struct xa_node *node)
> if (node->count && node->count == node->nr_values) {
> if (list_empty(&node->private_list)) {
> list_lru_add_obj(&shadow_nodes, &node->private_list);
> - __inc_lruvec_kmem_state(node, WORKINGSET_NODES);
> + __inc_node_page_state(page, WORKINGSET_NODES);
> }
> } else {
> if (!list_empty(&node->private_list)) {
> list_lru_del_obj(&shadow_nodes, &node->private_list);
> - __dec_lruvec_kmem_state(node, WORKINGSET_NODES);
> + __dec_node_page_state(page, WORKINGSET_NODES);
> }
> }
> }
> @@ -742,7 +743,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
> }
>
> list_lru_isolate(lru, item);
> - __dec_lruvec_kmem_state(node, WORKINGSET_NODES);
> + __dec_node_page_state(virt_to_page(node), WORKINGSET_NODES);
>
> spin_unlock(lru_lock);
>
> --
> 2.43.0
>

2024-04-30 17:38:41

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] memcg: reduce memory for the lruvec and memcg stats

On Tue, Apr 30, 2024 at 01:41:38AM -0700, Yosry Ahmed wrote:
> On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
> >
[...]
> > +
> > +#define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
> > +#define NR_MEMCG_STATS (NR_MEMCG_NODE_STAT_ITEMS + ARRAY_SIZE(memcg_stat_items))
> > +static int8_t mem_cgroup_stats_index[MEMCG_NR_STAT] __read_mostly;
>
> NR_MEMCG_STATS and MEMCG_NR_STAT are awfully close and have different
> meanings. I think we should come up with better names (sorry nothing
> comes to mind) or add a comment to make the difference more obvious.
>

How about the following comment?

/*
* Please note that NR_MEMCG_STATS represents the number of memcg stats
* we store in memory while MEMCG_NR_STAT represents the max enum value
* of the memcg stats.
*/

> > +
> > +static void init_memcg_stats(void)
> > +{
> > + int8_t i, j = 0;
> > +
> > + /* Switch to short once this failure occurs. */
> > + BUILD_BUG_ON(NR_MEMCG_STATS >= 127 /* INT8_MAX */);
>
> Should we use S8_MAX here too?
>

Yes. Andrew, can you please add the above comment and replacement of
127 with S8_MAX in the patch?

[...]
> >
> > - pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> > - x = READ_ONCE(pn->lruvec_stats->state[idx]);
> > + i = memcg_stats_index(idx);
> > + if (i >= 0) {
>
> nit: we could return here if (i < 0) like you did in
> memcg_page_state() and others below, less indentation. Same for
> lruvec_page_state_local().
>

I have fixed this in the following patch which adds warnings.


Thanks for the reviews.

2024-04-30 17:42:53

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] memcg: reduce memory for the lruvec and memcg stats

On Tue, Apr 30, 2024 at 10:38 AM Shakeel Butt <[email protected]> wrote:
>
> On Tue, Apr 30, 2024 at 01:41:38AM -0700, Yosry Ahmed wrote:
> > On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
> > >
> [...]
> > > +
> > > +#define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
> > > +#define NR_MEMCG_STATS (NR_MEMCG_NODE_STAT_ITEMS + ARRAY_SIZE(memcg_stat_items))
> > > +static int8_t mem_cgroup_stats_index[MEMCG_NR_STAT] __read_mostly;
> >
> > NR_MEMCG_STATS and MEMCG_NR_STAT are awfully close and have different
> > meanings. I think we should come up with better names (sorry nothing
> > comes to mind) or add a comment to make the difference more obvious.
> >
>
> How about the following comment?

The comment LGTM. I prefer renaming them though if someone can come up
with better names.

>
> /*
> * Please note that NR_MEMCG_STATS represents the number of memcg stats
> * we store in memory while MEMCG_NR_STAT represents the max enum value
> * of the memcg stats.
> */
>
> > > +
> > > +static void init_memcg_stats(void)
> > > +{
> > > + int8_t i, j = 0;
> > > +
> > > + /* Switch to short once this failure occurs. */
> > > + BUILD_BUG_ON(NR_MEMCG_STATS >= 127 /* INT8_MAX */);
> >
> > Should we use S8_MAX here too?
> >
>
> Yes. Andrew, can you please add the above comment and replacement of
> 127 with S8_MAX in the patch?
>
> [...]
> > >
> > > - pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> > > - x = READ_ONCE(pn->lruvec_stats->state[idx]);
> > > + i = memcg_stats_index(idx);
> > > + if (i >= 0) {
> >
> > nit: we could return here if (i < 0) like you did in
> > memcg_page_state() and others below, less indentation. Same for
> > lruvec_page_state_local().
> >
>
> I have fixed this in the following patch which adds warnings.

Yeah I saw that after reviewing this one.

FWIW, *if* you respin this, fixing this here would reduce the diff
noise in the patch that adds the warnings.

2024-04-30 17:46:45

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] memcg: reduce memory for the lruvec and memcg stats

On Tue, Apr 30, 2024 at 10:30:51AM -0700, T.J. Mercier wrote:
> On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
> >
> > +
> > +static inline int memcg_stats_index(int idx)
> > +{
> > + return mem_cgroup_stats_index[idx] - 1;
>
> Could this just be: return mem_cgroup_stats_index[idx];
> with a postfix increment of j in init_memcg_stats instead of prefix increment?
>

The -1 is basically for error checking but I will do a followup patch to
initialize the array/indirection-table with -1 and remove the
subtraction from the fast path.

Thanks for the review.

2024-04-30 17:49:39

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] memcg: reduce memory for the lruvec and memcg stats

On Tue, Apr 30, 2024 at 10:41:01AM -0700, Yosry Ahmed wrote:
> On Tue, Apr 30, 2024 at 10:38 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Tue, Apr 30, 2024 at 01:41:38AM -0700, Yosry Ahmed wrote:
> > > On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
[...]
> > >
> > > nit: we could return here if (i < 0) like you did in
> > > memcg_page_state() and others below, less indentation. Same for
> > > lruvec_page_state_local().
> > >
> >
> > I have fixed this in the following patch which adds warnings.
>
> Yeah I saw that after reviewing this one.
>
> FWIW, *if* you respin this, fixing this here would reduce the diff
> noise in the patch that adds the warnings.

Yeah, if I need to respin, I will change this.

2024-04-30 18:09:34

by T.J. Mercier

[permalink] [raw]
Subject: Re: [PATCH v3 8/8] memcg: use proper type for mod_memcg_state

On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
>
> The memcg stats update functions can take arbitrary integer but the
> only input which make sense is enum memcg_stat_item and we don't
> want these functions to be called with arbitrary integer, so replace
> the parameter type with enum memcg_stat_item and compiler will be able
> to warn if memcg stat update functions are called with incorrect index
> value.
>
> Signed-off-by: Shakeel Butt <[email protected]>

Reviewed-by: T.J. Mercier <[email protected]>

2024-04-30 23:01:01

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] memcg: reduce memory for the lruvec and memcg stats

On Tue, Apr 30, 2024 at 10:41:01AM -0700, Yosry Ahmed wrote:
> On Tue, Apr 30, 2024 at 10:38 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Tue, Apr 30, 2024 at 01:41:38AM -0700, Yosry Ahmed wrote:
> > > On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
> > > >
> > [...]
> > > > +
> > > > +#define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
> > > > +#define NR_MEMCG_STATS (NR_MEMCG_NODE_STAT_ITEMS + ARRAY_SIZE(memcg_stat_items))
> > > > +static int8_t mem_cgroup_stats_index[MEMCG_NR_STAT] __read_mostly;
> > >
> > > NR_MEMCG_STATS and MEMCG_NR_STAT are awfully close and have different
> > > meanings. I think we should come up with better names (sorry nothing
> > > comes to mind) or add a comment to make the difference more obvious.
> > >
> >
> > How about the following comment?
>
> The comment LGTM. I prefer renaming them though if someone can come up
> with better names.
>

I will be posting v4 and will change the name (still thinking about the
name) becasuse:

> > > > +static void init_memcg_stats(void)
> > > > +{
> > > > + int8_t i, j = 0;
> > > > +
> > > > + /* Switch to short once this failure occurs. */
> > > > + BUILD_BUG_ON(NR_MEMCG_STATS >= 127 /* INT8_MAX */);

The above should be MEMCG_NR_STAT instead of NR_MEMCG_STATS.

2024-04-30 23:09:15

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] memcg: reduce memory for the lruvec and memcg stats

On Tue, Apr 30, 2024 at 4:00 PM Shakeel Butt <[email protected]> wrote:
>
> On Tue, Apr 30, 2024 at 10:41:01AM -0700, Yosry Ahmed wrote:
> > On Tue, Apr 30, 2024 at 10:38 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Tue, Apr 30, 2024 at 01:41:38AM -0700, Yosry Ahmed wrote:
> > > > On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
> > > > >
> > > [...]
> > > > > +
> > > > > +#define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
> > > > > +#define NR_MEMCG_STATS (NR_MEMCG_NODE_STAT_ITEMS + ARRAY_SIZE(memcg_stat_items))
> > > > > +static int8_t mem_cgroup_stats_index[MEMCG_NR_STAT] __read_mostly;
> > > >
> > > > NR_MEMCG_STATS and MEMCG_NR_STAT are awfully close and have different
> > > > meanings. I think we should come up with better names (sorry nothing
> > > > comes to mind) or add a comment to make the difference more obvious.
> > > >
> > >
> > > How about the following comment?
> >
> > The comment LGTM. I prefer renaming them though if someone can come up
> > with better names.
> >
>
> I will be posting v4 and will change the name (still thinking about the
> name) becasuse:
>
> > > > > +static void init_memcg_stats(void)
> > > > > +{
> > > > > + int8_t i, j = 0;
> > > > > +
> > > > > + /* Switch to short once this failure occurs. */
> > > > > + BUILD_BUG_ON(NR_MEMCG_STATS >= 127 /* INT8_MAX */);
>
> The above should be MEMCG_NR_STAT instead of NR_MEMCG_STATS.

Yeah it's pretty confusing :)

How about something explicit like:

NR_MEMCG_POSSIBLE_STAT_ITEMS / MEMCG_MAX_STAT_ITEM
NR_MEMCG_ACTUAL_STAT_ITEMS / MEMCG_ACTUAL_NR_STAT

They look ugly, but I can't think of anything better. Maybe they will
inspire something better :)

2024-05-01 00:51:14

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] memcg: reduce memory for the lruvec and memcg stats

On Tue, Apr 30, 2024 at 04:07:05PM -0700, Yosry Ahmed wrote:
> On Tue, Apr 30, 2024 at 4:00 PM Shakeel Butt <[email protected]> wrote:
> >
> > On Tue, Apr 30, 2024 at 10:41:01AM -0700, Yosry Ahmed wrote:
> > > On Tue, Apr 30, 2024 at 10:38 AM Shakeel Butt <[email protected]> wrote:
> > > >
> > > > On Tue, Apr 30, 2024 at 01:41:38AM -0700, Yosry Ahmed wrote:
> > > > > On Mon, Apr 29, 2024 at 11:06 PM Shakeel Butt <[email protected]> wrote:
> > > > > >
> > > > [...]
> > > > > > +
> > > > > > +#define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
> > > > > > +#define NR_MEMCG_STATS (NR_MEMCG_NODE_STAT_ITEMS + ARRAY_SIZE(memcg_stat_items))
> > > > > > +static int8_t mem_cgroup_stats_index[MEMCG_NR_STAT] __read_mostly;
> > > > >
> > > > > NR_MEMCG_STATS and MEMCG_NR_STAT are awfully close and have different
> > > > > meanings. I think we should come up with better names (sorry nothing
> > > > > comes to mind) or add a comment to make the difference more obvious.
> > > > >
> > > >
> > > > How about the following comment?
> > >
> > > The comment LGTM. I prefer renaming them though if someone can come up
> > > with better names.
> > >
> >
> > I will be posting v4 and will change the name (still thinking about the
> > name) becasuse:
> >
> > > > > > +static void init_memcg_stats(void)
> > > > > > +{
> > > > > > + int8_t i, j = 0;
> > > > > > +
> > > > > > + /* Switch to short once this failure occurs. */
> > > > > > + BUILD_BUG_ON(NR_MEMCG_STATS >= 127 /* INT8_MAX */);
> >
> > The above should be MEMCG_NR_STAT instead of NR_MEMCG_STATS.
>
> Yeah it's pretty confusing :)
>
> How about something explicit like:
>
> NR_MEMCG_POSSIBLE_STAT_ITEMS / MEMCG_MAX_STAT_ITEM
> NR_MEMCG_ACTUAL_STAT_ITEMS / MEMCG_ACTUAL_NR_STAT

NR is pretty common to mark the end of an enum range. It would be good
to keep that for enum memcg_stat_item.

The other one is about an array, where we usually use "size" or
"len". How about using one of those instead? I think it should be
sufficiently distinguished then:

- MEMORY_STAT_LEN
- MEMCG_VMSTAT_LEN
- MEMCG_VMSTAT_SIZE