2024-04-23 05:19:05

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH 0/4] memcg: reduce memory consumption by memcg stats

Most of the memory overhead of a memcg object is due to memcg stats
maintained by the kernel. Since stats updates happen in performance
critical codepaths, the stats are maintained per-cpu and numa specific
stats are maintained per-node * per-cpu. This drastically increase the
overhead on large machines i.e. large of CPUs and multiple numa nodes.
This patch series tries to reduce the overhead by at least not
allocating the memory for stats which are not memcg specific.

Shakeel Butt (4):
mm: rearrange node_stat_item to put memcg stats at start
memcg: reduce memory for the lruvec and memcg stats
memcg: use proper type for mod_memcg_state
memcg: restrict __mod_memcg_lruvec_state to memcg stats

include/linux/memcontrol.h | 25 +++++++++++++------------
include/linux/mmzone.h | 29 +++++++++++++++++------------
mm/memcontrol.c | 12 +++++++-----
mm/vmstat.c | 24 ++++++++++++------------
4 files changed, 49 insertions(+), 41 deletions(-)

--
2.43.0



2024-04-23 05:19:18

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start

At the moment the memcg stats are sized based on the size of enum
node_stat_item but not all fields in node_stat_item corresponds to memcg
stats. So, rearrage the contents of node_stat_item such that all the
memcg specific stats are at the top and then the later patches will make
sure that the memcg code will not waste space for non-memcg stats.

Signed-off-by: Shakeel Butt <[email protected]>
---
include/linux/mmzone.h | 25 +++++++++++++------------
mm/vmstat.c | 24 ++++++++++++------------
2 files changed, 25 insertions(+), 24 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8f9c9590a42c..989ca97402c6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -166,9 +166,6 @@ enum node_stat_item {
NR_UNEVICTABLE, /* " " " " " */
NR_SLAB_RECLAIMABLE_B,
NR_SLAB_UNRECLAIMABLE_B,
- NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
- NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
- WORKINGSET_NODES,
WORKINGSET_REFAULT_BASE,
WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE,
WORKINGSET_REFAULT_FILE,
@@ -179,39 +176,43 @@ enum node_stat_item {
WORKINGSET_RESTORE_ANON = WORKINGSET_RESTORE_BASE,
WORKINGSET_RESTORE_FILE,
WORKINGSET_NODERECLAIM,
+ NR_PAGETABLE, /* used for pagetables */
+ NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */
+ NR_KERNEL_STACK_KB, /* measured in KiB */
NR_ANON_MAPPED, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
only modified from process context */
NR_FILE_PAGES,
+#ifdef CONFIG_SWAP
+ NR_SWAPCACHE,
+#endif
NR_FILE_DIRTY,
NR_WRITEBACK,
- NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
NR_SHMEM_THPS,
- NR_SHMEM_PMDMAPPED,
NR_FILE_THPS,
- NR_FILE_PMDMAPPED,
NR_ANON_THPS,
+ /* No memcg stats for the following fields. */
+ NR_SHMEM_PMDMAPPED,
+ NR_FILE_PMDMAPPED,
+ NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_VMSCAN_WRITE,
NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */
+ NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
+ NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
+ WORKINGSET_NODES,
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
NR_THROTTLED_WRITTEN, /* NR_WRITTEN while reclaim throttled */
NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */
NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */
NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */
- NR_KERNEL_STACK_KB, /* measured in KiB */
#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
NR_KERNEL_SCS_KB, /* measured in KiB */
#endif
- NR_PAGETABLE, /* used for pagetables */
- NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */
#ifdef CONFIG_IOMMU_SUPPORT
NR_IOMMU_PAGES, /* # of pages allocated by IOMMU */
#endif
-#ifdef CONFIG_SWAP
- NR_SWAPCACHE,
-#endif
#ifdef CONFIG_NUMA_BALANCING
PGPROMOTE_SUCCESS, /* promote successfully */
PGPROMOTE_CANDIDATE, /* candidate pages to promote */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8507c497218b..4eac2f6322a3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1206,9 +1206,6 @@ const char * const vmstat_text[] = {
"nr_unevictable",
"nr_slab_reclaimable",
"nr_slab_unreclaimable",
- "nr_isolated_anon",
- "nr_isolated_file",
- "workingset_nodes",
"workingset_refault_anon",
"workingset_refault_file",
"workingset_activate_anon",
@@ -1216,38 +1213,41 @@ const char * const vmstat_text[] = {
"workingset_restore_anon",
"workingset_restore_file",
"workingset_nodereclaim",
+ "nr_page_table_pages",
+ "nr_sec_page_table_pages",
+ "nr_kernel_stack",
"nr_anon_pages",
"nr_mapped",
"nr_file_pages",
+#ifdef CONFIG_SWAP
+ "nr_swapcached",
+#endif
"nr_dirty",
"nr_writeback",
- "nr_writeback_temp",
"nr_shmem",
"nr_shmem_hugepages",
- "nr_shmem_pmdmapped",
"nr_file_hugepages",
- "nr_file_pmdmapped",
"nr_anon_transparent_hugepages",
+ "nr_shmem_pmdmapped",
+ "nr_file_pmdmapped",
+ "nr_writeback_temp",
"nr_vmscan_write",
"nr_vmscan_immediate_reclaim",
+ "nr_isolated_anon",
+ "nr_isolated_file",
+ "workingset_nodes",
"nr_dirtied",
"nr_written",
"nr_throttled_written",
"nr_kernel_misc_reclaimable",
"nr_foll_pin_acquired",
"nr_foll_pin_released",
- "nr_kernel_stack",
#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
"nr_shadow_call_stack",
#endif
- "nr_page_table_pages",
- "nr_sec_page_table_pages",
#ifdef CONFIG_IOMMU_SUPPORT
"nr_iommu_pages",
#endif
-#ifdef CONFIG_SWAP
- "nr_swapcached",
-#endif
#ifdef CONFIG_NUMA_BALANCING
"pgpromote_success",
"pgpromote_candidate",
--
2.43.0


2024-04-23 05:19:34

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats

At the moment, the amount of memory allocated for stats related structs
in the mem_cgroup corresponds to the size of enum node_stat_item.
However not all fields in enum node_stat_item has corresponding memcg
stats. The fields of enum node_stat_item is sorted in such a way that
all the fields with corresponding memcg stats are at the start of the
enum node_stat_item. So, let's just make an explicit boundary within
enum node_stat_item and use that boundary to allocate memory for stats
related structs of memcgs.

For a given x86_64 config, the size of stats with and without patch is:

structs size in bytes w/o with

struct lruvec_stats 1128 648
struct lruvec_stats_percpu 752 432
struct memcg_vmstats 1832 1352
struct memcg_vmstats_percpu 1280 960

The memory savings is further compounded by the fact that these structs
are allocated for each cpu and for node. To be precise, for each memcg,
the memory saved would be:

Memory saved = ((21 * 3 * NR_NODES) + (21 * 2 * NR_NODS * NR_CPUS) +
(21 * 3) + (21 * 2 * NR_CPUS)) * sizeof(long)

Where 21 is the number of fields eliminated.

Signed-off-by: Shakeel Butt <[email protected]>
---
include/linux/memcontrol.h | 12 ++++++------
include/linux/mmzone.h | 8 ++++++--
mm/memcontrol.c | 5 ++++-
3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9aba0d0462ca..d68db7a0e829 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -32,7 +32,7 @@ struct kmem_cache;

/* Cgroup-specific page state, on top of universal node page state */
enum memcg_stat_item {
- MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
+ MEMCG_SWAP = NR_VM_NODE_MEMCG_STAT_ITEMS,
MEMCG_SOCK,
MEMCG_PERCPU_B,
MEMCG_VMALLOC,
@@ -92,21 +92,21 @@ struct mem_cgroup_reclaim_iter {

struct lruvec_stats_percpu {
/* Local (CPU and cgroup) state */
- long state[NR_VM_NODE_STAT_ITEMS];
+ long state[NR_VM_NODE_MEMCG_STAT_ITEMS];

/* Delta calculation for lockless upward propagation */
- long state_prev[NR_VM_NODE_STAT_ITEMS];
+ long state_prev[NR_VM_NODE_MEMCG_STAT_ITEMS];
};

struct lruvec_stats {
/* Aggregated (CPU and subtree) state */
- long state[NR_VM_NODE_STAT_ITEMS];
+ long state[NR_VM_NODE_MEMCG_STAT_ITEMS];

/* Non-hierarchical (CPU aggregated) state */
- long state_local[NR_VM_NODE_STAT_ITEMS];
+ long state_local[NR_VM_NODE_MEMCG_STAT_ITEMS];

/* Pending child counts during tree propagation */
- long state_pending[NR_VM_NODE_STAT_ITEMS];
+ long state_pending[NR_VM_NODE_MEMCG_STAT_ITEMS];
};

/*
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 989ca97402c6..59592f3c7d9b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -192,8 +192,12 @@ enum node_stat_item {
NR_SHMEM_THPS,
NR_FILE_THPS,
NR_ANON_THPS,
- /* No memcg stats for the following fields. */
- NR_SHMEM_PMDMAPPED,
+ /*
+ * No memcg stats for the following fields. Please add stats which have
+ * memcg counterpart above NR_VM_NODE_MEMCG_STAT_ITEMS.
+ */
+ NR_VM_NODE_MEMCG_STAT_ITEMS,
+ NR_SHMEM_PMDMAPPED = NR_VM_NODE_MEMCG_STAT_ITEMS,
NR_FILE_PMDMAPPED,
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_VMSCAN_WRITE,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 833d09c1d523..bb1bbf417a46 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1648,6 +1648,9 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
{
int i;

+ /* Reduce by 1 for MEMCG_SWAP as that is not exposed in v2. */
+ BUILD_BUG_ON(ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1);
+
/*
* Provide statistics on the state of the memory subsystem as
* well as cumulative event counters that show past behavior.
@@ -5869,7 +5872,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)

lstatc = per_cpu_ptr(pn->lruvec_stats_percpu, cpu);

- for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+ for (i = 0; i < NR_VM_NODE_MEMCG_STAT_ITEMS; i++) {
delta = pn->lruvec_stats.state_pending[i];
if (delta)
pn->lruvec_stats.state_pending[i] = 0;
--
2.43.0


2024-04-23 05:22:45

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH 3/4] memcg: use proper type for mod_memcg_state

The memcg stats update functions can take arbitrary integer but the
only input which make sense is enum memcg_stat_item and we don't
want these functions to be called with arbitrary integer, so replace
the parameter type with enum memcg_stat_item and compiler will be able
to warn if memcg stat update functions are called with incorrect index
value.

Signed-off-by: Shakeel Butt <[email protected]>
---
include/linux/memcontrol.h | 13 +++++++------
mm/memcontrol.c | 3 ++-
2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d68db7a0e829..1b4a6201c78c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -991,7 +991,8 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
void folio_memcg_lock(struct folio *folio);
void folio_memcg_unlock(struct folio *folio);

-void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
+void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
+ int val);

/* try to stablize folio_memcg() for all the pages in a memcg */
static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
@@ -1012,7 +1013,7 @@ static inline void mem_cgroup_unlock_pages(void)

/* idx can be of type enum memcg_stat_item or node_stat_item */
static inline void mod_memcg_state(struct mem_cgroup *memcg,
- int idx, int val)
+ enum memcg_stat_item idx, int val)
{
unsigned long flags;

@@ -1022,7 +1023,7 @@ static inline void mod_memcg_state(struct mem_cgroup *memcg,
}

static inline void mod_memcg_page_state(struct page *page,
- int idx, int val)
+ enum memcg_stat_item idx, int val)
{
struct mem_cgroup *memcg;

@@ -1541,19 +1542,19 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
}

static inline void __mod_memcg_state(struct mem_cgroup *memcg,
- int idx,
+ enum memcg_stat_item idx,
int nr)
{
}

static inline void mod_memcg_state(struct mem_cgroup *memcg,
- int idx,
+ enum memcg_stat_item idx,
int nr)
{
}

static inline void mod_memcg_page_state(struct page *page,
- int idx, int val)
+ enum memcg_stat_item idx, int val)
{
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bb1bbf417a46..4e991e913393 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -816,7 +816,8 @@ static int memcg_state_val_in_pages(int idx, int val)
* @idx: the stat item - can be enum memcg_stat_item or enum node_stat_item
* @val: delta to add to the counter, can be negative
*/
-void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
+void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
+ int val)
{
if (mem_cgroup_disabled())
return;
--
2.43.0


2024-04-23 05:23:57

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH 4/4] memcg: restrict __mod_memcg_lruvec_state to memcg stats

Currently __mod_memcg_lruvec_state takes enum node_stat_item as a
parameter and enum node_stat_item contains both memcg and non-memcg
stats but __mod_memcg_lruvec_state can only handle the memcg stats, so
simply only call __mod_memcg_lruvec_state for memcg stats.

Signed-off-by: Shakeel Butt <[email protected]>
---
mm/memcontrol.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4e991e913393..531b6ff711c7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -860,8 +860,6 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
case NR_ANON_MAPPED:
case NR_FILE_MAPPED:
case NR_ANON_THPS:
- case NR_SHMEM_PMDMAPPED:
- case NR_FILE_PMDMAPPED:
if (WARN_ON_ONCE(!in_task()))
pr_warn("stat item index: %d\n", idx);
break;
@@ -899,7 +897,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
__mod_node_page_state(lruvec_pgdat(lruvec), idx, val);

/* Update memcg and lruvec */
- if (!mem_cgroup_disabled())
+ if (!mem_cgroup_disabled() && idx < NR_VM_NODE_MEMCG_STAT_ITEMS)
__mod_memcg_lruvec_state(lruvec, idx, val);
}

--
2.43.0


2024-04-23 13:59:14

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start

On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote:
> At the moment the memcg stats are sized based on the size of enum
> node_stat_item but not all fields in node_stat_item corresponds to memcg
> stats. So, rearrage the contents of node_stat_item such that all the
> memcg specific stats are at the top and then the later patches will make
> sure that the memcg code will not waste space for non-memcg stats.
>
> Signed-off-by: Shakeel Butt <[email protected]>

This series is a great idea and the savings speak for themselves.

But rearranging and splitting vmstats along the memcg-nomemcg line
seems like an undue burden on the non-memcg codebase and interface.

- It messes with user-visible /proc/vmstat ordering, and sets things
up to do so on an ongoing basis as stats are added to memcg.

- It also separates related stats (like the workingset ones) in
/proc/vmstat when memcg only accounts a subset.

Would it make more sense to have a translation table inside memcg?
Like we have with memcg1_events.

2024-04-23 14:42:52

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats

Hi Shakeel,

kernel test robot noticed the following build errors:

[auto build test ERROR on next-20240422]
[cannot apply to akpm-mm/mm-everything linus/master v6.9-rc5 v6.9-rc4 v6.9-rc3 v6.9-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Shakeel-Butt/mm-rearrange-node_stat_item-to-put-memcg-stats-at-start/20240423-132451
base: next-20240422
patch link: https://lore.kernel.org/r/20240423051826.791934-3-shakeel.butt%40linux.dev
patch subject: [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats
config: x86_64-buildonly-randconfig-002-20240423 (https://download.01.org/0day-ci/archive/20240423/[email protected]/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240423/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

>> mm/memcontrol.c:1651:2: error: call to '__compiletime_assert_963' declared with 'error' attribute: BUILD_BUG_ON failed: ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1
1651 | BUILD_BUG_ON(ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1);
| ^
include/linux/build_bug.h:50:2: note: expanded from macro 'BUILD_BUG_ON'
50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
| ^
include/linux/build_bug.h:39:37: note: expanded from macro 'BUILD_BUG_ON_MSG'
39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
| ^
include/linux/compiler_types.h:460:2: note: expanded from macro 'compiletime_assert'
460 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:448:2: note: expanded from macro '_compiletime_assert'
448 | __compiletime_assert(condition, msg, prefix, suffix)
| ^
include/linux/compiler_types.h:441:4: note: expanded from macro '__compiletime_assert'
441 | prefix ## suffix(); \
| ^
<scratch space>:40:1: note: expanded from here
40 | __compiletime_assert_963
| ^
1 error generated.


vim +1651 mm/memcontrol.c

1645
1646 static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
1647 {
1648 int i;
1649
1650 /* Reduce by 1 for MEMCG_SWAP as that is not exposed in v2. */
> 1651 BUILD_BUG_ON(ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1);
1652
1653 /*
1654 * Provide statistics on the state of the memory subsystem as
1655 * well as cumulative event counters that show past behavior.
1656 *
1657 * This list is ordered following a combination of these gradients:
1658 * 1) generic big picture -> specifics and details
1659 * 2) reflecting userspace activity -> reflecting kernel heuristics
1660 *
1661 * Current memory state:
1662 */
1663 mem_cgroup_flush_stats(memcg);
1664
1665 for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
1666 u64 size;
1667
1668 size = memcg_page_state_output(memcg, memory_stats[i].idx);
1669 seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size);
1670
1671 if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
1672 size += memcg_page_state_output(memcg,
1673 NR_SLAB_RECLAIMABLE_B);
1674 seq_buf_printf(s, "slab %llu\n", size);
1675 }
1676 }
1677
1678 /* Accumulated memory events */
1679 seq_buf_printf(s, "pgscan %lu\n",
1680 memcg_events(memcg, PGSCAN_KSWAPD) +
1681 memcg_events(memcg, PGSCAN_DIRECT) +
1682 memcg_events(memcg, PGSCAN_KHUGEPAGED));
1683 seq_buf_printf(s, "pgsteal %lu\n",
1684 memcg_events(memcg, PGSTEAL_KSWAPD) +
1685 memcg_events(memcg, PGSTEAL_DIRECT) +
1686 memcg_events(memcg, PGSTEAL_KHUGEPAGED));
1687
1688 for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) {
1689 if (memcg_vm_event_stat[i] == PGPGIN ||
1690 memcg_vm_event_stat[i] == PGPGOUT)
1691 continue;
1692
1693 seq_buf_printf(s, "%s %lu\n",
1694 vm_event_name(memcg_vm_event_stat[i]),
1695 memcg_events(memcg, memcg_vm_event_stat[i]));
1696 }
1697
1698 /* The above should easily fit into one page */
1699 WARN_ON_ONCE(seq_buf_has_overflowed(s));
1700 }
1701

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-04-23 17:53:22

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start

On Tue, Apr 23, 2024 at 09:58:44AM -0400, Johannes Weiner wrote:
> On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote:
> > At the moment the memcg stats are sized based on the size of enum
> > node_stat_item but not all fields in node_stat_item corresponds to memcg
> > stats. So, rearrage the contents of node_stat_item such that all the
> > memcg specific stats are at the top and then the later patches will make
> > sure that the memcg code will not waste space for non-memcg stats.
> >
> > Signed-off-by: Shakeel Butt <[email protected]>
>
> This series is a great idea and the savings speak for themselves.
>
> But rearranging and splitting vmstats along the memcg-nomemcg line
> seems like an undue burden on the non-memcg codebase and interface.
>
> - It messes with user-visible /proc/vmstat ordering, and sets things
> up to do so on an ongoing basis as stats are added to memcg.
>
> - It also separates related stats (like the workingset ones) in
> /proc/vmstat when memcg only accounts a subset.
>
> Would it make more sense to have a translation table inside memcg?
> Like we have with memcg1_events.

Thanks for taking a look. I will look into the translation table
approach. The reason I went with this approach was that I am in parallel
looking into rearranging fields of important MM structs and also enums
to improve cache locality. For example, the field NR_SWAPCACHE is always
accessed together with NR_FILE_PAGES, so it makes sense to have them on
same cacheline. So, is the rearrangement of vmstats a big NO or a little
bit here and there is fine unlike what I did with this patch?

thanks,
Shakeel

2024-04-23 18:51:10

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start

On Tue, Apr 23, 2024 at 10:44:07AM -0700, Shakeel Butt wrote:
> On Tue, Apr 23, 2024 at 09:58:44AM -0400, Johannes Weiner wrote:
> > On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote:
> > > At the moment the memcg stats are sized based on the size of enum
> > > node_stat_item but not all fields in node_stat_item corresponds to memcg
> > > stats. So, rearrage the contents of node_stat_item such that all the
> > > memcg specific stats are at the top and then the later patches will make
> > > sure that the memcg code will not waste space for non-memcg stats.
> > >
> > > Signed-off-by: Shakeel Butt <[email protected]>
> >
> > This series is a great idea and the savings speak for themselves.
> >
> > But rearranging and splitting vmstats along the memcg-nomemcg line
> > seems like an undue burden on the non-memcg codebase and interface.
> >
> > - It messes with user-visible /proc/vmstat ordering, and sets things
> > up to do so on an ongoing basis as stats are added to memcg.
> >
> > - It also separates related stats (like the workingset ones) in
> > /proc/vmstat when memcg only accounts a subset.
> >
> > Would it make more sense to have a translation table inside memcg?
> > Like we have with memcg1_events.
>
> Thanks for taking a look. I will look into the translation table
> approach. The reason I went with this approach was that I am in parallel
> looking into rearranging fields of important MM structs and also enums
> to improve cache locality. For example, the field NR_SWAPCACHE is always
> accessed together with NR_FILE_PAGES, so it makes sense to have them on
> same cacheline. So, is the rearrangement of vmstats a big NO or a little
> bit here and there is fine unlike what I did with this patch?

I'm curious what other folks think.

The cache optimization is a stronger argument, IMO, because it
directly benefits the users of /proc/vmstat. And it would be fairly
self contained inside the node_stat_item enum - "ordered for cache".

I was more hesitant about imposing a memcg requirement on the generic
vmstat ordering.

2024-04-23 20:58:58

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats

Hi Shakeel,

kernel test robot noticed the following build errors:

[auto build test ERROR on next-20240422]
[cannot apply to akpm-mm/mm-everything linus/master v6.9-rc5 v6.9-rc4 v6.9-rc3 v6.9-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Shakeel-Butt/mm-rearrange-node_stat_item-to-put-memcg-stats-at-start/20240423-132451
base: next-20240422
patch link: https://lore.kernel.org/r/20240423051826.791934-3-shakeel.butt%40linux.dev
patch subject: [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats
config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20240424/[email protected]/config)
compiler: sh4-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240424/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

In file included from <command-line>:
In function 'memcg_stat_format',
inlined from 'memory_stat_format.constprop' at mm/memcontrol.c:1707:3:
>> include/linux/compiler_types.h:460:45: error: call to '__compiletime_assert_633' declared with attribute error: BUILD_BUG_ON failed: ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1
460 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:441:25: note: in definition of macro '__compiletime_assert'
441 | prefix ## suffix(); \
| ^~~~~~
include/linux/compiler_types.h:460:9: note: in expansion of macro '_compiletime_assert'
460 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
| ^~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
| ^~~~~~~~~~~~~~~~
mm/memcontrol.c:1651:9: note: in expansion of macro 'BUILD_BUG_ON'
1651 | BUILD_BUG_ON(ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1);
| ^~~~~~~~~~~~
In function 'memcg_stat_format',
inlined from 'memory_stat_format' at mm/memcontrol.c:1707:3,
inlined from 'memory_stat_show' at mm/memcontrol.c:6946:2:
>> include/linux/compiler_types.h:460:45: error: call to '__compiletime_assert_633' declared with attribute error: BUILD_BUG_ON failed: ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1
460 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:441:25: note: in definition of macro '__compiletime_assert'
441 | prefix ## suffix(); \
| ^~~~~~
include/linux/compiler_types.h:460:9: note: in expansion of macro '_compiletime_assert'
460 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
| ^~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
| ^~~~~~~~~~~~~~~~
mm/memcontrol.c:1651:9: note: in expansion of macro 'BUILD_BUG_ON'
1651 | BUILD_BUG_ON(ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1);
| ^~~~~~~~~~~~


vim +/__compiletime_assert_633 +460 include/linux/compiler_types.h

eb5c2d4b45e3d2 Will Deacon 2020-07-21 446
eb5c2d4b45e3d2 Will Deacon 2020-07-21 447 #define _compiletime_assert(condition, msg, prefix, suffix) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21 448 __compiletime_assert(condition, msg, prefix, suffix)
eb5c2d4b45e3d2 Will Deacon 2020-07-21 449
eb5c2d4b45e3d2 Will Deacon 2020-07-21 450 /**
eb5c2d4b45e3d2 Will Deacon 2020-07-21 451 * compiletime_assert - break build and emit msg if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21 452 * @condition: a compile-time constant condition to check
eb5c2d4b45e3d2 Will Deacon 2020-07-21 453 * @msg: a message to emit if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21 454 *
eb5c2d4b45e3d2 Will Deacon 2020-07-21 455 * In tradition of POSIX assert, this macro will break the build if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21 456 * supplied condition is *false*, emitting the supplied error message if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21 457 * compiler has support to do so.
eb5c2d4b45e3d2 Will Deacon 2020-07-21 458 */
eb5c2d4b45e3d2 Will Deacon 2020-07-21 459 #define compiletime_assert(condition, msg) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21 @460 _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
eb5c2d4b45e3d2 Will Deacon 2020-07-21 461

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-04-25 18:10:07

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start

On Mon, Apr 22, 2024 at 10:19 PM Shakeel Butt <[email protected]> wrote:
>
> At the moment the memcg stats are sized based on the size of enum
> node_stat_item but not all fields in node_stat_item corresponds to memcg
> stats. So, rearrage the contents of node_stat_item such that all the
> memcg specific stats are at the top and then the later patches will make
> sure that the memcg code will not waste space for non-memcg stats.

Is this patch meant to have no functional change other than the output order?
It would be better to clarify it in the commit message if that is the case.

Chris

>
> Signed-off-by: Shakeel Butt <[email protected]>
> ---
> include/linux/mmzone.h | 25 +++++++++++++------------
> mm/vmstat.c | 24 ++++++++++++------------
> 2 files changed, 25 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8f9c9590a42c..989ca97402c6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -166,9 +166,6 @@ enum node_stat_item {
> NR_UNEVICTABLE, /* " " " " " */
> NR_SLAB_RECLAIMABLE_B,
> NR_SLAB_UNRECLAIMABLE_B,
> - NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> - NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
> - WORKINGSET_NODES,
> WORKINGSET_REFAULT_BASE,
> WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE,
> WORKINGSET_REFAULT_FILE,
> @@ -179,39 +176,43 @@ enum node_stat_item {
> WORKINGSET_RESTORE_ANON = WORKINGSET_RESTORE_BASE,
> WORKINGSET_RESTORE_FILE,
> WORKINGSET_NODERECLAIM,
> + NR_PAGETABLE, /* used for pagetables */
> + NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */
> + NR_KERNEL_STACK_KB, /* measured in KiB */
> NR_ANON_MAPPED, /* Mapped anonymous pages */
> NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
> only modified from process context */
> NR_FILE_PAGES,
> +#ifdef CONFIG_SWAP
> + NR_SWAPCACHE,
> +#endif
> NR_FILE_DIRTY,
> NR_WRITEBACK,
> - NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
> NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
> NR_SHMEM_THPS,
> - NR_SHMEM_PMDMAPPED,
> NR_FILE_THPS,
> - NR_FILE_PMDMAPPED,
> NR_ANON_THPS,
> + /* No memcg stats for the following fields. */
> + NR_SHMEM_PMDMAPPED,
> + NR_FILE_PMDMAPPED,
> + NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
> NR_VMSCAN_WRITE,
> NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */
> + NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> + NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
> + WORKINGSET_NODES,
> NR_DIRTIED, /* page dirtyings since bootup */
> NR_WRITTEN, /* page writings since bootup */
> NR_THROTTLED_WRITTEN, /* NR_WRITTEN while reclaim throttled */
> NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */
> NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */
> NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */
> - NR_KERNEL_STACK_KB, /* measured in KiB */
> #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
> NR_KERNEL_SCS_KB, /* measured in KiB */
> #endif
> - NR_PAGETABLE, /* used for pagetables */
> - NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */
> #ifdef CONFIG_IOMMU_SUPPORT
> NR_IOMMU_PAGES, /* # of pages allocated by IOMMU */
> #endif
> -#ifdef CONFIG_SWAP
> - NR_SWAPCACHE,
> -#endif
> #ifdef CONFIG_NUMA_BALANCING
> PGPROMOTE_SUCCESS, /* promote successfully */
> PGPROMOTE_CANDIDATE, /* candidate pages to promote */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 8507c497218b..4eac2f6322a3 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1206,9 +1206,6 @@ const char * const vmstat_text[] = {
> "nr_unevictable",
> "nr_slab_reclaimable",
> "nr_slab_unreclaimable",
> - "nr_isolated_anon",
> - "nr_isolated_file",
> - "workingset_nodes",
> "workingset_refault_anon",
> "workingset_refault_file",
> "workingset_activate_anon",
> @@ -1216,38 +1213,41 @@ const char * const vmstat_text[] = {
> "workingset_restore_anon",
> "workingset_restore_file",
> "workingset_nodereclaim",
> + "nr_page_table_pages",
> + "nr_sec_page_table_pages",
> + "nr_kernel_stack",
> "nr_anon_pages",
> "nr_mapped",
> "nr_file_pages",
> +#ifdef CONFIG_SWAP
> + "nr_swapcached",
> +#endif
> "nr_dirty",
> "nr_writeback",
> - "nr_writeback_temp",
> "nr_shmem",
> "nr_shmem_hugepages",
> - "nr_shmem_pmdmapped",
> "nr_file_hugepages",
> - "nr_file_pmdmapped",
> "nr_anon_transparent_hugepages",
> + "nr_shmem_pmdmapped",
> + "nr_file_pmdmapped",
> + "nr_writeback_temp",
> "nr_vmscan_write",
> "nr_vmscan_immediate_reclaim",
> + "nr_isolated_anon",
> + "nr_isolated_file",
> + "workingset_nodes",
> "nr_dirtied",
> "nr_written",
> "nr_throttled_written",
> "nr_kernel_misc_reclaimable",
> "nr_foll_pin_acquired",
> "nr_foll_pin_released",
> - "nr_kernel_stack",
> #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
> "nr_shadow_call_stack",
> #endif
> - "nr_page_table_pages",
> - "nr_sec_page_table_pages",
> #ifdef CONFIG_IOMMU_SUPPORT
> "nr_iommu_pages",
> #endif
> -#ifdef CONFIG_SWAP
> - "nr_swapcached",
> -#endif
> #ifdef CONFIG_NUMA_BALANCING
> "pgpromote_success",
> "pgpromote_candidate",
> --
> 2.43.0
>
>

2024-04-25 22:54:58

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start

On Tue, Apr 23, 2024 at 11:30 AM Johannes Weiner <[email protected]> wrote:
>
> On Tue, Apr 23, 2024 at 10:44:07AM -0700, Shakeel Butt wrote:
> > On Tue, Apr 23, 2024 at 09:58:44AM -0400, Johannes Weiner wrote:
> > > On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote:
> > > > At the moment the memcg stats are sized based on the size of enum
> > > > node_stat_item but not all fields in node_stat_item corresponds to memcg
> > > > stats. So, rearrage the contents of node_stat_item such that all the
> > > > memcg specific stats are at the top and then the later patches will make
> > > > sure that the memcg code will not waste space for non-memcg stats.
> > > >
> > > > Signed-off-by: Shakeel Butt <[email protected]>
> > >
> > > This series is a great idea and the savings speak for themselves.
> > >
> > > But rearranging and splitting vmstats along the memcg-nomemcg line
> > > seems like an undue burden on the non-memcg codebase and interface.
> > >
> > > - It messes with user-visible /proc/vmstat ordering, and sets things
> > > up to do so on an ongoing basis as stats are added to memcg.
> > >
> > > - It also separates related stats (like the workingset ones) in
> > > /proc/vmstat when memcg only accounts a subset.
> > >
> > > Would it make more sense to have a translation table inside memcg?
> > > Like we have with memcg1_events.
> >
> > Thanks for taking a look. I will look into the translation table
> > approach. The reason I went with this approach was that I am in parallel
> > looking into rearranging fields of important MM structs and also enums
> > to improve cache locality. For example, the field NR_SWAPCACHE is always
> > accessed together with NR_FILE_PAGES, so it makes sense to have them on
> > same cacheline. So, is the rearrangement of vmstats a big NO or a little
> > bit here and there is fine unlike what I did with this patch?
>
> I'm curious what other folks think.
>
I slightly prefer not to change user visible ordering for no good reason.
It is not said the order is carved to stone. It depends on the ROI.

> The cache optimization is a stronger argument, IMO, because it
> directly benefits the users of /proc/vmstat. And it would be fairly
> self contained inside the node_stat_item enum - "ordered for cache".

Not sure how much of the cache optimization is measurable here.
I suspect it is going to be hard to measure a meaningful difference
just from the cache line order alone.

> I was more hesitant about imposing a memcg requirement on the generic
> vmstat ordering.

That is a valid reason.

Chris