2022-12-14 22:53:54

by Yuanchu Xie

[permalink] [raw]
Subject: [RFC PATCH 0/2] mm: multi-gen LRU: working set extensions

Introduce a way of monitoring the working set of a workload, per page
type and per NUMA node, with granularity in minutes. It has page-level
granularity and minimal memory overhead by building on the
Multi-generational LRU framework, which already has most of the
infrastructure and is just missing a useful interface.

MGLRU organizes pages in generations, where an older generation contains
colder pages, and aging promotes the recently used pages into the young
generation and creates a new one. The working set size is how much
memory an application needs to keep working, the amount of "hot" memory
that's frequently used. The only missing pieces between MGLRU
generations and working set estimation are a consistent aging cadence
and an interface; we introduce the two additions.

Periodic aging
======
MGLRU Aging is currently driven by reclaim, so the amount of time
between generations is non-deterministic. With memcgs being aged
regularly, MGLRU generations become time-based working set information.

- memory.periodic_aging: a new root-level only file in cgroupfs
Writing to memory.periodic_aging sets the aging interval and opts into
periodic aging.
- kold: a new kthread that ages memcgs based on the set aging interval.

Page idle age stats
======
- memory.page_idle_age: we group pages into idle age ranges, and present
the number of pages per node per pagetype in each range. This
aggregates the time information from MGLRU generations hierarchically.

Use case: proactive reclaimer
======
The proactive reclaimer sets the aging interval, and periodically reads
the page idle age stats, forming a working set estimation, which it then
calculates an amount to write to memory.reclaim.

With the page idle age stats, a proactive reclaimer could calculate a
precise amount of memory to reclaim without continuously probing and
inducing reclaim.

A proactive reclaimer that uses a similar interface is used in the
Google data centers.

Use case: workload introspection
======
A workload may use the working set estimates to adjust application
behavior as needed, e.g. preemptively killing some of its workers to
avoid its working set thrashing, or dropping caches to fit within a
limit.
It can also be valuable to application developers, who can benefit from
an out-of-the-box overview of the application's usage behaviors.

TODO List
======
- selftests
- a userspace demonstrator combining periodic aging, page idle age
stats, memory.reclaim, and/or PSI

Open questions
======
- MGLRU aging mechanism has a flag called force_scan. With
force_scan=false, invoking MGLRU aging when an lruvec has a maximum
number of generations does not actually perform aging.
However, with force_scan=true, MGLRU moves the pages in the oldest
generation to the second oldest generation. The force_scan=true flag
also disables some optimizations in MGLRU's page table walks.
The current patch sets force_scan=true, so that periodic aging would
work without a proactive reclaimer evicting the oldest generation.

- The page idle age format uses a fixed set of time ranges in seconds.
I have considered having it be based on the aging interval, or just
compiling the raw timestamps.
With the age ranges based on the aging interval, a memcg that's
undergoing memcg reclaim might have its generations in the 10
seconds range, and a much longer aging interval would obscure this
fact.
The raw timestamps from MGLRU could lead to a very large file when
aggregated hierarchically.

Yuanchu Xie (2):
mm: multi-gen LRU: periodic aging
mm: multi-gen LRU: cgroup working set stats

include/linux/kold.h | 44 ++++++++++
include/linux/mmzone.h | 4 +-
mm/Makefile | 3 +
mm/kold.c | 150 ++++++++++++++++++++++++++++++++
mm/memcontrol.c | 188 +++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 35 +++++++-
6 files changed, 422 insertions(+), 2 deletions(-)
create mode 100644 include/linux/kold.h
create mode 100644 mm/kold.c

--
2.39.0.314.g84b9a713c41-goog


2022-12-14 23:08:24

by Yuanchu Xie

[permalink] [raw]
Subject: [RFC PATCH 2/2] mm: multi-gen LRU: cgroup working set stats

Expose MGLRU generations as working set stats in cgroupfs as
memory.page_idle_age, where we group pages into idle age ranges, and
present the number of pages per node per pagetype in each range. This
aggregates the time information from MGLRU generations hierarchically.

Signed-off-by: Yuanchu Xie <[email protected]>
---
mm/memcontrol.c | 136 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 136 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7d2fb3fc4580..86554e17be58 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1655,6 +1655,130 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
pr_info("%s", buf);
}

+#ifdef CONFIG_LRU_GEN
+static const unsigned long page_idle_age_ranges[] = {
+ 1, 2, 5, 10, 20, 30, 45, 60, 90, 120, 180,
+ 240, 360, 480, 720, 960, 1440, 1920, 2880, 3840, -1
+};
+
+#define PAGE_IDLE_AGE_NR_RANGES ARRAY_SIZE(page_idle_age_ranges)
+
+static unsigned int lru_gen_time_to_page_idle_age_range(unsigned long timestamp)
+{
+ unsigned int i;
+ unsigned long gen_age = jiffies_to_msecs(jiffies - timestamp) / MSEC_PER_SEC;
+
+ for (i = 0; i < PAGE_IDLE_AGE_NR_RANGES - 1; ++i)
+ if (gen_age <= page_idle_age_ranges[i])
+ return i;
+
+ return PAGE_IDLE_AGE_NR_RANGES - 1;
+}
+
+static void lru_gen_fill_page_idle_age_table(unsigned long *table,
+ struct lru_gen_struct *lrugen,
+ int nid)
+{
+ unsigned long max_seq = READ_ONCE(lrugen->max_seq);
+ unsigned long min_seq[ANON_AND_FILE] = {
+ READ_ONCE(lrugen->min_seq[LRU_GEN_ANON]),
+ READ_ONCE(lrugen->min_seq[LRU_GEN_FILE]),
+ };
+ unsigned long seq;
+ unsigned int pagetype;
+
+ /*
+ * what do we want to do here?
+ * iterate over all the generations, for each anon and file
+ */
+
+ for (pagetype = LRU_GEN_ANON; pagetype < ANON_AND_FILE; ++pagetype) {
+ for (seq = min_seq[pagetype]; seq <= max_seq; ++seq) {
+ unsigned int zone;
+ unsigned int gen = lru_gen_from_seq(seq);
+ unsigned int idle_age = lru_gen_time_to_page_idle_age_range(
+ READ_ONCE(lrugen->timestamps[gen]));
+ unsigned long page_count = 0;
+
+ for (zone = 0; zone < MAX_NR_ZONES; ++zone) {
+ page_count += READ_ONCE(
+ lrugen->nr_pages[gen][pagetype][zone]);
+ }
+ table[pagetype * PAGE_IDLE_AGE_NR_RANGES *
+ nr_node_ids +
+ PAGE_IDLE_AGE_NR_RANGES * nid + idle_age] +=
+ page_count;
+ }
+ }
+}
+
+static void memory_page_idle_age_print(struct seq_file *m, unsigned long *table)
+{
+ static const char *type_str[ANON_AND_FILE] = { "anon", "file" };
+ unsigned int i, nid, pagetype;
+ unsigned int lower = 0;
+
+ for (i = 0; i < PAGE_IDLE_AGE_NR_RANGES; ++i) {
+ unsigned int upper = page_idle_age_ranges[i];
+
+ for (pagetype = LRU_GEN_ANON; pagetype < ANON_AND_FILE;
+ ++pagetype) {
+ if (upper == -1)
+ seq_printf(m, "%u-inf %s", lower,
+ type_str[pagetype]);
+ else
+ seq_printf(m, "%u-%u %s", lower, upper,
+ type_str[pagetype]);
+ for_each_node_state(nid, N_MEMORY) {
+ unsigned long page_count = table
+ [pagetype *
+ PAGE_IDLE_AGE_NR_RANGES *
+ nr_node_ids +
+ PAGE_IDLE_AGE_NR_RANGES * nid +
+ i];
+ seq_printf(m, " N%u=%lu", nid, page_count);
+ }
+ seq_puts(m, "\n");
+ }
+
+ lower = upper;
+ }
+}
+
+static int memory_page_idle_age_format(struct mem_cgroup *root,
+ struct seq_file *m)
+{
+ struct mem_cgroup *memcg;
+ unsigned long *table;
+
+ /*
+ * table contains PAGE_IDLE_AGE_NR_RANGES entries
+ * per node per pagetype
+ */
+ table = kmalloc_array(PAGE_IDLE_AGE_NR_RANGES * nr_node_ids *
+ ANON_AND_FILE,
+ sizeof(*table), __GFP_ZERO | GFP_KERNEL);
+
+ if (!table)
+ return -ENOMEM;
+
+ memcg = mem_cgroup_iter(root, NULL, NULL);
+ do {
+ int nid;
+
+ for_each_node_state(nid, N_MEMORY) {
+ struct lru_gen_struct *lrugen =
+ &memcg->nodeinfo[nid]->lruvec.lrugen;
+
+ lru_gen_fill_page_idle_age_table(table, lrugen, nid);
+ }
+ } while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
+
+ memory_page_idle_age_print(m, table);
+ return 0;
+}
+#endif /* CONFIG_LRU_GEN */
+
/*
* Return the memory (and swap, if configured) limit for a memcg.
*/
@@ -6571,6 +6695,13 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
}

#ifdef CONFIG_LRU_GEN
+static int memory_page_idle_age_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+ return memory_page_idle_age_format(memcg, m);
+}
+
static int memory_periodic_aging_show(struct seq_file *m, void *v)
{
unsigned int interval = kold_get_interval();
@@ -6724,6 +6855,11 @@ static struct cftype memory_files[] = {
.write = memory_reclaim,
},
#ifdef CONFIG_LRU_GEN
+ {
+ .name = "page_idle_age",
+ .flags = CFTYPE_NS_DELEGATABLE,
+ .seq_show = memory_page_idle_age_show,
+ },
{
.name = "periodic_aging",
.flags = CFTYPE_ONLY_ON_ROOT,
--
2.39.0.314.g84b9a713c41-goog

2022-12-14 23:21:52

by Yuanchu Xie

[permalink] [raw]
Subject: [RFC PATCH 1/2] mm: multi-gen LRU: periodic aging

Periodically age MGLRU-enabled lruvecs to turn MGLRU generations into
time-based working set information. This includes an interface to set
the periodic aging interval and a new kthread to perform aging.

memory.periodic_aging: a new root-level only file in cgroupfs
Writing to memory.periodic aging sets the aging interval and opts into
periodic aging.
kold: a new kthread that ages memcgs based on the set aging interval.

Signed-off-by: Yuanchu Xie <[email protected]>
---
include/linux/kold.h | 44 ++++++++++++
include/linux/mmzone.h | 4 +-
mm/Makefile | 3 +
mm/kold.c | 150 +++++++++++++++++++++++++++++++++++++++++
mm/memcontrol.c | 52 ++++++++++++++
mm/vmscan.c | 35 +++++++++-
6 files changed, 286 insertions(+), 2 deletions(-)
create mode 100644 include/linux/kold.h
create mode 100644 mm/kold.c

diff --git a/include/linux/kold.h b/include/linux/kold.h
new file mode 100644
index 000000000000..10b0dbe09a5c
--- /dev/null
+++ b/include/linux/kold.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * Periodic aging for multi-gen LRU
+ *
+ * Copyright (C) 2022 Yuanchu Xie <[email protected]>
+ */
+#ifndef KOLD_H_
+#define KOLD_H_
+
+#include <linux/memcontrol.h>
+
+struct kold_stats {
+ /* late is defined as spending an entire interval aging without sleep
+ * stat is aggregated every aging interval
+ */
+ unsigned int late_count;
+};
+
+int kold_set_interval(unsigned int interval);
+unsigned int kold_get_interval(void);
+int kold_get_stats(struct kold_stats *stats);
+
+/* returns the creation timestamp of the youngest generation */
+unsigned long lru_gen_force_age_lruvec(struct mem_cgroup *memcg, int nid,
+ unsigned long min_ttl);
+
+#ifndef CONFIG_MEMCG
+int kold_set_interval(unsigned int interval)
+{
+ return 0;
+}
+
+unsigned int kold_get_interval(void)
+{
+ return 0;
+}
+
+int kold_get_stats(struct kold_stats *stats)
+{
+ return -1;
+}
+#endif /* CONFIG_MEMCG */
+
+#endif /* KOLD_H_ */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5f74891556f3..929c777b826a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1218,7 +1218,9 @@ typedef struct pglist_data {

#ifdef CONFIG_LRU_GEN
/* kswap mm walk data */
- struct lru_gen_mm_walk mm_walk;
+ struct lru_gen_mm_walk mm_walk;
+ /* kold periodic aging walk data */
+ struct lru_gen_mm_walk kold_mm_walk;
#endif

CACHELINE_PADDING(_pad2_);
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..8bd554a6eb7d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -98,6 +98,9 @@ obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
+ifdef CONFIG_LRU_GEN
+obj-$(CONFIG_MEMCG) += kold.o
+endif
ifdef CONFIG_SWAP
obj-$(CONFIG_MEMCG) += swap_cgroup.o
endif
diff --git a/mm/kold.c b/mm/kold.c
new file mode 100644
index 000000000000..094574177968
--- /dev/null
+++ b/mm/kold.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Yuanchu Xie <[email protected]>
+ */
+#include <linux/stddef.h>
+#include <linux/topology.h>
+#include <linux/cpumask.h>
+#include <linux/mmzone.h>
+#include <linux/nodemask.h>
+#include <linux/sched/mm.h>
+#include <linux/swap.h>
+#include <linux/memcontrol.h>
+#include <linux/err.h>
+#include <linux/jiffies.h>
+#include <linux/sched.h>
+#include <linux/cache.h>
+#include <linux/init.h>
+#include <linux/mutex.h>
+#include <linux/kold.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+#include <linux/mm_inline.h>
+
+static struct task_struct *kold_thread __read_mostly;
+/* protects kold_thread */
+static DEFINE_MUTEX(kold_mutex);
+
+static unsigned int aging_interval __read_mostly;
+static unsigned int late_count;
+
+/* try to move to a cpu on the target node */
+static void try_move_current_to_node(int nid)
+{
+ struct cpumask node_cpus;
+
+ cpumask_and(&node_cpus, cpumask_of_node(nid), cpu_online_mask);
+ if (!cpumask_empty(&node_cpus))
+ set_cpus_allowed_ptr(current, &node_cpus);
+}
+
+static int kold_run(void *none)
+{
+ int nid;
+ unsigned int flags;
+ unsigned long last_interval_start_time = jiffies;
+ bool sleep_since_last_full_scan = false;
+ struct mem_cgroup *memcg;
+ struct reclaim_state reclaim_state = {};
+
+ while (!kthread_should_stop()) {
+ unsigned long interval =
+ (unsigned long)(READ_ONCE(aging_interval)) * HZ;
+ unsigned long next_wakeup_tick = jiffies + interval;
+ long timeout_ticks;
+
+ current->reclaim_state = &reclaim_state;
+ flags = memalloc_noreclaim_save();
+
+ for_each_node_state(nid, N_MEMORY) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+
+ try_move_current_to_node(nid);
+ reclaim_state.mm_walk = &pgdat->kold_mm_walk;
+
+ memcg = mem_cgroup_iter(NULL, NULL, NULL);
+ do {
+ unsigned long young_timestamp =
+ lru_gen_force_age_lruvec(memcg, nid,
+ interval);
+
+ if (time_before(young_timestamp + interval,
+ next_wakeup_tick)) {
+ next_wakeup_tick = young_timestamp + interval;
+ }
+ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+ }
+
+ memalloc_noreclaim_restore(flags);
+ current->reclaim_state = NULL;
+
+ /* late_count stats update */
+ if (time_is_before_jiffies(last_interval_start_time + interval)) {
+ last_interval_start_time += interval;
+ if (!sleep_since_last_full_scan) {
+ WRITE_ONCE(late_count,
+ READ_ONCE(late_count) + 1);
+ }
+ sleep_since_last_full_scan = false;
+ }
+
+ /* sleep until next aging */
+ timeout_ticks = -(long)(jiffies - next_wakeup_tick);
+ if (timeout_ticks > 0 && timeout_ticks != MAX_SCHEDULE_TIMEOUT) {
+ sleep_since_last_full_scan = true;
+ schedule_timeout_idle(timeout_ticks);
+ }
+ }
+ return 0;
+}
+
+int kold_get_stats(struct kold_stats *stats)
+{
+ stats->late_count = READ_ONCE(late_count);
+ return 0;
+}
+
+unsigned int kold_get_interval(void)
+{
+ return READ_ONCE(aging_interval);
+}
+
+int kold_set_interval(unsigned int interval)
+{
+ int err = 0;
+
+ mutex_lock(&kold_mutex);
+ if (interval && !kold_thread) {
+ if (!lru_gen_enabled()) {
+ err = -EOPNOTSUPP;
+ goto cleanup;
+ }
+ kold_thread = kthread_create(kold_run, NULL, "kold");
+
+ if (IS_ERR(kold_thread)) {
+ pr_err("kold: kthread_run(kold_run) failed\n");
+ err = PTR_ERR(kold_thread);
+ kold_thread = NULL;
+ goto cleanup;
+ }
+ WRITE_ONCE(aging_interval, interval);
+ wake_up_process(kold_thread);
+ } else {
+ if (!interval && kold_thread) {
+ kthread_stop(kold_thread);
+ kold_thread = NULL;
+ }
+ WRITE_ONCE(aging_interval, interval);
+ }
+
+cleanup:
+ mutex_unlock(&kold_mutex);
+ return err;
+}
+
+static int __init kold_init(void)
+{
+ return 0;
+}
+
+module_init(kold_init);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8549ae1b30..7d2fb3fc4580 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
#include <linux/resume_user_mode.h>
#include <linux/psi.h>
#include <linux/seq_buf.h>
+#include <linux/kold.h>
#include "internal.h"
#include <net/sock.h>
#include <net/ip.h>
@@ -6569,6 +6570,49 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
return nbytes;
}

+#ifdef CONFIG_LRU_GEN
+static int memory_periodic_aging_show(struct seq_file *m, void *v)
+{
+ unsigned int interval = kold_get_interval();
+ struct kold_stats stats;
+ int err;
+
+ err = kold_get_stats(&stats);
+
+ if (err)
+ return err;
+
+ seq_printf(m, "aging_interval %u\n", interval);
+ seq_printf(m, "late_count %u\n", stats.late_count);
+ return 0;
+}
+
+static ssize_t memory_periodic_aging_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ unsigned int new_interval;
+ int err;
+
+ if (!lru_gen_enabled())
+ return -EOPNOTSUPP;
+
+ buf = strstrip(buf);
+ if (!buf)
+ return -EINVAL;
+
+ err = kstrtouint(buf, 0, &new_interval);
+ if (err)
+ return err;
+
+ err = kold_set_interval(new_interval);
+ if (err)
+ return err;
+
+ return nbytes;
+}
+#endif /* CONFIG_LRU_GEN */
+
static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
@@ -6679,6 +6723,14 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NS_DELEGATABLE,
.write = memory_reclaim,
},
+#ifdef CONFIG_LRU_GEN
+ {
+ .name = "periodic_aging",
+ .flags = CFTYPE_ONLY_ON_ROOT,
+ .seq_show = memory_periodic_aging_show,
+ .write = memory_periodic_aging_write,
+ },
+#endif
{ } /* terminate */
};

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 04d8b88e5216..0fea21366fc8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -54,6 +54,7 @@
#include <linux/shmem_fs.h>
#include <linux/ctype.h>
#include <linux/debugfs.h>
+#include <linux/kold.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -5279,8 +5280,10 @@ static void lru_gen_change_state(bool enabled)

if (enabled)
static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
- else
+ else {
static_branch_disable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
+ kold_set_interval(0);
+ }

memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
@@ -5760,6 +5763,36 @@ static const struct file_operations lru_gen_ro_fops = {
.release = seq_release,
};

+/******************************************************************************
+ * periodic aging (kold)
+ ******************************************************************************/
+
+/* age lruvec as long as it is older than min_ttl,
+ * return the timestamp of the youngest generation
+ */
+unsigned long lru_gen_force_age_lruvec(struct mem_cgroup *memcg, int nid,
+ unsigned long min_ttl)
+{
+ struct scan_control sc = {
+ .may_writepage = true,
+ .may_unmap = true,
+ .may_swap = true,
+ .reclaim_idx = MAX_NR_ZONES - 1,
+ .gfp_mask = GFP_KERNEL,
+ };
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+ DEFINE_MAX_SEQ(lruvec);
+ int gen = lru_gen_from_seq(max_seq);
+ unsigned long birth_timestamp =
+ READ_ONCE(lruvec->lrugen.timestamps[gen]);
+
+ if (time_is_before_jiffies(birth_timestamp + min_ttl))
+ try_to_inc_max_seq(lruvec, max_seq, &sc, true, true);
+
+ return READ_ONCE(lruvec->lrugen.timestamps[lru_gen_from_seq(
+ READ_ONCE((lruvec)->lrugen.max_seq))]);
+}
+
/******************************************************************************
* initialization
******************************************************************************/
--
2.39.0.314.g84b9a713c41-goog

2023-01-10 07:20:52

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm: multi-gen LRU: working set extensions

Yuanchu Xie <[email protected]> writes:

> Introduce a way of monitoring the working set of a workload, per page
> type and per NUMA node, with granularity in minutes. It has page-level
> granularity and minimal memory overhead by building on the
> Multi-generational LRU framework, which already has most of the
> infrastructure and is just missing a useful interface.
>
> MGLRU organizes pages in generations, where an older generation contains
> colder pages, and aging promotes the recently used pages into the young
> generation and creates a new one. The working set size is how much
> memory an application needs to keep working, the amount of "hot" memory
> that's frequently used. The only missing pieces between MGLRU
> generations and working set estimation are a consistent aging cadence
> and an interface; we introduce the two additions.

So with kold kthread do we need aging in reclaim ? Should we switch reciam
to wakeup up kold kthread to do aging instead of doing try_to_inc_max_seq?
This would also help us to try different aging mechanism which can run
better in a kthread.


>
> Periodic aging
> ======
> MGLRU Aging is currently driven by reclaim, so the amount of time
> between generations is non-deterministic. With memcgs being aged
> regularly, MGLRU generations become time-based working set information.
>
> - memory.periodic_aging: a new root-level only file in cgroupfs
> Writing to memory.periodic_aging sets the aging interval and opts into
> periodic aging.
> - kold: a new kthread that ages memcgs based on the set aging interval.
>
> Page idle age stats
> ======
> - memory.page_idle_age: we group pages into idle age ranges, and present
> the number of pages per node per pagetype in each range. This
> aggregates the time information from MGLRU generations hierarchically.
>
> Use case: proactive reclaimer
> ======
> The proactive reclaimer sets the aging interval, and periodically reads
> the page idle age stats, forming a working set estimation, which it then
> calculates an amount to write to memory.reclaim.
>
> With the page idle age stats, a proactive reclaimer could calculate a
> precise amount of memory to reclaim without continuously probing and
> inducing reclaim.
>
> A proactive reclaimer that uses a similar interface is used in the
> Google data centers.
>
> Use case: workload introspection
> ======
> A workload may use the working set estimates to adjust application
> behavior as needed, e.g. preemptively killing some of its workers to
> avoid its working set thrashing, or dropping caches to fit within a
> limit.
> It can also be valuable to application developers, who can benefit from
> an out-of-the-box overview of the application's usage behaviors.
>
> TODO List
> ======
> - selftests
> - a userspace demonstrator combining periodic aging, page idle age
> stats, memory.reclaim, and/or PSI
>
> Open questions
> ======
> - MGLRU aging mechanism has a flag called force_scan. With
> force_scan=false, invoking MGLRU aging when an lruvec has a maximum
> number of generations does not actually perform aging.
> However, with force_scan=true, MGLRU moves the pages in the oldest
> generation to the second oldest generation. The force_scan=true flag
> also disables some optimizations in MGLRU's page table walks.
> The current patch sets force_scan=true, so that periodic aging would
> work without a proactive reclaimer evicting the oldest generation.
>
> - The page idle age format uses a fixed set of time ranges in seconds.
> I have considered having it be based on the aging interval, or just
> compiling the raw timestamps.
> With the age ranges based on the aging interval, a memcg that's
> undergoing memcg reclaim might have its generations in the 10
> seconds range, and a much longer aging interval would obscure this
> fact.
> The raw timestamps from MGLRU could lead to a very large file when
> aggregated hierarchically.
>
> Yuanchu Xie (2):
> mm: multi-gen LRU: periodic aging
> mm: multi-gen LRU: cgroup working set stats
>
> include/linux/kold.h | 44 ++++++++++
> include/linux/mmzone.h | 4 +-
> mm/Makefile | 3 +
> mm/kold.c | 150 ++++++++++++++++++++++++++++++++
> mm/memcontrol.c | 188 +++++++++++++++++++++++++++++++++++++++++
> mm/vmscan.c | 35 +++++++-
> 6 files changed, 422 insertions(+), 2 deletions(-)
> create mode 100644 include/linux/kold.h
> create mode 100644 mm/kold.c
>
> --
> 2.39.0.314.g84b9a713c41-goog

2023-01-11 14:34:39

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm: multi-gen LRU: working set extensions

On Wed, Dec 14, 2022 at 02:51:21PM -0800, Yuanchu Xie <[email protected]> wrote:
> that's frequently used. The only missing pieces between MGLRU
> generations and working set estimation are a consistent aging cadence
> and an interface; we introduce the two additions.
>
> Periodic aging
> ======
> MGLRU Aging is currently driven by reclaim, so the amount of time
> between generations is non-deterministic. With memcgs being aged
> regularly, MGLRU generations become time-based working set information.

Is this periodic aging specific to memcgs? IOW, periodic aging isn't
needed without memcgs (~with root only)
(Perhaps similar question to Aneeh's.)

> Use case: proactive reclaimer
> ======
> The proactive reclaimer sets the aging interval, and periodically reads
> the page idle age stats, forming a working set estimation, which it then
> calculates an amount to write to memory.reclaim.
>
> With the page idle age stats, a proactive reclaimer could calculate a
> precise amount of memory to reclaim without continuously probing and
> inducing reclaim.

Could the aging be also made per-memcg? (Similar to memory.reclaim,
possibly without the new kthread (if global reclaim's aging is enough).)

Thanks,
Michal


Attachments:
(No filename) (1.23 kB)
signature.asc (235.00 B)
Digital signature
Download all attachments

2023-01-11 14:48:06

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] mm: multi-gen LRU: cgroup working set stats

On Wed, Dec 14, 2022 at 02:51:23PM -0800, Yuanchu Xie <[email protected]> wrote:
> +static int memory_page_idle_age_format(struct mem_cgroup *root,
> + struct seq_file *m)
> +{
> + struct mem_cgroup *memcg;
> + unsigned long *table;
[...]
> + table = kmalloc_array(PAGE_IDLE_AGE_NR_RANGES * nr_node_ids *
> + ANON_AND_FILE,
> + sizeof(*table), __GFP_ZERO | GFP_KERNEL);
> +
[...]
> + memory_page_idle_age_print(m, table);
> + return 0;

FTR, the table seems leaked here.


Michal


Attachments:
(No filename) (527.00 B)
signature.asc (235.00 B)
Digital signature
Download all attachments

2023-01-12 02:27:03

by Yuanchu Xie

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm: multi-gen LRU: working set extensions

On Wed, Jan 11, 2023 at 6:17 AM Michal Koutný <[email protected]> wrote:
>
> On Wed, Dec 14, 2022 at 02:51:21PM -0800, Yuanchu Xie <[email protected]> wrote:
> > that's frequently used. The only missing pieces between MGLRU
> > generations and working set estimation are a consistent aging cadence
> > and an interface; we introduce the two additions.
> >
> > Periodic aging
> > ======
> > MGLRU Aging is currently driven by reclaim, so the amount of time
> > between generations is non-deterministic. With memcgs being aged
> > regularly, MGLRU generations become time-based working set information.
>
> Is this periodic aging specific to memcgs? IOW, periodic aging isn't
> needed without memcgs (~with root only)
> (Perhaps similar question to Aneeh's.)
Originally, I didn't see much value in periodic aging without memcgs,
as the main goal was to provide working set information.
Periodic aging might lead to MGLRU making better reclaim decisions,
but I don't have any benchmarks to back it up right now.

>
> > Use case: proactive reclaimer
> > ======
> > The proactive reclaimer sets the aging interval, and periodically reads
> > the page idle age stats, forming a working set estimation, which it then
> > calculates an amount to write to memory.reclaim.
> >
> > With the page idle age stats, a proactive reclaimer could calculate a
> > precise amount of memory to reclaim without continuously probing and
> > inducing reclaim.
>
> Could the aging be also made per-memcg? (Similar to memory.reclaim,
> possibly without the new kthread (if global reclaim's aging is enough).)
It is possible. We can have hierarchical aging, invoked by writing to
memory.aging with a time duration. For every child memcg, if its young
generation is older than (current time - specified duration), do
aging.
However, now we need a userspace tool to drive the aging, invoking
this interface every few seconds, since every memcg is aged at a
different cadence.
Having a kthread perform aging has the benefit of simplicity, gives a
source of truth for the aging interval, and makes the feature more
accessible. The application developers, if they want to take a look at
the page idle age stats, could do so without needing additional
ceremony.

Thanks,
Yuanchu

2023-01-12 02:27:10

by Yuanchu Xie

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm: multi-gen LRU: working set extensions

On Mon, Jan 9, 2023 at 10:25 PM Aneesh Kumar K.V
<[email protected]> wrote:
>
> Yuanchu Xie <[email protected]> writes:
>
> > Introduce a way of monitoring the working set of a workload, per page
> > type and per NUMA node, with granularity in minutes. It has page-level
> > granularity and minimal memory overhead by building on the
> > Multi-generational LRU framework, which already has most of the
> > infrastructure and is just missing a useful interface.
> >
> > MGLRU organizes pages in generations, where an older generation contains
> > colder pages, and aging promotes the recently used pages into the young
> > generation and creates a new one. The working set size is how much
> > memory an application needs to keep working, the amount of "hot" memory
> > that's frequently used. The only missing pieces between MGLRU
> > generations and working set estimation are a consistent aging cadence
> > and an interface; we introduce the two additions.
>
> So with kold kthread do we need aging in reclaim ? Should we switch reciam
> to wakeup up kold kthread to do aging instead of doing try_to_inc_max_seq?
> This would also help us to try different aging mechanism which can run
> better in a kthread.
If I understand correctly, MGLRU tries to put off aging as much as
possible for reclaim, and prefers aging uniformly in kswapd, so that
already sort of happens. With periodic aging on, reclaim only triggers
aging when it reclaims memory less than (MIN_NR_GENS * aging_interval)
cold.