2022-08-15 07:25:26

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v14 00/14] Multi-Gen LRU Framework

What's new
==========
Retested on v6.0-rc1; rebased to the latest mm-unstable.

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.

Patchset overview
=================
The design and implementation overview is in patch 14:
https://lore.kernel.org/r/[email protected]/

01. mm: x86, arm64: add arch_has_hw_pte_young()
02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Take advantage of hardware features when trying to clear the accessed
bit in many PTEs.

03. mm/vmscan.c: refactor shrink_node()
04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
its sole caller"
Minor refactors to improve readability for the following patches.

05. mm: multi-gen LRU: groundwork
Adds the basic data structure and the functions that insert pages to
and remove pages from the multi-gen LRU (MGLRU) lists.

06. mm: multi-gen LRU: minimal implementation
A minimal implementation without optimizations.

07. mm: multi-gen LRU: exploit locality in rmap
Exploits spatial locality to improve efficiency when using the rmap.

08. mm: multi-gen LRU: support page table walks
Further exploits spatial locality by optionally scanning page tables.

09. mm: multi-gen LRU: optimize multiple memcgs
Optimizes the overall performance for multiple memcgs running mixed
types of workloads.

10. mm: multi-gen LRU: kill switch
Adds a kill switch to enable or disable MGLRU at runtime.

11. mm: multi-gen LRU: thrashing prevention
12. mm: multi-gen LRU: debugfs interface
Provide userspace with features like thrashing prevention, working set
estimation and proactive reclaim.

13. mm: multi-gen LRU: admin guide
14. mm: multi-gen LRU: design doc
Add an admin guide and a design doc.

Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
Apache Cassandra Memcached
Apache Hadoop MongoDB
Apache Spark PostgreSQL
MariaDB (MySQL) Redis

An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.

On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
less wall time to sort three billion random integers, respectively,
under the medium- and the high-concurrency conditions, when
overcommitting memory. There were no statistically significant
changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
more transactions per minute (TPM), respectively, under the medium-
and the high-concurrency conditions, when overcommitting memory.
There were no statistically significant changes in TPM for the rest
of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
and [21.59, 30.02]% more operations per second (OPS), respectively,
for sequential access, random access and Gaussian (distribution)
access, when THP=always; 95% CIs [13.85, 15.97]% and
[23.94, 29.92]% more OPS, respectively, for random access and
Gaussian access, when THP=never. There were no statistically
significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
[2.16, 3.55]% more operations per second (OPS), respectively, for
exponential (distribution) access, random access and Zipfian
(distribution) access, when underutilizing memory; 95% CIs
[8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
respectively, for exponential access, random access and Zipfian
access, when overcommitting memory.

On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
and [4.11, 7.50]% more operations per second (OPS), respectively,
for exponential (distribution) access, random access and Zipfian
(distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
[6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
exponential access, random access and Zipfian access, when swap was
on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
less average wall time to finish twelve parallel TeraSort jobs,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in average wall time for the rest of the
benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
minute (TPM) under the high-concurrency condition, when swap was
off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
[11.47, 19.36]% more total operations per second (OPS),
respectively, for sequential access, random access and Gaussian
(distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
[10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
for sequential access, random access and Gaussian access, when
THP=never.

Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10].
fs_fio_bench_hdd_mq pft
fs_lmbench pgsql-hammerdb
fs_parallelio redis
fs_postmark stream
hackbench sysbenchthread
kernbench tpcc_spark
memcached unixbench
multichase vm-scalability
mutilate will-it-scale
nginx

[01] https://trends.google.com
[02] https://lore.kernel.org/r/[email protected]/
[03] https://lore.kernel.org/r/[email protected]/
[04] https://lore.kernel.org/r/[email protected]/
[05] https://lore.kernel.org/r/[email protected]/
[06] https://lore.kernel.org/r/[email protected]/
[07] https://lore.kernel.org/r/[email protected]/
[08] https://lore.kernel.org/r/[email protected]/
[09] https://lore.kernel.org/r/[email protected]/
[10] https://lore.kernel.org/r/[email protected]/

Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin reported [11]:
I have Archlinux with 8G RAM + zswap + swap. While developing, I
have lots of apps opened such as multiple LSP-servers for different
langs, chats, two browsers, etc... Usually, my system gets quickly
to a point of SWAP-storms, where I have to kill LSP-servers,
restart browsers to free memory, etc, otherwise the system lags
heavily and is barely usable.

1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
patchset, and I started up by opening lots of apps to create memory
pressure, and worked for a day like this. Till now I had not a
single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
getting to the point of 3G in SWAP before without a single
SWAP-storm.

Vaibhav from IBM reported [12]:
In a synthetic MongoDB Benchmark, seeing an average of ~19%
throughput improvement on POWER10(Radix MMU + 64K Page Size) with
MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
three different request distributions, namely, Exponential, Uniform
and Zipfan.

Shuang from U of Rochester reported [13]:
With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
and [9.26, 10.36]% higher throughput, respectively, for random
access, Zipfian (distribution) access and Gaussian (distribution)
access, when the average number of jobs per CPU is 1; 95% CIs
[42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
throughput, respectively, for random access, Zipfian access and
Gaussian access, when the average number of jobs per CPU is 2.

Daniel from Michigan Tech reported [14]:
With Memcached allocating ~100GB of byte-addressable Optante,
performance improvement in terms of throughput (measured as queries
per second) was about 10% for a series of workloads.

Large-scale deployments
-----------------------
We've rolled out MGLRU to tens of millions of Chrome OS users and
about a million Android users. Google's fleetwide profiling [15] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
app launch time at the 50th percentile.

The downstream kernels that have been using MGLRU include:
1. Android [16]
2. Arch Linux Zen [17]
3. Armbian [18]
4. Chrome OS [19]
5. Liquorix [20]
6. post-factum [21]
7. XanMod [22]

[11] https://lore.kernel.org/r/[email protected]/
[12] https://lore.kernel.org/r/[email protected]/
[13] https://lore.kernel.org/r/[email protected]/
[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
[15] https://dl.acm.org/doi/10.1145/2749469.2750392
[16] https://android.com
[17] https://archlinux.org
[18] https://armbian.com
[19] https://chromium.org
[20] https://liquorix.net
[21] https://codeberg.org/pf-kernel
[22] https://xanmod.org

Summery
=======
The facts are:
1. The independent lab results and the real-world applications
indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
work out of the box; there are no equivalent solutions.
3. There is a lot of new code; no smaller changes have been
demonstrated similar effects.

Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
materialize for a wide range of workloads.
2. Gauging the interest from the past discussions, the new features
will likely be put to use for both personal computers and data
centers.
3. Based on Google's track record, the new code will likely be well
maintained in the long term. It'd be more difficult if not
impossible to achieve similar effects with other approaches.

Yu Zhao (14):
mm: x86, arm64: add arch_has_hw_pte_young()
mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
mm/vmscan.c: refactor shrink_node()
Revert "include/linux/mm_inline.h: fold __update_lru_size() into its
sole caller"
mm: multi-gen LRU: groundwork
mm: multi-gen LRU: minimal implementation
mm: multi-gen LRU: exploit locality in rmap
mm: multi-gen LRU: support page table walks
mm: multi-gen LRU: optimize multiple memcgs
mm: multi-gen LRU: kill switch
mm: multi-gen LRU: thrashing prevention
mm: multi-gen LRU: debugfs interface
mm: multi-gen LRU: admin guide
mm: multi-gen LRU: design doc

Documentation/admin-guide/mm/index.rst | 1 +
Documentation/admin-guide/mm/multigen_lru.rst | 156 +
Documentation/mm/index.rst | 1 +
Documentation/mm/multigen_lru.rst | 159 +
arch/Kconfig | 8 +
arch/arm64/include/asm/pgtable.h | 15 +-
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 9 +-
arch/x86/mm/pgtable.c | 5 +-
fs/exec.c | 2 +
fs/fuse/dev.c | 3 +-
include/linux/cgroup.h | 15 +-
include/linux/memcontrol.h | 36 +
include/linux/mm.h | 5 +
include/linux/mm_inline.h | 231 +-
include/linux/mm_types.h | 77 +
include/linux/mmzone.h | 214 ++
include/linux/nodemask.h | 1 +
include/linux/page-flags-layout.h | 16 +-
include/linux/page-flags.h | 4 +-
include/linux/pgtable.h | 17 +-
include/linux/sched.h | 4 +
include/linux/swap.h | 4 +
kernel/bounds.c | 7 +
kernel/cgroup/cgroup-internal.h | 1 -
kernel/exit.c | 1 +
kernel/fork.c | 9 +
kernel/sched/core.c | 1 +
mm/Kconfig | 26 +
mm/huge_memory.c | 3 +-
mm/internal.h | 1 +
mm/memcontrol.c | 28 +
mm/memory.c | 39 +-
mm/mm_init.c | 6 +-
mm/mmzone.c | 2 +
mm/rmap.c | 6 +
mm/swap.c | 54 +-
mm/vmscan.c | 2972 ++++++++++++++++-
mm/workingset.c | 110 +-
39 files changed, 4095 insertions(+), 155 deletions(-)
create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
create mode 100644 Documentation/mm/multigen_lru.rst


base-commit: d2af7b221349ff6241e25fa8c67bcfae2b360700
--
2.37.1.595.g718a3a8f04-goog


2022-08-15 07:26:21

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v14 13/14] mm: multi-gen LRU: admin guide

Add an admin guide.

Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
---
Documentation/admin-guide/mm/index.rst | 1 +
Documentation/admin-guide/mm/multigen_lru.rst | 156 ++++++++++++++++++
mm/Kconfig | 3 +-
mm/vmscan.c | 4 +
4 files changed, 163 insertions(+), 1 deletion(-)
create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst

diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 1bd11118dfb1..d1064e0ba34a 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -32,6 +32,7 @@ the Linux memory management.
idle_page_tracking
ksm
memory-hotplug
+ multigen_lru
nommu-mmap
numa_memory_policy
numaperf
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
new file mode 100644
index 000000000000..6355f2b5019d
--- /dev/null
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -0,0 +1,156 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Multi-Gen LRU
+=============
+The multi-gen LRU is an alternative LRU implementation that optimizes
+page reclaim and improves performance under memory pressure. Page
+reclaim decides the kernel's caching policy and ability to overcommit
+memory. It directly impacts the kswapd CPU usage and RAM efficiency.
+
+Quick start
+===========
+Build the kernel with the following configurations.
+
+* ``CONFIG_LRU_GEN=y``
+* ``CONFIG_LRU_GEN_ENABLED=y``
+
+All set!
+
+Runtime options
+===============
+``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
+following subsections.
+
+Kill switch
+-----------
+``enabled`` accepts different values to enable or disable the
+following components. Its default value depends on
+``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
+unless some of them have unforeseen side effects. Writing to
+``enabled`` has no effect when a component is not supported by the
+hardware, and valid values will be accepted even when the main switch
+is off.
+
+====== ===============================================================
+Values Components
+====== ===============================================================
+0x0001 The main switch for the multi-gen LRU.
+0x0002 Clearing the accessed bit in leaf page table entries in large
+ batches, when MMU sets it (e.g., on x86). This behavior can
+ theoretically worsen lock contention (mmap_lock). If it is
+ disabled, the multi-gen LRU will suffer a minor performance
+ degradation for workloads that contiguously map hot pages,
+ whose accessed bits can be otherwise cleared by fewer larger
+ batches.
+0x0004 Clearing the accessed bit in non-leaf page table entries as
+ well, when MMU sets it (e.g., on x86). This behavior was not
+ verified on x86 varieties other than Intel and AMD. If it is
+ disabled, the multi-gen LRU will suffer a negligible
+ performance degradation.
+[yYnN] Apply to all the components above.
+====== ===============================================================
+
+E.g.,
+::
+
+ echo y >/sys/kernel/mm/lru_gen/enabled
+ cat /sys/kernel/mm/lru_gen/enabled
+ 0x0007
+ echo 5 >/sys/kernel/mm/lru_gen/enabled
+ cat /sys/kernel/mm/lru_gen/enabled
+ 0x0005
+
+Thrashing prevention
+--------------------
+Personal computers are more sensitive to thrashing because it can
+cause janks (lags when rendering UI) and negatively impact user
+experience. The multi-gen LRU offers thrashing prevention to the
+majority of laptop and desktop users who do not have ``oomd``.
+
+Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
+``N`` milliseconds from getting evicted. The OOM killer is triggered
+if this working set cannot be kept in memory. In other words, this
+option works as an adjustable pressure relief valve, and when open, it
+terminates applications that are hopefully not being used.
+
+Based on the average human detectable lag (~100ms), ``N=1000`` usually
+eliminates intolerable janks due to thrashing. Larger values like
+``N=3000`` make janks less noticeable at the risk of premature OOM
+kills.
+
+The default value ``0`` means disabled.
+
+Experimental features
+=====================
+``/sys/kernel/debug/lru_gen`` accepts commands described in the
+following subsections. Multiple command lines are supported, so does
+concatenation with delimiters ``,`` and ``;``.
+
+``/sys/kernel/debug/lru_gen_full`` provides additional stats for
+debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
+evicted generations in this file.
+
+Working set estimation
+----------------------
+Working set estimation measures how much memory an application needs
+in a given time interval, and it is usually done with little impact on
+the performance of the application. E.g., data centers want to
+optimize job scheduling (bin packing) to improve memory utilizations.
+When a new job comes in, the job scheduler needs to find out whether
+each server it manages can allocate a certain amount of memory for
+this new job before it can pick a candidate. To do so, the job
+scheduler needs to estimate the working sets of the existing jobs.
+
+When it is read, ``lru_gen`` returns a histogram of numbers of pages
+accessed over different time intervals for each memcg and node.
+``MAX_NR_GENS`` decides the number of bins for each histogram. The
+histograms are noncumulative.
+::
+
+ memcg memcg_id memcg_path
+ node node_id
+ min_gen_nr age_in_ms nr_anon_pages nr_file_pages
+ ...
+ max_gen_nr age_in_ms nr_anon_pages nr_file_pages
+
+Each bin contains an estimated number of pages that have been accessed
+within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
+and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
+the former is the largest and that of the latter is the smallest.
+
+Users can write ``+ memcg_id node_id max_gen_nr
+[can_swap [force_scan]]`` to ``lru_gen`` to create a new generation
+``max_gen_nr+1``. ``can_swap`` defaults to the swap setting and, if it
+is set to ``1``, it forces the scan of anon pages when swap is off,
+and vice versa. ``force_scan`` defaults to ``1`` and, if it is set to
+``0``, it employs heuristics to reduce the overhead, which is likely
+to reduce the coverage as well.
+
+A typical use case is that a job scheduler writes to ``lru_gen`` at a
+certain time interval to create new generations, and it ranks the
+servers it manages based on the sizes of their cold pages defined by
+this time interval.
+
+Proactive reclaim
+-----------------
+Proactive reclaim induces page reclaim when there is no memory
+pressure. It usually targets cold pages only. E.g., when a new job
+comes in, the job scheduler wants to proactively reclaim cold pages on
+the server it selected to improve the chance of successfully landing
+this new job.
+
+Users can write ``- memcg_id node_id min_gen_nr [swappiness
+[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or
+equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
+``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
+aged and therefore cannot be evicted. ``swappiness`` overrides the
+default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
+the number of pages to evict.
+
+A typical use case is that a job scheduler writes to ``lru_gen``
+before it tries to land a new job on a server. If it fails to
+materialize enough cold pages because of the overestimation, it
+retries on the next server according to the ranking result obtained
+from the working set estimation step. This less forceful approach
+limits the impacts on the existing jobs.
diff --git a/mm/Kconfig b/mm/Kconfig
index 6c86849c4db9..96cd3ae25c6f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1131,7 +1131,8 @@ config LRU_GEN
# make sure folio->flags has enough spare bits
depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP
help
- A high performance LRU implementation to overcommit memory.
+ A high performance LRU implementation to overcommit memory. See
+ Documentation/admin-guide/mm/multigen_lru.rst for details.

config LRU_GEN_ENABLED
bool "Enable by default"
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 509989fb39ef..f693720047db 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5288,6 +5288,7 @@ static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, c
return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
}

+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t len)
{
@@ -5321,6 +5322,7 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps);
}

+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
static ssize_t store_enabled(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t len)
{
@@ -5468,6 +5470,7 @@ static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
seq_putc(m, '\n');
}

+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
static int lru_gen_seq_show(struct seq_file *m, void *v)
{
unsigned long seq;
@@ -5626,6 +5629,7 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
return err;
}

+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
size_t len, loff_t *pos)
{
--
2.37.1.595.g718a3a8f04-goog

2022-08-15 07:39:55

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v14 07/14] mm: multi-gen LRU: exploit locality in rmap

Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, searching the rmap
can incur the highest CPU cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
Single workload:
fio (buffered I/O): no change

Single workload:
memcached (anon): +[3, 5]%
Ops/sec KB/sec
patch1-6: 1106168.46 43025.04
patch1-7: 1147696.57 44640.29

Configurations:
no change

Client benchmark results:
kswapd profiles:
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next

patch1-7
48.16% lzo1x_1_do_compress (real work)
8.20% page_vma_mapped_walk (overhead)
7.06% _raw_spin_unlock_irq
2.92% ptep_clear_flush
2.53% __zram_bvec_write
2.11% do_raw_spin_lock
2.02% memmove
1.93% lru_gen_look_around
1.56% free_unref_page_list
1.40% memset

Configurations:
no change

Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Barry Song <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
---
include/linux/memcontrol.h | 31 +++++++
include/linux/mm.h | 5 +
include/linux/mmzone.h | 6 ++
mm/internal.h | 1 +
mm/memcontrol.c | 1 +
mm/rmap.c | 6 ++
mm/swap.c | 4 +-
mm/vmscan.c | 184 +++++++++++++++++++++++++++++++++++++
8 files changed, 236 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4d31ce55b1c0..47829f378fcb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -444,6 +444,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
+ * - mem_cgroup_trylock_pages()
*
* For a kmem folio a caller should hold an rcu read lock to protect memcg
* associated with a kmem folio from being released.
@@ -505,6 +506,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
+ * - mem_cgroup_trylock_pages()
*
* For a kmem page a caller should hold an rcu read lock to protect memcg
* associated with a kmem page from being released.
@@ -959,6 +961,23 @@ void unlock_page_memcg(struct page *page);

void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);

+/* try to stablize folio_memcg() for all the pages in a memcg */
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+ rcu_read_lock();
+
+ if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
+ return true;
+
+ rcu_read_unlock();
+ return false;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+ rcu_read_unlock();
+}
+
/* idx can be of type enum memcg_stat_item or node_stat_item */
static inline void mod_memcg_state(struct mem_cgroup *memcg,
int idx, int val)
@@ -1422,6 +1441,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
{
}

+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+ /* to match folio_memcg_rcu() */
+ rcu_read_lock();
+ return true;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+ rcu_read_unlock();
+}
+
static inline void mem_cgroup_handle_over_high(void)
{
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fbe2e72e7bca..8ff7227c6cb1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1490,6 +1490,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
return page_to_pfn(&folio->page);
}

+static inline struct folio *pfn_folio(unsigned long pfn)
+{
+ return page_folio(pfn_to_page(pfn));
+}
+
static inline atomic_t *folio_pincount_ptr(struct folio *folio)
{
return &folio_page(folio, 1)->compound_pincount;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 019d7c8ee834..850c6171af68 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -375,6 +375,7 @@ enum lruvec_flags {
#ifndef __GENERATING_BOUNDS_H

struct lruvec;
+struct page_vma_mapped_walk;

#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
@@ -430,6 +431,7 @@ struct lru_gen_struct {
};

void lru_gen_init_lruvec(struct lruvec *lruvec);
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);

#ifdef CONFIG_MEMCG
void lru_gen_init_memcg(struct mem_cgroup *memcg);
@@ -442,6 +444,10 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
{
}

+static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+}
+
#ifdef CONFIG_MEMCG
static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
{
diff --git a/mm/internal.h b/mm/internal.h
index 4df67b6b8cce..0082d5fdddac 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -83,6 +83,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
void folio_rotate_reclaimable(struct folio *folio);
bool __folio_end_writeback(struct folio *folio);
void deactivate_file_folio(struct folio *folio);
+void folio_activate(struct folio *folio);

void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5fd38d12149c..882180866e31 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2789,6 +2789,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
+ * - mem_cgroup_trylock_pages()
*/
folio->memcg_data = (unsigned long)memcg;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 28aef434ea41..7dc6d77ae865 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -825,6 +825,12 @@ static bool folio_referenced_one(struct folio *folio,
}

if (pvmw.pte) {
+ if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
+ !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
+ lru_gen_look_around(&pvmw);
+ referenced++;
+ }
+
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte)) {
/*
diff --git a/mm/swap.c b/mm/swap.c
index f74fd51fa9e1..0a3871a70952 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -366,7 +366,7 @@ static void folio_activate_drain(int cpu)
folio_batch_move_lru(fbatch, folio_activate_fn);
}

-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
{
if (folio_test_lru(folio) && !folio_test_active(folio) &&
!folio_test_unevictable(folio)) {
@@ -385,7 +385,7 @@ static inline void folio_activate_drain(int cpu)
{
}

-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
{
struct lruvec *lruvec;

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4c57fb749a74..f365386eb441 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1635,6 +1635,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
if (!sc->may_unmap && folio_mapped(folio))
goto keep_locked;

+ /* folio_update_gen() tried to promote this page? */
+ if (lru_gen_enabled() && !ignore_references &&
+ folio_mapped(folio) && folio_test_referenced(folio))
+ goto keep_locked;
+
/*
* The number of dirty pages determines if a node is marked
* reclaim_congested. kswapd will stall and start writing
@@ -3219,6 +3224,29 @@ static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
* the aging
******************************************************************************/

+/* promote pages accessed through page tables */
+static int folio_update_gen(struct folio *folio, int gen)
+{
+ unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
+
+ VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
+ VM_WARN_ON_ONCE(!rcu_read_lock_held());
+
+ do {
+ /* lru_gen_del_folio() has isolated this page? */
+ if (!(old_flags & LRU_GEN_MASK)) {
+ /* for shrink_page_list() */
+ new_flags = old_flags | BIT(PG_referenced);
+ continue;
+ }
+
+ new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
+ new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+ } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
+
+ return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
/* protect pages accessed multiple times through file descriptors */
static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
{
@@ -3230,6 +3258,11 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio);

do {
+ new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+ /* folio_update_gen() has promoted this page? */
+ if (new_gen >= 0 && new_gen != old_gen)
+ return new_gen;
+
new_gen = (old_gen + 1) % MAX_NR_GENS;

new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
@@ -3244,6 +3277,43 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
return new_gen;
}

+static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr)
+{
+ unsigned long pfn = pte_pfn(pte);
+
+ VM_WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end);
+
+ if (!pte_present(pte) || is_zero_pfn(pfn))
+ return -1;
+
+ if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte)))
+ return -1;
+
+ if (WARN_ON_ONCE(!pfn_valid(pfn)))
+ return -1;
+
+ return pfn;
+}
+
+static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
+ struct pglist_data *pgdat)
+{
+ struct folio *folio;
+
+ /* try to avoid unnecessary memory loads */
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ return NULL;
+
+ folio = pfn_folio(pfn);
+ if (folio_nid(folio) != pgdat->node_id)
+ return NULL;
+
+ if (folio_memcg_rcu(folio) != memcg)
+ return NULL;
+
+ return folio;
+}
+
static void inc_min_seq(struct lruvec *lruvec, int type)
{
struct lru_gen_struct *lrugen = &lruvec->lrugen;
@@ -3445,6 +3515,114 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
}

+/*
+ * This function exploits spatial locality when shrink_page_list() walks the
+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
+ */
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+ int i;
+ pte_t *pte;
+ unsigned long start;
+ unsigned long end;
+ unsigned long addr;
+ unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
+ struct folio *folio = pfn_folio(pvmw->pfn);
+ struct mem_cgroup *memcg = folio_memcg(folio);
+ struct pglist_data *pgdat = folio_pgdat(folio);
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ DEFINE_MAX_SEQ(lruvec);
+ int old_gen, new_gen = lru_gen_from_seq(max_seq);
+
+ lockdep_assert_held(pvmw->ptl);
+ VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
+
+ if (spin_is_contended(pvmw->ptl))
+ return;
+
+ start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
+ end = min(pvmw->address | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1;
+
+ if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
+ if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
+ end = start + MIN_LRU_BATCH * PAGE_SIZE;
+ else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
+ start = end - MIN_LRU_BATCH * PAGE_SIZE;
+ else {
+ start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
+ end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
+ }
+ }
+
+ pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
+
+ rcu_read_lock();
+ arch_enter_lazy_mmu_mode();
+
+ for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+ unsigned long pfn;
+
+ pfn = get_pte_pfn(pte[i], pvmw->vma, addr);
+ if (pfn == -1)
+ continue;
+
+ if (!pte_young(pte[i]))
+ continue;
+
+ folio = get_pfn_folio(pfn, memcg, pgdat);
+ if (!folio)
+ continue;
+
+ if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
+ continue;
+
+ if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+ !folio_test_swapcache(folio)))
+ folio_mark_dirty(folio);
+
+ old_gen = folio_lru_gen(folio);
+ if (old_gen < 0)
+ folio_set_referenced(folio);
+ else if (old_gen != new_gen)
+ __set_bit(i, bitmap);
+ }
+
+ arch_leave_lazy_mmu_mode();
+ rcu_read_unlock();
+
+ if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
+ for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+ folio = pfn_folio(pte_pfn(pte[i]));
+ folio_activate(folio);
+ }
+ return;
+ }
+
+ /* folio_update_gen() requires stable folio_memcg() */
+ if (!mem_cgroup_trylock_pages(memcg))
+ return;
+
+ spin_lock_irq(&lruvec->lru_lock);
+ new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+
+ for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+ folio = pfn_folio(pte_pfn(pte[i]));
+ if (folio_memcg_rcu(folio) != memcg)
+ continue;
+
+ old_gen = folio_update_gen(folio, new_gen);
+ if (old_gen < 0 || old_gen == new_gen)
+ continue;
+
+ lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+ }
+
+ spin_unlock_irq(&lruvec->lru_lock);
+
+ mem_cgroup_unlock_pages();
+}
+
/******************************************************************************
* the eviction
******************************************************************************/
@@ -3481,6 +3659,12 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
return true;
}

+ /* promoted */
+ if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
+ list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+ return true;
+ }
+
/* protected */
if (tier > tier_idx) {
int hist = lru_hist_from_seq(lrugen->min_seq[type]);
--
2.37.1.595.g718a3a8f04-goog

2022-08-15 07:42:36

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v14 05/14] mm: multi-gen LRU: groundwork

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of
working set estimation and proactive reclaim. These techniques are
commonly used to optimize job scheduling (bin packing) in data
centers [1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive"
will be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
applications usually do not prepare themselves for major page
faults like they do for blocked I/O. E.g., GUI applications
commonly use dedicated I/O threads to avoid blocking rendering
threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present; the latter channel is assumed to follow the
latter pattern unless outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults". Three macros,
i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are
added in this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
has not been used since then. This protocol, AKA second chance,
requires a minimum of two generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
---
fs/fuse/dev.c | 3 +-
include/linux/mm_inline.h | 175 ++++++++++++++++++++++++++++++
include/linux/mmzone.h | 102 +++++++++++++++++
include/linux/page-flags-layout.h | 13 ++-
include/linux/page-flags.h | 4 +-
include/linux/sched.h | 4 +
kernel/bounds.c | 5 +
mm/Kconfig | 8 ++
mm/huge_memory.c | 3 +-
mm/memcontrol.c | 2 +
mm/memory.c | 25 +++++
mm/mm_init.c | 6 +-
mm/mmzone.c | 2 +
mm/swap.c | 11 +-
mm/vmscan.c | 75 +++++++++++++
15 files changed, 424 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 51897427a534..b4a6e0a1b945 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -776,7 +776,8 @@ static int fuse_check_page(struct page *page)
1 << PG_active |
1 << PG_workingset |
1 << PG_reclaim |
- 1 << PG_waiters))) {
+ 1 << PG_waiters |
+ LRU_GEN_MASK | LRU_REFS_MASK))) {
dump_page(page, "fuse: trying to steal weird page");
return 1;
}
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index fb8aadb81cd6..2ff703900fd0 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -40,6 +40,9 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
{
struct pglist_data *pgdat = lruvec_pgdat(lruvec);

+ lockdep_assert_held(&lruvec->lru_lock);
+ WARN_ON_ONCE(nr_pages != (int)nr_pages);
+
__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
__mod_zone_page_state(&pgdat->node_zones[zid],
NR_ZONE_LRU_BASE + lru, nr_pages);
@@ -101,11 +104,177 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
return lru;
}

+#ifdef CONFIG_LRU_GEN
+
+static inline bool lru_gen_enabled(void)
+{
+ return true;
+}
+
+static inline bool lru_gen_in_fault(void)
+{
+ return current->in_lru_fault;
+}
+
+static inline int lru_gen_from_seq(unsigned long seq)
+{
+ return seq % MAX_NR_GENS;
+}
+
+static inline int folio_lru_gen(struct folio *folio)
+{
+ unsigned long flags = READ_ONCE(folio->flags);
+
+ return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
+static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
+{
+ unsigned long max_seq = lruvec->lrugen.max_seq;
+
+ VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
+
+ /* see the comment on MIN_NR_GENS */
+ return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
+}
+
+static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *folio,
+ int old_gen, int new_gen)
+{
+ int type = folio_is_file_lru(folio);
+ int zone = folio_zonenum(folio);
+ int delta = folio_nr_pages(folio);
+ enum lru_list lru = type * LRU_INACTIVE_FILE;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ VM_WARN_ON_ONCE(old_gen != -1 && old_gen >= MAX_NR_GENS);
+ VM_WARN_ON_ONCE(new_gen != -1 && new_gen >= MAX_NR_GENS);
+ VM_WARN_ON_ONCE(old_gen == -1 && new_gen == -1);
+
+ if (old_gen >= 0)
+ WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone],
+ lrugen->nr_pages[old_gen][type][zone] - delta);
+ if (new_gen >= 0)
+ WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
+ lrugen->nr_pages[new_gen][type][zone] + delta);
+
+ /* addition */
+ if (old_gen < 0) {
+ if (lru_gen_is_active(lruvec, new_gen))
+ lru += LRU_ACTIVE;
+ __update_lru_size(lruvec, lru, zone, delta);
+ return;
+ }
+
+ /* deletion */
+ if (new_gen < 0) {
+ if (lru_gen_is_active(lruvec, old_gen))
+ lru += LRU_ACTIVE;
+ __update_lru_size(lruvec, lru, zone, -delta);
+ return;
+ }
+}
+
+static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+ unsigned long seq;
+ unsigned long flags;
+ int gen = folio_lru_gen(folio);
+ int type = folio_is_file_lru(folio);
+ int zone = folio_zonenum(folio);
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ VM_WARN_ON_ONCE_FOLIO(gen != -1, folio);
+
+ if (folio_test_unevictable(folio))
+ return false;
+ /*
+ * There are three common cases for this page:
+ * 1. If it's hot, e.g., freshly faulted in or previously hot and
+ * migrated, add it to the youngest generation.
+ * 2. If it's cold but can't be evicted immediately, i.e., an anon page
+ * not in swapcache or a dirty page pending writeback, add it to the
+ * second oldest generation.
+ * 3. Everything else (clean, cold) is added to the oldest generation.
+ */
+ if (folio_test_active(folio))
+ seq = lrugen->max_seq;
+ else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
+ (folio_test_reclaim(folio) &&
+ (folio_test_dirty(folio) || folio_test_writeback(folio))))
+ seq = lrugen->min_seq[type] + 1;
+ else
+ seq = lrugen->min_seq[type];
+
+ gen = lru_gen_from_seq(seq);
+ flags = (gen + 1UL) << LRU_GEN_PGOFF;
+ /* see the comment on MIN_NR_GENS about PG_active */
+ set_mask_bits(&folio->flags, LRU_GEN_MASK | BIT(PG_active), flags);
+
+ lru_gen_update_size(lruvec, folio, -1, gen);
+ /* for folio_rotate_reclaimable() */
+ if (reclaiming)
+ list_add_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
+ else
+ list_add(&folio->lru, &lrugen->lists[gen][type][zone]);
+
+ return true;
+}
+
+static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+ unsigned long flags;
+ int gen = folio_lru_gen(folio);
+
+ if (gen < 0)
+ return false;
+
+ VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
+ VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
+
+ /* for folio_migrate_flags() */
+ flags = !reclaiming && lru_gen_is_active(lruvec, gen) ? BIT(PG_active) : 0;
+ flags = set_mask_bits(&folio->flags, LRU_GEN_MASK, flags);
+ gen = ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+
+ lru_gen_update_size(lruvec, folio, gen, -1);
+ list_del(&folio->lru);
+
+ return true;
+}
+
+#else /* !CONFIG_LRU_GEN */
+
+static inline bool lru_gen_enabled(void)
+{
+ return false;
+}
+
+static inline bool lru_gen_in_fault(void)
+{
+ return false;
+}
+
+static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+ return false;
+}
+
+static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+ return false;
+}
+
+#endif /* CONFIG_LRU_GEN */
+
static __always_inline
void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
{
enum lru_list lru = folio_lru_list(folio);

+ if (lru_gen_add_folio(lruvec, folio, false))
+ return;
+
update_lru_size(lruvec, lru, folio_zonenum(folio),
folio_nr_pages(folio));
if (lru != LRU_UNEVICTABLE)
@@ -123,6 +292,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
{
enum lru_list lru = folio_lru_list(folio);

+ if (lru_gen_add_folio(lruvec, folio, true))
+ return;
+
update_lru_size(lruvec, lru, folio_zonenum(folio),
folio_nr_pages(folio));
/* This is not expected to be used on LRU_UNEVICTABLE */
@@ -140,6 +312,9 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
{
enum lru_list lru = folio_lru_list(folio);

+ if (lru_gen_del_folio(lruvec, folio, false))
+ return;
+
if (lru != LRU_UNEVICTABLE)
list_del(&folio->lru);
update_lru_size(lruvec, lru, folio_zonenum(folio),
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 025754b0bc09..86147f04bf76 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -317,6 +317,102 @@ enum lruvec_flags {
*/
};

+#endif /* !__GENERATING_BOUNDS_H */
+
+/*
+ * Evictable pages are divided into multiple generations. The youngest and the
+ * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
+ * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
+ * offset within MAX_NR_GENS, i.e., gen, indexes the LRU list of the
+ * corresponding generation. The gen counter in folio->flags stores gen+1 while
+ * a page is on one of lrugen->lists[]. Otherwise it stores 0.
+ *
+ * A page is added to the youngest generation on faulting. The aging needs to
+ * check the accessed bit at least twice before handing this page over to the
+ * eviction. The first check takes care of the accessed bit set on the initial
+ * fault; the second check makes sure this page hasn't been used since then.
+ * This process, AKA second chance, requires a minimum of two generations,
+ * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive
+ * LRU, e.g., /proc/vmstat, these two generations are considered active; the
+ * rest of generations, if they exist, are considered inactive. See
+ * lru_gen_is_active().
+ *
+ * PG_active is always cleared while a page is on one of lrugen->lists[] so that
+ * the aging needs not to worry about it. And it's set again when a page
+ * considered active is isolated for non-reclaiming purposes, e.g., migration.
+ * See lru_gen_add_folio() and lru_gen_del_folio().
+ *
+ * MAX_NR_GENS is set to 4 so that the multi-gen LRU can support twice the
+ * number of categories of the active/inactive LRU when keeping track of
+ * accesses through page tables. This requires order_base_2(MAX_NR_GENS+1) bits
+ * in folio->flags.
+ */
+#define MIN_NR_GENS 2U
+#define MAX_NR_GENS 4U
+
+#ifndef __GENERATING_BOUNDS_H
+
+struct lruvec;
+
+#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
+#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
+
+#ifdef CONFIG_LRU_GEN
+
+enum {
+ LRU_GEN_ANON,
+ LRU_GEN_FILE,
+};
+
+/*
+ * The youngest generation number is stored in max_seq for both anon and file
+ * types as they are aged on an equal footing. The oldest generation numbers are
+ * stored in min_seq[] separately for anon and file types as clean file pages
+ * can be evicted regardless of swap constraints.
+ *
+ * Normally anon and file min_seq are in sync. But if swapping is constrained,
+ * e.g., out of swap space, file min_seq is allowed to advance and leave anon
+ * min_seq behind.
+ *
+ * The number of pages in each generation is eventually consistent and therefore
+ * can be transiently negative.
+ */
+struct lru_gen_struct {
+ /* the aging increments the youngest generation number */
+ unsigned long max_seq;
+ /* the eviction increments the oldest generation numbers */
+ unsigned long min_seq[ANON_AND_FILE];
+ /* the multi-gen LRU lists, lazily sorted on eviction */
+ struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+ /* the multi-gen LRU sizes, eventually consistent */
+ long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+};
+
+void lru_gen_init_lruvec(struct lruvec *lruvec);
+
+#ifdef CONFIG_MEMCG
+void lru_gen_init_memcg(struct mem_cgroup *memcg);
+void lru_gen_exit_memcg(struct mem_cgroup *memcg);
+#endif
+
+#else /* !CONFIG_LRU_GEN */
+
+static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
+{
+}
+
+static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
+{
+}
+#endif
+
+#endif /* CONFIG_LRU_GEN */
+
struct lruvec {
struct list_head lists[NR_LRU_LISTS];
/* per lruvec lru_lock for memcg */
@@ -334,6 +430,10 @@ struct lruvec {
unsigned long refaults[ANON_AND_FILE];
/* Various lruvec state flags (enum lruvec_flags) */
unsigned long flags;
+#ifdef CONFIG_LRU_GEN
+ /* evictable pages divided into generations */
+ struct lru_gen_struct lrugen;
+#endif
#ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
#endif
@@ -749,6 +849,8 @@ static inline bool zone_is_empty(struct zone *zone)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
#define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH)
#define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
+#define LRU_GEN_PGOFF (KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
+#define LRU_REFS_PGOFF (LRU_GEN_PGOFF - LRU_REFS_WIDTH)

/*
* Define the bit shifts to access each section. For non-existent
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index ef1e3e736e14..240905407a18 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -55,7 +55,8 @@
#define SECTIONS_WIDTH 0
#endif

-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
+ <= BITS_PER_LONG - NR_PAGEFLAGS
#define NODES_WIDTH NODES_SHIFT
#elif defined(CONFIG_SPARSEMEM_VMEMMAP)
#error "Vmemmap: No space for nodes field in page flags"
@@ -89,8 +90,8 @@
#define LAST_CPUPID_SHIFT 0
#endif

-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
- <= BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
+ KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
#else
#define LAST_CPUPID_WIDTH 0
@@ -100,10 +101,12 @@
#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
#endif

-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
- > BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
+ KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
#error "Not enough bits in page flags"
#endif

+#define LRU_REFS_WIDTH 0
+
#endif
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 465ff35a8c00..0b0ae5084e60 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1058,7 +1058,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
1UL << PG_private | 1UL << PG_private_2 | \
1UL << PG_writeback | 1UL << PG_reserved | \
1UL << PG_slab | 1UL << PG_active | \
- 1UL << PG_unevictable | __PG_MLOCKED)
+ 1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK)

/*
* Flags checked when a page is prepped for return by the page allocator.
@@ -1069,7 +1069,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
* alloc-free cycle to prevent from reusing the page.
*/
#define PAGE_FLAGS_CHECK_AT_PREP \
- (PAGEFLAGS_MASK & ~__PG_HWPOISON)
+ ((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)

#define PAGE_FLAGS_PRIVATE \
(1UL << PG_private | 1UL << PG_private_2)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e7b2f8a5c711..8cc46a789193 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -914,6 +914,10 @@ struct task_struct {
#ifdef CONFIG_MEMCG
unsigned in_user_fault:1;
#endif
+#ifdef CONFIG_LRU_GEN
+ /* whether the LRU algorithm may apply to this access */
+ unsigned in_lru_fault:1;
+#endif
#ifdef CONFIG_COMPAT_BRK
unsigned brk_randomized:1;
#endif
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 9795d75b09b2..5ee60777d8e4 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -22,6 +22,11 @@ int main(void)
DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
#endif
DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
+#ifdef CONFIG_LRU_GEN
+ DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
+#else
+ DEFINE(LRU_GEN_WIDTH, 0);
+#endif
/* End of constants */

return 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index 0331f1461f81..d95f07cd6dcf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1124,6 +1124,14 @@ config PTE_MARKER_UFFD_WP
purposes. It is required to enable userfaultfd write protection on
file-backed memory types like shmem and hugetlbfs.

+config LRU_GEN
+ bool "Multi-Gen LRU"
+ depends on MMU
+ # make sure folio->flags has enough spare bits
+ depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP
+ help
+ A high performance LRU implementation to overcommit memory.
+
source "mm/damon/Kconfig"

endmenu
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9afc4c3b4d49..83c47a989260 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2443,7 +2443,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
#ifdef CONFIG_64BIT
(1L << PG_arch_2) |
#endif
- (1L << PG_dirty)));
+ (1L << PG_dirty) |
+ LRU_GEN_MASK | LRU_REFS_MASK));

/* ->mapping in first tail page is compound_mapcount */
VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b69979c9ced5..5fd38d12149c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5170,6 +5170,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)

static void mem_cgroup_free(struct mem_cgroup *memcg)
{
+ lru_gen_exit_memcg(memcg);
memcg_wb_domain_exit(memcg);
__mem_cgroup_free(memcg);
}
@@ -5228,6 +5229,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
memcg->deferred_split_queue.split_queue_len = 0;
#endif
idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
+ lru_gen_init_memcg(memcg);
return memcg;
fail:
mem_cgroup_id_remove(memcg);
diff --git a/mm/memory.c b/mm/memory.c
index 46071cf00b47..f9abc10ea7e2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5111,6 +5111,27 @@ static inline void mm_account_fault(struct pt_regs *regs,
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
}

+#ifdef CONFIG_LRU_GEN
+static void lru_gen_enter_fault(struct vm_area_struct *vma)
+{
+ /* the LRU algorithm doesn't apply to sequential or random reads */
+ current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
+}
+
+static void lru_gen_exit_fault(void)
+{
+ current->in_lru_fault = false;
+}
+#else
+static void lru_gen_enter_fault(struct vm_area_struct *vma)
+{
+}
+
+static void lru_gen_exit_fault(void)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
/*
* By the time we get here, we already hold the mm semaphore
*
@@ -5142,11 +5163,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
if (flags & FAULT_FLAG_USER)
mem_cgroup_enter_user_fault();

+ lru_gen_enter_fault(vma);
+
if (unlikely(is_vm_hugetlb_page(vma)))
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
else
ret = __handle_mm_fault(vma, address, flags);

+ lru_gen_exit_fault();
+
if (flags & FAULT_FLAG_USER) {
mem_cgroup_exit_user_fault();
/*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 9ddaf0e1b0ab..0d7b2bd2454a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void)

shift = 8 * sizeof(unsigned long);
width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
- - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
+ - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
- "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
+ "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
SECTIONS_WIDTH,
NODES_WIDTH,
ZONES_WIDTH,
LAST_CPUPID_WIDTH,
KASAN_TAG_WIDTH,
+ LRU_GEN_WIDTH,
+ LRU_REFS_WIDTH,
NR_PAGEFLAGS);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 0ae7571e35ab..68e1511be12d 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -88,6 +88,8 @@ void lruvec_init(struct lruvec *lruvec)
* Poison its list head, so that any operations on it would crash.
*/
list_del(&lruvec->lists[LRU_UNEVICTABLE]);
+
+ lru_gen_init_lruvec(lruvec);
}

#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
diff --git a/mm/swap.c b/mm/swap.c
index 9cee7f6a3809..0e423b7d458b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -484,6 +484,11 @@ void folio_add_lru(struct folio *folio)
folio_test_unevictable(folio), folio);
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

+ /* see the comment in lru_gen_add_folio() */
+ if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
+ lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
+ folio_set_active(folio);
+
folio_get(folio);
local_lock(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.lru_add);
@@ -575,7 +580,7 @@ static void lru_deactivate_file_fn(struct lruvec *lruvec, struct folio *folio)

static void lru_deactivate_fn(struct lruvec *lruvec, struct folio *folio)
{
- if (folio_test_active(folio) && !folio_test_unevictable(folio)) {
+ if (!folio_test_unevictable(folio) && (folio_test_active(folio) || lru_gen_enabled())) {
long nr_pages = folio_nr_pages(folio);

lruvec_del_folio(lruvec, folio);
@@ -688,8 +693,8 @@ void deactivate_page(struct page *page)
{
struct folio *folio = page_folio(page);

- if (folio_test_lru(folio) && folio_test_active(folio) &&
- !folio_test_unevictable(folio)) {
+ if (folio_test_lru(folio) && !folio_test_unevictable(folio) &&
+ (folio_test_active(folio) || lru_gen_enabled())) {
struct folio_batch *fbatch;

folio_get(folio);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e12715202ca7..ed9e149b13c3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3050,6 +3050,81 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
return can_demote(pgdat->node_id, sc);
}

+#ifdef CONFIG_LRU_GEN
+
+/******************************************************************************
+ * shorthand helpers
+ ******************************************************************************/
+
+#define for_each_gen_type_zone(gen, type, zone) \
+ for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \
+ for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \
+ for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
+
+static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int nid)
+{
+ struct pglist_data *pgdat = NODE_DATA(nid);
+
+#ifdef CONFIG_MEMCG
+ if (memcg) {
+ struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec;
+
+ /* for hotadd_new_pgdat() */
+ if (!lruvec->pgdat)
+ lruvec->pgdat = pgdat;
+
+ return lruvec;
+ }
+#endif
+ VM_WARN_ON_ONCE(!mem_cgroup_disabled());
+
+ return pgdat ? &pgdat->__lruvec : NULL;
+}
+
+/******************************************************************************
+ * initialization
+ ******************************************************************************/
+
+void lru_gen_init_lruvec(struct lruvec *lruvec)
+{
+ int gen, type, zone;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ lrugen->max_seq = MIN_NR_GENS + 1;
+
+ for_each_gen_type_zone(gen, type, zone)
+ INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
+}
+
+#ifdef CONFIG_MEMCG
+void lru_gen_init_memcg(struct mem_cgroup *memcg)
+{
+}
+
+void lru_gen_exit_memcg(struct mem_cgroup *memcg)
+{
+ int nid;
+
+ for_each_node(nid) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+ VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
+ sizeof(lruvec->lrugen.nr_pages)));
+ }
+}
+#endif
+
+static int __init init_lru_gen(void)
+{
+ BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
+ BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+
+ return 0;
+};
+late_initcall(init_lru_gen);
+
+#endif /* CONFIG_LRU_GEN */
+
static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
--
2.37.1.595.g718a3a8f04-goog

2022-08-15 07:42:52

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v14 01/14] mm: x86, arm64: add arch_has_hw_pte_young()

Some architectures automatically set the accessed bit in PTEs, e.g.,
x86 and arm64 v8.2. On architectures that do not have this capability,
clearing the accessed bit in a PTE usually triggers a page fault
following the TLB miss of this PTE (to emulate the accessed bit).

Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to reduce bursty
page faults when trying to clear the accessed bit in many PTEs.

Note that theoretically this capability can be unreliable, e.g.,
hotplugged CPUs might be different from builtin ones. Therefore it
should not be used in architecture-independent code that involves
correctness, e.g., to determine whether TLB flushes are required (in
combination with the accessed bit).

Signed-off-by: Yu Zhao <[email protected]>
Reviewed-by: Barry Song <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Acked-by: Will Deacon <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 15 ++-------------
arch/x86/include/asm/pgtable.h | 6 +++---
include/linux/pgtable.h | 13 +++++++++++++
mm/memory.c | 14 +-------------
4 files changed, 19 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b5df82aa99e6..71a1af42f0e8 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1082,24 +1082,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
* page after fork() + CoW for pfn mappings. We don't always have a
* hardware-managed access flag on arm64.
*/
-static inline bool arch_faults_on_old_pte(void)
-{
- /* The register read below requires a stable CPU to make any sense */
- cant_migrate();
-
- return !cpu_has_hw_af();
-}
-#define arch_faults_on_old_pte arch_faults_on_old_pte
+#define arch_has_hw_pte_young cpu_has_hw_af

/*
* Experimentally, it's cheap to set the access flag in hardware and we
* benefit from prefaulting mappings as 'old' to start with.
*/
-static inline bool arch_wants_old_prefaulted_pte(void)
-{
- return !arch_faults_on_old_pte();
-}
-#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
+#define arch_wants_old_prefaulted_pte cpu_has_hw_af

static inline bool pud_sect_supported(void)
{
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 44e2d6f1dbaa..dc5f7d8ef68a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1431,10 +1431,10 @@ static inline bool arch_has_pfn_modify_check(void)
return boot_cpu_has_bug(X86_BUG_L1TF);
}

-#define arch_faults_on_old_pte arch_faults_on_old_pte
-static inline bool arch_faults_on_old_pte(void)
+#define arch_has_hw_pte_young arch_has_hw_pte_young
+static inline bool arch_has_hw_pte_young(void)
{
- return false;
+ return true;
}

#ifdef CONFIG_PAGE_TABLE_CHECK
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 014ee8f0fbaa..95f408df4695 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -260,6 +260,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif

+#ifndef arch_has_hw_pte_young
+/*
+ * Return whether the accessed bit is supported on the local CPU.
+ *
+ * This stub assumes accessing through an old PTE triggers a page fault.
+ * Architectures that automatically set the access bit should overwrite it.
+ */
+static inline bool arch_has_hw_pte_young(void)
+{
+ return false;
+}
+#endif
+
#ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index b994784158f5..46071cf00b47 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -126,18 +126,6 @@ int randomize_va_space __read_mostly =
2;
#endif

-#ifndef arch_faults_on_old_pte
-static inline bool arch_faults_on_old_pte(void)
-{
- /*
- * Those arches which don't have hw access flag feature need to
- * implement their own helper. By default, "true" means pagefault
- * will be hit on old pte.
- */
- return true;
-}
-#endif
-
#ifndef arch_wants_old_prefaulted_pte
static inline bool arch_wants_old_prefaulted_pte(void)
{
@@ -2871,7 +2859,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
* On architectures with software "accessed" bits, we would
* take a double page fault, so mark it accessed here.
*/
- if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+ if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
pte_t entry;

vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
--
2.37.1.595.g718a3a8f04-goog

2022-08-15 07:45:50

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v14 14/14] mm: multi-gen LRU: design doc

Add a design doc.

Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
---
Documentation/mm/index.rst | 1 +
Documentation/mm/multigen_lru.rst | 159 ++++++++++++++++++++++++++++++
2 files changed, 160 insertions(+)
create mode 100644 Documentation/mm/multigen_lru.rst

diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
index 575ccd40e30c..4aa12b8be278 100644
--- a/Documentation/mm/index.rst
+++ b/Documentation/mm/index.rst
@@ -51,6 +51,7 @@ above structured documentation, or deleted if it has served its purpose.
ksm
memory-model
mmu_notifier
+ multigen_lru
numa
overcommit-accounting
page_migration
diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst
new file mode 100644
index 000000000000..d7062c6a8946
--- /dev/null
+++ b/Documentation/mm/multigen_lru.rst
@@ -0,0 +1,159 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Multi-Gen LRU
+=============
+The multi-gen LRU is an alternative LRU implementation that optimizes
+page reclaim and improves performance under memory pressure. Page
+reclaim decides the kernel's caching policy and ability to overcommit
+memory. It directly impacts the kswapd CPU usage and RAM efficiency.
+
+Design overview
+===============
+Objectives
+----------
+The design objectives are:
+
+* Good representation of access recency
+* Try to profit from spatial locality
+* Fast paths to make obvious choices
+* Simple self-correcting heuristics
+
+The representation of access recency is at the core of all LRU
+implementations. In the multi-gen LRU, each generation represents a
+group of pages with similar access recency. Generations establish a
+(time-based) common frame of reference and therefore help make better
+choices, e.g., between different memcgs on a computer or different
+computers in a data center (for job scheduling).
+
+Exploiting spatial locality improves efficiency when gathering the
+accessed bit. A rmap walk targets a single page and does not try to
+profit from discovering a young PTE. A page table walk can sweep all
+the young PTEs in an address space, but the address space can be too
+sparse to make a profit. The key is to optimize both methods and use
+them in combination.
+
+Fast paths reduce code complexity and runtime overhead. Unmapped pages
+do not require TLB flushes; clean pages do not require writeback.
+These facts are only helpful when other conditions, e.g., access
+recency, are similar. With generations as a common frame of reference,
+additional factors stand out. But obvious choices might not be good
+choices; thus self-correction is necessary.
+
+The benefits of simple self-correcting heuristics are self-evident.
+Again, with generations as a common frame of reference, this becomes
+attainable. Specifically, pages in the same generation can be
+categorized based on additional factors, and a feedback loop can
+statistically compare the refault percentages across those categories
+and infer which of them are better choices.
+
+Assumptions
+-----------
+The protection of hot pages and the selection of cold pages are based
+on page access channels and patterns. There are two access channels:
+
+* Accesses through page tables
+* Accesses through file descriptors
+
+The protection of the former channel is by design stronger because:
+
+1. The uncertainty in determining the access patterns of the former
+ channel is higher due to the approximation of the accessed bit.
+2. The cost of evicting the former channel is higher due to the TLB
+ flushes required and the likelihood of encountering the dirty bit.
+3. The penalty of underprotecting the former channel is higher because
+ applications usually do not prepare themselves for major page
+ faults like they do for blocked I/O. E.g., GUI applications
+ commonly use dedicated I/O threads to avoid blocking rendering
+ threads.
+
+There are also two access patterns:
+
+* Accesses exhibiting temporal locality
+* Accesses not exhibiting temporal locality
+
+For the reasons listed above, the former channel is assumed to follow
+the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
+present, and the latter channel is assumed to follow the latter
+pattern unless outlying refaults have been observed.
+
+Workflow overview
+=================
+Evictable pages are divided into multiple generations for each
+``lruvec``. The youngest generation number is stored in
+``lrugen->max_seq`` for both anon and file types as they are aged on
+an equal footing. The oldest generation numbers are stored in
+``lrugen->min_seq[]`` separately for anon and file types as clean file
+pages can be evicted regardless of swap constraints. These three
+variables are monotonically increasing.
+
+Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
+bits in order to fit into the gen counter in ``folio->flags``. Each
+truncated generation number is an index to ``lrugen->lists[]``. The
+sliding window technique is used to track at least ``MIN_NR_GENS`` and
+at most ``MAX_NR_GENS`` generations. The gen counter stores a value
+within ``[1, MAX_NR_GENS]`` while a page is on one of
+``lrugen->lists[]``; otherwise it stores zero.
+
+Each generation is divided into multiple tiers. A page accessed ``N``
+times through file descriptors is in tier ``order_base_2(N)``. Unlike
+generations, tiers do not have dedicated ``lrugen->lists[]``. In
+contrast to moving across generations, which requires the LRU lock,
+moving across tiers only involves atomic operations on
+``folio->flags`` and therefore has a negligible cost. A feedback loop
+modeled after the PID controller monitors refaults over all the tiers
+from anon and file types and decides which tiers from which types to
+evict or protect.
+
+There are two conceptually independent procedures: the aging and the
+eviction. They form a closed-loop system, i.e., the page reclaim.
+
+Aging
+-----
+The aging produces young generations. Given an ``lruvec``, it
+increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
+``MIN_NR_GENS``. The aging promotes hot pages to the youngest
+generation when it finds them accessed through page tables; the
+demotion of cold pages happens consequently when it increments
+``max_seq``. The aging uses page table walks and rmap walks to find
+young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
+and calls ``walk_page_range()`` with each ``mm_struct`` on this list
+to scan PTEs, and after each iteration, it increments ``max_seq``. For
+the latter, when the eviction walks the rmap and finds a young PTE,
+the aging scans the adjacent PTEs. For both, on finding a young PTE,
+the aging clears the accessed bit and updates the gen counter of the
+page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
+
+Eviction
+--------
+The eviction consumes old generations. Given an ``lruvec``, it
+increments ``min_seq`` when ``lrugen->lists[]`` indexed by
+``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
+evict from, it first compares ``min_seq[]`` to select the older type.
+If both types are equally old, it selects the one whose first tier has
+a lower refault percentage. The first tier contains single-use
+unmapped clean pages, which are the best bet. The eviction sorts a
+page according to its gen counter if the aging has found this page
+accessed through page tables and updated its gen counter. It also
+moves a page to the next generation, i.e., ``min_seq+1``, if this page
+was accessed multiple times through file descriptors and the feedback
+loop has detected outlying refaults from the tier this page is in. To
+this end, the feedback loop uses the first tier as the baseline, for
+the reason stated earlier.
+
+Summary
+-------
+The multi-gen LRU can be disassembled into the following parts:
+
+* Generations
+* Rmap walks
+* Page table walks
+* Bloom filters
+* PID controller
+
+The aging and the eviction form a producer-consumer model;
+specifically, the latter drives the former by the sliding window over
+generations. Within the aging, rmap walks drive page table walks by
+inserting hot densely populated page tables to the Bloom filters.
+Within the eviction, the PID controller uses refaults as the feedback
+to select types to evict and tiers to protect.
--
2.37.1.595.g718a3a8f04-goog

2022-08-15 09:18:14

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH v14 13/14] mm: multi-gen LRU: admin guide

On Mon, Aug 15, 2022 at 01:13:32AM -0600, Yu Zhao wrote:
> Add an admin guide.
>
> Signed-off-by: Yu Zhao <[email protected]>
> Acked-by: Brian Geffon <[email protected]>
> Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
> Acked-by: Oleksandr Natalenko <[email protected]>
> Acked-by: Steven Barrett <[email protected]>
> Acked-by: Suleiman Souhlal <[email protected]>
> Tested-by: Daniel Byrne <[email protected]>
> Tested-by: Donald Carr <[email protected]>
> Tested-by: Holger Hoffstätte <[email protected]>
> Tested-by: Konstantin Kharlamov <[email protected]>
> Tested-by: Shuang Zhai <[email protected]>
> Tested-by: Sofia Trinh <[email protected]>
> Tested-by: Vaibhav Jain <[email protected]>

LGTM to me (no new warnings).

Reviewed-by: Bagas Sanjaya <[email protected]>

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (921.00 B)
signature.asc (235.00 B)
Download all attachments

2022-08-15 09:26:24

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v14 13/14] mm: multi-gen LRU: admin guide

On Mon, Aug 15, 2022 at 01:13:32AM -0600, Yu Zhao wrote:
> Add an admin guide.
>
> Signed-off-by: Yu Zhao <[email protected]>
> Acked-by: Brian Geffon <[email protected]>
> Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
> Acked-by: Oleksandr Natalenko <[email protected]>
> Acked-by: Steven Barrett <[email protected]>
> Acked-by: Suleiman Souhlal <[email protected]>
> Tested-by: Daniel Byrne <[email protected]>
> Tested-by: Donald Carr <[email protected]>
> Tested-by: Holger Hoffst?tte <[email protected]>
> Tested-by: Konstantin Kharlamov <[email protected]>
> Tested-by: Shuang Zhai <[email protected]>
> Tested-by: Sofia Trinh <[email protected]>
> Tested-by: Vaibhav Jain <[email protected]>

A few formatting nits below, otherwise

Acked-by: Mike Rapoport <[email protected]>

> ---
> Documentation/admin-guide/mm/index.rst | 1 +
> Documentation/admin-guide/mm/multigen_lru.rst | 156 ++++++++++++++++++
> mm/Kconfig | 3 +-
> mm/vmscan.c | 4 +
> 4 files changed, 163 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
>
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index 1bd11118dfb1..d1064e0ba34a 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -32,6 +32,7 @@ the Linux memory management.
> idle_page_tracking
> ksm
> memory-hotplug
> + multigen_lru
> nommu-mmap
> numa_memory_policy
> numaperf
> diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
> new file mode 100644
> index 000000000000..6355f2b5019d
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/multigen_lru.rst
> @@ -0,0 +1,156 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +Multi-Gen LRU
> +=============
> +The multi-gen LRU is an alternative LRU implementation that optimizes
> +page reclaim and improves performance under memory pressure. Page
> +reclaim decides the kernel's caching policy and ability to overcommit
> +memory. It directly impacts the kswapd CPU usage and RAM efficiency.
> +
> +Quick start
> +===========
> +Build the kernel with the following configurations.
> +
> +* ``CONFIG_LRU_GEN=y``
> +* ``CONFIG_LRU_GEN_ENABLED=y``
> +
> +All set!
> +
> +Runtime options
> +===============
> +``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
> +following subsections.
> +
> +Kill switch
> +-----------
> +``enabled`` accepts different values to enable or disable the
> +following components. Its default value depends on
> +``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
> +unless some of them have unforeseen side effects. Writing to
> +``enabled`` has no effect when a component is not supported by the
> +hardware, and valid values will be accepted even when the main switch
> +is off.
> +
> +====== ===============================================================
> +Values Components
> +====== ===============================================================
> +0x0001 The main switch for the multi-gen LRU.
> +0x0002 Clearing the accessed bit in leaf page table entries in large
> + batches, when MMU sets it (e.g., on x86). This behavior can
> + theoretically worsen lock contention (mmap_lock). If it is
> + disabled, the multi-gen LRU will suffer a minor performance
> + degradation for workloads that contiguously map hot pages,
> + whose accessed bits can be otherwise cleared by fewer larger
> + batches.
> +0x0004 Clearing the accessed bit in non-leaf page table entries as
> + well, when MMU sets it (e.g., on x86). This behavior was not
> + verified on x86 varieties other than Intel and AMD. If it is
> + disabled, the multi-gen LRU will suffer a negligible
> + performance degradation.
> +[yYnN] Apply to all the components above.
> +====== ===============================================================
> +
> +E.g.,
> +::
> +
> + echo y >/sys/kernel/mm/lru_gen/enabled
> + cat /sys/kernel/mm/lru_gen/enabled
> + 0x0007
> + echo 5 >/sys/kernel/mm/lru_gen/enabled
> + cat /sys/kernel/mm/lru_gen/enabled
> + 0x0005
> +
> +Thrashing prevention
> +--------------------
> +Personal computers are more sensitive to thrashing because it can
> +cause janks (lags when rendering UI) and negatively impact user
> +experience. The multi-gen LRU offers thrashing prevention to the
> +majority of laptop and desktop users who do not have ``oomd``.
> +
> +Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
> +``N`` milliseconds from getting evicted. The OOM killer is triggered
> +if this working set cannot be kept in memory. In other words, this
> +option works as an adjustable pressure relief valve, and when open, it
> +terminates applications that are hopefully not being used.
> +
> +Based on the average human detectable lag (~100ms), ``N=1000`` usually
> +eliminates intolerable janks due to thrashing. Larger values like
> +``N=3000`` make janks less noticeable at the risk of premature OOM
> +kills.
> +
> +The default value ``0`` means disabled.
> +
> +Experimental features
> +=====================
> +``/sys/kernel/debug/lru_gen`` accepts commands described in the
> +following subsections. Multiple command lines are supported, so does
> +concatenation with delimiters ``,`` and ``;``.
> +
> +``/sys/kernel/debug/lru_gen_full`` provides additional stats for
> +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
> +evicted generations in this file.
> +
> +Working set estimation
> +----------------------
> +Working set estimation measures how much memory an application needs
> +in a given time interval, and it is usually done with little impact on
> +the performance of the application. E.g., data centers want to
> +optimize job scheduling (bin packing) to improve memory utilizations.
> +When a new job comes in, the job scheduler needs to find out whether
> +each server it manages can allocate a certain amount of memory for
> +this new job before it can pick a candidate. To do so, the job
> +scheduler needs to estimate the working sets of the existing jobs.
> +
> +When it is read, ``lru_gen`` returns a histogram of numbers of pages
> +accessed over different time intervals for each memcg and node.
> +``MAX_NR_GENS`` decides the number of bins for each histogram. The
> +histograms are noncumulative.
> +::
> +
> + memcg memcg_id memcg_path
> + node node_id
> + min_gen_nr age_in_ms nr_anon_pages nr_file_pages
> + ...
> + max_gen_nr age_in_ms nr_anon_pages nr_file_pages
> +
> +Each bin contains an estimated number of pages that have been accessed
> +within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
> +and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
> +the former is the largest and that of the latter is the smallest.
> +
> +Users can write ``+ memcg_id node_id max_gen_nr
> +[can_swap [force_scan]]`` to ``lru_gen`` to create a new generation

I think this would look nicer if the command would be a literal block, say

Users can write:

+ memcg_id node_id max_gen_nr [can swap [force_scan]]

to ``lru_gen`` ...

> +``max_gen_nr+1``. ``can_swap`` defaults to the swap setting and, if it
> +is set to ``1``, it forces the scan of anon pages when swap is off,
> +and vice versa. ``force_scan`` defaults to ``1`` and, if it is set to
> +``0``, it employs heuristics to reduce the overhead, which is likely
> +to reduce the coverage as well.
> +
> +A typical use case is that a job scheduler writes to ``lru_gen`` at a
> +certain time interval to create new generations, and it ranks the
> +servers it manages based on the sizes of their cold pages defined by
> +this time interval.
> +
> +Proactive reclaim
> +-----------------
> +Proactive reclaim induces page reclaim when there is no memory
> +pressure. It usually targets cold pages only. E.g., when a new job
> +comes in, the job scheduler wants to proactively reclaim cold pages on
> +the server it selected to improve the chance of successfully landing
> +this new job.
> +
> +Users can write ``- memcg_id node_id min_gen_nr [swappiness
> +[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or

Same here.

> +equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
> +``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
> +aged and therefore cannot be evicted. ``swappiness`` overrides the
> +default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
> +the number of pages to evict.
> +
> +A typical use case is that a job scheduler writes to ``lru_gen``
> +before it tries to land a new job on a server. If it fails to
> +materialize enough cold pages because of the overestimation, it
> +retries on the next server according to the ranking result obtained
> +from the working set estimation step. This less forceful approach
> +limits the impacts on the existing jobs.
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 6c86849c4db9..96cd3ae25c6f 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1131,7 +1131,8 @@ config LRU_GEN
> # make sure folio->flags has enough spare bits
> depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP
> help
> - A high performance LRU implementation to overcommit memory.
> + A high performance LRU implementation to overcommit memory. See
> + Documentation/admin-guide/mm/multigen_lru.rst for details.
>
> config LRU_GEN_ENABLED
> bool "Enable by default"
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 509989fb39ef..f693720047db 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5288,6 +5288,7 @@ static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, c
> return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
> }
>
> +/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
> static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
> const char *buf, size_t len)
> {
> @@ -5321,6 +5322,7 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
> return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps);
> }
>
> +/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
> static ssize_t store_enabled(struct kobject *kobj, struct kobj_attribute *attr,
> const char *buf, size_t len)
> {
> @@ -5468,6 +5470,7 @@ static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
> seq_putc(m, '\n');
> }
>
> +/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
> static int lru_gen_seq_show(struct seq_file *m, void *v)
> {
> unsigned long seq;
> @@ -5626,6 +5629,7 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
> return err;
> }
>
> +/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
> static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
> size_t len, loff_t *pos)
> {
> --
> 2.37.1.595.g718a3a8f04-goog
>

--
Sincerely yours,
Mike.

2022-08-15 09:37:21

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH v14 14/14] mm: multi-gen LRU: design doc

On Mon, Aug 15, 2022 at 01:13:33AM -0600, Yu Zhao wrote:
> Add a design doc.
>
> Signed-off-by: Yu Zhao <[email protected]>
> Acked-by: Brian Geffon <[email protected]>
> Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
> Acked-by: Oleksandr Natalenko <[email protected]>
> Acked-by: Steven Barrett <[email protected]>
> Acked-by: Suleiman Souhlal <[email protected]>
> Tested-by: Daniel Byrne <[email protected]>
> Tested-by: Donald Carr <[email protected]>
> Tested-by: Holger Hoffstätte <[email protected]>
> Tested-by: Konstantin Kharlamov <[email protected]>
> Tested-by: Shuang Zhai <[email protected]>
> Tested-by: Sofia Trinh <[email protected]>
> Tested-by: Vaibhav Jain <[email protected]>

LGTM to me (no new warnings).

Reviewed-by: Bagas Sanjaya <[email protected]>

--
An old man doll... just what I always wanted! - Clara

2022-08-17 22:53:39

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v14 13/14] mm: multi-gen LRU: admin guide

On Mon, Aug 15, 2022 at 3:13 AM Mike Rapoport <[email protected]> wrote:
>
> On Mon, Aug 15, 2022 at 01:13:32AM -0600, Yu Zhao wrote:
> > Add an admin guide.
...
> > +Users can write ``+ memcg_id node_id max_gen_nr
> > +[can_swap [force_scan]]`` to ``lru_gen`` to create a new generation
>
> I think this would look nicer if the command would be a literal block, say
>
> Users can write:
>
> + memcg_id node_id max_gen_nr [can swap [force_scan]]
>
> to ``lru_gen`` ...

Thanks. I have queued this and will refresh the series one more time
probably around rc5.

2022-08-31 04:45:03

by Yu Zhao

[permalink] [raw]
Subject: OpenWrt / MIPS benchmark with MGLRU

TLDR
====
RAM utilization Throughput (95% CI) P99 Latency (95% CI)
----------------------------------------------------------
~90% NS NS
~110% +[12, 16]% -[20, 22]%

Abbreviations
=============
CI: confidence interval
NS: no statistically significant difference
DUT: device under test
ATE: automatic test equipment

Rational
========
1. OpenWrt is the most popular distro for WiFi routers; many of its
targets use big endianness [1].
2. 4 out of the top 5 bestselling WiFi routers in the US use MIPS [2];
MIPS uses software-managed TLB.
3. Memcached is the best available memory benchmark on OpenWrt;
admittedly such a use case is very limited in the real world.

Hardware
========
DUT: Ubiquiti EdgeRouter (ER-8) [3]

DUT # cat /proc/cpuinfo
system type : UBNT_E200 (CN6120p1.1-800-NSP)
machine : Unknown
processor : 0
cpu model : Cavium Octeon II V0.1
BogoMIPS : 1600.00
wait instruction : yes
microsecond timers : yes
tlb_entries : 128
extra interrupt vector : yes
hardware watchpoint : yes, count: 2, address/irw mask: [0x0ffc, 0x0ffb]
isa : mips1 mips2 mips3 mips4 mips5 mips32r1 mips32r2 mips64r1 mips64r2
ASEs implemented :
Options implemented : tlb rixiex 4kex octeon_cache 32fpr prefetch mcheck ejtag llsc rixi lpa vtag_icache userlocal perf_cntr_intr_bit perf
shadow register sets : 1
kscratch registers : 3
package : 0
core : 0
VCED exceptions : not available
VCEI exceptions : not available

processor : 1
cpu model : Cavium Octeon II V0.1
BogoMIPS : 1600.00
wait instruction : yes
microsecond timers : yes
tlb_entries : 128
extra interrupt vector : yes
hardware watchpoint : yes, count: 2, address/irw mask: [0x0ffc, 0x0ffb]
isa : mips1 mips2 mips3 mips4 mips5 mips32r1 mips32r2 mips64r1 mips64r2
ASEs implemented :
Options implemented : tlb rixiex 4kex octeon_cache 32fpr prefetch mcheck ejtag llsc rixi lpa vtag_icache userlocal perf_cntr_intr_bit perf
shadow register sets : 1
kscratch registers : 3
package : 0
core : 1
VCED exceptions : not available
VCEI exceptions : not available

DUT # cat /proc/meminfo
MemTotal: 1991964 kB
MemFree: 1917304 kB
MemAvailable: 1896856 kB
Buffers: 4 kB
Cached: 33464 kB
SwapCached: 0 kB
Active: 1316 kB
Inactive: 33500 kB
Active(anon): 1316 kB
Inactive(anon): 33496 kB
Active(file): 0 kB
Inactive(file): 4 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 995324 kB
SwapFree: 995324 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 1360 kB
Mapped: 2688 kB
Shmem: 33464 kB
KReclaimable: 8244 kB
Slab: 19772 kB
SReclaimable: 8244 kB
SUnreclaim: 11528 kB
KernelStack: 1056 kB
PageTables: 336 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 1991304 kB
Committed_AS: 38916 kB
VmallocTotal: 1069547512 kB
VmallocUsed: 4856 kB
VmallocChunk: 0 kB
Percpu: 272 kB

Software
========
DUT # cat /etc/openwrt_release
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='22.03.0-rc6'
DISTRIB_REVISION='r19590-042d558536'
DISTRIB_TARGET='octeon/generic'
DISTRIB_ARCH='mips64_octeonplus'
DISTRIB_DESCRIPTION='OpenWrt 22.03.0-rc6 r19590-042d558536'
DISTRIB_TAINTS='no-all no-ipv6'

DUT # uname -a
Linux OpenWrt 6.0.0-rc3+ #0 SMP Sun Jul 31 15:12:47 2022 mips64 GNU/Linux

DUT # cat /proc/swaps
Filename Type Size Used Priority
/dev/zram0 partition 995324 0 100

DUT # memcached -V
memcached 1.6.9

DUT # cat /etc/config/memcached
config memcached
option user 'memcached'
option maxconn '1024'
option listen '0.0.0.0'
option port '11211'
option memory '6400'

ATE $ memtier_benchmark -v
memtier_benchmark 1.3.0
Copyright (C) 2011-2022 Redis Ltd.
This is free software. You may redistribute copies of it under the terms of
the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.

Procedure
=========
ATE $ cat run_benchmark_matrix.sh
run_memtier_benchmark()
{
# boot to kernel $3

# populate dataset
memtier_benchmark/memtier_benchmark -s $DUT_IP -p 11211 \
-P memcache_binary -n allkeys -c 1 --ratio 1:0 --pipeline 8 \
--key-minimum=1 --key-maximum=$2 --key-pattern=P:P \
-d 1000

# access dataset using Guassian pattern
memtier_benchmark/memtier_benchmark -s $DUT_IP -p 11211 \
-P memcache_binary --test-time $1 -c 1 --ratio 0:1 \
--pipeline 8 --key-minimum=1 --key-maximum=$2 \
--key-pattern=G:G --randomize --distinct-client-seed

# collect results
}

run_duration_secs=1200
mem_utils_90_110=(1600000 2000000)
kernels=("baseline" "patched")

for mem_util in ${mem_utils_90_110[@]}; do
for kernel in ${kernels[@]}; do
run_memtier_benchmark $run_duration_secs $mem_util $kernel
done
done

Results
=======
Baseline 90% RAM utilization
------------------------------------------------------------
Ops/sec Avg. Lat. p50 Lat. p99 Lat. p99.9 Lat. KB/sec
------------------------------------------------------------
48550.71 0.65687 0.48700 2.84700 5.56700 1812.25
48600.55 0.65629 0.48700 2.86300 5.59900 1814.11
48562.37 0.65674 0.48700 2.84700 5.50300 1812.68
48556.66 0.65688 0.48700 2.84700 5.53500 1812.47
48619.50 0.65600 0.48700 2.87900 5.63100 1814.82
48579.74 0.65654 0.48700 2.84700 5.56700 1813.33
48593.25 0.65764 0.48700 2.86300 5.56700 1814.10
48535.52 0.65716 0.48700 2.86300 5.56700 1811.68
48587.24 0.65645 0.48700 2.83100 5.50300 1813.61
48541.92 0.65704 0.48700 2.81500 5.47100 1811.92

MGLRU 90% RAM utilization
------------------------------------------------------------
Ops/sec Avg. Lat. p50 Lat. p99 Lat. p99.9 Lat. KB/sec
------------------------------------------------------------
48622.38 0.65594 0.48700 2.81500 5.47100 1814.92
48537.74 0.65715 0.48700 2.84700 5.53500 1811.76
48586.82 0.65646 0.48700 2.84700 5.50300 1813.59
48552.44 0.65695 0.48700 2.83100 5.43900 1812.31
48557.35 0.65680 0.49500 2.83100 5.53500 1812.49
48625.48 0.65593 0.48700 2.81500 5.43900 1815.04
48655.75 0.65557 0.48700 2.84700 5.53500 1816.17
48625.67 0.65595 0.48700 2.84700 5.53500 1815.04
48622.22 0.65600 0.48700 2.84700 5.47100 1814.91
48617.10 0.65610 0.48700 2.84700 5.56700 1814.73

Baseline 110% RAM utilization
------------------------------------------------------------
Ops/sec Avg. Lat. p50 Lat. p99 Lat. p99.9 Lat. KB/sec
------------------------------------------------------------
19813.79 1.61245 0.63100 17.79100 31.74300 744.91
20328.29 1.57158 0.62300 17.27900 31.10300 764.25
20104.12 1.58913 0.62300 17.40700 31.10300 755.82
20342.03 1.57053 0.61500 17.27900 30.84700 764.77
19688.05 1.62268 0.62300 17.91900 31.35900 740.18
19607.31 1.62943 0.63900 17.91900 31.23100 737.15
19250.96 1.65963 0.65500 17.91900 31.10300 723.75
20182.79 1.58290 0.63100 17.40700 30.84700 758.78
20181.88 1.58299 0.63100 17.40700 30.84700 758.75
20615.90 1.54963 0.62300 17.02300 30.84700 775.06

MGLRU 110% RAM utilization
------------------------------------------------------------
Ops/sec Avg. Lat. p50 Lat. p99 Lat. p99.9 Lat. KB/sec
------------------------------------------------------------
22911.33 1.39405 0.61500 13.69500 28.79900 861.36
22339.08 1.42989 0.61500 14.07900 30.07900 839.85
23394.22 1.36521 0.59900 13.56700 29.05500 879.51
22521.48 1.41830 0.61500 13.88700 29.82300 846.70
22678.10 1.40818 0.61500 13.82300 29.69500 852.59
22344.50 1.42952 0.61500 14.07900 29.95100 840.05
23245.65 1.37406 0.60700 13.50300 28.92700 873.93
23140.17 1.38032 0.59900 13.69500 29.18300 869.96
23003.34 1.38856 0.61500 13.63100 29.05500 864.82
22937.52 1.39253 0.61500 13.69500 29.43900 862.35

Flame graphs
------------
Baseline: https://drive.google.com/file/d/1-Ac4HMPAyZIqxtvKerUTqNNAgBLhpX9R
MGLRU: https://drive.google.com/file/d/1-9x0W2yIYeiRvXWiYRzL6niTqW7zCVPX

References
==========
[1] https://openwrt.org/docs/platforms/start
[2] https://www.amazon.com/bestsellers/pc/300189
[3] https://openwrt.org/toh/ubiquiti/edgerouter

2022-08-31 12:34:29

by Arnd Bergmann

[permalink] [raw]
Subject: Re: OpenWrt / MIPS benchmark with MGLRU

On Wed, Aug 31, 2022, at 6:17 AM, Yu Zhao wrote:
>
> Rational
> ========
> 1. OpenWrt is the most popular distro for WiFi routers; many of its
> targets use big endianness [1].
> 2. 4 out of the top 5 bestselling WiFi routers in the US use MIPS [2];
> MIPS uses software-managed TLB.
> 3. Memcached is the best available memory benchmark on OpenWrt;
> admittedly such a use case is very limited in the real world.
>
> Hardware
> ========
> DUT: Ubiquiti EdgeRouter (ER-8) [3]

I don't know if it makes any difference to your findings, but
I would point out the test hardware is neither representative
of most devices supported by OpenWRT, nor those on the amazon
best-seller list that I see looking from Germany:

Five of the top-10 devices on that list are arm64 (little-endian,
hardware TLB walker, typically 512MB of RAM), the others are
mips32 (typically only 128MB, mostly single-core) and only
the oldest one (Archer C7) of them is big-endian. I would not
expect endianness to make any difference, but the 16x smaller
memory of typical mips devices (ath79, mt76) might.

Arnd

2022-08-31 15:40:20

by Dave Hansen

[permalink] [raw]
Subject: Re: OpenWrt / MIPS benchmark with MGLRU

On 8/30/22 21:17, Yu Zhao wrote:
> TLDR
> ====
> RAM utilization Throughput (95% CI) P99 Latency (95% CI)
> ----------------------------------------------------------
> ~90% NS NS
> ~110% +[12, 16]% -[20, 22]%

I'll give you points for thinking out of the box on this one. This is a
piece of hardware where both latency and bandwidth theoretically matter.
I've got a slightly older but similar piece of Ubiquiti hardware with
512MB of RAM. It doesn't run OpenWRT, fwiw. Maybe my firmware is a bit
outdated.

*But*, most of the heavy lifting for packet flow on these systems is
done in hardware. They have some hardware acceleration to be able to
_route_ at gigabit speeds, so they're probably not quite as sensitive to
software hiccups as lower-end routers.

That said, my system at least does not typically have *any* memory
pressure. Right now, it hasn't even filled free memory with page cache
and it's been up for over a month:

# cat /proc/meminfo
MemTotal: 491552 kB
MemFree: 160188 kB
MemAvailable: 373088 kB
Cached: 151004 kB

I think a better tl;dr would be:

MGLRU doesn't help much or cause any regressions on this
hardware. Under (atypical) synthetic memory pressure, MGLRU did
show some modest but measurable throughput and latency benefits.

In other words, this provides more of a data point that MGLRU doesn't
hurt medium-ish sized embedded systems. I think you could make an even
stronger case with even smaller hardware or something that actually sees
memory pressure on a regular basis in the wild.

2022-08-31 23:11:21

by Yu Zhao

[permalink] [raw]
Subject: Re: OpenWrt / MIPS benchmark with MGLRU

On Tue, Aug 30, 2022 at 10:17 PM Yu Zhao <[email protected]> wrote:
>
> TLDR
> ====
> RAM utilization Throughput (95% CI) P99 Latency (95% CI)
> ----------------------------------------------------------
> ~90% NS NS
> ~110% +[12, 16]% -[20, 22]%
>
> Abbreviations
> =============
> CI: confidence interval
> NS: no statistically significant difference
> DUT: device under test
> ATE: automatic test equipment
>
> Rational
> ========
> 1. OpenWrt is the most popular distro for WiFi routers; many of its
> targets use big endianness [1].
> 2. 4 out of the top 5 bestselling WiFi routers in the US use MIPS [2];
> MIPS uses software-managed TLB.
> 3. Memcached is the best available memory benchmark on OpenWrt;
> admittedly such a use case is very limited in the real world.

Thanks.

My goal is to encourage MM people to extend their test coverage to
some commonly used but less tested configurations. I carefully
constructed this benchmark with the balance between its
representativeness and the effort to reproduce.

When I wear my MM hat, I see ER-8 as the ideal choice because it comes
with a serial port, a replaceable memory DIMM and one of the two cores
that can be disabled. The same SoC is also what the Debian MIPS port
mainly uses for their testing [1]. So if I need help, I might be able
to get it from them.

From OpenWrt's / MIPS OEMs' POVs, I do see ER-8 as an uninteresting
platform. Currently the best selling WiFi router on Amazon US is
Archer A7, a knockoff of Archer C7. The latter comes with not only the
serial port header but also the JTAG header, and that's what I use.
But I seriously doubt showing how I work on C7 would encourage MM
people to try it. I snapped a pictures of it during lunch:
https://drive.google.com/file/d/1rYBwLOyMqBSr6WKUZd7Gbf9RfwA641X5/
And other boards I routinely test the MM performance on:
https://drive.google.com/file/d/1yBMx9OPWw-5czvz3maNUy6WBFwPvAqG5/
All the way dates back to this vintage:
https://drive.google.com/file/d/12N21qiWSoyJgZwVkwAhY8_5Fj4dKftqD/

[1] https://wiki.debian.org/MIPSPort

2022-09-01 09:39:13

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v14 07/14] mm: multi-gen LRU: exploit locality in rmap



> On Aug 15, 2022, at 12:13 AM, Yu Zhao <[email protected]> wrote:
>
> Searching the rmap for PTEs mapping each page on an LRU list (to test
> and clear the accessed bit) can be expensive because pages from
> different VMAs (PA space) are not cache friendly to the rmap (VA
> space). For workloads mostly using mapped pages, searching the rmap
> can incur the highest CPU cost in the reclaim path.

Impressive work. Sorry if my feedback is not timely.

Just one minor point for thought, that can be left for a later cleanup.

>
> + for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
> + unsigned long pfn;
> +
> + pfn = get_pte_pfn(pte[i], pvmw->vma, addr);
> + if (pfn == -1)
> + continue;
> +
> + if (!pte_young(pte[i]))
> + continue;
> +
> + folio = get_pfn_folio(pfn, memcg, pgdat);
> + if (!folio)
> + continue;
> +
> + if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
> + continue;
> +

You have already checked that the PTE is old (not young), so this check
seems redundant. I do not see a way in which the access-bit can be cleared
since you hold the ptl. IOW, there is no need for the “if" and “continue".

Makes me also wonder whether having a separate ptep_clear_young() can
slightly help, since anyhow the access-bit is more of an estimation,
and having a separate ptep_clear_young() can enable optimizations.

On x86, for instance, if the PTE is dirty, we may be able to clear the
access-bit without an atomic operation, which should be faster.

2022-09-02 01:52:09

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v14 07/14] mm: multi-gen LRU: exploit locality in rmap

On Thu, Sep 1, 2022 at 7:17 PM Yu Zhao <[email protected]> wrote:
>
> On Thu, Sep 1, 2022 at 3:18 AM Nadav Amit <[email protected]> wrote:
> >
> >
> >
> > > On Aug 15, 2022, at 12:13 AM, Yu Zhao <[email protected]> wrote:
> > >
> > > Searching the rmap for PTEs mapping each page on an LRU list (to test
> > > and clear the accessed bit) can be expensive because pages from
> > > different VMAs (PA space) are not cache friendly to the rmap (VA
> > > space). For workloads mostly using mapped pages, searching the rmap
> > > can incur the highest CPU cost in the reclaim path.
> >
> > Impressive work.

Thanks.

> > Sorry if my feedback is not timely.
> >
> > Just one minor point for thought, that can be left for a later cleanup.
> >
> > >
> > > + for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
> > > + unsigned long pfn;
> > > +
> > > + pfn = get_pte_pfn(pte[i], pvmw->vma, addr);
> > > + if (pfn == -1)
> > > + continue;
> > > +
> > > + if (!pte_young(pte[i]))
> > > + continue;
> > > +
> > > + folio = get_pfn_folio(pfn, memcg, pgdat);
> > > + if (!folio)
> > > + continue;
> > > +
> > > + if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
> > > + continue;
> > > +
> >
> > You have already checked that the PTE is old (not young) so this check
> > seems redundant.
>
> You are right, for x86, which belongs to category 1: hardware and
> OS share the same paging data structure.
>
> > I do not see a way in which the access-bit can be cleared
> > since you hold the ptl.
>
> There is also category 2: the OS paging data structure is a shadow of what
> hardware actually uses, e.g., POWER9 radix.
>
> To make both categories work, the general rule is that the OS paging
> data structure must be more strict, i.e., it can have A/D bits set
> while the hardware paging data structure may not. The opposite is not
> allowed, even for the A bit, because the A bit can also be used to
> determine whether a TLB flush is required. The Linux kernel doesn't do
> this but there are other OSes that do.
>
> For prefaulted PTEs, we generally mark them young unless
> arch_wants_old_prefaulted_pte() returns true (currently only ARMv8.2+
> do). On POWER9, we'd see those PTEs pass the first check but fail the
> second.

Because the first check (non-atomic) is allowed to fetch from the OS
paging data structure (which is more strict) while the second check
(atomic) must fetch from the hardware page data structure (which does
not have the A bit because those PTEs are preffaulted).

> > IOW, there is no need for the “if" and “continue".
> >
> > Makes me also wonder whether having a separate ptep_clear_young() can
> > slightly help, since anyhow the access-bit is more of an estimation,
> > and having a separate ptep_clear_young() can enable optimizations.
> >
> > On x86, for instance, if the PTE is dirty, we may be able to clear the
> > access-bit without an atomic operation, which should be faster.
>
> Agreed.

2022-09-02 01:52:10

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v14 07/14] mm: multi-gen LRU: exploit locality in rmap

On Thu, Sep 1, 2022 at 3:18 AM Nadav Amit <[email protected]> wrote:
>
>
>
> > On Aug 15, 2022, at 12:13 AM, Yu Zhao <[email protected]> wrote:
> >
> > Searching the rmap for PTEs mapping each page on an LRU list (to test
> > and clear the accessed bit) can be expensive because pages from
> > different VMAs (PA space) are not cache friendly to the rmap (VA
> > space). For workloads mostly using mapped pages, searching the rmap
> > can incur the highest CPU cost in the reclaim path.
>
> Impressive work. Sorry if my feedback is not timely.
>
> Just one minor point for thought, that can be left for a later cleanup.
>
> >
> > + for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
> > + unsigned long pfn;
> > +
> > + pfn = get_pte_pfn(pte[i], pvmw->vma, addr);
> > + if (pfn == -1)
> > + continue;
> > +
> > + if (!pte_young(pte[i]))
> > + continue;
> > +
> > + folio = get_pfn_folio(pfn, memcg, pgdat);
> > + if (!folio)
> > + continue;
> > +
> > + if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
> > + continue;
> > +
>
> You have already checked that the PTE is old (not young) so this check
> seems redundant.

You are right, for x86, which belongs to category 1: hardware and
OS share the same paging data structure.

> I do not see a way in which the access-bit can be cleared
> since you hold the ptl.

There is also category 2: the OS paging data structure is a shadow of what
hardware actually uses, e.g., POWER9 radix.

To make both categories work, the general rule is that the OS paging
data structure must be more strict, i.e., it can have A/D bits set
while the hardware paging data structure may not. The opposite is not
allowed, even for the A bit, because the A bit can also be used to
determine whether a TLB flush is required. The Linux kernel doesn't do
this but there are other OSes that do.

For prefaulted PTEs, we generally mark them young unless
arch_wants_old_prefaulted_pte() returns true (currently only ARMv8.2+
do). On POWER9, we'd see those PTEs pass the first check but fail the
second.

> IOW, there is no need for the “if" and “continue".
>
> Makes me also wonder whether having a separate ptep_clear_young() can
> slightly help, since anyhow the access-bit is more of an estimation,
> and having a separate ptep_clear_young() can enable optimizations.
>
> On x86, for instance, if the PTE is dirty, we may be able to clear the
> access-bit without an atomic operation, which should be faster.

Agreed.

2022-09-12 00:21:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v14 00/14] Multi-Gen LRU Framework

I'd like to move mglru into the mm-stable branch late this week.

I'm not terribly happy about the level of review nor the carefulness of
the code commenting (these things are related) and I have a note here
that "mm: multi-gen LRU: admin guide" is due for an update and everyone
is at conference anyway. But let's please try to push things along anyway.

2022-09-15 18:19:27

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v14 00/14] Multi-Gen LRU Framework

On Sun, Sep 11, 2022 at 6:08 PM Andrew Morton <[email protected]> wrote:
>
> I'd like to move mglru into the mm-stable branch late this week.
>
> I'm not terribly happy about the level of review nor the carefulness of
> the code commenting (these things are related) and I have a note here
> that "mm: multi-gen LRU: admin guide" is due for an update and everyone
> is at conference anyway. But let's please try to push things along anyway.

Thanks for the heads-up. Will add as many comments as I can and wrap
it up by the end of tomorrow.

2022-09-18 21:28:18

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v14 00/14] Multi-Gen LRU Framework

On Thu, Sep 15, 2022 at 11:56 AM Yu Zhao <[email protected]> wrote:
>
> On Sun, Sep 11, 2022 at 6:08 PM Andrew Morton <[email protected]> wrote:
> >
> > I'd like to move mglru into the mm-stable branch late this week.
> >
> > I'm not terribly happy about the level of review nor the carefulness of
> > the code commenting (these things are related) and I have a note here
> > that "mm: multi-gen LRU: admin guide" is due for an update and everyone
> > is at conference anyway. But let's please try to push things along anyway.
>
> Thanks for the heads-up. Will add as many comments as I can and wrap
> it up by the end of tomorrow.

I've posted v15 which can replace what mm-unstable currently has.
Apologies for the delay: an unexpected lockdep warning from the maple
tree forced me to restart all the tests [1].

Let me also post the incremental patches after this email, in case you
strongly prefer to add them on top of v14.

[1] https://lore.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com/

2022-09-19 00:27:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v14 00/14] Multi-Gen LRU Framework

On Sun, 18 Sep 2022 14:40:01 -0600 Yu Zhao <[email protected]> wrote:

> Let me also post the incremental patches after this email, in case you
> strongly prefer to add them on top of v14.

Thanks, helpful.

I have one question regarding 03/11.

The final two updates look pretty substantial. I guess I'll do a
series replacement and let this and mapletree sit another week.

2022-09-20 08:28:28

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH v14 13/14] mm: multi-gen LRU: admin guide

On Mon, Aug 15, 2022 at 01:13:32AM -0600, Yu Zhao wrote:
> +=============
> +Multi-Gen LRU
> +=============
> +The multi-gen LRU is an alternative LRU implementation that optimizes
> +page reclaim and improves performance under memory pressure. Page
> +reclaim decides the kernel's caching policy and ability to overcommit
> +memory. It directly impacts the kswapd CPU usage and RAM efficiency.
> +
> +Quick start
> +===========
> +Build the kernel with the following configurations.
> +
> +* ``CONFIG_LRU_GEN=y``
> +* ``CONFIG_LRU_GEN_ENABLED=y``
> +
> +All set!
> +
> +Runtime options
> +===============
> +``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
> +following subsections.
> +
> +Kill switch
> +-----------
> +``enabled`` accepts different values to enable or disable the
> +following components. Its default value depends on
> +``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
> +unless some of them have unforeseen side effects. Writing to
> +``enabled`` has no effect when a component is not supported by the
> +hardware, and valid values will be accepted even when the main switch
> +is off.
> +
> +====== ===============================================================
> +Values Components
> +====== ===============================================================
> +0x0001 The main switch for the multi-gen LRU.
> +0x0002 Clearing the accessed bit in leaf page table entries in large
> + batches, when MMU sets it (e.g., on x86). This behavior can
> + theoretically worsen lock contention (mmap_lock). If it is
> + disabled, the multi-gen LRU will suffer a minor performance
> + degradation for workloads that contiguously map hot pages,
> + whose accessed bits can be otherwise cleared by fewer larger
> + batches.
> +0x0004 Clearing the accessed bit in non-leaf page table entries as
> + well, when MMU sets it (e.g., on x86). This behavior was not
> + verified on x86 varieties other than Intel and AMD. If it is
> + disabled, the multi-gen LRU will suffer a negligible
> + performance degradation.
> +[yYnN] Apply to all the components above.
> +====== ===============================================================
> +
> +E.g.,
> +::
> +
> + echo y >/sys/kernel/mm/lru_gen/enabled
> + cat /sys/kernel/mm/lru_gen/enabled
> + 0x0007
> + echo 5 >/sys/kernel/mm/lru_gen/enabled
> + cat /sys/kernel/mm/lru_gen/enabled
> + 0x0005
> +
> +Thrashing prevention
> +--------------------
> +Personal computers are more sensitive to thrashing because it can
> +cause janks (lags when rendering UI) and negatively impact user
> +experience. The multi-gen LRU offers thrashing prevention to the
> +majority of laptop and desktop users who do not have ``oomd``.
> +
> +Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
> +``N`` milliseconds from getting evicted. The OOM killer is triggered
> +if this working set cannot be kept in memory. In other words, this
> +option works as an adjustable pressure relief valve, and when open, it
> +terminates applications that are hopefully not being used.
> +
> +Based on the average human detectable lag (~100ms), ``N=1000`` usually
> +eliminates intolerable janks due to thrashing. Larger values like
> +``N=3000`` make janks less noticeable at the risk of premature OOM
> +kills.
> +
> +The default value ``0`` means disabled.
> +
> +Experimental features
> +=====================
> +``/sys/kernel/debug/lru_gen`` accepts commands described in the
> +following subsections. Multiple command lines are supported, so does
> +concatenation with delimiters ``,`` and ``;``.
> +
> +``/sys/kernel/debug/lru_gen_full`` provides additional stats for
> +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
> +evicted generations in this file.
> +
> +Working set estimation
> +----------------------
> +Working set estimation measures how much memory an application needs
> +in a given time interval, and it is usually done with little impact on
> +the performance of the application. E.g., data centers want to
> +optimize job scheduling (bin packing) to improve memory utilizations.
> +When a new job comes in, the job scheduler needs to find out whether
> +each server it manages can allocate a certain amount of memory for
> +this new job before it can pick a candidate. To do so, the job
> +scheduler needs to estimate the working sets of the existing jobs.
> +
> +When it is read, ``lru_gen`` returns a histogram of numbers of pages
> +accessed over different time intervals for each memcg and node.
> +``MAX_NR_GENS`` decides the number of bins for each histogram. The
> +histograms are noncumulative.
> +::
> +
> + memcg memcg_id memcg_path
> + node node_id
> + min_gen_nr age_in_ms nr_anon_pages nr_file_pages
> + ...
> + max_gen_nr age_in_ms nr_anon_pages nr_file_pages
> +
> +Each bin contains an estimated number of pages that have been accessed
> +within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
> +and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
> +the former is the largest and that of the latter is the smallest.
> +
> +Users can write ``+ memcg_id node_id max_gen_nr
> +[can_swap [force_scan]]`` to ``lru_gen`` to create a new generation
> +``max_gen_nr+1``. ``can_swap`` defaults to the swap setting and, if it
> +is set to ``1``, it forces the scan of anon pages when swap is off,
> +and vice versa. ``force_scan`` defaults to ``1`` and, if it is set to
> +``0``, it employs heuristics to reduce the overhead, which is likely
> +to reduce the coverage as well.
> +
> +A typical use case is that a job scheduler writes to ``lru_gen`` at a
> +certain time interval to create new generations, and it ranks the
> +servers it manages based on the sizes of their cold pages defined by
> +this time interval.
> +
> +Proactive reclaim
> +-----------------
> +Proactive reclaim induces page reclaim when there is no memory
> +pressure. It usually targets cold pages only. E.g., when a new job
> +comes in, the job scheduler wants to proactively reclaim cold pages on
> +the server it selected to improve the chance of successfully landing
> +this new job.
> +
> +Users can write ``- memcg_id node_id min_gen_nr [swappiness
> +[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or
> +equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
> +``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
> +aged and therefore cannot be evicted. ``swappiness`` overrides the
> +default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
> +the number of pages to evict.
> +
> +A typical use case is that a job scheduler writes to ``lru_gen``
> +before it tries to land a new job on a server. If it fails to
> +materialize enough cold pages because of the overestimation, it
> +retries on the next server according to the ranking result obtained
> +from the working set estimation step. This less forceful approach
> +limits the impacts on the existing jobs.

The documentation can be improved, like:

---- >8 ----
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
index 33e068830497e7..16d5c3317447f2 100644
--- a/Documentation/admin-guide/mm/multigen_lru.rst
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -10,13 +10,11 @@ memory. It directly impacts the kswapd CPU usage and RAM efficiency.

Quick start
===========
-Build the kernel with the following configurations.
+Build the kernel with the following configurations:

* ``CONFIG_LRU_GEN=y``
* ``CONFIG_LRU_GEN_ENABLED=y``

-All set!
-
Runtime options
===============
``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
@@ -51,8 +49,7 @@ Values Components
[yYnN] Apply to all the components above.
====== ===============================================================

-E.g.,
-::
+Below is the example of enabling multi-gen LRU::

echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
@@ -65,21 +62,21 @@ Thrashing prevention
--------------------
Personal computers are more sensitive to thrashing because it can
cause janks (lags when rendering UI) and negatively impact user
-experience. The multi-gen LRU offers thrashing prevention to the
+experience. The multi-gen LRU offers thrashing prevention for the
majority of laptop and desktop users who do not have ``oomd``.

Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
``N`` milliseconds from getting evicted. The OOM killer is triggered
if this working set cannot be kept in memory. In other words, this
-option works as an adjustable pressure relief valve, and when open, it
-terminates applications that are hopefully not being used.
+option works as an adjustable pressure relief valve; and when it is open,
+it terminates applications that are hopefully not being used.

Based on the average human detectable lag (~100ms), ``N=1000`` usually
eliminates intolerable janks due to thrashing. Larger values like
``N=3000`` make janks less noticeable at the risk of premature OOM
kills.

-The default value ``0`` means disabled.
+The default value is ``0`` which means disabled.

Experimental features
=====================
@@ -88,14 +85,14 @@ following subsections. Multiple command lines are supported, so does
concatenation with delimiters ``,`` and ``;``.

``/sys/kernel/debug/lru_gen_full`` provides additional stats for
-debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
-evicted generations in this file.
+debugging. When ``CONFIG_LRU_GEN_STATS=y`` is configured, the kernel keeps
+historical stats from evicted generations in this file.

Working set estimation
----------------------
Working set estimation measures how much memory an application needs
in a given time interval, and it is usually done with little impact on
-the performance of the application. E.g., data centers want to
+the performance of the application. For example, data centers want to
optimize job scheduling (bin packing) to improve memory utilizations.
When a new job comes in, the job scheduler needs to find out whether
each server it manages can allocate a certain amount of memory for
@@ -115,19 +112,19 @@ histograms are noncumulative.
max_gen_nr age_in_ms nr_anon_pages nr_file_pages

Each bin contains an estimated number of pages that have been accessed
-within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
+within ``age_in_ms``. Here, ``min_gen_nr`` contains the coldest pages
and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
the former is the largest and that of the latter is the smallest.

Users can write the following command to ``lru_gen`` to create a new
-generation ``max_gen_nr+1``:
+generation ``max_gen_nr+1``::

- ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
+ memcg_id node_id max_gen_nr [can_swap [force_scan]]

-``can_swap`` defaults to the swap setting and, if it is set to ``1``,
-it forces the scan of anon pages when swap is off, and vice versa.
-``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
-employs heuristics to reduce the overhead, which is likely to reduce
+``can_swap`` defaults to the swap setting. If it is set to ``1``,
+it forces the scan of anon pages when swap is off and vice versa.
+``force_scan`` defaults to ``1``. If it is set to ``0``, it
+uses heuristics to reduce the overhead, which is likely to reduce
the coverage as well.

A typical use case is that a job scheduler runs this command at a
@@ -138,15 +135,15 @@ this time interval.
Proactive reclaim
-----------------
Proactive reclaim induces page reclaim when there is no memory
-pressure. It usually targets cold pages only. E.g., when a new job
+pressure. It usually targets cold pages only. For example, when a new job
comes in, the job scheduler wants to proactively reclaim cold pages on
the server it selected, to improve the chance of successfully landing
this new job.

Users can write the following command to ``lru_gen`` to evict
-generations less than or equal to ``min_gen_nr``.
+generations less than or equal to ``min_gen_nr``::

- ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
+ memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]

``min_gen_nr`` should be less than ``max_gen_nr-1``, since
``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
@@ -155,7 +152,7 @@ overrides the default value in ``/proc/sys/vm/swappiness``.
``nr_to_reclaim`` limits the number of pages to evict.

A typical use case is that a job scheduler runs this command before it
-tries to land a new job on a server. If it fails to materialize enough
+tries to land a new job on a server. If it fails to get enough
cold pages because of the overestimation, it retries on the next
server according to the ranking result obtained from the working set
estimation step. This less forceful approach limits the impacts on the


> +/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
> static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
> const char *buf, size_t len)
> {
> @@ -5321,6 +5322,7 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
> return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps);
> }
>
> +/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
> static ssize_t store_enabled(struct kobject *kobj, struct kobj_attribute *attr,
> const char *buf, size_t len)
> {
> @@ -5468,6 +5470,7 @@ static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
> seq_putc(m, '\n');
> }
>
> +/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
> static int lru_gen_seq_show(struct seq_file *m, void *v)
> {
> unsigned long seq;
> @@ -5626,6 +5629,7 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
> return err;
> }
>
> +/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
> static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
> size_t len, loff_t *pos)
> {

Shouldn't these functions be described by writing kernel-doc comments?
I don't see any mentions of these in multigen_lru.rst.

Thanks.

--
An old man doll... just what I always wanted! - Clara