2022-05-18 04:24:11

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v11 00/14] Multi-Gen LRU Framework

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.

Patchset overview
=================
The design and implementation overview is in patch 14:
https://lore.kernel.org/r/[email protected]/

01. mm: x86, arm64: add arch_has_hw_pte_young()
02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Take advantage of hardware features when trying to clear the accessed
bit in many PTEs.

03. mm/vmscan.c: refactor shrink_node()
04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
its sole caller"
Minor refactors to improve readability for the following patches.

05. mm: multi-gen LRU: groundwork
Adds the basic data structure and the functions that insert pages to
and remove pages from the multi-gen LRU (MGLRU) lists.

06. mm: multi-gen LRU: minimal implementation
A minimal implementation without optimizations.

07. mm: multi-gen LRU: exploit locality in rmap
Exploits spatial locality to improve efficiency when using the rmap.

08. mm: multi-gen LRU: support page table walks
Further exploits spatial locality by optionally scanning page tables.

09. mm: multi-gen LRU: optimize multiple memcgs
Optimizes the overall performance for multiple memcgs running mixed
types of workloads.

10. mm: multi-gen LRU: kill switch
Adds a kill switch to enable or disable MGLRU at runtime.

11. mm: multi-gen LRU: thrashing prevention
12. mm: multi-gen LRU: debugfs interface
Provide userspace with features like thrashing prevention, working set
estimation and proactive reclaim.

13. mm: multi-gen LRU: admin guide
14. mm: multi-gen LRU: design doc
Add an admin guide and a design doc.

Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
Apache Cassandra Memcached
Apache Hadoop MongoDB
Apache Spark PostgreSQL
MariaDB (MySQL) Redis

An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.

On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
less wall time to sort three billion random integers, respectively,
under the medium- and the high-concurrency conditions, when
overcommitting memory. There were no statistically significant
changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
more transactions per minute (TPM), respectively, under the medium-
and the high-concurrency conditions, when overcommitting memory.
There were no statistically significant changes in TPM for the rest
of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
and [21.59, 30.02]% more operations per second (OPS), respectively,
for sequential access, random access and Gaussian (distribution)
access, when THP=always; 95% CIs [13.85, 15.97]% and
[23.94, 29.92]% more OPS, respectively, for random access and
Gaussian access, when THP=never. There were no statistically
significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
[2.16, 3.55]% more operations per second (OPS), respectively, for
exponential (distribution) access, random access and Zipfian
(distribution) access, when underutilizing memory; 95% CIs
[8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
respectively, for exponential access, random access and Zipfian
access, when overcommitting memory.

On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
and [4.11, 7.50]% more operations per second (OPS), respectively,
for exponential (distribution) access, random access and Zipfian
(distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
[6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
exponential access, random access and Zipfian access, when swap was
on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
less average wall time to finish twelve parallel TeraSort jobs,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in average wall time for the rest of the
benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
minute (TPM) under the high-concurrency condition, when swap was
off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
[11.47, 19.36]% more total operations per second (OPS),
respectively, for sequential access, random access and Gaussian
(distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
[10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
for sequential access, random access and Gaussian access, when
THP=never.

Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10].
fs_fio_bench_hdd_mq pft
fs_lmbench pgsql-hammerdb
fs_parallelio redis
fs_postmark stream
hackbench sysbenchthread
kernbench tpcc_spark
memcached unixbench
multichase vm-scalability
mutilate will-it-scale
nginx

[01] https://trends.google.com
[02] https://lore.kernel.org/r/[email protected]/
[03] https://lore.kernel.org/r/[email protected]/
[04] https://lore.kernel.org/r/[email protected]/
[05] https://lore.kernel.org/r/[email protected]/
[06] https://lore.kernel.org/r/[email protected]/
[07] https://lore.kernel.org/r/[email protected]/
[08] https://lore.kernel.org/r/[email protected]/
[09] https://lore.kernel.org/r/[email protected]/
[10] https://lore.kernel.org/r/[email protected]/

Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin reported [11]:
I have Archlinux with 8G RAM + zswap + swap. While developing, I
have lots of apps opened such as multiple LSP-servers for different
langs, chats, two browsers, etc... Usually, my system gets quickly
to a point of SWAP-storms, where I have to kill LSP-servers,
restart browsers to free memory, etc, otherwise the system lags
heavily and is barely usable.

1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
patchset, and I started up by opening lots of apps to create memory
pressure, and worked for a day like this. Till now I had not a
single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
getting to the point of 3G in SWAP before without a single
SWAP-storm.

Vaibhav from IBM reported [12]:
In a synthetic MongoDB Benchmark, seeing an average of ~19%
throughput improvement on POWER10(Radix MMU + 64K Page Size) with
MGLRU patches on top of v5.16 kernel for MongoDB + YCSB across
three different request distributions, namely, Exponential, Uniform
and Zipfan.

Shuang from U of Rochester reported [13]:
With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
and [9.26, 10.36]% higher throughput, respectively, for random
access, Zipfian (distribution) access and Gaussian (distribution)
access, when the average number of jobs per CPU is 1; 95% CIs
[42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
throughput, respectively, for random access, Zipfian access and
Gaussian access, when the average number of jobs per CPU is 2.

Daniel from Michigan Tech reported [14]:
With Memcached allocating ~100GB of byte-addressable Optante,
performance improvement in terms of throughput (measured as queries
per second) was about 10% for a series of workloads.

Large-scale deployments
-----------------------
The downstream kernels that have been using MGLRU include:
1. Android [15]
2. Arch Linux Zen [16]
3. Chrome OS [17]
4. Liquorix [18]
5. post-factum [19]
6. XanMod [20]

We've rolled out MGLRU to tens of millions of Chrome OS users and
about a million Android users. Google's fleetwide profiling [21] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
app launch time at the 50th percentile.

[11] https://lore.kernel.org/r/[email protected]/
[12] https://lore.kernel.org/r/[email protected]/
[13] https://lore.kernel.org/r/[email protected]/
[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
[15] https://android.com
[16] https://archlinux.org
[17] https://chromium.org
[18] https://liquorix.net
[19] https://gitlab.com/post-factum/pf-kernel/
[20] https://xanmod.org
[21] https://research.google/pubs/pub44271/

Summery
=======
The facts are:
1. The independent lab results and the real-world applications
indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
work out of the box; there are no equivalent solutions.
3. There is a lot of new code; no one has demonstrated smaller changes
with similar effects.

Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
materialize for a wide range of workloads.
2. Gauging the interest from the past discussions, the new features
will likely be put to use for both personal computers and data
centers.
3. Based on Google's track record, the new code will likely be well
maintained in the long term. It'd be more difficult if not
impossible to achieve similar effects with other alternatives.

Yu Zhao (14):
mm: x86, arm64: add arch_has_hw_pte_young()
mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
mm/vmscan.c: refactor shrink_node()
Revert "include/linux/mm_inline.h: fold __update_lru_size() into its
sole caller"
mm: multi-gen LRU: groundwork
mm: multi-gen LRU: minimal implementation
mm: multi-gen LRU: exploit locality in rmap
mm: multi-gen LRU: support page table walks
mm: multi-gen LRU: optimize multiple memcgs
mm: multi-gen LRU: kill switch
mm: multi-gen LRU: thrashing prevention
mm: multi-gen LRU: debugfs interface
mm: multi-gen LRU: admin guide
mm: multi-gen LRU: design doc

Documentation/admin-guide/mm/index.rst | 1 +
Documentation/admin-guide/mm/multigen_lru.rst | 156 +
Documentation/vm/index.rst | 1 +
Documentation/vm/multigen_lru.rst | 159 +
arch/Kconfig | 8 +
arch/arm64/include/asm/pgtable.h | 14 +-
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 9 +-
arch/x86/mm/pgtable.c | 5 +-
fs/exec.c | 2 +
fs/fuse/dev.c | 3 +-
include/linux/cgroup.h | 15 +-
include/linux/memcontrol.h | 36 +
include/linux/mm.h | 7 +
include/linux/mm_inline.h | 233 +-
include/linux/mm_types.h | 77 +
include/linux/mmzone.h | 216 ++
include/linux/nodemask.h | 1 +
include/linux/page-flags-layout.h | 16 +-
include/linux/page-flags.h | 4 +-
include/linux/pgtable.h | 17 +-
include/linux/sched.h | 4 +
include/linux/swap.h | 4 +
kernel/bounds.c | 7 +
kernel/cgroup/cgroup-internal.h | 1 -
kernel/exit.c | 1 +
kernel/fork.c | 9 +
kernel/sched/core.c | 1 +
mm/Kconfig | 26 +
mm/huge_memory.c | 3 +-
mm/internal.h | 1 +
mm/memcontrol.c | 28 +
mm/memory.c | 39 +-
mm/mm_init.c | 6 +-
mm/mmzone.c | 2 +
mm/rmap.c | 7 +
mm/swap.c | 52 +-
mm/vmscan.c | 2888 ++++++++++++++++-
mm/workingset.c | 110 +-
39 files changed, 4017 insertions(+), 153 deletions(-)
create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
create mode 100644 Documentation/vm/multigen_lru.rst

--
2.36.0.550.gb090851708-goog



2022-05-18 04:27:01

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v11 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG

Some architectures support the accessed bit in non-leaf PMD entries,
e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
as part of linear address translation [1]. Page table walkers that
clear the accessed bit may use this capability to reduce their search
space.

Note that:
1. Although an inline function is preferable, this capability is added
as a configuration option for consistency with the existing macros.
2. Due to the little interest in other varieties, this capability was
only tested on Intel and AMD CPUs.

Thanks to the following developers for their efforts [2][3].
Randy Dunlap <[email protected]>
Stephen Rothwell <[email protected]>

[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 3 (June 2021), section 4.8
[2] https://lore.kernel.org/r/[email protected]/
[3] https://lore.kernel.org/r/[email protected]/

Signed-off-by: Yu Zhao <[email protected]>
Reviewed-by: Barry Song <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
---
arch/Kconfig | 8 ++++++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 3 ++-
arch/x86/mm/pgtable.c | 5 ++++-
include/linux/pgtable.h | 4 ++--
5 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 31c4fdc4a4ba..8a2cb732b09e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1376,6 +1376,14 @@ config DYNAMIC_SIGFRAME
config HAVE_ARCH_NODE_DEV_GROUP
bool

+config ARCH_HAS_NONLEAF_PMD_YOUNG
+ bool
+ help
+ Architectures that select this option are capable of setting the
+ accessed bit in non-leaf PMD entries when using them as part of linear
+ address translations. Page table walkers that clear the accessed bit
+ may use this capability to reduce their search space.
+
source "kernel/gcov/Kconfig"

source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4bed3abf444d..fce1d9f41cc3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -86,6 +86,7 @@ config X86
select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_PTE_DEVMAP if X86_64
select ARCH_HAS_PTE_SPECIAL
+ select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2
select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64
select ARCH_HAS_COPY_MC if X86_64
select ARCH_HAS_SET_MEMORY
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 016606a0cf20..9cb3cf4cf6dd 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -820,7 +820,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)

static inline int pmd_bad(pmd_t pmd)
{
- return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
+ return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) !=
+ (_KERNPG_TABLE & ~_PAGE_ACCESSED);
}

static inline unsigned long pages_to_mb(unsigned long npg)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..a224193d84bf 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
return ret;
}

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
int pmdp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
{
@@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,

return ret;
}
+#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
int pudp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pud_t *pudp)
{
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 79f64dcff07d..743e7fc4afda 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,7 +212,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
#endif

#ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long address,
pmd_t *pmdp)
@@ -233,7 +233,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
BUILD_BUG();
return 0;
}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
#endif

#ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
--
2.36.0.550.gb090851708-goog


2022-05-18 04:27:57

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v11 14/14] mm: multi-gen LRU: design doc

Add a design doc.

Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
---
Documentation/vm/index.rst | 1 +
Documentation/vm/multigen_lru.rst | 159 ++++++++++++++++++++++++++++++
2 files changed, 160 insertions(+)
create mode 100644 Documentation/vm/multigen_lru.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index 44365c4574a3..b48434300226 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -25,6 +25,7 @@ algorithms. If you are looking for advice on simply allocating memory, see the
ksm
memory-model
mmu_notifier
+ multigen_lru
numa
overcommit-accounting
page_migration
diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
new file mode 100644
index 000000000000..bc8eaf1b956c
--- /dev/null
+++ b/Documentation/vm/multigen_lru.rst
@@ -0,0 +1,159 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Multi-Gen LRU
+=============
+The multi-gen LRU is an alternative LRU implementation that optimizes
+page reclaim and improves performance under memory pressure. Page
+reclaim decides the kernel's caching policy and ability to overcommit
+memory. It directly impacts the kswapd CPU usage and RAM efficiency.
+
+Design overview
+===============
+Objectives
+----------
+The design objectives are:
+
+* Good representation of access recency
+* Try to profit from spatial locality
+* Fast paths to make obvious choices
+* Simple self-correcting heuristics
+
+The representation of access recency is at the core of all LRU
+implementations. In the multi-gen LRU, each generation represents a
+group of pages with similar access recency. Generations establish a
+(time-based) common frame of reference and therefore help make better
+choices, e.g., between different memcgs on a computer or different
+computers in a data center (for job scheduling).
+
+Exploiting spatial locality improves efficiency when gathering the
+accessed bit. A rmap walk targets a single page and does not try to
+profit from discovering a young PTE. A page table walk can sweep all
+the young PTEs in an address space, but the address space can be too
+sparse to make a profit. The key is to optimize both methods and use
+them in combination.
+
+Fast paths reduce code complexity and runtime overhead. Unmapped pages
+do not require TLB flushes; clean pages do not require writeback.
+These facts are only helpful when other conditions, e.g., access
+recency, are similar. With generations as a common frame of reference,
+additional factors stand out. But obvious choices might not be good
+choices; thus self-correction is necessary.
+
+The benefits of simple self-correcting heuristics are self-evident.
+Again, with generations as a common frame of reference, this becomes
+attainable. Specifically, pages in the same generation can be
+categorized based on additional factors, and a feedback loop can
+statistically compare the refault percentages across those categories
+and infer which of them are better choices.
+
+Assumptions
+-----------
+The protection of hot pages and the selection of cold pages are based
+on page access channels and patterns. There are two access channels:
+
+* Accesses through page tables
+* Accesses through file descriptors
+
+The protection of the former channel is by design stronger because:
+
+1. The uncertainty in determining the access patterns of the former
+ channel is higher due to the approximation of the accessed bit.
+2. The cost of evicting the former channel is higher due to the TLB
+ flushes required and the likelihood of encountering the dirty bit.
+3. The penalty of underprotecting the former channel is higher because
+ applications usually do not prepare themselves for major page
+ faults like they do for blocked I/O. E.g., GUI applications
+ commonly use dedicated I/O threads to avoid blocking rendering
+ threads.
+
+There are also two access patterns:
+
+* Accesses exhibiting temporal locality
+* Accesses not exhibiting temporal locality
+
+For the reasons listed above, the former channel is assumed to follow
+the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
+present, and the latter channel is assumed to follow the latter
+pattern unless outlying refaults have been observed.
+
+Workflow overview
+=================
+Evictable pages are divided into multiple generations for each
+``lruvec``. The youngest generation number is stored in
+``lrugen->max_seq`` for both anon and file types as they are aged on
+an equal footing. The oldest generation numbers are stored in
+``lrugen->min_seq[]`` separately for anon and file types as clean file
+pages can be evicted regardless of swap constraints. These three
+variables are monotonically increasing.
+
+Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
+bits in order to fit into the gen counter in ``folio->flags``. Each
+truncated generation number is an index to ``lrugen->lists[]``. The
+sliding window technique is used to track at least ``MIN_NR_GENS`` and
+at most ``MAX_NR_GENS`` generations. The gen counter stores a value
+within ``[1, MAX_NR_GENS]`` while a page is on one of
+``lrugen->lists[]``; otherwise it stores zero.
+
+Each generation is divided into multiple tiers. Tiers represent
+different ranges of numbers of accesses through file descriptors. A
+page accessed ``N`` times through file descriptors is in tier
+``order_base_2(N)``. In contrast to moving across generations, which
+requires the LRU lock, moving across tiers only involves atomic
+operations on ``folio->flags`` and therefore has a negligible cost. A
+feedback loop modeled after the PID controller monitors refaults over
+all the tiers from anon and file types and decides which tiers from
+which types to evict or protect.
+
+There are two conceptually independent procedures: the aging and the
+eviction. They form a closed-loop system, i.e., the page reclaim.
+
+Aging
+-----
+The aging produces young generations. Given an ``lruvec``, it
+increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
+``MIN_NR_GENS``. The aging promotes hot pages to the youngest
+generation when it finds them accessed through page tables; the
+demotion of cold pages happens consequently when it increments
+``max_seq``. The aging uses page table walks and rmap walks to find
+young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
+and calls ``walk_page_range()`` with each ``mm_struct`` on this list
+to scan PTEs, and after each iteration, it increments ``max_seq``. For
+the latter, when the eviction walks the rmap and finds a young PTE,
+the aging scans the adjacent PTEs. For both, on finding a young PTE,
+the aging clears the accessed bit and updates the gen counter of the
+page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
+
+Eviction
+--------
+The eviction consumes old generations. Given an ``lruvec``, it
+increments ``min_seq`` when ``lrugen->lists[]`` indexed by
+``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
+evict from, it first compares ``min_seq[]`` to select the older type.
+If both types are equally old, it selects the one whose first tier has
+a lower refault percentage. The first tier contains single-use
+unmapped clean pages, which are the best bet. The eviction sorts a
+page according to its gen counter if the aging has found this page
+accessed through page tables and updated its gen counter. It also
+moves a page to the next generation, i.e., ``min_seq+1``, if this page
+was accessed multiple times through file descriptors and the feedback
+loop has detected outlying refaults from the tier this page is in. To
+do this, the feedback loop uses the first tier as the baseline, for
+the reason stated earlier.
+
+Summary
+-------
+The multi-gen LRU can be disassembled into the following parts:
+
+* Generations
+* Page table walks
+* Rmap walks
+* Bloom filters
+* PID controller
+
+The aging and the eviction form a producer-consumer model;
+specifically, the latter drives the former by the sliding window over
+generations. Within the aging, rmap walks drive page table walks by
+inserting hot densely populated page tables to the Bloom filters.
+Within the eviction, the PID controller uses refaults as the feedback
+to select types to evict and tiers to protect.
--
2.36.0.550.gb090851708-goog


2022-05-18 04:33:44

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v11 00/14] Multi-Gen LRU Framework

On 5/17/22 7:46 PM, Yu Zhao wrote:
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and
> straightforward.

Where's the changelog since v10?

--
Jens Axboe


2022-05-18 04:36:25

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v11 13/14] mm: multi-gen LRU: admin guide

Add an admin guide.

Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
---
Documentation/admin-guide/mm/index.rst | 1 +
Documentation/admin-guide/mm/multigen_lru.rst | 156 ++++++++++++++++++
mm/Kconfig | 3 +-
3 files changed, 159 insertions(+), 1 deletion(-)
create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst

diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index c21b5823f126..2cf5bae62036 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -32,6 +32,7 @@ the Linux memory management.
idle_page_tracking
ksm
memory-hotplug
+ multigen_lru
nommu-mmap
numa_memory_policy
numaperf
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
new file mode 100644
index 000000000000..6355f2b5019d
--- /dev/null
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -0,0 +1,156 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Multi-Gen LRU
+=============
+The multi-gen LRU is an alternative LRU implementation that optimizes
+page reclaim and improves performance under memory pressure. Page
+reclaim decides the kernel's caching policy and ability to overcommit
+memory. It directly impacts the kswapd CPU usage and RAM efficiency.
+
+Quick start
+===========
+Build the kernel with the following configurations.
+
+* ``CONFIG_LRU_GEN=y``
+* ``CONFIG_LRU_GEN_ENABLED=y``
+
+All set!
+
+Runtime options
+===============
+``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
+following subsections.
+
+Kill switch
+-----------
+``enabled`` accepts different values to enable or disable the
+following components. Its default value depends on
+``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
+unless some of them have unforeseen side effects. Writing to
+``enabled`` has no effect when a component is not supported by the
+hardware, and valid values will be accepted even when the main switch
+is off.
+
+====== ===============================================================
+Values Components
+====== ===============================================================
+0x0001 The main switch for the multi-gen LRU.
+0x0002 Clearing the accessed bit in leaf page table entries in large
+ batches, when MMU sets it (e.g., on x86). This behavior can
+ theoretically worsen lock contention (mmap_lock). If it is
+ disabled, the multi-gen LRU will suffer a minor performance
+ degradation for workloads that contiguously map hot pages,
+ whose accessed bits can be otherwise cleared by fewer larger
+ batches.
+0x0004 Clearing the accessed bit in non-leaf page table entries as
+ well, when MMU sets it (e.g., on x86). This behavior was not
+ verified on x86 varieties other than Intel and AMD. If it is
+ disabled, the multi-gen LRU will suffer a negligible
+ performance degradation.
+[yYnN] Apply to all the components above.
+====== ===============================================================
+
+E.g.,
+::
+
+ echo y >/sys/kernel/mm/lru_gen/enabled
+ cat /sys/kernel/mm/lru_gen/enabled
+ 0x0007
+ echo 5 >/sys/kernel/mm/lru_gen/enabled
+ cat /sys/kernel/mm/lru_gen/enabled
+ 0x0005
+
+Thrashing prevention
+--------------------
+Personal computers are more sensitive to thrashing because it can
+cause janks (lags when rendering UI) and negatively impact user
+experience. The multi-gen LRU offers thrashing prevention to the
+majority of laptop and desktop users who do not have ``oomd``.
+
+Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
+``N`` milliseconds from getting evicted. The OOM killer is triggered
+if this working set cannot be kept in memory. In other words, this
+option works as an adjustable pressure relief valve, and when open, it
+terminates applications that are hopefully not being used.
+
+Based on the average human detectable lag (~100ms), ``N=1000`` usually
+eliminates intolerable janks due to thrashing. Larger values like
+``N=3000`` make janks less noticeable at the risk of premature OOM
+kills.
+
+The default value ``0`` means disabled.
+
+Experimental features
+=====================
+``/sys/kernel/debug/lru_gen`` accepts commands described in the
+following subsections. Multiple command lines are supported, so does
+concatenation with delimiters ``,`` and ``;``.
+
+``/sys/kernel/debug/lru_gen_full`` provides additional stats for
+debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
+evicted generations in this file.
+
+Working set estimation
+----------------------
+Working set estimation measures how much memory an application needs
+in a given time interval, and it is usually done with little impact on
+the performance of the application. E.g., data centers want to
+optimize job scheduling (bin packing) to improve memory utilizations.
+When a new job comes in, the job scheduler needs to find out whether
+each server it manages can allocate a certain amount of memory for
+this new job before it can pick a candidate. To do so, the job
+scheduler needs to estimate the working sets of the existing jobs.
+
+When it is read, ``lru_gen`` returns a histogram of numbers of pages
+accessed over different time intervals for each memcg and node.
+``MAX_NR_GENS`` decides the number of bins for each histogram. The
+histograms are noncumulative.
+::
+
+ memcg memcg_id memcg_path
+ node node_id
+ min_gen_nr age_in_ms nr_anon_pages nr_file_pages
+ ...
+ max_gen_nr age_in_ms nr_anon_pages nr_file_pages
+
+Each bin contains an estimated number of pages that have been accessed
+within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
+and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
+the former is the largest and that of the latter is the smallest.
+
+Users can write ``+ memcg_id node_id max_gen_nr
+[can_swap [force_scan]]`` to ``lru_gen`` to create a new generation
+``max_gen_nr+1``. ``can_swap`` defaults to the swap setting and, if it
+is set to ``1``, it forces the scan of anon pages when swap is off,
+and vice versa. ``force_scan`` defaults to ``1`` and, if it is set to
+``0``, it employs heuristics to reduce the overhead, which is likely
+to reduce the coverage as well.
+
+A typical use case is that a job scheduler writes to ``lru_gen`` at a
+certain time interval to create new generations, and it ranks the
+servers it manages based on the sizes of their cold pages defined by
+this time interval.
+
+Proactive reclaim
+-----------------
+Proactive reclaim induces page reclaim when there is no memory
+pressure. It usually targets cold pages only. E.g., when a new job
+comes in, the job scheduler wants to proactively reclaim cold pages on
+the server it selected to improve the chance of successfully landing
+this new job.
+
+Users can write ``- memcg_id node_id min_gen_nr [swappiness
+[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or
+equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
+``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
+aged and therefore cannot be evicted. ``swappiness`` overrides the
+default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
+the number of pages to evict.
+
+A typical use case is that a job scheduler writes to ``lru_gen``
+before it tries to land a new job on a server. If it fails to
+materialize enough cold pages because of the overestimation, it
+retries on the next server according to the ranking result obtained
+from the working set estimation step. This less forceful approach
+limits the impacts on the existing jobs.
diff --git a/mm/Kconfig b/mm/Kconfig
index 426ea5f57d88..05291697055a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -916,7 +916,8 @@ config LRU_GEN
# make sure folio->flags has enough spare bits
depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP
help
- A high performance LRU implementation to overcommit memory.
+ A high performance LRU implementation to overcommit memory. See
+ Documentation/admin-guide/mm/multigen_lru.rst for details.

config LRU_GEN_ENABLED
bool "Enable by default"
--
2.36.0.550.gb090851708-goog


2022-05-18 04:47:11

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v11 12/14] mm: multi-gen LRU: debugfs interface

Add /sys/kernel/debug/lru_gen for working set estimation and proactive
reclaim. These techniques are commonly used to optimize job scheduling
(bin packing) in data centers [1][2].

Compared with the page table-based approach and the PFN-based
approach, this lruvec-based approach has the following advantages:
1. It offers better choices because it is aware of memcgs, NUMA nodes,
shared mappings and unmapped page cache.
2. It is more scalable because it is O(nr_hot_pages), whereas the
PFN-based approach is O(nr_total_pages).

Add /sys/kernel/debug/lru_gen_full for debugging.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731

Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
---
include/linux/nodemask.h | 1 +
mm/vmscan.c | 403 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 394 insertions(+), 10 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 567c3ddba2c4..90840c459abc 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -486,6 +486,7 @@ static inline int num_node_state(enum node_states state)
#define first_online_node 0
#define first_memory_node 0
#define next_online_node(nid) (MAX_NUMNODES)
+#define next_memory_node(nid) (MAX_NUMNODES)
#define nr_node_ids 1U
#define nr_online_nodes 1U

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 823c55e17e64..aef71c45c424 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -53,6 +53,7 @@
#include <linux/pagewalk.h>
#include <linux/shmem_fs.h>
#include <linux/ctype.h>
+#include <linux/debugfs.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -4063,12 +4064,39 @@ static void free_mm_walk(struct lru_gen_mm_walk *walk)
kfree(walk);
}

-static void inc_min_seq(struct lruvec *lruvec, int type)
+static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
{
+ int zone;
+ int remaining = MAX_LRU_BATCH;
struct lru_gen_struct *lrugen = &lruvec->lrugen;
+ int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);

+ if (type == LRU_GEN_ANON && !can_swap)
+ goto done;
+
+ for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+ struct list_head *head = &lrugen->lists[old_gen][type][zone];
+
+ while (!list_empty(head)) {
+ struct folio *folio = lru_to_folio(head);
+
+ VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
+ VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
+ VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
+ VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
+
+ new_gen = folio_inc_gen(lruvec, folio, false);
+ list_move_tail(&folio->lru, &lrugen->lists[new_gen][type][zone]);
+
+ if (!--remaining)
+ return false;
+ }
+ }
+done:
reset_ctrl_pos(lruvec, type, true);
WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
+
+ return true;
}

static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
@@ -4114,7 +4142,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
return success;
}

-static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
+static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
{
int prev, next;
int type, zone;
@@ -4128,9 +4156,13 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
continue;

- VM_WARN_ON_ONCE(type == LRU_GEN_FILE || can_swap);
+ VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));

- inc_min_seq(lruvec, type);
+ while (!inc_min_seq(lruvec, type, can_swap)) {
+ spin_unlock_irq(&lruvec->lru_lock);
+ cond_resched();
+ spin_lock_irq(&lruvec->lru_lock);
+ }
}

/*
@@ -4167,7 +4199,7 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
}

static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
- struct scan_control *sc, bool can_swap)
+ struct scan_control *sc, bool can_swap, bool force_scan)
{
bool success;
struct lru_gen_mm_walk *walk;
@@ -4182,7 +4214,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
* handful of PTEs. Spreading the work out over a period of time usually
* is less efficient, but it avoids bursty page faults.
*/
- if (!(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) {
+ if (!force_scan && !(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) {
success = iterate_mm_list_nowalk(lruvec, max_seq);
goto done;
}
@@ -4196,7 +4228,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
walk->lruvec = lruvec;
walk->max_seq = max_seq;
walk->can_swap = can_swap;
- walk->force_scan = false;
+ walk->force_scan = force_scan;

do {
success = iterate_mm_list(lruvec, walk, &mm);
@@ -4218,7 +4250,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,

VM_WARN_ON_ONCE(max_seq != READ_ONCE(lrugen->max_seq));

- inc_max_seq(lruvec, can_swap);
+ inc_max_seq(lruvec, can_swap, force_scan);
/* either this sees any waiters or they will see updated max_seq */
if (wq_has_sleeper(&lruvec->mm_state.wait))
wake_up_all(&lruvec->mm_state.wait);
@@ -4317,7 +4349,7 @@ static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
nr_to_scan++;

if (nr_to_scan && need_aging)
- try_to_inc_max_seq(lruvec, max_seq, sc, swappiness);
+ try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);

return true;
}
@@ -4887,7 +4919,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
return 0;
}

- if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap))
+ if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false))
return nr_to_scan;

return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0;
@@ -5175,6 +5207,354 @@ static struct attribute_group lru_gen_attr_group = {
.attrs = lru_gen_attrs,
};

+/******************************************************************************
+ * debugfs interface
+ ******************************************************************************/
+
+static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos)
+{
+ struct mem_cgroup *memcg;
+ loff_t nr_to_skip = *pos;
+
+ m->private = kvmalloc(PATH_MAX, GFP_KERNEL);
+ if (!m->private)
+ return ERR_PTR(-ENOMEM);
+
+ memcg = mem_cgroup_iter(NULL, NULL, NULL);
+ do {
+ int nid;
+
+ for_each_node_state(nid, N_MEMORY) {
+ if (!nr_to_skip--)
+ return get_lruvec(memcg, nid);
+ }
+ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+ return NULL;
+}
+
+static void lru_gen_seq_stop(struct seq_file *m, void *v)
+{
+ if (!IS_ERR_OR_NULL(v))
+ mem_cgroup_iter_break(NULL, lruvec_memcg(v));
+
+ kvfree(m->private);
+ m->private = NULL;
+}
+
+static void *lru_gen_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ int nid = lruvec_pgdat(v)->node_id;
+ struct mem_cgroup *memcg = lruvec_memcg(v);
+
+ ++*pos;
+
+ nid = next_memory_node(nid);
+ if (nid == MAX_NUMNODES) {
+ memcg = mem_cgroup_iter(NULL, memcg, NULL);
+ if (!memcg)
+ return NULL;
+
+ nid = first_memory_node;
+ }
+
+ return get_lruvec(memcg, nid);
+}
+
+static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
+ unsigned long max_seq, unsigned long *min_seq,
+ unsigned long seq)
+{
+ int i;
+ int type, tier;
+ int hist = lru_hist_from_seq(seq);
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+ seq_printf(m, " %10d", tier);
+ for (type = 0; type < ANON_AND_FILE; type++) {
+ unsigned long n[3] = {};
+
+ if (seq == max_seq) {
+ n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]);
+ n[1] = READ_ONCE(lrugen->avg_total[type][tier]);
+
+ seq_printf(m, " %10luR %10luT %10lu ", n[0], n[1], n[2]);
+ } else if (seq == min_seq[type] || NR_HIST_GENS > 1) {
+ n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+ n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]);
+ if (tier)
+ n[2] = READ_ONCE(lrugen->protected[hist][type][tier - 1]);
+
+ seq_printf(m, " %10lur %10lue %10lup", n[0], n[1], n[2]);
+ } else
+ seq_puts(m, " 0 0 0 ");
+ }
+ seq_putc(m, '\n');
+ }
+
+ seq_puts(m, " ");
+ for (i = 0; i < NR_MM_STATS; i++) {
+ if (seq == max_seq && NR_HIST_GENS == 1)
+ seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]),
+ toupper(MM_STAT_CODES[i]));
+ else if (seq != max_seq && NR_HIST_GENS > 1)
+ seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]),
+ MM_STAT_CODES[i]);
+ else
+ seq_puts(m, " 0 ");
+ }
+ seq_putc(m, '\n');
+}
+
+static int lru_gen_seq_show(struct seq_file *m, void *v)
+{
+ unsigned long seq;
+ bool full = !debugfs_real_fops(m->file)->write;
+ struct lruvec *lruvec = v;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+ int nid = lruvec_pgdat(lruvec)->node_id;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ DEFINE_MAX_SEQ(lruvec);
+ DEFINE_MIN_SEQ(lruvec);
+
+ if (nid == first_memory_node) {
+ const char *path = memcg ? m->private : "";
+
+#ifdef CONFIG_MEMCG
+ if (memcg)
+ cgroup_path(memcg->css.cgroup, m->private, PATH_MAX);
+#endif
+ seq_printf(m, "memcg %5hu %s\n", mem_cgroup_id(memcg), path);
+ }
+
+ seq_printf(m, " node %5d\n", nid);
+
+ if (!full)
+ seq = min_seq[LRU_GEN_ANON];
+ else if (max_seq >= MAX_NR_GENS)
+ seq = max_seq - MAX_NR_GENS + 1;
+ else
+ seq = 0;
+
+ for (; seq <= max_seq; seq++) {
+ int type, zone;
+ int gen = lru_gen_from_seq(seq);
+ unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
+
+ seq_printf(m, " %10lu %10u", seq, jiffies_to_msecs(jiffies - birth));
+
+ for (type = 0; type < ANON_AND_FILE; type++) {
+ long size = 0;
+ char mark = full && seq < min_seq[type] ? 'x' : ' ';
+
+ for (zone = 0; zone < MAX_NR_ZONES; zone++)
+ size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+
+ seq_printf(m, " %10lu%c", max(size, 0L), mark);
+ }
+
+ seq_putc(m, '\n');
+
+ if (full)
+ lru_gen_seq_show_full(m, lruvec, max_seq, min_seq, seq);
+ }
+
+ return 0;
+}
+
+static const struct seq_operations lru_gen_seq_ops = {
+ .start = lru_gen_seq_start,
+ .stop = lru_gen_seq_stop,
+ .next = lru_gen_seq_next,
+ .show = lru_gen_seq_show,
+};
+
+static int run_aging(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
+ bool can_swap, bool force_scan)
+{
+ DEFINE_MAX_SEQ(lruvec);
+ DEFINE_MIN_SEQ(lruvec);
+
+ if (seq < max_seq)
+ return 0;
+
+ if (seq > max_seq || (!force_scan && min_seq[!can_swap] + MAX_NR_GENS - 1 <= max_seq))
+ return -EINVAL;
+
+ try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, force_scan);
+
+ return 0;
+}
+
+static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
+ int swappiness, unsigned long nr_to_reclaim)
+{
+ struct blk_plug plug;
+ int err = -EINTR;
+ DEFINE_MAX_SEQ(lruvec);
+
+ if (seq + MIN_NR_GENS > max_seq)
+ return -EINVAL;
+
+ sc->nr_reclaimed = 0;
+
+ blk_start_plug(&plug);
+
+ while (!signal_pending(current)) {
+ DEFINE_MIN_SEQ(lruvec);
+
+ if (seq < min_seq[!swappiness] || sc->nr_reclaimed >= nr_to_reclaim ||
+ !evict_folios(lruvec, sc, swappiness, NULL)) {
+ err = 0;
+ break;
+ }
+
+ cond_resched();
+ }
+
+ blk_finish_plug(&plug);
+
+ return err;
+}
+
+static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
+ struct scan_control *sc, int swappiness, unsigned long opt)
+{
+ struct lruvec *lruvec;
+ int err = -EINVAL;
+ struct mem_cgroup *memcg = NULL;
+
+ if (!mem_cgroup_disabled()) {
+ rcu_read_lock();
+ memcg = mem_cgroup_from_id(memcg_id);
+#ifdef CONFIG_MEMCG
+ if (memcg && !css_tryget(&memcg->css))
+ memcg = NULL;
+#endif
+ rcu_read_unlock();
+
+ if (!memcg)
+ goto done;
+ }
+ if (memcg_id != mem_cgroup_id(memcg))
+ goto done;
+
+ if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY))
+ goto done;
+
+ lruvec = get_lruvec(memcg, nid);
+
+ if (swappiness < 0)
+ swappiness = get_swappiness(lruvec, sc);
+ else if (swappiness > 200)
+ goto done;
+
+ switch (cmd) {
+ case '+':
+ err = run_aging(lruvec, seq, sc, swappiness, opt);
+ break;
+ case '-':
+ err = run_eviction(lruvec, seq, sc, swappiness, opt);
+ break;
+ }
+done:
+ mem_cgroup_put(memcg);
+
+ return err;
+}
+
+static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
+ size_t len, loff_t *pos)
+{
+ void *buf;
+ char *cur, *next;
+ unsigned int flags;
+ int err = 0;
+ struct scan_control sc = {
+ .may_writepage = true,
+ .may_unmap = true,
+ .may_swap = true,
+ .reclaim_idx = MAX_NR_ZONES - 1,
+ .gfp_mask = GFP_KERNEL,
+ };
+
+ buf = kvmalloc(len + 1, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ if (copy_from_user(buf, src, len)) {
+ kvfree(buf);
+ return -EFAULT;
+ }
+
+ next = buf;
+ next[len] = '\0';
+
+ sc.reclaim_state.mm_walk = alloc_mm_walk();
+ if (!sc.reclaim_state.mm_walk) {
+ kvfree(buf);
+ return -ENOMEM;
+ }
+
+ set_task_reclaim_state(current, &sc.reclaim_state);
+ flags = memalloc_noreclaim_save();
+
+ while ((cur = strsep(&next, ",;\n"))) {
+ int n;
+ int end;
+ char cmd;
+ unsigned int memcg_id;
+ unsigned int nid;
+ unsigned long seq;
+ unsigned int swappiness = -1;
+ unsigned long opt = -1;
+
+ cur = skip_spaces(cur);
+ if (!*cur)
+ continue;
+
+ n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid,
+ &seq, &end, &swappiness, &end, &opt, &end);
+ if (n < 4 || cur[end]) {
+ err = -EINVAL;
+ break;
+ }
+
+ err = run_cmd(cmd, memcg_id, nid, seq, &sc, swappiness, opt);
+ if (err)
+ break;
+ }
+
+ memalloc_noreclaim_restore(flags);
+ set_task_reclaim_state(current, NULL);
+
+ free_mm_walk(sc.reclaim_state.mm_walk);
+ kvfree(buf);
+
+ return err ? : len;
+}
+
+static int lru_gen_seq_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &lru_gen_seq_ops);
+}
+
+static const struct file_operations lru_gen_rw_fops = {
+ .open = lru_gen_seq_open,
+ .read = seq_read,
+ .write = lru_gen_seq_write,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static const struct file_operations lru_gen_ro_fops = {
+ .open = lru_gen_seq_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
/******************************************************************************
* initialization
******************************************************************************/
@@ -5233,6 +5613,9 @@ static int __init init_lru_gen(void)
if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
pr_err("lru_gen: failed to create sysfs group\n");

+ debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops);
+ debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops);
+
return 0;
};
late_initcall(init_lru_gen);
--
2.36.0.550.gb090851708-goog


2022-05-18 04:48:45

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, the rmap has a high
CPU cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
Single workload:
fio (buffered I/O): no change

Single workload:
memcached (anon): +[5.5, 7.5]%
Ops/sec KB/sec
patch1-6: 1120643.70 43588.06
patch1-7: 1193918.93 46438.15

Configurations:
no change

Client benchmark results:
kswapd profiles:
patch1-6
35.99% lzo1x_1_do_compress (real work)
19.40% page_vma_mapped_walk
6.31% _raw_spin_unlock_irq
3.95% do_raw_spin_lock
2.39% anon_vma_interval_tree_iter_first
2.25% ptep_clear_flush
1.92% __anon_vma_interval_tree_subtree_search
1.70% folio_referenced_one
1.68% __zram_bvec_write
1.43% anon_vma_interval_tree_iter_next

patch1-7
45.90% lzo1x_1_do_compress (real work)
9.14% page_vma_mapped_walk
6.81% _raw_spin_unlock_irq
2.80% ptep_clear_flush
2.34% __zram_bvec_write
2.29% do_raw_spin_lock
1.84% lru_gen_look_around
1.78% memmove
1.74% obj_malloc
1.50% free_unref_page_list

Configurations:
no change

Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
---
include/linux/memcontrol.h | 31 ++++++++
include/linux/mm.h | 5 ++
include/linux/mmzone.h | 6 ++
mm/internal.h | 1 +
mm/memcontrol.c | 1 +
mm/rmap.c | 7 ++
mm/swap.c | 4 +-
mm/vmscan.c | 157 +++++++++++++++++++++++++++++++++++++
8 files changed, 210 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 89b14729d59f..2bfdcc77648a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -438,6 +438,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
+ * - mem_cgroup_trylock_pages()
*
* For a kmem folio a caller should hold an rcu read lock to protect memcg
* associated with a kmem folio from being released.
@@ -499,6 +500,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
+ * - mem_cgroup_trylock_pages()
*
* For a kmem page a caller should hold an rcu read lock to protect memcg
* associated with a kmem page from being released.
@@ -948,6 +950,23 @@ void unlock_page_memcg(struct page *page);

void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);

+/* try to stablize folio_memcg() for all the pages in a memcg */
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+ rcu_read_lock();
+
+ if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
+ return true;
+
+ rcu_read_unlock();
+ return false;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+ rcu_read_unlock();
+}
+
/* idx can be of type enum memcg_stat_item or node_stat_item */
static inline void mod_memcg_state(struct mem_cgroup *memcg,
int idx, int val)
@@ -1386,6 +1405,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
{
}

+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+ /* to match folio_memcg_rcu() */
+ rcu_read_lock();
+ return true;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+ rcu_read_unlock();
+}
+
static inline void mem_cgroup_handle_over_high(void)
{
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 894c289c2c06..4e8ab4ad4473 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1523,6 +1523,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
return page_to_pfn(&folio->page);
}

+static inline struct folio *pfn_folio(unsigned long pfn)
+{
+ return page_folio(pfn_to_page(pfn));
+}
+
static inline atomic_t *folio_pincount_ptr(struct folio *folio)
{
return &folio_page(folio, 1)->compound_pincount;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2d023d243e73..f0b980362186 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -374,6 +374,7 @@ enum lruvec_flags {
#ifndef __GENERATING_BOUNDS_H

struct lruvec;
+struct page_vma_mapped_walk;

#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
@@ -429,6 +430,7 @@ struct lru_gen_struct {
};

void lru_gen_init_lruvec(struct lruvec *lruvec);
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);

#ifdef CONFIG_MEMCG
void lru_gen_init_memcg(struct mem_cgroup *memcg);
@@ -441,6 +443,10 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
{
}

+static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+}
+
#ifdef CONFIG_MEMCG
static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
{
diff --git a/mm/internal.h b/mm/internal.h
index cf16280ce132..59d2422b647d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -68,6 +68,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
void folio_rotate_reclaimable(struct folio *folio);
bool __folio_end_writeback(struct folio *folio);
void deactivate_file_folio(struct folio *folio);
+void folio_activate(struct folio *folio);

void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2ee074f80e72..98aa720ac639 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2769,6 +2769,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
+ * - mem_cgroup_trylock_pages()
*/
folio->memcg_data = (unsigned long)memcg;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index fedb82371efe..7cb7ef29088a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -73,6 +73,7 @@
#include <linux/page_idle.h>
#include <linux/memremap.h>
#include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>

#include <asm/tlbflush.h>

@@ -821,6 +822,12 @@ static bool folio_referenced_one(struct folio *folio,
}

if (pvmw.pte) {
+ if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
+ !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
+ lru_gen_look_around(&pvmw);
+ referenced++;
+ }
+
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte)) {
/*
diff --git a/mm/swap.c b/mm/swap.c
index a99d22308f28..0aa1d0b33d42 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -342,7 +342,7 @@ static bool need_activate_page_drain(int cpu)
return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
}

-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
{
if (folio_test_lru(folio) && !folio_test_active(folio) &&
!folio_test_unevictable(folio)) {
@@ -362,7 +362,7 @@ static inline void activate_page_drain(int cpu)
{
}

-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
{
struct lruvec *lruvec;

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 891f0ab69b3a..cf89a28c3b0e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1554,6 +1554,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
if (!sc->may_unmap && page_mapped(page))
goto keep_locked;

+ /* folio_update_gen() tried to promote this page? */
+ if (lru_gen_enabled() && !ignore_references &&
+ page_mapped(page) && PageReferenced(page))
+ goto keep_locked;
+
may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));

@@ -3137,6 +3142,28 @@ static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
* the aging
******************************************************************************/

+static int folio_update_gen(struct folio *folio, int gen)
+{
+ unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
+
+ VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
+ VM_WARN_ON_ONCE(!rcu_read_lock_held());
+
+ do {
+ /* lru_gen_del_folio() has isolated this page? */
+ if (!(old_flags & LRU_GEN_MASK)) {
+ /* for shrink_page_list() */
+ new_flags = old_flags | BIT(PG_referenced);
+ continue;
+ }
+
+ new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
+ new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+ } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
+
+ return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
{
int type = folio_is_file_lru(folio);
@@ -3147,6 +3174,11 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio);

do {
+ new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+ /* folio_update_gen() has promoted this page? */
+ if (new_gen >= 0 && new_gen != old_gen)
+ return new_gen;
+
new_gen = (old_gen + 1) % MAX_NR_GENS;

new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
@@ -3365,6 +3397,125 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
}

+/*
+ * This function exploits spatial locality when shrink_page_list() walks the
+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
+ */
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+ int i;
+ pte_t *pte;
+ unsigned long start;
+ unsigned long end;
+ unsigned long addr;
+ unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
+ struct folio *folio = pfn_folio(pvmw->pfn);
+ struct mem_cgroup *memcg = folio_memcg(folio);
+ struct pglist_data *pgdat = folio_pgdat(folio);
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ DEFINE_MAX_SEQ(lruvec);
+ int old_gen, new_gen = lru_gen_from_seq(max_seq);
+
+ lockdep_assert_held(pvmw->ptl);
+ VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
+
+ if (spin_is_contended(pvmw->ptl))
+ return;
+
+ start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
+ end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
+
+ if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
+ if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
+ end = start + MIN_LRU_BATCH * PAGE_SIZE;
+ else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
+ start = end - MIN_LRU_BATCH * PAGE_SIZE;
+ else {
+ start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
+ end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
+ }
+ }
+
+ pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
+
+ rcu_read_lock();
+ arch_enter_lazy_mmu_mode();
+
+ for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+ unsigned long pfn = pte_pfn(pte[i]);
+
+ VM_WARN_ON_ONCE(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end);
+
+ if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+ continue;
+
+ if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+ continue;
+
+ if (!pte_young(pte[i]))
+ continue;
+
+ VM_WARN_ON_ONCE(!pfn_valid(pfn));
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ continue;
+
+ folio = pfn_folio(pfn);
+ if (folio_nid(folio) != pgdat->node_id)
+ continue;
+
+ if (folio_memcg_rcu(folio) != memcg)
+ continue;
+
+ if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
+ continue;
+
+ if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+ !folio_test_swapcache(folio)))
+ folio_mark_dirty(folio);
+
+ old_gen = folio_lru_gen(folio);
+ if (old_gen < 0)
+ folio_set_referenced(folio);
+ else if (old_gen != new_gen)
+ __set_bit(i, bitmap);
+ }
+
+ arch_leave_lazy_mmu_mode();
+ rcu_read_unlock();
+
+ if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
+ for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+ folio = pfn_folio(pte_pfn(pte[i]));
+ folio_activate(folio);
+ }
+ return;
+ }
+
+ /* folio_update_gen() requires stable folio_memcg() */
+ if (!mem_cgroup_trylock_pages(memcg))
+ return;
+
+ spin_lock_irq(&lruvec->lru_lock);
+ new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+
+ for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+ folio = pfn_folio(pte_pfn(pte[i]));
+ if (folio_memcg_rcu(folio) != memcg)
+ continue;
+
+ old_gen = folio_update_gen(folio, new_gen);
+ if (old_gen < 0 || old_gen == new_gen)
+ continue;
+
+ lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+ }
+
+ spin_unlock_irq(&lruvec->lru_lock);
+
+ mem_cgroup_unlock_pages();
+}
+
/******************************************************************************
* the eviction
******************************************************************************/
@@ -3401,6 +3552,12 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
return true;
}

+ /* promoted */
+ if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
+ list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+ return true;
+ }
+
/* protected */
if (tier > tier_idx) {
int hist = lru_hist_from_seq(lrugen->min_seq[type]);
--
2.36.0.550.gb090851708-goog


2022-05-18 04:54:43

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v11 08/14] mm: multi-gen LRU: support page table walks

To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.

When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
Single workload:
fio (buffered I/O): no change

Single workload:
memcached (anon): +[8, 10]%
Ops/sec KB/sec
patch1-7: 1193918.93 46438.15
patch1-8: 1301954.44 50640.27

Configurations:
no change

Client benchmark results:
kswapd profiles:
patch1-7
45.90% lzo1x_1_do_compress (real work)
9.14% page_vma_mapped_walk
6.81% _raw_spin_unlock_irq
2.80% ptep_clear_flush
2.34% __zram_bvec_write
2.29% do_raw_spin_lock
1.84% lru_gen_look_around
1.78% memmove
1.74% obj_malloc
1.50% free_unref_page_list

patch1-8
46.96% lzo1x_1_do_compress (real work)
7.55% page_vma_mapped_walk
5.89% _raw_spin_unlock_irq
3.33% walk_pte_range
2.65% ptep_clear_flush
2.23% __zram_bvec_write
2.08% do_raw_spin_lock
1.83% memmove
1.65% obj_malloc
1.47% free_unref_page_list

Configurations:
no change

Thanks to the following developers for their efforts [3].
kernel test robot <[email protected]>

[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo
[3] https://lore.kernel.org/r/[email protected]/

Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
---
fs/exec.c | 2 +
include/linux/memcontrol.h | 5 +
include/linux/mm_types.h | 77 +++
include/linux/mmzone.h | 61 ++-
include/linux/swap.h | 4 +
kernel/exit.c | 1 +
kernel/fork.c | 9 +
kernel/sched/core.c | 1 +
mm/memcontrol.c | 25 +
mm/vmscan.c | 981 ++++++++++++++++++++++++++++++++++++-
10 files changed, 1150 insertions(+), 16 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index e3e55d5e0be1..bba8fc44926f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1011,6 +1011,7 @@ static int exec_mmap(struct mm_struct *mm)
active_mm = tsk->active_mm;
tsk->active_mm = mm;
tsk->mm = mm;
+ lru_gen_add_mm(mm);
/*
* This prevents preemption while active_mm is being loaded and
* it and mm are being updated, which could cause problems for
@@ -1023,6 +1024,7 @@ static int exec_mmap(struct mm_struct *mm)
activate_mm(active_mm, mm);
if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
local_irq_enable();
+ lru_gen_use_mm(mm);
tsk->mm->vmacache_seqnum = 0;
vmacache_flush(tsk);
task_unlock(tsk);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2bfdcc77648a..cc16ba262464 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -344,6 +344,11 @@ struct mem_cgroup {
struct deferred_split deferred_split_queue;
#endif

+#ifdef CONFIG_LRU_GEN
+ /* per-memcg mm_struct list */
+ struct lru_gen_mm_list mm_list;
+#endif
+
struct mem_cgroup_per_node *nodeinfo[];
};

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8834e38c06a4..ef46d5b3da9b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -3,6 +3,7 @@
#define _LINUX_MM_TYPES_H

#include <linux/mm_types_task.h>
+#include <linux/sched.h>

#include <linux/auxvec.h>
#include <linux/kref.h>
@@ -17,6 +18,7 @@
#include <linux/page-flags-layout.h>
#include <linux/workqueue.h>
#include <linux/seqlock.h>
+#include <linux/mmdebug.h>

#include <asm/mmu.h>

@@ -655,6 +657,22 @@ struct mm_struct {
#ifdef CONFIG_IOMMU_SVA
u32 pasid;
#endif
+#ifdef CONFIG_LRU_GEN
+ struct {
+ /* this mm_struct is on lru_gen_mm_list */
+ struct list_head list;
+#ifdef CONFIG_MEMCG
+ /* points to the memcg of "owner" above */
+ struct mem_cgroup *memcg;
+#endif
+ /*
+ * Set when switching to this mm_struct, as a hint of
+ * whether it has been used since the last time per-node
+ * page table walkers cleared the corresponding bits.
+ */
+ unsigned long bitmap;
+ } lru_gen;
+#endif /* CONFIG_LRU_GEN */
} __randomize_layout;

/*
@@ -681,6 +699,65 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
return (struct cpumask *)&mm->cpu_bitmap;
}

+#ifdef CONFIG_LRU_GEN
+
+struct lru_gen_mm_list {
+ /* mm_struct list for page table walkers */
+ struct list_head fifo;
+ /* protects the list above */
+ spinlock_t lock;
+};
+
+void lru_gen_add_mm(struct mm_struct *mm);
+void lru_gen_del_mm(struct mm_struct *mm);
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm);
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+ INIT_LIST_HEAD(&mm->lru_gen.list);
+#ifdef CONFIG_MEMCG
+ mm->lru_gen.memcg = NULL;
+#endif
+ mm->lru_gen.bitmap = 0;
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+ /* unlikely but not a bug when racing with lru_gen_migrate_mm() */
+ VM_WARN_ON_ONCE(list_empty(&mm->lru_gen.list));
+
+ if (!(current->flags & PF_KTHREAD))
+ WRITE_ONCE(mm->lru_gen.bitmap, -1);
+}
+
+#else /* !CONFIG_LRU_GEN */
+
+static inline void lru_gen_add_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_del_mm(struct mm_struct *mm)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+}
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
struct mmu_gather;
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f0b980362186..6aa715a2fea1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -407,7 +407,7 @@ enum {
* min_seq behind.
*
* nr_pages[] are eventually consistent and therefore can be transiently
- * negative.
+ * negative when reset_batch_size() is pending.
*/
struct lru_gen_struct {
/* the aging increments the youngest generation number */
@@ -429,6 +429,58 @@ struct lru_gen_struct {
atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
};

+enum {
+ MM_PTE_TOTAL, /* total leaf entries */
+ MM_PTE_OLD, /* old leaf entries */
+ MM_PTE_YOUNG, /* young leaf entries */
+ MM_PMD_TOTAL, /* total non-leaf entries */
+ MM_PMD_FOUND, /* non-leaf entries found in Bloom filters */
+ MM_PMD_ADDED, /* non-leaf entries added to Bloom filters */
+ NR_MM_STATS
+};
+
+/* mnemonic codes for the mm stats above */
+#define MM_STAT_CODES "toydfa"
+
+/* double-buffering Bloom filters */
+#define NR_BLOOM_FILTERS 2
+
+struct lru_gen_mm_state {
+ /* set to max_seq after each iteration */
+ unsigned long seq;
+ /* where the current iteration starts (inclusive) */
+ struct list_head *head;
+ /* where the last iteration ends (exclusive) */
+ struct list_head *tail;
+ /* to wait for the last page table walker to finish */
+ struct wait_queue_head wait;
+ /* Bloom filters flip after each iteration */
+ unsigned long *filters[NR_BLOOM_FILTERS];
+ /* the mm stats for debugging */
+ unsigned long stats[NR_HIST_GENS][NR_MM_STATS];
+ /* the number of concurrent page table walkers */
+ int nr_walkers;
+};
+
+struct lru_gen_mm_walk {
+ /* the lruvec under reclaim */
+ struct lruvec *lruvec;
+ /* unstable max_seq from lru_gen_struct */
+ unsigned long max_seq;
+ /* the next address within an mm to scan */
+ unsigned long next_addr;
+ /* to batch page table entries */
+ unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)];
+ /* to batch promoted pages */
+ int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+ /* to batch the mm stats */
+ int mm_stats[NR_MM_STATS];
+ /* total batched items */
+ int batched;
+ bool can_swap;
+ bool force_scan;
+};
+
void lru_gen_init_lruvec(struct lruvec *lruvec);
void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);

@@ -479,6 +531,8 @@ struct lruvec {
#ifdef CONFIG_LRU_GEN
/* evictable pages divided into generations */
struct lru_gen_struct lrugen;
+ /* to concurrently iterate lru_gen_mm_list */
+ struct lru_gen_mm_state mm_state;
#endif
#ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
@@ -1072,6 +1126,11 @@ typedef struct pglist_data {

unsigned long flags;

+#ifdef CONFIG_LRU_GEN
+ /* kswap mm walk data */
+ struct lru_gen_mm_walk mm_walk;
+#endif
+
ZONE_PADDING(_pad2_)

/* Per-node vmstats */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 27093b477c5f..7bdd7bcb135d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -137,6 +137,10 @@ union swap_header {
*/
struct reclaim_state {
unsigned long reclaimed_slab;
+#ifdef CONFIG_LRU_GEN
+ /* per-thread mm walk data */
+ struct lru_gen_mm_walk *mm_walk;
+#endif
};

#ifdef __KERNEL__
diff --git a/kernel/exit.c b/kernel/exit.c
index f072959fcab7..f2d4d48ea790 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -466,6 +466,7 @@ void mm_update_next_owner(struct mm_struct *mm)
goto retry;
}
WRITE_ONCE(mm->owner, c);
+ lru_gen_migrate_mm(mm);
task_unlock(c);
put_task_struct(c);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 35a3beff140b..3e3cab26e66c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1149,6 +1149,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
goto fail_nocontext;

mm->user_ns = get_user_ns(user_ns);
+ lru_gen_init_mm(mm);
return mm;

fail_nocontext:
@@ -1191,6 +1192,7 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
+ lru_gen_del_mm(mm);
mmdrop(mm);
}

@@ -2660,6 +2662,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
get_task_struct(p);
}

+ if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) {
+ /* lock the task to synchronize with memcg migration */
+ task_lock(p);
+ lru_gen_add_mm(p->mm);
+ task_unlock(p);
+ }
+
wake_up_new_task(p);

/* forking complete and child started to run, tell ptracer */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d58c0389eb23..a8f9e5ebb5dd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5057,6 +5057,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
* finish_task_switch()'s mmdrop().
*/
switch_mm_irqs_off(prev->active_mm, next->mm, next);
+ lru_gen_use_mm(next->mm);

if (!prev->mm) { // from kernel
/* will mmdrop() in finish_task_switch(). */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 98aa720ac639..ff2b0a09e6c8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6093,6 +6093,30 @@ static void mem_cgroup_move_task(void)
}
#endif

+#ifdef CONFIG_LRU_GEN
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+ struct task_struct *task;
+ struct cgroup_subsys_state *css;
+
+ /* find the first leader if there is any */
+ cgroup_taskset_for_each_leader(task, css, tset)
+ break;
+
+ if (!task)
+ return;
+
+ task_lock(task);
+ if (task->mm && task->mm->owner == task)
+ lru_gen_migrate_mm(task->mm);
+ task_unlock(task);
+}
+#else
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
{
if (value == PAGE_COUNTER_MAX)
@@ -6438,6 +6462,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
.css_reset = mem_cgroup_css_reset,
.css_rstat_flush = mem_cgroup_css_rstat_flush,
.can_attach = mem_cgroup_can_attach,
+ .attach = mem_cgroup_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.post_attach = mem_cgroup_move_task,
.dfl_cftypes = memory_files,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cf89a28c3b0e..f292e7e761b1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,8 @@
#include <linux/printk.h>
#include <linux/dax.h>
#include <linux/psi.h>
+#include <linux/pagewalk.h>
+#include <linux/shmem_fs.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -3000,7 +3002,7 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \
for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)

-static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int nid)
+static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
{
struct pglist_data *pgdat = NODE_DATA(nid);

@@ -3045,6 +3047,375 @@ static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
}

+/******************************************************************************
+ * mm_struct list
+ ******************************************************************************/
+
+static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
+{
+ static struct lru_gen_mm_list mm_list = {
+ .fifo = LIST_HEAD_INIT(mm_list.fifo),
+ .lock = __SPIN_LOCK_UNLOCKED(mm_list.lock),
+ };
+
+#ifdef CONFIG_MEMCG
+ if (memcg)
+ return &memcg->mm_list;
+#endif
+ VM_WARN_ON_ONCE(!mem_cgroup_disabled());
+
+ return &mm_list;
+}
+
+void lru_gen_add_mm(struct mm_struct *mm)
+{
+ int nid;
+ struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+ VM_WARN_ON_ONCE(!list_empty(&mm->lru_gen.list));
+#ifdef CONFIG_MEMCG
+ VM_WARN_ON_ONCE(mm->lru_gen.memcg);
+ mm->lru_gen.memcg = memcg;
+#endif
+ spin_lock(&mm_list->lock);
+
+ for_each_node_state(nid, N_MEMORY) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+ if (!lruvec)
+ continue;
+
+ if (lruvec->mm_state.tail == &mm_list->fifo)
+ lruvec->mm_state.tail = &mm->lru_gen.list;
+ }
+
+ list_add_tail(&mm->lru_gen.list, &mm_list->fifo);
+
+ spin_unlock(&mm_list->lock);
+}
+
+void lru_gen_del_mm(struct mm_struct *mm)
+{
+ int nid;
+ struct lru_gen_mm_list *mm_list;
+ struct mem_cgroup *memcg = NULL;
+
+ if (list_empty(&mm->lru_gen.list))
+ return;
+
+#ifdef CONFIG_MEMCG
+ memcg = mm->lru_gen.memcg;
+#endif
+ mm_list = get_mm_list(memcg);
+
+ spin_lock(&mm_list->lock);
+
+ for_each_node(nid) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+ if (!lruvec)
+ continue;
+
+ if (lruvec->mm_state.tail == &mm->lru_gen.list)
+ lruvec->mm_state.tail = lruvec->mm_state.tail->next;
+
+ if (lruvec->mm_state.head != &mm->lru_gen.list)
+ continue;
+
+ lruvec->mm_state.head = lruvec->mm_state.head->next;
+ if (lruvec->mm_state.head == &mm_list->fifo)
+ WRITE_ONCE(lruvec->mm_state.seq, lruvec->mm_state.seq + 1);
+ }
+
+ list_del_init(&mm->lru_gen.list);
+
+ spin_unlock(&mm_list->lock);
+
+#ifdef CONFIG_MEMCG
+ mem_cgroup_put(mm->lru_gen.memcg);
+ mm->lru_gen.memcg = NULL;
+#endif
+}
+
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+ struct mem_cgroup *memcg;
+
+ lockdep_assert_held(&mm->owner->alloc_lock);
+
+ /* for mm_update_next_owner() */
+ if (mem_cgroup_disabled())
+ return;
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ rcu_read_unlock();
+ if (memcg == mm->lru_gen.memcg)
+ return;
+
+ VM_WARN_ON_ONCE(!mm->lru_gen.memcg);
+ VM_WARN_ON_ONCE(list_empty(&mm->lru_gen.list));
+
+ lru_gen_del_mm(mm);
+ lru_gen_add_mm(mm);
+}
+#endif
+
+/*
+ * Bloom filters with m=1<<15, k=2 and the false positive rates of ~1/5 when
+ * n=10,000 and ~1/2 when n=20,000, where, conventionally, m is the number of
+ * bits in a bitmap, k is the number of hash functions and n is the number of
+ * inserted items.
+ *
+ * Page table walkers use one of the two filters to reduce their search space.
+ * To get rid of non-leaf entries that no longer have enough leaf entries, the
+ * aging uses the double-buffering technique to flip to the other filter each
+ * time it produces a new generation. For non-leaf entries that have enough
+ * leaf entries, the aging carries them over to the next generation in
+ * walk_pmd_range(); the eviction also report them when walking the rmap
+ * in lru_gen_look_around().
+ *
+ * For future optimizations:
+ * 1. It's not necessary to keep both filters all the time. The spare one can be
+ * freed after the RCU grace period and reallocated if needed again.
+ * 2. And when reallocating, it's worth scaling its size according to the number
+ * of inserted entries in the other filter, to reduce the memory overhead on
+ * small systems and false positives on large systems.
+ * 3. Jenkins' hash function is an alternative to Knuth's.
+ */
+#define BLOOM_FILTER_SHIFT 15
+
+static inline int filter_gen_from_seq(unsigned long seq)
+{
+ return seq % NR_BLOOM_FILTERS;
+}
+
+static void get_item_key(void *item, int *key)
+{
+ u32 hash = hash_ptr(item, BLOOM_FILTER_SHIFT * 2);
+
+ BUILD_BUG_ON(BLOOM_FILTER_SHIFT * 2 > BITS_PER_TYPE(u32));
+
+ key[0] = hash & (BIT(BLOOM_FILTER_SHIFT) - 1);
+ key[1] = hash >> BLOOM_FILTER_SHIFT;
+}
+
+static void reset_bloom_filter(struct lruvec *lruvec, unsigned long seq)
+{
+ unsigned long *filter;
+ int gen = filter_gen_from_seq(seq);
+
+ lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
+
+ filter = lruvec->mm_state.filters[gen];
+ if (filter) {
+ bitmap_clear(filter, 0, BIT(BLOOM_FILTER_SHIFT));
+ return;
+ }
+
+ filter = bitmap_zalloc(BIT(BLOOM_FILTER_SHIFT), GFP_ATOMIC);
+ WRITE_ONCE(lruvec->mm_state.filters[gen], filter);
+}
+
+static void update_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+ int key[2];
+ unsigned long *filter;
+ int gen = filter_gen_from_seq(seq);
+
+ filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+ if (!filter)
+ return;
+
+ get_item_key(item, key);
+
+ if (!test_bit(key[0], filter))
+ set_bit(key[0], filter);
+ if (!test_bit(key[1], filter))
+ set_bit(key[1], filter);
+}
+
+static bool test_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+ int key[2];
+ unsigned long *filter;
+ int gen = filter_gen_from_seq(seq);
+
+ filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+ if (!filter)
+ return true;
+
+ get_item_key(item, key);
+
+ return test_bit(key[0], filter) && test_bit(key[1], filter);
+}
+
+static void reset_mm_stats(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, bool last)
+{
+ int i;
+ int hist;
+
+ lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
+
+ if (walk) {
+ hist = lru_hist_from_seq(walk->max_seq);
+
+ for (i = 0; i < NR_MM_STATS; i++) {
+ WRITE_ONCE(lruvec->mm_state.stats[hist][i],
+ lruvec->mm_state.stats[hist][i] + walk->mm_stats[i]);
+ walk->mm_stats[i] = 0;
+ }
+ }
+
+ if (NR_HIST_GENS > 1 && last) {
+ hist = lru_hist_from_seq(lruvec->mm_state.seq + 1);
+
+ for (i = 0; i < NR_MM_STATS; i++)
+ WRITE_ONCE(lruvec->mm_state.stats[hist][i], 0);
+ }
+}
+
+static bool should_skip_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+ int type;
+ unsigned long size = 0;
+ struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+ int key = pgdat->node_id % BITS_PER_TYPE(mm->lru_gen.bitmap);
+
+ if (!walk->force_scan && cpumask_empty(mm_cpumask(mm)) &&
+ !test_bit(key, &mm->lru_gen.bitmap))
+ return true;
+
+ clear_bit(key, &mm->lru_gen.bitmap);
+
+ for (type = !walk->can_swap; type < ANON_AND_FILE; type++) {
+ size += type ? get_mm_counter(mm, MM_FILEPAGES) :
+ get_mm_counter(mm, MM_ANONPAGES) +
+ get_mm_counter(mm, MM_SHMEMPAGES);
+ }
+
+ if (size < MIN_LRU_BATCH)
+ return true;
+
+ if (mm_is_oom_victim(mm))
+ return true;
+
+ return !mmget_not_zero(mm);
+}
+
+static bool iterate_mm_list(struct lruvec *lruvec, struct lru_gen_mm_walk *walk,
+ struct mm_struct **iter)
+{
+ bool first = false;
+ bool last = true;
+ struct mm_struct *mm = NULL;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+ struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+
+ /*
+ * There are four interesting cases for this page table walker:
+ * 1. It tries to start a new iteration of mm_list with a stale max_seq;
+ * there is nothing to be done.
+ * 2. It's the first of the current generation, and it needs to reset
+ * the Bloom filter for the next generation.
+ * 3. It reaches the end of mm_list, and it needs to increment
+ * mm_state->seq; the iteration is done.
+ * 4. It's the last of the current generation, and it needs to reset the
+ * mm stats counters for the next generation.
+ */
+ if (*iter)
+ mmput_async(*iter);
+ else if (walk->max_seq <= READ_ONCE(mm_state->seq))
+ return false;
+
+ spin_lock(&mm_list->lock);
+
+ VM_WARN_ON_ONCE(mm_state->seq + 1 < walk->max_seq);
+ VM_WARN_ON_ONCE(*iter && mm_state->seq > walk->max_seq);
+ VM_WARN_ON_ONCE(*iter && !mm_state->nr_walkers);
+
+ if (walk->max_seq <= mm_state->seq) {
+ if (!*iter)
+ last = false;
+ goto done;
+ }
+
+ if (!mm_state->nr_walkers) {
+ VM_WARN_ON_ONCE(mm_state->head && mm_state->head != &mm_list->fifo);
+
+ mm_state->head = mm_list->fifo.next;
+ first = true;
+ }
+
+ while (!mm && mm_state->head != &mm_list->fifo) {
+ mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list);
+
+ mm_state->head = mm_state->head->next;
+
+ /* force scan for those added after the last iteration */
+ if (!mm_state->tail || mm_state->tail == &mm->lru_gen.list) {
+ mm_state->tail = mm_state->head;
+ walk->force_scan = true;
+ }
+
+ if (should_skip_mm(mm, walk))
+ mm = NULL;
+ }
+
+ if (mm_state->head == &mm_list->fifo)
+ WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+done:
+ if (*iter && !mm)
+ mm_state->nr_walkers--;
+ if (!*iter && mm)
+ mm_state->nr_walkers++;
+
+ if (mm_state->nr_walkers)
+ last = false;
+
+ if (mm && first)
+ reset_bloom_filter(lruvec, walk->max_seq + 1);
+
+ if (*iter || last)
+ reset_mm_stats(lruvec, walk, last);
+
+ spin_unlock(&mm_list->lock);
+
+ *iter = mm;
+
+ return last;
+}
+
+static bool iterate_mm_list_nowalk(struct lruvec *lruvec, unsigned long max_seq)
+{
+ bool success = false;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+ struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+
+ if (max_seq <= READ_ONCE(mm_state->seq))
+ return false;
+
+ spin_lock(&mm_list->lock);
+
+ VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq);
+
+ if (max_seq > mm_state->seq && !mm_state->nr_walkers) {
+ VM_WARN_ON_ONCE(mm_state->head && mm_state->head != &mm_list->fifo);
+
+ WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+ reset_mm_stats(lruvec, NULL, true);
+ success = true;
+ }
+
+ spin_unlock(&mm_list->lock);
+
+ return success;
+}
+
/******************************************************************************
* refault feedback loop
******************************************************************************/
@@ -3193,6 +3564,479 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
return new_gen;
}

+static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
+ int old_gen, int new_gen)
+{
+ int type = folio_is_file_lru(folio);
+ int zone = folio_zonenum(folio);
+ int delta = folio_nr_pages(folio);
+
+ VM_WARN_ON_ONCE(old_gen >= MAX_NR_GENS);
+ VM_WARN_ON_ONCE(new_gen >= MAX_NR_GENS);
+
+ walk->batched++;
+
+ walk->nr_pages[old_gen][type][zone] -= delta;
+ walk->nr_pages[new_gen][type][zone] += delta;
+}
+
+static void reset_batch_size(struct lruvec *lruvec, struct lru_gen_mm_walk *walk)
+{
+ int gen, type, zone;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ walk->batched = 0;
+
+ for_each_gen_type_zone(gen, type, zone) {
+ enum lru_list lru = type * LRU_INACTIVE_FILE;
+ int delta = walk->nr_pages[gen][type][zone];
+
+ if (!delta)
+ continue;
+
+ walk->nr_pages[gen][type][zone] = 0;
+ WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
+ lrugen->nr_pages[gen][type][zone] + delta);
+
+ if (lru_gen_is_active(lruvec, gen))
+ lru += LRU_ACTIVE;
+ __update_lru_size(lruvec, lru, zone, delta);
+ }
+}
+
+static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *args)
+{
+ struct address_space *mapping;
+ struct vm_area_struct *vma = args->vma;
+ struct lru_gen_mm_walk *walk = args->private;
+
+ if (!vma_is_accessible(vma) || is_vm_hugetlb_page(vma) ||
+ (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ)) ||
+ vma == get_gate_vma(vma->vm_mm))
+ return true;
+
+ if (vma_is_anonymous(vma))
+ return !walk->can_swap;
+
+ if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
+ return true;
+
+ mapping = vma->vm_file->f_mapping;
+ if (mapping_unevictable(mapping))
+ return true;
+
+ /* check readpage to exclude special mappings like dax, etc. */
+ return shmem_mapping(mapping) ? !walk->can_swap : !mapping->a_ops->readpage;
+}
+
+/*
+ * Some userspace memory allocators map many single-page VMAs. Instead of
+ * returning back to the PGD table for each of such VMAs, finish an entire PMD
+ * table to reduce zigzags and improve cache performance.
+ */
+static bool get_next_vma(unsigned long mask, unsigned long size, struct mm_walk *args,
+ unsigned long *start, unsigned long *end)
+{
+ unsigned long next = round_up(*end, size);
+
+ VM_WARN_ON_ONCE(mask & size);
+ VM_WARN_ON_ONCE(*start >= *end);
+ VM_WARN_ON_ONCE((next & mask) != (*start & mask));
+
+ while (args->vma) {
+ if (next >= args->vma->vm_end) {
+ args->vma = args->vma->vm_next;
+ continue;
+ }
+
+ if ((next & mask) != (args->vma->vm_start & mask))
+ return false;
+
+ if (should_skip_vma(args->vma->vm_start, args->vma->vm_end, args)) {
+ args->vma = args->vma->vm_next;
+ continue;
+ }
+
+ *start = max(next, args->vma->vm_start);
+ next = (next | ~mask) + 1;
+ /* rounded-up boundaries can wrap to 0 */
+ *end = next && next < args->vma->vm_end ? next : args->vma->vm_end;
+
+ return true;
+ }
+
+ return false;
+}
+
+static bool suitable_to_scan(int total, int young)
+{
+ int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
+
+ /* suitable if the average number of young PTEs per cacheline is >=1 */
+ return young * n >= total;
+}
+
+static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
+ struct mm_walk *args)
+{
+ int i;
+ pte_t *pte;
+ spinlock_t *ptl;
+ unsigned long addr;
+ int total = 0;
+ int young = 0;
+ struct lru_gen_mm_walk *walk = args->private;
+ struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+ int old_gen, new_gen = lru_gen_from_seq(walk->max_seq);
+
+ VM_WARN_ON_ONCE(pmd_leaf(*pmd));
+
+ ptl = pte_lockptr(args->mm, pmd);
+ if (!spin_trylock(ptl))
+ return false;
+
+ arch_enter_lazy_mmu_mode();
+
+ pte = pte_offset_map(pmd, start & PMD_MASK);
+restart:
+ for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
+ struct folio *folio;
+ unsigned long pfn = pte_pfn(pte[i]);
+
+ VM_WARN_ON_ONCE(addr < args->vma->vm_start || addr >= args->vma->vm_end);
+
+ total++;
+ walk->mm_stats[MM_PTE_TOTAL]++;
+
+ if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+ continue;
+
+ if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+ continue;
+
+ if (!pte_young(pte[i])) {
+ walk->mm_stats[MM_PTE_OLD]++;
+ continue;
+ }
+
+ VM_WARN_ON_ONCE(!pfn_valid(pfn));
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ continue;
+
+ folio = pfn_folio(pfn);
+ if (folio_nid(folio) != pgdat->node_id)
+ continue;
+
+ if (folio_memcg_rcu(folio) != memcg)
+ continue;
+
+ if (!folio_is_file_lru(folio) && !walk->can_swap)
+ continue;
+
+ if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
+ continue;
+
+ young++;
+ walk->mm_stats[MM_PTE_YOUNG]++;
+
+ if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+ !folio_test_swapcache(folio)))
+ folio_mark_dirty(folio);
+
+ old_gen = folio_update_gen(folio, new_gen);
+ if (old_gen >= 0 && old_gen != new_gen)
+ update_batch_size(walk, folio, old_gen, new_gen);
+ }
+
+ if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
+ goto restart;
+
+ pte_unmap(pte);
+
+ arch_leave_lazy_mmu_mode();
+ spin_unlock(ptl);
+
+ return suitable_to_scan(total, young);
+}
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+ struct mm_walk *args, unsigned long *start)
+{
+ int i;
+ pmd_t *pmd;
+ spinlock_t *ptl;
+ struct lru_gen_mm_walk *walk = args->private;
+ struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+ int old_gen, new_gen = lru_gen_from_seq(walk->max_seq);
+
+ VM_WARN_ON_ONCE(pud_leaf(*pud));
+
+ /* try to batch at most 1+MIN_LRU_BATCH+1 entries */
+ if (*start == -1) {
+ *start = next;
+ return;
+ }
+
+ i = next == -1 ? 0 : pmd_index(next) - pmd_index(*start);
+ if (i && i <= MIN_LRU_BATCH) {
+ __set_bit(i - 1, walk->bitmap);
+ return;
+ }
+
+ pmd = pmd_offset(pud, *start);
+
+ ptl = pmd_lockptr(args->mm, pmd);
+ if (!spin_trylock(ptl))
+ goto done;
+
+ arch_enter_lazy_mmu_mode();
+
+ do {
+ struct folio *folio;
+ unsigned long pfn = pmd_pfn(pmd[i]);
+ unsigned long addr = i ? (*start & PMD_MASK) + i * PMD_SIZE : *start;
+
+ VM_WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end);
+
+ if (!pmd_present(pmd[i]) || is_huge_zero_pmd(pmd[i]))
+ goto next;
+
+ if (WARN_ON_ONCE(pmd_devmap(pmd[i])))
+ goto next;
+
+ if (!pmd_trans_huge(pmd[i])) {
+ if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
+ pmdp_test_and_clear_young(vma, addr, pmd + i);
+ goto next;
+ }
+
+ VM_WARN_ON_ONCE(!pfn_valid(pfn));
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ goto next;
+
+ folio = pfn_folio(pfn);
+ if (folio_nid(folio) != pgdat->node_id)
+ goto next;
+
+ if (folio_memcg_rcu(folio) != memcg)
+ goto next;
+
+ if (!folio_is_file_lru(folio) && !walk->can_swap)
+ goto next;
+
+ if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
+ goto next;
+
+ walk->mm_stats[MM_PTE_YOUNG]++;
+
+ if (pmd_dirty(pmd[i]) && !folio_test_dirty(folio) &&
+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+ !folio_test_swapcache(folio)))
+ folio_mark_dirty(folio);
+
+ old_gen = folio_update_gen(folio, new_gen);
+ if (old_gen >= 0 && old_gen != new_gen)
+ update_batch_size(walk, folio, old_gen, new_gen);
+next:
+ i = i > MIN_LRU_BATCH ? 0 :
+ find_next_bit(walk->bitmap, MIN_LRU_BATCH, i) + 1;
+ } while (i <= MIN_LRU_BATCH);
+
+ arch_leave_lazy_mmu_mode();
+ spin_unlock(ptl);
+done:
+ *start = -1;
+ bitmap_zero(walk->bitmap, MIN_LRU_BATCH);
+}
+#else
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+ struct mm_walk *args, unsigned long *start)
+{
+}
+#endif
+
+static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
+ struct mm_walk *args)
+{
+ int i;
+ pmd_t *pmd;
+ unsigned long next;
+ unsigned long addr;
+ struct vm_area_struct *vma;
+ unsigned long pos = -1;
+ struct lru_gen_mm_walk *walk = args->private;
+
+ VM_WARN_ON_ONCE(pud_leaf(*pud));
+
+ /*
+ * Finish an entire PMD in two passes: the first only reaches to PTE
+ * tables to avoid taking the PMD lock; the second, if necessary, takes
+ * the PMD lock to clear the accessed bit in PMD entries.
+ */
+ pmd = pmd_offset(pud, start & PUD_MASK);
+restart:
+ /* walk_pte_range() may call get_next_vma() */
+ vma = args->vma;
+ for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
+ pmd_t val = pmd_read_atomic(pmd + i);
+
+ /* for pmd_read_atomic() */
+ barrier();
+
+ next = pmd_addr_end(addr, end);
+
+ if (!pmd_present(val)) {
+ walk->mm_stats[MM_PTE_TOTAL]++;
+ continue;
+ }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (pmd_trans_huge(val)) {
+ unsigned long pfn = pmd_pfn(val);
+ struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+
+ walk->mm_stats[MM_PTE_TOTAL]++;
+
+ if (is_huge_zero_pmd(val))
+ continue;
+
+ if (!pmd_young(val)) {
+ walk->mm_stats[MM_PTE_OLD]++;
+ continue;
+ }
+
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ continue;
+
+ walk_pmd_range_locked(pud, addr, vma, args, &pos);
+ continue;
+ }
+#endif
+ walk->mm_stats[MM_PMD_TOTAL]++;
+
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+ if (!pmd_young(val))
+ continue;
+
+ walk_pmd_range_locked(pud, addr, vma, args, &pos);
+#endif
+ if (!walk->force_scan && !test_bloom_filter(walk->lruvec, walk->max_seq, pmd + i))
+ continue;
+
+ walk->mm_stats[MM_PMD_FOUND]++;
+
+ if (!walk_pte_range(&val, addr, next, args))
+ continue;
+
+ walk->mm_stats[MM_PMD_ADDED]++;
+
+ /* carry over to the next generation */
+ update_bloom_filter(walk->lruvec, walk->max_seq + 1, pmd + i);
+ }
+
+ walk_pmd_range_locked(pud, -1, vma, args, &pos);
+
+ if (i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &end))
+ goto restart;
+}
+
+static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
+ struct mm_walk *args)
+{
+ int i;
+ pud_t *pud;
+ unsigned long addr;
+ unsigned long next;
+ struct lru_gen_mm_walk *walk = args->private;
+
+ VM_WARN_ON_ONCE(p4d_leaf(*p4d));
+
+ pud = pud_offset(p4d, start & P4D_MASK);
+restart:
+ for (i = pud_index(start), addr = start; addr != end; i++, addr = next) {
+ pud_t val = READ_ONCE(pud[i]);
+
+ next = pud_addr_end(addr, end);
+
+ if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val)))
+ continue;
+
+ walk_pmd_range(&val, addr, next, args);
+
+ if (walk->batched >= MAX_LRU_BATCH) {
+ end = (addr | ~PUD_MASK) + 1;
+ goto done;
+ }
+ }
+
+ if (i < PTRS_PER_PUD && get_next_vma(P4D_MASK, PUD_SIZE, args, &start, &end))
+ goto restart;
+
+ end = round_up(end, P4D_SIZE);
+done:
+ /* rounded-up boundaries can wrap to 0 */
+ walk->next_addr = end && args->vma ? max(end, args->vma->vm_start) : 0;
+
+ return -EAGAIN;
+}
+
+static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+ static const struct mm_walk_ops mm_walk_ops = {
+ .test_walk = should_skip_vma,
+ .p4d_entry = walk_pud_range,
+ };
+
+ int err;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+ walk->next_addr = FIRST_USER_ADDRESS;
+
+ do {
+ err = -EBUSY;
+
+ /* folio_update_gen() requires stable folio_memcg() */
+ if (!mem_cgroup_trylock_pages(memcg))
+ break;
+
+ /* the caller might be holding the lock for write */
+ if (mmap_read_trylock(mm)) {
+ err = walk_page_range(mm, walk->next_addr, ULONG_MAX, &mm_walk_ops, walk);
+
+ mmap_read_unlock(mm);
+
+ if (walk->batched) {
+ spin_lock_irq(&lruvec->lru_lock);
+ reset_batch_size(lruvec, walk);
+ spin_unlock_irq(&lruvec->lru_lock);
+ }
+ }
+
+ mem_cgroup_unlock_pages();
+
+ cond_resched();
+ } while (err == -EAGAIN && walk->next_addr && !mm_is_oom_victim(mm));
+}
+
+static struct lru_gen_mm_walk *alloc_mm_walk(void)
+{
+ if (current->reclaim_state && current->reclaim_state->mm_walk)
+ return current->reclaim_state->mm_walk;
+
+ return kzalloc(sizeof(struct lru_gen_mm_walk),
+ __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+}
+
+static void free_mm_walk(struct lru_gen_mm_walk *walk)
+{
+ if (!current->reclaim_state || !current->reclaim_state->mm_walk)
+ kfree(walk);
+}
+
static void inc_min_seq(struct lruvec *lruvec, int type)
{
struct lru_gen_struct *lrugen = &lruvec->lrugen;
@@ -3244,7 +4088,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
return success;
}

-static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_swap)
+static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
{
int prev, next;
int type, zone;
@@ -3254,9 +4098,6 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_s

VM_WARN_ON_ONCE(!seq_is_valid(lruvec));

- if (max_seq != lrugen->max_seq)
- goto unlock;
-
for (type = 0; type < ANON_AND_FILE; type++) {
if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
continue;
@@ -3294,10 +4135,72 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_s

/* make sure preceding modifications appear */
smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
-unlock:
+
spin_unlock_irq(&lruvec->lru_lock);
}

+static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
+ struct scan_control *sc, bool can_swap)
+{
+ bool success;
+ struct lru_gen_mm_walk *walk;
+ struct mm_struct *mm = NULL;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ VM_WARN_ON_ONCE(max_seq > READ_ONCE(lrugen->max_seq));
+
+ /*
+ * If the hardware doesn't automatically set the accessed bit, fallback
+ * to lru_gen_look_around(), which only clears the accessed bit in a
+ * handful of PTEs. Spreading the work out over a period of time usually
+ * is less efficient, but it avoids bursty page faults.
+ */
+ if (!arch_has_hw_pte_young()) {
+ success = iterate_mm_list_nowalk(lruvec, max_seq);
+ goto done;
+ }
+
+ walk = alloc_mm_walk();
+ if (!walk) {
+ success = iterate_mm_list_nowalk(lruvec, max_seq);
+ goto done;
+ }
+
+ walk->lruvec = lruvec;
+ walk->max_seq = max_seq;
+ walk->can_swap = can_swap;
+ walk->force_scan = false;
+
+ do {
+ success = iterate_mm_list(lruvec, walk, &mm);
+ if (mm)
+ walk_mm(lruvec, mm, walk);
+
+ cond_resched();
+ } while (mm);
+
+ free_mm_walk(walk);
+done:
+ if (!success) {
+ if (!current_is_kswapd() && !sc->priority)
+ wait_event_killable(lruvec->mm_state.wait,
+ max_seq < READ_ONCE(lrugen->max_seq));
+
+ return max_seq < READ_ONCE(lrugen->max_seq);
+ }
+
+ VM_WARN_ON_ONCE(max_seq != READ_ONCE(lrugen->max_seq));
+
+ inc_max_seq(lruvec, can_swap);
+ /* either this sees any waiters or they will see updated max_seq */
+ if (wq_has_sleeper(&lruvec->mm_state.wait))
+ wake_up_all(&lruvec->mm_state.wait);
+
+ wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+ return true;
+}
+
static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
unsigned long *min_seq, bool can_swap, bool *need_aging)
{
@@ -3378,7 +4281,7 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
nr_to_scan++;

if (nr_to_scan && need_aging)
- inc_max_seq(lruvec, max_seq, swappiness);
+ try_to_inc_max_seq(lruvec, max_seq, sc, swappiness);
}

static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
@@ -3387,6 +4290,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)

VM_WARN_ON_ONCE(!current_is_kswapd());

+ current->reclaim_state->mm_walk = &pgdat->mm_walk;
+
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
@@ -3395,11 +4300,16 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)

cond_resched();
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+ current->reclaim_state->mm_walk = NULL;
}

/*
* This function exploits spatial locality when shrink_page_list() walks the
- * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If
+ * the scan was done cacheline efficiently, it adds the PMD entry pointing to
+ * the PTE table to the Bloom filter. This forms a feedback loop between the
+ * eviction and the aging.
*/
void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
@@ -3408,6 +4318,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
unsigned long start;
unsigned long end;
unsigned long addr;
+ struct lru_gen_mm_walk *walk;
+ int young = 0;
unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
struct folio *folio = pfn_folio(pvmw->pfn);
struct mem_cgroup *memcg = folio_memcg(folio);
@@ -3469,6 +4381,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
continue;

+ young++;
+
if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
!(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
!folio_test_swapcache(folio)))
@@ -3484,7 +4398,13 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
arch_leave_lazy_mmu_mode();
rcu_read_unlock();

- if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
+ /* feedback from rmap walkers to page table walkers */
+ if (suitable_to_scan(i, young))
+ update_bloom_filter(lruvec, max_seq, pvmw->pmd);
+
+ walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
+
+ if (!walk && bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
folio = pfn_folio(pte_pfn(pte[i]));
folio_activate(folio);
@@ -3496,8 +4416,10 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
if (!mem_cgroup_trylock_pages(memcg))
return;

- spin_lock_irq(&lruvec->lru_lock);
- new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+ if (!walk) {
+ spin_lock_irq(&lruvec->lru_lock);
+ new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+ }

for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
folio = pfn_folio(pte_pfn(pte[i]));
@@ -3508,10 +4430,14 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
if (old_gen < 0 || old_gen == new_gen)
continue;

- lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+ if (walk)
+ update_batch_size(walk, folio, old_gen, new_gen);
+ else
+ lru_gen_update_size(lruvec, folio, old_gen, new_gen);
}

- spin_unlock_irq(&lruvec->lru_lock);
+ if (!walk)
+ spin_unlock_irq(&lruvec->lru_lock);

mem_cgroup_unlock_pages();
}
@@ -3782,6 +4708,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
struct folio *folio;
enum vm_event_item item;
struct reclaim_stat stat;
+ struct lru_gen_mm_walk *walk;
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);

@@ -3821,6 +4748,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap

move_pages_to_lru(lruvec, &list);

+ walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
+ if (walk && walk->batched)
+ reset_batch_size(lruvec, walk);
+
item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
if (!cgroup_reclaim(sc))
__count_vm_events(item, reclaimed);
@@ -3875,20 +4806,25 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
return 0;
}

- inc_max_seq(lruvec, max_seq, can_swap);
+ if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap))
+ return nr_to_scan;

- return nr_to_scan;
+ return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0;
}

static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
struct blk_plug plug;
long scanned = 0;
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);

lru_add_drain();

blk_start_plug(&plug);

+ if (current_is_kswapd())
+ current->reclaim_state->mm_walk = &pgdat->mm_walk;
+
while (true) {
int delta;
int swappiness;
@@ -3916,6 +4852,9 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
cond_resched();
}

+ if (current_is_kswapd())
+ current->reclaim_state->mm_walk = NULL;
+
blk_finish_plug(&plug);
}

@@ -3932,15 +4871,21 @@ void lru_gen_init_lruvec(struct lruvec *lruvec)

for_each_gen_type_zone(gen, type, zone)
INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
+
+ lruvec->mm_state.seq = MIN_NR_GENS;
+ init_waitqueue_head(&lruvec->mm_state.wait);
}

#ifdef CONFIG_MEMCG
void lru_gen_init_memcg(struct mem_cgroup *memcg)
{
+ INIT_LIST_HEAD(&memcg->mm_list.fifo);
+ spin_lock_init(&memcg->mm_list.lock);
}

void lru_gen_exit_memcg(struct mem_cgroup *memcg)
{
+ int i;
int nid;

for_each_node(nid) {
@@ -3948,6 +4893,11 @@ void lru_gen_exit_memcg(struct mem_cgroup *memcg)

VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
sizeof(lruvec->lrugen.nr_pages)));
+
+ for (i = 0; i < NR_BLOOM_FILTERS; i++) {
+ bitmap_free(lruvec->mm_state.filters[i]);
+ lruvec->mm_state.filters[i] = NULL;
+ }
}
}
#endif
@@ -3956,6 +4906,7 @@ static int __init init_lru_gen(void)
{
BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+ BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);

return 0;
};
--
2.36.0.550.gb090851708-goog


2022-05-18 04:55:10

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v11 04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller"

This patch undoes the following refactor:
commit 289ccba18af4 ("include/linux/mm_inline.h: fold __update_lru_size() into its sole caller")

The upcoming changes to include/linux/mm_inline.h will reuse
__update_lru_size().

Signed-off-by: Yu Zhao <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
---
include/linux/mm_inline.h | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index ac32125745ab..7c9c2157e9a8 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -32,7 +32,7 @@ static inline int page_is_file_lru(struct page *page)
return folio_is_file_lru(page_folio(page));
}

-static __always_inline void update_lru_size(struct lruvec *lruvec,
+static __always_inline void __update_lru_size(struct lruvec *lruvec,
enum lru_list lru, enum zone_type zid,
long nr_pages)
{
@@ -41,6 +41,13 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
__mod_zone_page_state(&pgdat->node_zones[zid],
NR_ZONE_LRU_BASE + lru, nr_pages);
+}
+
+static __always_inline void update_lru_size(struct lruvec *lruvec,
+ enum lru_list lru, enum zone_type zid,
+ int nr_pages)
+{
+ __update_lru_size(lruvec, lru, zid, nr_pages);
#ifdef CONFIG_MEMCG
mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
#endif
--
2.36.0.550.gb090851708-goog


2022-06-06 09:45:56

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Wed, May 18, 2022 at 4:49 PM Yu Zhao <[email protected]> wrote:
>
> Searching the rmap for PTEs mapping each page on an LRU list (to test
> and clear the accessed bit) can be expensive because pages from
> different VMAs (PA space) are not cache friendly to the rmap (VA
> space). For workloads mostly using mapped pages, the rmap has a high
> CPU cost in the reclaim path.
>
> This patch exploits spatial locality to reduce the trips into the
> rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
> new function lru_gen_look_around() scans at most BITS_PER_LONG-1
> adjacent PTEs. On finding another young PTE, it clears the accessed
> bit and updates the gen counter of the page mapped by this PTE to
> (max_seq%MAX_NR_GENS)+1.
>
> Server benchmark results:
> Single workload:
> fio (buffered I/O): no change
>
> Single workload:
> memcached (anon): +[5.5, 7.5]%
> Ops/sec KB/sec
> patch1-6: 1120643.70 43588.06
> patch1-7: 1193918.93 46438.15
>
> Configurations:
> no change
>
> Client benchmark results:
> kswapd profiles:
> patch1-6
> 35.99% lzo1x_1_do_compress (real work)
> 19.40% page_vma_mapped_walk
> 6.31% _raw_spin_unlock_irq
> 3.95% do_raw_spin_lock
> 2.39% anon_vma_interval_tree_iter_first
> 2.25% ptep_clear_flush
> 1.92% __anon_vma_interval_tree_subtree_search
> 1.70% folio_referenced_one
> 1.68% __zram_bvec_write
> 1.43% anon_vma_interval_tree_iter_next
>
> patch1-7
> 45.90% lzo1x_1_do_compress (real work)
> 9.14% page_vma_mapped_walk
> 6.81% _raw_spin_unlock_irq
> 2.80% ptep_clear_flush
> 2.34% __zram_bvec_write
> 2.29% do_raw_spin_lock
> 1.84% lru_gen_look_around
> 1.78% memmove
> 1.74% obj_malloc
> 1.50% free_unref_page_list
>
> Configurations:
> no change
>
> Signed-off-by: Yu Zhao <[email protected]>
> Acked-by: Brian Geffon <[email protected]>
> Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
> Acked-by: Oleksandr Natalenko <[email protected]>
> Acked-by: Steven Barrett <[email protected]>
> Acked-by: Suleiman Souhlal <[email protected]>
> Tested-by: Daniel Byrne <[email protected]>
> Tested-by: Donald Carr <[email protected]>
> Tested-by: Holger Hoffstätte <[email protected]>
> Tested-by: Konstantin Kharlamov <[email protected]>
> Tested-by: Shuang Zhai <[email protected]>
> Tested-by: Sofia Trinh <[email protected]>
> Tested-by: Vaibhav Jain <[email protected]>
> ---
> include/linux/memcontrol.h | 31 ++++++++
> include/linux/mm.h | 5 ++
> include/linux/mmzone.h | 6 ++
> mm/internal.h | 1 +
> mm/memcontrol.c | 1 +
> mm/rmap.c | 7 ++
> mm/swap.c | 4 +-
> mm/vmscan.c | 157 +++++++++++++++++++++++++++++++++++++
> 8 files changed, 210 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 89b14729d59f..2bfdcc77648a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -438,6 +438,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
> * - LRU isolation
> * - lock_page_memcg()
> * - exclusive reference
> + * - mem_cgroup_trylock_pages()
> *
> * For a kmem folio a caller should hold an rcu read lock to protect memcg
> * associated with a kmem folio from being released.
> @@ -499,6 +500,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
> * - LRU isolation
> * - lock_page_memcg()
> * - exclusive reference
> + * - mem_cgroup_trylock_pages()
> *
> * For a kmem page a caller should hold an rcu read lock to protect memcg
> * associated with a kmem page from being released.
> @@ -948,6 +950,23 @@ void unlock_page_memcg(struct page *page);
>
> void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
>
> +/* try to stablize folio_memcg() for all the pages in a memcg */
> +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> +{
> + rcu_read_lock();
> +
> + if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
> + return true;
> +
> + rcu_read_unlock();
> + return false;
> +}
> +
> +static inline void mem_cgroup_unlock_pages(void)
> +{
> + rcu_read_unlock();
> +}
> +
> /* idx can be of type enum memcg_stat_item or node_stat_item */
> static inline void mod_memcg_state(struct mem_cgroup *memcg,
> int idx, int val)
> @@ -1386,6 +1405,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
> {
> }
>
> +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> +{
> + /* to match folio_memcg_rcu() */
> + rcu_read_lock();
> + return true;
> +}
> +
> +static inline void mem_cgroup_unlock_pages(void)
> +{
> + rcu_read_unlock();
> +}
> +
> static inline void mem_cgroup_handle_over_high(void)
> {
> }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 894c289c2c06..4e8ab4ad4473 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1523,6 +1523,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
> return page_to_pfn(&folio->page);
> }
>
> +static inline struct folio *pfn_folio(unsigned long pfn)
> +{
> + return page_folio(pfn_to_page(pfn));
> +}
> +
> static inline atomic_t *folio_pincount_ptr(struct folio *folio)
> {
> return &folio_page(folio, 1)->compound_pincount;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2d023d243e73..f0b980362186 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -374,6 +374,7 @@ enum lruvec_flags {
> #ifndef __GENERATING_BOUNDS_H
>
> struct lruvec;
> +struct page_vma_mapped_walk;
>
> #define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> #define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
> @@ -429,6 +430,7 @@ struct lru_gen_struct {
> };
>
> void lru_gen_init_lruvec(struct lruvec *lruvec);
> +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
>
> #ifdef CONFIG_MEMCG
> void lru_gen_init_memcg(struct mem_cgroup *memcg);
> @@ -441,6 +443,10 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
> {
> }
>
> +static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> +{
> +}
> +
> #ifdef CONFIG_MEMCG
> static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
> {
> diff --git a/mm/internal.h b/mm/internal.h
> index cf16280ce132..59d2422b647d 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -68,6 +68,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> void folio_rotate_reclaimable(struct folio *folio);
> bool __folio_end_writeback(struct folio *folio);
> void deactivate_file_folio(struct folio *folio);
> +void folio_activate(struct folio *folio);
>
> void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> unsigned long floor, unsigned long ceiling);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2ee074f80e72..98aa720ac639 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2769,6 +2769,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
> * - LRU isolation
> * - lock_page_memcg()
> * - exclusive reference
> + * - mem_cgroup_trylock_pages()
> */
> folio->memcg_data = (unsigned long)memcg;
> }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fedb82371efe..7cb7ef29088a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -73,6 +73,7 @@
> #include <linux/page_idle.h>
> #include <linux/memremap.h>
> #include <linux/userfaultfd_k.h>
> +#include <linux/mm_inline.h>
>
> #include <asm/tlbflush.h>
>
> @@ -821,6 +822,12 @@ static bool folio_referenced_one(struct folio *folio,
> }
>
> if (pvmw.pte) {
> + if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> + !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> + lru_gen_look_around(&pvmw);
> + referenced++;
> + }
> +
> if (ptep_clear_flush_young_notify(vma, address,

Hello, Yu.
look_around() is calling ptep_test_and_clear_young(pvmw->vma, addr, pte + i)
only without flush and notify. for flush, there is a tlb operation for arm64:
static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
int young = ptep_test_and_clear_young(vma, address, ptep);

if (young) {
/*
* We can elide the trailing DSB here since the worst that can
* happen is that a CPU continues to use the young entry in its
* TLB and we mistakenly reclaim the associated page. The
* window for such an event is bounded by the next
* context-switch, which provides a DSB to complete the TLB
* invalidation.
*/
flush_tlb_page_nosync(vma, address);
}

return young;
}

Does it mean the current kernel is over cautious? is it
safe to call ptep_test_and_clear_young() only?

btw, lru_gen_look_around() has already included 'address', are we doing
pte check for 'address' twice here?


> pvmw.pte)) {
> /*
> diff --git a/mm/swap.c b/mm/swap.c
> index a99d22308f28..0aa1d0b33d42 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -342,7 +342,7 @@ static bool need_activate_page_drain(int cpu)
> return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
> }
>
> -static void folio_activate(struct folio *folio)
> +void folio_activate(struct folio *folio)
> {
> if (folio_test_lru(folio) && !folio_test_active(folio) &&
> !folio_test_unevictable(folio)) {
> @@ -362,7 +362,7 @@ static inline void activate_page_drain(int cpu)
> {
> }
>
> -static void folio_activate(struct folio *folio)
> +void folio_activate(struct folio *folio)
> {
> struct lruvec *lruvec;
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 891f0ab69b3a..cf89a28c3b0e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1554,6 +1554,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> if (!sc->may_unmap && page_mapped(page))
> goto keep_locked;
>
> + /* folio_update_gen() tried to promote this page? */
> + if (lru_gen_enabled() && !ignore_references &&
> + page_mapped(page) && PageReferenced(page))
> + goto keep_locked;
> +
> may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>
> @@ -3137,6 +3142,28 @@ static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
> * the aging
> ******************************************************************************/
>
> +static int folio_update_gen(struct folio *folio, int gen)
> +{
> + unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
> +
> + VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
> + VM_WARN_ON_ONCE(!rcu_read_lock_held());
> +
> + do {
> + /* lru_gen_del_folio() has isolated this page? */
> + if (!(old_flags & LRU_GEN_MASK)) {
> + /* for shrink_page_list() */
> + new_flags = old_flags | BIT(PG_referenced);
> + continue;
> + }
> +
> + new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
> + new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
> + } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
> +
> + return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +}
> +
> static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> {
> int type = folio_is_file_lru(folio);
> @@ -3147,6 +3174,11 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
> VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio);
>
> do {
> + new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> + /* folio_update_gen() has promoted this page? */
> + if (new_gen >= 0 && new_gen != old_gen)
> + return new_gen;
> +
> new_gen = (old_gen + 1) % MAX_NR_GENS;
>
> new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
> @@ -3365,6 +3397,125 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
> } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> }
>
> +/*
> + * This function exploits spatial locality when shrink_page_list() walks the
> + * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
> + */
> +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> +{
> + int i;
> + pte_t *pte;
> + unsigned long start;
> + unsigned long end;
> + unsigned long addr;
> + unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
> + struct folio *folio = pfn_folio(pvmw->pfn);
> + struct mem_cgroup *memcg = folio_memcg(folio);
> + struct pglist_data *pgdat = folio_pgdat(folio);
> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> + DEFINE_MAX_SEQ(lruvec);
> + int old_gen, new_gen = lru_gen_from_seq(max_seq);
> +
> + lockdep_assert_held(pvmw->ptl);
> + VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
> +
> + if (spin_is_contended(pvmw->ptl))
> + return;
> +
> + start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
> + end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
> +
> + if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
> + if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
> + end = start + MIN_LRU_BATCH * PAGE_SIZE;
> + else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
> + start = end - MIN_LRU_BATCH * PAGE_SIZE;
> + else {
> + start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
> + end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
> + }
> + }
> +
> + pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
> +
> + rcu_read_lock();
> + arch_enter_lazy_mmu_mode();
> +
> + for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
> + unsigned long pfn = pte_pfn(pte[i]);
> +
> + VM_WARN_ON_ONCE(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end);
> +
> + if (!pte_present(pte[i]) || is_zero_pfn(pfn))
> + continue;
> +
> + if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
> + continue;
> +
> + if (!pte_young(pte[i]))
> + continue;
> +
> + VM_WARN_ON_ONCE(!pfn_valid(pfn));
> + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
> + continue;
> +
> + folio = pfn_folio(pfn);
> + if (folio_nid(folio) != pgdat->node_id)
> + continue;
> +
> + if (folio_memcg_rcu(folio) != memcg)
> + continue;
> +
> + if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
> + continue;
> +
> + if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
> + !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
> + !folio_test_swapcache(folio)))
> + folio_mark_dirty(folio);
> +
> + old_gen = folio_lru_gen(folio);
> + if (old_gen < 0)
> + folio_set_referenced(folio);
> + else if (old_gen != new_gen)
> + __set_bit(i, bitmap);
> + }
> +
> + arch_leave_lazy_mmu_mode();
> + rcu_read_unlock();
> +
> + if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
> + for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
> + folio = pfn_folio(pte_pfn(pte[i]));
> + folio_activate(folio);
> + }
> + return;
> + }
> +
> + /* folio_update_gen() requires stable folio_memcg() */
> + if (!mem_cgroup_trylock_pages(memcg))
> + return;
> +
> + spin_lock_irq(&lruvec->lru_lock);
> + new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
> +
> + for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
> + folio = pfn_folio(pte_pfn(pte[i]));
> + if (folio_memcg_rcu(folio) != memcg)
> + continue;
> +
> + old_gen = folio_update_gen(folio, new_gen);
> + if (old_gen < 0 || old_gen == new_gen)
> + continue;
> +
> + lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> + }
> +
> + spin_unlock_irq(&lruvec->lru_lock);
> +
> + mem_cgroup_unlock_pages();
> +}
> +
> /******************************************************************************
> * the eviction
> ******************************************************************************/
> @@ -3401,6 +3552,12 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
> return true;
> }
>
> + /* promoted */
> + if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
> + list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
> + return true;
> + }
> +
> /* protected */
> if (tier > tier_idx) {
> int hist = lru_hist_from_seq(lrugen->min_seq[type]);
> --
> 2.36.0.550.gb090851708-goog
>

Thanks
Barry

2022-06-07 09:30:45

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Mon, Jun 6, 2022 at 9:25 PM Barry Song <[email protected]> wrote:
>
> On Wed, May 18, 2022 at 4:49 PM Yu Zhao <[email protected]> wrote:
> >
> > Searching the rmap for PTEs mapping each page on an LRU list (to test
> > and clear the accessed bit) can be expensive because pages from
> > different VMAs (PA space) are not cache friendly to the rmap (VA
> > space). For workloads mostly using mapped pages, the rmap has a high
> > CPU cost in the reclaim path.
> >
> > This patch exploits spatial locality to reduce the trips into the
> > rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
> > new function lru_gen_look_around() scans at most BITS_PER_LONG-1
> > adjacent PTEs. On finding another young PTE, it clears the accessed
> > bit and updates the gen counter of the page mapped by this PTE to
> > (max_seq%MAX_NR_GENS)+1.
> >
> > Server benchmark results:
> > Single workload:
> > fio (buffered I/O): no change
> >
> > Single workload:
> > memcached (anon): +[5.5, 7.5]%
> > Ops/sec KB/sec
> > patch1-6: 1120643.70 43588.06
> > patch1-7: 1193918.93 46438.15
> >
> > Configurations:
> > no change
> >
> > Client benchmark results:
> > kswapd profiles:
> > patch1-6
> > 35.99% lzo1x_1_do_compress (real work)
> > 19.40% page_vma_mapped_walk
> > 6.31% _raw_spin_unlock_irq
> > 3.95% do_raw_spin_lock
> > 2.39% anon_vma_interval_tree_iter_first
> > 2.25% ptep_clear_flush
> > 1.92% __anon_vma_interval_tree_subtree_search
> > 1.70% folio_referenced_one
> > 1.68% __zram_bvec_write
> > 1.43% anon_vma_interval_tree_iter_next
> >
> > patch1-7
> > 45.90% lzo1x_1_do_compress (real work)
> > 9.14% page_vma_mapped_walk
> > 6.81% _raw_spin_unlock_irq
> > 2.80% ptep_clear_flush
> > 2.34% __zram_bvec_write
> > 2.29% do_raw_spin_lock
> > 1.84% lru_gen_look_around
> > 1.78% memmove
> > 1.74% obj_malloc
> > 1.50% free_unref_page_list
> >
> > Configurations:
> > no change
> >
> > Signed-off-by: Yu Zhao <[email protected]>
> > Acked-by: Brian Geffon <[email protected]>
> > Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
> > Acked-by: Oleksandr Natalenko <[email protected]>
> > Acked-by: Steven Barrett <[email protected]>
> > Acked-by: Suleiman Souhlal <[email protected]>
> > Tested-by: Daniel Byrne <[email protected]>
> > Tested-by: Donald Carr <[email protected]>
> > Tested-by: Holger Hoffstätte <[email protected]>
> > Tested-by: Konstantin Kharlamov <[email protected]>
> > Tested-by: Shuang Zhai <[email protected]>
> > Tested-by: Sofia Trinh <[email protected]>
> > Tested-by: Vaibhav Jain <[email protected]>
> > ---
> > include/linux/memcontrol.h | 31 ++++++++
> > include/linux/mm.h | 5 ++
> > include/linux/mmzone.h | 6 ++
> > mm/internal.h | 1 +
> > mm/memcontrol.c | 1 +
> > mm/rmap.c | 7 ++
> > mm/swap.c | 4 +-
> > mm/vmscan.c | 157 +++++++++++++++++++++++++++++++++++++
> > 8 files changed, 210 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 89b14729d59f..2bfdcc77648a 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -438,6 +438,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
> > * - LRU isolation
> > * - lock_page_memcg()
> > * - exclusive reference
> > + * - mem_cgroup_trylock_pages()
> > *
> > * For a kmem folio a caller should hold an rcu read lock to protect memcg
> > * associated with a kmem folio from being released.
> > @@ -499,6 +500,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
> > * - LRU isolation
> > * - lock_page_memcg()
> > * - exclusive reference
> > + * - mem_cgroup_trylock_pages()
> > *
> > * For a kmem page a caller should hold an rcu read lock to protect memcg
> > * associated with a kmem page from being released.
> > @@ -948,6 +950,23 @@ void unlock_page_memcg(struct page *page);
> >
> > void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
> >
> > +/* try to stablize folio_memcg() for all the pages in a memcg */
> > +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> > +{
> > + rcu_read_lock();
> > +
> > + if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
> > + return true;
> > +
> > + rcu_read_unlock();
> > + return false;
> > +}
> > +
> > +static inline void mem_cgroup_unlock_pages(void)
> > +{
> > + rcu_read_unlock();
> > +}
> > +
> > /* idx can be of type enum memcg_stat_item or node_stat_item */
> > static inline void mod_memcg_state(struct mem_cgroup *memcg,
> > int idx, int val)
> > @@ -1386,6 +1405,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
> > {
> > }
> >
> > +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> > +{
> > + /* to match folio_memcg_rcu() */
> > + rcu_read_lock();
> > + return true;
> > +}
> > +
> > +static inline void mem_cgroup_unlock_pages(void)
> > +{
> > + rcu_read_unlock();
> > +}
> > +
> > static inline void mem_cgroup_handle_over_high(void)
> > {
> > }
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 894c289c2c06..4e8ab4ad4473 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1523,6 +1523,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
> > return page_to_pfn(&folio->page);
> > }
> >
> > +static inline struct folio *pfn_folio(unsigned long pfn)
> > +{
> > + return page_folio(pfn_to_page(pfn));
> > +}
> > +
> > static inline atomic_t *folio_pincount_ptr(struct folio *folio)
> > {
> > return &folio_page(folio, 1)->compound_pincount;
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 2d023d243e73..f0b980362186 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -374,6 +374,7 @@ enum lruvec_flags {
> > #ifndef __GENERATING_BOUNDS_H
> >
> > struct lruvec;
> > +struct page_vma_mapped_walk;
> >
> > #define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> > #define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
> > @@ -429,6 +430,7 @@ struct lru_gen_struct {
> > };
> >
> > void lru_gen_init_lruvec(struct lruvec *lruvec);
> > +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
> >
> > #ifdef CONFIG_MEMCG
> > void lru_gen_init_memcg(struct mem_cgroup *memcg);
> > @@ -441,6 +443,10 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
> > {
> > }
> >
> > +static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> > +{
> > +}
> > +
> > #ifdef CONFIG_MEMCG
> > static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
> > {
> > diff --git a/mm/internal.h b/mm/internal.h
> > index cf16280ce132..59d2422b647d 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -68,6 +68,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> > void folio_rotate_reclaimable(struct folio *folio);
> > bool __folio_end_writeback(struct folio *folio);
> > void deactivate_file_folio(struct folio *folio);
> > +void folio_activate(struct folio *folio);
> >
> > void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> > unsigned long floor, unsigned long ceiling);
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 2ee074f80e72..98aa720ac639 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2769,6 +2769,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
> > * - LRU isolation
> > * - lock_page_memcg()
> > * - exclusive reference
> > + * - mem_cgroup_trylock_pages()
> > */
> > folio->memcg_data = (unsigned long)memcg;
> > }
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index fedb82371efe..7cb7ef29088a 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -73,6 +73,7 @@
> > #include <linux/page_idle.h>
> > #include <linux/memremap.h>
> > #include <linux/userfaultfd_k.h>
> > +#include <linux/mm_inline.h>
> >
> > #include <asm/tlbflush.h>
> >
> > @@ -821,6 +822,12 @@ static bool folio_referenced_one(struct folio *folio,
> > }
> >
> > if (pvmw.pte) {
> > + if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> > + !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> > + lru_gen_look_around(&pvmw);
> > + referenced++;
> > + }
> > +
> > if (ptep_clear_flush_young_notify(vma, address,
>
> Hello, Yu.
> look_around() is calling ptep_test_and_clear_young(pvmw->vma, addr, pte + i)
> only without flush and notify. for flush, there is a tlb operation for arm64:
> static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> unsigned long address, pte_t *ptep)
> {
> int young = ptep_test_and_clear_young(vma, address, ptep);
>
> if (young) {
> /*
> * We can elide the trailing DSB here since the worst that can
> * happen is that a CPU continues to use the young entry in its
> * TLB and we mistakenly reclaim the associated page. The
> * window for such an event is bounded by the next
> * context-switch, which provides a DSB to complete the TLB
> * invalidation.
> */
> flush_tlb_page_nosync(vma, address);
> }
>
> return young;
> }
>
> Does it mean the current kernel is over cautious? is it
> safe to call ptep_test_and_clear_young() only?

I can't really explain why we are getting a random app/java vm crash in monkey
test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
armv8-a machine without hardware PTE young support.

Moving to ptep_clear_flush_young() in look_around can make the random
hang disappear according to zhanyuan(Cc-ed).

On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
after
'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
the accessed bit instead of flushing the TLB")'

But on arm64, they are different. according to Will's comments in this
thread which
tried to make arm64 same with x86,
https://www.mail-archive.com/[email protected]/msg1793881.html

"
This is blindly copied from x86 and isn't true for us: we don't invalidate
the TLB on context switch. That means our window for keeping the stale
entries around is potentially much bigger and might not be a great idea.

If we roll a TLB invalidation routine without the trailing DSB, what sort of
performance does that get you?
"
We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
clear PTE young? Any comments from Will?

>
> btw, lru_gen_look_around() has already included 'address', are we doing
> pte check for 'address' twice here?
>

Thanks
Barry

2022-06-07 12:05:15

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Tue, Jun 07, 2022 at 10:37:46AM +1200, Barry Song wrote:
> On Tue, Jun 7, 2022 at 10:21 PM Will Deacon <[email protected]> wrote:
> > On Tue, Jun 07, 2022 at 07:37:10PM +1200, Barry Song wrote:
> > > I can't really explain why we are getting a random app/java vm crash in monkey
> > > test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> > > armv8-a machine without hardware PTE young support.
> > >
> > > Moving to ptep_clear_flush_young() in look_around can make the random
> > > hang disappear according to zhanyuan(Cc-ed).
> > >
> > > On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
> > > after
> > > 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> > > the accessed bit instead of flushing the TLB")'
> > >
> > > But on arm64, they are different. according to Will's comments in this
> > > thread which
> > > tried to make arm64 same with x86,
> > > https://www.mail-archive.com/[email protected]/msg1793881.html
> > >
> > > "
> > > This is blindly copied from x86 and isn't true for us: we don't invalidate
> > > the TLB on context switch. That means our window for keeping the stale
> > > entries around is potentially much bigger and might not be a great idea.
> > >
> > > If we roll a TLB invalidation routine without the trailing DSB, what sort of
> > > performance does that get you?
> > > "
> > > We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
> > > clear PTE young? Any comments from Will?
> >
> > Given that this issue is specific to the multi-gen LRU work, I think Yu is
> > the best person to comment. However, looking quickly at your analysis above,
> > I wonder if the code is relying on this sequence:
> >
> >
> > ptep_test_and_clear_young(vma, address, ptep);
> > ptep_clear_flush_young(vma, address, ptep);
> >
> >
> > to invalidate the TLB. On arm64, that won't be the case, as the invalidation
> > in ptep_clear_flush_young() is predicated on the pte being young (and this
> > patches the generic implementation in mm/pgtable-generic.c. In fact, that
> > second function call is always going to be a no-op unless the pte became
> > young again in the middle.
>
> thanks for your reply, sorry for failing to let you understand my question.
> my question is actually as below,
> right now lru_gen_look_around() is using ptep_test_and_clear_young()
> only without flush to clear pte for a couple of pages including the specific
> address:
> void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> {
> ...
>
> for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
> ...
>
> if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
> continue;
>
> ...
> }
>
> I wonder if it is safe to arm64. Do we need to move to ptep_clear_flush_young()
> in the loop?

I don't know what this code is doing, so Yu is the best person to answer
that. There's nothing inherently dangerous about eliding the TLB
maintenance; it really depends on the guarantees needed by the caller.

However, the snippet you posted from folio_referenced_one():

| if (pvmw.pte) {
| + if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
| + !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
| + lru_gen_look_around(&pvmw);
| + referenced++;
| + }
| +
| if (ptep_clear_flush_young_notify(vma, address,


Does seem to call lru_gen_look_around() *and*
ptep_clear_flush_young_notify(), which is what prompted my question as it
looks pretty suspicious to me.

Will

2022-06-07 17:27:35

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Tue, Jun 07, 2022 at 07:37:10PM +1200, Barry Song wrote:
> On Mon, Jun 6, 2022 at 9:25 PM Barry Song <[email protected]> wrote:
> > On Wed, May 18, 2022 at 4:49 PM Yu Zhao <[email protected]> wrote:
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index fedb82371efe..7cb7ef29088a 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -73,6 +73,7 @@
> > > #include <linux/page_idle.h>
> > > #include <linux/memremap.h>
> > > #include <linux/userfaultfd_k.h>
> > > +#include <linux/mm_inline.h>
> > >
> > > #include <asm/tlbflush.h>
> > >
> > > @@ -821,6 +822,12 @@ static bool folio_referenced_one(struct folio *folio,
> > > }
> > >
> > > if (pvmw.pte) {
> > > + if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> > > + !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> > > + lru_gen_look_around(&pvmw);
> > > + referenced++;
> > > + }
> > > +
> > > if (ptep_clear_flush_young_notify(vma, address,
> >
> > Hello, Yu.
> > look_around() is calling ptep_test_and_clear_young(pvmw->vma, addr, pte + i)
> > only without flush and notify. for flush, there is a tlb operation for arm64:
> > static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> > unsigned long address, pte_t *ptep)
> > {
> > int young = ptep_test_and_clear_young(vma, address, ptep);
> >
> > if (young) {
> > /*
> > * We can elide the trailing DSB here since the worst that can
> > * happen is that a CPU continues to use the young entry in its
> > * TLB and we mistakenly reclaim the associated page. The
> > * window for such an event is bounded by the next
> > * context-switch, which provides a DSB to complete the TLB
> > * invalidation.
> > */
> > flush_tlb_page_nosync(vma, address);
> > }
> >
> > return young;
> > }
> >
> > Does it mean the current kernel is over cautious? is it
> > safe to call ptep_test_and_clear_young() only?
>
> I can't really explain why we are getting a random app/java vm crash in monkey
> test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> armv8-a machine without hardware PTE young support.
>
> Moving to ptep_clear_flush_young() in look_around can make the random
> hang disappear according to zhanyuan(Cc-ed).
>
> On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
> after
> 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> the accessed bit instead of flushing the TLB")'
>
> But on arm64, they are different. according to Will's comments in this
> thread which
> tried to make arm64 same with x86,
> https://www.mail-archive.com/[email protected]/msg1793881.html
>
> "
> This is blindly copied from x86 and isn't true for us: we don't invalidate
> the TLB on context switch. That means our window for keeping the stale
> entries around is potentially much bigger and might not be a great idea.
>
> If we roll a TLB invalidation routine without the trailing DSB, what sort of
> performance does that get you?
> "
> We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
> clear PTE young? Any comments from Will?

Given that this issue is specific to the multi-gen LRU work, I think Yu is
the best person to comment. However, looking quickly at your analysis above,
I wonder if the code is relying on this sequence:


ptep_test_and_clear_young(vma, address, ptep);
ptep_clear_flush_young(vma, address, ptep);


to invalidate the TLB. On arm64, that won't be the case, as the invalidation
in ptep_clear_flush_young() is predicated on the pte being young (and this
patches the generic implementation in mm/pgtable-generic.c. In fact, that
second function call is always going to be a no-op unless the pte became
young again in the middle.

Will

2022-06-08 03:56:09

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Tue, Jun 7, 2022 at 10:21 PM Will Deacon <[email protected]> wrote:
>
> On Tue, Jun 07, 2022 at 07:37:10PM +1200, Barry Song wrote:
> > On Mon, Jun 6, 2022 at 9:25 PM Barry Song <[email protected]> wrote:
> > > On Wed, May 18, 2022 at 4:49 PM Yu Zhao <[email protected]> wrote:
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > index fedb82371efe..7cb7ef29088a 100644
> > > > --- a/mm/rmap.c
> > > > +++ b/mm/rmap.c
> > > > @@ -73,6 +73,7 @@
> > > > #include <linux/page_idle.h>
> > > > #include <linux/memremap.h>
> > > > #include <linux/userfaultfd_k.h>
> > > > +#include <linux/mm_inline.h>
> > > >
> > > > #include <asm/tlbflush.h>
> > > >
> > > > @@ -821,6 +822,12 @@ static bool folio_referenced_one(struct folio *folio,
> > > > }
> > > >
> > > > if (pvmw.pte) {
> > > > + if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> > > > + !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> > > > + lru_gen_look_around(&pvmw);
> > > > + referenced++;
> > > > + }
> > > > +
> > > > if (ptep_clear_flush_young_notify(vma, address,
> > >
> > > Hello, Yu.
> > > look_around() is calling ptep_test_and_clear_young(pvmw->vma, addr, pte + i)
> > > only without flush and notify. for flush, there is a tlb operation for arm64:
> > > static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> > > unsigned long address, pte_t *ptep)
> > > {
> > > int young = ptep_test_and_clear_young(vma, address, ptep);
> > >
> > > if (young) {
> > > /*
> > > * We can elide the trailing DSB here since the worst that can
> > > * happen is that a CPU continues to use the young entry in its
> > > * TLB and we mistakenly reclaim the associated page. The
> > > * window for such an event is bounded by the next
> > > * context-switch, which provides a DSB to complete the TLB
> > > * invalidation.
> > > */
> > > flush_tlb_page_nosync(vma, address);
> > > }
> > >
> > > return young;
> > > }
> > >
> > > Does it mean the current kernel is over cautious? is it
> > > safe to call ptep_test_and_clear_young() only?
> >
> > I can't really explain why we are getting a random app/java vm crash in monkey
> > test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> > armv8-a machine without hardware PTE young support.
> >
> > Moving to ptep_clear_flush_young() in look_around can make the random
> > hang disappear according to zhanyuan(Cc-ed).
> >
> > On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
> > after
> > 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> > the accessed bit instead of flushing the TLB")'
> >
> > But on arm64, they are different. according to Will's comments in this
> > thread which
> > tried to make arm64 same with x86,
> > https://www.mail-archive.com/[email protected]/msg1793881.html
> >
> > "
> > This is blindly copied from x86 and isn't true for us: we don't invalidate
> > the TLB on context switch. That means our window for keeping the stale
> > entries around is potentially much bigger and might not be a great idea.
> >
> > If we roll a TLB invalidation routine without the trailing DSB, what sort of
> > performance does that get you?
> > "
> > We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
> > clear PTE young? Any comments from Will?
>
> Given that this issue is specific to the multi-gen LRU work, I think Yu is
> the best person to comment. However, looking quickly at your analysis above,
> I wonder if the code is relying on this sequence:
>
>
> ptep_test_and_clear_young(vma, address, ptep);
> ptep_clear_flush_young(vma, address, ptep);
>
>
> to invalidate the TLB. On arm64, that won't be the case, as the invalidation
> in ptep_clear_flush_young() is predicated on the pte being young (and this
> patches the generic implementation in mm/pgtable-generic.c. In fact, that
> second function call is always going to be a no-op unless the pte became
> young again in the middle.

Hi Will,
thanks for your reply, sorry for failing to let you understand my question.
my question is actually as below,
right now lru_gen_look_around() is using ptep_test_and_clear_young()
only without flush to clear pte for a couple of pages including the specific
address:
void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
...

for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
...

if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
continue;

...
}

I wonder if it is safe to arm64. Do we need to move to ptep_clear_flush_young()
in the loop?

>
> Will

Thanks
Barry

2022-06-08 04:03:37

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Mon, Jun 6, 2022 at 3:25 AM Barry Song <[email protected]> wrote:
>
> On Wed, May 18, 2022 at 4:49 PM Yu Zhao <[email protected]> wrote:

...

> > @@ -821,6 +822,12 @@ static bool folio_referenced_one(struct folio *folio,
> > }
> >
> > if (pvmw.pte) {
> > + if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> > + !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> > + lru_gen_look_around(&pvmw);
> > + referenced++;
> > + }
> > +
> > if (ptep_clear_flush_young_notify(vma, address,
>
> Hello, Yu.
> look_around() is calling ptep_test_and_clear_young(pvmw->vma, addr, pte + i)
> only without flush and notify. for flush, there is a tlb operation for arm64:
> static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> unsigned long address, pte_t *ptep)
> {
> int young = ptep_test_and_clear_young(vma, address, ptep);
>
> if (young) {
> /*
> * We can elide the trailing DSB here since the worst that can
> * happen is that a CPU continues to use the young entry in its
> * TLB and we mistakenly reclaim the associated page. The
> * window for such an event is bounded by the next
> * context-switch, which provides a DSB to complete the TLB
> * invalidation.
> */
> flush_tlb_page_nosync(vma, address);
> }
>
> return young;
> }
>
> Does it mean the current kernel is over cautious?

Hi Barry,

This is up to individual archs. For x86, ptep_clear_flush_young() is
ptep_test_and_clear_young(). For arm64, I'd say yes, based on Figure 1
of Navarro, Juan, et al. "Practical, transparent operating system
support for superpages." [1].

int ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
/*
* On x86 CPUs, clearing the accessed bit without a TLB flush
* doesn't cause data corruption. [ It could cause incorrect
* page aging and the (mistaken) reclaim of hot pages, but the
* chance of that should be relatively low. ]
*
* So as a performance optimization don't flush the TLB when
* clearing the accessed bit, it will eventually be flushed by
* a context switch or a VM operation anyway. [ In the rare
* event of it not getting flushed for a long time the delay
* shouldn't really matter because there's no real memory
* pressure for swapout to react to. ]
*/
return ptep_test_and_clear_young(vma, address, ptep);
}

[1] https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf

> is it
> safe to call ptep_test_and_clear_young() only?

Yes. Though the h/w A-bit is designed to allow OSes to skip TLB
flushes when unmapping, the Linux kernel doesn't do this.

> btw, lru_gen_look_around() has already included 'address', are we doing
> pte check for 'address' twice here?

Yes for host MMU but no KVM MMU. ptep_clear_flush_young_notify() goes
into the MMU notifier. We don't use the _notify variant in
lru_gen_look_around() because GPA space generally exhibits no memory
locality.

Thanks.

2022-06-08 06:04:44

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Tue, Jun 7, 2022 at 4:44 AM Will Deacon <[email protected]> wrote:
>
> On Tue, Jun 07, 2022 at 10:37:46AM +1200, Barry Song wrote:
> > On Tue, Jun 7, 2022 at 10:21 PM Will Deacon <[email protected]> wrote:
> > > On Tue, Jun 07, 2022 at 07:37:10PM +1200, Barry Song wrote:
> > > > I can't really explain why we are getting a random app/java vm crash in monkey
> > > > test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> > > > armv8-a machine without hardware PTE young support.
> > > >
> > > > Moving to ptep_clear_flush_young() in look_around can make the random
> > > > hang disappear according to zhanyuan(Cc-ed).
> > > >
> > > > On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
> > > > after
> > > > 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> > > > the accessed bit instead of flushing the TLB")'
> > > >
> > > > But on arm64, they are different. according to Will's comments in this
> > > > thread which
> > > > tried to make arm64 same with x86,
> > > > https://www.mail-archive.com/[email protected]/msg1793881.html
> > > >
> > > > "
> > > > This is blindly copied from x86 and isn't true for us: we don't invalidate
> > > > the TLB on context switch. That means our window for keeping the stale
> > > > entries around is potentially much bigger and might not be a great idea.
> > > >
> > > > If we roll a TLB invalidation routine without the trailing DSB, what sort of
> > > > performance does that get you?
> > > > "
> > > > We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
> > > > clear PTE young? Any comments from Will?
> > >
> > > Given that this issue is specific to the multi-gen LRU work, I think Yu is
> > > the best person to comment. However, looking quickly at your analysis above,
> > > I wonder if the code is relying on this sequence:
> > >
> > >
> > > ptep_test_and_clear_young(vma, address, ptep);
> > > ptep_clear_flush_young(vma, address, ptep);
> > >
> > >
> > > to invalidate the TLB. On arm64, that won't be the case, as the invalidation
> > > in ptep_clear_flush_young() is predicated on the pte being young (and this
> > > patches the generic implementation in mm/pgtable-generic.c. In fact, that
> > > second function call is always going to be a no-op unless the pte became
> > > young again in the middle.
> >
> > thanks for your reply, sorry for failing to let you understand my question.
> > my question is actually as below,
> > right now lru_gen_look_around() is using ptep_test_and_clear_young()
> > only without flush to clear pte for a couple of pages including the specific
> > address:
> > void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> > {
> > ...
> >
> > for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
> > ...
> >
> > if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
> > continue;
> >
> > ...
> > }
> >
> > I wonder if it is safe to arm64. Do we need to move to ptep_clear_flush_young()
> > in the loop?
>
> I don't know what this code is doing, so Yu is the best person to answer
> that. There's nothing inherently dangerous about eliding the TLB
> maintenance; it really depends on the guarantees needed by the caller.

Ack.

> However, the snippet you posted from folio_referenced_one():
>
> | if (pvmw.pte) {
> | + if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> | + !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> | + lru_gen_look_around(&pvmw);
> | + referenced++;
> | + }
> | +
> | if (ptep_clear_flush_young_notify(vma, address,
>
>
> Does seem to call lru_gen_look_around() *and*
> ptep_clear_flush_young_notify(), which is what prompted my question as it
> looks pretty suspicious to me.

The _notify varint reaches into the MMU notifier --
lru_gen_look_around() doesn't do that because GPA space generally has
no locality. I hope this explains why both.

As to why the code is organized this way -- it depends on the point of
view. Mine is that lru_gen_look_around() is an add-on, since its logic
is independent/separable from ptep_clear_flush_young_notify(). We can
make lru_gen_look_around() include ptep_clear_flush_young_notify(),
but that would make the code functionally interwinted, which is bad
for my taste.

2022-06-08 06:05:26

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Tue, Jun 7, 2022 at 1:37 AM Barry Song <[email protected]> wrote:
>
> On Mon, Jun 6, 2022 at 9:25 PM Barry Song <[email protected]> wrote:
> >
> > On Wed, May 18, 2022 at 4:49 PM Yu Zhao <[email protected]> wrote:

...

> I can't really explain why we are getting a random app/java vm crash in monkey
> test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> armv8-a machine without hardware PTE young support.
>
> Moving to ptep_clear_flush_young() in look_around can make the random
> hang disappear according to zhanyuan(Cc-ed).

This sounds too familiar -- let me ask again: was the following commit
included during the test?

07509e10dcc7 arm64: pgtable: Fix pte_accessible()

If not, it will cause exactly the problem you described. And what
about this one?

e914d8f00391 mm: fix unexpected zeroed page mapping with zram swap

Missing it also causes userspace memory corruption on Android, i.e.,
random app crashes.

> On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
> after
> 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> the accessed bit instead of flushing the TLB")'
>
> But on arm64, they are different. according to Will's comments in this
> thread which
> tried to make arm64 same with x86,
> https://www.mail-archive.com/[email protected]/msg1793881.html
>
> "
> This is blindly copied from x86 and isn't true for us: we don't invalidate
> the TLB on context switch. That means our window for keeping the stale
> entries around is potentially much bigger and might not be a great idea.
>
> If we roll a TLB invalidation routine without the trailing DSB, what sort of
> performance does that get you?
> "
> We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
> clear PTE young? Any comments from Will?
>
> >
> > btw, lru_gen_look_around() has already included 'address', are we doing
> > pte check for 'address' twice here?

Explained in the previous reply. Hope that clarifies things.

2022-06-08 06:16:28

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v11 00/14] Multi-Gen LRU Framework

On Tue, May 17, 2022 at 8:05 PM Jens Axboe <[email protected]> wrote:
>
> On 5/17/22 7:46 PM, Yu Zhao wrote:
> > TLDR
> > ====
> > The current page reclaim is too expensive in terms of CPU usage and it
> > often makes poor choices about what to evict. This patchset offers an
> > alternative solution that is performant, versatile and
> > straightforward.
>
> Where's the changelog since v10?

Apologies for my laziness.

The changes are mainly nits, e.g., small refactorings, additional
comments, etc.; relatively major ones are:

* VM_BUG_ON() -> VM_WARN_ON_ONCE()
* Removed `depends on !MAXSMP`

There are no bug fixes or structural changes in v11. I do have a bug
fix [1] queued for the coming v12, which I have been benchmarking on
top of 5.19-rc1. So far I see no improvements or regressions, compared
with v10 on top of 5.18-rc1.

[1] https://forum.armbian.com/topic/20018-mglru-patches-to-bring-down-kswapd-cpu-usage

2022-06-08 07:33:12

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Wed, Jun 8, 2022 at 9:07 AM Yu Zhao <[email protected]> wrote:
>
> On Tue, Jun 7, 2022 at 4:44 AM Will Deacon <[email protected]> wrote:
> >
> > On Tue, Jun 07, 2022 at 10:37:46AM +1200, Barry Song wrote:
> > > On Tue, Jun 7, 2022 at 10:21 PM Will Deacon <[email protected]> wrote:
> > > > On Tue, Jun 07, 2022 at 07:37:10PM +1200, Barry Song wrote:
> > > > > I can't really explain why we are getting a random app/java vm crash in monkey
> > > > > test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> > > > > armv8-a machine without hardware PTE young support.
> > > > >
> > > > > Moving to ptep_clear_flush_young() in look_around can make the random
> > > > > hang disappear according to zhanyuan(Cc-ed).
> > > > >
> > > > > On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
> > > > > after
> > > > > 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> > > > > the accessed bit instead of flushing the TLB")'
> > > > >
> > > > > But on arm64, they are different. according to Will's comments in this
> > > > > thread which
> > > > > tried to make arm64 same with x86,
> > > > > https://www.mail-archive.com/[email protected]/msg1793881.html
> > > > >
> > > > > "
> > > > > This is blindly copied from x86 and isn't true for us: we don't invalidate
> > > > > the TLB on context switch. That means our window for keeping the stale
> > > > > entries around is potentially much bigger and might not be a great idea.
> > > > >
> > > > > If we roll a TLB invalidation routine without the trailing DSB, what sort of
> > > > > performance does that get you?
> > > > > "
> > > > > We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
> > > > > clear PTE young? Any comments from Will?
> > > >
> > > > Given that this issue is specific to the multi-gen LRU work, I think Yu is
> > > > the best person to comment. However, looking quickly at your analysis above,
> > > > I wonder if the code is relying on this sequence:
> > > >
> > > >
> > > > ptep_test_and_clear_young(vma, address, ptep);
> > > > ptep_clear_flush_young(vma, address, ptep);
> > > >
> > > >
> > > > to invalidate the TLB. On arm64, that won't be the case, as the invalidation
> > > > in ptep_clear_flush_young() is predicated on the pte being young (and this
> > > > patches the generic implementation in mm/pgtable-generic.c. In fact, that
> > > > second function call is always going to be a no-op unless the pte became
> > > > young again in the middle.
> > >
> > > thanks for your reply, sorry for failing to let you understand my question.
> > > my question is actually as below,
> > > right now lru_gen_look_around() is using ptep_test_and_clear_young()
> > > only without flush to clear pte for a couple of pages including the specific
> > > address:
> > > void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> > > {
> > > ...
> > >
> > > for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
> > > ...
> > >
> > > if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
> > > continue;
> > >
> > > ...
> > > }
> > >
> > > I wonder if it is safe to arm64. Do we need to move to ptep_clear_flush_young()
> > > in the loop?
> >
> > I don't know what this code is doing, so Yu is the best person to answer
> > that. There's nothing inherently dangerous about eliding the TLB
> > maintenance; it really depends on the guarantees needed by the caller.
>
> Ack.
>
> > However, the snippet you posted from folio_referenced_one():
> >
> > | if (pvmw.pte) {
> > | + if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> > | + !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> > | + lru_gen_look_around(&pvmw);
> > | + referenced++;
> > | + }
> > | +
> > | if (ptep_clear_flush_young_notify(vma, address,
> >
> >
> > Does seem to call lru_gen_look_around() *and*
> > ptep_clear_flush_young_notify(), which is what prompted my question as it
> > looks pretty suspicious to me.
>
> The _notify varint reaches into the MMU notifier --
> lru_gen_look_around() doesn't do that because GPA space generally has
> no locality. I hope this explains why both.
>
> As to why the code is organized this way -- it depends on the point of
> view. Mine is that lru_gen_look_around() is an add-on, since its logic
> is independent/separable from ptep_clear_flush_young_notify(). We can
> make lru_gen_look_around() include ptep_clear_flush_young_notify(),
> but that would make the code functionally interwinted, which is bad
> for my taste.

Given we used to have a flush for clear pte young in LRU, right now we are
moving to nop in almost all cases for the flush unless the address becomes
young exactly after look_around and before ptep_clear_flush_young_notify.
It means we are actually dropping flush. So the question is, were we
overcautious? we actually don't need the flush at all even without mglru?
for arm64, without the flush, stale data might be used for a
relatively long time
as commented in [1], does it actually harm?
[1]https://www.mail-archive.com/[email protected]/msg1793881.html

Thanks
Barry

2022-06-08 09:00:22

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Wed, Jun 8, 2022 at 7:07 AM Yu Zhao <[email protected]> wrote:
>
> On Tue, Jun 7, 2022 at 1:37 AM Barry Song <[email protected]> wrote:
> >
> > On Mon, Jun 6, 2022 at 9:25 PM Barry Song <[email protected]> wrote:
> > >
> > > On Wed, May 18, 2022 at 4:49 PM Yu Zhao <[email protected]> wrote:
>
> ...
>
> > I can't really explain why we are getting a random app/java vm crash in monkey
> > test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> > armv8-a machine without hardware PTE young support.
> >
> > Moving to ptep_clear_flush_young() in look_around can make the random
> > hang disappear according to zhanyuan(Cc-ed).
>
> This sounds too familiar -- let me ask again: was the following commit
> included during the test?
>
> 07509e10dcc7 arm64: pgtable: Fix pte_accessible()
>
> If not, it will cause exactly the problem you described. And what
> about this one?
>
> e914d8f00391 mm: fix unexpected zeroed page mapping with zram swap
>
> Missing it also causes userspace memory corruption on Android, i.e.,
> random app crashes.
>

According to zhanyuan's testing, we can confirm the above two commits
can fix the random android crash.

Thanks
Barry

2022-06-08 16:19:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Tue, Jun 7, 2022 at 5:43 PM Barry Song <[email protected]> wrote:
>
> Given we used to have a flush for clear pte young in LRU, right now we are
> moving to nop in almost all cases for the flush unless the address becomes
> young exactly after look_around and before ptep_clear_flush_young_notify.
> It means we are actually dropping flush. So the question is, were we
> overcautious? we actually don't need the flush at all even without mglru?

We stopped flushing the TLB on A bit clears on x86 back in 2014.

See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
clear the accessed bit instead of flushing the TLB").

Linus

2022-06-08 23:15:31

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, Jun 7, 2022 at 5:43 PM Barry Song <[email protected]> wrote:
> >
> > Given we used to have a flush for clear pte young in LRU, right now we are
> > moving to nop in almost all cases for the flush unless the address becomes
> > young exactly after look_around and before ptep_clear_flush_young_notify.
> > It means we are actually dropping flush. So the question is, were we
> > overcautious? we actually don't need the flush at all even without mglru?
>
> We stopped flushing the TLB on A bit clears on x86 back in 2014.
>
> See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> clear the accessed bit instead of flushing the TLB").

This is true for x86, RISC-V, powerpc and S390. but it is not true for
most platforms.

There was an attempt to do the same thing in arm64:
https://www.mail-archive.com/[email protected]/msg1793830.html
but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
https://www.mail-archive.com/[email protected]/msg1794484.html

Plus, generic code will also send a tlb flush:
int ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
int young;
young = ptep_test_and_clear_young(vma, address, ptep);
if (young)
flush_tlb_page(vma, address);
return young;
}

We used to use ptep_test_and_clear_young() only in rmap.c for page_referenced()
in 2.6.0:
https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/tree/mm/rmap.c?h=v2.6.0
int page_referenced(struct page * page)
{
...
if (ptep_test_and_clear_young(p))
...
}

but in 2.6.12, it has been already ptep_clear_flush_young() in
page_referenced_one()
https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/tree/mm/rmap.c?h=v2.6.12

I failed to find the history to figure out the motivation for 2.6.12
to use ptep_clear_flush_young()
in LRU, but I am still curious how using flush or not will affect LRU
on those platforms whose
ptep_clear_flush_young() and ptep_test_and_clear_young() are different.

>
> Linus

Thanks
Barry

2022-06-16 22:03:30

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Wed, Jun 8, 2022 at 4:46 PM Barry Song <[email protected]> wrote:
>
> On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <[email protected]> wrote:
> > >
> > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > moving to nop in almost all cases for the flush unless the address becomes
> > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > It means we are actually dropping flush. So the question is, were we
> > > overcautious? we actually don't need the flush at all even without mglru?
> >
> > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> >
> > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > clear the accessed bit instead of flushing the TLB").
>
> This is true for x86, RISC-V, powerpc and S390. but it is not true for
> most platforms.
>
> There was an attempt to do the same thing in arm64:
> https://www.mail-archive.com/[email protected]/msg1793830.html
> but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> https://www.mail-archive.com/[email protected]/msg1794484.html

Barry, you've already answered your own question.

Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
#define pte_accessible(mm, pte) \
- (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
+ (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))

You missed all TLB flushes for PTEs that have gone through
ptep_test_and_clear_young() on the reclaim path. But most of the time,
you got away with it, only occasional app crashes:
https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/

Why?

2022-06-16 22:42:49

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <[email protected]> wrote:
>
> On Wed, Jun 8, 2022 at 4:46 PM Barry Song <[email protected]> wrote:
> >
> > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <[email protected]> wrote:
> > > >
> > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > It means we are actually dropping flush. So the question is, were we
> > > > overcautious? we actually don't need the flush at all even without mglru?
> > >
> > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > >
> > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > clear the accessed bit instead of flushing the TLB").
> >
> > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > most platforms.
> >
> > There was an attempt to do the same thing in arm64:
> > https://www.mail-archive.com/[email protected]/msg1793830.html
> > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > https://www.mail-archive.com/[email protected]/msg1794484.html
>
> Barry, you've already answered your own question.
>
> Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> #define pte_accessible(mm, pte) \
> - (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> + (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
>
> You missed all TLB flushes for PTEs that have gone through
> ptep_test_and_clear_young() on the reclaim path. But most of the time,
> you got away with it, only occasional app crashes:
> https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
>
> Why?

Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
can cause random
App to crash.
ptep_test_and_clear_young() + flush won't have this kind of crashes though.
But after applying commit 07509e10dcc7 arm64: pgtable: Fix
pte_accessible(), on arm64,
ptep_test_and_clear_young() without flush won't cause App to crash.

ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7: OK
ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7: OK
ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7: CRASH

So is it possible that other platforms have similar problems with
arm64 while commit
07509e10dcc7 isn't there? but anyway, we depend on those platforms which can
really use mglru to expose this kind of potential bugs.

BTW, do you think it is safe to totally remove the flush code even for
the original
LRU? I don't see fundamental difference between MGLRU and LRU on this
"flush" thing. Since MGLRU doesn't need flush, why does LRU need it? flush
is very expensive, if we do think this flush is unnecessary, will we remove
it for the original LRU as well?

BTW, look_around is a great idea. Our experiments also show some great
decrease on the cpu consumption of page_referenced().

Thanks
Barry

2022-06-16 23:54:48

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Thu, Jun 16, 2022 at 4:33 PM Barry Song <[email protected]> wrote:
>
> On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <[email protected]> wrote:
> >
> > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <[email protected]> wrote:
> > >
> > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > <[email protected]> wrote:
> > > >
> > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <[email protected]> wrote:
> > > > >
> > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > It means we are actually dropping flush. So the question is, were we
> > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > >
> > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > >
> > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > clear the accessed bit instead of flushing the TLB").
> > >
> > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > most platforms.
> > >
> > > There was an attempt to do the same thing in arm64:
> > > https://www.mail-archive.com/[email protected]/msg1793830.html
> > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > https://www.mail-archive.com/[email protected]/msg1794484.html
> >
> > Barry, you've already answered your own question.
> >
> > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> > #define pte_accessible(mm, pte) \
> > - (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> > + (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> >
> > You missed all TLB flushes for PTEs that have gone through
> > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > you got away with it, only occasional app crashes:
> > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> >
> > Why?
>
> Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> can cause random
> App to crash.
> ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> pte_accessible(), on arm64,
> ptep_test_and_clear_young() without flush won't cause App to crash.
>
> ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7: OK
> ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7: OK
> ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7: CRASH

I agree -- my question was rhetorical :)

I was trying to imply this logic:
1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
3 (case 1 & 2 guarantee flushes)
3. We saw crashes, but only occasionally

Assuming TLB cached those PTEs, we would have seen the crashes more
often, which contradicts our observation. So the conclusion is TLB
didn't cache them most of the time, meaning flushing TLB just for the
sake of the A-bit isn't necessary.

> do you think it is safe to totally remove the flush code even for
> the original
> LRU?

Affirmative, based on not only my words, but 3rd parties':
1. Your (indirect) observation
2. Alexander's benchmark:
https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
3. The fundamental hardware limitation in terms of the TLB scalability
(Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf

2022-06-17 01:58:38

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Thu, Jun 16, 2022 at 5:29 PM Yu Zhao <[email protected]> wrote:
>
> On Thu, Jun 16, 2022 at 4:33 PM Barry Song <[email protected]> wrote:
> >
> > On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <[email protected]> wrote:
> > >
> > > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <[email protected]> wrote:
> > > >
> > > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <[email protected]> wrote:
> > > > > >
> > > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > > It means we are actually dropping flush. So the question is, were we
> > > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > > >
> > > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > > >
> > > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > > clear the accessed bit instead of flushing the TLB").
> > > >
> > > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > > most platforms.
> > > >
> > > > There was an attempt to do the same thing in arm64:
> > > > https://www.mail-archive.com/[email protected]/msg1793830.html
> > > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > > https://www.mail-archive.com/[email protected]/msg1794484.html
> > >
> > > Barry, you've already answered your own question.
> > >
> > > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> > > #define pte_accessible(mm, pte) \
> > > - (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> > > + (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> > >
> > > You missed all TLB flushes for PTEs that have gone through
> > > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > > you got away with it, only occasional app crashes:
> > > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> > >
> > > Why?
> >
> > Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> > can cause random
> > App to crash.
> > ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> > But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> > pte_accessible(), on arm64,
> > ptep_test_and_clear_young() without flush won't cause App to crash.
> >
> > ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7: OK
> > ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7: OK
> > ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7: CRASH
>
> I agree -- my question was rhetorical :)
>
> I was trying to imply this logic:
> 1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
> 2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
> 3 (case 1 & 2 guarantee flushes)
> 3. We saw crashes, but only occasionally
>
> Assuming TLB cached those PTEs, we would have seen the crashes more
> often, which contradicts our observation. So the conclusion is TLB
> didn't cache them most of the time, meaning flushing TLB just for the
> sake of the A-bit isn't necessary.
>
> > do you think it is safe to totally remove the flush code even for
> > the original
> > LRU?
>
> Affirmative, based on not only my words, but 3rd parties':
> 1. Your (indirect) observation
> 2. Alexander's benchmark:
> https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
> 3. The fundamental hardware limitation in terms of the TLB scalability
> (Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf

4. Intel's commit b13b1d2d8692 ("x86/mm: In the PTE swapout page
reclaim case clear the accessed bit instead of flushing the TLB")

2022-06-17 02:22:40

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Fri, Jun 17, 2022 at 1:43 PM Yu Zhao <[email protected]> wrote:
>
> On Thu, Jun 16, 2022 at 5:29 PM Yu Zhao <[email protected]> wrote:
> >
> > On Thu, Jun 16, 2022 at 4:33 PM Barry Song <[email protected]> wrote:
> > >
> > > On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <[email protected]> wrote:
> > > >
> > > > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <[email protected]> wrote:
> > > > >
> > > > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <[email protected]> wrote:
> > > > > > >
> > > > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > > > It means we are actually dropping flush. So the question is, were we
> > > > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > > > >
> > > > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > > > >
> > > > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > > > clear the accessed bit instead of flushing the TLB").
> > > > >
> > > > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > > > most platforms.
> > > > >
> > > > > There was an attempt to do the same thing in arm64:
> > > > > https://www.mail-archive.com/[email protected]/msg1793830.html
> > > > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > > > https://www.mail-archive.com/[email protected]/msg1794484.html
> > > >
> > > > Barry, you've already answered your own question.
> > > >
> > > > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> > > > #define pte_accessible(mm, pte) \
> > > > - (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> > > > + (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> > > >
> > > > You missed all TLB flushes for PTEs that have gone through
> > > > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > > > you got away with it, only occasional app crashes:
> > > > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> > > >
> > > > Why?
> > >
> > > Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> > > can cause random
> > > App to crash.
> > > ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> > > But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> > > pte_accessible(), on arm64,
> > > ptep_test_and_clear_young() without flush won't cause App to crash.
> > >
> > > ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7: OK
> > > ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7: OK
> > > ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7: CRASH
> >
> > I agree -- my question was rhetorical :)
> >
> > I was trying to imply this logic:
> > 1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
> > 2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
> > 3 (case 1 & 2 guarantee flushes)
> > 3. We saw crashes, but only occasionally
> >
> > Assuming TLB cached those PTEs, we would have seen the crashes more
> > often, which contradicts our observation. So the conclusion is TLB
> > didn't cache them most of the time, meaning flushing TLB just for the
> > sake of the A-bit isn't necessary.
> >
> > > do you think it is safe to totally remove the flush code even for
> > > the original
> > > LRU?
> >
> > Affirmative, based on not only my words, but 3rd parties':
> > 1. Your (indirect) observation
> > 2. Alexander's benchmark:
> > https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
> > 3. The fundamental hardware limitation in terms of the TLB scalability
> > (Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
>
> 4. Intel's commit b13b1d2d8692 ("x86/mm: In the PTE swapout page
> reclaim case clear the accessed bit instead of flushing the TLB")

Hi Yu,
I am going to send a RFC based on the above discussion.

diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..7ce6f0b6c330 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio,
}

if (pvmw.pte) {
- if (ptep_clear_flush_young_notify(vma, address,
+ if (ptep_clear_young_notify(vma, address,
pvmw.pte)) {
/*
* Don't treat a reference through
Thanks
Barry

2022-06-17 03:06:10

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Thu, Jun 16, 2022 at 8:01 PM Barry Song <[email protected]> wrote:
>
> On Fri, Jun 17, 2022 at 1:43 PM Yu Zhao <[email protected]> wrote:
> >
> > On Thu, Jun 16, 2022 at 5:29 PM Yu Zhao <[email protected]> wrote:
> > >
> > > On Thu, Jun 16, 2022 at 4:33 PM Barry Song <[email protected]> wrote:
> > > >
> > > > On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <[email protected]> wrote:
> > > > >
> > > > > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > > > > It means we are actually dropping flush. So the question is, were we
> > > > > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > > > > >
> > > > > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > > > > >
> > > > > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > > > > clear the accessed bit instead of flushing the TLB").
> > > > > >
> > > > > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > > > > most platforms.
> > > > > >
> > > > > > There was an attempt to do the same thing in arm64:
> > > > > > https://www.mail-archive.com/[email protected]/msg1793830.html
> > > > > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > > > > https://www.mail-archive.com/[email protected]/msg1794484.html
> > > > >
> > > > > Barry, you've already answered your own question.
> > > > >
> > > > > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> > > > > #define pte_accessible(mm, pte) \
> > > > > - (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> > > > > + (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> > > > >
> > > > > You missed all TLB flushes for PTEs that have gone through
> > > > > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > > > > you got away with it, only occasional app crashes:
> > > > > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> > > > >
> > > > > Why?
> > > >
> > > > Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> > > > can cause random
> > > > App to crash.
> > > > ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> > > > But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> > > > pte_accessible(), on arm64,
> > > > ptep_test_and_clear_young() without flush won't cause App to crash.
> > > >
> > > > ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7: OK
> > > > ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7: OK
> > > > ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7: CRASH
> > >
> > > I agree -- my question was rhetorical :)
> > >
> > > I was trying to imply this logic:
> > > 1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
> > > 2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
> > > 3 (case 1 & 2 guarantee flushes)
> > > 3. We saw crashes, but only occasionally
> > >
> > > Assuming TLB cached those PTEs, we would have seen the crashes more
> > > often, which contradicts our observation. So the conclusion is TLB
> > > didn't cache them most of the time, meaning flushing TLB just for the
> > > sake of the A-bit isn't necessary.
> > >
> > > > do you think it is safe to totally remove the flush code even for
> > > > the original
> > > > LRU?
> > >
> > > Affirmative, based on not only my words, but 3rd parties':
> > > 1. Your (indirect) observation
> > > 2. Alexander's benchmark:
> > > https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
> > > 3. The fundamental hardware limitation in terms of the TLB scalability
> > > (Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
> >
> > 4. Intel's commit b13b1d2d8692 ("x86/mm: In the PTE swapout page
> > reclaim case clear the accessed bit instead of flushing the TLB")
>
> Hi Yu,
> I am going to send a RFC based on the above discussion.
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5bcb334cd6f2..7ce6f0b6c330 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio,
> }
>
> if (pvmw.pte) {
> - if (ptep_clear_flush_young_notify(vma, address,
> + if (ptep_clear_young_notify(vma, address,
> pvmw.pte)) {
> /*
> * Don't treat a reference through

Thanks!

This might make a difference on my 64 core Altra -- I'll test after
you post the RFC.

2022-06-17 03:44:21

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Thu, Jun 16, 2022 at 9:03 PM Yu Zhao <[email protected]> wrote:
>
> On Thu, Jun 16, 2022 at 8:01 PM Barry Song <[email protected]> wrote:
> >
> > On Fri, Jun 17, 2022 at 1:43 PM Yu Zhao <[email protected]> wrote:
> > >
> > > On Thu, Jun 16, 2022 at 5:29 PM Yu Zhao <[email protected]> wrote:
> > > >
> > > > On Thu, Jun 16, 2022 at 4:33 PM Barry Song <[email protected]> wrote:
> > > > >
> > > > > On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > > > > > It means we are actually dropping flush. So the question is, were we
> > > > > > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > > > > > >
> > > > > > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > > > > > >
> > > > > > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > > > > > clear the accessed bit instead of flushing the TLB").
> > > > > > >
> > > > > > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > > > > > most platforms.
> > > > > > >
> > > > > > > There was an attempt to do the same thing in arm64:
> > > > > > > https://www.mail-archive.com/[email protected]/msg1793830.html
> > > > > > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > > > > > https://www.mail-archive.com/[email protected]/msg1794484.html
> > > > > >
> > > > > > Barry, you've already answered your own question.
> > > > > >
> > > > > > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> > > > > > #define pte_accessible(mm, pte) \
> > > > > > - (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> > > > > > + (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> > > > > >
> > > > > > You missed all TLB flushes for PTEs that have gone through
> > > > > > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > > > > > you got away with it, only occasional app crashes:
> > > > > > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> > > > > >
> > > > > > Why?
> > > > >
> > > > > Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> > > > > can cause random
> > > > > App to crash.
> > > > > ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> > > > > But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> > > > > pte_accessible(), on arm64,
> > > > > ptep_test_and_clear_young() without flush won't cause App to crash.
> > > > >
> > > > > ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7: OK
> > > > > ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7: OK
> > > > > ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7: CRASH
> > > >
> > > > I agree -- my question was rhetorical :)
> > > >
> > > > I was trying to imply this logic:
> > > > 1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
> > > > 2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
> > > > 3 (case 1 & 2 guarantee flushes)
> > > > 3. We saw crashes, but only occasionally
> > > >
> > > > Assuming TLB cached those PTEs, we would have seen the crashes more
> > > > often, which contradicts our observation. So the conclusion is TLB
> > > > didn't cache them most of the time, meaning flushing TLB just for the
> > > > sake of the A-bit isn't necessary.
> > > >
> > > > > do you think it is safe to totally remove the flush code even for
> > > > > the original
> > > > > LRU?
> > > >
> > > > Affirmative, based on not only my words, but 3rd parties':
> > > > 1. Your (indirect) observation
> > > > 2. Alexander's benchmark:
> > > > https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
> > > > 3. The fundamental hardware limitation in terms of the TLB scalability
> > > > (Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
> > >
> > > 4. Intel's commit b13b1d2d8692 ("x86/mm: In the PTE swapout page
> > > reclaim case clear the accessed bit instead of flushing the TLB")
> >
> > Hi Yu,
> > I am going to send a RFC based on the above discussion.
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 5bcb334cd6f2..7ce6f0b6c330 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio,
> > }
> >
> > if (pvmw.pte) {
> > - if (ptep_clear_flush_young_notify(vma, address,
> > + if (ptep_clear_young_notify(vma, address,
> > pvmw.pte)) {
> > /*
> > * Don't treat a reference through
>
> Thanks!
>
> This might make a difference on my 64 core Altra -- I'll test after
> you post the RFC.

Also, IIRC, it made no difference on POWER9 because POWER9 flushes TBL
regardless which variant is used.

2022-06-19 21:28:11

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Thu, Jun 16, 2022 at 9:17 PM Yu Zhao <[email protected]> wrote:
>
> On Thu, Jun 16, 2022 at 9:03 PM Yu Zhao <[email protected]> wrote:
> >
> > On Thu, Jun 16, 2022 at 8:01 PM Barry Song <[email protected]> wrote:
> > >
> > > On Fri, Jun 17, 2022 at 1:43 PM Yu Zhao <[email protected]> wrote:
> > > >
> > > > On Thu, Jun 16, 2022 at 5:29 PM Yu Zhao <[email protected]> wrote:
> > > > >
> > > > > On Thu, Jun 16, 2022 at 4:33 PM Barry Song <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <[email protected]> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > > > > > > <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > > > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > > > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > > > > > > It means we are actually dropping flush. So the question is, were we
> > > > > > > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > > > > > > >
> > > > > > > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > > > > > > >
> > > > > > > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > > > > > > clear the accessed bit instead of flushing the TLB").
> > > > > > > >
> > > > > > > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > > > > > > most platforms.
> > > > > > > >
> > > > > > > > There was an attempt to do the same thing in arm64:
> > > > > > > > https://www.mail-archive.com/[email protected]/msg1793830.html
> > > > > > > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > > > > > > https://www.mail-archive.com/[email protected]/msg1794484.html
> > > > > > >
> > > > > > > Barry, you've already answered your own question.
> > > > > > >
> > > > > > > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> > > > > > > #define pte_accessible(mm, pte) \
> > > > > > > - (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> > > > > > > + (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> > > > > > >
> > > > > > > You missed all TLB flushes for PTEs that have gone through
> > > > > > > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > > > > > > you got away with it, only occasional app crashes:
> > > > > > > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> > > > > > >
> > > > > > > Why?
> > > > > >
> > > > > > Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> > > > > > can cause random
> > > > > > App to crash.
> > > > > > ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> > > > > > But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> > > > > > pte_accessible(), on arm64,
> > > > > > ptep_test_and_clear_young() without flush won't cause App to crash.
> > > > > >
> > > > > > ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7: OK
> > > > > > ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7: OK
> > > > > > ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7: CRASH
> > > > >
> > > > > I agree -- my question was rhetorical :)
> > > > >
> > > > > I was trying to imply this logic:
> > > > > 1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
> > > > > 2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
> > > > > 3 (case 1 & 2 guarantee flushes)
> > > > > 3. We saw crashes, but only occasionally
> > > > >
> > > > > Assuming TLB cached those PTEs, we would have seen the crashes more
> > > > > often, which contradicts our observation. So the conclusion is TLB
> > > > > didn't cache them most of the time, meaning flushing TLB just for the
> > > > > sake of the A-bit isn't necessary.
> > > > >
> > > > > > do you think it is safe to totally remove the flush code even for
> > > > > > the original
> > > > > > LRU?
> > > > >
> > > > > Affirmative, based on not only my words, but 3rd parties':
> > > > > 1. Your (indirect) observation
> > > > > 2. Alexander's benchmark:
> > > > > https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
> > > > > 3. The fundamental hardware limitation in terms of the TLB scalability
> > > > > (Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
> > > >
> > > > 4. Intel's commit b13b1d2d8692 ("x86/mm: In the PTE swapout page
> > > > reclaim case clear the accessed bit instead of flushing the TLB")
> > >
> > > Hi Yu,
> > > I am going to send a RFC based on the above discussion.
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 5bcb334cd6f2..7ce6f0b6c330 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio,
> > > }
> > >
> > > if (pvmw.pte) {
> > > - if (ptep_clear_flush_young_notify(vma, address,
> > > + if (ptep_clear_young_notify(vma, address,
> > > pvmw.pte)) {
> > > /*
> > > * Don't treat a reference through
> >
> > Thanks!
> >
> > This might make a difference on my 64 core Altra -- I'll test after
> > you post the RFC.
>
> Also, IIRC, it made no difference on POWER9 because POWER9
> flushes TBL regardless which variant is used.
^^^^^^^ doesn't flush

I just verified this on POWER9. So on top of exhibit 1-4, we got:
5. 3cb1aa7aa3940 ("powerpc/64s: Implement ptep_clear_flush_young
that does not flush TLBs")

2022-06-19 22:18:40

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

On Mon, Jun 20, 2022 at 8:37 AM Yu Zhao <[email protected]> wrote:
>
> On Thu, Jun 16, 2022 at 9:17 PM Yu Zhao <[email protected]> wrote:
> >
> > On Thu, Jun 16, 2022 at 9:03 PM Yu Zhao <[email protected]> wrote:
> > >
> > > On Thu, Jun 16, 2022 at 8:01 PM Barry Song <[email protected]> wrote:
> > > >
> > > > On Fri, Jun 17, 2022 at 1:43 PM Yu Zhao <[email protected]> wrote:
> > > > >
> > > > > On Thu, Jun 16, 2022 at 5:29 PM Yu Zhao <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Jun 16, 2022 at 4:33 PM Barry Song <[email protected]> wrote:
> > > > > > >
> > > > > > > On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > > > > > > > <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > > > > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > > > > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > > > > > > > It means we are actually dropping flush. So the question is, were we
> > > > > > > > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > > > > > > > >
> > > > > > > > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > > > > > > > >
> > > > > > > > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > > > > > > > clear the accessed bit instead of flushing the TLB").
> > > > > > > > >
> > > > > > > > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > > > > > > > most platforms.
> > > > > > > > >
> > > > > > > > > There was an attempt to do the same thing in arm64:
> > > > > > > > > https://www.mail-archive.com/[email protected]/msg1793830.html
> > > > > > > > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > > > > > > > https://www.mail-archive.com/[email protected]/msg1794484.html
> > > > > > > >
> > > > > > > > Barry, you've already answered your own question.
> > > > > > > >
> > > > > > > > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> > > > > > > > #define pte_accessible(mm, pte) \
> > > > > > > > - (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> > > > > > > > + (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> > > > > > > >
> > > > > > > > You missed all TLB flushes for PTEs that have gone through
> > > > > > > > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > > > > > > > you got away with it, only occasional app crashes:
> > > > > > > > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> > > > > > > >
> > > > > > > > Why?
> > > > > > >
> > > > > > > Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> > > > > > > can cause random
> > > > > > > App to crash.
> > > > > > > ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> > > > > > > But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> > > > > > > pte_accessible(), on arm64,
> > > > > > > ptep_test_and_clear_young() without flush won't cause App to crash.
> > > > > > >
> > > > > > > ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7: OK
> > > > > > > ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7: OK
> > > > > > > ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7: CRASH
> > > > > >
> > > > > > I agree -- my question was rhetorical :)
> > > > > >
> > > > > > I was trying to imply this logic:
> > > > > > 1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
> > > > > > 2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
> > > > > > 3 (case 1 & 2 guarantee flushes)
> > > > > > 3. We saw crashes, but only occasionally
> > > > > >
> > > > > > Assuming TLB cached those PTEs, we would have seen the crashes more
> > > > > > often, which contradicts our observation. So the conclusion is TLB
> > > > > > didn't cache them most of the time, meaning flushing TLB just for the
> > > > > > sake of the A-bit isn't necessary.
> > > > > >
> > > > > > > do you think it is safe to totally remove the flush code even for
> > > > > > > the original
> > > > > > > LRU?
> > > > > >
> > > > > > Affirmative, based on not only my words, but 3rd parties':
> > > > > > 1. Your (indirect) observation
> > > > > > 2. Alexander's benchmark:
> > > > > > https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
> > > > > > 3. The fundamental hardware limitation in terms of the TLB scalability
> > > > > > (Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
> > > > >
> > > > > 4. Intel's commit b13b1d2d8692 ("x86/mm: In the PTE swapout page
> > > > > reclaim case clear the accessed bit instead of flushing the TLB")
> > > >
> > > > Hi Yu,
> > > > I am going to send a RFC based on the above discussion.
> > > >
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > index 5bcb334cd6f2..7ce6f0b6c330 100644
> > > > --- a/mm/rmap.c
> > > > +++ b/mm/rmap.c
> > > > @@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio,
> > > > }
> > > >
> > > > if (pvmw.pte) {
> > > > - if (ptep_clear_flush_young_notify(vma, address,
> > > > + if (ptep_clear_young_notify(vma, address,
> > > > pvmw.pte)) {
> > > > /*
> > > > * Don't treat a reference through
> > >
> > > Thanks!
> > >
> > > This might make a difference on my 64 core Altra -- I'll test after
> > > you post the RFC.
> >
> > Also, IIRC, it made no difference on POWER9 because POWER9
> > flushes TBL regardless which variant is used.
> ^^^^^^^ doesn't flush
>
> I just verified this on POWER9. So on top of exhibit 1-4, we got:
> 5. 3cb1aa7aa3940 ("powerpc/64s: Implement ptep_clear_flush_young
> that does not flush TLBs")

Thanks, Yu. I put a rfc,
https://lore.kernel.org/lkml/[email protected]/

we may clarify everything in that thread :-)

Thanks
Barry