2022-01-04 20:23:24

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v6 0/9] Multigenerational LRU Framework

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.

Design objectives
=================
The design objectives are:
1. Better representation of access recency
2. Try to profit from spatial locality
3. Clear fast path making obvious choices
4. Simple self-correcting heuristics

The representation of access recency is at the core of all LRU
approximations. The multigenerational LRU (MGLRU) divides pages into
multiple lists (generations), each having bounded access recency (a
time interval). Generations establish a common frame of reference and
help make better choices, e.g., between different memcgs on a computer
or different computers in a data center (for cluster job scheduling).

Exploiting spatial locality improves the efficiency when gathering the
accessed bit. A rmap walk targets a single page and doesn't try to
profit from discovering an accessed PTE. A page table walk can sweep
all hotspots in an address space, but its search space can be too
large to make a profit. The key is to optimize both methods and use
them in combination. (PMU is another option for further exploration.)

Fast path reduces code complexity and runtime overhead. Unmapped pages
don't require TLB flushes; clean pages don't require writeback. These
facts are only helpful when other conditions, e.g., access recency,
are similar. With generations as a common frame of reference,
additional factors stand out. But obvious choices might not be good
choices; thus self-correction is required (the next objective).

The benefits of simple self-correcting heuristics are self-evident.
Again with generations as a common frame of reference, this becomes
attainable. Specifically, pages in the same generation are categorized
based on additional factors, and a closed-loop control statistically
compares the refault percentages across all categories and throttles
the eviction of those that have higher percentages.

Patchset overview
=================
1. mm: x86, arm64: add arch_has_hw_pte_young()
2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Materializing hardware optimizations when trying to clear the accessed
bit in many PTEs. If hardware automatically sets the accessed bit in
PTEs, there is no need to worry about bursty page faults (emulating
the accessed bit). If it also sets the accessed bit in non-leaf PMD
entries, there is no need to search the PTE table pointed to by a PMD
entry that doesn't have the accessed bit set.

3. mm/vmscan.c: refactor shrink_node()
A minor refactor.

4. mm: multigenerational lru: groundwork
Adding the basic data structure and the functions to initialize it and
insert/remove pages.

5. mm: multigenerational lru: mm_struct list
An infra keeps track of mm_struct's for page table walkers and
provides them with optimizations, i.e., switch_mm() tracking and Bloom
filters.

6. mm: multigenerational lru: aging
7. mm: multigenerational lru: eviction
"The page reclaim" is a producer/consumer model. "The aging" produces
cold pages, whereas "the eviction " consumes them. Cold pages flow
through generations. The aging uses the mm_struct list infra to sweep
dense hotspots in page tables. During a page table walk, the aging
clears the accessed bit and tags accessed pages with the youngest
generation number. The eviction sorts those pages when it encounters
them. For pages in the oldest generation, eviction walks the rmap to
check the accessed bit one more time before evicting them. During an
rmap walk, the eviction feeds dense hotspots back to the aging. Dense
hotspots flow through the Bloom filters. For pages not mapped in page
tables, the eviction uses the PID controller to statistically
determine whether they have higher refaults. If so, the eviction
throttles their eviction by moving them to the next generation (the
second oldest).

8. mm: multigenerational lru: user interface
The knobs to turn on/off MGLRU and provide the userspace with
thrashing prevention, working set estimation (the aging) and proactive
reclaim (the eviction).

9. mm: multigenerational lru: Kconfig
The Kconfig options.

Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
Apache Cassandra Memcached
Apache Hadoop MongoDB
Apache Spark PostgreSQL
MariaDB (MySQL) Redis

An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.

On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
less wall time to sort three billion random integers, respectively,
under the medium- and the high-concurrency conditions, when
overcommitting memory. There were no statistically significant
changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
more transactions per minute (TPM), respectively, under the medium-
and the high-concurrency conditions, when overcommitting memory.
There were no statistically significant changes in TPM for the rest
of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
and [21.59, 30.02]% more operations per second (OPS), respectively,
for sequential access, random access and Gaussian (distribution)
access, when THP=always; 95% CIs [13.85, 15.97]% and
[23.94, 29.92]% more OPS, respectively, for random access and
Gaussian access, when THP=never. There were no statistically
significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
[2.16, 3.55]% more operations per second (OPS), respectively, for
exponential (distribution) access, random access and Zipfian
(distribution) access, when underutilizing memory; 95% CIs
[8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
respectively, for exponential access, random access and Zipfian
access, when overcommitting memory.

On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
and [4.11, 7.50]% more operations per second (OPS), respectively,
for exponential (distribution) access, random access and Zipfian
(distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
[6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
exponential access, random access and Zipfian access, when swap was
on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
less average wall time to finish twelve parallel TeraSort jobs,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in average wall time for the rest of the
benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
minute (TPM) under the high-concurrency condition, when swap was
off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
[11.47, 19.36]% more total operations per second (OPS),
respectively, for sequential access, random access and Gaussian
(distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
[10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
for sequential access, random access and Gaussian access, when
THP=never.

Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
are popular among MM developers, but we prefer large-scale A/B
experiments to validate improvements.)
fs_fio_bench_hdd_mq pft
fs_lmbench pgsql-hammerdb
fs_parallelio redis
fs_postmark stream
hackbench sysbenchthread
kernbench tpcc_spark
memcached unixbench
multichase vm-scalability
mutilate will-it-scale
nginx

[01] https://trends.google.com
[02] https://lore.kernel.org/linux-mm/[email protected]/
[03] https://lore.kernel.org/linux-mm/[email protected]/
[04] https://lore.kernel.org/linux-mm/[email protected]/
[05] https://lore.kernel.org/linux-mm/[email protected]/
[06] https://lore.kernel.org/linux-mm/[email protected]/
[07] https://lore.kernel.org/linux-mm/[email protected]/
[08] https://lore.kernel.org/linux-mm/[email protected]/
[09] https://lore.kernel.org/linux-mm/[email protected]/
[10] https://lore.kernel.org/linux-mm/[email protected]/

Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin wrote [11]:
I have Archlinux with 8G RAM + zswap + swap. While developing, I
have lots of apps opened such as multiple LSP-servers for different
langs, chats, two browsers, etc... Usually, my system gets quickly
to a point of SWAP-storms, where I have to kill LSP-servers,
restart browsers to free memory, etc, otherwise the system lags
heavily and is barely usable.

1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
patchset, and I started up by opening lots of apps to create memory
pressure, and worked for a day like this. Till now I had *not a
single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
getting to the point of 3G in SWAP before without a single
SWAP-storm.

The Arch Linux Zen kernel [12] has been using MGLRU since 5.12. Many
of its users reported their positive experiences to me, e.g., Shivodit
wrote:
I've tried the latest Zen kernel (5.14.13-zen1-1-zen in the
archlinux testing repos), everything's been smooth so far. I also
decided to copy a large volume of files to check performance under
I/O load, and everything went smoothly - no stuttering was present,
everything was responsive.

Large-scale deployments
-----------------------
We've rolled out MGLRU to tens of millions of Chrome OS users and
about a million Android users. Google's fleetwide profiling [13] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
rendering latency at the 50th percentile.

[11] https://lore.kernel.org/linux-mm/[email protected]/
[12] https://github.com/zen-kernel/zen-kernel/
[13] https://research.google/pubs/pub44271/

Summery
=======
The facts are:
1. The independent lab results and the real-world applications
indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
work out of the box; there are no equivalent solutions.
3. There is a lot of new code; nobody has demonstrated smaller changes
with similar effects.

Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
materialize for a wide range of workloads.
2. Gauging the interest from the past discussions [14][15][16], the
new features will likely be put to use for both personal computers
and data centers.
3. Based on Google's track record, the new code will likely be well
maintained in the long term. It'd be more difficult if not
impossible to achieve similar effects on top of the existing
design.

[14] https://lore.kernel.org/lkml/[email protected]/
[15] https://lore.kernel.org/lkml/[email protected]/
[16] https://lore.kernel.org/lkml/[email protected]/

Yu Zhao (9):
mm: x86, arm64: add arch_has_hw_pte_young()
mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
mm/vmscan.c: refactor shrink_node()
mm: multigenerational lru: groundwork
mm: multigenerational lru: mm_struct list
mm: multigenerational lru: aging
mm: multigenerational lru: eviction
mm: multigenerational lru: user interface
mm: multigenerational lru: Kconfig

Documentation/vm/index.rst | 1 +
Documentation/vm/multigen_lru.rst | 80 +
arch/Kconfig | 9 +
arch/arm64/include/asm/cpufeature.h | 5 +
arch/arm64/include/asm/pgtable.h | 13 +-
arch/arm64/kernel/cpufeature.c | 19 +
arch/arm64/tools/cpucaps | 1 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 9 +-
arch/x86/mm/pgtable.c | 5 +-
fs/exec.c | 2 +
fs/fuse/dev.c | 3 +-
include/linux/cgroup.h | 15 +-
include/linux/memcontrol.h | 11 +
include/linux/mm.h | 42 +
include/linux/mm_inline.h | 204 ++
include/linux/mm_types.h | 78 +
include/linux/mmzone.h | 175 ++
include/linux/nodemask.h | 1 +
include/linux/oom.h | 16 +
include/linux/page-flags-layout.h | 19 +-
include/linux/page-flags.h | 4 +-
include/linux/pgtable.h | 17 +-
include/linux/sched.h | 4 +
include/linux/swap.h | 4 +
kernel/bounds.c | 3 +
kernel/cgroup/cgroup-internal.h | 1 -
kernel/exit.c | 1 +
kernel/fork.c | 9 +
kernel/sched/core.c | 1 +
mm/Kconfig | 48 +
mm/huge_memory.c | 3 +-
mm/memcontrol.c | 26 +
mm/memory.c | 21 +-
mm/mm_init.c | 6 +-
mm/oom_kill.c | 4 +-
mm/page_alloc.c | 1 +
mm/rmap.c | 7 +
mm/swap.c | 51 +-
mm/vmscan.c | 2691 ++++++++++++++++++++++++++-
mm/workingset.c | 119 +-
41 files changed, 3591 insertions(+), 139 deletions(-)
create mode 100644 Documentation/vm/multigen_lru.rst

--
2.34.1.448.ga2b2bfdf31-goog



2022-01-04 20:23:29

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v6 1/9] mm: x86, arm64: add arch_has_hw_pte_young()

Some architectures automatically set the accessed bit in PTEs, e.g.,
x86 and arm64 v8.2. On architectures that don't have this capability,
clearing the accessed bit in a PTE usually triggers a page fault
following the TLB miss of this PTE.

Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to avoid bursty
page faults when trying to clear the accessed bit in a large number of
PTEs.

Signed-off-by: Yu Zhao <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
---
arch/arm64/include/asm/cpufeature.h | 5 +++++
arch/arm64/include/asm/pgtable.h | 13 ++++++++-----
arch/arm64/kernel/cpufeature.c | 19 +++++++++++++++++++
arch/arm64/tools/cpucaps | 1 +
arch/x86/include/asm/pgtable.h | 6 +++---
include/linux/pgtable.h | 13 +++++++++++++
mm/memory.c | 14 +-------------
7 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index ef6be92b1921..99518b4b2a9e 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -779,6 +779,11 @@ static inline bool system_supports_tlb_range(void)
cpus_have_const_cap(ARM64_HAS_TLB_RANGE);
}

+static inline bool system_has_hw_af(void)
+{
+ return IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && cpus_have_const_cap(ARM64_HW_AF);
+}
+
extern int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);

static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index c4ba047a82d2..e736f47436c7 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -999,13 +999,16 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
* page after fork() + CoW for pfn mappings. We don't always have a
* hardware-managed access flag on arm64.
*/
-static inline bool arch_faults_on_old_pte(void)
+static inline bool arch_has_hw_pte_young(bool local)
{
- WARN_ON(preemptible());
+ if (local) {
+ WARN_ON(preemptible());
+ return cpu_has_hw_af();
+ }

- return !cpu_has_hw_af();
+ return system_has_hw_af();
}
-#define arch_faults_on_old_pte arch_faults_on_old_pte
+#define arch_has_hw_pte_young arch_has_hw_pte_young

/*
* Experimentally, it's cheap to set the access flag in hardware and we
@@ -1013,7 +1016,7 @@ static inline bool arch_faults_on_old_pte(void)
*/
static inline bool arch_wants_old_prefaulted_pte(void)
{
- return !arch_faults_on_old_pte();
+ return arch_has_hw_pte_young(true);
}
#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 6f3e677d88f1..5bb553ee2c0e 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2171,6 +2171,25 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
.matches = has_hw_dbm,
.cpu_enable = cpu_enable_hw_dbm,
},
+ {
+ /*
+ * __cpu_setup always enables this capability. But if the boot
+ * CPU has it and a late CPU doesn't, the absent
+ * ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU will prevent this late CPU
+ * from going online. There is neither known hardware does that
+ * nor obvious reasons to design hardware works that way, hence
+ * no point leaving the door open here. If the need arises, a
+ * new weak system feature flag should do the trick.
+ */
+ .desc = "Hardware update of the Access flag",
+ .type = ARM64_CPUCAP_SYSTEM_FEATURE,
+ .capability = ARM64_HW_AF,
+ .sys_reg = SYS_ID_AA64MMFR1_EL1,
+ .sign = FTR_UNSIGNED,
+ .field_pos = ID_AA64MMFR1_HADBS_SHIFT,
+ .min_field_value = 1,
+ .matches = has_cpuid_feature,
+ },
#endif
{
.desc = "CRC32 instructions",
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 870c39537dd0..56e4ef5d95fa 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -36,6 +36,7 @@ HAS_STAGE2_FWB
HAS_SYSREG_GIC_CPUIF
HAS_TLB_RANGE
HAS_VIRT_HOST_EXTN
+HW_AF
HW_DBM
KVM_PROTECTED_MODE
MISMATCHED_CACHE_TYPE
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 448cd01eb3ec..c60b16f8b741 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1397,10 +1397,10 @@ static inline bool arch_has_pfn_modify_check(void)
return boot_cpu_has_bug(X86_BUG_L1TF);
}

-#define arch_faults_on_old_pte arch_faults_on_old_pte
-static inline bool arch_faults_on_old_pte(void)
+#define arch_has_hw_pte_young arch_has_hw_pte_young
+static inline bool arch_has_hw_pte_young(bool local)
{
- return false;
+ return true;
}

#endif /* __ASSEMBLY__ */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e24d2c992b11..53bd6a26918f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -258,6 +258,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif

+#ifndef arch_has_hw_pte_young
+/*
+ * Return whether the accessed bit is supported by the local CPU or system-wide.
+ *
+ * This stub assumes accessing thru an old PTE triggers a page fault.
+ * Architectures that automatically set the access bit should overwrite it.
+ */
+static inline bool arch_has_hw_pte_young(bool local)
+{
+ return false;
+}
+#endif
+
#ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index 8f1de811a1dc..ead6c7d4b9a1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -121,18 +121,6 @@ int randomize_va_space __read_mostly =
2;
#endif

-#ifndef arch_faults_on_old_pte
-static inline bool arch_faults_on_old_pte(void)
-{
- /*
- * Those arches which don't have hw access flag feature need to
- * implement their own helper. By default, "true" means pagefault
- * will be hit on old pte.
- */
- return true;
-}
-#endif
-
#ifndef arch_wants_old_prefaulted_pte
static inline bool arch_wants_old_prefaulted_pte(void)
{
@@ -2755,7 +2743,7 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
* On architectures with software "accessed" bits, we would
* take a double page fault, so mark it accessed here.
*/
- if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+ if (!arch_has_hw_pte_young(true) && !pte_young(vmf->orig_pte)) {
pte_t entry;

vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-04 20:23:31

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v6 2/9] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG

Some architectures support the accessed bit in non-leaf PMD entries,
e.g., x86_64 sets the accessed bit in a non-leaf PMD entry when using
it as part of linear address translation [1]. Page table walkers that
clear the accessed bit may use this feature to reduce their search
space.

Although an inline function is preferable, this capability is added as
a configuration option for the consistency with the existing macros.

[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 3 (June 2021), section 4.8

Signed-off-by: Yu Zhao <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
---
arch/Kconfig | 9 +++++++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 3 ++-
arch/x86/mm/pgtable.c | 5 ++++-
include/linux/pgtable.h | 4 ++--
5 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index d3c4ab249e9c..10f564340f79 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1312,6 +1312,15 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
config DYNAMIC_SIGFRAME
bool

+config ARCH_HAS_NONLEAF_PMD_YOUNG
+ bool
+ depends on PGTABLE_LEVELS > 2
+ help
+ Architectures that select this option are capable of setting the
+ accessed bit in non-leaf PMD entries when using them as part of linear
+ address translations. Page table walkers that clear the accessed bit
+ may use this feature to reduce their search space.
+
source "kernel/gcov/Kconfig"

source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5c2ccb85f2ef..5a4843242f09 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -85,6 +85,7 @@ config X86
select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_PTE_DEVMAP if X86_64
select ARCH_HAS_PTE_SPECIAL
+ select ARCH_HAS_NONLEAF_PMD_YOUNG if X86_64
select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64
select ARCH_HAS_COPY_MC if X86_64
select ARCH_HAS_SET_MEMORY
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index c60b16f8b741..36205ec0acac 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -817,7 +817,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)

static inline int pmd_bad(pmd_t pmd)
{
- return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
+ return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) !=
+ (_KERNPG_TABLE & ~_PAGE_ACCESSED);
}

static inline unsigned long pages_to_mb(unsigned long npg)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..a224193d84bf 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
return ret;
}

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
int pmdp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
{
@@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,

return ret;
}
+#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
int pudp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pud_t *pudp)
{
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 53bd6a26918f..b51f939a73f7 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -211,7 +211,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
#endif

#ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long address,
pmd_t *pmdp)
@@ -232,7 +232,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
BUILD_BUG();
return 0;
}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
#endif

#ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-04 20:23:38

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v6 3/9] mm/vmscan.c: refactor shrink_node()

This patch refactors shrink_node() to improve readability for the
upcoming changes to mm/vmscan.c.

Signed-off-by: Yu Zhao <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
---
mm/vmscan.c | 198 +++++++++++++++++++++++++++-------------------------
1 file changed, 104 insertions(+), 94 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 700434db5735..b6c5fd885216 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2716,6 +2716,109 @@ enum scan_balance {
SCAN_FILE,
};

+static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
+{
+ unsigned long file;
+ struct lruvec *target_lruvec;
+
+ target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+
+ /*
+ * Flush the memory cgroup stats, so that we read accurate per-memcg
+ * lruvec stats for heuristics.
+ */
+ mem_cgroup_flush_stats();
+
+ /*
+ * Determine the scan balance between anon and file LRUs.
+ */
+ spin_lock_irq(&target_lruvec->lru_lock);
+ sc->anon_cost = target_lruvec->anon_cost;
+ sc->file_cost = target_lruvec->file_cost;
+ spin_unlock_irq(&target_lruvec->lru_lock);
+
+ /*
+ * Target desirable inactive:active list ratios for the anon
+ * and file LRU lists.
+ */
+ if (!sc->force_deactivate) {
+ unsigned long refaults;
+
+ refaults = lruvec_page_state(target_lruvec,
+ WORKINGSET_ACTIVATE_ANON);
+ if (refaults != target_lruvec->refaults[0] ||
+ inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+ sc->may_deactivate |= DEACTIVATE_ANON;
+ else
+ sc->may_deactivate &= ~DEACTIVATE_ANON;
+
+ /*
+ * When refaults are being observed, it means a new
+ * workingset is being established. Deactivate to get
+ * rid of any stale active pages quickly.
+ */
+ refaults = lruvec_page_state(target_lruvec,
+ WORKINGSET_ACTIVATE_FILE);
+ if (refaults != target_lruvec->refaults[1] ||
+ inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
+ sc->may_deactivate |= DEACTIVATE_FILE;
+ else
+ sc->may_deactivate &= ~DEACTIVATE_FILE;
+ } else
+ sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
+
+ /*
+ * If we have plenty of inactive file pages that aren't
+ * thrashing, try to reclaim those first before touching
+ * anonymous pages.
+ */
+ file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
+ if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
+ sc->cache_trim_mode = 1;
+ else
+ sc->cache_trim_mode = 0;
+
+ /*
+ * Prevent the reclaimer from falling into the cache trap: as
+ * cache pages start out inactive, every cache fault will tip
+ * the scan balance towards the file LRU. And as the file LRU
+ * shrinks, so does the window for rotation from references.
+ * This means we have a runaway feedback loop where a tiny
+ * thrashing file LRU becomes infinitely more attractive than
+ * anon pages. Try to detect this based on file LRU size.
+ */
+ if (!cgroup_reclaim(sc)) {
+ unsigned long total_high_wmark = 0;
+ unsigned long free, anon;
+ int z;
+
+ free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+ file = node_page_state(pgdat, NR_ACTIVE_FILE) +
+ node_page_state(pgdat, NR_INACTIVE_FILE);
+
+ for (z = 0; z < MAX_NR_ZONES; z++) {
+ struct zone *zone = &pgdat->node_zones[z];
+
+ if (!managed_zone(zone))
+ continue;
+
+ total_high_wmark += high_wmark_pages(zone);
+ }
+
+ /*
+ * Consider anon: if that's low too, this isn't a
+ * runaway file reclaim problem, but rather just
+ * extreme pressure. Reclaim as per usual then.
+ */
+ anon = node_page_state(pgdat, NR_INACTIVE_ANON);
+
+ sc->file_is_tiny =
+ file + free <= total_high_wmark &&
+ !(sc->may_deactivate & DEACTIVATE_ANON) &&
+ anon >> sc->priority;
+ }
+}
+
/*
* Determine how aggressively the anon and file LRU lists should be
* scanned. The relative value of each set of LRU lists is determined
@@ -3186,109 +3289,16 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
unsigned long nr_reclaimed, nr_scanned;
struct lruvec *target_lruvec;
bool reclaimable = false;
- unsigned long file;

target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);

again:
- /*
- * Flush the memory cgroup stats, so that we read accurate per-memcg
- * lruvec stats for heuristics.
- */
- mem_cgroup_flush_stats();
-
memset(&sc->nr, 0, sizeof(sc->nr));

nr_reclaimed = sc->nr_reclaimed;
nr_scanned = sc->nr_scanned;

- /*
- * Determine the scan balance between anon and file LRUs.
- */
- spin_lock_irq(&target_lruvec->lru_lock);
- sc->anon_cost = target_lruvec->anon_cost;
- sc->file_cost = target_lruvec->file_cost;
- spin_unlock_irq(&target_lruvec->lru_lock);
-
- /*
- * Target desirable inactive:active list ratios for the anon
- * and file LRU lists.
- */
- if (!sc->force_deactivate) {
- unsigned long refaults;
-
- refaults = lruvec_page_state(target_lruvec,
- WORKINGSET_ACTIVATE_ANON);
- if (refaults != target_lruvec->refaults[0] ||
- inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
- sc->may_deactivate |= DEACTIVATE_ANON;
- else
- sc->may_deactivate &= ~DEACTIVATE_ANON;
-
- /*
- * When refaults are being observed, it means a new
- * workingset is being established. Deactivate to get
- * rid of any stale active pages quickly.
- */
- refaults = lruvec_page_state(target_lruvec,
- WORKINGSET_ACTIVATE_FILE);
- if (refaults != target_lruvec->refaults[1] ||
- inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
- sc->may_deactivate |= DEACTIVATE_FILE;
- else
- sc->may_deactivate &= ~DEACTIVATE_FILE;
- } else
- sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
-
- /*
- * If we have plenty of inactive file pages that aren't
- * thrashing, try to reclaim those first before touching
- * anonymous pages.
- */
- file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
- if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
- sc->cache_trim_mode = 1;
- else
- sc->cache_trim_mode = 0;
-
- /*
- * Prevent the reclaimer from falling into the cache trap: as
- * cache pages start out inactive, every cache fault will tip
- * the scan balance towards the file LRU. And as the file LRU
- * shrinks, so does the window for rotation from references.
- * This means we have a runaway feedback loop where a tiny
- * thrashing file LRU becomes infinitely more attractive than
- * anon pages. Try to detect this based on file LRU size.
- */
- if (!cgroup_reclaim(sc)) {
- unsigned long total_high_wmark = 0;
- unsigned long free, anon;
- int z;
-
- free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
- file = node_page_state(pgdat, NR_ACTIVE_FILE) +
- node_page_state(pgdat, NR_INACTIVE_FILE);
-
- for (z = 0; z < MAX_NR_ZONES; z++) {
- struct zone *zone = &pgdat->node_zones[z];
- if (!managed_zone(zone))
- continue;
-
- total_high_wmark += high_wmark_pages(zone);
- }
-
- /*
- * Consider anon: if that's low too, this isn't a
- * runaway file reclaim problem, but rather just
- * extreme pressure. Reclaim as per usual then.
- */
- anon = node_page_state(pgdat, NR_INACTIVE_ANON);
-
- sc->file_is_tiny =
- file + free <= total_high_wmark &&
- !(sc->may_deactivate & DEACTIVATE_ANON) &&
- anon >> sc->priority;
- }
+ prepare_scan_count(pgdat, sc);

shrink_node_memcgs(pgdat, sc);

--
2.34.1.448.ga2b2bfdf31-goog


2022-01-04 20:23:45

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v6 4/9] mm: multigenerational lru: groundwork

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they're aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages may be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations.

There are two conceptually independent processes (as in manufacturing
process): "the aging", which produces young generations, and "the
eviction", which consumes old generations. They form a closed-loop
system, i.e., "the page reclaim". Both processes can be triggered
separately from userspace for the purposes of working set estimation
and proactive reclaim. These features are required to optimize job
scheduling in data centers. The variable size of the sliding window is
designed for such use cases.

To avoid confusions, the terms "hot" and "cold" will be applied to the
multigenerational lru, as a new convention; the terms "active" and
"inactive" will be applied to the active/inactive lru, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one thru page tables and the other thru file descriptors. The
protection of the former channel is by design stronger because:
1) The uncertainty in determining the access patterns of the former
channel is higher due to the approximation of the accessed bit.
2) The cost of evicting the former channel is higher due to the TLB
flushes required and the likelihood of encountering the dirty bit.
3) The penalty of underprotecting the former channel is higher because
applications usually don't prepare themselves for major faults like
they do for blocked I/O. For example, GUI applications commonly use
dedicated I/O threads to avoid blocking the rendering threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present and the latter channel is assumed to follow
the latter pattern unless outlying refaults have been observed.

The "outlying refaults" will be addressed in [PATCH 07/10]. A few
macros, i.e., LRU_REFS_*, used in that patch are added in this one to
make the patchset less diffy.

A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
hasn't been used since then. This process, AKA second chance, requires
a minimum of two generations, hence MIN_NR_GENS.

Signed-off-by: Yu Zhao <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
---
fs/fuse/dev.c | 3 +-
include/linux/cgroup.h | 15 +-
include/linux/mm.h | 37 +++++
include/linux/mm_inline.h | 189 ++++++++++++++++++++++
include/linux/mmzone.h | 76 +++++++++
include/linux/page-flags-layout.h | 19 ++-
include/linux/page-flags.h | 4 +-
include/linux/sched.h | 4 +
kernel/bounds.c | 3 +
kernel/cgroup/cgroup-internal.h | 1 -
mm/huge_memory.c | 3 +-
mm/memcontrol.c | 2 +
mm/memory.c | 7 +
mm/mm_init.c | 6 +-
mm/page_alloc.c | 1 +
mm/swap.c | 9 +-
mm/vmscan.c | 259 ++++++++++++++++++++++++++++++
17 files changed, 623 insertions(+), 15 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd54a529460d..769139a8be86 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -785,7 +785,8 @@ static int fuse_check_page(struct page *page)
1 << PG_active |
1 << PG_workingset |
1 << PG_reclaim |
- 1 << PG_waiters))) {
+ 1 << PG_waiters |
+ LRU_GEN_MASK | LRU_REFS_MASK))) {
dump_page(page, "fuse: trying to steal weird page");
return 1;
}
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 75c151413fda..b145025f3eac 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp)
css_put(&cgrp->self);
}

+extern struct mutex cgroup_mutex;
+
+static inline void cgroup_lock(void)
+{
+ mutex_lock(&cgroup_mutex);
+}
+
+static inline void cgroup_unlock(void)
+{
+ mutex_unlock(&cgroup_mutex);
+}
+
/**
* task_css_set_check - obtain a task's css_set with extra access conditions
* @task: the task to obtain css_set for
@@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp)
* as locks used during the cgroup_subsys::attach() methods.
*/
#ifdef CONFIG_PROVE_RCU
-extern struct mutex cgroup_mutex;
extern spinlock_t css_set_lock;
#define task_css_set_check(task, __c) \
rcu_dereference_check((task)->cgroups, \
@@ -707,6 +718,8 @@ struct cgroup;
static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; }
static inline void css_get(struct cgroup_subsys_state *css) {}
static inline void css_put(struct cgroup_subsys_state *css) {}
+static inline void cgroup_lock(void) {}
+static inline void cgroup_unlock(void) {}
static inline int cgroup_attach_task_all(struct task_struct *from,
struct task_struct *t) { return 0; }
static inline int cgroupstats_build(struct cgroupstats *stats,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..fadbf8e6abcd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -227,6 +227,7 @@ int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
#define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)

#define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
+#define lru_to_folio(head) (list_entry((head)->prev, struct folio, lru))

void setup_initial_init_mm(void *start_code, void *end_code,
void *end_data, void *brk);
@@ -1070,6 +1071,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
#define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH)
#define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
+#define LRU_GEN_PGOFF (KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
+#define LRU_REFS_PGOFF (LRU_GEN_PGOFF - LRU_REFS_WIDTH)

/*
* Define the bit shifts to access each section. For non-existent
@@ -1920,6 +1923,40 @@ static inline void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows) { }
#endif

+#ifdef CONFIG_LRU_GEN
+static inline void task_enter_lru_fault(void)
+{
+ WARN_ON_ONCE(current->in_lru_fault);
+
+ current->in_lru_fault = 1;
+}
+
+static inline void task_exit_lru_fault(void)
+{
+ WARN_ON_ONCE(!current->in_lru_fault);
+
+ current->in_lru_fault = 0;
+}
+
+static inline bool task_in_lru_fault(void)
+{
+ return current->in_lru_fault;
+}
+#else
+static inline void task_enter_lru_fault(void)
+{
+}
+
+static inline void task_exit_lru_fault(void)
+{
+}
+
+static inline bool task_in_lru_fault(void)
+{
+ return false;
+}
+#endif /* CONFIG_LRU_GEN */
+
static inline void unmap_shared_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen)
{
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e2ec68b0515c..5f239f67f36b 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -90,11 +90,194 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
return lru;
}

+#ifdef CONFIG_LRU_GEN
+
+static inline bool lru_gen_enabled(void)
+{
+#ifdef CONFIG_LRU_GEN_ENABLED
+ DECLARE_STATIC_KEY_TRUE(lru_gen_static_key);
+
+ return static_branch_likely(&lru_gen_static_key);
+#else
+ DECLARE_STATIC_KEY_FALSE(lru_gen_static_key);
+
+ return static_branch_unlikely(&lru_gen_static_key);
+#endif
+}
+
+static inline int lru_gen_from_seq(unsigned long seq)
+{
+ return seq % MAX_NR_GENS;
+}
+
+static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
+{
+ unsigned long max_seq = lruvec->lrugen.max_seq;
+
+ VM_BUG_ON(gen >= MAX_NR_GENS);
+
+ /* see the comment on MIN_NR_GENS */
+ return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
+}
+
+static inline void lru_gen_update_size(struct lruvec *lruvec, enum lru_list lru,
+ int zone, long delta)
+{
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+ lockdep_assert_held(&lruvec->lru_lock);
+ WARN_ON_ONCE(delta != (int)delta);
+
+ __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, delta);
+ __mod_zone_page_state(&pgdat->node_zones[zone], NR_ZONE_LRU_BASE + lru, delta);
+}
+
+static inline void lru_gen_balance_size(struct lruvec *lruvec, struct folio *folio,
+ int old_gen, int new_gen)
+{
+ int type = folio_is_file_lru(folio);
+ int zone = folio_zonenum(folio);
+ int delta = folio_nr_pages(folio);
+ enum lru_list lru = type * LRU_FILE;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ VM_BUG_ON(old_gen != -1 && old_gen >= MAX_NR_GENS);
+ VM_BUG_ON(new_gen != -1 && new_gen >= MAX_NR_GENS);
+ VM_BUG_ON(old_gen == -1 && new_gen == -1);
+
+ if (old_gen >= 0)
+ WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone],
+ lrugen->nr_pages[old_gen][type][zone] - delta);
+ if (new_gen >= 0)
+ WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
+ lrugen->nr_pages[new_gen][type][zone] + delta);
+
+ if (old_gen < 0) {
+ if (lru_gen_is_active(lruvec, new_gen))
+ lru += LRU_ACTIVE;
+ lru_gen_update_size(lruvec, lru, zone, delta);
+ return;
+ }
+
+ if (new_gen < 0) {
+ if (lru_gen_is_active(lruvec, old_gen))
+ lru += LRU_ACTIVE;
+ lru_gen_update_size(lruvec, lru, zone, -delta);
+ return;
+ }
+
+ if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
+ lru_gen_update_size(lruvec, lru, zone, -delta);
+ lru_gen_update_size(lruvec, lru + LRU_ACTIVE, zone, delta);
+ }
+
+ /* Promotion is legit while a page is on an lru list, but demotion isn't. */
+ VM_BUG_ON(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
+}
+
+static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+ int gen;
+ unsigned long old_flags, new_flags;
+ int type = folio_is_file_lru(folio);
+ int zone = folio_zonenum(folio);
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ if (folio_test_unevictable(folio) || !lrugen->enabled)
+ return false;
+ /*
+ * There are three cases for this page:
+ * 1) If it shouldn't be evicted, e.g., it was just faulted in, add it
+ * to the youngest generation.
+ * 2) If it can't be evicted immediately, i.e., it's an anon page and
+ * not in swapcache, or a dirty page pending writeback, add it to the
+ * second oldest generation.
+ * 3) If it may be evicted immediately, e.g., it's a clean page, add it
+ * to the oldest generation.
+ */
+ if (folio_test_active(folio))
+ gen = lru_gen_from_seq(lrugen->max_seq);
+ else if ((!type && !folio_test_swapcache(folio)) ||
+ (folio_test_reclaim(folio) &&
+ (folio_test_dirty(folio) || folio_test_writeback(folio))))
+ gen = lru_gen_from_seq(lrugen->min_seq[type] + 1);
+ else
+ gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+ do {
+ new_flags = old_flags = READ_ONCE(folio->flags);
+ VM_BUG_ON_FOLIO(new_flags & LRU_GEN_MASK, folio);
+
+ new_flags &= ~(LRU_GEN_MASK | BIT(PG_active));
+ new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+ } while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+ lru_gen_balance_size(lruvec, folio, -1, gen);
+ /* for folio_rotate_reclaimable() */
+ if (reclaiming)
+ list_add_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
+ else
+ list_add(&folio->lru, &lrugen->lists[gen][type][zone]);
+
+ return true;
+}
+
+static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+ int gen;
+ unsigned long old_flags, new_flags;
+
+ do {
+ new_flags = old_flags = READ_ONCE(folio->flags);
+ if (!(new_flags & LRU_GEN_MASK))
+ return false;
+
+ VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+ VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+
+ gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+
+ new_flags &= ~LRU_GEN_MASK;
+ /* for shrink_page_list() */
+ if (reclaiming)
+ new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
+ else if (lru_gen_is_active(lruvec, gen))
+ new_flags |= BIT(PG_active);
+ } while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+ lru_gen_balance_size(lruvec, folio, gen, -1);
+ list_del(&folio->lru);
+
+ return true;
+}
+
+#else
+
+static inline bool lru_gen_enabled(void)
+{
+ return false;
+}
+
+static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+ return false;
+}
+
+static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+ return false;
+}
+
+#endif /* CONFIG_LRU_GEN */
+
static __always_inline
void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
{
enum lru_list lru = folio_lru_list(folio);

+ if (lru_gen_add_folio(lruvec, folio, false))
+ return;
+
update_lru_size(lruvec, lru, folio_zonenum(folio),
folio_nr_pages(folio));
list_add(&folio->lru, &lruvec->lists[lru]);
@@ -111,6 +294,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
{
enum lru_list lru = folio_lru_list(folio);

+ if (lru_gen_add_folio(lruvec, folio, true))
+ return;
+
update_lru_size(lruvec, lru, folio_zonenum(folio),
folio_nr_pages(folio));
list_add_tail(&folio->lru, &lruvec->lists[lru]);
@@ -125,6 +311,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
static __always_inline
void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
{
+ if (lru_gen_del_folio(lruvec, folio, false))
+ return;
+
list_del(&folio->lru);
update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
-folio_nr_pages(folio));
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 936dc0b6c226..371c7210d510 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -303,6 +303,78 @@ enum lruvec_flags {
*/
};

+struct lruvec;
+
+#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
+#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
+
+#ifdef CONFIG_LRU_GEN
+
+#define MIN_LRU_BATCH BITS_PER_LONG
+#define MAX_LRU_BATCH (MIN_LRU_BATCH * 128)
+
+/*
+ * Evictable pages are divided into multiple generations. The youngest and the
+ * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
+ * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
+ * offset within MAX_NR_GENS, gen, indexes the lru list of the corresponding
+ * generation. The gen counter in folio->flags stores gen+1 while a page is on
+ * lrugen->lists[]. Otherwise, it stores 0.
+ *
+ * A page is added to the youngest generation on faulting. The aging needs to
+ * check the accessed bit at least twice before handing this page over to the
+ * eviction. The first check takes care of the accessed bit set on the initial
+ * fault; the second check makes sure this page hasn't been used since then.
+ * This process, AKA second chance, requires a minimum of two generations,
+ * hence MIN_NR_GENS. And to be compatible with the active/inactive lru, these
+ * two generations are mapped to the active; the rest of generations, if they
+ * exist, are mapped to the inactive. PG_active is always cleared while a page
+ * is on lrugen->lists[] so that demotion, which happens consequently when the
+ * aging creates a new generation, needs not to worry about it.
+ */
+#define MIN_NR_GENS 2U
+#define MAX_NR_GENS ((unsigned int)CONFIG_NR_LRU_GENS)
+
+struct lru_gen_struct {
+ /* the aging increments the youngest generation number */
+ unsigned long max_seq;
+ /* the eviction increments the oldest generation numbers */
+ unsigned long min_seq[ANON_AND_FILE];
+ /* the birth time of each generation in jiffies */
+ unsigned long timestamps[MAX_NR_GENS];
+ /* the multigenerational lru lists */
+ struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+ /* the sizes of the above lists */
+ unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+ /* whether the multigenerational lru is enabled */
+ bool enabled;
+};
+
+void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);
+
+#ifdef CONFIG_MEMCG
+void lru_gen_init_memcg(struct mem_cgroup *memcg);
+void lru_gen_free_memcg(struct mem_cgroup *memcg);
+#endif
+
+#else /* !CONFIG_LRU_GEN */
+
+static inline void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
+{
+}
+
+static inline void lru_gen_free_memcg(struct mem_cgroup *memcg)
+{
+}
+#endif
+
+#endif /* CONFIG_LRU_GEN */
+
struct lruvec {
struct list_head lists[NR_LRU_LISTS];
/* per lruvec lru_lock for memcg */
@@ -320,6 +392,10 @@ struct lruvec {
unsigned long refaults[ANON_AND_FILE];
/* Various lruvec state flags (enum lruvec_flags) */
unsigned long flags;
+#ifdef CONFIG_LRU_GEN
+ /* evictable pages divided into generations */
+ struct lru_gen_struct lrugen;
+#endif
#ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
#endif
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index ef1e3e736e14..8cdbbdccb5ad 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -26,6 +26,14 @@

#define ZONES_WIDTH ZONES_SHIFT

+#ifdef CONFIG_LRU_GEN
+/* LRU_GEN_WIDTH is generated from order_base_2(CONFIG_NR_LRU_GENS + 1). */
+#define LRU_REFS_WIDTH (CONFIG_TIERS_PER_GEN - 2)
+#else
+#define LRU_GEN_WIDTH 0
+#define LRU_REFS_WIDTH 0
+#endif /* CONFIG_LRU_GEN */
+
#ifdef CONFIG_SPARSEMEM
#include <asm/sparsemem.h>
#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
@@ -55,7 +63,8 @@
#define SECTIONS_WIDTH 0
#endif

-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
+ <= BITS_PER_LONG - NR_PAGEFLAGS
#define NODES_WIDTH NODES_SHIFT
#elif defined(CONFIG_SPARSEMEM_VMEMMAP)
#error "Vmemmap: No space for nodes field in page flags"
@@ -89,8 +98,8 @@
#define LAST_CPUPID_SHIFT 0
#endif

-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
- <= BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
+ KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
#else
#define LAST_CPUPID_WIDTH 0
@@ -100,8 +109,8 @@
#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
#endif

-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
- > BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
+ KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
#error "Not enough bits in page flags"
#endif

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b5f14d581113..d609c71ea228 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -961,7 +961,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
1UL << PG_private | 1UL << PG_private_2 | \
1UL << PG_writeback | 1UL << PG_reserved | \
1UL << PG_slab | 1UL << PG_active | \
- 1UL << PG_unevictable | __PG_MLOCKED)
+ 1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK)

/*
* Flags checked when a page is prepped for return by the page allocator.
@@ -972,7 +972,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
* alloc-free cycle to prevent from reusing the page.
*/
#define PAGE_FLAGS_CHECK_AT_PREP \
- (PAGEFLAGS_MASK & ~__PG_HWPOISON)
+ ((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)

#define PAGE_FLAGS_PRIVATE \
(1UL << PG_private | 1UL << PG_private_2)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78c351e35fec..69b35d61c017 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -909,6 +909,10 @@ struct task_struct {
#ifdef CONFIG_MEMCG
unsigned in_user_fault:1;
#endif
+#ifdef CONFIG_LRU_GEN
+ /* whether the lru algorithm may apply for this access */
+ unsigned in_lru_fault:1;
+#endif
#ifdef CONFIG_COMPAT_BRK
unsigned brk_randomized:1;
#endif
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 9795d75b09b2..aba13aa7336c 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -22,6 +22,9 @@ int main(void)
DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
#endif
DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
+#ifdef CONFIG_LRU_GEN
+ DEFINE(LRU_GEN_WIDTH, order_base_2(CONFIG_NR_LRU_GENS + 1));
+#endif
/* End of constants */

return 0;
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index bfbeabc17a9d..bec59189e206 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -146,7 +146,6 @@ struct cgroup_mgctx {
#define DEFINE_CGROUP_MGCTX(name) \
struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)

-extern struct mutex cgroup_mutex;
extern spinlock_t css_set_lock;
extern struct cgroup_subsys *cgroup_subsys[];
extern struct list_head cgroup_roots;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e5483347291c..2e2ca7ecff63 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2364,7 +2364,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
#ifdef CONFIG_64BIT
(1L << PG_arch_2) |
#endif
- (1L << PG_dirty)));
+ (1L << PG_dirty) |
+ LRU_GEN_MASK | LRU_REFS_MASK));

/* ->mapping in first tail page is compound_mapcount */
VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2ed5f2a0879d..a4359a278e31 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5098,6 +5098,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)

static void mem_cgroup_free(struct mem_cgroup *memcg)
{
+ lru_gen_free_memcg(memcg);
memcg_wb_domain_exit(memcg);
__mem_cgroup_free(memcg);
}
@@ -5161,6 +5162,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
memcg->deferred_split_queue.split_queue_len = 0;
#endif
idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
+ lru_gen_init_memcg(memcg);
return memcg;
fail:
mem_cgroup_id_remove(memcg);
diff --git a/mm/memory.c b/mm/memory.c
index ead6c7d4b9a1..bc54be1d613f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4745,6 +4745,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags, struct pt_regs *regs)
{
vm_fault_t ret;
+ bool lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));

__set_current_state(TASK_RUNNING);

@@ -4766,11 +4767,17 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
if (flags & FAULT_FLAG_USER)
mem_cgroup_enter_user_fault();

+ if (lru_fault)
+ task_enter_lru_fault();
+
if (unlikely(is_vm_hugetlb_page(vma)))
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
else
ret = __handle_mm_fault(vma, address, flags);

+ if (lru_fault)
+ task_exit_lru_fault();
+
if (flags & FAULT_FLAG_USER) {
mem_cgroup_exit_user_fault();
/*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 9ddaf0e1b0ab..0d7b2bd2454a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void)

shift = 8 * sizeof(unsigned long);
width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
- - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
+ - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
- "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
+ "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
SECTIONS_WIDTH,
NODES_WIDTH,
ZONES_WIDTH,
LAST_CPUPID_WIDTH,
KASAN_TAG_WIDTH,
+ LRU_GEN_WIDTH,
+ LRU_REFS_WIDTH,
NR_PAGEFLAGS);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..05e02fbd1e5d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7416,6 +7416,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)

pgdat_page_ext_init(pgdat);
lruvec_init(&pgdat->__lruvec);
+ lru_gen_init_state(NULL, &pgdat->__lruvec);
}

static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
diff --git a/mm/swap.c b/mm/swap.c
index e8c9dc6d0377..d7dde3b7d4b5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

+ /* see the comment in lru_gen_add_folio() */
+ if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
+ task_in_lru_fault() && !(current->flags & PF_MEMALLOC))
+ folio_set_active(folio);
+
folio_get(folio);
local_lock(&lru_pvecs.lock);
pvec = this_cpu_ptr(&lru_pvecs.lru_add);
@@ -563,7 +568,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)

static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
{
- if (PageActive(page) && !PageUnevictable(page)) {
+ if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
int nr_pages = thp_nr_pages(page);

del_page_from_lru_list(page, lruvec);
@@ -677,7 +682,7 @@ void deactivate_file_page(struct page *page)
*/
void deactivate_page(struct page *page)
{
- if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
struct pagevec *pvec;

local_lock(&lru_pvecs.lock);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6c5fd885216..0e487c0ffe17 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,7 @@
#include <linux/printk.h>
#include <linux/dax.h>
#include <linux/psi.h>
+#include <linux/memory.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -3040,6 +3041,264 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
return can_demote(pgdat->node_id, sc);
}

+#ifdef CONFIG_LRU_GEN
+
+/******************************************************************************
+ * shorthand helpers
+ ******************************************************************************/
+
+#define for_each_gen_type_zone(gen, type, zone) \
+ for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \
+ for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \
+ for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
+
+static int folio_lru_gen(struct folio *folio)
+{
+ unsigned long flags = READ_ONCE(folio->flags);
+
+ return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
+static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
+{
+ struct pglist_data *pgdat = NODE_DATA(nid);
+
+#ifdef CONFIG_MEMCG
+ if (memcg) {
+ struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec;
+
+ /* for hotadd_new_pgdat() */
+ if (!lruvec->pgdat)
+ lruvec->pgdat = pgdat;
+
+ return lruvec;
+ }
+#endif
+ return pgdat ? &pgdat->__lruvec : NULL;
+}
+
+static int get_nr_gens(struct lruvec *lruvec, int type)
+{
+ return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1;
+}
+
+static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
+{
+ /*
+ * Ideally anon and file min_seq should be in sync. But swapping isn't
+ * as reliable as dropping clean file pages, e.g., out of swap space. So
+ * allow file min_seq to advance and leave anon min_seq behind, but not
+ * the other way around.
+ */
+ return get_nr_gens(lruvec, 1) >= MIN_NR_GENS &&
+ get_nr_gens(lruvec, 1) <= get_nr_gens(lruvec, 0) &&
+ get_nr_gens(lruvec, 0) <= MAX_NR_GENS;
+}
+
+/******************************************************************************
+ * state change
+ ******************************************************************************/
+
+#ifdef CONFIG_LRU_GEN_ENABLED
+DEFINE_STATIC_KEY_TRUE(lru_gen_static_key);
+#else
+DEFINE_STATIC_KEY_FALSE(lru_gen_static_key);
+#endif
+
+static bool __maybe_unused state_is_valid(struct lruvec *lruvec)
+{
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ if (lrugen->enabled) {
+ enum lru_list lru;
+
+ for_each_evictable_lru(lru) {
+ if (!list_empty(&lruvec->lists[lru]))
+ return false;
+ }
+ } else {
+ int gen, type, zone;
+
+ for_each_gen_type_zone(gen, type, zone) {
+ if (!list_empty(&lrugen->lists[gen][type][zone]))
+ return false;
+
+ /* unlikely but not a bug when reset_batch_size() is pending */
+ VM_WARN_ON(lrugen->nr_pages[gen][type][zone]);
+ }
+ }
+
+ return true;
+}
+
+static bool fill_evictable(struct lruvec *lruvec)
+{
+ enum lru_list lru;
+ int remaining = MAX_LRU_BATCH;
+
+ for_each_evictable_lru(lru) {
+ int type = is_file_lru(lru);
+ bool active = is_active_lru(lru);
+ struct list_head *head = &lruvec->lists[lru];
+
+ while (!list_empty(head)) {
+ bool success;
+ struct folio *folio = lru_to_folio(head);
+
+ VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+ VM_BUG_ON_FOLIO(folio_test_active(folio) != active, folio);
+ VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+ VM_BUG_ON_FOLIO(folio_lru_gen(folio) < MAX_NR_GENS, folio);
+
+ lruvec_del_folio(lruvec, folio);
+ success = lru_gen_add_folio(lruvec, folio, false);
+ VM_BUG_ON(!success);
+
+ if (!--remaining)
+ return false;
+ }
+ }
+
+ return true;
+}
+
+static bool drain_evictable(struct lruvec *lruvec)
+{
+ int gen, type, zone;
+ int remaining = MAX_LRU_BATCH;
+
+ for_each_gen_type_zone(gen, type, zone) {
+ struct list_head *head = &lruvec->lrugen.lists[gen][type][zone];
+
+ while (!list_empty(head)) {
+ bool success;
+ struct folio *folio = lru_to_folio(head);
+
+ VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+ VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+ VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+ VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
+
+ success = lru_gen_del_folio(lruvec, folio, false);
+ VM_BUG_ON(!success);
+ lruvec_add_folio(lruvec, folio);
+
+ if (!--remaining)
+ return false;
+ }
+ }
+
+ return true;
+}
+
+static void lru_gen_change_state(bool enable)
+{
+ static DEFINE_MUTEX(state_mutex);
+
+ struct mem_cgroup *memcg;
+
+ mem_hotplug_begin();
+ cgroup_lock();
+ mutex_lock(&state_mutex);
+
+ if (enable == lru_gen_enabled())
+ goto unlock;
+
+ if (enable)
+ static_branch_enable(&lru_gen_static_key);
+ else
+ static_branch_disable(&lru_gen_static_key);
+
+ memcg = mem_cgroup_iter(NULL, NULL, NULL);
+ do {
+ int nid;
+
+ for_each_node(nid) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+ if (!lruvec)
+ continue;
+
+ spin_lock_irq(&lruvec->lru_lock);
+
+ VM_BUG_ON(!seq_is_valid(lruvec));
+ VM_BUG_ON(!state_is_valid(lruvec));
+
+ lruvec->lrugen.enabled = enable;
+
+ while (!(enable ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
+ spin_unlock_irq(&lruvec->lru_lock);
+ cond_resched();
+ spin_lock_irq(&lruvec->lru_lock);
+ }
+
+ spin_unlock_irq(&lruvec->lru_lock);
+ }
+
+ cond_resched();
+ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+unlock:
+ mutex_unlock(&state_mutex);
+ cgroup_unlock();
+ mem_hotplug_done();
+}
+
+/******************************************************************************
+ * initialization
+ ******************************************************************************/
+
+void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec)
+{
+ int i;
+ int gen, type, zone;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ lrugen->max_seq = MIN_NR_GENS + 1;
+ lrugen->enabled = lru_gen_enabled();
+
+ for (i = 0; i <= MIN_NR_GENS + 1; i++)
+ lrugen->timestamps[i] = jiffies;
+
+ for_each_gen_type_zone(gen, type, zone)
+ INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
+}
+
+#ifdef CONFIG_MEMCG
+void lru_gen_init_memcg(struct mem_cgroup *memcg)
+{
+ int nid;
+
+ for_each_node(nid) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+ lru_gen_init_state(memcg, lruvec);
+ }
+}
+
+void lru_gen_free_memcg(struct mem_cgroup *memcg)
+{
+ int nid;
+
+ for_each_node(nid) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+ VM_BUG_ON(memchr_inv(lruvec->lrugen.nr_pages, 0,
+ sizeof(lruvec->lrugen.nr_pages)));
+ }
+}
+#endif
+
+static int __init init_lru_gen(void)
+{
+ BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
+ BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+
+ return 0;
+};
+late_initcall(init_lru_gen);
+
+#endif /* CONFIG_LRU_GEN */
+
static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-04 20:23:51

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v6 5/9] mm: multigenerational lru: mm_struct list

To exploit spatial locality, the aging prefers to walk page tables to
search for young PTEs. And this patch paves the way for that.

An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.

To avoid confusions, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.

A page table walker, i.e., a thread in the aging path, iterates an
mm_struct list and calls walk_page_range() with each mm_struct on this
list. The iteration finishes when it reaches the end of this list.
When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore the aging can run concurrently.

This infra also provides the following optimizations:
1) It tracks the usage of mm_struct's between context switches so that
page table walkers may skip processes that have been sleeping since
the last iteration.
2) It provides generational Bloom filters to record populated branches
so that page table walkers may reduce their search space based on
the query results.

Signed-off-by: Yu Zhao <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
---
fs/exec.c | 2 +
include/linux/memcontrol.h | 5 +
include/linux/mm_inline.h | 5 +
include/linux/mm_types.h | 78 ++++++++
include/linux/mmzone.h | 61 +++++++
kernel/exit.c | 1 +
kernel/fork.c | 9 +
kernel/sched/core.c | 1 +
mm/memcontrol.c | 24 +++
mm/vmscan.c | 352 +++++++++++++++++++++++++++++++++++++
10 files changed, 538 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 537d92c41105..308aa88ca15f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1005,6 +1005,7 @@ static int exec_mmap(struct mm_struct *mm)
active_mm = tsk->active_mm;
tsk->active_mm = mm;
tsk->mm = mm;
+ lru_gen_add_mm(mm);
/*
* This prevents preemption while active_mm is being loaded and
* it and mm are being updated, which could cause problems for
@@ -1015,6 +1016,7 @@ static int exec_mmap(struct mm_struct *mm)
if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
local_irq_enable();
activate_mm(active_mm, mm);
+ lru_gen_use_mm(mm);
if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
local_irq_enable();
tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c5c403f4be6..aba18cd101db 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -340,6 +340,11 @@ struct mem_cgroup {
struct deferred_split deferred_split_queue;
#endif

+#ifdef CONFIG_LRU_GEN
+ /* per-memcg mm_struct list */
+ struct lru_gen_mm_list mm_list;
+#endif
+
struct mem_cgroup_per_node *nodeinfo[];
};

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 5f239f67f36b..717a2290acb3 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -110,6 +110,11 @@ static inline int lru_gen_from_seq(unsigned long seq)
return seq % MAX_NR_GENS;
}

+static inline int lru_hist_from_seq(unsigned long seq)
+{
+ return seq % NR_HIST_GENS;
+}
+
static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
{
unsigned long max_seq = lruvec->lrugen.max_seq;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c3a6e6209600..bdbd9390adb3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -3,6 +3,7 @@
#define _LINUX_MM_TYPES_H

#include <linux/mm_types_task.h>
+#include <linux/sched.h>

#include <linux/auxvec.h>
#include <linux/list.h>
@@ -16,6 +17,8 @@
#include <linux/page-flags-layout.h>
#include <linux/workqueue.h>
#include <linux/seqlock.h>
+#include <linux/nodemask.h>
+#include <linux/mmdebug.h>

#include <asm/mmu.h>

@@ -646,6 +649,22 @@ struct mm_struct {
#ifdef CONFIG_IOMMU_SUPPORT
u32 pasid;
#endif
+#ifdef CONFIG_LRU_GEN
+ struct {
+ /* this mm_struct is on lru_gen_mm_list */
+ struct list_head list;
+#ifdef CONFIG_MEMCG
+ /* points to the memcg of "owner" above */
+ struct mem_cgroup *memcg;
+#endif
+ /*
+ * Set when switching to this mm_struct, as a hint of
+ * whether it has been used since the last time per-node
+ * page table walkers cleared the corresponding bits.
+ */
+ nodemask_t nodes;
+ } lru_gen;
+#endif /* CONFIG_LRU_GEN */
} __randomize_layout;

/*
@@ -672,6 +691,65 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
return (struct cpumask *)&mm->cpu_bitmap;
}

+#ifdef CONFIG_LRU_GEN
+
+struct lru_gen_mm_list {
+ /* mm_struct list for page table walkers */
+ struct list_head fifo;
+ /* protects the list above */
+ spinlock_t lock;
+};
+
+void lru_gen_add_mm(struct mm_struct *mm);
+void lru_gen_del_mm(struct mm_struct *mm);
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm);
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+ INIT_LIST_HEAD(&mm->lru_gen.list);
+#ifdef CONFIG_MEMCG
+ mm->lru_gen.memcg = NULL;
+#endif
+ nodes_clear(mm->lru_gen.nodes);
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+ /* unlikely but not a bug when racing with lru_gen_migrate_mm() */
+ VM_WARN_ON(list_empty(&mm->lru_gen.list));
+
+ if (!(current->flags & PF_KTHREAD) && !nodes_full(mm->lru_gen.nodes))
+ nodes_setall(mm->lru_gen.nodes);
+}
+
+#else /* !CONFIG_LRU_GEN */
+
+static inline void lru_gen_add_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_del_mm(struct mm_struct *mm)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+}
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
struct mmu_gather;
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 371c7210d510..5b9bc2532c5b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -335,6 +335,13 @@ struct lruvec;
#define MIN_NR_GENS 2U
#define MAX_NR_GENS ((unsigned int)CONFIG_NR_LRU_GENS)

+/* whether to keep historical stats for evicted generations */
+#ifdef CONFIG_LRU_GEN_STATS
+#define NR_HIST_GENS ((unsigned int)CONFIG_NR_LRU_GENS)
+#else
+#define NR_HIST_GENS 1U
+#endif
+
struct lru_gen_struct {
/* the aging increments the youngest generation number */
unsigned long max_seq;
@@ -350,6 +357,58 @@ struct lru_gen_struct {
bool enabled;
};

+enum {
+ MM_PTE_TOTAL, /* total leaf entries */
+ MM_PTE_OLD, /* old leaf entries */
+ MM_PTE_YOUNG, /* young leaf entries */
+ MM_PMD_TOTAL, /* total non-leaf entries */
+ MM_PMD_FOUND, /* non-leaf entries found in Bloom filters */
+ MM_PMD_ADDED, /* non-leaf entries added to Bloom filters */
+ NR_MM_STATS
+};
+
+/* mnemonic codes for the mm stats above */
+#define MM_STAT_CODES "toydfa"
+
+/* double-buffering Bloom filters */
+#define NR_BLOOM_FILTERS 2
+
+struct lru_gen_mm_state {
+ /* set to max_seq after each iteration */
+ unsigned long seq;
+ /* where the current iteration starts (inclusive) */
+ struct list_head *head;
+ /* where the last iteration ends (exclusive) */
+ struct list_head *tail;
+ /* to wait for the last page table walker to finish */
+ struct wait_queue_head wait;
+ /* Bloom filters flip after each iteration */
+ unsigned long *filters[NR_BLOOM_FILTERS];
+ /* the mm stats for debugging */
+ unsigned long stats[NR_HIST_GENS][NR_MM_STATS];
+ /* the number of concurrent page table walkers */
+ int nr_walkers;
+};
+
+struct lru_gen_mm_walk {
+ /* the lruvec under reclaim */
+ struct lruvec *lruvec;
+ /* unstable max_seq from lru_gen_struct */
+ unsigned long max_seq;
+ /* the next address within an mm to scan */
+ unsigned long next_addr;
+ /* to batch page table entries */
+ unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)];
+ /* to batch promoted pages */
+ int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+ /* to batch the mm stats */
+ int mm_stats[NR_MM_STATS];
+ /* total batched items */
+ int batched;
+ bool can_swap;
+ bool full_scan;
+};
+
void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);

#ifdef CONFIG_MEMCG
@@ -395,6 +454,8 @@ struct lruvec {
#ifdef CONFIG_LRU_GEN
/* evictable pages divided into generations */
struct lru_gen_struct lrugen;
+ /* to concurrently iterate lru_gen_mm_list */
+ struct lru_gen_mm_state mm_state;
#endif
#ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686..f8bf605c9ba5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -463,6 +463,7 @@ void mm_update_next_owner(struct mm_struct *mm)
goto retry;
}
WRITE_ONCE(mm->owner, c);
+ lru_gen_migrate_mm(mm);
task_unlock(c);
put_task_struct(c);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697..be1b58bf11bb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1078,6 +1078,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
goto fail_nocontext;

mm->user_ns = get_user_ns(user_ns);
+ lru_gen_init_mm(mm);
return mm;

fail_nocontext:
@@ -1120,6 +1121,7 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
+ lru_gen_del_mm(mm);
mmdrop(mm);
}

@@ -2603,6 +2605,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
get_task_struct(p);
}

+ if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) {
+ /* lock the task to synchronize with memcg migration */
+ task_lock(p);
+ lru_gen_add_mm(p->mm);
+ task_unlock(p);
+ }
+
wake_up_new_task(p);

/* forking complete and child started to run, tell ptracer */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 77563109c0ea..268b869d326e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4956,6 +4956,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
* finish_task_switch()'s mmdrop().
*/
switch_mm_irqs_off(prev->active_mm, next->mm, next);
+ lru_gen_use_mm(next->mm);

if (!prev->mm) { // from kernel
/* will mmdrop() in finish_task_switch(). */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4359a278e31..33576f6814b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6135,6 +6135,29 @@ static void mem_cgroup_move_task(void)
}
#endif

+#ifdef CONFIG_LRU_GEN
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ struct task_struct *task = NULL;
+
+ cgroup_taskset_for_each_leader(task, css, tset)
+ break;
+
+ if (!task)
+ return;
+
+ task_lock(task);
+ if (task->mm && task->mm->owner == task)
+ lru_gen_migrate_mm(task->mm);
+ task_unlock(task);
+}
+#else
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
{
if (value == PAGE_COUNTER_MAX)
@@ -6478,6 +6501,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
.css_reset = mem_cgroup_css_reset,
.css_rstat_flush = mem_cgroup_css_rstat_flush,
.can_attach = mem_cgroup_can_attach,
+ .attach = mem_cgroup_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.post_attach = mem_cgroup_move_task,
.dfl_cftypes = memory_files,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0e487c0ffe17..5eaf22aa446a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3095,6 +3095,342 @@ static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
get_nr_gens(lruvec, 0) <= MAX_NR_GENS;
}

+/******************************************************************************
+ * mm_struct list
+ ******************************************************************************/
+
+static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
+{
+ static struct lru_gen_mm_list mm_list = {
+ .fifo = LIST_HEAD_INIT(mm_list.fifo),
+ .lock = __SPIN_LOCK_UNLOCKED(mm_list.lock),
+ };
+
+#ifdef CONFIG_MEMCG
+ if (memcg)
+ return &memcg->mm_list;
+#endif
+ return &mm_list;
+}
+
+void lru_gen_add_mm(struct mm_struct *mm)
+{
+ int nid;
+ struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+ VM_BUG_ON_MM(!list_empty(&mm->lru_gen.list), mm);
+#ifdef CONFIG_MEMCG
+ VM_BUG_ON_MM(mm->lru_gen.memcg, mm);
+ mm->lru_gen.memcg = memcg;
+#endif
+ spin_lock(&mm_list->lock);
+
+ list_add_tail(&mm->lru_gen.list, &mm_list->fifo);
+
+ for_each_node(nid) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+ if (!lruvec)
+ continue;
+
+ if (lruvec->mm_state.tail == &mm_list->fifo)
+ lruvec->mm_state.tail = lruvec->mm_state.tail->prev;
+ }
+
+ spin_unlock(&mm_list->lock);
+}
+
+void lru_gen_del_mm(struct mm_struct *mm)
+{
+ int nid;
+ struct lru_gen_mm_list *mm_list;
+ struct mem_cgroup *memcg = NULL;
+
+ if (list_empty(&mm->lru_gen.list))
+ return;
+
+#ifdef CONFIG_MEMCG
+ memcg = mm->lru_gen.memcg;
+#endif
+ mm_list = get_mm_list(memcg);
+
+ spin_lock(&mm_list->lock);
+
+ for_each_node(nid) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+ if (!lruvec)
+ continue;
+
+ if (lruvec->mm_state.tail == &mm->lru_gen.list)
+ lruvec->mm_state.tail = lruvec->mm_state.tail->next;
+
+ if (lruvec->mm_state.head != &mm->lru_gen.list)
+ continue;
+
+ lruvec->mm_state.head = lruvec->mm_state.head->next;
+ if (lruvec->mm_state.head == &mm_list->fifo)
+ WRITE_ONCE(lruvec->mm_state.seq, lruvec->mm_state.seq + 1);
+ }
+
+ list_del_init(&mm->lru_gen.list);
+
+ spin_unlock(&mm_list->lock);
+
+#ifdef CONFIG_MEMCG
+ mem_cgroup_put(mm->lru_gen.memcg);
+ mm->lru_gen.memcg = NULL;
+#endif
+}
+
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+ struct mem_cgroup *memcg;
+
+ lockdep_assert_held(&mm->owner->alloc_lock);
+
+ if (mem_cgroup_disabled())
+ return;
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(mm->owner);
+ rcu_read_unlock();
+ if (memcg == mm->lru_gen.memcg)
+ return;
+
+ VM_BUG_ON_MM(!mm->lru_gen.memcg, mm);
+ VM_BUG_ON_MM(list_empty(&mm->lru_gen.list), mm);
+
+ lru_gen_del_mm(mm);
+ lru_gen_add_mm(mm);
+}
+#endif
+
+/*
+ * Bloom filters with m=1<<15, k=2 and the false positive rates of ~1/5 when
+ * n=10,000 and ~1/2 when n=20,000, where, conventionally, m is the number of
+ * bits in a bitmap, k is the number of hash functions and n is the number of
+ * inserted items.
+ *
+ * Page table walkers use one of the two filters to reduce their search space.
+ * To get rid of non-leaf entries that no longer have enough leaf entries, the
+ * aging uses the double-buffering technique to flip to the other filter each
+ * time it creates a new generation. For non-leaf entries that have enough
+ * leaf entries, the aging carries them over to the next generation in
+ * walk_pmd_range(); the eviction also report them when walking the rmap
+ * in lru_gen_look_around().
+ *
+ * For future optimizations:
+ * 1) It's not necessary to keep both filters all the time. The spare one can be
+ * freed after the RCU grace period and reallocated if needed again.
+ * 2) And when reallocating, it's worth scaling its size according to the number
+ * of inserted entries in the other filter, to reduce the memory overhead on
+ * small systems and false positives on large systems.
+ * 3) Jenkins' hash function is an alternative to Knuth's.
+ */
+#define BLOOM_FILTER_SHIFT 15
+
+static inline int filter_gen_from_seq(unsigned long seq)
+{
+ return seq % NR_BLOOM_FILTERS;
+}
+
+static void get_item_key(void *item, int *key)
+{
+ u32 hash = hash_ptr(item, BLOOM_FILTER_SHIFT * 2);
+
+ BUILD_BUG_ON(BLOOM_FILTER_SHIFT * 2 > BITS_PER_TYPE(u32));
+
+ key[0] = hash & (BIT(BLOOM_FILTER_SHIFT) - 1);
+ key[1] = hash >> BLOOM_FILTER_SHIFT;
+}
+
+static void clear_bloom_filter(struct lruvec *lruvec, unsigned long seq)
+{
+ unsigned long *filter;
+ int gen = filter_gen_from_seq(seq);
+
+ lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
+
+ filter = lruvec->mm_state.filters[gen];
+ if (filter) {
+ bitmap_clear(filter, 0, BIT(BLOOM_FILTER_SHIFT));
+ return;
+ }
+
+ filter = bitmap_zalloc(BIT(BLOOM_FILTER_SHIFT), GFP_ATOMIC);
+ WRITE_ONCE(lruvec->mm_state.filters[gen], filter);
+}
+
+static void set_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+ int key[2];
+ unsigned long *filter;
+ int gen = filter_gen_from_seq(seq);
+
+ filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+ if (!filter)
+ return;
+
+ get_item_key(item, key);
+
+ if (!test_bit(key[0], filter))
+ set_bit(key[0], filter);
+ if (!test_bit(key[1], filter))
+ set_bit(key[1], filter);
+}
+
+static bool test_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+ int key[2];
+ unsigned long *filter;
+ int gen = filter_gen_from_seq(seq);
+
+ filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+ if (!filter)
+ return false;
+
+ get_item_key(item, key);
+
+ return test_bit(key[0], filter) && test_bit(key[1], filter);
+}
+
+static void reset_mm_stats(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, bool last)
+{
+ int i;
+ int hist = lru_hist_from_seq(walk->max_seq);
+
+ lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
+
+ for (i = 0; i < NR_MM_STATS; i++) {
+ WRITE_ONCE(lruvec->mm_state.stats[hist][i],
+ lruvec->mm_state.stats[hist][i] + walk->mm_stats[i]);
+ walk->mm_stats[i] = 0;
+ }
+
+ if (NR_HIST_GENS == 1 || !last)
+ return;
+
+ hist = lru_hist_from_seq(walk->max_seq + 1);
+ for (i = 0; i < NR_MM_STATS; i++)
+ WRITE_ONCE(lruvec->mm_state.stats[hist][i], 0);
+}
+
+static bool should_skip_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+ int type;
+ unsigned long size = 0;
+ struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+
+ if (!walk->full_scan && cpumask_empty(mm_cpumask(mm)) &&
+ !node_isset(pgdat->node_id, mm->lru_gen.nodes))
+ return true;
+
+ for (type = !walk->can_swap; type < ANON_AND_FILE; type++) {
+ size += type ? get_mm_counter(mm, MM_FILEPAGES) :
+ get_mm_counter(mm, MM_ANONPAGES) +
+ get_mm_counter(mm, MM_SHMEMPAGES);
+ }
+
+ if (size < MIN_LRU_BATCH)
+ return true;
+
+ if (mm_is_oom_victim(mm))
+ return true;
+
+ if (!mmget_not_zero(mm))
+ return true;
+
+ node_clear(pgdat->node_id, mm->lru_gen.nodes);
+
+ return false;
+}
+
+static bool get_next_mm(struct lruvec *lruvec, struct lru_gen_mm_walk *walk,
+ struct mm_struct **iter)
+{
+ bool first = false;
+ bool last = true;
+ struct mm_struct *mm = NULL;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+ struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+
+ /*
+ * There are four interesting cases for this page table walker:
+ * 1) It tries to start a new iteration of this list with a stale
+ * max_seq; there is nothing to be done.
+ * 2) It's the first of the current generation, and it needs to prepare
+ * the Bloom filter for the next generation.
+ * 3) It reaches the end of this list, and it needs to increment
+ * mm_state->seq; the iteration is done.
+ * 4) It's the last of the current generation, and it needs to clear the
+ * historical mm stats for the next generation.
+ */
+ if (*iter)
+ mmput_async(*iter);
+ else if (walk->max_seq <= READ_ONCE(mm_state->seq))
+ return false;
+
+ spin_lock(&mm_list->lock);
+
+ VM_BUG_ON(walk->max_seq > mm_state->seq + 1);
+ VM_BUG_ON(*iter && walk->max_seq < mm_state->seq);
+ VM_BUG_ON(*iter && !mm_state->nr_walkers);
+
+ if (walk->max_seq <= mm_state->seq) {
+ if (!*iter)
+ last = false;
+ goto done;
+ }
+
+ if (mm_state->head == &mm_list->fifo) {
+ VM_BUG_ON(mm_state->nr_walkers);
+ mm_state->head = mm_state->head->next;
+ first = true;
+ }
+
+ while (!mm && mm_state->head != &mm_list->fifo) {
+ mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list);
+
+ mm_state->head = mm_state->head->next;
+
+ /* full scan for those added after the last iteration */
+ if (mm_state->tail == &mm->lru_gen.list) {
+ mm_state->tail = mm_state->tail->next;
+ walk->full_scan = true;
+ }
+
+ if (should_skip_mm(mm, walk))
+ mm = NULL;
+ }
+
+ if (mm_state->head == &mm_list->fifo)
+ WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+done:
+ if (*iter && !mm)
+ mm_state->nr_walkers--;
+ if (!*iter && mm)
+ mm_state->nr_walkers++;
+
+ if (mm_state->nr_walkers)
+ last = false;
+
+ if (mm && first)
+ clear_bloom_filter(lruvec, walk->max_seq + 1);
+
+ if (*iter || last)
+ reset_mm_stats(lruvec, walk, last);
+
+ spin_unlock(&mm_list->lock);
+
+ *iter = mm;
+
+ return last;
+}
+
/******************************************************************************
* state change
******************************************************************************/
@@ -3252,6 +3588,7 @@ void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec)
int i;
int gen, type, zone;
struct lru_gen_struct *lrugen = &lruvec->lrugen;
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);

lrugen->max_seq = MIN_NR_GENS + 1;
lrugen->enabled = lru_gen_enabled();
@@ -3261,6 +3598,11 @@ void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec)

for_each_gen_type_zone(gen, type, zone)
INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
+
+ lruvec->mm_state.seq = MIN_NR_GENS;
+ lruvec->mm_state.head = &mm_list->fifo;
+ lruvec->mm_state.tail = &mm_list->fifo;
+ init_waitqueue_head(&lruvec->mm_state.wait);
}

#ifdef CONFIG_MEMCG
@@ -3268,6 +3610,9 @@ void lru_gen_init_memcg(struct mem_cgroup *memcg)
{
int nid;

+ INIT_LIST_HEAD(&memcg->mm_list.fifo);
+ spin_lock_init(&memcg->mm_list.lock);
+
for_each_node(nid) {
struct lruvec *lruvec = get_lruvec(memcg, nid);

@@ -3280,10 +3625,16 @@ void lru_gen_free_memcg(struct mem_cgroup *memcg)
int nid;

for_each_node(nid) {
+ int i;
struct lruvec *lruvec = get_lruvec(memcg, nid);

VM_BUG_ON(memchr_inv(lruvec->lrugen.nr_pages, 0,
sizeof(lruvec->lrugen.nr_pages)));
+
+ for (i = 0; i < NR_BLOOM_FILTERS; i++) {
+ bitmap_free(lruvec->mm_state.filters[i]);
+ lruvec->mm_state.filters[i] = NULL;
+ }
}
}
#endif
@@ -3292,6 +3643,7 @@ static int __init init_lru_gen(void)
{
BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+ BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);

return 0;
};
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-04 20:23:57

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v6 6/9] mm: multigenerational lru: aging

To avoid confusions, the term "scan" will be applied to PTEs in a page
table and pages on an lru list. It emphasizes on consecutive elements
in a set rather than the data structure holding this set together.

The aging produces young generations. Given an lruvec, it iterates
lruvec_memcg()->mm_list and calls walk_page_range() with each
mm_struct on this list to scan PTEs for accessed pages. On finding a
young PTE, it clears the accessed bit and updates the gen counter of
the page mapped by this PTE to (max_seq%MAX_NR_GENS)+1. After each
iteration of this list, it increments max_seq. The aging is needed
before the eviction can continue when max_seq-min_seq+1 reaches
MIN_NR_GENS.

To avoid confusions, the terms "promotion" and "demotion" will be
applied to the multigenerational lru, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
lru, as usual.

IOW, the aging promotes a page to the youngest generation when it
finds this page accessed thru page tables; demotion happens
consequently when it creates a new generation. Note that promotion
doesn't require any lru list operations in the aging path, only the
update of the gen counter and the lru sizes; demotion, unless as the
result of the creation of a new generation, requires lru list
operations, e.g., lru_deactivate_fn().

The aging uses the following optimizations when walking page tables:
1) It uses the accessed bit in non-leaf PMD entries, the hint from the
CPU scheduler and the Bloom filters to reduce its search space.
2) It doesn't zigzag between a PGD table and the same PMD or PTE table
spanning multiple VMAs. In other words, it finishes all the VMAs
within the range of the same PMD or PTE table before it returns to
a PGD table. This improves the cache performance for workloads that
have large numbers of tiny VMAs, especially when
CONFIG_PGTABLE_LEVELS=5.

The aging is only interested in accessed pages and therefore has the
complexity of O(nr_hot_evictable_pages). The worst case scenario is
the aging fails to exploit any spatial locality and the eviction has
to promote all accessed pages when walking the rmap, which is similar
to the active/inactive lru. However, generations still can provide
better temporal locality.

Signed-off-by: Yu Zhao <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
---
include/linux/memcontrol.h | 6 +
include/linux/mm.h | 5 +
include/linux/mmzone.h | 10 +
include/linux/oom.h | 16 +
include/linux/swap.h | 4 +
mm/oom_kill.c | 4 +-
mm/rmap.c | 7 +
mm/vmscan.c | 896 +++++++++++++++++++++++++++++++++++++
8 files changed, 946 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index aba18cd101db..028afdb81c10 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1393,18 +1393,24 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)

static inline void lock_page_memcg(struct page *page)
{
+ /* to match folio_memcg_rcu() */
+ rcu_read_lock();
}

static inline void unlock_page_memcg(struct page *page)
{
+ rcu_read_unlock();
}

static inline void folio_memcg_lock(struct folio *folio)
{
+ /* to match folio_memcg_rcu() */
+ rcu_read_lock();
}

static inline void folio_memcg_unlock(struct folio *folio)
{
+ rcu_read_unlock();
}

static inline void mem_cgroup_handle_over_high(void)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fadbf8e6abcd..3d42118b7f5e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1599,6 +1599,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
return page_to_pfn(&folio->page);
}

+static inline struct folio *pfn_folio(unsigned long pfn)
+{
+ return page_folio(pfn_to_page(pfn));
+}
+
/* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
#ifdef CONFIG_MIGRATION
static inline bool is_pinnable_page(struct page *page)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5b9bc2532c5b..94af12507788 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -304,6 +304,7 @@ enum lruvec_flags {
};

struct lruvec;
+struct page_vma_mapped_walk;

#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
@@ -410,6 +411,7 @@ struct lru_gen_mm_walk {
};

void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);

#ifdef CONFIG_MEMCG
void lru_gen_init_memcg(struct mem_cgroup *memcg);
@@ -422,6 +424,10 @@ static inline void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *l
{
}

+static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+}
+
#ifdef CONFIG_MEMCG
static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
{
@@ -1048,6 +1054,10 @@ typedef struct pglist_data {

unsigned long flags;

+#ifdef CONFIG_LRU_GEN
+ /* kswap mm walk data */
+ struct lru_gen_mm_walk mm_walk;
+#endif
ZONE_PADDING(_pad2_)

/* Per-node vmstats */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 2db9a1432511..9c7a4fae0661 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -57,6 +57,22 @@ struct oom_control {
extern struct mutex oom_lock;
extern struct mutex oom_adj_mutex;

+#ifdef CONFIG_MMU
+extern struct task_struct *oom_reaper_list;
+extern struct wait_queue_head oom_reaper_wait;
+
+static inline bool oom_reaping_in_progress(void)
+{
+ /* a racy check can be used to reduce the chance of overkilling */
+ return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait);
+}
+#else
+static inline bool oom_reaping_in_progress(void)
+{
+ return false;
+}
+#endif
+
static inline void set_current_oom_origin(void)
{
current->signal->oom_flag_origin = true;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d1ea44b31f19..bb93bba97115 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -137,6 +137,10 @@ union swap_header {
*/
struct reclaim_state {
unsigned long reclaimed_slab;
+#ifdef CONFIG_LRU_GEN
+ /* per-thread mm walk data */
+ struct lru_gen_mm_walk *mm_walk;
+#endif
};

#ifdef __KERNEL__
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ddabefcfb5a..ef5860fc7d22 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -508,8 +508,8 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
* victim (if that is possible) to help the OOM killer to move on.
*/
static struct task_struct *oom_reaper_th;
-static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+struct task_struct *oom_reaper_list;
static DEFINE_SPINLOCK(oom_reaper_lock);

bool __oom_reap_task_mm(struct mm_struct *mm)
diff --git a/mm/rmap.c b/mm/rmap.c
index 163ac4e6bcee..2f023e6c0f82 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -73,6 +73,7 @@
#include <linux/page_idle.h>
#include <linux/memremap.h>
#include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>

#include <asm/tlbflush.h>

@@ -790,6 +791,12 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
}

if (pvmw.pte) {
+ if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
+ !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
+ lru_gen_look_around(&pvmw);
+ referenced++;
+ }
+
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte)) {
/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5eaf22aa446a..fbf1337a1632 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -51,6 +51,8 @@
#include <linux/dax.h>
#include <linux/psi.h>
#include <linux/memory.h>
+#include <linux/pagewalk.h>
+#include <linux/shmem_fs.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -1555,6 +1557,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
if (!sc->may_unmap && page_mapped(page))
goto keep_locked;

+ /* folio_update_gen() tried to promote this page? */
+ if (lru_gen_enabled() && !ignore_references &&
+ page_mapped(page) && PageReferenced(page))
+ goto keep_locked;
+
may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));

@@ -3047,6 +3054,15 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
* shorthand helpers
******************************************************************************/

+#define DEFINE_MAX_SEQ(lruvec) \
+ unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
+
+#define DEFINE_MIN_SEQ(lruvec) \
+ unsigned long min_seq[ANON_AND_FILE] = { \
+ READ_ONCE((lruvec)->lrugen.min_seq[0]), \
+ READ_ONCE((lruvec)->lrugen.min_seq[1]), \
+ }
+
#define for_each_gen_type_zone(gen, type, zone) \
for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \
for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \
@@ -3077,6 +3093,12 @@ static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
return pgdat ? &pgdat->__lruvec : NULL;
}

+static int get_swappiness(struct mem_cgroup *memcg)
+{
+ return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
+ mem_cgroup_swappiness(memcg) : 0;
+}
+
static int get_nr_gens(struct lruvec *lruvec, int type)
{
return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1;
@@ -3431,6 +3453,869 @@ static bool get_next_mm(struct lruvec *lruvec, struct lru_gen_mm_walk *walk,
return last;
}

+/******************************************************************************
+ * the aging
+ ******************************************************************************/
+
+static void folio_update_gen(struct folio *folio, struct lru_gen_mm_walk *walk)
+{
+ unsigned long old_flags, new_flags;
+ int type = folio_is_file_lru(folio);
+ int zone = folio_zonenum(folio);
+ int delta = folio_nr_pages(folio);
+ int old_gen, new_gen = lru_gen_from_seq(walk->max_seq);
+
+ do {
+ new_flags = old_flags = READ_ONCE(folio->flags);
+
+ /* for shrink_page_list() */
+ if (!(new_flags & LRU_GEN_MASK)) {
+ new_flags |= BIT(PG_referenced);
+ continue;
+ }
+
+ new_flags &= ~LRU_GEN_MASK;
+ new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
+ } while (new_flags != old_flags &&
+ cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+ old_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+ if (old_gen < 0 || old_gen == new_gen)
+ return;
+
+ walk->batched++;
+ walk->nr_pages[old_gen][type][zone] -= delta;
+ walk->nr_pages[new_gen][type][zone] += delta;
+}
+
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+ unsigned long old_flags, new_flags;
+ int type = folio_is_file_lru(folio);
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+ int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+ do {
+ new_flags = old_flags = READ_ONCE(folio->flags);
+ VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
+
+ new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+ /* folio_update_gen() has promoted this page? */
+ if (new_gen >= 0 && new_gen != old_gen)
+ return new_gen;
+
+ new_gen = (old_gen + 1) % MAX_NR_GENS;
+
+ new_flags &= ~LRU_GEN_MASK;
+ new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
+ /* for folio_end_writeback() */
+ if (reclaiming)
+ new_flags |= BIT(PG_reclaim);
+ } while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+ lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
+
+ return new_gen;
+}
+
+static void reset_batch_size(struct lruvec *lruvec, struct lru_gen_mm_walk *walk)
+{
+ int gen, type, zone;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ walk->batched = 0;
+
+ for_each_gen_type_zone(gen, type, zone) {
+ enum lru_list lru = type * LRU_FILE;
+ int delta = walk->nr_pages[gen][type][zone];
+
+ if (!delta)
+ continue;
+
+ walk->nr_pages[gen][type][zone] = 0;
+ WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
+ lrugen->nr_pages[gen][type][zone] + delta);
+
+ if (lru_gen_is_active(lruvec, gen))
+ lru += LRU_ACTIVE;
+ lru_gen_update_size(lruvec, lru, zone, delta);
+ }
+}
+
+static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *walk)
+{
+ struct address_space *mapping;
+ struct vm_area_struct *vma = walk->vma;
+ struct lru_gen_mm_walk *priv = walk->private;
+
+ if (!vma_is_accessible(vma) || is_vm_hugetlb_page(vma) ||
+ (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ)))
+ return true;
+
+ if (vma_is_anonymous(vma))
+ return !priv->can_swap;
+
+ if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
+ return true;
+
+ mapping = vma->vm_file->f_mapping;
+ if (!mapping->a_ops->writepage)
+ return true;
+
+ return (shmem_mapping(mapping) && !priv->can_swap) || mapping_unevictable(mapping);
+}
+
+/*
+ * Some userspace memory allocators map many single-page VMAs. Instead of
+ * returning back to the PGD table for each of such VMAs, finish an entire PMD
+ * table to reduce zigzags and improve cache performance.
+ */
+static bool get_next_vma(struct mm_walk *walk, unsigned long mask, unsigned long size,
+ unsigned long *start, unsigned long *end)
+{
+ unsigned long next = round_up(*end, size);
+
+ VM_BUG_ON(mask & size);
+ VM_BUG_ON(*start >= *end);
+ VM_BUG_ON((next & mask) != (*start & mask));
+
+ while (walk->vma) {
+ if (next >= walk->vma->vm_end) {
+ walk->vma = walk->vma->vm_next;
+ continue;
+ }
+
+ if ((next & mask) != (walk->vma->vm_start & mask))
+ return false;
+
+ if (should_skip_vma(walk->vma->vm_start, walk->vma->vm_end, walk)) {
+ walk->vma = walk->vma->vm_next;
+ continue;
+ }
+
+ *start = max(next, walk->vma->vm_start);
+ next = (next | ~mask) + 1;
+ /* rounded-up boundaries can wrap to 0 */
+ *end = next && next < walk->vma->vm_end ? next : walk->vma->vm_end;
+
+ return true;
+ }
+
+ return false;
+}
+
+static bool suitable_to_scan(int total, int young)
+{
+ int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
+
+ /* suitable if the average number of young PTEs per cacheline is >=1 */
+ return young * n >= total;
+}
+
+static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+{
+ int i;
+ pte_t *pte;
+ spinlock_t *ptl;
+ unsigned long addr;
+ int total = 0;
+ int young = 0;
+ struct lru_gen_mm_walk *priv = walk->private;
+ struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+
+ VM_BUG_ON(pmd_leaf(*pmd));
+
+ pte = pte_offset_map_lock(walk->mm, pmd, start & PMD_MASK, &ptl);
+ arch_enter_lazy_mmu_mode();
+restart:
+ for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
+ struct folio *folio;
+ unsigned long pfn = pte_pfn(pte[i]);
+
+ total++;
+ priv->mm_stats[MM_PTE_TOTAL]++;
+
+ if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+ continue;
+
+ if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+ continue;
+
+ if (!pte_young(pte[i])) {
+ priv->mm_stats[MM_PTE_OLD]++;
+ continue;
+ }
+
+ VM_BUG_ON(!pfn_valid(pfn));
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ continue;
+
+ folio = pfn_folio(pfn);
+ if (folio_nid(folio) != pgdat->node_id)
+ continue;
+
+ if (folio_memcg_rcu(folio) != memcg)
+ continue;
+
+ VM_BUG_ON(addr < walk->vma->vm_start || addr >= walk->vma->vm_end);
+ if (ptep_test_and_clear_young(walk->vma, addr, pte + i)) {
+ folio_update_gen(folio, priv);
+ priv->mm_stats[MM_PTE_YOUNG]++;
+ young++;
+ }
+
+ if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+ !folio_test_swapcache(folio)))
+ folio_mark_dirty(folio);
+ }
+
+ if (i < PTRS_PER_PTE && get_next_vma(walk, PMD_MASK, PAGE_SIZE, &start, &end))
+ goto restart;
+
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(pte, ptl);
+
+ return suitable_to_scan(total, young);
+}
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+ struct mm_walk *walk, unsigned long *start)
+{
+ int i;
+ pmd_t *pmd;
+ spinlock_t *ptl;
+ struct lru_gen_mm_walk *priv = walk->private;
+ struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+
+ VM_BUG_ON(pud_leaf(*pud));
+
+ /* try to batch at most 1+MIN_LRU_BATCH+1 entries */
+ if (*start == -1) {
+ *start = next;
+ return;
+ }
+
+ i = next == -1 ? 0 : pmd_index(next) - pmd_index(*start);
+ if (i && i <= MIN_LRU_BATCH) {
+ __set_bit(i - 1, priv->bitmap);
+ return;
+ }
+
+ pmd = pmd_offset(pud, *start);
+ ptl = pmd_lock(walk->mm, pmd);
+ arch_enter_lazy_mmu_mode();
+
+ do {
+ struct folio *folio;
+ unsigned long pfn = pmd_pfn(pmd[i]);
+ unsigned long addr = i ? (*start & PMD_MASK) + i * PMD_SIZE : *start;
+
+ if (!pmd_present(pmd[i]) || is_huge_zero_pmd(pmd[i]))
+ goto next;
+
+ if (WARN_ON_ONCE(pmd_devmap(pmd[i])))
+ goto next;
+
+ if (!pmd_trans_huge(pmd[i])) {
+ if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
+ pmdp_test_and_clear_young(vma, addr, pmd + i);
+ goto next;
+ }
+
+ VM_BUG_ON(!pfn_valid(pfn));
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ goto next;
+
+ folio = pfn_folio(pfn);
+ if (folio_nid(folio) != pgdat->node_id)
+ goto next;
+
+ if (folio_memcg_rcu(folio) != memcg)
+ goto next;
+
+ VM_BUG_ON(addr < vma->vm_start || addr >= vma->vm_end);
+ if (pmdp_test_and_clear_young(vma, addr, pmd + i)) {
+ folio_update_gen(folio, priv);
+ priv->mm_stats[MM_PTE_YOUNG]++;
+ }
+
+ if (pmd_dirty(pmd[i]) && !folio_test_dirty(folio) &&
+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+ !folio_test_swapcache(folio)))
+ folio_mark_dirty(folio);
+next:
+ i = i > MIN_LRU_BATCH ? 0 :
+ find_next_bit(priv->bitmap, MIN_LRU_BATCH, i) + 1;
+ } while (i <= MIN_LRU_BATCH);
+
+ arch_leave_lazy_mmu_mode();
+ spin_unlock(ptl);
+
+ *start = -1;
+ bitmap_zero(priv->bitmap, MIN_LRU_BATCH);
+}
+#else
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+ struct mm_walk *walk, unsigned long *start)
+{
+}
+#endif
+
+static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+{
+ int i;
+ pmd_t *pmd;
+ unsigned long next;
+ unsigned long addr;
+ struct vm_area_struct *vma;
+ unsigned long pos = -1;
+ struct lru_gen_mm_walk *priv = walk->private;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+#endif
+
+ VM_BUG_ON(pud_leaf(*pud));
+
+ /*
+ * Finish an entire PMD in two passes: the first only reaches to PTE
+ * tables to avoid taking the PMD lock; the second, if necessary, takes
+ * the PMD lock to clear the accessed bit in PMD entries.
+ */
+ pmd = pmd_offset(pud, start & PUD_MASK);
+restart:
+ /* walk_pte_range() may call get_next_vma() */
+ vma = walk->vma;
+ for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
+ pmd_t val = pmd_read_atomic(pmd + i);
+
+ /* for pmd_read_atomic() */
+ barrier();
+
+ next = pmd_addr_end(addr, end);
+
+ if (!pmd_present(val)) {
+ priv->mm_stats[MM_PTE_TOTAL]++;
+ continue;
+ }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (pmd_trans_huge(val)) {
+ unsigned long pfn = pmd_pfn(val);
+
+ priv->mm_stats[MM_PTE_TOTAL]++;
+
+ if (is_huge_zero_pmd(val))
+ continue;
+
+ if (!pmd_young(val)) {
+ priv->mm_stats[MM_PTE_OLD]++;
+ continue;
+ }
+
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ continue;
+
+ walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+ continue;
+ }
+#endif
+ priv->mm_stats[MM_PMD_TOTAL]++;
+
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+ if (!pmd_young(val))
+ continue;
+
+ walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+#endif
+ if (!priv->full_scan && !test_bloom_filter(priv->lruvec, priv->max_seq, pmd + i))
+ continue;
+
+ priv->mm_stats[MM_PMD_FOUND]++;
+
+ if (!walk_pte_range(&val, addr, next, walk))
+ continue;
+
+ set_bloom_filter(priv->lruvec, priv->max_seq + 1, pmd + i);
+
+ priv->mm_stats[MM_PMD_ADDED]++;
+ }
+
+ walk_pmd_range_locked(pud, -1, vma, walk, &pos);
+
+ if (i < PTRS_PER_PMD && get_next_vma(walk, PUD_MASK, PMD_SIZE, &start, &end))
+ goto restart;
+}
+
+static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+{
+ int i;
+ pud_t *pud;
+ unsigned long addr;
+ unsigned long next;
+ struct lru_gen_mm_walk *priv = walk->private;
+
+ VM_BUG_ON(p4d_leaf(*p4d));
+
+ pud = pud_offset(p4d, start & P4D_MASK);
+restart:
+ for (i = pud_index(start), addr = start; addr != end; i++, addr = next) {
+ pud_t val = READ_ONCE(pud[i]);
+
+ next = pud_addr_end(addr, end);
+
+ if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val)))
+ continue;
+
+ walk_pmd_range(&val, addr, next, walk);
+
+ if (priv->batched >= MAX_LRU_BATCH) {
+ end = (addr | ~PUD_MASK) + 1;
+ goto done;
+ }
+ }
+
+ if (i < PTRS_PER_PUD && get_next_vma(walk, P4D_MASK, PUD_SIZE, &start, &end))
+ goto restart;
+
+ end = round_up(end, P4D_SIZE);
+done:
+ /* rounded-up boundaries can wrap to 0 */
+ priv->next_addr = end && walk->vma ? max(end, walk->vma->vm_start) : 0;
+
+ return -EAGAIN;
+}
+
+static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+ static const struct mm_walk_ops mm_walk_ops = {
+ .test_walk = should_skip_vma,
+ .p4d_entry = walk_pud_range,
+ };
+
+ int err;
+#ifdef CONFIG_MEMCG
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+#endif
+
+ walk->next_addr = FIRST_USER_ADDRESS;
+
+ do {
+ unsigned long start = walk->next_addr;
+ unsigned long end = mm->highest_vm_end;
+
+ err = -EBUSY;
+
+ rcu_read_lock();
+#ifdef CONFIG_MEMCG
+ if (memcg && atomic_read(&memcg->moving_account))
+ goto contended;
+#endif
+ if (!mmap_read_trylock(mm))
+ goto contended;
+
+ err = walk_page_range(mm, start, end, &mm_walk_ops, walk);
+
+ mmap_read_unlock(mm);
+
+ if (walk->batched) {
+ spin_lock_irq(&lruvec->lru_lock);
+ reset_batch_size(lruvec, walk);
+ spin_unlock_irq(&lruvec->lru_lock);
+ }
+contended:
+ rcu_read_unlock();
+
+ cond_resched();
+ } while (err == -EAGAIN && walk->next_addr && !mm_is_oom_victim(mm));
+}
+
+static struct lru_gen_mm_walk *alloc_mm_walk(void)
+{
+ if (!current->reclaim_state || !current->reclaim_state->mm_walk)
+ return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
+
+ return current->reclaim_state->mm_walk;
+}
+
+static void free_mm_walk(struct lru_gen_mm_walk *walk)
+{
+ if (!current->reclaim_state || !current->reclaim_state->mm_walk)
+ kvfree(walk);
+}
+
+static void inc_min_seq(struct lruvec *lruvec)
+{
+ int gen, type;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ VM_BUG_ON(!seq_is_valid(lruvec));
+
+ for (type = 0; type < ANON_AND_FILE; type++) {
+ if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
+ continue;
+
+ WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
+ }
+}
+
+static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
+{
+ int gen, type, zone;
+ bool success = false;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+ DEFINE_MIN_SEQ(lruvec);
+
+ VM_BUG_ON(!seq_is_valid(lruvec));
+
+ for (type = !can_swap; type < ANON_AND_FILE; type++) {
+ while (lrugen->max_seq >= min_seq[type] + MIN_NR_GENS) {
+ gen = lru_gen_from_seq(min_seq[type]);
+
+ for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+ if (!list_empty(&lrugen->lists[gen][type][zone]))
+ goto next;
+ }
+
+ min_seq[type]++;
+ }
+next:
+ ;
+ }
+
+ /* see the comment in seq_is_valid() */
+ if (can_swap) {
+ min_seq[0] = min(min_seq[0], min_seq[1]);
+ min_seq[1] = max(min_seq[0], lrugen->min_seq[1]);
+ }
+
+ for (type = !can_swap; type < ANON_AND_FILE; type++) {
+ if (min_seq[type] == lrugen->min_seq[type])
+ continue;
+
+ WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
+ success = true;
+ }
+
+ return success;
+}
+
+static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
+{
+ int prev, next;
+ int type, zone;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ spin_lock_irq(&lruvec->lru_lock);
+
+ VM_BUG_ON(!seq_is_valid(lruvec));
+
+ if (max_seq != lrugen->max_seq)
+ goto unlock;
+
+ inc_min_seq(lruvec);
+
+ /* update the active/inactive lru sizes for compatibility */
+ prev = lru_gen_from_seq(lrugen->max_seq - 1);
+ next = lru_gen_from_seq(lrugen->max_seq + 1);
+
+ for (type = 0; type < ANON_AND_FILE; type++) {
+ for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+ enum lru_list lru = type * LRU_FILE;
+ long delta = lrugen->nr_pages[prev][type][zone] -
+ lrugen->nr_pages[next][type][zone];
+
+ if (!delta)
+ continue;
+
+ lru_gen_update_size(lruvec, lru, zone, delta);
+ lru_gen_update_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
+ }
+ }
+
+ WRITE_ONCE(lrugen->timestamps[next], jiffies);
+ /* make sure preceding modifications appear */
+ smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
+unlock:
+ spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
+ struct scan_control *sc, bool can_swap, bool full_scan)
+{
+ bool last;
+ struct lru_gen_mm_walk *walk;
+ struct mm_struct *mm = NULL;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ VM_BUG_ON(max_seq > READ_ONCE(lrugen->max_seq));
+
+ /*
+ * If the hardware doesn't automatically set the accessed bit, fallback
+ * to lru_gen_look_around(), which only clears the accessed bit in a
+ * handful of PTEs. Spreading the work out over a period of time usually
+ * is less efficient, but it avoids bursty page faults.
+ */
+ if (!full_scan && !arch_has_hw_pte_young(false)) {
+ inc_max_seq(lruvec, max_seq);
+ return true;
+ }
+
+ walk = alloc_mm_walk();
+ if (!walk)
+ return false;
+
+ walk->lruvec = lruvec;
+ walk->max_seq = max_seq;
+ walk->can_swap = can_swap;
+ walk->full_scan = full_scan;
+
+ do {
+ last = get_next_mm(lruvec, walk, &mm);
+ if (mm)
+ walk_mm(lruvec, mm, walk);
+
+ cond_resched();
+ } while (mm);
+
+ free_mm_walk(walk);
+
+ if (!last) {
+ if (!current_is_kswapd() && sc->priority < DEF_PRIORITY - 2)
+ wait_event_killable(lruvec->mm_state.wait,
+ max_seq < READ_ONCE(lrugen->max_seq));
+
+ return max_seq < READ_ONCE(lrugen->max_seq);
+ }
+
+ VM_BUG_ON(max_seq != READ_ONCE(lrugen->max_seq));
+
+ inc_max_seq(lruvec, max_seq);
+ /* either this sees any waiters or they will see updated max_seq */
+ if (wq_has_sleeper(&lruvec->mm_state.wait))
+ wake_up_all(&lruvec->mm_state.wait);
+
+ wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+ return true;
+}
+
+static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq, unsigned long *min_seq,
+ struct scan_control *sc, bool can_swap, bool *need_aging)
+{
+ int gen, type, zone;
+ long max = 0;
+ long min = 0;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ /*
+ * The upper bound of evictable pages is all eligible pages; the lower
+ * bound is aged eligible file pages. The aging is due if the number of
+ * aged generations and the number of aged eligible file pages are both
+ * low.
+ */
+ for (type = !can_swap; type < ANON_AND_FILE; type++) {
+ unsigned long seq;
+
+ for (seq = min_seq[type]; seq <= max_seq; seq++) {
+ long size = 0;
+
+ gen = lru_gen_from_seq(seq);
+
+ for (zone = 0; zone <= sc->reclaim_idx; zone++)
+ size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+
+ max += size;
+ if (type && max_seq >= seq + MIN_NR_GENS)
+ min += size;
+ }
+ }
+
+ *need_aging = max_seq <= min_seq[1] + MIN_NR_GENS && min < MIN_LRU_BATCH;
+
+ return max > 0 ? max : 0;
+}
+
+static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
+ unsigned long min_ttl)
+{
+ bool need_aging;
+ long nr_to_scan;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ int swappiness = get_swappiness(memcg);
+ DEFINE_MAX_SEQ(lruvec);
+ DEFINE_MIN_SEQ(lruvec);
+
+ if (mem_cgroup_below_min(memcg))
+ return false;
+
+ if (min_ttl) {
+ int gen = lru_gen_from_seq(min_seq[1]);
+ unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
+
+ if (time_is_after_jiffies(birth + min_ttl))
+ return false;
+ }
+
+ nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, sc, swappiness, &need_aging);
+ if (!nr_to_scan)
+ return false;
+
+ nr_to_scan >>= sc->priority;
+
+ if (!mem_cgroup_online(memcg))
+ nr_to_scan++;
+
+ if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
+ try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
+
+ return true;
+}
+
+/* to protect the working set of the last N jiffies */
+static unsigned long lru_gen_min_ttl __read_mostly;
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+ struct mem_cgroup *memcg;
+ bool success = false;
+ unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
+
+ VM_BUG_ON(!current_is_kswapd());
+
+ current->reclaim_state->mm_walk = &pgdat->mm_walk;
+
+ memcg = mem_cgroup_iter(NULL, NULL, NULL);
+ do {
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+ if (age_lruvec(lruvec, sc, min_ttl))
+ success = true;
+
+ cond_resched();
+ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+ if (!success && mutex_trylock(&oom_lock)) {
+ struct oom_control oc = {
+ .gfp_mask = sc->gfp_mask,
+ .order = sc->order,
+ };
+
+ if (!oom_reaping_in_progress())
+ out_of_memory(&oc);
+
+ mutex_unlock(&oom_lock);
+ }
+
+ current->reclaim_state->mm_walk = NULL;
+}
+
+/*
+ * This function exploits spatial locality when shrink_page_list() walks the
+ * rmap. It scans the vicinity of a young PTE in a PTE table and promotes
+ * accessed pages. If the scan was done cacheline efficiently, it adds the PMD
+ * entry pointing to this PTE table to the Bloom filter. This process is a
+ * feedback loop from the eviction to the aging.
+ */
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+ int i;
+ pte_t *pte;
+ unsigned long start;
+ unsigned long end;
+ unsigned long addr;
+ struct lru_gen_mm_walk *walk;
+ int total = 0;
+ int young = 0;
+ struct mem_cgroup *memcg = page_memcg(pvmw->page);
+ struct pglist_data *pgdat = page_pgdat(pvmw->page);
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ DEFINE_MAX_SEQ(lruvec);
+
+ lockdep_assert_held(pvmw->ptl);
+ VM_BUG_ON_PAGE(PageLRU(pvmw->page), pvmw->page);
+
+ walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
+ if (!walk)
+ return;
+
+ walk->max_seq = max_seq;
+
+ start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
+ end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
+
+ if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
+ if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
+ end = start + MIN_LRU_BATCH * PAGE_SIZE;
+ else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
+ start = end - MIN_LRU_BATCH * PAGE_SIZE;
+ else {
+ start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
+ end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
+ }
+ }
+
+ pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
+
+ lock_page_memcg(pvmw->page);
+ arch_enter_lazy_mmu_mode();
+
+ for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+ struct folio *folio;
+ unsigned long pfn = pte_pfn(pte[i]);
+
+ total++;
+
+ if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+ continue;
+
+ if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+ continue;
+
+ if (!pte_young(pte[i]))
+ continue;
+
+ VM_BUG_ON(!pfn_valid(pfn));
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ continue;
+
+ folio = pfn_folio(pfn);
+ if (folio_nid(folio) != pgdat->node_id)
+ continue;
+
+ if (folio_memcg_rcu(folio) != memcg)
+ continue;
+
+ VM_BUG_ON(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end);
+ if (ptep_test_and_clear_young(pvmw->vma, addr, pte + i)) {
+ folio_update_gen(folio, walk);
+ young++;
+ }
+
+ if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+ !folio_test_swapcache(folio)))
+ __set_bit(i, walk->bitmap);
+ }
+
+ arch_leave_lazy_mmu_mode();
+ unlock_page_memcg(pvmw->page);
+
+ if (suitable_to_scan(total, young))
+ set_bloom_filter(lruvec, max_seq, pvmw->pmd);
+
+ for_each_set_bit(i, walk->bitmap, MIN_LRU_BATCH)
+ set_page_dirty(pte_page(pte[i]));
+
+ bitmap_zero(walk->bitmap, MIN_LRU_BATCH);
+}
+
/******************************************************************************
* state change
******************************************************************************/
@@ -3649,6 +4534,12 @@ static int __init init_lru_gen(void)
};
late_initcall(init_lru_gen);

+#else
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+}
+
#endif /* CONFIG_LRU_GEN */

static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -4536,6 +5427,11 @@ static void age_active_anon(struct pglist_data *pgdat,
struct mem_cgroup *memcg;
struct lruvec *lruvec;

+ if (lru_gen_enabled()) {
+ lru_gen_age_node(pgdat, sc);
+ return;
+ }
+
if (!can_age_anon_pages(pgdat, sc))
return;

--
2.34.1.448.ga2b2bfdf31-goog


2022-01-04 20:24:00

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v6 7/9] mm: multigenerational lru: eviction

The eviction consumes old generations. Given an lruvec, it scans pages
on lrugen->lists[] indexed by min_seq%MAX_NR_GENS. A feedback loop
modeled after the PID controller monitors refaults over anon and file
types and decides which type to evict when both are available from the
same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses thru file descriptors. A page
accessed N times thru file descriptors is in tier order_base_2(N). The
feedback loop also monitors refaults over all tiers and decides when
to promote pages in which tiers (N>1), using the first tier (N=0,1) as
a baseline.

The eviction sorts a page according to the gen counter if the aging
has found this page accessed thru page tables, which completes the
promotion of this page. The eviction also promotes a page to the next
generation (min_seq+1 rather than max_seq) if this page was accessed
multiple times thru file descriptors and the feedback loop has
detected higher refaults from the tier this page is in. This approach
has the following advantages:
1) It removes the cost of activation (recall the terms) in the
buffered access path by inferring whether pages accessed multiple
times thru file descriptors are statistically hot and thus worth
promoting in the eviction path.
2) It takes pages accessed thru page tables into account and avoids
overprotecting pages accessed multiple times thru file descriptors.
3) More tiers, which require additional bits in folio->flags, provide
better protection for pages accessed more than twice thru file
descriptors, when under heavy buffered I/O workloads.

The eviction increments min_seq when lrugen->lists[] indexed by
min_seq%MAX_NR_GENS is empty.

Signed-off-by: Yu Zhao <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
---
include/linux/mm_inline.h | 10 +
include/linux/mmzone.h | 28 ++
mm/swap.c | 42 +++
mm/vmscan.c | 571 +++++++++++++++++++++++++++++++++++++-
mm/workingset.c | 119 +++++++-
5 files changed, 767 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 717a2290acb3..1907098ba908 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -115,6 +115,14 @@ static inline int lru_hist_from_seq(unsigned long seq)
return seq % NR_HIST_GENS;
}

+static inline int lru_tier_from_refs(int refs)
+{
+ VM_BUG_ON(refs > BIT(LRU_REFS_WIDTH));
+
+ /* see the comment on MAX_NR_TIERS */
+ return order_base_2(refs + 1);
+}
+
static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
{
unsigned long max_seq = lruvec->lrugen.max_seq;
@@ -243,6 +251,8 @@ static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio,
gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;

new_flags &= ~LRU_GEN_MASK;
+ if ((new_flags & LRU_REFS_FLAGS) != LRU_REFS_FLAGS)
+ new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
/* for shrink_page_list() */
if (reclaiming)
new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 94af12507788..8f1262bb815a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -336,6 +336,25 @@ struct page_vma_mapped_walk;
#define MIN_NR_GENS 2U
#define MAX_NR_GENS ((unsigned int)CONFIG_NR_LRU_GENS)

+/*
+ * Each generation is divided into multiple tiers. Tiers represent different
+ * ranges of numbers of accesses thru file descriptors. A page accessed N times
+ * thru file descriptors is in tier order_base_2(N). A page in the first tier
+ * (N=0,1) is marked by PG_referenced unless it was faulted in thru page tables
+ * or read ahead. A page in any other tier (N>1) is marked by PG_referenced and
+ * PG_workingset. Additional bits in folio->flags are required to support more
+ * than two tiers.
+ *
+ * In contrast to moving across generations (promotion), moving across tiers
+ * only requires operations on folio->flags and therefore has a negligible cost
+ * in the buffered access path. In the eviction path, comparisons of
+ * refaulted/(evicted+promoted) from the first tier and the rest infer whether
+ * pages accessed multiple times thru file descriptors are statistically hot
+ * and thus worth promoting.
+ */
+#define MAX_NR_TIERS ((unsigned int)CONFIG_TIERS_PER_GEN)
+#define LRU_REFS_FLAGS (BIT(PG_referenced) | BIT(PG_workingset))
+
/* whether to keep historical stats for evicted generations */
#ifdef CONFIG_LRU_GEN_STATS
#define NR_HIST_GENS ((unsigned int)CONFIG_NR_LRU_GENS)
@@ -354,6 +373,15 @@ struct lru_gen_struct {
struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
/* the sizes of the above lists */
unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+ /* the exponential moving average of refaulted */
+ unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
+ /* the exponential moving average of evicted+promoted */
+ unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
+ /* the first tier doesn't need promotion, hence the minus one */
+ unsigned long promoted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
+ /* can be modified without holding the lru lock */
+ atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
+ atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
/* whether the multigenerational lru is enabled */
bool enabled;
};
diff --git a/mm/swap.c b/mm/swap.c
index d7dde3b7d4b5..ae8d56848602 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -407,6 +407,43 @@ static void __lru_cache_activate_folio(struct folio *folio)
local_unlock(&lru_pvecs.lock);
}

+#ifdef CONFIG_LRU_GEN
+static void folio_inc_refs(struct folio *folio)
+{
+ unsigned long refs;
+ unsigned long old_flags, new_flags;
+
+ if (folio_test_unevictable(folio))
+ return;
+
+ /* see the comment on MAX_NR_TIERS */
+ do {
+ new_flags = old_flags = READ_ONCE(folio->flags);
+
+ if (!(new_flags & BIT(PG_referenced))) {
+ new_flags |= BIT(PG_referenced);
+ continue;
+ }
+
+ if (!(new_flags & BIT(PG_workingset))) {
+ new_flags |= BIT(PG_workingset);
+ continue;
+ }
+
+ refs = new_flags & LRU_REFS_MASK;
+ refs = min(refs + BIT(LRU_REFS_PGOFF), LRU_REFS_MASK);
+
+ new_flags &= ~LRU_REFS_MASK;
+ new_flags |= refs;
+ } while (new_flags != old_flags &&
+ cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+}
+#else
+static void folio_inc_refs(struct folio *folio)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
/*
* Mark a page as having seen activity.
*
@@ -419,6 +456,11 @@ static void __lru_cache_activate_folio(struct folio *folio)
*/
void folio_mark_accessed(struct folio *folio)
{
+ if (lru_gen_enabled()) {
+ folio_inc_refs(folio);
+ return;
+ }
+
if (!folio_test_referenced(folio)) {
folio_set_referenced(folio);
} else if (folio_test_unevictable(folio)) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fbf1337a1632..b232f711dbdb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -128,6 +128,13 @@ struct scan_control {
/* Always discard instead of demoting to lower tier memory */
unsigned int no_demotion:1;

+#ifdef CONFIG_LRU_GEN
+ /* help make better choices when multiple memcgs are eligible */
+ unsigned int memcgs_need_aging:1;
+ unsigned int memcgs_need_swapping:1;
+ unsigned int memcgs_avoid_swapping:1;
+#endif
+
/* Allocation order */
s8 order;

@@ -1288,9 +1295,11 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,

if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
- mem_cgroup_swapout(page, swap);
+
+ /* get a shadow entry before mem_cgroup_swapout() clears memcg_data */
if (reclaimed && !mapping_exiting(mapping))
shadow = workingset_eviction(page, target_memcg);
+ mem_cgroup_swapout(page, swap);
__delete_from_swap_cache(page, swap, shadow);
xa_unlock_irq(&mapping->i_pages);
put_swap_page(page, swap);
@@ -2729,6 +2738,9 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
unsigned long file;
struct lruvec *target_lruvec;

+ if (lru_gen_enabled())
+ return;
+
target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);

/*
@@ -3075,6 +3087,17 @@ static int folio_lru_gen(struct folio *folio)
return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
}

+static int folio_lru_tier(struct folio *folio)
+{
+ int refs;
+ unsigned long flags = READ_ONCE(folio->flags);
+
+ refs = (flags & LRU_REFS_FLAGS) == LRU_REFS_FLAGS ?
+ ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + 1 : 0;
+
+ return lru_tier_from_refs(refs);
+}
+
static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
{
struct pglist_data *pgdat = NODE_DATA(nid);
@@ -3453,6 +3476,92 @@ static bool get_next_mm(struct lruvec *lruvec, struct lru_gen_mm_walk *walk,
return last;
}

+/******************************************************************************
+ * refault feedback loop
+ ******************************************************************************/
+
+/*
+ * A feedback loop based on Proportional-Integral-Derivative (PID) controller.
+ *
+ * The P term is refaulted/(evicted+promoted) from a tier in the generation
+ * currently being evicted; the I term is the exponential moving average of the
+ * P term over the generations previously evicted, using the smoothing factor
+ * 1/2; the D term isn't used.
+ *
+ * The setpoint (SP) is always the first tier of one type; the process variable
+ * (PV) is either any tier of the other type or any other tier of the same
+ * type.
+ *
+ * The error is the difference between the SP and the PV; the correction is
+ * turn off promotion when SP>PV or turn on promotion when SP<PV.
+ */
+struct ctrl_pos {
+ unsigned long refaulted;
+ unsigned long total;
+ int gain;
+};
+
+static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
+ struct ctrl_pos *pos)
+{
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+ int hist = lru_hist_from_seq(lrugen->min_seq[type]);
+
+ pos->refaulted = lrugen->avg_refaulted[type][tier] +
+ atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+ pos->total = lrugen->avg_total[type][tier] +
+ atomic_long_read(&lrugen->evicted[hist][type][tier]);
+ if (tier)
+ pos->total += lrugen->promoted[hist][type][tier - 1];
+ pos->gain = gain;
+}
+
+static void reset_ctrl_pos(struct lruvec *lruvec, int gen, int type)
+{
+ int tier;
+ int hist = lru_hist_from_seq(gen);
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+ bool carryover = gen == lru_gen_from_seq(lrugen->min_seq[type]);
+ bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1;
+
+ if (!carryover && !clear)
+ return;
+
+ for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+ if (carryover) {
+ unsigned long sum;
+
+ sum = lrugen->avg_refaulted[type][tier] +
+ atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+ WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
+
+ sum = lrugen->avg_total[type][tier] +
+ atomic_long_read(&lrugen->evicted[hist][type][tier]);
+ if (tier)
+ sum += lrugen->promoted[hist][type][tier - 1];
+ WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
+ }
+
+ if (clear) {
+ atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
+ atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
+ if (tier)
+ WRITE_ONCE(lrugen->promoted[hist][type][tier - 1], 0);
+ }
+ }
+}
+
+static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
+{
+ /*
+ * Return true if the PV has a limited number of refaults or a lower
+ * refaulted/total than the SP.
+ */
+ return pv->refaulted < MIN_LRU_BATCH ||
+ pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <=
+ (sp->refaulted + 1) * pv->total * pv->gain;
+}
+
/******************************************************************************
* the aging
******************************************************************************/
@@ -3476,6 +3585,7 @@ static void folio_update_gen(struct folio *folio, struct lru_gen_mm_walk *walk)

new_flags &= ~LRU_GEN_MASK;
new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
+ new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
} while (new_flags != old_flags &&
cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);

@@ -3508,6 +3618,7 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai

new_flags &= ~LRU_GEN_MASK;
new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
+ new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
/* for folio_end_writeback() */
if (reclaiming)
new_flags |= BIT(PG_reclaim);
@@ -3961,6 +4072,8 @@ static void inc_min_seq(struct lruvec *lruvec)
if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
continue;

+ gen = lru_gen_from_seq(lrugen->min_seq[type]);
+ reset_ctrl_pos(lruvec, gen, type);
WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
}
}
@@ -3999,6 +4112,8 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
if (min_seq[type] == lrugen->min_seq[type])
continue;

+ gen = lru_gen_from_seq(lrugen->min_seq[type]);
+ reset_ctrl_pos(lruvec, gen, type);
WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
success = true;
}
@@ -4039,6 +4154,9 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
}
}

+ for (type = 0; type < ANON_AND_FILE; type++)
+ reset_ctrl_pos(lruvec, next, type);
+
WRITE_ONCE(lrugen->timestamps[next], jiffies);
/* make sure preceding modifications appear */
smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
@@ -4189,6 +4307,22 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)

VM_BUG_ON(!current_is_kswapd());

+ /*
+ * To avoid the aging path and reduce the chance of swapping, which can
+ * be costly, optimistically skip them unless their corresponding flags
+ * were cleared in the eviction path. This improves the overall
+ * performance when multiple memcgs are eligible.
+ */
+ if (!sc->memcgs_need_aging) {
+ sc->memcgs_need_aging = 1;
+ sc->memcgs_avoid_swapping = !sc->memcgs_need_swapping;
+ sc->memcgs_need_swapping = 1;
+ return;
+ }
+
+ sc->memcgs_need_swapping = 1;
+ sc->memcgs_avoid_swapping = 1;
+
current->reclaim_state->mm_walk = &pgdat->mm_walk;

memcg = mem_cgroup_iter(NULL, NULL, NULL);
@@ -4316,6 +4450,429 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
bitmap_zero(walk->bitmap, MIN_LRU_BATCH);
}

+/******************************************************************************
+ * the eviction
+ ******************************************************************************/
+
+static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
+{
+ bool success;
+ int gen = folio_lru_gen(folio);
+ int type = folio_is_file_lru(folio);
+ int zone = folio_zonenum(folio);
+ int tier = folio_lru_tier(folio);
+ int delta = folio_nr_pages(folio);
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ VM_BUG_ON_FOLIO(gen >= MAX_NR_GENS, folio);
+
+ if (!folio_evictable(folio)) {
+ success = lru_gen_del_folio(lruvec, folio, true);
+ VM_BUG_ON_FOLIO(!success, folio);
+ folio_set_unevictable(folio);
+ lruvec_add_folio(lruvec, folio);
+ __count_vm_events(UNEVICTABLE_PGCULLED, delta);
+ return true;
+ }
+
+ if (type && folio_test_anon(folio) && folio_test_dirty(folio)) {
+ success = lru_gen_del_folio(lruvec, folio, true);
+ VM_BUG_ON_FOLIO(!success, folio);
+ folio_set_swapbacked(folio);
+ lruvec_add_folio_tail(lruvec, folio);
+ return true;
+ }
+
+ if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
+ list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+ return true;
+ }
+
+ if (tier > tier_idx) {
+ int hist = lru_hist_from_seq(gen);
+
+ gen = folio_inc_gen(lruvec, folio, false);
+ list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
+
+ WRITE_ONCE(lrugen->promoted[hist][type][tier - 1],
+ lrugen->promoted[hist][type][tier - 1] + delta);
+ __mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+ return true;
+ }
+
+ if (folio_test_writeback(folio) || (type && folio_test_dirty(folio))) {
+ gen = folio_inc_gen(lruvec, folio, true);
+ list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+ return true;
+ }
+
+ return false;
+}
+
+static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc)
+{
+ bool success;
+
+ if (!sc->may_unmap && folio_mapped(folio))
+ return false;
+
+ if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) &&
+ (folio_test_dirty(folio) ||
+ (folio_test_anon(folio) && !folio_test_swapcache(folio))))
+ return false;
+
+ if (!folio_try_get(folio))
+ return false;
+
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return false;
+ }
+
+ success = lru_gen_del_folio(lruvec, folio, true);
+ VM_BUG_ON_FOLIO(!success, folio);
+
+ return true;
+}
+
+static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
+ int type, int tier, struct list_head *list)
+{
+ int gen, zone;
+ enum vm_event_item item;
+ int sorted = 0;
+ int scanned = 0;
+ int isolated = 0;
+ int remaining = MAX_LRU_BATCH;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+ VM_BUG_ON(!list_empty(list));
+
+ if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
+ return 0;
+
+ gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+ for (zone = sc->reclaim_idx; zone >= 0; zone--) {
+ LIST_HEAD(moved);
+ int skipped = 0;
+ struct list_head *head = &lrugen->lists[gen][type][zone];
+
+ while (!list_empty(head)) {
+ struct folio *folio = lru_to_folio(head);
+ int delta = folio_nr_pages(folio);
+
+ VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+ VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+ VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+ VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
+
+ scanned += delta;
+
+ if (sort_folio(lruvec, folio, tier))
+ sorted += delta;
+ else if (isolate_folio(lruvec, folio, sc)) {
+ list_add(&folio->lru, list);
+ isolated += delta;
+ } else {
+ list_move(&folio->lru, &moved);
+ skipped += delta;
+ }
+
+ if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH)
+ break;
+ }
+
+ if (skipped) {
+ list_splice(&moved, head);
+ __count_zid_vm_events(PGSCAN_SKIP, zone, skipped);
+ }
+
+ if (!remaining || isolated >= MIN_LRU_BATCH)
+ break;
+ }
+
+ item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
+ if (!cgroup_reclaim(sc)) {
+ __count_vm_events(item, isolated);
+ __count_vm_events(PGREFILL, sorted);
+ }
+ __count_memcg_events(memcg, item, isolated);
+ __count_memcg_events(memcg, PGREFILL, sorted);
+ __count_vm_events(PGSCAN_ANON + type, isolated);
+
+ /*
+ * There might not be eligible pages due to reclaim_idx, may_unmap and
+ * may_writepage. Check the remaining to prevent livelock if there is no
+ * progress.
+ */
+ return isolated || !remaining ? scanned : 0;
+}
+
+static int get_tier_idx(struct lruvec *lruvec, int type)
+{
+ int tier;
+ struct ctrl_pos sp, pv;
+
+ /*
+ * To leave a margin for fluctuations, use a larger gain factor (1:2).
+ * This value is chosen because any other tier would have at least twice
+ * as many refaults as the first tier.
+ */
+ read_ctrl_pos(lruvec, type, 0, 1, &sp);
+ for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+ read_ctrl_pos(lruvec, type, tier, 2, &pv);
+ if (!positive_ctrl_err(&sp, &pv))
+ break;
+ }
+
+ return tier - 1;
+}
+
+static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx)
+{
+ int type, tier;
+ struct ctrl_pos sp, pv;
+ int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
+
+ /*
+ * Compare the first tier of anon with that of file to determine which
+ * type to scan. Also need to compare other tiers of the selected type
+ * with the first tier of the other type to determine the last tier (of
+ * the selected type) to evict.
+ */
+ read_ctrl_pos(lruvec, 0, 0, gain[0], &sp);
+ read_ctrl_pos(lruvec, 1, 0, gain[1], &pv);
+ type = positive_ctrl_err(&sp, &pv);
+
+ read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
+ for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+ read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
+ if (!positive_ctrl_err(&sp, &pv))
+ break;
+ }
+
+ *tier_idx = tier - 1;
+
+ return type;
+}
+
+static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
+ int *type_scanned, struct list_head *list)
+{
+ int i;
+ int type;
+ int scanned;
+ int tier = -1;
+ DEFINE_MIN_SEQ(lruvec);
+
+ VM_BUG_ON(!seq_is_valid(lruvec));
+
+ /*
+ * Try to make the obvious choice first. When anon and file are both
+ * available from the same generation, interpret swappiness 1 as file
+ * first and 200 as anon first.
+ */
+ if (!swappiness)
+ type = 1;
+ else if (min_seq[0] < min_seq[1])
+ type = 0;
+ else if (swappiness == 1)
+ type = 1;
+ else if (swappiness == 200)
+ type = 0;
+ else
+ type = get_type_to_scan(lruvec, swappiness, &tier);
+
+ for (i = !swappiness; i < ANON_AND_FILE; i++) {
+ if (tier < 0)
+ tier = get_tier_idx(lruvec, type);
+
+ scanned = scan_folios(lruvec, sc, type, tier, list);
+ if (scanned)
+ break;
+
+ type = !type;
+ tier = -1;
+ }
+
+ *type_scanned = type;
+
+ return scanned;
+}
+
+static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
+ bool *swapped)
+{
+ int type;
+ int scanned;
+ int reclaimed;
+ LIST_HEAD(list);
+ struct folio *folio;
+ enum vm_event_item item;
+ struct reclaim_stat stat;
+ struct lru_gen_mm_walk *walk;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+ spin_lock_irq(&lruvec->lru_lock);
+
+ scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
+
+ if (try_to_inc_min_seq(lruvec, swappiness))
+ scanned++;
+
+ if (get_nr_gens(lruvec, 1) == MIN_NR_GENS)
+ scanned = 0;
+
+ spin_unlock_irq(&lruvec->lru_lock);
+
+ if (list_empty(&list))
+ return scanned;
+
+ reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false);
+
+ /*
+ * To avoid livelock, don't add rejected pages back to the same lists
+ * they were isolated from.
+ */
+ list_for_each_entry(folio, &list, lru) {
+ if (!folio_test_reclaim(folio) ||
+ !(folio_test_dirty(folio) || folio_test_writeback(folio)))
+ folio_set_active(folio);
+
+ folio_clear_referenced(folio);
+ folio_clear_workingset(folio);
+ }
+
+ spin_lock_irq(&lruvec->lru_lock);
+
+ move_pages_to_lru(lruvec, &list);
+
+ walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
+ if (walk && walk->batched)
+ reset_batch_size(lruvec, walk);
+
+ item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
+ if (!cgroup_reclaim(sc))
+ __count_vm_events(item, reclaimed);
+ __count_memcg_events(memcg, item, reclaimed);
+ __count_vm_events(PGSTEAL_ANON + type, reclaimed);
+
+ spin_unlock_irq(&lruvec->lru_lock);
+
+ mem_cgroup_uncharge_list(&list);
+ free_unref_page_list(&list);
+
+ sc->nr_reclaimed += reclaimed;
+
+ if (!type && swapped)
+ *swapped = true;
+
+ return scanned;
+}
+
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap)
+{
+ bool need_aging;
+ long nr_to_scan;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ DEFINE_MAX_SEQ(lruvec);
+ DEFINE_MIN_SEQ(lruvec);
+
+ if (mem_cgroup_below_min(memcg) ||
+ (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
+ return 0;
+
+ nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, sc, can_swap, &need_aging);
+ if (!nr_to_scan)
+ return 0;
+
+ nr_to_scan >>= sc->priority;
+
+ if (!mem_cgroup_online(memcg))
+ nr_to_scan++;
+
+ if (!nr_to_scan)
+ return 0;
+
+ if (current_is_kswapd()) {
+ /* leave the work to lru_gen_age_node() */
+ if (need_aging)
+ return 0;
+
+ sc->memcgs_need_aging = 0;
+ return nr_to_scan;
+ }
+
+ if (max_seq >= min_seq[1] + MIN_NR_GENS)
+ return nr_to_scan;
+
+ /* try slab and other memcgs before going to the aging path */
+ if (!sc->force_deactivate) {
+ sc->skipped_deactivate = 1;
+ return 0;
+ }
+
+ return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? nr_to_scan : 0;
+}
+
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+ struct blk_plug plug;
+ long scanned = 0;
+ bool swapped = false;
+ unsigned long reclaimed = sc->nr_reclaimed;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+ lru_add_drain();
+
+ if (current_is_kswapd())
+ current->reclaim_state->mm_walk = &pgdat->mm_walk;
+
+ blk_start_plug(&plug);
+
+ while (true) {
+ int delta;
+ int swappiness;
+ long nr_to_scan;
+
+ if (sc->may_swap)
+ swappiness = get_swappiness(memcg);
+ else if (!cgroup_reclaim(sc) && get_swappiness(memcg))
+ swappiness = 1;
+ else
+ swappiness = 0;
+
+ nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
+ if (!nr_to_scan)
+ break;
+
+ delta = evict_folios(lruvec, sc, swappiness, &swapped);
+ if (!delta)
+ break;
+
+ if (sc->memcgs_avoid_swapping && swappiness < 200 && swapped)
+ break;
+
+ scanned += delta;
+ if (scanned >= nr_to_scan) {
+ if (!swapped && sc->nr_reclaimed - reclaimed >= MIN_LRU_BATCH)
+ sc->memcgs_need_swapping = 0;
+ break;
+ }
+
+ cond_resched();
+ }
+
+ blk_finish_plug(&plug);
+
+ if (current_is_kswapd())
+ current->reclaim_state->mm_walk = NULL;
+}
+
/******************************************************************************
* state change
******************************************************************************/
@@ -4540,6 +5097,10 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
{
}

+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+}
+
#endif /* CONFIG_LRU_GEN */

static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -4553,6 +5114,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
struct blk_plug plug;
bool scan_adjusted;

+ if (lru_gen_enabled()) {
+ lru_gen_shrink_lruvec(lruvec, sc);
+ return;
+ }
+
get_scan_count(lruvec, sc, nr);

/* Record the original scan target for proportional adjustments later */
@@ -5057,6 +5623,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
struct lruvec *target_lruvec;
unsigned long refaults;

+ if (lru_gen_enabled())
+ return;
+
target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
target_lruvec->refaults[0] = refaults;
diff --git a/mm/workingset.c b/mm/workingset.c
index 8c03afe1d67c..c2e433d76de1 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly;
static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
bool workingset)
{
- eviction >>= bucket_order;
eviction &= EVICTION_MASK;
eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
@@ -212,10 +211,116 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,

*memcgidp = memcgid;
*pgdat = NODE_DATA(nid);
- *evictionp = entry << bucket_order;
+ *evictionp = entry;
*workingsetp = workingset;
}

+#ifdef CONFIG_LRU_GEN
+
+static int folio_lru_refs(struct folio *folio)
+{
+ unsigned long flags = READ_ONCE(folio->flags);
+
+ BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
+
+ /* see the comment on MAX_NR_TIERS */
+ return flags & BIT(PG_workingset) ? (flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF : 0;
+}
+
+static void *lru_gen_eviction(struct folio *folio)
+{
+ int hist, tier;
+ unsigned long token;
+ unsigned long min_seq;
+ struct lruvec *lruvec;
+ struct lru_gen_struct *lrugen;
+ int type = folio_is_file_lru(folio);
+ int refs = folio_lru_refs(folio);
+ int delta = folio_nr_pages(folio);
+ bool workingset = folio_test_workingset(folio);
+ struct mem_cgroup *memcg = folio_memcg(folio);
+ struct pglist_data *pgdat = folio_pgdat(folio);
+
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ lrugen = &lruvec->lrugen;
+ min_seq = READ_ONCE(lrugen->min_seq[type]);
+ token = (min_seq << LRU_REFS_WIDTH) | refs;
+
+ hist = lru_hist_from_seq(min_seq);
+ tier = lru_tier_from_refs(refs + workingset);
+ atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
+
+ return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
+}
+
+static void lru_gen_refault(struct folio *folio, void *shadow)
+{
+ int hist, tier, refs;
+ int memcg_id;
+ bool workingset;
+ unsigned long token;
+ unsigned long min_seq;
+ struct lruvec *lruvec;
+ struct lru_gen_struct *lrugen;
+ struct mem_cgroup *memcg;
+ struct pglist_data *pgdat;
+ int type = folio_is_file_lru(folio);
+ int delta = folio_nr_pages(folio);
+
+ unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
+
+ refs = token & (BIT(LRU_REFS_WIDTH) - 1);
+ if (refs && !workingset)
+ return;
+
+ if (folio_pgdat(folio) != pgdat)
+ return;
+
+ rcu_read_lock();
+ memcg = folio_memcg_rcu(folio);
+ if (mem_cgroup_id(memcg) != memcg_id)
+ goto unlock;
+
+ token >>= LRU_REFS_WIDTH;
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ lrugen = &lruvec->lrugen;
+ min_seq = READ_ONCE(lrugen->min_seq[type]);
+ if (token != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)))
+ goto unlock;
+
+ hist = lru_hist_from_seq(min_seq);
+ tier = lru_tier_from_refs(refs + workingset);
+ atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
+ mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
+
+ /*
+ * Count the following two cases as stalls:
+ * 1) For pages accessed thru page tables, hotter pages pushed out hot
+ * pages which refaulted immediately.
+ * 2) For pages accessed thru file descriptors, numbers of accesses
+ * might have been beyond the limit.
+ */
+ if (task_in_lru_fault() || refs + workingset == BIT(LRU_REFS_WIDTH)) {
+ folio_set_workingset(folio);
+ mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
+ }
+unlock:
+ rcu_read_unlock();
+}
+
+#else
+
+static void *lru_gen_eviction(struct folio *folio)
+{
+ return NULL;
+}
+
+static void lru_gen_refault(struct folio *folio, void *shadow)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
/**
* workingset_age_nonresident - age non-resident entries as LRU ages
* @lruvec: the lruvec that was aged
@@ -264,10 +369,14 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);

+ if (lru_gen_enabled())
+ return lru_gen_eviction(page_folio(page));
+
lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
/* XXX: target_memcg can be NULL, go through lruvec */
memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
eviction = atomic_long_read(&lruvec->nonresident_age);
+ eviction >>= bucket_order;
workingset_age_nonresident(lruvec, thp_nr_pages(page));
return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
}
@@ -297,7 +406,13 @@ void workingset_refault(struct folio *folio, void *shadow)
int memcgid;
long nr;

+ if (lru_gen_enabled()) {
+ lru_gen_refault(folio, shadow);
+ return;
+ }
+
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
+ eviction <<= bucket_order;

rcu_read_lock();
/*
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-04 20:24:02

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v6 8/9] mm: multigenerational lru: user interface

Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch.

Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention.
Compared with the size-based approach, e.g., [1], this time-based
approach has the following advantages:
1) It's easier to configure because it's agnostic to applications and
memory sizes.
2) It's more reliable because it's directly wired to the OOM killer.

Add /sys/kernel/debug/lru_gen for working set estimation and proactive
reclaim. Compared with the page table-based approach and the PFN-based
approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has
the following advantages:
1) It offers better choices because it's aware of memcgs, NUMA nodes,
shared mappings and unmapped page cache.
2) It's more scalable because it's O(nr_hot_evictable_pages), whereas
the PFN-based approach is O(nr_total_pages).

Add /sys/kernel/debug/lru_gen_full for debugging.

[1] https://lore.kernel.org/lkml/[email protected]/

Signed-off-by: Yu Zhao <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
---
Documentation/vm/index.rst | 1 +
Documentation/vm/multigen_lru.rst | 62 +++++
include/linux/nodemask.h | 1 +
mm/vmscan.c | 415 ++++++++++++++++++++++++++++++
4 files changed, 479 insertions(+)
create mode 100644 Documentation/vm/multigen_lru.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index 6f5ffef4b716..f25e755b4ff4 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -38,3 +38,4 @@ algorithms. If you are looking for advice on simply allocating memory, see the
unevictable-lru
z3fold
zsmalloc
+ multigen_lru
diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
new file mode 100644
index 000000000000..6f9e0181348b
--- /dev/null
+++ b/Documentation/vm/multigen_lru.rst
@@ -0,0 +1,62 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multigenerational LRU
+=====================
+
+Quick start
+===========
+Runtime configurations
+----------------------
+:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
+ feature wasn't enabled by default.
+
+Recipes
+=======
+Personal computers
+------------------
+:Thrashing prevention: Write ``N`` to
+ ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
+ ``N`` milliseconds from getting evicted. The OOM killer is invoked if
+ this working set can't be kept in memory. Based on the average human
+ detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
+ lags due to thrashing. Larger values like ``N=3000`` make lags less
+ noticeable at the cost of more OOM kills.
+
+Data centers
+------------
+:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
+ format:
+ ::
+
+ memcg memcg_id memcg_path
+ node node_id
+ min_gen birth_time anon_size file_size
+ ...
+ max_gen birth_time anon_size file_size
+
+ ``min_gen`` is the oldest generation number and ``max_gen`` is the
+ youngest generation number. ``birth_time`` is in milliseconds.
+ ``anon_size`` and ``file_size`` are in pages.
+
+ This file also accepts commands in the following subsections.
+ Multiple command lines are supported, so does concatenation with
+ delimiters ``,`` and ``;``.
+
+ ``/sys/kernel/debug/lru_gen_full`` contains additional stats for
+ debugging.
+
+:Working set estimation: Write ``+ memcg_id node_id max_gen
+ [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to trigger
+ the aging. It scans PTEs for accessed pages and promotes them to the
+ youngest generation ``max_gen``. Then it creates a new generation
+ ``max_gen+1``. Set ``can_swap`` to 1 to scan for accessed anon pages
+ when swap is off. Set ``full_scan`` to 0 to reduce the overhead as
+ well as the coverage when scanning PTEs.
+
+:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness
+ [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to trigger the
+ eviction. It evicts generations less than or equal to ``min_gen``.
+ ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and
+ ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use
+ ``nr_to_reclaim`` to limit the number of pages to evict.
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 567c3ddba2c4..90840c459abc 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -486,6 +486,7 @@ static inline int num_node_state(enum node_states state)
#define first_online_node 0
#define first_memory_node 0
#define next_online_node(nid) (MAX_NUMNODES)
+#define next_memory_node(nid) (MAX_NUMNODES)
#define nr_node_ids 1U
#define nr_online_nodes 1U

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b232f711dbdb..20f45ff849fc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -53,6 +53,8 @@
#include <linux/memory.h>
#include <linux/pagewalk.h>
#include <linux/shmem_fs.h>
+#include <linux/ctype.h>
+#include <linux/debugfs.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -5021,6 +5023,413 @@ static void lru_gen_change_state(bool enable)
mem_hotplug_done();
}

+/******************************************************************************
+ * sysfs interface
+ ******************************************************************************/
+
+static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
+}
+
+static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
+ const char *buf, size_t len)
+{
+ unsigned int msecs;
+
+ if (kstrtouint(buf, 10, &msecs))
+ return -EINVAL;
+
+ WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs));
+
+ return len;
+}
+
+static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR(
+ min_ttl_ms, 0644, show_min_ttl, store_min_ttl
+);
+
+static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+ return snprintf(buf, PAGE_SIZE, "%d\n", lru_gen_enabled());
+}
+
+static ssize_t store_enable(struct kobject *kobj, struct kobj_attribute *attr,
+ const char *buf, size_t len)
+{
+ bool enable;
+
+ if (kstrtobool(buf, &enable))
+ return -EINVAL;
+
+ lru_gen_change_state(enable);
+
+ return len;
+}
+
+static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
+ enabled, 0644, show_enable, store_enable
+);
+
+static struct attribute *lru_gen_attrs[] = {
+ &lru_gen_min_ttl_attr.attr,
+ &lru_gen_enabled_attr.attr,
+ NULL
+};
+
+static struct attribute_group lru_gen_attr_group = {
+ .name = "lru_gen",
+ .attrs = lru_gen_attrs,
+};
+
+/******************************************************************************
+ * debugfs interface
+ ******************************************************************************/
+
+static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos)
+{
+ struct mem_cgroup *memcg;
+ loff_t nr_to_skip = *pos;
+
+ m->private = kvmalloc(PATH_MAX, GFP_KERNEL);
+ if (!m->private)
+ return ERR_PTR(-ENOMEM);
+
+ memcg = mem_cgroup_iter(NULL, NULL, NULL);
+ do {
+ int nid;
+
+ for_each_node_state(nid, N_MEMORY) {
+ if (!nr_to_skip--)
+ return get_lruvec(memcg, nid);
+ }
+ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+ return NULL;
+}
+
+static void lru_gen_seq_stop(struct seq_file *m, void *v)
+{
+ if (!IS_ERR_OR_NULL(v))
+ mem_cgroup_iter_break(NULL, lruvec_memcg(v));
+
+ kvfree(m->private);
+ m->private = NULL;
+}
+
+static void *lru_gen_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ int nid = lruvec_pgdat(v)->node_id;
+ struct mem_cgroup *memcg = lruvec_memcg(v);
+
+ ++*pos;
+
+ nid = next_memory_node(nid);
+ if (nid == MAX_NUMNODES) {
+ memcg = mem_cgroup_iter(NULL, memcg, NULL);
+ if (!memcg)
+ return NULL;
+
+ nid = first_memory_node;
+ }
+
+ return get_lruvec(memcg, nid);
+}
+
+static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
+ unsigned long max_seq, unsigned long *min_seq,
+ unsigned long seq)
+{
+ int i;
+ int type, tier;
+ int hist = lru_hist_from_seq(seq);
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+ seq_printf(m, " %10d", tier);
+ for (type = 0; type < ANON_AND_FILE; type++) {
+ unsigned long n[3] = {};
+
+ if (seq == max_seq) {
+ n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]);
+ n[1] = READ_ONCE(lrugen->avg_total[type][tier]);
+
+ seq_printf(m, " %10luR %10luT %10lu ", n[0], n[1], n[2]);
+ } else if (seq == min_seq[type] || NR_HIST_GENS > 1) {
+ n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+ n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]);
+ if (tier)
+ n[2] = READ_ONCE(lrugen->promoted[hist][type][tier - 1]);
+
+ seq_printf(m, " %10lur %10lue %10lup", n[0], n[1], n[2]);
+ } else
+ seq_puts(m, " 0 0 0 ");
+ }
+ seq_putc(m, '\n');
+ }
+
+ seq_puts(m, " ");
+ for (i = 0; i < NR_MM_STATS; i++) {
+ if (seq == max_seq && NR_HIST_GENS == 1)
+ seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]),
+ toupper(MM_STAT_CODES[i]));
+ else if (seq != max_seq && NR_HIST_GENS > 1)
+ seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]),
+ MM_STAT_CODES[i]);
+ else
+ seq_puts(m, " 0 ");
+ }
+ seq_putc(m, '\n');
+}
+
+static int lru_gen_seq_show(struct seq_file *m, void *v)
+{
+ unsigned long seq;
+ bool full = !debugfs_real_fops(m->file)->write;
+ struct lruvec *lruvec = v;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+ int nid = lruvec_pgdat(lruvec)->node_id;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ DEFINE_MAX_SEQ(lruvec);
+ DEFINE_MIN_SEQ(lruvec);
+
+ if (nid == first_memory_node) {
+ const char *path = memcg ? m->private : "";
+
+#ifdef CONFIG_MEMCG
+ if (memcg)
+ cgroup_path(memcg->css.cgroup, m->private, PATH_MAX);
+#endif
+ seq_printf(m, "memcg %5hu %s\n", mem_cgroup_id(memcg), path);
+ }
+
+ seq_printf(m, " node %5d\n", nid);
+
+ if (!full)
+ seq = min_seq[0];
+ else if (max_seq >= MAX_NR_GENS)
+ seq = max_seq - MAX_NR_GENS + 1;
+ else
+ seq = 0;
+
+ for (; seq <= max_seq; seq++) {
+ int gen, type, zone;
+ unsigned int msecs;
+
+ gen = lru_gen_from_seq(seq);
+ msecs = jiffies_to_msecs(jiffies - READ_ONCE(lrugen->timestamps[gen]));
+
+ seq_printf(m, " %10lu %10u", seq, msecs);
+
+ for (type = 0; type < ANON_AND_FILE; type++) {
+ long size = 0;
+
+ if (seq < min_seq[type]) {
+ seq_puts(m, " -0 ");
+ continue;
+ }
+
+ for (zone = 0; zone < MAX_NR_ZONES; zone++)
+ size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+
+ seq_printf(m, " %10lu ", max(size, 0L));
+ }
+
+ seq_putc(m, '\n');
+
+ if (full)
+ lru_gen_seq_show_full(m, lruvec, max_seq, min_seq, seq);
+ }
+
+ return 0;
+}
+
+static const struct seq_operations lru_gen_seq_ops = {
+ .start = lru_gen_seq_start,
+ .stop = lru_gen_seq_stop,
+ .next = lru_gen_seq_next,
+ .show = lru_gen_seq_show,
+};
+
+static int run_aging(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
+ bool can_swap, bool full_scan)
+{
+ DEFINE_MAX_SEQ(lruvec);
+
+ if (seq == max_seq)
+ try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, full_scan);
+
+ return seq > max_seq ? -EINVAL : 0;
+}
+
+static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
+ int swappiness, unsigned long nr_to_reclaim)
+{
+ struct blk_plug plug;
+ int err = -EINTR;
+ DEFINE_MAX_SEQ(lruvec);
+
+ if (max_seq < seq + MIN_NR_GENS)
+ return -EINVAL;
+
+ sc->nr_reclaimed = 0;
+
+ blk_start_plug(&plug);
+
+ while (!signal_pending(current)) {
+ DEFINE_MIN_SEQ(lruvec);
+
+ if (seq < min_seq[!swappiness] || sc->nr_reclaimed >= nr_to_reclaim ||
+ !evict_folios(lruvec, sc, swappiness, NULL)) {
+ err = 0;
+ break;
+ }
+
+ cond_resched();
+ }
+
+ blk_finish_plug(&plug);
+
+ return err;
+}
+
+static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
+ struct scan_control *sc, int swappiness, unsigned long opt)
+{
+ struct lruvec *lruvec;
+ int err = -EINVAL;
+ struct mem_cgroup *memcg = NULL;
+
+ if (!mem_cgroup_disabled()) {
+ rcu_read_lock();
+ memcg = mem_cgroup_from_id(memcg_id);
+#ifdef CONFIG_MEMCG
+ if (memcg && !css_tryget(&memcg->css))
+ memcg = NULL;
+#endif
+ rcu_read_unlock();
+
+ if (!memcg)
+ goto done;
+ }
+ if (memcg_id != mem_cgroup_id(memcg))
+ goto done;
+
+ if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY))
+ goto done;
+
+ lruvec = get_lruvec(memcg, nid);
+
+ if (swappiness < 0)
+ swappiness = get_swappiness(memcg);
+ else if (swappiness > 200)
+ goto done;
+
+ switch (cmd) {
+ case '+':
+ err = run_aging(lruvec, seq, sc, swappiness, opt);
+ break;
+ case '-':
+ err = run_eviction(lruvec, seq, sc, swappiness, opt);
+ break;
+ }
+done:
+ mem_cgroup_put(memcg);
+
+ return err;
+}
+
+static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
+ size_t len, loff_t *pos)
+{
+ void *buf;
+ char *cur, *next;
+ unsigned int flags;
+ int err = 0;
+ struct scan_control sc = {
+ .may_writepage = 1,
+ .may_unmap = 1,
+ .may_swap = 1,
+ .reclaim_idx = MAX_NR_ZONES - 1,
+ .gfp_mask = GFP_KERNEL,
+ };
+
+ buf = kvmalloc(len + 1, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ if (copy_from_user(buf, src, len)) {
+ kvfree(buf);
+ return -EFAULT;
+ }
+
+ next = buf;
+ next[len] = '\0';
+
+ sc.reclaim_state.mm_walk = alloc_mm_walk();
+ if (!sc.reclaim_state.mm_walk) {
+ kvfree(buf);
+ return -ENOMEM;
+ }
+
+ flags = memalloc_noreclaim_save();
+ set_task_reclaim_state(current, &sc.reclaim_state);
+
+ while ((cur = strsep(&next, ",;\n"))) {
+ int n;
+ int end;
+ char cmd;
+ unsigned int memcg_id;
+ unsigned int nid;
+ unsigned long seq;
+ unsigned int swappiness = -1;
+ unsigned long opt = -1;
+
+ cur = skip_spaces(cur);
+ if (!*cur)
+ continue;
+
+ n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid,
+ &seq, &end, &swappiness, &end, &opt, &end);
+ if (n < 4 || cur[end]) {
+ err = -EINVAL;
+ break;
+ }
+
+ err = run_cmd(cmd, memcg_id, nid, seq, &sc, swappiness, opt);
+ if (err)
+ break;
+ }
+
+ set_task_reclaim_state(current, NULL);
+ memalloc_noreclaim_restore(flags);
+
+ free_mm_walk(sc.reclaim_state.mm_walk);
+ kvfree(buf);
+
+ return err ? : len;
+}
+
+static int lru_gen_seq_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &lru_gen_seq_ops);
+}
+
+static const struct file_operations lru_gen_rw_fops = {
+ .open = lru_gen_seq_open,
+ .read = seq_read,
+ .write = lru_gen_seq_write,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static const struct file_operations lru_gen_ro_fops = {
+ .open = lru_gen_seq_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
/******************************************************************************
* initialization
******************************************************************************/
@@ -5087,6 +5496,12 @@ static int __init init_lru_gen(void)
BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);

+ if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
+ pr_err("lru_gen: failed to create sysfs group\n");
+
+ debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops);
+ debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops);
+
return 0;
};
late_initcall(init_lru_gen);
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-04 20:24:22

by Yu Zhao

[permalink] [raw]
Subject: [PATCH v6 9/9] mm: multigenerational lru: Kconfig

Add configuration options for the multigenerational lru.

Signed-off-by: Yu Zhao <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
---
Documentation/vm/multigen_lru.rst | 18 ++++++++++++
mm/Kconfig | 48 +++++++++++++++++++++++++++++++
2 files changed, 66 insertions(+)

diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
index 6f9e0181348b..a54c5637c455 100644
--- a/Documentation/vm/multigen_lru.rst
+++ b/Documentation/vm/multigen_lru.rst
@@ -6,6 +6,13 @@ Multigenerational LRU

Quick start
===========
+Build configurations
+--------------------
+:Required: Set ``CONFIG_LRU_GEN=y``.
+
+:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable this feature by
+ default.
+
Runtime configurations
----------------------
:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
@@ -25,6 +32,17 @@ Personal computers

Data centers
------------
+:Optional: Change ``CONFIG_NR_LRU_GENS`` to a larger value to support
+ more generations for ``Working set estimation`` and
+ ``Proactive reclaim``.
+
+:Optional: Change ``CONFIG_TIERS_PER_GEN`` to a larger value to
+ support more tiers, which generally provide better protection for
+ page cache when under heavy buffered I/O workloads.
+
+:Optional: Set ``CONFIG_LRU_GEN_STATS=y`` to enable full stats for
+ debugging. See ``Debugfs interface``.
+
:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
format:
::
diff --git a/mm/Kconfig b/mm/Kconfig
index 356f4f2c779e..8a33605917f5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -900,6 +900,54 @@ config IO_MAPPING
config SECRETMEM
def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED

+# multigenerational lru {
+config LRU_GEN
+ bool "Multigenerational LRU"
+ depends on MMU
+ # the following options can use up the spare bits in page flags
+ depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
+ help
+ A high performance LRU implementation for memory overcommit. See
+ Documentation/vm/multigen_lru.rst for details.
+
+config LRU_GEN_ENABLED
+ bool "Enable by default"
+ depends on LRU_GEN
+ help
+ This option enables the multigenerational lru by default.
+
+config NR_LRU_GENS
+ int "Max number of generations"
+ depends on LRU_GEN
+ range 4 31
+ default 4
+ help
+ This option uses order_base_2(N+1) bits in page flags.
+
+ Do not configure more generations than you plan to use. They have a
+ per-memcg and per-node memory overhead.
+
+config TIERS_PER_GEN
+ int "Number of tiers per generation"
+ depends on LRU_GEN
+ range 2 5
+ default 4
+ help
+ This option uses N-2 bits in page flags.
+
+ Larger values generally provide better protection for page cache when
+ under heavy buffered I/O workloads.
+
+config LRU_GEN_STATS
+ bool "Full stats for debugging"
+ depends on LRU_GEN
+ help
+ This option keeps historical stats for evicted generations.
+
+ Do not enable full stats unless you plan to look at them. They have a
+ per-memcg and per-node memory overhead.
+# }
+
source "mm/damon/Kconfig"

endmenu
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-04 20:30:09

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and
> straightforward.

<snipped>

> Summery
> =======
> The facts are:
> 1. The independent lab results and the real-world applications
> indicate substantial improvements; there are no known regressions.
> 2. Thrashing prevention, working set estimation and proactive reclaim
> work out of the box; there are no equivalent solutions.
> 3. There is a lot of new code; nobody has demonstrated smaller changes
> with similar effects.
>
> Our options, accordingly, are:
> 1. Given the amount of evidence, the reported improvements will likely
> materialize for a wide range of workloads.
> 2. Gauging the interest from the past discussions [14][15][16], the
> new features will likely be put to use for both personal computers
> and data centers.
> 3. Based on Google's track record, the new code will likely be well
> maintained in the long term. It'd be more difficult if not
> impossible to achieve similar effects on top of the existing
> design.

Hi Andrew, Linus,

Can you please take a look at this patchset and let me know if it's
5.17 material?

My goal is to get it merged asap so that users can reap the benefits
and I can push the sequels. Please examine the data provided -- I
think the unprecedented coverage and the magnitude of the improvements
warrant a green light.

Thanks!

2022-01-04 21:24:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v6 2/9] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG

On Tue, Jan 4, 2022 at 12:23 PM Yu Zhao <[email protected]> wrote:
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5c2ccb85f2ef..5a4843242f09 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -85,6 +85,7 @@ config X86
> + select ARCH_HAS_NONLEAF_PMD_YOUNG if X86_64

Why is this limited to 64-bit?

I'm ok with that - maybe it's a simple case of "this is not worth
doing on 32-bit", but I'd like the explanation to be written out.

Right now the commit message literally points the architecture manual
that sio relevant for both 32-bit and 64-bit - and then the patch
itself makes it 64-bit only.

Linus

2022-01-04 21:34:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v6 4/9] mm: multigenerational lru: groundwork

On Tue, Jan 4, 2022 at 12:23 PM Yu Zhao <[email protected]> wrote:
>

> index a7e4a9e7d807..fadbf8e6abcd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
>
> +#ifdef CONFIG_LRU_GEN
> +static inline void task_enter_lru_fault(void)
> +{
> + WARN_ON_ONCE(current->in_lru_fault);
...

Why are these in this very core header file?

They are used in one single file - mm/memory.c.

They should be just static functions there.

I'm also not sure why the calling convention is

if (lru_fault)
task_enter_lru_fault();

instead of doing just

task_enter_lru_fault(vma);

and having that function do

/* Don't do LRU fault accounting for SEQ/RAND files */
if (unlikely(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ)))
return;

which would seem to be a lot more legible and straightforward.

In fact, you could do it without any conditionals at all, if you just
remove the WARN_ON_ONCE() from the exit path, turning it into just

current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));

for 'enter' and just

current->in_lru_fault = 0;

for exit.

It seems pointless to have that extra variable, and the extra
conditionals, for a case that is probably very unusual indeed.

Linus

2022-01-04 21:40:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v6 9/9] mm: multigenerational lru: Kconfig

On Tue, Jan 4, 2022 at 12:23 PM Yu Zhao <[email protected]> wrote:
>
> Add configuration options for the multigenerational lru.
> def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>
> +config NR_LRU_GENS
> + int "Max number of generations"
> + depends on LRU_GEN
> + range 4 31
> + default 4
> + help
> + This option uses order_base_2(N+1) bits in page flags.
> +
> + Do not configure more generations than you plan to use. They have a
> + per-memcg and per-node memory overhead.
> +
> +config TIERS_PER_GEN
> + int "Number of tiers per generation"
> + depends on LRU_GEN
> + range 2 5
> + default 4
> + help
> + This option uses N-2 bits in page flags.
> +
> + Larger values generally provide better protection for page cache when
> + under heavy buffered I/O workloads.

These are not appropriate questions to ask users.

No user has any idea what the answer should be. And no, we don't add
"benchmark tuning Kconfig questions" to the kernel. We leave those
kinds of games to companies that need to fake their benchmark numbers.

If *you* can't give a good number for these config options, then no
user or distro can either.

So just pick a number, and stand by it.

Don't do this kind of "I don't know what the right number is, so I'll
just push the blame on the user".

Linus

2022-01-04 21:43:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Tue, Jan 4, 2022 at 12:30 PM Yu Zhao <[email protected]> wrote:
>
> My goal is to get it merged asap so that users can reap the benefits
> and I can push the sequels. Please examine the data provided -- I
> think the unprecedented coverage and the magnitude of the improvements
> warrant a green light.

I'll leave this to Andrew. I had some stylistic nits, but all the
actual complexity is in that aging and eviction, and while I looked at
the patches, I certainly couldn't make much of a judgement on them.

The proof is in the numbers, and they look fine, but who knows what
happens when others test it. I don't see anything that looks worrisome
per se, I just see the silly small things that made me go "Eww".

Linus

2022-01-05 03:34:41

by Shuang Zhai

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

Fio / pmem benchmark with MGLRU

TLDR
====
With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
and [9.26, 10.36]% higher throughput, respectively, for random
access, Zipfian (distribution) access and Gaussian (distribution)
access, when the average number of jobs per CPU is 1; 95% CIs
[42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput,
respectively, for random access, Zipfian access and Gaussian access,
when the average number of jobs per CPU is 2.

Background
==========
Many applications running on warehouse-scale computers heavily use
POSIX read(2)/write(2) and page cache, e.g., Apache Kafka, a
distributed streaming application used by "more than 80% of all
Fortune 100 companies" [1] and PostgreSQL, "the world's most advanced
open source relational database" [2].

Intel DC Persistent Memory, as an affordable alternative to DRAM, can
deliver large capacity and data persistence. Specifically, the device
used in this benchmark can achieve up to 36 GiB/s and 15 GiB/s
throughput, respectively, for sequential and random read access.

Our research group at the University of Rochester focuses on the
intersection of computer architecture and system software. My current
research interest is memory management on tiered memory systems.

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.15
* Patched: 5.15 + MGLRU

Access patterns (4KB read):
* Random (uniform)
* Zipfian (theta 0.8; the recommended range is 0-2)
* Gaussian (deviation 40; the possible range is 0-100)

Concurrency conditions (the average number of jobs per CPU):
* 1
* 2

Total file size (GB): 400 (~2x memory capacity)
Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~30

Notes
-----
1. All files were stored on pmem. Each job had the exclusive access to
a single file.
2. Due to the hardware limitation when accessing remote pmem [3],
numactl was used to bind the fio processes to the local pmem. Only
one of the two NUMA nodes was used during the benchmark.
3. During dry runs, we observed that the throughput doesn't improve
beyond 2 jobs per CPU for random access. Moreover, the patched
kernel showed consistent improvements over the baseline kernel
when using 3 or 4 jobs per CPU.
4. We wanted to simulate the real-world scenarios and therefore used
default swap configuration (on). Moreover, we didn't observe any
negative impact on performance with dry runs that disabled swap.

Procedure
=========
<for each kernel>
grub2-reboot <baseline, patched>
<for each concurrency condition>
<generate test files>
<for each access pattern>
<for each data point>
<reboot>
<run fio>

Hardware
--------
Memory (GiB per socket): 192
CPU (# per socket): 40
Pmem (GiB per socket): 768

Fio
---
$ fio -version
fio-3.28

$ numactl --cpubind=0 --membind=0 fio --name=randread \
--directory=/mnt/pmem/ --size={10G, 5G} --io_size=1000TB \
--time_based --numjobs={40, 80} --ioengine=io_uring \
--ramp_time=20m --runtime=10m --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution={random, zipf:0.8, normal:40} \
--direct=0 --norandommap --group_reporting

Results
=======
Throughput
----------
The patched kernel achieved substantially higher throughput for all
three access patterns and two concurrency conditions. Specifically,
comparing the patched with the baseline kernel, fio achieved 95% CIs
[38.95, 40.26]%, [4.12, 6.64]% and [9.26, 10.36]% higher throughput,
respectively, for random access, Zipfian access, and Gaussian access,
when the average number of jobs per CPU is 1; 95% CIs [42.32, 49.15]%,
[9.44, 9.89]% and [20.99, 22.86]% higher throughput, respectively, for
random access, Zipfian access and Gaussian access, when the average
number of jobs per CPU is 2.

+---------------------+---------------+---------------+
| Mean MiB/s [95% CI] | 1 job / CPU | 2 jobs / CPU |
+---------------------+---------------+---------------+
| Random access | 8411 / 11742 | 8417 / 12267 |
| | [3275, 3387] | [3562, 4137] |
+---------------------+---------------+---------------+
| Zipfian access | 14576 / 15360 | 12932 / 14181 |
| | [600, 967] | [1220, 1279] |
+---------------------+---------------+---------------+
| Gaussian access | 14564 / 15993 | 11513 / 14037 |
| | [1348, 1508] | [2417, 2631] |
+---------------------+---------------+---------------+
Table 1. Throughput comparison between the baseline and the patched
kernels

The patched kernel exhibited less degradation in throughput when
running more concurrent jobs. Comparing 2 jobs per CPU with 1 job per
CPU, fio achieved 95% CIs [-11.54, -11.02]%, [-16.91, -12.01]% and
[-21.61, -20.30]% higher throughput, respectively, for random access,
Zipfian access and Gaussian access, when using the baseline kernel;
95% CIs [2.04, 6.92]%, [-8.86, -6.48]% and [-12.83, -11.64]% higher
throughput, respectively, for random access, Zipfian access and
Gaussian access, when using the patched kernel. There were no
statistically significant changes in throughput for the rest of the
test matrix.

+---------------------+-----------------+----------------+
| Mean MiB/s [95% CI] | Baseline kernel | Patched kernel |
+---------------------+-----------------+----------------+
| Random access | 8411 / 8417 | 11741 / 12267 |
| | [-55, 69] | [239, 812] |
+---------------------+-----------------+----------------+
| Zipfian access | 14576 / 12932 | 15360/ 14181 |
| | [-1682, -1607] | [-1361, -996] |
+---------------------+-----------------+----------------+
| Gaussian access | 14565 / 11513 | 15993 / 14037 |
| | [-3147, -2957] | [-2051, -1861] |
+---------------------+-----------------+----------------+
Table 2. Throughput comparison between 1 job per CPU and 2 jobs per
CPU

Tail Latency
------------
Comparing the patched with the baseline kernel, fio experienced 95%
CIs [-41.77, -40.35]% and [6.64, 13.95]% higher latency at the 99th
percentile, respectively, for random access and Gaussian access, when
the average number of jobs per CPU is 1; 95% CIs [-41.97, -40.59]%,
[-47.74, -47.04]% and [-51.32, -50.27]% higher latency at the 99th
percentile, respectively, for random access, Zipfian access and
Gaussian access, when the average number of jobs per CPU is 2. There
were no statistically significant changes in latency at the 99th
percentile for the rest of the test matrix.

+------------------------------+----------------+------------------+
| 99th percentile latency (us) | 1 job / CPU | 2 jobs / CPU |
+------------------------------+----------------+------------------+
| Random access | 12466 / 7347 | 25560 / 15008 |
| | [-5207, -5030] | [-10729, -10375] |
+------------------------------+----------------+------------------+
| Zipfian access | 3395 / 3382 | 14563 / 7661 |
| | [-131, 105] | [-6953,-6850] |
+------------------------------+----------------+------------------+
| Gaussian access | 3280 / 3618 | 15611 / 7681 |
| | [217, 457] | [-8012, -7848] |
+------------------------------+----------------+------------------+
Table 3. Comparison of the 99th percentile latency between the
baseline and the patched kernels (lower is better)

Metrics collected during each run are available at:
https://github.com/zhaishuang1/MglruPerf/tree/master

A peek at 5.16-rc6
------------------
We also ran the benchmark on 5.16-rc6 with swap off. However, we
haven't collected enough data points to establish a 95% CI. Here are
a few numbers we've collected:

+----------------+------------+----------+----------------+----------+
| Access pattern | Jobs / CPU | 5.16-rc6 | 5.16-rc6-mglru | % change |
+----------------+------------+----------+----------------+----------+
| Random access | 1 | 7467 | 10440 | 39.8% |
+----------------+------------+----------+----------------+----------+
| Random access | 2 | 7504 | 13417 | 78.8% |
+----------------+------------+----------+----------------+----------+
| Random access | 3 | 7511 | 13954 | 85.8% |
+----------------+------------+----------+----------------+----------+
| Random access | 4 | 7542 | 13925 | 84.6% |
+----------------+------------+----------+----------------+----------+

Reference
=========
[1] https://kafka.apache.org/documentation/#design_filesystem
[2] https://www.postgresql.org/docs/11/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
[3] System Evaluation of the Intel Optane byte-addressable NVM, MEMSYS 2019.

Appendix
========
Throughput
----------
$ cat raw_data_fio.r
v <- c(
# baseline 40 procs random
8467.89, 8428.34, 8383.32, 8253.12, 8464.65, 8307.42, 8424.78, 8434.44, 8474.88, 8468.26,
# baseline 40 procs zipf
14570.44, 14598.03, 14550.74, 14640.29, 14591.4, 14573.35, 14503.18, 14613.39, 14598.61, 14522.27,
# baseline 40 procs gaussian
14504.95, 14427.23, 14652.19, 14519.47, 14557.97, 14617.92, 14555.87, 14446.94, 14678.12, 14688.33,
# baseline 80 procs random
8427.51, 8267.23, 8437.48, 8432.37, 8441.4, 8454.26, 8413.13, 8412.44, 8444.36, 8444.32,
# baseline 80 procs zipf
12980.12, 12946.43, 12911.95, 12925.83, 12952.75, 12841.44, 12920.35, 12924.19, 12944.38, 12967.72,
# baseline 80 procs gaussian
11666.29, 11624.72, 11454.82, 11482.36, 11462.24, 11379.46, 11691.5, 11471.19, 11402.08, 11494.13,
# patched 40 procs random
11706.69, 11778.1, 11774.07, 11750.07, 11744.97, 11766.65, 11727.79, 11708.41, 11745.3, 11716.45,
# patched 40 procs zipf
15498.31, 14647.94, 15423.35, 15467.32, 15467.05, 15342.49, 15511.34, 15414.06, 15401.1, 15431.57,
# patched 40 procs gaussian
15957.86, 15957.13, 16022.69, 16035.85, 16150.2, 15904.5, 15943.36, 16036.78, 16025.95, 15900.56,
# patched 80 procs random
12568.51, 11772.25, 11622.15, 12057.66, 11971.72, 12693.36, 12399.71, 12553.23, 12242.74, 12793.34,
# patched 80 procs zipf
14194.78, 14213.61, 14148.66, 14182.35, 14183.91, 14192.23, 14163.2, 14179.7, 14162.12, 14196.34,
# patched 80 procs gaussian
14084.86, 13706.34, 14089.42, 14058.4, 14096.74, 14108.06, 14043.41, 14072.15, 14088.44, 14024.51
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (concurr in 1:2) {
for (dist in 1:3) {
r <- t.test(a[, dist, concurr, 1], a[, dist, concurr, 2])
print(r)

p <- r$conf.int * 100 / r$estimate[1]
if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
s <- sprintf("concurr%d dist%d: no significance", concurr, dist)
} else {
s <- sprintf("concurr%d dist%d: [%.2f, %.2f]%%", concurr, dist, -p[2], -p[1])
}
print(s)
}
}

# low concurr vs high concurr
for (kern in 1:2) {
for (dist in 1:3) {
r <- t.test(a[, dist, 1, kern], a[, dist, 2, kern])
print(r)

p <- r$conf.int * 100 / r$estimate[1]
if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
s <- sprintf("kern%d dist%d: no significance", kern, dist)
} else {
s <- sprintf("kern%d dist%d: [%.2f, %.2f]%%", kern, dist, -p[2], -p[1])
}
print(s)
}
}

$ R -q -s -f raw_data_fio.r

Welch Two Sample t-test

data: a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -132.15, df = 11.177, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3386.514 -3275.766
sample estimates:
mean of x mean of y
8410.71 11741.85

[1] "concurr1 dist1: [38.95, 40.26]%"

Welch Two Sample t-test

data: a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -9.5917, df = 9.4797, p-value = 3.463e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-967.8353 -600.7307
sample estimates:
mean of x mean of y
14576.17 15360.45

[1] "concurr1 dist2: [4.12, 6.64]%"

Welch Two Sample t-test

data: a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -37.744, df = 17.33, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1508.328 -1348.850
sample estimates:
mean of x mean of y
14564.90 15993.49

[1] "concurr1 dist3: [9.26, 10.36]%"

Welch Two Sample t-test

data: a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -30.144, df = 9.3334, p-value = 1.281e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4137.381 -3562.653
sample estimates:
mean of x mean of y
8417.45 12267.47

[1] "concurr2 dist1: [42.32, 49.15]%"

Welch Two Sample t-test

data: a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -92.164, df = 13.276, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1279.417 -1220.931
sample estimates:
mean of x mean of y
12931.52 14181.69

[1] "concurr2 dist2: [9.44, 9.89]%"

Welch Two Sample t-test

data: a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -49.453, df = 17.863, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2631.656 -2417.052
sample estimates:
mean of x mean of y
11512.88 14037.23

[1] "concurr2 dist3: [20.99, 22.86]%"

Welch Two Sample t-test

data: a[, dist, 1, kern] and a[, dist, 2, kern]
t = -0.22947, df = 16.403, p-value = 0.8213
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-68.88155 55.40155
sample estimates:
mean of x mean of y
8410.71 8417.45

[1] "kern1 dist1: no significance"

Welch Two Sample t-test

data: a[, dist, 1, kern] and a[, dist, 2, kern]
t = 91.86, df = 17.875, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1607.021 1682.287
sample estimates:
mean of x mean of y
14576.17 12931.52

[1] "kern1 dist2: [-11.54, -11.02]%"

Welch Two Sample t-test

data: a[, dist, 1, kern] and a[, dist, 2, kern]
t = 67.477, df = 17.539, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2956.815 3147.225
sample estimates:
mean of x mean of y
14564.90 11512.88

[1] "kern1 dist3: [-21.61, -20.30]%"

Welch Two Sample t-test

data: a[, dist, 1, kern] and a[, dist, 2, kern]
t = -4.1443, df = 9.0781, p-value = 0.002459
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-812.1507 -239.0833
sample estimates:
mean of x mean of y
11741.85 12267.47

[1] "kern2 dist1: [2.04, 6.92]%"

Welch Two Sample t-test

data: a[, dist, 1, kern] and a[, dist, 2, kern]
t = 14.566, df = 9.1026, p-value = 1.291e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
996.0064 1361.5196
sample estimates:
mean of x mean of y
15360.45 14181.69

[1] "kern2 dist2: [-8.86, -6.48]%"

Welch Two Sample t-test

data: a[, dist, 1, kern] and a[, dist, 2, kern]
t = 43.826, df = 15.275, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1861.263 2051.247
sample estimates:
mean of x mean of y
15993.49 14037.23

[1] "kern2 dist3: [-12.83, -11.64]%"

99th Percentile Latency
-----------------------
$ cat raw_data_fio_lat.r
v <- c(
# baseline 40 procs random
12649, 12387, 12518, 12518, 12518, 12387, 12518, 12518, 12387, 12256,
# baseline 40 procs zipf
3458, 3294, 3425, 3294, 3294, 3359, 3752, 3326, 3294, 3458,
# baseline 40 procs gaussian
3326, 3458, 3195, 3392, 3326, 3228, 3228, 3326, 3130, 3195,
# baseline 80 procs random
25560, 26084, 25560, 25560, 25297, 25297, 25822, 25560, 25560, 25297,
# baseline 80 procs zipf
14484, 14615, 14615, 14484, 14484, 14615, 14615, 14615, 14615, 14484,
# baseline 80 procs gaussian
15664, 15664, 15533, 15533, 15533, 15664, 15795, 15533, 15664, 15533,
# patched 40 procs random
7439, 7242, 7373, 7373, 7373, 7439, 7242, 7308, 7308, 7373,
# patched 40 procs zipf
3261, 3425, 3392, 3294, 3359, 3556, 3228, 3490, 3458, 3359,
# patched 40 procs gaussian
3687, 3523, 3556, 3523, 3752, 3654, 3884, 3490, 3392, 3720,
# patched 80 procs random
15008, 15008, 15008, 15008, 15008, 15008, 15008, 15008, 15008, 15008,
# patched 80 procs zipf
7701, 7635, 7701, 7701, 7635, 7635, 7701, 7635, 7635, 7635,
# patched 80 procs gaussian
7635, 7898, 7701, 7635, 7635, 7635, 7635, 7635, 7701, 7701
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (concurr in 1:2) {
for (dist in 1:3) {
r <- t.test(a[, dist, concurr, 1], a[, dist, concurr, 2])
print(r)

p <- r$conf.int * 100 / r$estimate[1]
if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
s <- sprintf("concurr%d dist%d: no significance", concurr, dist)
} else {
s <- sprintf("concurr%d dist%d: [%.2f, %.2f]%%", concurr, dist, -p[2], -p[1])
}
print(s)
}
}

$ R -q -s -f raw_data_fio_lat.r

Welch Two Sample t-test

data: a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = 123.52, df = 15.287, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
5030.417 5206.783
sample estimates:
mean of x mean of y
12465.6 7347.0

[1] "concurr1 dist1: [-41.77, -40.35]%"

Welch Two Sample t-test

data: a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = 0.23667, df = 16.437, p-value = 0.8158
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-104.7812 131.1812
sample estimates:
mean of x mean of y
3395.4 3382.2

[1] "concurr1 dist2: no significance"

Welch Two Sample t-test

data: a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = -5.9754, df = 16.001, p-value = 1.94e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-457.5065 -217.8935
sample estimates:
mean of x mean of y
3280.4 3618.1

[1] "concurr1 dist3: [6.64, 13.95]%"

Welch Two Sample t-test

data: a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = 134.89, df = 9, p-value = 3.437e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
10374.74 10728.66
sample estimates:
mean of x mean of y
25559.7 15008.0

[1] "concurr2 dist1: [-41.97, -40.59]%"

Welch Two Sample t-test

data: a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = 288.1, df = 13.292, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
6849.566 6952.834
sample estimates:
mean of x mean of y
14562.6 7661.4

[1] "concurr2 dist2: [-47.74, -47.04]%"

Welch Two Sample t-test

data: a[, dist, concurr, 1] and a[, dist, concurr, 2]
t = 203.64, df = 17.798, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
7848.616 8012.384
sample estimates:
mean of x mean of y
15611.6 7681.1

[1] "concurr2 dist3: [-51.32, -50.27]%"

2022-01-05 08:55:42

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

Hi Yu,

On Tue, 4 Jan 2022 13:22:19 -0700 Yu Zhao <[email protected]> wrote:

> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and
> straightforward.
>
[...]
> Summery
> =======
> The facts are:
> 1. The independent lab results and the real-world applications
> indicate substantial improvements; there are no known regressions.

So impressive results!

> 2. Thrashing prevention, working set estimation and proactive reclaim
> work out of the box; there are no equivalent solutions.

I think similar works are already available out of the box with the latest
mainline tree, though it might be suboptimal in some cases.

First, you can do thrashing prevention using DAMON-based Operation Scheme
(DAMOS)[1] with MADV_COLD action. Second, for working set estimation, you can
either use the DAMOS again with statistics action, or the damon_aggregated
tracepoint[2]. The DAMON user space tool[3] helps the tracepoint analysis and
visualization. Finally, for the proactive reclaim, you can again use the DAMOS
with MADV_PAGEOUT action, or simply the DAMON-based proactive reclaim
module (DAMON_RECLAIM)[4].

Nevertheless, as noted above, current DAMON based solutions might be suboptimal
for some cases. First of all, DAMON currently doesn't provide page granularity
monitoring. Though its monitoring results were useful for our users'
production usages, there could be different requirements and situations.
Secondly, the DAMON-based thrashing prevention wouldn't reduce the CPU usage of
the reclamation logic's access scanning.

So, to me, MGLRU patchset looks providing something that DAMON doesn't provide,
but also something that DAMON is already providing. Specifically, the
efficient page granularity access scanning is what DAMON doesn't provide for
now. However, the utilization of the access information for LRU list
manipulation (thrashing prevention) and proactive reclamation is similar to
what DAMON (specifically, DAMOS) provides. Also, this patchset is reducing the
reclamation logic's CPU usage using the efficient page granularity access
scanning.

IMHO, we might be able to reduce the duplicates by integrating MGLRU in DAMON.
What I'm saying is, we could 1) introduce the efficient page granularity access
scanning, 2) reduce the reclamation logic's CPU usage by making it to use the
efficient page granularity access scanning, and 3) extend DAMON for page
granularity monitoring with the efficient access sacanning[5]. Then, users
could get the benefit of MGLRU by using DAMOS but setting it to use your
efficient page granularity access scanning. To make it more simple, we can
extend existing kernel logics to use DAMON in the way, or implement a new
kernel module. Additional advantages of this approach would be 1) reducing the
changes to the existing code, and 2) making the efficient page granularity
access information be utilized for more general cases.

Of course, the integration might not be so simple as seems to me now. We could
put DAMON and MGLRU together as those are for now, and let users select what
they really want. I think it's up to you.

I didn't read this patchset thoroughly yet, so I might missing many things. If
so, please feel free to let me know.

[1] https://docs.kernel.org/admin-guide/mm/damon/usage.html#schemes
[2] https://docs.kernel.org/admin-guide/mm/damon/usage.html#tracepoint-for-monitoring-results
[3] https://github.com/awslabs/damo
[4] https://docs.kernel.org/admin-guide/mm/damon/reclaim.html
[5] https://docs.kernel.org/vm/damon/design.html#configurable-layers


Thanks,
SJ

[...]

2022-01-05 10:45:39

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 1/9] mm: x86, arm64: add arch_has_hw_pte_young()

On Tue, Jan 04, 2022 at 01:22:20PM -0700, Yu Zhao wrote:
> Some architectures automatically set the accessed bit in PTEs, e.g.,
> x86 and arm64 v8.2. On architectures that don't have this capability,
> clearing the accessed bit in a PTE usually triggers a page fault
> following the TLB miss of this PTE.
>
> Being aware of this capability can help make better decisions, e.g.,
> whether to spread the work out over a period of time to avoid bursty
> page faults when trying to clear the accessed bit in a large number of
> PTEs.
>
> Signed-off-by: Yu Zhao <[email protected]>
> Tested-by: Konstantin Kharlamov <[email protected]>
> ---
> arch/arm64/include/asm/cpufeature.h | 5 +++++
> arch/arm64/include/asm/pgtable.h | 13 ++++++++-----
> arch/arm64/kernel/cpufeature.c | 19 +++++++++++++++++++
> arch/arm64/tools/cpucaps | 1 +
> arch/x86/include/asm/pgtable.h | 6 +++---
> include/linux/pgtable.h | 13 +++++++++++++
> mm/memory.c | 14 +-------------
> 7 files changed, 50 insertions(+), 21 deletions(-)
>
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index ef6be92b1921..99518b4b2a9e 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -779,6 +779,11 @@ static inline bool system_supports_tlb_range(void)
> cpus_have_const_cap(ARM64_HAS_TLB_RANGE);
> }
>
> +static inline bool system_has_hw_af(void)
> +{
> + return IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && cpus_have_const_cap(ARM64_HW_AF);
> +}
> +
> extern int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);
>
> static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index c4ba047a82d2..e736f47436c7 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -999,13 +999,16 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
> * page after fork() + CoW for pfn mappings. We don't always have a
> * hardware-managed access flag on arm64.
> */
> -static inline bool arch_faults_on_old_pte(void)
> +static inline bool arch_has_hw_pte_young(bool local)
> {
> - WARN_ON(preemptible());
> + if (local) {
> + WARN_ON(preemptible());
> + return cpu_has_hw_af();
> + }
>
> - return !cpu_has_hw_af();
> + return system_has_hw_af();
> }
> -#define arch_faults_on_old_pte arch_faults_on_old_pte
> +#define arch_has_hw_pte_young arch_has_hw_pte_young
>
> /*
> * Experimentally, it's cheap to set the access flag in hardware and we
> @@ -1013,7 +1016,7 @@ static inline bool arch_faults_on_old_pte(void)
> */
> static inline bool arch_wants_old_prefaulted_pte(void)
> {
> - return !arch_faults_on_old_pte();
> + return arch_has_hw_pte_young(true);
> }
> #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
>
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index 6f3e677d88f1..5bb553ee2c0e 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -2171,6 +2171,25 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
> .matches = has_hw_dbm,
> .cpu_enable = cpu_enable_hw_dbm,
> },
> + {
> + /*
> + * __cpu_setup always enables this capability. But if the boot
> + * CPU has it and a late CPU doesn't, the absent
> + * ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU will prevent this late CPU
> + * from going online. There is neither known hardware does that
> + * nor obvious reasons to design hardware works that way, hence
> + * no point leaving the door open here. If the need arises, a
> + * new weak system feature flag should do the trick.
> + */
> + .desc = "Hardware update of the Access flag",
> + .type = ARM64_CPUCAP_SYSTEM_FEATURE,
> + .capability = ARM64_HW_AF,
> + .sys_reg = SYS_ID_AA64MMFR1_EL1,
> + .sign = FTR_UNSIGNED,
> + .field_pos = ID_AA64MMFR1_HADBS_SHIFT,
> + .min_field_value = 1,
> + .matches = has_cpuid_feature,
> + },
> #endif
> {
> .desc = "CRC32 instructions",
> diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
> index 870c39537dd0..56e4ef5d95fa 100644
> --- a/arch/arm64/tools/cpucaps
> +++ b/arch/arm64/tools/cpucaps
> @@ -36,6 +36,7 @@ HAS_STAGE2_FWB
> HAS_SYSREG_GIC_CPUIF
> HAS_TLB_RANGE
> HAS_VIRT_HOST_EXTN
> +HW_AF
> HW_DBM
> KVM_PROTECTED_MODE
> MISMATCHED_CACHE_TYPE

As discussed in the previous threads, we really don't need the complexity
of the additional cap for the arm64 part. Please can you just use the
existing code instead? It's both simpler and, as you say, it's equivalent
for existing hardware.

That way, this patch just ends up being a renaming exercise and we're all
good.

Thanks,

Will

2022-01-05 10:53:27

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Wed, Jan 05, 2022 at 08:55:34AM +0000, SeongJae Park wrote:
> Hi Yu,
>
> On Tue, 4 Jan 2022 13:22:19 -0700 Yu Zhao <[email protected]> wrote:
>
> > TLDR
> > ====
> > The current page reclaim is too expensive in terms of CPU usage and it
> > often makes poor choices about what to evict. This patchset offers an
> > alternative solution that is performant, versatile and
> > straightforward.
> >
> [...]
> > Summery
> > =======
> > The facts are:
> > 1. The independent lab results and the real-world applications
> > indicate substantial improvements; there are no known regressions.
>
> So impressive results!
>
> > 2. Thrashing prevention, working set estimation and proactive reclaim
> > work out of the box; there are no equivalent solutions.
>
> I think similar works are already available out of the box with the latest
> mainline tree, though it might be suboptimal in some cases.

Ok, I will sound harsh because I hate it when people challenge facts
while having no idea what they are talking about.

Our jobs are help the leadership make best decisions by providing them
with facts, not feeding them crap.

Don't get me wrong -- you are welcome to start another thread and have
a casual discussion with me. But this thread is not for that; it's for
the leadership and stakeholder to make a decision. Check who are in
"To" and "Cc" and what my request is.

> I didn't read this patchset thoroughly yet, so I might missing many things. If
> so, please feel free to let me know.

Yes, apparently you didn't read this patchset thoroughly, and you have
missed all things that matter to this thread.

> First, you can do thrashing prevention using DAMON-based Operation Scheme
> (DAMOS)[1] with MADV_COLD action.

Here is thrashing prevention really means, from patch 8:
+Personal computers
+------------------
+:Thrashing prevention: Write ``N`` to
+ ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
+ ``N`` milliseconds from getting evicted. The OOM killer is invoked if
+ this working set can't be kept in memory. Based on the average human
+ detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
+ lags due to thrashing. Larger values like ``N=3000`` make lags less
+ noticeable at the cost of more OOM kills.

It's about when to trigger OOM kills. Got it? Or probably you don't
understand what MADV_COLD is either?

> Second, for working set estimation, you can either use the DAMOS
> again with statistics action, or the damon_aggregated tracepoint[2].

This is you are suggesting:
TRACE_EVENT(damon_aggregated,
TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u",
__entry->target_id, __entry->nr_regions,
__entry->start, __entry->end, __entry->nr_accesses)

Now read my doc again:
+Data centers
+------------
+:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
+ format:
+ memcg memcg_id memcg_path
+ node node_id

Have you heard of something called memcg? And NUMA node? How exactly
can this tracepoint provide information about different memcgs and
NUMA node?

> The DAMON user space tool[3] helps the tracepoint analysis and
> visualization.

What does "work out of box" mean? Should every Linux desktop, laptop
and phone user install this tool?

> Finally, for the proactive reclaim, you can again use the DAMOS
> with MADV_PAGEOUT action

How exactly does MADV_PAGEOUT find pages that are NOT mapped in page
tables? Let me tell you another fact: they are usually the cheapest to
reclaim.

> or simply the DAMON-based proactive reclaim module (DAMON_RECLAIM)[4].
> [4] https://docs.kernel.org/admin-guide/mm/damon/reclaim.html

How many knob does DAMON_RECLAIM have? 14? I lost count.

> Of course, the integration might not be so simple as seems to me now.

Look, I'm open to your suggestion. I probably should have been nicer.
So I'm sorry. I just don't appreciate alternative facts.

2022-01-05 11:12:27

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Wed, Jan 05, 2022 at 03:53:07AM -0700, Yu Zhao wrote:
> Look, I'm open to your suggestion. I probably should have been nicer.
> So I'm sorry. I just don't appreciate alternative facts.

Yes, you should've been *much* nicer. I'm reading lkml for pretty much
20 years now and you just made my eyebrows go up - something which
pretty much never happens these days.

So you need to check yourself before replying. Looking at git history,
you're not a newbie so you've probably picked up - at least from the
sidelines - all those code of conduct discussions. And I'm not going to
point you to it - I'm sure you can find it yourself and peruse it at
your own convenience.

Long story short: we all try to be civil to each other now, even if it
is hard sometimes.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-01-05 11:25:39

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Wed, 5 Jan 2022 03:53:07 -0700 Yu Zhao <[email protected]> wrote:

> On Wed, Jan 05, 2022 at 08:55:34AM +0000, SeongJae Park wrote:
> > Hi Yu,
> >
> > On Tue, 4 Jan 2022 13:22:19 -0700 Yu Zhao <[email protected]> wrote:
[...]
> > I think similar works are already available out of the box with the latest
> > mainline tree, though it might be suboptimal in some cases.
>
> Ok, I will sound harsh because I hate it when people challenge facts
> while having no idea what they are talking about.
>
> Our jobs are help the leadership make best decisions by providing them
> with facts, not feeding them crap.

I was using the word "similar", to represent this is only for a rough concept
level similarity, rather than detailed facts. But, seems it was not enough,
sorry. Anyway, I will not talk more and thus disturb you having the important
discussion with leaders here, as you are asking.

>
> Don't get me wrong -- you are welcome to start another thread and have
> a casual discussion with me. But this thread is not for that; it's for
> the leadership and stakeholder to make a decision. Check who are in
> "To" and "Cc" and what my request is.

Haha. Ok, good luck for you.


Thanks,
SJ

[...]

2022-01-05 20:47:24

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 1/9] mm: x86, arm64: add arch_has_hw_pte_young()

On Wed, Jan 05, 2022 at 10:45:26AM +0000, Will Deacon wrote:
> On Tue, Jan 04, 2022 at 01:22:20PM -0700, Yu Zhao wrote:
> > Some architectures automatically set the accessed bit in PTEs, e.g.,
> > x86 and arm64 v8.2. On architectures that don't have this capability,
> > clearing the accessed bit in a PTE usually triggers a page fault
> > following the TLB miss of this PTE.
> >
> > Being aware of this capability can help make better decisions, e.g.,
> > whether to spread the work out over a period of time to avoid bursty
> > page faults when trying to clear the accessed bit in a large number of
> > PTEs.
> >
> > Signed-off-by: Yu Zhao <[email protected]>
> > Tested-by: Konstantin Kharlamov <[email protected]>
> > ---
> > arch/arm64/include/asm/cpufeature.h | 5 +++++
> > arch/arm64/include/asm/pgtable.h | 13 ++++++++-----
> > arch/arm64/kernel/cpufeature.c | 19 +++++++++++++++++++
> > arch/arm64/tools/cpucaps | 1 +
> > arch/x86/include/asm/pgtable.h | 6 +++---
> > include/linux/pgtable.h | 13 +++++++++++++
> > mm/memory.c | 14 +-------------
> > 7 files changed, 50 insertions(+), 21 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> > index ef6be92b1921..99518b4b2a9e 100644
> > --- a/arch/arm64/include/asm/cpufeature.h
> > +++ b/arch/arm64/include/asm/cpufeature.h
> > @@ -779,6 +779,11 @@ static inline bool system_supports_tlb_range(void)
> > cpus_have_const_cap(ARM64_HAS_TLB_RANGE);
> > }
> >
> > +static inline bool system_has_hw_af(void)
> > +{
> > + return IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && cpus_have_const_cap(ARM64_HW_AF);
> > +}
> > +
> > extern int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);
> >
> > static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index c4ba047a82d2..e736f47436c7 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -999,13 +999,16 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
> > * page after fork() + CoW for pfn mappings. We don't always have a
> > * hardware-managed access flag on arm64.
> > */
> > -static inline bool arch_faults_on_old_pte(void)
> > +static inline bool arch_has_hw_pte_young(bool local)
> > {
> > - WARN_ON(preemptible());
> > + if (local) {
> > + WARN_ON(preemptible());
> > + return cpu_has_hw_af();
> > + }
> >
> > - return !cpu_has_hw_af();
> > + return system_has_hw_af();
> > }
> > -#define arch_faults_on_old_pte arch_faults_on_old_pte
> > +#define arch_has_hw_pte_young arch_has_hw_pte_young
> >
> > /*
> > * Experimentally, it's cheap to set the access flag in hardware and we
> > @@ -1013,7 +1016,7 @@ static inline bool arch_faults_on_old_pte(void)
> > */
> > static inline bool arch_wants_old_prefaulted_pte(void)
> > {
> > - return !arch_faults_on_old_pte();
> > + return arch_has_hw_pte_young(true);
> > }
> > #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
> >
> > diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> > index 6f3e677d88f1..5bb553ee2c0e 100644
> > --- a/arch/arm64/kernel/cpufeature.c
> > +++ b/arch/arm64/kernel/cpufeature.c
> > @@ -2171,6 +2171,25 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
> > .matches = has_hw_dbm,
> > .cpu_enable = cpu_enable_hw_dbm,
> > },
> > + {
> > + /*
> > + * __cpu_setup always enables this capability. But if the boot
> > + * CPU has it and a late CPU doesn't, the absent
> > + * ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU will prevent this late CPU
> > + * from going online. There is neither known hardware does that
> > + * nor obvious reasons to design hardware works that way, hence
> > + * no point leaving the door open here. If the need arises, a
> > + * new weak system feature flag should do the trick.
> > + */
> > + .desc = "Hardware update of the Access flag",
> > + .type = ARM64_CPUCAP_SYSTEM_FEATURE,
> > + .capability = ARM64_HW_AF,
> > + .sys_reg = SYS_ID_AA64MMFR1_EL1,
> > + .sign = FTR_UNSIGNED,
> > + .field_pos = ID_AA64MMFR1_HADBS_SHIFT,
> > + .min_field_value = 1,
> > + .matches = has_cpuid_feature,
> > + },
> > #endif
> > {
> > .desc = "CRC32 instructions",
> > diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
> > index 870c39537dd0..56e4ef5d95fa 100644
> > --- a/arch/arm64/tools/cpucaps
> > +++ b/arch/arm64/tools/cpucaps
> > @@ -36,6 +36,7 @@ HAS_STAGE2_FWB
> > HAS_SYSREG_GIC_CPUIF
> > HAS_TLB_RANGE
> > HAS_VIRT_HOST_EXTN
> > +HW_AF
> > HW_DBM
> > KVM_PROTECTED_MODE
> > MISMATCHED_CACHE_TYPE
>
> As discussed in the previous threads, we really don't need the complexity
> of the additional cap for the arm64 part. Please can you just use the
> existing code instead? It's both simpler and, as you say, it's equivalent
> for existing hardware.
>
> That way, this patch just ends up being a renaming exercise and we're all
> good.

No, renaming alone isn't enough. A caller needs to disable preemption
before calling system_has_hw_af(), and I don't think it's reasonable
to ask this caller to do it on x86 as well.

It seems you really prefer not to have HW_AF. So the best I can
accommodate, considering other potential archs, e.g., risc-v (I do
plan to provide benchmark results on risc-v, btw), is:

static inline bool arch_has_hw_pte_young(bool local)
{
bool hw_af;

if (local) {
WARN_ON(preemptible());
return cpu_has_hw_af();
}

preempt_disable();
hw_af = system_has_hw_af();
preempt_enable();

return hw_af;
}

Or please give me something else I can call without disabling
preemption, sounds good?

2022-01-05 21:06:33

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Wed, Jan 05, 2022 at 11:25:27AM +0000, SeongJae Park wrote:
> On Wed, 5 Jan 2022 03:53:07 -0700 Yu Zhao <[email protected]> wrote:
>
> > On Wed, Jan 05, 2022 at 08:55:34AM +0000, SeongJae Park wrote:
> > > Hi Yu,
> > >
> > > On Tue, 4 Jan 2022 13:22:19 -0700 Yu Zhao <[email protected]> wrote:
> [...]
> > > I think similar works are already available out of the box with the latest
> > > mainline tree, though it might be suboptimal in some cases.
> >
> > Ok, I will sound harsh because I hate it when people challenge facts
> > while having no idea what they are talking about.
> >
> > Our jobs are help the leadership make best decisions by providing them
> > with facts, not feeding them crap.
>
> I was using the word "similar", to represent this is only for a rough concept
> level similarity, rather than detailed facts. But, seems it was not enough,
> sorry. Anyway, I will not talk more and thus disturb you having the important
> discussion with leaders here, as you are asking.

First of all, I want to apologize.

I detested what I read, and I still don't like "a rough concept level
similarity" sitting next to a factual statement. But as Borislav has
reminded me, my tone did cross the line. I should have had used an
objective approach to express my (very) different views.

I hope that's all water under the bridge now. And I do plan to carry
on with what I should have had done.

2022-01-05 21:12:44

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Tue, Jan 04, 2022 at 01:43:13PM -0800, Linus Torvalds wrote:
> On Tue, Jan 4, 2022 at 12:30 PM Yu Zhao <[email protected]> wrote:
> >
> > My goal is to get it merged asap so that users can reap the benefits
> > and I can push the sequels. Please examine the data provided -- I
> > think the unprecedented coverage and the magnitude of the improvements
> > warrant a green light.
>
> I'll leave this to Andrew. I had some stylistic nits, but all the
> actual complexity is in that aging and eviction, and while I looked at
> the patches, I certainly couldn't make much of a judgement on them.
>
> The proof is in the numbers, and they look fine, but who knows what
> happens when others test it. I don't see anything that looks worrisome
> per se, I just see the silly small things that made me go "Eww".

I appreciate your time, I'll address all your comments togather with
others' in the next spin, after I hear from Andrew. (I'm assuming he
will have comments too.)

2022-01-06 10:30:21

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 1/9] mm: x86, arm64: add arch_has_hw_pte_young()

On Wed, Jan 05, 2022 at 01:47:08PM -0700, Yu Zhao wrote:
> On Wed, Jan 05, 2022 at 10:45:26AM +0000, Will Deacon wrote:
> > On Tue, Jan 04, 2022 at 01:22:20PM -0700, Yu Zhao wrote:
> > > diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
> > > index 870c39537dd0..56e4ef5d95fa 100644
> > > --- a/arch/arm64/tools/cpucaps
> > > +++ b/arch/arm64/tools/cpucaps
> > > @@ -36,6 +36,7 @@ HAS_STAGE2_FWB
> > > HAS_SYSREG_GIC_CPUIF
> > > HAS_TLB_RANGE
> > > HAS_VIRT_HOST_EXTN
> > > +HW_AF
> > > HW_DBM
> > > KVM_PROTECTED_MODE
> > > MISMATCHED_CACHE_TYPE
> >
> > As discussed in the previous threads, we really don't need the complexity
> > of the additional cap for the arm64 part. Please can you just use the
> > existing code instead? It's both simpler and, as you say, it's equivalent
> > for existing hardware.
> >
> > That way, this patch just ends up being a renaming exercise and we're all
> > good.
>
> No, renaming alone isn't enough. A caller needs to disable preemption
> before calling system_has_hw_af(), and I don't think it's reasonable
> to ask this caller to do it on x86 as well.
>
> It seems you really prefer not to have HW_AF. So the best I can
> accommodate, considering other potential archs, e.g., risc-v (I do
> plan to provide benchmark results on risc-v, btw), is:
>
> static inline bool arch_has_hw_pte_young(bool local)
> {
> bool hw_af;
>
> if (local) {
> WARN_ON(preemptible());
> return cpu_has_hw_af();
> }
>
> preempt_disable();
> hw_af = system_has_hw_af();
> preempt_enable();
>
> return hw_af;
> }
>
> Or please give me something else I can call without disabling
> preemption, sounds good?

Sure thing, let me take a look. Do you have your series on a public git
tree someplace?

Cheers,

Will

2022-01-06 16:06:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

I am still reading through the series. It is a lot of code and quite
hard to wrap ones head around so these are mostly random things I have
run into. More will likely follow up.

On Tue 04-01-22 13:22:25, Yu Zhao wrote:
[...]
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index aba18cd101db..028afdb81c10 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1393,18 +1393,24 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
>
> static inline void lock_page_memcg(struct page *page)
> {
> + /* to match folio_memcg_rcu() */
> + rcu_read_lock();
> }
>
> static inline void unlock_page_memcg(struct page *page)
> {
> + rcu_read_unlock();
> }
>
> static inline void folio_memcg_lock(struct folio *folio)
> {
> + /* to match folio_memcg_rcu() */
> + rcu_read_lock();
> }
>
> static inline void folio_memcg_unlock(struct folio *folio)
> {
> + rcu_read_unlock();
> }

This should go into a separate patch and merge it independently. I
haven't really realized that !MEMCG configuration has a different
locking scopes.

[...]
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 2db9a1432511..9c7a4fae0661 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -57,6 +57,22 @@ struct oom_control {
> extern struct mutex oom_lock;
> extern struct mutex oom_adj_mutex;
>
> +#ifdef CONFIG_MMU
> +extern struct task_struct *oom_reaper_list;
> +extern struct wait_queue_head oom_reaper_wait;
> +
> +static inline bool oom_reaping_in_progress(void)
> +{
> + /* a racy check can be used to reduce the chance of overkilling */
> + return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait);
> +}
> +#else
> +static inline bool oom_reaping_in_progress(void)
> +{
> + return false;
> +}
> +#endif

I do not like this. These are internal oom reaper's and no code should
really make any decisions based on that. oom_reaping_in_progress is not
telling much anyway. This is a global queue for oom reaper that can
contain oom victims from different oom scopes (e.g. global OOM, memcg
OOM or memory policy OOM).

Your lru_gen_age_node uses this to decide whether to trigger
out_of_memory and that is clearly wrong for the above reasons.
out_of_memory is designed to skip over any action if there is an oom
victim pending from the oom domain (have a look at oom_evaluate_task).

[...]

> +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
> + unsigned long min_ttl)
> +{
> + bool need_aging;
> + long nr_to_scan;
> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> + int swappiness = get_swappiness(memcg);
> + DEFINE_MAX_SEQ(lruvec);
> + DEFINE_MIN_SEQ(lruvec);
> +
> + if (mem_cgroup_below_min(memcg))
> + return false;

mem_cgroup_below_min requires effective values to be calculated for the
reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection
--
Michal Hocko
SUSE Labs

2022-01-06 16:12:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> +{
> + if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> +
> + return current->reclaim_state->mm_walk;
> +}
> +
> +static void free_mm_walk(struct lru_gen_mm_walk *walk)
> +{
> + if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> + kvfree(walk);
> +}

Do I get it right that you are allocating from the reclaim context? What
prevents this to completely deplete the memory as the reclaim context is
PF_MEMALLOC?
--
Michal Hocko
SUSE Labs

2022-01-06 21:28:03

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Thu, Jan 06, 2022 at 05:06:42PM +0100, Michal Hocko wrote:
> I am still reading through the series. It is a lot of code and quite
> hard to wrap ones head around so these are mostly random things I have
> run into. More will likely follow up.
>
> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> [...]
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index aba18cd101db..028afdb81c10 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1393,18 +1393,24 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
> >
> > static inline void lock_page_memcg(struct page *page)
> > {
> > + /* to match folio_memcg_rcu() */
> > + rcu_read_lock();
> > }
> >
> > static inline void unlock_page_memcg(struct page *page)
> > {
> > + rcu_read_unlock();
> > }
> >
> > static inline void folio_memcg_lock(struct folio *folio)
> > {
> > + /* to match folio_memcg_rcu() */
> > + rcu_read_lock();
> > }
> >
> > static inline void folio_memcg_unlock(struct folio *folio)
> > {
> > + rcu_read_unlock();
> > }
>
> This should go into a separate patch and merge it independently. I
> haven't really realized that !MEMCG configuration has a different
> locking scopes.

Considered it done.

> > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > index 2db9a1432511..9c7a4fae0661 100644
> > --- a/include/linux/oom.h
> > +++ b/include/linux/oom.h
> > @@ -57,6 +57,22 @@ struct oom_control {
> > extern struct mutex oom_lock;
> > extern struct mutex oom_adj_mutex;
> >
> > +#ifdef CONFIG_MMU
> > +extern struct task_struct *oom_reaper_list;
> > +extern struct wait_queue_head oom_reaper_wait;
> > +
> > +static inline bool oom_reaping_in_progress(void)
> > +{
> > + /* a racy check can be used to reduce the chance of overkilling */
> > + return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait);
> > +}
> > +#else
> > +static inline bool oom_reaping_in_progress(void)
> > +{
> > + return false;
> > +}
> > +#endif
>
> I do not like this. These are internal oom reaper's and no code should
> really make any decisions based on that. oom_reaping_in_progress is not
> telling much anyway.

There is a perfectly legitimate reason for this.

If there is already a oom kill victim and the oom reaper is making
progress, the system may still be under memory pressure until the oom
reaping is done. The page reclaim has two choices in this transient
state: kill more processes or keep reclaiming (a few more) hot pages.

The first choice, AKA overkilling, is generally a bad one. The oom
reaper is single threaded and it can't go faster with additional
victims. Additional processes are sacrificed for nothing -- this is
an overcorrection of a system that tries to strike a balance between
the tendencies to release memory pressure and to improve memory
utilization.

> This is a global queue for oom reaper that can
> contain oom victims from different oom scopes (e.g. global OOM, memcg
> OOM or memory policy OOM).

True, but this is a wrong reason to make the conclusion below. Oom
kill scopes do NOT matter; only the pool the freed memory goes into
does. And there is only one global pool free pages.

> Your lru_gen_age_node uses this to decide whether to trigger
> out_of_memory and that is clearly wrong for the above reasons.

I hope my explanation above is clear enough. There is nothing wrong
with the purpose and the usage of oom_reaping_in_progress(), and it
has been well tested in the Arch Linux Zen kernel.

Without it, overkills can be easily reproduced by the following simple
script. That is additional oom kills happen to processes other than
"tail".

# enable zram
while true;
do
tail /dev/zero
done

> out_of_memory is designed to skip over any action if there is an oom
> victim pending from the oom domain (have a look at oom_evaluate_task).

Where exactly? Point me to the code please.

I don't see such a logic inside out_of_memory() or
oom_evaluate_task(). Currently the only thing that could remotely
prevent overkills is oom_lock. But it's inadequate.

This is the entire pipeline:
low on memory -> out_of_memory() -> oom_reaper() -> free memory

To avoid overkills, we need to consider the later half of it too.
oom_reaping_in_progress() is exactly for this purpose.

> > +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
> > + unsigned long min_ttl)
> > +{
> > + bool need_aging;
> > + long nr_to_scan;
> > + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > + int swappiness = get_swappiness(memcg);
> > + DEFINE_MAX_SEQ(lruvec);
> > + DEFINE_MIN_SEQ(lruvec);
> > +
> > + if (mem_cgroup_below_min(memcg))
> > + return false;
>
> mem_cgroup_below_min requires effective values to be calculated for the
> reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection

I always keep that in mind, and age_lruvec() is called *after*
mem_cgroup_calculate_protection():

balance_pgdat()
memcgs_need_aging = 0
do {
lru_gen_age_node()
if (!memcgs_need_aging) {
memcgs_need_aging = 1
return
}
age_lruvec()

shrink_node_memcgs()
mem_cgroup_calculate_protection()
lru_gen_shrink_lruvec()
if ...
memcgs_need_aging = 0
} while ...

2022-01-06 21:41:20

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Thu, Jan 06, 2022 at 05:12:16PM +0100, Michal Hocko wrote:
> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> > +{
> > + if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> > +
> > + return current->reclaim_state->mm_walk;
> > +}
> > +
> > +static void free_mm_walk(struct lru_gen_mm_walk *walk)
> > +{
> > + if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > + kvfree(walk);
> > +}
>
> Do I get it right that you are allocating from the reclaim context? What
> prevents this to completely deplete the memory as the reclaim context is
> PF_MEMALLOC?

Yes, and in general the same reason zram/zswap/etc. allocate memory in
the reclaim context: to make more free memory.

In this case, lru_gen_mm_walk is small (160 bytes); it's per direct
reclaimer; and direct reclaimers rarely come here, i.e., only when
kswapd can't keep up in terms of the aging, which is similar to the
condition where the inactive list is empty for the active/inactive
lru.

2022-01-07 07:25:15

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 1/9] mm: x86, arm64: add arch_has_hw_pte_young()

On Thu, Jan 06, 2022 at 10:30:09AM +0000, Will Deacon wrote:
> On Wed, Jan 05, 2022 at 01:47:08PM -0700, Yu Zhao wrote:
> > On Wed, Jan 05, 2022 at 10:45:26AM +0000, Will Deacon wrote:
> > > On Tue, Jan 04, 2022 at 01:22:20PM -0700, Yu Zhao wrote:
> > > > diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
> > > > index 870c39537dd0..56e4ef5d95fa 100644
> > > > --- a/arch/arm64/tools/cpucaps
> > > > +++ b/arch/arm64/tools/cpucaps
> > > > @@ -36,6 +36,7 @@ HAS_STAGE2_FWB
> > > > HAS_SYSREG_GIC_CPUIF
> > > > HAS_TLB_RANGE
> > > > HAS_VIRT_HOST_EXTN
> > > > +HW_AF
> > > > HW_DBM
> > > > KVM_PROTECTED_MODE
> > > > MISMATCHED_CACHE_TYPE
> > >
> > > As discussed in the previous threads, we really don't need the complexity
> > > of the additional cap for the arm64 part. Please can you just use the
> > > existing code instead? It's both simpler and, as you say, it's equivalent
> > > for existing hardware.
> > >
> > > That way, this patch just ends up being a renaming exercise and we're all
> > > good.
> >
> > No, renaming alone isn't enough. A caller needs to disable preemption
> > before calling system_has_hw_af(), and I don't think it's reasonable
> > to ask this caller to do it on x86 as well.
> >
> > It seems you really prefer not to have HW_AF. So the best I can
> > accommodate, considering other potential archs, e.g., risc-v (I do
> > plan to provide benchmark results on risc-v, btw), is:
> >
> > static inline bool arch_has_hw_pte_young(bool local)
> > {
> > bool hw_af;
> >
> > if (local) {
> > WARN_ON(preemptible());
> > return cpu_has_hw_af();
> > }
> >
> > preempt_disable();
> > hw_af = system_has_hw_af();
> > preempt_enable();
> >
> > return hw_af;
> > }
> >
> > Or please give me something else I can call without disabling
> > preemption, sounds good?
>
> Sure thing, let me take a look. Do you have your series on a public git
> tree someplace?

Thanks!

This patch (updated) on Gerrit:
https://linux-mm-review.googlesource.com/c/page-reclaim/+/1500/1

And the entire series:
git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/08/1508/1

2022-01-07 08:43:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Thu 06-01-22 14:27:52, Yu Zhao wrote:
> On Thu, Jan 06, 2022 at 05:06:42PM +0100, Michal Hocko wrote:
[...]
> > > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > > index 2db9a1432511..9c7a4fae0661 100644
> > > --- a/include/linux/oom.h
> > > +++ b/include/linux/oom.h
> > > @@ -57,6 +57,22 @@ struct oom_control {
> > > extern struct mutex oom_lock;
> > > extern struct mutex oom_adj_mutex;
> > >
> > > +#ifdef CONFIG_MMU
> > > +extern struct task_struct *oom_reaper_list;
> > > +extern struct wait_queue_head oom_reaper_wait;
> > > +
> > > +static inline bool oom_reaping_in_progress(void)
> > > +{
> > > + /* a racy check can be used to reduce the chance of overkilling */
> > > + return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait);
> > > +}
> > > +#else
> > > +static inline bool oom_reaping_in_progress(void)
> > > +{
> > > + return false;
> > > +}
> > > +#endif
> >
> > I do not like this. These are internal oom reaper's and no code should
> > really make any decisions based on that. oom_reaping_in_progress is not
> > telling much anyway.
>
> There is a perfectly legitimate reason for this.
>
> If there is already a oom kill victim and the oom reaper is making
> progress, the system may still be under memory pressure until the oom
> reaping is done. The page reclaim has two choices in this transient
> state: kill more processes or keep reclaiming (a few more) hot pages.
>
> The first choice, AKA overkilling, is generally a bad one. The oom
> reaper is single threaded and it can't go faster with additional
> victims. Additional processes are sacrificed for nothing -- this is
> an overcorrection of a system that tries to strike a balance between
> the tendencies to release memory pressure and to improve memory
> utilization.
>
> > This is a global queue for oom reaper that can
> > contain oom victims from different oom scopes (e.g. global OOM, memcg
> > OOM or memory policy OOM).
>
> True, but this is a wrong reason to make the conclusion below. Oom
> kill scopes do NOT matter; only the pool the freed memory goes into
> does. And there is only one global pool free pages.
>
> > Your lru_gen_age_node uses this to decide whether to trigger
> > out_of_memory and that is clearly wrong for the above reasons.
>
> I hope my explanation above is clear enough. There is nothing wrong
> with the purpose and the usage of oom_reaping_in_progress(), and it
> has been well tested in the Arch Linux Zen kernel.

I disagree. An ongoing oom kill in one domain (say memcg A) shouldn't be
any base for any decisions in reclaim in other domain (say memcg B or
even a global reclaim). Those are fundamentally different conditions.

> Without it, overkills can be easily reproduced by the following simple
> script. That is additional oom kills happen to processes other than
> "tail".
>
> # enable zram
> while true;
> do
> tail /dev/zero
> done

I would be interested to hear more (care to send oom reports?).

> > out_of_memory is designed to skip over any action if there is an oom
> > victim pending from the oom domain (have a look at oom_evaluate_task).
>
> Where exactly? Point me to the code please.
>
> I don't see such a logic inside out_of_memory() or
> oom_evaluate_task(). Currently the only thing that could remotely
> prevent overkills is oom_lock. But it's inadequate.

OK, let me try to exaplain. The protocol is rather convoluted. Once the
oom killer is invoked it choses a victim to kill. oom_evaluate_task will
evaluate _all_ tasks from the oom respective domain (select_bad_process
which distinguishes memcg vs global oom kill and oom_cpuset_eligible for
the cpuset domains). If there is any pre-existing oom victim
(tsk_is_oom_victim) then the scan is aborted and the oom killer bails
out. OOM victim stops being considered as relevant once the oom reaper
manages to release its address space (or give up on the mmap_sem
contention) and sets MMF_OOM_SKIP flag for the mm.

That being said the out_of_memory automatically backs off and relies on
the oom reaper to process its queue.

Does it make more clear for you now?

> This is the entire pipeline:
> low on memory -> out_of_memory() -> oom_reaper() -> free memory
>
> To avoid overkills, we need to consider the later half of it too.
> oom_reaping_in_progress() is exactly for this purpose.
>
> > > +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
> > > + unsigned long min_ttl)
> > > +{
> > > + bool need_aging;
> > > + long nr_to_scan;
> > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > + int swappiness = get_swappiness(memcg);
> > > + DEFINE_MAX_SEQ(lruvec);
> > > + DEFINE_MIN_SEQ(lruvec);
> > > +
> > > + if (mem_cgroup_below_min(memcg))
> > > + return false;
> >
> > mem_cgroup_below_min requires effective values to be calculated for the
> > reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection
>
> I always keep that in mind, and age_lruvec() is called *after*
> mem_cgroup_calculate_protection():

> balance_pgdat()
> memcgs_need_aging = 0
> do {
> lru_gen_age_node()
> if (!memcgs_need_aging) {
> memcgs_need_aging = 1
> return
> }
> age_lruvec()
>
> shrink_node_memcgs()
> mem_cgroup_calculate_protection()
> lru_gen_shrink_lruvec()
> if ...
> memcgs_need_aging = 0
> } while ...

Uff, this is really subtle. I really think you should be following the
existing pattern when the effective values are calculated right in the
same context as they are evaluated.
--
Michal Hocko
SUSE Labs

2022-01-07 08:55:13

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Thu 06-01-22 14:41:12, Yu Zhao wrote:
> On Thu, Jan 06, 2022 at 05:12:16PM +0100, Michal Hocko wrote:
> > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> > > +{
> > > + if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> > > +
> > > + return current->reclaim_state->mm_walk;
> > > +}
> > > +
> > > +static void free_mm_walk(struct lru_gen_mm_walk *walk)
> > > +{
> > > + if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > > + kvfree(walk);
> > > +}
> >
> > Do I get it right that you are allocating from the reclaim context? What
> > prevents this to completely deplete the memory as the reclaim context is
> > PF_MEMALLOC?
>
> Yes, and in general the same reason zram/zswap/etc. allocate memory in
> the reclaim context: to make more free memory.

I have to admit that I am not really familiar with zram/zswap but I find
the concept of requiring memory to do the reclaim really problematic.

> In this case, lru_gen_mm_walk is small (160 bytes); it's per direct
> reclaimer; and direct reclaimers rarely come here, i.e., only when
> kswapd can't keep up in terms of the aging, which is similar to the
> condition where the inactive list is empty for the active/inactive
> lru.

Well, this is not a strong argument to be honest. Kswapd being stuck
and the majority of the reclaim being done in the direct reclaim
context is a situation I have seen many many times. We used to have
problems with direct reclaimers throttling to prevent an over eager OOM
situations.

Have you considered using a pool of preallocated objects instead?
--
Michal Hocko
SUSE Labs

2022-01-07 09:00:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Fri 07-01-22 09:55:09, Michal Hocko wrote:
[...]
> > In this case, lru_gen_mm_walk is small (160 bytes); it's per direct
> > reclaimer; and direct reclaimers rarely come here, i.e., only when
> > kswapd can't keep up in terms of the aging, which is similar to the
> > condition where the inactive list is empty for the active/inactive
> > lru.
>
> Well, this is not a strong argument to be honest. Kswapd being stuck
> and the majority of the reclaim being done in the direct reclaim
> context is a situation I have seen many many times.

Also do not forget that memcg reclaim is effectivelly only direct
reclaim. Not that the memcg reclaim indicates a global memory shortage
but it can add up and race with the global reclaim as well.

--
Michal Hocko
SUSE Labs

2022-01-07 09:06:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 5/9] mm: multigenerational lru: mm_struct list

On Tue 04-01-22 13:22:24, Yu Zhao wrote:
> To exploit spatial locality, the aging prefers to walk page tables to
> search for young PTEs. And this patch paves the way for that.
>
> An mm_struct list is maintained for each memcg, and an mm_struct
> follows its owner task to the new memcg when this task is migrated.

How does this work actually for the memcg reclaim? I can see you
lru_gen_migrate_mm on the task migration. My concern is, though, that
such a task leaves all the memory behind in the previous memcg (in
cgroup v2, in v1 you can opt in for charge migration). If you move the
mm to a new memcg then you age it somewhere where the memory is not
really consumed.
--
Michal Hocko
SUSE Labs

2022-01-07 09:38:25

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Tue 04-01-22 13:30:00, Yu Zhao wrote:
[...]
> Hi Andrew, Linus,
>
> Can you please take a look at this patchset and let me know if it's
> 5.17 material?

I am still not done with the review and have seen at least few problems
that would need to be addressed.

But more fundamentally I believe there are really some important
questions to be answered. First and foremost this is a major addition
to the memory reclaim and there should be a wider consensus that we
really want to go that way. The patchset doesn't have a single ack nor
reviewed-by AFAICS. I haven't seen a lot of discussion since v2
(http://lkml.kernel.org/r/[email protected])
nor do I see any clarification on how concerns raised there have been
addressed or at least how they are planned to be addressed.

Johannes has made some excellent points
http://lkml.kernel.org/r/[email protected]. Let me quote
for reference part of it I find the most important:
: Realistically, I think incremental changes are unavoidable to get this
: merged upstream.
:
: Not just in the sense that they need to be smaller changes, but also
: in the sense that they need to replace old code. It would be
: impossible to maintain both, focus development and testing resources,
: and provide a reasonably stable experience with both systems tugging
: at a complicated shared code base.
:
: On the other hand, the existing code also has billions of hours of
: production testing and tuning. We can't throw this all out overnight -
: it needs to be surgical and the broader consequences of each step need
: to be well understood.
:
: We also have millions of servers relying on being able to do upgrades
: for drivers and fixes in other subsystems that we can't put on hold
: until we stabilized a new reclaim implementation from scratch.

Fully agreed on all points here.

I do appreciate there is a lot of work behind this patchset and I
also do understand it has gained a considerable amount of testing as
well. Your numbers are impressive but my experience tells me that it is
equally important to understand the worst case behavior and there is not
really much mentioned about those in changelogs.

We also shouldn't ignore costs the code is adding. One of them would be
a further page flags depletion. We have been hitting problems on that
front for years and many features had to be reworked to bypass a lack of
space in page->flags.

I will be looking more into the code (especially the memcg side of it)
but I really believe that a consensus on above Johannes' points need to
be found first before this work can move forward.

Thanks!
--
Michal Hocko
SUSE Labs

2022-01-07 13:11:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Tue 04-01-22 13:22:25, Yu Zhao wrote:
[...]
> +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
> +{
> + struct mem_cgroup *memcg;
> + bool success = false;
> + unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
> +
> + VM_BUG_ON(!current_is_kswapd());
> +
> + current->reclaim_state->mm_walk = &pgdat->mm_walk;
> +
> + memcg = mem_cgroup_iter(NULL, NULL, NULL);
> + do {
> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +
> + if (age_lruvec(lruvec, sc, min_ttl))
> + success = true;
> +
> + cond_resched();
> + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> +
> + if (!success && mutex_trylock(&oom_lock)) {
> + struct oom_control oc = {
> + .gfp_mask = sc->gfp_mask,
> + .order = sc->order,
> + };
> +
> + if (!oom_reaping_in_progress())
> + out_of_memory(&oc);
> +
> + mutex_unlock(&oom_lock);
> + }

Why do you need to trigger oom killer from this path? Why cannot you
rely on the page allocator to do that like we do now?
--
Michal Hocko
SUSE Labs

2022-01-07 14:44:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Tue 04-01-22 13:22:25, Yu Zhao wrote:
[...]
> +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> +{
> + static const struct mm_walk_ops mm_walk_ops = {
> + .test_walk = should_skip_vma,
> + .p4d_entry = walk_pud_range,
> + };
> +
> + int err;
> +#ifdef CONFIG_MEMCG
> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +#endif
> +
> + walk->next_addr = FIRST_USER_ADDRESS;
> +
> + do {
> + unsigned long start = walk->next_addr;
> + unsigned long end = mm->highest_vm_end;
> +
> + err = -EBUSY;
> +
> + rcu_read_lock();
> +#ifdef CONFIG_MEMCG
> + if (memcg && atomic_read(&memcg->moving_account))
> + goto contended;
> +#endif
> + if (!mmap_read_trylock(mm))
> + goto contended;

Have you evaluated the behavior under mmap_sem contention? I mean what
would be an effect of some mms being excluded from the walk? This path
is called from direct reclaim and we do allocate with exclusive mmap_sem
IIRC and the trylock can fail in a presence of pending writer if I am
not mistaken so even the read lock holder (e.g. an allocation from the #PF)
can bypass the walk.

Or is this considered statistically insignificant thus a theoretical
problem?
--
Michal Hocko
SUSE Labs

2022-01-07 18:45:49

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Fri, Jan 07, 2022 at 10:38:18AM +0100, Michal Hocko wrote:
> On Tue 04-01-22 13:30:00, Yu Zhao wrote:
> [...]
> > Hi Andrew, Linus,
> >
> > Can you please take a look at this patchset and let me know if it's
> > 5.17 material?
>
> I am still not done with the review and have seen at least few problems
> that would need to be addressed.
>
> But more fundamentally I believe there are really some important
> questions to be answered. First and foremost this is a major addition
> to the memory reclaim and there should be a wider consensus that we
> really want to go that way. The patchset doesn't have a single ack nor
> reviewed-by AFAICS. I haven't seen a lot of discussion since v2
> (http://lkml.kernel.org/r/[email protected])
> nor do I see any clarification on how concerns raised there have been
> addressed or at least how they are planned to be addressed.
>
> Johannes has made some excellent points
> http://lkml.kernel.org/r/[email protected]. Let me quote
> for reference part of it I find the most important:
> : Realistically, I think incremental changes are unavoidable to get this
> : merged upstream.
> :
> : Not just in the sense that they need to be smaller changes, but also
> : in the sense that they need to replace old code. It would be
> : impossible to maintain both, focus development and testing resources,
> : and provide a reasonably stable experience with both systems tugging
> : at a complicated shared code base.
> :
> : On the other hand, the existing code also has billions of hours of
> : production testing and tuning. We can't throw this all out overnight -
> : it needs to be surgical and the broader consequences of each step need
> : to be well understood.
> :
> : We also have millions of servers relying on being able to do upgrades
> : for drivers and fixes in other subsystems that we can't put on hold
> : until we stabilized a new reclaim implementation from scratch.
>
> Fully agreed on all points here.
>
> I do appreciate there is a lot of work behind this patchset and I
> also do understand it has gained a considerable amount of testing as
> well. Your numbers are impressive but my experience tells me that it is
> equally important to understand the worst case behavior and there is not
> really much mentioned about those in changelogs.
>
> We also shouldn't ignore costs the code is adding. One of them would be
> a further page flags depletion. We have been hitting problems on that
> front for years and many features had to be reworked to bypass a lack of
> space in page->flags.
>
> I will be looking more into the code (especially the memcg side of it)
> but I really believe that a consensus on above Johannes' points need to
> be found first before this work can move forward.

Thanks for the summary. I appreciate your time and I agree your
assessment is fair.

So I've acknowledged your concerns, and you've acknowledged my numbers
(the performance improvements) are impressive.

Now we are in agreement, cheers.

Next, I argue that the benefits of this patchset outweigh its risks,
because, drawing from my past experience,
1. There have been many larger and/or riskier patchsets taken; I'll
assemble a list if you disagree. And this patchset is fully guarded
by #ifdef; Linus has also assessed on this point.
2. There have been none that came with the testing/benchmarking
coverage as this one did. Please point me to some if I'm mistaken,
and I'll gladly match them.

The numbers might not materialize in the real world; the code is not
perfect; and many other risks... But all the top eight open source
memory hogs were covered, which is unprecedented; memcached and fio
showed significant improvements and it only takes a few commands to
see for yourselves.

Regarding the acks and the reviewed-bys, I certainly can ask people
who have reaped the benefits of this patchset to do them, if it's
required. But I see less fun in that. I prefer to provide empirical
evidence and convince people who are on the other side of the aisle.

2022-01-07 21:12:52

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Fri, Jan 07, 2022 at 09:43:49AM +0100, Michal Hocko wrote:
> On Thu 06-01-22 14:27:52, Yu Zhao wrote:
> > On Thu, Jan 06, 2022 at 05:06:42PM +0100, Michal Hocko wrote:
> [...]
> > > > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > > > index 2db9a1432511..9c7a4fae0661 100644
> > > > --- a/include/linux/oom.h
> > > > +++ b/include/linux/oom.h
> > > > @@ -57,6 +57,22 @@ struct oom_control {
> > > > extern struct mutex oom_lock;
> > > > extern struct mutex oom_adj_mutex;
> > > >
> > > > +#ifdef CONFIG_MMU
> > > > +extern struct task_struct *oom_reaper_list;
> > > > +extern struct wait_queue_head oom_reaper_wait;
> > > > +
> > > > +static inline bool oom_reaping_in_progress(void)
> > > > +{
> > > > + /* a racy check can be used to reduce the chance of overkilling */
> > > > + return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait);
> > > > +}
> > > > +#else
> > > > +static inline bool oom_reaping_in_progress(void)
> > > > +{
> > > > + return false;
> > > > +}
> > > > +#endif
> > >
> > > I do not like this. These are internal oom reaper's and no code should
> > > really make any decisions based on that. oom_reaping_in_progress is not
> > > telling much anyway.
> >
> > There is a perfectly legitimate reason for this.
> >
> > If there is already a oom kill victim and the oom reaper is making
> > progress, the system may still be under memory pressure until the oom
> > reaping is done. The page reclaim has two choices in this transient
> > state: kill more processes or keep reclaiming (a few more) hot pages.
> >
> > The first choice, AKA overkilling, is generally a bad one. The oom
> > reaper is single threaded and it can't go faster with additional
> > victims. Additional processes are sacrificed for nothing -- this is
> > an overcorrection of a system that tries to strike a balance between
> > the tendencies to release memory pressure and to improve memory
> > utilization.
> >
> > > This is a global queue for oom reaper that can
> > > contain oom victims from different oom scopes (e.g. global OOM, memcg
> > > OOM or memory policy OOM).
> >
> > True, but this is a wrong reason to make the conclusion below. Oom
> > kill scopes do NOT matter; only the pool the freed memory goes into
> > does. And there is only one global pool free pages.
> >
> > > Your lru_gen_age_node uses this to decide whether to trigger
> > > out_of_memory and that is clearly wrong for the above reasons.
> >
> > I hope my explanation above is clear enough. There is nothing wrong
> > with the purpose and the usage of oom_reaping_in_progress(), and it
> > has been well tested in the Arch Linux Zen kernel.
>
> I disagree. An ongoing oom kill in one domain (say memcg A) shouldn't be
> any base for any decisions in reclaim in other domain (say memcg B or
> even a global reclaim). Those are fundamentally different conditions.

I agree for the memcg A oom and memcg B reclaim case, because memory
freed from A doesn't go to B.

I still think for the memcg A and the global reclaim case, memory
freed from A can be considered when deciding whether to make more
kills during global reclaim.

But this is something really minor, and I'll go with your suggestion,
i.e., getting rid of oom_reaping_in_progress().

> > Without it, overkills can be easily reproduced by the following simple
> > script. That is additional oom kills happen to processes other than
> > "tail".
> >
> > # enable zram
> > while true;
> > do
> > tail /dev/zero
> > done
>
> I would be interested to hear more (care to send oom reports?).

I agree with what said below. I think those additional ooms might have
been from different oom domains. I plan to leave this for now and go
with your suggestion as mentioned above.

> > > out_of_memory is designed to skip over any action if there is an oom
> > > victim pending from the oom domain (have a look at oom_evaluate_task).
> >
> > Where exactly? Point me to the code please.
> >
> > I don't see such a logic inside out_of_memory() or
> > oom_evaluate_task(). Currently the only thing that could remotely
> > prevent overkills is oom_lock. But it's inadequate.
>
> OK, let me try to exaplain. The protocol is rather convoluted. Once the
> oom killer is invoked it choses a victim to kill. oom_evaluate_task will
> evaluate _all_ tasks from the oom respective domain (select_bad_process
> which distinguishes memcg vs global oom kill and oom_cpuset_eligible for
> the cpuset domains). If there is any pre-existing oom victim
> (tsk_is_oom_victim) then the scan is aborted and the oom killer bails
> out. OOM victim stops being considered as relevant once the oom reaper
> manages to release its address space (or give up on the mmap_sem
> contention) and sets MMF_OOM_SKIP flag for the mm.
>
> That being said the out_of_memory automatically backs off and relies on
> the oom reaper to process its queue.
>
> Does it make more clear for you now?

Yes, you are right, thanks.

> > This is the entire pipeline:
> > low on memory -> out_of_memory() -> oom_reaper() -> free memory
> >
> > To avoid overkills, we need to consider the later half of it too.
> > oom_reaping_in_progress() is exactly for this purpose.
> >
> > > > +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
> > > > + unsigned long min_ttl)
> > > > +{
> > > > + bool need_aging;
> > > > + long nr_to_scan;
> > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > > + int swappiness = get_swappiness(memcg);
> > > > + DEFINE_MAX_SEQ(lruvec);
> > > > + DEFINE_MIN_SEQ(lruvec);
> > > > +
> > > > + if (mem_cgroup_below_min(memcg))
> > > > + return false;
> > >
> > > mem_cgroup_below_min requires effective values to be calculated for the
> > > reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection
> >
> > I always keep that in mind, and age_lruvec() is called *after*
> > mem_cgroup_calculate_protection():
>
> > balance_pgdat()
> > memcgs_need_aging = 0
> > do {
> > lru_gen_age_node()
> > if (!memcgs_need_aging) {
> > memcgs_need_aging = 1
> > return
> > }
> > age_lruvec()
> >
> > shrink_node_memcgs()
> > mem_cgroup_calculate_protection()
> > lru_gen_shrink_lruvec()
> > if ...
> > memcgs_need_aging = 0
> > } while ...
>
> Uff, this is really subtle. I really think you should be following the
> existing pattern when the effective values are calculated right in the
> same context as they are evaluated.

Consider it done.

2022-01-07 23:36:18

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Fri, Jan 07, 2022 at 02:11:29PM +0100, Michal Hocko wrote:
> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> [...]
> > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
> > +{
> > + struct mem_cgroup *memcg;
> > + bool success = false;
> > + unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
> > +
> > + VM_BUG_ON(!current_is_kswapd());
> > +
> > + current->reclaim_state->mm_walk = &pgdat->mm_walk;
> > +
> > + memcg = mem_cgroup_iter(NULL, NULL, NULL);
> > + do {
> > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > +
> > + if (age_lruvec(lruvec, sc, min_ttl))
> > + success = true;
> > +
> > + cond_resched();
> > + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> > +
> > + if (!success && mutex_trylock(&oom_lock)) {
> > + struct oom_control oc = {
> > + .gfp_mask = sc->gfp_mask,
> > + .order = sc->order,
> > + };
> > +
> > + if (!oom_reaping_in_progress())
> > + out_of_memory(&oc);
> > +
> > + mutex_unlock(&oom_lock);
> > + }
>
> Why do you need to trigger oom killer from this path? Why cannot you
> rely on the page allocator to do that like we do now?

This is per desktop users' (repeated) requests. The can't tolerate
thrashing as servers do because of UI lags; and they usually don't
have fancy tools like oomd.

Related discussions I saw:
https://github.com/zen-kernel/zen-kernel/issues/218
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/

From patch 8:
Personal computers
------------------
:Thrashing prevention: Write ``N`` to
``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
``N`` milliseconds from getting evicted. The OOM killer is invoked if
this working set can't be kept in memory. Based on the average human
detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
lags due to thrashing. Larger values like ``N=3000`` make lags less
noticeable at the cost of more OOM kills.

2022-01-08 00:19:38

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 5/9] mm: multigenerational lru: mm_struct list

On Fri, Jan 07, 2022 at 10:06:15AM +0100, Michal Hocko wrote:
> On Tue 04-01-22 13:22:24, Yu Zhao wrote:
> > To exploit spatial locality, the aging prefers to walk page tables to
> > search for young PTEs. And this patch paves the way for that.
> >
> > An mm_struct list is maintained for each memcg, and an mm_struct
> > follows its owner task to the new memcg when this task is migrated.
>
> How does this work actually for the memcg reclaim? I can see you
> lru_gen_migrate_mm on the task migration. My concern is, though, that
> such a task leaves all the memory behind in the previous memcg (in
> cgroup v2, in v1 you can opt in for charge migration). If you move the
> mm to a new memcg then you age it somewhere where the memory is not
> really consumed.

There are two options to gather the accessed bit: page table walks and
rmap walks. Page table walks sweep dense hotspots that are NOT
misplaced in terms of reclaim scope (lruvec); rmap walks cover what
page table walks miss, e.g., misplaced dense hotspots or sparse ones.

Dense hotspots are stored in Bloom filters for each lruvec.

If an mm leaves everything in the old memcg, page table walks in the
new memcg reclaim path basically ignore this mm after the first scan,
because everything is misplaced.

In the old memcg reclaim path, page table walks won't see this mm
at all. But rmap walks will catch everything later in the eviction
path, i.e., lru_gen_look_around(). This function is less efficient
compared with page table walks because, for each rmap walk of a
non-shared page, it only can gather the accessed bit from 64 PTEs at
most. But it's still a lot faster than the original rmap, which only
gathers the accessed bit from a single PTE, for each walk of a
non-shared page.

2022-01-10 03:58:16

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Fri, Jan 07, 2022 at 10:00:31AM +0100, Michal Hocko wrote:
> On Fri 07-01-22 09:55:09, Michal Hocko wrote:
> [...]
> > > In this case, lru_gen_mm_walk is small (160 bytes); it's per direct
> > > reclaimer; and direct reclaimers rarely come here, i.e., only when
> > > kswapd can't keep up in terms of the aging, which is similar to the
> > > condition where the inactive list is empty for the active/inactive
> > > lru.
> >
> > Well, this is not a strong argument to be honest. Kswapd being stuck
> > and the majority of the reclaim being done in the direct reclaim
> > context is a situation I have seen many many times.
>
> Also do not forget that memcg reclaim is effectivelly only direct
> reclaim. Not that the memcg reclaim indicates a global memory shortage
> but it can add up and race with the global reclaim as well.

I don't dispute any of the above, and I probably don't like this code
more than you do.

But let's not forget the purposes of PF_MEMALLOC, besides preventing
recursive reclaims, include letting reclaim dip into reserves so that
it can make more free memory. So I think it's acceptable if the
following conditions are met:
1. The allocation size is small.
2. The number of allocations is bounded.
3. Its failure doesn't stall reclaim.
And it'd be nice if
4. The allocation happens rarely, e.g., slow path only.

The code in question meets all of them.

1. This allocation is 160 bytes.
2. It's bounded by the number of page table walkers which, in the
worst, is same as the number of mm_struct's.
3. Most importantly, its failure doesn't stall the aging. The aging
will fallback to the rmap-based function lru_gen_look_around().
But this function only gathers the accessed bit from at most 64
PTEs, meaning it's less efficient (retains ~80% performance gains).
4. This allocation is rare, i.e., only when the aging is required,
which is similar to the low inactive case for the active/inactive
lru.

The bottom line is I can try various optimizations, e.g., preallocate
a few buffers for a limited number of page walkers and if this number
has been reached, fallback to the rmap-based function. But I have yet
to see evidence that calls for additional complexity.

2022-01-10 04:48:08

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote:
> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> [...]
> > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > +{
> > + static const struct mm_walk_ops mm_walk_ops = {
> > + .test_walk = should_skip_vma,
> > + .p4d_entry = walk_pud_range,
> > + };
> > +
> > + int err;
> > +#ifdef CONFIG_MEMCG
> > + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +#endif
> > +
> > + walk->next_addr = FIRST_USER_ADDRESS;
> > +
> > + do {
> > + unsigned long start = walk->next_addr;
> > + unsigned long end = mm->highest_vm_end;
> > +
> > + err = -EBUSY;
> > +
> > + rcu_read_lock();
> > +#ifdef CONFIG_MEMCG
> > + if (memcg && atomic_read(&memcg->moving_account))
> > + goto contended;
> > +#endif
> > + if (!mmap_read_trylock(mm))
> > + goto contended;
>
> Have you evaluated the behavior under mmap_sem contention? I mean what
> would be an effect of some mms being excluded from the walk? This path
> is called from direct reclaim and we do allocate with exclusive mmap_sem
> IIRC and the trylock can fail in a presence of pending writer if I am
> not mistaken so even the read lock holder (e.g. an allocation from the #PF)
> can bypass the walk.

You are right. Here it must be a trylock; otherwise it can deadlock.

I think there might be a misunderstanding: the aging doesn't
exclusively rely on page table walks to gather the accessed bit. It
prefers page table walks but it can also fallback to the rmap-based
function, i.e., lru_gen_look_around(), which only gathers the accessed
bit from at most 64 PTEs and therefore is less efficient. But it still
retains about 80% of the performance gains.

> Or is this considered statistically insignificant thus a theoretical
> problem?

Yes. People who work on the maple tree and SPF at Google expressed the
same concern during the design review meeting (all stakeholders on the
mailing list were also invited). So we had a counter to monitor the
contention in previous versions, i.e., MM_LOCK_CONTENTION in v4 here:
https://lore.kernel.org/lkml/[email protected]/

And we also combined this patchset with the SPF patchset to see if the
latter makes any difference. Our conclusion was the contention is
statistically insignificant to the performance under memory pressure.

This can be explained by how often we create a new generation. (We
only walk page tables when we create a new generation. And it's
similar to the low inactive condition for the active/inactive lru.)

Usually we only do so every few seconds. We'd run into problems with
other parts of the kernel, e.g., lru lock contention, i/o congestion,
etc. if we create more than a few generation every second.

2022-01-10 10:27:39

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] mm: multigenerational lru: user interface

Hi,

On Tue, Jan 04, 2022 at 01:22:27PM -0700, Yu Zhao wrote:
> Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch.
>
> Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention.
> Compared with the size-based approach, e.g., [1], this time-based
> approach has the following advantages:
> 1) It's easier to configure because it's agnostic to applications and
> memory sizes.
> 2) It's more reliable because it's directly wired to the OOM killer.
>
> Add /sys/kernel/debug/lru_gen for working set estimation and proactive
> reclaim. Compared with the page table-based approach and the PFN-based
> approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has
> the following advantages:
> 1) It offers better choices because it's aware of memcgs, NUMA nodes,
> shared mappings and unmapped page cache.
> 2) It's more scalable because it's O(nr_hot_evictable_pages), whereas
> the PFN-based approach is O(nr_total_pages).
>
> Add /sys/kernel/debug/lru_gen_full for debugging.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Signed-off-by: Yu Zhao <[email protected]>
> Tested-by: Konstantin Kharlamov <[email protected]>
> ---
> Documentation/vm/index.rst | 1 +
> Documentation/vm/multigen_lru.rst | 62 +++++

The description of user visible interfaces should go to
Documentation/admin-guide/mm

Documentation/vm/multigen_lru.rst should have contained design description
and the implementation details and it would be great to actually have such
document.

> include/linux/nodemask.h | 1 +
> mm/vmscan.c | 415 ++++++++++++++++++++++++++++++
> 4 files changed, 479 insertions(+)
> create mode 100644 Documentation/vm/multigen_lru.rst
>
> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index 6f5ffef4b716..f25e755b4ff4 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -38,3 +38,4 @@ algorithms. If you are looking for advice on simply allocating memory, see the
> unevictable-lru
> z3fold
> zsmalloc
> + multigen_lru
> diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
> new file mode 100644
> index 000000000000..6f9e0181348b
> --- /dev/null
> +++ b/Documentation/vm/multigen_lru.rst
> @@ -0,0 +1,62 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Multigenerational LRU
> +=====================
> +
> +Quick start
> +===========
> +Runtime configurations
> +----------------------
> +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
> + feature wasn't enabled by default.

Required for what? This sentence seem to lack context. Maybe add an
overview what is Multigenerational LRU so that users will have an idea what
these knobs control.

> +
> +Recipes
> +=======

Some more context here will be also helpful.

> +Personal computers
> +------------------
> +:Thrashing prevention: Write ``N`` to
> + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> + ``N`` milliseconds from getting evicted. The OOM killer is invoked if
> + this working set can't be kept in memory. Based on the average human
> + detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
> + lags due to thrashing. Larger values like ``N=3000`` make lags less
> + noticeable at the cost of more OOM kills.
> +
> +Data centers
> +------------
> +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> + format:
> + ::
> +
> + memcg memcg_id memcg_path
> + node node_id
> + min_gen birth_time anon_size file_size
> + ...
> + max_gen birth_time anon_size file_size
> +
> + ``min_gen`` is the oldest generation number and ``max_gen`` is the
> + youngest generation number. ``birth_time`` is in milliseconds.
> + ``anon_size`` and ``file_size`` are in pages.

And what does oldest and youngest generations mean from the user
perspective?

> +
> + This file also accepts commands in the following subsections.
> + Multiple command lines are supported, so does concatenation with
> + delimiters ``,`` and ``;``.
> +
> + ``/sys/kernel/debug/lru_gen_full`` contains additional stats for
> + debugging.
> +
> +:Working set estimation: Write ``+ memcg_id node_id max_gen
> + [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to trigger
> + the aging. It scans PTEs for accessed pages and promotes them to the
> + youngest generation ``max_gen``. Then it creates a new generation
> + ``max_gen+1``. Set ``can_swap`` to 1 to scan for accessed anon pages
> + when swap is off. Set ``full_scan`` to 0 to reduce the overhead as
> + well as the coverage when scanning PTEs.
> +
> +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness
> + [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to trigger the
> + eviction. It evicts generations less than or equal to ``min_gen``.
> + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and
> + ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use
> + ``nr_to_reclaim`` to limit the number of pages to evict.

...

--
Sincerely yours,
Mike.

2022-01-10 10:54:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Sun 09-01-22 21:47:57, Yu Zhao wrote:
> On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote:
> > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > [...]
> > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > +{
> > > + static const struct mm_walk_ops mm_walk_ops = {
> > > + .test_walk = should_skip_vma,
> > > + .p4d_entry = walk_pud_range,
> > > + };
> > > +
> > > + int err;
> > > +#ifdef CONFIG_MEMCG
> > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > +#endif
> > > +
> > > + walk->next_addr = FIRST_USER_ADDRESS;
> > > +
> > > + do {
> > > + unsigned long start = walk->next_addr;
> > > + unsigned long end = mm->highest_vm_end;
> > > +
> > > + err = -EBUSY;
> > > +
> > > + rcu_read_lock();
> > > +#ifdef CONFIG_MEMCG
> > > + if (memcg && atomic_read(&memcg->moving_account))
> > > + goto contended;
> > > +#endif
> > > + if (!mmap_read_trylock(mm))
> > > + goto contended;
> >
> > Have you evaluated the behavior under mmap_sem contention? I mean what
> > would be an effect of some mms being excluded from the walk? This path
> > is called from direct reclaim and we do allocate with exclusive mmap_sem
> > IIRC and the trylock can fail in a presence of pending writer if I am
> > not mistaken so even the read lock holder (e.g. an allocation from the #PF)
> > can bypass the walk.
>
> You are right. Here it must be a trylock; otherwise it can deadlock.

Yeah, this is clear.

> I think there might be a misunderstanding: the aging doesn't
> exclusively rely on page table walks to gather the accessed bit. It
> prefers page table walks but it can also fallback to the rmap-based
> function, i.e., lru_gen_look_around(), which only gathers the accessed
> bit from at most 64 PTEs and therefore is less efficient. But it still
> retains about 80% of the performance gains.

I have to say that I really have hard time to understand the runtime
behavior depending on that interaction. How does the reclaim behave when
the virtual scan is enabled, partially enabled and almost completely
disabled due to different constrains? I do not see any such an
evaluation described in changelogs and I consider this to be a rather
important information to judge the overall behavior.

> > Or is this considered statistically insignificant thus a theoretical
> > problem?
>
> Yes. People who work on the maple tree and SPF at Google expressed the
> same concern during the design review meeting (all stakeholders on the
> mailing list were also invited). So we had a counter to monitor the
> contention in previous versions, i.e., MM_LOCK_CONTENTION in v4 here:
> https://lore.kernel.org/lkml/[email protected]/
>
> And we also combined this patchset with the SPF patchset to see if the
> latter makes any difference. Our conclusion was the contention is
> statistically insignificant to the performance under memory pressure.
>
> This can be explained by how often we create a new generation. (We
> only walk page tables when we create a new generation. And it's
> similar to the low inactive condition for the active/inactive lru.)
>
> Usually we only do so every few seconds. We'd run into problems with
> other parts of the kernel, e.g., lru lock contention, i/o congestion,
> etc. if we create more than a few generation every second.

This would be a very good information to have in changelogs. Ideally
with some numbers and analysis.

--
Michal Hocko
SUSE Labs

2022-01-10 14:37:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Sun 09-01-22 20:58:02, Yu Zhao wrote:
> On Fri, Jan 07, 2022 at 10:00:31AM +0100, Michal Hocko wrote:
> > On Fri 07-01-22 09:55:09, Michal Hocko wrote:
> > [...]
> > > > In this case, lru_gen_mm_walk is small (160 bytes); it's per direct
> > > > reclaimer; and direct reclaimers rarely come here, i.e., only when
> > > > kswapd can't keep up in terms of the aging, which is similar to the
> > > > condition where the inactive list is empty for the active/inactive
> > > > lru.
> > >
> > > Well, this is not a strong argument to be honest. Kswapd being stuck
> > > and the majority of the reclaim being done in the direct reclaim
> > > context is a situation I have seen many many times.
> >
> > Also do not forget that memcg reclaim is effectivelly only direct
> > reclaim. Not that the memcg reclaim indicates a global memory shortage
> > but it can add up and race with the global reclaim as well.
>
> I don't dispute any of the above, and I probably don't like this code
> more than you do.
>
> But let's not forget the purposes of PF_MEMALLOC, besides preventing
> recursive reclaims, include letting reclaim dip into reserves so that
> it can make more free memory. So I think it's acceptable if the
> following conditions are met:
> 1. The allocation size is small.
> 2. The number of allocations is bounded.
> 3. Its failure doesn't stall reclaim.
> And it'd be nice if
> 4. The allocation happens rarely, e.g., slow path only.

I would add
0. The allocation should be done only if absolutely _necessary_.

Please keep in mind that whatever you allocate from that context will be
consuming a very precious memory reserves which are shared with other
components of the system. Even worse these can go all the way to
depleting memory completely where other things can fall apart.

> The code in question meets all of them.
>
> 1. This allocation is 160 bytes.
> 2. It's bounded by the number of page table walkers which, in the
> worst, is same as the number of mm_struct's.
> 3. Most importantly, its failure doesn't stall the aging. The aging
> will fallback to the rmap-based function lru_gen_look_around().
> But this function only gathers the accessed bit from at most 64
> PTEs, meaning it's less efficient (retains ~80% performance gains).
> 4. This allocation is rare, i.e., only when the aging is required,
> which is similar to the low inactive case for the active/inactive
> lru.

I think this fallback behavior deserves much more detailed explanation
in changelogs.

> The bottom line is I can try various optimizations, e.g., preallocate
> a few buffers for a limited number of page walkers and if this number
> has been reached, fallback to the rmap-based function. But I have yet
> to see evidence that calls for additional complexity.

I would disagree here. This is not an optimization. You should be
avoiding allocations from the memory reclaim because any allocation just
add a runtime behavior complexity and potential corner cases.
--
Michal Hocko
SUSE Labs

2022-01-10 14:49:51

by Alexey Avramov

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

Note that with vm.swappiness=0, the vm.watermark_scale_factor value does
not affect the swap possibility: MGLRU ignores sc->file_is_tiny.

With the classic 2-gen LRU swapping goes well at swappiness=0 and
high vm.watermark_scale_factor, which is expected according to the
documentation:
"At 0, the kernel will not initiate swap until the amount of free and
file-backed pages is less than the high watermark in a zone." [1]

With vm.swappiness=0, no swap occurs with any vm.watermark_scale_factor
value with MGLRU.

I used to see in practice (with MGLRU v3) the impossibility of swapping
when vm.swappiness=0 and vm.watermark_scale_factor=1000.

At a minimum, this will require updating the documentation for
vm.swappiness.

BTW, why MGLRU doesn't use something like sc->file_is_tiny?

[1] https://github.com/torvalds/linux/blob/v5.16/Documentation/admin-guide/sysctl/vm.rst#swappiness

2022-01-10 15:02:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Thu 06-01-22 17:12:18, Michal Hocko wrote:
> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> > +{
> > + if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);

One thing I have overlooked completely. You cannot really use GFP_KERNEL
allocation here because the reclaim context can be constrained (e.g.
GFP_NOFS). This allocation will not do any reclaim as it is PF_MEMALLOC
but I suspect that the lockdep will complain anyway.

Also kvmalloc is not really great here. a) vmalloc path is never
executed for small objects and b) we do not really want to make a
dependency between vmalloc and the reclaim (by vmalloc -> reclaim ->
vmalloc).

Even if we rule out vmalloc and look at kmalloc alone. Is this really
safe? I do not see any recursion prevention in the SL.B code. Maybe this
just happens to work but the dependency should be really documented so
that future SL.B changes won't break the whole scheme.
--
Michal Hocko
SUSE Labs

2022-01-10 15:21:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 5/9] mm: multigenerational lru: mm_struct list

On Fri 07-01-22 17:19:28, Yu Zhao wrote:
> On Fri, Jan 07, 2022 at 10:06:15AM +0100, Michal Hocko wrote:
> > On Tue 04-01-22 13:22:24, Yu Zhao wrote:
> > > To exploit spatial locality, the aging prefers to walk page tables to
> > > search for young PTEs. And this patch paves the way for that.
> > >
> > > An mm_struct list is maintained for each memcg, and an mm_struct
> > > follows its owner task to the new memcg when this task is migrated.
> >
> > How does this work actually for the memcg reclaim? I can see you
> > lru_gen_migrate_mm on the task migration. My concern is, though, that
> > such a task leaves all the memory behind in the previous memcg (in
> > cgroup v2, in v1 you can opt in for charge migration). If you move the
> > mm to a new memcg then you age it somewhere where the memory is not
> > really consumed.
>
> There are two options to gather the accessed bit: page table walks and
> rmap walks. Page table walks sweep dense hotspots that are NOT
> misplaced in terms of reclaim scope (lruvec); rmap walks cover what
> page table walks miss, e.g., misplaced dense hotspots or sparse ones.
>
> Dense hotspots are stored in Bloom filters for each lruvec.
>
> If an mm leaves everything in the old memcg, page table walks in the
> new memcg reclaim path basically ignore this mm after the first scan,
> because everything is misplaced.

OK, so do I get it right that pages mapped from a different memcg than
the reclaimed one are considered effectivelly non-present from the the
reclaim logic POV? This would be worth mentioning in the migration
callback because it is not really that straightforward to put those two
together.

> In the old memcg reclaim path, page table walks won't see this mm
> at all. But rmap walks will catch everything later in the eviction
> path, i.e., lru_gen_look_around(). This function is less efficient
> compared with page table walks because, for each rmap walk of a
> non-shared page, it only can gather the accessed bit from 64 PTEs at
> most. But it's still a lot faster than the original rmap, which only
> gathers the accessed bit from a single PTE, for each walk of a
> non-shared page.

Again, something that should be really documented.
--
Michal Hocko
SUSE Labs

2022-01-10 15:35:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Fri 07-01-22 16:36:11, Yu Zhao wrote:
> On Fri, Jan 07, 2022 at 02:11:29PM +0100, Michal Hocko wrote:
> > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > [...]
> > > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
> > > +{
> > > + struct mem_cgroup *memcg;
> > > + bool success = false;
> > > + unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
> > > +
> > > + VM_BUG_ON(!current_is_kswapd());
> > > +
> > > + current->reclaim_state->mm_walk = &pgdat->mm_walk;
> > > +
> > > + memcg = mem_cgroup_iter(NULL, NULL, NULL);
> > > + do {
> > > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > +
> > > + if (age_lruvec(lruvec, sc, min_ttl))
> > > + success = true;
> > > +
> > > + cond_resched();
> > > + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> > > +
> > > + if (!success && mutex_trylock(&oom_lock)) {
> > > + struct oom_control oc = {
> > > + .gfp_mask = sc->gfp_mask,
> > > + .order = sc->order,
> > > + };
> > > +
> > > + if (!oom_reaping_in_progress())
> > > + out_of_memory(&oc);
> > > +
> > > + mutex_unlock(&oom_lock);
> > > + }
> >
> > Why do you need to trigger oom killer from this path? Why cannot you
> > rely on the page allocator to do that like we do now?
>
> This is per desktop users' (repeated) requests. The can't tolerate
> thrashing as servers do because of UI lags; and they usually don't
> have fancy tools like oomd.
>
> Related discussions I saw:
> https://github.com/zen-kernel/zen-kernel/issues/218
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/

I do not really see any arguments why an userspace based trashing
detection cannot be used for those. Could you clarify?

Also my question was pointing to why out_of_memory is called from the
reclaim rather than the allocator (memcg charging path). It is the
caller of the reclaim to control different reclaim strategies and tell
when all the hopes are lost and the oom killer should be invoked. This
allows for a different policies at the allocator level and this change
will break that AFAICS. E.g. what if the underlying allocation context
is __GFP_NORETRY?

> >From patch 8:
> Personal computers
> ------------------
> :Thrashing prevention: Write ``N`` to
> ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> ``N`` milliseconds from getting evicted. The OOM killer is invoked if
> this working set can't be kept in memory. Based on the average human
> detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
> lags due to thrashing. Larger values like ``N=3000`` make lags less
> noticeable at the cost of more OOM kills.

This is a very good example of something that should be a self contained
patch with its own justification. TBH it is really not all that clear to
me that we want to provide any user visible knob to control OOM behavior
based on a time based QoS.

--
Michal Hocko
SUSE Labs

2022-01-10 15:39:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Fri 07-01-22 11:45:40, Yu Zhao wrote:
[...]
> Next, I argue that the benefits of this patchset outweigh its risks,
> because, drawing from my past experience,
> 1. There have been many larger and/or riskier patchsets taken; I'll
> assemble a list if you disagree.

No question about that. Changes in the reclaim path are paved with
failures and reverts and fine tuning on top of existing fine tuning.
The difference from your patchset is that they tend to be much much
smaller and go incremental and therefore easier to review.

> And this patchset is fully guarded
> by #ifdef; Linus has also assessed on this point.

I appreciate you made the new behavior an opt-in and therefore existing
workloads are less likely to regress. I do not think ifdefs help
all that much, though, because a) realistically the config will
likely be enabled for most distribution kernels and b) the parallel
reclaim implementation adds a maintenance overhead regardless of those
ifdef. The later point is especially worrying because the memory reclaim
is a complex and hard to review beast already. Any future changes would
need to consider both reclaim algorithms of course.

Hence I argue we really need a wider consensus this is the right
direction we want to pursue.

> 2. There have been none that came with the testing/benchmarking
> coverage as this one did. Please point me to some if I'm mistaken,
> and I'll gladly match them.

I do appreciate your numbers but you should realize that this is an area
that is really hard to get any conclusive testing for. We keep learning
about fallouts on workloads we haven't really anticipated or where the
runtime effects happen to disagree with our intuition. So while those
numbers are nice there are other important aspects to consider like the
maintenance cost for example.

> The numbers might not materialize in the real world; the code is not
> perfect; and many other risks... But all the top eight open source
> memory hogs were covered, which is unprecedented; memcached and fio
> showed significant improvements and it only takes a few commands to
> see for yourselves.
>
> Regarding the acks and the reviewed-bys, I certainly can ask people
> who have reaped the benefits of this patchset to do them, if it's
> required. But I see less fun in that. I prefer to provide empirical
> evidence and convince people who are on the other side of the aisle.

I like to hear from users who benefit from your work and that certainly
gives more credit to it. But it will be the MM community to maintain the
code and address future issues.

We do not have a dedicated maintainer for the memory reclaim but
certainly there are people who have helped shaping the existing code and
have learned a lot from the past issues - like Johannes, Rik, Mel just
to name few. If I were you I would be really looking into finding an
agreement with them. I myself can help you with memcg and oom side of
the things (we already have discussions about those).

Thanks!
--
Michal Hocko
SUSE Labs

2022-01-10 16:01:11

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On 1/10/22 16:01, Michal Hocko wrote:
> On Thu 06-01-22 17:12:18, Michal Hocko wrote:
>> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
>> > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
>> > +{
>> > + if (!current->reclaim_state || !current->reclaim_state->mm_walk)
>> > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
>
> One thing I have overlooked completely. You cannot really use GFP_KERNEL
> allocation here because the reclaim context can be constrained (e.g.
> GFP_NOFS). This allocation will not do any reclaim as it is PF_MEMALLOC
> but I suspect that the lockdep will complain anyway.
>
> Also kvmalloc is not really great here. a) vmalloc path is never
> executed for small objects and b) we do not really want to make a
> dependency between vmalloc and the reclaim (by vmalloc -> reclaim ->
> vmalloc).
>
> Even if we rule out vmalloc and look at kmalloc alone. Is this really
> safe? I do not see any recursion prevention in the SL.B code. Maybe this
> just happens to work but the dependency should be really documented so
> that future SL.B changes won't break the whole scheme.

Slab implementations drop all locks before calling into page allocator (thus
possibly reclaim) so slab itself should be fine and I don't expect it to
change. But we could eventually reach the page allocator recursively again,
that's true and not great.

2022-01-10 16:25:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Mon 10-01-22 17:01:07, Vlastimil Babka wrote:
> On 1/10/22 16:01, Michal Hocko wrote:
> > On Thu 06-01-22 17:12:18, Michal Hocko wrote:
> >> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> >> > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> >> > +{
> >> > + if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> >> > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> >
> > One thing I have overlooked completely. You cannot really use GFP_KERNEL
> > allocation here because the reclaim context can be constrained (e.g.
> > GFP_NOFS). This allocation will not do any reclaim as it is PF_MEMALLOC
> > but I suspect that the lockdep will complain anyway.
> >
> > Also kvmalloc is not really great here. a) vmalloc path is never
> > executed for small objects and b) we do not really want to make a
> > dependency between vmalloc and the reclaim (by vmalloc -> reclaim ->
> > vmalloc).
> >
> > Even if we rule out vmalloc and look at kmalloc alone. Is this really
> > safe? I do not see any recursion prevention in the SL.B code. Maybe this
> > just happens to work but the dependency should be really documented so
> > that future SL.B changes won't break the whole scheme.
>
> Slab implementations drop all locks before calling into page allocator (thus
> possibly reclaim) so slab itself should be fine and I don't expect it to
> change. But we could eventually reach the page allocator recursively again,
> that's true and not great.

Thanks for double checking. If recursion is really intended and
something SL.B allocators should support then this is definitely worth
documenting so that a subtle change won't break in the future.

--
Michal Hocko
SUSE Labs

2022-01-10 16:57:45

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Tue 04-01-22 13:22:25, Yu Zhao wrote:
[...]
> +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> +{
> + static const struct mm_walk_ops mm_walk_ops = {
> + .test_walk = should_skip_vma,
> + .p4d_entry = walk_pud_range,
> + };
> +
> + int err;
> +#ifdef CONFIG_MEMCG
> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +#endif
> +
> + walk->next_addr = FIRST_USER_ADDRESS;
> +
> + do {
> + unsigned long start = walk->next_addr;
> + unsigned long end = mm->highest_vm_end;
> +
> + err = -EBUSY;
> +
> + rcu_read_lock();
> +#ifdef CONFIG_MEMCG
> + if (memcg && atomic_read(&memcg->moving_account))
> + goto contended;
> +#endif

Why do you need to check for moving_account?
--
Michal Hocko
SUSE Labs

2022-01-10 22:05:09

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Mon, Jan 10, 2022 at 04:39:51PM +0100, Michal Hocko wrote:
> On Fri 07-01-22 11:45:40, Yu Zhao wrote:
> [...]
> > Next, I argue that the benefits of this patchset outweigh its risks,
> > because, drawing from my past experience,
> > 1. There have been many larger and/or riskier patchsets taken; I'll
> > assemble a list if you disagree.
>
> No question about that. Changes in the reclaim path are paved with
> failures and reverts and fine tuning on top of existing fine tuning.
> The difference from your patchset is that they tend to be much much
> smaller and go incremental and therefore easier to review.

No argument here.

> > And this patchset is fully guarded
> > by #ifdef; Linus has also assessed on this point.
>
> I appreciate you made the new behavior an opt-in and therefore existing
> workloads are less likely to regress. I do not think ifdefs help
> all that much, though, because a) realistically the config will
> likely be enabled for most distribution kernels

There is also a runtime kill switch.

> b) the parallel
> reclaim implementation adds a maintenance overhead regardless of those
> ifdef. The later point is especially worrying because the memory reclaim
> is a complex and hard to review beast already. Any future changes would
> need to consider both reclaim algorithms of course.

A perfectly legitimate concern.

If this patchset is taken:
1. There will be refactoring that makes the long-term maintenance as
affordable as possible, i.e., similar to the SL.B model, but can
also make runtime switch.
2. There will also be optimizations for mmu notifier (KVM), THP, etc.
3. Most importantly, Google will be committing more resource on this.
And that's why we need to hear a decision -- our resource planning
depends on it.

> Hence I argue we really need a wider consensus this is the right
> direction we want to pursue.

We've been doing our best to get this consensus -- we invited all
the stakeholders to meetings a long time ago -- but unfortunately we
couldn't move the needle.

I agree consensus is important. But, IMO, progress is even more
important. And personally, I'd rather try something wrong than do
nothing.

> > 2. There have been none that came with the testing/benchmarking
> > coverage as this one did. Please point me to some if I'm mistaken,
> > and I'll gladly match them.
>
> I do appreciate your numbers but you should realize that this is an area
> that is really hard to get any conclusive testing for.

Fully agreed. That's why we started a new initiative, and we hope more
people will following these practices:
1. All results in this area should be reported with at least standard
deviations, or preferably confidence intervals.
2. Real applications should be benchmarked (with synthetic load
generator), not just synthetic benchmarks (not real applications).
3. A wide range of devices should be covered, i.e., servers, desktops,
laptops and phones.

I'm very confident to say our benchmark reports were hold to the
highest standards. We have worked with MariaDB (company), EnterpriseDB
(Postgres), Redis (company), etc. on these reports. They have copies
of these reports (PDF version):
https://linux-mm.googlesource.com/benchmarks/

We welcome any expert in those applications to examine our reports,
and we'll be happy to run any other benchmarks or same benchmarks with
different configurations that anybody thinks it's important and we've
missed.

> We keep learning
> about fallouts on workloads we haven't really anticipated or where the
> runtime effects happen to disagree with our intuition. So while those
> numbers are nice there are other important aspects to consider like the
> maintenance cost for example.

I assume we agree this is not an easy decision. Can I also assume we
agree that this decision should be make within a reasonable time frame?

> > The numbers might not materialize in the real world; the code is not
> > perfect; and many other risks... But all the top eight open source
> > memory hogs were covered, which is unprecedented; memcached and fio
> > showed significant improvements and it only takes a few commands to
> > see for yourselves.
> >
> > Regarding the acks and the reviewed-bys, I certainly can ask people
> > who have reaped the benefits of this patchset to do them, if it's
> > required. But I see less fun in that. I prefer to provide empirical
> > evidence and convince people who are on the other side of the aisle.
>
> I like to hear from users who benefit from your work and that certainly
> gives more credit to it. But it will be the MM community to maintain the
> code and address future issues.

I'll ask downstream kernel maintainers (from different distros) that
have taken this patchset to ACK.

I'll ask credible testers who are professionals, researchers,
contributors to other subsystems to provide Test-by's. There are many
other individual testers I may not be able to acknowledge their
efforts, e.g., my coworker just sent this to me:

Using that v5 for some time and confirm that difference under heavy
load and memory pressure is significant."
https://www.phoronix.com/forums/forum/software/general-linux-open-source/1301258-mglru-is-a-very-enticing-enhancement-for-linux-in-2022#post1301275

I'll leave the reviews in your capable hands. As I said, I prefer to
convince people with empirical evidence.

> We do not have a dedicated maintainer for the memory reclaim but
> certainly there are people who have helped shaping the existing code and
> have learned a lot from the past issues - like Johannes, Rik, Mel just
> to name few. If I were you I would be really looking into finding an
> agreement with them. I myself can help you with memcg and oom side of
> the things (we already have discussions about those).

Unfortunately people have different priorities. As I said, we tried
to get all the stakeholders in the same (conference) room so that we
can make some good progress. But we failed.

Rest assured, we'll keep trying. But please understand we need to do
cost control and therefore we can't keep investing in this effort
forever. So I think it's not unreasonable, after I've addressed all
pending comments, to ask for some clear instructions from the
leadership:
Yes
No
Or something specific

Thanks!

2022-01-10 22:46:22

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

> > > 2. There have been none that came with the testing/benchmarking
> > > coverage as this one did. Please point me to some if I'm mistaken,
> > > and I'll gladly match them.
> >
> > I do appreciate your numbers but you should realize that this is an area
> > that is really hard to get any conclusive testing for.
>
> Fully agreed. That's why we started a new initiative, and we hope more
> people will following these practices:
> 1. All results in this area should be reported with at least standard
> deviations, or preferably confidence intervals.
> 2. Real applications should be benchmarked (with synthetic load
> generator), not just synthetic benchmarks (not real applications).
> 3. A wide range of devices should be covered, i.e., servers, desktops,
> laptops and phones.
>
> I'm very confident to say our benchmark reports were hold to the
> highest standards. We have worked with MariaDB (company), EnterpriseDB
> (Postgres), Redis (company), etc. on these reports. They have copies
> of these reports (PDF version):
> https://linux-mm.googlesource.com/benchmarks/
>
> We welcome any expert in those applications to examine our reports,
> and we'll be happy to run any other benchmarks or same benchmarks with
> different configurations that anybody thinks it's important and we've
> missed.

I really think this gets at the heart of the issue with mm
development, and is one of the reasons it's been extra frustrating to
not have an MM conf for the past couple of years; I think sorting out
how we measure & proceed on changes would be easier done f2f. E.g.
concluding with a consensus that if something doesn't regress on X, Y,
and Z, and has reasonably maintainable and readable code, we should
merge it and try it out.

But since f2f isn't an option until 2052 at the earliest...

I understand the desire for an "incremental approach that gets us from
A->B". In the abstract it sounds great. However, with a change like
this one, I think it's highly likely that such a path would be
littered with regressions both large and small, and would probably be
more difficult to reason about than the relatively clean design of
MGLRU. On top of that, I don't think we'll get the kind of user
feedback we need for something like this *without* merging it. Yu has
done a tremendous job collecting data here (and the results are really
incredible), but I think we can all agree that without extensive
testing in the field with all sorts of weird codes, we're not going to
find the problematic behaviors we're concerned about.

So unless we want to eschew big mm changes entirely (we shouldn't!
look at net or scheduling for how important big rewrites are to
progress), I think we should be open to experimenting with new stuff.
We can always revert if things get too unwieldy.

None of this is to say that there may not be lots more comments on the
code or potential fixes/changes to incorporate before merging; I'm
mainly arguing about the mindset we should have to changes like this,
not all the stuff the community is already really good at (i.e.
testing and reviewing code on a nuts & bolts level).

Thanks,
Jesse

2022-01-11 01:19:04

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Mon, Jan 10, 2022 at 04:35:46PM +0100, Michal Hocko wrote:
> On Fri 07-01-22 16:36:11, Yu Zhao wrote:
> > On Fri, Jan 07, 2022 at 02:11:29PM +0100, Michal Hocko wrote:
> > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > [...]
> > > > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
> > > > +{
> > > > + struct mem_cgroup *memcg;
> > > > + bool success = false;
> > > > + unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
> > > > +
> > > > + VM_BUG_ON(!current_is_kswapd());
> > > > +
> > > > + current->reclaim_state->mm_walk = &pgdat->mm_walk;
> > > > +
> > > > + memcg = mem_cgroup_iter(NULL, NULL, NULL);
> > > > + do {
> > > > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > > +
> > > > + if (age_lruvec(lruvec, sc, min_ttl))
> > > > + success = true;
> > > > +
> > > > + cond_resched();
> > > > + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> > > > +
> > > > + if (!success && mutex_trylock(&oom_lock)) {
> > > > + struct oom_control oc = {
> > > > + .gfp_mask = sc->gfp_mask,
> > > > + .order = sc->order,
> > > > + };
> > > > +
> > > > + if (!oom_reaping_in_progress())
> > > > + out_of_memory(&oc);
> > > > +
> > > > + mutex_unlock(&oom_lock);
> > > > + }
> > >
> > > Why do you need to trigger oom killer from this path? Why cannot you
> > > rely on the page allocator to do that like we do now?
> >
> > This is per desktop users' (repeated) requests. The can't tolerate
> > thrashing as servers do because of UI lags; and they usually don't
> > have fancy tools like oomd.
> >
> > Related discussions I saw:
> > https://github.com/zen-kernel/zen-kernel/issues/218
> > https://lore.kernel.org/lkml/[email protected]/
> > https://lore.kernel.org/lkml/[email protected]/
> > https://lore.kernel.org/lkml/[email protected]/
> > https://lore.kernel.org/lkml/[email protected]/
>
> I do not really see any arguments why an userspace based trashing
> detection cannot be used for those. Could you clarify?

It definitely can be done. But who is going to do it for every distro
and all individual users? AFAIK, not a single distro provides such a
solution for desktop/laptop/phone users.

And also there is the theoretical question how reliable a userspace
solution can be. What if this usespace solution itself gets stuck in
the direct reclaim path. I'm sure if nobody has done some search to
prove or debunk it.

In addition, what exactly PSI values should be used on different
models of consumer electronics? Nobody knows. We have a team working
on this and we haven't figured it out for all our Chromebook models.

As Andrew said, "a blunt instrument like this would be useful".
https://lore.kernel.org/lkml/[email protected]/

I'd like to have less code in kernel too, but I've learned never to
walk over users. If I remove this and they come after me asking why,
I'd have a hard time convincing them.

> Also my question was pointing to why out_of_memory is called from the
> reclaim rather than the allocator (memcg charging path). It is the
> caller of the reclaim to control different reclaim strategies and tell
> when all the hopes are lost and the oom killer should be invoked. This
> allows for a different policies at the allocator level and this change
> will break that AFAICS. E.g. what if the underlying allocation context
> is __GFP_NORETRY?

This is called in kswapd only, and by default (min_ttl=0) it doesn't
do anything. So __GFP_NORETRY doesn't apply. The question would be
more along the lines of long-term ABI support.

And I'll add the following comments, if you think we can keep this
logic:
OOM kill if every generation from all memcgs is younger than min_ttl.
Another theoretical possibility is all memcgs are either below min or
ineligible at priority 0, but this isn't the main goal.

(Please read my reply at the bottom to decide whether we should keep
it or not. Thanks.)

> > >From patch 8:
> > Personal computers
> > ------------------
> > :Thrashing prevention: Write ``N`` to
> > ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> > ``N`` milliseconds from getting evicted. The OOM killer is invoked if
> > this working set can't be kept in memory. Based on the average human
> > detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
> > lags due to thrashing. Larger values like ``N=3000`` make lags less
> > noticeable at the cost of more OOM kills.
>
> This is a very good example of something that should be a self contained
> patch with its own justification.

Consider it done.

> TBH it is really not all that clear to
> me that we want to provide any user visible knob to control OOM behavior
> based on a time based QoS.

Agreed, and it didn't exist until v4, i.e., after I was demanded to
provide it for several times.

For example:
https://github.com/zen-kernel/zen-kernel/issues/223

And another example:
Your Multigenerational LRU patchset is pretty complex and
effective, but does not eliminate thrashing condition fully on an
old PCs with slow HDD.

I'm kindly asking you to cooperate with hakavlad if it's possible
and maybe re-implement parts of le9 patch in your patchset wherever
acceptable, as they are quite similar in the core concept.

This is excerpt of an email from [email protected], and he has
posted demo videos in this discussion:
https://lore.kernel.org/lkml/[email protected]/

2022-01-11 01:42:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Mon, Jan 10, 2022 at 2:46 PM Jesse Barnes <[email protected]> wrote:
>
> So unless we want to eschew big mm changes entirely (we shouldn't!
> look at net or scheduling for how important big rewrites are to
> progress), I think we should be open to experimenting with new stuff.

So I personally think this is worth going with, partly simply due to
the reported improvements that have been measured.

But also to a large extent because the whole notion of doing
multi-generational LRU isn't exactly some wackadoodle crazy thing. We
already do active vs inactive, the whole multi-generational thing just
doesn't seem to be so "far out".

But yes, numbers talk, and I get the feeling that we just need to try
it. Maybe not 5.17, but..

Linus

2022-01-11 08:17:50

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v6 4/9] mm: multigenerational lru: groundwork

Yu Zhao <[email protected]> writes:

.....

+
> +/*
> + * Evictable pages are divided into multiple generations. The youngest and the
> + * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
> + * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
> + * offset within MAX_NR_GENS, gen, indexes the lru list of the corresponding
> + * generation. The gen counter in folio->flags stores gen+1 while a page is on
> + * lrugen->lists[]. Otherwise, it stores 0.
> + *
> + * A page is added to the youngest generation on faulting. The aging needs to
> + * check the accessed bit at least twice before handing this page over to the
> + * eviction. The first check takes care of the accessed bit set on the initial
> + * fault; the second check makes sure this page hasn't been used since then.
> + * This process, AKA second chance, requires a minimum of two generations,
> + * hence MIN_NR_GENS. And to be compatible with the active/inactive lru, these
> + * two generations are mapped to the active; the rest of generations, if they
> + * exist, are mapped to the inactive. PG_active is always cleared while a page
> + * is on lrugen->lists[] so that demotion, which happens consequently when the
> + * aging creates a new generation, needs not to worry about it.
> + */

Where do we clear PG_active in the code? Is this the reason we endup
with

void deactivate_page(struct page *page)
{
- if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {




> +#define MIN_NR_GENS 2U
> +#define MAX_NR_GENS ((unsigned int)CONFIG_NR_LRU_GENS)
> +
> +struct lru_gen_struct {
> + /* the aging increments the youngest generation number */
> + unsigned long max_seq;
> + /* the eviction increments the oldest generation numbers */
> + unsigned long min_seq[ANON_AND_FILE];
> + /* the birth time of each generation in jiffies */
> + unsigned long timestamps[MAX_NR_GENS];
> + /* the multigenerational lru lists */
> + struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> + /* the sizes of the above lists */
> + unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> + /* whether the multigenerational lru is enabled */
> + bool enabled;
> +};
> +

....

> static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
> diff --git a/mm/swap.c b/mm/swap.c
> index e8c9dc6d0377..d7dde3b7d4b5 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
> VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
> VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>
> + /* see the comment in lru_gen_add_folio() */
> + if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> + task_in_lru_fault() && !(current->flags & PF_MEMALLOC))
> + folio_set_active(folio);
> +


Can you explain this better? What is the significance of marking the
folio active here. Do we need to differentiate parallel page faults (across
different vmas) w.r.t task_in_lru_fault()?


> folio_get(folio);
> local_lock(&lru_pvecs.lock);
> pvec = this_cpu_ptr(&lru_pvecs.lru_add);
> @@ -563,7 +568,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
>

2022-01-11 08:41:30

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
> On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> > TLDR
> > ====
> > The current page reclaim is too expensive in terms of CPU usage and it
> > often makes poor choices about what to evict. This patchset offers an
> > alternative solution that is performant, versatile and
> > straightforward.
>
> <snipped>
>
> > Summery
> > =======
> > The facts are:
> > 1. The independent lab results and the real-world applications
> > indicate substantial improvements; there are no known regressions.
> > 2. Thrashing prevention, working set estimation and proactive reclaim
> > work out of the box; there are no equivalent solutions.
> > 3. There is a lot of new code; nobody has demonstrated smaller changes
> > with similar effects.
> >
> > Our options, accordingly, are:
> > 1. Given the amount of evidence, the reported improvements will likely
> > materialize for a wide range of workloads.
> > 2. Gauging the interest from the past discussions [14][15][16], the
> > new features will likely be put to use for both personal computers
> > and data centers.
> > 3. Based on Google's track record, the new code will likely be well
> > maintained in the long term. It'd be more difficult if not
> > impossible to achieve similar effects on top of the existing
> > design.
>
> Hi Andrew, Linus,
>
> Can you please take a look at this patchset and let me know if it's
> 5.17 material?
>
> My goal is to get it merged asap so that users can reap the benefits
> and I can push the sequels. Please examine the data provided -- I
> think the unprecedented coverage and the magnitude of the improvements
> warrant a green light.

Downstream kernel maintainers who have been carrying MGLRU for more than
3 versions, can you please provide your Acked-by tags?

Having this patchset in the mainline will make your job easier :)

Alexandre - the XanMod Kernel maintainer
https://xanmod.org

Brian - the Chrome OS kernel memory maintainer
https://www.chromium.org

Jan - the Arch Linux Zen kernel maintainer
https://archlinux.org

Steven - the Liquorix kernel maintainer
https://liquorix.net

Suleiman - the ARCVM (Android downstream) kernel memory maintainer
https://chromium.googlesource.com/chromiumos/third_party/kernel

Also my gratitude to those who have helped test MGLRU:

Daniel - researcher at Michigan Tech
benchmarked memcached

Holger - who has been testing/patching/contributing to various
subsystems since ~2008

Shuang - researcher at University of Rochester
benchmarked fio and provided a report

Sofia - EDI https://www.edi.works
benchmarked the top eight memory hogs and provided reports

Can you please provide your Tested-by tags? This will ensure the credit
for your contributions.

Thanks!

2022-01-11 09:00:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Mon 10-01-22 18:18:55, Yu Zhao wrote:
> On Mon, Jan 10, 2022 at 04:35:46PM +0100, Michal Hocko wrote:
> > On Fri 07-01-22 16:36:11, Yu Zhao wrote:
> > > On Fri, Jan 07, 2022 at 02:11:29PM +0100, Michal Hocko wrote:
> > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > > [...]
> > > > > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
> > > > > +{
> > > > > + struct mem_cgroup *memcg;
> > > > > + bool success = false;
> > > > > + unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
> > > > > +
> > > > > + VM_BUG_ON(!current_is_kswapd());
> > > > > +
> > > > > + current->reclaim_state->mm_walk = &pgdat->mm_walk;
> > > > > +
> > > > > + memcg = mem_cgroup_iter(NULL, NULL, NULL);
> > > > > + do {
> > > > > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > > > +
> > > > > + if (age_lruvec(lruvec, sc, min_ttl))
> > > > > + success = true;
> > > > > +
> > > > > + cond_resched();
> > > > > + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> > > > > +
> > > > > + if (!success && mutex_trylock(&oom_lock)) {
> > > > > + struct oom_control oc = {
> > > > > + .gfp_mask = sc->gfp_mask,
> > > > > + .order = sc->order,
> > > > > + };
> > > > > +
> > > > > + if (!oom_reaping_in_progress())
> > > > > + out_of_memory(&oc);
> > > > > +
> > > > > + mutex_unlock(&oom_lock);
> > > > > + }
> > > >
> > > > Why do you need to trigger oom killer from this path? Why cannot you
> > > > rely on the page allocator to do that like we do now?
> > >
> > > This is per desktop users' (repeated) requests. The can't tolerate
> > > thrashing as servers do because of UI lags; and they usually don't
> > > have fancy tools like oomd.
> > >
> > > Related discussions I saw:
> > > https://github.com/zen-kernel/zen-kernel/issues/218
> > > https://lore.kernel.org/lkml/[email protected]/
> > > https://lore.kernel.org/lkml/[email protected]/
> > > https://lore.kernel.org/lkml/[email protected]/
> > > https://lore.kernel.org/lkml/[email protected]/
> >
> > I do not really see any arguments why an userspace based trashing
> > detection cannot be used for those. Could you clarify?
>
> It definitely can be done. But who is going to do it for every distro
> and all individual users? AFAIK, not a single distro provides such a
> solution for desktop/laptop/phone users.

If existing interfaces provides sufficient information to make those
calls then I would definitely prefer a userspace solution.

> And also there is the theoretical question how reliable a userspace
> solution can be. What if this usespace solution itself gets stuck in
> the direct reclaim path. I'm sure if nobody has done some search to
> prove or debunk it.

I have to confess I haven't checked oomd or other solutions but with a
sufficient care (all the code mlocked in + no allocations done while
collecting data) I believe this should be achieveable.

> In addition, what exactly PSI values should be used on different
> models of consumer electronics? Nobody knows. We have a team working
> on this and we haven't figured it out for all our Chromebook models.

I believe this is a matter of tuning for a specific deployment. We do
not have only psi but also refault counters that can be used.

> As Andrew said, "a blunt instrument like this would be useful".
> https://lore.kernel.org/lkml/[email protected]/
>
> I'd like to have less code in kernel too, but I've learned never to
> walk over users. If I remove this and they come after me asking why,
> I'd have a hard time convincing them.
>
> > Also my question was pointing to why out_of_memory is called from the
> > reclaim rather than the allocator (memcg charging path). It is the
> > caller of the reclaim to control different reclaim strategies and tell
> > when all the hopes are lost and the oom killer should be invoked. This
> > allows for a different policies at the allocator level and this change
> > will break that AFAICS. E.g. what if the underlying allocation context
> > is __GFP_NORETRY?
>
> This is called in kswapd only, and by default (min_ttl=0) it doesn't
> do anything. So __GFP_NORETRY doesn't apply.

My bad. I must have got lost when traversing the code but I can see you
are enforcing that by a VM_BUG_ON. So the limited scope reclaim is not a
problem indeed.

> The question would be
> more along the lines of long-term ABI support.
>
> And I'll add the following comments, if you think we can keep this
> logic:
> OOM kill if every generation from all memcgs is younger than min_ttl.
> Another theoretical possibility is all memcgs are either below min or
> ineligible at priority 0, but this isn't the main goal.
>
> (Please read my reply at the bottom to decide whether we should keep
> it or not. Thanks.)
>
> > > >From patch 8:
> > > Personal computers
> > > ------------------
> > > :Thrashing prevention: Write ``N`` to
> > > ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> > > ``N`` milliseconds from getting evicted. The OOM killer is invoked if
> > > this working set can't be kept in memory. Based on the average human
> > > detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
> > > lags due to thrashing. Larger values like ``N=3000`` make lags less
> > > noticeable at the cost of more OOM kills.
> >
> > This is a very good example of something that should be a self contained
> > patch with its own justification.
>
> Consider it done.
>
> > TBH it is really not all that clear to
> > me that we want to provide any user visible knob to control OOM behavior
> > based on a time based QoS.
>
> Agreed, and it didn't exist until v4, i.e., after I was demanded to
> provide it for several times.
>
> For example:
> https://github.com/zen-kernel/zen-kernel/issues/223
>
> And another example:
> Your Multigenerational LRU patchset is pretty complex and
> effective, but does not eliminate thrashing condition fully on an
> old PCs with slow HDD.
>
> I'm kindly asking you to cooperate with hakavlad if it's possible
> and maybe re-implement parts of le9 patch in your patchset wherever
> acceptable, as they are quite similar in the core concept.
>
> This is excerpt of an email from [email protected], and he has
> posted demo videos in this discussion:
> https://lore.kernel.org/lkml/[email protected]/

That is all an interesting feedback but we should be really craful about
ABI constrains and future maintainability of the knob. I still stand
behind my statement that kernel should implement such features only if
it is clear that we cannot really implement a similar logic in the
userspace.

--
Michal Hocko
SUSE Labs

2022-01-11 09:01:19

by Holger Hoffstätte

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On 2022-01-11 09:41, Yu Zhao wrote:
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
>> On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
>>> TLDR
>>> ====
>>> The current page reclaim is too expensive in terms of CPU usage and it
>>> often makes poor choices about what to evict. This patchset offers an
>>> alternative solution that is performant, versatile and
>>> straightforward.
>>
>> <snipped>
>>
>>> Summery
>>> =======
>>> The facts are:
>>> 1. The independent lab results and the real-world applications
>>> indicate substantial improvements; there are no known regressions.
>>> 2. Thrashing prevention, working set estimation and proactive reclaim
>>> work out of the box; there are no equivalent solutions.
>>> 3. There is a lot of new code; nobody has demonstrated smaller changes
>>> with similar effects.
>>>
>>> Our options, accordingly, are:
>>> 1. Given the amount of evidence, the reported improvements will likely
>>> materialize for a wide range of workloads.
>>> 2. Gauging the interest from the past discussions [14][15][16], the
>>> new features will likely be put to use for both personal computers
>>> and data centers.
>>> 3. Based on Google's track record, the new code will likely be well
>>> maintained in the long term. It'd be more difficult if not
>>> impossible to achieve similar effects on top of the existing
>>> design.
>>
>> Hi Andrew, Linus,
>>
>> Can you please take a look at this patchset and let me know if it's
>> 5.17 material?
>>
>> My goal is to get it merged asap so that users can reap the benefits
>> and I can push the sequels. Please examine the data provided -- I
>> think the unprecedented coverage and the magnitude of the improvements
>> warrant a green light.
>
> Downstream kernel maintainers who have been carrying MGLRU for more than
> 3 versions, can you please provide your Acked-by tags?
>
> Having this patchset in the mainline will make your job easier :)
>
> Alexandre - the XanMod Kernel maintainer
> https://xanmod.org
>
> Brian - the Chrome OS kernel memory maintainer
> https://www.chromium.org
>
> Jan - the Arch Linux Zen kernel maintainer
> https://archlinux.org
>
> Steven - the Liquorix kernel maintainer
> https://liquorix.net
>
> Suleiman - the ARCVM (Android downstream) kernel memory maintainer
> https://chromium.googlesource.com/chromiumos/third_party/kernel
>
> Also my gratitude to those who have helped test MGLRU:
>
> Daniel - researcher at Michigan Tech
> benchmarked memcached
>
> Holger - who has been testing/patching/contributing to various
> subsystems since ~2008
>
> Shuang - researcher at University of Rochester
> benchmarked fio and provided a report
>
> Sofia - EDI https://www.edi.works
> benchmarked the top eight memory hogs and provided reports
>
> Can you please provide your Tested-by tags? This will ensure the credit
> for your contributions.
>
> Thanks!
>

Have been pounding on this "in production" on several different machines
(server, desktop, laptop) and 5.15.x without any issues, so:

Tested-by: Holger Hoffstätte <[email protected]>

Looking forward to seeing this in mainline!

cheers,
Holger

2022-01-11 10:25:11

by Alexey Avramov

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

In some of my benchmarks MGLRU really gave unrivaled performance.
I assume the adoption of MGLRU into the kernel would save billions of
dollars and greatly reduce carbon dioxide emissions.

However, there are also cases where MGLRU loses.
There are cases where MGLRU does not achieve the performance that the
classic LRU gives (at least I got such results when testing MGLRU before[1],
but I did not report them here).

As a Linux user, I would like to see both variants of LRU in the kernel, so
that it is possible to switch to the suitable variant when needed: none of
the LRU variants allowed me to squeeze the maximum for all cases.

I hope to test MGLRU v6 later and show you some of its weaknesses and
anomalies with specific logs and benchmarks.

[1] I didn't have enough time and energy to decipher the results at that time:
https://github.com/hakavlad/cache-tests/tree/main/mg-LRU-v3_vs_classic-LRU
(but you can try to guess what it all means)


2022-01-11 10:39:23

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v6 7/9] mm: multigenerational lru: eviction

...

+static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
> + int *type_scanned, struct list_head *list)
> +{
> + int i;
> + int type;
> + int scanned;
> + int tier = -1;
> + DEFINE_MIN_SEQ(lruvec);
> +
> + VM_BUG_ON(!seq_is_valid(lruvec));
> +
> + /*
> + * Try to make the obvious choice first. When anon and file are both
> + * available from the same generation, interpret swappiness 1 as file
> + * first and 200 as anon first.
> + */
> + if (!swappiness)
> + type = 1;
> + else if (min_seq[0] < min_seq[1])
> + type = 0;
> + else if (swappiness == 1)
> + type = 1;
> + else if (swappiness == 200)
> + type = 0;
> + else
> + type = get_type_to_scan(lruvec, swappiness, &tier);
> +

Wondering wether it will make it simpler to use
#define ANON 0
#define FILE 1

and then
else if (min_seq[ANON] < min_seq[FILE])
type = ANON;

The usage of 0/1 across code do confuse

-aneesh

2022-01-11 10:41:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Mon 10-01-22 14:46:08, Jesse Barnes wrote:
> > > > 2. There have been none that came with the testing/benchmarking
> > > > coverage as this one did. Please point me to some if I'm mistaken,
> > > > and I'll gladly match them.
> > >
> > > I do appreciate your numbers but you should realize that this is an area
> > > that is really hard to get any conclusive testing for.
> >
> > Fully agreed. That's why we started a new initiative, and we hope more
> > people will following these practices:
> > 1. All results in this area should be reported with at least standard
> > deviations, or preferably confidence intervals.
> > 2. Real applications should be benchmarked (with synthetic load
> > generator), not just synthetic benchmarks (not real applications).
> > 3. A wide range of devices should be covered, i.e., servers, desktops,
> > laptops and phones.
> >
> > I'm very confident to say our benchmark reports were hold to the
> > highest standards. We have worked with MariaDB (company), EnterpriseDB
> > (Postgres), Redis (company), etc. on these reports. They have copies
> > of these reports (PDF version):
> > https://linux-mm.googlesource.com/benchmarks/
> >
> > We welcome any expert in those applications to examine our reports,
> > and we'll be happy to run any other benchmarks or same benchmarks with
> > different configurations that anybody thinks it's important and we've
> > missed.
>
> I really think this gets at the heart of the issue with mm
> development, and is one of the reasons it's been extra frustrating to
> not have an MM conf for the past couple of years; I think sorting out
> how we measure & proceed on changes would be easier done f2f. E.g.
> concluding with a consensus that if something doesn't regress on X, Y,
> and Z, and has reasonably maintainable and readable code, we should
> merge it and try it out.

I am fully with you on that! I hope we can have LSFMM this year finally.

> But since f2f isn't an option until 2052 at the earliest...

Let's be more optimistic than that ;)

> I understand the desire for an "incremental approach that gets us from
> A->B". In the abstract it sounds great. However, with a change like
> this one, I think it's highly likely that such a path would be
> littered with regressions both large and small, and would probably be
> more difficult to reason about than the relatively clean design of
> MGLRU.

There are certainly things that do not make much sense to split up of
course. On the other hand the patchset is making a lot of decisions and
assumptions that are neither documented in the code nor in the
changelog. From my past experience these are really problematic from a
long term maintenance POV. We are struggling with those already because
changelog tend to be much more coarse in the past yet the code stays
with us and we have been really "great" at not touching many of those
because "something might break". This results in the complexity grow and
further maintenance burden.

> On top of that, I don't think we'll get the kind of user
> feedback we need for something like this *without* merging it. Yu has
> done a tremendous job collecting data here (and the results are really
> incredible), but I think we can all agree that without extensive
> testing in the field with all sorts of weird codes, we're not going to
> find the problematic behaviors we're concerned about.

This is understood.

> So unless we want to eschew big mm changes entirely (we shouldn't!
> look at net or scheduling for how important big rewrites are to
> progress), I think we should be open to experimenting with new stuff.
> We can always revert if things get too unwieldy.

As long as the patchset doesn't include new user visible interfaces
which have proven to be really hard to revert.

> None of this is to say that there may not be lots more comments on the
> code or potential fixes/changes to incorporate before merging; I'm
> mainly arguing about the mindset we should have to changes like this,
> not all the stuff the community is already really good at (i.e.
> testing and reviewing code on a nuts & bolts level).

From my reading of this and previous discussions I have gathered that
there was no opposition just for the sake of it. There have been very
specific questions regarding the implementation and/or future plans to
address issues expressed in the past.

So far I have only managed to check the memcg and oom integration
finding some issues there. All of them should be fixable reasonably
easily but it also points that a deep dive into this is really
necessary.

I have also raised questions about future maintainability of the
resulting code. As you could have noticed the review power in the MM
community is lacking behind and we tend to have more code producers than
reviewers and maintainers.
Not to mention other things like page flags depletion which is something
we have been struggling for quite some time already.

All that being said there is a lot of work for such a large change to be
merged.
--
Michal Hocko
SUSE Labs

2022-01-11 12:15:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Tue 11-01-22 11:21:48, Alexey Avramov wrote:
> > I do not really see any arguments why an userspace based trashing
> > detection cannot be used for those.
>
> Firsly,
> because this is the task of the kernel, not the user space.
> Memory is managed by the kernel, not by the user space.
> The absence of such a mechanism in the kernel is a fundamental problem.
> The userspace tools are ugly hacks:
> some of them consume a lot of CPU [1],
> some of them consume a lot of memory [2],
> some of them cannot into process_mrelease() (earlyoom, nohang),
> some of them kill only the whole cgroup (systemd-oomd, oomd) [3]
> and depends on systemd and cgroup_v2 (oomd, systemd-oomd).

Thanks for those links. Read through them and my understanding is that
most of those are very specific to the tool used and nothing really
fundamental because of lack of kernel support.

> One of the biggest challenges for userspace oom-killers is to potentially
> function under intense memory pressure and are prone to getting stuck in
> memory reclaim themselves [4].

This one is more interesting and the truth is that handling the complete
OOM situation from the userspace is really tricky. Especially when with
a more complex oom decision policy. In the past we have discussed
potential ways to implement a oom kill policy be kernel modules or eBPF.
Without anybody following up on that.

But I suspect you are mixing up two things here. One of them is out
of memory situation where no memory can be reclaimed and allocated.

The other is one where the memory can be reclaimed, a progress is made,
but that leads to a trashing when the most of the time is spent on
refaulting a memory reclaimed shortly before.

The first one is addressed by the global oom killer and it tries to
be really conservative as much as possible because this is a very
disruptive operation. But the later one is more complex and a proper
handling really depends on the particular workload to be handled
properly because it is more of a QoS than an emergency action to keep
the system alive.

There are workloads which prefer a temporary trashing over its working
set during a peak memory demand rather than an OOM kill because way too
much work would be thrown away. On the other side workloads that are
latency sensitive can see even the direct reclaim as a runtime visible
problem.

I hope you can imagine there is a really large gap between those
two cases and no simple solution can be applied to the whole
range. Therefore we have PSI and refault stats exported to the userspace
so that a workload specific policy can be implemented there.

If userspace has hard time to use that data and action upon then let's
talk about specifics. For the most steady trashing situations I have
seen the userspace with mlocked memory and the code can make a forward
progress and mediate the situation.

[...]

> [1] https://github.com/facebookincubator/oomd/issues/79
> [2] https://github.com/hakavlad/nohang#memory-and-cpu-usage
> [3] https://github.com/facebookincubator/oomd/issues/125
> [4] https://lore.kernel.org/all/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com/
> [5] https://github.com/zen-kernel/zen-kernel/issues/223
> [6] https://raw.githubusercontent.com/hakavlad/cache-tests/main/mg-LRU-v3_vs_classic-LRU/3-firefox-tail-OOM/mg-LRU-1/psi2
> [7] https://lore.kernel.org/linux-mm/[email protected]/
--
Michal Hocko
SUSE Labs

2022-01-11 14:19:15

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 1/9] mm: x86, arm64: add arch_has_hw_pte_young()

On Fri, Jan 07, 2022 at 12:25:07AM -0700, Yu Zhao wrote:
> On Thu, Jan 06, 2022 at 10:30:09AM +0000, Will Deacon wrote:
> > On Wed, Jan 05, 2022 at 01:47:08PM -0700, Yu Zhao wrote:
> > > On Wed, Jan 05, 2022 at 10:45:26AM +0000, Will Deacon wrote:
> > > > On Tue, Jan 04, 2022 at 01:22:20PM -0700, Yu Zhao wrote:
> > > > > diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
> > > > > index 870c39537dd0..56e4ef5d95fa 100644
> > > > > --- a/arch/arm64/tools/cpucaps
> > > > > +++ b/arch/arm64/tools/cpucaps
> > > > > @@ -36,6 +36,7 @@ HAS_STAGE2_FWB
> > > > > HAS_SYSREG_GIC_CPUIF
> > > > > HAS_TLB_RANGE
> > > > > HAS_VIRT_HOST_EXTN
> > > > > +HW_AF
> > > > > HW_DBM
> > > > > KVM_PROTECTED_MODE
> > > > > MISMATCHED_CACHE_TYPE
> > > >
> > > > As discussed in the previous threads, we really don't need the complexity
> > > > of the additional cap for the arm64 part. Please can you just use the
> > > > existing code instead? It's both simpler and, as you say, it's equivalent
> > > > for existing hardware.
> > > >
> > > > That way, this patch just ends up being a renaming exercise and we're all
> > > > good.
> > >
> > > No, renaming alone isn't enough. A caller needs to disable preemption
> > > before calling system_has_hw_af(), and I don't think it's reasonable
> > > to ask this caller to do it on x86 as well.
> > >
> > > It seems you really prefer not to have HW_AF. So the best I can
> > > accommodate, considering other potential archs, e.g., risc-v (I do
> > > plan to provide benchmark results on risc-v, btw), is:
> > >
> > > static inline bool arch_has_hw_pte_young(bool local)
> > > {
> > > bool hw_af;
> > >
> > > if (local) {
> > > WARN_ON(preemptible());
> > > return cpu_has_hw_af();
> > > }
> > >
> > > preempt_disable();
> > > hw_af = system_has_hw_af();
> > > preempt_enable();
> > >
> > > return hw_af;
> > > }
> > >
> > > Or please give me something else I can call without disabling
> > > preemption, sounds good?
> >
> > Sure thing, let me take a look. Do you have your series on a public git
> > tree someplace?
>
> Thanks!
>
> This patch (updated) on Gerrit:
> https://linux-mm-review.googlesource.com/c/page-reclaim/+/1500/1

How about folding in something like the diff below? I've basically removed
that 'bool local' argument and dropped the preemptible() check from the
arm64 code.

Will

--->8

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 280123916fc2..990358eca359 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -998,27 +998,14 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
* the pte is old and cannot be marked young. So we always end up with zeroed
* page after fork() + CoW for pfn mappings. We don't always have a
* hardware-managed access flag on arm64.
- *
- * The system-wide support isn't used when involving correctness and therefore
- * is allowed to be flaky.
*/
-static inline bool arch_has_hw_pte_young(bool local)
-{
- WARN_ON(local && preemptible());
-
- return cpu_has_hw_af();
-}
-#define arch_has_hw_pte_young arch_has_hw_pte_young
+#define arch_has_hw_pte_young cpu_has_hw_af

/*
* Experimentally, it's cheap to set the access flag in hardware and we
* benefit from prefaulting mappings as 'old' to start with.
*/
-static inline bool arch_wants_old_prefaulted_pte(void)
-{
- return arch_has_hw_pte_young(true);
-}
-#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
+#define arch_wants_old_prefaulted_pte cpu_has_hw_af

static inline pgprot_t arch_filter_pgprot(pgprot_t prot)
{
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index c60b16f8b741..3908780fc408 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1398,7 +1398,7 @@ static inline bool arch_has_pfn_modify_check(void)
}

#define arch_has_hw_pte_young arch_has_hw_pte_young
-static inline bool arch_has_hw_pte_young(bool local)
+static inline bool arch_has_hw_pte_young(void)
{
return true;
}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 599cc232d5c4..0bd1beadb545 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -260,15 +260,12 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,

#ifndef arch_has_hw_pte_young
/*
- * Return whether the accessed bit is supported by the local CPU or system-wide.
+ * Return whether the accessed bit is supported by the local CPU.
*
- * This stub assumes accessing thru an old PTE triggers a page fault.
+ * This stub assumes accessing through an old PTE triggers a page fault.
* Architectures that automatically set the access bit should overwrite it.
- *
- * Note that the system-wide support can be flaky and therefore shouldn't be
- * used when involving correctness.
*/
-static inline bool arch_has_hw_pte_young(bool local)
+static inline bool arch_has_hw_pte_young(void)
{
return false;
}
diff --git a/mm/memory.c b/mm/memory.c
index ead6c7d4b9a1..1f02de6d51e4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2743,7 +2743,7 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
* On architectures with software "accessed" bits, we would
* take a double page fault, so mark it accessed here.
*/
- if (!arch_has_hw_pte_young(true) && !pte_young(vmf->orig_pte)) {
+ if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
pte_t entry;

vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);

2022-01-11 14:23:05

by Alexey Avramov

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

> I do not really see any arguments why an userspace based trashing
> detection cannot be used for those.

Firsly,
because this is the task of the kernel, not the user space.
Memory is managed by the kernel, not by the user space.
The absence of such a mechanism in the kernel is a fundamental problem.
The userspace tools are ugly hacks:
some of them consume a lot of CPU [1],
some of them consume a lot of memory [2],
some of them cannot into process_mrelease() (earlyoom, nohang),
some of them kill only the whole cgroup (systemd-oomd, oomd) [3]
and depends on systemd and cgroup_v2 (oomd, systemd-oomd).
One of the biggest challenges for userspace oom-killers is to potentially
function under intense memory pressure and are prone to getting stuck in
memory reclaim themselves [4].

It is strange that after decades of user complaints about thrashing and
not-working OOM killer I have to explain the obvious things.
The basic mechanism must be implemented in the kernel.
Stop shifting responsibility to the user space!

Secondly,
the real reason for the min_ttl_ms mechanism is that without it,
multi-minute stalls are possible [5] even when the killer is expected to
arrive, and memory pressure is closed to 100 at this period [6].
This fixes a bug that does not exist in the mainline LRU (this is
MGLRU-specific bug). BTW, the similar symptoms were recently fixed in the
mainline [7].

[1] https://github.com/facebookincubator/oomd/issues/79
[2] https://github.com/hakavlad/nohang#memory-and-cpu-usage
[3] https://github.com/facebookincubator/oomd/issues/125
[4] https://lore.kernel.org/all/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com/
[5] https://github.com/zen-kernel/zen-kernel/issues/223
[6] https://raw.githubusercontent.com/hakavlad/cache-tests/main/mg-LRU-v3_vs_classic-LRU/3-firefox-tail-OOM/mg-LRU-1/psi2
[7] https://lore.kernel.org/linux-mm/[email protected]/

[I am duplicating a previous message here - it was not delivered to mailing lists]

2022-01-11 16:07:02

by Shuang Zhai

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

Yu Zhao wrote:
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
>> On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
>>> TLDR
>>> ====
>>> The current page reclaim is too expensive in terms of CPU usage and it
>>> often makes poor choices about what to evict. This patchset offers an
>>> alternative solution that is performant, versatile and
>>> straightforward.
>>
>> <snipped>
>>
>>> Summery
>>> =======
>>> The facts are:
>>> 1. The independent lab results and the real-world applications
>>> indicate substantial improvements; there are no known regressions.
>>> 2. Thrashing prevention, working set estimation and proactive reclaim
>>> work out of the box; there are no equivalent solutions.
>>> 3. There is a lot of new code; nobody has demonstrated smaller changes
>>> with similar effects.
>>>
>>> Our options, accordingly, are:
>>> 1. Given the amount of evidence, the reported improvements will likely
>>> materialize for a wide range of workloads.
>>> 2. Gauging the interest from the past discussions [14][15][16], the
>>> new features will likely be put to use for both personal computers
>>> and data centers.
>>> 3. Based on Google's track record, the new code will likely be well
>>> maintained in the long term. It'd be more difficult if not
>>> impossible to achieve similar effects on top of the existing
>>> design.
>>
>> Hi Andrew, Linus,
>>
>> Can you please take a look at this patchset and let me know if it's
>> 5.17 material?
>>
>> My goal is to get it merged asap so that users can reap the benefits
>> and I can push the sequels. Please examine the data provided -- I
>> think the unprecedented coverage and the magnitude of the improvements
>> warrant a green light.
>
> Downstream kernel maintainers who have been carrying MGLRU for more than
> 3 versions, can you please provide your Acked-by tags?
>
> Having this patchset in the mainline will make your job easier :)
>
> Alexandre - the XanMod Kernel maintainer
> https://xanmod.org
>
> Brian - the Chrome OS kernel memory maintainer
> https://www.chromium.org
>
> Jan - the Arch Linux Zen kernel maintainer
> https://archlinux.org
>
> Steven - the Liquorix kernel maintainer
> https://liquorix.net
>
> Suleiman - the ARCVM (Android downstream) kernel memory maintainer
> https://chromium.googlesource.com/chromiumos/third_party/kernel
>
> Also my gratitude to those who have helped test MGLRU:
>
> Daniel - researcher at Michigan Tech
> benchmarked memcached
>
> Holger - who has been testing/patching/contributing to various
> subsystems since ~2008
>
> Shuang - researcher at University of Rochester
> benchmarked fio and provided a report
>
> Sofia - EDI https://www.edi.works
> benchmarked the top eight memory hogs and provided reports
>
> Can you please provide your Tested-by tags? This will ensure the credit
> for your contributions.
>
> Thanks!

I have tested MGLRU using fio [1]. The performance improvement is fabulous.
I hope this patchset could eventually get merged to enable large scale test
and let more users talk about their experience.

Tested-by: Shuang Zhai <[email protected]>

[1] https://lore.kernel.org/lkml/[email protected]/

2022-01-11 22:27:15

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 1/9] mm: x86, arm64: add arch_has_hw_pte_young()

On Tue, Jan 11, 2022 at 02:19:02PM +0000, Will Deacon wrote:
> On Fri, Jan 07, 2022 at 12:25:07AM -0700, Yu Zhao wrote:
> > On Thu, Jan 06, 2022 at 10:30:09AM +0000, Will Deacon wrote:
> > > On Wed, Jan 05, 2022 at 01:47:08PM -0700, Yu Zhao wrote:
> > > > On Wed, Jan 05, 2022 at 10:45:26AM +0000, Will Deacon wrote:
> > > > > On Tue, Jan 04, 2022 at 01:22:20PM -0700, Yu Zhao wrote:
> > > > > > diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
> > > > > > index 870c39537dd0..56e4ef5d95fa 100644
> > > > > > --- a/arch/arm64/tools/cpucaps
> > > > > > +++ b/arch/arm64/tools/cpucaps
> > > > > > @@ -36,6 +36,7 @@ HAS_STAGE2_FWB
> > > > > > HAS_SYSREG_GIC_CPUIF
> > > > > > HAS_TLB_RANGE
> > > > > > HAS_VIRT_HOST_EXTN
> > > > > > +HW_AF
> > > > > > HW_DBM
> > > > > > KVM_PROTECTED_MODE
> > > > > > MISMATCHED_CACHE_TYPE
> > > > >
> > > > > As discussed in the previous threads, we really don't need the complexity
> > > > > of the additional cap for the arm64 part. Please can you just use the
> > > > > existing code instead? It's both simpler and, as you say, it's equivalent
> > > > > for existing hardware.
> > > > >
> > > > > That way, this patch just ends up being a renaming exercise and we're all
> > > > > good.
> > > >
> > > > No, renaming alone isn't enough. A caller needs to disable preemption
> > > > before calling system_has_hw_af(), and I don't think it's reasonable
> > > > to ask this caller to do it on x86 as well.
> > > >
> > > > It seems you really prefer not to have HW_AF. So the best I can
> > > > accommodate, considering other potential archs, e.g., risc-v (I do
> > > > plan to provide benchmark results on risc-v, btw), is:
> > > >
> > > > static inline bool arch_has_hw_pte_young(bool local)
> > > > {
> > > > bool hw_af;
> > > >
> > > > if (local) {
> > > > WARN_ON(preemptible());
> > > > return cpu_has_hw_af();
> > > > }
> > > >
> > > > preempt_disable();
> > > > hw_af = system_has_hw_af();
> > > > preempt_enable();
> > > >
> > > > return hw_af;
> > > > }
> > > >
> > > > Or please give me something else I can call without disabling
> > > > preemption, sounds good?
> > >
> > > Sure thing, let me take a look. Do you have your series on a public git
> > > tree someplace?
> >
> > Thanks!
> >
> > This patch (updated) on Gerrit:
> > https://linux-mm-review.googlesource.com/c/page-reclaim/+/1500/1
>
> How about folding in something like the diff below? I've basically removed
> that 'bool local' argument and dropped the preemptible() check from the
> arm64 code.

This looks great, thanks.

2022-01-11 23:17:05

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Mon, Jan 10, 2022 at 04:01:13PM +0100, Michal Hocko wrote:
> On Thu 06-01-22 17:12:18, Michal Hocko wrote:
> > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> > > +{
> > > + if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
>
> One thing I have overlooked completely.

I appreciate your attention to details but GFP_KERNEL is legit in the
reclaim path. It's been used many years in our production, e.g.,
page reclaim
swap_writepage()
frontswap_store()
zswap_frontswap_store()
zswap_entry_cache_alloc(GFP_KERNEL)

(And I always test my changes with lockdep, kasan, DEBUG_VM, etc., no
warnings ever seen from using GFP_KERNEL in the reclaim path.)

> You cannot really use GFP_KERNEL
> allocation here because the reclaim context can be constrained (e.g.
> GFP_NOFS). This allocation will not do any reclaim as it is PF_MEMALLOC
> but I suspect that the lockdep will complain anyway.
>
> Also kvmalloc is not really great here. a) vmalloc path is never
> executed for small objects and b) we do not really want to make a
> dependency between vmalloc and the reclaim (by vmalloc -> reclaim ->
> vmalloc).
>
> Even if we rule out vmalloc and look at kmalloc alone. Is this really
> safe? I do not see any recursion prevention in the SL.B code. Maybe this
> just happens to work but the dependency should be really documented so
> that future SL.B changes won't break the whole scheme.

Affirmative, as Vlastimil has clarified.

Thanks!

2022-01-12 01:02:25

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Mon, Jan 10, 2022 at 05:57:39PM +0100, Michal Hocko wrote:
> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> [...]
> > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > +{
> > + static const struct mm_walk_ops mm_walk_ops = {
> > + .test_walk = should_skip_vma,
> > + .p4d_entry = walk_pud_range,
> > + };
> > +
> > + int err;
> > +#ifdef CONFIG_MEMCG
> > + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +#endif
> > +
> > + walk->next_addr = FIRST_USER_ADDRESS;
> > +
> > + do {
> > + unsigned long start = walk->next_addr;
> > + unsigned long end = mm->highest_vm_end;
> > +
> > + err = -EBUSY;
> > +
> > + rcu_read_lock();
> > +#ifdef CONFIG_MEMCG
> > + if (memcg && atomic_read(&memcg->moving_account))
> > + goto contended;
> > +#endif
>
> Why do you need to check for moving_account?

This check, if succeeds, blocks memcg migration.

Our goal is to move pages between different generations of the same
lruvec (the first arg). Meanwhile, pages can also be migrated between
different memcgs (different lruvecs).

The active/inactive lru uses isolation to block memcg migration.

Generations account pages similarly to the active/inactive lru, i.e.,
each generation has nr_pages counter. However, unlike the active/
inactive lru, a page can be moved to a different generation without
getting isolated or even without being under the lru lock, as long as
the delta is eventually accounted for (which does require the lru lock
when it happens).

The generation counter in page->flags (folio->flags to be precise)
stores 0 when a page is isolated, to synchronize with isolation.

2022-01-12 01:46:36

by Suleiman Souhlal

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Tue, Jan 11, 2022 at 5:41 PM Yu Zhao <[email protected]> wrote:
>
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
> > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> > > TLDR
> > > ====
> > > The current page reclaim is too expensive in terms of CPU usage and it
> > > often makes poor choices about what to evict. This patchset offers an
> > > alternative solution that is performant, versatile and
> > > straightforward.
> >
> > <snipped>
> >
> > > Summery
> > > =======
> > > The facts are:
> > > 1. The independent lab results and the real-world applications
> > > indicate substantial improvements; there are no known regressions.
> > > 2. Thrashing prevention, working set estimation and proactive reclaim
> > > work out of the box; there are no equivalent solutions.
> > > 3. There is a lot of new code; nobody has demonstrated smaller changes
> > > with similar effects.
> > >
> > > Our options, accordingly, are:
> > > 1. Given the amount of evidence, the reported improvements will likely
> > > materialize for a wide range of workloads.
> > > 2. Gauging the interest from the past discussions [14][15][16], the
> > > new features will likely be put to use for both personal computers
> > > and data centers.
> > > 3. Based on Google's track record, the new code will likely be well
> > > maintained in the long term. It'd be more difficult if not
> > > impossible to achieve similar effects on top of the existing
> > > design.
> >
> > Hi Andrew, Linus,
> >
> > Can you please take a look at this patchset and let me know if it's
> > 5.17 material?
> >
> > My goal is to get it merged asap so that users can reap the benefits
> > and I can push the sequels. Please examine the data provided -- I
> > think the unprecedented coverage and the magnitude of the improvements
> > warrant a green light.
>
> Downstream kernel maintainers who have been carrying MGLRU for more than
> 3 versions, can you please provide your Acked-by tags?
>
> Having this patchset in the mainline will make your job easier :)
>
> Alexandre - the XanMod Kernel maintainer
> https://xanmod.org
>
> Brian - the Chrome OS kernel memory maintainer
> https://www.chromium.org
>
> Jan - the Arch Linux Zen kernel maintainer
> https://archlinux.org
>
> Steven - the Liquorix kernel maintainer
> https://liquorix.net
>
> Suleiman - the ARCVM (Android downstream) kernel memory maintainer
> https://chromium.googlesource.com/chromiumos/third_party/kernel

Android on ChromeOS has been using MGLRU for a while now, with great results.
It would be great for more people to more easily be able to benefit from it.

Acked-by: Suleiman Souhlal <[email protected]>

-- Suleiman

2022-01-12 02:16:22

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 4/9] mm: multigenerational lru: groundwork

On Tue, Jan 11, 2022 at 01:46:24PM +0530, Aneesh Kumar K.V wrote:
> Yu Zhao <[email protected]> writes:
>
> .....
>
> +
> > +/*
> > + * Evictable pages are divided into multiple generations. The youngest and the
> > + * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
> > + * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
> > + * offset within MAX_NR_GENS, gen, indexes the lru list of the corresponding
> > + * generation. The gen counter in folio->flags stores gen+1 while a page is on
> > + * lrugen->lists[]. Otherwise, it stores 0.
> > + *
> > + * A page is added to the youngest generation on faulting. The aging needs to
> > + * check the accessed bit at least twice before handing this page over to the
> > + * eviction. The first check takes care of the accessed bit set on the initial
> > + * fault; the second check makes sure this page hasn't been used since then.
> > + * This process, AKA second chance, requires a minimum of two generations,
> > + * hence MIN_NR_GENS. And to be compatible with the active/inactive lru, these
> > + * two generations are mapped to the active; the rest of generations, if they
> > + * exist, are mapped to the inactive. PG_active is always cleared while a page
> > + * is on lrugen->lists[] so that demotion, which happens consequently when the
> > + * aging creates a new generation, needs not to worry about it.
> > + */
>
> Where do we clear PG_active in the code? Is this the reason we endup
> with

We clear PG_active when we add a page (folio) to MGLRU lists:
include/linux/mm_inline.h
lru_gen_add_folio()
do {
new_flags = old_flags = READ_ONCE(folio->flags);

...

new_flags &= ~(LRU_GEN_MASK | BIT(PG_active));
^^^^^^^^^
...

} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);

We also set it when we isolate a page (for page migration):
include/linux/mm_inline.h
lru_gen_del_folio()
do {
new_flags = old_flags = READ_ONCE(folio->flags);

...

else if (lru_gen_is_active(lruvec, gen))
new_flags |= BIT(PG_active);
^^^^^^^^^
} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);

>
> void deactivate_page(struct page *page)
> {
> - if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> + if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {

That's correct.

> > +#define MIN_NR_GENS 2U
> > +#define MAX_NR_GENS ((unsigned int)CONFIG_NR_LRU_GENS)
> > +
> > +struct lru_gen_struct {
> > + /* the aging increments the youngest generation number */
> > + unsigned long max_seq;
> > + /* the eviction increments the oldest generation numbers */
> > + unsigned long min_seq[ANON_AND_FILE];
> > + /* the birth time of each generation in jiffies */
> > + unsigned long timestamps[MAX_NR_GENS];
> > + /* the multigenerational lru lists */
> > + struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> > + /* the sizes of the above lists */
> > + unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> > + /* whether the multigenerational lru is enabled */
> > + bool enabled;
> > +};
> > +
>
> ....
>
> > static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
> > diff --git a/mm/swap.c b/mm/swap.c
> > index e8c9dc6d0377..d7dde3b7d4b5 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
> > VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
> > VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> >
> > + /* see the comment in lru_gen_add_folio() */
> > + if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> > + task_in_lru_fault() && !(current->flags & PF_MEMALLOC))
> > + folio_set_active(folio);
> > +
>
>
> Can you explain this better? What is the significance of marking the
> folio active here. Do we need to differentiate parallel page faults (across
> different vmas) w.r.t task_in_lru_fault()?

All pages faulted in need to be added to the youngest generation. But
without PG_active, lru_gen_add_folio() doesn't know whether a page was
faulted in, or something else, e.g., page cache readahead. This is
because pages aren't immediately sent to lru_gen_add_folio(). They are
batched by lru_pvecs:

/**
* folio_add_lru - Add a folio to an LRU list.
* @folio: The folio to be added to the LRU.
*
* Queue the folio for addition to the LRU. The decision on whether
* to add the page to the [in]active [file|anon] list is deferred until the
* pagevec is drained. This gives a chance for the caller of folio_add_lru()
* have the folio added to the active list using folio_mark_accessed().
*/
void folio_add_lru(struct folio *folio)
{
struct pagevec *pvec;

VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

/* see the comment in lru_gen_add_folio() */
if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
lru_gen_in_pgfault() && !(current->flags & PF_MEMALLOC))
folio_set_active(folio);

folio_get(folio);
local_lock(&lru_pvecs.lock);
pvec = this_cpu_ptr(&lru_pvecs.lru_add);
if (pagevec_add_and_need_flush(pvec, &folio->page))
__pagevec_lru_add(pvec);
local_unlock(&lru_pvecs.lock);
}

2022-01-12 06:07:34

by Sofia Trinh

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Tue, Jan 11, 2022 at 12:41 AM Yu Zhao <[email protected]> wrote:
>
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
> > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> > > TLDR
> > > ====
> > > The current page reclaim is too expensive in terms of CPU usage and it
> > > often makes poor choices about what to evict. This patchset offers an
> > > alternative solution that is performant, versatile and
> > > straightforward.
> >
> > <snipped>
> >
> > > Summery
> > > =======
> > > The facts are:
> > > 1. The independent lab results and the real-world applications
> > > indicate substantial improvements; there are no known regressions.
> > > 2. Thrashing prevention, working set estimation and proactive reclaim
> > > work out of the box; there are no equivalent solutions.
> > > 3. There is a lot of new code; nobody has demonstrated smaller changes
> > > with similar effects.
> > >
> > > Our options, accordingly, are:
> > > 1. Given the amount of evidence, the reported improvements will likely
> > > materialize for a wide range of workloads.
> > > 2. Gauging the interest from the past discussions [14][15][16], the
> > > new features will likely be put to use for both personal computers
> > > and data centers.
> > > 3. Based on Google's track record, the new code will likely be well
> > > maintained in the long term. It'd be more difficult if not
> > > impossible to achieve similar effects on top of the existing
> > > design.
> >
> > Hi Andrew, Linus,
> >
> > Can you please take a look at this patchset and let me know if it's
> > 5.17 material?
> >
> > My goal is to get it merged asap so that users can reap the benefits
> > and I can push the sequels. Please examine the data provided -- I
> > think the unprecedented coverage and the magnitude of the improvements
> > warrant a green light.
>
> Downstream kernel maintainers who have been carrying MGLRU for more than
> 3 versions, can you please provide your Acked-by tags?
>
> Having this patchset in the mainline will make your job easier :)
>
> Alexandre - the XanMod Kernel maintainer
> https://xanmod.org
>
> Brian - the Chrome OS kernel memory maintainer
> https://www.chromium.org
>
> Jan - the Arch Linux Zen kernel maintainer
> https://archlinux.org
>
> Steven - the Liquorix kernel maintainer
> https://liquorix.net
>
> Suleiman - the ARCVM (Android downstream) kernel memory maintainer
> https://chromium.googlesource.com/chromiumos/third_party/kernel
>
> Also my gratitude to those who have helped test MGLRU:
>
> Daniel - researcher at Michigan Tech
> benchmarked memcached
>
> Holger - who has been testing/patching/contributing to various
> subsystems since ~2008
>
> Shuang - researcher at University of Rochester
> benchmarked fio and provided a report
>
> Sofia - EDI https://www.edi.works
> benchmarked the top eight memory hogs and provided reports

Tested-by: Sofia Trinh <[email protected]>

2022-01-12 08:06:03

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 7/9] mm: multigenerational lru: eviction

On Tue, Jan 11, 2022 at 04:07:57PM +0530, Aneesh Kumar K.V wrote:
> ...
>
> +static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
> > + int *type_scanned, struct list_head *list)
> > +{
> > + int i;
> > + int type;
> > + int scanned;
> > + int tier = -1;
> > + DEFINE_MIN_SEQ(lruvec);
> > +
> > + VM_BUG_ON(!seq_is_valid(lruvec));
> > +
> > + /*
> > + * Try to make the obvious choice first. When anon and file are both
> > + * available from the same generation, interpret swappiness 1 as file
> > + * first and 200 as anon first.
> > + */
> > + if (!swappiness)
> > + type = 1;
> > + else if (min_seq[0] < min_seq[1])
> > + type = 0;
> > + else if (swappiness == 1)
> > + type = 1;
> > + else if (swappiness == 200)
> > + type = 0;
> > + else
> > + type = get_type_to_scan(lruvec, swappiness, &tier);
> > +
>
> Wondering wether it will make it simpler to use
> #define ANON 0
> #define FILE 1
>
> and then
> else if (min_seq[ANON] < min_seq[FILE])
> type = ANON;
>
> The usage of 0/1 across code do confuse

I agree, and I plan to do this later because the existing code uses
this convention and needs renaming too.

2022-01-12 08:08:22

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 5/9] mm: multigenerational lru: mm_struct list

On Mon, Jan 10, 2022 at 04:21:53PM +0100, Michal Hocko wrote:
> On Fri 07-01-22 17:19:28, Yu Zhao wrote:
> > On Fri, Jan 07, 2022 at 10:06:15AM +0100, Michal Hocko wrote:
> > > On Tue 04-01-22 13:22:24, Yu Zhao wrote:
> > > > To exploit spatial locality, the aging prefers to walk page tables to
> > > > search for young PTEs. And this patch paves the way for that.
> > > >
> > > > An mm_struct list is maintained for each memcg, and an mm_struct
> > > > follows its owner task to the new memcg when this task is migrated.
> > >
> > > How does this work actually for the memcg reclaim? I can see you
> > > lru_gen_migrate_mm on the task migration. My concern is, though, that
> > > such a task leaves all the memory behind in the previous memcg (in
> > > cgroup v2, in v1 you can opt in for charge migration). If you move the
> > > mm to a new memcg then you age it somewhere where the memory is not
> > > really consumed.
> >
> > There are two options to gather the accessed bit: page table walks and
> > rmap walks. Page table walks sweep dense hotspots that are NOT
> > misplaced in terms of reclaim scope (lruvec); rmap walks cover what
> > page table walks miss, e.g., misplaced dense hotspots or sparse ones.
> >
> > Dense hotspots are stored in Bloom filters for each lruvec.
> >
> > If an mm leaves everything in the old memcg, page table walks in the
> > new memcg reclaim path basically ignore this mm after the first scan,
> > because everything is misplaced.
>
> OK, so do I get it right that pages mapped from a different memcg than
> the reclaimed one are considered effectivelly non-present from the the
> reclaim logic POV? This would be worth mentioning in the migration
> callback because it is not really that straightforward to put those two
> together.

That's correct. Will document this in detail.

> > In the old memcg reclaim path, page table walks won't see this mm
> > at all. But rmap walks will catch everything later in the eviction
> > path, i.e., lru_gen_look_around(). This function is less efficient
> > compared with page table walks because, for each rmap walk of a
> > non-shared page, it only can gather the accessed bit from 64 PTEs at
> > most. But it's still a lot faster than the original rmap, which only
> > gathers the accessed bit from a single PTE, for each walk of a
> > non-shared page.
>
> Again, something that should be really documented.

Noted.

2022-01-12 08:36:00

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] mm: multigenerational lru: user interface

On Mon, Jan 10, 2022 at 12:27:19PM +0200, Mike Rapoport wrote:
> Hi,
>
> On Tue, Jan 04, 2022 at 01:22:27PM -0700, Yu Zhao wrote:
> > Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch.
> >
> > Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention.
> > Compared with the size-based approach, e.g., [1], this time-based
> > approach has the following advantages:
> > 1) It's easier to configure because it's agnostic to applications and
> > memory sizes.
> > 2) It's more reliable because it's directly wired to the OOM killer.
> >
> > Add /sys/kernel/debug/lru_gen for working set estimation and proactive
> > reclaim. Compared with the page table-based approach and the PFN-based
> > approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has
> > the following advantages:
> > 1) It offers better choices because it's aware of memcgs, NUMA nodes,
> > shared mappings and unmapped page cache.
> > 2) It's more scalable because it's O(nr_hot_evictable_pages), whereas
> > the PFN-based approach is O(nr_total_pages).
> >
> > Add /sys/kernel/debug/lru_gen_full for debugging.
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> >
> > Signed-off-by: Yu Zhao <[email protected]>
> > Tested-by: Konstantin Kharlamov <[email protected]>
> > ---
> > Documentation/vm/index.rst | 1 +
> > Documentation/vm/multigen_lru.rst | 62 +++++
>
> The description of user visible interfaces should go to
> Documentation/admin-guide/mm
>
> Documentation/vm/multigen_lru.rst should have contained design description
> and the implementation details and it would be great to actually have such
> document.

Will do, thanks.

> > include/linux/nodemask.h | 1 +
> > mm/vmscan.c | 415 ++++++++++++++++++++++++++++++
> > 4 files changed, 479 insertions(+)
> > create mode 100644 Documentation/vm/multigen_lru.rst
> >
> > diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> > index 6f5ffef4b716..f25e755b4ff4 100644
> > --- a/Documentation/vm/index.rst
> > +++ b/Documentation/vm/index.rst
> > @@ -38,3 +38,4 @@ algorithms. If you are looking for advice on simply allocating memory, see the
> > unevictable-lru
> > z3fold
> > zsmalloc
> > + multigen_lru
> > diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
> > new file mode 100644
> > index 000000000000..6f9e0181348b
> > --- /dev/null
> > +++ b/Documentation/vm/multigen_lru.rst
> > @@ -0,0 +1,62 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=====================
> > +Multigenerational LRU
> > +=====================
> > +
> > +Quick start
> > +===========
> > +Runtime configurations
> > +----------------------
> > +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
> > + feature wasn't enabled by default.
>
> Required for what? This sentence seem to lack context. Maybe add an
> overview what is Multigenerational LRU so that users will have an idea what
> these knobs control.

Apparently I left an important part of this quick start in the next
patch, where Kconfig options are added. I'm wonder whether I should
squash the next patch into this one.

I always separate Kconfig changes and leave them in the last patch
because it gives me peace of mind knowing it'll never give any auto
bisectors a hard time.

But I saw people not following this practice, and I'm also tempted to
do so. Can anybody remind me whether it's considered a bad practice to
have code changes and Kconfig changes in the same patch?

> > +
> > +Recipes
> > +=======
>
> Some more context here will be also helpful.

Will do.

> > +Personal computers
> > +------------------
> > +:Thrashing prevention: Write ``N`` to
> > + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> > + ``N`` milliseconds from getting evicted. The OOM killer is invoked if
> > + this working set can't be kept in memory. Based on the average human
> > + detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
> > + lags due to thrashing. Larger values like ``N=3000`` make lags less
> > + noticeable at the cost of more OOM kills.
> > +
> > +Data centers
> > +------------
> > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > + format:
> > + ::
> > +
> > + memcg memcg_id memcg_path
> > + node node_id
> > + min_gen birth_time anon_size file_size
> > + ...
> > + max_gen birth_time anon_size file_size
> > +
> > + ``min_gen`` is the oldest generation number and ``max_gen`` is the
> > + youngest generation number. ``birth_time`` is in milliseconds.
> > + ``anon_size`` and ``file_size`` are in pages.
>
> And what does oldest and youngest generations mean from the user
> perspective?

Good question. Will add more details in the next spin.

2022-01-12 10:18:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Tue 11-01-22 18:01:29, Yu Zhao wrote:
> On Mon, Jan 10, 2022 at 05:57:39PM +0100, Michal Hocko wrote:
> > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > [...]
> > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > +{
> > > + static const struct mm_walk_ops mm_walk_ops = {
> > > + .test_walk = should_skip_vma,
> > > + .p4d_entry = walk_pud_range,
> > > + };
> > > +
> > > + int err;
> > > +#ifdef CONFIG_MEMCG
> > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > +#endif
> > > +
> > > + walk->next_addr = FIRST_USER_ADDRESS;
> > > +
> > > + do {
> > > + unsigned long start = walk->next_addr;
> > > + unsigned long end = mm->highest_vm_end;
> > > +
> > > + err = -EBUSY;
> > > +
> > > + rcu_read_lock();
> > > +#ifdef CONFIG_MEMCG
> > > + if (memcg && atomic_read(&memcg->moving_account))
> > > + goto contended;
> > > +#endif
> >
> > Why do you need to check for moving_account?
>
> This check, if succeeds, blocks memcg migration.

OK, I can see that you rely on the RCU here for the synchronization. A
comment which mentions mem_cgroup_move_charge would be helpful for
clarity. Is there any reason you are not using folio_memcg_lock in the
pte walk instead?
--
Michal Hocko
SUSE Labs

2022-01-12 10:29:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Tue 11-01-22 16:16:57, Yu Zhao wrote:
> On Mon, Jan 10, 2022 at 04:01:13PM +0100, Michal Hocko wrote:
> > On Thu 06-01-22 17:12:18, Michal Hocko wrote:
> > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> > > > +{
> > > > + if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > > > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> >
> > One thing I have overlooked completely.
>
> I appreciate your attention to details but GFP_KERNEL is legit in the
> reclaim path. It's been used many years in our production, e.g.,
> page reclaim
> swap_writepage()
> frontswap_store()
> zswap_frontswap_store()
> zswap_entry_cache_alloc(GFP_KERNEL)
>
> (And I always test my changes with lockdep, kasan, DEBUG_VM, etc., no
> warnings ever seen from using GFP_KERNEL in the reclaim path.)

OK, I can see it now. __need_reclaim will check for PF_MEMALLOC and skip
the fs_reclaim tracking.

I still maintain I am not really happy about (nor in the zswap example)
allocations from the direct reclaim context. I would really recommend
using a pre-allocated pool of objects.

If there are strong reasons for not doing so then at lease change that
to kzalloc.

Thanks!
--
Michal Hocko
SUSE Labs

2022-01-12 10:32:01

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] mm: multigenerational lru: user interface

On Wed 12-01-22 01:35:52, Yu Zhao wrote:
[...]
> But I saw people not following this practice, and I'm also tempted to
> do so. Can anybody remind me whether it's considered a bad practice to
> have code changes and Kconfig changes in the same patch?

If you want to have the patch series bisectable then it is preferable to
add kconfig options early so that the code is enabled in the respective
steps. Sometimes that can be impractical though (e.g. when the feature is
incomplete at that stage).
--
Michal Hocko
SUSE Labs

2022-01-12 15:46:00

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] mm: multigenerational lru: user interface

On Wed, Jan 12, 2022 at 01:35:52AM -0700, Yu Zhao wrote:
> On Mon, Jan 10, 2022 at 12:27:19PM +0200, Mike Rapoport wrote:
> > Hi,
> >
> > On Tue, Jan 04, 2022 at 01:22:27PM -0700, Yu Zhao wrote:
> > > Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch.
> > >
> > > Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention.
> > > Compared with the size-based approach, e.g., [1], this time-based
> > > approach has the following advantages:
> > > 1) It's easier to configure because it's agnostic to applications and
> > > memory sizes.
> > > 2) It's more reliable because it's directly wired to the OOM killer.
> > >
> > > Add /sys/kernel/debug/lru_gen for working set estimation and proactive
> > > reclaim. Compared with the page table-based approach and the PFN-based
> > > approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has
> > > the following advantages:
> > > 1) It offers better choices because it's aware of memcgs, NUMA nodes,
> > > shared mappings and unmapped page cache.
> > > 2) It's more scalable because it's O(nr_hot_evictable_pages), whereas
> > > the PFN-based approach is O(nr_total_pages).
> > >
> > > Add /sys/kernel/debug/lru_gen_full for debugging.
> > >
> > > [1] https://lore.kernel.org/lkml/[email protected]/
> > >
> > > Signed-off-by: Yu Zhao <[email protected]>
> > > Tested-by: Konstantin Kharlamov <[email protected]>
> > > ---
> > > Documentation/vm/index.rst | 1 +
> > > Documentation/vm/multigen_lru.rst | 62 +++++
> >
> > The description of user visible interfaces should go to
> > Documentation/admin-guide/mm
> >
> > Documentation/vm/multigen_lru.rst should have contained design description
> > and the implementation details and it would be great to actually have such
> > document.
>
> Will do, thanks.
>
> > > include/linux/nodemask.h | 1 +
> > > mm/vmscan.c | 415 ++++++++++++++++++++++++++++++
> > > 4 files changed, 479 insertions(+)
> > > create mode 100644 Documentation/vm/multigen_lru.rst
> > >
> > > diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> > > index 6f5ffef4b716..f25e755b4ff4 100644
> > > --- a/Documentation/vm/index.rst
> > > +++ b/Documentation/vm/index.rst
> > > @@ -38,3 +38,4 @@ algorithms. If you are looking for advice on simply allocating memory, see the
> > > unevictable-lru
> > > z3fold
> > > zsmalloc
> > > + multigen_lru
> > > diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
> > > new file mode 100644
> > > index 000000000000..6f9e0181348b
> > > --- /dev/null
> > > +++ b/Documentation/vm/multigen_lru.rst
> > > @@ -0,0 +1,62 @@
> > > +.. SPDX-License-Identifier: GPL-2.0
> > > +
> > > +=====================
> > > +Multigenerational LRU
> > > +=====================
> > > +
> > > +Quick start
> > > +===========
> > > +Runtime configurations
> > > +----------------------
> > > +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
> > > + feature wasn't enabled by default.
> >
> > Required for what? This sentence seem to lack context. Maybe add an
> > overview what is Multigenerational LRU so that users will have an idea what
> > these knobs control.
>
> Apparently I left an important part of this quick start in the next
> patch, where Kconfig options are added. I'm wonder whether I should
> squash the next patch into this one.

I think documentation deserves a separate patch.


--
Sincerely yours,
Mike.

2022-01-12 21:03:29

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

Hello.

On ?ter? 4. ledna 2022 21:22:19 CET Yu Zhao wrote:
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and
> straightforward.
>
> Design objectives
> =================
> The design objectives are:
> 1. Better representation of access recency
> 2. Try to profit from spatial locality
> 3. Clear fast path making obvious choices
> 4. Simple self-correcting heuristics
>
> The representation of access recency is at the core of all LRU
> approximations. The multigenerational LRU (MGLRU) divides pages into
> multiple lists (generations), each having bounded access recency (a
> time interval). Generations establish a common frame of reference and
> help make better choices, e.g., between different memcgs on a computer
> or different computers in a data center (for cluster job scheduling).
>
> Exploiting spatial locality improves the efficiency when gathering the
> accessed bit. A rmap walk targets a single page and doesn't try to
> profit from discovering an accessed PTE. A page table walk can sweep
> all hotspots in an address space, but its search space can be too
> large to make a profit. The key is to optimize both methods and use
> them in combination. (PMU is another option for further exploration.)
>
> Fast path reduces code complexity and runtime overhead. Unmapped pages
> don't require TLB flushes; clean pages don't require writeback. These
> facts are only helpful when other conditions, e.g., access recency,
> are similar. With generations as a common frame of reference,
> additional factors stand out. But obvious choices might not be good
> choices; thus self-correction is required (the next objective).
>
> The benefits of simple self-correcting heuristics are self-evident.
> Again with generations as a common frame of reference, this becomes
> attainable. Specifically, pages in the same generation are categorized
> based on additional factors, and a closed-loop control statistically
> compares the refault percentages across all categories and throttles
> the eviction of those that have higher percentages.
>
> Patchset overview
> =================
> 1. mm: x86, arm64: add arch_has_hw_pte_young()
> 2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> Materializing hardware optimizations when trying to clear the accessed
> bit in many PTEs. If hardware automatically sets the accessed bit in
> PTEs, there is no need to worry about bursty page faults (emulating
> the accessed bit). If it also sets the accessed bit in non-leaf PMD
> entries, there is no need to search the PTE table pointed to by a PMD
> entry that doesn't have the accessed bit set.
>
> 3. mm/vmscan.c: refactor shrink_node()
> A minor refactor.
>
> 4. mm: multigenerational lru: groundwork
> Adding the basic data structure and the functions to initialize it and
> insert/remove pages.
>
> 5. mm: multigenerational lru: mm_struct list
> An infra keeps track of mm_struct's for page table walkers and
> provides them with optimizations, i.e., switch_mm() tracking and Bloom
> filters.
>
> 6. mm: multigenerational lru: aging
> 7. mm: multigenerational lru: eviction
> "The page reclaim" is a producer/consumer model. "The aging" produces
> cold pages, whereas "the eviction " consumes them. Cold pages flow
> through generations. The aging uses the mm_struct list infra to sweep
> dense hotspots in page tables. During a page table walk, the aging
> clears the accessed bit and tags accessed pages with the youngest
> generation number. The eviction sorts those pages when it encounters
> them. For pages in the oldest generation, eviction walks the rmap to
> check the accessed bit one more time before evicting them. During an
> rmap walk, the eviction feeds dense hotspots back to the aging. Dense
> hotspots flow through the Bloom filters. For pages not mapped in page
> tables, the eviction uses the PID controller to statistically
> determine whether they have higher refaults. If so, the eviction
> throttles their eviction by moving them to the next generation (the
> second oldest).
>
> 8. mm: multigenerational lru: user interface
> The knobs to turn on/off MGLRU and provide the userspace with
> thrashing prevention, working set estimation (the aging) and proactive
> reclaim (the eviction).
>
> 9. mm: multigenerational lru: Kconfig
> The Kconfig options.
>
> Benchmark results
> =================
> Independent lab results
> -----------------------
> Based on the popularity of searches [01] and the memory usage in
> Google's public cloud, the most popular open-source memory-hungry
> applications, in alphabetical order, are:
> Apache Cassandra Memcached
> Apache Hadoop MongoDB
> Apache Spark PostgreSQL
> MariaDB (MySQL) Redis
>
> An independent lab evaluated MGLRU with the most widely used benchmark
> suites for the above applications. They posted 960 data points along
> with kernel metrics and perf profiles collected over more than 500
> hours of total benchmark time. Their final reports show that, with 95%
> confidence intervals (CIs), the above applications all performed
> significantly better for at least part of their benchmark matrices.
>
> On 5.14:
> 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
> less wall time to sort three billion random integers, respectively,
> under the medium- and the high-concurrency conditions, when
> overcommitting memory. There were no statistically significant
> changes in wall time for the rest of the benchmark matrix.
> 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
> more transactions per minute (TPM), respectively, under the medium-
> and the high-concurrency conditions, when overcommitting memory.
> There were no statistically significant changes in TPM for the rest
> of the benchmark matrix.
> 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
> and [21.59, 30.02]% more operations per second (OPS), respectively,
> for sequential access, random access and Gaussian (distribution)
> access, when THP=always; 95% CIs [13.85, 15.97]% and
> [23.94, 29.92]% more OPS, respectively, for random access and
> Gaussian access, when THP=never. There were no statistically
> significant changes in OPS for the rest of the benchmark matrix.
> 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
> [2.16, 3.55]% more operations per second (OPS), respectively, for
> exponential (distribution) access, random access and Zipfian
> (distribution) access, when underutilizing memory; 95% CIs
> [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
> respectively, for exponential access, random access and Zipfian
> access, when overcommitting memory.
>
> On 5.15:
> 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
> and [4.11, 7.50]% more operations per second (OPS), respectively,
> for exponential (distribution) access, random access and Zipfian
> (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
> [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
> exponential access, random access and Zipfian access, when swap was
> on.
> 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
> less average wall time to finish twelve parallel TeraSort jobs,
> respectively, under the medium- and the high-concurrency
> conditions, when swap was on. There were no statistically
> significant changes in average wall time for the rest of the
> benchmark matrix.
> 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
> minute (TPM) under the high-concurrency condition, when swap was
> off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
> respectively, under the medium- and the high-concurrency
> conditions, when swap was on. There were no statistically
> significant changes in TPM for the rest of the benchmark matrix.
> 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
> [11.47, 19.36]% more total operations per second (OPS),
> respectively, for sequential access, random access and Gaussian
> (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
> [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
> for sequential access, random access and Gaussian access, when
> THP=never.
>
> Our lab results
> ---------------
> To supplement the above results, we ran the following benchmark suites
> on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
> are popular among MM developers, but we prefer large-scale A/B
> experiments to validate improvements.)
> fs_fio_bench_hdd_mq pft
> fs_lmbench pgsql-hammerdb
> fs_parallelio redis
> fs_postmark stream
> hackbench sysbenchthread
> kernbench tpcc_spark
> memcached unixbench
> multichase vm-scalability
> mutilate will-it-scale
> nginx
>
> [01] https://trends.google.com
> [02] https://lore.kernel.org/linux-mm/[email protected]/
> [03] https://lore.kernel.org/linux-mm/[email protected]/
> [04] https://lore.kernel.org/linux-mm/[email protected]/
> [05] https://lore.kernel.org/linux-mm/[email protected]/
> [06] https://lore.kernel.org/linux-mm/[email protected]/
> [07] https://lore.kernel.org/linux-mm/[email protected]/
> [08] https://lore.kernel.org/linux-mm/[email protected]/
> [09] https://lore.kernel.org/linux-mm/[email protected]/
> [10] https://lore.kernel.org/linux-mm/[email protected]/
>
> Read-world applications
> =======================
> Third-party testimonials
> ------------------------
> Konstantin wrote [11]:
> I have Archlinux with 8G RAM + zswap + swap. While developing, I
> have lots of apps opened such as multiple LSP-servers for different
> langs, chats, two browsers, etc... Usually, my system gets quickly
> to a point of SWAP-storms, where I have to kill LSP-servers,
> restart browsers to free memory, etc, otherwise the system lags
> heavily and is barely usable.
>
> 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
> patchset, and I started up by opening lots of apps to create memory
> pressure, and worked for a day like this. Till now I had *not a
> single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
> getting to the point of 3G in SWAP before without a single
> SWAP-storm.
>
> The Arch Linux Zen kernel [12] has been using MGLRU since 5.12. Many
> of its users reported their positive experiences to me, e.g., Shivodit
> wrote:
> I've tried the latest Zen kernel (5.14.13-zen1-1-zen in the
> archlinux testing repos), everything's been smooth so far. I also
> decided to copy a large volume of files to check performance under
> I/O load, and everything went smoothly - no stuttering was present,
> everything was responsive.
>
> Large-scale deployments
> -----------------------
> We've rolled out MGLRU to tens of millions of Chrome OS users and
> about a million Android users. Google's fleetwide profiling [13] shows
> an overall 40% decrease in kswapd CPU usage, in addition to
> improvements in other UX metrics, e.g., an 85% decrease in the number
> of low-memory kills at the 75th percentile and an 18% decrease in
> rendering latency at the 50th percentile.
>
> [11] https://lore.kernel.org/linux-mm/[email protected]/
> [12] https://github.com/zen-kernel/zen-kernel/
> [13] https://research.google/pubs/pub44271/
>
> Summery
> =======
> The facts are:
> 1. The independent lab results and the real-world applications
> indicate substantial improvements; there are no known regressions.
> 2. Thrashing prevention, working set estimation and proactive reclaim
> work out of the box; there are no equivalent solutions.
> 3. There is a lot of new code; nobody has demonstrated smaller changes
> with similar effects.
>
> Our options, accordingly, are:
> 1. Given the amount of evidence, the reported improvements will likely
> materialize for a wide range of workloads.
> 2. Gauging the interest from the past discussions [14][15][16], the
> new features will likely be put to use for both personal computers
> and data centers.
> 3. Based on Google's track record, the new code will likely be well
> maintained in the long term. It'd be more difficult if not
> impossible to achieve similar effects on top of the existing
> design.
>
> [14] https://lore.kernel.org/lkml/[email protected]/
> [15] https://lore.kernel.org/lkml/[email protected]/
> [16] https://lore.kernel.org/lkml/[email protected]/
>
> Yu Zhao (9):
> mm: x86, arm64: add arch_has_hw_pte_young()
> mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> mm/vmscan.c: refactor shrink_node()
> mm: multigenerational lru: groundwork
> mm: multigenerational lru: mm_struct list
> mm: multigenerational lru: aging
> mm: multigenerational lru: eviction
> mm: multigenerational lru: user interface
> mm: multigenerational lru: Kconfig
>
> Documentation/vm/index.rst | 1 +
> Documentation/vm/multigen_lru.rst | 80 +
> arch/Kconfig | 9 +
> arch/arm64/include/asm/cpufeature.h | 5 +
> arch/arm64/include/asm/pgtable.h | 13 +-
> arch/arm64/kernel/cpufeature.c | 19 +
> arch/arm64/tools/cpucaps | 1 +
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/pgtable.h | 9 +-
> arch/x86/mm/pgtable.c | 5 +-
> fs/exec.c | 2 +
> fs/fuse/dev.c | 3 +-
> include/linux/cgroup.h | 15 +-
> include/linux/memcontrol.h | 11 +
> include/linux/mm.h | 42 +
> include/linux/mm_inline.h | 204 ++
> include/linux/mm_types.h | 78 +
> include/linux/mmzone.h | 175 ++
> include/linux/nodemask.h | 1 +
> include/linux/oom.h | 16 +
> include/linux/page-flags-layout.h | 19 +-
> include/linux/page-flags.h | 4 +-
> include/linux/pgtable.h | 17 +-
> include/linux/sched.h | 4 +
> include/linux/swap.h | 4 +
> kernel/bounds.c | 3 +
> kernel/cgroup/cgroup-internal.h | 1 -
> kernel/exit.c | 1 +
> kernel/fork.c | 9 +
> kernel/sched/core.c | 1 +
> mm/Kconfig | 48 +
> mm/huge_memory.c | 3 +-
> mm/memcontrol.c | 26 +
> mm/memory.c | 21 +-
> mm/mm_init.c | 6 +-
> mm/oom_kill.c | 4 +-
> mm/page_alloc.c | 1 +
> mm/rmap.c | 7 +
> mm/swap.c | 51 +-
> mm/vmscan.c | 2691 ++++++++++++++++++++++++++-
> mm/workingset.c | 119 +-
> 41 files changed, 3591 insertions(+), 139 deletions(-)
> create mode 100644 Documentation/vm/multigen_lru.rst

For the series:

Tested-by: Oleksandr Natalenko <[email protected]>

I run this (and one of the previous spins) on nine machines (physical, virtual, workstations, servers) for quite some time with no hassle.

Thanks for your job, and please keep me in Cc once you post new spins. I'm more than happy to deploy those across the fleet.

--
Oleksandr Natalenko (post-factum)



2022-01-12 23:43:23

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Wed, Jan 12, 2022 at 11:17:53AM +0100, Michal Hocko wrote:
> On Tue 11-01-22 18:01:29, Yu Zhao wrote:
> > On Mon, Jan 10, 2022 at 05:57:39PM +0100, Michal Hocko wrote:
> > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > [...]
> > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > > +{
> > > > + static const struct mm_walk_ops mm_walk_ops = {
> > > > + .test_walk = should_skip_vma,
> > > > + .p4d_entry = walk_pud_range,
> > > > + };
> > > > +
> > > > + int err;
> > > > +#ifdef CONFIG_MEMCG
> > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > > +#endif
> > > > +
> > > > + walk->next_addr = FIRST_USER_ADDRESS;
> > > > +
> > > > + do {
> > > > + unsigned long start = walk->next_addr;
> > > > + unsigned long end = mm->highest_vm_end;
> > > > +
> > > > + err = -EBUSY;
> > > > +
> > > > + rcu_read_lock();
> > > > +#ifdef CONFIG_MEMCG
> > > > + if (memcg && atomic_read(&memcg->moving_account))
> > > > + goto contended;
> > > > +#endif
> > >
> > > Why do you need to check for moving_account?
> >
> > This check, if succeeds, blocks memcg migration.
>
> OK, I can see that you rely on the RCU here for the synchronization. A
> comment which mentions mem_cgroup_move_charge would be helpful for
> clarity.

Will do

> Is there any reason you are not using folio_memcg_lock in the
> pte walk instead?

We have a particular lruvec (the first arg), hence a particular memcg
to lock. But we don't have a particular page to lock.

2022-01-13 09:00:28

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Wed, Jan 12, 2022 at 09:56:58PM +0100, Oleksandr Natalenko wrote:
> Hello.
>
> On ?ter? 4. ledna 2022 21:22:19 CET Yu Zhao wrote:
> > TLDR
> > ====
> > The current page reclaim is too expensive in terms of CPU usage and it
> > often makes poor choices about what to evict. This patchset offers an
> > alternative solution that is performant, versatile and
> > straightforward.

<snipped>

> For the series:
>
> Tested-by: Oleksandr Natalenko <[email protected]>
>
> I run this (and one of the previous spins) on nine machines (physical, virtual, workstations, servers) for quite some time with no hassle.
>
> Thanks for your job, and please keep me in Cc once you post new spins. I'm more than happy to deploy those across the fleet.

Thanks, Oleksandr. And if I may take the liberty of introducing you as:

Oleksandr - the post-factum kernel maintainer
https://gitlab.com/post-factum/pf-kernel

in addition to other downstream kernel maintainers I've introduced:

> Alexandre - the XanMod kernel maintainer
> https://xanmod.org
>
> Brian - the Chrome OS kernel memory maintainer
> https://www.chromium.org
>
> Jan - the Arch Linux Zen kernel maintainer
> https://archlinux.org
>
> Steven - the Liquorix kernel maintainer
> https://liquorix.net
>
> Suleiman - the ARCVM (Android downstream) kernel memory maintainer
> https://chromium.googlesource.com/chromiumos/third_party/kernel

2022-01-13 09:25:40

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Wed, Jan 12, 2022 at 11:28:57AM +0100, Michal Hocko wrote:
> On Tue 11-01-22 16:16:57, Yu Zhao wrote:
> > On Mon, Jan 10, 2022 at 04:01:13PM +0100, Michal Hocko wrote:
> > > On Thu 06-01-22 17:12:18, Michal Hocko wrote:
> > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > > > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> > > > > +{
> > > > > + if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > > > > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> > >
> > > One thing I have overlooked completely.
> >
> > I appreciate your attention to details but GFP_KERNEL is legit in the
> > reclaim path. It's been used many years in our production, e.g.,
> > page reclaim
> > swap_writepage()
> > frontswap_store()
> > zswap_frontswap_store()
> > zswap_entry_cache_alloc(GFP_KERNEL)
> >
> > (And I always test my changes with lockdep, kasan, DEBUG_VM, etc., no
> > warnings ever seen from using GFP_KERNEL in the reclaim path.)
>
> OK, I can see it now. __need_reclaim will check for PF_MEMALLOC and skip
> the fs_reclaim tracking.
>
> I still maintain I am not really happy about (nor in the zswap example)
> allocations from the direct reclaim context. I would really recommend
> using a pre-allocated pool of objects.

Not trying to argue anything -- there are many other places in the
reclaim path that must allocate memory to make progress, e.g.,

add_to_swap_cache()
xas_nomem()

__swap_writepage()
bio_alloc()

The only way to not allocate memory is drop clean pages. Writing dirty
pages (not swap) might require allocations as well. (But we only write
dirty pages in kswapd, not in the direct reclaim path.)

> If there are strong reasons for not doing so then at lease change that
> to kzalloc.

Consider it done.

2022-01-13 09:43:49

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Mon, Jan 10, 2022 at 03:37:28PM +0100, Michal Hocko wrote:
> On Sun 09-01-22 20:58:02, Yu Zhao wrote:
> > On Fri, Jan 07, 2022 at 10:00:31AM +0100, Michal Hocko wrote:
> > > On Fri 07-01-22 09:55:09, Michal Hocko wrote:
> > > [...]
> > > > > In this case, lru_gen_mm_walk is small (160 bytes); it's per direct
> > > > > reclaimer; and direct reclaimers rarely come here, i.e., only when
> > > > > kswapd can't keep up in terms of the aging, which is similar to the
> > > > > condition where the inactive list is empty for the active/inactive
> > > > > lru.
> > > >
> > > > Well, this is not a strong argument to be honest. Kswapd being stuck
> > > > and the majority of the reclaim being done in the direct reclaim
> > > > context is a situation I have seen many many times.
> > >
> > > Also do not forget that memcg reclaim is effectivelly only direct
> > > reclaim. Not that the memcg reclaim indicates a global memory shortage
> > > but it can add up and race with the global reclaim as well.
> >
> > I don't dispute any of the above, and I probably don't like this code
> > more than you do.
> >
> > But let's not forget the purposes of PF_MEMALLOC, besides preventing
> > recursive reclaims, include letting reclaim dip into reserves so that
> > it can make more free memory. So I think it's acceptable if the
> > following conditions are met:
> > 1. The allocation size is small.
> > 2. The number of allocations is bounded.
> > 3. Its failure doesn't stall reclaim.
> > And it'd be nice if
> > 4. The allocation happens rarely, e.g., slow path only.
>
> I would add
> 0. The allocation should be done only if absolutely _necessary_.
>
> Please keep in mind that whatever you allocate from that context will be
> consuming a very precious memory reserves which are shared with other
> components of the system. Even worse these can go all the way to
> depleting memory completely where other things can fall apart.

I agree but I also see a distinction:
1,2,3 are objective;
0,4 are subjective.

For some users, page reclaim itself could be not absolutely necessary
because they are okay with OOM kills. But for others, the situation
could be reversed.

> > The code in question meets all of them.
> >
> > 1. This allocation is 160 bytes.
> > 2. It's bounded by the number of page table walkers which, in the
> > worst, is same as the number of mm_struct's.
> > 3. Most importantly, its failure doesn't stall the aging. The aging
> > will fallback to the rmap-based function lru_gen_look_around().
> > But this function only gathers the accessed bit from at most 64
> > PTEs, meaning it's less efficient (retains ~80% performance gains).
> > 4. This allocation is rare, i.e., only when the aging is required,
> > which is similar to the low inactive case for the active/inactive
> > lru.
>
> I think this fallback behavior deserves much more detailed explanation
> in changelogs.

Will do.

> > The bottom line is I can try various optimizations, e.g., preallocate
> > a few buffers for a limited number of page walkers and if this number
> > has been reached, fallback to the rmap-based function. But I have yet
> > to see evidence that calls for additional complexity.
>
> I would disagree here. This is not an optimization. You should be
> avoiding allocations from the memory reclaim because any allocation just
> add a runtime behavior complexity and potential corner cases.

Would __GFP_NOMEMALLOC address your concern? It prevents allocations
from accessing the reserves even under PF_MEMALLOC.

2022-01-13 09:48:00

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] mm: multigenerational lru: user interface

On Wed, Jan 12, 2022 at 05:45:40PM +0200, Mike Rapoport wrote:
> On Wed, Jan 12, 2022 at 01:35:52AM -0700, Yu Zhao wrote:
> > On Mon, Jan 10, 2022 at 12:27:19PM +0200, Mike Rapoport wrote:
> > > Hi,
> > >
> > > On Tue, Jan 04, 2022 at 01:22:27PM -0700, Yu Zhao wrote:
> > > > Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch.
> > > >
> > > > Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention.
> > > > Compared with the size-based approach, e.g., [1], this time-based
> > > > approach has the following advantages:
> > > > 1) It's easier to configure because it's agnostic to applications and
> > > > memory sizes.
> > > > 2) It's more reliable because it's directly wired to the OOM killer.
> > > >
> > > > Add /sys/kernel/debug/lru_gen for working set estimation and proactive
> > > > reclaim. Compared with the page table-based approach and the PFN-based
> > > > approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has
> > > > the following advantages:
> > > > 1) It offers better choices because it's aware of memcgs, NUMA nodes,
> > > > shared mappings and unmapped page cache.
> > > > 2) It's more scalable because it's O(nr_hot_evictable_pages), whereas
> > > > the PFN-based approach is O(nr_total_pages).
> > > >
> > > > Add /sys/kernel/debug/lru_gen_full for debugging.
> > > >
> > > > [1] https://lore.kernel.org/lkml/[email protected]/
> > > >
> > > > Signed-off-by: Yu Zhao <[email protected]>
> > > > Tested-by: Konstantin Kharlamov <[email protected]>
> > > > ---
> > > > Documentation/vm/index.rst | 1 +
> > > > Documentation/vm/multigen_lru.rst | 62 +++++
> > >
> > > The description of user visible interfaces should go to
> > > Documentation/admin-guide/mm
> > >
> > > Documentation/vm/multigen_lru.rst should have contained design description
> > > and the implementation details and it would be great to actually have such
> > > document.
> >
> > Will do, thanks.
> >
> > > > include/linux/nodemask.h | 1 +
> > > > mm/vmscan.c | 415 ++++++++++++++++++++++++++++++
> > > > 4 files changed, 479 insertions(+)
> > > > create mode 100644 Documentation/vm/multigen_lru.rst
> > > >
> > > > diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> > > > index 6f5ffef4b716..f25e755b4ff4 100644
> > > > --- a/Documentation/vm/index.rst
> > > > +++ b/Documentation/vm/index.rst
> > > > @@ -38,3 +38,4 @@ algorithms. If you are looking for advice on simply allocating memory, see the
> > > > unevictable-lru
> > > > z3fold
> > > > zsmalloc
> > > > + multigen_lru
> > > > diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
> > > > new file mode 100644
> > > > index 000000000000..6f9e0181348b
> > > > --- /dev/null
> > > > +++ b/Documentation/vm/multigen_lru.rst
> > > > @@ -0,0 +1,62 @@
> > > > +.. SPDX-License-Identifier: GPL-2.0
> > > > +
> > > > +=====================
> > > > +Multigenerational LRU
> > > > +=====================
> > > > +
> > > > +Quick start
> > > > +===========
> > > > +Runtime configurations
> > > > +----------------------
> > > > +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
> > > > + feature wasn't enabled by default.
> > >
> > > Required for what? This sentence seem to lack context. Maybe add an
> > > overview what is Multigenerational LRU so that users will have an idea what
> > > these knobs control.
> >
> > Apparently I left an important part of this quick start in the next
> > patch, where Kconfig options are added. I'm wonder whether I should
> > squash the next patch into this one.
>
> I think documentation deserves a separate patch.

Will do.

2022-01-13 10:33:04

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] mm: multigenerational lru: user interface

Yu Zhao <[email protected]> writes:

> Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch.


Got the below lockdep warning while using the above kill/enable switch


[ 84.252952] ======================================================
[ 84.253012] WARNING: possible circular locking dependency detected
[ 84.253074] 5.16.0-rc8-16204-g1cdcf1120b31 #511 Not tainted
[ 84.253135] ------------------------------------------------------
[ 84.253194] bash/2862 is trying to acquire lock:
[ 84.253243] c0000000021ff740 (cgroup_mutex){+.+.}-{3:3}, at: store_enable+0x80/0x1510
[ 84.253340]
but task is already holding lock:
[ 84.253410] c000000002221348 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x30/0x50
[ 84.253503]
which lock already depends on the new lock.

[ 84.253608]
the existing dependency chain (in reverse order) is:
[ 84.253693]
-> #2 (mem_hotplug_lock){++++}-{0:0}:
[ 84.253768] lock_acquire+0x134/0x4a0
[ 84.253821] percpu_down_write+0x80/0x1c0
[ 84.253872] try_online_node+0x40/0x90
[ 84.253924] cpu_up+0x7c/0x160
[ 84.253976] bringup_nonboot_cpus+0xc4/0x120
[ 84.254027] smp_init+0x48/0xd4
[ 84.254079] kernel_init_freeable+0x274/0x45c
[ 84.254134] kernel_init+0x44/0x194
[ 84.254188] ret_from_kernel_thread+0x5c/0x64
[ 84.254241]
-> #1 (cpu_hotplug_lock){++++}-{0:0}:
[ 84.254321] lock_acquire+0x134/0x4a0
[ 84.254373] cpus_read_lock+0x6c/0x180
[ 84.254426] static_key_disable+0x24/0x50
[ 84.254477] rebind_subsystems+0x3b0/0x5a0
[ 84.254528] cgroup_setup_root+0x24c/0x530
[ 84.254581] cgroup1_get_tree+0x7d8/0xb80
[ 84.254638] vfs_get_tree+0x48/0x150
[ 84.254695] path_mount+0x8b8/0xd20
[ 84.254752] do_mount+0xb8/0xe0
[ 84.254808] sys_mount+0x250/0x390
[ 84.254863] system_call_exception+0x15c/0x2b0
[ 84.254932] system_call_common+0xec/0x250
[ 84.254989]
-> #0 (cgroup_mutex){+.+.}-{3:3}:
[ 84.255072] check_prev_add+0x180/0x1050
[ 84.255129] __lock_acquire+0x17b8/0x25c0
[ 84.255186] lock_acquire+0x134/0x4a0
[ 84.255243] __mutex_lock+0xdc/0xa90
[ 84.255300] store_enable+0x80/0x1510
[ 84.255356] kobj_attr_store+0x2c/0x50
[ 84.255413] sysfs_kf_write+0x6c/0xb0
[ 84.255471] kernfs_fop_write_iter+0x1bc/0x2b0
[ 84.255539] new_sync_write+0x130/0x1d0
[ 84.255594] vfs_write+0x2cc/0x4c0
[ 84.255645] ksys_write+0x84/0x140
[ 84.255699] system_call_exception+0x15c/0x2b0
[ 84.255771] system_call_common+0xec/0x250
[ 84.255829]
other info that might help us debug this:

[ 84.255933] Chain exists of:
cgroup_mutex --> cpu_hotplug_lock --> mem_hotplug_lock

[ 84.256070] Possible unsafe locking scenario:

[ 84.256149] CPU0 CPU1
[ 84.256201] ---- ----
[ 84.256255] lock(mem_hotplug_lock);
[ 84.256311] lock(cpu_hotplug_lock);
[ 84.256380] lock(mem_hotplug_lock);
[ 84.256448] lock(cgroup_mutex);
[ 84.256491]
*** DEADLOCK ***

[ 84.256571] 5 locks held by bash/2862:
[ 84.256626] #0: c00000002043d460 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x84/0x140
[ 84.256728] #1: c00000004bafc888 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x178/0x2b0
[ 84.256830] #2: c000000020b993b8 (kn->active#207){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x184/0x2b0
[ 84.256942] #3: c0000000020e5cd0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x20/0x50
[ 84.257045] #4: c000000002221348 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x30/0x50
[ 84.257152]
stack backtrace:
[ 84.257220] CPU: 107 PID: 2862 Comm: bash Not tainted 5.16.0-rc8-16204-g1cdcf1120b31 #511
[ 84.257309] Call Trace:
[ 84.257346] [c000000040d5b4a0] [c000000000a89f94] dump_stack_lvl+0x98/0xe0 (unreliable)
[ 84.257438] [c000000040d5b4e0] [c000000000267244] print_circular_bug.isra.0+0x3b4/0x3e0
[ 84.257528] [c000000040d5b580] [c0000000002673e0] check_noncircular+0x170/0x1a0
[ 84.257605] [c000000040d5b650] [c000000000268be0] check_prev_add+0x180/0x1050
[ 84.257683] [c000000040d5b710] [c00000000026ca48] __lock_acquire+0x17b8/0x25c0
[ 84.257760] [c000000040d5b840] [c00000000026e4c4] lock_acquire+0x134/0x4a0
[ 84.257837] [c000000040d5b940] [c00000000148a53c] __mutex_lock+0xdc/0xa90
[ 84.257914] [c000000040d5ba60] [c0000000004d5080] store_enable+0x80/0x1510
[ 84.257989] [c000000040d5bbc0] [c000000000a9286c] kobj_attr_store+0x2c/0x50
[ 84.258066] [c000000040d5bbe0] [c000000000752c4c] sysfs_kf_write+0x6c/0xb0
[ 84.258143] [c000000040d5bc20] [c000000000750fcc] kernfs_fop_write_iter+0x1bc/0x2b0
[ 84.258219] [c000000040d5bc70] [c000000000615df0] new_sync_write+0x130/0x1d0
[ 84.258295] [c000000040d5bd10] [c00000000061997c] vfs_write+0x2cc/0x4c0
[ 84.258373] [c000000040d5bd60] [c000000000619d54] ksys_write+0x84/0x140
[ 84.258450] [c000000040d5bdb0] [c00000000002c91c] system_call_exception+0x15c/0x2b0
[ 84.258528] [c000000040d5be10] [c00000000000c64c] system_call_common+0xec/0x250
[ 84.258604] --- interrupt: c00 at 0x79c551e76554
[ 84.258658] NIP: 000079c551e76554 LR: 000079c551de2674 CTR: 0000000000000000
[ 84.258732] REGS: c000000040d5be80 TRAP: 0c00 Not tainted (5.16.0-rc8-16204-g1cdcf1120b31)
[ 84.258817] MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 28422428 XER: 00000000
[ 84.258931] IRQMASK: 0
GPR00: 0000000000000004 00007fffc8e9a320 000079c551f77100 0000000000000001
GPR04: 0000017190973cc0 0000000000000002 0000000000000010 0000017190973cc0
GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 000079c5520ab1d0 0000017190943590 000001718749b738
GPR16: 00000171873b0ae0 0000000000000000 0000000020000000 0000017190973a60
GPR20: 0000000000000000 0000000000000001 0000017187443ca0 00007fffc8e9a514
GPR24: 00007fffc8e9a510 000001718749b0d0 000079c551f719d8 000079c551f72308
GPR28: 0000000000000002 000079c551f717e8 0000017190973cc0 0000000000000002
[ 84.259600] NIP [000079c551e76554] 0x79c551e76554
[ 84.259651] LR [000079c551de2674] 0x79c551de2674
[ 84.259701] --- interrupt: c00

2022-01-13 11:58:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Wed 12-01-22 16:43:15, Yu Zhao wrote:
> On Wed, Jan 12, 2022 at 11:17:53AM +0100, Michal Hocko wrote:
[...]
> > Is there any reason you are not using folio_memcg_lock in the
> > pte walk instead?
>
> We have a particular lruvec (the first arg), hence a particular memcg
> to lock. But we don't have a particular page to lock.

That is certainly true at this layer but the locking should be needed
only for specific pages, no? So you can move the lock down to the
callback which examines respective pages. Or is there anything
preventing that?

To be honest, and that is the reason I am asking, I really do not like
to open code the migration synchronization outside of the memcg proper.
Code paths which need a stable memcg are supposed to be using
folio_memcg_lock for the specific examination time. If you prefer a
trylock approach for this usecase then we can add one.

--
Michal Hocko
SUSE Labs

2022-01-13 12:02:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Thu 13-01-22 02:43:38, Yu Zhao wrote:
[...]
> > > The bottom line is I can try various optimizations, e.g., preallocate
> > > a few buffers for a limited number of page walkers and if this number
> > > has been reached, fallback to the rmap-based function. But I have yet
> > > to see evidence that calls for additional complexity.
> >
> > I would disagree here. This is not an optimization. You should be
> > avoiding allocations from the memory reclaim because any allocation just
> > add a runtime behavior complexity and potential corner cases.
>
> Would __GFP_NOMEMALLOC address your concern? It prevents allocations
> from accessing the reserves even under PF_MEMALLOC.

__GFP_NOMEMALLOC would deal with the complete memory depletion concern
for sure but I am not sure how any of these allocations would succeed
when called from the direct reclaim. Some access to memory reserves is
necessary if you insist on allocating from the reclaim process.

You can have a look at the limited memory reserves access by oom victims
for an example of how this can be done.

--
Michal Hocko
SUSE Labs

2022-01-13 17:05:05

by Alexey Avramov

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

> But the later one is more complex and a proper
> handling really depends on the particular workload

That is why I advocate the introduction of new tunables.

> There are workloads which prefer a temporary trashing over its working
> set during a peak memory demand rather than an OOM kill

OK, for such cases, the OOM handles can be set to 0.
It can even be the default value.

> On the other side workloads that are
> latency sensitive

I daresay that this is the case with most workloads.
An internet server that falls into thrashing is a dead server.

> no simple solution can be applied to the whole

There are several solutions and they can be taken into the kernel
at the same time, they all work:
- min_ttl_ms + MGLRU
- vm.min_filelist_kbytes-like knobs
- PSI-based solutions.

> For the most steady trashing situations I have
> seen the userspace with mlocked memory and the code can make a forward
> progress and mediate the situation.

I still don't see a problem in making all the kernel-space solutions
in the kernel.

2022-01-13 23:02:23

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] mm: multigenerational lru: user interface

On Thu, Jan 13, 2022 at 04:01:31PM +0530, Aneesh Kumar K.V wrote:
> Yu Zhao <[email protected]> writes:
>
> > Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch.
>
>
> Got the below lockdep warning while using the above kill/enable switch
>
>
> [ 84.252952] ======================================================
> [ 84.253012] WARNING: possible circular locking dependency detected
> [ 84.253074] 5.16.0-rc8-16204-g1cdcf1120b31 #511 Not tainted
> [ 84.253135] ------------------------------------------------------
> [ 84.253194] bash/2862 is trying to acquire lock:
> [ 84.253243] c0000000021ff740 (cgroup_mutex){+.+.}-{3:3}, at: store_enable+0x80/0x1510
> [ 84.253340]
> but task is already holding lock:
> [ 84.253410] c000000002221348 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x30/0x50
> [ 84.253503]
> which lock already depends on the new lock.
>
> [ 84.255933] Chain exists of:
> cgroup_mutex --> cpu_hotplug_lock --> mem_hotplug_lock

Thanks. Will reverse the order between mem_hotplug_lock and
cgroup_mutex in the next spin.

2022-01-14 05:22:30

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] mm: multigenerational lru: user interface

Yu Zhao <[email protected]> writes:

> On Thu, Jan 13, 2022 at 04:01:31PM +0530, Aneesh Kumar K.V wrote:
>> Yu Zhao <[email protected]> writes:
>>
>> > Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch.
>>
>>
>> Got the below lockdep warning while using the above kill/enable switch
>>
>>
>> [ 84.252952] ======================================================
>> [ 84.253012] WARNING: possible circular locking dependency detected
>> [ 84.253074] 5.16.0-rc8-16204-g1cdcf1120b31 #511 Not tainted
>> [ 84.253135] ------------------------------------------------------
>> [ 84.253194] bash/2862 is trying to acquire lock:
>> [ 84.253243] c0000000021ff740 (cgroup_mutex){+.+.}-{3:3}, at: store_enable+0x80/0x1510
>> [ 84.253340]
>> but task is already holding lock:
>> [ 84.253410] c000000002221348 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x30/0x50
>> [ 84.253503]
>> which lock already depends on the new lock.
>>
>> [ 84.255933] Chain exists of:
>> cgroup_mutex --> cpu_hotplug_lock --> mem_hotplug_lock
>
> Thanks. Will reverse the order between mem_hotplug_lock and
> cgroup_mutex in the next spin.

It also needs the unlocked variant of static_key_enable/disable.

[ 71.204397][ T2819] bash/2819 is trying to acquire lock:
[ 71.204446][ T2819] c0000000020e5cd0 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_disable+0x24/0x50
[ 71.204542][ T2819]
[ 71.204542][ T2819] but task is already holding lock:
[ 71.204613][ T2819] c0000000020e5cd0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x20/0x50
[ 71.204710][ T2819]
[ 71.204710][ T2819] other info that might help us debug this:
[ 71.204787][ T2819] Possible unsafe locking scenario:
[ 71.204787][ T2819]
[ 71.204860][ T2819] CPU0
[ 71.204901][ T2819] ----
[ 71.204941][ T2819] lock(cpu_hotplug_lock);
[ 71.204998][ T2819] lock(cpu_hotplug_lock);
[ 71.205053][ T2819]
[ 71.205053][ T2819] *** DEADLOCK ***

-aneesh

2022-01-14 06:50:24

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] mm: multigenerational lru: user interface

On Fri, Jan 14, 2022 at 10:50:05AM +0530, Aneesh Kumar K.V wrote:
> Yu Zhao <[email protected]> writes:
> > On Thu, Jan 13, 2022 at 04:01:31PM +0530, Aneesh Kumar K.V wrote:
> >> Yu Zhao <[email protected]> writes:
> >>
> >> > Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch.
> >>
> >> Got the below lockdep warning while using the above kill/enable switch
> >>
> >>
> >> [ 84.252952] ======================================================
> >> [ 84.253012] WARNING: possible circular locking dependency detected
> >> [ 84.253074] 5.16.0-rc8-16204-g1cdcf1120b31 #511 Not tainted
> >> [ 84.253135] ------------------------------------------------------
> >> [ 84.253194] bash/2862 is trying to acquire lock:
> >> [ 84.253243] c0000000021ff740 (cgroup_mutex){+.+.}-{3:3}, at: store_enable+0x80/0x1510
> >> [ 84.253340]
> >> but task is already holding lock:
> >> [ 84.253410] c000000002221348 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x30/0x50
> >> [ 84.253503]
> >> which lock already depends on the new lock.
> >>
> >> [ 84.255933] Chain exists of:
> >> cgroup_mutex --> cpu_hotplug_lock --> mem_hotplug_lock
> >
> > Thanks. Will reverse the order between mem_hotplug_lock and
> > cgroup_mutex in the next spin.
>
> It also needs the unlocked variant of static_key_enable/disable.

Right. This is what I have at the moment. Tested with QEMU memory
hotplug. Can you please give it try too? Thanks.

cgroup_lock()
cpus_read_lock()
get_online_mems()

if (enable)
static_branch_enable_cpuslocked()
else
static_branch_disable_cpuslocked()

put_online_mems()
cpus_read_unlock()
cgroup_unlock()

2022-01-20 00:52:49

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Tue, Jan 11, 2022 at 01:41:22AM -0700, Yu Zhao wrote:
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
> > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> > > TLDR
> > > ====
> > > The current page reclaim is too expensive in terms of CPU usage and it
> > > often makes poor choices about what to evict. This patchset offers an
> > > alternative solution that is performant, versatile and
> > > straightforward.
> >
> > <snipped>
> >
> > > Summery
> > > =======
> > > The facts are:
> > > 1. The independent lab results and the real-world applications
> > > indicate substantial improvements; there are no known regressions.
> > > 2. Thrashing prevention, working set estimation and proactive reclaim
> > > work out of the box; there are no equivalent solutions.
> > > 3. There is a lot of new code; nobody has demonstrated smaller changes
> > > with similar effects.
> > >
> > > Our options, accordingly, are:
> > > 1. Given the amount of evidence, the reported improvements will likely
> > > materialize for a wide range of workloads.
> > > 2. Gauging the interest from the past discussions [14][15][16], the
> > > new features will likely be put to use for both personal computers
> > > and data centers.
> > > 3. Based on Google's track record, the new code will likely be well
> > > maintained in the long term. It'd be more difficult if not
> > > impossible to achieve similar effects on top of the existing
> > > design.
> >
> > Hi Andrew, Linus,
> >
> > Can you please take a look at this patchset and let me know if it's
> > 5.17 material?
> >
> > My goal is to get it merged asap so that users can reap the benefits
> > and I can push the sequels. Please examine the data provided -- I
> > think the unprecedented coverage and the magnitude of the improvements
> > warrant a green light.

My gratitude to Donald who has been helping test MGLRU since v2:

Donald Carr ([email protected])

Founder of Chaos Reins (http://chaos-reins.com), an SF based
consultancy company specializing in designing/creating embedded
Linux appliances.

Can you please provide your Tested-by tags? This will ensure the credit
for your contributions.

Thanks!

2022-01-20 01:12:24

by Donald Carr

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

January 18, 2022 1:21 AM, "Yu Zhao" <[email protected]> wrote:

> On Tue, Jan 11, 2022 at 01:41:22AM -0700, Yu Zhao wrote:
>
>> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
>> On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
>>> TLDR
>>> ====
>>> The current page reclaim is too expensive in terms of CPU usage and it
>>> often makes poor choices about what to evict. This patchset offers an
>>> alternative solution that is performant, versatile and
>>> straightforward.
>>
>> <snipped>
>>
>>> Summery
>>> =======
>>> The facts are:
>>> 1. The independent lab results and the real-world applications
>>> indicate substantial improvements; there are no known regressions.
>>> 2. Thrashing prevention, working set estimation and proactive reclaim
>>> work out of the box; there are no equivalent solutions.
>>> 3. There is a lot of new code; nobody has demonstrated smaller changes
>>> with similar effects.
>>>
>>> Our options, accordingly, are:
>>> 1. Given the amount of evidence, the reported improvements will likely
>>> materialize for a wide range of workloads.
>>> 2. Gauging the interest from the past discussions [14][15][16], the
>>> new features will likely be put to use for both personal computers
>>> and data centers.
>>> 3. Based on Google's track record, the new code will likely be well
>>> maintained in the long term. It'd be more difficult if not
>>> impossible to achieve similar effects on top of the existing
>>> design.
>>
>> Hi Andrew, Linus,
>>
>> Can you please take a look at this patchset and let me know if it's
>> 5.17 material?
>>
>> My goal is to get it merged asap so that users can reap the benefits
>> and I can push the sequels. Please examine the data provided -- I
>> think the unprecedented coverage and the magnitude of the improvements
>> warrant a green light.
>
> My gratitude to Donald who has been helping test MGLRU since v2:
>
> Donald Carr ([email protected])
>
> Founder of Chaos Reins (http://chaos-reins.com), an SF based
> consultancy company specializing in designing/creating embedded
> Linux appliances.

Tested-by: Donald Carr <[email protected]>

> Can you please provide your Tested-by tags? This will ensure the credit
> for your contributions.
>
> Thanks!

2022-01-21 16:58:19

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Thu, Jan 13, 2022 at 01:02:26PM +0100, Michal Hocko wrote:
> On Thu 13-01-22 02:43:38, Yu Zhao wrote:
> [...]
> > > > The bottom line is I can try various optimizations, e.g., preallocate
> > > > a few buffers for a limited number of page walkers and if this number
> > > > has been reached, fallback to the rmap-based function. But I have yet
> > > > to see evidence that calls for additional complexity.
> > >
> > > I would disagree here. This is not an optimization. You should be
> > > avoiding allocations from the memory reclaim because any allocation just
> > > add a runtime behavior complexity and potential corner cases.
> >
> > Would __GFP_NOMEMALLOC address your concern? It prevents allocations
> > from accessing the reserves even under PF_MEMALLOC.
>
> __GFP_NOMEMALLOC would deal with the complete memory depletion concern
> for sure but I am not sure how any of these allocations would succeed
> when called from the direct reclaim. Some access to memory reserves is
> necessary if you insist on allocating from the reclaim process.
>
> You can have a look at the limited memory reserves access by oom victims
> for an example of how this can be done.

Thanks. I'll change GFP_KERNEL to __GFP_HIGH | __GFP_NOMEMALLOC.
__GFP_HIGH allows some access to memory reserves and __GFP_NOMEMALLOC
prevents the complete depletion. Basically the combination lower the
min watermark by 1/2, and we have been using them for
add_to_swap_cache().

2022-01-21 17:42:44

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Mon, Jan 10, 2022 at 11:54:42AM +0100, Michal Hocko wrote:
> On Sun 09-01-22 21:47:57, Yu Zhao wrote:
> > On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote:
> > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > [...]
> > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > > +{
> > > > + static const struct mm_walk_ops mm_walk_ops = {
> > > > + .test_walk = should_skip_vma,
> > > > + .p4d_entry = walk_pud_range,
> > > > + };
> > > > +
> > > > + int err;
> > > > +#ifdef CONFIG_MEMCG
> > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > > +#endif
> > > > +
> > > > + walk->next_addr = FIRST_USER_ADDRESS;
> > > > +
> > > > + do {
> > > > + unsigned long start = walk->next_addr;
> > > > + unsigned long end = mm->highest_vm_end;
> > > > +
> > > > + err = -EBUSY;
> > > > +
> > > > + rcu_read_lock();
> > > > +#ifdef CONFIG_MEMCG
> > > > + if (memcg && atomic_read(&memcg->moving_account))
> > > > + goto contended;
> > > > +#endif
> > > > + if (!mmap_read_trylock(mm))
> > > > + goto contended;
> > >
> > > Have you evaluated the behavior under mmap_sem contention? I mean what
> > > would be an effect of some mms being excluded from the walk? This path
> > > is called from direct reclaim and we do allocate with exclusive mmap_sem
> > > IIRC and the trylock can fail in a presence of pending writer if I am
> > > not mistaken so even the read lock holder (e.g. an allocation from the #PF)
> > > can bypass the walk.
> >
> > You are right. Here it must be a trylock; otherwise it can deadlock.
>
> Yeah, this is clear.
>
> > I think there might be a misunderstanding: the aging doesn't
> > exclusively rely on page table walks to gather the accessed bit. It
> > prefers page table walks but it can also fallback to the rmap-based
> > function, i.e., lru_gen_look_around(), which only gathers the accessed
> > bit from at most 64 PTEs and therefore is less efficient. But it still
> > retains about 80% of the performance gains.
>
> I have to say that I really have hard time to understand the runtime
> behavior depending on that interaction. How does the reclaim behave when
> the virtual scan is enabled, partially enabled and almost completely
> disabled due to different constrains? I do not see any such an
> evaluation described in changelogs and I consider this to be a rather
> important information to judge the overall behavior.

It doesn't have (partially) enabled/disabled states nor does its
behavior change with different reclaim constraints. Having either
would make its design too complex to implement or benchmark.

There is feedback loop connecting page table walks and rmap walks by
Bloom filters. The Bloom filters hold dense hot areas. Page table walks
test whether virtual areas are in the Bloom filters and scan those that
were tested positive. Anything they miss will be caught by rmap walks
later (shrink_page_list()). And when rmap walks find new dense hot
areas, they add those area to the Bloom filters.

A dense hot area means it has many accessed pages belonging to the
reclaim domain, and clearing the accessed bit in all PTEs within this
area by one page table walk is more efficient than doing it one by one
by many rmap walks, in terms of cacheline utilization.

> > > Or is this considered statistically insignificant thus a theoretical
> > > problem?
> >
> > Yes. People who work on the maple tree and SPF at Google expressed the
> > same concern during the design review meeting (all stakeholders on the
> > mailing list were also invited). So we had a counter to monitor the
> > contention in previous versions, i.e., MM_LOCK_CONTENTION in v4 here:
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > And we also combined this patchset with the SPF patchset to see if the
> > latter makes any difference. Our conclusion was the contention is
> > statistically insignificant to the performance under memory pressure.
> >
> > This can be explained by how often we create a new generation. (We
> > only walk page tables when we create a new generation. And it's
> > similar to the low inactive condition for the active/inactive lru.)
> >
> > Usually we only do so every few seconds. We'd run into problems with
> > other parts of the kernel, e.g., lru lock contention, i/o congestion,
> > etc. if we create more than a few generation every second.
>
> This would be a very good information to have in changelogs. Ideally
> with some numbers and analysis.

Will do. Thanks.

2022-01-21 19:04:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Wed 19-01-22 00:04:10, Yu Zhao wrote:
> On Mon, Jan 10, 2022 at 11:54:42AM +0100, Michal Hocko wrote:
> > On Sun 09-01-22 21:47:57, Yu Zhao wrote:
> > > On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote:
> > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > > [...]
> > > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > > > +{
> > > > > + static const struct mm_walk_ops mm_walk_ops = {
> > > > > + .test_walk = should_skip_vma,
> > > > > + .p4d_entry = walk_pud_range,
> > > > > + };
> > > > > +
> > > > > + int err;
> > > > > +#ifdef CONFIG_MEMCG
> > > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > > > +#endif
> > > > > +
> > > > > + walk->next_addr = FIRST_USER_ADDRESS;
> > > > > +
> > > > > + do {
> > > > > + unsigned long start = walk->next_addr;
> > > > > + unsigned long end = mm->highest_vm_end;
> > > > > +
> > > > > + err = -EBUSY;
> > > > > +
> > > > > + rcu_read_lock();
> > > > > +#ifdef CONFIG_MEMCG
> > > > > + if (memcg && atomic_read(&memcg->moving_account))
> > > > > + goto contended;
> > > > > +#endif
> > > > > + if (!mmap_read_trylock(mm))
> > > > > + goto contended;
> > > >
> > > > Have you evaluated the behavior under mmap_sem contention? I mean what
> > > > would be an effect of some mms being excluded from the walk? This path
> > > > is called from direct reclaim and we do allocate with exclusive mmap_sem
> > > > IIRC and the trylock can fail in a presence of pending writer if I am
> > > > not mistaken so even the read lock holder (e.g. an allocation from the #PF)
> > > > can bypass the walk.
> > >
> > > You are right. Here it must be a trylock; otherwise it can deadlock.
> >
> > Yeah, this is clear.
> >
> > > I think there might be a misunderstanding: the aging doesn't
> > > exclusively rely on page table walks to gather the accessed bit. It
> > > prefers page table walks but it can also fallback to the rmap-based
> > > function, i.e., lru_gen_look_around(), which only gathers the accessed
> > > bit from at most 64 PTEs and therefore is less efficient. But it still
> > > retains about 80% of the performance gains.
> >
> > I have to say that I really have hard time to understand the runtime
> > behavior depending on that interaction. How does the reclaim behave when
> > the virtual scan is enabled, partially enabled and almost completely
> > disabled due to different constrains? I do not see any such an
> > evaluation described in changelogs and I consider this to be a rather
> > important information to judge the overall behavior.
>
> It doesn't have (partially) enabled/disabled states nor does its
> behavior change with different reclaim constraints. Having either
> would make its design too complex to implement or benchmark.

Let me clarify. By "partially enabled" I really meant behavior depedning
on runtime conditions. Say mmap_sem cannot be locked for half of scanned
tasks and/or allocation for the mm walker fails due to lack of memory.
How does this going to affect reclaim efficiency. How does a user/admin
know that the memory reclaim is in a "degraded" mode because of the
contention?
--
Michal Hocko
SUSE Labs

2022-01-21 19:05:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Tue 18-01-22 23:31:07, Yu Zhao wrote:
> On Thu, Jan 13, 2022 at 01:02:26PM +0100, Michal Hocko wrote:
> > On Thu 13-01-22 02:43:38, Yu Zhao wrote:
> > [...]
> > > > > The bottom line is I can try various optimizations, e.g., preallocate
> > > > > a few buffers for a limited number of page walkers and if this number
> > > > > has been reached, fallback to the rmap-based function. But I have yet
> > > > > to see evidence that calls for additional complexity.
> > > >
> > > > I would disagree here. This is not an optimization. You should be
> > > > avoiding allocations from the memory reclaim because any allocation just
> > > > add a runtime behavior complexity and potential corner cases.
> > >
> > > Would __GFP_NOMEMALLOC address your concern? It prevents allocations
> > > from accessing the reserves even under PF_MEMALLOC.
> >
> > __GFP_NOMEMALLOC would deal with the complete memory depletion concern
> > for sure but I am not sure how any of these allocations would succeed
> > when called from the direct reclaim. Some access to memory reserves is
> > necessary if you insist on allocating from the reclaim process.
> >
> > You can have a look at the limited memory reserves access by oom victims
> > for an example of how this can be done.
>
> Thanks. I'll change GFP_KERNEL to __GFP_HIGH | __GFP_NOMEMALLOC.
> __GFP_HIGH allows some access to memory reserves and __GFP_NOMEMALLOC
> prevents the complete depletion. Basically the combination lower the
> min watermark by 1/2, and we have been using them for
> add_to_swap_cache().

Yes this will prevent the complete memory depletion. There are other
users of this portion of memory reserves so the reclaim might be out of
luck. How this turns out in practice remains to be seen though but it
certainly is an opportunity for corner cases and hard to test behavior.

--
Michal Hocko
SUSE Labs

2022-01-21 20:02:05

by Steven Barrett

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Tue, Jan 11, 2022, at 2:41 AM, Yu Zhao wrote:
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
> > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> > > TLDR
> > > ====
> > > The current page reclaim is too expensive in terms of CPU usage and it
> > > often makes poor choices about what to evict. This patchset offers an
> > > alternative solution that is performant, versatile and
> > > straightforward.
> >
> > <snipped>
> >
> > > Summery
> > > =======
> > > The facts are:
> > > 1. The independent lab results and the real-world applications
> > > indicate substantial improvements; there are no known regressions.
> > > 2. Thrashing prevention, working set estimation and proactive reclaim
> > > work out of the box; there are no equivalent solutions.
> > > 3. There is a lot of new code; nobody has demonstrated smaller changes
> > > with similar effects.
> > >
> > > Our options, accordingly, are:
> > > 1. Given the amount of evidence, the reported improvements will likely
> > > materialize for a wide range of workloads.
> > > 2. Gauging the interest from the past discussions [14][15][16], the
> > > new features will likely be put to use for both personal computers
> > > and data centers.
> > > 3. Based on Google's track record, the new code will likely be well
> > > maintained in the long term. It'd be more difficult if not
> > > impossible to achieve similar effects on top of the existing
> > > design.
> >
> > Hi Andrew, Linus,
> >
> > Can you please take a look at this patchset and let me know if it's
> > 5.17 material?
> >
> > My goal is to get it merged asap so that users can reap the benefits
> > and I can push the sequels. Please examine the data provided -- I
> > think the unprecedented coverage and the magnitude of the improvements
> > warrant a green light.
>
> Downstream kernel maintainers who have been carrying MGLRU for more than
> 3 versions, can you please provide your Acked-by tags?
>
> Having this patchset in the mainline will make your job easier :)
>
> Alexandre - the XanMod Kernel maintainer
> https://xanmod.org
>
> Brian - the Chrome OS kernel memory maintainer
> https://www.chromium.org
>
> Jan - the Arch Linux Zen kernel maintainer
> https://archlinux.org
>
> Steven - the Liquorix kernel maintainer
> https://liquorix.net
>
> Suleiman - the ARCVM (Android downstream) kernel memory maintainer
> https://chromium.googlesource.com/chromiumos/third_party/kernel
>
> Also my gratitude to those who have helped test MGLRU:
>
> Daniel - researcher at Michigan Tech
> benchmarked memcached
>
> Holger - who has been testing/patching/contributing to various
> subsystems since ~2008
>
> Shuang - researcher at University of Rochester
> benchmarked fio and provided a report
>
> Sofia - EDI https://www.edi.works
> benchmarked the top eight memory hogs and provided reports
>
> Can you please provide your Tested-by tags? This will ensure the credit
> for your contributions.
>
> Thanks!

This feature has been a huge improvement for desktop linux, system
responsiveness has hit a new level high memory pressure. Thanks Yu!

Acked-by: Steven Barrett <[email protected]>

2022-01-21 20:06:48

by Brian Geffon

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Tue, Jan 11, 2022 at 3:41 AM Yu Zhao <[email protected]> wrote:
>
> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote:
> > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote:
> > > TLDR
> > > ====
> > > The current page reclaim is too expensive in terms of CPU usage and it
> > > often makes poor choices about what to evict. This patchset offers an
> > > alternative solution that is performant, versatile and
> > > straightforward.
> >
> > <snipped>
> >
> > > Summery
> > > =======
> > > The facts are:
> > > 1. The independent lab results and the real-world applications
> > > indicate substantial improvements; there are no known regressions.
> > > 2. Thrashing prevention, working set estimation and proactive reclaim
> > > work out of the box; there are no equivalent solutions.
> > > 3. There is a lot of new code; nobody has demonstrated smaller changes
> > > with similar effects.
> > >
> > > Our options, accordingly, are:
> > > 1. Given the amount of evidence, the reported improvements will likely
> > > materialize for a wide range of workloads.
> > > 2. Gauging the interest from the past discussions [14][15][16], the
> > > new features will likely be put to use for both personal computers
> > > and data centers.
> > > 3. Based on Google's track record, the new code will likely be well
> > > maintained in the long term. It'd be more difficult if not
> > > impossible to achieve similar effects on top of the existing
> > > design.
> >
> > Hi Andrew, Linus,
> >
> > Can you please take a look at this patchset and let me know if it's
> > 5.17 material?
> >
> > My goal is to get it merged asap so that users can reap the benefits
> > and I can push the sequels. Please examine the data provided -- I
> > think the unprecedented coverage and the magnitude of the improvements
> > warrant a green light.
>
> Downstream kernel maintainers who have been carrying MGLRU for more than
> 3 versions, can you please provide your Acked-by tags?
>
> Having this patchset in the mainline will make your job easier :)
>
> Alexandre - the XanMod Kernel maintainer
> https://xanmod.org
>
> Brian - the Chrome OS kernel memory maintainer
> https://www.chromium.org

MGLRU has been maturing in ChromeOS for quite some time, we've
maintained it in a number of different kernels between 4.14 and 5.15,
and it's become the default
for tens of millions of users. We've seen substantial improvements in
terms of CPU utilization and memory pressure resulting in fewer OOM
kills and reduced UI latency. I would love to see this make it
upstream so more desktop users can benefit.

Acked-by: Brian Geffon <[email protected]>


>
> Jan - the Arch Linux Zen kernel maintainer
> https://archlinux.org
>
> Steven - the Liquorix kernel maintainer
> https://liquorix.net
>
> Suleiman - the ARCVM (Android downstream) kernel memory maintainer
> https://chromium.googlesource.com/chromiumos/third_party/kernel
>
> Also my gratitude to those who have helped test MGLRU:
>
> Daniel - researcher at Michigan Tech
> benchmarked memcached
>
> Holger - who has been testing/patching/contributing to various
> subsystems since ~2008
>
> Shuang - researcher at University of Rochester
> benchmarked fio and provided a report
>
> Sofia - EDI https://www.edi.works
> benchmarked the top eight memory hogs and provided reports
>
> Can you please provide your Tested-by tags? This will ensure the credit
> for your contributions.
>
> Thanks!

2022-01-24 05:55:09

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <[email protected]> wrote:
>
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and
> straightforward.
>
> Design objectives
> =================
> The design objectives are:
> 1. Better representation of access recency
> 2. Try to profit from spatial locality
> 3. Clear fast path making obvious choices
> 4. Simple self-correcting heuristics
>
> The representation of access recency is at the core of all LRU
> approximations. The multigenerational LRU (MGLRU) divides pages into
> multiple lists (generations), each having bounded access recency (a
> time interval). Generations establish a common frame of reference and
> help make better choices, e.g., between different memcgs on a computer
> or different computers in a data center (for cluster job scheduling).
>
> Exploiting spatial locality improves the efficiency when gathering the
> accessed bit. A rmap walk targets a single page and doesn't try to
> profit from discovering an accessed PTE. A page table walk can sweep
> all hotspots in an address space, but its search space can be too
> large to make a profit. The key is to optimize both methods and use
> them in combination. (PMU is another option for further exploration.)
>
> Fast path reduces code complexity and runtime overhead. Unmapped pages
> don't require TLB flushes; clean pages don't require writeback. These
> facts are only helpful when other conditions, e.g., access recency,
> are similar. With generations as a common frame of reference,
> additional factors stand out. But obvious choices might not be good
> choices; thus self-correction is required (the next objective).
>
> The benefits of simple self-correcting heuristics are self-evident.
> Again with generations as a common frame of reference, this becomes
> attainable. Specifically, pages in the same generation are categorized
> based on additional factors, and a closed-loop control statistically
> compares the refault percentages across all categories and throttles
> the eviction of those that have higher percentages.
>
> Patchset overview
> =================
> 1. mm: x86, arm64: add arch_has_hw_pte_young()
> 2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> Materializing hardware optimizations when trying to clear the accessed
> bit in many PTEs. If hardware automatically sets the accessed bit in
> PTEs, there is no need to worry about bursty page faults (emulating
> the accessed bit). If it also sets the accessed bit in non-leaf PMD
> entries, there is no need to search the PTE table pointed to by a PMD
> entry that doesn't have the accessed bit set.
>
> 3. mm/vmscan.c: refactor shrink_node()
> A minor refactor.
>
> 4. mm: multigenerational lru: groundwork
> Adding the basic data structure and the functions to initialize it and
> insert/remove pages.
>
> 5. mm: multigenerational lru: mm_struct list
> An infra keeps track of mm_struct's for page table walkers and
> provides them with optimizations, i.e., switch_mm() tracking and Bloom
> filters.
>
> 6. mm: multigenerational lru: aging
> 7. mm: multigenerational lru: eviction
> "The page reclaim" is a producer/consumer model. "The aging" produces
> cold pages, whereas "the eviction " consumes them. Cold pages flow
> through generations. The aging uses the mm_struct list infra to sweep
> dense hotspots in page tables. During a page table walk, the aging
> clears the accessed bit and tags accessed pages with the youngest
> generation number. The eviction sorts those pages when it encounters
> them. For pages in the oldest generation, eviction walks the rmap to
> check the accessed bit one more time before evicting them. During an
> rmap walk, the eviction feeds dense hotspots back to the aging. Dense
> hotspots flow through the Bloom filters. For pages not mapped in page
> tables, the eviction uses the PID controller to statistically
> determine whether they have higher refaults. If so, the eviction
> throttles their eviction by moving them to the next generation (the
> second oldest).
>
> 8. mm: multigenerational lru: user interface
> The knobs to turn on/off MGLRU and provide the userspace with
> thrashing prevention, working set estimation (the aging) and proactive
> reclaim (the eviction).
>
> 9. mm: multigenerational lru: Kconfig
> The Kconfig options.
>
> Benchmark results
> =================
> Independent lab results
> -----------------------
> Based on the popularity of searches [01] and the memory usage in
> Google's public cloud, the most popular open-source memory-hungry
> applications, in alphabetical order, are:
> Apache Cassandra Memcached
> Apache Hadoop MongoDB
> Apache Spark PostgreSQL
> MariaDB (MySQL) Redis
>
> An independent lab evaluated MGLRU with the most widely used benchmark
> suites for the above applications. They posted 960 data points along
> with kernel metrics and perf profiles collected over more than 500
> hours of total benchmark time. Their final reports show that, with 95%
> confidence intervals (CIs), the above applications all performed
> significantly better for at least part of their benchmark matrices.
>
> On 5.14:
> 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
> less wall time to sort three billion random integers, respectively,
> under the medium- and the high-concurrency conditions, when
> overcommitting memory. There were no statistically significant
> changes in wall time for the rest of the benchmark matrix.
> 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
> more transactions per minute (TPM), respectively, under the medium-
> and the high-concurrency conditions, when overcommitting memory.
> There were no statistically significant changes in TPM for the rest
> of the benchmark matrix.
> 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
> and [21.59, 30.02]% more operations per second (OPS), respectively,
> for sequential access, random access and Gaussian (distribution)
> access, when THP=always; 95% CIs [13.85, 15.97]% and
> [23.94, 29.92]% more OPS, respectively, for random access and
> Gaussian access, when THP=never. There were no statistically
> significant changes in OPS for the rest of the benchmark matrix.
> 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
> [2.16, 3.55]% more operations per second (OPS), respectively, for
> exponential (distribution) access, random access and Zipfian
> (distribution) access, when underutilizing memory; 95% CIs
> [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
> respectively, for exponential access, random access and Zipfian
> access, when overcommitting memory.
>
> On 5.15:
> 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
> and [4.11, 7.50]% more operations per second (OPS), respectively,
> for exponential (distribution) access, random access and Zipfian
> (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
> [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
> exponential access, random access and Zipfian access, when swap was
> on.
> 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
> less average wall time to finish twelve parallel TeraSort jobs,
> respectively, under the medium- and the high-concurrency
> conditions, when swap was on. There were no statistically
> significant changes in average wall time for the rest of the
> benchmark matrix.
> 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
> minute (TPM) under the high-concurrency condition, when swap was
> off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
> respectively, under the medium- and the high-concurrency
> conditions, when swap was on. There were no statistically
> significant changes in TPM for the rest of the benchmark matrix.
> 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
> [11.47, 19.36]% more total operations per second (OPS),
> respectively, for sequential access, random access and Gaussian
> (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
> [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
> for sequential access, random access and Gaussian access, when
> THP=never.
>
> Our lab results
> ---------------
> To supplement the above results, we ran the following benchmark suites
> on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
> are popular among MM developers, but we prefer large-scale A/B
> experiments to validate improvements.)
> fs_fio_bench_hdd_mq pft
> fs_lmbench pgsql-hammerdb
> fs_parallelio redis
> fs_postmark stream
> hackbench sysbenchthread
> kernbench tpcc_spark
> memcached unixbench
> multichase vm-scalability
> mutilate will-it-scale
> nginx
>
> [01] https://trends.google.com
> [02] https://lore.kernel.org/linux-mm/[email protected]/
> [03] https://lore.kernel.org/linux-mm/[email protected]/
> [04] https://lore.kernel.org/linux-mm/[email protected]/
> [05] https://lore.kernel.org/linux-mm/[email protected]/
> [06] https://lore.kernel.org/linux-mm/[email protected]/
> [07] https://lore.kernel.org/linux-mm/[email protected]/
> [08] https://lore.kernel.org/linux-mm/[email protected]/
> [09] https://lore.kernel.org/linux-mm/[email protected]/
> [10] https://lore.kernel.org/linux-mm/[email protected]/
>
> Read-world applications
> =======================
> Third-party testimonials
> ------------------------
> Konstantin wrote [11]:
> I have Archlinux with 8G RAM + zswap + swap. While developing, I
> have lots of apps opened such as multiple LSP-servers for different
> langs, chats, two browsers, etc... Usually, my system gets quickly
> to a point of SWAP-storms, where I have to kill LSP-servers,
> restart browsers to free memory, etc, otherwise the system lags
> heavily and is barely usable.
>
> 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
> patchset, and I started up by opening lots of apps to create memory
> pressure, and worked for a day like this. Till now I had *not a
> single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
> getting to the point of 3G in SWAP before without a single
> SWAP-storm.
>
> The Arch Linux Zen kernel [12] has been using MGLRU since 5.12. Many
> of its users reported their positive experiences to me, e.g., Shivodit
> wrote:
> I've tried the latest Zen kernel (5.14.13-zen1-1-zen in the
> archlinux testing repos), everything's been smooth so far. I also
> decided to copy a large volume of files to check performance under
> I/O load, and everything went smoothly - no stuttering was present,
> everything was responsive.
>
> Large-scale deployments
> -----------------------
> We've rolled out MGLRU to tens of millions of Chrome OS users and
> about a million Android users. Google's fleetwide profiling [13] shows
> an overall 40% decrease in kswapd CPU usage, in addition to

Hi Yu,

Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64?
And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG.
Does it help a lot in decreasing the cpu usage? If so, this might be
a good proof that arm64 also needs this hardware feature?
In short, I am curious how much the improvement in this patchset depends
on the hardware ability of NONLEAF_PMD_YOUNG.

> improvements in other UX metrics, e.g., an 85% decrease in the number
> of low-memory kills at the 75th percentile and an 18% decrease in
> rendering latency at the 50th percentile.
>
> [11] https://lore.kernel.org/linux-mm/[email protected]/
> [12] https://github.com/zen-kernel/zen-kernel/
> [13] https://research.google/pubs/pub44271/
>

Thanks
Barry

2022-01-24 14:01:38

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Wed, Jan 19, 2022 at 10:42:47AM +0100, Michal Hocko wrote:
> On Wed 19-01-22 00:04:10, Yu Zhao wrote:
> > On Mon, Jan 10, 2022 at 11:54:42AM +0100, Michal Hocko wrote:
> > > On Sun 09-01-22 21:47:57, Yu Zhao wrote:
> > > > On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote:
> > > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > > > [...]
> > > > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > > > > +{
> > > > > > + static const struct mm_walk_ops mm_walk_ops = {
> > > > > > + .test_walk = should_skip_vma,
> > > > > > + .p4d_entry = walk_pud_range,
> > > > > > + };
> > > > > > +
> > > > > > + int err;
> > > > > > +#ifdef CONFIG_MEMCG
> > > > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > > > > +#endif
> > > > > > +
> > > > > > + walk->next_addr = FIRST_USER_ADDRESS;
> > > > > > +
> > > > > > + do {
> > > > > > + unsigned long start = walk->next_addr;
> > > > > > + unsigned long end = mm->highest_vm_end;
> > > > > > +
> > > > > > + err = -EBUSY;
> > > > > > +
> > > > > > + rcu_read_lock();
> > > > > > +#ifdef CONFIG_MEMCG
> > > > > > + if (memcg && atomic_read(&memcg->moving_account))
> > > > > > + goto contended;
> > > > > > +#endif
> > > > > > + if (!mmap_read_trylock(mm))
> > > > > > + goto contended;
> > > > >
> > > > > Have you evaluated the behavior under mmap_sem contention? I mean what
> > > > > would be an effect of some mms being excluded from the walk? This path
> > > > > is called from direct reclaim and we do allocate with exclusive mmap_sem
> > > > > IIRC and the trylock can fail in a presence of pending writer if I am
> > > > > not mistaken so even the read lock holder (e.g. an allocation from the #PF)
> > > > > can bypass the walk.
> > > >
> > > > You are right. Here it must be a trylock; otherwise it can deadlock.
> > >
> > > Yeah, this is clear.
> > >
> > > > I think there might be a misunderstanding: the aging doesn't
> > > > exclusively rely on page table walks to gather the accessed bit. It
> > > > prefers page table walks but it can also fallback to the rmap-based
> > > > function, i.e., lru_gen_look_around(), which only gathers the accessed
> > > > bit from at most 64 PTEs and therefore is less efficient. But it still
> > > > retains about 80% of the performance gains.
> > >
> > > I have to say that I really have hard time to understand the runtime
> > > behavior depending on that interaction. How does the reclaim behave when
> > > the virtual scan is enabled, partially enabled and almost completely
> > > disabled due to different constrains? I do not see any such an
> > > evaluation described in changelogs and I consider this to be a rather
> > > important information to judge the overall behavior.
> >
> > It doesn't have (partially) enabled/disabled states nor does its
> > behavior change with different reclaim constraints. Having either
> > would make its design too complex to implement or benchmark.
>
> Let me clarify. By "partially enabled" I really meant behavior depedning
> on runtime conditions. Say mmap_sem cannot be locked for half of scanned
> tasks and/or allocation for the mm walker fails due to lack of memory.
> How does this going to affect reclaim efficiency.

Understood. This is not only possible -- it's the default for our ARM
hardware that doesn't support the accessed bit, i.e., CPUs that don't
automatically set the accessed bit.

In try_to_inc_max_seq(), we have:
/*
* If the hardware doesn't automatically set the accessed bit, fallback
* to lru_gen_look_around(), which only clears the accessed bit in a
* handful of PTEs. Spreading the work out over a period of time usually
* is less efficient, but it avoids bursty page faults.
*/
if the accessed bit is not supported
return

if alloc_mm_walk() fails
return

walk_mm()
if mmap_sem contented
return

scan page tables

We have a microbenchmark that specifically measures this worst case
scenario by entirely disabling page table scanning. Its results showed
that this still retains more than 90% of the optimal performance. I'll
share this microbenchmark in another email when answering Barry's
questions regarding the accessed bit.

Our profiling infra also indirectly confirms this: it collects data
from real users running on hardware with and without the accessed
bit. Users running on hardware without the accessed bit indeed suffer
a small performance degradation, compared with users running on
hardware with it. But they still benefit almost as much, compared with
users running on the same hardware but without MGLRU.

> How does a user/admin
> know that the memory reclaim is in a "degraded" mode because of the
> contention?

As we previously discussed here:
https://lore.kernel.org/linux-mm/[email protected]/
there used to be a counter measuring the contention, and it was deemed
unnecessary and removed in v4. But I don't have a problem if we want
to revive it.

2022-01-24 14:02:06

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Thu, Jan 13, 2022 at 12:57:35PM +0100, Michal Hocko wrote:
> On Wed 12-01-22 16:43:15, Yu Zhao wrote:
> > On Wed, Jan 12, 2022 at 11:17:53AM +0100, Michal Hocko wrote:
> [...]
> > > Is there any reason you are not using folio_memcg_lock in the
> > > pte walk instead?
> >
> > We have a particular lruvec (the first arg), hence a particular memcg
> > to lock. But we don't have a particular page to lock.
>
> That is certainly true at this layer but the locking should be needed
> only for specific pages, no?

Yes.

> So you can move the lock down to the
> callback which examines respective pages. Or is there anything
> preventing that?

No.

> To be honest, and that is the reason I am asking, I really do not like
> to open code the migration synchronization outside of the memcg proper.

Agreed.

> Code paths which need a stable memcg are supposed to be using
> folio_memcg_lock for the specific examination time.

No argument here, just a clarification: when possible I prefer to
lock a batch of pages rather than individual ones.

> If you prefer a
> trylock approach for this usecase then we can add one.

Done. Thanks.

2022-01-24 19:22:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Sun 23-01-22 14:28:30, Yu Zhao wrote:
> On Wed, Jan 19, 2022 at 10:42:47AM +0100, Michal Hocko wrote:
> > On Wed 19-01-22 00:04:10, Yu Zhao wrote:
> > > On Mon, Jan 10, 2022 at 11:54:42AM +0100, Michal Hocko wrote:
> > > > On Sun 09-01-22 21:47:57, Yu Zhao wrote:
> > > > > On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote:
> > > > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > > > > [...]
> > > > > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > > > > > +{
> > > > > > > + static const struct mm_walk_ops mm_walk_ops = {
> > > > > > > + .test_walk = should_skip_vma,
> > > > > > > + .p4d_entry = walk_pud_range,
> > > > > > > + };
> > > > > > > +
> > > > > > > + int err;
> > > > > > > +#ifdef CONFIG_MEMCG
> > > > > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > > > > > +#endif
> > > > > > > +
> > > > > > > + walk->next_addr = FIRST_USER_ADDRESS;
> > > > > > > +
> > > > > > > + do {
> > > > > > > + unsigned long start = walk->next_addr;
> > > > > > > + unsigned long end = mm->highest_vm_end;
> > > > > > > +
> > > > > > > + err = -EBUSY;
> > > > > > > +
> > > > > > > + rcu_read_lock();
> > > > > > > +#ifdef CONFIG_MEMCG
> > > > > > > + if (memcg && atomic_read(&memcg->moving_account))
> > > > > > > + goto contended;
> > > > > > > +#endif
> > > > > > > + if (!mmap_read_trylock(mm))
> > > > > > > + goto contended;
> > > > > >
> > > > > > Have you evaluated the behavior under mmap_sem contention? I mean what
> > > > > > would be an effect of some mms being excluded from the walk? This path
> > > > > > is called from direct reclaim and we do allocate with exclusive mmap_sem
> > > > > > IIRC and the trylock can fail in a presence of pending writer if I am
> > > > > > not mistaken so even the read lock holder (e.g. an allocation from the #PF)
> > > > > > can bypass the walk.
> > > > >
> > > > > You are right. Here it must be a trylock; otherwise it can deadlock.
> > > >
> > > > Yeah, this is clear.
> > > >
> > > > > I think there might be a misunderstanding: the aging doesn't
> > > > > exclusively rely on page table walks to gather the accessed bit. It
> > > > > prefers page table walks but it can also fallback to the rmap-based
> > > > > function, i.e., lru_gen_look_around(), which only gathers the accessed
> > > > > bit from at most 64 PTEs and therefore is less efficient. But it still
> > > > > retains about 80% of the performance gains.
> > > >
> > > > I have to say that I really have hard time to understand the runtime
> > > > behavior depending on that interaction. How does the reclaim behave when
> > > > the virtual scan is enabled, partially enabled and almost completely
> > > > disabled due to different constrains? I do not see any such an
> > > > evaluation described in changelogs and I consider this to be a rather
> > > > important information to judge the overall behavior.
> > >
> > > It doesn't have (partially) enabled/disabled states nor does its
> > > behavior change with different reclaim constraints. Having either
> > > would make its design too complex to implement or benchmark.
> >
> > Let me clarify. By "partially enabled" I really meant behavior depedning
> > on runtime conditions. Say mmap_sem cannot be locked for half of scanned
> > tasks and/or allocation for the mm walker fails due to lack of memory.
> > How does this going to affect reclaim efficiency.
>
> Understood. This is not only possible -- it's the default for our ARM
> hardware that doesn't support the accessed bit, i.e., CPUs that don't
> automatically set the accessed bit.
>
> In try_to_inc_max_seq(), we have:
> /*
> * If the hardware doesn't automatically set the accessed bit, fallback
> * to lru_gen_look_around(), which only clears the accessed bit in a
> * handful of PTEs. Spreading the work out over a period of time usually
> * is less efficient, but it avoids bursty page faults.
> */
> if the accessed bit is not supported
> return
>
> if alloc_mm_walk() fails
> return
>
> walk_mm()
> if mmap_sem contented
> return
>
> scan page tables
>
> We have a microbenchmark that specifically measures this worst case
> scenario by entirely disabling page table scanning. Its results showed
> that this still retains more than 90% of the optimal performance. I'll
> share this microbenchmark in another email when answering Barry's
> questions regarding the accessed bit.
>
> Our profiling infra also indirectly confirms this: it collects data
> from real users running on hardware with and without the accessed
> bit. Users running on hardware without the accessed bit indeed suffer
> a small performance degradation, compared with users running on
> hardware with it. But they still benefit almost as much, compared with
> users running on the same hardware but without MGLRU.

This definitely a good information to have in the cover letter.

> > How does a user/admin
> > know that the memory reclaim is in a "degraded" mode because of the
> > contention?
>
> As we previously discussed here:
> https://lore.kernel.org/linux-mm/[email protected]/
> there used to be a counter measuring the contention, and it was deemed
> unnecessary and removed in v4. But I don't have a problem if we want
> to revive it.

Well, counter might be rather tricky but few trace points would make some
sense to me.

--
Michal Hocko
SUSE Labs

2022-01-25 09:28:16

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Sun, Jan 23, 2022 at 06:43:06PM +1300, Barry Song wrote:
> On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <[email protected]> wrote:

<snipped>

> > Large-scale deployments
> > -----------------------
> > We've rolled out MGLRU to tens of millions of Chrome OS users and
> > about a million Android users. Google's fleetwide profiling [13] shows
> > an overall 40% decrease in kswapd CPU usage, in addition to
>
> Hi Yu,
>
> Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64?
> And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG.
> Does it help a lot in decreasing the cpu usage?

Hi Barry,

The fleet-wide profiling data I shared was from x86. For arm64, I only
have data from synthetic benchmarks at the moment, and it also shows
similar improvements.

For Chrome OS (individual users), walk_pte_range(), the function that
would benefit from ARCH_HAS_NONLEAF_PMD_YOUNG, only uses a small
portion (<4%) of kswapd CPU time. So ARCH_HAS_NONLEAF_PMD_YOUNG isn't
that helpful.

> If so, this might be
> a good proof that arm64 also needs this hardware feature?
> In short, I am curious how much the improvement in this patchset depends
> on the hardware ability of NONLEAF_PMD_YOUNG.

For data centers, I do think ARCH_HAS_NONLEAF_PMD_YOUNG has some value.
In addition to cold/hot memory scanning, there are other use cases like
dirty tracking, which can benefit from the accessed bit on non-leaf
entries. I know some proprietary software uses this capability on x86
for different purposes than this patchset does. And AFAIK, x86 is the
only arch that supports this capability, e.g., risc-v and ppc can only
set the accessed bit in PTEs.

In fact, I've discussed this with one of the arm maintainers Will. So
please check with him too if you are interested in moving forward with
the idea. I might be able to provide with additional data if you need
it to make a decision.

Thanks.

2022-01-30 16:17:14

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Tue, Jan 25, 2022 at 7:48 PM Yu Zhao <[email protected]> wrote:
>
> On Sun, Jan 23, 2022 at 06:43:06PM +1300, Barry Song wrote:
> > On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <[email protected]> wrote:
>
> <snipped>
>
> > > Large-scale deployments
> > > -----------------------
> > > We've rolled out MGLRU to tens of millions of Chrome OS users and
> > > about a million Android users. Google's fleetwide profiling [13] shows
> > > an overall 40% decrease in kswapd CPU usage, in addition to
> >
> > Hi Yu,
> >
> > Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64?
> > And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG.
> > Does it help a lot in decreasing the cpu usage?
>
> Hi Barry,
>
> The fleet-wide profiling data I shared was from x86. For arm64, I only
> have data from synthetic benchmarks at the moment, and it also shows
> similar improvements.
>
> For Chrome OS (individual users), walk_pte_range(), the function that
> would benefit from ARCH_HAS_NONLEAF_PMD_YOUNG, only uses a small
> portion (<4%) of kswapd CPU time. So ARCH_HAS_NONLEAF_PMD_YOUNG isn't
> that helpful.

Hi Yu,
Thanks!

In the current kernel, depending on reverse mapping, while memory is
under pressure,
the cpu usage of kswapd can be very very high especially while a lot of pages
have large mapcount, thus a huge reverse mapping cost.

Regarding <4%, I guess the figure came from machines with NONLEAF_PMD_YOUNG?
In this case, we can skip many PTE scans while PMD has no accessed bit
set. But for
a machine without NONLEAF, will the figure of cpu usage be much larger?

>
> > If so, this might be
> > a good proof that arm64 also needs this hardware feature?
> > In short, I am curious how much the improvement in this patchset depends
> > on the hardware ability of NONLEAF_PMD_YOUNG.
>
> For data centers, I do think ARCH_HAS_NONLEAF_PMD_YOUNG has some value.
> In addition to cold/hot memory scanning, there are other use cases like
> dirty tracking, which can benefit from the accessed bit on non-leaf
> entries. I know some proprietary software uses this capability on x86
> for different purposes than this patchset does. And AFAIK, x86 is the
> only arch that supports this capability, e.g., risc-v and ppc can only
> set the accessed bit in PTEs.

Yep. NONLEAF is a nice feature.

btw, page table should have a separate DIRTY bit, right? wouldn't dirty page
tracking depend on the DIRTY bit rather than the accessed bit? so x86 also has
NONLEAF dirty bit? Or they are scanning accessed bit of PMD before
scanning DIRTY bits of PTEs?

>
> In fact, I've discussed this with one of the arm maintainers Will. So
> please check with him too if you are interested in moving forward with
> the idea. I might be able to provide with additional data if you need
> it to make a decision.

I am interested in running it and have some data without NONLEAF
especially while free memory is very limited and the system has memory
thrashing.

>
> Thanks.

Thanks
Barry

2022-02-09 10:19:55

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Fri, Jan 28, 2022 at 09:54:09PM +1300, Barry Song wrote:
> On Tue, Jan 25, 2022 at 7:48 PM Yu Zhao <[email protected]> wrote:
> >
> > On Sun, Jan 23, 2022 at 06:43:06PM +1300, Barry Song wrote:
> > > On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <[email protected]> wrote:
> >
> > <snipped>
> >
> > > > Large-scale deployments
> > > > -----------------------
> > > > We've rolled out MGLRU to tens of millions of Chrome OS users and
> > > > about a million Android users. Google's fleetwide profiling [13] shows
> > > > an overall 40% decrease in kswapd CPU usage, in addition to
> > >
> > > Hi Yu,
> > >
> > > Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64?
> > > And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG.
> > > Does it help a lot in decreasing the cpu usage?
> >
> > Hi Barry,
> >
> > The fleet-wide profiling data I shared was from x86. For arm64, I only
> > have data from synthetic benchmarks at the moment, and it also shows
> > similar improvements.
> >
> > For Chrome OS (individual users), walk_pte_range(), the function that
> > would benefit from ARCH_HAS_NONLEAF_PMD_YOUNG, only uses a small
> > portion (<4%) of kswapd CPU time. So ARCH_HAS_NONLEAF_PMD_YOUNG isn't
> > that helpful.
>
> Hi Yu,
> Thanks!
>
> In the current kernel, depending on reverse mapping, while memory is
> under pressure,
> the cpu usage of kswapd can be very very high especially while a lot of pages
> have large mapcount, thus a huge reverse mapping cost.

Agreed. I've posted v7 which includes kswapd profiles collected from an
arm64 v8.2 laptop under memory pressure.

> Regarding <4%, I guess the figure came from machines with NONLEAF_PMD_YOUNG?

No, it's from Snapdragon 7c. Please see the kswapd profiles in v7.

> In this case, we can skip many PTE scans while PMD has no accessed bit
> set. But for
> a machine without NONLEAF, will the figure of cpu usage be much larger?

So NONLEAF_PMD_YOUNG at most can save 4% CPU usage from kswapd. But
this definitely can vary, depending on the workloads.

> > > If so, this might be
> > > a good proof that arm64 also needs this hardware feature?
> > > In short, I am curious how much the improvement in this patchset depends
> > > on the hardware ability of NONLEAF_PMD_YOUNG.
> >
> > For data centers, I do think ARCH_HAS_NONLEAF_PMD_YOUNG has some value.
> > In addition to cold/hot memory scanning, there are other use cases like
> > dirty tracking, which can benefit from the accessed bit on non-leaf
> > entries. I know some proprietary software uses this capability on x86
> > for different purposes than this patchset does. And AFAIK, x86 is the
> > only arch that supports this capability, e.g., risc-v and ppc can only
> > set the accessed bit in PTEs.
>
> Yep. NONLEAF is a nice feature.
>
> btw, page table should have a separate DIRTY bit, right?

Yes.

> wouldn't dirty page
> tracking depend on the DIRTY bit rather than the accessed bit?

It depends on the goal.

> so x86 also has
> NONLEAF dirty bit?

No.

> Or they are scanning accessed bit of PMD before
> scanning DIRTY bits of PTEs?

A mandatory sync to disk must use the dirty bit to ensure data
integrity. But for a voluntary sync to disk, it can use the accessed
bit to narrow the search of dirty pages.

A mandatory sync is used to free specific dirty pages. A voluntary sync
is used to keep the number of dirty pages low in general and it doesn't
target any specific dirty pages.

> > In fact, I've discussed this with one of the arm maintainers Will. So
> > please check with him too if you are interested in moving forward with
> > the idea. I might be able to provide with additional data if you need
> > it to make a decision.
>
> I am interested in running it and have some data without NONLEAF
> especially while free memory is very limited and the system has memory
> thrashing.

The v7 has a switch to disable this feature on x86. If you can run your
workloads on x86, then it might be able to help you measure the difference.