LinuxLists.cc - [PATCH v7 00/12] Multigenerational LRU Framework

2022-02-09 08:52:54

Subject: [PATCH v7 00/12] Multigenerational LRU Framework

What's new
==========
1) Addressed all the comments received on the mailing list and in the
meeting with the stakeholders (will note on individual patches).
2) Measured the performance improvements for each patch between 5-8
(reported in the commit messages).

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and straightforward.

Patchset overview
=================
The design and implementation overview was moved to patch 12 so that
people can finish reading this cover letter.

1. mm: x86, arm64: add arch_has_hw_pte_young()
2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Using hardware optimizations when trying to clear the accessed bit in
many PTEs.

3. mm/vmscan.c: refactor shrink_node()
A minor refactor.

4. mm: multigenerational LRU: groundwork
Adding the basic data structure and the functions that insert/remove
pages to/from the multigenerational LRU (MGLRU) lists.

5. mm: multigenerational LRU: minimal implementation
A minimal (functional) implementation without any optimizations.

6. mm: multigenerational LRU: exploit locality in rmap
Improving the efficiency when using the rmap.

7. mm: multigenerational LRU: support page table walks
Adding the (optional) page table scanning.

8. mm: multigenerational LRU: optimize multiple memcgs
Optimizing the overall performance for multiple memcgs running mixed
types of workloads.

9. mm: multigenerational LRU: runtime switch
Adding a runtime switch to enable or disable MGLRU.

10. mm: multigenerational LRU: thrashing prevention
11. mm: multigenerational LRU: debugfs interface
Providing userspace with additional features like thrashing prevention,
working set estimation and proactive reclaim.

12. mm: multigenerational LRU: documentation
Adding a design doc and an admin guide.

Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
Apache Cassandra Memcached
Apache Hadoop MongoDB
Apache Spark PostgreSQL
MariaDB (MySQL) Redis

An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.

On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
less wall time to sort three billion random integers, respectively,
under the medium- and the high-concurrency conditions, when
overcommitting memory. There were no statistically significant
changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
more transactions per minute (TPM), respectively, under the medium-
and the high-concurrency conditions, when overcommitting memory.
There were no statistically significant changes in TPM for the rest
of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
and [21.59, 30.02]% more operations per second (OPS), respectively,
for sequential access, random access and Gaussian (distribution)
access, when THP=always; 95% CIs [13.85, 15.97]% and
[23.94, 29.92]% more OPS, respectively, for random access and
Gaussian access, when THP=never. There were no statistically
significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
[2.16, 3.55]% more operations per second (OPS), respectively, for
exponential (distribution) access, random access and Zipfian
(distribution) access, when underutilizing memory; 95% CIs
[8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
respectively, for exponential access, random access and Zipfian
access, when overcommitting memory.

On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
and [4.11, 7.50]% more operations per second (OPS), respectively,
for exponential (distribution) access, random access and Zipfian
(distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
[6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
exponential access, random access and Zipfian access, when swap was
on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
less average wall time to finish twelve parallel TeraSort jobs,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in average wall time for the rest of the
benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
minute (TPM) under the high-concurrency condition, when swap was
off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
[11.47, 19.36]% more total operations per second (OPS),
respectively, for sequential access, random access and Gaussian
(distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
[10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
for sequential access, random access and Gaussian access, when
THP=never.

Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
are popular among MM developers, but we prefer large-scale A/B
experiments to validate improvements.)
fs_fio_bench_hdd_mq pft
fs_lmbench pgsql-hammerdb
fs_parallelio redis
fs_postmark stream
hackbench sysbenchthread
kernbench tpcc_spark
memcached unixbench
multichase vm-scalability
mutilate will-it-scale
nginx

[01] https://trends.google.com
[02] https://lore.kernel.org/lkml/[email protected]/
[03] https://lore.kernel.org/lkml/[email protected]/
[04] https://lore.kernel.org/lkml/[email protected]/
[05] https://lore.kernel.org/lkml/[email protected]/
[06] https://lore.kernel.org/lkml/[email protected]/
[07] https://lore.kernel.org/lkml/[email protected]/
[08] https://lore.kernel.org/lkml/[email protected]/
[09] https://lore.kernel.org/lkml/[email protected]/
[10] https://lore.kernel.org/lkml/[email protected]/

Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin wrote [11]:
I have Archlinux with 8G RAM + zswap + swap. While developing, I
have lots of apps opened such as multiple LSP-servers for different
langs, chats, two browsers, etc... Usually, my system gets quickly
to a point of SWAP-storms, where I have to kill LSP-servers,
restart browsers to free memory, etc, otherwise the system lags
heavily and is barely usable.

1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
patchset, and I started up by opening lots of apps to create memory
pressure, and worked for a day like this. Till now I had *not a
single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
getting to the point of 3G in SWAP before without a single
SWAP-storm.

An anonymous user wrote [12]:
Using that v5 for some time and confirm that difference under heavy
load and memory pressure is significant.

Shuang wrote [13]:
With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
and [9.26, 10.36]% higher throughput, respectively, for random
access, Zipfian (distribution) access and Gaussian (distribution)
access, when the average number of jobs per CPU is 1; 95% CIs
[42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput,
respectively, for random access, Zipfian access and Gaussian access,
when the average number of jobs per CPU is 2.

Daniel wrote [14]:
With memcached allocating ~100GB of byte-addressable Optante,
performance improvement in terms of throughput (measured as queries
per second) was about 10% for a series of workloads.

Large-scale deployments
-----------------------
The downstream kernels that have been using MGLRU include:
1. Android ARCVM [15]
2. Arch Linux Zen [16]
3. Chrome OS [17]
4. Liquorix [18]
5. post-factum [19]
6. XanMod [20]

We've rolled out MGLRU to tens of millions of Chrome OS users and
about a million Android users. Google's fleetwide profiling [21] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
rendering latency at the 50th percentile.

[11] https://lore.kernel.org/lkml/[email protected]/
[12] https://phoronix.com/forums/forum/software/general-linux-open-source/1301258-mglru-is-a-very-enticing-enhancement-for-linux-in-2022?p=1301275#post1301275
[13] https://lore.kernel.org/lkml/[email protected]/
[14] https://lore.kernel.org/linux-mm/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
[15] https://chromium.googlesource.com/chromiumos/third_party/kernel
[16] https://archlinux.org
[17] https://chromium.org
[18] https://liquorix.net
[19] https://gitlab.com/post-factum/pf-kernel
[20] https://xanmod.org
[21] https://research.google/pubs/pub44271/

Summery
=======
The facts are:
1. The independent lab results and the real-world applications
indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
work out of the box; there are no equivalent solutions.
3. There is a lot of new code; nobody has demonstrated smaller changes
with similar effects.

Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
materialize for a wide range of workloads.
2. Gauging the interest from the past discussions [22][23][24], the
new features will likely be put to use for both personal computers
and data centers.
3. Based on Google's track record, the new code will likely be well
maintained in the long term. It'd be more difficult if not
impossible to achieve similar effects on top of the existing
design.

[22] https://lore.kernel.org/lkml/[email protected]/
[23] https://lore.kernel.org/lkml/[email protected]/
[24] https://lore.kernel.org/lkml/[email protected]/

Yu Zhao (12):
mm: x86, arm64: add arch_has_hw_pte_young()
mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
mm/vmscan.c: refactor shrink_node()
mm: multigenerational LRU: groundwork
mm: multigenerational LRU: minimal implementation
mm: multigenerational LRU: exploit locality in rmap
mm: multigenerational LRU: support page table walks
mm: multigenerational LRU: optimize multiple memcgs
mm: multigenerational LRU: runtime switch
mm: multigenerational LRU: thrashing prevention
mm: multigenerational LRU: debugfs interface
mm: multigenerational LRU: documentation

Documentation/admin-guide/mm/index.rst | 1 +
Documentation/admin-guide/mm/multigen_lru.rst | 121 +
Documentation/vm/index.rst | 1 +
Documentation/vm/multigen_lru.rst | 152 +
arch/Kconfig | 9 +
arch/arm64/include/asm/pgtable.h | 14 +-
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 9 +-
arch/x86/mm/pgtable.c | 5 +-
fs/exec.c | 2 +
fs/fuse/dev.c | 3 +-
include/linux/cgroup.h | 15 +-
include/linux/memcontrol.h | 36 +
include/linux/mm.h | 8 +
include/linux/mm_inline.h | 214 ++
include/linux/mm_types.h | 78 +
include/linux/mmzone.h | 182 ++
include/linux/nodemask.h | 1 +
include/linux/page-flags-layout.h | 19 +-
include/linux/page-flags.h | 4 +-
include/linux/pgtable.h | 17 +-
include/linux/sched.h | 4 +
include/linux/swap.h | 5 +
kernel/bounds.c | 3 +
kernel/cgroup/cgroup-internal.h | 1 -
kernel/exit.c | 1 +
kernel/fork.c | 9 +
kernel/sched/core.c | 1 +
mm/Kconfig | 50 +
mm/huge_memory.c | 3 +-
mm/memcontrol.c | 27 +
mm/memory.c | 39 +-
mm/mm_init.c | 6 +-
mm/page_alloc.c | 1 +
mm/rmap.c | 7 +
mm/swap.c | 55 +-
mm/vmscan.c | 2831 ++++++++++++++++-
mm/workingset.c | 119 +-
38 files changed, 3908 insertions(+), 146 deletions(-)
create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
create mode 100644 Documentation/vm/multigen_lru.rst

--
2.35.0.263.gb82422642f-goog

2022-02-09 09:56:30

by Yu Zhao

[permalink] [raw]

Subject: [PATCH v7 07/12] mm: multigenerational LRU: support page table walks

To avoid confusions, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.

To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A runtime
switch will be added in the next patch to enable or disable this
feature. Without it, the aging relies on the rmap only.

NB: this feature has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmap them.

An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.

When multiple page table walkers (threads) iterate the same list, each
of them gets a unique mm_struct; therefore they can run concurrently.
Page table walkers ignore any misplaced pages, e.g., if an mm_struct
was migrated, pages it left in the previous memcg won't be promoted
when its current memcg is under reclaim. Similarly, page table walkers
won't promote pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1) It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2) It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3) It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4) It doesn't zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
Single workload:
fio (buffered I/O): no change

Single workload:
memcached (anon): +[5.5, 7.5]%
Ops/sec KB/sec
patch1-6: 1015292.83 39490.38
patch1-7: 1080856.82 42040.53

Configurations:
no change

Client benchmark results:
kswapd profiles:
patch1-6
45.49% lzo1x_1_do_compress (real work)
7.38% page_vma_mapped_walk
7.24% _raw_spin_unlock_irq
2.64% ptep_clear_flush
2.31% __zram_bvec_write
2.13% do_raw_spin_lock
2.09% lru_gen_look_around
1.89% free_unref_page_list
1.85% memmove
1.74% obj_malloc

patch1-7
47.73% lzo1x_1_do_compress (real work)
6.84% page_vma_mapped_walk
6.14% _raw_spin_unlock_irq
2.86% walk_pte_range
2.79% ptep_clear_flush
2.24% __zram_bvec_write
2.10% do_raw_spin_lock
1.94% free_unref_page_list
1.80% memmove
1.75% obj_malloc

Configurations:
no change

[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo

Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
---
fs/exec.c | 2 +
include/linux/memcontrol.h | 5 +
include/linux/mm_types.h | 78 +++
include/linux/mmzone.h | 58 +++
include/linux/swap.h | 4 +
kernel/exit.c | 1 +
kernel/fork.c | 9 +
kernel/sched/core.c | 1 +
mm/memcontrol.c | 24 +
mm/vmscan.c | 963 ++++++++++++++++++++++++++++++++++++-
10 files changed, 1132 insertions(+), 13 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..7a69046e9fd8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1006,6 +1006,7 @@ static int exec_mmap(struct mm_struct *mm)
active_mm = tsk->active_mm;
tsk->active_mm = mm;
tsk->mm = mm;
+ lru_gen_add_mm(mm);
/*
* This prevents preemption while active_mm is being loaded and
* it and mm are being updated, which could cause problems for
@@ -1016,6 +1017,7 @@ static int exec_mmap(struct mm_struct *mm)
if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
local_irq_enable();
activate_mm(active_mm, mm);
+ lru_gen_use_mm(mm);
if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
local_irq_enable();
tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 51c9bc8e965d..2f0d8e912cfe 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -342,6 +342,11 @@ struct mem_cgroup {
struct deferred_split deferred_split_queue;
#endif

+#ifdef CONFIG_LRU_GEN
+ /* per-memcg mm_struct list */
+ struct lru_gen_mm_list mm_list;
+#endif
+
struct mem_cgroup_per_node *nodeinfo[];
};

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5140e5feb486..8d2cdbbdd467 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -3,6 +3,7 @@
#define _LINUX_MM_TYPES_H

#include <linux/mm_types_task.h>
+#include <linux/sched.h>

#include <linux/auxvec.h>
#include <linux/kref.h>
@@ -17,6 +18,8 @@
#include <linux/page-flags-layout.h>
#include <linux/workqueue.h>
#include <linux/seqlock.h>
+#include <linux/nodemask.h>
+#include <linux/mmdebug.h>

#include <asm/mmu.h>

@@ -634,6 +637,22 @@ struct mm_struct {
#ifdef CONFIG_IOMMU_SUPPORT
u32 pasid;
#endif
+#ifdef CONFIG_LRU_GEN
+ struct {
+ /* this mm_struct is on lru_gen_mm_list */
+ struct list_head list;
+#ifdef CONFIG_MEMCG
+ /* points to the memcg of "owner" above */
+ struct mem_cgroup *memcg;
+#endif
+ /*
+ * Set when switching to this mm_struct, as a hint of
+ * whether it has been used since the last time per-node
+ * page table walkers cleared the corresponding bits.
+ */
+ nodemask_t nodes;
+ } lru_gen;
+#endif /* CONFIG_LRU_GEN */
} __randomize_layout;

/*
@@ -660,6 +679,65 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
return (struct cpumask *)&mm->cpu_bitmap;
}

+#ifdef CONFIG_LRU_GEN
+
+struct lru_gen_mm_list {
+ /* mm_struct list for page table walkers */
+ struct list_head fifo;
+ /* protects the list above */
+ spinlock_t lock;
+};
+
+void lru_gen_add_mm(struct mm_struct *mm);
+void lru_gen_del_mm(struct mm_struct *mm);
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm);
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+ INIT_LIST_HEAD(&mm->lru_gen.list);
+#ifdef CONFIG_MEMCG
+ mm->lru_gen.memcg = NULL;
+#endif
+ nodes_clear(mm->lru_gen.nodes);
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+ /* unlikely but not a bug when racing with lru_gen_migrate_mm() */
+ VM_WARN_ON(list_empty(&mm->lru_gen.list));
+
+ if (!(current->flags & PF_KTHREAD) && !nodes_full(mm->lru_gen.nodes))
+ nodes_setall(mm->lru_gen.nodes);
+}
+
+#else /* !CONFIG_LRU_GEN */
+
+static inline void lru_gen_add_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_del_mm(struct mm_struct *mm)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+}
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
struct mmu_gather;
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3d6ea30a2bdb..fa0a7a84ee58 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -386,6 +386,58 @@ struct lru_gen_struct {
bool enabled;
};

+enum {
+ MM_PTE_TOTAL, /* total leaf entries */
+ MM_PTE_OLD, /* old leaf entries */
+ MM_PTE_YOUNG, /* young leaf entries */
+ MM_PMD_TOTAL, /* total non-leaf entries */
+ MM_PMD_FOUND, /* non-leaf entries found in Bloom filters */
+ MM_PMD_ADDED, /* non-leaf entries added to Bloom filters */
+ NR_MM_STATS
+};
+
+/* mnemonic codes for the mm stats above */
+#define MM_STAT_CODES "toydfa"
+
+/* double-buffering Bloom filters */
+#define NR_BLOOM_FILTERS 2
+
+struct lru_gen_mm_state {
+ /* set to max_seq after each iteration */
+ unsigned long seq;
+ /* where the current iteration starts (inclusive) */
+ struct list_head *head;
+ /* where the last iteration ends (exclusive) */
+ struct list_head *tail;
+ /* to wait for the last page table walker to finish */
+ struct wait_queue_head wait;
+ /* Bloom filters flip after each iteration */
+ unsigned long *filters[NR_BLOOM_FILTERS];
+ /* the mm stats for debugging */
+ unsigned long stats[NR_HIST_GENS][NR_MM_STATS];
+ /* the number of concurrent page table walkers */
+ int nr_walkers;
+};
+
+struct lru_gen_mm_walk {
+ /* the lruvec under reclaim */
+ struct lruvec *lruvec;
+ /* unstable max_seq from lru_gen_struct */
+ unsigned long max_seq;
+ /* the next address within an mm to scan */
+ unsigned long next_addr;
+ /* to batch page table entries */
+ unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)];
+ /* to batch promoted pages */
+ int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+ /* to batch the mm stats */
+ int mm_stats[NR_MM_STATS];
+ /* total batched items */
+ int batched;
+ bool can_swap;
+ bool full_scan;
+};
+
void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);
void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);

@@ -436,6 +488,8 @@ struct lruvec {
#ifdef CONFIG_LRU_GEN
/* evictable pages divided into generations */
struct lru_gen_struct lrugen;
+ /* to concurrently iterate lru_gen_mm_list */
+ struct lru_gen_mm_state mm_state;
#endif
#ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
@@ -1028,6 +1082,10 @@ typedef struct pglist_data {

unsigned long flags;

+#ifdef CONFIG_LRU_GEN
+ /* kswap mm walk data */
+ struct lru_gen_mm_walk mm_walk;
+#endif
ZONE_PADDING(_pad2_)

/* Per-node vmstats */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b37520d3ff1d..04d84ac6d1ac 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -137,6 +137,10 @@ union swap_header {
*/
struct reclaim_state {
unsigned long reclaimed_slab;
+#ifdef CONFIG_LRU_GEN
+ /* per-thread mm walk data */
+ struct lru_gen_mm_walk *mm_walk;
+#endif
};

#ifdef __KERNEL__
diff --git a/kernel/exit.c b/kernel/exit.c
index b00a25bb4ab9..54d2ce4b93d1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -463,6 +463,7 @@ void mm_update_next_owner(struct mm_struct *mm)
goto retry;
}
WRITE_ONCE(mm->owner, c);
+ lru_gen_migrate_mm(mm);
task_unlock(c);
put_task_struct(c);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..8dcf6c37b918 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1079,6 +1079,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
goto fail_nocontext;

mm->user_ns = get_user_ns(user_ns);
+ lru_gen_init_mm(mm);
return mm;

fail_nocontext:
@@ -1121,6 +1122,7 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
+ lru_gen_del_mm(mm);
mmdrop(mm);
}

@@ -2576,6 +2578,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
get_task_struct(p);
}

+ if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) {
+ /* lock the task to synchronize with memcg migration */
+ task_lock(p);
+ lru_gen_add_mm(p->mm);
+ task_unlock(p);
+ }
+
wake_up_new_task(p);

/* forking complete and child started to run, tell ptracer */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 848eaa0efe0e..e5fcfd4557ad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4970,6 +4970,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
* finish_task_switch()'s mmdrop().
*/
switch_mm_irqs_off(prev->active_mm, next->mm, next);
+ lru_gen_use_mm(next->mm);

if (!prev->mm) { // from kernel
/* will mmdrop() in finish_task_switch(). */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74373df19d84..662e652f85ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6155,6 +6155,29 @@ static void mem_cgroup_move_task(void)
}
#endif

+#ifdef CONFIG_LRU_GEN
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ struct task_struct *task = NULL;
+
+ cgroup_taskset_for_each_leader(task, css, tset)
+ break;
+
+ if (!task)
+ return;
+
+ task_lock(task);
+ if (task->mm && task->mm->owner == task)
+ lru_gen_migrate_mm(task->mm);
+ task_unlock(task);
+}
+#else
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
{
if (value == PAGE_COUNTER_MAX)
@@ -6500,6 +6523,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
.css_reset = mem_cgroup_css_reset,
.css_rstat_flush = mem_cgroup_css_rstat_flush,
.can_attach = mem_cgroup_can_attach,
+ .attach = mem_cgroup_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.post_attach = mem_cgroup_move_task,
.dfl_cftypes = memory_files,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 933d46ae2f68..5ab6cd332fcc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,8 @@
#include <linux/printk.h>
#include <linux/dax.h>
#include <linux/psi.h>
+#include <linux/pagewalk.h>
+#include <linux/shmem_fs.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -3135,6 +3137,371 @@ static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
get_nr_gens(lruvec, TYPE_ANON) <= MAX_NR_GENS;
}

+/******************************************************************************
+ * mm_struct list
+ ******************************************************************************/
+
+static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
+{
+ static struct lru_gen_mm_list mm_list = {
+ .fifo = LIST_HEAD_INIT(mm_list.fifo),
+ .lock = __SPIN_LOCK_UNLOCKED(mm_list.lock),
+ };
+
+#ifdef CONFIG_MEMCG
+ if (memcg)
+ return &memcg->mm_list;
+#endif
+ return &mm_list;
+}
+
+void lru_gen_add_mm(struct mm_struct *mm)
+{
+ int nid;
+ struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+ VM_BUG_ON_MM(!list_empty(&mm->lru_gen.list), mm);
+#ifdef CONFIG_MEMCG
+ VM_BUG_ON_MM(mm->lru_gen.memcg, mm);
+ mm->lru_gen.memcg = memcg;
+#endif
+ spin_lock(&mm_list->lock);
+
+ list_add_tail(&mm->lru_gen.list, &mm_list->fifo);
+
+ for_each_node(nid) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+ if (!lruvec)
+ continue;
+
+ if (lruvec->mm_state.tail == &mm_list->fifo)
+ lruvec->mm_state.tail = lruvec->mm_state.tail->prev;
+ }
+
+ spin_unlock(&mm_list->lock);
+}
+
+void lru_gen_del_mm(struct mm_struct *mm)
+{
+ int nid;
+ struct lru_gen_mm_list *mm_list;
+ struct mem_cgroup *memcg = NULL;
+
+ if (list_empty(&mm->lru_gen.list))
+ return;
+
+#ifdef CONFIG_MEMCG
+ memcg = mm->lru_gen.memcg;
+#endif
+ mm_list = get_mm_list(memcg);
+
+ spin_lock(&mm_list->lock);
+
+ for_each_node(nid) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+ if (!lruvec)
+ continue;
+
+ if (lruvec->mm_state.tail == &mm->lru_gen.list)
+ lruvec->mm_state.tail = lruvec->mm_state.tail->next;
+
+ if (lruvec->mm_state.head != &mm->lru_gen.list)
+ continue;
+
+ lruvec->mm_state.head = lruvec->mm_state.head->next;
+ if (lruvec->mm_state.head == &mm_list->fifo)
+ WRITE_ONCE(lruvec->mm_state.seq, lruvec->mm_state.seq + 1);
+ }
+
+ list_del_init(&mm->lru_gen.list);
+
+ spin_unlock(&mm_list->lock);
+
+#ifdef CONFIG_MEMCG
+ mem_cgroup_put(mm->lru_gen.memcg);
+ mm->lru_gen.memcg = NULL;
+#endif
+}
+
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+ struct mem_cgroup *memcg;
+
+ lockdep_assert_held(&mm->owner->alloc_lock);
+
+ /* for mm_update_next_owner() */
+ if (mem_cgroup_disabled())
+ return;
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(mm->owner);
+ rcu_read_unlock();
+ if (memcg == mm->lru_gen.memcg)
+ return;
+
+ VM_BUG_ON_MM(!mm->lru_gen.memcg, mm);
+ VM_BUG_ON_MM(list_empty(&mm->lru_gen.list), mm);
+
+ lru_gen_del_mm(mm);
+ lru_gen_add_mm(mm);
+}
+#endif
+
+/*
+ * Bloom filters with m=1<<15, k=2 and the false positive rates of ~1/5 when
+ * n=10,000 and ~1/2 when n=20,000, where, conventionally, m is the number of
+ * bits in a bitmap, k is the number of hash functions and n is the number of
+ * inserted items.
+ *
+ * Page table walkers use one of the two filters to reduce their search space.
+ * To get rid of non-leaf entries that no longer have enough leaf entries, the
+ * aging uses the double-buffering technique to flip to the other filter each
+ * time it produces a new generation. For non-leaf entries that have enough
+ * leaf entries, the aging carries them over to the next generation in
+ * walk_pmd_range(); the eviction also report them when walking the rmap
+ * in lru_gen_look_around().
+ *
+ * For future optimizations:
+ * 1) It's not necessary to keep both filters all the time. The spare one can be
+ * freed after the RCU grace period and reallocated if needed again.
+ * 2) And when reallocating, it's worth scaling its size according to the number
+ * of inserted entries in the other filter, to reduce the memory overhead on
+ * small systems and false positives on large systems.
+ * 3) Jenkins' hash function is an alternative to Knuth's.
+ */
+#define BLOOM_FILTER_SHIFT 15
+
+static inline int filter_gen_from_seq(unsigned long seq)
+{
+ return seq % NR_BLOOM_FILTERS;
+}
+
+static void get_item_key(void *item, int *key)
+{
+ u32 hash = hash_ptr(item, BLOOM_FILTER_SHIFT * 2);
+
+ BUILD_BUG_ON(BLOOM_FILTER_SHIFT * 2 > BITS_PER_TYPE(u32));
+
+ key[0] = hash & (BIT(BLOOM_FILTER_SHIFT) - 1);
+ key[1] = hash >> BLOOM_FILTER_SHIFT;
+}
+
+static void reset_bloom_filter(struct lruvec *lruvec, unsigned long seq)
+{
+ unsigned long *filter;
+ int gen = filter_gen_from_seq(seq);
+
+ lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
+
+ filter = lruvec->mm_state.filters[gen];
+ if (filter) {
+ bitmap_clear(filter, 0, BIT(BLOOM_FILTER_SHIFT));
+ return;
+ }
+
+ filter = bitmap_zalloc(BIT(BLOOM_FILTER_SHIFT), GFP_ATOMIC);
+ WRITE_ONCE(lruvec->mm_state.filters[gen], filter);
+}
+
+static void update_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+ int key[2];
+ unsigned long *filter;
+ int gen = filter_gen_from_seq(seq);
+
+ filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+ if (!filter)
+ return;
+
+ get_item_key(item, key);
+
+ if (!test_bit(key[0], filter))
+ set_bit(key[0], filter);
+ if (!test_bit(key[1], filter))
+ set_bit(key[1], filter);
+}
+
+static bool test_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+ int key[2];
+ unsigned long *filter;
+ int gen = filter_gen_from_seq(seq);
+
+ filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+ if (!filter)
+ return true;
+
+ get_item_key(item, key);
+
+ return test_bit(key[0], filter) && test_bit(key[1], filter);
+}
+
+static void reset_mm_stats(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, bool last)
+{
+ int i;
+ int hist;
+
+ lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
+
+ if (walk) {
+ hist = lru_hist_from_seq(walk->max_seq);
+
+ for (i = 0; i < NR_MM_STATS; i++) {
+ WRITE_ONCE(lruvec->mm_state.stats[hist][i],
+ lruvec->mm_state.stats[hist][i] + walk->mm_stats[i]);
+ walk->mm_stats[i] = 0;
+ }
+ }
+
+ if (NR_HIST_GENS > 1 && last) {
+ hist = lru_hist_from_seq(lruvec->mm_state.seq + 1);
+
+ for (i = 0; i < NR_MM_STATS; i++)
+ WRITE_ONCE(lruvec->mm_state.stats[hist][i], 0);
+ }
+}
+
+static bool should_skip_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+ int type;
+ unsigned long size = 0;
+ struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+
+ if (!walk->full_scan && cpumask_empty(mm_cpumask(mm)) &&
+ !node_isset(pgdat->node_id, mm->lru_gen.nodes))
+ return true;
+
+ node_clear(pgdat->node_id, mm->lru_gen.nodes);
+
+ for (type = !walk->can_swap; type < ANON_AND_FILE; type++) {
+ size += type ? get_mm_counter(mm, MM_FILEPAGES) :
+ get_mm_counter(mm, MM_ANONPAGES) +
+ get_mm_counter(mm, MM_SHMEMPAGES);
+ }
+
+ if (size < MIN_LRU_BATCH)
+ return true;
+
+ if (mm_is_oom_victim(mm))
+ return true;
+
+ return !mmget_not_zero(mm);
+}
+
+static bool iterate_mm_list(struct lruvec *lruvec, struct lru_gen_mm_walk *walk,
+ struct mm_struct **iter)
+{
+ bool first = false;
+ bool last = true;
+ struct mm_struct *mm = NULL;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+ struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+
+ /*
+ * There are four interesting cases for this page table walker:
+ * 1) It tries to start a new iteration of mm_list with a stale max_seq;
+ * there is nothing to be done.
+ * 2) It's the first of the current generation, and it needs to reset
+ * the Bloom filter for the next generation.
+ * 3) It reaches the end of mm_list, and it needs to increment
+ * mm_state->seq; the iteration is done.
+ * 4) It's the last of the current generation, and it needs to reset the
+ * mm stats counters for the next generation.
+ */
+ if (*iter)
+ mmput_async(*iter);
+ else if (walk->max_seq <= READ_ONCE(mm_state->seq))
+ return false;
+
+ spin_lock(&mm_list->lock);
+
+ VM_BUG_ON(walk->max_seq > mm_state->seq + 1);
+ VM_BUG_ON(*iter && walk->max_seq < mm_state->seq);
+ VM_BUG_ON(*iter && !mm_state->nr_walkers);
+
+ if (walk->max_seq <= mm_state->seq) {
+ if (!*iter)
+ last = false;
+ goto done;
+ }
+
+ if (mm_state->head == &mm_list->fifo) {
+ VM_BUG_ON(mm_state->nr_walkers);
+ mm_state->head = mm_state->head->next;
+ first = true;
+ }
+
+ while (!mm && mm_state->head != &mm_list->fifo) {
+ mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list);
+
+ mm_state->head = mm_state->head->next;
+
+ /* full scan for those added after the last iteration */
+ if (mm_state->tail == &mm->lru_gen.list) {
+ mm_state->tail = mm_state->tail->next;
+ walk->full_scan = true;
+ }
+
+ if (should_skip_mm(mm, walk))
+ mm = NULL;
+ }
+
+ if (mm_state->head == &mm_list->fifo)
+ WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+done:
+ if (*iter && !mm)
+ mm_state->nr_walkers--;
+ if (!*iter && mm)
+ mm_state->nr_walkers++;
+
+ if (mm_state->nr_walkers)
+ last = false;
+
+ if (mm && first)
+ reset_bloom_filter(lruvec, walk->max_seq + 1);
+
+ if (*iter || last)
+ reset_mm_stats(lruvec, walk, last);
+
+ spin_unlock(&mm_list->lock);
+
+ *iter = mm;
+
+ return last;
+}
+
+static bool iterate_mm_list_nowalk(struct lruvec *lruvec, unsigned long max_seq)
+{
+ bool success = false;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+ struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+
+ if (max_seq <= READ_ONCE(mm_state->seq))
+ return false;
+
+ spin_lock(&mm_list->lock);
+
+ VM_BUG_ON(max_seq > mm_state->seq + 1);
+
+ if (max_seq > mm_state->seq && !mm_state->nr_walkers) {
+ VM_BUG_ON(mm_state->head != &mm_list->fifo);
+
+ WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+ reset_mm_stats(lruvec, NULL, true);
+ success = true;
+ }
+
+ spin_unlock(&mm_list->lock);
+
+ return success;
+}
+
/******************************************************************************
* refault feedback loop
******************************************************************************/
@@ -3288,6 +3655,465 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
return new_gen;
}

+static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
+ int old_gen, int new_gen)
+{
+ int type = folio_is_file_lru(folio);
+ int zone = folio_zonenum(folio);
+ int delta = folio_nr_pages(folio);
+
+ VM_BUG_ON(old_gen >= MAX_NR_GENS);
+ VM_BUG_ON(new_gen >= MAX_NR_GENS);
+
+ walk->batched++;
+
+ walk->nr_pages[old_gen][type][zone] -= delta;
+ walk->nr_pages[new_gen][type][zone] += delta;
+}
+
+static void reset_batch_size(struct lruvec *lruvec, struct lru_gen_mm_walk *walk)
+{
+ int gen, type, zone;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ walk->batched = 0;
+
+ for_each_gen_type_zone(gen, type, zone) {
+ enum lru_list lru = type * LRU_INACTIVE_FILE;
+ int delta = walk->nr_pages[gen][type][zone];
+
+ if (!delta)
+ continue;
+
+ walk->nr_pages[gen][type][zone] = 0;
+ WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
+ lrugen->nr_pages[gen][type][zone] + delta);
+
+ if (lru_gen_is_active(lruvec, gen))
+ lru += LRU_ACTIVE;
+ lru_gen_update_size(lruvec, lru, zone, delta);
+ }
+}
+
+static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *walk)
+{
+ struct address_space *mapping;
+ struct vm_area_struct *vma = walk->vma;
+ struct lru_gen_mm_walk *priv = walk->private;
+
+ if (!vma_is_accessible(vma) || is_vm_hugetlb_page(vma) ||
+ (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ)) ||
+ vma == get_gate_vma(vma->vm_mm))
+ return true;
+
+ if (vma_is_anonymous(vma))
+ return !priv->can_swap;
+
+ if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
+ return true;
+
+ mapping = vma->vm_file->f_mapping;
+ if (mapping_unevictable(mapping))
+ return true;
+
+ /* check readpage to exclude special mappings like dax, etc. */
+ return shmem_mapping(mapping) ? !priv->can_swap : !mapping->a_ops->readpage;
+}
+
+/*
+ * Some userspace memory allocators map many single-page VMAs. Instead of
+ * returning back to the PGD table for each of such VMAs, finish an entire PMD
+ * table to reduce zigzags and improve cache performance.
+ */
+static bool get_next_vma(struct mm_walk *walk, unsigned long mask, unsigned long size,
+ unsigned long *start, unsigned long *end)
+{
+ unsigned long next = round_up(*end, size);
+
+ VM_BUG_ON(mask & size);
+ VM_BUG_ON(*start >= *end);
+ VM_BUG_ON((next & mask) != (*start & mask));
+
+ while (walk->vma) {
+ if (next >= walk->vma->vm_end) {
+ walk->vma = walk->vma->vm_next;
+ continue;
+ }
+
+ if ((next & mask) != (walk->vma->vm_start & mask))
+ return false;
+
+ if (should_skip_vma(walk->vma->vm_start, walk->vma->vm_end, walk)) {
+ walk->vma = walk->vma->vm_next;
+ continue;
+ }
+
+ *start = max(next, walk->vma->vm_start);
+ next = (next | ~mask) + 1;
+ /* rounded-up boundaries can wrap to 0 */
+ *end = next && next < walk->vma->vm_end ? next : walk->vma->vm_end;
+
+ return true;
+ }
+
+ return false;
+}
+
+static bool suitable_to_scan(int total, int young)
+{
+ int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
+
+ /* suitable if the average number of young PTEs per cacheline is >=1 */
+ return young * n >= total;
+}
+
+static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+{
+ int i;
+ pte_t *pte;
+ spinlock_t *ptl;
+ unsigned long addr;
+ int total = 0;
+ int young = 0;
+ struct lru_gen_mm_walk *priv = walk->private;
+ struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+ int old_gen, new_gen = lru_gen_from_seq(priv->max_seq);
+
+ VM_BUG_ON(pmd_leaf(*pmd));
+
+ pte = pte_offset_map_lock(walk->mm, pmd, start & PMD_MASK, &ptl);
+ arch_enter_lazy_mmu_mode();
+restart:
+ for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
+ struct folio *folio;
+ unsigned long pfn = pte_pfn(pte[i]);
+
+ VM_BUG_ON(addr < walk->vma->vm_start || addr >= walk->vma->vm_end);
+
+ total++;
+ priv->mm_stats[MM_PTE_TOTAL]++;
+
+ if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+ continue;
+
+ if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+ continue;
+
+ if (!pte_young(pte[i])) {
+ priv->mm_stats[MM_PTE_OLD]++;
+ continue;
+ }
+
+ VM_BUG_ON(!pfn_valid(pfn));
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ continue;
+
+ folio = pfn_folio(pfn);
+ if (folio_nid(folio) != pgdat->node_id)
+ continue;
+
+ if (folio_memcg_rcu(folio) != memcg)
+ continue;
+
+ if (!ptep_test_and_clear_young(walk->vma, addr, pte + i))
+ continue;
+
+ young++;
+ priv->mm_stats[MM_PTE_YOUNG]++;
+
+ if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+ !folio_test_swapcache(folio)))
+ folio_mark_dirty(folio);
+
+ old_gen = folio_update_gen(folio, new_gen);
+ if (old_gen >= 0 && old_gen != new_gen)
+ update_batch_size(priv, folio, old_gen, new_gen);
+ }
+
+ if (i < PTRS_PER_PTE && get_next_vma(walk, PMD_MASK, PAGE_SIZE, &start, &end))
+ goto restart;
+
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(pte, ptl);
+
+ return suitable_to_scan(total, young);
+}
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+ struct mm_walk *walk, unsigned long *start)
+{
+ int i;
+ pmd_t *pmd;
+ spinlock_t *ptl;
+ struct lru_gen_mm_walk *priv = walk->private;
+ struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+ int old_gen, new_gen = lru_gen_from_seq(priv->max_seq);
+
+ VM_BUG_ON(pud_leaf(*pud));
+
+ /* try to batch at most 1+MIN_LRU_BATCH+1 entries */
+ if (*start == -1) {
+ *start = next;
+ return;
+ }
+
+ i = next == -1 ? 0 : pmd_index(next) - pmd_index(*start);
+ if (i && i <= MIN_LRU_BATCH) {
+ __set_bit(i - 1, priv->bitmap);
+ return;
+ }
+
+ pmd = pmd_offset(pud, *start);
+ ptl = pmd_lock(walk->mm, pmd);
+ arch_enter_lazy_mmu_mode();
+
+ do {
+ struct folio *folio;
+ unsigned long pfn = pmd_pfn(pmd[i]);
+ unsigned long addr = i ? (*start & PMD_MASK) + i * PMD_SIZE : *start;
+
+ VM_BUG_ON(addr < vma->vm_start || addr >= vma->vm_end);
+
+ if (!pmd_present(pmd[i]) || is_huge_zero_pmd(pmd[i]))
+ goto next;
+
+ if (WARN_ON_ONCE(pmd_devmap(pmd[i])))
+ goto next;
+
+ if (!pmd_trans_huge(pmd[i])) {
+ if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
+ pmdp_test_and_clear_young(vma, addr, pmd + i);
+ goto next;
+ }
+
+ VM_BUG_ON(!pfn_valid(pfn));
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ goto next;
+
+ folio = pfn_folio(pfn);
+ if (folio_nid(folio) != pgdat->node_id)
+ goto next;
+
+ if (folio_memcg_rcu(folio) != memcg)
+ goto next;
+
+ if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
+ goto next;
+
+ priv->mm_stats[MM_PTE_YOUNG]++;
+
+ if (pmd_dirty(pmd[i]) && !folio_test_dirty(folio) &&
+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+ !folio_test_swapcache(folio)))
+ folio_mark_dirty(folio);
+
+ old_gen = folio_update_gen(folio, new_gen);
+ if (old_gen >= 0 && old_gen != new_gen)
+ update_batch_size(priv, folio, old_gen, new_gen);
+next:
+ i = i > MIN_LRU_BATCH ? 0 :
+ find_next_bit(priv->bitmap, MIN_LRU_BATCH, i) + 1;
+ } while (i <= MIN_LRU_BATCH);
+
+ arch_leave_lazy_mmu_mode();
+ spin_unlock(ptl);
+
+ *start = -1;
+ bitmap_zero(priv->bitmap, MIN_LRU_BATCH);
+}
+#else
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+ struct mm_walk *walk, unsigned long *start)
+{
+}
+#endif
+
+static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+{
+ int i;
+ pmd_t *pmd;
+ unsigned long next;
+ unsigned long addr;
+ struct vm_area_struct *vma;
+ unsigned long pos = -1;
+ struct lru_gen_mm_walk *priv = walk->private;
+
+ VM_BUG_ON(pud_leaf(*pud));
+
+ /*
+ * Finish an entire PMD in two passes: the first only reaches to PTE
+ * tables to avoid taking the PMD lock; the second, if necessary, takes
+ * the PMD lock to clear the accessed bit in PMD entries.
+ */
+ pmd = pmd_offset(pud, start & PUD_MASK);
+restart:
+ /* walk_pte_range() may call get_next_vma() */
+ vma = walk->vma;
+ for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
+ pmd_t val = pmd_read_atomic(pmd + i);
+
+ /* for pmd_read_atomic() */
+ barrier();
+
+ next = pmd_addr_end(addr, end);
+
+ if (!pmd_present(val)) {
+ priv->mm_stats[MM_PTE_TOTAL]++;
+ continue;
+ }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (pmd_trans_huge(val)) {
+ unsigned long pfn = pmd_pfn(val);
+ struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+
+ priv->mm_stats[MM_PTE_TOTAL]++;
+
+ if (is_huge_zero_pmd(val))
+ continue;
+
+ if (!pmd_young(val)) {
+ priv->mm_stats[MM_PTE_OLD]++;
+ continue;
+ }
+
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ continue;
+
+ walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+ continue;
+ }
+#endif
+ priv->mm_stats[MM_PMD_TOTAL]++;
+
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+ if (!pmd_young(val))
+ continue;
+
+ walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+#endif
+ if (!priv->full_scan && !test_bloom_filter(priv->lruvec, priv->max_seq, pmd + i))
+ continue;
+
+ priv->mm_stats[MM_PMD_FOUND]++;
+
+ if (!walk_pte_range(&val, addr, next, walk))
+ continue;
+
+ priv->mm_stats[MM_PMD_ADDED]++;
+
+ /* carry over to the next generation */
+ update_bloom_filter(priv->lruvec, priv->max_seq + 1, pmd + i);
+ }
+
+ walk_pmd_range_locked(pud, -1, vma, walk, &pos);
+
+ if (i < PTRS_PER_PMD && get_next_vma(walk, PUD_MASK, PMD_SIZE, &start, &end))
+ goto restart;
+}
+
+static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+{
+ int i;
+ pud_t *pud;
+ unsigned long addr;
+ unsigned long next;
+ struct lru_gen_mm_walk *priv = walk->private;
+
+ VM_BUG_ON(p4d_leaf(*p4d));
+
+ pud = pud_offset(p4d, start & P4D_MASK);
+restart:
+ for (i = pud_index(start), addr = start; addr != end; i++, addr = next) {
+ pud_t val = READ_ONCE(pud[i]);
+
+ next = pud_addr_end(addr, end);
+
+ if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val)))
+ continue;
+
+ walk_pmd_range(&val, addr, next, walk);
+
+ if (priv->batched >= MAX_LRU_BATCH) {
+ end = (addr | ~PUD_MASK) + 1;
+ goto done;
+ }
+ }
+
+ if (i < PTRS_PER_PUD && get_next_vma(walk, P4D_MASK, PUD_SIZE, &start, &end))
+ goto restart;
+
+ end = round_up(end, P4D_SIZE);
+done:
+ /* rounded-up boundaries can wrap to 0 */
+ priv->next_addr = end && walk->vma ? max(end, walk->vma->vm_start) : 0;
+
+ return -EAGAIN;
+}
+
+static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+ static const struct mm_walk_ops mm_walk_ops = {
+ .test_walk = should_skip_vma,
+ .p4d_entry = walk_pud_range,
+ };
+
+ int err;
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+ walk->next_addr = FIRST_USER_ADDRESS;
+
+ do {
+ err = -EBUSY;
+
+ /* folio_update_gen() requires stable folio_memcg() */
+ if (!mem_cgroup_trylock_pages(memcg))
+ break;
+
+ /* the caller might be holding the lock for write */
+ if (mmap_read_trylock(mm)) {
+ unsigned long start = walk->next_addr;
+ unsigned long end = mm->highest_vm_end;
+
+ err = walk_page_range(mm, start, end, &mm_walk_ops, walk);
+
+ mmap_read_unlock(mm);
+
+ if (walk->batched) {
+ spin_lock_irq(&lruvec->lru_lock);
+ reset_batch_size(lruvec, walk);
+ spin_unlock_irq(&lruvec->lru_lock);
+ }
+ }
+
+ mem_cgroup_unlock_pages();
+
+ cond_resched();
+ } while (err == -EAGAIN && walk->next_addr && !mm_is_oom_victim(mm));
+}
+
+static struct lru_gen_mm_walk *alloc_mm_walk(void)
+{
+ if (current->reclaim_state && current->reclaim_state->mm_walk)
+ return current->reclaim_state->mm_walk;
+
+ return kzalloc(sizeof(struct lru_gen_mm_walk),
+ __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+}
+
+static void free_mm_walk(struct lru_gen_mm_walk *walk)
+{
+ if (!current->reclaim_state || !current->reclaim_state->mm_walk)
+ kfree(walk);
+}
+
static void inc_min_seq(struct lruvec *lruvec)
{
int type;
@@ -3346,7 +4172,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
return success;
}

-static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
+static void inc_max_seq(struct lruvec *lruvec)
{
int prev, next;
int type, zone;
@@ -3356,9 +4182,6 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)

VM_BUG_ON(!seq_is_valid(lruvec));

- if (max_seq != lrugen->max_seq)
- goto unlock;
-
inc_min_seq(lruvec);

/* update the active/inactive LRU sizes for compatibility */
@@ -3385,10 +4208,72 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
WRITE_ONCE(lrugen->timestamps[next], jiffies);
/* make sure preceding modifications appear */
smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
-unlock:
+
spin_unlock_irq(&lruvec->lru_lock);
}

+static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
+ struct scan_control *sc, bool can_swap, bool full_scan)
+{
+ bool success;
+ struct lru_gen_mm_walk *walk;
+ struct mm_struct *mm = NULL;
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+ VM_BUG_ON(max_seq > READ_ONCE(lrugen->max_seq));
+
+ /*
+ * If the hardware doesn't automatically set the accessed bit, fallback
+ * to lru_gen_look_around(), which only clears the accessed bit in a
+ * handful of PTEs. Spreading the work out over a period of time usually
+ * is less efficient, but it avoids bursty page faults.
+ */
+ if (!full_scan && !arch_has_hw_pte_young()) {
+ success = iterate_mm_list_nowalk(lruvec, max_seq);
+ goto done;
+ }
+
+ walk = alloc_mm_walk();
+ if (!walk) {
+ success = iterate_mm_list_nowalk(lruvec, max_seq);
+ goto done;
+ }
+
+ walk->lruvec = lruvec;
+ walk->max_seq = max_seq;
+ walk->can_swap = can_swap;
+ walk->full_scan = full_scan;
+
+ do {
+ success = iterate_mm_list(lruvec, walk, &mm);
+ if (mm)
+ walk_mm(lruvec, mm, walk);
+
+ cond_resched();
+ } while (mm);
+
+ free_mm_walk(walk);
+done:
+ if (!success) {
+ if (!current_is_kswapd() && !sc->priority)
+ wait_event_killable(lruvec->mm_state.wait,
+ max_seq < READ_ONCE(lrugen->max_seq));
+
+ return max_seq < READ_ONCE(lrugen->max_seq);
+ }
+
+ VM_BUG_ON(max_seq != READ_ONCE(lrugen->max_seq));
+
+ inc_max_seq(lruvec);
+ /* either this sees any waiters or they will see updated max_seq */
+ if (wq_has_sleeper(&lruvec->mm_state.wait))
+ wake_up_all(&lruvec->mm_state.wait);
+
+ wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+ return true;
+}
+
static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
unsigned long *min_seq, bool can_swap, bool *need_aging)
{
@@ -3449,7 +4334,7 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
nr_to_scan++;

if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
- inc_max_seq(lruvec, max_seq);
+ try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
}

static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
@@ -3458,6 +4343,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)

VM_BUG_ON(!current_is_kswapd());

+ current->reclaim_state->mm_walk = &pgdat->mm_walk;
+
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
@@ -3466,11 +4353,16 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)

cond_resched();
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+ current->reclaim_state->mm_walk = NULL;
}

/*
* This function exploits spatial locality when shrink_page_list() walks the
* rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
+ * If the scan was done cacheline efficiently, it adds the PMD entry pointing
+ * to the PTE table to the Bloom filter. This process is a feedback loop from
+ * the eviction to the aging.
*/
void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
@@ -3480,6 +4372,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
unsigned long end;
unsigned long addr;
struct folio *folio;
+ struct lru_gen_mm_walk *walk;
+ int young = 0;
unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
struct mem_cgroup *memcg = page_memcg(pvmw->page);
struct pglist_data *pgdat = page_pgdat(pvmw->page);
@@ -3537,6 +4431,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
continue;

+ young++;
+
if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
!(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
!folio_test_swapcache(folio)))
@@ -3552,7 +4448,13 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
arch_leave_lazy_mmu_mode();
rcu_read_unlock();

- if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
+ /* feedback from rmap walkers to page table walkers */
+ if (suitable_to_scan(i, young))
+ update_bloom_filter(lruvec, max_seq, pvmw->pmd);
+
+ walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
+
+ if (!walk && bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
folio = page_folio(pte_page(pte[i]));
folio_activate(folio);
@@ -3564,8 +4466,10 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
if (!mem_cgroup_trylock_pages(memcg))
return;

- spin_lock_irq(&lruvec->lru_lock);
- new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+ if (!walk) {
+ spin_lock_irq(&lruvec->lru_lock);
+ new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+ }

for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
folio = page_folio(pte_page(pte[i]));
@@ -3576,10 +4480,14 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
if (old_gen < 0 || old_gen == new_gen)
continue;

- lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
+ if (walk)
+ update_batch_size(walk, folio, old_gen, new_gen);
+ else
+ lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
}

- spin_unlock_irq(&lruvec->lru_lock);
+ if (!walk)
+ spin_unlock_irq(&lruvec->lru_lock);

mem_cgroup_unlock_pages();
}
@@ -3846,6 +4754,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
struct folio *folio;
enum vm_event_item item;
struct reclaim_stat stat;
+ struct lru_gen_mm_walk *walk;
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);

@@ -3884,6 +4793,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap

move_pages_to_lru(lruvec, &list);

+ walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
+ if (walk && walk->batched)
+ reset_batch_size(lruvec, walk);
+
item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
if (!cgroup_reclaim(sc))
__count_vm_events(item, reclaimed);
@@ -3938,9 +4851,10 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
return 0;
}

- inc_max_seq(lruvec, max_seq);
+ if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false))
+ return nr_to_scan;

- return nr_to_scan;
+ return max_seq >= min_seq[TYPE_FILE] + MIN_NR_GENS ? nr_to_scan : 0;
}

static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -3948,9 +4862,13 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
struct blk_plug plug;
long scanned = 0;
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);

lru_add_drain();

+ if (current_is_kswapd())
+ current->reclaim_state->mm_walk = &pgdat->mm_walk;
+
blk_start_plug(&plug);

while (true) {
@@ -3981,6 +4899,9 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
}

blk_finish_plug(&plug);
+
+ if (current_is_kswapd())
+ current->reclaim_state->mm_walk = NULL;
}

/******************************************************************************
@@ -3992,6 +4913,7 @@ void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec)
int i;
int gen, type, zone;
struct lru_gen_struct *lrugen = &lruvec->lrugen;
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);

lrugen->max_seq = MIN_NR_GENS + 1;
lrugen->enabled = lru_gen_enabled();
@@ -4001,6 +4923,11 @@ void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec)

for_each_gen_type_zone(gen, type, zone)
INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
+
+ lruvec->mm_state.seq = MIN_NR_GENS;
+ lruvec->mm_state.head = &mm_list->fifo;
+ lruvec->mm_state.tail = &mm_list->fifo;
+ init_waitqueue_head(&lruvec->mm_state.wait);
}

#ifdef CONFIG_MEMCG
@@ -4008,6 +4935,9 @@ void lru_gen_init_memcg(struct mem_cgroup *memcg)
{
int nid;

+ INIT_LIST_HEAD(&memcg->mm_list.fifo);
+ spin_lock_init(&memcg->mm_list.lock);
+
for_each_node(nid) {
struct lruvec *lruvec = get_lruvec(memcg, nid);

@@ -4020,10 +4950,16 @@ void lru_gen_free_memcg(struct mem_cgroup *memcg)
int nid;

for_each_node(nid) {
+ int i;
struct lruvec *lruvec = get_lruvec(memcg, nid);

VM_BUG_ON(memchr_inv(lruvec->lrugen.nr_pages, 0,
sizeof(lruvec->lrugen.nr_pages)));
+
+ for (i = 0; i < NR_BLOOM_FILTERS; i++) {
+ bitmap_free(lruvec->mm_state.filters[i]);
+ lruvec->mm_state.filters[i] = NULL;
+ }
}
}
#endif
@@ -4032,6 +4968,7 @@ static int __init init_lru_gen(void)
{
BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+ BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);

return 0;
};
--
2.35.0.263.gb82422642f-goog

2022-02-09 09:56:58

by Oleksandr Natalenko

[permalink] [raw]

Subject: Re: [PATCH v7 00/12] Multigenerational LRU Framework

Hello.

On ?ter? 8. ?nora 2022 9:18:50 CET Yu Zhao wrote:
> What's new
> ==========
> 1) Addressed all the comments received on the mailing list and in the
> meeting with the stakeholders (will note on individual patches).
> 2) Measured the performance improvements for each patch between 5-8
> (reported in the commit messages).
>
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and straightforward.
>
> Patchset overview
> =================
> The design and implementation overview was moved to patch 12 so that
> people can finish reading this cover letter.
>
> 1. mm: x86, arm64: add arch_has_hw_pte_young()
> 2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> Using hardware optimizations when trying to clear the accessed bit in
> many PTEs.
>
> 3. mm/vmscan.c: refactor shrink_node()
> A minor refactor.
>
> 4. mm: multigenerational LRU: groundwork
> Adding the basic data structure and the functions that insert/remove
> pages to/from the multigenerational LRU (MGLRU) lists.
>
> 5. mm: multigenerational LRU: minimal implementation
> A minimal (functional) implementation without any optimizations.
>
> 6. mm: multigenerational LRU: exploit locality in rmap
> Improving the efficiency when using the rmap.
>
> 7. mm: multigenerational LRU: support page table walks
> Adding the (optional) page table scanning.
>
> 8. mm: multigenerational LRU: optimize multiple memcgs
> Optimizing the overall performance for multiple memcgs running mixed
> types of workloads.
>
> 9. mm: multigenerational LRU: runtime switch
> Adding a runtime switch to enable or disable MGLRU.
>
> 10. mm: multigenerational LRU: thrashing prevention
> 11. mm: multigenerational LRU: debugfs interface
> Providing userspace with additional features like thrashing prevention,
> working set estimation and proactive reclaim.
>
> 12. mm: multigenerational LRU: documentation
> Adding a design doc and an admin guide.
>
> Benchmark results
> =================
> Independent lab results
> -----------------------
> Based on the popularity of searches [01] and the memory usage in
> Google's public cloud, the most popular open-source memory-hungry
> applications, in alphabetical order, are:
> Apache Cassandra Memcached
> Apache Hadoop MongoDB
> Apache Spark PostgreSQL
> MariaDB (MySQL) Redis
>
> An independent lab evaluated MGLRU with the most widely used benchmark
> suites for the above applications. They posted 960 data points along
> with kernel metrics and perf profiles collected over more than 500
> hours of total benchmark time. Their final reports show that, with 95%
> confidence intervals (CIs), the above applications all performed
> significantly better for at least part of their benchmark matrices.
>
> On 5.14:
> 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
> less wall time to sort three billion random integers, respectively,
> under the medium- and the high-concurrency conditions, when
> overcommitting memory. There were no statistically significant
> changes in wall time for the rest of the benchmark matrix.
> 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
> more transactions per minute (TPM), respectively, under the medium-
> and the high-concurrency conditions, when overcommitting memory.
> There were no statistically significant changes in TPM for the rest
> of the benchmark matrix.
> 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
> and [21.59, 30.02]% more operations per second (OPS), respectively,
> for sequential access, random access and Gaussian (distribution)
> access, when THP=always; 95% CIs [13.85, 15.97]% and
> [23.94, 29.92]% more OPS, respectively, for random access and
> Gaussian access, when THP=never. There were no statistically
> significant changes in OPS for the rest of the benchmark matrix.
> 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
> [2.16, 3.55]% more operations per second (OPS), respectively, for
> exponential (distribution) access, random access and Zipfian
> (distribution) access, when underutilizing memory; 95% CIs
> [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
> respectively, for exponential access, random access and Zipfian
> access, when overcommitting memory.
>
> On 5.15:
> 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
> and [4.11, 7.50]% more operations per second (OPS), respectively,
> for exponential (distribution) access, random access and Zipfian
> (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
> [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
> exponential access, random access and Zipfian access, when swap was
> on.
> 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
> less average wall time to finish twelve parallel TeraSort jobs,
> respectively, under the medium- and the high-concurrency
> conditions, when swap was on. There were no statistically
> significant changes in average wall time for the rest of the
> benchmark matrix.
> 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
> minute (TPM) under the high-concurrency condition, when swap was
> off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
> respectively, under the medium- and the high-concurrency
> conditions, when swap was on. There were no statistically
> significant changes in TPM for the rest of the benchmark matrix.
> 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
> [11.47, 19.36]% more total operations per second (OPS),
> respectively, for sequential access, random access and Gaussian
> (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
> [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
> for sequential access, random access and Gaussian access, when
> THP=never.
>
> Our lab results
> ---------------
> To supplement the above results, we ran the following benchmark suites
> on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
> are popular among MM developers, but we prefer large-scale A/B
> experiments to validate improvements.)
> fs_fio_bench_hdd_mq pft
> fs_lmbench pgsql-hammerdb
> fs_parallelio redis
> fs_postmark stream
> hackbench sysbenchthread
> kernbench tpcc_spark
> memcached unixbench
> multichase vm-scalability
> mutilate will-it-scale
> nginx
>
> [01] https://trends.google.com
> [02] https://lore.kernel.org/lkml/[email protected]/
> [03] https://lore.kernel.org/lkml/[email protected]/
> [04] https://lore.kernel.org/lkml/[email protected]/
> [05] https://lore.kernel.org/lkml/[email protected]/
> [06] https://lore.kernel.org/lkml/[email protected]/
> [07] https://lore.kernel.org/lkml/[email protected]/
> [08] https://lore.kernel.org/lkml/[email protected]/
> [09] https://lore.kernel.org/lkml/[email protected]/
> [10] https://lore.kernel.org/lkml/[email protected]/
>
> Read-world applications
> =======================
> Third-party testimonials
> ------------------------
> Konstantin wrote [11]:
> I have Archlinux with 8G RAM + zswap + swap. While developing, I
> have lots of apps opened such as multiple LSP-servers for different
> langs, chats, two browsers, etc... Usually, my system gets quickly
> to a point of SWAP-storms, where I have to kill LSP-servers,
> restart browsers to free memory, etc, otherwise the system lags
> heavily and is barely usable.
>
> 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
> patchset, and I started up by opening lots of apps to create memory
> pressure, and worked for a day like this. Till now I had *not a
> single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
> getting to the point of 3G in SWAP before without a single
> SWAP-storm.
>
> An anonymous user wrote [12]:
> Using that v5 for some time and confirm that difference under heavy
> load and memory pressure is significant.
>
> Shuang wrote [13]:
> With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
> and [9.26, 10.36]% higher throughput, respectively, for random
> access, Zipfian (distribution) access and Gaussian (distribution)
> access, when the average number of jobs per CPU is 1; 95% CIs
> [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput,
> respectively, for random access, Zipfian access and Gaussian access,
> when the average number of jobs per CPU is 2.
>
> Daniel wrote [14]:
> With memcached allocating ~100GB of byte-addressable Optante,
> performance improvement in terms of throughput (measured as queries
> per second) was about 10% for a series of workloads.
>
> Large-scale deployments
> -----------------------
> The downstream kernels that have been using MGLRU include:
> 1. Android ARCVM [15]
> 2. Arch Linux Zen [16]
> 3. Chrome OS [17]
> 4. Liquorix [18]
> 5. post-factum [19]
> 6. XanMod [20]
>
> We've rolled out MGLRU to tens of millions of Chrome OS users and
> about a million Android users. Google's fleetwide profiling [21] shows
> an overall 40% decrease in kswapd CPU usage, in addition to
> improvements in other UX metrics, e.g., an 85% decrease in the number
> of low-memory kills at the 75th percentile and an 18% decrease in
> rendering latency at the 50th percentile.
>
> [11] https://lore.kernel.org/lkml/[email protected]/
> [12] https://phoronix.com/forums/forum/software/general-linux-open-source/1301258-mglru-is-a-very-enticing-enhancement-for-linux-in-2022?p=1301275#post1301275
> [13] https://lore.kernel.org/lkml/[email protected]/
> [14] https://lore.kernel.org/linux-mm/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
> [15] https://chromium.googlesource.com/chromiumos/third_party/kernel
> [16] https://archlinux.org
> [17] https://chromium.org
> [18] https://liquorix.net
> [19] https://gitlab.com/post-factum/pf-kernel
> [20] https://xanmod.org
> [21] https://research.google/pubs/pub44271/
>
> Summery
> =======
> The facts are:
> 1. The independent lab results and the real-world applications
> indicate substantial improvements; there are no known regressions.
> 2. Thrashing prevention, working set estimation and proactive reclaim
> work out of the box; there are no equivalent solutions.
> 3. There is a lot of new code; nobody has demonstrated smaller changes
> with similar effects.
>
> Our options, accordingly, are:
> 1. Given the amount of evidence, the reported improvements will likely
> materialize for a wide range of workloads.
> 2. Gauging the interest from the past discussions [22][23][24], the
> new features will likely be put to use for both personal computers
> and data centers.
> 3. Based on Google's track record, the new code will likely be well
> maintained in the long term. It'd be more difficult if not
> impossible to achieve similar effects on top of the existing
> design.
>
> [22] https://lore.kernel.org/lkml/[email protected]/
> [23] https://lore.kernel.org/lkml/[email protected]/
> [24] https://lore.kernel.org/lkml/[email protected]/
>
> Yu Zhao (12):
> mm: x86, arm64: add arch_has_hw_pte_young()
> mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> mm/vmscan.c: refactor shrink_node()
> mm: multigenerational LRU: groundwork
> mm: multigenerational LRU: minimal implementation
> mm: multigenerational LRU: exploit locality in rmap
> mm: multigenerational LRU: support page table walks
> mm: multigenerational LRU: optimize multiple memcgs
> mm: multigenerational LRU: runtime switch
> mm: multigenerational LRU: thrashing prevention
> mm: multigenerational LRU: debugfs interface
> mm: multigenerational LRU: documentation
>
> Documentation/admin-guide/mm/index.rst | 1 +
> Documentation/admin-guide/mm/multigen_lru.rst | 121 +
> Documentation/vm/index.rst | 1 +
> Documentation/vm/multigen_lru.rst | 152 +
> arch/Kconfig | 9 +
> arch/arm64/include/asm/pgtable.h | 14 +-
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/pgtable.h | 9 +-
> arch/x86/mm/pgtable.c | 5 +-
> fs/exec.c | 2 +
> fs/fuse/dev.c | 3 +-
> include/linux/cgroup.h | 15 +-
> include/linux/memcontrol.h | 36 +
> include/linux/mm.h | 8 +
> include/linux/mm_inline.h | 214 ++
> include/linux/mm_types.h | 78 +
> include/linux/mmzone.h | 182 ++
> include/linux/nodemask.h | 1 +
> include/linux/page-flags-layout.h | 19 +-
> include/linux/page-flags.h | 4 +-
> include/linux/pgtable.h | 17 +-
> include/linux/sched.h | 4 +
> include/linux/swap.h | 5 +
> kernel/bounds.c | 3 +
> kernel/cgroup/cgroup-internal.h | 1 -
> kernel/exit.c | 1 +
> kernel/fork.c | 9 +
> kernel/sched/core.c | 1 +
> mm/Kconfig | 50 +
> mm/huge_memory.c | 3 +-
> mm/memcontrol.c | 27 +
> mm/memory.c | 39 +-
> mm/mm_init.c | 6 +-
> mm/page_alloc.c | 1 +
> mm/rmap.c | 7 +
> mm/swap.c | 55 +-
> mm/vmscan.c | 2831 ++++++++++++++++-
> mm/workingset.c | 119 +-
> 38 files changed, 3908 insertions(+), 146 deletions(-)
> create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
> create mode 100644 Documentation/vm/multigen_lru.rst

Thanks for the new spin.

Is the patch submission broken for everyone, or for me only? I see raw emails cluttered with some garbage like =2D, and hence I cannot apply those neither from my email client nor from lore.

Probably, you've got a git repo where things can be pulled from so that we do not depend on mailing systems and/or tools breaking plaintext?

Thanks.

--
Oleksandr Natalenko (post-factum)

2022-02-09 11:14:47

by Yu Zhao

[permalink] [raw]

Subject: [PATCH v7 06/12] mm: multigenerational LRU: exploit locality in rmap

Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs aren't cache friendly to the rmap. For workloads mostly
using mapped pages, the rmap has a high CPU cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
Single workload:
fio (buffered I/O): no change

Single workload:
memcached (anon): +[3.5, 5.5]%
Ops/sec KB/sec
patch1-5: 972526.07 37826.95
patch1-6: 1015292.83 39490.38

Configurations:
no change

Client benchmark results:
kswapd profiles:
patch1-5
39.73% lzo1x_1_do_compress (real work)
14.96% page_vma_mapped_walk
6.97% _raw_spin_unlock_irq
3.07% do_raw_spin_lock
2.53% anon_vma_interval_tree_iter_first
2.04% ptep_clear_flush
1.82% __zram_bvec_write
1.76% __anon_vma_interval_tree_subtree_search
1.57% memmove
1.45% free_unref_page_list

patch1-6
45.49% lzo1x_1_do_compress (real work)
7.38% page_vma_mapped_walk
7.24% _raw_spin_unlock_irq
2.64% ptep_clear_flush
2.31% __zram_bvec_write
2.13% do_raw_spin_lock
2.09% lru_gen_look_around
1.89% free_unref_page_list
1.85% memmove
1.74% obj_malloc

Configurations:
no change

Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
---
include/linux/memcontrol.h | 31 ++++++++
include/linux/mm.h | 5 ++
include/linux/mmzone.h | 6 ++
include/linux/swap.h | 1 +
mm/memcontrol.c | 1 +
mm/rmap.c | 7 ++
mm/swap.c | 4 +-
mm/vmscan.c | 155 +++++++++++++++++++++++++++++++++++++
8 files changed, 208 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b72d75141e12..51c9bc8e965d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -436,6 +436,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
+ * - mem_cgroup_trylock_pages()
*
* For a kmem folio a caller should hold an rcu read lock to protect memcg
* associated with a kmem folio from being released.
@@ -497,6 +498,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
+ * - mem_cgroup_trylock_pages()
*
* For a kmem page a caller should hold an rcu read lock to protect memcg
* associated with a kmem page from being released.
@@ -934,6 +936,23 @@ void unlock_page_memcg(struct page *page);

void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);

+/* try to stablize folio_memcg() for all the pages in a memcg */
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+ rcu_read_lock();
+
+ if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
+ return true;
+
+ rcu_read_unlock();
+ return false;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+ rcu_read_unlock();
+}
+
/* idx can be of type enum memcg_stat_item or node_stat_item */
static inline void mod_memcg_state(struct mem_cgroup *memcg,
int idx, int val)
@@ -1371,6 +1390,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
{
}

+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+ /* to match folio_memcg_rcu() */
+ rcu_read_lock();
+ return true;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+ rcu_read_unlock();
+}
+
static inline void mem_cgroup_handle_over_high(void)
{
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b4b9886ba277..7d70b42b67e1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1588,6 +1588,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
return page_to_pfn(&folio->page);
}

+static inline struct folio *pfn_folio(unsigned long pfn)
+{
+ return page_folio(pfn_to_page(pfn));
+}
+
/* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
#ifdef CONFIG_MIGRATION
static inline bool is_pinnable_page(struct page *page)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3870dd9246a2..3d6ea30a2bdb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -304,6 +304,7 @@ enum lruvec_flags {
};

struct lruvec;
+struct page_vma_mapped_walk;

#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
@@ -386,6 +387,7 @@ struct lru_gen_struct {
};

void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);

#ifdef CONFIG_MEMCG
void lru_gen_init_memcg(struct mem_cgroup *memcg);
@@ -398,6 +400,10 @@ static inline void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *l
{
}

+static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+}
+
#ifdef CONFIG_MEMCG
static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
{
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1d38d9475c4d..b37520d3ff1d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -372,6 +372,7 @@ extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_cpu_zone(struct zone *zone);
extern void lru_add_drain_all(void);
+extern void folio_activate(struct folio *folio);
extern void deactivate_file_page(struct page *page);
extern void deactivate_page(struct page *page);
extern void mark_page_lazyfree(struct page *page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cabb5085531b..74373df19d84 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2744,6 +2744,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
+ * - mem_cgroup_trylock_pages()
*/
folio->memcg_data = (unsigned long)memcg;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..112e77dc62f4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -73,6 +73,7 @@
#include <linux/page_idle.h>
#include <linux/memremap.h>
#include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>

#include <asm/tlbflush.h>

@@ -819,6 +820,12 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
}

if (pvmw.pte) {
+ if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
+ !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
+ lru_gen_look_around(&pvmw);
+ referenced++;
+ }
+
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte)) {
/*
diff --git a/mm/swap.c b/mm/swap.c
index f5c0bcac8dcd..e65e7520bebf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -344,7 +344,7 @@ static bool need_activate_page_drain(int cpu)
return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
}

-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
{
if (folio_test_lru(folio) && !folio_test_active(folio) &&
!folio_test_unevictable(folio)) {
@@ -364,7 +364,7 @@ static inline void activate_page_drain(int cpu)
{
}

-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
{
struct lruvec *lruvec;

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5f0d92838712..933d46ae2f68 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1556,6 +1556,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
if (!sc->may_unmap && page_mapped(page))
goto keep_locked;

+ /* folio_update_gen() tried to promote this page? */
+ if (lru_gen_enabled() && !ignore_references &&
+ page_mapped(page) && PageReferenced(page))
+ goto keep_locked;
+
may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));

@@ -3227,6 +3232,31 @@ static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
* the aging
******************************************************************************/

+static int folio_update_gen(struct folio *folio, int gen)
+{
+ unsigned long old_flags, new_flags;
+
+ VM_BUG_ON(gen >= MAX_NR_GENS);
+ VM_BUG_ON(!rcu_read_lock_held());
+
+ do {
+ new_flags = old_flags = READ_ONCE(folio->flags);
+
+ /* for shrink_page_list() */
+ if (!(new_flags & LRU_GEN_MASK)) {
+ new_flags |= BIT(PG_referenced);
+ continue;
+ }
+
+ new_flags &= ~LRU_GEN_MASK;
+ new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+ new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
+ } while (new_flags != old_flags &&
+ cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+ return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
{
unsigned long old_flags, new_flags;
@@ -3239,6 +3269,10 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);

new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+ /* folio_update_gen() has promoted this page? */
+ if (new_gen >= 0 && new_gen != old_gen)
+ return new_gen;
+
new_gen = (old_gen + 1) % MAX_NR_GENS;

new_flags &= ~LRU_GEN_MASK;
@@ -3434,6 +3468,122 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
}

+/*
+ * This function exploits spatial locality when shrink_page_list() walks the
+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
+ */
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+ int i;
+ pte_t *pte;
+ unsigned long start;
+ unsigned long end;
+ unsigned long addr;
+ struct folio *folio;
+ unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
+ struct mem_cgroup *memcg = page_memcg(pvmw->page);
+ struct pglist_data *pgdat = page_pgdat(pvmw->page);
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ DEFINE_MAX_SEQ(lruvec);
+ int old_gen, new_gen = lru_gen_from_seq(max_seq);
+
+ lockdep_assert_held(pvmw->ptl);
+ VM_BUG_ON_PAGE(PageLRU(pvmw->page), pvmw->page);
+
+ start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
+ end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
+
+ if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
+ if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
+ end = start + MIN_LRU_BATCH * PAGE_SIZE;
+ else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
+ start = end - MIN_LRU_BATCH * PAGE_SIZE;
+ else {
+ start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
+ end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
+ }
+ }
+
+ pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
+
+ rcu_read_lock();
+ arch_enter_lazy_mmu_mode();
+
+ for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+ unsigned long pfn = pte_pfn(pte[i]);
+
+ VM_BUG_ON(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end);
+
+ if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+ continue;
+
+ if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+ continue;
+
+ if (!pte_young(pte[i]))
+ continue;
+
+ VM_BUG_ON(!pfn_valid(pfn));
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ continue;
+
+ folio = pfn_folio(pfn);
+ if (folio_nid(folio) != pgdat->node_id)
+ continue;
+
+ if (folio_memcg_rcu(folio) != memcg)
+ continue;
+
+ if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
+ continue;
+
+ if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+ !folio_test_swapcache(folio)))
+ folio_mark_dirty(folio);
+
+ old_gen = folio_lru_gen(folio);
+ if (old_gen < 0)
+ folio_set_referenced(folio);
+ else if (old_gen != new_gen)
+ __set_bit(i, bitmap);
+ }
+
+ arch_leave_lazy_mmu_mode();
+ rcu_read_unlock();
+
+ if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
+ for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+ folio = page_folio(pte_page(pte[i]));
+ folio_activate(folio);
+ }
+ return;
+ }
+
+ /* folio_update_gen() requires stable folio_memcg() */
+ if (!mem_cgroup_trylock_pages(memcg))
+ return;
+
+ spin_lock_irq(&lruvec->lru_lock);
+ new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+
+ for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+ folio = page_folio(pte_page(pte[i]));
+ if (folio_memcg_rcu(folio) != memcg)
+ continue;
+
+ old_gen = folio_update_gen(folio, new_gen);
+ if (old_gen < 0 || old_gen == new_gen)
+ continue;
+
+ lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
+ }
+
+ spin_unlock_irq(&lruvec->lru_lock);
+
+ mem_cgroup_unlock_pages();
+}
+
/******************************************************************************
* the eviction
******************************************************************************/
@@ -3467,6 +3617,11 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
return true;
}

+ if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
+ list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+ return true;
+ }
+
if (tier > tier_idx) {
int hist = lru_hist_from_seq(lrugen->min_seq[type]);

--
2.35.0.263.gb82422642f-goog

2022-02-09 13:21:25

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v7 00/12] Multigenerational LRU Framework

On Tue 08-02-22 11:11:02, Oleksandr Natalenko wrote:
[...]
> Is the patch submission broken for everyone, or for me only? I see raw
> emails cluttered with some garbage like =2D, and hence I cannot apply
> those neither from my email client nor from lore.

The patchset seems to be OK both in my inbox and b4[1] has downloaded
the full thread without any issues and I could apply all the patches
just fine

[1] https://git.kernel.org/pub/scm/utils/b4/b4.git

--
Michal Hocko
SUSE Labs

2022-02-13 19:14:32

by Yu Zhao

[permalink] [raw]

Subject: Re: [PATCH v7 00/12] Multigenerational LRU Framework

On Sat, Feb 12, 2022 at 05:12:19AM +0900, Alexey Avramov wrote:
> Aggressive swapping even with vm.swappiness=1 with MGLRU
> ========================================================
>
> Reading a large mmapped file leads to a super agressive swapping.
> Reducing vm.swappiness even to 1 does not have effect.

Mind explaining why you think it's "super agressive"? I assume you
expected a different behavior that would perform better. If so,
please spell it out.

> Demo: https://www.youtube.com/watch?v=J81kwJeuW58
>
> Linux 5.17-rc3, Multigenerational LRU v7,
> vm.swappiness=1, MemTotal: 11.5 GiB.
>
> $ cache-bench -r 35000 -m1 -b1 -p1 -f test20000
> Reading mmapped file (file size: 20000 MiB)
> cache-bench v0.2.0: https://github.com/hakavlad/cache-bench

Writing your own benchmark is a good exercise but fio is the standard
benchmark in this case. Please use it with --ioengine=mmap.

> Swapping started with MemAvailable=71%.
> At the end 33 GiB was swapped out when MemAvailable=60%.
>
> Is it OK?

MemAvailable is an estimate (free + page cache), and it doesn't imply
any reclaim preferences. In the worst case scenario, e.g., out of swap
space, MemAvailable *may* be reclaimed.

Here is my benchmark result with file mmap + *high* swap usage. Ram
disk was used to reduce the variance in the result (and SSD wear out
if you care). More details on additional configurations here:
https://lore.kernel.org/linux-mm/[email protected]/

Mixed workloads:
fio (buffered I/O): +13%
IOPS BW
5.17-rc3: 275k 1075MiB/s
v7: 313k 1222MiB/s

memcached (anon): +12%
Ops/sec KB/sec
5.17-rc3: 511282.72 19861.04
v7: 572408.80 22235.49

cat mmap.sh
systemctl restart memcached
swapoff -a
umount /mnt
rmmod brd

modprobe brd rd_nr=2 rd_size=56623104

mkswap /dev/ram0
swapon /dev/ram0

mkfs.ext4 /dev/ram1
mount -t ext4 /dev/ram1 /mnt

memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=50000000 --key-pattern=P:P -c 1 \
-t 36 --ratio 1:0 --pipeline 8 -d 2000

sysctl vm.overcommit_memory=1

fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=mmap --iodepth=128 --iodepth_batch_submit=32 \
--iodepth_batch_complete=32 --rw=randread --random_distribution=random \
--norandommap --time_based --ramp_time=10m --runtime=990m \
--group_reporting &
pid=$!

sleep 200

memcached.sock -P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 --ratio 0:1 \
--pipeline 8 --randomize --distinct-client-seed

kill -INT $pid
wait

2022-03-03 06:30:16

by Vaibhav Jain

[permalink] [raw]

Subject: Re: [PATCH v7 00/12] Multigenerational LRU Framework

In a synthetic MongoDB Benchmark (YCSB) seeing an average of ~19% throughput
improvement on POWER10(Radix MMU + 64K Page Size) with MGLRU patches on
top of v5.16 kernel for MongoDB + YCSB bench across three different
request distriburions namely Exponential,Uniform and Zipfan

Test-Results
============

Average YCSB reported throughput (95% Confidence Interval):
|---------------------+---------------------+---------------------+---------------------|
| Kernel-Type | Exponential | Uniform | Zipfan |
|---------------------+---------------------+---------------------+---------------------|
| Base Kernel (v5.16) | 27324.701 ± 759.652 | 20671.590 ± 412.974 | 37713.761 ± 621.213 |
| v5.16 + MGLRU | 32702.231 ± 287.957 | 24916.239 ± 217.977 | 44308.839 ± 701.829 |
|---------------------+---------------------+---------------------+---------------------|
| Speedup | 19.68% ± 4.03% | 20.11% ± 2.95% | 17.49% ± 2.82% |
|---------------------+---------------------+---------------------+---------------------|

n = 11 Samples x 3 (Distributions) x 2 (Kernels) = 66 Observations

Test Environment
================
Cpu: POWER10 (architected), altivec supported
platform: pSeries
CPUs: 32
MMU: Radix
Page-Size: 64K
Total-Memory: 64G

Distro
-------
# cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.4 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.4"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.4"

System-config
-------------
# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

# cat /proc/swaps
Filename Type Size Used Priority
/dev/dm-5 partition 10485696 940864 -2

# cat /proc/sys/vm/overcommit_memory
0

#cat /proc/cmdline
<existing parameters> systemd.unified_cgroup_hierarchy=1 transparent_hugepage=never

MongoDB data partition
----------------------
lsblk /dev/sdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 128G 0 disk <home>/data/mongodb

mount | grep /dev/sdb
/dev/sdb on /root/vajain21/mglru/data/mongodb type ext4 (rw,relatime)

Testing Artifacts
==================

MongoDB-configuration
---------------------
MongoDB Commounity Server built from https://github.com/mongodb/mongo release v5.0.6

# mongod --version
db version v5.0.6
Build Info: {
"version": "5.0.6",
"gitVersion": "212a8dbb47f07427dae194a9c75baec1d81d9259",
"openSSLVersion": "OpenSSL 1.1.1g FIPS 21 Apr 2020",
"modules": [],
"allocator": "tcmalloc",
"environment": {
"distarch": "ppc64le",
"target_arch": "ppc64le"
}
}

# cat /etc/mongod.conf
storage:
dbPath: <home-path>/data/mongodb
journal:
enabled: true
engine: wiredTiger
wiredTiger:
engineConfig:
cacheSizeGB: 50
net:
bindIp: 127.0.0.1
unixDomainSocket:
enabled: true
pathPrefix: /run/mongodb
setParameter:
enableLocalhostAuthBypass: true

YCSB (https://github.com/vaibhav92/YCSB/tree/mongodb-domain-sockets)
--------------------------------------------------------------------

YCSB forked from https://github.com/brianfrankcooper/YCSB.git. This fixes a
problem with YCSB when trying to connect to MongoDB on a unix domain socket. PR
raised to the project at https://github.com/brianfrankcooper/YCSB/pull/1587

Head Commit: fb2555a77005ae70c26e4adc46c945caf4daa2f9(" [core] Generate
classpath from all dependencies rather than just compile scoped")

Kernel-Config
-------------

Base-Kernel: https://github.com/torvalds/linux/ v5.16
Base-Kernel-Config:
https://github.com/vaibhav92/mglru-benchmark/blob/auto_build/config-non-mglru

Test-Kernel: https://linux-mm.googlesource.com/page-reclaim refs/changes/49/1549/1
Test-Kernel-Config:
https://github.com/vaibhav92/mglru-benchmark/blob/auto_build/config-mglru

CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y
CONFIG_NR_LRU_GENS=4
CONFIG_TIERS_PER_GEN=4

YCSB:
recordcount=80000000
operationcount=80000000
readproportion=0.8
updateproportion=0.2
workload=site.ycsb.workloads.CoreWorkload
threads=64
requestdistributions={uniform, exponential, zipfian}

Test-Bench
===========
Source: https://github.com/vaibhav92/mglru-benchmark/tree/auto_build

Invoked via following command that will *destroy* contents of /dev/sdd
and use it as data disk for MongoDB:

$ export MONGODB_DISK=/dev/sdd; curl \
https://raw.githubusercontent.com/vaibhav92/mglru-benchmark/auto_build/build.sh
\ | sudo bash -s

Test-Methodology
================

Setup
-----
1. Pull & Build testing artifact v5.16 Base Kernel, MGLRU Kernel,
MongoDB, YCSB & Qemu for qemu-img tools
2. Format and mount provided MongoDB Data disk with ext4.
3. Generate Systemd service/slice files for MongoDB and place them into /etc/systemd/system/
4. Generate MongoDB configration pointing to the data disk mount.
5. Start the built MongoDB instance.
6. Ensure that MongoDB is running.

Load Test Data
---------------
1. Ensure that MongoDB instance is stopped.
2. Unmount the data disk and reformat it with ext4.
3. Restart MongoDB.
4. Spin off YCSB to load data into the Mongo instance.
5. Stop MongoDB + Unmount data Disk
6. Create a qcow2 image of the data disk and store it with test data.
7. Kexec into base kernel.

Test Phase (Happens at each boot)
---------------------------------
1. Select the distribution to be used for YCSB from
{"Uniform","Exponential","Zipfan"}
2. Restore the MongoDB qcow2 data disk Image to the disk
3. Mount the data disk and restart MongoDB daemon.
4. Start YCSB to generate the workload on MongoDB.
5. Once finished collect results.
6. Kexec into next-kernel which keeps switching between Base-Kernel &
MGLRU-Kernel when all three distriutions have been tested.

Setup and Load Test Data stages can be accomplished by following command:
#export MONGODB_DISK=/dev/sdd; \
curl https://raw.githubusercontent.com/vaibhav92/mglru-benchmark/auto_build/build.sh | bash -s

Once completed successfully it will kexec into the base kernel and start the
Test phase on boot via systemd service named 'mglru-benchmark'

Based on above results,
Tested-by: Vaibhav Jain<[email protected]>

Yu Zhao <[email protected]> writes:

> What's new
> ==========
> 1) Addressed all the comments received on the mailing list and in the
> meeting with the stakeholders (will note on individual patches).
> 2) Measured the performance improvements for each patch between 5-8
> (reported in the commit messages).
>
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and straightforward.
>
> Patchset overview
> =================
> The design and implementation overview was moved to patch 12 so that
> people can finish reading this cover letter.
>
> 1. mm: x86, arm64: add arch_has_hw_pte_young()
> 2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> Using hardware optimizations when trying to clear the accessed bit in
> many PTEs.
>
> 3. mm/vmscan.c: refactor shrink_node()
> A minor refactor.
>
> 4. mm: multigenerational LRU: groundwork
> Adding the basic data structure and the functions that insert/remove
> pages to/from the multigenerational LRU (MGLRU) lists.
>
> 5. mm: multigenerational LRU: minimal implementation
> A minimal (functional) implementation without any optimizations.
>
> 6. mm: multigenerational LRU: exploit locality in rmap
> Improving the efficiency when using the rmap.
>
> 7. mm: multigenerational LRU: support page table walks
> Adding the (optional) page table scanning.
>
> 8. mm: multigenerational LRU: optimize multiple memcgs
> Optimizing the overall performance for multiple memcgs running mixed
> types of workloads.
>
> 9. mm: multigenerational LRU: runtime switch
> Adding a runtime switch to enable or disable MGLRU.
>
> 10. mm: multigenerational LRU: thrashing prevention
> 11. mm: multigenerational LRU: debugfs interface
> Providing userspace with additional features like thrashing prevention,
> working set estimation and proactive reclaim.
>
> 12. mm: multigenerational LRU: documentation
> Adding a design doc and an admin guide.
>
> Benchmark results
> =================
> Independent lab results
> -----------------------
> Based on the popularity of searches [01] and the memory usage in
> Google's public cloud, the most popular open-source memory-hungry
> applications, in alphabetical order, are:
> Apache Cassandra Memcached
> Apache Hadoop MongoDB
> Apache Spark PostgreSQL
> MariaDB (MySQL) Redis
>
> An independent lab evaluated MGLRU with the most widely used benchmark
> suites for the above applications. They posted 960 data points along
> with kernel metrics and perf profiles collected over more than 500
> hours of total benchmark time. Their final reports show that, with 95%
> confidence intervals (CIs), the above applications all performed
> significantly better for at least part of their benchmark matrices.
>
> On 5.14:
> 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
> less wall time to sort three billion random integers, respectively,
> under the medium- and the high-concurrency conditions, when
> overcommitting memory. There were no statistically significant
> changes in wall time for the rest of the benchmark matrix.
> 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
> more transactions per minute (TPM), respectively, under the medium-
> and the high-concurrency conditions, when overcommitting memory.
> There were no statistically significant changes in TPM for the rest
> of the benchmark matrix.
> 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
> and [21.59, 30.02]% more operations per second (OPS), respectively,
> for sequential access, random access and Gaussian (distribution)
> access, when THP=always; 95% CIs [13.85, 15.97]% and
> [23.94, 29.92]% more OPS, respectively, for random access and
> Gaussian access, when THP=never. There were no statistically
> significant changes in OPS for the rest of the benchmark matrix.
> 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
> [2.16, 3.55]% more operations per second (OPS), respectively, for
> exponential (distribution) access, random access and Zipfian
> (distribution) access, when underutilizing memory; 95% CIs
> [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
> respectively, for exponential access, random access and Zipfian
> access, when overcommitting memory.
>
> On 5.15:
> 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
> and [4.11, 7.50]% more operations per second (OPS), respectively,
> for exponential (distribution) access, random access and Zipfian
> (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
> [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
> exponential access, random access and Zipfian access, when swap was
> on.
> 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
> less average wall time to finish twelve parallel TeraSort jobs,
> respectively, under the medium- and the high-concurrency
> conditions, when swap was on. There were no statistically
> significant changes in average wall time for the rest of the
> benchmark matrix.
> 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
> minute (TPM) under the high-concurrency condition, when swap was
> off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
> respectively, under the medium- and the high-concurrency
> conditions, when swap was on. There were no statistically
> significant changes in TPM for the rest of the benchmark matrix.
> 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
> [11.47, 19.36]% more total operations per second (OPS),
> respectively, for sequential access, random access and Gaussian
> (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
> [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
> for sequential access, random access and Gaussian access, when
> THP=never.
>
> Our lab results
> ---------------
> To supplement the above results, we ran the following benchmark suites
> on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
> are popular among MM developers, but we prefer large-scale A/B
> experiments to validate improvements.)
> fs_fio_bench_hdd_mq pft
> fs_lmbench pgsql-hammerdb
> fs_parallelio redis
> fs_postmark stream
> hackbench sysbenchthread
> kernbench tpcc_spark
> memcached unixbench
> multichase vm-scalability
> mutilate will-it-scale
> nginx
>
> [01] https://trends.google.com
> [02] https://lore.kernel.org/lkml/[email protected]/
> [03] https://lore.kernel.org/lkml/[email protected]/
> [04] https://lore.kernel.org/lkml/[email protected]/
> [05] https://lore.kernel.org/lkml/[email protected]/
> [06] https://lore.kernel.org/lkml/[email protected]/
> [07] https://lore.kernel.org/lkml/[email protected]/
> [08] https://lore.kernel.org/lkml/[email protected]/
> [09] https://lore.kernel.org/lkml/[email protected]/
> [10] https://lore.kernel.org/lkml/[email protected]/
>
> Read-world applications
> =======================
> Third-party testimonials
> ------------------------
> Konstantin wrote [11]:
> I have Archlinux with 8G RAM + zswap + swap. While developing, I
> have lots of apps opened such as multiple LSP-servers for different
> langs, chats, two browsers, etc... Usually, my system gets quickly
> to a point of SWAP-storms, where I have to kill LSP-servers,
> restart browsers to free memory, etc, otherwise the system lags
> heavily and is barely usable.
>
> 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
> patchset, and I started up by opening lots of apps to create memory
> pressure, and worked for a day like this. Till now I had *not a
> single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
> getting to the point of 3G in SWAP before without a single
> SWAP-storm.
>
> An anonymous user wrote [12]:
> Using that v5 for some time and confirm that difference under heavy
> load and memory pressure is significant.
>
> Shuang wrote [13]:
> With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
> and [9.26, 10.36]% higher throughput, respectively, for random
> access, Zipfian (distribution) access and Gaussian (distribution)
> access, when the average number of jobs per CPU is 1; 95% CIs
> [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput,
> respectively, for random access, Zipfian access and Gaussian access,
> when the average number of jobs per CPU is 2.
>
> Daniel wrote [14]:
> With memcached allocating ~100GB of byte-addressable Optante,
> performance improvement in terms of throughput (measured as queries
> per second) was about 10% for a series of workloads.
>
> Large-scale deployments
> -----------------------
> The downstream kernels that have been using MGLRU include:
> 1. Android ARCVM [15]
> 2. Arch Linux Zen [16]
> 3. Chrome OS [17]
> 4. Liquorix [18]
> 5. post-factum [19]
> 6. XanMod [20]
>
> We've rolled out MGLRU to tens of millions of Chrome OS users and
> about a million Android users. Google's fleetwide profiling [21] shows
> an overall 40% decrease in kswapd CPU usage, in addition to
> improvements in other UX metrics, e.g., an 85% decrease in the number
> of low-memory kills at the 75th percentile and an 18% decrease in
> rendering latency at the 50th percentile.
>
> [11] https://lore.kernel.org/lkml/[email protected]/
> [12] https://phoronix.com/forums/forum/software/general-linux-open-source/1301258-mglru-is-a-very-enticing-enhancement-for-linux-in-2022?p=1301275#post1301275
> [13] https://lore.kernel.org/lkml/[email protected]/
> [14] https://lore.kernel.org/linux-mm/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
> [15] https://chromium.googlesource.com/chromiumos/third_party/kernel
> [16] https://archlinux.org
> [17] https://chromium.org
> [18] https://liquorix.net
> [19] https://gitlab.com/post-factum/pf-kernel
> [20] https://xanmod.org
> [21] https://research.google/pubs/pub44271/
>
> Summery
> =======
> The facts are:
> 1. The independent lab results and the real-world applications
> indicate substantial improvements; there are no known regressions.
> 2. Thrashing prevention, working set estimation and proactive reclaim
> work out of the box; there are no equivalent solutions.
> 3. There is a lot of new code; nobody has demonstrated smaller changes
> with similar effects.
>
> Our options, accordingly, are:
> 1. Given the amount of evidence, the reported improvements will likely
> materialize for a wide range of workloads.
> 2. Gauging the interest from the past discussions [22][23][24], the
> new features will likely be put to use for both personal computers
> and data centers.
> 3. Based on Google's track record, the new code will likely be well
> maintained in the long term. It'd be more difficult if not
> impossible to achieve similar effects on top of the existing
> design.
>
> [22] https://lore.kernel.org/lkml/[email protected]/
> [23] https://lore.kernel.org/lkml/[email protected]/
> [24] https://lore.kernel.org/lkml/[email protected]/
>
> Yu Zhao (12):
> mm: x86, arm64: add arch_has_hw_pte_young()
> mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> mm/vmscan.c: refactor shrink_node()
> mm: multigenerational LRU: groundwork
> mm: multigenerational LRU: minimal implementation
> mm: multigenerational LRU: exploit locality in rmap
> mm: multigenerational LRU: support page table walks
> mm: multigenerational LRU: optimize multiple memcgs
> mm: multigenerational LRU: runtime switch
> mm: multigenerational LRU: thrashing prevention
> mm: multigenerational LRU: debugfs interface
> mm: multigenerational LRU: documentation
>
> Documentation/admin-guide/mm/index.rst | 1 +
> Documentation/admin-guide/mm/multigen_lru.rst | 121 +
> Documentation/vm/index.rst | 1 +
> Documentation/vm/multigen_lru.rst | 152 +
> arch/Kconfig | 9 +
> arch/arm64/include/asm/pgtable.h | 14 +-
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/pgtable.h | 9 +-
> arch/x86/mm/pgtable.c | 5 +-
> fs/exec.c | 2 +
> fs/fuse/dev.c | 3 +-
> include/linux/cgroup.h | 15 +-
> include/linux/memcontrol.h | 36 +
> include/linux/mm.h | 8 +
> include/linux/mm_inline.h | 214 ++
> include/linux/mm_types.h | 78 +
> include/linux/mmzone.h | 182 ++
> include/linux/nodemask.h | 1 +
> include/linux/page-flags-layout.h | 19 +-
> include/linux/page-flags.h | 4 +-
> include/linux/pgtable.h | 17 +-
> include/linux/sched.h | 4 +
> include/linux/swap.h | 5 +
> kernel/bounds.c | 3 +
> kernel/cgroup/cgroup-internal.h | 1 -
> kernel/exit.c | 1 +
> kernel/fork.c | 9 +
> kernel/sched/core.c | 1 +
> mm/Kconfig | 50 +
> mm/huge_memory.c | 3 +-
> mm/memcontrol.c | 27 +
> mm/memory.c | 39 +-
> mm/mm_init.c | 6 +-
> mm/page_alloc.c | 1 +
> mm/rmap.c | 7 +
> mm/swap.c | 55 +-
> mm/vmscan.c | 2831 ++++++++++++++++-
> mm/workingset.c | 119 +-
> 38 files changed, 3908 insertions(+), 146 deletions(-)
> create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
> create mode 100644 Documentation/vm/multigen_lru.rst
>
> --
> 2.35.0.263.gb82422642f-goog
>
>

--
Cheers
~ Vaibhav

2022-03-03 07:29:16

by Yu Zhao

[permalink] [raw]

Subject: Re: [PATCH v7 00/12] Multigenerational LRU Framework

On Thu, Mar 03, 2022 at 11:36:51AM +0530, Vaibhav Jain wrote:
>
> In a synthetic MongoDB Benchmark (YCSB) seeing an average of ~19% throughput
> improvement on POWER10(Radix MMU + 64K Page Size) with MGLRU patches on
> top of v5.16 kernel for MongoDB + YCSB bench across three different
> request distriburions namely Exponential,Uniform and Zipfan

Thanks, Vaibhav. I'll post the next version in a few days and include
your tested-by tag.