2023-09-21 01:03:55

by Kairui Song

[permalink] [raw]
Subject: [RFC PATCH v3 0/6] Refault distance update with MGLRU support

From: Kairui Song <[email protected]>

I noticed MGLRU not working very well on certain workflows, which is
observed on some workloads with heavy memory stress.

After some debugging, I found this was related to refault distance
detection, when the file page workingset size exceeds total memory,
and the access distance (the left-shift time of a page before it gets
activated or promoted, considering LRU starts from right) of file pages
are larger than total memory. All file pages are stuck on the oldest
generation and getting read-in then evicted permutably, few get activated
and stay in memory.

This series tries to fix this problem by rework the refault distance
based activation to better fit MGLRU, and also tries to use a unified
algorithm for both MGLRU and Inactive/Active LRU, the performance almost
doubled for the workloads that are not working well previously.

Patch 1/5 reworked the refault distance detection model for
Inactive/Active LRU, and updated the comments.

Patch 2/5 splitted the code logic into a helper, prepare for MGLRU.

Patch 3/5 and 4/5 are code simplification and updates for MGLRU.

Patch 5/5 applies the modified refault distance algorithm
for MGLRU.

Following benchmark showed 5x improvement:
To simulate the workflow, I setup a 3-replicated mongodb cluster using
docker, each in a standalone cgroup, set to use 5GB of wiretiger cache
and 10g of oplog, on a 32G VM. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.

Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s

Patched (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s

The performance is 5x times better than before. Testing with lower
stress and some other benchmarks also shows slight improvement or
equivalent performance (eg. fio tests shows a observable performance gain).

Sending out as RFC, I'm still doing more test on it, since this changed
a frequently used algorithm and not really sure if there is any performance
regression on long term. It should improvement the performance for file
pages in general even if there are low memory pressure, since it saved
some cgroup iterations and atomic operations.

Update from V2:
- Rebase to latest mm-stable and redone some tests.
- Split the algorithm change into a stand alone patch as
suggested by Johannes Weiner.

Update from V1:
- Removed the fls operations which previously used in patch 1 for
protecting active pages by expontial ratio, simply compare with number of
inactive pages seems good enough.
- Update some benchmarks results, test result that are basically
identical as before are not updated.

Kairui Song (6):
workingset: simplify and use a more intuitive model
workingset: move refault distance checking into to a helper
workignset: simplify the initilization code
workingset: simplify lru_gen_test_recent
mm, lru_gen: convert avg_total and avg_refaulted to atomic
workingset, lru_gen: apply refault-distance based re-activation

include/linux/mmzone.h | 4 +-
include/linux/swap.h | 2 -
mm/swap.c | 1 -
mm/vmscan.c | 30 +--
mm/workingset.c | 416 +++++++++++++++++++++--------------------
5 files changed, 236 insertions(+), 217 deletions(-)

--
2.41.0


2023-09-21 01:07:31

by Kairui Song

[permalink] [raw]
Subject: [RFC PATCH v3 6/6] workingset, lru_gen: apply refault-distance based re-activation

From: Kairui Song <[email protected]>

I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.

Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.

So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.

Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:

- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.

NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.

Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.

Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:

- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s

$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768

$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848

- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s

$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801

$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196

The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.

I also checked the benchmark with memtier/memcached and fio,
using similar setup as in commit ac35a4902374 but scaled down to fit in
my test environment:

memtier test (16G ramdisk as swap, 4G memcg limit, VM on a EPYC 7K62):
memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 \
-t 16 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=36000000 --key-pattern=P:P -c 1 \
-t 16 --ratio 1:0 --pipeline 8 -d 600 -x 6

fio test 1 (16G ramdisk, 4G memcg limit, VM on a EPYC 7K62):
fio -name=mglru --numjobs=16 --directory=/mnt --size=1000m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting

fio test 2 (16G ramdisk, 2G memcg limit, VM on a EPYC 7K62):
fio -name=mglru --numjobs=16 --directory=/mnt --size=1000m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting

mysql test (15G buffer pool with 16G memcg limit, VM on a EPYC 7K62):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=48 --table-size=2000000 --threads=16 --time=1800 run

Before this patch:
memtier: 37794.71 op/s
fio 1: 6327.3k iops
fio 2: 5697.6k iops
mysql: 146104.98 qps

After this patch:
memtier: 37792.61 op/s
fio 1: 6583.3k iops
fio 2: 5929.2k iops
mysql: 146055.88 qps

There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.

Signed-off-by: Kairui Song <[email protected]>
---
mm/vmscan.c | 20 +++++---
mm/workingset.c | 130 +++++++++++++++++++++++++++++++-----------------
2 files changed, 95 insertions(+), 55 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 82acc1934c86..c7745b22cc0b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3730,17 +3730,21 @@ static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)

for (tier = 0; tier < MAX_NR_TIERS; tier++) {
if (carryover) {
- unsigned long sum;
+ unsigned long refaulted, total;

- sum = atomic_long_read(&lrugen->avg_refaulted[type][tier]) +
- atomic_long_read(&lrugen->refaulted[hist][type][tier]);
- atomic_long_set(&lrugen->avg_refaulted[type][tier], sum / 2);
+ refaulted = atomic_long_read(&lrugen->avg_refaulted[type][tier]) +
+ atomic_long_read(&lrugen->refaulted[hist][type][tier]);

- sum = atomic_long_read(&lrugen->avg_total[type][tier]) +
- atomic_long_read(&lrugen->evicted[hist][type][tier]);
+ total = atomic_long_read(&lrugen->avg_total[type][tier]) +
+ atomic_long_read(&lrugen->evicted[hist][type][tier]);
if (tier)
- sum += lrugen->protected[hist][type][tier - 1];
- atomic_long_set(&lrugen->avg_total[type][tier], sum / 2);
+ total += lrugen->protected[hist][type][tier - 1];
+
+ /* total could be less than refaulted, see lru_gen_refault */
+ total = max(total, refaulted);
+
+ atomic_long_set(&lrugen->avg_refaulted[type][tier], refaulted / 2);
+ atomic_long_set(&lrugen->avg_total[type][tier], total / 2);
}

if (clear) {
diff --git a/mm/workingset.c b/mm/workingset.c
index 87a16b6158e5..e548c8cee9ad 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -175,6 +175,7 @@
MEM_CGROUP_ID_SHIFT)
#define EVICTION_BITS (BITS_PER_LONG - (EVICTION_SHIFT))
#define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
+#define LRU_GEN_EVICTION_BITS (EVICTION_BITS - LRU_REFS_WIDTH - LRU_GEN_WIDTH)

/*
* Eviction timestamps need to be able to cover the full range of
@@ -185,6 +186,7 @@
* evictions into coarser buckets by shaving off lower timestamp bits.
*/
static unsigned int bucket_order __read_mostly;
+static unsigned int lru_gen_bucket_order __read_mostly;

static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
bool workingset)
@@ -290,6 +292,34 @@ static inline bool lru_test_refault(struct mem_cgroup *memcg,
(file ? inactive_anon : inactive_file);
}

+/**
+ * workingset_age_nonresident - age non-resident entries as LRU ages
+ * @lruvec: the lruvec that was aged
+ * @nr_pages: the number of pages to count
+ *
+ * As in-memory pages are aged, non-resident pages need to be aged as
+ * well, in order for the refault distances later on to be comparable
+ * to the in-memory dimensions. This function allows reclaim and LRU
+ * operations to drive the non-resident aging along in parallel.
+ */
+void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
+{
+ /*
+ * Reclaiming a cgroup means reclaiming all its children in a
+ * round-robin fashion. That means that each cgroup has an LRU
+ * order that is composed of the LRU orders of its child
+ * cgroups; and every page has an LRU position not just in the
+ * cgroup that owns it, but in all of that group's ancestors.
+ *
+ * So when the physical inactive list of a leaf cgroup ages,
+ * the virtual inactive lists of all its parents, including
+ * the root cgroup's, age as well.
+ */
+ do {
+ atomic_long_add(nr_pages, &lruvec->nonresident_age);
+ } while ((lruvec = parent_lruvec(lruvec)));
+}
+
#ifdef CONFIG_LRU_GEN

static void *lru_gen_eviction(struct folio *folio)
@@ -311,10 +341,14 @@ static void *lru_gen_eviction(struct folio *folio)
lruvec = mem_cgroup_lruvec(memcg, pgdat);
lrugen = &lruvec->lrugen;
min_seq = READ_ONCE(lrugen->min_seq[type]);
+
token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0);
+ token <<= LRU_GEN_EVICTION_BITS;
+ token |= lru_eviction(lruvec, LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);

hist = lru_hist_from_seq(min_seq);
atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
+ workingset_age_nonresident(lruvec, folio_nr_pages(folio));

return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
}
@@ -329,15 +363,17 @@ static bool lru_gen_test_recent(struct lruvec *lruvec, bool file,
unsigned long min_seq;

min_seq = READ_ONCE(lruvec->lrugen.min_seq[file]);
+ token >>= LRU_GEN_EVICTION_BITS;
return (token >> LRU_REFS_WIDTH) == (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH));
}

static void lru_gen_refault(struct folio *folio, void *shadow)
{
int memcgid;
- bool recent;
+ bool refault;
bool workingset;
unsigned long token;
+ bool recent = false;
int hist, tier, refs;
struct lruvec *lruvec;
struct pglist_data *pgdat;
@@ -345,28 +381,36 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
int type = folio_is_file_lru(folio);
int delta = folio_nr_pages(folio);

- rcu_read_lock();
-
unpack_shadow(shadow, &memcgid, &pgdat, &token, &workingset);
lruvec = mem_cgroup_lruvec(mem_cgroup_from_id(memcgid), pgdat);
if (lruvec != folio_lruvec(folio))
- goto unlock;
+ return;

mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
-
+ refault = lru_test_refault(lruvec_memcg(lruvec), lruvec, token, type,
+ LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
recent = lru_gen_test_recent(lruvec, type, token);
- if (!recent)
- goto unlock;
+ if (!recent && !refault)
+ return;

lrugen = &lruvec->lrugen;
-
hist = lru_hist_from_seq(READ_ONCE(lrugen->min_seq[type]));
/* see the comment in folio_lru_refs() */
+ token >>= LRU_GEN_EVICTION_BITS;
refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset;
tier = lru_tier_from_refs(refs);

- atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
- mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+ if (refault) {
+ if (refs)
+ folio_set_active(folio);
+ /*
+ * Protect higher tier to make it easier
+ * to stay in a stable workingset and prevent refault.
+ */
+ if (refs != BIT(LRU_REFS_WIDTH))
+ tier = lru_tier_from_refs(refs + 1);
+ mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+ }

/*
* Count the following two cases as stalls:
@@ -375,12 +419,25 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
* 2. For pages accessed multiple times through file descriptors,
* numbers of accesses might have been out of the range.
*/
- if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
- folio_set_workingset(folio);
+ if (refault || lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
+ folio_set_workingset(folio);
+ }
+
+ /*
+ * If recent is false, add to global PID counters since the gen which
+ * the page evicted is gone already.
+ */
+ if (recent) {
+ /*
+ * tier may get increased upon refault, which makes refaulted larger
+ * than evicted, this will be reset and accounted by reset_ctrl_pos
+ */
+ atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
+ } else {
+ atomic_long_add(delta, &lrugen->avg_total[type][tier]);
+ atomic_long_add(delta, &lrugen->avg_refaulted[type][tier]);
}
-unlock:
- rcu_read_unlock();
}

#else /* !CONFIG_LRU_GEN */
@@ -402,34 +459,6 @@ static void lru_gen_refault(struct folio *folio, void *shadow)

#endif /* CONFIG_LRU_GEN */

-/**
- * workingset_age_nonresident - age non-resident entries as LRU ages
- * @lruvec: the lruvec that was aged
- * @nr_pages: the number of pages to count
- *
- * As in-memory pages are aged, non-resident pages need to be aged as
- * well, in order for the refault distances later on to be comparable
- * to the in-memory dimensions. This function allows reclaim and LRU
- * operations to drive the non-resident aging along in parallel.
- */
-void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
-{
- /*
- * Reclaiming a cgroup means reclaiming all its children in a
- * round-robin fashion. That means that each cgroup has an LRU
- * order that is composed of the LRU orders of its child
- * cgroups; and every page has an LRU position not just in the
- * cgroup that owns it, but in all of that group's ancestors.
- *
- * So when the physical inactive list of a leaf cgroup ages,
- * the virtual inactive lists of all its parents, including
- * the root cgroup's, age as well.
- */
- do {
- atomic_long_add(nr_pages, &lruvec->nonresident_age);
- } while ((lruvec = parent_lruvec(lruvec)));
-}
-
/**
* workingset_eviction - note the eviction of a folio from memory
* @target_memcg: the cgroup that is causing the reclaim
@@ -529,16 +558,16 @@ void workingset_refault(struct folio *folio, void *shadow)
bool workingset;
long nr;

- if (lru_gen_enabled()) {
- lru_gen_refault(folio, shadow);
- return;
- }
-
/* Flush stats (and potentially sleep) before holding RCU read lock */
mem_cgroup_flush_stats_ratelimited();

rcu_read_lock();

+ if (lru_gen_enabled()) {
+ lru_gen_refault(folio, shadow);
+ goto out;
+ }
+
/*
* The activation decision for this folio is made at the level
* where the eviction occurred, as that is where the LRU order
@@ -785,6 +814,13 @@ static int __init workingset_init(void)
pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
EVICTION_BITS, max_order, bucket_order);

+#ifdef CONFIG_LRU_GEN
+ if (max_order > LRU_GEN_EVICTION_BITS)
+ lru_gen_bucket_order = max_order - LRU_GEN_EVICTION_BITS;
+ pr_info("workingset: lru_gen_timestamp_bits=%d lru_gen_bucket_order=%u\n",
+ LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
+#endif
+
ret = prealloc_shrinker(&workingset_shadow_shrinker, "mm-shadow");
if (ret)
goto err;
--
2.41.0

2023-09-21 06:12:36

by Kairui Song

[permalink] [raw]
Subject: [RFC PATCH v3 3/6] workignset: simplify the initilization code

From: Kairui Song <[email protected]>

Use the new introduced EVICTION_BITS to replace timestamp_bits, compiler
should be able to optimize out the previous variable but this should
make the code more clear and unified.

Signed-off-by: Kairui Song <[email protected]>
---
mm/workingset.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index b0704cbfc667..278c3b9eb549 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -772,7 +772,6 @@ static struct lock_class_key shadow_nodes_key;

static int __init workingset_init(void)
{
- unsigned int timestamp_bits;
unsigned int max_order;
int ret;

@@ -784,12 +783,11 @@ static int __init workingset_init(void)
* some more pages at runtime, so keep working with up to
* double the initial memory by using totalram_pages as-is.
*/
- timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
max_order = fls_long(totalram_pages() - 1);
- if (max_order > timestamp_bits)
- bucket_order = max_order - timestamp_bits;
+ if (max_order > EVICTION_BITS)
+ bucket_order = max_order - EVICTION_BITS;
pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
- timestamp_bits, max_order, bucket_order);
+ EVICTION_BITS, max_order, bucket_order);

ret = prealloc_shrinker(&workingset_shadow_shrinker, "mm-shadow");
if (ret)
--
2.41.0

2023-09-21 08:24:51

by Kairui Song

[permalink] [raw]
Subject: [RFC PATCH v3 5/6] mm, lru_gen: convert avg_total and avg_refaulted to atomic

From: Kairui Song <[email protected]>

No feature change, prepare for later patch.

Signed-off-by: Kairui Song <[email protected]>
---
include/linux/mmzone.h | 4 ++--
mm/vmscan.c | 16 ++++++++--------
2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4106fbc5b4b3..d944987b67d3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -425,9 +425,9 @@ struct lru_gen_folio {
/* the multi-gen LRU sizes, eventually consistent */
long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
/* the exponential moving average of refaulted */
- unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
+ atomic_long_t avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
/* the exponential moving average of evicted+protected */
- unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
+ atomic_long_t avg_total[ANON_AND_FILE][MAX_NR_TIERS];
/* the first tier doesn't need protection, hence the minus one */
unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
/* can be modified without holding the LRU lock */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3f4de75e5186..82acc1934c86 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3705,9 +3705,9 @@ static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
struct lru_gen_folio *lrugen = &lruvec->lrugen;
int hist = lru_hist_from_seq(lrugen->min_seq[type]);

- pos->refaulted = lrugen->avg_refaulted[type][tier] +
+ pos->refaulted = atomic_long_read(&lrugen->avg_refaulted[type][tier]) +
atomic_long_read(&lrugen->refaulted[hist][type][tier]);
- pos->total = lrugen->avg_total[type][tier] +
+ pos->total = atomic_long_read(&lrugen->avg_total[type][tier]) +
atomic_long_read(&lrugen->evicted[hist][type][tier]);
if (tier)
pos->total += lrugen->protected[hist][type][tier - 1];
@@ -3732,15 +3732,15 @@ static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
if (carryover) {
unsigned long sum;

- sum = lrugen->avg_refaulted[type][tier] +
+ sum = atomic_long_read(&lrugen->avg_refaulted[type][tier]) +
atomic_long_read(&lrugen->refaulted[hist][type][tier]);
- WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
+ atomic_long_set(&lrugen->avg_refaulted[type][tier], sum / 2);

- sum = lrugen->avg_total[type][tier] +
+ sum = atomic_long_read(&lrugen->avg_total[type][tier]) +
atomic_long_read(&lrugen->evicted[hist][type][tier]);
if (tier)
sum += lrugen->protected[hist][type][tier - 1];
- WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
+ atomic_long_set(&lrugen->avg_total[type][tier], sum / 2);
}

if (clear) {
@@ -5885,8 +5885,8 @@ static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,

if (seq == max_seq) {
s = "RT ";
- n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]);
- n[1] = READ_ONCE(lrugen->avg_total[type][tier]);
+ n[0] = atomic_long_read(&lrugen->avg_refaulted[type][tier]);
+ n[1] = atomic_long_read(&lrugen->avg_total[type][tier]);
} else if (seq == min_seq[type] || NR_HIST_GENS > 1) {
s = "rep";
n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]);
--
2.41.0

2023-09-21 11:16:42

by Kairui Song

[permalink] [raw]
Subject: [RFC PATCH v3 1/6] workingset: simplify and use a more intuitive model

From: Kairui Song <[email protected]>

This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.

The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based re-activation
for multi-gen LRU.

Currently, refault distance re-activation is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.

Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right.

And besides, one page can get activate/deactivated for multiple times.

And multi-gen LRU doesn't fit with this model well, pages are getting
aged and activated constantly as the generation sliding window slides.

So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an eviction sequence like before.
Let the `nonresistence_age` still be NA and get increased for each
eviction, so we get a "Shadow LRU" here of one evicted page:

Let SP = ((NA's reading @ current) - (NA's reading @ eviction))

+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by previously faulted in pages
* -> The page corresponding to SP

It can be easily seen that SP stands for how far the current workflow
could push a page out of available memory. Since all evicted page was
once head of INACTIVE list, the page could have such an access distance:

SP + NR_INACTIVE

It *may* get re-activated before getting evicted again if:

SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE

Which can be simplified to:

SP < NR_ACTIVE

Then the page is worth getting re-activated to start from ACTIVE part,
since the access distance is shorter than the total memory to make it
stay.

And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, so throttle this by two factors:

1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the ACTIVE part of LRU is long enough, chanllaging ACTIVE pages
by re-activating a one-time faulted previously INACTIVE page may not
be a good idea, so throttle the re-activation when ACTIVE > INACTIVE
by comparing with INACTIVE instead.

Another effect of the refault activation worth noticing is that, by
throttling reactivation when ACTIVE part is high, this refault distance
based re-activation can help hold a portion of the caches in memory
instead of letting cached get evicted permutably when the cache size is
larger than total memory, and hotness is similar among all cache pages.
That's because the established workingset (ACTIVE part) will tend to stay
since we throttled reactivation, until the workingset itself start to stall.

This is actually similar with the algoritm before, which introduce such
effect by increasing nonresistence_age in many call paths, trottled
the reactivation when activition/reactivation is massively happenning.

Combined all above, we have:
Upon refault, if any of following conditions is met, mark page as active:

- If ACTIVE LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE

- If ACTIVE LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE

Code-wise, this is simpler than before since no longer need to do lruvec
statistic update when activating a page, and so far, a few benchmarks shows
a similar or better result. And when combined with multi-gen LRU (in
later commits) it shows a measurable performance gain for some workloads.

Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:

memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6

fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting

fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting

mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800

kernel build test done with 3G memcg limit on an i7-9700.

Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6491.5 tps
kernel-build: 1817.13499 seconds

After (Average of 6 test run):
fio: IOPS=5137.5k
fio2: IOPS=7300.67k
memcached: 57878.422 ops/s
mysql: 6491.1 tps
kernel-build: 1813.66231 seconds

Signed-off-by: Kairui Song <[email protected]>
---
include/linux/swap.h | 2 -
mm/swap.c | 1 -
mm/vmscan.c | 2 -
mm/workingset.c | 155 ++++++++++++++++++-------------------------
4 files changed, 64 insertions(+), 96 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 493487ed7c38..ca51d79842b7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -344,10 +344,8 @@ static inline swp_entry_t page_swap_entry(struct page *page)

/* linux/mm/workingset.c */
bool workingset_test_recent(void *shadow, bool file, bool *workingset);
-void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg);
void workingset_refault(struct folio *folio, void *shadow);
-void workingset_activation(struct folio *folio);

/* Only track the nodes of mappings with shadow entries */
void workingset_update_node(struct xa_node *node);
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..685b446fd4f9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -482,7 +482,6 @@ void folio_mark_accessed(struct folio *folio)
else
__lru_cache_activate_folio(folio);
folio_clear_referenced(folio);
- workingset_activation(folio);
}
if (folio_test_idle(folio))
folio_clear_idle(folio);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6f13394b112e..3f4de75e5186 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2539,8 +2539,6 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
lruvec_add_folio(lruvec, folio);
nr_pages = folio_nr_pages(folio);
nr_moved += nr_pages;
- if (folio_test_active(folio))
- workingset_age_nonresident(lruvec, nr_pages);
}

/*
diff --git a/mm/workingset.c b/mm/workingset.c
index da58a26d0d4d..8613945fc66e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -64,74 +64,64 @@
* thrashing on the inactive list, after which refaulting pages can be
* activated optimistically to compete with the existing active pages.
*
- * Approximating inactive page access frequency - Observations:
+ * For such approximation, we introduce a counter `nonresistence_age` (NA)
+ * here. This counter increases each time a page is evicted, and each evicted
+ * page will have a shadow that stores the counter reading at the eviction
+ * time as a timestamp. So when an evicted page was faulted again, we have:
*
- * 1. When a page is accessed for the first time, it is added to the
- * head of the inactive list, slides every existing inactive page
- * towards the tail by one slot, and pushes the current tail page
- * out of memory.
+ * Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
*
- * 2. When a page is accessed for the second time, it is promoted to
- * the active list, shrinking the inactive list by one slot. This
- * also slides all inactive pages that were faulted into the cache
- * more recently than the activated page towards the tail of the
- * inactive list.
+ * +-memory available to cache-+
+ * | |
+ * +-------------------------+===============+===========+
+ * | * shadows O O O | INACTIVE | ACTIVE |
+ * +-+-----------------------+===============+===========+
+ * | |
+ * +-----------------------+
+ * | SP
+ * fault page O -> Hole left by previously faulted in pages
+ * * -> The page corresponding to SP
*
- * Thus:
+ * Here SP can stands for how far the current workflow could push a page
+ * out of available memory. Since all evicted page was once head of
+ * INACTIVE list, the page could have such an access distance of:
*
- * 1. The sum of evictions and activations between any two points in
- * time indicate the minimum number of inactive pages accessed in
- * between.
+ * SP + NR_INACTIVE
*
- * 2. Moving one inactive page N page slots towards the tail of the
- * list requires at least N inactive page accesses.
+ * So if:
*
- * Combining these:
+ * SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
*
- * 1. When a page is finally evicted from memory, the number of
- * inactive pages accessed while the page was in cache is at least
- * the number of page slots on the inactive list.
+ * Which can be simplified to:
*
- * 2. In addition, measuring the sum of evictions and activations (E)
- * at the time of a page's eviction, and comparing it to another
- * reading (R) at the time the page faults back into memory tells
- * the minimum number of accesses while the page was not cached.
- * This is called the refault distance.
+ * SP < NR_ACTIVE
*
- * Because the first access of the page was the fault and the second
- * access the refault, we combine the in-cache distance with the
- * out-of-cache distance to get the complete minimum access distance
- * of this page:
+ * Then the page is worth getting re-activated to start from ACTIVE part,
+ * since the access distance is shorter than total memory to make it stay.
*
- * NR_inactive + (R - E)
+ * And since this is only an estimation, based on several hypotheses, and
+ * it could break the ability of LRU to distinguish a workingset out of
+ * caches, so throttle this by two factors:
*
- * And knowing the minimum access distance of a page, we can easily
- * tell if the page would be able to stay in cache assuming all page
- * slots in the cache were available:
+ * 1. Notice that re-faulted in pages may leave "holes" on the shadow
+ * part of LRU, that part is left unhandled on purpose to decrease
+ * re-activate rate for pages that have a large SP value (the larger
+ * SP value a page have, the more likely it will be affected by such
+ * holes).
+ * 2. When the ACTIVE part of LRU is long enough, challenging ACTIVE pages
+ * by re-activating a one-time faulted previously INACTIVE page may not
+ * be a good idea, so throttle the re-activation when ACTIVE > INACTIVE
+ * by comparing with INACTIVE instead.
*
- * NR_inactive + (R - E) <= NR_inactive + NR_active
+ * Combined all above, we have:
+ * Upon refault, if any of the following conditions is met, mark the page
+ * as active:
*
- * If we have swap we should consider about NR_inactive_anon and
- * NR_active_anon, so for page cache and anonymous respectively:
- *
- * NR_inactive_file + (R - E) <= NR_inactive_file + NR_active_file
- * + NR_inactive_anon + NR_active_anon
- *
- * NR_inactive_anon + (R - E) <= NR_inactive_anon + NR_active_anon
- * + NR_inactive_file + NR_active_file
- *
- * Which can be further simplified to:
- *
- * (R - E) <= NR_active_file + NR_inactive_anon + NR_active_anon
- *
- * (R - E) <= NR_active_anon + NR_inactive_file + NR_active_file
- *
- * Put into words, the refault distance (out-of-cache) can be seen as
- * a deficit in inactive list space (in-cache). If the inactive list
- * had (R - E) more page slots, the page would not have been evicted
- * in between accesses, but activated instead. And on a full system,
- * the only thing eating into inactive list space is active pages.
+ * - If ACTIVE LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
+ * SP < NR_ACTIVE
*
+ * - If ACTIVE LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
+ * SP < NR_INACTIVE
*
* Refaulting inactive pages
*
@@ -419,8 +409,10 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
struct mem_cgroup *eviction_memcg;
struct lruvec *eviction_lruvec;
unsigned long refault_distance;
- unsigned long workingset_size;
+ unsigned long inactive_file;
+ unsigned long inactive_anon;
unsigned long refault;
+ unsigned long active;
int memcgid;
struct pglist_data *pgdat;
unsigned long eviction;
@@ -479,21 +471,27 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
* workingset competition needs to consider anon or not depends
* on having free swap space.
*/
- workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
- if (!file) {
- workingset_size += lruvec_page_state(eviction_lruvec,
- NR_INACTIVE_FILE);
- }
+ active = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
+ inactive_file = lruvec_page_state(eviction_lruvec, NR_INACTIVE_FILE);
+
if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
- workingset_size += lruvec_page_state(eviction_lruvec,
+ active += lruvec_page_state(eviction_lruvec,
NR_ACTIVE_ANON);
- if (file) {
- workingset_size += lruvec_page_state(eviction_lruvec,
- NR_INACTIVE_ANON);
- }
+ inactive_anon = lruvec_page_state(eviction_lruvec,
+ NR_INACTIVE_ANON);
+ } else {
+ inactive_anon = 0;
}

- return refault_distance <= workingset_size;
+ /*
+ * When there are already enough active pages, be less aggressive
+ * on reactivating pages, challenge an large set of established
+ * active pages with one time refaulted page may not be a good idea.
+ */
+ if (active >= inactive_anon + inactive_file)
+ return refault_distance < inactive_anon + inactive_file;
+ else
+ return refault_distance < active + (file ? inactive_anon : inactive_file);
}

/**
@@ -543,7 +541,6 @@ void workingset_refault(struct folio *folio, void *shadow)
goto out;

folio_set_active(folio);
- workingset_age_nonresident(lruvec, nr);
mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file, nr);

/* Folio was active prior to eviction */
@@ -560,30 +557,6 @@ void workingset_refault(struct folio *folio, void *shadow)
rcu_read_unlock();
}

-/**
- * workingset_activation - note a page activation
- * @folio: Folio that is being activated.
- */
-void workingset_activation(struct folio *folio)
-{
- struct mem_cgroup *memcg;
-
- rcu_read_lock();
- /*
- * Filter non-memcg pages here, e.g. unmap can call
- * mark_page_accessed() on VDSO pages.
- *
- * XXX: See workingset_refault() - this should return
- * root_mem_cgroup even for !CONFIG_MEMCG.
- */
- memcg = folio_memcg_rcu(folio);
- if (!mem_cgroup_disabled() && !memcg)
- goto out;
- workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio));
-out:
- rcu_read_unlock();
-}
-
/*
* Shadow entries reflect the share of the working set that does not
* fit into memory, so their number depends on the access pattern of
--
2.41.0